Example of circuit simulation in Cassandra 2.0 in CQL3

In the previous article, I lucidly explained how Cassandra stores data. I strongly recommend at least a run through the eyes. In this article, we will create a simple database to use in the next article , which will be fully devoted to data retrieval / retrieval.

Task

Suppose we have an ad network that spins ads. People click on banners, the advertising customer pays, we (the network), resellers (distributors) and hosters of advertising space have an income on this. Advertising space resellers work for 20%. This percentage grows due to various factors, the most important thing is that it is not constant and the new percentage can be applied, for example, to clicks of a month ago.

Need: quickly be able to count the income of each reseller for any period of days, keep a schedule of clicks in real time.

')

Disclaimer

The article does NOT serve as the only correct guide for modeling ad network tasks.
Each case requires its own specific simulation. Silver bullet does not exist.

Likbez

Cassandra (hereinafter C * ) is a distributed NoSQL database.
cqlsh - command line for C * CQL.
CQL is a SQL-like query language. Abbreviation for C assandra Q uery L anguage.
Keyspace is an analogue of a database in relational DBMS.
The primary storage unit is a string . A row can dynamically expand to 2 billion columns. It is important.
Primary Key - a unique identifier for the string. Required to be at each line. It is used to search for the node in which the string is stored, as well as to search within the string itself.
The distribution key is the first part of the master key.
The cluster key is the second part of the master key.
A node (node) is a java process of Cassandra. All nodes are equivalent. No master-slave dependencies.
Cluster - several nodes working with each other as a whole.
During recording, it is not possible to read the current value of the cell being written (this is almost true).

Let's assume that we have 6 nodes .

cqlsh

Run cqlsh .

 Connected to Test Cluster at localhost:9160. [cqlsh 4.1.0 | Cassandra 2.0.2 | CQL spec 3.1.1 | Thrift protocol 19.38.0] cqlsh>

Create keyspace

Create a keyspace (database).

 CREATE KEYSPACE ad_network WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '3' }; USE ad_network;

replication_factor is the number of nodes that will hold the string.

Create a model

Reseller table

Create a reseller table and fill it with data. In the table we will keep a history of changes in the interest rate of the reseller.

 CREATE TABLE reseller ( id text, effective_since text, -- day in the format of 'YYYY-MM-DD' reward_percent float, -- value from 0.0 to 1.0 PRIMARY KEY (id, effective_since) --   (id)    (effective_since) ) WITH CLUSTERING ORDER BY (effective_since DESC); --    LIMIT 1  INSERT INTO reseller (id, effective_since, reward_percent) VALUES ('supaboobs', '2011-02-13', 0.2); INSERT INTO reseller (id, effective_since, reward_percent) VALUES ('supaboobs', '2012-01-22', 0.25); INSERT INTO reseller (id, effective_since, reward_percent) VALUES ('supaboobs', '2013-11-30', 0.3);

It seems that we have created three lines. But those who read the previous article know that we created one line with the distribution key 'supaboobs' and three cluster keys : '2011-02-13' , '2012-01-22' and '2013-11-30' . This string, and all subsequent ones, will be stored on three of our six nodes.

Let's see the contents:

 cqlsh:ad_network> SELECT * FROM reseller WHERE id='supaboobs'; id | effective_since | reward_percent -----------+-----------------+---------------- supaboobs | 2013-11-30 | 0.3 supaboobs | 2012-01-22 | 0.25 supaboobs | 2011-02-13 | 0.2

Later, when we need the current interest rate, we will do the following:

 cqlsh:ad_network> SELECT * FROM reseller WHERE id = 'supaboobs' LIMIT 1; id | effective_since | reward_percent -----------+-----------------+---------------- supaboobs | 2013-11-30 | 0.3

Ad_click table

In this we will store clicks on our banners.
Columns: reseller ID, day (to speed up the search), date + click time, banner ID, total cost per click.

 CREATE TABLE ad_click ( reseller_id text, day text, -- day in the format of 'YYYY-MM-DD' time timestamp, ad_id text, amount float, PRIMARY KEY ((reseller_id, day), time, ad_id) --   (reseller_id, day)    (time, ad_id) ) WITH CLUSTERING ORDER BY (time DESC); --

Add some data.

 INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-28', '2013-11-28 02:16:52', '890_567_234', 0.005); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-28', '2013-11-28 07:17:35', '890_567_234', 0.005); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-29', '2013-11-29 17:18:51', '890_567_211', 0.0075); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-29', '2013-11-29 22:20:37', '890_567_211', 0.0075); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-30', '2013-11-30 11:21:56', '890_567_234', 0.005); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-12-01', '2013-12-01 12:21:59', '890_567_010', 0.01);

Let's look at them.

 cqlsh:ad_network> SELECT * FROM ad_click; reseller_id | day | time | ad_id | amount -------------+------------+---------------------+-------------+-------- supaboobs | 2013-12-01 | 2013-12-01 12:21:59 | 890_567_010 | 0.01 supaboobs | 2013-11-30 | 2013-11-30 11:21:56 | 890_567_234 | 0.005 supaboobs | 2013-11-28 | 2013-11-28 07:17:35 | 890_567_234 | 0.005 supaboobs | 2013-11-28 | 2013-11-28 02:16:52 | 890_567_234 | 0.005 supaboobs | 2013-11-29 | 2013-11-29 22:20:37 | 890_567_211 | 0.0075 supaboobs | 2013-11-29 | 2013-11-29 17:18:51 | 890_567_211 | 0.0075

Since we have a composite distribution key - (reseller_id, day) , 4 lines were actually created here (if it’s hard to understand why, then read the previous article and everything will fall into place). It turns out that for each reseller we will create a new line every day and fill it with data. The cluster key is also composite - time, ad_id .

Table amount_by_day

Since the interest rate cannot change more often than once a day, you can win a few milliseconds and processor time on this. Create another table that will store the same money, but without breakdown on clicks:

 CREATE TABLE amount_by_day ( reseller_id text, day text, -- day in the format of 'YYYY-MM-DD' amount double, PRIMARY KEY (reseller_id, day) --   (reseller_id)    (day) ) WITH CLUSTERING ORDER BY (day DESC); --

It will have to be filled once in the suki. A separate code in our system will collect data from ad_click , summarize and write to amount_by_day .

Click Count Count Tables

Naturally, it is important for us to know how many times we clicked on which banner. But since SELECT COUNT(0) FROM ad_click WHERE ad_id='...' across all six nodes would be too expensive (and there is no COUNT operation in CQL), then in C * there is such a thing as counter - s.

Counter is a column type, i.e. syntactically used just like timestamp , text , double , etc. But there are limitations. If there is at least one counter in the table, then all other columns must also be of the counter type (excluding PRIMARY KEY, of course). Create the same table:

 CREATE TABLE clicks_per_ad ( ad_id text, clicks counter, PRIMARY KEY (ad_id)); --   (ad_id),

And so in the table you can change the value of the clicks column.

 cqlsh:ad_network> SELECT * FROM clicks_per_ad; (0 rows) cqlsh:ad_network> UPDATE clicks_per_ad SET clicks = clicks + 1 WHERE ad_id = '890_567_234'; cqlsh:ad_network> SELECT * FROM clicks_per_ad; ad_id | clicks -------------+-------- 890_567_234 | 1 (1 rows) cqlsh:ad_network> UPDATE clicks_per_ad SET clicks = clicks + 1 WHERE ad_id = '890_567_234'; cqlsh:ad_network> SELECT * FROM clicks_per_ad; ad_id | clicks -------------+-------- 890_567_234 | 2 (1 rows) cqlsh:ad_network> UPDATE clicks_per_ad SET clicks = 0 WHERE ad_id = '890_567_234'; cqlsh:ad_network> SELECT * FROM clicks_per_ad; ad_id | clicks -------------+-------- 890_567_234 | 0 (1 rows)

Thus, you can count anything if it is signed int . Those. we can consider exceptional integers, but in the range -2 ^ 63 - + 2 ^ 63.

It is noteworthy that at first there were no rows in the table, but after the UPDATE command, one suddenly appeared. This is a CQL feature. INSERT and UPDATE - the essence of the same command. I will make a reservation that in C * there is an opportunity not to update / insert data if they already (or "have not yet") exist. It is called “lightweight transactions”, which are slow relative to a normal data write operation.

Of course, the number of clicks can be collected on any criteria (keys). For example:

 CREATE TABLE clicks_per_reseller_per_day ( reseller_id text, day text, -- day in the format of 'YYYY-MM-DD' clicks counter, PRIMARY KEY ((reseller_id, day))); --   (reseller_id, day),    CREATE TABLE clicks_per_reseller ( reseller_id text, clicks counter, PRIMARY KEY (reseller_id)); --   (reseller_id),

A bit about text ID

In RDBMS, we are used to assigning a unique identifier of type int to strings. Why not make the identifiers textual, which would mean something meaningful? Yes, because productivity will suffer. Personally, I was very depressed. We are used to having our driver's license ID numbers, insurance policy ID numbers. But you often have to add letters, for example, passport ID - two letters and 6 numbers, house numbers often with letters or hyphens, etc.

In C * it is not customary to use dry digits as keys, because they do not incur acceleration of work, unlike RDBMS. Yes, and auto increment in C * is missing (but there is a timeuuid , if you suddenly need a unique ID). text as column type reseller_id may seem reseller_id . In C *, a distribution key (partition key) is searched for by comparing hashes. Those. there is no direct comparison of strings, which means that performance does not sag.

Data recording

To record one click, we will need to do as many as 4 UPDATE operations. I have not lost my mind. Writing to C * is too fast, performance will not suffer. Writing speed with 6 nodes will be about 100 times faster than in MongoDB or 2-5 times faster than in HBase , not to mention RDBMS. The latest versions of C * are able to independently optimize disk space ( compaction , compression ), so everything will be fine with the hard disk space.

To make sure that all the INSERTs worked, there is such an understanding as BATCH . Unfortunately, they work within the same distribution key (within one line).

 BEGIN BATCH -- INSERT, UPDATE, DELETE ... APPLY BATCH;

Batch is not a substitute for transactions in an RDBMS. With the help of them, C * transmits all commands in one packet, rather than several commands, thus optimizing the network operation. There are two types of batch.

Unlogged - BEGIN UNLOGGED BATCH - the usual batch. But if the coordinating node (coordinator node - the one that is responsible for the CQL command that came to it and for communicating with other nodes) dies in the middle of the batch, the integrity (consistency) of the data may deteriorate.
Atomic (atomic) - BEGIN BATCH - in this case C * will make sure that either all data is written or nothing. But this operation is about 30% slower.

Reading data

This topic is devoted to the following article . The SELECT ... FROM ... WHERE operation has many limitations compared to RDBMS, so special attention should be paid to this.

Conclusion

The task of this post was to show how much the approach to database modeling in Kassandra differs from the RDBMS. The approach differs dramatically, as you can see. And do not be confused by the similarity of CQL and SQL, in fact only syntax matches. Only a few techniques and distinctive features were shown here, but they are much, much more.

Previous article of the cycle .
Next article of the cycle .

Source: https://habr.com/ru/post/204026/

All Articles