cqlsh
. Connected to Test Cluster at localhost:9160. [cqlsh 4.1.0 | Cassandra 2.0.2 | CQL spec 3.1.1 | Thrift protocol 19.38.0] cqlsh>
CREATE KEYSPACE ad_network WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '3' }; USE ad_network;
replication_factor
is the number of nodes that will hold the string. CREATE TABLE reseller ( id text, effective_since text, -- day in the format of 'YYYY-MM-DD' reward_percent float, -- value from 0.0 to 1.0 PRIMARY KEY (id, effective_since) -- (id) (effective_since) ) WITH CLUSTERING ORDER BY (effective_since DESC); -- LIMIT 1 INSERT INTO reseller (id, effective_since, reward_percent) VALUES ('supaboobs', '2011-02-13', 0.2); INSERT INTO reseller (id, effective_since, reward_percent) VALUES ('supaboobs', '2012-01-22', 0.25); INSERT INTO reseller (id, effective_since, reward_percent) VALUES ('supaboobs', '2013-11-30', 0.3);
'supaboobs'
and three cluster keys : '2011-02-13'
, '2012-01-22'
and '2013-11-30'
. This string, and all subsequent ones, will be stored on three of our six nodes. cqlsh:ad_network> SELECT * FROM reseller WHERE id='supaboobs'; id | effective_since | reward_percent -----------+-----------------+---------------- supaboobs | 2013-11-30 | 0.3 supaboobs | 2012-01-22 | 0.25 supaboobs | 2011-02-13 | 0.2
cqlsh:ad_network> SELECT * FROM reseller WHERE id = 'supaboobs' LIMIT 1; id | effective_since | reward_percent -----------+-----------------+---------------- supaboobs | 2013-11-30 | 0.3
CREATE TABLE ad_click ( reseller_id text, day text, -- day in the format of 'YYYY-MM-DD' time timestamp, ad_id text, amount float, PRIMARY KEY ((reseller_id, day), time, ad_id) -- (reseller_id, day) (time, ad_id) ) WITH CLUSTERING ORDER BY (time DESC); --
INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-28', '2013-11-28 02:16:52', '890_567_234', 0.005); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-28', '2013-11-28 07:17:35', '890_567_234', 0.005); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-29', '2013-11-29 17:18:51', '890_567_211', 0.0075); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-29', '2013-11-29 22:20:37', '890_567_211', 0.0075); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-11-30', '2013-11-30 11:21:56', '890_567_234', 0.005); INSERT INTO ad_click (reseller_id, day, time, ad_id, amount) VALUES ('supaboobs', '2013-12-01', '2013-12-01 12:21:59', '890_567_010', 0.01);
cqlsh:ad_network> SELECT * FROM ad_click; reseller_id | day | time | ad_id | amount -------------+------------+---------------------+-------------+-------- supaboobs | 2013-12-01 | 2013-12-01 12:21:59 | 890_567_010 | 0.01 supaboobs | 2013-11-30 | 2013-11-30 11:21:56 | 890_567_234 | 0.005 supaboobs | 2013-11-28 | 2013-11-28 07:17:35 | 890_567_234 | 0.005 supaboobs | 2013-11-28 | 2013-11-28 02:16:52 | 890_567_234 | 0.005 supaboobs | 2013-11-29 | 2013-11-29 22:20:37 | 890_567_211 | 0.0075 supaboobs | 2013-11-29 | 2013-11-29 17:18:51 | 890_567_211 | 0.0075
(reseller_id, day)
, 4 lines were actually created here (if it’s hard to understand why, then read the previous article and everything will fall into place). It turns out that for each reseller we will create a new line every day and fill it with data. The cluster key is also composite - time, ad_id
. CREATE TABLE amount_by_day ( reseller_id text, day text, -- day in the format of 'YYYY-MM-DD' amount double, PRIMARY KEY (reseller_id, day) -- (reseller_id) (day) ) WITH CLUSTERING ORDER BY (day DESC); --
ad_click
, summarize and write to amount_by_day
.SELECT COUNT(0) FROM ad_click WHERE ad_id='...'
across all six nodes would be too expensive (and there is no COUNT
operation in CQL), then in C * there is such a thing as counter - s.timestamp
, text
, double
, etc. But there are limitations. If there is at least one counter in the table, then all other columns must also be of the counter type (excluding PRIMARY KEY, of course). Create the same table: CREATE TABLE clicks_per_ad ( ad_id text, clicks counter, PRIMARY KEY (ad_id)); -- (ad_id),
clicks
column. cqlsh:ad_network> SELECT * FROM clicks_per_ad; (0 rows) cqlsh:ad_network> UPDATE clicks_per_ad SET clicks = clicks + 1 WHERE ad_id = '890_567_234'; cqlsh:ad_network> SELECT * FROM clicks_per_ad; ad_id | clicks -------------+-------- 890_567_234 | 1 (1 rows) cqlsh:ad_network> UPDATE clicks_per_ad SET clicks = clicks + 1 WHERE ad_id = '890_567_234'; cqlsh:ad_network> SELECT * FROM clicks_per_ad; ad_id | clicks -------------+-------- 890_567_234 | 2 (1 rows) cqlsh:ad_network> UPDATE clicks_per_ad SET clicks = 0 WHERE ad_id = '890_567_234'; cqlsh:ad_network> SELECT * FROM clicks_per_ad; ad_id | clicks -------------+-------- 890_567_234 | 0 (1 rows)
signed int
. Those. we can consider exceptional integers, but in the range -2 ^ 63 - + 2 ^ 63.UPDATE
command, one suddenly appeared. This is a CQL feature. INSERT
and UPDATE
- the essence of the same command. I will make a reservation that in C * there is an opportunity not to update / insert data if they already (or "have not yet") exist. It is called “lightweight transactions”, which are slow relative to a normal data write operation. CREATE TABLE clicks_per_reseller_per_day ( reseller_id text, day text, -- day in the format of 'YYYY-MM-DD' clicks counter, PRIMARY KEY ((reseller_id, day))); -- (reseller_id, day), CREATE TABLE clicks_per_reseller ( reseller_id text, clicks counter, PRIMARY KEY (reseller_id)); -- (reseller_id),
text
as column type reseller_id
may seem reseller_id
. In C *, a distribution key (partition key) is searched for by comparing hashes. Those. there is no direct comparison of strings, which means that performance does not sag.UPDATE
operations. I have not lost my mind. Writing to C * is too fast, performance will not suffer. Writing speed with 6 nodes will be about 100 times faster than in MongoDB or 2-5 times faster than in HBase , not to mention RDBMS. The latest versions of C * are able to independently optimize disk space ( compaction , compression ), so everything will be fine with the hard disk space. BEGIN BATCH -- INSERT, UPDATE, DELETE ... APPLY BATCH;
BEGIN UNLOGGED BATCH
- the usual batch. But if the coordinating node (coordinator node - the one that is responsible for the CQL command that came to it and for communicating with other nodes) dies in the middle of the batch, the integrity (consistency) of the data may deteriorate.BEGIN BATCH
- in this case C * will make sure that either all data is written or nothing. But this operation is about 30% slower.SELECT ... FROM ... WHERE
operation has many limitations compared to RDBMS, so special attention should be paid to this.Previous article of the cycle .
Next article of the cycle .
Source: https://habr.com/ru/post/204026/
All Articles