📜 ⬆️ ⬇️

My experience with Apache Cassandra

Like most NoSQL solutions, C * is subject to one extremely unpleasant epidemic: it is an excellent tool for a narrow class of tasks, but is positioned by evangelists as another silver bullet for data storage. In this article I will talk about my experience in implementing C * in a (comparatively) loaded web analytics project. It will be useful to anyone who is faced with the choice of a scalable data warehouse, and dispels myths and delusions about this tool.



For a start on the positive points. As I said before, C * copes with its main tasks on Ur. It was created for fast recording, scalability and fault tolerance, and in this it is probably the best: I have not yet encountered a simpler cluster management, and in all this time it has not failed me once. However, the biggest problem with C * is that its scope is much narrower than it might seem from the documentation. And then I will tell you the points why.

CQL is not SQL


Honestly, I don’t understand why the C * developers decided to create a CQL. It confuses and disorients, creates false impressions about the specifics of C *. Here are some facts:
')
  1. The main misconception that CQL introduces to all beginners is the illusion that you can make some kind of sampling. This is not true. C * is the key-value store. You cannot get a subset of rows in a table. Either one (by key) or all. To bypass this limitation in C * there is “wide rows” - the ability to write any columns in a row (independence from the scheme, up to 2 billion unique columns in a row). But this also saves only with a special approach to the planning of the data model.
  2. CQL introduces the concepts partition key and cluster key within PRIMARY KEY. Another major misconception about how this database works. In fact, the string in C * is defined only through the partition-part. All “entries” that differ in cluster key will simply follow each other within the same row.
  3. There is no compound partition key. The easiest way to understand the behavior of the compound key is to imagine that the key field values ​​are concatenated before being saved. And the string can only be obtained by the full value of the key (as in redis, for example).
  4. INSERT and UPDATE in C * are the same thing. From now and forever and ever.
  5. Collections in CQL are only syntactic sugar, and records in them are stored in separate columns.
  6. Secondary indices are also only syntactic sugar. C * creates a new key family (table) for each secondary index and duplicates records there.


Apparently, CQL was created to popularize this database among beginners, trying to hide the most important fundamental concepts in the work of this database.

Features of data design


The main principle in designing data in C * is the “SELECT-driven model”. You design the data so that you can get it later with a SELECT. In terms of key-value wide-row storage, this implies a very strong denormalization. You do not just denormalize the data, as you used to do this in relational databases, you actually create a separate table for each query . And in many projects (where there is a lot of data in terms of volume) this gives either a huge overhead during map / reduce and aggregation, or a huge overhead in terms of the amount of data stored.

And yes, you should immediately be prepared for the fact that without distributed aggregation (Hadoop, Spark, Hive, etc.) this database is useless. The developers promise in the next version of the operators for data aggregation in CQL, but you can not really rely on them. By the architecture of this database, it is clear that they will work only within one line.

Counters


I take this type of data to this type of data in this database, and this is why: initially, when I started to embed C * in my project, I was very happy: it is very cool for web analytics to have atomic counters, they greatly simplify the system. But then I understood the simple truth: never, do you hear? NEVER use counters in C * (at least in versions up to 2.1. * Inclusive). The fact is that this type of data contradicts the whole ideology of this database, and because of this it has a huge number of problems. If you ask any C * specialist about the counterers, he will only begin to giggle viciously in response.

In short:
  1. You can not put in the string with the counters of any other values, except for the counter.
  2. You cannot put counters in the collection (for the reason above)
  3. You cannot do cluster and partition key counters (for the reason above), sort them and put secondary indexes on them, of course, also
  4. You cannot set the counter to time to live.
  5. Kounters have problems with replication (rarely, but aptly)
  6. If you delete the counter, you will not be able to create it again for a long time.


The essence of the discrepancy between the ideology of the counters and the database itself is that all operations in C * are idempotent (that is, repeated use of the operation does not change the result). This is what gives trouble-free architecture in the fall of nodes, data centers, problems with communication, etc. Kounters violate this principle, and inside they are made through “light transactions” - first read the value, then increase it. And it causes a lot of problems. In general, if you need counters, it is better to use the Redis-layer, and in C * already add the final value.

That's all for now. Choose your tools wisely. If something seems too good to be true, then perhaps it is.

Source: https://habr.com/ru/post/258581/


All Articles