⬆️ ⬇️

How does apache cassandra work?

Cassandra

In this topic, I would like to talk about how Cassandra ( cassandra ) is organized — a decentralized, fault-tolerant, and reliable key-value database. The storage itself will take care of the problems of single point of failure , server failure and data distribution between cluster nodes . Moreover, both in the case of servers in a single data center ( data center ), and in a configuration with many data centers separated by distances and, accordingly, network delays. Reliability is understood as total data consistency ( eventual consistency ) of data with the ability to set the level of data consistency ( tune consistency ) of each request.



NoSQL databases generally require a greater understanding of their internal structure than SQL . This article will describe the basic structure, and in the following articles you can consider: CQL and programming interface; engineering and optimization techniques; features of clusters located in many data centers.



Data model



In the terminology of cassandra, the application works with the keyspace , which corresponds to the concept of a database schema in the relational model. In this key space there can be several column families ( column family ), which corresponds to the concept of a relational table. In turn, column families contain columns ( column ), which are combined with a key ( row key ) in a record ( row ). A column consists of three parts: a name ( column name ), a timestamp, and a value . The columns within the record are ordered. Unlike a relational database, there are no restrictions on whether the records (and in terms of the database, these are lines) contain columns with the same names as in other records — no. Column families can be of several types, but in this article we will omit this detail. Also in the latest versions of Cassandra, it became possible to perform queries for defining and modifying data ( DDL , DML ) using the CQL language, as well as creating secondary indices ( secondary indices ).

Data model



The specific value stored in the cassandra is identified by:



')

Each value is associated with a time stamp — a user-defined number that is used to resolve conflicts during recording: the larger the number, the newer the column is considered, and it compares the old columns when comparing.



By data type: key space and column family are strings (names); The time stamp is a 64-bit number; and the key, column name, and column value are an array of bytes. Also, cassandra has the concept of data types ( data type ). These types can be optionally specified by the developer (optional) when creating a column family. For column names, this is called a comparator , for values ​​and keys, a validator . The first determines which byte values ​​are valid for column names and how to arrange them. The second is which byte values ​​are valid for the values ​​of columns and keys. If these data types are not specified, the cassandra stores the values ​​and compares them as byte strings ( BytesType ) since, in fact, they are stored internally.



Data types are as follows:





In cassandra, all data writing operations are always rewriting operations, that is, if a column with the same key and name already exists in the column family, and the time stamp is greater than the one that is saved, the value is overwritten. The recorded values ​​never change, they just come in newer columns with new values.



Writing to a cassandra works at a faster rate than reading. This changes the approach that is used in the design. If we consider cassandra from the point of view of designing a data model, then it is easier to imagine the column family not as a table, but as a materialized view ( materialized view ) - a structure that represents data of some complex query, but stores it on disk. Instead of trying to somehow compile the data using queries, it is better to try to keep everything in the finite family that may be needed for this query. That is, it is necessary to approach not from the side of relations between entities or relations between objects, but from the side of requests: which fields are required to choose; in what order the records should go; what data related to the main ones should be requested together - all this should already be stored in the column family. The number of columns in the record is theoretically limited to 2 billion. This is a brief digression, and in more detail - in the article on design and optimization techniques. And now let's delve into the process of storing data in Cassandra and reading them.



Data distribution



Consider how data is distributed, depending on the key across cluster nodes ( cluster nodes ). Cassandra allows you to set a data distribution strategy. The first such strategy distributes data depending on the md5 key value - a random partitioner . The second takes into account the bit representation of the key itself - a byte-ordered partitioner . The first strategy, for the most part, provides more benefits, since you do not need to worry about evenly distributing data between servers and similar problems. The second strategy is used in rare cases, for example, if range queries are necessary. It is important to note that the choice of this strategy is made before the cluster is created and, in fact, cannot be changed without a full data reload.



Cassandra uses a technique known as consistent hashing to distribute the data. This approach allows you to distribute data between nodes and make it so that when adding and deleting a new node, the amount of data being sent was small. For this, each node is assigned a label ( token ), which breaks into pieces the set of all md5 key values. Since in most cases the RandomPartitioner is used, consider it. As I said, the RandomPartitioner calculates a 128-bit md5 for each key. To determine in which nodes the data will be stored, all the labels of nodes from the smallest to the largest are sorted, and when the value of the label becomes greater than the md5 value of the key, then this node along with some number of subsequent nodes (in the order of labels) is selected for saving. The total number of nodes selected must be equal to the replication factor . The replication level is set for each key space and allows you to adjust data redundancy .



Data replication



Before you add a node to the cluster, you must set a label for it. From what percentage of keys covers the gap between this label and the next, depends on how much data will be stored on the site. The entire label set for a cluster is called a ring .



Here is an illustration of a cluster ring consisting of 6 nodes with uniformly distributed labels using the nodetool integrated utility.



nodetool ring



Data consistency



Cassandra cluster nodes are equal, and customers can connect to any of them, both for writing and reading. Requests go through a coordination stage, during which, having ascertained on which nodes the data should be located, the server sends requests to these nodes. We will call the node that coordinates - the coordinator ( coordinator ), and the nodes that are selected to save the record with this key - replica nodes ( replica nodes ). Physically, the coordinator can be one of the replica nodes - it depends only on the key, the marker and tags.



For each request, both for reading and writing, there is an opportunity to set the level of data consistency.



For the record, this level will affect the number of replica nodes from which confirmation of a successful completion of the operation (the data has been recorded) will be expected before returning control to the user. There are levels of consistency for recording:



Writing to Cassandra



For reading, the level of consistency will affect the number of replica nodes from which to read. There are levels of consistency to read:





Reading from Cassandra



Thus, it is possible to adjust the time delays of read, write operations and adjust the consistency ( tune consistency ), as well as the availability ( availability ) of each type of operation. In fact, availability is directly dependent on the level of consistency of read and write operations, since it determines how many replica nodes can fail, and these operations will still be confirmed.



If the number of nodes from which confirmation of a record comes, in total with the number of nodes from which reading occurs, is greater than the replication level, then we have a guarantee that after writing a new value will always be read, and this is called strict consistency ( strong consistency ). In the absence of strict consistency, there is the possibility that the read operation will return outdated data.



In any case, the value eventually spreads between the replicas, but after the coordination wait ends. This distribution is called eventual consistency . If not all replica nodes will be available during recording, then sooner or later, recovery tools, such as reading with a fix and anti-entropy node repair, will be used. More on this later.



Thus, with the read and write QUORUM consistency level, strict consistency will always be maintained, and there will be some balance between the delay in the read and write operations. When writing ALL, and reading ONE there will be strict consistency, and reading operations will be performed faster and will have greater accessibility, that is, the number of failed nodes, at which reading will still be performed, may be greater than with QUORUM. For write operations, all replica work nodes will be required. When writing ONE, reading ALL, there will also be strict consistency, and write operations will be faster and write accessibility will be great, it will be enough just to confirm that the write operation went on at least one of the servers, and read more slowly and require all replica nodes . If the application does not have a requirement for strict consistency, then it becomes possible to speed up both read and write operations, as well as improve accessibility by setting lower levels of consistency.



Data recovery



Cassandra supports three data recovery mechanisms:





Burn to disc



When data comes after coordination directly to the node for recording, they fall into two data structures: a table in memory ( memtable ) and a log of commit ( commit log ). A table in memory exists for each column family and allows you to memorize the value instantly. Technically, this is a hash table with simultaneous access ( concurrent access ) based on a data structure called skip list ”. The pinning log is one for the entire key space and is stored on disk. A log is a sequence of modification operations. It also breaks into pieces when it reaches a certain size.



Such an organization makes it possible to make the recording speed limited to the speed of sequential writing to the hard disk and at the same time guarantee data durability ( data durability ). The pinning log in case of an emergency stop of the node is read at the start of the cassandra service and restores all tables in memory. It turns out that the speed rests on the sequential write to the disk, and in modern hard drives it is about 100MB / s. For this reason, it is advisable to put a fixing journal on a separate disk medium.



It is clear that sooner or later the memory can be filled. Therefore, the table in memory must also be saved to disk. To determine the moment of saving, there is a limit on the size of the occupied tables in memory ( memtable_total_spacein_mb ), by default it is the maximum size of the Java heap ( Java heapspace ). When filling tables with more than this limit in memory, cassandra creates a new table and writes the old table in memory to disk as a saved table ( SSTable ). The stored table after creation is never modified again ( is immutable ). When saving to disk occurs, the parts of the log are marked as free, thus freeing up the disk space occupied by the journal. It is necessary to take into account that the journal has an interlaced structure from data of different column families in the key space, and some parts may not be released, since some areas will correspond to other data still in the tables in memory.



Save value to disk



As a result, each column family corresponds to one table in memory and a number of saved tables. Now that the node is processing the read request, it needs to query all these structures and select the most recent value for the timestamp. To speed up this process, there are three mechanisms: bloom filtering ( bloom filter ), key cache ( key cache ) and write cache ( record cache ):





Read from disk



Compaction



At a certain point in time, the data in the column family will be overwritten — columns will come that will have the same name and key. That is, a situation will arise when the older and newer stored table will contain old and new data. In order to guarantee integrity, Cassandra must read all these stored tables and select data with the latest timestamp. It turns out that the number of hard disk positioning operations when reading is proportional to the number of stored tables. Therefore, in order to free up the overwritten data and reduce the number of saved tables, there is a compaction process . It reads several stored tables in succession and writes a new saved table, which combines data by timestamps. When a table is fully recorded and put into use, cassandra can release source tables (the tables that formed it). Thus, if the tables contain overwritten data, then this redundancy is eliminated. It is clear that during such an operation the amount of redundancy increases - a new saved table exists on the disk along with the source tables, which means that the amount of disk space must always be such that it can be compacted.



Compaction



Cassandra allows you to choose one of two compaction strategies:





Delete operations



From the point of view of the internal device, the operation of deleting columns is the operation of writing a special value - a wiping value ( tombstone ). When such a value is obtained as a result of reading, it is skipped, as if such a value never existed. As a result of compaction, such values ​​are gradually crowding out obsolete real values ​​and, possibly, disappear altogether. If there are columns with real data with even newer time stamps, then they will grind, and in the end, these overwrite values.



Transactional



Cassandra supports transactionalization at the level of one record, that is, for a set of columns with one key. Here are four ACID requirements:





Afterword



So, we looked at how the basic operations are arranged - reading and writing values ​​to the cassandra.



In addition, in my opinion, important references:





Please: comment on the spelling and ideas for improving the article expressed in a personal message.

Source: https://habr.com/ru/post/155115/



All Articles