Yandex cloud platform: read more about Elliptics

Some time ago I started telling about Elliptics on Habré - our fault-tolerant key-value distributed storage (by the way, free and distributed under the GPL-license). Then I described the Elliptics device in general: about the architecture and the basic principles of operation, due to which the reliability of the system is achieved, how the system can be expanded, and how it behaves when it fails.

Starting from this article, let's try to dive deeper into Elliptics: I want to tell you about the internal architecture and various supported features.

')
Today - about the network and software architecture of Elliptics and some of its features. I will also tell you in detail about the cache and our low-level library for local data storage - Eblob .

Vocabulary

To begin with, we introduce a small set of concepts for a better understanding of each other:

node - one of the nodes in the network Elliptics;
group - one of the DHT-rings, each node belongs to one of the groups;
id - the unique key of the item in Elliptics;
server - server node;
client - client node.

Network architecture

As mentioned in the last article, in Elliptics, each node is responsible for some range of keys in its group. But at the same time, each node is aware of the range for which all of them are responsible. In Elliptics, this knowledge is called the query routing table. It is thanks to her and the peer-to-peer technology, in which all nodes connect to all, you can make any request for O (1) network operations.

Elliptics was originally created as a truly reliable data storage system that will not be architecturally afraid of shutting down machines, data centers, or entire countries. Therefore, it was important to build a system without using a master node, which is fairly simple and easily reacting to any changes in the network configuration.

The Elliptics routing table is maintained up to date in a very simple way: each node asks for the first node from each group to return its routing table once a certain amount of time. This allows you to timely respond to any changes in the network, to learn about the appearance of new or loss of existing machines.

If the client's routing table has fallen behind the real situation, then the server may receive a request for an identifier for which it is not responsible. In this case, the request will be forwarded by the server to that node, which, in its opinion, is responsible for this range.

For communication between nodes, Elliptics uses its extensible asynchronous binary protocol with multiplexing support. Each package has a header and, possibly, a request or response body.

In the header are:

key for which the request was made;
command status (represents errno);
transaction number;
team number (identifier);
a set of flags common to all commands;
size of request / response body.

The total header is 96 bytes, the body of the request and response is specific to the command and, from the point of view of the network engine, is simply a set of bytes. This protocol allows you to abstract the actual network interaction between nodes from the logic of the server - data storage. Due to this, we can say that Elliptics is no longer just a data storage system, but a distributed command execution system.

What makes Elliptics different?

In large networks, problems arise with the prioritization of network flows; there are several ways to solve it.

First, you can “paint” traffic using the TOS (Type of Service) field in the ip-package. In this case, you can prioritize requests from the client to the server and between servers. Elliptics uses server_net_prio and client_net_prio options for this.

Secondly, you can start up different traffic on different networks, for this Elliptics has support for several interfaces. At the same time, each interface needs to specify the class (number, identifier) of the network, so we will have our own routing table for each network, this allows some customers to go on a faster, but, for example, less reliable network, and on other customers slower but reliable. Due to this, you can run the recovery procedure on a faster network without affecting requests from the outside world that are currently running.

In Elliptics there is support for regionality, the data will always try to give from that machine on the network that is closer to the client. This allows you to have caching copies of data in data centers around the world without having to send each request to Moscow. For example, for Yandex.Music it is possible to store the most popular compositions in data centers of a specific region.

Due to the way Elliptics stores data on a disk (which will be discussed below), it is possible to stream data directly from servers, without intermediate proxying. To do this, the client finds out where the data nearest to him are located, and redirects the user's HTTP request directly to the server, where the data is physically located. At the same time, the request already contains information about the file containing the necessary data block, as well as the size and coordinates of this block relative to the beginning of the file. This allows you to use one of the best solutions for today to return statics over HTTP - nginx. To protect against access to "alien" data, such requests include simple private key authentication.

It is no secret that from time to time data is lost for reasons beyond our control - disks are piling up, servers are burning down, hurricanes leave data centers without electricity, for political reasons they turn off the Internet in the country. In this case, it is important that the data continue to be stored in the correct number of copies. Therefore, Elliptics has a procedure called read recovery. If during the reading it is found that the file is not on all the necessary groups, then upon receipt of the data, the client on these groups restores it. This allows you to maintain the system even without a long start of global data recovery procedures.

In addition to read recovery, we also have data recovery systems both inside and between the data center. The first way is well suited if we add new machines to the groups — the added machines immediately begin to respond to all requests. Recording will be done successfully, but the new nodes will not be able to give them, because they are still on the “old” servers, although the corresponding ranges are already serving the “new” ones. Merge recovery transports data from one server group to another. Inter-center recovery is needed in case of accidents or planned exercises, when we lose part of the data on one or several groups. Missing (or newer) data will be copied from other replicas.

Server architecture

The server itself in its device resembles a layer cake. At the very top is the network layer, which is responsible only for maintaining connections between the nodes, as well as for reading and sending packets over the network. At the same level, the routing table is kept up to date.

Below is a layer for processing commands. All incoming requests are distributed to the appropriate handlers: cache, secondary indexes, counters, srw, backend, service commands.

I / O operations with the corresponding flags fall into the cache. They will get to the backend only if the cache does not have enough data to process the request, or by timeout (by default, data is synchronized from cache to disk once every 30 seconds). For example, all append requests will be accumulated in the cache to minimize the number of disk operations. The exception is the iteration and range requests, they are immediately sent to the backend. You can generally prohibit working with the disk (enabled by a special flag in the command) - in this case, all operations will be performed only with the memory of the cache.

Secondary indexes run on top of the existing Elliptics infrastructure. To implement them, it was enough to add several new commands, as well as to be able to call already existing methods for writing and reading files, which in turn will pass through the cache. Secondary indexes in Elliptics allow you to mark any files with tags, and also subsequently to find the necessary files by them.

In addition, Elliptics has a data processing system - srw (unfortunately, no one remembers how it is decrypted and what it means in general). Srw allows you to perform various tasks on Elliptics servers. To run applications and monitor the consumption of their resources using our cloud platform Cocaine . In general, the principle of operation is as follows:

The client sends a request to the server to perform the task, the request contains the name of the application, the name of the event and the binary data - the event arguments;
The server through Cocaine sends the request to the desired application;
Cocaine, if necessary, launches a new instance of the application, monitors their lifetime and load;
The application processes the request, if necessary, sends further messages to other applications (which can also be on other machines). Each application can send some data to the client.
The last application informs the client about the completion of the task.

The data will be received by the “copy” of the application that is running directly on the machine responsible for storing the key being processed - thus, high locality of data and handlers is implemented.
It is possible to roughly represent the server code as a trigger for an operation in the database. On top of this system is built Grape - our Open Source system for processing data flow in real time.

Cache

Elliptics cache appeared in August 2012 to reduce the load on the disk. Initially it was LRU- cache, which, with special flags, saved a copy of the read or write data to the cache, so that next time it would not go to disk. In fact, there are only 16 caches in each node (of course, their number can be changed), each of which is responsible for its own range of keys. In this way, better support for multi-core systems is achieved.

Subsequently, support for timeouts was added to it, after which the data is removed from the cache. Delayed write to the disk was also added, this allows you to “stick together” several I / O operations with a key into one - this is very noticeable if there are many append records at the end of files, for example, in the case of storing logs.

More recently, the LRU cache has been replaced with a Segmented LRU cache; it behaves much better with hot data. We came to this idea after learning about the Facebook experience . Segmented LRU stores hot data in a separate list, so they are not pushed out of the cache by keys that were accessed once.

In total, the cache needs to support 4 data structures - LRU-lists (per page per page in SLRU), a binary search tree by keys, a queue with priorities for timeouts and a queue with priorities for sync-to-disk. However, the last 3 structures are quite well placed on top of a Cartesian tree - a binary tree above the keys and a heap above the “events”, where the “event” is the time of a sync or key swelling. But at the same time, contrary to standard practice, when balancing is achieved due to random weights in a heap, we get a tree balanced by “random” keys. And we can consider keys as random due to the fact that they are the result of cryptographic hashing by the sha512 function.

Backend

In conclusion, I will briefly tell you about our main backend - Eblob . In the entire history of Elliptics, many different libraries have been tested, for example, Google LevelDB, Tokyo / Kyoto Cabinet, Oracle BerkeleyDB, file system. For our tasks, Eblob turned out to be the best local system for storing medium (more than one kilobyte) and large (tens of gigabytes) data volumes.

So, Eblob is our low-level Open Source append-only library for storing data in blobs .

In each blob can be millions of files and its size can reach hundreds of gigabytes, all these parameters can be configured. At each time point, the record goes only to the last, “open” blob. As soon as the blob reaches a limit on the size or number of files, it “closes” and a new one opens. After this, the files in the “closed” blobs will never change, the maximum that can happen is to mark the file as deleted if it is deleted or a new version is recorded in the new blob.

Some time after the closure of the blob, or as soon as the keys more than the threshold value are deleted from it, the sorting procedure starts (it is defragmentation). This can be done either manually by an external utility, or by a timeout or range specified in the ebloba config. During the sorting / defragmentation, all empty places will be deleted from the blob, and an index file will be created that contains only data headers (key, internal eblob flags, size and indent from the beginning of the data in the "big" blob). Due to the need to smooth the load on the disk, the defragmentation operation can be postponed to the future in order to perform it during the least load on the system.

To search for a blob in which the key lies, several techniques are used:

All indices of the not yet sorted blobs are in memory;
By sorted blobs built on Bloom Filter for each blob, and also builds a tree of ranges for every 40 (default) keys in the blob.
- If Bloom Filter says that the key may be on the disk, then the binary tree is searched for a range in which the requested key should be.
- The piece of the index file corresponding to the found range is read from the disk, then a binary search is performed on it.
- If the key is found in the index file, then a request is already being made to a large blob on the exact coordinates.

Conclusion

In the following articles we will talk about how to raise Elliptics at home and examples of its use, more about the Eblob device (believe us, we have something interesting to tell about it), data streaming, secondary indexes and much more.

Source: https://habr.com/ru/post/209882/

All Articles