📜 ⬆️ ⬇️

Chain replication: building efficient KV storage (part 1/2)


In this article we will consider the architecture of simple and efficient KV-storages using chain replication (chain replication), which is actively investigated and successfully used in various systems.

This is the first half of the chain replication article. The second part is here . First there will be a little theory, then a few examples of use with various modifications.

  1. The goal is to set the task and compare it with the primary / backup protocol.
  2. Chain replication is a basic approach.
  3. Chain replication - distributed requests.
  4. FAWN: a Fast Array of Wimpy Nodes.

1. Introduction


1.1 Purpose


Suppose we want to design a simple key-value store. The storage will have a very minimal interface:

  1. write (key, object): save / update value by key.
  2. read (key): return the saved value by key.

We also know that the data size is relatively small (everything fits on one server, there is no need for sharding), but there can be very, very many read / write requests.
')
Our goal is to withstand a large number of requests ( high throughput, HT ), to have high availability ( high availability, HA ) and strict consistency ( strong consistency, SC ).

In many systems, SC is sacrificed for the sake of HA + HT, because the performance of all three properties is a non-trivial task. Amazon Dynamo was a huge leap forward and spawned a number of Dynamo-style databases, such as Cassandra, Riak, Voldemort, etc.

1.2 Primary / Backup


One of the most common and simple approaches to building such a data storage system is to use primary / backup replication.
We have 1 primary server, several backup servers, write / read operations go only through the primary server.


Here, the picture shows one of the possible interaction protocols (the primary waits for ack from all backups before sending the ack to the client), there are other options (not mutually exclusive), for example:


A separate process is also needed, which monitors the state of the cluster (distributes the configuration to the participants) and when the master server drops, it makes (initiates) new elections, and also determines what to do in the case of split brain. Again, depending on the requirements, a part of this logic can be executed as part of the replication algorithm, part as a third-party application (for example, zookeeper for configuration storage), etc.

Obviously, sooner or later, the performance of primary / backup replication will be limited to two bottlenecks:


The more reliability / consistency requirements are imposed on the cluster, the faster this moment will come.

Are there any other ways to achieve our goal?

1.3 Chain Replication



In general, chain replication consists of a sequence (chain) of servers, with special roles HEAD (the server with which the client communicates) and TAIL (chain end, SC warranty). The chain has at least the following properties:

  1. Withstands crashing to n - 1 servers.
  2. The write speed is not significantly different from the SC Primary / Backup speed.
  3. The cluster reconfiguration in case of a HEAD crash occurs much faster than Primary, the other servers are comparatively or faster than in Primary / Backup.

A small but significant point is that you need a reliable FIFO connection between the servers.

Let us further consider in more detail the various methods of constructing chain replication.

2. Basic approach



2.1 Algorithm of work


Clients send write requests to the head node, and read requests to the tail node. The answer always comes from tail. Head, upon receiving a change request, calculates the required state change, applies it, and sends it to the next node. As soon as tail processes it, an ACK response is sent back through the chain. Obviously, if the read request returns a certain x value, then it is stored on all nodes.

2.2 Replication Protocol


Let's number servers from head to tail, then on each node iwe will additionally store:


2.3 Handling server failures


As stated in the introduction, we need a master process that:


We believe that the master process is stable and never falls. The choice of such a process is beyond the scope of this article.

The second very important assumption is that we assume that the servers are fail-stop:


Consider how to add a new server:
Theoretically, the new server can be added to any place in the chain, adding to the tail seems the least complicated - you just need to copy the state of the current tail to the new server, notify the master about the change in the chain and notify the old tail that requests now need to be sent further.

Finally, consider three possible failure scenarios:

2.3.1 Drop head
Simply remove the server from the chain and assign the next new head. There will only be a loss of those requests from Pendingheadwhich were not sent further - Pendinghead setminusSenthead

2.3.2 Drop tail
Remove the server from the chain and assign the previous new tail, before this Senttail1cleared (all these operations are marked as processed tail), respectively Pendingtail1decreases.

2.3.3 Drop intermediate node k
Wizard informs nodes k1and k+1about changing the order in the chain.
Possible loss Sentk1if the node kI did not have time to send them further to my successor, so after removing the node kfrom the chain of the first thing re-sent Sentk1and only after that the node k1continues to process new requests.

2.4 Comparison with backup / primary protocol



Chain replication delays on failures:


P / B delays on failures:


As you can see, the worst failure (tail) for chain replication is faster than the worst one for P / B (Primary).

The authors of this approach performed load tests that showed comparable performance with the P / B protocol.

3. Distributed Queries (Chain Replication with Apportioned Queries - CRAQ)


The basic approach has an obvious weak point - tail, which handles all read requests. This can lead to two problems:


The idea of ​​CRAQ is quite simple - allow read requests to come to all servers except tail, and to ensure consistency we will store the object version vector for write requests, and in case of ambiguity, the nodes will make a request in tail to get the latest fixed version.

3.1 CRAQ


We formalize the CRAQ architecture:
Each node, except tail, processes read requests and returns a response, and head returns a response from write requests (compare with the basic approach).


On each non-tail node several versions of the same object can be stored, and the versions form a strictly monotonically increasing sequence. For each version is added an additional attribute "clean" or "dirty." Initially all versions are clean.

As soon as the node receives the write request, it adds the received version to the list of versions, and then:


As soon as the node receives confirmation from the successor, it marks the version as clean and removes all previous versions.

As soon as the node receives a read request:



For applications with a predominance of read requests, the performance of CRAQ grows linearly with the growth of nodes ; in the case of a predominance of write requests, the performance will be no worse than that of the basic approach.

CRAQ can be located in one or in several data centers. This allows customers to select the nearest nodes to increase the speed of read requests.



3.2 Consistency in the CRAQ


CRAQ provides strong consistency, except in one case: when a node receives the latest fixed version from the tail, the tail can fix a new version before the node responds to the client. In this situation, CRAQ provides monotonous reading (successive read requests will not be a thing of the past, but may return old data) throughout the chain .

Weaker consistency is also possible:


3.3 Handling server failures


Similar to the basic approach.

3.4 More


CRAQ has one interesting feature - you can use multicast for a write operation. Suppose the head sends a multicast change and sends only a certain identifier of this event further along the chain. If the update itself has not arrived before the node, then it can wait and receive it from the next node when Tail sends a confirmation of the change commit. Similarly, the tail can send a multicast confirmation of fixation.

4. FAWN: a Fast Array of Wimpy Nodes


A very interesting study, not directly related to the topic of this article, but serves as an example of the use of chain replication.

High-performance key-value storage (Dynamo, memcached, Voldemort) have common characteristics - I / O demands, minimal computing, concurrent independent access to random keys in large quantities, small key values ​​- up to 1Kb.

Servers with HDDs are not suitable for such clusters due to the long seek operation (random access time), and servers with a lot of DRAM consume a surprisingly large amount of power — 2GB DRAM is equivalent to 1Tb HDD.

Building an effective (throughput) cluster with minimal power consumption is the goal of the original study. 50% of the cost of the server for three years is the cost of electricity, and modern power saving modes are not as effective as they are advertised - in tests at 20% load, CPU consumption remained at 50%, plus the rest of the server components do not have power saving modes ( DRAM, for example, and so works at a minimum). It is important to note that in such clusters the gap between the CPU and I / O is increasing - a powerful CPU is forced to wait for the I / O operation to complete.

4.1 Architecture


The FAWN cluster is built on old servers for $ 250 (Prices 2009), with built-in CPU 500MHz, 512Mb RAM, 32Gb SSD. If you are familiar with Amazon Dynamo architecture or consistent hashing, then you will be familiar with the FAWN architecture:

  1. Each physical server contains several virtual nodes, each has its own VID.
  2. VIDs form a ring, each VID is responsible for the range “behind itself” (for example, A1 is responsible for keys in the R1 range).
  3. To increase reliability, data is replicated to the following R nodes in a clockwise direction. (for example, when R = 2, the key on A1 is replicated on B1 and C1), so we get chain replication (basic approach).
  4. Reading requests go to the tail of the chain, i.e. Reading the key with A1 will go to C1.
  5. Write requests go to the head of the chain and go to the end.


The server map is stored on a cluster of frontend servers, each of which is responsible for its specific VID list, and can redirect the request to a different frontend server.

4.2 Test Results


In FAWN load testing, it reaches QPS (Queries per second) equal to 90% of QPS on a random read flash disk.

The following table compares the Total Cost of Ownership (TCO) of various configurations, where the basis for Traditional is a $ 1000 server with a consumption of 200W (2009 Prices):

Thus, if:



Links


Source: https://habr.com/ru/post/430396/


All Articles