📜 ⬆️ ⬇️

Chain replication: building efficient KV storage (part 2/2)


We continue to consider examples of the use of chain replication. Basic definitions and architectures were given in the first part , I recommend to get acquainted with it before reading the second part.

In this article we will study the following systems:


5. Hibari


Hibari is a distributed, fault-tolerant KV-repository written in erlang. Uses chain replication (basic approach), so achieves strict consistency. In tests, Hibari shows high performance - on two-unit servers, several thousand updates per second are achieved (requests for 1Kb)

5.1 Architecture


For posting data is used consistent hashing. The storage is based on physical and logical blocks. The physical block ( physical brick ) is a server with Linux, maybe an EC2 instance and in general a VM as a whole. Logical block ( logical brick ) is the storage instance with which the main processes of the cluster work and each block is a node of any one chain. In the example below, the cluster is configured with the placement of 2 logical blocks on each physical block and with a chain length of 2. Note that the nodes of the circuit are “smeared” across the physical blocks to increase reliability.
')
The master process (see definition in the first part) is called the Admin server .

Data is stored in “tables”, which simply serve as a division into neimspaces, each table is stored in at least one chain, and each chain stores data from only one table.

The Hibari client receives updates from the Admin server with a list of all the head and tail of all the chains (and all tables). Thus, clients know immediately which logical node to send a request to.



5.2 Hashing


Hibari uses a couple \ {T, K \} to determine the name of the chain that stores the key in the table : key displayed on the interval (using MD5), which is divided into sections for which any one circuit is responsible. Sections can be of different widths, depending on the "weight" of the chain, for example:



Thus, if some physical blocks are very powerful, then wider sections can be given to chains located on them (then more keys will fall on them).

6. HyperDex


The goal of this project was to build a distributed key-value store, which, unlike other popular solutions (BigTable, Cassandra, Dynamo), will support fast search by secondary attributes and will be able to quickly search by range. For example, in the previously reviewed systems, to search for all values ​​in a certain range, one will have to go through all the servers, which is obviously unacceptable. HyperDex solves this problem using Hyperspace Hashing .

6.1 Architecture


The idea behind hyperspace hashing is to build -dimensional space where each attribute corresponds to one coordinate axis. For example, for objects (first-name, last-name, phone-number) the space might look like this:



The gray hyperplane traverses all the keys, where is last-name = Smith, the yellow one goes through all the keys, where first-name = John. The intersection of these planes forms the answer to the search query phone numbers of people with the name John and the last name Smith. So the request for attribute returns -dimensional subspace.

The search space is split into -dimensional non-intersecting regions, and each region is assigned to a single server. The object with coordinates from the region is stored on the server of this region. Thus a hash is built between objects and servers.

A search query (by range) will determine the regions included in the resulting hyperplane and, thus, reduce the number of polled servers to a minimum.

In this approach, there is one problem - the number of required servers grows exponentially with the number of attributes, i.e. if attributes then need servers. To solve this problem, HyperDex uses a partition of the hyperspace into subspaces (with a smaller dimension) with, respectively, a subset of attributes:


6.2 Replication


To ensure strict consistency, the authors developed a special approach based on chain replication - value dependent chaining , where each next node is determined by hashing the corresponding attribute. For example, the key first will be hashed into the key space (we will get the head of the chain, also called point leader ), then the hash from $ inline $ "John" $ inline $ to the coordinate on the corresponding axis and so on. (See the picture below for the update example. ).

All updates go through the point leader, which orders the queries (linearizability).



If an update leads to a change in the region, then first the new version is recorded immediately after the old one (see update ), and after receiving the ACK from tail, the link to the old version from the previous server will be changed. To simultaneous requests (for example, and ) did not violate the consistency. point leader adds versioning and other meta-information to the server, in case of earlier could determine that the order is broken and you need to wait .

7. ChainReaction


The causal + convergence model is used, which adds the condition of conflict-free convergence to causal (causal) convergence. To accomplish causal convergence, metadata is added to each query, which lists the versions of all cause-dependent keys. ChainReaction allows geo-replication in several data centers and is a further development of the CRAQ idea.

7.1 Architecture


The architecture from FAWN is used with small changes - each DC consists of data servers - backends (data storage, replication, form a DHT ring) and client proxies - frontends (send a request to a specific node). Each key replicates to R consecutive nodes, forming a chain. Requests for reading are processed tail, write - head.


7.2 One data center


Note one important property resulting from chain replication - if the node causal-consistent with some client operations, then all previous nodes - too. So, if the operation was seen by us last time on site then all are causal-dependent (from ) read operations can only be performed on nodes from head to . Once will be executed on the tail - there will be no restrictions on reading. Denote the write operations that were performed tail in DC as DC-Write-Stable (d) .

Each client stores a list (metadata) of all keys requested by the client in the format (key, version, chainIndex), where chainIndex is the position of the node in the chain that responded to the last request about the key key. The metadata is stored only for keys that the client does not know whether it is DC-Write-Stable (d) or not .

7.2.1 Performing a write operation


Note that once the operation has become DC-Write-Stable (d), then no read request can read previous versions.

For each write request, a list is added of all the keys to which read operations were performed before the last write operation. As soon as the client proxy receives the request, it performs blocking read operations on the tails of all keys from the metadata (we are waiting for confirmation of the presence of the same or a newer version, in other words, we fulfill the causal consistency condition). Once confirmations are received, the write request is sent to the head of the corresponding circuit.



As soon as the new value is stored on chain nodes, the notification is sent to the client (with the index of the last node). The client updates chainIndex and removes the metadata of the sent keys, since it became known about them that they are DC-Write-Stable (d). Parallel to this, the recording continues further - lazy propagation . Thus, priority is given to write operations at first. knots. As soon as tail saves the new version of the key, the notification is sent to the client and transmitted to all nodes in the chain so that they mark the key as stable.

7.2.2 Performing a read operation


The client proxy sends a read request to node in the chain, while distributing the load. In response, the node sends the value and version of this value. The response is sent to the client, with:


7.2.3 Processing node failures


Almost completely identical to the basic approach, with some differences in that in some cases the chainIndex on the client becomes invalid - this is easily determined when executing queries (there is no key with this version) and the request is redirected to the head of the chain to find the node with the desired version.

7.3 Several ( a) data centers (geo-replication)


Let's take as a basis the algorithms from the single-server architecture and adapt them to a minimum. To begin with, in the metadata, instead of just the version and chainIndex values, we will need versioned vectors of dimensions N.

Define Global-Write-Stable in a similar way with DC-Write-Stable (d) - the write operation is considered Global-Write-Stable if it was performed on the tails in all DCs.

Each DC has a new component - remote_proxy , its task is to receive / send updates from other DCs.

7.3.1 Performing a write operation (on the server )


The beginning is similar to the single-server architecture - we perform blocking reads, write to the first chain nodes. At this point, the client proxy sends the client a new vector chainIndex, where zeros are everywhere, except for the position - there is meaning . Further - as usual. Additional operation at the very end - the update is sent to remote_proxy, which accumulates several requests and then sends everything.

There are two problems here:


7.3.2 Performing a read operation


Similarly, a single-server architecture, adjusted for using the vector chainIndex instead of a scalar and the possibility of missing a key in a DC (because updates are asynchronous), then either wait or redirect the request to another DC.

7.3.3 Conflict Resolution


Thanks to the metadata, causal-dependent operations are always performed in the correct order (sometimes you have to block the process for this). But competitive changes in different DC can lead to conflicts. To resolve such situations, Last Write Wins is used, for which a pair is present in each update operation. where - clock on proxy, and - id DC.

7.3.4 Handling node failures


Similar to single-server architecture.

8. Leveraging Sharding in the Design of Scalable Replication Protocols


The aim of the study is to build a distributed system with shards and with replication without using an external master process to reconfigure / monitor the cluster.

In the main current approaches, the authors see the following disadvantages:

Replication:


Strict consistency:


Detection of node failures:


The approach proposed by the authors, called Elastic replication , is devoid of these shortcomings, and has the following characteristics:


Summary plate:


8.1 Organization of replicas


On each shard the sequence of configurations is determined. for example, the new configuration does not contain any fallen cue

Each element of the configuration sequence consists of:


Each shard is represented by a set of replicas (by construction - ), so we do not divide into roles “shard” and “cue”.

Each replica stores the following data:


The main task of the replica orderer is to send requests to the rest of the replicas and support the largest history prefix:


8.2 Organization of shards


Shards are combined into rings called elastic bands . Each shard belongs to only one ring. Predecessor of every shard performs a special role - it is a sequencer for it. The task of the sequencer is to issue its successor a new configuration in case of replica failures.


Two conditions are required:


The second condition seems too strict, but it is equivalent to the “traditional” condition that the master process never falls.

8.3 Using Chain replication


As you might have guessed, the replicas are organized as a chain (basic approach) - the orderer will be the head, with some differences:


8.5 Reconfiguration in case of failure



Links


Source: https://habr.com/ru/post/432020/


All Articles