📜 ⬆️ ⬇️

How S3 DataLine Storage is arranged



Hi, Habr!

It is no secret that huge amounts of data are involved in the work of modern applications, and their flow is constantly growing. This data needs to be stored and processed, often from a large number of machines, and this is not an easy task. To solve it, there are cloud object storage. Usually they represent the implementation of technology Software Defined Storage.
')
In early 2018, we launched (and launched!) Our own 100% S3-compatible storage based on Cloudian HyperStore. As it turned out, there are very few Russian-language publications about the Cloudian itself, and even less about the real application of this solution.

Today, based on the experience of DataLine, I will tell you about the architecture and internal structure of Cloudian software, including the implementation of Cloudian SDS, which is based on a number of Apache Cassandra architectural solutions. Separately, we consider the most interesting thing in any SDS storage — the logic of ensuring fault tolerance and distribution of objects.

If you are building your S3 repository or are busy maintaining it, this article will come in handy.

First of all, I will explain why our choice fell on Cloudian. It's simple: decent options in this niche is extremely small. For example, a couple of years ago, when we only conceived the construction, there were only three options:


For us, as for the service provider, the decisive factors were: a high level of compliance with the storage API and the original Amazon S3, the presence of embedded billing, scalability with multi-regional support and the presence of a third vendor support line. Cloudian has it all.

And yes, the most (of course!) The main thing is that DataLine and Cloudian have similar corporate colors. You must admit that we could not resist such beauty.



Unfortunately, Cloudian is not the most common software, and there is practically no information about it in runet. Today we will correct this injustice and talk with you about the features of the HyperStore architecture, examine its most important components and deal with the main architectural nuances. Let's start with the most basic, namely, what does Cloudian have under the hood?

How is the storage Cloudian HyperStore


Let's take a look at the diagram and see how the solution from Cloudian works.


The main component storage scheme.

As we can see, the system consists of several main components:


The work of the main services, S3 Service and HyperStore Service will be of the greatest interest for us, then we will carefully consider their work. But first it makes sense to find out how the distribution of services in the cluster is arranged and what the fault tolerance and reliability of the data storage of this solution is in general.




The common services in the diagram above refer to the S3, HyperStore, CMC and Apache Cassandra services . At first glance, everything is beautiful and neat. But a closer look reveals that only a single failure of a node is effectively handled. And the simultaneous loss of two nodes at once can be fatal for the availability of the cluster - Redis QoS (on node 2) has only 1 slave (on node 3). The same picture is with the risk of losing cluster management - Puppet Server is only on two nodes (1 and 2). However, the probability of failure of two nodes at once is very low, and it is quite possible to live with this.

However, to improve the reliability of the storage, we use 4 nodes in the DataLine instead of the minimum three. It turns out the following resource allocation:



Immediately striking another nuance - Redis Credentials is not placed on each node (as could be supposed from the official scheme above), but only on 3 of them. In this case, Redis Credentials is used for each incoming request. It turns out, because of the need to go to someone else's Redis, there is some bias in the performance of the fourth node.

For us, this is not yet significant. With load testing, there were no significant deviations in the response speed of the node, but on large clusters with dozens of working nodes it makes sense to correct this nuance.

Here is the migration scheme on 6 nodes:



From the diagram you can see how the service migration is implemented in case of node failure. Only the case of failure of one server of each role is taken into account. If both servers fall, manual intervention is required.

Here, the matter is also not without some subtleties. Puppet is used to migrate roles. Therefore, if you lose it or accidentally break it, automatic failover may not work. For the same reason, you should not manually edit the manifests of the embedded Puppet. This is not completely safe, changes can be suddenly erased, as the manifests are edited using the cluster admin panel.

From the point of view of data integrity, everything is much more interesting. Object metadata is stored in Apache Cassandra, and each entry is replicated to 3 of 4 nodes. Replication factor 3 is also used to store data, but you can configure a larger one. This ensures the safety of data even with the simultaneous failure of 2 nodes out of 4. And if you have time to rebalance the cluster, you can lose nothing with the one node remaining. The main thing is to have enough space.



This is what happens when two nodes fail. The diagram clearly shows that even in this situation, the data remains intact.

However, the availability of data and storage will depend on the consistency strategy. For data, metadata, read and write, it is configured separately.

Valid options are at least one node, quorum, or all nodes.
This setting determines how many nodes must confirm a read / read in order for the request to be considered successful. We use quorum as a reasonable compromise between the time to process the request and the reliability of the recording / read inconsistency. That is, of the three nodes involved in the operation, for error-free operation, it is enough to get a consistent answer from 2. Accordingly, in order to stay afloat if more than one node fails, you will need to switch to a single write / read strategy.

Processing requests in Cloudian


Below we consider two schemes for processing incoming requests to Cloudian HyperStore, PUT and GET. This is the main task for S3 Service and HyperStore.

Let's start with how the write request is handled:



Surely you noted that each request generates a lot of checks and data retrieval, at least 6 calls from component to component. It is from here that there are recording delays and a high CPU time consumption when working with small files.

Large files are transferred by chunks. Separate chunks are not considered as separate requests and some checks are not carried out.

The node that received the original request will then independently determine where and what to write, even if no recording is made directly to it. This allows you to hide the internal cluster organization from the end client and use external load balancers. All this has a positive effect on serviceability and fault tolerance of the storage.



As you can see, the logic of reading is not too different from the record. It has the same high sensitivity of performance to the size of the processed objects. Therefore, due to the significant savings in working with metadata, retrieving one finely chopped object is much easier than many individual objects of the same total volume.

Data storage and duplication


As can be seen from the above schemes, Cloudian supports various storage and duplication schemes:

Replication - using replication, it is possible to maintain in the system a customized number of copies of each data object and store each copy on different nodes. For example, using 3X replication, 3 copies of each object are created, and each copy is “lying” on its own node.

Erasure Coding — With erasure coding, each object is encoded into a configurable number (known as K number) of data fragments plus a configurable amount of redundancy code (M number). Each K + M object fragments are unique, and each fragment is stored on its own node. The decoded object can be using any K fragments. In other words, the object remains readable, even if the M nodes are unavailable.

For example, in erasure coding using the 4 + 2 formula (4 data fragments plus 2 redundancy code fragments), each object is split into 6 unique fragments stored on six different nodes, and this object can be recovered and read if any 4 of the 6 fragments are available. .

Plus Erasure Coding, compared with replication, consists in saving space, even at the cost of a significant increase in CPU load, deterioration in response speed and the need for background procedures to control the consistency of objects. In any case, the metadata is stored separately from the data (in Apache Cassandra), which increases the flexibility and reliability of the solution.

Briefly about other features HyperStore


As I wrote at the beginning of the article, several useful tools are built into HyperStore. Among them:


However, Cloudian HyperStore is still not perfect. For example, for some reason you cannot transfer an existing account to another group or assign several groups of one record. It is impossible to generate intermediate billing reports - you will receive all reports only after the close of the reporting period. Therefore, to find out how the score has grown, neither clients nor we can do it in real time.

Logic of Cloudian HyperStore


Now we dive deeper and look at the most interesting thing in any SDS storage — the logic of the distribution of objects among the nodes. In the case of the Cloudian repository, the metadata is stored separately from the data itself. For metadata, Cassandra is used, for data, the proprietary solution HyperStore.

Unfortunately, so far there is no official translation of Cloudian documentation into Russian on the Internet, so below I will place my translation of the most interesting parts of this documentation.

The role of Apache Cassandra in HyperStore


In HyperStore, Cassandra is used to store object metadata, user account data, and service usage data. With a typical deployment on each HyperStore node, Cassandra data is stored on the same disk as the OS. The system also maintains Cassandra data on a dedicated disk on each node. Cassandra data is not stored on HyperStore data disks. When vNodes are assigned to the host machine, they are distributed only across the HyperStore storage nodes. vNodes are not allocated to the disk where Cassandra data is stored.
Within the cluster, the metadata in Cassandra is replicated in accordance with the policy (strategy) of your repository. Data replication Cassandra uses vNodes as follows:


How HyperStore works


The placement and replication of S3 objects in a HyperStore cluster are based on a consistent caching scheme that uses the space of integer tokens in the range from 0 to 2,127 -1. Integer tokens are assigned to HyperStore nodes. For each S3 object, a hash is calculated as it is loaded into the storage. The object is stored in the node to which the lowest value of the token was assigned, greater than or equal to the object's hash value. Replication is also implemented by storing an object on nodes that have been assigned tokens that have a minimum value.

In “classic” storage based on consistent hashing, one token is assigned to one physical node. The Cloudian HyperStore system uses and extends the functionality of a “virtual node” (vNode), introduced in Cassandra in version 1.2, - each physical host is assigned a large number of tokens (maximum 256). In essence, a storage cluster consists of a very large number of “virtual nodes” with a large number of virtual nodes (tokens) on each physical host.

The HyperStore system assigns a separate set of tokens (virtual nodes) to each disk on each physical host. Each disk on the host is responsible for its own set of replicas of objects. A disk crash affects only those replicas of objects that are on it. Other disks on the host will continue to operate and fulfill their data storage responsibilities.

We give an example and consider a cluster of 6 HyperStore hosts, each of which has 4 S3 storage disks. Suppose that each physical host is assigned 32 tokens and there is a simplified token space from 0 to 960, and the values ​​of 192 tokens in this system (6 hosts of 32 tokens) are 0, 5, 10, 15, 20, and so on, up to 955.

The following diagram shows one possible distribution of tokens across the cluster. 32 tokens of each host are distributed evenly across 4 disks (8 tokens per disk), and the tokens themselves are randomly distributed across the cluster.



Now suppose you have configured HyperStore to 3X replicate S3 objects. We agree that the S3 object is loaded into the system, and the hash algorithm applied to its name gives us a hash value of 322 (in this simplified hash space). The diagram below shows how three instances or replicas of an object will be stored in a cluster:




Add a comment: from a practical point of view, this optimization (distribution of tokens not only across physical nodes, but also across individual disks) is needed not only to ensure availability, but also to evenly distribute data between disks. In this case, the RAID is not used, the entire logic of data placement on the disks is controlled by HyperStore itself. On the one hand, it is convenient and controllable; if you lose a disk, everything will rebalance itself. On the other hand, I personally trust more good RAID controllers — after all, their operation logic has been optimized for many years. But this is all my personal preferences, on real jambs and problems, we never caught HyperStore, if you follow the recommendations of the vendor when installing software on physical servers. But the attempt to use virtualization and virtual disks on top of the same moon on the storage system failed, when the storage system was overloaded during load testing, HyperStore went crazy and scattered the data completely unevenly, scoring one disk and not touching the other.

Disk device inside cluster


Recall that each host has 32 tokens, and each host tokens are evenly distributed between its disks. Let's take a closer look at hyperstore2: Disk2 (in the diagram below). We see that tokens 325, 425, 370 and so on are assigned to this disk.

Since the cluster is configured for 3X replication, hyperstore2: Disk2 will store the following:

In accordance with the 325 disk token:

In accordance with the 425 disk token:

And so on.

As noted earlier, when placing second and third replicas, HyperStore may in some cases skip tokens in order not to store more than one copy of an object on a single physical node. This eliminates the use of hyperstore2: disk2 as storage for second or third replicas of the same object.



If Disk 2 fails, Disk 1, 3 and 4 will continue to store data, and objects on Disk 2 will be stored in a cluster, since were replicated to other hosts.
Comment: in the end, the distribution of replicas and / or fragments of objects in the HyperStore cluster is based on the Cassandra design, modified to the needs of file storage. To understand where to place the object physically, a certain hash is taken on its behalf and, depending on its value, numbered “tokens” are selected for placement. The tokens are randomly distributed in advance across the cluster for load balancing. When choosing a token number for placement, restrictions on the placement of replicas and parts of the object on the same physical nodes are taken into account. Unfortunately, this design has a side effect: if you need to add or remove a node in a cluster, you will have to reshuffle the data, and this is a fairly resource-intensive process.

Single storage in multiple data centers


Now let's see how HyperStore operates geodistribution in several data centers and regions. In our case, the multi-data center mode differs from the multi-regional one by using one or more token spaces. In the first case, the token space is one. In the second, each region will have independent token space with (potentially) its own consistency settings, capacity, and storage configurations.
To understand how this works, again refer to the translation of documentation, section "Multi-Data Center Deployments".

Consider deploying HyperStore in two data centers. Let's call them DC1 and DC2. Each data center has 3 physical nodes. As in our previous examples, each physical node has four disks, 32 tokens (vNodes) are assigned to each host, and we assume a simplified token space from 0 to 960. According to this scenario with several data centers, the token space is divided into 192 tokens - 32 tokens for each of 6 physical hosts. Tokens are distributed by hosts absolutely randomly.

Also assume that the replication of S3 objects in this case is configured on two replicas in each data center.

Let's consider how a hypothetical S3 object with a hash value of 942 will be replicated in 2 data centers:



Comment: Another feature of the above scheme is that for normal operation of storage policies affecting both data centers, the amount of space, the number of nodes and disks on the node must be the same in both data centers. As I said above, the multiregional scheme has no such limitation.
This concludes our overview of the architectural features and core capabilities of Cloudian. In any case, this topic is too serious and large, so that you can fit a comprehensive manual on it in an article on Habré. Therefore, if you are interested in the details that I omitted, you have any questions or suggestions for the presentation of the material in future articles, I will be pleased to talk with you in the comments.
In the next article, we will look at the implementation of S3 storage in the DataLine, talk in detail about the infrastructure and technologies used for network resiliency, and as a bonus I will tell you the history of its construction!

Source: https://habr.com/ru/post/423853/


All Articles