How S3 DataLine Storage is arranged

Hi, Habr!

It is no secret that huge amounts of data are involved in the work of modern applications, and their flow is constantly growing. This data needs to be stored and processed, often from a large number of machines, and this is not an easy task. To solve it, there are cloud object storage. Usually they represent the implementation of technology Software Defined Storage.
')
In early 2018, we launched (and launched!) Our own 100% S3-compatible storage based on Cloudian HyperStore. As it turned out, there are very few Russian-language publications about the Cloudian itself, and even less about the real application of this solution.

Today, based on the experience of DataLine, I will tell you about the architecture and internal structure of Cloudian software, including the implementation of Cloudian SDS, which is based on a number of Apache Cassandra architectural solutions. Separately, we consider the most interesting thing in any SDS storage — the logic of ensuring fault tolerance and distribution of objects.

If you are building your S3 repository or are busy maintaining it, this article will come in handy.

First of all, I will explain why our choice fell on Cloudian. It's simple: decent options in this niche is extremely small. For example, a couple of years ago, when we only conceived the construction, there were only three options:

CEHP + RADOS Gateway;
Minio;
Cloudian HyperStore.

For us, as for the service provider, the decisive factors were: a high level of compliance with the storage API and the original Amazon S3, the presence of embedded billing, scalability with multi-regional support and the presence of a third vendor support line. Cloudian has it all.

And yes, the most (of course!) The main thing is that DataLine and Cloudian have similar corporate colors. You must admit that we could not resist such beauty.

Unfortunately, Cloudian is not the most common software, and there is practically no information about it in runet. Today we will correct this injustice and talk with you about the features of the HyperStore architecture, examine its most important components and deal with the main architectural nuances. Let's start with the most basic, namely, what does Cloudian have under the hood?

How is the storage Cloudian HyperStore

Let's take a look at the diagram and see how the solution from Cloudian works.

The main component storage scheme.

As we can see, the system consists of several main components:

Cloudian Management Control - management console ;
Admin Service - internal administration module ;
S3 Service - the module responsible for supporting the S3 protocol ;
HyperStore Service - the actual storage service ;
Apache Cassandra - centralized service data storage ;
Redis - for the most frequently read data .

The work of the main services, S3 Service and HyperStore Service will be of the greatest interest for us, then we will carefully consider their work. But first it makes sense to find out how the distribution of services in the cluster is arranged and what the fault tolerance and reliability of the data storage of this solution is in general.

The common services in the diagram above refer to the S3, HyperStore, CMC and Apache Cassandra services . At first glance, everything is beautiful and neat. But a closer look reveals that only a single failure of a node is effectively handled. And the simultaneous loss of two nodes at once can be fatal for the availability of the cluster - Redis QoS (on node 2) has only 1 slave (on node 3). The same picture is with the risk of losing cluster management - Puppet Server is only on two nodes (1 and 2). However, the probability of failure of two nodes at once is very low, and it is quite possible to live with this.

However, to improve the reliability of the storage, we use 4 nodes in the DataLine instead of the minimum three. It turns out the following resource allocation:

Immediately striking another nuance - Redis Credentials is not placed on each node (as could be supposed from the official scheme above), but only on 3 of them. In this case, Redis Credentials is used for each incoming request. It turns out, because of the need to go to someone else's Redis, there is some bias in the performance of the fourth node.

For us, this is not yet significant. With load testing, there were no significant deviations in the response speed of the node, but on large clusters with dozens of working nodes it makes sense to correct this nuance.

Here is the migration scheme on 6 nodes:

From the diagram you can see how the service migration is implemented in case of node failure. Only the case of failure of one server of each role is taken into account. If both servers fall, manual intervention is required.

Here, the matter is also not without some subtleties. Puppet is used to migrate roles. Therefore, if you lose it or accidentally break it, automatic failover may not work. For the same reason, you should not manually edit the manifests of the embedded Puppet. This is not completely safe, changes can be suddenly erased, as the manifests are edited using the cluster admin panel.

From the point of view of data integrity, everything is much more interesting. Object metadata is stored in Apache Cassandra, and each entry is replicated to 3 of 4 nodes. Replication factor 3 is also used to store data, but you can configure a larger one. This ensures the safety of data even with the simultaneous failure of 2 nodes out of 4. And if you have time to rebalance the cluster, you can lose nothing with the one node remaining. The main thing is to have enough space.

This is what happens when two nodes fail. The diagram clearly shows that even in this situation, the data remains intact.

However, the availability of data and storage will depend on the consistency strategy. For data, metadata, read and write, it is configured separately.

Valid options are at least one node, quorum, or all nodes.
This setting determines how many nodes must confirm a read / read in order for the request to be considered successful. We use quorum as a reasonable compromise between the time to process the request and the reliability of the recording / read inconsistency. That is, of the three nodes involved in the operation, for error-free operation, it is enough to get a consistent answer from 2. Accordingly, in order to stay afloat if more than one node fails, you will need to switch to a single write / read strategy.

Processing requests in Cloudian

Below we consider two schemes for processing incoming requests to Cloudian HyperStore, PUT and GET. This is the main task for S3 Service and HyperStore.

Let's start with how the write request is handled:

Surely you noted that each request generates a lot of checks and data retrieval, at least 6 calls from component to component. It is from here that there are recording delays and a high CPU time consumption when working with small files.

Large files are transferred by chunks. Separate chunks are not considered as separate requests and some checks are not carried out.

The node that received the original request will then independently determine where and what to write, even if no recording is made directly to it. This allows you to hide the internal cluster organization from the end client and use external load balancers. All this has a positive effect on serviceability and fault tolerance of the storage.

As you can see, the logic of reading is not too different from the record. It has the same high sensitivity of performance to the size of the processed objects. Therefore, due to the significant savings in working with metadata, retrieving one finely chopped object is much easier than many individual objects of the same total volume.

Data storage and duplication

As can be seen from the above schemes, Cloudian supports various storage and duplication schemes:

Replication - using replication, it is possible to maintain in the system a customized number of copies of each data object and store each copy on different nodes. For example, using 3X replication, 3 copies of each object are created, and each copy is “lying” on its own node.

Erasure Coding — With erasure coding, each object is encoded into a configurable number (known as K number) of data fragments plus a configurable amount of redundancy code (M number). Each K + M object fragments are unique, and each fragment is stored on its own node. The decoded object can be using any K fragments. In other words, the object remains readable, even if the M nodes are unavailable.

For example, in erasure coding using the 4 + 2 formula (4 data fragments plus 2 redundancy code fragments), each object is split into 6 unique fragments stored on six different nodes, and this object can be recovered and read if any 4 of the 6 fragments are available. .

Plus Erasure Coding, compared with replication, consists in saving space, even at the cost of a significant increase in CPU load, deterioration in response speed and the need for background procedures to control the consistency of objects. In any case, the metadata is stored separately from the data (in Apache Cassandra), which increases the flexibility and reliability of the solution.

Briefly about other features HyperStore

As I wrote at the beginning of the article, several useful tools are built into HyperStore. Among them:

Flexible billing with support for changing the price of the resource depending on the volume and tariff plan;
Built-in monitoring;
The ability to limit the use of resources for users and user groups;
QoS settings and built-in procedures for balancing the use of resources between nodes, as well as regular procedures for rebalancing between nodes and disks on nodes or when entering new nodes in a cluster.

However, Cloudian HyperStore is still not perfect. For example, for some reason you cannot transfer an existing account to another group or assign several groups of one record. It is impossible to generate intermediate billing reports - you will receive all reports only after the close of the reporting period. Therefore, to find out how the score has grown, neither clients nor we can do it in real time.

Logic of Cloudian HyperStore

Now we dive deeper and look at the most interesting thing in any SDS storage — the logic of the distribution of objects among the nodes. In the case of the Cloudian repository, the metadata is stored separately from the data itself. For metadata, Cassandra is used, for data, the proprietary solution HyperStore.

Unfortunately, so far there is no official translation of Cloudian documentation into Russian on the Internet, so below I will place my translation of the most interesting parts of this documentation.

The role of Apache Cassandra in HyperStore

In HyperStore, Cassandra is used to store object metadata, user account data, and service usage data. With a typical deployment on each HyperStore node, Cassandra data is stored on the same disk as the OS. The system also maintains Cassandra data on a dedicated disk on each node. Cassandra data is not stored on HyperStore data disks. When vNodes are assigned to the host machine, they are distributed only across the HyperStore storage nodes. vNodes are not allocated to the disk where Cassandra data is stored.
Within the cluster, the metadata in Cassandra is replicated in accordance with the policy (strategy) of your repository. Data replication Cassandra uses vNodes as follows:

When you create a new Cassandra object (the primary key and its corresponding values), it is hashed, and the hash is used to associate the object with a specific vNode. The system checks to which host this vNode is assigned, and then the first replica of the Cassandra object is stored on the Cassandra disk on that host.
For example, imagine that a host machine is assigned 96 vNodes distributed across multiple HyperStore data disks. Cassandra objects whose hash values fall within the token ranges of any of these 96 vNodes will be written to a Cassandra disk on this host.
Additional replicas of the Cassandra object (the number of replicas depends on your configuration) are associated with the vNodes with the next sequence number and stored on the node to which these vNodes are assigned, provided that, if necessary, vNodes are “skipped” so that each replica of the Cassandra object is stored on another host machine.

How HyperStore works

The placement and replication of S3 objects in a HyperStore cluster are based on a consistent caching scheme that uses the space of integer tokens in the range from 0 to ^2,127 -1. Integer tokens are assigned to HyperStore nodes. For each S3 object, a hash is calculated as it is loaded into the storage. The object is stored in the node to which the lowest value of the token was assigned, greater than or equal to the object's hash value. Replication is also implemented by storing an object on nodes that have been assigned tokens that have a minimum value.

In “classic” storage based on consistent hashing, one token is assigned to one physical node. The Cloudian HyperStore system uses and extends the functionality of a “virtual node” (vNode), introduced in Cassandra in version 1.2, - each physical host is assigned a large number of tokens (maximum 256). In essence, a storage cluster consists of a very large number of “virtual nodes” with a large number of virtual nodes (tokens) on each physical host.

The HyperStore system assigns a separate set of tokens (virtual nodes) to each disk on each physical host. Each disk on the host is responsible for its own set of replicas of objects. A disk crash affects only those replicas of objects that are on it. Other disks on the host will continue to operate and fulfill their data storage responsibilities.

We give an example and consider a cluster of 6 HyperStore hosts, each of which has 4 S3 storage disks. Suppose that each physical host is assigned 32 tokens and there is a simplified token space from 0 to 960, and the values of 192 tokens in this system (6 hosts of 32 tokens) are 0, 5, 10, 15, 20, and so on, up to 955.

The following diagram shows one possible distribution of tokens across the cluster. 32 tokens of each host are distributed evenly across 4 disks (8 tokens per disk), and the tokens themselves are randomly distributed across the cluster.

Now suppose you have configured HyperStore to 3X replicate S3 objects. We agree that the S3 object is loaded into the system, and the hash algorithm applied to its name gives us a hash value of 322 (in this simplified hash space). The diagram below shows how three instances or replicas of an object will be stored in a cluster:

With its hash value of the name 322, the first replica of the object is stored on the 325 token, since This is the smallest token value that is greater than or equal to the object's hash value. 325 token (highlighted in red in the diagram) assigned to hyperstore2: Disk2. Accordingly, the first replica of the object is stored there.

The second replica is stored on the disk to which the next token is assigned (330, highlighted in orange), that is, on hyperstore4: Disk2.
The third replica is saved to the disk, which is assigned the next after 330 token - 335 (yellow), on hyperstore3: Disk3.

Add a comment: from a practical point of view, this optimization (distribution of tokens not only across physical nodes, but also across individual disks) is needed not only to ensure availability, but also to evenly distribute data between disks. In this case, the RAID is not used, the entire logic of data placement on the disks is controlled by HyperStore itself. On the one hand, it is convenient and controllable; if you lose a disk, everything will rebalance itself. On the other hand, I personally trust more good RAID controllers — after all, their operation logic has been optimized for many years. But this is all my personal preferences, on real jambs and problems, we never caught HyperStore, if you follow the recommendations of the vendor when installing software on physical servers. But the attempt to use virtualization and virtual disks on top of the same moon on the storage system failed, when the storage system was overloaded during load testing, HyperStore went crazy and scattered the data completely unevenly, scoring one disk and not touching the other.

Disk device inside cluster

Recall that each host has 32 tokens, and each host tokens are evenly distributed between its disks. Let's take a closer look at hyperstore2: Disk2 (in the diagram below). We see that tokens 325, 425, 370 and so on are assigned to this disk.

Since the cluster is configured for 3X replication, hyperstore2: Disk2 will store the following:

In accordance with the 325 disk token:

The first replicas of objects with a hash value from 320 (exclusively) to 325 (inclusive);
Second replicas of objects with a hash value from 315 (exclusively) to 320 (inclusive);
Third replicas of objects with a hash value from 310 (exclusively) to 315 (inclusive).

In accordance with the 425 disk token:

The first replicas of objects with a hash value from 420 (exclusively) to 425 (inclusive);
Second replicas of objects with a hash value of 415 (exclusively) to 420 (inclusive);
Third replicas of objects with a hash value from 410 (exclusively) to 415 (inclusive).

And so on.

As noted earlier, when placing second and third replicas, HyperStore may in some cases skip tokens in order not to store more than one copy of an object on a single physical node. This eliminates the use of hyperstore2: disk2 as storage for second or third replicas of the same object.

If Disk 2 fails, Disk 1, 3 and 4 will continue to store data, and objects on Disk 2 will be stored in a cluster, since were replicated to other hosts.

Comment: in the end, the distribution of replicas and / or fragments of objects in the HyperStore cluster is based on the Cassandra design, modified to the needs of file storage. To understand where to place the object physically, a certain hash is taken on its behalf and, depending on its value, numbered “tokens” are selected for placement. The tokens are randomly distributed in advance across the cluster for load balancing. When choosing a token number for placement, restrictions on the placement of replicas and parts of the object on the same physical nodes are taken into account. Unfortunately, this design has a side effect: if you need to add or remove a node in a cluster, you will have to reshuffle the data, and this is a fairly resource-intensive process.

Single storage in multiple data centers

Now let's see how HyperStore operates geodistribution in several data centers and regions. In our case, the multi-data center mode differs from the multi-regional one by using one or more token spaces. In the first case, the token space is one. In the second, each region will have independent token space with (potentially) its own consistency settings, capacity, and storage configurations.
To understand how this works, again refer to the translation of documentation, section "Multi-Data Center Deployments".

Consider deploying HyperStore in two data centers. Let's call them DC1 and DC2. Each data center has 3 physical nodes. As in our previous examples, each physical node has four disks, 32 tokens (vNodes) are assigned to each host, and we assume a simplified token space from 0 to 960. According to this scenario with several data centers, the token space is divided into 192 tokens - 32 tokens for each of 6 physical hosts. Tokens are distributed by hosts absolutely randomly.

Also assume that the replication of S3 objects in this case is configured on two replicas in each data center.

Let's consider how a hypothetical S3 object with a hash value of 942 will be replicated in 2 data centers:

The first replica is stored in vNode 945 (indicated in red in the diagram below), which is located in DC2, on hyperstore5: Disk3.
The second replica is stored in vNode 950 (indicated in orange) DC2, on hyperstore6: Disk4.
The next vNode 955 is located in DC2, where the specified level of replication has already been achieved, so this vNode is skipped.
The third replica is located in vNode 0 (yellow) - in DC1, hyperstore2: Disk3. Note that after the token with the highest number (955), the token with the lowest number (0) follows.
The next vNode (5) is located in DC2, where the specified level of replication has already been achieved, so this vNode is skipped.
The fourth and last replica is stored in vNode 10 (green) - in DC1, hyperstore3: Disk3.

Comment: Another feature of the above scheme is that for normal operation of storage policies affecting both data centers, the amount of space, the number of nodes and disks on the node must be the same in both data centers. As I said above, the multiregional scheme has no such limitation.

This concludes our overview of the architectural features and core capabilities of Cloudian. In any case, this topic is too serious and large, so that you can fit a comprehensive manual on it in an article on Habré. Therefore, if you are interested in the details that I omitted, you have any questions or suggestions for the presentation of the material in future articles, I will be pleased to talk with you in the comments.
In the next article, we will look at the implementation of S3 storage in the DataLine, talk in detail about the infrastructure and technologies used for network resiliency, and as a bonus I will tell you the history of its construction!

Source: https://habr.com/ru/post/423853/

All Articles