Build yourself: how we did the Amazon-style storage for small hosters

More and more Russian Internet projects want a server in the cloud and look towards Amazon EC2 and its analogues. The flight of large customers to the West, we perceive as a challenge for runet hosters. After all, they would also have “their own Amazon”, with preference and poetess. To meet the demand of hosters for distributed data storage for deployment at relatively small capacities, we have made Parallels Cloud Server (PCS).

In the post under the cut, I will talk about the architecture of the storage-part - one of the main highlights of the PCS. It allows you to organize on conventional hardware storage system, in terms of speed and fault tolerance comparable to expensive SAN-storage. The second post (it is already being prepared) will be of interest to developers, it will deal with things that we learned in the process of creating and testing the system.
')

Intro

The vast majority of Habr's readers know about Parallels as a developer of a solution for running Windows on a Mac. Another part of the audience knows about our container virtualization products - Parallels Virtuozzo Containers and Parallels Server Bare Metal , as well as open-source OpenVZ . The ideas that underlie the containers are to a greater or lesser degree used by cloud providers of the scale of Yandex and Google. But in general, Parallels Skate is a software for small and / or fast-growing service providers. Containers allow hosters to run multiple copies of operating systems on a single server. This gives a 3–4 times higher density of virtual environments than hypervisors. Additionally, container virtualization allows for the migration of virtual environments between physical machines, and this happens seamlessly for the client of the hoster or the cloud provider.

As a rule, service providers to store client data use local disks of existing servers. There are two problems with local storage. First, data transfer is often complicated by the size of the virtual machine. The file can be very large. The second problem: the high availability of virtual machines is difficult to achieve if the server shuts down, say, as a result of a power failure.

Both problems can be solved, but the cost of the solution is quite high. There are storage systems such as SAN / NAS storage . These are such big boxes, whole racks, and in the case of a large provider - data centers for many, many petabytes. Servers connect to them using any protocol and put the data there, or take it from there. SAN storage provides redundant data storage, fault tolerance, it has its own monitoring of the status of disks, as well as the function of self-repair and correction of detected errors. All this is very cool with one exception: even the simplest SAN storage will pull no less than $ 100 thousand, which for our partners - small hosters - is quite a lot of money. At the same time, they want to have something similar, and they constantly tell us about it. The task was drawn by itself: to offer them a solution of similar functionality with a noticeably lower purchase and ownership price.

PCS architecture: first touches

We pushed away from the fact that our interpretation of distributed storage should be as simple as possible. And it came from a task that was also simple: the system should provide for the operation of virtual machines and containers.

Virtual machines and containers have certain requirements for consistency of stored data. That is, the result of operations on data somehow corresponds to the order in which they are produced.

Recall how any journaling file system works. She has a journal, she has metadata and the actual data. The file system logs the data, waits until the end of this operation, after which only the data is written to the hard disk where it should be. Then, a new entry falls on the space vacated in the journal, which after some time is sent to the space allocated to it. In the event of a power down or in the event of a system crash, all data in the log will be there before the fixed point. They can be lost, having written down already in a final place.

There are many types of consistency. The order in which all file systems expect to see their data is called immediate / strict consistency. Let us translate this as “strict consistency.” There are two distinct features of strict consistency. First, all readings return the values just written: old data is never visible. Secondly, the data is visible in the same order in which they were recorded.

Most surprisingly, cloud storage solutions do not provide such properties. In particular, Amazon S3 Object Storage, used for various web-oriented services, provides completely different guarantees - the so-called eventual consistency, or final consistency in time. These are systems in which the recorded data is guaranteed not immediately visible, but after some time. For virtual machines and file systems, this is not appropriate.

Additionally, we wanted our data storage system to have some more remarkable properties:

She should be able to grow as needed. Storage capacity should increase dynamically when new disks are added.
It should be able to allocate more space than there is on a single disk.
It should be able to break the amount of allocated space into several disks, because 100 TB can not be put on one disk.

In the history with several disks there is a pitfall. As soon as we begin to distribute the array of data across multiple disks, the probability of data loss increases dramatically. If we have one server with one disk, then the probability that the disk will fail is not very big. But if there are 100 disks in our storage, the probability that at least one disk will burn will greatly increase. The more disks, the higher the risk. Accordingly, the data must be stored redundantly, in several copies.

The distribution of data on many disks has its advantages. For example, it is possible to restore the original image of a broken disk in parallel with multiple live disks at the same time. As a rule, such an operation takes only a few minutes in a cloud storage, in contrast to traditional RAID arrays, which will take several hours or even days to restore. In general, the probability of data loss is inversely proportional to the square of the time required to restore them. Accordingly, the faster we recover the data, the lower the likelihood that the data will be lost.

It was solved: we divide the entire data array into pieces of a fixed size (64-128 MB in our case), we will replicate them in a given number of copies and distribute them throughout the cluster.

Then we decided to simplify everything to the maximum. First of all, it was clear that we did not need a regular POSIX-compatible file system. It only takes to optimize the system for large objects - images of virtual machines - that take several tens of gigabytes. Since the images themselves are rarely created / deleted / renamed, metadata changes rarely occur and these changes can not be optimized, which is very significant. For the operation of a container or a virtual machine, inside objects there is a file system. It is optimized for changing metadata, provides standard POSIX semantics, etc.

Attack of the Clones

You can often hear judgments that only individual cluster nodes fail at the providers. Alas, it also happens that even data centers can lose their power supply entirely - the entire cluster falls. From the very beginning we decided that we should take into account the possibility of such failures. Let's go back a bit and think about why it’s difficult to ensure strict consistency of stored data in a distributed data warehouse. And why did the major providers (the same Amazon S3) start with eventual consistency storage?

The problem is simple. Suppose we have three servers that store one copy of the object. At some point changes occur in the object. These changes manage to write two of the three servers; the third was for some reason not available. Next on our rack or in our data center power is lost, and when it returns, the servers begin to load. And it may happen that the server to which the object changes were not recorded will be loaded first. If you do not take any special measures, the client can access the old, irrelevant copy of the data.

If we saw a large object (the image of a virtual machine in our case) into several pieces, everything becomes more complicated. Suppose distributed storage divides an image of a file into two fragments, for which six servers will be needed. If not all servers have time to make changes to these fragments, then after a power failure, when the servers are loaded, we already have four combinations of fragments, and two of them never existed in nature. The file system for this is not designed at all.

It turns out that it is necessary to somehow assign versions to objects. This is realizable in several ways. The first is to use a transactional file system (for example, BTRFS) and update the version along with the data update. This is correct, but when using traditional (rotating) hard drives for the time being - slowly: performance drops by several times. The second option is to use some kind of consensus algorithm (for example, Paxos ), so that the servers that make the modification agree among themselves. This is also slow. The servers themselves, which are responsible for the data, cannot track the versionality of changes in object fragments, since they don’t know if the data was changed by someone else. Therefore, we concluded that data versions should be updated somewhere on the side.

Versioning will be monitored by a metadata server (MDS). At the same time, it is not necessary to update the version upon successful recording, it is necessary only when one of the servers did not record for any reason and we exclude it. Therefore, in normal mode, data is recorded as quickly as possible.

Actually, the solution architecture is made up of all these conclusions. It consists of three components:

Clients that communicate with the storage over normal Ethernet.
MDS, which stores information about where the files are and where the current versions of the fragments are located.
The fragments themselves, distributed over local hard drives of servers.

Obviously, the MDS is the bottleneck in the whole architecture. He knows everything about the system. Therefore, it needs to be made highly available - so that if a machine fails, the system will be available. There was a temptation to “put” MDS in a MySQL database or in another popular database. But this is not our method, because SQL servers are quite limited by the number of requests per second, which immediately puts an end to the scalability of the system. In addition, to make a reliable cluster of SQL servers is even more difficult. We peered the solution in the article about GoogleFS . In the original performance of GoogleFS, it did not fit our tasks. It does not provide the desired strict consistency, since it is intended to add data to search engines, but not to modify this data. The solution was as follows: MDS stores the full state of the objects and their versions in the entire memory. It turns out not so much. Each fragment describes only 128 bytes. That is, a description of the state of the storage of several petabytes will fit into the memory of a modern server, which is acceptable. Status Changes MDS writes metadata to the log. The magazine is growing, albeit relatively slowly.

Our task is to make it so that several MDSs can somehow negotiate among themselves and make the solution available even if one of them falls. But there is a problem: the log file cannot grow forever, something will have to be done with it. For this, a new log is created, and all changes begin to be written there, and in parallel with this, a memory snapshot is created, and the state of the system begins to be written asynchronously to snapshots. After snapshot is created, you can delete the old log. In the event of a system crash, you need to “lose” the snapshot and log to it. When the new log grows back, the procedure is repeated.

More hints

Here are a couple of tricks that we implemented in the process of creating our distributed cloud storage.

How is the high speed of recording fragments achieved? As a rule, three copies of data are used in distributed repositories. If you write "in the forehead," the client must send a request to the three servers to write with new data. It will turn out slowly, namely three times slower than the network bandwidth. If we have Gigabit Ethernet, then in a second we will transfer only 30-40 MB of data to each copy, which is not an outstanding result, because the recording speed even on the HDD is significantly higher. To use the resources of iron and the network more efficiently, we applied chain-replication. The client sends data to the first server. He, having received the first part (64 Kb), writes it to the disk, and immediately sends it to the other servers in parallel along the chain. As a result, a large request begins to be written as early as possible, and is transmitted asynchronously to other participants. It turns out that iron is used at 80% of maximum performance, even if we are talking about three copies of data. Everything works great because Ethernet can simultaneously receive and send data (full duplex), i.e. in reality, it produces two gigabits per second, and not one. And the server that received the data from the client, with the same speed, sends them down the chain.

SSD caching. SSDs help speed up the operation of any storage. The idea is not new, although it should be noted that in open source there are no really good solutions. SAN-storage for a long time is used caching. Hint is based on the ability of SSD to give out at random access performance is orders of magnitude higher than HDD can produce. You start caching data on the SSD - you get the speed ten times higher. Additionally, we count the checksums of all the data, and store them also on SSD-drives. Periodically reading all the data, we check their availability and compliance with checksums. The latter is sometimes also called scrabbing (if you remember, a skin scrub removes loose particles from it) and increases the reliability of the system, as well as allows you to detect errors before the data is needed in reality.

There is another reason why SSD caching is important for virtual machines. The fact is that with object storage, for example, Amazon S3 and its analogs, the delay in accessing an object is not very important. As a rule, they are accessed via the Internet, and a delay of 10 ms is simply not noticeable there. If we are talking about a virtual machine on the server of the hosting provider, then when the OS performs a series of consecutive synchronous requests, the delay accumulates and becomes very noticeable. Moreover, all scripts and most applications are synchronous in nature, i.e. perform operation by operation, not in parallel. As a result, seconds and even tens of seconds are already noticeable in the user interface or in some responses to user actions.

Results & Conclusions

As a result, we obtained a data storage system, which by properties is suitable for hosters to be deployed on their infrastructure, because it has the following properties:

Ability to execute virtual machines and containers directly from storage, because Our system provides strong consistency semantics.
Ability to scale to petabyte volumes on a variety of servers and disks.
Support for checksums of data and SSD caching.

A few words about performance. It turned out to be comparable with the performance of enapraise storage facilities. We took on a test from Intel company a cluster of 14 servers with 4x HDD and received 13 thousand random I / O operations per second with very small (4 Kb) files on ordinary hard drives with SATA interface. This is quite a lot. SSD caching accelerated the work of the same storage by almost two orders of magnitude - we approached 1 million i / o operations per second. The recovery rate of one terabyte of data in 10 minutes, and the greater the number of disks, the faster the recovery.

SAN-storage with similar parameters will cost from several hundred thousand dollars. Yes, a large company can afford its purchase and maintenance, but we made our distributed storage for hosters who would like to get a solution of similar parameters for incomparably less money and already on existing equipment.

***

As stated at the beginning, “to be continued.” A post is being prepared on how the storage-part of Parallels Cloud Server was developed. Suggestions are accepted in comments that you would be interested to read about. I will try to accommodate them.

Source: https://habr.com/ru/post/162381/

All Articles