New cloud storage system

Just a couple of days ago, we put into operation a new storage system, the creation of which worked the last six months. Before telling you what is new in it, I will tell you about the history of its development.

It should be immediately noted that the storage system is the cornerstone in the device of cloud hosting. There are two mandatory storage requirements. First, it must be networked so that users' virtual machines can move freely between compute nodes. Secondly, it must be productive for parallel operations, since it is simultaneously used by a large number of clients.

At the time when we designed the system, and it was 2007, there were 3 main technologies on the market for organizing network storage: iSCSI on 1/10 Gbps Ethernet, Fibre Channel 4 Gbps and Infiniband 40 Gbps. After conducting research, including on price characteristics, we chose the latest option with Infiniband. This allowed us to use one network fabric for both storage and IP data transfer. At first glance, this seems strange, as many are accustomed to think of Infiniband as a very expensive exotic technology from the world of supercomputers. But a simple price comparison shows that in this case, Infiniband is very economical.

Thus, we are likely to become the first company in the history of the world hosting, who used Infiniband. Now, besides us, some of these are American and savvy Russian hosting companies, as well as some data centers for their internal needs.
')
Regarding the repositories themselves, we did not find a single ready-made productive solution capable of working with Infiniband; Something was in LSI and DDN , but in a very raw form. Once again, having considered the cost of iSCSI and FC-based fault-tolerant solutions, we realized that at the resulting price, no one would buy them.

At that moment, Maxim Lapan , who had experience with the commercial product of IBM - the cluster file system GPFS, joined our company. It turned out that GPFS can work directly with Infiniband, making the most of all its capabilities, and has all the tools for backing up. So the first version of the Scalaxi cloud storage system was created. We used raid10 on GPFS nodes and double redundancy of the nodes themselves. The virtual machine disk images were in the form of regular files on GPFS.

However, during the operation, it turned out that GPFS solves our problems extremely badly. First, frequent crashes during cluster configuration (adding new nodes, etc.). Colleagues who have followed our path now face various problems . Secondly, low performance: GPFS was originally designed to work consistently with large pieces of data, for example, to stream video files, and not to work with images of user disks, where random access to small blocks comes to the fore.

We began to think about the problem, to try different solutions. At the same time, I really wanted to get away from proprietary and unpredictable technologies. Stumbled upon VastSky , a development by VA Linux Systems. Its architecture was very similar to what we were moving to - using the device-mapper driver to work with custom images. LVM also works on the basis of this driver and is essentially its management utility. However, VastSky is not yet ready for use in production systems.

However, this concept has convinced us that we are on the right path. As a result, a new version of the storage system was developed. Here is its scheme:

There are three types of servers in our cloud:

VRT (ViRTualization) - diskless virtualization servers running client virtual machines.
IBRP (IB Raid Proxy) are storage proxy servers whose task is to maintain raids.
IBRN (IB Raid Node) - storage system nodes that contain disks and caches.

Photo stand with IBRN front:

Behind:

Photo of Infiniband switch:

It all works as follows:

- IBRN export their drives via Infiniband SRP (SRP - SCSI over RDMA Protocol) using the SCST driver with caching, the fastest open source SCSI driver. SCST is generally a good thing, the same Maxim Lapan advised us; develop it Russian guys. It allows you to make an enterprise storage from any Linux box.
- IBRPs get disks from IBRN, for each IBRN pair they collect raid10 using md, whose code we carefully checked for performance under competitive conditions, create an LV group on this raid and export it again via SCST without caching.
- VRTs get disks from all IBRPs, the multipath driver creates a round-robin group of them, thus distributing the load across all IBRPs.
- When creating a new disk on one of the IBRPs, the lvcreate command is executed, the device-mapper table for the created volume is remembered, the device is created on the VRT via the device-mapper and sent to Xen.
- I / O write operations from the client virtual machine reach IBRP, get into md, md makes a record on both IBRNs and only after that returns the answer that the operation was successful. Thus, if any I / O node drops, the operation in the client machine will be processed correctly.

At first glance, it seems that there are a lot of levels and they will reduce performance. But this is not the case; the Infiniband bus speed exceeds the SAS speed, and the whole design works at least no slower than local drives, and using 96 GB caches, exceeds them several times.

Next, we have compiled a list of test scenarios for testing this system for fault tolerance and conducted them many times;

1. IBRN freeze or power drop.

Tests passed successfully. For some time, I / O on client machines is frozen until the timeout of the disks on the IBRP is working. Further md on IBRP disconnects disks of fallen IBRN and continues work. After the restoration of the IBRN, the md synchronization is successful within 24 hours.

2. Stop IBRN unloading SCST driver.

Tests passed successfully. The course of events is similar to the previous test, but without a timeout.

3. IBRP freeze or power drop.

Tests successfully passed. The SRP session is terminated by timeout, and the multipath on the VRT deletes the path. On client machines, all I / O is frozen for the duration of the multipath timeout.

4. Stop IBRP unloading SCST driver.

Tests successfully passed. After the SCST is unloaded, the SRP session is properly broken, and the multipath on the VRT immediately marks the path as failed. For clients, everything looks transparent and without timeouts.

5. Failed disk in IBRN (pull out the disk on the hot).

Tests successfully passed. The disk disappears in IBRN, errors fall out, errors also happen on IBRP, and md constantly redirects to the mirror. After inserting the disk back, the md synchronization is successful within 24 hours.

6. Restart the Infiniband switch .

Tests successfully passed. The switch was turned off on power for a couple of seconds and turned back on. When rebooting, it was observed: a fall and a raise of IB links, a break and restore SRP sessions, a temporary drop in multipathd paths on VRT. Client I / O is frozen during reboot.

7. Simultaneous drop in IBRP and IBRN.

Tests successfully passed.

8. Simultaneous stopping of an IBRN pair by unloading SCST drivers.

Tests successfully passed. During the test, the IBRN pair was stopped, rebooted and put back into operation. All this time, I / O client machines were frozen, after the end of the test, all client machines were frozen, I / O operations were resumed. Conclusion - a simultaneous stop of an IBRN pair can be made to maintain the physical IBRN servers in a regular window (from 2 am to 6 am).

Recall that all servers of the storage system are located in our data center Oversan-Mercury , each rack has two power supplies from independent inputs, this is additionally reserved by the UPS and diesel generators. Thus, in order to drop such a storage, you need to block (for example, riot police) the supply of diesel fuel to the DC and de-energize all of Moscow for a week.

The performance of the new storage system is truly colossal, we managed to get more than 120,000 IOPS per record from the IBRN pair. But we will tell about it in one of the following posts, having conducted a comparative analysis of the performance of disk subsystems of various Russian and foreign cloud hosting sites.

Those who already want to check the new storage system can clone their existing servers (all new disks are created on the new system), or wait until next week when we will complete the final migration. Those who do not have servers in Skalaxi, it's time to start them, 150 rubles are available for registration when testing.

Source: https://habr.com/ru/post/116137/

All Articles

New cloud storage system

More articles: