Storage systems: how to choose?

A project of any complexity, whatever one may say, is faced with the task of storing data. Different systems can be such storage: Block storage, File storage, Object storage and Key-value storage. In any sane project, before purchasing a storage solution, tests are carried out to check certain parameters in certain conditions. Remembering how many good projects, made by properly growing hands, pierced that we forgot about scalability, we decided to figure it out:

What characteristics of Block storage and File storage need to be taken into account if you want the storage system to grow following it as the project grows
Why software fault tolerance is more reliable and cheaper than hardware level
How to conduct testing to compare "apples with apples"
How to get an order of magnitude more / less IOPS, changing only one parameter

During testing, we used RAID – systems and Parallels Cloud Storage (PStorage) distributed storage system. PStorage is part of the Parallels Cloud Server product.

Let's start with the fact that we define the main characteristics that you need to pay attention to when choosing a storage system. They will determine the structure of the post.

fault tolerance
Data recovery speed
Performance matching your requests.
Data consistency

fault tolerance

The most important property of a data storage system is that the system is designed to SAVE data without any compromises, that is, to ensure maximum availability and in no case lose even a small part of them. For some reason, many people think about performance, price, but little attention is paid to data storage reliability.

To ensure resiliency in the event of a failure, there is a single technique - redundancy. The question is at what level is the reservation applied. With some rough simplification, we can say that there are two levels: Hardware and Software.

Hardware-level redundancy has long proven itself in Enterprise systems. SAN / NAS boxes have double redundancy of all modules (two or even three power supplies, a pair of “brain” boards) and store data simultaneously on several disks inside one box. Personally, I metaphorically imagine this as a very safe mug: as reliable as possible to keep the liquid inside, with thick walls and always with two handles in case one of them breaks.

Redundancy at the Software level is only beginning to penetrate Enterprise systems, but every year it eats away a larger and larger piece from HW solutions. The principle here is simple. Such systems do not rely on the reliability of iron. They consider that it is a priori unreliable, and solve backup tasks at the software level, creating copies (replicas) of data and storing them on physically different hardware. Continuing the analogy with cups, this is when there are several completely ordinary cups, and you poured tea in both, suddenly one breaks.

Thus, SW solutions do not require expensive equipment, as a rule, are more profitable, but at the same time provide exactly the same fault tolerance, albeit at a different level. They are also easier to optimize, for example, to spread data to different sites, perform balancing, change the level of fault tolerance, scale linearly as the cluster grows.

I'll tell you how the issue of reservations is solved by the example of Parallels Cloud Storage (PStorage). PStorage is not tied to any hardware vendor and is capable of running on completely ordinary machines, right up to desktop PCs. We do not trust hardware, so the PStorage architecture is designed to lose the entire physical server (and not just a single disk). All data in Parallels Cloud Storage is stored in multiple copies (replicas). However, PStorage never stores more than one copy on a physical server / rack / room (as you wish). We recommend storing 3 copies of the data in order to be protected from simultaneous failure of two servers / racks at once.

Comment: The figure shows an example of a cluster storing data in two copies.

Data recovery speed

What happens if one of the disks fails?

To begin, consider the usual HW RAID1 (mirror) of two drives. In the event of a single disk crash, RAID continues to work with the remaining one, waiting for the moment when the failed disk is replaced. Those. RAID is vulnerable at this time: the remaining disk stores a single copy of the data. One of our clients had a case when they carried out repairs in their data center and sawed metal. The chips flew directly to the working servers, and within a few hours the disks in them began to fly out one by one. Then the system was organized on a conventional RAID, and as a result, the provider lost some of the data.

How long the system is in a vulnerable state depends on the recovery time. This dependence is described by the following formula:

MTTDL ~ = 1 / T ^ 2 * , where T is the recovery time, and the mean time to data loss ( MTTDL ) is the average time to data loss, C is a certain coefficient.

So, the faster the system recovers the required number of copies of data, the less likely it is to lose data. Here we even omit the fact that the administrator needs to replace a dead disk with a new one to start the HW RAID recovery process, and this also takes time, especially if you need to order a disk.

For RAID1, recovery time is the time it takes for a RAID controller to transfer data from a working disk to a new one. As you can easily guess, the copy speed will be equal to the read / write speed of the HDD, that is, approximately 100 MB / s, if the RAID controller is not at all loaded. And if at this time the RAID controller is loaded from the outside, then the speed will be several times lower. A thoughtful reader will perform similar calculations for RAID10, RAID5, RAID6 and come to the conclusion that any HW RAID is restored at a speed not exceeding the speed of one disk.

SAN / NAS systems almost always use the same approach as normal RAID. They group disks and collect RAID from them. Actually, the recovery rate is the same.

At the software level there are much more possibilities for optimization. For example, in PStorage, data is distributed throughout the cluster and across all disks in the cluster, and in the event of a failure of one of the disks, replication begins automatically. There is no need to wait for the administrator to replace the disk. In addition, all cluster disks are involved in replication, so the speed of data recovery is much higher. We wrote data to the cluster, disconnected one server from the cluster and measured the time during which the cluster would restore the missing number of replicas. The graph shows the result for a cluster of 7/14/21 physical nodes with two SATA 1TB drives. Cluster compiled on 1GB network.

If you use a 10Gbit network, then the speed will be even higher.

Comment: There is no error in the fact that on a 1Gbit network, the cluster recovery speed from 21 servers is almost gigabytes per second. The fact is that the data stored in Parallels Cloud Storage is distributed across cluster disks (a certain stripe across the cluster), so we can simultaneously copy data from different disks to different ones. That is, there is no single point of ownership of the data, which could be the bottleneck of the test.

The full test script can be found in this document, if you wish, you can repeat it yourself.

How to test performance correctly - tips

Based on our experience in testing storage systems, I would highlight the basic rules:

It is necessary to "determine the Wishlist." What exactly you want to get from the system and how much. Our clients in most cases use Parallels Cloud Storage to build a high availability cluster for virtual machines and containers. That is, each of the cluster machines simultaneously provides both storage and executes virtual machines. Thus, the cluster does not require a dedicated external “data storage”. In terms of performance, this means that the cluster receives the load from each server. Therefore, in the example, we will always load the cluster in parallel from all physical servers in the cluster.
No need to use well compressible data templates. Many HDDs / SSDs, storage systems, and virtual machines sometimes have special Low-level optimizations for processing zero data. In such situations, it is easy to see that writing zeros to a disk is somewhat faster than writing random data. A typical example of such an error is the well-known test:
```
dd if=/dev/zero of=/dev/sda size=1M 
```
It is better to use random data when testing. At the same time, the generation of this data should not affect the test itself. Those. it is better to generate random data in advance, for example in a file. Otherwise, the test will be limited to the generation of data, as in the following example:
```
 dd if=/dev/random of=/dev/sda size=1M 
```
Consider the distance between components that are spaced apart. Naturally, communication between distributed components may contain delays. It is worth remembering this possible bottleneck with loads. Especially network latency and network bandwidth.
Take at least a minute to conduct the test. The test time should be long.
Perform one test several times to smooth out deviations.
Use a large amount of data for the workload (working set). Working set is a very important parameter, as it greatly affects performance. That he can change the test result dozens of times. For example, with the Adaptec 71605 RAID controller, random I / O on a 512M file shows 100K iops, and on a 2GB file it shows only 3K IOPS. The difference in performance is 30 times due to the RAID-cache (hit and not hit in the cache, depending on the volume of the load). If you are going to work with data larger than the cache size of your storage system (in this example, 512M), then use just such volumes. For virtual machines, we use 16GB.
And of course, always compare only "apples with apples." It is necessary to compare systems with the same fault tolerance on the same hardware. For example, you cannot compare RAID0 with PStorage, since PStorage provides fault tolerance when drives / servers are being sent out, but RAID0 does not. Correctly in this case will compare RAID1 / 6/10 with PStorage.

Below are the test results for the described methodology. We compare the performance of the local RAID1 (“ host RAID 1 ”) with the PStorage cluster (“ PCS ”). Those "apples with apples". It should be noted that it is necessary to compare systems with the same level of redundancy. PStorage in these tests stored information in two copies (replicas = 2) instead of the recommended three, so that the level of fault tolerance is the same for both systems. Otherwise, the comparison would be dishonest: PStorage (replicas = 3) allows you to lose 2 any drives / servers at the same time, when RAID1 of 2 drives is only 1. We use the same hardware for all tests: 1,10,21 identical physical servers with two 1TB SATA disks, 1Gbit network, core i5 CPU, 16GB RAM. If the cluster consists of 21 servers, then its performance is compared with the total performance of 21 local RAID. The load was made in 16 threads on each physical node at the same time. Each node had a working set of 16GB, that is, for example, for the RANDOM 4K test as a whole, on a cluster, the loaders randomly walked through 336GB of data. The load time is 1 minute, each test was carried out 3 times.

The PCS + SSD columns show the performance of the same cluster, but with SSD caching. PStorage has a built-in ability to use local SSDs for write-journaling, read-caching, which allows several times the performance of local rotating disks. Also, SSD drives can be used to create a separate layer (tier) with higher performance.

findings

Briefly summarized:

Choosing the type of backup, we tend to "software level". Software level provides more opportunities for optimization and reduces the requirements for hardware and reduce the cost of the system as a whole.
We carry out tests under certain conditions (see our tips)
We pay attention to the speed of recovery - a very important parameter, which, with insufficient efficiency, can simply destroy some of the business.

You can also test our own solution, especially since we give it to do it for free. At least, on our own tests, Parallels Cloud Storage shows the fastest data recovery in the event of a disk loss (more than in RAID systems, including SAN) and performance is at least as good as local RAID, and with SSD caching - and higher.

On the consistency of the data we plan to talk more in a separate post.

How to try Parallels Cloud Storage

The official product page here . To try for free, fill out the form .
Also PStorage is available for the OpenVZ project .
You can read about how PCS works in FastVPS in this post .
What your tests, pros and cons show - we can discuss in detail in the comments.

Source: https://habr.com/ru/post/239381/

All Articles