What factors affect storage performance and how?

Storage systems for the vast majority of web projects (and not only) play a key role. After all, the task often comes down not only to storing a certain type of content, but also to ensuring its return to visitors, as well as processing, which imposes certain performance requirements.

While many other metrics are used in the production of drives, in order to describe and guarantee proper performance, it is customary to use IOPS as a comparative metric on the storage and disk storage market to “compare” the convenience. However, the performance of storage systems, as measured in Input Output Operations Per Second (IOPS), input / output (write / read) operations, is influenced by a large number of factors.

In this article, I would like to consider these factors in order to make the measure of performance expressed in IOPS more understandable.
')
Let's start with the fact that IOPS is not at all IOPS and even not at all IOPS, since there are many variables that determine how much IOPS we get in some and other cases. You should also take into account that storage systems use read and write functions and provide different amounts of IOPS for these functions depending on the architecture and type of application, especially in cases where I / O operations occur at the same time. Different workloads have different I / O (I / O) requirements. Thus, storage systems, which at first glance should have provided adequate performance, may not actually cope with the task.

Drive Performance Basics

In order to gain a full understanding of the question, let's start with the basics. IOPS, bandwidth (MB / s or MiB / s) and response time in milliseconds (ms) are common units for measuring the performance of drives and arrays of them.

IOPS is usually considered in a key to measuring the ability of a storage device to read / write in blocks of 4-8KB in random order. Which is typical for online transaction processing tasks, databases, and for launching various applications.

The notion of drive capacity is usually applicable when reading / writing a large file, for example, in blocks of 64KB or more, sequentially (in 1 stream, 1 file).

Response time is the time that the drive needs in order to start performing a write / read operation.

The conversion between IOPS and bandwidth can be done as follows:

IOPS = bandwidth / block size;
Bandwidth = IOPS * block size,

where block size is the amount of information transmitted during a single input / output (I / O) operation. Thus, knowing the characteristics of the hard disk (HDD SATA), as bandwidth - we can easily calculate the number of IOPS.

For example, take the standard block size - 4KB and the standard bandwidth declared by the manufacturer for sequential writing or reading (I / O) - 121 MB / s. IOPS = 121 MB / 4 KB, with the result that we get a value of about 30,000 IOPS for our SATA hard disk . If the block size is increased and made equal to 8 KB, the value will be about 15,000 IOPS, that is, it will decrease almost in proportion to the increase in the block size. However, it should be clearly understood that here we considered IOPS in the key of sequential writing or reading.

Everything changes dramatically for traditional SATA hard drives if reading and writing are random. Here begins to play the role of delay (latency), which is very critical in the case of hard drives HDDs (Hard Disk Drives) SATA / SAS, and sometimes even in the case of solid state drives SSD (Solid State Drive). Although the latter often provide performance by orders of magnitude better than that of "rotating" drives, due to the absence of moving elements, but still there can be noticeable delays in writing, due to the nature of the technology, and, as a result, when using them in arrays. Dear amarao conducted a rather useful study on the use of solid-state drives in arrays, as it turned out, the performance will depend on the latency of the slowest drive. You can read more about the results in his article: SSD + raid0 - not everything is so simple .

But back to the performance of individual drives. Consider the case of "rotating" drives. The time required to perform a single random I / O operation will be determined by the following components:

T (I / O) = T (A) + T (L) + T (R / W),

where T (A) is the access time (access time or seek time), also known as search time, that is, the time required for the reading head to be placed on the track with the necessary information block. Often the manufacturer specifies 3 parameters in the disc specification:

- the time required to move from the farthest path to the nearest;
- the time required to move between adjacent tracks;
- average access time.

Thus, we come to the magic conclusion that T (A) can be improved if we place our data on as close as possible tracks, and all data are located as far as possible from the center of the plate (it takes less time to move the block of magnetic heads, and there are more data on external tracks, since the track is longer and it rotates faster than the internal one). Now it becomes clear why defragmentation can be so helpful. Especially with the condition of placing data on external tracks in the first place.

T (L) - the delay caused by the rotation of the disk, that is, the time required to read or write a specific sector on our track. It is easy to understand that it will lie in the range from 0 to 1 / RPS, where RPS is the number of revolutions per second. For example, with a characteristic disk at 7200 RPM (revolutions per minute), we get 7200/60 = 120 revolutions per second. That is, one revolution occurs in (1/120) * 1000 (the number of milliseconds per second) = 8.33 ms. The average delay in this case will be equal to half of the time spent per revolution - 8.33 / 2 = 4.16 ms.

T (R / W) is the sector reading or writing time, which is determined by the size of the block selected during formatting (from 512 bytes and up to ... several megabytes, in the case of more capacious drives, from 4 kilobytes, the standard cluster size) and throughput, which indicated in the characteristics of the drive.

The average rotational delay, which is approximately equal to the time spent on half a turn, knowing the rotational speed of 7200, 10,000 or 15,000 RPM, is easy to determine. And above, we have already shown how.

The rest of the parameters (average search time of reading and writing) are more difficult to determine, they are determined as a result of tests and specified by the manufacturer.

To calculate the number of random hard disk IOPs, it is possible to apply the following formula, provided that the number of simultaneous read and write operations is the same (50% / 50%):

1 / (((average search time of reading + average search time of record) / 2) / 1000) + (average rotational delay / 1000)).

Many are wondering why this is exactly the origin of the formula? IOPS is the number of input or output operations per second. That is why we divide in the numerator 1 second (1000 milliseconds) by the time, taking into account all the delays in the denominator (also expressed in seconds or milliseconds) required for a single input or output operation.

That is, the formula can be written in the following way:

1000 (ms) / ((average search search time (ms) + average write search time (ms)) / 2) + average rotational delay (ms))

For drives with a different number of RPM (rotations per minute), we get the following values:

For the 7200 RPM drive, IOPS = 1 / (((8.5 + 9.5) / 2) / 1000) + (4.16 / 1000)) = 1 / ((9/1000) +
(4.16 / 1000)) = 1000 / 13.16 = 75.98;
For a 10K RPM SAS drive, IOPS = 1 / (((3.8 + 4.4) / 2) / 1000) + (2.98 / 1000)) =
1 / ((4.10 / 1000) + (2.98 / 1000)) = 1000 / 7.08 = 141.24;
For a 15K RPM SAS drive, IOPS = 1 / (((3.48 + 3.9) / 2) / 1000) + (2.00 / 1000)) =
1 / ((3.65 / 1000) + (2/1000)) = 1000 / 5.65 = 176.99.

Thus, we see dramatic changes when, from tens of thousands of IOPS when sequentially reading or writing, performance drops to several dozen IOPS.

And already, with the standard size of the sector in 4KB, and the presence of such a small number of IOPS, we get the bandwidth value not in a hundred megabytes, but less than in megabytes.

These examples also illustrate the reason for minor changes in nominal disk IOPS from different manufacturers for disks with the same RPM.

Now it becomes clear why the performance data lies in rather wide ranges:

7200 RPM (Rotate per Minute) HDD SATA - 50-75 IOPS;
10K RPM HDD SAS - 110-140 IOPS;
15K RPM HDD SAS - 150-200 IOPS;
SSD (Solid State Drive) - tens of thousands of IOPS per read, hundreds and thousands of per write.

However, the nominal disk IOPS is still far from inaccurate, since it does not take into account the differences in the nature of the loads in individual cases, which is very important to understand.

Also, for a better understanding of the topic, I recommend to familiarize yourself with one more useful article from amarao : How to measure disk performance correctly , thanks to which it also becomes clear that latency is not completely fixed and also depends on the load and its nature.

The only thing I would like to add is:

When calculating the performance of a hard disk, the decrease in the number of IOPS with increasing block size can be neglected, why?

We already understood that for “rotating” drives, the time required for random reading or writing consists of the following components:

T (I / O) = T (A) + T (L) + T (R / W).

And then even calculated the performance for random reading and writing to IOPS. That's just the parameter T (R / W) we essentially neglected there, and this is not accidental. We know that, say, sequential reads can be provided at 120 megabytes per second. It becomes clear that the block in 4KB will be read in about 0.03 ms, the time is two orders of magnitude shorter than the time of the remaining delays (8 ms + 4 ms).

Thus, if with a block size of 4K, we have 76 IOPS (the main delay was caused by the rotation of the drive and the head positioning time, rather than the read or write process itself), then with a block size of 64K, IOPS will not fall by 16 times with sequential reading, but only for a few IOPS . Since the time spent directly reading or writing, will increase by 0.45 ms, which is only about 4% of the total delay time.

As a result, we get 76-4% = 72.96 IOPS, which you will agree, is not critical at all in the calculations, since the fall in IOPS is not 16 times, but only a few percent! And when calculating the performance of systems where it is more important not to forget to take into account other important parameters.

The magic conclusion: when calculating the performance of storage systems based on hard disks, you should choose the optimal block size (cluster) to provide the maximum throughput you need depending on the type of data and applications used, and IOPS falling when the block size increases from 4KB to 64KB or even 128KB can be neglected, or consider as 4 and 7% respectively, if they play an important role in the task.

It also becomes clear why it does not always make sense to use very large blocks. For example, with video streaming, a two megabyte block size may not be the best option. Since the drop in the number of IOPS will be more than 2 times. Among other things, other degradation processes in arrays will be added, related to multithreading and computational load in the distribution of data across the array.

The optimal block size (cluster)

The optimal block size must be considered depending on the nature of the load and the type of applications used. If you are working with small data, for example with databases, you should choose the standard 4 KB, but if we are talking about streaming video files, the cluster size is better to choose from 64 KB or more.

It should be remembered that the block size is not as critical for SSDs as it is for standard HDDs, as it allows you to provide the necessary bandwidth due to a small number of random IOPS, the number of which decreases slightly with increasing block size, in contrast to SSD, where there is almost proportional dependence .

Why standard 4KB?

For many drives, especially solid-state ones, the performance values, for example, recordings starting from 4KB, become optimal, as can be seen from the graph:

While on reading, the speed is also quite substantial and less tolerable since 4 KB:

It is for this reason that the 4KB block size is very often used as a standard, since with a smaller size there are large performance losses, and as the block size increases, in the case of working with small data, the data will be distributed less efficiently, to occupy the entire block size and drive quota will not be used effectively.

RAID level

If your storage system is an array of hard drives combined in a certain level of RAID, then the system performance will depend largely on what type of RAID was used and what percentage of the total number of operations falls on write operations, because the record is the cause of performance degradation In most cases.

So, with RAID0, only 1 IOPS will be spent on each input operation, because the data will be distributed across all drives without duplication. In the case of a mirror (RAID1, RAID10), each write operation will consume 2 IOPS already, since the information must be recorded on 2 drives.

In higher levels of RAID, losses are even more significant, for example, in RAID5, the penalty ratio will be already 4, which is related to the way data is distributed across disks.

RAID5 is used instead of RAID4 in most cases, since it distributes parity (checksums) across all disks. In a RAID4 array, one of the disks is responsible for all parity, while the data is spread over more than 3 disks. That is why we use the penalty factor 4 in the RAID5 array, since we read the data, read the parity, then write the data and write the parity.

In the RAID6 array, everything is the same, except that instead of calculating the parity once, we do it twice and thus we have 3 reads and 3 entries, which gives us the penalty factor of 6.

It would seem that in an array like RAID-DP everything will be similar, since it is essentially a modified RAID6 array. But it was not there ... The trick is that a separate WAFL (Write Anywhere File Layout) file system is used, where all write operations are consecutive and are performed on the free space. WAFL will basically write new data to a new location on the disk and then move pointers to new data, thus eliminating read operations that should take place. In addition, there is a journal entry in NVRAM, which tracks the transaction records, initiates the recording and can restore them if necessary. There is a record in the buffer at the beginning, and then they are already "merged" on the disc, which speeds up the process. Probably experts at NetApp can enlighten us in more detail in the comments, due to which savings are achieved, I have not yet fully understood this issue, but I remember that the RAID penalty coefficient will be only 2 and not 6. The “trick” is quite significant.

With large RAID-DP arrays, which consist of dozens of disks, there is the concept of reducing the “parity penalty” that occurs when parity is written. So with the growth of the RAID-DP array, a smaller number of disks allocated for parity is required, which will lead to a reduction in losses associated with parity records. However, in small arrays, or in order to increase conservatism, we can neglect this phenomenon.

Now, knowing the IOPS loss as a result of applying one or another RAID level, we can calculate the performance of the array. However, please note that other factors, such as interface bandwidth, non-optimal distribution of interrupts across processor cores, etc., RAID controller bandwidth, exceeding the allowed queue depth, can have a negative effect.

In case of neglect of these factors, the formula will be as follows:

Functional IOPS = (Source IOPS *% write operations / RAID penalty ratio) + (Source IOPS *% read), where Source IOPS = average IOPS of drives * number of drives.

For example, let's calculate the performance of a RAID10 array of 12 SATA HDDs, if it is known that 10% of write operations and 90% of read operations occur at the same time. Suppose a disk provides 75 random IOPS, with a block size of 4K.

Source IOPS = 75 * 12 = 900;
Functional IOPS = (900 * 0.1 / 2) + (900 * 0.9) = 855.

Thus, we see that at low recording intensity, which is mainly observed in systems designed for content delivery, the effect of the RAID penalty coefficient is minimal.

In order to conservatism, I recommend adding from 20% of the required number of IOPS when designing systems.

Application dependency

The performance of our solution may very much depend on the applications that will be executed later. So it can be transaction processing - “structured” data that is organized, consistent and predictable. Often in these processes it is possible to apply the principle of batch processing, distributing these processes over time, so that when the load is minimal, thereby optimizing the consumption of IOPS. Recently, however, there are more and more media projects where the data is “not structured” and require completely different principles for their processing.

For this reason, calculating the required performance of a solution for a particular project can be a very difficult task. Some storage vendors and experts claim that IOPS does not matter, as the vast majority of customers use up to 30-40 thousand IOPS, while modern storage systems provide hundreds of thousands and even millions of IOPS. That is, modern storage meet the needs of 99% of customers. Nevertheless, this statement may be far from always true, only for the business segment that hosts storage, locally, but not for projects hosted in data centers, which often, even when using ready-made storage solutions, should provide quite high performance and fault tolerance.

In the case of placing a project in a data center, in most cases, it is still more economical to build storage systems on the basis of dedicated servers rather than use ready-made solutions, since it becomes possible to more efficiently distribute the load and select the optimal equipment for these or other processes. Among other things, the performance indicators of off-the-shelf storage systems are far from real, since most of them are based on synthetic test performance profile data when using 4 or 8 KB of block size, while most client applications are now running in block size environments from 32 to 64K .

As you can see from the graph:

Less than 5% of storage systems are configured using a block of less than 10 KB and less than 15% use blocks with a size of less than 20 KB. In addition, even for a specific application, it is rarely when only one type of I / O consumption occurs. For example, the database will have different I / O profiles for different processes (data files, logging, indexes ...). This means that the declared synthetic tests of system performance may be far from the truth.

And what about the delays?

Even if we ignore the fact that the tools used to measure latency tend to measure average waiting times and miss the fact that a single I / O in one of the processes may take much longer than others, thus slowing down the progress of the whole process does not take into account how much the waiting time of the I / O will change depending on the block size . Among other things, this time will also depend on the specific application.

Thus, we come to another magical conclusion that not only the block size is not a very good characteristic when measuring the performance of IOPS systems, but latency can also be a completely useless parameter.

Well, if neither IOPS nor latency is a good measure of storage system performance, then what?

Only a real test of the application execution on a specific solution ...

This test will be the real method that will certainly allow you to understand how productive the solution will be for your case. To do this, you need to run a copy of the application on a single storage and simulate the load for a certain period. This is the only way to get reliable data. And of course, you need to measure not the metrics of the repository, but the metrics of the application.

Nevertheless, taking into account the above factors affecting the performance of our systems can be very useful in selecting storage or building a specific infrastructure based on dedicated servers. With a certain degree of conservatism, it becomes possible to choose a more or less real solution, to eliminate some technical and software flaws in the form of an optimal block size when broken down or not optimally working with disks. The solution, of course, will not guarantee the calculated performance by 100%, but in 99% of cases it will be possible to say that the solution will cope with the load, especially if conservatism is added depending on the type of application and its features in the calculation.

Source: https://habr.com/ru/post/282469/

All Articles