Increase array efficiency

This article was prepared by Nikolai Vedyashkin, an expert at the Jet Info Systems Service Center.

Let us imagine a situation: we added a new instance of the database or a new backup task (RC) to the database server, connected an additional server to the disk array, and in all these cases we discovered a decrease in its performance. Then you can go in different ways.

For example, add a database server and transfer an instance of the database to it, add backup drives to speed up the RK, upgrade processors, etc. However, it should be remembered that a simple increase in hardware capacity is least advantageous in terms of material and time costs. It is much more efficient to solve such problems at the level of the logic of IT solutions.

')

Causes of slipping

Problems with the performance of an array are often related to the fact that its initial configuration does not take into account its architecture, operating principles, and existing constraints. For example, the Achilles heel of the arrays of the old generation is a relatively low throughput of internal tires — about 200 Mb / s. Not so long ago, one of the customers asked us to analyze the work of its disk array and give recommendations for optimization. In fact, the array was not loaded, while its speed periodically left much to be desired. The analysis revealed an irregular configuration: in general, during the day, the internal disks were loaded approximately equally, but the load peaks were unevenly distributed over them. As a result, one of the internal tires was overloaded. That is, the array “skidded” due to exceeding the maximum allowable threshold for one component. Our recommendation - repartitioning it to evenly load internal tires - helped increase productivity by 30%.

The error can creep in and when connecting servers to the storage. An example is the incorrect configuration of disk capacity that is presented to hosts. The fact is that some of the modern arrays have limitations on a parameter such as a queue of commands (Queue Depth, QD). Here it is worth a little deeper into the story. In the SCSI-I standard, the SCSI-server driver had to wait for the execution of one command and only after that send the next one. From the SCSI-II standard and above, the SCSI driver can send several commands (QD) to the SCSI disk simultaneously. The maximum number of parallel serviced SCSI commands is one of the most important disk characteristics. The IOPS (Input Output Operation per Second) parameter shows how many requests (SCSI commands) per second SCSI LUNs can perform. It turns out that QD and IOPS may be in irreconcilable contradiction with each other.

The situation in which the server-side I / O characteristics are unacceptable, the response time to requests is very large, and the array is not loaded is quite real. The reason lies in - incorrect configuration of the command queue (higher than the allowable one) - commands hang in the array buffer until their queue for execution comes up. On the server, large service time is registered.

If QD is significantly below the optimal value, the performance will also be lame. With a remarkable response time and an unloaded array, the number of requests it processes will be very small. Blame - long waiting in the queue before sending requests to the storage system.

We catch IOPS by the tail

What to do if the response time goes off, and the array is not loaded? Or if you just want to "squeeze" out of the array a little more?
Can:

look at the Queue Depth settings on the server and compare the maximum allowed command queue with the LUN of the array. Adjust settings;
look at the statistics from the array. Perhaps it is piling up a queue of commands to the LUN;
split one LUN into several and connect to the host in stripe or at least concatenation depending on the configuration. Concatenation is useful if the load is distributed across all LUNs.
choose a stripe unit size on the array and host so that a typical application-side operation loads as few physical disks as possible in the array.

Fig. 1. Stripe Unit Size

An example from our experience: the “server – array” combination at the customer did not show the declared level of performance. As a result of the analysis, it turned out that the server was given a very large (several terabytes) LUN - the application was not satisfactory, and the LUN itself was overloaded with teams. We recommended breaking this LUN into several and spreading the load on different volumes. On the server, 4 instance databases were spinning, as a result, one of them started to work 6 times faster, the other one - 2 times.

No longer means better

IT specialists of customer companies do not always understand which type of RAID is best suited for a particular load profile from the application side. Everyone knows that RAID 10 is reliable, resistant to the loss of several disks and shows good speed on random operations. It is not surprising that most often choose this very expensive option. However, if the load profile of an application implies few random write operations and many read or sequential write operations, it is better to use RAID 5. It can work 1.5 or even 2 times faster on the same number of disks. One company asked us to improve the disk input / input characteristics for one of its applications. The application created many read operations and a small number of write operations. RAID 10 was configured on the array, and from the statistics it was clear that almost half of the drives in the RAID group were idle. With the transition to RAID 5 from exactly the same number of physical disks, the application will improve by more than 1.5 times.

We welcome your constructive comments.

Performance problems affect virtually every company operating a computing complex. The examples given here are not the only ones. Many problems associated with the unsatisfactory operation of arrays can be avoided if the configuration of the equipment and the application load profile are taken into account when configuring the equipment. At the same time, the improvement of the work of the computing complex should not be confined to any one of its components — the server, the array, the software, or the data network. The best results can be achieved after analyzing the entire complex as a whole and changing the configuration not only of the array, but also of the server and applications.

Source: https://habr.com/ru/post/266629/

All Articles

Increase array efficiency

Causes of slipping

We catch IOPS by the tail

No longer means better

More articles: