
In the field of data storage devices, in recent years serious changes have taken place: new technologies are being introduced, the volume and speed of disk operations are growing. In this case, the following situation develops, in which the bottleneck becomes not the device, but the software. The Linux kernel mechanisms for working with the disk subsystem are completely unsuitable for new, fast block devices.
In recent versions of the Linux kernel, much work has been done to solve this problem. With the release of version 4.12, users have the opportunity to use several new schedulers for block devices. These schedulers are based on a new request transfer method - the
multiqueue block layer (blk-mq).
In this article, we would like to consider in detail the features of the work of new planners and their impact on the performance of the disk subsystem.
')
The article will be of interest to everyone involved in testing, implementing and administering high-load systems. We will talk about traditional and new planners, analyze the results of synthetic tests and draw some conclusions about testing and choosing a scheduler for the load. Let's start with the theory.
Some theory
Traditionally, the subsystem for working with block device drivers in Linux has provided two options for sending a request to a driver: one with a queue, and the second without a queue.
In the first case, a queue is used for requests to the device, common to all processes. On the one hand, this approach allows the use of different planners, who in the common queue can rearrange requests in places, slowing down one and speeding up others, combining several requests to neighboring blocks into a single request to reduce costs. On the other hand, while accelerating the processing of requests from the block device, this queue becomes a bottleneck: one for the entire operating system.
In the second case, the request is sent directly to the device driver. Of course, this makes it possible to pass a general queue. But at the same time, it becomes necessary to create a turn and schedule requests inside the device driver. This option is suitable for virtual devices that process requests and redirect them to real devices.
In the past few years, the speed of processing requests by block devices has increased, and therefore the requirements for the operating system, as well as for the data interface, have changed.
In 2013, the third option was developed - multiqueue block layer (blk-mq).
We describe the solution architecture in general terms. The request first enters the program queue. The number of these queues is equal to the number of processor cores. After passing through the program queue, the request goes to the send queue. The number of send queues depends on the device driver, which can support from 1 to 2048 queues. Since the work of the scheduler is carried out at the level of program queues, a request from any program queue can go to any send queue provided by the driver.
Standard Schedulers
In most modern Linux distributions, three standard schedulers are available:
noop ,
CFQ and
deadline .
The name
noop is short for no operation, which suggests that this scheduler does not do anything. In fact, this is not the case: noop implements a simple FIFO queue and combining requests to neighboring blocks. It does not change the order of requests and relies on the underlying level in this — for example, the Raid controller's scheduling capabilities.
CFQ (Completely fair queuing) works harder. It divides bandwidth between all processes. For synchronous requests it is created in turn per process. Asynchronous requests are combined in a priority queue. In addition, CFQ sorts queries to minimize disk sector searches. The time to execute requests between queues is distributed according to priority, which can be configured using the
ionice utility. When setting up, you can change both the process class and the priority within the class. We will not dwell on the
settings in detail; we only note that this scheduler has a great deal of flexibility, and setting priority on processes is a very useful thing to limit the execution of administrative tasks for the sake of users.
The
deadline scheduler uses a different strategy. The basis is the time spent in the queue. In this way, it ensures that every request will be served by the scheduler. Since most applications are blocked on reading, the deadline by default gives priority to requests for reading.
The
blk-mq mechanism was developed after the advent of NVMe disks, when it became clear that the standard kernel tools do not provide the necessary performance. It was added to kernel 3.13, however, the planners were written later. The latest Linux kernels (≥4.12) have the following schedulers for blk-mq: none, bfq, mq-deadline, and kyber. Consider each of these planners in more detail.
Blk-mq schedulers: an overview
Using blk-mq, you can really disable the scheduler, for this it is enough to set it to none.
Much has been written about
BFQ , I can only say that it inherits some of the settings and the main algorithm from CFQ, introducing the concept of a budget and a few more parameters for tuning.
Mq-deadline is, as you might guess, implementing a deadline using blk-mq.
And the last option is
kyber . It was written to work with fast devices. Using two queues - write and read requests, kyber gives priority to read requests, over write requests. The algorithm measures the completion time of each request and adjusts the actual queue size to achieve the delays specified in the settings.
Tests
Introductory notes
Check the work scheduler is not so easy. A simple single-threaded test will not show convincing results. There are two options for testing - simulate multi-threaded load using, for example, fio; install a real application and check how it will show itself under load.
We conducted a series of synthetic tests. All of them were carried out with the standard settings of the schedulers themselves (such settings can be found in / sys / block / sda / queue / iosched /).
What case will we test?
We will create a multi-threaded load on the block device. We will be interested in the smallest delay (the smallest value of the latency parameter) at the highest data transfer rate. We assume that the priority is read requests.
HDD tests
Let's start with testing schedulers with hdd-drive.
For HDD tests the server was used:
- 2 x Intel® Xeon® CPU E5-2630 v2 @ 2.40GHz
- 128 GB RAM
- Disk 8TB HGST HUH721008AL
- OS Ubuntu linux 16.04, kernel 4.13.0-17-generic from official repositories
Fio options
[global] ioengine=libaio blocksize=4k direct=1 buffered=0 iodepth=1 runtime=7200 random_generator=lfsr filename=/dev/sda1 [writers] numjobs=2 rw=randwrite [reader_40] rw=randrw rwmixread=40
From the settings you can see that we go around caching, set queue depth 1 for each process. Two processes will write data to random blocks, one will write and read. The process described in the
reader_40 section will send 40% of requests for reading, the remaining 60% for writing (option
rwmixread ).
More fio options are described in the
man page .
The test duration is two hours (7200 seconds).
Test resultsCFQ:

Deadline:

Noop:

BFQ:

Mq-deadline:

Kyber:

None:

Test results in the table | writers randwrite | reader_40 randwrite | reader_40 read |
CFQ | bw | 331 KB / s | 210 KB / s | 140 KB / s |
iops | 80 | 51 | 34 |
avg lat | 12.36 | 7.17 | 18.36 |
deadline | bw | 330 KB / s | 210 KB / s | 140 KB / s |
iops | 80 | 51 | 34 |
avg lat | 12.39 | 7.2 | 18.39 |
noop | bw | 331 KB / s | 210 KB / s | 140 KB / s |
iops | 80 | 51 | 34 |
avg lat | 12.36 | 7.16 | 18.42 |
Bfq | bw | 384 KB / s | 208 KB / s | 139 KB / s |
iops | 93 | 50 | 33 |
avg lat | 10.65 | 6.28 | 20.03 |
mq-deadline | bw | 333 KB / s | 211 KB / s | 142 KB / s |
iops | 81 | 51 | 34 |
avg lat | 12.29 | 7.08 | 18.32 |
kyber | bw | 385 KB / s | 193 KB / s | 129 KB / s |
iops | 94 | 47 | 31 |
avg lat | 10.63 | 9.15 | 18.01 |
none | bw | 332 KB / s | 212 KB / s | 142 KB / s |
iops | 81 | 51 | 34 |
avg lat | 12.3 | 7.1 | 18.3 |
* here and further in the writers column the median from the stream of writers is taken, similarly in the reader_40 columns. The delay value in milliseconds. Let's look at the test results of traditional (single-queue) planners. The values ​​obtained as a result of tests, do not differ from each other. It should be noted that latency bursts that occur on the charts of deadline and noop tests also occurred during CFQ tests, although less often. When testing blk-mq schedulers, this was not observed, the maximum latency reached as much as 11 seconds, regardless of the type of requests - writing or reading.
Everything is much more interesting when using blk-mq-schedulers. We are primarily interested in the delay in processing requests for reading data. In the context of such a task, BFQ is worse for the worse. The maximum delay value for this scheduler reached 6 seconds per write and 2.5 seconds per read. The smallest maximum delay value showed kyber - 129ms for writing and 136 for reading. 20ms more maximum delay with none for all streams. For mq-deadline, it was 173ms for writing and 289ms for reading.
As the results show, it was not possible to achieve some significant reduction of the delay by changing the scheduler. But you can select the BFQ scheduler, which showed a good result in terms of the number of recorded / read data. On the other hand, when looking at the graph obtained when testing BFQ, it seems strange to unevenly distribute the load on the disk, despite the fact that the load from fio is quite uniform and uniform.
SSD tests
For SSD tests the server was used:
- Intel® Xeon® CPU E5-1650 v3 @ 3.50GHz
- 64 GB RAM
- 1.6TB INTEL Intel SSD DC S3520 Series
- OS Ubuntu linux 14.04, kernel 4.12.0-041200-generic ( kernel.ubuntu.com )
Fio options
[global] ioengine=libaio blocksize=4k direct=1 buffered=0 iodepth=4 runtime=7200 random_generator=lfsr filename=/dev/sda1 [writers] numjobs=10 rw=randwrite [reader_20] numjobs=2 rw=randrw rwmixread=20
The test is similar to the previous one for hdd, but it differs in the number of processes that will access the disk - 10 per record and two for writing and reading and the write / read ratio of 80/20, as well as the queue depth. A partition with a capacity of 1598GB was created on the drive, two gigabytes are left unused.
Test resultsCFQ:

Deadline:

Noop:

BFQ:

Mq-deadline:

Kyber:

None:

Test results in the table | writers randwrite | reader_20 randwrite | reader_20 read |
CFQ | bw | 13065 KB / s | 6321 KB / s | 1578 KB / s |
iops | 3265 | 1580 | 394 |
avg lat | 1.223 | 2,000 | 2.119 |
deadline | bw | 12690 KB / s | 10279 KB / s | 2567 KB / s |
iops | 3172 | 2569 | 641 |
avg lat | 1.259 | 1.261 | 1.177 |
noop | bw | 12509 KB / s | 9807 KB / s | 2450 KB / s |
iops | 3127 | 2451 | 613 |
avg lat | 1.278 | 1.278 | 1.405 |
Bfq | bw | 12803 KB / s | 10,000 KB / s | 2497 KB / s |
iops | 3201 | 2499 | 624 |
avg lat | 1.248 | 1.248 | 1.398 |
mq-deadline | bw | 12650 KB / s | 9715 KB / s | 2414 KB / s |
iops | 3162 | 2416 | 604 |
avg lat | 1.264 | 1.298 | 1.423 |
kyber | bw | 8764 KB / s | 8572 KB / s | 2141 KB / s |
iops | 2191 | 2143 | 535 |
avg lat | 1.824 | 1.823 | 0.167 |
none | bw | 12835 KB / s | 10174 KB / s | 2541 KB / s |
iops | 3208 | 2543 | 635 |
avg lat | 1.245 | 1.227 | 1.376 |
Pay attention to the average read latency. Among all the schedulers, the kyber is the most prominent, showing the lowest latency, and the CFQ is the highest. Kyber was designed to work with fast devices and aims to reduce latency in general, with priority for synchronous requests. For read requests, the delay is very low, while the amount of data read is less than when using other schedulers (with the exception of CFQ).
Let's try to compare the difference in the amount of data read per second between kyber and, for example, the deadline, as well as the difference in the read delay between them. We see that kyber showed 7 times less read latency than the deadline, with a decrease in throughput per reading of just 1.2 times. At the same time, kyber showed worse results for write requests - by 1.5 an increase in latency and a decrease in throughput by 1.3 times.
Our initial task is to get the least read latency with the least amount of bandwidth damage. According to the test results, it can be considered that kyber is better than other planners for solving this task.
It is interesting to note that CFQ and BFQ showed a relatively low recording delay, but at the same time, with CFQ, processes that performed only writing to the disk received the highest priority. What conclusion can be made from this? Probably, BFQ more “honestly” distributes the priority for requests, as stated by the developers.
The maximum latency value was much higher for * FQ schedulers and mq-deadline - up to ~ 3.1 seconds for CFQ and up to ~ 2.7 seconds for BFQ and mq-deadline. For other schedulers, the maximum delay during the tests was 35-50 ms.
NVMe tests
For the NVMe tests, the Micron 9100 drive was installed in the server on which the SSD tests were carried out. The disk layout is similar to SSD - the section for the 1598GB test and 2GB of unused space. The fio settings used were the same as in the previous test, only the queue depth (iodepth) was increased to 8.
[global] ioengine=libaio blocksize=4k direct=1 buffered=0 iodepth=8 runtime=7200 random_generator=lfsr filename=/dev/nvme0n1p1 [writers] numjobs=10 rw=randwrite [reader_20] numjobs=2 rw=randrw rwmixread=20
Test results in the table | writers randwrite | reader_20 write | reader_20 read |
Bfq | bw | 45752 KB / s | 30541 KB / s | 7634 KB / s |
iops | 11437 | 7635 | 1908 |
avg lat | 0.698 | 0.694 | 1.409 |
mq-deadline | bw | 46321 KB / s | 31112 KB / s | 7777 KB / s |
iops | 11580 | 7777 | 1944 |
avg lat | 0.690 | 0.685 | 1.369 |
kyber | bw | 30460 KB / s | 27709 KB / s | 6926 KB / s |
iops | 7615 | 6927 | 1731 |
avg lat | 1.049 | 1,000 | 0.612 |
none | bw | 45940 KB / s | 30867 KB / s | 7716 KB / s |
iops | 11484 | 7716 | 1929 |
avg lat | 0.695 | 0.694 | 1.367 |
The graph shows the test results with the change of the scheduler and pause. Test order: none, kyber, mq-deadline, BFQ.
The table and the graph again show the active work of the kyber algorithm to reduce latency: 0.612ms versus 1.3-1.4 for other planners. It is generally believed that for NVMe disks there is no point in using any scheduler, but if the priority is to reduce latency and you can sacrifice the number of I / O operations, then it makes sense to consider kyber. Looking at the graphics, you can notice an increase in CPU load when using BFQ (last tested).
Conclusions and recommendations
Our article is a very general introduction to the topic. Yes, and all the data of our tests should be taken in view of the fact that in real practice everything is much more complicated than in experimental conditions. It all depends on a number of factors: the type of load, the file system used and much more. Very much depends on the hardware component: disk model, RAID / HBA / JBOD.
Speaking in general terms: if an application uses ioprio to prioritize specific processes, then the choice of a scheduler in the direction of CFQ / BFQ will be justified. But you should always proceed from the type of load and the composition of the entire disk subsystem. For some solutions, developers give very specific recommendations: for example, for clickhouse they
recommend using CFQ for HDD and noop for SSD drives. If the disk requires a large bandwidth, the ability to perform more I / O operations and the average delay is not important, then you should look towards BFQ / CFQ, and for ssd disks also noop and none.
If it is necessary to reduce the delay and each operation separately, and especially the read operation, should be performed as quickly as possible, then in addition to using ssd, you should use a specially designed deadline scheduler, or one of the new ones - mq-deadline, kyber.
Our recommendations are general in nature and are not suitable for all cases. When choosing a scheduler, in our opinion, the following points should be considered:
- The hardware component also plays an important role: a situation is quite possible when a RAID controller uses embedded algorithms for scheduling queries — in this case, the choice of a scheduler of none or noop is fully justified.
- The type of load is very important: not in experimental, but in “combat conditions” application can show other results. The fio utility used for tests keeps the same load at all times, but in actual practice applications rarely access the disk in a consistent and consistent way. The actual queue depth, average per minute, can be kept in the range of 1-3, but in peaks rise to 10-13-150 requests. It all depends on the type of load.
- We only tested writing / reading random data blocks, and when schedulers are working, consistency is important. With a linear load, you can get more bandwidth if the scheduler is well grouped requests.
- Each scheduler has options that can be configured additionally. All of them are described in the documentation for schedulers.
The attentive reader probably noticed that we used different kernels to test the schedulers for HDD and SSD / NVMe drives. The fact is that during testing on kernel 4.12 with HDD, the work of the BFQ and mq-deadline planners looked rather strange - the delay then decreased, then grew and kept very high values ​​in a few minutes. Since this behavior did not look quite adequate and the 4.13 core came out, we decided to run tests with the HDD on the new core.
Do not forget that the code of planners may change. In the next release of the kernel, it turns out that the mechanism due to which a specific scheduler showed a performance degradation on your workload has already been rewritten and now, at a minimum, does not reduce performance.
The peculiarities of the work of input-output schedulers in Linux is a complex topic, and it can hardly be considered in one publication. We plan to return to this topic in the following articles. If you have experience testing planners in combat, we will be happy to read about it in the comments.