File system load profiling with iostat and gnuplot - amateur notes

Often the work "in the field" requires the collection and analysis of information on the customer's site in the presence of a very meager list of utilities. In particular, to collect information on the use of input-output system during the day.

In the article I will try to show how, with only iostat and gnuplot, you can try to analyze the system and what conclusions you can make.

I do not pretend to thorough possession of the subject and the exact use of terms. Moreover, I will try to speak in “ordinary” language and not to rush in terms.
')
Everything described below is the fruit of experience, mistakes, googling, mana and other things.

Small educational program

In terms of evaluating the performance of an I / O system in terms of storage, there are several characteristics:

Bandwidth (bytes per second) - the maximum linear speed with which the device writes or reads one, but a very large file.
IOPS (operations per second) is the number of requests sent to the device. This is the speed of work with a large number of small files.
Processing time (milliseconds) - the time during which an average I / O request is processed, including waiting in a queue, and waiting is the most characteristic, because it depends on the load and not on the size of the request itself. The higher this time, the “slower” the response of the system to an application request, but other applications are not affected in principle.
Load (percent) is an estimate of the I / O load of the system. At values close to 100%, the system “rises”, including all applications. Processors go to iowait, and processes that want I / O to go to [Disk] Sleep

At the same time, it is necessary to understand that none of these parameters can give a complete profile of file system usage. It is difficult to see and evaluate the “clean” operation of the file subsystem, if only because there are various caches on the stack of its work: disk buffer, RAID controller cache, system cache. Accordingly, more or less extrapolated figures can be obtained only near the peak load. And the topic of proper testing of storage systems, comparison and evaluation of results is generally not a plowed field and you can write a dissertation.

Basic characteristics

Since measuring absolute numbers in general is not very grateful, I prefer to evaluate their value based on some basis, and after that, to use some kind of relative yardstick. And the experience of working with the DBMS hints that it is not the values that are worth comparing, but their orders.
For such a basis, I took the following figures for myself:
The average server hard disk is 10 thousand revolutions per second, connected via a SAS to a convenient RAID controller (with a cache on a battery):

has a bandwidth of about 300 megabytes per second with sequential access
produces about 150 iops (in parrots)
with a request processing time of about 10 ms

If all this lives in a RAID then, depending on the type of raid, the numbers multiply or are divided by the number of disks. For example: 4 disks in RAID 1 + 0 should show half the total iops of all four disks.

If the tests show that the system works, giving out about these numbers, it means that no one “tested” during testing, there are no miracles and you can continue to work. Otherwise, you should first figure out what's wrong with the hardware (maybe the RAID is restored or the screw is scattered).

Basically, I use the results of testing to assess the efficiency of working with the storage system and the impact of DBMS settings and our software on how they use it, as well as quick diagnosis of problems with DBMS.

Measurements

Training

To begin with, it is reasonable to save df -hl in order to understand which sections we are talking about.

# df -hl Filesystem Size Used Avail Use% Mounted on /dev/sda1 4.0G 3.9G 0 100% / /dev/sda7 102G 94G 2.7G 98% /SQL /dev/sda6 1012M 307M 654M 32% /var /dev/sda2 3.0G 69M 2.8G 3% /tmp /dev/sda5 34G 24G 8.8G 73% /ARCHIVE tmpfs 4.0G 26M 3.9G 1% /dev/shm

By the way, it is useful to remember which of the sections is swap.

It will be good if the physical device is divided into different logical partitions for each component. This way we will be able to evaluate the I / O system usage profile for each separated component.

Observation

Unlike other utilities, iostat is not very suitable for logging based on time. Well, nothing, awk to help.

Line to get the log file:

 iostat -xk -t 10 | awk '// {print strftime("%Y-%m-%d %H:%M:%S"),$0}' >> iostat.log &

Naturally, it should be run in the screenshot.

Attention should be paid to the -t 10 parameter. This is the interval over which values are averaged and “averages” are calculated in seconds.

If the value is too small, there will be extra peaks and outliers on the chart. If it is too large, there will be no significant peaks. In my opinion, the interval of 3-10 seconds is optimal.

Entering our log will be something like this:

 iostat -xk -t 10 | awk '// {print strftime("%Y-%m-%d %H:%M:%S"),$0}' 2013-01-14 11:31:04 Linux 2.6.32.32-m4.x86_64 (localhost.localdomain) 01/14/2013 2013-01-14 11:31:04 2013-01-14 11:31:04 Time: 11:31:04 AM 2013-01-14 11:31:04 avg-cpu: %user %nice %system %iowait %steal %idle 2013-01-14 11:31:04 0.67 0.00 1.02 1.11 0.00 97.20 2013-01-14 11:31:04 2013-01-14 11:31:04 Device: rrqm/s wrqm/sr/sw/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 2013-01-14 11:31:04 sda 20.70 47.44 0.91 12.01 26.18 237.91 40.89 2.35 181.49 7.76 10.03 2013-01-14 11:31:04 sda1 0.00 0.83 0.08 0.67 0.94 6.00 18.47 0.05 61.99 15.81 1.19 2013-01-14 11:31:04 sda2 0.00 0.24 0.00 0.01 0.00 1.00 173.76 0.01 533.13 45.98 0.05 2013-01-14 11:31:04 sda3 0.00 0.00 0.00 0.00 0.01 0.02 120.95 0.00 33.78 25.52 0.00 2013-01-14 11:31:04 sda4 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 72.00 72.00 0.00 2013-01-14 11:31:04 sda5 7.45 2.07 0.18 1.17 3.82 12.99 24.88 0.06 41.68 13.86 1.87 2013-01-14 11:31:04 sda6 0.00 0.15 0.00 0.28 0.11 1.72 12.85 0.01 40.63 18.86 0.54 2013-01-14 11:31:04 sda7 13.24 44.15 0.65 9.87 21.29 216.18 45.16 2.23 211.42 7.91 8.32 2013-01-14 11:31:04 2013-01-14 11:31:14 Time: 11:31:14 AM 2013-01-14 11:31:14 avg-cpu: %user %nice %system %iowait %steal %idle 2013-01-14 11:31:14 0.61 0.00 1.06 1.20 0.00 97.12 2013-01-14 11:31:14 2013-01-14 11:31:14 Device: rrqm/s wrqm/sr/sw/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 2013-01-14 11:31:14 sda 0.00 45.40 0.00 10.50 0.00 219.20 41.75 1.69 156.43 8.30 8.72 2013-01-14 11:31:14 sda1 0.00 0.80 0.00 0.30 0.00 4.40 29.33 0.01 25.33 25.33 0.76 2013-01-14 11:31:14 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2013-01-14 11:31:14 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2013-01-14 11:31:14 sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2013-01-14 11:31:14 sda5 0.00 1.30 0.00 1.00 0.00 9.20 18.40 0.03 27.80 8.10 0.81 2013-01-14 11:31:14 sda6 0.00 0.70 0.00 1.10 0.00 7.20 13.09 0.02 18.55 10.64 1.17 2013-01-14 11:31:14 sda7 0.00 42.60 0.00 8.10 0.00 198.40 48.99 1.63 195.89 8.15 6.60

The first block is the measurement for the period of work since the start. The rest of the time specified interval.

We run this logging on a certain interval (for example, a couple of days, and then we ask to send a log.

Analysis

Well, according to these figures can be said?

First of all, pay attention to the last column% util. If only because the large value of this parameter (90% -100%) for the user looks like a “server hangs”. We have sda loaded less than 9%. Fine. And basically the section sda7 (SQL).

Further write / read options:

column capacity rkB / s, wkB / s. We write 219 KB / s
iops r / s columns, w / s - 10.5 requests per second
request processing time: await - 156 ms, of which the actual execution of svctm is 8 ms

What it says:
The disk is not loaded, the numbers are not large. But requests stand in a queue for 200ms, the time is acceptable, but at “the first shukher will crack,” the subsystem itself is lousy. Although it is impossible to say unequivocally? for again, the disk is not heavily loaded, and in no hurry to send data from the queue.

What else is useful: rrqm / s, wrqm / s is the actual number of requests that applications were asked to perform, but the system was able to combine them into blocks, on average, avgrq-sz bytes.

These figures will help to optimize the application:

If you open 100,500 files, start randomly jumping through open files and doing other extras, eat out the system cache - r / s, w / s will start to rrqm / s and wrqm / s, and avgrq-sz to fall. This is a situation of pure Random Access - the slowest disk usage scenario.
And if everything is consistent, then they will strive for edinichke (and preferably to zero), and avgrq-sz grow. The only request per interval is pure water. Sequential Access is the fastest disk usage scenario.

It is clear that it is always difficult, and often impossible, to achieve consistent work, but by comparing these figures, it is possible to evaluate whether the change in the logic of working with a disk has improved or worsened the load on it.

Charts

Actually watching the log eyes with his eyes - ungrateful. Therefore, I wrote several “scripts” of bash and gnuplot, a little bit of logging and plotting.

Plotting one device for the 1st parameter plus smooth bezier:

 #!/bin/bash #echo "usage: $0 <iostat.log> <disk name> <out.png> <title> <yaxis-title> <column1> <title1> echo "procesing $1 for device $2, plotting $3 ($4 - $7)" cat $1 | grep "$2 " > dat.dat gnuplot <<_EOF_ set terminal png set output "$3" set title "$4" set xdata time set timefmt "%Y-%m-%d %H:%M:%S" set format x "%H:%M\n%d/%m" set xtics nomirror scale 3,2 set ylabel "$5" set samples 60 plot "dat.dat" using 1:$6 title "$7" with impulses, \ "dat.dat" using 1:$6 title "$7 avg" smooth bezier lw 2 _EOF_ rm dat.dat

And the same schedule for one device by 2 parameters:

 #!/bin/bash echo "procesing $1 for device $2, plotting $3 ($4 - $7, $9)" cat $1 | grep "$2 " > dat.dat gnuplot <<_EOF_ set terminal png set output "$3" set title "$4" set xdata time set timefmt "%Y-%m-%d %H:%M:%S" set format x "%H:%M\n%d/%m" set xtics nomirror scale 3,2 set ylabel "$5" set samples 10 plot "dat.dat" using 1:$6 title "$7" with lines, \ "dat.dat" using 1:$8 title "$9" with lines _EOF_ rm dat.dat

Well, since there are a lot of devices, and there are several servers, and laziness to start with pens, I wrote a couple more primitive scripts: plot_info.sh

 #!/bin/bash echo "usage: $0 <n> <flitle> <title> <y-axis> <column1> <column-title> <column2> <column-title>" echo "$0 2 rw 'Read/write iops per second' 'iops' 4 'read request/s' 5 'write request/s'" #/dev/cciss/c0d0p1 20G 12G 7.2G 61% / #/dev/cciss/c0d0p6 476G 213G 239G 48% /ARCHIVE #/dev/cciss/c0d0p5 9.7G 264M 8.9G 3% /var #/dev/cciss/c0d0p3 9.7G 167M 9.0G 2% /TMP #/dev/cciss/c0d1p1 559G 39G 521G 7% /SQL #/dev/cciss/c0d2p2 1020G 320G 701G 32% /SQL/file #/dev/cciss/c0d2p1 98G 1.9G 96G 2% /SQL/pgsql/data/pg_xlog ./graph$1.sh ./log201.txt c0d0p2 201/c0d0p2.swap.$2.png "swap $3" "$4" "$5" "$6" "$7" "$8" ./graph$1.sh ./log201.txt c0d0p3 201/c0d0p3.tmp.$2.png "/TMP $3" "$4" "$5" "$6" "$7" "$8" ./graph$1.sh ./log201.txt c0d1p1 201/c0d1p1.sql.$2.png "/SQL $3" "$4" "$5" "$6" "$7" "$8" ./graph$1.sh ./log201.txt c0d2p1 201/c0d2p1.pg_xlog.$2.png "pg_xlog $3" "$4" "$5" "$6" "$7" "$8" ./graph$1.sh ./log201.txt c0d1 201/dev.c0d1-sql.$2.png "dev:/SQL $3" "$4" "$5" "$6" "$7" "$8" ./graph$1.sh ./log201.txt c0d2 201/dev.c0d2-xlog_file.$2.png "dev:pg_xlog+file $3" "$4" "$5" "$6" "$7" "$8"

and task.sh

 #!/bin/bash ./plot_info.sh 3 iops_load 'Estimated IOPS' 'iops_limit' '(100*($7+$6)/$14)' 'iops' ./plot_info.sh 0 wrratio 'Write percent' 'wratio' '($9*100/($10+$9))' '%' ./plot_info.sh 0 qwaitratio 'Queue wait ratio' 'qwaitratio' '($13*100/$12)' '%' ./plot_info.sh 2 rwqm 'Request merged and queued to device' 'rqm/s' 4 'read rqm/s' 5 'write rqm/s' ./plot_info.sh 2 request 'Request issued to device' 'requests' 6 'r/s' 7 'w/s' ./plot_info.sh 2 rwspeed 'Read/write speed (kB per sec)' 'speed' 8 'read kB/s' 9 'write kB/s' ./plot_info.sh 1 rsize 'Average request size' 'avgrq-sz' '($10*4)' 'kB' ./plot_info.sh 1 qsize 'Average query size' 'avgqu-sz' 11 'queries' ./plot_info.sh 2 await 'Average request servicing time' 'await' "'12'" 'waiting msec' 13 'servicing msec' ./plot_info.sh 1 util 'Bandwidth utilization' '%util' 14 '%' ./plot_info.sh 1 iops 'Average IOPS' 'iops' "'7+6'" 'iops'

For the code, please do not kick it was written to work quickly and on the knee (for which the actual scripts were invented by smart people).

As a result, we get nice graphics.

Summarizing

As a summary, a couple of “successful stories”

Usually the user has only one symptom: everything slows down. Below I will give examples of how the analysis of logs iostat helped to identify the problem, or indicate that the problem is not in the disk.

First story

In iostat, we saw abnormally high rKB / s values when executing the simplest queries, including queries with LIMIT 1 (tens and hundreds of megabytes) - we clearly read more than necessary.
From this, we could eventually show that we did not work out the auto-vacuum (although it was launched). A further vacuum log showed that max_fsm_page is two orders of magnitude lower than necessary.

Second story

A long commit transaction of the order of 60 seconds containing a total of a dozen DDL queries. All sinned at a high "background" load, and this type of brakes. But iostat showed in every way that at that time there was no additional anomalous load on the disk. As a result, the PostgreSQL 8.3 DBMS is connected with a large number (pool) of unused connections ...

A little about what it is all about

It so happened that the product we are developing includes some kind of storage based on PostgreSQL DBMS. And this database lives without any supervision from the DBA, and our support does not give access to it either. There are a few dozens of clients reading this database. But the record clients, though not many (up to a hundred), they only write 24/7 when hundreds of records per second, and when a thousand or another. Well, as the great Murphy bequeathed - everything always works on a synthetic lab in the test lab, but in the field on real data "everything is terribly slow."

As a result, the problem arises of collecting the information about the operation of the complex from the fields by phone and finding out what it is missing. And these hands are usually unqualified, the maximum run PuTTY and WinSCP, some sort of skriptik, send logs. On more you can not count on.

Another question - the complex is complex, there is no excess iron. Of course, you can add additional software, but we all understand that even free software is not free, and the customer does not want to pay for additional installation, configuration and support. And again, an extra component in the system does not contribute to an increase in reliability.

As a result, we have:
Server level HP DL380G7 with PostgreSQL DBMS living on it and tying from every client written in C ++.
An operating system based on truncated Fedora or CentOS distributions.
Storage on the basis of a normal RAID controller with a dozen SAS screws.
Of the utilities, only the very minimum is available, including iostat and gnuplot.

Source: https://habr.com/ru/post/165855/

All Articles