📜 ⬆️ ⬇️

Backup LVM2 volumes with IO overload protection using SIGSTOP, SIGCONT signals

Configuring backup confidently occupies one of the most important places in the administrator's activities. Depending on backup tasks, application types and data type, backup can be performed using various tools, such as rsync, duplicity, rdiff-backup, bacula and others, of which there is a huge variety.


In addition to the implementation of the backup process itself, which would meet the needs of the organization, there are a number of problems that inevitably arise when backing up, one of which is an increase in the load on the disk subsystem, which can lead to degradation of application performance.


The solution to this problem is not simple - often the administrator is forced to make compromises, which lead to the fact that the duration of the procedure increases or the frequency of backup decreases from daily to weekly. These tradeoffs are inevitable and are a forced response to existing technical limitations.


And, nevertheless, the main question remains open. How do you back up in such a way that core applications receive acceptable quality of service? UNIX family operating systems provide a staff mechanism for managing I / O priorities for applications, called ionice , and moreover, specific UNIX implementations provide their own mechanisms that allow you to impose additional restrictions. For example, in the case of GNU / Linux, there is a cgroups mechanism that allows you to limit the bandwidth (for physically connected devices) and set the relative priority for a group of processes.


However, in some cases, such solutions are not enough and it is necessary to focus on the actual "well-being" of system processes, which reflect such system parameters as Load Average or% IOWait. In this case, an approach that I have been successfully applying for quite a long time when backing up data from LVM2 using dd can help.


Task Description


There is a GNU / Linux server that has storage configured using LVM2 and for this server a volume backup process is performed every night, which is done by creating a snapshot of the partition and running dd + gzip:


ionice -c3 dd if=/dev/vg/volume-snap bs=1M | gzip --fast | ncftpput ... 

When backing up, I want to execute it as quickly as possible, but empirically noted that increasing% IOWait to 30%, the quality of service provided by the disk system of applications becomes unacceptable, so you need to keep it below this level. It is required to implement a restrictive mechanism that would provide processing in the maximum permissible values ​​of% IOWait.


Finding a solution


Initially, the solution was applied to the approach with ionice -3 , but it did not give a stable result. Mechanisms based on cpipe and cgroups (throttling) were discarded as making it impossible to copy data quickly if% IOWait is normal. As a result, a solution was chosen based on monitoring% IOWait and suspending / resuming the dd process using SIGSTOP, SIGCONT signals together with the sar statistics service.


Decision


Schematically, the solution is as follows:


  1. We request statistics for N seconds and get the average value of% IOWait;
  2. Determine the action:
    a. If the value of% IOWait is <30, then the process is resumed (SIGCONT);
    b. If the% IOWait value is> 30, stop the process (SIGSTOP), increment the counter;
  3. If the process is stopped longer than N x K, restart the process and stop it again after 2 seconds.

Most likely item 3 raises questions. Why such a strange action? The fact is that within the framework of the backup, data is transferred via FTP to a remote server and if the copying process is stopped for a long enough time, then we can lose the connection by timeout. In order to prevent this from happening, we perform a forced resumption and stop of the copying process, even if we are in the red zone.

The solution code is shown below.


 #!/bin/bash INTERVAL=10 CNTR=0 while : do CUR_LA=`LANG=C sar 1 $INTERVAL | grep Average | awk '{print $6}' | perl -pe 'if ($_ > 30) { print "HIGH "} else {print "LOW "}'` echo $CUR_LA MARKER=`echo $CUR_LA | awk '{print $1}'` if [ "$MARKER" = "LOW" ] then CNTR=0 pkill dd -x --signal CONT continue else let "CNTR=$CNTR+1" pkill dd -x --signal STOP fi if [ "$CNTR" = "5" ] then echo "CNTR = $CNTR - CONT / 2 sec / STOP to avoid socket timeouts" CNTR=0 pkill dd -x --signal CONT sleep 2 pkill dd -x --signal STOP fi done 

This solution successfully solved the problem of overloading IO on the server, while not restricting the speed severely, and has been faithfully serving for several months, while solutions based on the mechanisms designed for this have not yielded a positive result. It should be noted that the parameter value obtained by sar can be easily replaced with Load Average and other parameters that correlate with the degradation of the service. This script is also suitable for tasks that do not use LVM2 + dd, but, for example, Rsync or other backup tools.


With the help of cgroups, it is possible in the same way to implement not a stop, but a band limitation, if we are talking about copying data from a physical block device.


PS: The script is shown without editing in its original form.


')

Source: https://habr.com/ru/post/332614/


All Articles