Autoscaling - a tool for automatic vertical scaling of resources (CPU | RAM | HDD)

At the beginning of this year, our cloud-based VDS has a public API . It allows the client to do almost all the same actions with cloud virtual machines and disks as in the panel: create, delete disks and VMs, change tariffs and disk size, and so on.

Along with the advent of the API, the idea on its basis of implementing a system for monitoring virtual machine (VM) resources running inside the machine and automatically increasing / decreasing the necessary resources as needed - autoscaling (autoscaling, S) appeared.

Some explanations on AS

It is worthwhile to clarify here that since the AS system is based on the API, its task does not include instant provision of resources on demand, at the time when the need arises for them, or guessing / predicting the future need for resources. The essence of AS is that it should fix the moment when it can be said with confidence that the current tariff plan resources are not enough for guaranteed * and timely ** execution of the processes running in the virtual machine and automatically transfer the VM to the next tariff.
')

** Timely - because if the execution of the running processes rests on the processor resource, the processes will be executed one way or another. But the time of their completion becomes unpredictable.
* Guaranteed, because if the VM's RAM is close to exhaustion and the swap is not configured on the VM, this means that the situation is close when some of the processes running on the VM will be abnormally terminated by the operating system if the total memory consumption by all processes exceeds overall volume. If the swap is configured, then until it is also exhausted, no one will be killed, but the VM performance will also be very bad, because will depend on the speed of the swap-partition, which in any case is an order of magnitude less than the speed of the RAM.

In order to understand that the situation is close to critical, the AS must conduct observations for some time. After all, the situation with an instantaneous short-term surge in load is most likely not a good reason for changing the tariff. Thus, it should be borne in mind that when the AS understands that the situation is critical, it is likely that some bad situation has already happened (the page of your site has not been downloaded for a minute by some visitor, or some processes on your VM have already crashed completed by OOM-Killer, the linux kernel subsystem, which chooses which process to sacrifice, so that the rest of the system continues to work if the RAM is exhausted). That is, AS is not a silver bullet, guaranteed to protect your services from a lack of resources. But it is set up in such a way as to minimize the negative consequences of the exhaustion of resources through their timely expansion. He will do it for sure faster than you yourself will notice, or even find out that problems have appeared on your server and have time to reach the VM control panel in order to expand your resources with your hands. Moreover, in most cases (it all depends on the profile of the increasing load), the AS switches the tariff before the shortage of resources has time to manifest itself in a clear negative way.

At the same time, if a critical resource situation is not observed, the AS is trying to understand whether it is really necessary to use this tariff, or if a cheaper tariff will be enough, and if so, it will promptly transfer you to a more economical tariff without damage to quality of your VM.

Similarly with the disk. We will try to expand it before the place runs out on it. In the case of a disk, the automation works only upwards. File system shrinking is an offline operation that requires the virtual machine to be stopped. You can still do this through the control panel when you need it.

We have divided the possibility of including AS according to different criteria. For example, you can separately enable autoscaling on the memory criterion, but not include it on the processor, if you think that the speed of the processes running inside your VM is not as critical as the stability of their work, and does not require additional costs of money. For example, if you have some kind of heavy, long-running process running on your virtual machine, say, uploading a large amount of data to the database, or another scheduled processing of a large data set, then it may not be so important when this process is completed: an hour or two, but it is much more critical that this process ended in principle, and was not accidentally killed by the operating system in case of a lack of RAM.

Or if you have a small application running on a VM, and you are sure that it doesn’t require additional memory and CPU, but you are afraid that constantly accumulating logs from it may clog the entire disk and cause an important process to stop, you can turn on automatic expansion drive and select the maximum limit to which you are willing to allow it to expand. For scaling by processor and / or by memory, you also need to set the upper and lower limit of resource change in the line of tariff plans.

Each time you change the tariff or expand the disk, our control panel will automatically send you a notification letter to the contact address with a description of what was done and why. So, if you think that this is an abnormal situation, you can immediately go to the server and find out why the load has increased or what exactly the entire disk has hammered.

Let's take a closer look at how it all works.

So, in our case, autoscaling is a set of python scripts that run on cron. At start-up, they read the previous measurement history, make additional necessary system diagnostics, save the updated history for the next run, evaluate the received data and, if necessary, take actions to change the tariff or drive expansion, or simply complete.

Autoscaling drive.
Several disks can be connected to a virtual machine. Autoscaling can be enabled separately for each disk, and separately set a limit to which the disk can be automatically expanded.

Limitations:
By default, the ext4 file system is installed on the new disk, but the user can later install any other file system, partition table, lvm or something else on his disk. Since the FS should, like the disk itself, expand automatically (otherwise the AS is lost), there is a risk to take any destructive actions in relation to what is installed on the disk and in the most unpleasant case this can result, including to data loss. Therefore, we decided to limit the ability of AS to work only with the standard ext4 filesystem.

We also do not allow autoscaling for a disk on Centos6-based systems, since they use too old kernel 2.6, in which udev does not work correctly, as a result of which the file system does not automatically expand after expanding the disk.

Mechanism of work:
By cron, autoscaling on the disk starts every 7 minutes. Information on all disks connected to the system is collected from all available sources.
1. When you configure AS in the panel, a config file is sent to the virtual machine in json /etc/autoscaling/autoscaling.conf format. There is a list of all disks connected to this VM (their uid), information about whether autoscaling is enabled for them and to what maximum they can be expanded;
2. Inside the VM from / sys / class / block take a list of all disks visible in the system;
3. From / proc / partitions we take the sizes of disks;
4. From / proc / mounts, extract information about the type of filesystem and whether it is mounted in read-write mode;
5. From the df output, we obtain information about the size of the file system and how much free space is left on it.

As a result, as a result of collecting information, we should have something like a full table for all disks:

Disk id disk_size max_size autoscale_enabled fs_type is_writable fs_size fs_free ------------------------------------------------------------------------------------------ vda 1081 5 40 True ext4 True 5 1.2 vdb None 0.1 None None iso9660 False 0.1 0.1 vdc 1082 10 45 True ext4 True 10 6.2 vdd 1234 5 15 False ext4 True 5 0.3

If we managed to collect all the necessary information on the disk, there is a suitable file system (ext4) on it and if AS is enabled for it, then a check is performed: how many percent of free space is left on the file system If this value is less than the threshold of 10%, then an API call is sent to the panel to increase the disk. The panel sends the appropriate command to the master server, where this VM is running, to increase the disk. (And also the panel sends the user a letter to the contact email about this event with all the necessary details). When expanding the udev disk inside the user system, it intercepts this event and starts the process of expanding the FS (resize2fs).

All AS actions are logged inside the virtual machine in / var / log / syslog and in the panel logs.

Autoscaling of tariffs.

To change the memory or processor, a transition to another tariff is applied. We have a line of tariffs, which we can receive via API in json format.

It looks like this at the moment:

 [ { "memsize": 1024, "name": "tiny", "ncpu": 2 }, { "memsize": 2048, "name": "small", "ncpu": 2 }, { "memsize": 4096, "name": "medium", "ncpu": 4 }, { "memsize": 8192, "name": "large", "ncpu": 4 }, { "memsize": 8192, "name": "xl8", "ncpu": 8 }, { "memsize": 16384, "name": "xl16", "ncpu": 8 } ]

As you can see, some transitions to the neighboring tariff do not increase / decrease one of the parameters (cpu / memsize). For example, switching from “TINY” to “SMALL” increases only the amount of memory, but does not add CPU. In order to increase the CPU while on “TINY” , you need to go directly to “MEDIUM” . This is taken into account when choosing the desired rate.

Autoscaling of tariffs. Memory.

Information on memory consumption is taken from / proc / meminfo .

Based on the 'MemTotal', 'MemFree', 'Buffers', 'Cached' and 'Shmem' parameters, we find how much memory is currently used up and cannot be freed if necessary.

'Used' = 'MemTotal' - 'MemFree' - 'Buffers' - 'Cached' + 'Shmem'

'Buffers' is a memory reserved for the temporary storage of small disk blocks. It is used to speed up disk I / O and this memory can be quickly released by the operating system if necessary (if some of the processes need more memory than there is unused memory).

'Cached' - cache of files read from the disk. This memory can also be quickly released.

But there is also 'Shmem' - shared memory, used as the fastest interprocess communication (IPC) method. In / proc / meminfo, 'Shmem' is part of 'Cached', but this part of the memory cannot be quickly released while it is being used by some of the processes.

Then the percentage of free memory, along with the one that can be freed on demand, can be calculated as:

 ('MemTotal' - 'Used') * 100 / 'MemTotal'

 ('MemFree' + 'Buffers' + 'Cached' - 'Shmem') * 100 / 'MemTotal'

To make a decision on changing tariffs for the greater part, we take several measurements once a minute, each time keeping the history of previous measurements. If the average value of free memory in the last 5 minutes is less than the threshold of 20%, then we believe that the user's VM has consistently moved to the critical zone, and in order to avoid further problems, it’s time to increase the tariff to the next tariff, where there is more memory than the current one. The corresponding API call is applied. The tariff changes almost instantly, without rebooting, and your system immediately becomes significantly more memory. The user is sent an email to the contact email about the event.

The history of statistics collection when the tariff is changed is reset to zero in order to avoid affecting new measurements in the new conditions. Thus, the next tariff change will take place no sooner than a new measurement history has accumulated.

Also, all AS actions are logged in / var / log / syslog and in the panel logs.

Autoscaling of tariffs. The processor .

Statistics on CPU consumption is collected once per minute and history is taken into account in the last 10 minutes. CPU statistics are taken from / proc / stat , but interpreting it is somewhat more complicated than in the case of memory.

The fact is that for a more accurate distribution of resources, we give the virtual machine not the actual cores of the master server processor, but always give out 12 cores of the master server to any virtual machine and using cgroups we limit how many percent from one core this virtual machine process can consume (As you know, cgroups is a Linux kernel technology that allows you to limit the provision of resources for OS processes or groups of processes). So, if at the MEDIUM tariff we provide 4 CPU cores for a VM, this means that 12 actual cores will be provided, and with the help of cgroups CPU consumption will be limited to four hundred percent from one core. This approach offers advantages in the accuracy and flexibility of the provision of resources, since if we issue the virtual kernel to the virtual machine, and more than one VM runs on the same kernel, they ultimately share the resource of this CPU core among themselves, while limiting with the help of cgroups allows the virtual machine is exactly what is promised (provided that we monitor the CPU load on the master servers and timely migrate the virtual machines between the more or less loaded master servers). It also allows you to issue not an integer number of CPU cores, but, let's say, 123% of a single core (we do not use this).

On the other hand, this approach has a minus - the situation becomes less obvious for the owner of the virtual machine. As in / proc / cpuinfo , in / proc / stat and in all utilities such as top, htop, atop , the user sees 12 CPU cores. And in order to correctly assess the situation with the help of such utilities, it takes a little more understanding of the essence of these limits (however, for the convenience of users, the control panel displays all the necessary information in a more familiar form).

So, from / proc / stat we see something like this:

 cpu 308468 0 155290 2087891920 139327 0 1717 323604 0 0 cpu0 52005 0 18792 173912232 41000 0 734 40868 0 0 cpu1 20211 0 23423 173905770 5134 0 253 47490 0 0 cpu2 58747 0 22624 173929843 17311 0 162 35843 0 0 cpu3 46602 0 17294 173965248 16777 0 100 31919 0 0 cpu4 21629 0 9426 174009578 8572 0 66 21842 0 0 cpu5 17057 0 10685 174021499 6986 0 53 18868 0 0 cpu6 14844 0 8881 174011454 5011 0 58 25400 0 0 cpu7 17524 0 10358 174023057 6182 0 66 20343 0 0 cpu8 15126 0 8597 174030215 7034 0 47 19455 0 0 cpu9 15437 0 9150 174023863 10817 0 56 20722 0 0 cpu10 15456 0 8545 174028209 8363 0 62 21390 0 0 cpu11 13826 0 7511 174030947 6135 0 56 19460 0 0

Moreover, information about 12 cores in this case is useless. Enough the first line summarizing information for all cores.

 cpu 308468 0 155290 2087891920 139327 0 1717 323604 0 0

The numeric fields of this string mean the following in order:

user - processor time spent on processing processes in user space.
nice is the same, but for processes with a changed priority (nice).
system - the time spent on making system calls.
idle - idle time (while the processor is not busy with any other tasks from this list).
iowait is the time spent by the processor while waiting for I / O operations.
irq is the time to process interrupts.
softirq - some "light" interrupts. (Obviously, in order to understand what it is, you need to drink as much bad vodka as Alexey Kuznetsov did at the time of their writing).
steal - time not received by this virtual machine due to the fact that the master gave this resource to another VM (and / or the cgroups limit worked, as in our case).
guest is the time spent running a virtual CPU for a guest system running Linux (this parameter makes sense only on the master system).
guest_nice - similarly, for guest systems with a modified priority.

And also for calculating statistics on the CPU, besides the information from / proc / stat , information is needed on how many CPU cores are actually allocated using cgroups for this VM. In a trivial way from the inside of the VM itself, this is not to be recognized.

Therefore, another API call is used, giving out basic information about the virtual machine.

In json it can be represented as follows:

 { "id": 11004, "ips": [ { "id": 11993, "ipvalue": "185.41.161.231" } ], "memsize": 1024, "monitoring_enabled": false, "name": "esokolov_AS_test", "ncpu": 2, "ssh_keys": [ 42 ], "state": "active", "storages": [ 12344 ], "type": "tiny", "vm_id": "vm_40b5d315" }

To be able to scaling down, you also need to know how many cores there will be if you switch to a lower tariff. This information is taken from the tariff table given earlier. In order not to pull the API every minute with the same requests for information from the panel (on tariffs and on the parameters of this VM), this information is cached. Before the tariff change, the information from the cache can be quite relevant. The cache is reset if the tariff is changed, the system is rebooted, or if it is stored longer than the specified time (day in our case).

In / proc / stat information is stored as simple integer time counters (one is one hundredth of a second on most architectures) from the start of the system.

So, we will need to store the history not only to get statistics for the last 10 minutes, but also to understand how much resources were spent in each specific measurement period (by subtracting the previous period value from the current values. Let's call these values 'cur_user', 'cur_nice' , 'cur_system', etc.).

So, from / proc / stat we can calculate the total amount of the CPU resource (let's call it 'total' ).

 'total' = 'user' + 'nice' + 'system' + 'idle' + 'iowait' + 'irq' + 'softirq' + 'steal'

The total amount of CPU resources for the current time period (let's call it 'cur_total' is obtained by subtracting 'total' from the previous measurement from the 'total' of the current measurement). This is the resource of all 12 cores. At the same time, the part given to this virtual machine ( 'my_cur_total' ) will be equal to:

 'my_cur_total' = 'cur_total' * ncpu / 12

where ncpu is taken from VM information (cgroups limit in kernels).

Now we can calculate the real 'idle' (CPU idle time) of the current time period (let's call it 'my_cur_idle' ) for this VM.

 'my_cur_idle' = 'my_cur_total' - 'cur_user' - 'cur_nice' - 'cur_system' - 'cur_iowait' - 'cur_irq' - 'cur_softirq'

It is clear that this value has little to do with the 'idle' of / proc / stat , since there this value means the downtime of all 12 cores.

The values of 'my_cur_idle' , 'cur_iowait' , 'cur_steal' , etc., expressed as a percentage of 'my_cur_total' , are essentially the values that usually show utilities like top, atop, htop.
They are the basis for switching tariff. That is, the current percentage of free CPU is calculated and recorded in the measurement history. If the average value of this percentage in the last 10 minutes falls below the critical 10%, then we switch the tariff to the tariff with more CPUs. That is, the reason for switching up is the situation when on the virtual machine in the last 10 minutes more than 90% of CPU was consumed on average.

The duration of the measurements and the threshold values, of course, are taken empirically from the assumption that, for example, a one-time launch of a heavy script that eats the entire CPU and completes its work within a couple of minutes is not a good reason for switching the tariff. However, these parameters can be adjusted according to our users' feedback (AS scripts and settings are automatically updated on virtual machines when new versions are released).

Autoscaling fare down.

In order to go down to the previous tariff with less memory or with fewer CPUs and not to make hell for the system, we must be sure that at the previous tariff the current load will fully fit into the available resources both in the processor and in memory, At the same time, some reserve will remain so that the situation at a lower tariff does not become critical immediately after the switch. From this, firstly, it follows that down we can switch at a time only to the nearest tariff (if there is an opportunity to switch down two tariffs, this will be done in two steps). And, secondly, this means that we must also do all the previously cited calculations for memory and processor, and based on the parameters of the previous tariff (on how much memory and CPU it provides). Accordingly, these parameters are taken from the tariff table. And all the above reasoning for obtaining the criteria for the system load we simultaneously conduct with regard to these parameters of the previous tariff. And also save their history. That is, in the end, we simultaneously get the values that would be calculated for this virtual machine at its current load, if it were at the previous tariff.

But to exclude the constant switching of tariffs back and forth, the threshold values for switching down are slightly different from the threshold values for switching up.
So, to switch down at a lower rate, after switching, there should be at least 20% of unused CPU (idle) and at least 30% of free memory (or one that can be freed — buffers and disk cache). And the reason for the switch is also not a one-time measurement, but an average result from the last 10 measurements.

Ps. If you are already our client, you can connect autoscaling in the control panel - bit.ly/Panel-NetAngels .
To register a new account you here

Source: https://habr.com/ru/post/312174/

All Articles

Autoscaling - a tool for automatic vertical scaling of resources (CPU | RAM | HDD)

At the beginning of this year, our cloud-based VDS has a public API . It allows the client to do almost all the same actions with cloud virtual machines and disks as in the panel: create, delete disks and VMs, change tariffs and disk size, and so on.

More articles: