How to speed up the container: we tune OpenVZ

OpenVZ is an OpenSource implementation of container virtualization technology for the Linux kernel, which allows you to run on one system with the OpenVZ core a lot of virtual environments with various Linux distributions inside. Due to its features (container virtualization is at the kernel level, not iron) for a number of performance indicators - density, elasticity, requirements for RAM size, response speed, etc. - It works better than other virtualization technologies. For example, here you can see OpenVZ performance comparisons with traditional hypervisor virtualization systems. But besides this, there are a lot of tweaking options in Linux and OpenVZ.
In this article, we will look at non-trivial options for configuring OpenVZ core containers that can improve the performance of the entire OpenVZ system.

General settings

The main settings affecting the performance of containers are the limits on memory and processor consumption. In most cases, increasing the amount of allocated memory and processors helps to improve the performance of containers for custom applications, such as, for example, your web server or database server.

To set a global limit on the physical memory allocated to the container, it is enough to specify the option --ram, for example:

# vzctl set 100 --ram 4G --save

')
Very often, lowering the limits in a container results in a failure to allocate memory in one place or another in the kernel or application, so when using containers, it is extremely useful to monitor the contents of the / proc / user_beancounters file. Nonzero values in the failcnt column mean that some of the limits are too small and you need to either reduce the working set in the container (for example, reduce the number of apache threads or the postgresql server), or increase the memory limit using the --ram option. For convenient monitoring of the / proc / user_beancounters file, you can use the vzubc utility, which allows you, for example, to watch only meters that are close to failcnt, or to update the readings with a periodicity (top-like mode). Read more about the vzubc utility here .

In addition to setting a limit on physical memory, it is recommended to set a limit on the size of the swap for the container. To adjust the size of the swap container, use the --swap option of the vzctl command:

 # vzctl set 100 --ram 4G --swap 8G --save

The sum of the --ram and --swap values is the maximum amount of memory the container can use. After reaching the --ram container limit, the memory pages belonging to the container processes will be pushed into the so-called “virtual swap (VSwap)”. At the same time, real disk I / O will not occur, and container performance will be artificially understated to create the effect of real swapping.

When configuring containers, it is recommended that the sum of ram + swap for all containers does not exceed the sum of ram + swap per host node. To check the settings, you can use the vzoversell utility.

To control the maximum number of available processors for a container, you need to use the --cpus option, for example:

 # vzctl set 100 --cpus 4 --save

When creating a new container, the number of processors for this container is not limited, and it will use all possible CPU resources of the server. Therefore, for systems with multiple containers, it makes sense to set a limit on the number of processors of each container in accordance with the tasks assigned to them. It can also sometimes be useful to limit CPUs in percentages (or in megahertz) using the --cpulimit option, as well as manage weights, that is, container priorities, using the --cpuunits option.

Memory overcommit

The OpenVZ kernel allows you to allocate to all containers a total more memory than the full amount of physical memory available on the host. This situation is called memory overcommit and in this case the kernel will manage the memory dynamically, balancing it between containers, because the memory allowed by the containers will not necessarily be in demand and the kernel can manage it at its discretion. When overcommitting from memory, the kernel will effectively manage various caches (page cache, dentry cache) and try to reduce them in containers in proportion to the established memory limit. For example, if you want to isolate from each other the services of some high-load web site consisting of a front-end, back-end and database for the sake of improving security, then you can put them in separate containers, but at the same time select the most accessible on the host amount of memory, for example:

 # vzctl set 100 --ram 128G --save # vzctl set 101 --ram 128G --save # vzctl set 102 --ram 128G --save

In this case, from the point of view of memory management, the situation will not differ from the one when your services were running on the same host and memory balancing will still be as efficient as possible. You do not need to think about which of the containers you need to put more memory - in the front-end container for a larger page cache for static data of the web site, or in a container with a database for a larger cache of the database itself. The kernel will balance everything automatically.

Although overcommit allows the kernel to balance memory between containers as efficiently as possible, it also has certain unpleasant properties. When the total amount of allocated anonymous memory, that is, the total working set for all processes of all containers approaches the total memory size in the host, when a process or kernel tries to allocate a new memory, a global exception “out of memory” may occur and the OOM-killer will kill one of the processes in the system. To check whether such exceptions happened on a host or in a container, you can use the command:

 # dmesg | grep oom [3715841.990200] 645043 (postmaster) invoked oom-killer in ub 600 generation 0 gfp 0x2005a [3715842.557102] oom-killer in ub 600 generation 0 ends: task died

At the same time it is important to note that the OOM-killer will kill the process not necessarily in the same container in which it tried to allocate memory. To control the behavior of OOM-killer, use the command:

 # vzctl set 100 --oomguarpages 2G --save

allowing to set a limit that guarantees the inviolability of container processes within a given limit. Therefore, for containers in which vital services are running, you can set this limit equal to the memory limit.

Processor overcommit

Just as in the case of memory, the overcommit on processors allows you to allocate to containers a total number of processors exceeding the total number of logical processors on a host. And just like in the case of memory, the overcommit on processors allows you to achieve the most effective overall system performance. For example, in the case of the same web server, “decomposed” into three containers for the front end, back end and database, you can also set an unlimited number of CPUs for each container and achieve the maximum total system performance. Again, the kernel itself will determine which processes from which containers to allocate processor time.

Unlike memory, a processor is an “elastic” resource, that is, a shortage of CPUs does not lead to any exceptions or errors in the system, except for slowing down some active processes. Therefore, the use of overcommit on processors is a safer trick for overclocking a system than an overcommit from memory. The only negative effect of overcommittee on processors is a possible violation of the principle of honesty allocation of processor time for containers, which can be bad, for example, for VPS hosting customers who may receive less paid processor time. To preserve the fair allocation of processor time, it is necessary to set the “weight” of containers, in accordance with the paid processor time, using the --cpuunits option of the vzctl command (read more here ).

Container optimization on NUMA hosts

In the case of launching containers on a host with NUMA (Non-Uniform Memory Access), it is possible that container processes are executed on one NUMA node, and some (or all of the memory) of these processes have been allocated on another NUMA node. In this case, each memory access will be slower than the memory access of the local NUMA node and the deceleration rate will depend on the distance between the nodes. The Linux kernel will try to avoid such a situation, but to ensure that the container is executed on the local NUMA node, you can set up for each CPU mask container that will limit the set of processors that are allowed to execute container processes.

You can view the NUMA nodes available on the host using the numactl command:

  # numactl -H
 available: 2 nodes (0-1)
 node 0 cpus: 0 1 2 3
 node 0 size: 16351 MB
 node 0 free: 1444 MB
 node 1 cpus: 4 5 6 7
 node 1 size: 16384 MB
 node 1 free: 10602 MB
 node distances:
 node 0 1
   0: 10 21
   1: 21 10

In this example, there are two NUMA nodes on the host, each with 4 processor cores and 16GB of memory.

To set the limit on the set of processors for the container, use the vzctl command:

 # vzctl set 100 --cpumask 0-3 --save # vzctl set 101 --cpumask 4-7 --save

In this example, we allowed container 100 to execute only on processors from 0 to 3, and container 101 - from 4 to 7. At the same time, you need to understand that if a process from, for example, container 100 has already allocated memory on a NUMA node 1 then every access to this memory will be slower than access to local memory. Therefore, it is recommended to restart the containers after these commands are executed.

It is worth noting that in the new release of vzctl 4.8, the option --nodemask appeared, which allows you to attach a container to a specific NUMA node without specifying the list of processors of this node, but by operating only with the NUMA number.

It should be borne in mind that this approach will limit the ability of the process scheduler to balance the load between the processors of the system, which in the case of a large overcommit on the processors can lead to slower work.

Controlling the behavior of fsync in containers

As you know, to ensure that the data is written to disk, the application must execute the fsync () system call on each modified file. This system call will write the file data from the write-back cache to the disk and initiate a flush of the data from the disk cache to permanent non-volatile media. At the same time, even if the application writes data to the disk, bypassing the write-back cache (the so-called Direct I / O), this system call is still necessary to ensure that the data from the disk’s cache is reset.

Frequent execution of the fsync () system call can significantly slow down the operation of the disk subsystem. The average hard disk is capable of 30-50 syncs / sec.

It is often known that for all or part of the containers such strict guarantees of data recording are not needed, and the loss of a part of data in the event of a hardware failure is not critical. For such cases, the OpenVZ kernel provides the ability to ignore fsync () / fdatasync () / sync () requests for all or part of containers. You can configure the behavior of the kernel using the / proc / sys / fs / fsync-enable file. Possible values of this file in case of setting to host node (global settings):

   0 (FSYNC_NEVER) fsync () / fdatasync () / sync () requests from containers are ignored
   1 (FSYNC_ALWAYS) fsync () / fdatasync () / sync () requests from containers work as usual, 
                         data of all inodes on all file systems of the host machine will be recorded
   2 (FSYNC_FILTERED) fsync () / fdatasync () requests from containers work as usual
                         sync () requests from containers affect only container files (default value)

Possible values of this file in case of setting inside a specific container:

   0 fsync () / fdatasync () / sync () requests from this container are ignored 
   2 use global settings set to host node (default value)

Despite the fact that these settings can significantly speed up the server disk subsystem, you need to use them carefully and selectively, because disabling fsync () may result in data loss in the event of a hardware failure.

Controlling Direct I / O Behavior in Containers

By default, writing to all files opened without the O_DIRECT flag is done via the write-back cache. This not only reduces the latency (timeout) of writing data to disk for the application (the write () system call completes as soon as the data is copied to the write-back cache, not waiting for the actual writing to disk), but also allows the core I / O scheduler more efficiently allocate disk resources between processes, grouping I / O requests from applications.

At the same time, certain categories of applications, for example, databases, effectively manage the recording of their data themselves, performing large sequential I / O requests. Therefore, such applications often open files with the O_DIRECT flag, which tells the kernel to write data to such a file, bypassing the write-back cache, directly from the memory of the user application. In the case of one database running on a host, this approach is more efficient than writing through the cache, since I / O requests from the database are already optimally aligned and there is no need for additional copying of memory from the user application to the write-back cache.

In the case of multiple containers with databases on the same host, this assumption is wrong, since the I / O scheduler in the Linux kernel cannot optimally allocate disk resources between applications using Direct I / O. Therefore, the default OpenVZ Direct I / O for containers is turned off, and all data is written via the write-back cache. This introduces a small overhead in the form of additional copying of memory from the user application to the write-back cache, while allowing the kernel I / O scheduler to more efficiently allocate disk resources.

If you know in advance that there will be no such situation on the host, you can avoid additional overhead and allow Direct I / O to be used by all or part of the containers. You can configure the kernel behavior using the / proc / sys / fs / odirect_enable file. Possible values of this file in case of setting to host node (global settings):

   0 flag O_DIRECT is ignored for containers, the whole record occurs via write-back cache (default value)
   1 flag O_DIRECT in containers works as usual

Possible values of this file in case of setting inside a specific container:

   0 the O_DIRECT flag is ignored for this container, the entire record occurs via the write-back cache
   1 O_DIRECT flag for this container works as usual
   2 use global settings (default)

Conclusion

In general, the Linux kernel in general, and OpenVZ in particular, provide a large number of features for fine-tuning performance for specific user tasks. OpenVZ-based virtualization allows for the highest possible performance through flexible resource management and various settings. In this article, we have resulted in only a small part of container-specific settings. In particular, I didn’t write about the three parameters CPUUNITS / CPULIMIT / CPUS and how all three influence each other. But I am ready to explain this and much more in the comments.
For more information, read the vzctl man page and a lot of resources on the Internet, for example, openvz.livejournal.com.

Source: https://habr.com/ru/post/240197/

All Articles