Containerization mechanisms: cgroups

We continue the cycle of articles on containerization mechanisms. Last time we talked about the isolation of processes using the namespaces mechanism. But for containerization, resource isolation alone is not enough. If we run any application in an isolated environment, we must be sure that enough resources are allocated to this application and that it will not consume extra resources, thereby disrupting the rest of the system. To solve this problem, the Linux kernel has a special mechanism - cgroups (short for control groups, control groups). We will tell about it in today's article.

The theme of cgroups today is particularly relevant: a new version of this mechanism, group v2 , was officially added to the core of version 4.5, released in January of this year.
While working on it, cgroups was essentially rewritten.
')
Why were such radical changes necessary? To answer this question, let's take a closer look at how the first version of cgroups was implemented.

Cgroups: a brief history

The development of cgroups was launched in 2006 by Google employees Paul Management and Rohit Set. The term “control group” was not yet used, and the term “process containers” was used instead. Actually, at first they did not set themselves the goal of creating cgroups in the modern sense. The original idea was much more modest: to improve the cpuset mechanism, designed to distribute CPU time and memory between tasks. But over time, everything grew into a larger project.

In late 2007, the name of the process containers was replaced by control groups. This was done to avoid discrepancies in the interpretation of the term “container” (at that time, the OpenVZ project was actively developing, and the word “container” began to be used in a new, modern meaning).

In 2008, the cgroups mechanism was officially added to the Linux kernel (version 2.6.24). What's new in this version of the kernel compared to previous ones?

No system call specifically designed to work with cgroups has been added. The major changes include the cgroups file system, also known as cgroupfs.

In init / main.c, references to functions for activating cgoups at boot time were added: cgroup_init and cgroup_init_early. The functions used to spawn and terminate the process — fork () and exit () — were slightly modified.

There are new directories in the / proc virtual file system: / proc / {pid} / group (for each process) and / proc / cgroups (for the system as a whole).

Architecture

The cgroups mechanism consists of two components: the core ( cgroup core ) and the so-called subsystems. In the kernel version 4.4.0.21 of such subsystems 12:

blkio - sets limits on reading and writing from block devices;
cpuacct - generates reports on the use of processor resources;
cpu - provides access of processes within the control group to the CPU;
cpuset - distributes tasks within the control group between processor cores;
devices - allows or blocks access to devices;
freezer - suspends and resumes the execution of tasks within the control group
hugetlb - activates support for large memory pages for control groups;
memory - manages memory allocation for groups of processes;
net_cls - marks network packets with a special tag that allows you to identify packets generated by a particular task within the control group;
netprio - used to dynamically set traffic priorities;
pids - used to limit the number of processes in the control group.

You can display the list of subsystems on the console using the command:

$ ls /sys/fs/cgroup/ blkio cpu,cpuacct freezer net_cls perf_event cpu cpuset hugetlb net_cls,net_prio pids cpuacct devices memory net_prio systemd

Each subsystem is a directory with control files in which all settings are written. Each of these directories contains the following control files:

cgroup.clone_children - allows you to pass parent properties to child control groups;
tasks - contains a list of PIDs of all processes included in control groups;
cgroup.procs - contains a list of TGIDs of process groups included in control groups;
cgroup.event_control - allows you to send notifications when the status of the control group changes;
release_agent - contains the command that will be executed if the notify_on_release option is enabled. It can be used, for example, to automatically remove empty control groups;
notify_on_release - contains a boolean variable (0 or 1), including (or vice versa), the execution of the command specified in the release_agent.

Each subsystem also has its own control files. We will describe some of them below.

To create a control group, it is enough to create a nested directory in any of the subsystems. Control files will be automatically added to this nested directory (we will tell about it in more detail below). Adding processes to a group is very simple: you just need to write their PID to the tasks control file.

The set of control groups embedded in the subsystem is called a hierarchy. Let's try to understand the principles of cgroups functioning in simple practical examples.

Hierarchy cgroups: practical acquaintance

Example 1: CPU Management

Run the command:

 $ mkdir /sys/fs/cgroup/cpuset/group0

Using this command, we created a control group containing the following control files:

 $ ls /sys/fs/cgroup/cpuset/group0 group.clone_children cpuset.memory_pressure cgroup.procs cpuset.memory_spread_page cpuset.cpu_exclusive cpuset.memory_spread_slab cpuset.cpus cpuset.mems cpuset.effective_cpus cpuset.sched_load_balance cpuset.effective_mems cpuset.sched_relax_domain_level cpuset.mem_exclusive notify_on_release cpuset.mem_hardwall tasks cpuset.memory_migrate

So far, there are no processes in our group. To add a process, you need to write its PID to the tasks file, for example:

 $ echo $$ > /sys/fs/cgroup/cpuset/group0/tasks

The $$ characters indicate the PID of the process being executed by the current command shell.

This process is not assigned to any CPU core, which is confirmed by the following command:

 $ cat /proc/$$/status |grep '_allowed' Cpus_allowed: 2 Cpus_allowed_list: 0-1 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0

The output of this command shows that for the process of interest to us are available 2 CPU cores with numbers 0 and 1.

Let's try to “tie” this process to the kernel with the number 0:

 $ echo 0 >/sys/fs/cgroup/cpuset/group0/cpuset.cpus

Check what happened:

 $ cat /proc/$$/status |grep '_allowed' Cpus_allowed: 1 Cpus_allowed_list: 0 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0

Example 2: memory management

We embed the group created in the previous example into one more subsystem:

 $ mkdir /sys/fs/cgroup/memory/group0

Next, run:

 $ echo $$ > /sys/fs/cgroup/memory/group0/tasks

Let's try to limit the memory consumption for the control group group0. To do this, we need to register the corresponding limit in the memory.limit_in_bytes file:

 $ echo 40M > /sys/fs/cgroup/memory/group0/memory.limit_in_bytes

The cgroups mechanism provides very extensive memory management capabilities. For example, with its help we can protect critical processes from falling under the hot hand of OOM-killer:

 $ echo 1 > /sys/fs/cgroup/memory/group0/memory.oom_control $ cat /sys/fs/cgroup/memory/group0/memory.oom_control oom_kill_disable 1 under_oom 0

If we place a separate control group, for example, an ssh-daemon and disable the OOM-killer for this group, then we can be sure that it will not be "killed" by exaggerating the memory consumption.

Example 3: Device Management

Add our control group to another hierarchy:

 $ mkdir /sys/fs/cgroup/devices/group0

By default, the group has no device access restrictions:

 $ cat /sys/fs/cgroup/devices/group0/devices.list a *:* rwm

Let's try to set restrictions:

 $ echo 'c 1:3 rmw' > /sys/fs/cgroup/devices/group0/devices.deny

This command will include the / dev / null device in the list of prohibited for our control group. We have written a line of the form 'c 1: 3 rmw' to the control file. First, we specify the type of device - in our case, it is a character device, denoted by the letter c (short for character device). Two other types of devices are block (b) and all possible devices (a). Then follow the major and minor numbers of the device. You can check the numbers with the help of the command:

 $ ls -l /dev/null

Instead of / dev / null, of course, you can specify any other path. The output of this command looks like this:

 crw-rw-rw- 1 root root 1, 3 May 30 10:49 /dev/null

The first digit in the output is the major number, and the second is the minor number.

The last three letters indicate the access rights: r - permission to read files from the specified device, w - permission to write to the specified device, m - permission to create new device files.

Next, run:

 $ echo $$ > /sys/fs/cgroup/devices/group0/tasks $ echo "test" > /dev/null

When executing the last command, the system will display an error message:

 -bash: /dev/null: Operation not permitted

We cannot interact with the device / dev / null, because access is closed.

Restore access:

 $ echo a > /sys/fs/cgroup/devices/group0/devices.allow

As a result of this command, the entry a *: * rwm will be added to the file /sys/fs/cgroup/devices/group0/devices.allow, and all restrictions will be removed.

Cgroups and containers

From the examples given, it is clear what the principle of cgroups work is: we put certain processes into a group, which we then “embed” into subsystems. Let us now analyze more complex examples and consider how cgroups are used in modern containerization tools using the example of LXC.

Install LXC and create a container:

 $ sudo apt-get install lxc debootstrap bridge-utils $ sudo lxc-create -n ubuntu -t ubuntu -f /usr/share/doc/lxc/examples/lxc-veth.conf $ lxc-start -d -n ubuntu

Let's see what has changed in the cgroups directory after creating and running the container:

 $ ls /sys/fs/cgroup/memory cgroup.clone_children memory.limit_in_bytes memory.swappiness cgroup.event_control memory.max_usage_in_bytes memory.usage_in_bytes cgroup.procs memory.move_charge_at_immigrate memory.use_hierarchy cgroup.sane_behavior memory.numa_stat notify_on_release lxc memory.oom_control release_agent memory.failcnt memory.pressure_level tasks memory.force_empty memory.soft_limit_in_bytes

As you can see, the lxc directory appeared in each hierarchy, which in turn contains the Ubuntu subdirectory. For each new container, a separate subdirectory will be created in the lxc directory. The PID of all processes running in this container will be written to the file / sys / fs / cgroup / cpu / lxc / [container name] / tasks

You can allocate resources for containers using either the cgroups control files or using the special lxc commands, for example:

 $ lxc-cgroup -n [ ] memory.limit_in_bytes 400

The situation is similar for Docker, systemd-nspawn, and others.

Disadvantages of cgroups

For almost 10 years of its existence, the cgroups mechanism has been repeatedly criticized. As the author of an article on LWN.net noted, the developers of the cgroups core do not actively like cgroups. The reasons for such a dislike can be understood even from the examples given in this article, although we tried to present them as neutral as possible, without emotion: it is very inconvenient to embed the control group into each subsystem separately. Looking more closely, we will see that this approach is extremely inconsistent.

If we, for example, create a nested control group, then in some subsystems the settings of the parent group are inherited, and in some - not.

In the cpuset subsystem, any change in the parent control group is automatically transmitted to the nested groups, but in other subsystems there is no such change and the clone.children parameter must be activated.

Talking about the elimination of these and other cgroups flaws was a long time ago in the community of kernel developers: one of the first texts on this topic dates from the beginning of 2012.

The author of this text, Facebook engineer Techjen Heh, explicitly pointed out that the main problem of cgroups is the wrong organization in which subsystems are connected to numerous hierarchies of control groups. He proposed using one and only one hierarchy, and adding subsystems for each group separately. This approach led to major changes, including the change of name: the resource isolation mechanism is now called cgroup (in the singular), and not cgroups.

We will understand in more detail the essence of the implemented innovations.

Cgroup v2: what's new

As noted above, group v2 has been included in the Linux kernel since kernel version 4.5. In this case, the old version is supported too. For version 4.6, a patch already exists that can be used to disable support for the first version when the kernel is loaded.

Currently, in cgroup v2, you can only work with three subsystems: blkio, memory, and PID. Already appeared (for now in the test version) patches that allow you to manage CPU resources.

Cgroup v2 is mounted using the following command:

 $ mount -t cgroup2 none [ ]

Suppose we mounted cgroup 2 in the / cgroup2 directory. The following control files will be automatically created in this directory:

cgroup.controllers - contains a list of supported subsystems;
cgroup.procs — upon completion of the mount, contains a list of all running processes in the system, including zombie processes. If we create a group, then such a file will also be created for it; it will be empty until processes are added to the group;
cgroup.subtree_control - contains a list of subsystems activated for this control group; default is empty.

These same files are created in each new control group. Also, the cgroup.events file is added to the group, which is missing in the root directory.

A new group is created like this:

 $ mkdir /cgroup2/group1

To add a subsystem for a group, you need to write the name of this subsystem in the cgroup.subtree_control file:

 $ echo "+pid" > /cgroup2/group1/cgroup.subtree_control

To delete the subsystem, use the same command, only in place of the plus is minus:

 $ echo "-pid" > /cgroup2/group1/cgroup.subtree_control

When a subsystem is activated for a group, additional control files are created in it. For example, after activating the PID subsystem, the pids.max and pids.current files will appear in the directory. The first of these files is used to limit the number of processes in a group, and the second one contains information on the number of processes currently included in the group.

Within existing groups, you can create subgroups:

 $ mkdir /cgroup2/group1/subgroup1 $ mkdir /cgroup2/group1/subgroup2 $ echo "+memory" > /cgroup2/group1/cgroup.subtree_control,

All subgroups inherit the characteristics of the parent group. In the example just given, the PID subsystem will be activated for both group1 and for both subgroups embedded in it; pids.max and pids.current files will also be added to them. This can be illustrated using the scheme:

To avoid misunderstandings with nested groups (see above), the following rule applies to cgroup v2: you cannot add a process to a nested group if any subsystem is already activated in it:

In the first version of cgroups, a process could belong to several subgroups at the same time, if these subgroups belonged to different hierarchies embedded in different subsystems. In the second version, one process can belong to only one subgroup, thus avoiding confusion.

Conclusion

In this article, we described how the cgroups mechanism works and what changes have been made to its new version. If you have questions and additions - welcome to comments.

For anyone who wants to dive deeper into the topic, here is a list of links to interesting materials: