On the eve of the next release of CRIU

Today I want to continue the series of articles about the CRIU project (Checkpoint / restore mostly in the userspace) . The project is just over a year old, and according to its capabilities, it has already come closer to similar functionality in OpenVZ.
The first part of the article will tell about the new functionality that has appeared in CRIU over the past few months. The second part will tell about our experience in introducing new technologies to improve the development process.

New functionality

Snapshot memory and iterative migration

The killer feature of the next release is iterative snapshots of the process state and, as a result, iterative migration. In both cases, at each subsequent iteration, only that part of the memory that has changed from the previous time is saved. In the first case, this reduces the time and amount of data on the disk. In the case of migration, the downtime of the system is significantly reduced, because at the first iteration of copying the memory, the processes are not frozen.

The main problem was the implementation of the mechanism that will allow you to track which memory regions have changed since the previous time. It works quite simply. It is possible to mark all existing process pages as “clean”. In this case, the kernel prohibits writing to them and if someone tries to write to such a page, an exception will be thrown (pagefault). As a result, the kernel will return the write permission page and mark it as “dirty” (PME_SOFT_DIRTY). All page properties can be read from the file / proc / PID / ~~pagemap~~ pagemap2. Yes, I had to create a second version of the file, since the bits in the first format had run out, and some of them, in fact, are always zeros. That they managed to free up in the second version.

No-disk migration

Literally in the first months of the existence of CRIU, a comrade from the German Institute asked us to do the migration without using a disk. Often, the speed of data transfer between the machines is several times higher than the speed of writing data to a disk, so his request seemed reasonable. Less than a month ago, this functionality was added. It works as follows. At the remote end, the server is started, to which all data is transferred, and it also initiates the recovery procedure.
')
The peculiarity here is that we tried to avoid unnecessary copying of memory to the maximum. The kernel represents a series of system calls for this purpose (vmsplic, splice, etc). For example, the process memory can be sent via pipe (pipe), without making a single copy.

Signals

It would seem that saving and restoring signals is not a big task, but it did not stretch for one month. If there are no corresponding interfaces for saving or restoring any objects, then we try to come up with a new interface that will be useful not only for CRIU. This approach has supporters, as well as opponents.

It was the same this time. It was possible to add another command to ptrace (the interface that debuggers use), but it seemed that its use was too narrow, while Linux had a signalfd system call for a long time, which did something very similar. With the start of implementation, problems began to surface. In ordinary life, the processes do not know from which queue they receive a signal (there are two queues, one for the whole process and one for each thread), but in the case of CRIU this is important. We can not restore the signals of one thread to another. I also wanted to look at the signals, but not to take them from the queue, because the dump process should not be destructive. Both of these problems did not look serious, something similar we have already done over other objects. The third problem was that none of the formats for presenting siginfo into user space provided complete information sufficient for recovery. Before us there were two formats in the core. The first is what we get in the signal handler, and it is closest to what is stored in the kernel, with one exception - it lacks an object type. The kernel simply truncates the type before passing it to user space. The second format is what signalfd returns. By indirect signs, it is possible to determine the type of message, but in some cases some of the information is simply missing. The solution to this problem is obvious - we need a third format that will return siginfo in the same form that the kernel stores it.

The fourth key problem is that the signalfd descriptor is not tied to a specific process, while the semantics of the file descriptor implies that after fork () it will point to the same object as before the fork. This contradiction was added by the first signalfd developers and, to maintain backward compatibility, cannot be changed. For the current interface, this contradiction does not affect anything, but when we decided to extend the signalfd capabilities, problems started. Everyone wanted to do work with the signalfd descriptor, which seemed as close as possible to working with other descriptors, but because of the contradiction described above, it was not possible to find a reasonable solution. I made about 5-7 attempts and as a result I had to make an initial version with ptrace.

Converter of OpenVZ images

In previous articles it was already mentioned that CRIU should replace the existing OpenVZ checkpointing mechanism and, of course, backward compatibility should be preserved. I will tell you a secret that the OpenVZ team is currently working on a new stable version of the kernel. As usual, it will be based on the kernel from the next version of RHEL (7). Together with this process, a converter is being developed from OpenVZ images to CRIU images. This task is not so much difficult as time consuming.

Netlink sockets, TCP connection migration, vdso conversion

Also at the moment, work is underway to restore the vDSO library. The main difficulty is that the library is provided by the kernel, and the mechanism of its interaction with the kernel is not fixed. This library is loaded dynamically, so that the addresses of functions can vary from kernel to kernel. We decided that it would be easiest to create a proxy that would look like an old library, but call functions from a new one. Even on this path, not everything goes smoothly. For example, the process could go into the library code and be interrupted by signal processing. In a signal handler, a process can do anything and understand how it got there in the general case is impossible. In LKML, even the thought flashed to make a “stable” library vdso, the code of which will not change. Until we come up with a better solution, we will live with a proxy and hope that in the vdso code the process will not be interrupted.

Until recently, all TCP connections used a global counter for setting time stamps on packets (TCP timestamp). This counter is reset at each reboot. This scheme prevented the migration of TCP connections from one machine to another. The next Linux kernel will be able to set the offset for each socket separately, which will allow CRIU to migrate connections.
Another not great feature was the support for Netlink sockets. The current interface represented by the / proc / net / netlink file does not provide sufficient information, so we had to expand the socket diag for netlink sockets.

Technology

Continuous Integration (eng. Continuous Integration)

The CRIU development process is built in the image and design of the Linux kernel. The basic principles we follow:

One commit is one logical change. Each commit is one complete thought. This principle greatly facilitates the verification process (review) and improves its quality.
Any commit should not break the build, that is, at any point the project must compile and all tests must pass. This principle helps to find the introduced problems (to be bisected).

The first principle is controlled by the person who views the changes and puts them into the main repository. The second principle can only be verified experimentally. Usually this process is called “continuous integration”, for which automation there are several solutions. We chose the currently popular Jenkins project. Installation and setup does not take much time. According to the results of the past two months, we can safely say that the efforts were spent not in vain. No, we break the build extremely rarely, but due to the large number of unit test runs, he caught several bugs related to the racing of resources or the coincidence of some circumstances.

The technology will work in full when we start running all the tests (not only the unit), in all variations (backward compatibility testing, testing for non-destructive dump), on all configurations (two architectures, two kernel versions). This is the work that the developer does the most expensive.

Static code analyzers

I was always skeptical of static code analyzers, for some reason it was believed that they would bring not much benefit, but time to recycle would help. Sometimes in the process of work there are such moments when you don’t want to do anything and need to be distracted. I like to try something new at such moments. This is how we started using Jenkins and this is how our code was run through a clang-analyzer. It’s impossible to say that we got some supernatural results, but he pointed out a couple of bugs in the ways of error handling.

Inspired by the results, another developer registered our project on scan.coverity.com . Here the engine is somewhat more powerful than that of the clang-analyzer, but the number of “false positives” is higher. My opinion on static analyzers is this: they bring benefits, but their priority is not very high. If your project is well covered with tests and they do not find bugs, then you can spend time on static analyzers.

Links

lwn.net/Articles/546966
lwn.net/Articles/531939
habrahabr.ru/post/152903
habrahabr.ru/post/148413
jenkins-ci.org
ru.wikipedia.org/wiki/CRIU
criu.org

Source: https://habr.com/ru/post/177499/

All Articles