Features of testing technology C / R in Linux

In 2012, Andrew Morton was pessimistic about the future of the CRIU (Checkpoint and Restore In Userspace) project when he accepted the first changes to the Linux kernel to support C / R (Checkpoint / Restore). The idea to implement the functionality of saving and restoring running processes in user space looked crazy , and after 4 years the project is not only alive, but more and more interesting for itself. Before the start of the CRIU project, attempts were made to implement C / R in Linux (DMTCP, BLCR, OpenVZ, CKPT, etc.), but all of them were doomed for failure for various reasons while CRIU became a viable project. Unfortunately, the C / R task in Linux has not become any easier. In this article I will talk about the features of testing CRIU.

Books and hundreds of articles have already been written about the benefits of unit testing, the use of continuous integration systems. These techniques are known to every experienced developer and are the absolute standard for any software project. Therefore, we will not describe here the advantages of using these techniques, but instead we will only tell about the nuances that distinguish CRIU from other projects.
')
The CRIU development process itself is no different from the Linux kernel development: every new change is one complete thought. All patches arrive on the criu @ newsletter, where they are reviewed. Patches that have been reviewed are added to the repository by the project manager. Although at the review stage some of the problems in the code come to light, it’s impossible to reduce them to zero due to the number of scenarios and configurations. Therefore, to detect degradation, we run tests for each new change. The guarantor of the constant launch of tests is automatic working tests.

In the early stages of development, we began to use functional tests from the ZDTM (Zero DownTime Migration) test suite, which we successfully tested the in-kernel C / R implementation in OpenVZ . Now, each test from this set is launched separately and goes through 3 stages: preparation of the environment, “demonization” and waiting for a signal (which signals the test that it is time to check its status), check the result. Tests are conditionally divided for two groups. The first group is static tests that prepare some kind of constant environment or state and “fall asleep” while waiting for a signal. The second group is dynamic tests that constantly change their state and / or environment (for example, they send data via a TCP connection). If in 2012, the CRIU unit-testing system consisted of about 70 individual test programs, today their number has increased to 200 . But truly terrifying is the number of combinations that you need to run to complete CRIU testing.

The basic configuration is the launch of the entire test suite on the host, in which each test program sits in a certain position, the test process is saved and restored, and then asked to check in the same position if it remains or not. The next most important configuration is to check that C / R not only works, but after C the main process did not fail. Therefore, each test must also be eliminated in the variant when only the first part is performed (without recovery) and check that the posture is observed. This is a spontaneous test. The reconstructed process may be in the same position, but is not suitable for repeated C / R. So another configuration appears - repeat C / R. Then there are configurations with snapshots, C / R surrounded by namespaces, C / R with normal user rights, C / R with backward compatibility testing, checking for successful recovery on BTRFS and NFS (because these filesystems have their own “features” ). But in addition to C / R for individual processes, you can do group C / R - saving groups of processes, when all processes are in the same position and when each process is in its own position.

CRIU supports several hardware architectures, now they are x86_64, ARM, AArch64, PPC64le and on the i386 approach. The harsh reality forces us to test several kernel versions: the latest official release of the vanilla kernel, the RHEL7 core (which is based on 3.11) and the linux-next branch. The duration of the tests is small (2-10 minutes), but if we take into account the number of combinations of existing scenarios and possible configurations, it turns out an impressive figure.

As I already wrote, tests benefit only when they are regularly used. Until some time, we started the tests manually, but at some point we realized that local testing takes a lot of time from the developers and set up a system to continuously run tests for each new change.

We use two CI systems: Travis CI is used to check compilation on all supported hardware architectures. Since Travis CI uses the kernel below version 3.8, which lacks most of the patches required for CRIU, Travis is not suitable for running tests and, additionally, we use the well-known Jenkins CI .

findings

assembly, testing and code coverage measurements should be automated
there are no many tests, we have a ratio of a useful code to a test code of about 1.6 (48 KLOC vs 30 KLOC) and there is much to strive for
if the number of configurations to test is huge, prioritize
hands as always not enough, come to our CRIU , eh?

The CRIU project was launched in 2012 by engineers from Virtuozzo, but was later supported by other companies interested in creating C / R technology in Linux.

Source: https://habr.com/ru/post/283504/

All Articles

Features of testing technology C / R in Linux

findings

More articles: