How to update the kernel in the system without restarting the services (step-by-step instruction)

How do you think how realistic it is to log in to ssh, upgrade the system, load the new kernel and still remain in the same ssh session. Now there is a trendy movement to update the kernel on the fly (ksplice, KernelCare, ReadyKernel, etc), but this method has many limitations. First, it does not allow changes to be applied that change the structure of the data. Secondly, objects in memory may already contain incorrect data, which can cause problems in the future. A more “honest" way to upgrade the kernel will be described here. In fact, the method itself has long been known [1] , and the value of this article is that we analyze everything in detail using a real example, understand how simple or difficult it is, and what to expect from such experiments.

Travis CI is one of the popular continuous integration systems that works well with Github. The service is developing rapidly and if several years ago it provided only containers with not very fresh distributions, then today there is a choice between containers and vmocks, there is support not only for Linux systems and much more.

We started using Travis-CI in our CRIU project (checkpoint / restrore in userspace) several years ago and always took the most from the service. We started with a compilation check on x86_64, and today Travis-CI runs our tests, checks the compilation on all architectures, with different compilers, and even tests compatibility with new kernels, including the most unstable and advanced Linux-Next branch.

And the most important thing here is that any of the developers can take advantage of all this for their own purposes and he does not need to tune, crouch, or bounce anything locally.
')

And now to the point, gentlemen ...

But today I want to tell at all not about how we test CRIU, but about one interesting variant of its use. Imagine that at the entrance we have a virtual machine in which a process is started via ssh. How do we load our core so that the process doesn't notice? This is exactly the situation that we have in Travis-CI.

We do not have external access to the virtual machine, and if the Travis process for some reason dies (ends), the service completes the task and deletes the VM. Agree, the problem, frankly, not easy. We even made a vote at the bottom - beep, whether you immediately made a decision or not.

But we did the following: we take the CRIU, dump the Travis ssh session, load the new kernel, restore the processes and run on. Something like this I thought when I decided to have some fun after dinner and show how it all takes off.

I must say that the task is by no means abstract. She has several real uses. One of them is the desire of some users to download Ubuntu 16.04 ( https://github.com/travis-ci/travis-ci/issues/5821 ). Travis developers are not going to solve this problem yet, and we can try to do it without their help. The idea here is the same, we take the initial system 14.04, update it and reboot into a new environment.

Decision

Upgrading the system is the lesser of troubles, solved by a couple of commands:

sed -i -e "s/trusty/xenial/g" /etc/apt/sources.list apt-get update && apt-get dist-upgrade -y

But then it becomes much more fun. First, a survey arises: where to start dumping? Secondly, how will we recover? If something goes wrong, how do we know what exactly? There’s no need to wait for help from frozen Travis.

So we start to understand on their own. We look at the process tree and understand that dumping should start with the SSHD process, which handles our SSH session.

Process Tree:

 12253 ?        Ss     0:03 /usr/sbin/sshd -D 32443 ?        Ss     0:00  \_ sshd: root@pts/0 32539 pts/0    Ss     0:00  |   \_ -bash

We go to all parents, starting with ourselves, and take the second sshd process from init:

 ppid="" pid=$$ while :; do   p=$(awk '/^PPid:/ { print $2 }' /proc/$pid/status)   test “$p” -eq 1 && break   ppid=$pid   pid=$p done echo $pid

Now we know who to dump and need to decide who will do this. It is worth considering that CRIU does not allow “sawing the branch on which it sits,” so it is necessary to create a third-party process:

 setsid bash -c "setsid ./scripts/travis/kexec-dump.sh $ppid < /dev/null &> /travis.log &"

It's time to write a command for the dump. If you think that it is not difficult, then you are greatly mistaken. There are so many options in CRIU that not all developers can figure them out right away. But in fact, everything is not so bad, if you look. The line of code was quite short.

 ./criu/criu dump -D /imgs -o dump.log -t $pid --tcp-established --ext-unix-sk -v4 --file-locks —link-remap

If you translate it into Russian, this command sounds like this: “CRIU, make us a subtree dump starting from the $ pid process, put all the data in the / imgs directory, save the logs in the dump.log file, tell in detail about everything you do, and We also allow you to save tcp sockets, unix sockets connected to the outside world, file locks and descriptors for remote files. ”

It seems that everything is clear, except for deleted files - where will they come from? But it suffices to recall that we installed a major update on the system, which means that almost everything has been updated, including the library and the files that are launched. At the same time, our process was not restarted and still uses the old versions of these files. It is for them that we specify the option --link-remap.

Immediately there is another problem. Between saving and restoring processes, network traffic must be blocked, otherwise there is no guarantee that TCP connections will survive this operation. CRIU adds a couple of iptables rules for this, and our task is to restore these rules after booting the new kernel, but before the network is configured. Here I had to google a bit, but in general, the task was also solved not too difficult.

 cat > /etc/network/if-pre-up.d/iptablesload << EOF #!/bin/sh iptables-restore < /etc/iptables.rules unlink /etc/network/if-pre-up.d/iptablesload unlink /etc/iptables.rules exit 0 EOF chmod +x /etc/network/if-pre-up.d/iptablesload iptables-save -c > /etc/iptables.rules

Recovery

So, the processes are saved, and it's time to prepare the one who will restore them. Here we have to write a small service.

 cat > /lib/systemd/system/crtr.service << EOF [Unit] Description=Restore a Travis process [Service] Type=idle ExecStart=/root/criu/scripts/travis/kexec-restore.sh $d $f [Install] WantedBy=multi-user.target EOF

Everything seems ready and you can take off. The key to start.

 kernel=$(ls /boot/vmlinuz* | tail -n 1 | sed 's/.*vmlinuz-\(.*\)/\1/') echo $kernel kexec -l /boot/vmlinuz-$kernel --initrd=/boot/initrd.img-$kernel --reuse-cmdline

Fly!

 kexec -e

So we took off, but, like SpaceX, we could not sit down the first time. But we could not, because the landing platform was already occupied by someone. But seriously, the problem is that CRIU allows you to restore processes only with the same identifiers that they had at the time of the dump. We rebooted into the new system, where the systemd (!!!) and the processes became a bit more. This problem has long been studied by science, and here we will be helped by containers, more precisely, only their small part, called the process namespace (pid namespace).

 unshare -pfm --mount-proc --propagation=private ./criu/criu restore \ -D /imgs -o restore.log -j --tcp-established --ext-unix-sk \ -v4 -l --link-remap &

Let's try to take off, and again our ship does not get in touch. This time there are no ideas about problems, and we need to somehow get logs. It was decided not to think for a long time, but to take it and pour it on one of the popular storages of various wastes.

 #!/usr/bin/env python2 import dropbox, sys, os access_token = os.getenv("DROPBOX_TOKEN") client = dropbox.client.DropboxClient(access_token) f = open(sys.argv[1]) fname = os.path.basename(sys.argv[1]) response = client.put_file(fname, f) print 'uploaded: ', response print "=====================" print client.share(fname)['url'] print "====================="

Under the sight of cameras, we lose another ship and we understand that the jokes are over. This time, a DBus socket is complaining to us, that is, it is a connection whose state is inaccessible to us, because it is owned only by a DBus demon. On the other hand, why does sshd need this socket? Surely he wants to monitor the status of the network and other nonsense. We are not going to do anything like this (or rather, we have already done everything), so let's just restore this socket sometime and go further.

 diff --git a/criu/sk-unix.cb/criu/sk-unix.c index 5cbe07a..f856552 100644 --- a/criu/sk-unix.c +++ b/criu/sk-unix.c @@ -708,5 +708,4 @@ static int dump_external_sockets(struct unix_sk_desc *peer)                               if (peer->type != SOCK_DGRAM) {                                       show_one_unix("Ext stream not supported", peer);                                       pr_err("Can't dump half of stream unix connection.\n"); -                                       return -1;                               }

In fact, we made our own CRIU patch. This could be solved more elegantly with the help of plug-ins, but it was faster this way. Fill in our changes again and wait for the next drop. This time there is a problem with pseudo terminals: the numbers we need are already being used by someone. We could mount devpts with newinstance, but this option has not worked recently.

- The newinstance mount
Ignored. // Eric W. Biederman

It looks like it's time to get into the images of the processes and tweak them a bit with a file. Let's change the pseudo-terminal numbers in them and add the prefix 1. There was a terminal with the number 1, it would become with the number 11. For this, it is possible in CRIU to reformat the image in Json format and back. It looks like this:

 ./crit/crit show /imgs/tty-info.img  | \   sed 's/"index": \([0-9]*\)/"index": 1\1/' | \   ./crit/crit encode > /imgs/tty-info.img.new ./crit/crit show /imgs/reg-files.img  | \   sed 's|/dev/pts/\([0-9]*\)|/dev/pts/1\1|' | \   ./crit/crit encode > /imgs/reg-files.img.new

Run again and wait. The time is already long after noon, and this whole undertaking is clearly very much delayed. Usually we get an error - this time that some fifo files from / run / systemd / sessions cannot be restored. To understand what kind of files there is, there is no desire, so before restoring, simply create them and run further.

 f=$(lsof -p $1 | grep /run/systemd/sessions | awk '{ print $9 }') ... criu dump kexec mkfifo $f criu restore

Again we fall, and this time it seems that we are raiding a bug in CRIU. We see that sys_prctl (PR_SET_MM, PR_SET_MM_MAP, ...) returns EACCES, crawls into the kernel, and we find that the cause is the restoration of the link to the launched file. The kernel sees that we are passing a link to a file that does not have a corresponding bit. You remember that we have updated the entire system, and now this link from the process points to the deleted file. It turns out that before deleting the file, dpkg removed the right to run it.

 # strace -e chmod,link,unlink -f apt-get install --reinstall sudo ... 3331  link("/usr/bin/sudo", "/usr/bin/sudo.dpkg-tmp") = 0 3331  chmod("/usr/bin/sudo.dpkg-tmp", 0600) = 0 3331  unlink("/usr/bin/sudo.dpkg-tmp")  = 0 ...

It seems to be enough to make another patch to the CRIU, and the golden key will be in our pocket.

 diff --git a/criu/cr-restore.cb/criu/cr-restore.c index 12f13ae..39277cf 100644 --- a/criu/cr-restore.c +++ b/criu/cr-restore.c @@ -2278,6 +2278,23 @@ static int prepare_mm(pid_t pid, struct task_restore_args *args)       if (exe_fd < 0)               goto out; +       { +               struct stat st; + +               if (fstat(exe_fd, &st)) { +                       pr_perror("Unable to stat a file"); +                       return -1; +               } + +               if (!(st.st_mode & (S_IXUSR | S_IXGRP | S_IXOTH))) { +                       pr_debug("Add the execution bit for %d (st_mode %o)\n", exe_fd, st.st_mode); +                       if (fchmod(exe_fd, st.st_mode | S_IXUSR)) { +                               pr_perror("Unable to add the execution bit"); +                               return -1; +                       } +               } +       } +       args->fd_exe_link = exe_fd;       ret = 0; out:

Conclusion

Hooray! Everything works https://travis-ci.org/avagin/criu/builds/181822758 . In fact, this is a very brief retelling of the whole story. I had to run this task in Travis 33 times before it was first successful.

What have we proved by this? First, they solved a couple of applied tasks, and second, they showed that CRIU is a very low-level tool and even a simple task may require in-depth knowledge of the system. But the efforts are compensated by power, flexibility and opportunities. Although no one guarantees that you do not have to fight with bugs.

Good luck on the cosmic expanses!

Source: https://habr.com/ru/post/318522/

All Articles