Today I want to talk about my most favorite feature in the latest release of Parallels Cloud Server - rebootless update, or update without rebooting.
Rebooting is a simple server and loss of the status of current activities. It is undesirable for a server that is used by a large number of people. At the moment there is a popular technology Ksplice, where changes are rolled on the live system. This is unreliable, not every update can be rolled out like this. And in general, there are no guarantees that the problem code did not have time to inherit. Another important problem is that developers are reluctant to take up bugs after such updates. Who knows what cooked in this hodgepodge.
We at Parallels approached the problem from the other side and decided to do everything in an honest way. To be honest - it means to reboot the kernel, but so that no one noticed. The fastest way to roll a new kernel is to use kexec. Now, remember that both containers and virtual machines are able to maintain their state (suspend / resume, dump / restore, snapshot, etc). Thus, if we put all the virtual environments to sleep, quickly reboot the kernel and restore the environments, the user will notice only a small delay in maintenance, which will be similar to network problems. As a first approximation, this is how rebootless update works.
Parallels developers went further and significantly reduced the downtime of virtual machines. First of all, the PramFS file system was created, similar to tmpfs, but its state is maintained between the kernel reboot via kexec. The state of virtual machines and containers is added to this file system. PramFS is several orders of magnitude faster than a disk, therefore the time for saving and restoring environments has decreased significantly.
')
Saving the state of a container means saving all its objects (open files, sockets, pipes, timers, process states, etc.) and user memory. The next optimization step allowed us to leave user memory and file system caches in the same place where they were before the reboot. This step also reduced the time to save and restore containers and reduce downtime.
As a result, after such an update, a new kernel is loaded on the server with no trace of the old one. All kernel objects are recreated and their states are restored. User memory and file system caches remain untouched. The server reboot time has decreased several times, compared to a normal reboot.
At the moment, this feature is available only to Parallels Cloud Server users, but we have plans to offer this functionality to the Linux community. And the preservation and restoration of containers will be implemented in the framework of the CRIU project.
www.parallels.com/products/pcsen.wikipedia.org/wiki/Kexecen.wikipedia.org/wiki/Ksplicecriu.org/Main_Page