PID problem 1 zombie reaping in Docker

Hi, Habr!
We in Hexlet are actively using Docker both for launching the application itself and associated servers, and for launching custom code in practical programming exercises. Without these lightweight containers, it would be much harder for us to cope with these tasks. Docker is a wonderful technology, but sometimes there are unexpected problems. One of these problems (and its solution) is described in the Phusion blog (the creators of Phusion Passenger), today we publish its translation.

About a year ago, when Docker was in version 0.6, we were the first to introduce Baseimage-docker. This is a minimal Ubuntu image modified specifically for Docker. People can pull this base image from the Docker Registry and use it as a basis for their images.

We were early Docker users, using it for CI and for creating a working environment long before the release of version 1.0. The basic image we made to solve problems specific to the principles of the Docker. For example, Docker does not start processes under a special init process, which would correctly handle child processes, so a situation is possible when zombie processes cause a lot of problems. The docker also doesn’t do anything with syslog, so important messages may be lost. And so on.
')
However, we found out that many people do not understand the problems we are facing. Yes, these are rather low-level Unix system mechanisms that are far from clear to everyone. Therefore, in this post we describe the most important problem that we solve - PID 1 zombie reaping problem.

It turned out:

The problems that we solve are relevant for many people.
Many people do not know about their existence, therefore, at some point, unexpected problems (the Murphy's law) are sure to begin.
It will be very inefficient if everyone solves problems on their own.

Therefore, we have made a decision in a universal base image that everyone can use: Baseimage-docker. This image adds a bunch of useful tools needed (as we believe) to the Docker image developer. We use Baseimage-docker as the basis for all our images.

The community likes what we do: our image is the third most popular in the Docker Registry after the official images of Ubuntu and CentOS.

The PID 1 problem: collecting zombies

All processes in Unix are represented as a tree. Each process generates child processes, and each process has a parent except the highest (or root) one.

The root process is init. It is started by the kernel when the system boots. init is responsible for starting the rest of the system, such as the SSH daemon, the Docker daemon, running Apache / Nginx, starting the graphical interface, and so on. Each of them, in turn, runs its child processes.

Nothing unusual. But what happens when the process ends? Suppose the bash process (PID 5) has been completed. It turns into the so-called “defunct process”, also known as the “zombie process”.

Why is this happening? Unix is made in such a way that the parent process waits for the completion of the child to get an exit code (exit status). The zombie process exists until the parent process completes this action using the waitpid () family of system calls. Here is a quote from man:

A child that terminates, but has not been waited for becomes a “zombie.” There is a complete set of information on how to complete the process.

Usually people consider zombie processes to be some kind of runaway processes that cause confusion. But formally, from the point of view of the Unix operating system, zombie processes are clearly defined. These are processes that have ended, but their parent processes are still waiting for them to complete.

In most cases, this is not a problem. The waitpid () system call for handling zombies is called “reaping” (gathering, processing). Many applications handle their child processes correctly. In the sshd example above, if bash is terminated, the OS will send a SIGCHLD signal to the sshd process to wake it up. Sshd will notice this and process (“reaps”) the child process.

But there is a special case. Imagine that the parent process terminated, intentionally or because of a user action. What happens to its child processes? They no longer have a parent, so they become “orphans” (this is a technical term).

This is where the init process comes into play. The init process - PID 1 - has a special task: to “adopt” orphaned processes (this is again a real technical term). This means that init becomes the parent of such processes, despite the fact that they were not actually generated by init.

Consider the example of Nginx, which is demonized by default. It works as follows: first, Nginx creates a child process. Then the main Nginx process ends. Now the Nginx child process is adopted by init.

The OS kernel expects special behavior from init: the kernel believes that init should handle (build, “reap”) adopted processes, too.

This is a very important feature in Unix. It is so fundamental that many programs are designed to work correctly. Most of the demons are designed for the fact that the demonized processes will be adopted and processed (that is, correctly completed after becoming a zombie) init.

I use demons as an example, but this mechanism applies not only to them. Every time the process that has children ends, he expects init to clean up everything. This is described in detail in two very good books: Operating System Concepts and Advanced Programming in the UNIX Environment .

Why are zombie processes harmful

Why are zombie processes harmful, despite the fact that they are just completed processes? After all, surely the memory allocated to the process has already been released, and the zombies are just a string in ps?

Yes, the memory of this process is already released. But the fact that the process is still visible in ps means that it uses kernel resources. Here is a quote from man on waitpid:

This is a way to create further processes.

Until the zombie is removed from the system using wait, it will use the slot in the kernel's process table, and if this table is full, it will not be possible to create new processes.

And here Docker

And here is the Docker? Many people run only one process in their container. But most likely this process does not behave like the correct init. That is, instead of correctly processing the adopted processes, he believes that another init process should do this. And he thinks so quite rightly.

Let's look at a specific example. Suppose your container contains a web server that runs a CGI script written in bash. The script calls grep. Then the web server determines that the script has been processed for too long and kills it. But grep remains running. When he finishes his job, he turns into a zombie and is adopted by the PID 1 process (web server). The web server does not know anything about grep, so it does not handle its completion and the zombie grep remains in the system.

The problem applies to other situations. Many create containers for third-party applications, such as PostgreSQL, and run these applications as the only process inside the container. When you run someone else's code, are you sure that it does not spawn child processes, which then turn into zombies? If you run your code and know exactly what it does and the libraries it uses, then everything is fine. But in general, you need to run the correct init to solve problems.

But doesn’t launching a full system init turn a container into a heavy thing like a virtual machine?

The init system is not necessarily heavy. Perhaps you are thinking about Upstart, Systemd, SysV, and so on. Perhaps you think that you need to run the whole system inside the container. This is not true. “Full init system” is optional and not needed.

The system we need is a simple little program whose task is to launch your application and collect the adopted processes. Using such a simple init system is fully consistent with the Docker philosophy.

Simple init system

Perhaps there are ready-made solutions? Nearly. Good old bash. Bash handles adopted processes. Bash can run anything. So instead of such a line in the Dockerfile ...

CMD ["/path-to-your-app"]()

can write

 CMD ["/bin/bash", "-c", "set -e && /path-to-your-app"]()

(The -e directive prohibits bash from recognizing the script as a simple command and exec () writing it directly).

This will result in the following hierarchy of processes:

But, unfortunately, this approach has a problem. It does not process signals! Suppose you use kill to send a SIGTERM signal to the bash process. Bash ends, but does not send SIGTERM to its child processes!

When bash is completed, the kernel ends the entire container with all the processes inside. These processes are terminated with SIGKILL. Therefore, there is no way to complete these processes cleanly. Suppose your application is writing something to a file. The file may be damaged if the application terminated in this way during recording. Unclean termination of processes is bad. It's almost like pulling the power cord off the server.

But why should we care that the init process ends with a SIGTERM signal? Because docker stop sends SIGTERM to the init process. “Docker stop” should stop the container correctly so that it can be started later with “docker start”.

The bash experts will probably want to write a normal EXIT handler that sends signals to their children, like this:

 # !/bin/bash function cleanup() { local pids=`jobs -p` if [\\[ "$pids" != "" ]()]; then kill $pids \\>/dev/null 2\\>/dev/null fi } trap cleanup EXIT /path-to-your-app

Unfortunately, this does not solve the problem. Sending signals to child processes is not enough. init must also wait for the completion of the child processes before terminating itself. If init is completed earlier, then all child processes will be killed (not cleanly) by the kernel.

Obviously, a slightly more complicated solution is required, but a complete init system with Upstart, Systemd and SysV is too fat for a lightweight docker container. Fortunately, Baseimage-docker has a solution. We wrote our own lightweight init system specifically for use inside the docker container. Not inventing anything better, we called it my_init . This is a Python program of 350 lines.

Key functions of my_init:

Process (reap) child processes
Runs subprocesses
Waits for completion of all subprocesses before its own termination, with a maximum timeout.
Logs activity in “docker logs”

Will Docker solve this problem himself?

Ideally, the problem with PID 1 should be solved natively by Docker himself. It would be great, but so far, in January 2015, we have not heard anything like this from the Docker team. This is not a criticism - Docker is very ambitious, and I am sure that their team has more important problems. The PID 1 problem is easily solved at the user level. So until Docker solves this problem officially, we recommend that people solve it themselves using a system like the one described above.

Is this a problem at all?

The problem may seem hypothetical. If you have never seen a zombie in your container, it may seem all right. But the only way to make sure that there is no problem is to check all your code, all your libraries and all the libraries that are used by the libraries. If you have not done so, then perhaps there is a string somewhere that starts a child process, which then turns into a zombie.

Do not forget about the law of Murphy.

Besides the fact that zombies clog up a table of kernel resources, they can also interfere with the correct operation of programs that check for processes. For example, Phusion Passenger manages processes. It restarts the processes when they fall. It parses the ps output and sends a 0 signal to the process. The zombie is visible in ps and responds to signal 0, so Phusion Passenger thinks the process is still alive.

All you need to protect yourself from problems with zombies is to spend 5 minutes connecting Baseimage-docker or importing 350 lines of my_init . Additional costs for disk and memory are minimal: only a couple of megabytes are added to memory.

Conclusion

The PID 1 problem is real. One solution is to use Baseimage-docker . Is this the only way? Of course not. The goals of Baseimage-docker are:

Tell people about a few important points when working with Docker containers.
Provide a ready-made solution so that people do not reinvent the wheel.

In this case, several solutions are possible, the main thing is that they cope with the task described. You can write your own version in C, Go, Ruby or something else.

You may not want to use a basic Ubuntu image. Maybe you are using CentOS. But Baseimage-docker can still be useful to you. For example, the project ourpassenger_rpm_automation uses CentOS containers. We simply extracted my_init and pasted it there.

Happy Dockage!

Source: https://habr.com/ru/post/248519/

All Articles