📜 ⬆️ ⬇️

Rethinking PID 1. Part 3





Parallel file system jobs


If you look at the download schedule for current distributions, you will see more synchronization points than just daemon launches: the tasks associated with file systems take the most time: mounting, checking files for errors (fsck), quoting. Now, during loading, a lot of time is spent waiting until all the disks specified in / etc / fstab appear in the device tree, and then checked for errors, quotas are mounted and applied (unless of course they are included). Only after all this will we be able to go further and actually start loading services.
')
Can we improve this process? It turns out that we can. Harold Heuer came up with the idea of ​​using the honorable autofs to improve the process.

Just like the connect () call “declares” that he is interested in another service, the open () call (or other similar call) “declares” that he is interested in some file or some file system. So, in order to make paralleling possible, we need to make sure that these applications are waiting only when the file system they are looking for is not mounted yet, but it will be very soon. To do this, we connect the autofs mount point (fake mount point), and when our real file system passes the integrity check with the fsck utility during normal OS loading, we will replace it with a real monitor point. While the real file system is not yet mounted, an attempt to access the file system will be queued by the kernel and the access attempt will be blocked, but only for the only daemon that has addressed. Thus, we can run our demons long before all filesystems become available and without losing any access to files and parallelizing the OS boot process as much as possible.

Parallelization of FS tasks makes no sense for the mount point - / (root root), where all services (demons) and all binaries are located. However, due to the mount point / home , which is usually larger and can even be encrypted, and maybe even mounted from a remote machine and which is rarely accessed by the daemons during the OS boot, we can significantly increase the OS boot speed. It is probably not necessary to remind you that virtual files such as procfs or sysfs should never be mounted via autofs .

It will not be a surprise for me if some readers find the solution to integrating autofs into the init process slightly "fragile" or even strange, and maybe attributed to the more "hacker" side of things. However, to a large extent, having played with him, I can say that autofs in its present place feels quite well even. Using autofs means that we can create a monitoring point without immediately providing a real file system. In fact, we get deferred access. If an application tries to access the autofs file system and it takes us a lot of time to replace it with a real file system, then the application will hang up in an interrupted sleep, meaning that you can easily cancel it, for example, using Ctrl + C. Also note that at any point of the application’s execution, if a real file system can’t be correctly mounted (as a result of a fsck failure), then we can simply ask autofs to return a real error code (say ENOENT ). So, I think that ... I want to say that even if the integration of autofs into the initialization system may seem reckless at first, our experimental code showed that the idea in practice, not surprisingly, behaves very well even if of course the idea is implemented correctly and for the right reasons.

Also note that the autofs monitoring points should be so-called direct mappings (note the translator: information on display types here ), meaning that from the point of view of the application there are only minor differences between the classical (real) mount point and those based on autofs .

Keep the first user PID small


Another good thing we can learn from the MacOS boot logic is that shell scripts are evil. The shell is as “fast” as the shell is slow. In shell scripts, you can quickly figure out what's what, but the speed of execution leaves much to be desired! The classic sysvinit boot system is modeled around shell scripts. Whether / bin / bash or any other shell (which was most likely written to speed up the execution of shell scripts) is ultimately doomed to be slow. On my machine, the scripts in /etc/init.d call grep 77 times. awk is called 92 times, cut - 23 and sed - 74. Every time these commands (or others) are called, a new process is generated, then any related libraries are searched for, such things as i18n are configured, and so on. Even if operations are rarely performed, a little more complicated than trivial operations with strings, the loading process is still interrupted. Of course, this is all done incredibly slowly. But no language other than the shell could do something like that. Moreover, shell scripts are very “fragile”, as I said above: they change their behavior from environment variables and other similar things that are difficult to track and somehow control.

So let's get rid of the burden of shell scripts during the OS boot process. Before we can do this, we must first understand what they are actually used for: dachshunds, if viewed from a high level, most of the time they do very boring things. Most of the scripts spend time on trivial start and stop services, and must be rewritten to C, and either in the form of separate executable files, or transferred inside the services (daemons) themselves or implemented in the boot system itself.

It does not seem like we can completely get rid of the burden of shell scripts in the near future. Rewriting them in C will take time, in some cases this occupation does not make any sense at all, and sometimes shell scripts are very handy. But in general, we can make them less necessary and ubiquitous than before.

A good metric for measuring the invasiveness of shell scripts in the OS boot process is the PID number of the first process that you can run after the OS is fully loaded. Log in and log in, open a terminal and type echo $$ . Try it on your Linux machine, and then compare the result with MacOS! (Hint, the result might be something like the following: Linux PID - 1823; MacOS PID - 154. Measured on our test machines.)

Process tracking


The main component of the system that starts and manages the services should be the “carer” process: it should monitor the services. Restart services when they stop. When the service “falls,” it must collect all the necessary information about the collapse and keep them somewhere nearby so that the administrator can view them further, and also build cross-links with the information that is present in crash dump systems such as abrt , as well as in a log service such as syslog or in an audit system.

The process observer (“nurse”) should also be able to completely stop the service (note the translator: I mean the process of the service and all its child processes). This may seem like a trivial task, but in fact it is even more difficult than it seems. Traditionally, in Unix, a process that forks itself twice can get rid of self-observation by its parent process, and the very first parent will not know about the connection with the new process that his descendant actually created. For example: in the current state of affairs, a “prancing” (badly behaved) CGI script that forked itself two times will not be stopped when the Apache service is stopped. Moreover, you do not even have the opportunity to establish a connection between the CGI script “naughty” and Apache, until you know its name and purpose.

So how can we track the processes so that they cannot get rid of the “sitter” and so that we can control them as one entity if they fork a gazillion times?

Different people offer different solutions for this problem. I'm not going to go into details in this article, but let me at least say that solutions are based on a ptrace or netlink connector (a kernel interface that allows you to get a netlink message every time a process in the system calls fork () or exit ) that some people have devoted time to and have implemented are criticized by many for ugly and non-scalable solutions.

So what can we do about it? Well, recently Control Groups (aka "cgroups") have appeared in the core. In general, they allow you to create a hierarchy of process groups. The hierarchy is directly projected into the virtual file system, and therefore easily accessible. The names of the groups are the names of the directories in the virtual filesystem. If a process belongs to any of the groups, all its descendants (child processes) created via the fork () call will also belong to the same group. If a process does not have the privileges (say root), but has access to the cgroup file system, it cannot leave its group. Initially, cgroups were added to the kernel for container purposes: certain kernel subsystems can put limits on resources or on certain groups, such as limits on CPU or memory usage. Traditionally, limits on limits (as implemented in setrlimit () ) are set for each process (mostly). cgroups , on the other hand, allow you to set limits on the entire group of processes. cgroups are also used to set limits, outside the area of ​​their direct responsibility - containers. For example, you can use cgroups to set the maximum amount of memory or CPU used by Apache, as well as all its child processes. Then, a badly behaved CGI script, can no longer leave its group exposed via setrlimit () simply by calling fork () again.

In addition to containers and setting resource limits, cgroups are also very useful as a tracking tool for processes (services): cgroup membership is safely inherited by all child processes from which they cannot escape. There is also an alert system and the parent process will be notified if the running cgroup is empty. You can find out the cgroups of the process by reading the file / proc / $ PID / cgroup . Consequently, cgroups are very well suited for the role of "nurse" behind the processes, i.e. their direct tracking.

We control the process environment


A good carer should not only monitor and control when and how services are started or stopped, or when they suddenly fall, but it should also provide a minimum, good, and safer environment for their implementation.

Securing software means setting obvious process parameters such as resource limits — setrlimit , user / group ID, or a block of environment variables, and that's not all. The Linux kernel provides users and administrators with a good level of control over the processes. For each process, you can set the parameters of the scheduler CPU and IO, attraction to the CPU and of course the environment cgroup with additional limits and much more.

As an example, calling ioprio_set () with IOPRIO_CLASS_IDLE is a great way to minimize the impact of the updatedb function of the locate utility on the interactivity of the system.

On top of this, certain high-level monitoring tools can be very useful, such as mounting an additional read-only file layer based on bind mounts. Thus, we can start certain daemons (services) so that all (or some) of the filesystems will appear to them in read-only mode and, therefore, an EROFS error will be returned for each write attempt. In this way, we can isolate the demons in what they can do, in the same manner as the poor SELinux does (but this method is definitely not a substitute for SELinux, and please do not use bad ideas like SELi ....).

And lastly. Logging is an important part of service execution: ideally, every output bit generated by the service should be logged. Therefore, the boot system must provide logging services to the daemons it starts up from the very start of the boot and attach the standard output and error output to the syslog or in some cases to / dev / kmsg , which in many cases is a very good replacement for syslog ( involved in embedded systems, listen!), especially in those cases when out of the box the kernel log buffer is terribly large.

To be continued…

Source: https://habr.com/ru/post/335780/


All Articles