📜 ⬆️ ⬇️

New interface for getting process attributes in Linux

In developing CRIU, we realized that the current interface for obtaining information about processes is not perfect. In addition, a similar problem was successfully solved for sockets. We tried to transfer these developments to the processes and got quite good results, which you will learn about by reading this article to the end.

Disadvantages of the current interface


After reading the title, the question arises: "And what did the old interface not please"? Many of you know that now information about processes is collected on the procfs file system. Here, each process corresponds to a directory that contains several dozen files.

$ ls /proc/self/ attr cwd loginuid numa_maps schedstat task autogroup environ map_files oom_adj sessionid timers auxv exe maps oom_score setgroups uid_map cgroup fd mem oom_score_adj smaps wchan clear_refs fdinfo mountinfo pagemap stack cmdline gid_map mounts personality stat comm io mountstats projid_map statm coredump_filter latency net root status cpuset limits ns sched syscall 


Each file contains a number of attributes. The first problem is that we have to read at least one file for each process, i.e. make three system calls. If you need to collect data on hundreds of thousands of processes, it can take a long time even on a powerful machine. Some people may recall that the ps or top utility runs slowly on loaded machines.
')
The second problem is related to how the properties of the processes are divided into files. We have a good example that shows that the current partition is not very good. In CRIU there is a task to obtain data on all regions of the process memory. If we look in the / proc / PID / maps file, we will find that it does not contain the flags that are needed to recover the memory regions. Fortunately, there is another file - / proc / PID / smaps, which contains the necessary information, as well as statistics on the spent physical memory (which we do not need). A simple experiment shows that the formation of the first file takes an order of magnitude less time.

 $ time cat /proc/*/maps > /dev/null real 0m0.061s user 0m0.002s sys 0m0.059s $ time cat /proc/*/smaps > /dev/null real 0m0.253s user 0m0.004s sys 0m0.247s 


Probably, you already guessed that the statistics of memory consumption is to blame - it takes most of the time to collect it.

The third problem can be seen in the file format. Firstly, there is no uniform format. Secondly, the format of some files cannot be expanded in principle (for this reason we cannot add a field with flags in / proc / PID / maps). Thirdly, many text files that are easily readable by humans. This is convenient when you want to look at one process. However, when there is a task to analyze thousands of processes, then you will not view them with your eyes, but write some code. Disassembling files of different formats is not the most pleasant pastime. The binary format is usually more convenient for processing in the program code, and its generation often requires less resources.

Interface for obtaining information about sockets socket-diag


When we started doing CRIU, there was a problem with getting information about sockets. For most types of sockets, as usual, the files in / proc (/ proc / net / unix, / proc / net / netlink, etc.) were used, which contain a rather limited set of parameters. For INET sockets, there was a netlink interface, which presented information in binary form and in an easily expandable format. This interface was able to generalize to all types of sockets.
It works as follows. First, a query is formed that specifies a set of parameter groups and a set of sockets for which they are required. At the output we get the required data, divided into messages. One message describes one socket. All parameters are divided into groups, which can be quite small, as they bear the overhead of only the size of the message. Each group is described by type and size. At the same time, we have the opportunity to expand existing groups or add new ones.

New interface for getting process attributes task-diag


When we saw problems with obtaining data about processes, an analogy with sockets immediately came, and the idea arose to use the same interface for processes.

All attributes must be divided into groups. There is one important rule here: no attribute should have a noticeable effect on the time required to generate all attributes in a group. Remember, I was talking about / proc / PID / smaps? In the new interface, we brought this statistics into a separate group.

At the first stage we did not set the task to cover all the attributes. I wanted to understand how comfortable the new interface is. Therefore, we decided to make the interface sufficient for the needs of CRIU. The result is the following set of attribute groups:
  TASK_DIAG_BASE /*   pid, tid, sig, pgid, comm */ TASK_DIAG_CRED, /*   */ TASK_DIAG_STAT, /*  ,   taskstats  */ TASK_DIAG_VMA, /*    */ TASK_DIAG_VMA_STAT, /*        */ TASK_DIAG_PID = 64, /*   */ TASK_DIAG_TGID, /*   */ 


In fact, here is the current version of the division into groups. In particular, we see here TASK_DIAG_STAT, which appeared in the second version as part of the integration of the interface with the already existing taskstats, built on the basis of netlink sockets. The latter uses the netlink protocol and has a number of known issues that we will address in this article.

And a couple of words about how a group of processes about which information is needed is defined.
 #define TASK_DIAG_DUMP_ALL 0 /*     */ #define TASK_DIAG_DUMP_ALL_THREAD 1 /*      */ #define TASK_DIAG_DUMP_CHILDREN 2 /*      */ #define TASK_DIAG_DUMP_THREAD 3 /*      */ #define TASK_DIAG_DUMP_ONE 4 /*     */ 


In the process of implementation, several questions arose. The interface should be accessible to normal users, i.e. we needed to save access rights somewhere. The second question is where do you get the link to the process namespace (pidns)?

Let's start with the second. We use the netlink interface, which is based on sockets and is used primarily in the network subsystem. The link to the network namespace is taken from the socket. In our case, the link needs to be taken to the process namespace. After reading a bit of the kernel code, it turned out that each message contains information about the sender (SCM_CREDENTIALS). It contains the process ID, which allows us to take a link to the process namespace from the sender. This goes against the network namespace, since the socket is bound to the namespace in which it was created. Taking the link to pidns from the process that requested the information is probably permissible, and we also have the opportunity to set the name we need, since sender information can be set at the sending stage.

The first problem turned out to be much more interesting, although we could not understand its details for a long time. Linux file descriptors have one feature. We can open the file and lower its privileges, while the file descriptor will remain fully functional. The same is true for netlink sockets to some extent, but there is a problem that Andy Lutomirski has pointed out to me (Andy Lutomirski). It lies in the fact that we are not able to specify exactly what this socket will be used for. That is, if we have an application that creates a netlink socket and then lowers its privileges, then this application will be able to use the socket for any functionality that is available for netlink sockets. In other words, lowering privileges does not affect the netlink socket. When we add new functionality to netlink sockets, we open up new possibilities for applications that use them, which is a serious security problem.

There were other suggestions about the interface. In particular, the idea was to add a new system call. But I didn't like it very much, because There may be too much data to write them all in one buffer. A file descriptor implies subtraction of data in chunks, which, in my opinion, looks more reasonable.

There was also a proposal to make a transactional file in the procfs file system. The idea is similar to what we did for netlink sockets. Open the file, write the request, read the answer. It is on this idea that we stopped, as in the working version for the next version.

A few words about performance


The first version did not cause much discussion, but it helped to find another group of people interested in a new, faster interface for obtaining process properties. One evening I shared my work with Pavel Odintsov (@pavelodintsov), and he said that he recently had problems with the perf, and they were also connected with the speed of collecting the attributes of processes. That's how he brought us together with David Ahern (David Ahern), who made a considerable contribution to the development of the interface. He also proved by another example that this work is needed not only for us.

Performance comparison can be started with a simple example. Suppose that we need for all processes to get the session number, groups and other parameters from the files / proc / pid / stat.

For a fair comparison, we will write a small program that will read out / proc / PID / status for each process. Below we will see that it is faster than the ps utility.

  while ((de = readdir(d))) { if (de->d_name[0] < '0' || de->d_name[0] > '9') continue; snprintf(buf, sizeof(buf), "/proc/%s/stat", de->d_name); fd = open(buf, O_RDONLY); read(fd, buf, sizeof(buf)); close(fd); tasks++; } 


The program for task-diag is more voluminous. It can be found in my repository in the tools / testing / selftests / task_diag / directory.

 $ ps a -o pid,ppid,pgid,sid,comm | wc -l 50006 $ time ps a -o pid,ppid,pgid,sid,comm > /dev/null real 0m1.256s user 0m0.367s sys 0m0.871s $ time ./task_proc_all a tasks: 50085 real 0m0.279s user 0m0.013s sys 0m0.255s $ time ./task_diag_all a real 0m0.051s user 0m0.001s sys 0m0.049s 


Even on such a simple example, it is clear that task_diag runs several times faster. The ps utility is slower as it reads more files per process.

Let's see what the pref trace --summary shows for both options.

 $ perf trace --summary ./task_proc_all a tasks: 50086 Summary of events: task_proc_all (72414), 300753 events, 100.0%, 0.000 msec syscall calls min avg max stddev (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- ------ read 50091 0.003 0.005 0.925 0.40% write 1 0.011 0.011 0.011 0.00% open 50092 0.003 0.004 0.992 0.49% close 50093 0.002 0.002 0.061 0.15% fstat 7 0.002 0.003 0.008 25.95% mmap 18 0.002 0.006 0.026 19.70% mprotect 10 0.006 0.010 0.020 13.28% munmap 2 0.012 0.020 0.028 40.18% brk 3 0.003 0.007 0.010 30.28% rt_sigaction 2 0.003 0.003 0.004 18.81% rt_sigprocmask 1 0.003 0.003 0.003 0.00% access 1 0.005 0.005 0.005 0.00% getdents 50 0.003 0.940 2.023 4.51% getrlimit 1 0.003 0.003 0.003 0.00% arch_prctl 1 0.002 0.002 0.002 0.00% set_tid_address 1 0.003 0.003 0.003 0.00% openat 1 0.022 0.022 0.022 0.00% set_robust_list 1 0.003 0.003 0.003 0.00% 


 $ perf trace --summary ./task_diag_all a Summary of events: task_diag_all (72481), 183 events, 94.8%, 0.000 msec syscall calls min avg max stddev (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- ------ read 31 0.003 1.471 6.364 14.43% write 1 0.003 0.003 0.003 0.00% open 7 0.005 0.008 0.020 26.21% close 6 0.002 0.002 0.003 3.96% fstat 6 0.002 0.002 0.003 4.67% mmap 17 0.002 0.006 0.030 25.38% mprotect 10 0.005 0.007 0.010 6.33% munmap 2 0.006 0.007 0.008 13.84% brk 3 0.003 0.004 0.004 9.08% rt_sigaction 2 0.002 0.002 0.002 9.57% rt_sigprocmask 1 0.002 0.002 0.002 0.00% access 1 0.006 0.006 0.006 0.00% getrlimit 1 0.002 0.002 0.002 0.00% arch_prctl 1 0.002 0.002 0.002 0.00% set_tid_address 1 0.002 0.002 0.002 0.00% set_robust_list 1 0.002 0.002 0.002 0.00% 


The number of system calls in the case of task_diag is seriously reduced.

Results for the perf utility (cited from David Ahern's letter).
 > Using the fork test command: > 10,000 processes; 10k proc with 5 threads = 50,000 tasks > reading /proc: 11.3 sec > task_diag: 2.2 sec > > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 > > 128 instances of sepcjbb, 80,000+ tasks: > reading /proc: 32.1 sec > task_diag: 3.9 sec > > So overall much snappier startup times. 


Here we see an increase in productivity by an order of magnitude.

Conclusion


This project is under development and can still be changed many times. But now we have two real projects, by the example of which one can see a serious increase in productivity. I am almost certain that in one form or another this work will sooner or later fall into the main branch of the kernel.

Links


github.com/avagin/linux-task-diag
lkml.org/lkml/2015/7/6/142
lwn.net/Articles/633622
www.slideshare.net/openvz/speeding-up-ps-and-top-57448025

Source: https://habr.com/ru/post/275545/


All Articles