How effective is the procfs virtual file system and can it be optimized?

The proc file system (hereinafter simply procfs) is a virtual file system that provides information about processes. It is an “excellent” example of the interfaces following the “everything is a file” paradigm. Procfs was developed a long time ago: at a time when servers, on average, served several dozen processes, when opening a file and deducting information about a process was not a problem. However, time does not stand still, and now servers serve hundreds of thousands, or even more processes at the same time. In this context, the idea of “opening a file for each process to subtract data of interest” no longer looks so attractive, and the first thing that comes to mind to speed up reading is to get information about a group of processes in one iteration. In this article we will try to find procfs elements that can be optimized.

The very idea of improving procfs arose when we discovered that CRIU wastes a significant amount of time just reading procfs files. We saw how a similar problem was solved for sockets, and decided to do something similar to the sock-diag interface, but only for procfs. Of course, we assumed how difficult it would be to change the old and well-established interface in the core, to convince the community that the game was worth the trouble ... and were pleasantly surprised by the number of people who supported the creation of the new interface. Strictly speaking, no one knew how the new interface should look, but there is no doubt that procfs does not meet current performance requirements. For example, such a scenario: the server responds to requests for too long, vmstat shows that the memory has gone to the swap, and the launch of “ps ax” is performed for 10 seconds or more, top and does not show anything at all. In this article we will not consider any specific new interface; rather, we will try to describe the problems and the ways to solve them.

Each running procfs process is represented by the / proc / <pid> directory.
In each directory there are many files and subdirectories that provide access to specific information about the process. Subdirectories group data by attributes. For example ( $$ is a special shell variable that is expanded in pid - identifier of the current process):

 $ ls -F /proc/$$ attr/ exe@ mounts projid_map status autogroup fd/ mountstats root@ syscall auxv fdinfo/ net/ sched task/ cgroup gid_map ns/ schedstat timers clear_refs io numa_maps sessionid timerslack_ns cmdline limits oom_adj setgroups uid_map comm loginuid oom_score smaps wchan coredump_filter map_files/ oom_score_adj smaps_rollup cpuset maps pagemap stack cwd@ mem patch_state stat environ mountinfo personality statm

All these files give out data in different formats. Most in ASCII text format, which is easily perceived by man. Well, almost easy:

 $ cat /proc/$$/stat 24293 (bash) S 21811 24293 24293 34854 24876 4210688 6325 19702 0 10 15 7 33 35 20 0 1 0 47892016 135487488 3388 18446744073709551615 94447405350912 94447406416132 140729719486816 0 0 0 65536 3670020 1266777851 1 0 0 17 2 0 0 0 0 0 94447408516528 94447408563556 94447429677056 140729719494655 140729719494660 140729719494660 140729719496686 0

To understand what each element of this set means, the reader will have to open man proc (5), or the kernel documentation. For example, the second element is the name of the executable file in brackets, and the nineteenth element is the current value of the execution priority (nice).

Some files are quite readable by themselves:

 $ cat /proc/$$/status | head -n 5 Name: bash Umask: 0002 State: S (sleeping) Tgid: 24293 Ngid: 0

But how often do users read information directly from procfs files? How long does the kernel need to translate binary data into text format? What is the overhead for procfs? How convenient is this interface for state monitor programs, and how much time do they spend to process this text data? How critical is such a slow implementation in emergency situations?

Most likely, it will not be a mistake to say that users prefer programs like top or ps, instead of reading the data from procfs directly.

To answer the remaining questions we will conduct several experiments. First, find where the kernel spends the time to generate procfs files.

In order to obtain certain information from all processes in the system, we will have to go through the / proc / directory and select all the subdirectories whose name is represented in decimal digits. Then, in each of them, we need to open the file, read it and close it.

In total, we will execute three system calls, and one of them will create a file descriptor (in the kernel, a file descriptor is associated with a set of internal objects for which additional memory is allocated). The open () and close () system calls themselves do not give us any information, so they can be attributed to the overhead of the procfs interface.

Let's try to just make open () and close () for each process in the system, but we will not read the contents of the files:

 $ time ./task_proc_all --noread stat tasks: 50290 real 0m0.177s user 0m0.012s sys 0m0.162s

 $ time ./task_proc_all --noread loginuid tasks: 50289 real 0m0.176s user 0m0.026s sys 0m0.145

task-proc-all is a small utility that can be viewed through the link below.

It does not matter which file to open, since real data is generated only at the time of read ().

And now let's look at the perf core profiler output:

 - 92.18% 0.00% task_proc_all [unknown] - 0x8000 - 64.01% __GI___libc_open - 50.71% entry_SYSCALL_64_fastpath - do_sys_open - 48.63% do_filp_open - path_openat - 19.60% link_path_walk - 14.23% walk_component - 13.87% lookup_fast - 7.55% pid_revalidate 4.13% get_pid_task + 1.58% security_task_to_inode 1.10% task_dump_owner 3.63% __d_lookup_rcu + 3.42% security_inode_permission + 14.76% proc_pident_lookup + 4.39% d_alloc_parallel + 2.93% get_empty_filp + 2.43% lookup_fast + 0.98% do_dentry_open 2.07% syscall_return_via_sysret 1.60% 0xfffffe000008a01b 0.97% kmem_cache_alloc 0.61% 0xfffffe000008a01e - 16.45% __getdents64 - 15.11% entry_SYSCALL_64_fastpath sys_getdents iterate_dir - proc_pid_readdir - 7.18% proc_fill_cache + 3.53% d_lookup 1.59% filldir + 6.82% next_tgid + 0.61% snprintf - 9.89% __close + 4.03% entry_SYSCALL_64_fastpath 0.98% syscall_return_via_sysret 0.85% 0xfffffe000008a01b 0.61% 0xfffffe000008a01e 1.10% syscall_return_via_sysret

The kernel spends almost 75% of the time just to create and delete the file descriptor, and about 16% to list the processes.

Although we know how long it takes to open () and close () for each process, we cannot yet assess how significant it is. We need to compare the obtained values with something. Let's try to do the same with the most famous files. Usually, when you need to list the processes, use the ps or top utility. They both read / proc / <pid> / stat and / proc / <pid> / status for each process in the system.

Let's start with / proc / <pid> / status - this is a massive file with a fixed number of fields:

 $ time ./task_proc_all status tasks: 50283 real 0m0.455s user 0m0.033s sys 0m0.417s

 - 93.84% 0.00% task_proc_all [unknown] [k] 0x0000000000008000 - 0x8000 - 61.20% read - 53.06% entry_SYSCALL_64_fastpath - sys_read - 52.80% vfs_read - 52.22% __vfs_read - seq_read - 50.43% proc_single_show - 50.38% proc_pid_status - 11.34% task_mem + seq_printf + 6.99% seq_printf - 5.77% seq_put_decimal_ull 1.94% strlen + 1.42% num_to_str - 5.73% cpuset_task_status_allowed + seq_printf - 5.37% render_cap_t + 5.31% seq_printf - 5.25% render_sigset_t 0.84% seq_putc 0.73% __task_pid_nr_ns + 0.63% __lock_task_sighand 0.53% hugetlb_report_usage + 0.68% _copy_to_user 1.10% number 1.05% seq_put_decimal_ull 0.84% vsnprintf 0.79% format_decode 0.73% syscall_return_via_sysret 0.52% 0xfffffe000003201b + 20.95% __GI___libc_open + 6.44% __getdents64 + 4.10% __close

It can be seen that only about 60% of the time is spent inside the read () system call. If you look at the profile more closely, you find that 45% of the time is used inside the core functions seq_printf, seq_put_decimal_ull. So, converting from binary format to text is quite an expensive operation. What causes a well-founded question: do we really need a text interface to pull data from the kernel? How often do users want to work with raw data? And why do the top and ps utilities have to convert this text data back to a binary form?

It would probably be interesting to know how much faster the output would be if binary data were used directly, and if three system calls were not required.

Attempts to create such an interface have already been. In 2004, tried to use the netlink engine.

 [0/2][ANNOUNCE] nproc: netlink access to /proc information (https://lwn.net/Articles/99600/) nproc is an attempt to address the current problems with /proc. In short, it exposes the same information via netlink (implemented for a small subset).

Unfortunately, the community has not shown much interest in this work. One of the last attempts to rectify the situation occurred two years ago.

 [PATCH 0/15] task_diag: add a new interface to get information about processes (https://lwn.net/Articles/683371/)

The task-diag interface is based on the following principles:

Transactional nature: sent a request, received a response;
The format of messages is in the form of netlink (the same as in sock_diag interface: binary and extensible);
Ability to request information about multiple processes in one call;
Optimized attribute grouping (any attribute in a group should not increase response time).

This interface has been presented at several conferences. It was integrated into the utilities of pstools, CRIU, and also David Ahern integrated the task_diag into perf, as an experiment.

The kernel developer community has become interested in the task_diag interface. The main subject of discussion was the choice of transport between the core and user space. The initial idea of using netlink sockets was rejected. Partly because of unresolved problems in the code of the netlink engine itself, and partly because many people think that the netlink interface was designed exclusively for the network subsystem. Then it was proposed to use transactional files inside procfs, that is, the user opens the file, writes the request to it, and then simply reads the answer. As usual, there were also opponents of this approach. The solution, which everyone would like, has not yet been found.

Let's compare the performance of task_diag with procfs.

The task_diag engine has a test utility that is well suited for our experiments. Suppose we want to request process IDs and their rights. Below is the output for one process:

 $ ./task_diag_all one -c -p $$ pid 2305 tgid 2305 ppid 2299 sid 2305 pgid 2305 comm bash uid: 1000 1000 1000 1000 gid: 1000 1000 1000 1000 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000003fffffffff

And now for all processes in the system, that is, the same thing that we did for the experiment with procfs, when we read the / proc / pid / status file:

 $ time ./task_diag_all all -c real 0m0.048s user 0m0.001s sys 0m0.046s

It took only 0.05 seconds to get the data to build the process tree. And with procfs it took 0.177 seconds only to open one file for each process, and without reading the data.

The perf output for the task_diag interface:

 - 82.24% 0.00% task_diag_all [kernel.vmlinux] [k] entry_SYSCALL_64_fastpath - entry_SYSCALL_64_fastpath - 81.84% sys_read vfs_read __vfs_read proc_reg_read task_diag_read - taskdiag_dumpit + 33.84% next_tgid 13.06% __task_pid_nr_ns + 6.63% ptrace_may_access + 5.68% from_kuid_munged - 4.19% __get_task_comm 2.90% strncpy 1.29% _raw_spin_lock 3.03% __nla_reserve 1.73% nla_reserve + 1.30% skb_copy_datagram_iter + 1.21% from_kgid_munged 1.12% strncpy

There is nothing interesting in the listing itself, except for the fact that there are no obvious functions suitable for optimization.

Let's look at the perf output when reading information about all processes in the system:

  $ perf trace -s ./task_diag_all all -c -q Summary of events: task_diag_all (54326), 185 events, 95.4% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ read 49 40.209 0.002 0.821 4.126 9.50% mmap 11 0.051 0.003 0.005 0.007 9.94% mprotect 8 0.047 0.003 0.006 0.009 10.42% openat 5 0.042 0.005 0.008 0.020 34.86% munmap 1 0.014 0.014 0.014 0.014 0.00% fstat 4 0.006 0.001 0.002 0.002 10.47% access 1 0.006 0.006 0.006 0.006 0.00% close 4 0.004 0.001 0.001 0.001 2.11% write 1 0.003 0.003 0.003 0.003 0.00% rt_sigaction 2 0.003 0.001 0.001 0.002 15.43% brk 1 0.002 0.002 0.002 0.002 0.00% prlimit64 1 0.001 0.001 0.001 0.001 0.00% arch_prctl 1 0.001 0.001 0.001 0.001 0.00% rt_sigprocmask 1 0.001 0.001 0.001 0.001 0.00% set_robust_list 1 0.001 0.001 0.001 0.001 0.00% set_tid_address 1 0.001 0.001 0.001 0.001 0.00%

For procfs, we need to make more than 150000 system calls to pull out information about all processes, and for task_diag - a little more than 50.

Let's look at real situations from life. For example, we want to display a process tree along with command line arguments for each. To do this, we need to pull out the pid of the process, the pid of its parent, and the command line arguments themselves.

For the task_diag interface, the program sends one request to get all the parameters at once:

 $ time ./task_diag_all all --cmdline -q real 0m0.096s user 0m0.006s sys 0m0.090s

For the original procfs, we need to read / proc // status and / proc // cmdline for each process:

 $ time ./task_proc_all status tasks: 50278 real 0m0.463s user 0m0.030s sys 0m0.427s

 $ time ./task_proc_all cmdline tasks: 50281 real 0m0.270s user 0m0.028s sys 0m0.237s

It is easy to see that task_diag is 7 times faster than procfs (0.096 vs. 0.27 + 0.46). Usually, performance improvement by a few percent is already a good result, and here the speed has increased almost an order of magnitude.

It is also worth mentioning that the creation of internal kernel objects also greatly affects performance. Especially in the case when the memory subsystem is under heavy load. Compare the number of objects created for procfs and task_diag:

 $ perf trace --event 'kmem:*alloc*' ./task_proc_all status 2>&1 | grep kmem | wc -l 58184 $ perf trace --event 'kmem:*alloc*' ./task_diag_all all -q 2>&1 | grep kmem | wc -l 188

You also need to find out how many objects are created when you start a simple process, for example, the utility true:

 $ perf trace --event 'kmem:*alloc*' true 2>&1 | wc -l 94

Procfs creates 600 times more objects than task_diag. This is one of the reasons why procfs works so badly when memory is heavy. At least, therefore, it is worth optimizing it.

We hope that the article will attract more developers to optimize the state of the procfs kernel subsystem.

Many thanks to David Ahern, Andy Lutomirski, Stephen Hemming, Oleg Nesterov, W. Trevor King, Arnd Bergmann, Eric W. Biederman, and many others who helped develop and improve the task_diag interface.

Thanks to cromer , k001 and Stanislav Kinsbursky for helping to write this article.

Links

Source: https://habr.com/ru/post/418715/

All Articles

How effective is the procfs virtual file system and can it be optimized?

Links

More articles: