We study the internal kitchen of the Linux kernel using / proc for quick diagnosis and problem solving

This article is about modern Linux. For example, RHEL6 with 2.6.3x kernels will do, but RHEL5 with 2.6.18 kernels (by the way, the most popular in production) - alas, no. And still - here there will be no description of nuclear debuggers or SytemTap scripts; only good old simple commands like "cat / proc / PID / xyz" for some useful nodes of the / proc file system.

Diagnostics of the "braking" process

Here is a good example of a common problem that I reproduced on my laptop: the user complains that the find works "much slower" without returning any results. Knowing what was the matter, we solved the problem. However, I was asked to state a systematic approach to solving such problems.

Fortunately, the system is running OEL6, i.e. on a fairly fresh core (namely, 2.6.39 UEK2)
')
So, let's proceed to the diagnosis.

First check to see if the find process is still alive:

[root @ oel6 ~] # ps -ef | grep find
root 27288 27245 4 11:57 pts / 0 00:00:01 find. -type f
root 27334 27315 0 11:57 pts / 1 00:00:00 grep find

Yes, it is in place - PID 27288 (I will use this pid later in the diagnostic example)

Let's start with the basics and take a look at what is the bottleneck for this process. If it is not blocked by anything (for example, it reads everything it needs from the cache), it should use 100% CPU. If it is blocked due to IO or problems with the network, then the processor load must be lower or absent altogether.

[root @ oel6 ~] # top -cbp 27288
top - 11:58:15 up 7 days, 3:38, 2 users, load average: 1.21, 0.65, 0.47
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu (s): 0.1% us, 0.1% sy, 0.0% ni, 99.8% id, 0.0% wa, 0.0% hi, 0.0% si, 0.0% st
Mem: 2026460k total, 1935780k used, 90680k free, 64416k buffers
Swap: 4128764k total, 251004k used, 3877760k free, 662280k cached

PID USER PR NI VIRT RES SHR S% CPU% MEM TIME + COMMAND
27288 root 20 0 109m 1160 844 D 0.0 0.1 0: 01.11 find. -type f

The output of the 'top' command states that this process does not load the CPU at all (or the load is so small that it can be considered zero). However, there is a very important difference between a process that is completely frozen, not having a chance at all to get a processor quantum, and a process that constantly wakes up from the waiting state and then immediately falls asleep again (for example, some kind of poll operation, which is constantly timed out, and then the process during execution causes it to fall asleep again). In order to recognize these two states, the 'top' command, unfortunately, is not enough. But at least we have already found out that this process does not devour the CPU time.

Let's try other tools. Usually, if a process seems to hang like this (0% CPU usually means that the process is in some kind of blocking system call - which causes the kernel to put the process to sleep), I run strace on this process to track which system call he is currently stuck. If the process is in fact not completely frozen, but periodically returns from the system call and wakes up, then this will also be seen in the strace output (the system call will be periodically completed and then called again):

[root @ oel6 ~] # strace -cp 27288
Process 27288 attached - interrupt to quit

^ C
^ Z
[1] + Stopped strace -cp 27288

[root @ oel6 ~] # kill -9 %%
[1] + Stopped strace -cp 27288
[root @ oel6 ~] #
[1] + Killed strace -cp 27288

Oops ... Looks like the strace team itself is also stuck! For quite a long time she did not display anything on the screen, and did not even respond to pressing CTRL + C, so I had to first send it to the background mode by pressing CTRL + Z, and then completely kill it. Not a very easy diagnosis!

Let's try pstack (on Linux, pstack is just a script wrapper for the GDB debugger). Pstack will not tell us anything about the kernel's internal kitchen, but at least indicates what kind of system call it is (usually it looks like calling the libc library function on top of the user stack):

[root @ oel6 ~] # pstack 27288

^ C
^ Z
[1] + Stopped pstack 27288

[root @ oel6 ~] # kill %%
[1] + Stopped pstack 27288
[root @ oel6 ~] #
[1] + Terminated pstack 27288

Pstack is also frozen without any explanation!

So, we still do not know whether our process is 100% free (alas), or just 99.99% (it wakes up and immediately falls asleep) - and also we don’t know where exactly this happened.

Where else can you look? There is one more frequently available space - the status and WCHAN fields, the contents of which can be explored using the good old ps command (hmm ... maybe it was worth running it right away to make sure that we are not dealing with zombies):

[root @ oel6 ~] # ps -flp 27288
FS UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 D root 27288 27245 0 80 0 - 28070 rpc_wa 11:57 pts / 0 00:00:01 find. -type f

To make sure that the process continues to be in the same state, ps must be run several times in a row (you do not want to come to a false conclusion on the basis of a single attempt made at the wrong moment?). However, here I bravely show only one launch.

The process is in state D (“deep sleep”), which is usually associated with disk I / O (which is what the ps man page also says). In addition, the WCHAN field (the name of the function that led the process to hibernation / standby) is slightly truncated. I can add an option to the ps call to make the output of this field a little wider, however, since its contents come from the / proc system anyway, let's look at the source (again, it would be nice to do it several times to make sure it hung whether the process is complete, or just very often and often sleeps):

[root @ oel6 ~] # cat / proc / 27288 / wchan
rpc_wait_bit_killable

Hmm ... Looks like the process is waiting for some kind of RPC call. This usually means that the process communicates with other processes (either on a local machine, or even on a remote server). But we still do not know why.

Is there any movement, or is the process completely frozen?

Before moving on to the very “meat” of the article, let's determine whether the process is completely frozen or not. On modern kernels, you can learn about this by studying / proc / PID / status. For clarity, I have highlighted interesting values for us:

[root @ oel6 ~] # cat / proc / 27288 / status
Name: find
State: D (disk sleep)
Tgid: 27288
Pid: 27288
PPid: 27245
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 256
Groups: 0 1 2 3 4 6 10
VmPeak: 112628 kB
VmSize: 112280 kB
VmLck: 0 kB
VMHWM: 1508 kB
VmRSS: 1160 kB
VmData: 260 kB
VmStk: 136 kB
VmExe: 224 kB
VmLib: 2468 kB
VmPTE: 88 kB
VmSwap: 0 kB
Threads: 1
SigQ: 4/15831
SigPnd: 0000000000040000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000180000000
CapInh: 00000000000000
CapPrm: ffffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed: ffffffff, ffffffff
Cpus_allowed_list: 0-63
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 9950
nonvoluntary_ctxt_switches: 17104

The process is in a state of D - disk sleep ("deep sleep"). Also, pay attention to the values voluntary_ctxt_switches and nonvoluntary_ctxt_switches - they will tell us how many times the processor received CPU quanta (or gave back). Then, after a few seconds, run the command again and check if the values have increased. In my case, the numbers did not increase, and therefore I would assume that the process hung tight (well, or at least never woke up during those few seconds between the teams). So, now I can be more sure that the process is completely frozen (and not just flying below the radar, constantly consuming 0.04% of the CPU time).

By the way, there are two more places where you can see the number of context switches (and the second one, in addition, is available on the ancient cores):

[root @ oel6 ~] # cat / proc / 27288 / sched
find (27288, #threads: 1)
- se.exec_start: 617547410.689282
se.vruntime: 2471987.542895
se.sum_exec_runtime: 1119.480311
se.statistics.wait_start: 0.000000
se.statistics.sleep_start: 0.000000
se.statistics.block_start: 617547410.689282
se.statistics.sleep_max: 0.089192
se.statistics.block_max: 60082.951331
se.statistics.exec_max: 1.110465
se.statistics.slice_max: 0.334211
se.statistics.wait_max: 0.812834
se.statistics.wait_sum: 724.745506
se.statistics.wait_count: 27211
se.statistics.iowait_sum: 0.000000
se.statistics.iowait_count: 0
se.nr_migrations: 312
se.statistics.nr_migrations_cold: 0
se.statistics.nr_failed_migrations_affine: 0
se.statistics.nr_failed_migrations_running: 96
se.statistics.nr_failed_migrations_hot: 1794
se.statistics.nr_forced_migrations: 150
se.statistics.nr_wakeups: 18507
se.statistics.nr_wakeups_sync: 1
se.statistics.nr_wakeups_migrate: 155
se.statistics.nr_wakeups_local: 18504
se.statistics.nr_wakeups_remote: 3
se.statistics.nr_wakeups_affine: 155
se.statistics.nr_wakeups_affine_attempts: 158
se.statistics.nr_wakeups_passive: 0
se.statistics.nr_wakeups_idle: 0
avg_atom: 0.041379
avg_per_cpu: 3.588077
nr_switches: 27054
nr_voluntary_switches: 9950
nr_involuntary_switches: 17104
se.load.weight: 1024
policy: 0
prio: 120
clock-delta: 72

Here you need to look at the number nr_switches (which is equal to nr_voluntary_switches + nr_involuntary_switches ).

The total number of nr_switches is 27054 in the above slice, and is also located in the third field in the / proc / PID / schedstat output:

[root @ oel6 ~] # cat / proc / 27288 / schedstat
1119480311 724745506 27054

And it does not increase ...

Examine the kernel's internal kitchen using the / proc file system.

It seems that our process is very tightly frozen :). Strace and pstack are useless. They use the ptrace () system call to connect to the process and inject it into its memory, but since the process is hopelessly hung, most likely in some kind of system call, I assume that the ptrace () call also hangs on its own (by the way, I I somehow tried to run strace on strace itself at the moment when it connects to the target process. And this led to a crash of the process. I warned! :)

How do we know in which system call we hang, having neither strace nor pstack? Fortunately, we are working on a modern core! Poprivetsvuemu / proc / PID / syscall!

[root @ oel6 ~] # cat / proc / 27288 / syscall
262 0xffffffffffffffff9c 0x20cf6c8 0x7fff97c52710 0x100 0x100 0x676e776f645f616d 0x7fff97c52658 0x390e2da8ea

Hurt yourself! And what should I do about it ???
Well, usually these numbers are needed for something. If we see something like “0xAVeryBigNumber”, then this is usually an address in memory (and they can be used with utilities like pmap), and if the number is small, most likely it is an index to some kind of array. For example, an array of open file descriptors (which you can see in / proc / PID / fd), or, in this case, since we are dealing with system calls — this is the number of the system call that the process is in! So, we now know that the process is frozen in the system call # 262!

Note that system call numbers may vary between different operating systems, operating system versions and platforms, and therefore you will need the correct header file from your operating system. It’s nice to start by searching for “syscall *” in the / usr / include folder. On my version and the Linux platform (64bit), the system calls are defined in the /usr/include/asm/unistd_64.h file:

[root @ oel6 ~] # grep 262 /usr/include/asm/unistd_64.h
#define __NR_ newfstatat 262

We are almost there! System call 262 is something called newfstatat . It remains to run the man and find out what it is. Give a little hint about the names of the system calls: if the man page does not find the desired name, try searching without any possible suffixes and prefixes (for example, “man pread” instead of “man pread64 ″) - and in this case, run man without the prefix“ new ”- man fstatat . Well, or just google.

However, this system call, “new-fstat-at”, gives you the ability to read file properties in a very similar way to how a regular stat system call does. And we are stuck just in a file metadata read operation. So, we have moved one step further. However, we still do not know why the hangup happened!

Well, it's time to greet my little friend / proc / PID / stack , which will allow you to view the chain of the nuclear stack of the process just by printing the contents of the proc file !!!

[root @ oel6 ~] # cat / proc / 27288 / stack
[] rpc_wait_bit_killable + 0x24 / 0x40 [sunrpc]
[] __rpc_execute + 0xf5 / 0x1d0 [sunrpc]
[] rpc_execute + 0x43 / 0x50 [sunrpc]
[] rpc_run_task + 0x75 / 0x90 [sunrpc]
[] rpc_call_sync + 0x42 / 0x70 [sunrpc]
[] nfs3_rpc_wrapper.clone.0 + 0x35 / 0x80 [nfs]
[] nfs3_proc_getattr + 0x47 / 0x90 [nfs]
[] __nfs_revalidate_inode + 0xcc / 0x1f0 [nfs]
[] nfs_revalidate_inode + 0x36 / 0x60 [nfs]
[] nfs _getattr + 0x5f / 0x110 [nfs]
[] vfs_getattr + 0x4e / 0x80
[] vfs_fstatat + 0x70 / 0x90
[] sys_ newfstatat + 0x24 / 0x50
[] system_call_fastpath + 0x16 / 0x1b
[] 0xffffffffffffffff

The top function is the place in the kernel code where we hang. This is exactly what the WCHAN field already told us (but notice that there are actually some other functions above, such as the nuclear function schedule() , which puts the processes to sleep or wakes up as needed. However, these functions are not shown here perhaps because they are already the result of the wait state itself, and not its cause).

Having a full kernel stack for this task, we can study it from bottom to top and try to figure out how the rpc_wait_bit_killable function rpc_wait_bit_killable led the scheduler to call and put us to sleep.

The system_call_fastpath call is the standard kernel system call handler that caused the code that implements the newfstatat ( sys_newfstatat ) system call that we are dealing with. Moving further into the "daughter" functions, we see several pieces related to NFS. This is already 100% irrefutable proof that we are somewhere around the NFS processing code! I do not say “in NFS code”, as long as we can see how the extreme of these NFS functions, in turn, calls some kind of RPC function (rpc_call_sync) to communicate with another process. In this case, it is probably [kworker/N:N] , [nfsiod] , [lockd] or [rpciod] IO kernel threads. And for some reason, one of these streams does not respond at all (usually you should suspect a broken network connection, packet loss, or just some network problems).

To figure out whether any of these helper threads are dependent on the network-related code, you also need to consider their stacks, although, for example, kworkers perform much more functions than just an RPC for NFS connection. During a separate experiment (just copying a large file via NFS), I caught one of the kworkers waiting in the code to communicate with the network:

[root @ oel6 proc] # for i in `pgrep worker`; do ps -fp $ i; cat / proc / $ i / stack; done
UID PID PPID C STIME TTY TIME CMD
root 53 2 0 Feb14? 00:04:34 [ kworker / 1: 1]

[] __cond_resched + 0x2a / 0x40
[] lock_sock_nested + 0x35 / 0x70
[] tcp_sendmsg + 0x29 / 0xbe0
[] inet_sendmsg + 0x48 / 0xb0
[] sock_sendmsg + 0xef / 0x120
[] kernel_sendmsg + 0x41 / 0x60
[] xs_send_kvec + 0x8e / 0xa0 [sunrpc]
[] xs_sendpages + 0x173 / 0x220 [sunrpc]
[] xs_tcp_send_request + 0x5d / 0x160 [sunrpc]
[] xprt_transmit + 0x83 / 0x2e0 [sunrpc]
[] call_transmit + 0xa8 / 0x130 [sunrpc]
[] __rpc_execute + 0x66 / 0x1d0 [sunrpc]
[] rpc_async_schedule + 0x15 / 0x20 [sunrpc]
[] process_one_work + 0x13e / 0x460
[] worker_thread + 0x17c / 0x3b0
[] kthread + 0x96 / 0xa0
[] kernel_thread_helper + 0x4 / 0x10

It is probably not difficult to turn on kernel tracing and find out exactly which kernel threads communicate with each other, but in this article I don’t want to do this. Let it be a practical (and simple) diagnostic exercise!

We understand and "repair"

In any case, thanks to the ability to very easily get a printout of the kernel stack in modern Linux (I can not say exactly which version of the kernel it appeared), we were able to consistently conclude exactly where our find command hung - namely, in the NFS code in the kernel Linux And when you are dealing with a hangup associated with NFS, you should most likely suspect a problem with the network. If you are curious about how I reproduced this problem, then everything is very simple: I mounted the NFS volume from the virtual machine, ran the find command, and then suspended the machine. This led to the same symptoms as if there was a network (configuration, firewall) problem, where the connection silently terminates without notifying the TCP end nodes, or where the packets simply do not pass for some reason.

Since the top of the stack contains one of the “kill” functions (which can be safely interrupted, rpc_wait_bit_killable ), we can kill it with the command kill -9 :

[root @ oel6 ~] # ps -fp 27288
UID PID PPID C STIME TTY TIME CMD
root 27288 27245 0 11:57 pts / 0 00:00:01 find. -type f
[root @ oel6 ~] # kill -9 27288

[root @ oel6 ~] # ls -l / proc / 27288 / stack
ls: cannot access / proc / 27288 / stack: No such file or directory

[root @ oel6 ~] # ps -fp 27288
UID PID PPID C STIME TTY TIME CMD
[root @ oel6 ~] #

the process is complete.

Profiling kernel threads "on the knee"

Notice that the / proc / PID / stack file looks like a regular text file. So you can easily get a kernel thread profile! Here's how you can “on the knee” find out the current system call, as well as the promotion of the kernel stack (if you are in the system call), and then combine all this into a semi-hierarchical profile:

[root @ oel6 ~] # export LC_ALL = C ; for i in {1..100}; do cat / proc / 29797 / syscall | awk '{print $ 1}'; cat / proc / 29797 / stack | / home / oracle / os_explain -k; usleep 100,000; done | sort -r | uniq -c

  69 running
       1 ffffff81534c83
       2 ffffff81534820
       6,247
      25 180

     100 0xffffffffffffffff 
       1 thread_group_cputime 
      27 sysenter_dispatch 
       3 ia32_sysret 
       1 task_sched_runtime 
      27 sys32_pread 
       1 compat_sys_io_submit 
       2 compat_sys_io_getevents 
      27 sys_pread64 
       2 sys_io_getevents 
       1 do_io_submit 
      27 vfs_read 
       2 read_events 
       1 io_submit_one 
      27 do_sync_read 
       1 aio_run_iocb 
      27 generic_file_aio_read 
       1 aio_rw_vect_retry 
      27 generic_file_read_iter 
       1 generic_file_aio_read 
      27 mapping_direct_IO 
       1 generic_file_read_iter 
      27 blkdev_direct_IO 
      27 __blockdev_direct_IO 
      27 do_blockdev_direct_IO 
      27 dio_post_submission 
      27 dio_await_completion 
       6 blk_flush_plug_list

This will allow a very rough estimate of where in the core the process spends its time (if it spends it at all). System call numbers are highlighted separately at the top. “Running” means that during the diagnostics process worked in user space (and not in system calls). So, 69% of the time the process was in user code; 25% in the system call # 180 (nfsservctl on my system), and 6% in the system call # 247 (waitid).

Above, the author uses a call to a certain script / home / oracle / os_explain . You can get it on the link - approx. trans.

The output shows two more "functions", for some reason they are not displayed by name. However, their addresses are known, so we can check them manually:

[root @ oel6 ~] # cat / proc / kallsyms | grep -i ffffff81534c83
ffffffff81534c83 t ia32_sysret

It seems that this is a return from the system call for the 32-bit subsystem, however, since this function is not a system call itself (this is just an internal auxiliary function), it seems that the / proc / stack handler did not display its name. It is also possible that these addresses are shown, because the / proc state is not consistent for the reader: while one thread modifies these memory structures and elements, the reading threads sometimes see outdated data.

Check the second address at the same time:

[root @ oel6 ~] # cat / proc / kallsyms | grep -i ffffff81534820
[root @ oel6 ~] #

Found nothing? Well, however, the debugging is not over! Let's look for something interesting around this address. For example, delete the last couple of digits at the end of the address:

[root @ oel6 ~] # cat / proc / kallsyms | grep -i ffffff815348
ffffffff8153480d t sysenter_do_call
ffffffff81534819 t sysenter_dispatch
ffffffff81534847 t sysexit_from_sys_call
ffffffff8153487a t sysenter_auditsys
ffffffff815348b9 t sysexit_audit

It looks like the sysenter_dispatch function starts with just one byte before the original address displayed in / proc / PID / stack. Therefore, we most likely almost completed one byte (perhaps the NOP command located there for dynamic debugging). However, it seems that these pieces of the stack are in the sysenter_dispatch function, which is not itself a system call, but an auxiliary function. ( If you dig a little bit - there is still a difference not in 1, but in 7 bytes. The numbers are hexadecimal! - approx. Transl. )

More about stack profiling

Notice that the existing various stack promotion utilities - Perf on Linux, the Oprofile and profile utilities from DTrace on Solaris fix the instruction pointer registers (EIP on 32-bit Intel, or RIP on x64), and also the stack pointer (ESP on 32bit and RSP on x64) for the current thread on the processor, and then follow the stack pointers back. Thus, these utilities are able to show only those threads that are running on the CPU at the time of testing!This is certainly great when we are looking for problems with a high CPU load, but it is completely useless for diagnosing tightly hung or waiting for a while / hibernating processes.

Utilities like pstack on Linux, Solaris, HP-UX, procstack (on AIX), ORADEBUG SHORT_STACK, and just reading the / proc / PID / stack pseudo file are a good addition (but not a replacement) for CPU profiling utilities - as long as they give access to the process memory independently from his state in the scheduler and read the stack right from there. If the process is asleep and does not concern the CPU, then the top of the process stack can be read from the saved context - which is stored in the kernel memory by the OS scheduler during context switching.

Of course, CPU event profiling utilities can often provide much more than just pstack, OProfile, Perf, and even the CPC provider in DTrace (on Solaris11). For example, install and read internal processor counters to evaluate things like the number of missed CPU cycles while waiting for memory access; the number of L1 / L2 cache misses, etc. But better read what Kevin Closson writes about this: ( Perf , Oprofile )

Good luck :-)