Kernel tracing with ftrace

We have already touched upon the problems of kernel tracing and profiling in previous publications . There are many specialized tools for analyzing events at the kernel level: SystemTap , Ktap , Sysdig , LTTNG, and others. There are many detailed articles and educational materials about these tools.

Much less information can be found about Linux's “native” mechanisms, with which you can monitor system events, receive and analyze debug information. We would like to consider this topic in today's article. We will pay special attention to ftrace - the first and so far the only trace tool added to the kernel. Let's start with the definition of basic concepts.

')

Core profiling and tracing

Kernel profiling is the identification of performance bottlenecks. Using profiling, you can determine where exactly the performance loss occurred in a particular program. Special programs generate a profile - a summary of events on the basis of which you can find out which functions took the most time to complete. These programs, however, do not help to identify the cause of the decline in performance.

“Bottlenecks” very often arise under certain conditions, which cannot be identified during profiling. To understand why an event happened, you need to restore the context. In turn, tracing is required to restore context.

Tracing is the receipt of information about what is happening inside the operating system. For this purpose, special software tools are used. They record all events in the system just as the tape recorder records all the sounds of the environment.

Tracer programs can simultaneously monitor events at the level of individual applications and at the level of the operating system. Information obtained during tracing may be useful for diagnosing and solving many system problems.

Tracing is sometimes compared with logging. There are similarities between the two procedures, but there are differences.

During the trace information about events occurring at a low level is recorded. Their number is in the hundreds and even thousands. Logs also record information about high-level events that occur much less frequently: for example, user logons, application errors, database transactions, and others.

Like logs, files with trace data can be read "from the sheet." It is much more interesting and useful, however, to extract information from them that relates to the work of specific applications. All tracer programs have corresponding functions.

In the Linux kernel, there are three main mechanisms by which the procedures for tracing and profiling the kernel are performed:

tracepoints - a mechanism that works through static code instrumentation;
kprobes - a dynamic tracing mechanism, with which you can interrupt the execution of kernel code anywhere, call your own handler, and after completing all necessary operations, go back;
perf_events - PMU access interface (Performance Monitoring Unit).

We will not describe in detail the features of all these mechanisms; interested readers refer to the article of STC MetroTek , as well as the blog by Brendan Gregg .

Using ftrace, you can interact with these mechanisms and receive debugging data while in user space. We'll talk more about this below. All command examples are for OS Ubuntu 14.04, kernel version 3.13.0-24.

Ftrace: general information

The name ftrace is an abbreviation for Function Trace - tracing functions. However, the capabilities of this tool are much broader: with its help, you can track contextual switches, measure interrupt processing times, calculate time for activating tasks with high priority, and much more.

Ftrace was developed by Stephen Rostedt and added to the core in 2008, starting with version 2.6.27. This is a framework that provides a debug ring buffer for writing data. These tracers are built into the kernel.

The ftrace works on the debugfs file system, which in most modern Linux distributions is mounted by default. To get started with ftrace, you just need to go to the sys / kernel / debug / tracing directory (it is available only to the root user):

# cd /sys/kernel/debug/tracing

We will not talk in detail about all the files and subdirectories it contains - this is already done in the official documentation . We briefly describe only those of them that are of interest in the context of our consideration:

available_tracers — available tracers;
current_tracer - the current tracing program;
tracing_on - service file, responsible for enabling and disabling data recording in the ring buffer (to enable recording, write number 1 to this file, and to disable it, 0);
trace is a file where trace data is stored in a human readable format.

Available tracers

You can view the list of available tracers using the command

 root@andrei:/sys/kernel/debug/tracing#: cat available_tracers blk mmiotrace function_graph wakeup_rt wakeup function nop

We give a brief description for each tracer:

function - tracer functions of the kernel calls without the possibility of obtaining arguments;
function_graph - kernel call function tracer with subcalls;
blk - call and event tracer related to input-output to block devices; that is what is used in the blktrace utility, which we have already written about ;
mmiotrace is a tracer of I / O operations reflected in memory.
nop is the simplest tracer, which, as the name implies, does nothing (however, in some situations it may be useful, which is discussed below).

Tracer function

We will begin the introduction to ftrace with the function tracer. We will consider it on the material of such a test script:

 #!/bin/sh dir=/sys/kernel/debug/tracing sysctl kernel.ftrace_enabled=1 echo function > ${dir}/current_tracer echo 1 > ${dir}/tracing_on sleep 1 echo 0 > ${dir}/tracing_on less ${dir}/trace

Crypts are very simple and generally understandable, but there are moments in it that are worth paying attention to.
The sysctl ftrace.enabled = 1 command enables tracing of functions. Next, we install the current tracer by writing its name to the current_tracer file.

After that we write 1 to the file tracing_on and thereby activate the ring buffer update. An important feature of the syntax: there must be a space between the 1 and the> sign: a command like echo1> tracing_on will not work. Literally in one line, we disable it (note: if we write 0 to the tracing_on file, the ring buffer will not be cleared; the ftrace will not be disabled either).

Why do we do this? Between the two echo commands is the sleep 1 command. We enable the buffer update, execute this command, and then immediately disable it. Due to this, the trace will include information on all system calls that occurred during the execution of this command.

In the last line of the script, we give the command to output the trace data to the console.

As a result of the execution of this script, we will see the following output (here is a small fragment):

 # tracer: function # # entries-in-buffer/entries-written: 29571/29571 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | trace.sh-1295 [000] .... 90.502874: mutex_unlock <-rb_simple_write trace.sh-1295 [000] .... 90.502875: __fsnotify_parent <-vfs_write trace.sh-1295 [000] .... 90.502876: fsnotify <-vfs_write trace.sh-1295 [000] .... 90.502876: __srcu_read_lock <-fsnotify trace.sh-1295 [000] .... 90.502876: __srcu_read_unlock <-fsnotify trace.sh-1295 [000] .... 90.502877: __sb_end_write <-vfs_write trace.sh-1295 [000] .... 90.502877: syscall_trace_leave <-int_check_syscall_exit_work trace.sh-1295 [000] .... 90.502878: context_tracking_user_exit <-syscall_trace_leave trace.sh-1295 [000] .... 90.502878: context_tracking_user_enter <-syscall_trace_leave trace.sh-1295 [000] d... 90.502878: vtime_user_enter <-context_tracking_user_enter trace.sh-1295 [000] d... 90.502878: _raw_spin_lock <-vtime_user_enter trace.sh-1295 [000] d... 90.502878: __vtime_account_system <-vtime_user_enter trace.sh-1295 [000] d... 90.502878: get_vtime_delta <-__vtime_account_system trace.sh-1295 [000] d... 90.502879: account_system_time <-__vtime_account_system trace.sh-1295 [000] d... 90.502879: cpuacct_account_field <-account_system_time trace.sh-1295 [000] d... 90.502879: acct_account_cputime <-account_system_time trace.sh-1295 [000] d... 90.502879: __acct_update_integrals <-acct_account_cputime

The output starts with information on the number of event entries in the buffer (entries in buffer) and the total number of recorded events (entries written). The difference between these two numbers is the number of events lost while filling the buffer (in our case there are no lost events).

The following is a list of functions, including the following information:

process identifier (PID);
the number of the processor core that is being traced (CPU #);
timestamp (TIMESTAMP; indicates the time when the entry to the function occurred);
the name of the function being traced and the name of the parent function that called it (FUNCTION); for example, in the very first line of our output, the function mutex-unlock calls the function rb_simple_write.

Function_graph tracer

The function_graph tracer works just like function, but tracks functions in more detail: for each function, it specifies an entry point and an exit point. With this tracer, you can track functions with subcalls and measure the execution time for each function.

Edit the script used in the previous example:

 #!/bin/sh dir=/sys/kernel/debug/tracing sysctl kernel.ftrace_enabled=1 echo function_graph > ${dir}/current_tracer echo 1 > ${dir}/tracing_on sleep 1 echo 0 > ${dir}/tracing_on less ${dir}/trace

As a result of this script, we get the following output:

 # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) 0.120 us | } /* resched_task */ 0) 1.877 us | } /* check_preempt_curr */ 0) 4.264 us | } /* ttwu_do_wakeup */ 0) + 29.053 us | } /* ttwu_do_activate.constprop.74 */ 0) 0.091 us | _raw_spin_unlock(); 0) 0.260 us | ttwu_stat(); 0) 0.133 us | _raw_spin_unlock_irqrestore(); 0) + 37.785 us | } /* try_to_wake_up */ 0) + 38.478 us | } /* default_wake_function */ 0) + 39.203 us | } /* pollwake */ 0) + 40.793 us | } /* __wake_up_common */ 0) 0.104 us | _raw_spin_unlock_irqrestore(); 0) + 42.920 us | } /* __wake_up_sync_key */ 0) + 44.160 us | } /* sock_def_readable */ 0) ! 192.850 us | } /* tcp_rcv_established */ 0) ! 197.445 us | } /* tcp_v4_do_rcv */ 0) 0.113 us | _raw_spin_unlock(); 0) ! 205.655 us | } /* tcp_v4_rcv */ 0) ! 208.154 us | } /* ip_local_deliver_finish */

In the column DURATION indicates the time spent on the function. Special attention should be paid to items marked with + and! The + sign means that the function took more than 10 microseconds, and the exclamation mark - more than 100.

In the column FUNCTION_CALLS with information on function calls.
The beginning and end of the execution of each function are indicated in it as it is accepted in the C language: the curly bracket at the beginning of the function and another one at the end. Functions that are leaves of a graph and do not call any other functions are indicated by a semicolon (;).

Function Filters

The output of ftrace can sometimes be very large, and it is extremely difficult to find the necessary information in it. You can simplify the search with the help of filters: the output will not contain information about all the functions, but only about those that really interest us. To do this, simply write the names of necessary functions to the set_ftrace_filter file, for example:

 root@andrei:/sys/kernel/debug/tracing# echo kfree > set_ftrace_filter

To disable the filter, you need to write in the same file an empty line:

 root@andrei:/sys/kernel/debug/tracing# echo > set_ftrace_filter

As a result of the command

 root@andrei:/sys/kernel/debug/tracing# echo kfree > set_ftrace_notrace

we will get a completely opposite result: the output will contain information about all functions except kfree ().

Another useful option is set_ftrace_pid. It is designed to trace functions called during the execution of the specified process.

Filtering capabilities in ftrace are much wider; Read more about them in the article by Stephen Rostedt on LWN.net .

Event tracing

Above, we already mentioned the tracepoints mechanism. The word tracepoint in translation means "trace point". The trace point is a special insertion into the code that records certain system events. A trace point can be active (this means that some check is assigned to it) or inactive (no check is assigned to it).

An inactive trace point has no effect on the operation of the system; it only adds a few bytes to call the trace function at the end of the instrumented function, and also adds the data structure to a separate section.

When the trace point is active, when the corresponding code fragment is executed, the trace function is called. Trace data is written to the debug ring buffer.

Trace points can be inserted anywhere in the code. They are already present in the code of many nuclear functions. Consider, for example, the function kmem_cache_alloc (taken from here ):

 { void *ret = slab_alloc(cachep, flags, _RET_IP_); trace_kmem_cache_alloc(_RET_IP_, ret, cachep->object_size, cachep->size, flags); return ret; }

Pay attention to the trace_kmem_cache_alloc line - this is exactly the trace point. The number of such examples can be multiplied by referring to the source code of other functions of the kernel.

The Linux kernel has a special API that allows you to work with trace points from user space. In the / sys / kernel / debug / tracing directory there is a subdirectory of events in which system events available for tracking are stored. The system event in this context is nothing more than the trace points included in the core.

Their list can be viewed using the command:

 root@andrei:/sys/kernel/debug/tracing# cat available_events

A long list will be displayed on the console, which is rather inconvenient to view. You can output the same list in a more structured format like this:

 root@andrei:/sys/kernel/debug/tracing# ls events

Before you start tracking events, check to see if event recording is included in the ring buffer:

 root@andrei:/sys/kernel/debug/tracing# cat tracing_on

If after executing this command the number 0 is displayed on the console, we will execute:

 root@andrei:/sys/kernel/debug/tracing# echo 1 > tracing_on

In the last article we wrote about the chroot () system call - we will track the entrance to this system call. As a tracer, we will select nop: function and function_graph record too much information, including information that is irrelevant to the event of interest to us.

 root@andrei:/sys/kernel/debug/tracing# echo nop > current_tracer

All events related to system calls are stored in the syscalls directory. It, in turn, stores directories for entering and exiting various system calls. Activate the trace point we need by writing 1 to its enable file:

 root@andrei:/sys/kernel/debug/tracing# echo 1 > events/syscalls/sys_enter_chroot/enable

Then create an isolated file system using the chroot command (see details in the previous post ). After executing the commands of interest to us, turn off tracing so that the output does not contain information about unnecessary and unrelated events:

 root@andrei:/sys/kernel/debug/tracing# echo 0 > tracing_on

Next, view the contents of the ring buffer. At the very end of the output will contain information about the system call that we are interested in (here is a small fragment):

 root@andrei:/sys/kernel/debug/tracing# at trace ...... chroot-11321 [000] .... 4606.265208: sys_chroot(filename: 7fff785ae8c2) chroot-11325 [000] .... 4691.677767: sys_chroot(filename: 7fff242308cc) bash-11338 [000] .... 4746.971300: sys_chroot(filename: 7fff1efca8cc) bash-11351 [000] .... 5379.020609: sys_chroot(filename: 7fffbf9918cc)

More details about the event trace settings can be found here .

Conclusion

In this article we have done a general overview of the possibilities of ftrace. We would appreciate any comments and additions. If you want to dive deeper into the topic, we recommend that you familiarize yourself with the following sources:

https://www.kernel.org/doc/Documentation/trace/tracepoints.txt - detailed description of the tracepoints mechanism;
https://www.kernel.org/doc/Documentation/trace/events.txt - a guide to tracing system events in Linux;
https://www.kernel.org/doc/Documentation/trace/ftrace.txt - official ftrace documentation;
https ://lttng.
https://lwn.net/Articles/370423/ - Stephen Rodstedt article on ftrace features;
http://alex.dzyoba.com/linux/profiling-ftrace.html - ftrace review article with practical case study;
http://panda-nox.livejournal.com/2235.html is the only detailed publication about ftrace in Russian.

If you for one reason or another can not leave comments here - we invite you to our blog .

Source: https://habr.com/ru/post/280322/

All Articles