
We have already touched upon the problems of kernel tracing and profiling
in previous publications . There are many specialized tools for analyzing events at the kernel level:
SystemTap ,
Ktap ,
Sysdig ,
LTTNG, and others. There are many detailed articles and educational materials about these tools.
Much less information can be found about Linux's “native” mechanisms, with which you can monitor system events, receive and analyze debug information. We would like to consider this topic in today's article. We will pay special attention to
ftrace - the first and so far the only trace tool added to the kernel. Let's start with the definition of basic concepts.
')
Core profiling and tracing
Kernel profiling is the identification of performance bottlenecks. Using profiling, you can determine where exactly the performance loss occurred in a particular program. Special programs generate a profile - a summary of events on the basis of which you can find out which functions took the most time to complete. These programs, however, do not help to identify the cause of the decline in performance.
“Bottlenecks” very often arise under certain conditions, which cannot be identified during profiling. To understand why an event happened, you need to restore the context. In turn,
tracing is required to restore context.
Tracing is the receipt of information about what is happening inside the operating system. For this purpose, special software tools are used. They record all events in the system just as the tape recorder records all the sounds of the environment.
Tracer programs can simultaneously monitor events at the level of individual applications and at the level of the operating system. Information obtained during tracing may be useful for diagnosing and solving many system problems.
Tracing is sometimes compared with logging. There are similarities between the two procedures, but there are differences.
During the trace information about events occurring at a low level is recorded. Their number is in the hundreds and even thousands. Logs also record information about high-level events that occur much less frequently: for example, user logons, application errors, database transactions, and others.
Like logs, files with trace data can be read "from the sheet." It is much more interesting and useful, however, to extract information from them that relates to the work of specific applications. All tracer programs have corresponding functions.
In the Linux kernel, there are three main mechanisms by which the procedures for tracing and profiling the kernel are performed:
- tracepoints - a mechanism that works through static code instrumentation;
- kprobes - a dynamic tracing mechanism, with which you can interrupt the execution of kernel code anywhere, call your own handler, and after completing all necessary operations, go back;
- perf_events - PMU access interface (Performance Monitoring Unit).
We will not describe in detail the features of all these mechanisms; interested readers refer to the
article of STC MetroTek , as well as the
blog by Brendan Gregg .
Using ftrace, you can interact with these mechanisms and receive debugging data while in user space. We'll talk more about this below. All command examples are for OS Ubuntu 14.04, kernel version 3.13.0-24.
Ftrace: general information
The name ftrace is an abbreviation for Function Trace - tracing functions. However, the capabilities of this tool are much broader: with its help, you can track contextual switches, measure interrupt processing times, calculate time for activating tasks with high priority, and much more.
Ftrace was developed by Stephen Rostedt and added to the core in 2008, starting with version 2.6.27. This is a framework that provides a debug ring buffer for writing data. These tracers are built into the kernel.
The ftrace works on the debugfs file system, which in most modern Linux distributions is mounted by default. To get started with ftrace, you just need to go to the sys / kernel / debug / tracing directory (it is available only to the root user):
We will not talk in detail about all the files and subdirectories it contains - this is already done in the
official documentation . We briefly describe only those of them that are of interest in the context of our consideration:
- available_tracers — available tracers;
current_tracer - the current tracing program; - tracing_on - service file, responsible for enabling and disabling data recording in the ring buffer (to enable recording, write number 1 to this file, and to disable it, 0);
- trace is a file where trace data is stored in a human readable format.
Available tracers
You can view the list of available tracers using the command
root@andrei:/sys/kernel/debug/tracing
We give a brief description for each tracer:
- function - tracer functions of the kernel calls without the possibility of obtaining arguments;
- function_graph - kernel call function tracer with subcalls;
- blk - call and event tracer related to input-output to block devices; that is what is used in the blktrace utility, which we have already written about ;
- mmiotrace is a tracer of I / O operations reflected in memory.
- nop is the simplest tracer, which, as the name implies, does nothing (however, in some situations it may be useful, which is discussed below).
Tracer function
We will begin the introduction to ftrace with the function tracer. We will consider it on the material of such a test script:
Crypts are very simple and generally understandable, but there are moments in it that are worth paying attention to.
The sysctl ftrace.enabled = 1 command enables tracing of functions. Next, we install the current tracer by writing its name to the current_tracer file.
After that we write 1 to the file tracing_on and thereby activate the ring buffer update. An important feature of the syntax: there must be a space between the 1 and the> sign: a command like echo1> tracing_on will not work. Literally in one line, we disable it (note: if we write 0 to the tracing_on file, the ring buffer will not be cleared; the ftrace will not be disabled either).
Why do we do this? Between the two echo commands is the sleep 1 command. We enable the buffer update, execute this command, and then immediately disable it. Due to this, the trace will include information on all system calls that occurred during the execution of this command.
In the last line of the script, we give the command to output the trace data to the console.
As a result of the execution of this script, we will see the following output (here is a small fragment):
The output starts with information on the number of event entries in the buffer (entries in buffer) and the total number of recorded events (entries written). The difference between these two numbers is the number of events lost while filling the buffer (in our case there are no lost events).
The following is a list of functions, including the following information:
- process identifier (PID);
- the number of the processor core that is being traced (CPU #);
- timestamp (TIMESTAMP; indicates the time when the entry to the function occurred);
- the name of the function being traced and the name of the parent function that called it (FUNCTION); for example, in the very first line of our output, the function mutex-unlock calls the function rb_simple_write.
Function_graph tracer
The function_graph tracer works just like function, but tracks functions in more detail: for each function, it specifies an entry point and an exit point. With this tracer, you can track functions with subcalls and measure the execution time for each function.
Edit the script used in the previous example:
As a result of this script, we get the following output:
In the column DURATION indicates the time spent on the function. Special attention should be paid to items marked with + and! The + sign means that the function took more than 10 microseconds, and the exclamation mark - more than 100.
In the column FUNCTION_CALLS with information on function calls.
The beginning and end of the execution of each function are indicated in it as it is accepted in the C language: the curly bracket at the beginning of the function and another one at the end. Functions that are leaves of a graph and do not call any other functions are indicated by a semicolon (;).
Function Filters
The output of ftrace can sometimes be very large, and it is extremely difficult to find the necessary information in it. You can simplify the search with the help of filters: the output will not contain information about all the functions, but only about those that really interest us. To do this, simply write the names of necessary functions to the set_ftrace_filter file, for example:
root@andrei:/sys/kernel/debug/tracing
To disable the filter, you need to write in the same file an empty line:
root@andrei:/sys/kernel/debug/tracing
As a result of the command
root@andrei:/sys/kernel/debug/tracing
we will get a completely opposite result: the output will contain information about all functions except kfree ().
Another useful option is set_ftrace_pid. It is designed to trace functions called during the execution of the specified process.
Filtering capabilities in ftrace are much wider; Read more about them in the article by
Stephen Rostedt on LWN.net .
Event tracing
Above, we already mentioned the tracepoints mechanism. The word tracepoint in translation means "trace point". The trace point is a special insertion into the code that records certain system events. A trace point can be active (this means that some check is assigned to it) or inactive (no check is assigned to it).
An inactive trace point has no effect on the operation of the system; it only adds a few bytes to call the trace function at the end of the instrumented function, and also adds the data structure to a separate section.
When the trace point is active, when the corresponding code fragment is executed, the trace function is called. Trace data is written to the debug ring buffer.
Trace points can be inserted anywhere in the code. They are already present in the code of many nuclear functions. Consider, for example, the function kmem_cache_alloc (taken
from here ):
{ void *ret = slab_alloc(cachep, flags, _RET_IP_); trace_kmem_cache_alloc(_RET_IP_, ret, cachep->object_size, cachep->size, flags); return ret; }
Pay attention to the
trace_kmem_cache_alloc line - this is exactly the trace point. The number of such examples can be multiplied by referring to the source code of other functions of the kernel.
The Linux kernel has a special API that allows you to work with trace points from user space. In the / sys / kernel / debug / tracing directory there is a subdirectory of events in which system events available for tracking are stored. The system event in this context is nothing more than the trace points included in the core.
Their list can be viewed using the command:
root@andrei:/sys/kernel/debug/tracing
A long list will be displayed on the console, which is rather inconvenient to view. You can output the same list in a more structured format like this:
root@andrei:/sys/kernel/debug/tracing
Before you start tracking events, check to see if event recording is included in the ring buffer:
root@andrei:/sys/kernel/debug/tracing
If after executing this command the number 0 is displayed on the console, we will execute:
root@andrei:/sys/kernel/debug/tracing
In the last article we wrote about the chroot () system call - we will track the entrance to this system call. As a tracer, we will select nop: function and function_graph record too much information, including information that is irrelevant to the event of interest to us.
root@andrei:/sys/kernel/debug/tracing
All events related to system calls are stored in the syscalls directory. It, in turn, stores directories for entering and exiting various system calls. Activate the trace point we need by writing 1 to its enable file:
root@andrei:/sys/kernel/debug/tracing
Then create an isolated file system using the chroot command (see details
in the previous post ). After executing the commands of interest to us, turn off tracing so that the output does not contain information about unnecessary and unrelated events:
root@andrei:/sys/kernel/debug/tracing
Next, view the contents of the ring buffer. At the very end of the output will contain information about the system call that we are interested in (here is a small fragment):
root@andrei:/sys/kernel/debug/tracing
More details about the event trace settings can be found
here .
Conclusion
In this article we have done a general overview of the possibilities of ftrace. We would appreciate any comments and additions. If you want to dive deeper into the topic, we recommend that you familiarize yourself with the following sources:
If you for one reason or another can not leave comments here - we invite you to our
blog .