Profiling already running programs

We all use profilers. The traditional scheme of working with them is such that you have to initially launch the program “under the profiler” and then, after the end of its work, analyze the raw dump using additional utilities.
And what to do if we, without having root, want to profile an already running program that has been working “properly” for a long time, but now something has gone wrong. And we want to do it quickly. Common situation?
Then consider the most popular profilers and how they work. And then the profiler, which solves exactly the specified task.

Popular Profilers

If you know fundamentally different - write about him in the comments. For now, consider these 4:

I. gprof

A good old UNIX profiler that, according to Kirk McKuzick, was written by Bill Joy to analyze the performance of BSD subsystems. Actually, the profiler is “provided” by the compiler - it must place control points at the beginning and at the end of each function. The difference between these two points will be the time of its execution.
It is worth noting that in this case, gprof accurately "knows" and how many times each function was called. And although it may be necessary in some situations, it also has a negative effect - the overhead of the measurements may be comparable or even more than the function body itself. Therefore, for example, when compiling C ++ code, optimizations are used resulting in inline.
One way or another, gprof does not work with programs that are already running.
')

Ii. Callgrind

Callgrind is part of Valgrind, an excellent framework for building dynamic code analysis tools. Valgrind runs the sandboxed program, actually using virtualization. Callgrind performs profiling based on breakpoint based on call and ret instructions. It significantly slows down the analyzed code, as a rule, from 5 to 20 times. Thus, it is usually not suitable for analyzing large data at runtime.
However, the tool is very popular, and the simple format of the call graph is supported by excellent visualization tools, for example, kcachegrind.

Iii. OProfile

It is a system-wide profile for Linux systems that is capable of profiling all running at low overhead.

OProfile is a system-wide profiler. Those. it is not aimed at working with individual processes, profiling the entire system instead. OProfile collects metrics by reading not a system timer, like gprof or callgrind, but CPU counters. Therefore, it requires privileges to run the daemon.
However, this is an indispensable tool when you need to deal with the work of the entire system, the entire server at once. And especially indispensable when profiling the core area.

New version OProfile 0.9.8

For versions 0.9.7 and earlier, the profiler consisted of a kernel driver and a daemon for data collection. Since version 0.9.8, this method has been replaced by the use of Linux Kernel Performance Events (requires kernel 2.6.31 or later). The 0.9.8 release also includes the 'operf' program, which allows unprivileged users to profile individual processes.

Iv. Google perftools

This profiler is part of the Google perftools suite. I did not find his review in Habré, so I will describe it very briefly.
The kit includes a series of libraries aimed at accelerating and analyzing C / C ++ applications. The central part is the tcmalloc allocator , which, in addition to speeding up the allocation of memory, carries the means for analyzing classical problems - memory leaks and heap profile.

The second part is libprofiler , which allows you to collect CPU usage statistics. It is important to focus on how he does it. Several times per second (the default is 100), the program is interrupted by a timer signal. In the handler of this signal, the stack is unwound and all instruction pointers are remembered. At the end, the raw data is reset to a file, according to which you can already build statistics and call graphs.

Here are some details of how this is done.

1. By default, the timer signal selects the ITIMER_PROF timer, which ticks only when using the CPU program. After all, as a rule, we are not very interested where the program was waiting for keyboard input or receipt of data on the socket. And if it's still interesting, use env CPUPROFILE_REALTIME=1

2. The call stack is spun either using libunwind or manually (which requires --fno-omit-framepointer, always works on x86).
3. Function names are subsequently recognized using addr2line (1)
4. Like other Google perftools tools, the profiler can be linked explicitly, or it can be preloaded with LD_PRELOAD .

The principle of operation is interesting - the program is interrupted only N times per second, where N is small enough. This is the so-called. sampling profiler. Its advantage is that it does not have a significant impact on the program being analyzed, no matter how many minor functions there are called. Due to the nature of the work, he, however, does not allow to answer the question “how many times has this function been called”.
In the case of the google profiler there are a few more troubles:

This profiler is also not designed to work with already running programs.
latest versions do not work with fork (2), sometimes making it difficult to use it in demons

Crxprof

As promised, now about another profiler that is written specifically to solve the problem outlined above - easy profiling of already running processes.

He collects the call stack and displays the hottest parts in the console by pressing ENTER. He also can save the call graph in the above-mentioned callgrind format. It works quickly and, like any other sampling profiler, does not depend on the complexity of the calls in the program being profiled.

Some details of the work

In general, crxprof also works like perftools, but uses external profiling through ptrace (2) . Like perftools, it uses libunwind to unwind the stack, and instead of the hard work of converting to function names, libbfd is used instead of addr2line (1) .

Several times per second, the program stops (SIGSTOP) and the call stack is “removed” using libunwind. Having loaded the map of the functions of the program being profiled and its associated libraries at the start of the crxprof, we can quickly find out which function each instruction pointer belongs to.

The call graph is being built in parallel, assuming that there is some central function - the entry point. This is usually __libc_start_main from the libc library.

Source code is available on github. Since The utility was created for me and my colleagues, I fully admit that it may not correspond to your use-case. One way or another, ask.

Collect crxprof and look at an example of its use.

Assembly

What is needed: Linux (2.6+), autoconf + automake, binutils-dev (includes libbfd), libunwind-dev (I have it called libunwind8-dev).
To build perform:

 autoreconf -fiv ./configure make sudo make install

If libunwind is installed in a nonstandard place, use:

 ./configure --with-libunwind=/path/to/libunwind

Profiling

To do this, just run

 crxprof pid

And that's it! Now use ENTER to display the profile in the console, and ^ C to complete. Crxprof also displays the profile and the output of the program.

crxprof: ptrace (PTRACE_ATTACH) failed: Operation not permitted

If you see this error, then ptrace on your system is “limited”. (Ubuntu?)
Read more here.
In short, either start with sudo, or (better) execute in the console:

 $ echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

Like all unix utilities, crxprof prints usage when called with the key --help. See man crxprof .

crxprof --help

 Usage: crxprof [options] pid
 Options are:
  -t | --threshold N: n% of time (default: 5.0)
  -d | --dump FILE: save callgrind dump to given FILE
  -f | --freq FREQ: set profile frequency to FREQ Hz (default: 100)
  -m | --max-depth N: show at most N levels while visualizing (default: no limit)
  -r | --realtime: use realtime profile instead of CPU
  -h | --help: show this help
 
  --full-stack: print full stack while visualizing
  --print-symbols: just print funcs and addrs (and quit)

Real example

In order to give a real, but not complicated, example, I use this code in C. Compile, run it and ask crxprof to save the function call graph (4054 - pid of the program being profiled):

 $ crxprof -d /tmp/test.calls 4054 Reading symbols (list of function) reading symbols from /home/dkrot/work/crxprof/test/a.out (exe) reading symbols from /lib/x86_64-linux-gnu/libc-2.15.so (dynlib) reading symbols from /lib/x86_64-linux-gnu/ld-2.15.so (dynlib) Attaching to process: 6704 Starting profile (interval 10ms) Press ENTER to show profile, ^C to quit 1013 snapshot interrputs got (0 dropped) main (100% | 0% self) \_ heavy_fn (75% | 49% self) \_ fn (25% | 25% self) \_ fn (24% | 24% self) Profile saved to /tmp/test.calls (Callgrind format) ^C--- Exit since ^C pressed

According to the statistics displayed on the console, it can be seen that:

main () calls heavy_fn () (and this is the “hardest” way)
heavy_fn () calls fn ()
main () also calls fn () directly
heavy_fn () takes half the CPU time
fn () takes remaining CPU time
main () by itself does not consume anything

The visualization is done according to the “largest subtrees first” scheme. Thus, even for large real-world programs, you can use simple visualization in the console, which should be convenient on the servers.

To visualize complex call graphs, it is convenient to use KCachegrind:

 $ kcachegrind /tmp/test.calls

The picture that I got is on the right.
Instead of concluding, let me remind you that only a few of my colleagues and myself use the profiler. I hope it will also be useful to you.

Source: https://habr.com/ru/post/167837/

All Articles