The latest version of Java Microbenchmark Harness (JMH) has a new profiler -
dtraceasm
, long awaited perfasm
port on Mac OS X, which can display the Java benchmark assembler profile.
The survey showed that not everyone understands how, in principle, it is possible to get a Java method at the input, to show an output listing of the compiled method with the hottest instructions, their distribution and a small profile like "And another 5% of the time the virtual machine spent in Method Symbol::as_C_string(char*, int)
".
In the process of perfasm
porting perfasm
it turned out that in fact ™ everything is not very difficult and there was a desire to tell how such a profiler works.
To understand the article, it is highly desirable to familiarize yourself with JMH, for example, by looking at examples of its use.
What should such a profiler do?
For a Java benchmark, it must show exactly where most of the CPU time is spent at the level of the generated code.
For example, for a method that counts logarithm:
@Benchmark public double log(double x) { return Math.log(x); }
dtraceasm
or perfasm
will show the profile as in the screenshot on the left, accusing the fstpl
instruction in everything. Because of the strong pipelining of modern processors, such a profile can be wrong, and it often makes sense to look not only at the instruction, which is considered hot, but also at the previous one. Here it is fyl2x
, which counts the logarithm.
In fact, such a profiler is very similar to perf annotate
, but it can work with JIT-compiled Java code.
And why bother *asm
-profiler may be needed if you are not writing your JIT compiler? Not least, of course, out of curiosity, because it helps to answer the following questions very quickly:
PrintAssembly
course, you can go to the PrintAssembly
output and find the right place or use JITWatch , but this is usually less convenient)Math.sqrt
will definitely slow down") differs from the harsh realityIn addition to curiosity, it is useful to be able to answer the same questions if you suddenly decide to optimize a small place, for example, your thread-safe queue or a highly specialized class .
Well, if you use the tool, it is useful to at least understand approximately how it works inside, so as not to perceive it as some kind of magic and to have an idea of its capabilities and limitations.
To build a profile using the generated code, you first need to get this generated code from somewhere
but we have no money . Fortunately, everything is already invented for us and the virtual machine (hereinafter I mean only hotspot) can print all compiled code to stdout, you just need to enable the necessary flag ( -XX:+PrintAssembly
) and put a special disassembler in $JAVA_HOME
. There are enough instructions on the Internet on how to do this, usually you don’t need to collect anything yourself and you just need to download the assembled disassembler for your platform.
PrintAssembly is useful, but not the most convenient. Its output has a previously known format, it is even annotated with comments to which bytecode instructions the current line belongs, which method is now typed or in which case is the argument, but it is measured in megabytes, and it will contain all versions of the compiled method (C1-compiler, C2 compiler, version after de-optimization, GOTO 1), therefore, to find the desired in it is usually extremely difficult.
The profiler in this huge output should show exactly where we need to look to see the hottest part of the benchmark. And for writing such a profiler, in this output we are interested in information on which method the instructions refer to, what is their address in memory and, optionally, comments from the disassembler.
DTrace is a dynamic trace framework supported in Solaris, FreeBSD, Mac OS X and, in part, in Linux. It consists of a kernel module, which implements the basic functionality and client programs in a special language D (do not confuse with another language D). The client program declares what event is interesting to it, and the kernel module compiles the program into a special bytecode, does some preparatory work, and starts this program at the necessary events to execute. In this case, the D language is safe and does not allow to unfold strongly, for example, to go into an infinite loop or paint an application, so programs on it can be executed directly in the core. The framework itself is very powerful and allows you to do many very interesting and non-trivial things that remain beyond the scope of this article; I will only consider the functionality necessary for dtraceasm
.
dtraceasm
uses the profile-n
event provider, which does not hang onto any special event, but simply calls the user program at fixed intervals.
The mechanism is used simple, the kernel registers a timer with a given frequency and starts to interrupt the process that is now running on the CPU, and in the handler calls our DTrace script.
The script itself looks like this:
profile-1001 /arg1/ { printf("%d 0x%lx %d", pid, arg1, timestamp); ufunc(arg1)}
You can read it like this: "1001 times per second, type the pid of the current executable process, its PC, the current time and the name of the executable method (along with the library name) if the process is currently running in userspace."
The PC ( program counter ) is a special register containing the address of the instruction that is currently being executed. But where does the name of the method come from?
Since the kernel knows everything about loaded libraries, executable files and their symbols (methods are symbols), and the addresses at which they are loaded, it can use this knowledge to build the index "instruction address -> library -> specific method". That is, knowing the value of the PC, you can find out where it came from.
In the lib.so
library lib.so
the foo()
method starts at offset 1024
, the next bar()
method starts at offset 2048
, and the library itself is loaded into the process at address 1048576
. If the current value of the PC is in the interval [1048576 + 1024, 1048576 + 2048]
, then the foo()
method from lib.so
is now executed.
But if the code was loaded dynamically (actually, what the JIT compilers do) and there is no information about the characters, the kernel will not find the method name.
How to get an annotated assembler benchmark now?
The benchmark is launched in a separate JVM with the PrintAssembly
flag and immediately after its launch a DTrace script is launched, which writes its result to the file.
Having on hand this data it is necessary only to do the following steps:
inc %r10d
and an optional comment from the disassemblerAt the same time, the problems with the fact that in the PrintAssembly output there is a code compiled by both the C1 compiler and C2 go off themselves, because only one version of the compiled code will get into the hot profile after warmup iterations (if you have a steady state there are constant recompilations, plus *asm
profilers filter events from warmup iterations, and native methods (internals of the JVM itself, native calls, etc.) will get to the top of hot methods.
NB: from the point of view of porting to Mac OS X, it was necessary to “only” do a part with PC gaming using DTrace, the rest of the infrastructure for processing the results in JMH existed since perfasm
and the author of the article (I) didn’t do anything.
Using a combination of simple tools, you get a fairly powerful profiler, which for an unprepared developer can look like a black box, and now you know how it actually works and that there is no magic or rocket science in it (and you can easily understand how it works, for example, perfasm
).
Special thanks to Alexey Shipilyov for having read all the inaccuracies in the article and made sure that I bring knowledge to the masses, but not nonsense :)
Source: https://habr.com/ru/post/347124/
All Articles