In Search of Performance, Part 2: Java Profiling for Linux

Argued that you can endlessly look at the fire, water and how others work, but there is something else! We are sure that you can talk endlessly with Sasha goldshtn Goldstein about performance. We already interviewed Sasha before JPoint 2017, but then the conversation was specifically about BPF, to which Sasha’s report was devoted.

This time we decided to dig deeper and find out the fundamental problems of monitoring performance and their solutions.

Where to start

- Last time we talked in some detail about BPF and briefly discussed the problems of monitoring Java performance under Linux. This time I would like to concentrate not on a specific tool, but on problems and the search for solutions. The first question, rather trivial, is how to understand that there are problems with performance? Should I think about this if the user does not complain?
')
Sasha Goldstein: If you start thinking about performance only at the moment when your users complain, they will not be with you for long. For many, performance engineering is trashshuting and crisis mode. The phones are ringing, the light is flashing, the system has fallen, the keyboard is on - the usual working life of a performance engineer. In reality, they spend most of their time planning, designing, monitoring and preventing crises.

To begin with, capacity planning is an estimate of the expected system load and resource utilization; designing scalability will help avoid bottlenecks and get significant increases in load; Instrumentation and monitoring are vital for understanding what is happening inside the system so as not to dig blindly; thanks to the installation of the automatic notification, you will know exactly about any problems that arise, as a rule, even before users start complaining; and of course there will be single crises that will have to be resolved in stressful conditions.

It is worth noting that the tula are constantly changing, but the process itself remains unchanged. I will give a couple of concrete examples: you can do capacity planning on a piece of paper on your knee; You can use APM solutions (like New Relic or Plumbr) for end-to-end instrumentation and monitoring, AB and JMeter for quick load testing, and so on. To learn more, you can read Brendan Gregg’s book, Systems Performance , an excellent source on the life cycle and performance methodology, and Google ’s Site Reliability Engineering covers the topic of Service Level Objectives and their monitoring.

- Suppose we understand that there is a problem: where to start? It often seems to me that many (especially not professional performance engineers) are immediately ready to uncover JMH, rewrite everything to unsafe and “hack compilers”. Then watch what happened. But in reality it is better not to start with this?

Sasha Goldstein: This is a fairly common practice, when writing code and performing basic profiler tests, problems arise with performance, which can be easily corrected by changing the code or “hacking compilers”. However, on the sale, based on my experience, this is done not so often. Many problems are inherent in only one environment, caused by changing workload patterns or associated with bottlenecks outside the code of your application, and only a small part can be mixed and improved at the source level by smart hacks.

Here are a couple of examples to illustrate:

A couple of years ago, Datadog encountered a problem when database inserts and updates in PostgreSQL jumped from 50 ms to 800 ms. They used AWS EBS with SSD. What did it do? Instead of tuning the database or changing the application code, they found that EBS throttling was to blame for everything: it has an IOPS quota, which, if exceeded, will fall under the performance limit.
Recently, I had a user with the problem of huge response time on the server, which was associated with delays in garbage collection. Some requests took more than 5 seconds (and they appeared completely haphazard), since garbage collection was out of control. Having carefully studied the system, we found that everything was in order with the memory allocation of the application or with the tuning of garbage collection; due to a jump in the size of the workload, the actual memory usage has increased and caused swapping, which is absolutely detrimental to implement any garbage collection (if the collector needs to pump up and pump out memory to mark active objects, this is the end).
A couple of months ago, Sysdig ran into a container isolation problem: being close to container X, file system operations performed by container Y were much slower, while memory usage and processor utilization for both containers were very low. After a little research, they found that the kernel's directory cache was overloaded with container X, which later caused a hash table collision and, as a result, a significant slowdown. Again, changing the application code or container resource allocation would not have solved this problem.

I understand that it is often much easier to focus on things that you can manage, for example, application-level hacks. Purely psychologically, this is understandable, does not require in-depth knowledge of the system or environment, and for some reason is considered “cooler” in some cultures. But to address this in the first place is wrong.

- Probably you should first see how the application / service works in production. What tools do you recommend for this, and which ones - not so much?

Sasha Goldstein: Monitoring and profiling on the sale is a set of tools and a technician.

We start with high-level performance metrics, focusing on the use of resources (processor, memory, disk, network) and load characteristics (# of queries, errors, types of requests, # of database requests). There are standard tools for obtaining this data for each operation and execution time. For example, Linux usually uses tools like vmstat, iostat, sar, ifconfig, pidstat; for JVM, use JMX-based tools or jstat. These are metrics that can be continuously collected into a database, perhaps at a 5 or 30 second interval, so that you can analyze jumps and, if necessary, go back in time, to correlate previous deployment operations, releases, world events, or workload changes. . It is important that many focus on collecting only average values; although they are good, but, by definition, do not represent the complete distribution of what you measure. It is much better to collect percentiles, and if possible even histograms.

The next level is operational metrics that usually cannot be continuously collected or stored for a long time. These include: garbage collection log, network requests, database requests, class loads, and so on. Understanding this data after it has been stored somewhere is sometimes much more difficult than actually collecting it. This allows, however, to ask questions, such as “what requests worked while the database CPU load increased to 100%” or “what were the IOPS of the disks and the response time during the execution of this request”. Numbers alone, especially in the form of averages, will not allow you to conduct this kind of research.

And finally, the “hardcore” level: SSH in the server (or remote launch of the body) to collect more internal metrics that cannot be stored during normal operation of the service. These are tools that are commonly called “profilers.”

For Java production profiling, there are a lot of creepy things that not only give you a big overhead and delay, but they can also lie to you. Despite the fact that the ecosystem is already about 20 years old, there are only a few reliable profiling techniques with a low overhead for JVM applications. I can recommend Honest Profiler Richard Worburton, async-profiler Andrey Pangin and, of course, my pet - perf .

By the way, many bodies focus on CPU profiling, understanding which path of code execution causes high CPU utilization. This is great, but often this is not the problem; we need tools that can show the code execution paths responsible for memory allocation (async-profiler can now do this as well), missing page errors, missing cache, disk accesses, network requests, database requests and other events. In this area, I was attracted by the problem of finding the right performance for researching the working environment.

Java profiling for Linux

- I heard that under the Java / Linux stack there are a lot of problems with the accuracy of measurements. Surely you can somehow fight this. How do you do that?

Sasha Goldstein: Yes, it is sad. Here's what the current situation looks like: you have a fast conveyor line with a huge number of different parts that you need to test to find defects and understand the speed of the application / service. You cannot check every part absolutely, so your main strategy is to check 1 part per second and see if everything is in order, and you need to do this through the “tiny window” above this “tape”, because it is dangerous to come closer. It seems not bad, is not it? But then it turns out that when you try to look at it, it shows you not what is happening on the conveyor right now; it waits until the conveyor enters a magical “safe” mode, and only after that gives you everything to see. It also turns out that you will never see many parts, because the pipeline cannot enter its “safe” mode while they are nearby; and it turns out that the process of finding a defect in the window takes as much as 5 seconds, so it is impossible to do this every second.

Approximately in such condition now there is a set of profilers in the world of JVM. YourKit, jstack, JProfiler, VisualVM - they all have the same approach to CPU profiling: they use sampling of threads in a safe state. This means that they use the documented API to suspend all JVM threads and take their stack traces, which are then collected for a report with the hottest methods and stacks.

The problem with such a suspension of the process is as follows: threads do not stop immediately, runtime waits until they reach a safe state, which may be after many instructions and even methods. As a result, you get a biased picture of the application, and different profilers may still disagree with each other!

There is a study showing how bad it is when each profiler has his own point of view about the hottest method in the same workload (Mitkovich et al., "Evaluating the Accuracy of Java Profilers"). Moreover, if you have 1000 threads in a complex call stack of Spring, often you will not be able to collect stack traces. Perhaps no more than 10 times per second. As a result, your stack data will differ from the actual workload even more!

Solving such problems is not easy, but it is worth investing: it is impossible to profile some workloads on the prod using “traditional” tools like those listed above.

There are two separate approaches and one hybrid:

Richard Worburton's Honest Profiler uses an internal undocumented API, AsyncGetCallTrace, which returns a single-stack stack trace, does not require a transition to a safe state, and is called using a signal handler. It was originally designed by Oracle Developer Studio. The basic approach is to install a signal handler and register it on a signal with a set time (for example, 100 Hz). Then you need to take a stack-trace of any thread that is currently running inside the signal handler. Obviously, there are difficult tasks when it comes to efficiently combining stack-traces, especially in the context of a signal handler, but this approach works great. (This approach uses JFR, requiring a commercial license)
Linux perf can provide rich sampling stacks (not only for CPU instructions, but also for other events, such as disk access and network requests). The problem is that the address of the Java method is converted to a method name, which requires a JVMTI agent that extracts a text file (perf map) that perf can read and use. There are also problems with stack reconstruction if JIT uses frame pointer suppression. This approach may well work, but it requires a little preparation. As a result, however, you will receive stack traces not only for JVM threads and Java methods, but also for all of your threads, including the kernel stack and C ++ stacks.
Andrey Pangin’s async-profiler combines two approaches. It sets up a perf sample, but also uses a signal handler to call AsyncGetStackTrace and get the Java stack. Combining the two stacks gives a complete picture of what is happening in the stream, allowing you to avoid problems with the conversion of Java method names and the suppression of frame pointers.

Any of these options are much better than safepoint-biased profilers in terms of accuracy, resource consumption and frequency of reports. They can be complicated, but I think accurate profiling of production with a low overhead is worth the effort.

Container Profiling

- Speaking of environments, it is now fashionable to pack everything in containers, are there any special features here? What should be remembered when working with containerized applications?

Sasha Goldstein: There are interesting problems with containers that many Tula completely ignore and as a result stop working altogether.

In brief, I remind you that Linux containers are built around two key technologies: control groups and namespaces. Monitoring groups allow you to increase the resource quota for a process or for a group of processes: CPU time caps, memory limit, IOPS storage, and so on. Namespaces allow container isolation: mount namespace provides each container with its own mount point (in fact, a separate file system), PID namespace — own process identifiers, network namespace gives each container its own network interface, and so on. Because of the namespace, it is difficult for multiple subsets to properly exchange data with containerized JVM applications (although some of these problems are not peculiar to the JVM).

Before discussing specific issues, it will be better if we briefly describe the different types of observability of the bodies for the JVM. If you have not heard of some of them, it's time to refresh your knowledge:

basic tula like jps and jinfo provide information about existing JVM processes and their configuration;
jstack can be used to get the thread dump (stack trace) from existing JVM processes;
jmap — to get a head dump of existing JVM processes or simpler class histograms;
jcmd is used to replace all previous tuls and send commands to existing JVM processes through the JVM attach interface; it is based on a UNIX domain socket, which is used by JVM processes and jcmd for data exchange;
jstat - to monitor basic JVM performance information, such as class loading, JIT compilation, and garbage collection statistics; based on the JVM that generates the / tmp / hsperfdata_ $ UID / $ PID files with this binary format data;
The Serviceability Agent provides an interface to check the memory of JVM processes, threads, stacks, and so on, and can be used with a memory dump and not just live processes; it works by reading the process memory and internal data structure;
JMX (managed beans) can be used to obtain performance information from an ongoing process, as well as to send commands to control its behavior;
JVMTI agents can join various interesting JVM events, such as loading classes, compiling methods, starting / stopping threads, monitor contention, and so on.

Here are some problems that arise when profiling and monitoring containers from a host, and how to solve them (in the future I will try to tell you more about this):

Most tools require access to process binaries to search for objects and values. All this is in the mount namespace of the container and is not available from the host. Partly this access can be provided by bind-mounting from the container or to the mount namespace of the container in the profiler during symbol symbol resolving (this is what the perf and BCC tools you mentioned in the previous interview are doing now).
If you have a JVMTI agent that generates a perf map (for example, perf-map-agent), it will be written to the container / tmp storage using the container process ID (for example, /tmp/perf-1.map). The map file must be accessible to the host, and the host must wait for the correct process ID in the file name. (Again, perf and BCC can now do this automatically).
The JVM attach interface (on which jcmd, jinfo, jstack and some other tools rely) requires the correct PID and mount namespace attach file, as well as the UNIX socket of the domain used to exchange data with the JVM. This information can be thrown using the jattach utility and creating an attach file, entering the container namespace or using the bind-mounting corresponding directories on the host.
using JVM performance data files (in / tmp / hsperfdata_ $ UID / $ PID) used by jstat requires access to mount the container namespace. This is easily addressed by bind-mounting the / tmp container on the host;
The simplest approach to using JMX-based tools is, perhaps, accessing the JVM as if it were remote — by configuring the RMI endpoint, as you would for remote diagnostics;
The Serviceability Agent requires exact version correspondence between the JVM process and the host. I think you understand that you should not run them on a host, especially if it uses a different distribution and it has different versions of JVM installed.

Here you can think: what if I just put the performance-tool in the container so that all these problems with isolation did not arise because of me? Although the idea is not bad, many of the bodies will not work with this configuration because of the seccomp. Docker, for example, rejects the perf_event_open system call, which is required for profiling with perf and async-profiler; he also rejects the ptrace system call, which is used by a large number of bodies to read the memory capacity of the JVM process. Changing the seccomp policy to accept these system calls puts the host at risk. Also, by placing the profiling tula in a container, you increase its attack surface.

We wanted to continue the conversation and discuss the effect of iron on profiling ...

Very soon, Sasha will come to St. Petersburg to conduct training on profiling JVM applications in production and speak at the Joker 2017 conference with a report on BPF , therefore, if you want to dive deeper into the topic - you have every chance to meet with Sasha personally.

Source: https://habr.com/ru/post/338928/

All Articles

In Search of Performance, Part 2: Java Profiling for Linux

Where to start

Java profiling for Linux

Container Profiling

More articles: