Do It Yourself Java Profiling

At the last Appication Developer Days conference, Roman Elizarov ( elizarov ) told how to profile, i.e. explore the performance of any Java-based applications, without using specialized tools, even if they are vendor-based, even open-source. It turns out you can use little-known features built into the JVM (threaddumps, java agents, bytecode manipulation), and quickly and efficiently implement profiling You can run all the time even on the combat system. Here is a video of the report (here it is kosolapo embeditsya, but it is 1280x720, everything is perfectly readable):

But I also propose to take a look at the 70K text of the illustrated article-transcript under the cut, which I compiled from the video and slides.

DIY Java Profiling (Roman Elizarov, ADD-2011) .pdf

Today I will have a report about "Do it yourself profiling in Java." Slides will be in English, I will do a report in Russian. There are very, very many slides, but there is not much time, so I will skip some very quickly, and try to leave more time at the end for questions, and maybe even somewhere in the middle. In principle, do not hesitate if you want to suddenly ask something or clarify, and even interrupt me. I'd rather cover in more detail what is interesting to you, than I will simply tell what I wanted to tell.

The report is based on real experience; we have been working in the company for more than ten years to create complex, very high-loaded financial applications that work with large data sets, with millions of quotations per second, with tens of thousands of users online, and there, with such work, it is always about profiling the application.

Profiling an application is an inevitable component of any optimization; optimization without profiling is impossible. You profile, find bottlenecks, optimize-profile, it is a constant cycle.

Why is the report called “Do It Yourself Java Profiler”? Why do something yourself? There is a huge number of ready-made tools that help profiling - the profilers themselves, and similar tools.

But the fact is that firstly, there may be a problem with a third-party tool. You are just for some reason - reliability or security; you may not be able to run a third-party tool on some kind of living environment. Unfortunately, you often have to deal with profiling not only on the test platform, but also on the live platform, and not always for a high-loaded platform, you have the opportunity and resources to make an identical copy of the system and run it under the same load. And many bottlenecks can be detected only under heavy load, only under very specific conditions. You see that the system does not work, but you do not understand why. What kind of load pattern you need to create for it in order for the problem to manifest itself - therefore, it is often necessary to profile the living system.
')
When we write financial applications, we still have the task to ensure the reliability of the system. And we do not “banks”, where the main thing is not to lose your money, but which can be unavailable for hours. We do brokerage systems for online trading, where the availability of systems is always 24 × 7, this is one of their key qualities, they should never fall.

And I have already said that the whole industry is regulated, and sometimes we cannot use any third-party product on a real system.

But the tools are often opaque. Yes, there is documentation that describes “what”, but it does not describe how exactly he gets these results, and it is not always possible to understand what he actually intended.

And even if the tool is open source, it does not change anything, because there is a lot of this code, you will kill a lot of time in it when sorting it out. Tools need to learn, and do something yourself, of course, much more pleasant.

What is the problem with studying? Naturally, if you use a tool often, you should teach it. If you program every day in some kind of development environment you love, you know it far and wide, and this knowledge naturally pays off to you a hundredfold.

And if you need to do something once a month, for example, once a month you need to do profiling for the sake of a bug with performance, then it’s not a fact that studying the appropriate tool will pay off. Of course, if there is no situation when the tool solves the problem many times faster.

Doing something with your own hands, you can reuse your knowledge. For example, you have knowledge of programming languages, your tools, you can deepen them, expand them, clarify them, learn more about the tools you already have, instead of learning a new one.

Why a java report? Well, not only that our company programs in Java, it is the leading language of 2001 according to the TIOBE index, great for enterprise applications. And for this particular lecture, it is generally wonderful - because Java is a managed language, it works in a virtual machine, and it is Java that makes profiling very easy.

I will talk first, just about how to solve many profiling problems by writing some Java code. I will talk about the possibilities of Java-machines that can be used, and will talk about such a technique as manipulating byte-code.

Today we look at profiling both CPU and memory. I will tell about different techniques.

CPU profiling. The easiest way is to simply take and program. If you need to figure out where, how much, what, in your program takes time, and how many times it is called, then the easiest way: no tools are needed, nothing — just write a few lines of code.

Java has a great currentTimeMillis function that returns the current time. You can measure it in one place, measure it in another, and then you can count how many times it is done, the total time, the minimum and maximum time, anything. The easiest way. DIY in its maximum simplicity and primitivism.

Oddly enough, in practice, the method works fine, brings a lot of benefits - because it is fast, convenient and efficient.

When does this method work well? This works great for business methods - the business method is large, it is not called very often, and you need to measure something about it. Moreover, writing this code, since you wrote it - it becomes part of your application and part of the functionality. More or less, any modern large application contains management interfaces, some statistics, ... and, in general, application performance is one of three things you often want to see an application presenting itself, simply as part of its functionality.

In this sense, programming an application to profile itself is a logical step. You increase the functionality of the application, profiling the application becomes part of its functionality. Especially if in this way you place definitions in your business methods that the end user calls, then this information will also be meaningful to the end user. How many times and what methods were called, how many times worked and so on. The information you collect, in this case, with this approach, is completely under your control. You can measure the number of calls, the minimum time, the maximum time, the average count, you can build histograms of the distribution of the execution time, count the medians and percentiles. You can track different execution paths in the code in different ways, as in this example, who managed to make out while I was talking, I noticed that depending on the execution path, we write down different statistics: how often the query result got into the cache, and how long It took, and how often the query had to go into the database, and how long it took.

This is possible if you write this code yourself, collect statistics, embed it yourself in your application.

Moreover, as a person who is involved in the profiling-optimization cycle, you use this information later - what happens in your application? This information is always inside your application, the code works in a live system. You had some bad day, something the system did not work that way, you can look at these statistics in logs, figure it out, and so on.

A wonderful technique, there are no third-party tools, just a little code in Java.

What to do if methods are shorter, and are called more often?

The fact is that such a direct method is no longer suitable, because the “currentTimeMillis” method is not fast, and it measures only in milliseconds.

If you need to measure only the number of calls, then you can quickly do it using the AtomicLong Java class. With it, you can, making a minimal contribution to performance, count the number of calls to any method that interests you. This will work up to tens of thousands of calls per second, without greatly distorting the work of the application itself.

What to do if you still need to measure the execution time? Measuring the execution time of short methods is a very complex topic. It cannot be solved by standard methods, despite the fact that Java has a “systemNanoTime” method, it does not solve these problems, it is slow in itself, and it’s difficult to measure something fast with it.

The only real way out is to use the native code, for the x86 processor there is such a wonderful instruction rdtsc , which returns the counter of the number of processor cycles. There is no direct access to it, you can write a single-line method on C that calls “rdtsc”, and then link it with Java code, and call it from Java. This call will take you a hundred clocks, and if you need to measure a piece of code that takes a thousand or more clocks, then it makes sense, if you optimize every machine cycle, and you want to understand “plus or minus”, “faster-slower” ", How do you work. This is really a rare case when you need to optimize each beat.

Most often, when it comes to some shorter pieces of code, and more often called, use a different approach, which is called "sampling". Instead of accurate measurements, how many times and what is called, you periodically analyze the execution of a program, see where it is executed, at arbitrary points in time, for example, once a second, or once every ten seconds.

See where the performance takes place, and consider where you find your program often. If you have a line in the program in which it spends everything, or at least 90% of its time, for example, some cycle, and there, in the depth, some line, then most likely, when you stop playing, you in this line and find it.

Such a place in the program is called a “hot spot”. This is always a great candidate for optimization. What's great is there is a built-in function called “thread dump” to dump all threads. In Windows, it is done by pressing CTRL-Break on the console, and on Linux and other unixes, this is done by sending the third signal with the command “kill -3”.

In this case, the Java machine on the console displays a detailed description of the state of all threads. And if you really have a hot place in the code, then most likely the program will be caught there. Therefore, again, when you have a performance problem with the code, you do not need to run to the profiler, you do not need to do anything. See what slows down, make at least one thread dump, and look. If you have one hot place in which the program spends all the time, you are in the thread dump and see this line in your favorite development environment, without using any third-party, additional tools. Look at this code, study, optimize - either it is called too often, or it is slowly running, further debriefing, optimization, or further profiling.

Also in modern Java-machines there is a wonderful utility “jstack”, you can pass the process identifier to it, and get a thread dump at the output.

Do more than one thread dump, do several thread dumps. If the first one doesn’t catch anything, look at a couple more. Maybe the program does not spend a hundred percent of your time in a hot spot, but 50%. Having made several thread dumps, you are obviously in at least some of these moments, you will get your code from a hot spot, and with your eyes look at the places where you caught your code.

This idea can be developed further. You can run the Java machine, redirect its output to the file, and run a script that does a thread dump every three seconds. This can be done quite calmly on a living system, without any risk of doing something with it. Because the thread dump collection is quite fast, takes about 100 milliseconds, even with a very large number of threads.

And if you write in Java, then most likely your system is not hard real time, and nanoseconds are not important to you there, because garbage collection occurs periodically, and so on. Those. you do not create an extra sleep for a hundred milliseconds.

And even in our financial field, most of the systems that we write, we still write for people, people work with them, yes, there are millions of quotations per second, yes, there are robots (this is a separate story), but most often these quotes are watched by a person who will not notice plus or minus 100 milliseconds in the eyes. A person will notice that if there is braking for two hundred milliseconds, this will already be a noticeable delay for a person, but one hundred milliseconds will not.

Therefore, you can not worry about the extra hundred milliseconds, and once every three seconds to do a thread dump, you can safely, even on a live system. At the same time, the thread dump is part of a java-machine, well-tested - I have never seen for all my experience that an attempt to make a thread dump on a Java-machine did something bad to it.

Those. it is a completely safe tool for profiling living and working systems. After that, having received the file thread dump, you can look at it with your eyes, and you can write a simple piece of code that analyzes, counts some statistics → at least stupidly, see which methods have appeared and how many times, to see what state the threads were in.

Moreover, if the standard profiling tools really look at the state of the stream that the Java machine prints (“RUNNABLE”), then in reality, your state does not mean anything, because if your program works a lot with the database and it works a lot with some external ... network, then your code can expect to receive data over the network, while the java-machine considers it as "RUNNABLE", and you can not understand anything - what are your actual methods, and what are waiting for data from the network. On the other hand, if you analyze the stacks yourself, you can write, you know what your program is doing, that this is a call to the database, that this method on the stack means that you have entered the database, you can count how many percent of the time you spend in the database, and so on and so forth. You may know that here this method actually does not eat CPU, although the java-machine thinks that it is “RUNNABLE”.

Moreover, thread dump can be integrated into applications, there is a wonderful method in Java

  Thread.getAllStackTraces

which allows you to get information about stacktrace programmatically.

In this way, you can integrate profiling, as a functional part of this application, and distribute the application to your customers, with already built-in profiling. In this way, you will have a constant flow of information that you can analyze to improve your application.

But there is a problem. The fact is that when you ask a Java machine to do a thread-dump, it doesn’t just stop the process, and makes a stack, it turns on the flag that the Java machine should stop in the next safe place. The “safe place” is special places that the compiler places in code, where the program has a certain state, where it is clear that it has registers, the point of execution is clear, and so on. If you take a piece of sequential code where there are no reverse transitions, no cycles, then there may not be a “safe point” at all. And it does not matter that method calls can be inlined by hotspot, and there won't be any save points either.

Therefore, if you see a line in the thread dumpe, this does not mean at all that this is a hot line of your code, it is just the closest savepoint to the hot spot. Because when you press "CTRL-BREAK", Java all threads check "stop at the nearest savepoint", and only when they stop, does the Java machine analyze the state in which they do it.

We now turn to memory profiling as it is done.

Firstly, there are wonderful, ready-made features of the Java-machine. There is a great jmap tool that displays a histogram of what your memory is full of, what objects and how much memory is occupied. This is a great tool for a general overview of what is going on and what your memory is clogged with.

Again, if you have never profiled a program, then most often you will immediately find problems, and you will have food to further optimize your memory usage.

The problem is that in this way you will receive information about all the objects, even those that are not currently in use, are in the trash.

Therefore, jmap has a special option “live”, which, before making a histogram, does garbage collection, leaves only the objects used, and only after that builds a histogram.

The problem is that already with this option, a large, live system working with many gigabytes of memory cannot be profiled, because the garbage collection of a system working with a dozen gigabytes of memory takes a decade or two seconds, and this can be unacceptable ... on any system if your the system works with the final people, a person who does not respond to the system for more than three seconds believes that it is stuck. You can not live, working with a human system to stop longer than a second, in fact. Even a second will already be noticeable to a person, but still not a disaster, but if you plug in some kind of tool that stops for 10 seconds, then it will be a disaster. Therefore, it is often necessary to be content with living systems of jmap-s of those objects that are, and in general, it does not matter whether it is garbage or not.

It is also additionally useful to know the advanced options of Java machines. For example, on Java machines, you can ask to print a class histogram when you do thread dump → "PrintClassHistogram".

Java machine can be asked when memory runs out to write its status to disk. This is very useful, because usually, people begin to optimize memory consumption only when it ends for some reason. Nobody is engaged in profiling, when everything is fine, but when the program starts to miss the memory, it crashes, they begin to think how to optimize this way. Therefore, this option is always useful to be enabled. Then, in a bad case, the Java machine will write you a binary dump, which you can then analyze, not on a live system, with any tools. At the same time, this dump can be taken at any time from a Java machine, the same jmap, with the “-dump” option, but this, again, stops the Java machine for a long time, on a live system, you hardly enter it do.

Replica from the audience: There is a property that this “HeapDumpOutOfMemory” is optimized for those cases when the memory is already over.

Yes of course.“HeapDumpOutOfMemory” is a very useful option, and although it is “-XX”, you should not be afraid of these options, although this “XX” emphasizes their mega-specialty, but these are not experimental options, they are normal production-options of the Java machine, they are stable, reliable, they can be used on a live, real system.

These are not experimental options! In the java-machine there is a clear division, but the division into experimental and non-experimental options does not depend on the number of Xs.

Replica from the audience: This option sometimes does not delay dumps ...

Well, there are also bugs in the java-machine, it all depends on ... there are various reasons for memory exhaustion, I will not stop there, we do not have much time.

I want to stop at a very important point in the remaining time, namely, the profiling of memory allocation.

One thing - what memory is busy, how you use it at all. If you have somewhere in the code there is an excessive allocation of temporary memory, i.e. you allocate, ... do something with it, this method and then forget, it goes into the garbage, and the garbage collector then takes it. And so you do it again and again, your program will run slower than it would work with it ... but you will not find this place in the CPU profiler, because the allocation operation itself in the Java memory machine runs fantastically fast, faster, than in any non-managed language, C / C ++, because in Java memory allocation is a trite increase in a single pointer. Everything.These are a few assembly instructions, everything happens very quickly. It has already been previously reset, everything has already been allocated and prepared. You will not find this time when analyzing the hot spots of your code - it will never show up in you, in any thread dumpe, in any profiler, that this is your hot spot. Although you will have all this time, your application will work - why? Because then, the garbage collector will spend time collecting this garbage. So look at how many percent of the time your application spends on garbage collection.

This is a useful option "-verbose: gc", "+ PrintGC", "+ PrintGCDetails", which will allow you to figure out how much time your application takes to collect garbage. If you see that a significant percentage of time is spent on garbage collection, it means that you allocate a lot of memory somewhere in the program, you will not find this place, you need to look for who allocates memory.

How to search? There is a method built into the Java-machine, the key "-Xaprof". He unfortunately only displays the so-called allocation profile at the end of the process, which speaks not about the contents of the memory, but about the statistics of the objects to be selected → which objects and how often they were allocated.

If this really happens to you often, you will most likely see that some temporary class has been established somewhere, which really stands out very often. Try to immediately make aprof - maybe you will immediately find your problem.

But not a fact. It may be a situation that you will see the selection of a large number of arrays of characters, some strings, or something, and it is not clear where.

It is clear that you may have suspicions - where. Maybe some recent change it could make. In the end, you can add in the place where memory is allocated too often, using the same code modification technique in atomi long, count how many times there is a selection in this place - look at the statistics, you can initiate and find suspicious places.

And what if you have no idea where this is going? Well, we must somehow add the collection of statistics everywhere, in all places where memory is allocated. For this kind of tasks, aspect-oriented programming or direct use of byte-code manipulations is perfect.

For the rest of the time I’ll try, I’ll just stop using the bytecode manipulation technique, which is suitable for solving problems like “I want in all the places where the array is allocated, to calculate how many times this place happens, in all my the code to find the very place in which for some reason I select a lot of arrays of ints ”. Those.I see a lot of them stand out, but I just want to find where.

Bytecode manipulation allows not only these tasks to be solved, it allows any changes, non-functional, to make in the code, after its compilation. Thereby, this way to decompose profiling from business logic. If at the beginning I said that often profiling can be a logical part of your functionality, then there are times when it is not necessary, when you need to find a problem, solve it, and that there are no lines left. In this case, such a wonderful technique as manipulating byte-code is suitable.

And it can be done both with post-compilation of the code layout, and on the fly, with the code.

The best way I know is to use the ASM ObjectBeb library. This is an open-source library that makes it very easy to handle bytecode manipulation, and it is fantastically fast - you can manipulate code on the fly without slowing down the load time of the application.

ASM is very simple. It has a class called class-reader that reads .class files and converts Baitics using the Visitor pattern into a set of calls like “I see a method”, “I see a field with such and such fields in this class” and so Further. When he sees a method, he starts using “MethodVisitor” to report how he sees bytecode there.

And then there is, on the other hand, such a thing as “ClassWriter”, which, conversely, turns the class into an array of bytes that is needed by a Java machine.

So that for example, using ASM, to track all the allocations of arrays ... in general, this is done primitively. You just need a couple of classes to do. You need to define your adapter class, which, when it is told that the method is visible, overlaps, and returns its own visitor method, to determine what happens in this method.

And when inside the method he is told that there is an integer instruction with an array allocation byte-code ("NEWARRAY"), then at that moment, he has the opportunity ... to insert his own bytecodes into the upstream and that's it. And you traced all the places where you have allocated arrays, and changed the corresponding byte code.

Further - what to do if you want to do these changes on the fly.

If you have a set of compiled classes, then this is easy → you sort of processed this tool, and that’s all.

If you need to do this on the fly, then in the java-machine there is a wonderful option - javaagent. You make a special jar-file, in which in the manifest you specify the “Premain-Class” option, and specify the name of your class there. Then ... the “premain” method follows a specific pattern, and thus, even before the main code is run, with the main method, you get control, and you get a pointer to interface instrumentation. This interface is wonderful, it allows you to change classes on the fly in a java-machine. It allows you to set up your own class-file transformer, which the Java machine will invoke for each attempt to load the class.

And you can substitute classes. Those.load not only those classes that actually lie, but using the same ObjectWebASM, analyze something, change, and replace them on the fly ... you can find out the size of the selected object.

A wonderful tool for such profiling on the knee, when you have a specific task that needs to be solved.

In conclusion, I will say that it is completely unnecessary, to solve some profiling tasks to own some tool, it is enough to know the byte code, know the options of the Java-machine, and know the standard Java-libraries, the same javalang-tool. This allows you to solve a huge number of specific problems that you encounter. For ten years in our company we have developed several home-grown tools that solve problems that are not specific to us, which specific profilers do not solve. Starting with the fact that in the course of our work, we have a spreading, but nevertheless simple tool that analyzes the thread dump and gives statistics on them, it is nevertheless a simple utility that cannot be called a tool. A multi-page classic that collects statistics and displays it in a beautiful way. Wildly usefulbecause we don’t need to connect some profilers to the production system, just a thread dump, and that’s all ...

And ending with the fact that we have our own memory profiling tool, which, again, is small, it is difficult to call it a tool that keeps track of where and what is allocated, and it does it, practically without slowing down the program. And both commercial and open profilers, they also know how to track memory allocation, but they are trying to solve the problem more complex and universal. They try to find out where memory allocation takes place, with a full stack trace. It is long, it slows down a lot. Do not use the same sampling. Not always collected, thereby not getting all the statistics, and so on.

They make some compromises that are in your subject area, you do not need them, you have your own tasks that you want to solve when analyzing the performance of your systems.

Now I will answer questions (answers to questions from 30:06).

You can also view other transcripts from application developer days , download all videos from past conferences with torrents ( [2011] , [2010] ) or the entire folder on http .

Disclaimer again: I am not the author of the report. I am the chairman of the PC of this conference, the video editor, and the compiler of the transcript. The author does not have a habrakacount, perhaps over time I will be able to invite him, for example, to comment on your questions. The author of the report elizarov is called to comment and answers questions. See. also a discussion in the author's journal .

BTW, elizarov speaks at the ADD-2012 conference with the report " We are writing the fastest hash for data caching. "

Source: https://habr.com/ru/post/143468/

All Articles

Do It Yourself Java Profiling

More articles: