Super slow and super fast benchmark

In a recent Java performance article, a discussion on performance measurement broke out. Looking at it, with sadness we have to realize that many people still do not understand how difficult it is to correctly measure the execution time of a particular code. In addition, people are not at all used to the fact that the same code under different conditions can run significantly different times. For example, here is one of the opinions:

If I need to find out which language is faster for me on my task, then I will chase away the most primitive benchmark in the world. If the difference is significant (say, an order of magnitude) - then most likely on the user machine everything will be about the same.

Unfortunately, the most primitive benchmark in the world is usually incorrectly written benchmark. And one should not hope that the wrong benchmark will measure the result at least up to the order. It can measure something completely different, which will be completely different from the actual performance of a program with similar code. Let's take an example.

Let's say we saw the Java 8 Stream API and want to check how fast it works for simple math. For example, for simplicity, we take a stream of integers from 0 to 99999, square each by using the map operation, and that's it, we will not do anything else. We just want to measure performance, right? But even a quick overview of the API is enough to see that streams are lazy and IntStream.range(0, 100_000).map(x -> x * x) in fact does not do anything. Therefore, we add a terminal operation forEach , which somehow uses our result. For example, increase it by one. As a result, we get this test:

 static void test() { IntStream.range(0, 100_000).map(x -> x * x).forEach(x -> x++); }

Fine. How to measure how much it works? Everyone knows: take time at the beginning, time at the end and calculate the difference! Add a method that measures time and returns the result in nanoseconds:

 static long measure() { long start = System.nanoTime(); test(); long end = System.nanoTime(); return end - start; }

Well, now let's just output the result. On my not the fastest Core i7 and Open JDK 8u91 64bit, in different launches I get a number in the region from 50 to 65 million nanoseconds. That is 50-65 milliseconds. One hundred thousand squares in 50 milliseconds? This is monstrous! This is only two million times per second. Twenty-five years ago, computers and then quickly built in a square. Java shamelessly slows down! Or not?

In fact, the first use of lambda and Stream API in the application will always add a delay of 50-70 ms on modern computers. After all, a lot of things have to be done during this time:

Load classes to generate lambda runtime views (see LambdaMetafactory ) and everything related to them.
Download classes of the Stream API itself (there are a lot of them there)
For lambdas that are used in our code (in our case, two) generate a runtime representation.
JIT-compile it all good at least somehow.

All this requires a lot of time and in fact it is even surprising that it manages to keep within 50 ms. But all this is needed exactly once.

Lyrical digression

In general, with the presence of dynamic loading and caching for anything, it becomes very difficult to understand what we measured. This applies not only to Java. A simple library call can initiate loading from a hard disk and initializing a shared library (and imagine that the hard disk also went to sleep mode). As a result, the call may take much longer. Soared about it or not? Sometimes it is necessary. For example, at the time of Windows 95, loading the OLE32.DLL shared library took considerable time and would have announced the first program in the brakes that would try to load OLE32. This forced the developers, if possible, not to load OLE32 as long as possible, so that other programs could become guilty. In some places, other libraries even implemented functions that duplicate some of the OLE32 functions, just to avoid loading OLE32. Read more about this story with Raymond Chen .

So, we realized that our benchmark is super slow, because in the process a lot of things are done that need to be done exactly once after loading. If our program plans to work for more than a second, most likely we don’t care much about it. So let's "warm up the JVM" - we will measure this 100 thousand times and display the result of the last measurement:

 for (int i = 100000; i >= 0; i--) { long res = measure(); if(i == 0) System.out.println(res); }

This program completes faster than in a second and prints 70-90 nanoseconds on my machine. That's great! So, one squaring has 0.7-0.9 picoseconds? Is Java squaring more than a trillion times a second? Java is super fast! Or not?

Already at the second iteration, much of the above list will be executed and the process will speed up every 100. Further, the JIT compiler will gradually compile different pieces of code (there is a lot of it inside the Stream API), collecting execution profiles and optimizing more and more. In the end, JIT is smart enough to zainlaynit the entire chain of lambda and understand that the result of multiplication is not used anywhere. A naive attempt to use it through the increment JIT-compiler is not deceived: this operation still has no side effect. The JIT compiler didn’t have enough strength to cut the entire stream at all, but he was able to mow the internal loop, actually making the test performance independent of the number of iterations (replace IntStream.range(0, 100_000) with IntStream.range(0, 1_000_000) - the result will be same).

By the way, at such times, the execution time and granularity of nanoTime() significant. Even on the same hardware but in different OS you can get a significantly different answer. Read more about this - Alexei Shipilev .

So, we wrote "the most primitive benchmark." At first it turned out to be super-slow, and after a little refinement it was super-fast, almost a million times faster. We wanted to measure how fast the squaring is done using the Stream API. But in the first test this mathematical operation sank in a sea of other operations, and in the second test it was simply not performed. Be wary of making hasty conclusions.

Where is the truth? The truth is that this test has nothing to do with reality. It does not produce visible effects in your program, that is, in fact, it does nothing. In reality, you rarely write code that does nothing, and of course, it hardly brings you money (although there are exceptions ). Trying to answer the question of how much time squaring is actually performed inside the Stream API is generally not very meaningful: this is a very simple operation and, depending on the surrounding code, the JIT compiler can compile a cycle with multiplication in very different ways. Remember that performance is not additive: if A is running for x seconds, and B is running for y seconds, then it’s not at all the fact that performing A and B will take x + y seconds. It may be completely wrong.

If you want easy answers, then in real programs the truth will be somewhere in the middle: the overhead of a stream of 100,000 integers that are squared will be about 1000 times more than the super-fast result, and about 1000 times less than super slow But depending on many factors it can be worse. Or better.

At last year's Joker, I looked at a slightly more interesting example of the performance measurement of the Stream API and dug deeper what was going on there. Well and the obligatory reference to JMH : it will help not to step on a simple rake when measuring the performance of JVM-languages. Although, of course, even JMH will not solve all your problems in a magical way: you still have to think.

Source: https://habr.com/ru/post/307268/

All Articles

Super slow and super fast benchmark

More articles: