Too fast, too megamorphic: what affects the performance of a method call in Java?

From the translator:
the argument of the supporters of writing final everywhere and everywhere and their opponents is akin to the dispute of pointed and stupid points. As in some other communities, this sluggish dispute has been going on for years in our company. And only this article by Richard Warburton allowed our spikes to take over.

What are we talking about?

Let's start with a short story. A few weeks ago I sent my suggestion to remove the final modifier from some methods to the Java core libs mailing list. As a result, there were several topics for discussion. One of them, for example, is to find out to what extent the performance of a method call that was final will deteriorate if this final is removed from it.

I had some thoughts on whether performance regression would occur or not, but I first tried to find out if anyone had already published benchmark results on this issue. Unfortunately, I could not find anything. This does not mean that they do not exist or that other people did not investigate the situation, but I did not encounter any peer-reviewed code. So it's time to write a few benchmarks.

Benchmarking Methodology

So, I decided to use the wonderful JMH framework to do benchmarking. If you're not sure that the framework will help you get accurate benchmarking results, then you should watch this talk by Alexei Shipilev (@TheShade), the author of the framework, or Nitsan Wakart's really cool blog , which explains how it helps.
')
In my case, I wanted to understand what affected the performance of the method call. I decided to try different variations of the call methods and measure the costs of them. Having a set of criteria and varying only one factor at a time, we can understand how various factors or their combinations affect the cost of a method call.

Inlining

At the same time, the most and least obvious influencing factor is whether a method call occurs at all! The actual method call can be so optimized by the compiler that it will not remain at all. Generally speaking, there are two ways to reduce the cost of a call. One of them is to directly embed the method itself, the other is to use the inline cache. Don't worry - these are fairly simple concepts, but there is a bit of terminology to be understood. Imagine that we have a class named Foo, which defines a method called bar.

class Foo { void bar() { ... } }

We can call the bar method by writing code that looks like this:

 Foo foo = new Foo(); foo.bar();

The main thing here is the place where bar is actually called - foo.bar () - it is called callsite . When we say that the method was inline (built-in), it means that the body of the method is taken and inserted into callsite, instead of calling the method. For programs that consist of many small methods (I think that writing programs more correctly), embedding can lead to significant acceleration. This is because the program does not spend most of its time on method calls instead of doing work! We can control whether a method is embedded or not in JMH using the CompilerControl annotation. We will return to the concept of inline cache a little later.

Hierarchy depth and method overrides

If we decide to remove the final keyword from a method, this means that we will be able to override it. This is another factor that needs to be taken into account. So I took the methods and called them at different levels of the class hierarchy; there were also methods redefined at different levels of the hierarchy. This allowed me to determine how deeply class hierarchies interact with method overrides.

Polymorphism

When I mentioned the idea of callsite earlier, I secretly wanted to get around a rather important question. Since the non- final method can be overridden in a subclass, our callsite s may eventually call various methods. So maybe I’m dealing with Foo or its child class Baz, which also implements bar (). How does the compiler know which method to call? All methods in Java are virtual (redefinable) by default, and for each call you have to find the correct method in the so-called virtual method table (vtable). This is rather slow, so optimizing compilers are always trying to reduce the cost of such searches. The approach mentioned earlier — inline or embed — works great if your compiler can prove that a given callsite can call only one specific implementation of the method. This is called monomorphic callsite.

Unfortunately, for the most part, proving that callsite is strictly monomorphic is ultimately impractical. JIT compilers, as a rule, use an alternative approach, profiling which types are called in callsite and assuming that if callsite was monomorphic in its first N calls, then it is worthy of speculative optimization based on the assumption that it will always be monomorphic. Such a speculative optimization is usually true, but not always; therefore, the compiler must implement protection before calling the method to verify the type of the object in which the method is called.

However, monomorphic callsite-s is not the only case when we want to optimize. Many callsite-s are, so-called, bimorphic (bimorphic) - i.e. There are two methods that can be invoked. You can still embed bimorph callsite-s using your security code, checking which implementation to call and then proceed to it. It is still cheaper than a full method call. In addition, you can optimize this case using the inline cache. The inline cache does not actually embed the body of the method in callsite, but it has a specialized transition table that acts like a cache on a complete virtual method table. The Hotspot JIT compiler supports bimorphic embedded caches, and any callsite with 3 or more possible implementations is considered megamorphic.

Thus, there are 3 types of calls for comparison and research: cases of monomorphic, bimorphic and megamorphic calls.

results

Group the results so that you can see the forest among the trees; I will present the dry figures along with their small analysis. Specific figures / costs are actually not that interesting for us. The relation between the different types of method calls is interesting, so that the statistical errors are small. There is a rather significant difference - 6.26 times - between the fastest and the slowest. In reality, this difference is likely to be greater - due to the costs associated with measuring the time of the empty method.

The source code for these benchmarks is available on GitHub . To avoid confusion, I presented parts of the results in different blocks. Polymorphic benchmarks are made using PolymorphicBenchmark, and the rest - using JavaFinalBenchmark

Simple callsite s

 Benchmark Mode Samples Mean Mean error Units cijJavaFinalBenchmark.finalInvoke avgt 25 2.606 0.007 ns/op cijJavaFinalBenchmark.virtualInvoke avgt 25 2.598 0.008 ns/op cijJavaFinalBenchmark.alwaysOverriddenMethod avgt 25 2.609 0.006 ns/op

Our first result set shows the cost of calling a virtual method, a final method, and a method that goes into a deep hierarchy and is redefined. Note that in all these cases we forced the compiler not to embed methods. As you can see, the difference between the times is minimal and our standard deviation shows that it does not matter much. Thus, we can conclude that simply adding a final keyword will not significantly improve call performance. Redefinition of a method also, apparently, is not of great importance.

Embedding simple callsite

 Benchmark Mode Samples Mean Mean error Units cijJavaFinalBenchmark.inlinableFinalInvoke avgt 25 0.782 0.003 ns/op cijJavaFinalBenchmark.inlinableVirtualInvoke avgt 25 0.780 0.002 ns/op cijJavaFinalBenchmark.inlinableAlwaysOverriddenMethod avgt 25 1.393 0.060 ns/op

Now take the same three cases and remove the constraint of embedding. Again, final and virtual method calls end up with the same duration. They are 4 times faster than in the case of the prohibition of embedding, which I would have recorded at the expense of, in fact, embedding. Calling an overridden method everywhere ends up in between two. I suspect that this is because the method itself has several possible implementations of subclasses and, therefore, the compiler must insert a type check. The mechanics of this are explained above in more detail in the “Polymorphism” section.

Impact of class hierarchy

 Benchmark Mode Samples Mean Mean error Units cijJavaFinalBenchmark.parentMethod1 avgt 25 2.600 0.008 ns/op cijJavaFinalBenchmark.parentMethod2 avgt 25 2.596 0.007 ns/op cijJavaFinalBenchmark.parentMethod3 avgt 25 2.598 0.006 ns/op cijJavaFinalBenchmark.parentMethod4 avgt 25 2.601 0.006 ns/op cijJavaFinalBenchmark.inlinableParentMethod1 avgt 25 1.373 0.006 ns/op cijJavaFinalBenchmark.inlinableParentMethod2 avgt 25 1.368 0.004 ns/op cijJavaFinalBenchmark.inlinableParentMethod3 avgt 25 1.371 0.004 ns/op cijJavaFinalBenchmark.inlinableParentMethod4 avgt 25 1.371 0.005 ns/op

Wow, this is a big block of methods! Each of the numbered method calls (1-4) shows how deeply the method was called in the class hierarchy. So parentMethod4 means that we call the method declared in the 4th parent class. If you look at the numbers, the difference between 1 and 4 is very small. Thus, we can conclude that the depth of the hierarchy does not matter. When embedded, everyone follows the same pattern: the depth of the hierarchy is irrelevant. Our embedded methods are comparable to inlinableAlwaysOverriddenMethod, but slower than inlinableVirtualInvoke. I would refer this to type checking again. The JIT compiler can profile methods to find out that only one has been built in, but it cannot prove that it will always be like this.

Impact of class hierarchy on final methods

 Benchmark Mode Samples Mean Mean error Units cijJavaFinalBenchmark.parentFinalMethod1 avgt 25 2.598 0.007 ns/op cijJavaFinalBenchmark.parentFinalMethod2 avgt 25 2.596 0.007 ns/op cijJavaFinalBenchmark.parentFinalMethod3 avgt 25 2.640 0.135 ns/op cijJavaFinalBenchmark.parentFinalMethod4 avgt 25 2.601 0.009 ns/op cijJavaFinalBenchmark.inlinableParentFinalMethod1 avgt 25 1.373 0.004 ns/op cijJavaFinalBenchmark.inlinableParentFinalMethod2 avgt 25 1.375 0.016 ns/op cijJavaFinalBenchmark.inlinableParentFinalMethod3 avgt 25 1.369 0.005 ns/op cijJavaFinalBenchmark.inlinableParentFinalMethod4 avgt 25 1.371 0.003 ns/op

Here, the same scheme as above — the final keyword does not seem to have any meaning. I thought that theoretically here it can be proved that inlinableParentFinalMethod4 is embedded and that type checking is excluded, but it looks like it is not.

Polymorphism

 Monomorphic: 2.816 +- 0.056 ns/op Bimorphic: 3.258 +- 0.195 ns/op Megamorphic: 4.896 +- 0.017 ns/op Inlinable Monomorphic: 1.555 +- 0.007 ns/op Inlinable Bimorphic: 1.555 +- 0.004 ns/op Inlinable Megamorphic: 4.278 +- 0.013 ns/op

Finally, we come to the case of a polymorphic call. The cost of a monomorphic call, roughly speaking, is the same as that of our regular virtual calls described above. As we need to perform searches on large tables of virtual methods, they become slower as more and more biomorphic and megamorphic cases appear. As soon as we allow embedding, profiling throws out too much, and our monomorphic and biomorphic callsite are lowered to the cost of our built-in calls with type checking. Very similar to the case of a class hierarchy, only slightly slower. The megamorphic case is still very slow. Let me remind you that here we have not configured Hotspot to prevent embedding, it simply does not implement a polymorphic inline cache for callsite-s more complex than bimorphic.

What do we understand?

I think it is worth noting that there are many people who do not have a speculative performance model that calculates different types of method calls, taking a different amount of time, and a lot of people who understand that they use a different amount of time, but they don’t really understand it right. I know because I myself was so and made wrong assumptions. So I hope this study was useful for you. The following is a brief summary of the theses that I defended in this article:

There is a big difference between the fastest and the slowest type of method call.
In practice, adding or deleting the final keyword does not really affect performance, but if you later reorganize your hierarchy, the process may begin to slow down.
Deeper class hierarchies have no real effect on call performance.
Monomorphic calls are faster than bimorphic.
Bimorph calls are faster than megamorphic.
When checking the type, which we see in the case of profiling, when it is impossible to prove monomorphism, the call slows down a little, compared to monomorphic.

I would say that the price of checking the type is my personal "big revelation." This is what I see as is rarely discussed and often neglected.

Warnings and further work

Of course, this is not the only possible approach to this issue!

This article focuses on the factors affecting the performance of method calls related to call types. One of the factors that I did not mention is the heuristic surrounding the method, embedded depending on the size or depth of the call stack. If your method is too large, it will not be embedded at all, and you still end up paying for the cost of calling the method. Another reason to write small, easy to read methods.
I did not consider how a call through an interface affects any of these situations. If you find this interesting, then there is an interface performance study on the Mechanical Sympathy blog.
One factor that we completely ignored was the effect of the embedding method on other compiler optimizations. When compilers perform optimization, in which only one method is analyzed (intra-procedural optimization), in order to effectively optimize, they really need as much information as possible. Restricting embedding can significantly reduce the context with which other optimizations will have to work.
Binding explanations down to the assembler level to dive into this issue in more detail.

Perhaps these are topics for future posts.

Thanks:
- Alexey Shipilev for feedback to the benchmark,
- to users of Martin Thompson , Martijn Verburg, Sadiq Jaffer and Chris West - for very useful reviews and comments on my blog.

Source: https://habr.com/ru/post/223875/

All Articles

Too fast, too megamorphic: what affects the performance of a method call in Java?

What are we talking about?

Benchmarking Methodology

Inlining

Hierarchy depth and method overrides

Polymorphism

results

Simple callsite s

Embedding simple callsite

Impact of class hierarchy

Impact of class hierarchy on final methods

Polymorphism

What do we understand?

Warnings and further work

More articles: