📜 ⬆️ ⬇️

Search for odd performance reasons

Introduction


Finally, it is useful to study the Java bytecode in detail, and almost immediately an interesting question arose in my head. There is an NOP instruction that does nothing. So, how does this “nothing” affect performance? Actually, the process of studying this and described in the post.

Disclaimer


The story itself, first of all, is not about how it really works, but about what mistakes you should beware of when measuring performance.

Instruments


Let's start with the main thing: how all the measurements were carried out. The ASM library was used to generate the code, the JMH was used to create the benchmark itself .

In order not to use reflection, a small interface was created:
public interface Getter { int get(); } 

')
Next, a class was generated that implements the get method:
  public get()I NOP ... NOP LDC 20 IRETURN 

You can insert an arbitrary number of feet.

Full code generator
 public class SimpleGetterClassLoader extends ClassLoader { private static final String GENERATED_CLASS_NAME = "other.GeneratedClass"; private static final ClassLoader myClassLoader = new SimpleGetterClassLoader(); @SuppressWarnings("unchecked") public static Getter newInstanceWithNOPs(int nopCount) throws Exception { Class<?> clazz = Class.forName(GENERATED_CLASS_NAME + "_" + nopCount, false, myClassLoader); return (Getter) clazz.newInstance(); } @NotNull @Override protected Class<?> findClass(@NotNull String name) throws ClassNotFoundException { if (!name.startsWith(GENERATED_CLASS_NAME)) throw new ClassNotFoundException(name); int nopCount = Integer.parseInt(name.substring(GENERATED_CLASS_NAME.length() + 1)); ClassWriter cw = new ClassWriter(0); cw.visit(V1_5, ACC_PUBLIC, name.replace('.', '/'), null, getInternalName(Object.class), new String[]{getInternalName(Getter.class)}); { MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "<init>", "()V", null, null); mv.visitCode(); mv.visitVarInsn(ALOAD, 0); mv.visitMethodInsn(INVOKESPECIAL, getInternalName(Object.class), "<init>", "()V"); mv.visitInsn(RETURN); mv.visitMaxs(1, 1); mv.visitEnd(); } { MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "get", "()I", null, null); mv.visitCode(); for (int i = 0; i < nopCount; i++) { mv.visitInsn(NOP); } mv.visitLdcInsn(20); mv.visitInsn(IRETURN); mv.visitMaxs(1, 1); mv.visitEnd(); } cw.visitEnd(); byte[] bytes = cw.toByteArray(); return defineClass(name, bytes, 0, bytes.length); } } 


Benchmark
 @State(Scope.Benchmark) @OutputTimeUnit(TimeUnit.MICROSECONDS) public class Bench { private Getter nop_0; private Getter nop_10; ... @Setup public void setup() throws Exception { nop_0 = newInstanceWithNOPs(0); nop_10 = newInstanceWithNOPs(10); ... } @GenerateMicroBenchmark public int nop_0() { return nop_0.get(); } @GenerateMicroBenchmark public int nop_10() { return nop_10.get(); } ... 



Search for truth


First, 2 tests were launched: without knobs and since 2000.

 Benchmark Mode Samples Mean Mean error Units b.Bench.nop_0 thrpt 5 838,753 48,962 ops/us b.Bench.nop_2000 thrpt 5 298,428 7,965 ops/us 

And immediately I made a very powerful conclusion: "Stupid JIT does not cut the legs, but translates them into machine ones."
Question to experts:
If this were true, would the measurement result be similar? Or would it be something completely different?

But this, nevertheless, was a hypothesis, and I really wanted to test it. First, I made sure that these methods are really compiled by JIT, then I looked into what. Naturally, the assembler obtained was completely identical. And here I realized that I did not understand something. The executable code is the same, and the performance is 2.5 times different. Strange.

Then I really wanted to look at the type of dependence.
 Benchmark Mode Samples Mean Mean error Units b.Bench.nop_0 thrpt 5 813,010 71,510 ops/us b.Bench.nop_2000 thrpt 5 302,589 12,360 ops/us b.Bench.nop_10000 thrpt 5 0,268 0,017 ops/us 

Hidden knowledge
This measurement is generally gorgeous. 3 points and all from different sequences.

It is worth noting here that for a new point, whether the compilation takes place or what happens at the output, I did not look at all. Automatically suggested that everything is the same as at 0 / 2k. What was a mistake.

I looked at it, and made the following far-reaching conclusion: "Dependence is very strongly non-linear." But, what is more important, in this place I began to suspect that the real thing here was not in the feet themselves, but in the size of the method.

The next thought was that our methods are virtual, which means they are stored in a table of virtual methods. Maybe the table itself is size sensitive? For verification, I simply transferred the code to static methods, and, of course, nothing has changed at all.
Question to experts 2
Was this thought a complete nonsense? Or was there something sensible in her?


Further, from misunderstanding, it is useful to look for what has the size of the method. The answer was found in the openjdk source:
  develop(intx, HugeMethodLimit, 8000, \ "Don't compile methods larger than this if " \ "+DontCompileHugeMethods") 

Interestingly, just between 2k and 10k. Calculate the size of my method: 3 bytes on “return 20”, 7997 remains.
 Benchmark Mode Samples Mean Mean error Units b.Bench.nop_0 thrpt 5 797,376 12,998 ops/us b.Bench.nop_2000 thrpt 5 306,795 0,243 ops/us b.Bench.nop_7997 thrpt 5 303,314 7,161 ops/us b.Bench.nop_7998 thrpt 5 0,335 0,001 ops/us b.Bench.nop_10000 thrpt 5 0,269 0,000 ops/us 

Guess this boundary is clear. It remains to understand what is happening up to 8000 bytes. Add points:
 Benchmark Mode Samples Mean Mean error Units b.Bench.nop_0 thrpt 5 853,499 61,847 ops/us b.Bench.nop_10 thrpt 5 845,861 112,504 ops/us b.Bench.nop_100 thrpt 5 867,068 20,681 ops/us b.Bench.nop_500 thrpt 5 304,116 1,665 ops/us b.Bench.nop_1000 thrpt 5 299,295 8,745 ops/us b.Bench.nop_2000 thrpt 5 306,495 0,578 ops/us b.Bench.nop_7997 thrpt 5 301,322 7,992 ops/us b.Bench.nop_7998 thrpt 5 0,335 0,005 ops/us b.Bench.nop_10000 thrpt 5 0,269 0,004 ops/us b.Bench.nop_25000 thrpt 5 0,105 0,007 ops/us b.Bench.nop_50000 thrpt 5 0,053 0,001 ops/us 

The first thing that pleases us here is that after jit has disconnected, linear dependence is very clearly visible. What exactly coincides with our expectation, since each NOP must be explicitly processed.

The next thing that the eye falls on is a strong feeling that up to 8k is not just some kind of dependency, but just 2 constants. Another 5 minutes of manual binary search, and the border is found.
 Benchmark Mode Samples Mean Mean error Units b.Bench.nop_0 thrpt 5 805,466 10,074 ops/us b.Bench.nop_10 thrpt 5 862,027 4,756 ops/us b.Bench.nop_100 thrpt 5 861,462 9,881 ops/us b.Bench.nop_322 thrpt 5 863,176 22,385 ops/us b.Bench.nop_323 thrpt 5 303,677 5,130 ops/us b.Bench.nop_500 thrpt 5 299,368 11,143 ops/us b.Bench.nop_1000 thrpt 5 302,884 3,373 ops/us b.Bench.nop_2000 thrpt 5 306,682 3,598 ops/us b.Bench.nop_7997 thrpt 5 301,457 4,209 ops/us b.Bench.nop_7998 thrpt 5 0,337 0,001 ops/us b.Bench.nop_10000 thrpt 5 0,268 0,004 ops/us b.Bench.nop_25000 thrpt 5 0,107 0,002 ops/us b.Bench.nop_50000 thrpt 5 0,053 0,000 ops/us 

Almost everything, it remains to understand what is it abroad. Calculate: 3 + 322 == 325. We are looking for what kind of magic 325, and find a certain key- XX: FreqInlineSize
FreqInlineSize is 325 on modern 64bit Linux

and its description of the docks:
Integer quotation of instructions for instructions.


Hooray! Finally, everything came together. So we found the dependence of the performance on the size of the method (of course, "with other things being equal").
1. JIT + inline
2. JIT
3. honest interpretation

Conclusion


As I said at the beginning, the main thing to pay attention to is not real behavior. It turned out to be rather trivial, and I am sure it is described in the dock (I did not read, I don’t know) My main message is that it is very important to trust common sense, and if the measurement results even slightly differ from it, or simply seem incomprehensible, then you should definitely check and re-check everything.

I hope someone this post seemed interesting.

PS


I always counted both 8000 and 325 in bytes inclusive. It seems that it was necessary to do this in the instructions non-inclusive.

Question for connoisseurs 3
Why precisely 325 and 8000? Are these random numbers, or is there something behind them?

Source: https://habr.com/ru/post/215355/


All Articles