How hard can it be to measure the performance of a simple function, like this one?
Well, let's just wrap it up in some micro-benchmark, call it many, many times (to average the results) and see what happens, right? Well, okay, we can still look at the generated instructions just to make sure that the compiler didn’t "optimize" something. We can also conduct several different tests to make sure that the loop is the bottleneck. Well, that's it. We understand what we measure, right?
Let's imagine that there is another function in our file, the speed of which you also measure, but in separate tests. Those. the file looks something like this:
')
Then one day your manager comes to you and shows a claim from a user of your library, which is that it does not work as fast as you promised. But wait, we measured the performance well and promised exactly what we got from the test results. What went wrong?
The user says that he was only interested in testing the benchmark_func () function, so he ran performance tests for her only.
Numbers
I collected this code with the latest Clang with the following options:
-O2 -march=skylake -fno-unroll-loops
I ran this code on an Intel Core i7-6700 Skylake processor
All code together with scripts for assembly can be downloaded
here . You will also need a
google benchmark library.
Let's call the version of the code with two functions “basic”, and the variant with only one function benchmark_func is “no_foo”. And here are the results:
$ ./baseline.sh --------------------------------------------------------- Benchmark CPU Iterations Throughput Clockticks/iter --------------------------------------------------------- func_bench_median 4 ns 191481954 32.5626GB/s 74.73 $ ./no_foo.sh --------------------------------------------------------- Benchmark CPU Iterations Throughput Clockticks/iter --------------------------------------------------------- func_bench_median 4 ns 173214907 29.5699GB/s 84.54
I calculated the “Clockticks / iter” metric on my own, dividing the number of ticks for performing the benchmark_func () function by the number of iterations.
Oddly enough, deleting the function foo () not called in the test from the source file at all has reduced the performance of the remaining function by ~ 10%.
Let's try to understand what's going on here.
Looking ahead, I’ll say that the generated assembly code for the benchmark_func () function is identical in both cases, and the only difference is in its location in the binary and the alignment of the internal loop.
Let's first look at the generated code for the "base" version:
$ objdump -d a.out -M intel | grep "<_Z14benchmark_funcPi>:" -A15 00000000004046c0 <_Z14benchmark_funcPi>: 4046c0: 48 c7 c0 80 ff ff ff mov rax,0xffffffffffffff80 4046c7: c5 fd 76 c0 vpcmpeqd ymm0,ymm0,ymm0 4046cb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] 4046d0: c5 fe 6f 8c 07 80 00 vmovdqu ymm1,YMMWORD PTR [rdi+rax*1+0x80] 4046d7: 00 00 4046d9: c5 f5 fa c8 vpsubd ymm1,ymm1,ymm0 4046dd: c5 fe 7f 8c 07 80 00 vmovdqu YMMWORD PTR [rdi+rax*1+0x80],ymm1 4046e4: 00 00 4046e6: 48 83 c0 20 add rax,0x20 4046ea: 75 e4 jne 4046d0 <_Z14benchmark_funcPi+0x10> 4046ec: c5 f8 77 vzeroupper 4046ef: c3 ret
We can notice that the code is aligned on the cache line (0x406c0 mod 0x40 == 0x0). It's good. But there is one more thing that we need to know about the architecture of Intel processors. In my Skylake family processor I use a micro-instruction translation engine, MITE (Micro-instruction Translation Engine), which selects instructions for 16 bytes in one pass. The important point here is that it is not just any 16 bytes, but 16 bytes from a window aligned 16-byte interval. After these instructions have been selected, the decoder converts them into a sequence of smaller micro-operations (uops). Further, these micro-operations are transferred to the following stages of implementation.
But there is another hardware block called DSB (Decoded Stream Buffer), which, as its name implies, is a micro-operations cache. If we want to execute some set of instructions that we have recently performed, we first check to see if there are any corresponding micro-operations in the DSB. If they are there, it will save us not only from re-broadcasting their MITE, but even from reading them from RAM (which is generally excellent). However, there are certain limitations that affect how microtools can get (or not get) into DSB, we will talk about this below.
In the assembler commands above, you can see that the code was vectorized and we actually have only 4 loop iterations, which is good for this example, because otherwise LSD (Loop Stream Detector) would detect the loop and stop fetching instructions from the memory.
More information about all these nuances of Intel architecture can be found in the document "Intel 64 and IA-32 Architectures Optimization Reference Manual". You can also watch a good Zia Ansari
presentation on this topic.
Code alignment matters
I think you already guess what will be discussed further. Let's see how the benchmark_func () function is located in the code in both cases:
"Base case":

"No_foo":

The thick rectangles in the pictures above indicate 32-byte windows, and the yellow background indicates the instructions of the loop body. The first observation may be that in the second case, the entire loop code falls into one 32-byte window, and in the first case it is distributed between the two. Indeed, in the second case, we have half as many misses when accessing DSB (DSB_MISS_PS 1800M vs 888M) and exactly zero DSH-MITE switching overhead (DSB2MITE_SWITCHES, PENALTY_CYCLES 888M vs 0). But why in this case everything works 10% worse? Probably, there is some other architectural feature that I have not yet taken into account.
I conducted several experiments, testing my various hypotheses about how decoded instructions fit into the DSB, but still I’m not 100% sure that I fully understood this. I laid out my experiments
here .
Performance counters showed no anomalies. The only thing you can pay attention to is the difference between the two cases in the parameter
IDQ_UOPS_NOT_DELIVERED, CYCLES_0_UOPS_DELIV (4100M vs 5200M). If you do not know what it is - look at the end of the article, there are explanations of all the counters used.
We go even further
I did two more experiments, obviously setting the alignment: -mllvm -align-all-functions = 5 and -mllvm -align-all-blocks = 5:
$ ./aligned_functions.sh --------------------------------------------------------- Benchmark CPU Iterations Throughput Clockticks/iter --------------------------------------------------------- func_bench_median 3 ns 218294614 36.8538GB/s 63.37 $ ./aligned_blocks.sh --------------------------------------------------------- Benchmark CPU Iterations Throughput Clockticks/iter --------------------------------------------------------- func_bench_median 3 ns 262104631 44.3106GB/s 46.25
When aligning benchmark_func () on the 32-byte boundary, I received + 13% performance, and on aligning all the basic blocks (including the beginning of the function) in the benchmark_func () function on the 32-byte boundary, I received + 36% speed increase. Funny, right?
The location of the function for the case with the alignment function is not much different from the "base" case:

That is, we are dealing with some kind of problem with DSB, as in the "base" case. The counters show even worse DSB utilization: DSB_MISS_PS 2600M vs 1800M. What is even more important is to compare the counter readings IDQ_UOPS_NOT_DELIVERED, CYCLES_0_UOPS_DELIV: 330M vs 4100M. In the end, what is really important for us is to ensure that the backend is filled with decoded micro-instructions.
Now the case with aligned base blocks:

It is interesting in that there is both a high level of DSB usage and a low number of cycles in which there were no microinstructions delivered. Look at the table below for specific counters.
Used performance counters

Here is an explanation of the column headers of this table:
FRONTEND_RETIRED.DSB_MISS_PS - counts instructions for which a search miss in DSB (Decode Stream Buffer) occurred
DSB2MITE_SWITCHES.PENALTY_CYCLES - counts penalty cycles when switching between DSB and MITE. A request to DSB, in which the required instructions were not found and had to contact MITE, may in the worst case cost up to 6 cycles in which no micro-operations are transferred to IDQ. As a rule, it takes up to 2 cycles.
IDQ.ALL_DSB_CYCLES_4_UOPS - counts the number of
ticks in which exactly 4 micro-instructions were delivered to the Instruction Decode Queue (IDQ) from Decode Stream Buffer (DSB)
IDQ.ALL_DSB_CYCLES_ANY_UOPS — counts the number of
ticks in which microtools were delivered to the Instruction Decode Queue (IDQ) from Decode Stream Buffer (DSB)
IDQ_UOPS_NOT_DELIVERED.CORE — counts the number of micro-operations not delivered to the Resource Allocation Table (RAT) for each thread, adding “4-x” when the Instruction Decode Queue (IDQ) delivers x micro-operations to the Resource Allocation Table (RAT) (x belongs to {0 1,2,3})
IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE — counts, for each stream, the number of cycles in which micro-operations were not delivered to the Resource Allocation Table (RAT). IDQ_Uops_Not_Delivered.core = 4.
Cautions
For this particular case, all these problems with alignment will disappear if we, for example, increase the number of iterations to 1024. At this point, the cycle detector (LSD) will work. He will understand that we are in a cycle and follow the same instructions over and over again. Then it will simply disable reading the instructions from the memory and run their execution from its internal buffer. At this moment it becomes completely irrelevant how our instructions are located and aligned in memory.
Another interesting example is that I got a 10% drop in performance when I used the
gold linker. This is not because it is somehow bad, but again, because of the alignment of the code.
Why not align the code always?
Alignment means that the compiler inserts NOP instructions into the code. This increases the size of the binary and can also hit on performance if these NOP instructions fall into some frequently used cycle. Execution of NOP instructions is not absolutely free - they still need to be read from memory and decoded.
findings
As you can see, even such a small amount of code can be difficult. I do not think that all of us should be experts in microprocessor architecture, but it is worth at least knowing that such problems may exist. Be aware that once measured function performance may change in the future even without changing the code of this function. If this is an important point for you - do not forget to take additional measures of performance to identify problems like the one described in this article.