Optimization of the code in the mind, or "Well, just as definitely faster"

The other day, working on one mistake in one open source project, I saw how a colleague (also working in parallel on the same problem) filled in such a commit [31a078bec7] :

/* - * Select the list item based on the index. Negative operand means - * end-based indexing (-2, ...), and -1 means out of range. + * Decode end-offset index values. */ - if (opnd < -1) { - index = opnd+1 + objc; - } else { - index = opnd; - } + index = opnd + (opnd <= TCL_INDEX_END)*(objc - 1 - TCL_INDEX_END); pcAdjustment = 5;

The change itself is correct (now TCL_INDEX_END is a constant definition (-2) ).

And roughly speaking in mind, this turns into the following (all variables are int):
')

 index = opnd + cmp(opnd, (-2))==>(0 | 1) * (objc - 1 - (-2));

That is, he seemed to want to save here one conditional transition.
And everything seems to be nothing, but I was still alarmed by such seemingly trifling “optimization” with a bias in arithmetic.

Firstly, this change concerns the most “main” function in this project ( TEBCresume ), because it is responsible for the execution of bytecode (JIT compiled TCL instructions). For this reason, this function is also the largest (about 6 thousand lines + primitives and macros) and one of the most complex in the project codebase, with multiple `goto`, puzzles macros for working with the execution stack, NRE convolution / expansion ( nonrecursive evaluation, etc. etc.

Those. changes in this function are often viewed under a magnifying glass, or even under a microscope (as it happened that even minor modifications can turn the entire code of this function upside down) ...

Secondly, by the nature of the activity, I often have to optimize the shared code, looking at its assembly reflection, squeezing the fractions of micro and even nano-seconds, and I often see that there is a very ambiguous situation. At least sometimes, expanding such constructions that "save" conditional jump back to if or even if/else , I saw an improvement both in the resulting assembler code and explicitly in the final performance comparison of the results.

Actually, what I wrote all this was - I wanted to show how it happens by example, well, since I touched on this topic, to collect some statistics. Therefore, a couple of polls at the end of the article ...

Do not forget about the optimization in modern compilers, which has achieved, if not perfection, then already very, very much on the level.

Another thing is that the fact that there the compiler has been optimized sometimes goes sideways when performed on modern hardware with its own optimization inside, such as extraordinary or speculative execution, command-level parallelism (ILP), and so on.
But this is a topic for a completely different article.

For obvious reasons, we will not sort it all out in the original function ...

Therefore, as an example, two functions test1 (optimization in arithmetic), and test2 (flow as it is with one `if`), and the resulting assembler for both functions in different versions of the compiler, for example gcc, were tested with -O2 and -Ofast (almost without, and the trunk is completely unchanged):

c / c ++
`int test1(int opnd, int objc) { int index; index = opnd + (opnd <= (-2)) * (objc - 1 - (-2)); return index; }`	`int test2(int opnd, int objc) { int index; if ((index = opnd) <= (-2)) { index += (objc - 1 - (-2)); } return index; }`
x64, gcc 5.1:
`;#test1(int,int) xor eax, eax cmp edi, -1 setl al add esi, 1 imul esi, eax lea eax, [rsi+rdi] ret`	`;#test2(int,int) lea edx, [rdi+1+rsi] mov eax, edi cmp edi, -2 cmovle eax, edx ret`
x64, gcc 7.3:
`;#test1(int,int) xor eax, eax cmp edi, -1 setl al add esi, 1 imul eax, esi add eax, edi ret`	`;#test2(int,int) lea ecx, [rdi+1+rsi] mov eax, edi cmp edi, -1 cmovl eax, ecx ret`
x64, gcc (trunk):
`;#test1(int,int) add esi, 1 mov eax, 0 cmp edi, -1 cmovge esi, eax lea eax, [rsi+rdi] ret`	`;#test2(int,int) mov eax, edi cmp edi, -1 lea ecx, [rdi+1+rsi] cmovl eax, ecx ret`
x86, gcc 5.1:
`;#test1(int,int) mov ecx, [esp+4] mov edx, [esp+8] xor eax, eax cmp ecx, -1 setl al add edx, 1 imul eax, edx add eax, ecx ret`	`;#test2(int,int) mov eax, [esp+4] mov edx, [esp+8] lea edx, [eax+1+edx] cmp eax, -2 cmovle eax, edx ret`
x86, gcc 7.3:
`;#test1(int,int) mov ecx, [esp+4] mov edx, [esp+8] xor eax, eax cmp ecx, -1 setl al add edx, 1 imul eax, edx add eax, ecx ret`	`;#test2(int,int) mov eax, [esp+4] mov edx, [esp+8] lea edx, [eax+1+edx] cmp eax, -1 cmovl eax, edx ret`
x86, gcc (trunk):
`;#test1(int,int) mov edx, [esp+4] mov eax, [esp+8] mov ecx, 0 add eax, 1 cmp edx, -1 cmovge eax, ecx add eax, edx ret`	`;#test2(int,int) mov eax, [esp+4] mov edx, [esp+8] cmp eax, -1 lea edx, [eax+1+edx] cmovl eax, edx ret`

More clearly (with lighting, etc.) it can be seen here or here .

What we have in the rest:

in all variants (and even with maximum optimization) test1 is no better or loses test2
test2 is implemented by the compiler without any conditional jump at all, as one might think, for example, simply using the conditional cmovl
even native test2 more readable
The test2 shorter, more convenient for ILP / SEP in some places, it blurs the CPU cache less (at least instruction cache).

Here, by the way, some progress in the development of the compiler is clearly visible.

Well, in full, the whole function, the compiler slightly changed the entire flow, which on some artificial benchmarks showed a drop to one-two percent (it may differ for other compilers / systems / platforms, etc.).

The general test of compiled TCL code execution (without parasitic load) shows a slight drop (fractions of a percent), which may simply be due to fluctuations of energy per unit volume of vacuum.

Now, why it seems to me that I don’t need to be engaged in such a "blind" optimization in the mind:

the mathematics and the actual flow of the process are mixed in a bunch, which for example makes it difficult for the compiler to parse it
when optimizing during compilation, mathematics has a slightly weaker branching than the same flow-tree and inlining processes, control over overflow, implicit behavior, etc. is needed. (it happened historically, roughly speaking, mathematics has a slightly shifted focus when optimizing it).
the source code becomes implicit, at least it slows down the review and analysis (sometimes it’s just very difficult to read).

This list can be continued indefinitely, for example when compiling source codes for a specific processor (if supported by the compiler), the explicit required behavior will leave the compiler (or a new version of it) more likely to choose the optimal script, for example using some new processor instructions, or cutting out "duplicates "Turning over all the code completely, etc.

I’m not saying that you don’t need to optimize at all, but if you do this, then you will be armed with a magnifying glass, looking at the native compilation result, and performing measurements of execution speed, performance profiling (both on a specific function, and the entire code), etc.
I often saw some kind of optimization that definitely seemed to improve bytecode (smaller and shorter instructions, optimal register loading, etc.) by a few percent sagging performance on real tests with multithreaded and / or parallel parasitic loads.

PS All the girls with the holiday!

For this to the polls ...

Source: https://habr.com/ru/post/350742/

All Articles

Optimization of the code in the mind, or "Well, just as definitely faster"

More articles: