Overtaking compiler

It is often repeated now on forums and other places where developers talk, that a decent optimizing compiler will always surpass the pitiful, almost human attempts of a program written by hand in assembly language. There are rare cases, such as MPEG decoders, where good use of SIMD instructions may allow assembly to completely outperform the compiler. But usually, and everywhere, we hear that the compiler always works better.

The reason is usually indicated by the fact that a modern CPU has so many pipelines and management conflicts that you have to deal with, that a simple assembler program will not do the job well, handling them.

But is it? Let's not just take on faith the words of some guys on the Internet as a biblical revelation, but we will do a little experiment and find out.

We just take a simple piece of code and look at it. I'm not going to pick an example that can benefit, to a large extent, from exotic built-in functions. Instead, I'll take the old standard quicksort.
')
Here we show a simple quick sort in C ++, which we will test:

struct Item { int key; int value; }; extern "C" void sortRoutine(Item *items, int count) { if (count <= 1) return; // already sorted if only zero/one element // Pick the pivot. Item pivot = items[count-1]; int low = 0; for (int pos=0;pos<count-1;pos++) { if (items[pos].key <= pivot.key) { // swap elements Item tmp = items[pos]; items[pos] = items[low]; items[low] = tmp; low++; } } items[count-1] = items[low]; items[low] = pivot; sortRoutine(items, low); sortRoutine(items+low+1, count-1-low); }

Nothing unusual. This is not the best sorting algorithm in the world, and this is not even the best implementation, but it is simple and does its job well.

Now we will try to directly execute a simple implementation of this approach in the assembler:

 sortRoutine: ; rcx = items ; rdx = count sub rdx, 1 jbe done ; Pick the pivot. mov r8, [rcx+rdx*8] ; r8 = pivot data xor r9, r9 ; r9 = low xor r10, r10 ; r10 = pos partition: cmp [rcx+r10*8], r8d jg noswap ; swap elements mov rax, [rcx+r10*8] mov r11, [rcx+r9*8] mov [rcx+r9*8], rax mov [rcx+r10*8], r11 inc r9 noswap: inc r10 cmp r10, rdx jb partition ; move pivot into place mov rax, [rcx+r9*8] mov [rcx+rdx*8], rax mov [rcx+r9*8], r8 ; recurse sub rdx, r9 lea rax, [rcx+r9*8+8] push rax push rdx mov rdx, r9 call sortRoutine pop rdx pop rcx test rdx, rdx jnz sortRoutine done: ret

It turned out to be quite easy to write, mainly due to the operators of flexible addressing to Intel memory. What is interesting - I did not make any real attempt to pay attention to planning, pipelining, and so on. I just wrote it as a simple assembly language program.

Now let's estimate the elapsed time. I wrote a simple test program that sorted an array of 1,000,000 positions. The test was performed 100 times and was taken the best value for the entire set. Version C ++ was compiled using gcc 4.8.1, clang 3.8.0 and MSVC 2013.

results

 sort_cpp_recurse_gcc.exe : 99  -    100  sort_cpp_recurse_clang.exe : 99  -    100  sort_cpp_recurse_ms.exe : 98  -    100  sort_asm_recurse.exe : 92  -    100

Well, this is interesting. Different compilers produce basically the same result with a slight advantage for MSVC. But the assembler version is clearly faster - almost 7% in this case.

The fact is, in C ++ there is not always a good representation of the underlying machine. This is normal when it comes to variables, but its stack representation is very limited. C ++ believes that we can only use the stack for calls, whereas in reality one of the things we can do is use it as a data stack.

Let's try and see what happens. We remove the recursive calls to sortRoutine and instead retrieve our data ranges directly from the stack. This allows us to work in a single cycle without the need, in fact, to access recursion. Often, this approach can provide a significant advantage, since it eliminates the loss of time at each input to a function / output from it.

The corresponding program is available in the archive file below.

 sort_cpp_recurse_gcc.exe : 99  -    100  sort_cpp_recurse_clang.exe : 99  -    100  sort_cpp_recurse_ms.exe : 98  -    100  sort_asm_recurse.exe : 92  -    100  sort_cpp_iter_gcc.exe : 106  -    100  sort_cpp_iter_clang.exe : 97  -    100  sort_cpp_iter_ms.exe : 95  -    100  sort_asm_iter.exe : 92  -    100

Interesting. The assembler version gave almost the same result. I suppose this is due to the fact that, although the iterative approach eliminates the loss of work with functions, such losses for our hand-written version for x64 are actually small, so there is no gain.

But for C ++ versions, the situation is different. Most showed a slight increase in speed, but gcc is clearly slower! As far as I can see from disassembling, the control is built as if it is trying to confuse itself. Increased control routes and the associated bells and whistles have led to complicated juggling.

I compiled these x64 tests, where the default calling convention is fastcall. I suppose that if instead of compiling a solution for a 32-bit version using the agreement based on the cdecl stack, a non-recursive version would have produced a relatively better result. I have not tried it - I leave it as an exercise for the reader.

Conclusion

It seems that the old saying “modern compilers are always faster than the program you have written in assembler” is not necessarily correct.

However, it is correct that the compiler does its work quite well , and it is easier to work with code. Therefore, although it would be possible to squeeze a little more speed, but this is hardly justified because of the emerging difficulties associated with maintenance.

The assembler version was still faster. I believe that if in this post you have found something useful for yourself, it is that people on the Internet can sometimes say something that does not correspond to reality at all .

Materials
sorttest.zip - All codes used in this article.

Source: https://habr.com/ru/post/317078/

All Articles

Overtaking compiler

results

Conclusion

More articles: