Optimizing the code, or overtake Ognelis in speed

I read a topic about new super-optimizations in Ognelis and thought long.
It is not very clear to me why around this kind of work a holiday with fireworks and a snow maiden is arranged. Let's take a closer look at what has been done.

* Function Inlining: Removing the overhead of function calls by simply replacing it.
Inline functions are able to do almost all compilers. It is very simple and you can say a free way to speed up the program. He has certain limitations:
1) Excess layout leads to bloat code. This is practically not a problem now - there is a lot of memory, but the compiler has a top limit on the size of the function to be included.
2) The compiler does NOT know how to inline functions from a neighboring module, from system libraries, or from dynamic libraries.

In addition, a normal (modern) compiler can work with so-called "intrinsics" - functions that it recognizes from the table and has ready-made code for them. These are usually math functions such as sin (). So, this same sine will not be made as call sin (), but will be inserted with a piece of code, i.e. automatically inline.

* Type Inference: Removing checks for common operators (like "+"). It sees the “+” operator.
')
Well, this is usually called RTTI skip - they threw out type checking where it is not needed ... It is possible for the JIT compiler to do this super cool too, but the usual ones know how to do this for a long time. Naturally, the responsibility for the types, or rather their possible discrepancy, lies with the programmer :)

* Looping: The overhead of looping has been grossly diminished. There is a need to make a list of the most common areas of the junction.
9 out of 10 that a simple anroll cycle was made. Unrolling is a duplication of the body of the loop N times:

for (int i=0; i<maxI; i++){
a[i] = b[i]+c[i];
}

if unroll is 4 then we get:

for (int i=0; i<maxI; i+=4){
a[i] = b[i]+c[i];
a[i+1] = b[i+1]+c[i+1];
a[i+2] = b[i+2]+c[i+2];
a[i+3] = b[i+3]+c[i+3];
}

Well, plus doprovke that maxI is a multiple of 4 :)

Now we take our program, Intel C Compiler (or Intel Fortran Compiler - someone like that) and build the project with the following keys:
-O3 (yes, aggressive optimization);
-axT (enable vectorization, that is, using SSEx + generating common code for severe cases);
-ip (interprocedural optimization, including partial inlineing. for fans, there is an option -ipo - inter-module optimization, not portable !!!);
-ansi-alias -fno-alias (improves vectorisability :) cycles)

In addition to all this, we obtain: automatic anroll cycles of at least 4, inline and intrinsic of all mathematical functions.
And the firelight will not catch up;)
Yes, and in the general case there is a difference between the regular and the JIT compiler, but it is not so big as to give out well-known features for discoveries (like: “Oh! We invented inlineing!”)

Ps. The first pancake, please do not slap in the face.

Source: https://habr.com/ru/post/38228/

All Articles

Optimizing the code, or overtake Ognelis in speed

More articles: