We optimize step by step with the compiler Intel C ++

Each developer sooner or later faces the problem of optimizing their application, and you want to do it with minimal effort and maximum profit in terms of performance. In this question, the compiler comes to the rescue, which today can do a lot of things automatically, you just need to tell him about it with the help of keys. Compilation options, as well as types of optimization, divorced quite a lot, so I decided to write a blog about step-by-step optimization of the application using the Intel compiler.

So, the whole thorny path of compiling and optimizing our application can be divided into 7 steps. Let's go!

Step 1. Will we compile the code without any optimization at all?
That's right, in the first step we would like to answer this question. I often start the optimization process by turning off everything and everything in the compiler. Why? Well, first, I want to make sure that my code works correctly without any interference from the compiler and its ingenious transformations. I disable optimizations with the -O0 key (on Windows / Od ), collect the code and launch the application. Yes, and debugging non-optimized code is easier.
')
Step 2. What can there be "easier" to connect?
We start with the "basic" options.

-O1 / -Os
The first, basic level of optimization, in which the compiler does not auto-vectorize — that is, it does not even try. At the same time, an analysis of the data flow, moving the code, reducing the cost of operations, analyzing the lifetime of variables, planning the execution of commands are performed. Often used to limit the size of our application, somewhat cutting back on optimization. If the O1 option is enabled, then Os is also implicitly enabled.

-O2
The optimization level that is enabled by default, based on the speed of the application. Starting from this level, the cycle vectorization is enabled. In addition, a series of basic optimizations are performed with cycles, in-lining, IP (Intra-file interprocedural) optimization, and more.

-O3
At this maximum level of optimizations, in addition to what was done on O2, a number of more aggressive transformations are included with cycles, for example, unrolling the outer loop and fusion the inner ones, dividing the cycles into blocks (blocking), combining the IF conditions . A very good overview of the optimizations themselves is presented here . If it is critical for your application to save numerical results (for example, scientific calculations), then you need to be careful in using this option. Often, the numbers “float” and you need to limit the optimization, returning to -O2 and using the -fp-model options. In general, after compiling with -O2, no one limits us to try -O3 and just see what happens. In theory, the application should work faster.

-no-prec-div
A division operation conforming to the IEEE standard is very labor intensive. You can sacrifice some accuracy in calculations, but speed up calculations using this option, in which the compiler, for example, will replace A / B expressions with A * (1 / B) .

-ansi-alias
This option tells the compiler that we adhere to strict aliasing rules when writing our code in accordance with the ISO C standard. If you follow these rules, when dereferencing pointers to objects of different types, we will never turn to the same memory location. which gives more room to the compiler to perform optimizations. For a detailed description of aliasing, you can read this article.
It is important to note that starting with the Intel compiler version 15.0 (Intel Parallel Studio XE 2015 Composer Edition) and above, this option is enabled by default, but if we write on earlier versions, we don’t forget about it.

Step 3. We use the specifics of "iron"
You can use the -x'code ' option to connect optimizations specific to Intel processors. It tells the compiler what processor capabilities can be used, including a set of instructions that can be generated. As 'code' you can set SSE2, SSE3, SSSE3, SSE3_ATOM and SSSE3_ATOM, ATOM_SSSE3, ATOM_SSE4.2, SSE4.1, SSE4.2, AVX, CORE-AVX-I, CORE-AVX2, CORE-AVX512, MIC-AVX512 , COMMON-AVX512 .
It is clear that the resulting application can be run only on systems with Intel processors that support the generated instructions.
By default, the -xSSE2 key is used , which, for example, during vectorization will tell the compiler to use SSE2 instructions. In most cases (Pentium 4 and higher), this guarantees the execution of the application.
If we write under Atom and know for sure that the application will only run on it, then for better performance we can use -xSSSE3_ATOM . For the Silvermont architecture, you must specify -xATOM_SSE4.2 .
Particularly lazy can use the option -xHost , and in this case, the optimization will be done under the iron on which we collect the code.

By the way, it is possible to specify not only one specific set of instructions, but several at once - using the -ax'code ' key.
At the same time, the auto-selector (dispatcher) will be added to the code, which will be determined by the CPU (by CPUID ) during the launch of the application, and depending on which instruction set it supports, it will go along the desired path. Of course, this leads to an increase in the size of our application, but it gives more flexibility. In addition to the explicitly specified set of instructions via 'code' , the default version SSE2 is also always created. For example, specifying -axAVX , we get one default version with SSE2, as well as a separate version for AVX.
In addition, we can specify several sets of instructions in the -ax option, separated by commas. For example, -axSSE4.2, AVX will tell the compiler to generate versions of SSE4.2, AVX, and do not forget about the default (SSE2) branch, which will always be. It can also be explicitly set using the -x option in addition to -ax . For example, specifying the -axSSE4.2 keys , AVX-xSSE4.1, the default version will be SSE4.1.

For optimizations that are not specific to Intel processors, the -m option is used .
For example, for the Quark SoC X1000, you can specify the -mia32 options (generate code for the IA-32 architecture) and -falign-stack = assume-4-byte , which lets you tell the compiler that our stack is 4 bytes aligned. If necessary, the compiler will be able to align it to 16 bytes. This may reduce the size of the data required for calling functions.

Step 4. IPO
No, we are not going to sell shares on the stock exchange yet. IPO (Interprocedural Optimization) is an interprocedural analysis and optimization that the compiler does over our code. It is connected with the -ipo option and allows optimizations not for one single source file, but for all sources at the same time. In this case, the compiler knows much more and can make much more conclusions and, accordingly, transformations / optimizations. All the intricacies of the IPO will help to understand this blog. The specificity of the work lies in the fact that when compiling with -ipo, the usual order of compilation and linking changes, and the object file contains a packed internal representation, so the standard (on Linux) linker ld and the ar utility should be replaced with Intel xiar and xild . Do not forget that the process of compiling with an IPO itself may take significantly more time, especially for "large" applications.

Step 5. Or maybe "profile"?
Nothing can give the compiler more information than running the application itself. Thanks to him, we can find out exactly which branches we went to, where we spent more time, how often we did not get into the cache, and so on. Naturally, I conclude that profiling our application can significantly help its optimization.
The compiler has such an option that allows you to profile our application and optimize based on the collected data - PGO (Profile-guided Optimization).
The process of work consists of several steps, and, accordingly, compiler keys.

First of all, we need to perform the instrumentation of our application, having collected it with the -prof-gen key. Next, you need to run our application, while collecting various statistical data (profile) into a separate information file with the .dyn extension. Well, in the end, use this data during the final compilation with the -prof-use key, in which the compiler will try to optimize the most expensive computationally-intensive code branches.

In some cases, you may need to specify a place where to put the same files with the results of the application. This can be done with the option -prof-dir = 'val' , specifying the path to the folder. Thus, we can build our code on one machine, then profile it on another, and perform the final compilation again on the first one. Just take the dyn files and put them in a daddy on the system, where we collect the code, and specify the path through -prof-dir .

In order for the profile to be compiled, the application must normally complete and exit.
If our application runs infinitely (for example, a frequent case for embedded systems), you will have to make a couple more gestures:
1. Add an exit point from the application.
2. Add a PGO API call _PGOPTI_Prof_Dump_All ()
3. You can control the dump interval in microseconds through environment variables:
export INTEL_PROF_DUMP_INTERVAL 5000
export INTEL_PROF_DUMP_CUMULATIVE 1

Step 6. Vector games
Specifically decided to focus on vectorization, although it is enabled by default (with the -O2 option), and the set of instructions is controlled by the ones already described -x , -ax , etc. But when talking about the performance of the Intel compiler, it is on vectorization that you need to pay special attention, because it gives the maximum increase in the speed of the application. We read the corresponding post on how to help the compiler in its hard work. Well, a set of updated options -opr-report will help.

Step 7. Parallel automatically!
The Intel compiler has the most interesting option -parallel, which allows parallelizing loops using OpenMP using compiler tools in automatic mode. Obviously, not all cycles are equally well parallelized, and the compiler cannot always do this. But try this option is worth it - from this, we are unlikely to lose something.

As a result, here is a set of options worth trying when compiling your code to increase performance:

-O2 / O3 -no-prec-div -x'code '-ipo -prof-gen / -prof-use -prof-dir =' val '-parallel

By the way, for the lazy they came up with the -fast option, which includes most of these keys: ipo, -O3, -no-prec-div, -static, -fp-model fast = 2, and -xHost .
Well, apart from the options, a good Intel® Vtune Amplifier XE profiler will always help us, but that's another story.

Practice
In addition to theoretical reflections, I want to play around with the listed options on an example of calculating the number of Pi and show how they affect the speed of the application, albeit unpretentious. In the “theory” I gave the keys for Linux, in the case of Windows they are almost identical, the letter Q is added at the beginning (in most cases). I will collect an example on Windows to show the corresponding options. I used the Intel C ++ compiler version 15.0 (15.0.2.179 Build 20150121).

So, the code I deliberately broke into two files (so that was the effect of the IPO).
pi.c:

#define N 1000000000 double f( double x ); main() { double sum, pi, x, h; clock_t start, stop; int i; h = (double)1.0/(double)N; sum = 0.0; start = clock(); for ( i=0; i<N ; i++ ){ x = h*(i-0.5); sum = sum + f(x); } stop = clock(); // print value of pi to be sure multiplication is correct pi = h*sum; printf(" pi is approximately : %f \n", pi); // print elapsed time printf("Elapsed time = %lf seconds\n",((double)(stop - start)) / CLOCKS_PER_SEC); }

In a separate file fx.c, the function f is defined:

 double f(double x){ double ret; ret = 4.0 / (x*x + 1.0); return ret; }

Inklud stdio libraries and time I did not give.
So, we will collect this code with different options and look at the resulting acceleration.
To begin with, we compile without optimizations:

 icl /Od pi.c fx.c /o Od_pi.exe

And run Od_pi.exe:

 pi is approximately : 3.141593 Elapsed time = 22.828000 seconds

Something long, let's see what the next level O1 gives:

 icl /O1 pi.c fx.c /o O1_pi.exe pi is approximately : 3.141593 Elapsed time = 4.963000 seconds

Interestingly, by increasing the optimization level to O2 and O3, we no longer gain any speed.
This is quite logical, because the code is quite simple, and besides, it was not vectorized due to a function call defined in another file inside the loop. So an IPO should help:

 icl /O2 pi.c fx.c /Qipo ipo_pi.exe pi is approximately : 3.141593 Elapsed time = 2.562000 seconds

At the same time, our cycle was vectorized. If we collect the code without an IPO, but with the keys QxAVX , QxSSE2 and others from the same series, we will not notice any difference in speed. Again it is quite logical, since vectorization will not work:

 icl /O2 /QxAVX pi.c fx.c /o xAVX_pi.exe Elapsed time = 5.065000 seconds icl /O2 /QxSSE2 pi.c fx.c /o xSSE2_pi.exe Elapsed time = 5.093000 seconds

I compile the code and run the application on Haswell, so I use the / QxHost option and IPO:

 icl /O2 /QxHost /Qipo pi.c fx.c /o xHost_ipo_pi.exe Elapsed time = 2.718000 seconds

The / fast option gives the same result:

 icl /fast /Qvec-report2 pi.c fx.c /o fast_pi.exe Elapsed time = 2.718000 seconds

Profiling allows you to squeeze a little more in terms of optimization:

 icl /Qprof-gen pi.c fx.c /o pgen_pi.exe

Run the application, and then compile again:

 icl /Qprof-use /O2 /Qipo pi.c fx.c /o puse_pi.exe Elapsed time = 2.578000 seconds

Well, we get the most with auto-paralleling:

 icl /Qparallel /Qpar-report2 /Qvec-report2 /Qipo pi.c fx.c /o par_ipo_pi.exe Elapsed time = 1.447000 seconds

This is how we accelerated significantly, without putting much effort into it. Simple game options, so to speak.

Source: https://habr.com/ru/post/256251/

All Articles

We optimize step by step with the compiler Intel C ++

More articles: