/GL flag), then the compiler driver ( cl.exe ) will only call the front end ( c1.dll or c1xx.dll ) and will postpone the work of the back end ( c2.dll ) until linking. The resulting object files contain C Intermediate Language (CIL), not machine code. Then the linker is called ( link.exe ). He sees that the object files contain CIL code, and calls the back end, which, in turn, runs WPO and generates binary object files so that the linker can put them together and form the executable file./GL compiler keys with /O1 or /O2 and /Gw , as well as using the /OPT:REF and /OPT:ICF link keys). In this article I will discuss only inlining and COMDAT optimization. A complete list of LTCG optimizations is provided in the documentation. It is useful to know that the linker can perform LTCG on native, native-managed and purely managed object files, as well as on safe-managed (safe managed) object files and safe.netmodules.source1.c and source2.c ) and a header file ( source2.h ). The source1.c and source2.c are shown in the listing below, and the header file containing the prototypes of all source2.c functions is so simple that I will not give it. // source1.c #include <stdio.h> // scanf_s and printf. #include "Source2.h" int square(int x) { return x*x; } main() { int n = 5, m; scanf_s("%d", &m); printf("The square of %d is %d.", n, square(n)); printf("The square of %d is %d.", m, square(m)); printf("The cube of %d is %d.", n, cube(n)); printf("The sum of %d is %d.", n, sum(n)); printf("The sum of cubes of %d is %d.", n, sumOfCubes(n)); printf("The %dth prime number is %d.", n, getPrime(n)); } // source2.c #include <math.h> // sqrt. #include <stdbool.h> // bool, true and false. #include "Source2.h" int cube(int x) { return x*x*x; } int sum(int x) { int result = 0; for (int i = 1; i <= x; ++i) result += i; return result; } int sumOfCubes(int x) { int result = 0; for (int i = 1; i <= x; ++i) result += cube(i); return result; } static bool isPrime(int x) { for (int i = 2; i <= (int)sqrt(x); ++i) { if (x % i == 0) return false; } return true; } int getPrime(int x) { int count = 0; int candidate = 2; while (count != x) { if (isPrime(candidate)) ++count; } return candidate; } source1.c contains two functions: the function square , which calculates the square of an integer, and the main function of the program main . The main function calls the square function and all functions from source2.c with the exception of the isPrime . The file source2.c contains 5 functions: cube to raise an integer to a third power, sum to count the sum of integers from 1 to a given number, sumOfCubes to count the sum of cubes of integers from 1 to a given number, isPrime to check the number for simplicity, getPrime to get a prime number with a given number. I missed error handling, since it is of no interest in this article.getPrime function is the most complex, getPrime it contains a while , inside of which it calls the isPrime function, which also contains a loop. I will use this code to demonstrate one of the important optimizations of the function inlining compiler and several additional optimizations./FA[s] compiler key) and a map file (obtained using the linker /MAP key) to study the COMDAT optimization performed (the linker will report them , if you turn on the /verbose:icf and /verbose:ref ) keys. Make sure all keys are correct and continue reading the article. I will use the C ( /TC ) compiler to make the generated code easier to learn, but everything described in the article also applies to C ++ code./Od key without the /GL key. In this configuration, the resulting object files contain a binary code that exactly matches the source code. You can examine the assembler output files and the map file to verify this. A configuration is equivalent to a Debug configuration in Visual Studio./O1 , /O2 or /Ox switches are specified), but it does not include the /GL switch. In this configuration, the final object files contain an optimized binary code, but at the same time the optimization of the entire program level is not performed.source1.c , you will notice that two important optimizations have been performed. The first call to the square function, square(n) , was replaced by the value computed at compile time. How did this happen? The compiler noticed that the function body is not enough, and decided to substitute its contents instead of the call. Then the compiler noticed that in the calculation of the value there is a local variable n with a known initial value, which did not change between the initial assignment and the function call. Thus, he came to the conclusion that it is safe to calculate the value of the multiplication operation and substitute the result ( 25 ). The second call to the square function, square(m) , was also inline, i.e., the function body was substituted for the call. But the value of the variable m is unknown at the time of compilation, so the compiler was unable to calculate in advance the value of the expression.source2.c 's assembly listing file, which is much more interesting. The call to the cube function in the sumOfCubes function was inline. This, in turn, allowed the compiler to perform loop optimization (for more details, see the “Cycle Optimization” section). In the isPrime function, SSE2 instructions were used to convert int to double when calling sqrt and converting from double to int when getting a result from sqrt . In fact, sqrt volunteered once before the start of the cycle. Note that the /arch switch tells the compiler that x86 uses SSE2 by default (most x86 processors and x86-64 processors support SSE2)./GL compiler key is specified (you can also explicitly specify /O1 or /O2 ). Thus, we tell the compiler to generate object files with CIL code instead of assembly object files. This means that the linker will call the back end of the compiler to run WPO, as described above. Now we will discuss a few WPOs to show the great benefits of LTCG. Generated assembly code listings for this configuration are available online./Ob key, which is turned on, if you have turned off optimization), the /GL key allows the compiler to inline functions defined in other files regardless of the /Gy key (we will discuss it a little later). The linker /LTCG optional and only affects the linker.source1.c , you may notice that calls to all functions except scanf_s were inline. As a result, the compiler was able to calculate the functions cube , sum and sumOfCubes . Only the isPrime function isPrime not inline. However, if you manually inlined it in getPrime , then the compiler would still execute inline getPrime in main .auto_inline directive. You can also tell the compiler specific functions or methods using __declspec(noinline) . You can also mark a function with the inline and advise the compiler to execute the inline (although the compiler may decide to ignore this advice if it considers it bad). The inline has been available since the first C ++ version, it appeared in C99. You can use the __inline compiler __inline keyword for both C and C ++: this is convenient if you want to use older versions of C that do not support this keyword. The __forceinline (for C and C ++) causes the compiler to always inline a function, if possible. Last but not least, you can tell the compiler to expand the recursive function of the specified or indefinite depth by inlining using the inline_recursion directive. Note that at present the compiler does not have the ability to control inline in the place of the function call, and not in the place of its declaration./Ob0 disables inlining completely, which is useful during debugging (this switch works in the Debug configuration in Visual Studio). The /Ob1 tells the compiler that only functions marked with inline , __inline , __forceinline should be considered as candidates for __forceinline . The /Ob2 only works when the specified /O[1|2|x] and tells the compiler to consider all functions for inlining. In my opinion, the only reason for using the inline and __inline is to control inlining for the /Ob1 ./Gy (function level linking) and /Gw (global data optimization). These sections are called COMDATs. You can also mark a given global variable using __declspec( selectany) to tell the compiler to package the variable in COMDAT. Further, using the linker /OPT:REF key, you can get rid of unused functions and global variables. Key /OPT:ICF helps to minimize identical functions and global constants (ICF is Identical COMDAT Folding). The /ORDER switch will force the linker to place COMDATs in the final images in a specific order. Note that all linker optimizations do not need the /GL . The /OPT:REF and /OPT:ICF switches must be turned off during debugging for obvious reasons.source1.c so that sumOfCubes is passed to sumOfCubes instead of n, then the compiler will not be able to calculate the value of the parameters, you have to compile the function so that it can work for any argument. The final function will be well optimized, which is why it will have a large size, which means the compiler will not inline it./O1 key, no optimizations to sumOfCubes will be applied. Compiling with the /O2 key will give speed optimizations. At the same time, the code size will increase significantly, sumOfCubes loop inside the sumOfCubes function will be unwound and vectorized. It is very important to understand that vectorization will not be possible without inlineing the cube function. Moreover, unwinding the cycle will also not be as effective without inlineing. A simplified graphical representation of the resulting code is shown in the following picture (this graph is valid for both x86 and x86-64).
sumOfCubes function is sumOfCubes . If SSE4 is supported and x is greater than or equal to 8, then SSE4 instructions will be used to perform 4 multiplications at a time. The process of performing the same operation for several variables is called vectorization. Also the compiler unwinds this cycle twice. This means that the body of the loop will be repeated twice for each iteration. As a result, the execution of eight multiplication operations will occur in 1 iteration. If x less than 8, then the code without optimizations will be used to perform the function. Note that the compiler inserts three exit points instead of one - thus reducing the number of transitions./arch switch. Specifying /arch:AVX2 , you tell the compiler to also use the FMA and BMI instructions.__forceinline and loop directives with the no_vector option (the latter turns off the no_vector specified cycles).someOfCubes function someOfCubes not the only one whose loop has been unwound. If you modify the code and pass m to the sum function instead of n , then the compiler will not be able to calculate its value and it will have to generate the code, the loop will be unwound twice. int sum(int x) { int result = 0; int count = 0; for (int i = 1; i <= x; ++i) { ++count; result += i; } printf("%d", count); return result; } x . This optimization is called the removal of the loop invariant (loop-invariant code motion). The word "invariant" shows that this technique is applicable when a part of the code does not depend on expressions that include a loop variable.x . , count . x count, ! , x , count . , . , Visual C++ , , x .O1 , /O2 , /Ox , optimize : #pragma optimize( "[optimization-list]", {on | off} ) g , s , t , y . /Og , /Os , /Ot , /Oy .off . on ./Og , , , . LTCG , /Og WPO.optimize , , : , . , , profile-guided- (PGO), , , . , . Visual Studio , , .true , . RyuJIT . , SSE4.1, JIT- SSE4.1 subOfCubes , . , RyuJIT , . . JIT- . Visual C++ , . Microsoft .NET Native Visual C++. Windows Store./optimize . JIT- System.Runtime.CompilerServices.MethodImpl MethodImplOptions . NoOptimization , NoInlining , AggressiveInlining ( .NET 4.5) JIT- , .Source: https://habr.com/ru/post/250199/
All Articles