/GL
flag), then the compiler driver ( cl.exe
) will only call the front end ( c1.dll
or c1xx.dll
) and will postpone the work of the back end ( c2.dll
) until linking. The resulting object files contain C Intermediate Language (CIL), not machine code. Then the linker is called ( link.exe
). He sees that the object files contain CIL code, and calls the back end, which, in turn, runs WPO and generates binary object files so that the linker can put them together and form the executable file./GL
compiler keys with /O1
or /O2
and /Gw
, as well as using the /OPT:REF
and /OPT:ICF
link keys). In this article I will discuss only inlining and COMDAT optimization. A complete list of LTCG optimizations is provided in the documentation. It is useful to know that the linker can perform LTCG on native, native-managed and purely managed object files, as well as on safe-managed (safe managed) object files and safe.netmodules.source1.c
and source2.c
) and a header file ( source2.h
). The source1.c
and source2.c
are shown in the listing below, and the header file containing the prototypes of all source2.c
functions is so simple that I will not give it. // source1.c #include <stdio.h> // scanf_s and printf. #include "Source2.h" int square(int x) { return x*x; } main() { int n = 5, m; scanf_s("%d", &m); printf("The square of %d is %d.", n, square(n)); printf("The square of %d is %d.", m, square(m)); printf("The cube of %d is %d.", n, cube(n)); printf("The sum of %d is %d.", n, sum(n)); printf("The sum of cubes of %d is %d.", n, sumOfCubes(n)); printf("The %dth prime number is %d.", n, getPrime(n)); }
// source2.c #include <math.h> // sqrt. #include <stdbool.h> // bool, true and false. #include "Source2.h" int cube(int x) { return x*x*x; } int sum(int x) { int result = 0; for (int i = 1; i <= x; ++i) result += i; return result; } int sumOfCubes(int x) { int result = 0; for (int i = 1; i <= x; ++i) result += cube(i); return result; } static bool isPrime(int x) { for (int i = 2; i <= (int)sqrt(x); ++i) { if (x % i == 0) return false; } return true; } int getPrime(int x) { int count = 0; int candidate = 2; while (count != x) { if (isPrime(candidate)) ++count; } return candidate; }
source1.c
contains two functions: the function square
, which calculates the square of an integer, and the main function of the program main
. The main function calls the square function and all functions from source2.c
with the exception of the isPrime
. The file source2.c
contains 5 functions: cube
to raise an integer to a third power, sum
to count the sum of integers from 1 to a given number, sumOfCubes
to count the sum of cubes of integers from 1 to a given number, isPrime
to check the number for simplicity, getPrime
to get a prime number with a given number. I missed error handling, since it is of no interest in this article.getPrime
function is the most complex, getPrime
it contains a while
, inside of which it calls the isPrime
function, which also contains a loop. I will use this code to demonstrate one of the important optimizations of the function inlining compiler and several additional optimizations./FA[s]
compiler key) and a map file (obtained using the linker /MAP
key) to study the COMDAT optimization performed (the linker will report them , if you turn on the /verbose:icf
and /verbose:ref
) keys. Make sure all keys are correct and continue reading the article. I will use the C ( /TC
) compiler to make the generated code easier to learn, but everything described in the article also applies to C ++ code./Od
key without the /GL
key. In this configuration, the resulting object files contain a binary code that exactly matches the source code. You can examine the assembler output files and the map file to verify this. A configuration is equivalent to a Debug configuration in Visual Studio./O1
, /O2
or /Ox
switches are specified), but it does not include the /GL
switch. In this configuration, the final object files contain an optimized binary code, but at the same time the optimization of the entire program level is not performed.source1.c
, you will notice that two important optimizations have been performed. The first call to the square
function, square(n)
, was replaced by the value computed at compile time. How did this happen? The compiler noticed that the function body is not enough, and decided to substitute its contents instead of the call. Then the compiler noticed that in the calculation of the value there is a local variable n
with a known initial value, which did not change between the initial assignment and the function call. Thus, he came to the conclusion that it is safe to calculate the value of the multiplication operation and substitute the result ( 25
). The second call to the square
function, square(m)
, was also inline, i.e., the function body was substituted for the call. But the value of the variable m is unknown at the time of compilation, so the compiler was unable to calculate in advance the value of the expression.source2.c
's assembly listing file, which is much more interesting. The call to the cube
function in the sumOfCubes
function was inline. This, in turn, allowed the compiler to perform loop optimization (for more details, see the “Cycle Optimization” section). In the isPrime
function, SSE2 instructions were used to convert int
to double
when calling sqrt
and converting from double
to int
when getting a result from sqrt
. In fact, sqrt
volunteered once before the start of the cycle. Note that the /arch
switch tells the compiler that x86 uses SSE2 by default (most x86 processors and x86-64 processors support SSE2)./GL
compiler key is specified (you can also explicitly specify /O1
or /O2
). Thus, we tell the compiler to generate object files with CIL code instead of assembly object files. This means that the linker will call the back end of the compiler to run WPO, as described above. Now we will discuss a few WPOs to show the great benefits of LTCG. Generated assembly code listings for this configuration are available online./Ob
key, which is turned on, if you have turned off optimization), the /GL
key allows the compiler to inline functions defined in other files regardless of the /Gy
key (we will discuss it a little later). The linker /LTCG
optional and only affects the linker.source1.c
, you may notice that calls to all functions except scanf_s
were inline. As a result, the compiler was able to calculate the functions cube
, sum
and sumOfCubes
. Only the isPrime
function isPrime
not inline. However, if you manually inlined it in getPrime
, then the compiler would still execute inline getPrime
in main
.auto_inline
directive. You can also tell the compiler specific functions or methods using __declspec(noinline)
. You can also mark a function with the inline
and advise the compiler to execute the inline (although the compiler may decide to ignore this advice if it considers it bad). The inline
has been available since the first C ++ version, it appeared in C99. You can use the __inline
compiler __inline
keyword for both C and C ++: this is convenient if you want to use older versions of C that do not support this keyword. The __forceinline
(for C and C ++) causes the compiler to always inline a function, if possible. Last but not least, you can tell the compiler to expand the recursive function of the specified or indefinite depth by inlining using the inline_recursion
directive. Note that at present the compiler does not have the ability to control inline in the place of the function call, and not in the place of its declaration./Ob0
disables inlining completely, which is useful during debugging (this switch works in the Debug configuration in Visual Studio). The /Ob1
tells the compiler that only functions marked with inline
, __inline
, __forceinline
should be considered as candidates for __forceinline
. The /Ob2
only works when the specified /O[1|2|x]
and tells the compiler to consider all functions for inlining. In my opinion, the only reason for using the inline
and __inline
is to control inlining for the /Ob1
./Gy
(function level linking) and /Gw
(global data optimization). These sections are called COMDATs. You can also mark a given global variable using __declspec( selectany)
to tell the compiler to package the variable in COMDAT. Further, using the linker /OPT:REF
key, you can get rid of unused functions and global variables. Key /OPT:ICF
helps to minimize identical functions and global constants (ICF is Identical COMDAT Folding). The /ORDER
switch will force the linker to place COMDATs in the final images in a specific order. Note that all linker optimizations do not need the /GL
. The /OPT:REF
and /OPT:ICF
switches must be turned off during debugging for obvious reasons.source1.c
so that sumOfCubes
is passed to sumOfCubes
instead of n, then the compiler will not be able to calculate the value of the parameters, you have to compile the function so that it can work for any argument. The final function will be well optimized, which is why it will have a large size, which means the compiler will not inline it./O1
key, no optimizations to sumOfCubes
will be applied. Compiling with the /O2
key will give speed optimizations. At the same time, the code size will increase significantly, sumOfCubes
loop inside the sumOfCubes
function will be unwound and vectorized. It is very important to understand that vectorization will not be possible without inlineing the cube function. Moreover, unwinding the cycle will also not be as effective without inlineing. A simplified graphical representation of the resulting code is shown in the following picture (this graph is valid for both x86 and x86-64).sumOfCubes
function is sumOfCubes
. If SSE4 is supported and x is greater than or equal to 8, then SSE4 instructions will be used to perform 4 multiplications at a time. The process of performing the same operation for several variables is called vectorization. Also the compiler unwinds this cycle twice. This means that the body of the loop will be repeated twice for each iteration. As a result, the execution of eight multiplication operations will occur in 1 iteration. If x
less than 8, then the code without optimizations will be used to perform the function. Note that the compiler inserts three exit points instead of one - thus reducing the number of transitions./arch
switch. Specifying /arch:AVX2
, you tell the compiler to also use the FMA and BMI instructions.__forceinline
and loop
directives with the no_vector
option (the latter turns off the no_vector
specified cycles).someOfCubes
function someOfCubes
not the only one whose loop has been unwound. If you modify the code and pass m
to the sum
function instead of n
, then the compiler will not be able to calculate its value and it will have to generate the code, the loop will be unwound twice. int sum(int x) { int result = 0; int count = 0; for (int i = 1; i <= x; ++i) { ++count; result += i; } printf("%d", count); return result; }
x
. This optimization is called the removal of the loop invariant (loop-invariant code motion). The word "invariant" shows that this technique is applicable when a part of the code does not depend on expressions that include a loop variable.x
. , count
. x count, ! , x
, count
. , . , Visual C++ , , x
.O1
, /O2
, /Ox
, optimize
: #pragma optimize( "[optimization-list]", {on | off} )
g
, s
, t
, y
. /Og
, /Os
, /Ot
, /Oy
.off
. on
./Og
, , , . LTCG
, /Og
WPO.optimize
, , : , . , , profile-guided- (PGO), , , . , . Visual Studio , , .true
, . RyuJIT . , SSE4.1, JIT- SSE4.1 subOfCubes
, . , RyuJIT , . . JIT- . Visual C++ , . Microsoft .NET Native Visual C++. Windows Store./optimize
. JIT- System.Runtime.CompilerServices.MethodImpl
MethodImplOptions
. NoOptimization
, NoInlining
, AggressiveInlining
( .NET 4.5) JIT- , .Source: https://habr.com/ru/post/250199/
All Articles