GCC Profile-guided optimization

Profile-guided optimization (hereinafter PGO) is a technique of optimizing a program by the compiler aimed at increasing the performance of a program. Unlike traditional methods of optimizing analyzing only source codes, PGO uses the measurement results of test runs of an optimized program to generate the optimal code. Test runs reveal which parts of the program are executed more often and which less frequently. The advantage of this approach is that the compiler does not make assumptions when choosing an optimization method, but is based on real data collected during program execution. It is necessary to take into account that the test runs of the program should be executed according to the most typical scenario, so that the statistics are representative, otherwise the performance of the program may even decrease.

PGO can include the following types of optimizations ( source ):

Inlining - for example, if function A often calls function B, and function B is sufficiently small, then function B is embedded in A. This is done on the basis of the actual statistics of program launches.
Virtual Call Speculation - if a virtual call, or a call through a function pointer, often points to a specific function, then it can be replaced by a conditionally direct (function-triggered) call to a specific function, and even a function can be embedded (inline).
Register Allocation - optimization of register allocation based on collected data.
Basic Block Optimization - this optimization allows you to put jointly called blocks of code into a common memory page, which minimizes the number of used pages and memory overspending.
Size / Speed Optimization - functions in which the program spends a significant portion of time can be optimized for speed of execution.
Function Layout - based on the call graph, functions that belong to the same execution chain will be placed in the same section.
Conditional Branch Optimization - optimization of branching and switch expressions. Based on test runs, PGO helps determine which conditions in a switch expression are performed more often than others. These values can then be removed from the switch statement. The same applies to if / else - the compiler can order branches based on which one is called more often.
Dead Code Separation - a code that was not called during test runs can be moved to a special section in order to prevent its entering into frequently used memory pages.
EH Code Separation - an exception handling code that runs in exceptional cases can be moved to a separate section, if it is possible to determine that the exceptions work in specific conditions.
Memory Intrinsics - (I find it difficult to correctly translate, I quote the original) If you can, it is determined if an intrinsic is called frequently. An intrinsic can also be optimized.

I will talk about the easiest way to run PGO using the GCC compiler. PCC support in GCC comes at the cost of two flags -fprofile-generate and -fprofile-use. The general compilation scheme looks like this:

Compile the program with all optimization flags and the -fprofile-generate flag. This flag must be set to both the compiler and the linker. For example:
g ++ -O3 -march = native -mtune = native -fprofile-generate -Wall -c -fmessage-length = 0 -MMD -MP -MF "src / pgo-1.d" -MT "src / pgo-1.d "-O" src / pgo-1.o "" ../src/pgo-1.cpp "
g ++ -fprofile-generate -o "pgo-1" ./src/pgo-1.o
After successful compilation, it is necessary to perform a test run of the program with the most typical use case. If everything is done correctly, then as a result of the test run, a statistics file will appear with the extension gcda .
Compile the program with all optimization flags and the -fprofile-use flag. This flag must be set to both the compiler and the linker. For example:
g ++ -O3 -march = native -mtune = native -fprofile-use -Wall -c -fmessage-length = 0 -MMD -MP -MF "src / pgo-1.d" -MT "src / pgo-1.d "-O" src / pgo-1.o "" ../src/pgo-1.cpp "
g ++ -fprofile-use -o "pgo-1" ./src/pgo-1.o

In this case, gcc will use the statistics file created in paragraph 2, or report that the file was not found.

Here is such a simple scheme. However, when it is not worthlessly applying any optimizations. Always before something to optimize, you must initially create a test environment that will allow you to effectively evaluate the usefulness of certain optimization flags.

Consider the effectiveness of PGO on a simple example. The code of the program under study:

#include <iostream>
#include <algorithm>
#include <stdlib.h>
')
const size_t MB = 1024 * 1024 ;
size_t MOD = 0 ;

unsigned char uniqueNumber ( ) {
static unsigned char number = 0 ;
return ++ number % MOD ;
}

int main ( int argc, char ** argv ) {
if ( argc < 3 ) {
return 1 ;
}

size_t BLOCK_SIZE = atoi ( argv [ 1 ] ) * MB ;
MOD = atoi ( argv [ 2 ] ) ;

unsigned char * garbage = ( unsigned char * ) malloc ( BLOCK_SIZE ) ;

std :: generate_n ( garbage, BLOCK_SIZE, uniqueNumber ) ;
std :: sort ( garbage, garbage + BLOCK_SIZE ) ;

free ( garbage ) ;

return 0 ;
}

The program creates an unsigned char array of several megabytes, depending on the first parameter passed. Then it fills it with a sequence of repeating characters depending on the second parameter passed. After that sorts the resulting array. For example:

./prog 32 3 - creates an array of 32MB, filled according to the scheme: {1, 2, 1, 2, ...}. Then it will sort it.

./prog 16 7 - creates an array of 16MB, filled according to the scheme: {1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, ...}. Then it will sort it.

Thus, by setting different parameters, we can influence the frequency of conditional transitions when sorting a given array and the size of the data being processed. Thanks to this, we will be able to test the Conditional Branch Optimization described earlier . Having written a not tricky script, I carried out 1792 tests with various parameters and put them into graphs below. The size of the array varied: {2, 4, 8, 16, 32, 128, 256}, and the divisor {1..256}.

Performance Percentage (Cumulative)

On this graph, the percentages overlap. It is necessary to look at the area filled with a specific color, and not at the absolute values given on the y-axis. The larger the area, the greater the performance increase. The graph clearly shows the performance gain for different values of the size of the array.
percents_stacked

Performance Percentage (idle)

The y-axis shows the percentage of performance gains relative to a non-optimized program. The x-axis shows the dividers used.

Absolute performance values

The graphs below show the program execution time in seconds on the y-axis, and the divisor used on the x-axis. In the legend, -PGO means without optimization, + PGO with optimization.
2MB -PGOvs2MB + PGO

findings

For this particular program, PGO-based optimization is extremely useful and gives a steady performance increase of 5-25%.

The file with the source code, the testing script and the graphing script can be downloaded .

The archive structure is as follows:

pgo-1 / fprofile-generate - make build script with the -fprofile-generate flag
pgo-1 / fprofile-use - make build script with the -fprofile-use flag
pgo-1 / Release - make script to build a regular version
pgo-1 / src - source code
pgo-1 / graph.pl - graph generation script (reads the super_log file)
pgo-1 / run.sh - script to run a test cycle

graph.pl works with a log file whose format corresponds to the output of the run.sh command
For example, you can run like this:

./run.sh> log 2> & 1
tail log | grep -A2 PGO

UPD. The test project was built with the -O3 flags -march = native -mtune = native without PGO, and -O3 -march = native -mtune = native with PGO. All graphs show a gain relative to -O3 -march = native -mtune = native.

Mirror Article

Source: https://habr.com/ru/post/138132/

All Articles

GCC Profile-guided optimization

Performance Percentage (Cumulative)

Performance Percentage (idle)

Absolute performance values

findings

More articles: