📜 ⬆️ ⬇️

GCC Profile-guided optimization

Profile-guided optimization (hereinafter PGO) is a technique of optimizing a program by the compiler aimed at increasing the performance of a program. Unlike traditional methods of optimizing analyzing only source codes, PGO uses the measurement results of test runs of an optimized program to generate the optimal code. Test runs reveal which parts of the program are executed more often and which less frequently. The advantage of this approach is that the compiler does not make assumptions when choosing an optimization method, but is based on real data collected during program execution. It is necessary to take into account that the test runs of the program should be executed according to the most typical scenario, so that the statistics are representative, otherwise the performance of the program may even decrease.

PGO can include the following types of optimizations ( source ):

I will talk about the easiest way to run PGO using the GCC compiler. PCC support in GCC comes at the cost of two flags -fprofile-generate and -fprofile-use. The general compilation scheme looks like this:
  1. Compile the program with all optimization flags and the -fprofile-generate flag. This flag must be set to both the compiler and the linker. For example:
    g ++ -O3 -march = native -mtune = native -fprofile-generate -Wall -c -fmessage-length = 0 -MMD -MP -MF "src / pgo-1.d" -MT "src / pgo-1.d "-O" src / pgo-1.o "" ../src/pgo-1.cpp "
    g ++ -fprofile-generate -o "pgo-1" ./src/pgo-1.o

  2. After successful compilation, it is necessary to perform a test run of the program with the most typical use case. If everything is done correctly, then as a result of the test run, a statistics file will appear with the extension gcda .
  3. Compile the program with all optimization flags and the -fprofile-use flag. This flag must be set to both the compiler and the linker. For example:
    g ++ -O3 -march = native -mtune = native -fprofile-use -Wall -c -fmessage-length = 0 -MMD -MP -MF "src / pgo-1.d" -MT "src / pgo-1.d "-O" src / pgo-1.o "" ../src/pgo-1.cpp "
    g ++ -fprofile-use -o "pgo-1" ./src/pgo-1.o

    In this case, gcc will use the statistics file created in paragraph 2, or report that the file was not found.

Here is such a simple scheme. However, when it is not worthlessly applying any optimizations. Always before something to optimize, you must initially create a test environment that will allow you to effectively evaluate the usefulness of certain optimization flags.

Consider the effectiveness of PGO on a simple example. The code of the program under study:
#include <iostream>
#include <algorithm>
#include <stdlib.h>
')
const size_t MB = 1024 * 1024 ;
size_t MOD = 0 ;

unsigned char uniqueNumber ( ) {
static unsigned char number = 0 ;
return ++ number % MOD ;
}

int main ( int argc, char ** argv ) {
if ( argc < 3 ) {
return 1 ;
}

size_t BLOCK_SIZE = atoi ( argv [ 1 ] ) * MB ;
MOD = atoi ( argv [ 2 ] ) ;

unsigned char * garbage = ( unsigned char * ) malloc ( BLOCK_SIZE ) ;

std :: generate_n ( garbage, BLOCK_SIZE, uniqueNumber ) ;
std :: sort ( garbage, garbage + BLOCK_SIZE ) ;

free ( garbage ) ;

return 0 ;
}


The program creates an unsigned char array of several megabytes, depending on the first parameter passed. Then it fills it with a sequence of repeating characters depending on the second parameter passed. After that sorts the resulting array. For example:
./prog 32 3 - creates an array of 32MB, filled according to the scheme: {1, 2, 1, 2, ...}. Then it will sort it.

./prog 16 7 - creates an array of 16MB, filled according to the scheme: {1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, ...}. Then it will sort it.

Thus, by setting different parameters, we can influence the frequency of conditional transitions when sorting a given array and the size of the data being processed. Thanks to this, we will be able to test the Conditional Branch Optimization described earlier . Having written a not tricky script, I carried out 1792 tests with various parameters and put them into graphs below. The size of the array varied: {2, 4, 8, 16, 32, 128, 256}, and the divisor {1..256}.

Performance Percentage (Cumulative)


On this graph, the percentages overlap. It is necessary to look at the area filled with a specific color, and not at the absolute values ​​given on the y-axis. The larger the area, the greater the performance increase. The graph clearly shows the performance gain for different values ​​of the size of the array.


Performance Percentage (idle)


The y-axis shows the percentage of performance gains relative to a non-optimized program. The x-axis shows the dividers used.


Absolute performance values


The graphs below show the program execution time in seconds on the y-axis, and the divisor used on the x-axis. In the legend, -PGO means without optimization, + PGO with optimization.
















findings


For this particular program, PGO-based optimization is extremely useful and gives a steady performance increase of 5-25%.

The file with the source code, the testing script and the graphing script can be downloaded .

The archive structure is as follows:
pgo-1 / fprofile-generate - make build script with the -fprofile-generate flag
pgo-1 / fprofile-use - make build script with the -fprofile-use flag
pgo-1 / Release - make script to build a regular version
pgo-1 / src - source code
pgo-1 / graph.pl - graph generation script (reads the super_log file)
pgo-1 / run.sh - script to run a test cycle

graph.pl works with a log file whose format corresponds to the output of the run.sh command
For example, you can run like this:
./run.sh> log 2> & 1
tail log | grep -A2 PGO


UPD. The test project was built with the -O3 flags -march = native -mtune = native without PGO, and -O3 -march = native -mtune = native with PGO. All graphs show a gain relative to -O3 -march = native -mtune = native.

Mirror Article

Source: https://habr.com/ru/post/138132/


All Articles