Most of the program execution time falls on cycles: it can be calculations, reception and processing of information, etc. Proper application of optimization techniques cycles will increase the speed of the program. But before embarking on optimizations, it is necessary to identify the “narrow” places of the program and try to find the reasons for the drop in performance.
The execution time of the code in cycles depends on the organization of the memory, the processor architecture, including the supported set of instructions, pipelines, caches and programmer's experience.
Consider some methods of loop optimization: loop unrolling, loop fusion, loop distribution, loop alignment, loop interchange, loop blocking.
Before applying any optimization, do the simplest thing:
remove all variables from the loop that do not change in it .
What reasons can lead to a decrease in the speed of the program in cycles?
- Loop iterations are dependent and cannot be executed in parallel.
- The loop body is large and too many registers are required.
- The body of the cycle or the number of iterations is small and profitable to completely abandon the use of the cycle.
- The loop contains calls to functions and procedures from third-party libraries.
- The loop intensively uses any one processor execution unit.
- In the loop there are conditional transitions.
Sweep cycles
This optimization is performed when the body of the loop is small. It is necessary to more effectively use executing devices at each iteration. Therefore, repeatedly duplicate the body of the cycle, depending on the number of performing devices. But such optimization can cause a data dependency, in order to get rid of it, additional variables are introduced.
Before | After | After # 2 |
for ( int i = 0; i < iN; i++){ res *= a[i]; } | for ( int i = 0; i < iN; i+=3){ res *= a[i]; res *= a[i+1]; res *= a[i+2]; } | for ( int i = 0; i < iN; i+=3){ res1 *= a[i]; res2 *= a[i+1]; res3 *= a[i+2]; } res = res1 * res2 * res3; |
The following keys can be used in gcc:
-funroll-all-loops -funroll-loops .
Cycle integration
The loop can have long-running instructions, for example, extracting square roots. Or there are several cycles that run on the same interval of indices. Therefore, it is advisable to combine cycles for a more balanced load of performing devices.
Before | After |
for ( int i = 0; i < iN; i++){ a[i] = b[i] - 5; } for ( int i = 0; i < iN-1; i++){ d[i] = e[i] * 3; } | for ( int i = 0; i < iN-1; i++){ a[i] = b[i] - 5; d[i] = e[i] * 3; } a[iN-1] = b[iN-1] - 5; |
Cutting cycles
This optimization is used when the loop body is large and the variables are not enough registers. Therefore, the data is first pushed into the cache, and if absolutely everything is bad, then in the RAM.
And access to RAM takes about 300 processor cycles, and access to L2 is only ~ 10. Access to memory with a big step slows down the program even more. It is optimal to “walk” in memory in increments of
2 n , where n is a rather small number (<7).
Before | After |
for ( int j = 0; j < jN; j++){ for ( int k = 0; k < kN; k++){ for ( int m = 0; m < mN; m++){ i = j * k + m; a[i] = b[i] * c[i] + f[i]/e[i]+ x[i] - y[i] + z[i]/r[i] + d[i] * x[i]; } } } | for ( int j = 0; j < jN; j++){ for ( int k = 0; k < kN; k++){ double tmp; for ( int m = 0; m < mN; m++){ i = j * k + m; tmp = b[i] * c[i] + f[i]/e[i]; a[i] = tmp - y[i] + z[i]/r[i] + (d[i] + 1) * x[i]; } } } |
Swap cycles
In nested loops, the order of nesting is important. Therefore, it is necessary to remember how arrays are stored in memory. A classic example: c / c ++ store matrices line by line, and fortran - column by column.
Before | After |
for ( int i = 0; i < iN; i++){ for ( int j = 0; j < jN; j++){ for ( int k = 0; k < kN; k++){ c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } | for ( int i = 0; i < iN; i++){ for ( int k = 0; k < kN; k++){ for ( int j = 0; j < jN; j++){ c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } |
Now calls to the array
a go sequentially.
')
Splitting cycles into blocks
If the loop body is complex, then this optimization can be applied to better locate the data in memory and improve the use of caches. The result of the optimization depends heavily on the processor architecture.
Before | After |
for ( int i = 0; i < iN; i++){ for ( int j = 0; j < jN; j++){ a[i][j] = a[i][j] + b[i][j]; } } | // int iBlk, jBlk; for ( int k = 0; k < iN/iBlk; k++){ for ( int m = 0; m < jN/jBlk; m++){ for ( int i = k * iBlk; i < ((k + 1) * iBlk); i++){ for ( int j = m * jBlk; j < ((m + 1) * jBlk); j++){ a[i][j] = a[i][j] + b[i][j]; } } } }
|
About this principle, MPI technology works: divides large arrays into blocks and sends them to individual processors.
Dependency resolution
The best solution is to get rid of. But not with all dependencies it will turn out.
for ( int i = 1; i < N; i++){
a[i] = a[i-1] + 1;
}
For this example, it is better to apply the scan, because the result of the calculation will remain on the registers. But the majority of such cycles cannot be fully optimized (or parallelized), the result still depends on the previous round of the cycle.
To check the cycle for independence, reverse the direction of motion in the cycle. If the result of the calculations has not changed, then the iterations of the cycle are independent.
Regarding conditional transitions
Loss of time occurs due to errors in the prediction of transitions, because you have to roll back the pipeline. Therefore, it is best to abandon the conventional structures. But, if this is impossible, you need to try to facilitate the work of the transition prediction module. To do this,
place the most likely branches at the beginning of the branch and, if possible, move the conditional constructions beyond the cycle.
Instead of conclusion
If you create a complex program that will take a lot of CPU time, then
- Familiarize yourself with the processor architecture (find out how many and what execution devices it has, how many pipelines, L1 and L2 cache sizes).
- Try compiling a program with different compilers and with different keys.
- Consider the impact of the operating system.
I also advise you to read this
article .
From my own experience, I can say that the proper application of optimizations can improve the speed of the program at times.
If you want to practice optimization yourself, then try to calculate the number Pi:

Below is the “bad” code.
long N = 10000000;
double dx, sum, x;
sum = 0.0;
x = 0.0;
dx = 1.0 / ( double ) N;
for ( long i = 0; i < N; i++){
sum += 4.0 / (1.0 + x * x);
x += dx;
}
double pi = dx * sum;
What I didn’t tell about: vectorization of calculations (SSE instructions); about prefetchs that make it easier to work with memory. If there are people “who are interested”, I will write a separate article about the vectorization of cycles.
Source Code Highlighter source code highlights .