Among the large number of
cycle optimization , one of the most effective is the technique of dividing the cycle into blocks (loop blocking). Its essence is to change the iterative space in order to work more optimally with memory, that is, to reduce cache misses. For these purposes, a special directive appeared in the latest version of the compiler, allowing to control this optimization. But first things first.
We all know that modern processors have a cache - a memory with a high access speed, designed to minimize access to RAM. It is located between the processor and the main memory and stores a copy of a part of the RAM data:
Cache memory is divided into several levels, with each subsequent level is larger in size and slower in access speed and data transfer than the previous one. The fastest cache of the first level is L1, and it is divided into two - the instruction cache and the data cache. But the L2 cache is shared, which means that the necessary amount of memory can be used for each of the cores. If you use all the cores, the cache memory is divided into each of them dynamically, depending on the load.
The memory is accessed by the processor in small blocks, which are called cache lines; in fact, the cache consists of them. Typically, the string size is 64 bytes. When reading any value from memory, at least one cache line gets into the cache. Subsequent access to any value from this string happens very quickly. Moving to another line takes more time, but the lack of data in the entire cache leads to very serious performance losses associated with uploading / loading data.
Thus, in terms of performance, it would be ideal if our entire array was cached and access to its elements was done in rows.
It is important to say about such a thing as data locality. Memory accesses have temporal and spatial locality. If a memory cell is accessed, it is likely that this memory cell will soon be needed again. This is a temporary locality.
By spatial locality it is meant that when accessing a memory cell, it is likely that the adjacent memory cells will be accessed.
It’s not for nothing that the data is loaded not by one byte, but by cache lines of 64 each. Having addressed one element of the array in a loop, we have several immediately “ready”, and subject to sequential access we can work with them as efficiently as possible, implementing spatial locality.
')
Temporal locality in cycles is implemented through the use of the same data for several iterations of the cycle. The technique of dividing the cycle into blocks allows the use of temporary locality ensuring that if the data is once loaded into the cache, it will not be unloaded until it is used.
Now consider an example. Suppose we have a certain loop that works with arrays:
double A[MAX, MAX], B[MAX, MAX]; for (i=0; i< MAX; i++) for (j=0; j< MAX; j++) A[i,j] = A[i,j] + B[j, i];
At the first iteration, when accessing
B [0, 0], the cache line containing
B [0, 0: 7] (the size of the line is 64 bytes, and each element is 8 bytes) is loaded. Since in the internal loop we go at index
j for array
B , we jump to the next line every time, thereby not getting to the cache line. Provided that the string A is sufficiently long and due to the limited cache size, this string will be preempted even before the inner loop reaches the end. Thus, when accessing any element of array
B, we have to miss and reuse data in the cache for
B is not a phenomenon.
The problem can be solved if you work with arrays of blocks that fall into the cache.
for (i=0; i< MAX; i+=block_size) for (j=0; j< MAX; j+=block_size) for (ii=i; ii<i+block_size; ii++) for (jj=j; jj<j+block_size; jj++) A[ii,jj] = A[ii,jj] + B[jj, ii];
In this case, arrays
A and
B are subdivided into two rectangular blocks so that their sum does not exceed the size of the cache:
For example, if
block_size is 8, then each block will be 8 cache lines of 64 bytes each. At the first iteration of the inner loop,
A [0, 0: 7] and
B [0, 0: 7] will be cached. In the first iteration of the outer loop, we still get a slip, but once instead of eight.
By the way, if the
MAX value in this example is simply huge, then with the help of this technique we can also avoid the overhead costs associated with misses in the
Translation lookaside buffer (TLB ), which is designed to accelerate the translation of the virtual memory address to the physical memory. . In addition, we can save the bandwidth of the external bus.
It's time to collect some simple code and show the effectiveness of technology.
#include <time.h> #include <stdio.h> #define MAX 8000 #define BS 16 //Block size is selected as the loop-blocking factor void add(int a[][MAX], int b[][MAX]); int main() { int i, j; int A[MAX][MAX]; int B[MAX][MAX]; clock_t before, after; //Initialize array for (i=0; i<MAX; i++) { for(j=0;j<MAX; j++) { A[i][j]=j; B[i][j]=j; } } before = clock(); add(A, B); add(A, B); add(A, B); add(A, B); after = clock(); printf("\nTime taken to complete : %7.2lf secs\n", (float)(after - before)/ CLOCKS_PER_SEC); }
Actually, inside add we will add arrays. First, without using blocks:
void add(int a[][MAX], int b[][MAX]) { int i, j; for(i=0;i<MAX;i++) for(j=0; j<MAX;j++) a[i][j] = a[i][j] + b[j][i]; }
And using block division:
void add(int a[][MAX], int b[][MAX]) { int i, j, ii, jj; for(i=0; i<MAX; i+=BS) for(j=0; j<MAX; j+=BS) for(ii=i; ii<i+BS; ii++) for(jj=j; jj<j+BS; jj++)
We collect this business with a sufficient stack size:
icl /Qoption,link,"/STACK:1000000000" test.cpp
The time spent on executing the regular version was 4.67 seconds versus 1.13 with “blocking”. More than impressive win.
And now for what I started talking about loop blocking. The latest version of the compiler has a directive that can make life easier and save us from having to beat the blocks with handles. This is the
block_loop directive:
C / C ++:
#pragma block_loop [clause[,clause]...] #pragma noblock_loop
Fortran:
!DIR$ BLOCK_LOOP [clause[[,] clause]...] !DIR$ NOBLOCK_LOOP
Through the parameters, you can control the block size (factor) and the nesting level of the cycle (level), for which you can optimize (from 1 to 8). If you do not specify anything, then the required block sizes for your processor will be calculated by default. For each cycle, you can set your own block size. For example:
#pragma block_loop factor(256) level(1) #pragma block_loop factor(512) level(2)
Sets the block size to 256 for the outer loop and 512 for the inner loop (if we have a nested double loop). Consider this example:
#pragma block_loop factor(256) level(1:2) for(j=1; j<n ; j++){ f = 0; for(i=1; i<n; i++){ f = f + a[i] * b[i]; } c[j] = c[j] + f; }
In this case, the directive converts our code like this:
for(jj=1 ; jj<n/256+1; jj+){ for(ii=1; ii<n/256+1; ii++){ for(j=(jj-1)*256+1; j<min(jj*256, n); j++){ f = 0; for (i=(ii-1)*256+1; i<min(ii*256,n); i++){ f = f + a[i] * b[i]; } c[j] = c[j] + f; } } }
Now let's go back to the example for which I ran the tests, and modify the
add function by adding the
#pragma block_loop directive . Rebuilding the code with the same options, I got a completely unexpected result. Nothing in the speed of the application has changed. And the thing is in the default
O2 option, in which the compiler does not perform a number of cyclic optimizations, including the division of the cycle into blocks. For the directive to work, you need to set the highest level of
O3 optimization:
icl /O3 /Qoption,link,"/STACK:1000000000" test.cpp
At the output, I got 0.83 seconds, which gives an increase compared to what I did with pens. In fact, there I did not select the size of the blocks perfectly, but simply showed the technique, so there was room for growth. By the way, to make sure that this directive is necessary, I rebuilt with
O3 without it. As expected, the performance did not improve, and I saw the same sad 4.69 at the exit.
It seems to me that this directive will be very useful, given the magical effect it gives. I also note that at the moment it is available in beta version 16.0, the release of which is just around the corner. Everyone to try!