In modern compilers, the task of vectoring cycles is very important and necessary. For the most part, with successful vectorization, application performance can be significantly increased. There are a
lot of ways to achieve this, and there are even more subtleties associated with obtaining the expected “acceleration” of our application.
Today we will talk about data alignment, its impact on performance and vectorization and working with it in the compiler, in particular. The concept itself is given in great detail in
this article, as well as many other nuances. But we are interested in the effect of alignment during vectorization. So, if you read the article or just know how the work with memory is going on, then the news that the data read by blocks will not surprise you.
When we operate with elements of arrays (and not only with them), we are actually constantly working with cache lines of 64 bytes each. SSE and AVX vectors always fall into the same cache line if they are equalized by 16 and 32 bytes, respectively. But if our data is not aligned, then it is very likely that we will have to load another “additional” cache line. This process has a strong impact on performance, and if we are at the same time and to the elements of the array, and therefore to memory, we turn inconsistently, then everything can be even worse.
In addition, the instructions themselves can be with even or uneven access to the data. If in the instructions we see the letter
u (
unaligned ), then most likely it is an instruction of unaligned read and write, for example
vmovupd . It is worth noting that since the Nehalem architecture, the speed of these instructions has become comparable to the aligned ones, provided that the data is even. On older versions, something is wrong.
')
The compiler can actively help us in the fight for performance. For example, he may try to break a 128-bit non-leveled load into two 64-bit ones, which will be better, but still slow. Another good solution that the compiler can implement is to generate different versions for leveled and non-leveled cases. In runtime, the determination is made of what data we have, and the execution goes according to the required version. The only problem is that the overhead of such checks may be too high, and the compiler will abandon this idea. It is even better if the compiler can flatten the data for us. By the way, if during vectorization the data is not aligned or the compiler does not know anything about alignment, the initial cycle is divided into three parts:
- a number of iterations (always less than the length of the vector) to the main “core” (peel loop), which the compiler can use to align the starting address. You can disable peeling using the mP2OPT_vec_alignment = 6 option.
- the main body - the “core” —a loop (kernel loop) for which aligned vector instructions are generated
- “Tail” (remainder loop), which is due to the fact that the number of iterations is not divided by the length of the vector; it may also be vectorized, but not as efficiently as the main loop. If we want to disable vectorization of the cycle remainder, then we use the #pragma vector novecremainder directive in C / C ++ or ! DIR $ vector noremainder in Fortran.
Thus, the equal address of the starting address can be achieved, due to the loss in speed - we will have to “peretaptyvatsya” to the main core of the loop, performing a certain number of iterations. But this can be avoided by aligning the data and telling the compiler about it.
Developers need to make it a rule to align the data "as it should": 16 bytes for SSE, 32 for AVX and 64 for MIC & AVX-512. How can this be done?
To allocate equalized memory in C / C ++, the heap uses the function:
void* _mm_malloc(int size, int base)
Linux has a function:
int posix_memaligned(void **p, size_t base, size_t size)
For variables on the stack, the
__declspec attribute is
used :
__declspec(align(base)) <var>
Or specific to Linux:
<var> __attribute__((aligned(base)))
The problem is that
__declspec is unknown to gcc, so there may be a problem with portability, so you should use a preprocessor:
#ifdef __GNUC__ #define _ALIGN(N) __attribute__((aligned(N))) #else #define _ALIGN(N) __declspec(align(N)) #endif _ALIGN(16) int foo[4];
Interestingly, the Fortran compiler from Intel (version 13.0 and higher) has a special option
-align , with which you can make the data equalized (when declared). For example, through
-align array32byte, we tell the compiler that all arrays are aligned by 32 bytes. There is a directive:
!DIR$ ATTRIBUTES ALIGN: base :: variable
Now about the instructions themselves. When working with unaligned data, instructions for unaligned reading and writing are very slow, with the exception of vector SSE operations on SandyBridge and newer. There, they may not be inferior in speed to instructions with equal access under certain conditions. Unaligned AVX vector instructions for working with unsigned data are slower than similar ones for working with equalized, even on the latest generations of processors.
In this case, the compiler prefers to generate non-aligned instructions for AVX, because in the case of aligned data, they will work just as quickly, and if the data turns out to be not aligned, then there will be a slower execution, but it will be. If equalized instructions are generated, and the data is not aligned, then everything will fall.
You can tell the compiler which instruction set to use through the
pragma vector unaligned / aligned directive.
For example, consider this code:
void mult(double* a, double* b, double* c) { int i; #pragma vector unaligned for (i = 0; i < N; i++) c[i] = a[i] * b[i]; }
For him, when using AVX instructions, we get the following assembly code:
..B2.2: vmovupd (%rdi,%rax,8), %xmm0 vmovupd (%rsi,%rax,8), %xmm1 vinsertf128 $1, 16(%rsi,%rax,8), %ymm1, %ymm3 vinsertf128 $1, 16(%rdi,%rax,8), %ymm0, %ymm2 vmulpd %ymm3, %ymm2, %ymm4 vmovupd %xmm4, (%rdx,%rax,8) vextractf128 $1, %ymm4, 16(%rdx,%rax,8) addq $4, %rax cmpq $1000000, %rax jb ..B2.2
It is worth noting that in this case there will not be that peel loop, because we used the directive.
If we replace
unaligned with
aligned , thereby guaranteeing the compiler that the data is aligned and safely generate the corresponding aligned instructions, we will get the following:
..B2.2: vmovupd (%rdi,%rax,8), %ymm0 vmulpd (%rsi,%rax,8), %ymm0, %ymm1 vmovntpd %ymm1, (%rdx,%rax,8) addq $4, %rax cmpq $1000000, %rax jb ..B2.2
The latter case will work faster provided that
a ,
b and
c are equal. If not, everything will be bad. In the first case, we get a slightly slower implementation, subject to equalized data due to the fact that the compiler was not able to use
vmovntpd , and the additional instruction
vextractf128 appeared .
Another important point is the notion of equalization of the initial address and relative alignment. Consider the following example:
void matvec(double a[][COLWIDTH], double b[], double c[]) { int i, j; for(i = 0; i < size1; i++) { b[i] = 0; #pragma vector aligned for(j = 0; j < size2; j++) b[i] += a[i][j] * c[j]; } }
There is only one question here - will this code work, provided that
a ,
b and
c are aligned by 16 bytes, and we collect our code using SSE? The answer depends on the COLWIDTH value. In the case of an odd length (length of SSE registers / double size = 2, it means that COLWIDTH should be divided by 2), our application will finish its execution much earlier than expected (after passing through the first row of the array). The reason is that the first data element in the second line is unaligned. For such cases, it is necessary to add dummy elements (“holes”) at the end of each line so that the new line is aligned, making the so-called padding. In this case, we can do this with COLWIDTH, depending on the set of vector instructions and the type of data we will use. As already mentioned, for SSE it should be an even number, and for AVX it should be divisible by 4.
If we know that only the starting address is aligned, we can give this information to the compiler through the attribute:
__assume_aligned(<array>, base)
Analog for Fortran:
!DIR$ ASSUME_ALIGNED address1:base [, address2:base] ...
I played a little with a simple example of matrix multiplication on Haswell, in order to compare the speed of the application with AVX instructions on Windows depending on the directives in the code:
for (j = 0;j < size2; j++) { b[i] += a[i][j] * x[j];
Aligned data to 32 bytes:
_declspec(align(32)) FTYPE a[ROW][COLWIDTH]; _declspec(align(32)) FTYPE b[ROW]; _declspec(align(32)) FTYPE x[COLWIDTH];
The sample goes along with the samples to the compiler from Intel, all the code can be viewed there. So, if we use the directive
pragma vetor aligned before the cycle, then the runtime of the cycle was 2.531 seconds. In its absence, it increased to 3.466 and a peel cycle appeared. Probably, the compiler did not recognize the aligned data. Turning off its generation using
mP2OPT_vec_alignment = 6 , the cycle took almost 4 seconds. Interestingly, the “deceive” compiler turned out to be quite difficult in this example, because it stubbornly generated run-time data checking and did several variants of the cycle, as a result of which the operation with uneven data was slightly worse.
The bottom line is to say that by aligning the data you will almost always get rid of potential problems, in particular, with performance. But aligning the data by itself is not enough - you need to inform the compiler about what you know, and then you can get the most efficient application at the output. The main thing - do not forget about the little tricks!