Some of the simplest principles of auto-vectorization

My previous post was devoted to cyclic permutation optimizations, loop recognition problems, ambiguity resolution when working with memory, definition and importance of dependencies. Now I want to review one of the most efficient loop optimizations - auto-vectorization. I would like to discuss the issues of optimization efficiency, as well as try to understand what factors determine this efficiency. Anyone who is interested is welcome. When discussing, I will focus on Intel's auto vectorizer and auto vectorizer gcc 4.7.2. I will investigate gcc to confirm that the vectorization principles that I am trying to formulate here are of a rather general nature. At the same time, of course, I want to understand the level of auto-vectorization in gcc. Here, of course, there is a certain element of inequality, since I use the latest Intel compiler, but not the most top-notch version of gcc, but basically I will be guided when comparing SSE instructions. (By the way, Intel is actively involved in the development of the gcc auto vectorizer). Since Intel and Intel compiler are closer to me, I will pay more attention to it in some places. I do not pretend to be a vector guru and will be glad if someone sees my mistakes and corrects me. Letters will be many.

Where there is vectorization

Since the very first supercomputers, attempts have been made to implement architectural mechanisms for using data parallelism (data parallelism) to speed up computations. In many computing programs, the same operation is applied to a whole data set, usually an array section or array. In this case, there is some repetitive data access model, as well as operations that can be performed in parallel on elements from this data set. The first pipelined vector processors supported instruction sets that ran on data vectors. This data is located directly in memory. In the Cray-1 processor, vector registers were introduced, which greatly accelerated the execution of instructions, since vector registers could save intermediate results of calculations. The use of vector architectural mechanisms can be programmed directly. Well, if there is already a program written using sequentially running calculations, it is necessary to modify the program to use data parallelism. This modification of the sequential algorithm in parallel using data parallelism is called vectorization .

The history of vectorization and authorization for Intel64 and IA32 Intel architectures in the presentation for dummies looks probably so. In the glorious good times of the 8086 microprocessor, the venerable father of the entire x86 processor family, this processor worked closely with the math coprocessor, which was designed to support calculations for floating-point numbers, was a separate chip and was installed in a special socket on the motherboard. Many old-timers remember that when buying personal computers, it was necessary to solve a dilemma, to buy a computer system with a coprocessor (more expensive) or without (cheaper, but slower with real calculations). The coprocessor supported the x87 instruction set and worked with the register stack. Starting with the Intel486DX processor, the coprocessor was integrated into the processor as an FPU floating point unit . This integration added new registers to the microprocessor, namely 8 80-bit data registers for storing floating-point numbers. But with integer calculations, these resources were not used in any way. It was decided to process integer vectors on the basis of FPU and the MMX extension, which first appeared in the Pentium MMX processor, was proposed. New registers were added to the MPs (8 64 bit MM0-MM7), which were addressed (aliased) to the FPU registers, and the microprocessor instruction set was supplemented with a number of instructions (SIMD Single Instruction, Multiple Data) working with these registers and performing integral operations on them. The concept of packages was introduced, i.e. each of these registers could store
2 - 32 bit integers
4 - 16 bit
8 - 8 bit
The disadvantage was the inability to simultaneously work with integer packets and real numbers. It was necessary to switch the processor to a special mode and this switching was carried out for a long time.
')
The next step was the development of the SSE extension, which was first implemented in Pentium3. The size of vector registers has been increased to 128 bits. The appearance of this set of instructions allowed us to solve the problem of simultaneous work with packed whole and real data. Now the packed integers were processed with the help of MMX, at the same time real calculations are performed using SSE and vector registers. In addition to vector registers, SSE added a 32-bit flag register and operations with this register to the computing system and expanded the set of SIMD operations on integers. Instructions were added to explicitly prefetch data, monitor data caching, and monitor the order of save operations, extend CPUID instructions to get information about the processor.
The SSE was followed by an extension of SSE2 , which added new instructions to SSE to completely overwhelm the MMX and to process packed double-point floating-point data.
SSE3 , SSEE3 , SSE4 - subsequent SSE extensions.
AVX is a new expansion of the command system, increasing the size of vector registers to 256 bits.
AVX2 is the new announced expansion of the AVX command system. It is already here .

Auto vectorizers

In the modern competitive world, the term “economic efficiency” often rules. Therefore, new teams appear in processors for a reason, and to make the product being produced more attractive to buyers and motivate them to replace old computing architectures with newer, more efficient, energy efficient, ~~cool and trendy ones~~ . There is even a special independent platform , where a methodology has been developed for how to measure the slope and performance correctly. And if modifying and optimizing existing instructions and improving the architecture can immediately have a positive impact on performance, then you need to work with new instructions. For example, to persuade manufacturers of compilers to include their support in their products and to track the simultaneous appearance of these products on the market along with new architectures. It is clear that this situation looks slightly optimistic, so Intel has long enough tried to take the fate of promoting their computing systems in their own hands and in fact offers along with the computing system all the necessary components for its maximum success, such as compilers and performance analyzers. Therefore, new versions of compilers are released simultaneously with the appearance on the market of iconic new architectures.

In order to show the gain from the appearance of new vector instructions, an auto vectorizer is needed in the compiler, i.e. the compiler must be able to convert the scalar code into a vector itself. Comparison of the performance of computing systems is performed on representative samples of programs (such as suites CPU2000, CPU2006, CPUV6, and so on). The ability to vectorize something using assembler inserts or calls to built-in vector intrinsics is not part of the competition condition. When preparing a competitor (compiling a program before measuring its performance), it is only possible to transfer a set of different compiler options to the compiler. Therefore, the presence of a high-quality auto vectorizer is an important condition for the promotion of new computing systems in the market, both for Intel’s and competitors ’computing systems. It is not surprising that the development of the principles of automatic vectorization is one of the drivers of the development of all computer science.

The auto vectorizer is also of great independent value, as an important component of the optimizing compiler, which makes it relatively easy to improve the performance of computational programs. Vectorization can be performed in various ways. You can use assembler inserts, you can use the call built-in intrinsikov. In any case, this approach to vectorization has several disadvantages. It is rather laborious, it is necessary to port the code for new computing systems when new vector extensions appear, etc. Although in this case, you can achieve better results than using the auto vectorizer, but this is not an easy job. It is much more promising, from the manufacturer’s point of view, to use the auto vectorizer.

The advantage of the auto vectorizer is that it is enough to add options when compiling, and the compiler itself will vectorize those cycles that are profitable to vectorize. Intel's auto vectorizer now works by default, since the default optimization level is O2. The gcc auto vectorizer works starting with the –O3 option. Both the one and the other auto vectorizers can be turned on / off with special options. The auto vectorizer allows you to create applications that are focused on execution on a specific computer system that supports a specific vector extension. In the Intel compiler, there are several different possibilities for choosing an architecture. The normal mode, when the vector extension used is chosen, is no different from how the vector extension is selected in gcc. But for Intel's architectures, you can enable runtime checks and create, for example, multivariate applications with different vectorization options, choosing the best option depending on the architecture on which the program runs. You can automatically perform optimal vectorization for the machine you are compiling. Those. For the icc compiler, there are Intel architectures and all the rest. I am afraid that fine tuning vectorization for the second type of architecture is not provided. I can assume that the gcc auto vectorizer is a greater proponent of equality.

For better use avtevtorizator desirable to have feedback with him. If the auto vectorizer reports the results of its work, i.e. reports what decision he made when vectoring in a particular case, then you can try to deal with problems that interfere with vectorization. Both icc and gcc have this functionality. icc has the option –vec_report, and gcc reports on the operation of the auto vectorizer using the option –ftree-vectorizer-verbose. It is clear that the icc report is focused on users, while the gcc report is on developers. Increasing the level of accountability in gcc, you can get a lot of interesting details about the “thinking process” of the auto-vectorizer, which is quite interesting, since the commercial icc reluctantly shares its secrets. On the other hand, if the important details of gcc need to be caught in a pile of other information, then icc is sufficiently concise.

An example of using an auto vectorizer

Let's take a simple function and try to look at the diagnostics of vectorizers.

void Calculate(float * a,float * b, float * c , int n) { for(int i=0;i<n;i++) { a[i] = a[i]+b[i]+c[i]; } return; }

 gcc -c -O3 -march=native -std=c99 -mfpmath=sse -ftree-vectorizer-verbose=2 vector.c Analyzing loop at vector.c:2 Vectorizing loop at vector.c:2 2: created 2 versioning for alias checks. 2: LOOP VECTORIZED. vector.c:1: note: vectorized 1 loops in function.

Do not forget when working with real add for gcc –mfpmath = sse, otherwise the vectorizer does not work.
Well, and similarly for icc:

 icc -c -vec_report3 -xhost -std=c99 vector.c vector.c(2): (col. 3) remark: LOOP WAS VECTORIZED vector.c(2): (col. 3) remark: loop skipped: multiversioned

An approximate scheme of vectorization of the simplest cycle

Basically, vectorization is performed for loops, in which case it is a cyclic permutation optimization. It is this vectorization that I would like to consider. Here I tried to draw a certain vectorization scheme. Before vectorization, we have some kind of data of the same type and we perform operations of the same type over the elements of these sets and save the result of the calculations. When vectoring this sequential code, it is necessary to form vector scalar data (write data to vector registers), apply the vector operation to these vectors and then save the result from the vector register to memory. If a simple loop with a single statement inside is vectorized, then the unrolling of this cycle can be represented as the first stage of vectorization by the number N, which is equal to the number of elements of the type used in the vector register. As a result of loop vectoring, one vector iteration replaces N scalar iterations. In the general case, we can assume that during vectorization, the data must be “packed” into vector registers. Although the operands of vector instructions can be memory addresses, but these addresses must be aligned in memory and in most cases the vectorizer needs to work to ensure that this condition is met.

The admissibility of vectorization

Since we consider vectorization as a cycle optimization, it is performed only for “good” cycles, i.e. cycles with a certain number of iterations and sequentially changing iteration variables. That is, in order for the auto vectorizer to vectorize cycles, they must first be recognized. What difficulties the compiler has to overcome when recognizing loops, I described in a previous post .
Cycle vectoring changes the order of operations in the loop. If the loop body contains several statements for evaluating (statements), then the order of execution of these statements changes as shown in the diagram:

According to the criterion for the permissibility of permutation optimizations, vectorization is admissible if the order of dependent computations does not change. Due to the peculiarities of vectorization, it does not change the order of lexical dependencies for data forward (lexically forward data dependencies), but can break the self-dependency introduced by the loop and inverse dependencies (self and backward dependencies). Whether such dependencies will be violated depends on the distance between the dependent statements. If this distance is more than N (the number of elements of this type in the vector register), then the dependencies will not be broken. It is also interesting that the so-called “output dependencies” may not be violated during vectorization, if saving to vector registers and unloading from them are performed sequentially, i.e. in the same order as in the program.

In order to understand whether vectorization is permissible, it is necessary to perform a dependency analysis. The worst case is that there are different objects in the loop that can refer to the same memory. And if we go back to our first example, then this is exactly the situation we have here. If the Calculate function is compiled separately, then in general, the pointers a, b, c can refer to the same or intersecting memory. As can be seen from the reports, both vectorizers coped with this work by adding checks on the execution time and the multi-version code (although they say this in a different language). As a result, additional instructions were added to the code. Now, if certain memory segments that start with the pointers a, b, c and have a size of n * sizeof (float) do not overlap, then a vectorized cycle is called, and in the opposite case a scalar cycle. Function code has increased significantly, execution time checks reduce performance.

It is possible to make auto-vectoring more optimal in several ways, namely: either transmitting the information to the compiler itself that the objects do not overlap in memory, or adding interprocedural analysis, in the hope that the compiler will be able to prove it. Interprocedural analysis is another large separate topic. But it is clear that, theoretically, we can check whether the actual arguments of all the Calculate function calls out from memory, and if not, then extend this knowledge to the formal arguments of the Calculate function. Let's expand the very first test a bit and see if the compilers know how to draw such information.

 #include <stdlib.h> #define N 2000 static void Calculate(float * a,float * b, float * c , int n) { for(int i=0;i<n;i++) { a[i] = a[i]+b[i]+c[i]; } return; } float test() { float *x,*y,*z,ret; x=(float*)malloc(N*sizeof(float)); y=(float*)malloc(N*sizeof(float)); z=(float*)malloc(N*sizeof(float)); for(int i=0;i<N;i++) { x[i] = 1.0;y[i] = 0.0; z[i] = 1.0; } Calculate(x,y,z,N); ret=x[1]; free(x); free(y);free(z); return ret; }

When compiling, we prohibit inline in order to save the Calculate function. Here, interprocedural analysis is used only for functions from the vector.c source file and the static attribute allows the compiler to understand that there are no calls to this function somewhere in other source files and the function can be modified according to the detected properties of its actual arguments.

 gcc -c -O3 -fno-inline -fipa-pta -march=native -std=c99 -mfpmath=sse -ftree-vectorizer-verbose=2 vector.c ... Vectorizing loop at vector.c:5 5: LOOP VECTORIZED. ... icc -c -ip -ip-no-inlining -vec_report3 -xhost -std=c99 vector.c vector.c(5): (col. 3) remark: LOOP WAS VECTORIZED vector.c(5): (col. 3) remark: REMAINDER LOOP WAS VECTORIZED

For gcc, you must add the -fipa-pta option (Perform interprocedural pointer analysis and interprocedural modification and reference analysis), since this option is not included in the default optimization options. In this case, gcc and icc successfully coped with the task of pulling the properties of actual arguments.

In order not to strain the compiler once again with the analysis of the program, it is simpler to add a restrict qualifier to the pointer descriptions to indicate to the compiler that the data referenced by the pointer can be addressed only with its help. If there are no objects in the cycle that can be aliased, then the analysis of dependencies should be able to find dependencies (for example, referring to the same array element at different iterations of the cycle), determine their type, distance between dependent iterations, chronological order and determine the validity of . Work is not easy at all.
Let's see how the auto vectorizers work, informing the user about the problems. Consider a few typical cases where dependencies can affect vectorization:

 void Calculate(float * restrict a,float *restrict b,float * restrict c, int * restrict vec, int n) { for(int i=0;i<n-3;i++) //3 a[i+3] = a[i]+1; for(int i=0;i<n-5;i++) //5 a[i+5] = a[i]+1; for(int i=3;i<n;i++) //7 a[i-3]=a[i]+1; for(int i=0;i<n;i++) //9 b[i]=a[vec[i]]; for(int i=0;i<n;i++) //11 a[vec[i]] = b[i]*vec[i]; for(int i=1;i<n;i++) { //13 a[i]=c[i]+1; b[i]=a[i-1]; } for(int i=0;i<n-1;i++) { //17 a[i]=c[i]+1; b[i]=a[i+1]; } for(int i=0;i<n-5;i++) { //21 a[i]=c[i]+1; b[i]=a[i+5]; } return; }

In the first two cycles, we have lexically inverse dependencies (backward dependencies) and the distance between dependencies 3 and 5. When working with 128 bit registers (SSE), this makes it difficult to vectorize the entire width of the vector register for the first cycle. When working with 256 bit registers (AVX), vectorization over the entire width of the vector register is also impossible in the second case. In the third cycle, there is a forward dependence (forward dependence), which does not prevent vectorization. The fourth and fifth cycles are cases of vector indexing. In the fifth case, there is an output relation (output dependence), i.e. the order of writing to memory matters. This dependence does not interfere with vectorization, since, although vector operations are performed, the results from vector registers are retrieved sequentially in memory, and this, ultimately, helps to maintain the correct result. In the sixth cycle, there is a forward relationship between the two statements of the cycle. The seventh and eighth case - inverse dependencies.

 gcc -c -O3 -fno-inline -march=corei7-avx -std=c99 -mfpmath=sse -ftree-loop-distribution -ftree-vectorizer-verbose=1 vector.c Analyzing loop at vector.c:21 Vectorizing loop at vector.c:21 21: LOOP VECTORIZED. Analyzing loop at vector.c:17 Analyzing loop at vector.c:13 Vectorizing loop at vector.c:13 13: LOOP VECTORIZED. Analyzing loop at vector.c:11 Analyzing loop at vector.c:9 Analyzing loop at vector.c:7 Vectorizing loop at vector.c:7 7: LOOP VECTORIZED. Analyzing loop at vector.c:5 Vectorizing loop at vector.c:5 5: LOOP VECTORIZED. Analyzing loop at vector.c:3 vector.c:1: note: vectorized 4 loops in function.

Gcc did not want to consider cycles with vector indexing. In cycles with inverse dependencies, I calculated the distances and vectorized them, where possible. At the same time, he lowered the vectorization base for cycles // 5 and // 21, i.e. made vectorization for 128 bit registers. You can find out about this by increasing the value for –free-vectorizer-verbose.

 … 21: === vect_analyze_dependences === 21: dependence distance = 5. 21: adjusting maximal vectorization factor to 5 21: dependence distance >= VF. 21: bad data dependence. 21: ***** Re-trying analysis with vector size 16

Similarly for icc we get.

 icc -c -ip -vec_report3 -xAVX -std=c99 vector.c vector.c(3): (col. 3) remark: loop was not vectorized: existence of vector dependence vector.c(4): (col. 7) remark: vector dependence: assumed FLOW dependence between a line 4 and a line 4 vector.c(5): (col. 3) remark: LOOP WAS VECTORIZED vector.c(7): (col. 3) remark: LOOP WAS VECTORIZED vector.c(9): (col. 3) remark: LOOP WAS VECTORIZED vector.c(9): (col. 3) remark: REMAINDER LOOP WAS VECTORIZED vector.c(11): (col. 3) remark: LOOP WAS VECTORIZED vector.c(13): (col. 3) remark: LOOP WAS VECTORIZED vector.c(17): (col. 3) remark: PARTIAL LOOP WAS VECTORIZED vector.c(17): (col. 3) remark: PARTIAL LOOP WAS VECTORIZED vector.c(21): (col. 3) remark: PARTIAL LOOP WAS VECTORIZED vector.c(21): (col. 3) remark: PARTIAL LOOP WAS VECTORIZED

icc did not vectorize only cycle // 3. Interestingly, the // 17 cycle was previously divided into two, and then both were vectorized. The same was done for the cycle // 21, although this was not necessary.
ftree-loop-distribution could help gcc vectorize the // 17 cycle by first breaking it into two, but for some reason did not.

Another interesting example looks like this. Actually, this is the simplest work with square matrices.

 void ttt(int *restrict a, int * restrict b, int n) { for(int i=0;i<n;i++) for(int j=0;j<n;j++) a[j+n*i] = a[j+n*i] + b[i+n*j]; return; }

icc vectorizes the inner loop successfully, and gcc writes “not suitable for gather” (?).

 gcc -c -fno-inline -O3 -march=corei7 -std=c99 -mfpmath=sse -ftree-vectorizer-verbose=3 vector.c Analyzing loop at vector.c:8 8: not vectorized: data ref analysis failed D.2238_18 = *D.2237_17; Analyzing loop at vector.c:9 9: not vectorized: not suitable for gather D.2238_18 = *D.2237_17; vector.c:6: note: vectorized 0 loops in function.

When I talked about the acceptance criterion for permutation optimizations, I already mentioned that there are directives that can be used to convey to the auto vectorizer the intimate knowledge about the program. For example, you use vector indexing and know that each vector index is unique. Then the dependencies caused by the use of such indexing can not be. For example:

 void Calculate(float * restrict a,float *restrict b,float * restrict c, int * restrict vec, int n) { for(int i=0;i<n;i++) a[vec[i]] += b[i]*c[i]; return; }

Those. there may be a dependency if there are identical vec [i] and vec [j]. As I found out, gcc does not understand vector indexing. icc issues the following diagnostics:

 icc -c -vec_report3 -xSSE4.2 -std=c99 vector.c ector.c(3): (col. 3) remark: loop was not vectorized: existence of vector dependence vector.c(4): (col. 7) remark: vector dependence: assumed ANTI dependence between a line 4 and a line 4 vector.c(4): (col. 7) remark: vector dependence: assumed FLOW dependence between a line 4 and a line 4 vector.c(4): (col. 7) remark: vector dependence: assumed FLOW dependence between a line 4 and a line 4 vector.c(4): (col. 7) remark: vector dependence: assumed ANTI dependence between a line 4 and a line 4

This problem can be solved by placing #pragma ivdep in front of the loop. This pragma states that there are no supposed dependencies in the loop. After substituting this pragma, the compiler successfully vectorizes this loop.

In cases of more complex cycles, the report of the vectorizer can give an idea of the dependencies that impede vectorization. Having studied these problems, one can assess whether they are caused by real-life dependencies or simply the compiler is unable to prove the absence of dependencies. In the second case, the auto vectorizer can be improved with #pragma ivdep.
I was surprised to find that the gcc compiler does not support this pragma and the corresponding tracker exists since 2007.

Factors determining the effectiveness of vectorization: Sequential access

It can be seen from the diagram that as a result of vectorization, the number of arithmetic operations performed is reduced, but it also becomes necessary to “package” the data into vector registers. The data can be placed in vector registers elementwise or using a single copy from memory. If the elements are not located in memory continuously, then loading into registers can be done only elementwise. If the vectorized read / write statements are the main operations, then the benefit from vectorization becomes doubtful - with vectorization, the number of operations performed varies slightly. But at the same time, in the vector case, the amount of code increases significantly (as with the sweep by N), which can also have a bad effect on performance.

If, for example, consider the simplest loop in which access to the arrays is carried out in step 2

 void Calculate(int * restrict a, int * restrict b, int n) { for(int i=0;i<n;i=i+2) { b[i] = a[i]+1; } return; }

then, with vectorization, you can get about such an assembler for the vectorized part of the cycle (using the Intel auto vectorizer for your examples):

 icc -S -xSSE4.2 -std=c99 vector.c ..B1.8: # Preds ..B1.8 ..B1.7 movd 24(%edi,%eax,8), %xmm1 #4.14 movd 8(%edi,%eax,8), %xmm3 #4.14 movd 16(%edi,%eax,8), %xmm2 #4.14 movd (%edi,%eax,8), %xmm4 #4.14 punpckldq %xmm1, %xmm3 #4.14 punpckldq %xmm2, %xmm4 #4.14 punpckldq %xmm3, %xmm4 #4.14 paddd %xmm0, %xmm4 #4.7 movd %xmm4, (%esi,%eax,8) #4.7 psrldq $4, %xmm4 #4.7 movd %xmm4, 8(%esi,%eax,8) #4.7 psrldq $4, %xmm4 #4.7 movd %xmm4, 16(%esi,%eax,8) #4.7 psrldq $4, %xmm4 #4.7 movd %xmm4, 24(%esi,%eax,8) #4.7 addl $4, %eax #3.4 cmpl %edx, %eax #3.4 jb ..B1.8 # Prob 82%

This example speaks for itself. Several vector registers are used to form the vector a [i: i + N]. The body of the cycle grows significantly and it is probably difficult to expect an improvement in productivity from such vectorization.

Advanced vectorizers may ask, what about scatter / gathering? At the moment, the implementation of the gather is done for computers that support AVX2. At least, the gather reduces the number of instructions. In this case:

 icc -S -xCORE-AVX2 -std=c99 vector.c .B1.8: # Preds ..B1.8 ..B1.7 lea (%edi,%eax,8), %ebx #4.14 vpxor %ymm3, %ymm3, %ymm3 #4.14 vpcmpeqd %ymm2, %ymm2, %ymm2 #4.14 vpgatherdd %ymm2, (%ebx,%ymm0,8), %ymm3 #4.14 vpaddd %ymm1, %ymm3, %ymm4 #4.19 vextracti128 $1, %ymm4, %xmm2 #4.7 vpsrldq $4, %xmm4, %xmm5 #4.7 vmovd %xmm4, (%esi,%eax,8) #4.7 vpsrldq $4, %xmm2, %xmm3 #4.7 vpsrldq $4, %xmm5, %xmm6 #4.7 vpsrldq $4, %xmm3, %xmm4 #4.7 vmovd %xmm5, 8(%esi,%eax,8) #4.7 vpsrldq $4, %xmm6, %xmm7 #4.7 vpsrldq $4, %xmm4, %xmm5 #4.7 vmovd %xmm6, 16(%esi,%eax,8) #4.7 vmovd %xmm7, 24(%esi,%eax,8) #4.7 vmovd %xmm2, 32(%esi,%eax,8) #4.7 vmovd %xmm3, 40(%esi,%eax,8) #4.7 vmovd %xmm4, 48(%esi,%eax,8) #4.7 vmovd %xmm5, 56(%esi,%eax,8) #4.7 addl $8, %eax #3.4 cmpl %edx, %eax #3.4 jb ..B1.8

Sequential access allows you to simplify the procedure for loading memory into registers, so vectorization will be all the more efficient than sequential access to more objects in a loop. On the other hand, if the vectorized code contains a lot of computations and their reuse, then the vectorization can be quite effective even with inconsistent access to objects.

Factors determining the efficiency of vectorization: Alignment

The speed of copying from memory also depends on how the memory block to be written into the vector register is aligned. If there is a work with aligned addresses, then these addresses can be used in the instructions as operands, and due to this a more optimal allocation of registers can be made. Reading to aligned addresses is faster. For SSE instructions, addresses aligned at 16 are required; for AVX, at 32.

Due to the effect of alignment on performance, the overall loop vectorization scheme looks like this:

In general, the cycle vectorization generates three cycles: a cycle to achieve elements of some processed object aligned with memory, a vectorized part and a scalar tail cycle. If several objects are processed within the loop, then all of them cannot be aligned and it is necessary to select the “lead” object and adjust to it.

To evaluate the effect of alignment, let's invent a simple test that will work with both aligned and unaligned memory. With the PERF macro, we create two program variants. In one, the arguments that are aligned in different ways will be passed to the ttt function, in the other - in the same way. Studying assembler files leads to the idea that both icc and gcc chose array “a” as the “leading” array. Therefore, several scalar iterations will be inserted in front of the vector cycle to align it. And the array “b” will be equalized in one version of the program, but not in the other. Well, suppress the substitution so that the function body is the same for both executable files. Align the memory for the processed arrays select using memalign.

 #include <stdio.h> #include <stdlib.h> #include <malloc.h> #define N 1000 void ttt(int * restrict a, int * restrict b, int n) { for(int i=0;i<n;i++) { a[i] = a[i]+b[i]; } return; } int main() { int *a,*b; a = (int*)memalign(32,N*sizeof(int)); b = (int*)memalign(32,N*sizeof(int)); for(int i=0;i<N;i++) { a[i]=i; b[i]=i; } for(int rep=0; rep<10000000; rep++) { #ifndef PERF ttt(&a[0],&b[1],N-3); ttt(&a[1],&b[2],N-4); #else ttt(&a[0],&b[0],N-3); ttt(&a[1],&b[1],N-4); #endif } printf("%d\n",a[100]); free(a); free(b); }

 icc -ip-no-inlining -vec_report3 -xSSE4.2 -std=c99 vector.c -o vector_sse icc -DPERF -ip-no-inlining -vec_report3 -xSSE4.2 -std=c99 vector.c -o vector_sse_perf

On my laboratory machine I get these tsiferki:

 time ./vector_sse : 6.36s time ./vector_sse_perf : 3.46s

I did not expect such a difference. It is clear that all data is located in the cache. In a real program, this difference is probably more difficult to obtain due to alignment.
We look, and what gcc will show in this situation:

 gcc -fno-inline -O3 -march=corei7 -std=c99 -mfpmath=sse -ftree-vectorizer-verbose=1 vector.c -o vector gcc -fno-inline -DPERF -O3 -march=corei7 -std=c99 -mfpmath=sse -ftree-vectorizer-verbose=1 vector.c -o vector_perf time ./vector : 6.49s time ./vector_perf : 3.61s

Those. it can be concluded that alignment has a very positive effect on performance if there are a lot of iterations in the loop.

Since alignment has such a strong effect on performance, I would like to have the tools to manipulate this characteristic. Such tools are. If memory for the array is allocated on the stack, then you can use the __attribute __ ((aligned (N))) construct. For example, like this:

 int a[N] __attribute__((aligned(32)));

Using this attribute, you can align arrays located inside a class or structure. I already use in my program an analogue of the malloc function, namely memalign, a function that returns aligned memory.If you have taken care of alignment and want the auto vectorizer to create code for memory-aligned objects, but program analysis does not allow using this object property correctly, then you can use the built-in __builtin_assume_aligned function, for example:

 int * aa =__builtin_assume_aligned(a,32);

In this case, this will lead to the fact that the auto vectorizer will generate code without an alignment loop, it will be possible to use instructions for working with aligned memory, and it will be possible to use memory addresses as operands for vector instructions. (If the first item can simply affect the performance, then the subsequent ones will lead to disastrous consequences for the program if the real object is not aligned.)

There is also #pragma vector aligned. As I understand it, if it is used, all objects involved in vectorization will be considered aligned.

Evaluation of auto-vectoring utility

Since there are vectorization efficiency factors associated with the operation of the computing system itself, it would be logical to somehow take these factors into account when performing the vectorization in order not to vectorize deliberately failures and not degrade application performance. It is logical to have a heuristic mechanism that weighs all the factors for and against vectorization and issues its conclusion. For the icc compiler, the conclusion looks like this:

 vector.c(3): (col. 4) remark: loop was not vectorized: vectorization possible but seems inefficient

Gcc with the option -ftree-vectorizer-verbose = 3 talks about the course of its reasoning (for example, it refuses to vectorize a short cycle):

 8: cost model: prologue peel iters set to vf/2. 8: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown . 8: Cost model analysis: Vector inside of loop cost: 5 Vector outside of loop cost: 24 Scalar iteration cost: 4 Scalar outside cost: 0 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 7 8: Profitability threshold = 6 8: not vectorized: vectorization not profitable.

But in general, since gcc showed problems with vectorization of objects that are accessed inconsistently, I could not find more complex examples when gcc refuses to vectorize the cycle due to the inefficiency of vectorization.

If there are evaluation mechanisms for deciding on the vectorization and the user believes that the heuristic mechanism has made the wrong conclusion about the benefits of vectorization, then the decision of the vectorizer can be changed, for example, using various pragmas. For example, #pragma vector always recommends that you always vectorize the loop below. In turn, #pragma novector allows you to abandon the vectorization of the cycle.

Conclusion

The quality of the auto vectorizer is not limited to all the considerations that were discussed here. To obtain an optimal code, the vectorizer needs to cooperate with other cyclic optimizations, use the results of interprocedural analysis, be able to recognize various idioms, be able to reuse vector calculations that have already been done, and much more. But I hope that my brief introduction to the topic of auto-vectorization was useful.
In general, I was not able to notice any errors of the icc auto-vectorizer on the simplest examples. The auto vectorizer gcc showed a number of flaws (it refused to vectorize arrays with vector indexing and the case with non-sequential access for a square matrix).

Some useful literature on the subject.

I have a great book, Optimizing compilers for modern architectures. Authors Randy Allen and Ken Kennedy. This book introduces all major aspects of the work of the optimizing compiler.
There is also “The software Vectorization Handbook”. The book of engineers for engineers, written by Aart JC Bik and devoted directly to the theory of autor vectorization.
Many interesting articles can be found on Intel's sites. Read, for example, A Guide to Auto-vectorization with Intel C ++ Compilers .

Source: https://habr.com/ru/post/171439/

All Articles