Vectorization of cycles: diagnosis and control

Often, programmers rely on the compiler for loop vectorization. But the compiler is not omnipotent, he often also needs help in parsing difficult sections. This article has the answer to the question: how to find out where the compiler is experiencing difficulties with vectorization and how to help it overcome them?

The vectorization of cycles in LLVM was first introduced in version 3.2, in version 3.3 it became enabled by default. Vectorization has already been discussed in this blog in 2012 and in 2013 , as well as at the FOSDEM 2014 and WWDC 2013 conferences. The LLVM vectorizer performs numerous iterative operations on cycles to increase performance. Modern processors can parallelize the execution of instructions that follow each other and are independent of each other using iron-level support — multiple execution units and an extraordinary execution of commands.

Unfortunately, in the case when the vectorization of the cycle is impossible, or does not lead to an increase in efficiency, the compiler without any notification will simply skip this cycle. This is a problem for many applications that rely on the compiler to correctly vectorize available cycles. Recent LLVM updates to version 3.5 have added new command line arguments that can help determine the causes of vectorization.

Cycle Analysis Messages

These messages provide the user with information from the LLVM optimizer, including data on cycle scan, reordering of instructions ( also called interleaving or interleaving from English interleaving ) and vectorization. To display these messages, the compiler needs to pass the argument '-Rpass' with the parameter 'loop-vectorize'. The example below shows a cycle that was vectorized with parameter 4 and whose commands were interleaved with parameter 2.

void test1(int *List, int Length) { int i = 0; while(i < Length) { List[i] = i*2; i++; } } clang -O3 -Rpass=loop-vectorize -S test1.c -o /dev/null test1.c:4:5: remark: vectorized loop (vectorization factor: 4, unrolling interleave factor: 2) while(i < Length) { ^

Many cycles cannot be vectorized due to complex control flow (for example, many if blocks), and also if the cycle contains data types that are not subject to vectorization or non-vectorable function calls.

For example, to vectorize the code below, you must first make sure that the array 'A' is not a pseudonym for 'B' (does not point to the same address, and does not intersect with it at all). But the optimizer will not be able to find out, because it does not know the number of elements in 'A'.

 void test2(int *A, int *B, int Length) { for (int i = 0; i < Length; i++) A[B[i]]++; } clang -O3 -Rpass-analysis=loop-vectorize -S test2.c -o /dev/null test2.c:3:5: remark: loop not vectorized: cannot identify array bounds for (int i = 0; i < Length; i++) ^

A list of non-vectorable statements can be obtained using the command line argument '-Rpass-analysis = loop-vectorize'. For example, in many cases 'break' and 'switch' cannot be vectorized.

In the first example, we can see that the simplest conditional transition prevents vectorization

 for (int i = 0; i < Length; i++) { if (A[i] > 10.0) break; A[i] = 0; } control_flow.cpp:5:9: remark: loop not vectorized: loop control flow is not understood by vectorizer if (A[i] > 10.0) ^

The second example demonstrates the failure of vectorization due to the fact that the loop simply contains a switch

 for (int i = 0; i < Length; i++) { switch(A[i]) { case 0: B[i] = 1; break; case 1: B[i] = 2; break; default: B[i] = 3; } } no_switch.cpp:4:5: remark: loop not vectorized: loop contains a switch statement switch(A[i]) { ^

New pragma directive for loops

Explicit control over vectorization, the creation of alternating commands and the sweep cycles is necessary for optimal tuning of the program's performance. For example, when compiling with the -Os flag, that is, optimized for size, vectoring of the most frequently called cycles is a great idea. Vectorization, command alternation, and loop expansion can be explicitly specified using the #pragma clang loop directive for any for, while, do-while, or range-based for loop from the C ++ 11 standard. For example, the vectorization value and the number of alternating commands are specified using the pragma directive for loops.

 void test3(float *Vx, float *Vy, float *Ux, float *Uy, float *P, int Length) { #pragma clang loop vectorize_width(4) interleave_count(4) #pragma clang loop unroll(disable) for (int i = 0; i < Length; i++) { float A = Vx[i] * Ux[i]; float B = A + Vy[i] * Uy[i]; P[i] = B; } } clang -O3 -Rpass=loop-vectorize -S test3.c -o /dev/null test3.c:5:5: remark: vectorized loop (vectorization factor: 4, unrolling interleave factor: 4) for (int i = 0; i < Length; i++) { ^

Integer constant expressions

The parameters of the above pragma directives (vectorize_width, interleave_count and unroll_count) accept integer constants, but you can also pass the result of the expression calculated at compile time to it, as in the following example:

 template <int ArchWidth, int ExecutionUnits> void test4(float *Vx, float *Vy, float *Ux, float *Uy, float *P, int Length) { #pragma clang loop vectorize_width(ArchWidth) #pragma clang loop interleave_count(ExecutionUnits * 4) for (int i = 0; i < Length; i++) { float A = Vx[i] * Ux[i]; float B = A + Vy[i] * Uy[i]; P[i] = B; } } void compute_test4(float *Vx, float *Vy, float *Ux, float *Uy, float *P, int Length) { const int arch_width = 4; const int exec_units = 2; test4<arch_width, exec_units>(Vx, Vy, Ux, Uy, P, Length); }

Now we will collect it:

 clang++ -O3 -Rpass=loop-vectorize -S test4.cpp -o /dev/null test4.cpp:6:5: remark: vectorized loop (vectorization factor: 4, unrolling interleave factor: 8) for (int i = 0; i < Length; i++) { ^

Warnings about the impossibility of vectorization

Of course, even with explicit instructions, vectorization is not always possible. For example, because of the complex flow of control. If explicitly declared vectorization encounters such problems, a warning message is displayed that this directive cannot be executed. Below is an example of a function that returns the index of the last positive number from a loop, and this loop cannot be vectorized due to the use of the 'last_positive_index' variable outside of it:

 int test5(int *List, int Length) { int last_positive_index = 0; #pragma clang loop vectorize(enable) for (int i = 1; i < Length; i++) { if (List[i] > 0) { last_positive_index = i; continue; } List[i] = 0; } return last_positive_index; } clang -O3 -g -S test5.c -o /dev/null test5.c:5:9: warning: loop not vectorized: failed explicitly specified loop vectorization for (int i = 1; i < Length; i++) { ^

The start line of the loop, which cannot be vectorized, in this case can only be obtained by using the argument '-g'

Conclusion

Diagnostic messages and pragma-directives for cycles are two innovations that are quite useful for tuning program performance. Special thanks to everyone who contributed to the development of these add-ons. In the future, it will be necessary to add diagnostic messages for the SLP vectorizer and additional parameters for the pragma directive. If there are any other ideas about how and what to do better - I will be glad to hear.

Source: https://habr.com/ru/post/244505/

All Articles