
This post is again devoted to cyclic optimization. Why is it all about cyclic permutation optimizations? The fact is that this is one of the most efficient parts of an optimizing compiler. The number of cyclic permutation optimizations includes both auto-vectorization and auto-parallelization. These optimizations have their own specifics, but in general, all cyclic optimizations have common problems and common methods for solving them.
Often you hear the opinion that the compiler in many cases "tupit". I want to be a compiler's attorney here to show that the compiler’s life is not so easy, it is possible to cause an easy share of sympathy for its difficult part and to show what objective difficulties exist in processing the program and why in many cases the compiler rightly cannot do this or other optimizations that seem obvious to the programmer. Well, at the same time I want to demonstrate some possibilities to help the compiler in its work. It is clear that sometimes there are also subjective factors, in the face of developers who for some reason did not implement this or that functionality within the compiler.
I will use the Intel compiler for Windows, but I think similar problems cause headaches for users and developers of all optimizing compilers with cyclic optimizations, so I hope this post will be useful for all programmers. I assume that the reader has some basic ideas about interprocedural analysis, so I will refer to the possibilities of this analysis without a detailed explanation of what it is and how it works.
')
In the previous post I suggested looking at the change in the performance of the synthetic test caused by the use of high-level cycle distribution of loop splitting (loop distribution). One of the main goals of the example was to show how the rearrangement of the order of operations can affect the operation of the memory subsystem and how many different performance factors can be changed. Those. The evaluation of the profitability of optimization is a rather difficult job, which is often made difficult by the lack of accurate information about the structure of the data being processed.
And this is not all the difficulties that the compiler faces. Before evaluating the possible benefits of a particular transformation, you need to spend some effort to recognize the cycles themselves and prove the very possibility of applying the permutation optimization. Basically this post is devoted to the problem of recognizing loops and ambiguity resolution when working with memory.
Cycle Recognition
To perform cyclic permutation optimizations, the compiler must first recognize and classify cycles. The main cyclic optimizations are applicable only to cycles with a certain number of iterations, having successively changing iterative variables (and as a result, not having transitions beyond the limits of the cycle and calls to unknown functions). Consider, for example, a set of simple cycles located in a certain function foo:
extern int unknown(); int gn,gi; typedef struct vector { float *array; int lbound; int ubound; } vec; int foo(int n, float *a, vec *v) { int i,j; int b[100]; // 1 loop for(i=0;i<gn;i++) a[i] = 0; // 2 loop for(gi=0;gi<n;gi++) a[gi]=0; // 3 loop for(i=0;i<n;i=i*3) b[i]=0; // 4 loop for(i=0;i<n;i++) { j=unknown(); b[i]=j; } // 5 loop for(i=0;i<n;i++) { if(b[i] !=0) break; b[i] = 1; } // 6 loop for(i=0;i<n&&i<gn;i++) b[i] = i; // 7 loop for(i=0;i<n;i++) { if(gn) break; else b[i] = 0; } // 8 loop for(i=0;i<n;i++) { if(i>n) goto exit; b[i] = 100; } // 9 loop for(i=v->lbound;i<v->ubound;i++) { v->array[i]=i; } exit: return b[10]; }
Did the compiler recognize these loops and is there a way to figure this out? I don’t know a clear way (except for using internal dumps for developers), but there is an interesting indirect method related to issuing a report on the work of the auto vectorizer using the / Qvec-report2 key. The / Qvec_report switch is used like this:
/ Qvec-report [n]
control amount of vectorizer diagnostic information
n = 0 no diagnostic information
n = 1 indicate vectorized loops (DEFAULT when enabled)
n = 2 indicate vectorized / non-vectorized loops
n = 3 indicate vectorized / non-vectorized loops and prohibiting
data dependence information
...
If you compile the file with the default options and the / Qvec_report2 key, the compiler will display the following information:
icl -Qvec_report2 test_loops.c
.. \ test_loops.c (34): (col. 1) Remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
.. \ test_loops.c (41): (col. 1) Remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
.. \ test_loops.c (45): (col. 1) Remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
.. \ test_loops.c (53): (col. 1) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
.. \ test_loops.c (16): (col. 1) Remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
.. \ test_loops.c (20): (col. 1) Remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
.. \ test_loops.c (24): (col. 1) Remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
.. \ test_loops.c (28): (col. 1) Remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
.. \ test_loops.c (60): (col. 1) Remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
The phrase nonstandard loop and means that the loop has an inappropriate structure for performing loop optimization. What is the reason that the compiler could not recognize and classify cycles?
1 cycle: The global variable “gn” is used in the exit condition. At each iteration, the memory is modified through the “a” pointer. Theoretically, this pointer may refer to “gn” and the condition for exiting the loop may change at each iteration.
2 cycle: The same situation is connected with the use of the global variable “gi” as an iterative variable of the cycle. The iteration variable can be changed to any iteration through the "a" pointer.
Cycle 3: The iteration variable changes nonlinearly.
4th cycle: Call inside the loop of an unknown function. Such a call may include, for example, exiting the program.
5 cycle: There are several conditions under which the execution of the cycle can be interrupted. Those. the number of iterations of such a cycle cannot be determined at compile time.
Cycle 6: The compiler was unable to parse the complex exit condition of the loop. The use of the global “gn” is not an obstacle here, since the cycle is working with the local array “b” and this variable cannot be changed in the process.
Cycle 7: A conditional transition within a cycle makes it difficult to determine the number of iterations. The transition does not depend on the iterative variable, and here the compiler could take a conditional transition out of the loop. But apparently for this you must first recognize the cycle.
Cycle 8: Using goto with going to an outside loop makes the number of iterations of such a loop indefinite. Although such a transition can not happen, but this problem also had to be solved BEFORE the phase of recognition of cycles.
Cycle 9: A very typical case. The array field can theoretically refer to the lbound or ubound fields. This is one of the reasons why there are problems processing various containers from the STL library, such as vector.
Most of the above difficulties in classifying cycles are associated with the problem of ambiguity resolution when working with memory (memory disambiguation). To resolve this problem, the compiler uses both the provisions of the standards of programming languages and the results of interprocedural analysis. Since in my example the interprocedural analysis is not used to the full (by default, the Intel compiler uses –Qip, that is, only the functions from the compiled source file are analyzed), the compiler basically has to use only the rules of the C and C + standards. +, which provide programmers with ample opportunities to organize various objects that refer to the same memory.
We can formulate the following wishes for programmers who want to achieve the best result in optimizing their programs:
1.) Do not use global variables and pointers in constructions that determine the behavior of a loop. If necessary, create local variables.
2.) Simplify, where possible, the exit condition of the loop. For example, in cycle 6 it was possible to pre-calculate the smallest value for “n” and “gn” and use it later in the output condition.
3.) Avoid conditional jumps beyond the cycle.
4.) Use the –Qansi_alias option if you do not violate ANSI aliasing rules in your programs. These rules state that incompatible types of pointers can never refer to the same memory. Those. By setting this option when compiling, you give the compiler the ability to use type information to resolve ambiguities when working with memory. In this case, you let the compiler know that in your program, pointers refer only to objects of their own or compatible types.
5.) Quite often, arguments are pointers passed to a function. The C99 standard introduced a new restrict type qualifier for pointers. restrict-qualification of pointers means that the data pointed to by such pointers does not indicate overlapping objects. In the given example, the description of the function foo using this classifier would look like: int foo (int n, float * rest a, vec * rest v). To use this functionality, you need to add the option –Qstd = c99 when compiling.
6.) When organizing your project, place the shared functions in one source file and use the static attribute to limit the scope of objects and functions to this source file. This improves the ability of effective interprocedural analysis with the –Qip option. Looking ahead, I would like to note that a complete interprocedural analysis of-Qipo is very useful, but it significantly increases the compilation resources and compile time. Therefore, with a competent project organization, the –Qip option can be a very useful alternative, while providing performance comparable to -Qipo.
7.) Use the option to enable interprocedural analysis and optimizations (-Qipo). This option allows, for example, to determine whether the address was taken from a variable. If it is possible to establish that the address of a variable has never been taken, then this greatly simplifies the elimination of ambiguity. In addition, interprocedural analysis includes the collection of information about objects referenced by pointers (local and global
points-to analysis ) and the properties of functions. Substitution of the function body (inlining) can lead to the fact that the cycle with the function call becomes “good” and can be optimized. All this is partly true for –Qip.
Well and not to offend fans of C ++ a small synthetic test in C ++.
#include <stdlib.h> class Array { public: Array(int n); ~Array() { free(Objects);}; void initArray(int x); int nobjects; int *Objects; }; Array::Array(int n) { nobjects = n; Objects = (int*) malloc(nobjects*sizeof(int)); } void Array::initArray(int x) { for(int i=0;i<nobjects;i++) Objects[i] = x; }
I am wondering if the loop is recognized inside initArray.
Without the –Qansi_alias option, everything is generally bad here, since the hidden pointer this appears, which, from the point of view of the compiler taught by bitter life, can point to anything. –Qansi_alias solves this problem and with this option the members of the class can no longer be aliased to this. But even the presence of the –Qansi_alias option does not help the compiler recognize the loop.
icl -Qvec_report2 -Qansi_alias test_loops.cpp -c
test_loops.cpp (18): (col. 2) Remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
The compiler fears here that when writing to Objects, the variable nobjects can be changed. As a result, the loop is not recognized. The –Qansi_alias option does not help, since the Objects and nobjects pointer is of the same type. If this file were part of a project and we would build a project with the –Qipo option, the compiler would recognize and vectorize this cycle if it proved that these objects cannot overlap in memory.
Here we should focus on the fact that if you use a member of a class of type nobjects in an exit condition from a loop, the compiler sees this-> nobjects. Replacing this object with a local variable before the cycle will simplify the recognition of the cycle.
void Array::initArray(int x) { int n=nobjects; for(int i=0;i<n;i++) Objects[i] = x; }
Interestingly, if you declare nobjects a private class variable, then in this case, the cycle is vectorized. This is due to interprocedural analysis (-Qip) performed for a single file.
Thus it is possible to form a recommendation:
6a) Use the private attribute for class members, if possible, to limit their scope. Those. This is a good recommendation from the point of view of writing modular code, but this principle also brings some benefits for optimizations.
I would also like to note that in the case of a nest of cycles, an unrecognized internal cycle automatically leads to the rejection of all external cycles to it. There are few cycle operations on the loop nest, for example loop interchange (loop interchange). However, identifying nest loops can be very useful for performance.
Permissibility criterion for permutation optimizations
So, the main idea of cyclic optimization is to change the order of the work performed. Once the cycles have been recognized and classified and the real work begins. It is necessary to determine the conditions under which one can do cyclic permutation optimizations and at the same time obtain an "equivalent program" or equivalent calculations. From the point of view of an unbiased user, programs will be equivalent if, with the same input data, they receive the same results and the results are displayed in the same order. If you think about when the change in the calculation order will result in the final result remaining unchanged, then it is easy to reproduce the equivalence criterion for the calculations. Results will be the same if statements that are dependent on each other are executed in the same order in the original and modified calculations. Those. Independent instructions can be rearranged as you like, but the order of dependent calculations must be preserved. And here appears such a key concept for optimizing compiler, as dependence.
There are
data dependencies and control graphs .
If you want to apply a certain cyclic optimization, then to prove its correctness, you will need to make sure that the order of dependencies in the initial and derived cycle will be the same. There is a rather complicated methodology for detecting dependencies when using in a loop (or loop nest) array A, indexed by iteration counters (iterative loop variables). The compiler tries to determine if there is a call and write to the same memory at different iterations. The best option for many cyclic optimizations is the absence of dependencies in the cycle. (There are permutation optimizations that use dependencies in their work.) To determine dependencies, the compiler uses quite complex methods, but all this complex functionality will not be needed unless the ambiguity problem is solved when working with memory. In this case, there are completely unexpected dependencies that can interfere with profitable optimizations.
Again, take a small example and see how dependencies interfere with vectorization, as well as other optimizations. In this case, all measures have been taken to prevent the use of one memory by different objects.
void foo(float * restrict a, float * restrict b, int n, int offset) { int i,j; for(i=0;i<n;i++) { a[i] = a[i+offset]; b[i] = b[i+offset]; } return; }
icl -Qstd = c99 -Qvec_report3 test_loops.c -c
est_loops.c (4): (col. 1) remark: loop was not vectorized: existence of vector.
.. \ test_loops.c (6): (col. 4) remark: vector dependence: assumed ANTI dependence between line 6 and b line 6.
.. \ test_loops.c (6): (col. 4) remark: vector dependence: assumed between b line 6 and b line 6.
.. \ test_loops.c (6): (col. 4) remark: vector dependence: assumed between b line 6 and b line 6.
.. \ test_loops.c (6): (col. 4) remark: vector dependence: assumed ANTI dependence between line 6 and b line 6.
.. \ test_loops.c (5): (col. 4) remark: vector dependence: assumed ANTI depending on line 5 and a line 5.
.. \ test_loops.c (5): (col. 4) remark: 5: a line 5 and a line 5.
.. \ test_loops.c (5): (col. 4) remark: 5: a line 5 and a line 5.
.. \ test_loops.c (5): (col. 4) remark: vector dependence: assumed ANTI depending on line 5 and a line 5.
The problem here is that when performing the shift of the vector, the compiler does not know how the variables “n” and “offset” correlate. To help the compiler optimize these cycles, you can use pragmas:
8) Use #pragma ivdep before a cycle if, according to your knowledge of the cycle, there can be no dependency in it. Using pragmas can also help in recognizing a loop.
Another widely known case with which the compiler has no chance to deal with dependencies is the case of using vector indexing, for example:
void foo(int n, float * restrict a, float * restrict b, int * restrict ind) { int i; for(i=0;i<n;i++) a[ind[i]] += b[i]; }
icl -Qstd = c99 -Qvec_report2 test_loops.c -c
.. \ test_loops.c (4): (col. 1) Remark: loop was not vectorized: existence of vector dependence.
.. \ test_loops.c (5): (col. 3) Remark: a dependence between a line 5 and a line 5.
.. \ test_loops.c (5): (col. 3) Remark: a line between a line 5 and a line 5.
.. \ test_loops.c (5): (col. 3) Remark: a line between a line 5 and a line 5.
.. \ test_loops.c (5): (col. 3) Remark: a dependence between a line 5 and a line 5.
The ind array can contain duplicate values, and then there will indeed be cyclic dependencies that prevent various optimizations. In this case, #pragma ivdep also helps the compiler.
In general, there are many such cases that can be invented, but I hope that in general my thought is understandable.
Conclusion:
This text was written to show that the work of the optimizing compiler in many cases is determined by the need to ensure the conservative nature of the use of optimizations. Those. optimization can be applied only when it is possible to prove its correctness.
The conservative nature of the analysis should imply the presumption that the programmer is guilty of wanting to confuse and deceive the compiler. You can go to meet the compiler and abandon some very common features of programming languages. Also, languages provide various software tools to provide the compiler with additional information.
I would be glad if someone will share experiences on this topic.