Developer at a crossroads: how to vectorize?

A lot of interesting things have been written on the topic of vectorization. Let's say, a great post that explains a lot of useful information on the work of auto-vectoring would highly recommend it for reading. I'm interested in another question. Now in the hands of the developers a large number of ways to create a "vector" code - from the pure assembler to the same auto vectorizer. What is the way to stay? How to find a balance between the necessary and the sufficient? About this and talk.

So, get the "cherished" vector instructions in several ways. Let's represent schematically in the form of the following table:

If we are experienced gurus and can afford to write in pure assembler, then perhaps this way will give us 100% confidence in using the maximum in our code. Still, we will immediately write on the necessary instructions and use all the capabilities of the processor. That's just “sharpened” it will be for a specific set of instructions, and therefore, for a specific “hardware”. The release of new instructions (and progress does not stand still) will require global processing and new labor costs. Obviously, it is worth thinking about something more user friendly. And at the next "step" intrinsic functions appear.
')
This is no longer a pure assembler, however, it still takes a lot of time to rewrite the code. Let's say a simple cycle in which two arrays are added will look like this:

#include <immintrin.h> double A[100], B[100], C[100]; for (int i = 0; i < 100; i += 4) { __m256d a = _mm256_load_pd(&A[i]); __m256d b = _mm256_load_pd(&B[i]); __m256d c = _mm256_add_pd(a, b); _mm256_store_pd(&C[i], c); }

In this case, we use AVX intrinsic functions. Thus, we guarantee the generation of the corresponding AVX instructions, that is, they are again tied to a specific “hardware”. Labor costs have decreased, but we will not be able to use this code in the future - sooner or later it will have to be rewritten again. And this will always be the case, until we explicitly select instructions, “writing” them in the source code. Whether it is pure assembler, intrinsic functions or SIMD intrinsic classes. Also, by the way, an interesting thing, representing the next level of abstraction.

The same example will be rewritten as follows:

 #include <dvec.h> // 4 elements per vector * 25 = 100 elements F64vec4 A[25], B[25], C[25]; for(int i = 0; i < 25; i++) C[i] = A[i] + B[i];

In this case, we no longer need to know which functions to use. The code itself looks quite elegant, and the developer only needs to create the data of the desired class. In this example, F64 means a 64-bit float type, and vec4 talks about using Intel AVX (vec2 for SSE).

I think everyone understands why this method cannot be called the best in terms of the price / quality ratio. That's right, portability is still not perfect. Therefore, a reasonable solution is to use a compiler for solving such problems. With it, rebuilding our code, we will be able to create binaries for the architecture we need, whatever it is, and use the latest instruction sets. At the same time, we need to make sure that the compiler is able to vectorize the code.

While we were walking on the table from the bottom up, discussing the "complex" ways of code vectorization. Let's talk about simpler ways.
Obviously, the easiest is to shift all responsibility to the compiler and enjoy life. But not everything is so simple. No matter how clever the compiler is, there are still many cases where it is powerless to do anything with a loop without additional data or hints. In addition, in some cases, the code, successfully vectorized with one version of the compiler, is no longer vectorized with the other. It’s all about clever compiler heuristics, so relying on 100% auto-vectorization is not possible, although the piece is definitely useful. For example, a modern compiler can vectorize such code:

 double A[1000], B[1000], C[1000], D[1000], E[1000]; for (int i = 0; i < 1000; i++) E[i] = (A[i] < B[i]) ? C[i] : D[i];

If we tried to create an analogue of the code on intrinsic functions, guaranteeing vectorization, we would get something like this:

 double A[1000], B[1000], C[1000], D[1000], E[1000]; for (int i = 0; i < 1000; i += 2) { __m128d a = _mm_load_pd(&A[i]); __m128d b = _mm_load_pd(&B[i]); __m128d c = _mm_load_pd(&C[i]); __m128d d = _mm_load_pd(&D[i]); __m128d e; __m128d mask = _mm_cmplt_pd(a, b); e = _mm_or_pd( _mm_and_pd (mask, c), _mm_andnot_pd(mask, d)); _mm_store_pd(&E[i], e); }

Well, when the compiler can do it for us! It is a pity that not always ... and in cases where the compiler does not cope, the developer can help him himself. To do this, you can use special "tricks" in the form of directives. For example, #pragma ivdep tells you that there are no dependencies in the cycle, and #pragma vector always allows you to ignore the "efficiency policy" of vectorization (often, if the compiler thinks that it is inefficient to vectorize the cycle, say it does not). But these directives from the category "may help." If the compiler is sure that there are dependencies, then it will not vectorize the cycle, even if there is a pragma ivdep.

Therefore, I have identified another way, which is based on directives, but several other principles of operation. These are directives from the new standard OpenMP 4.0 and Inte Cilk Plus #pragma omp simd and #pragma simd respectively. They allow you to completely "forget" the compiler about your own checks, and rely entirely on what the developer says. Responsibility, in this case, naturally, is shifted to his shoulders and head, so that you need to act carefully. Hence the need for another method.

How to make so that the checks still remain, but the code is guaranteed to be vectorized? Unfortunately, with the syntax that exists in C / C ++, so far. But using the capabilities of a special syntax for working with arrays (array notation), which is part of Cilk Plus, (see the previous post in order to understand how much everything there is), it is possible. Moreover, the syntax is very simple, which reminds Fortran, and has the following form:

 base[first:length:stride]

We set a name, an initial index, number of elements, a step (optional) and forward. The previous example will overwrite it like this:

 double A[1000], B[1000], C[1000], D[1000], E[1000]; E[:] = (A[:] < B[:]) ? C[:] : D[:];

A colon means that we refer to all elements. You can also perform more complex manipulations. Let's say this code

 for (i = 0; i < 5; i++) A[(i*2) + 1] = B[(i*1) + 1];

will be rewritten more compactly, and most importantly, guarantees vectorization:

 int A[10], *B; A[1:5:2] = B[1:5];

Thus, we see that there are really many ways to achieve vectorization. If we talk about balance, then since we have listed all the methods in the plate “from simple to complex”, then from the point of view of the required expenses and result at the output, the golden mean converges on Cilk Plus. But this does not mean that everything is so obvious. If the patient is ill, he is not always immediately prescribed antibiotics, right? So it is here. For some, it may be enough autovection, for someone directives ivdep and vector always would be quite a reasonable solution. It is more important to involve the compiler so that the head does not hurt when new instructions and hardware come out, and here Intel always has something to offer. So until new posts, friends!

Source: https://habr.com/ru/post/205552/

All Articles

Developer at a crossroads: how to vectorize?

More articles: