Waited, waited and waited! OpenMP 4.0

Each new OpenMP specification introduces very useful and necessary additions to the already existing functionality. For example, in version 3.0, so-called tasks were added, which allowed solving an even wider range of tasks in parallelizing applications. 3.1, a number of improvements to work with tasks and reductions.

But compared to what the standard 4.0 now gives us, the previous innovations seem to be small. The latest version has expanded the types of supported parallelism, which has never been noticed before.

We all know that parallelism is in fact very different. It begins at the instruction level, when modern processor architectures provide us with the functionality of a pipeline and superscalarity. Already at this level we can say that our application is parallel. But this is something that is already embedded in the hardware today, and that is used implicitly for developers. There are currently existing processors and other "sweets" - the number of cores and vector registers (SSE, AVX, etc.).
')
They give us two more types of parallelism:

by tasks.
In this case, all the cores of our processor are used. The idea is to isolate independent tasks that can be simultaneously performed on several cores.
according to (vectorization).
Registers of increased length of each core are used. In this case, processing of several elements (vectors) of data in one operation / instruction occurs, hence the name “vectorization”.

Here, in fact, all types available on systems with shared memory. It is worth noting that there are still distributed systems (hi to clusters and MPI), but this is not the focus of our conversation today. We are talking about OpenMP, and it is classically “sharpened” on systems with shared memory.

Moreover, until the last moment, namely July 2013, it was always about concurrency in tasks. Moreover, the use of OpenMP has become one of the most popular methods, universal in terms of language support (suitable for both C / C ++ and Fortran) and simple in terms of implementation (the idea of adding parallelism step by step through directives).

But data concurrency has a smaller role in terms of performance. It is important to note that vectorization is one of the key points when using the Intel compiler if you are working on an application for which high performance is really important. By the way, for any other compiler too. Therefore, for a long time, there is a whole set of various directives that help vectorize the code — that is, generate such instructions that will use the entire length of the registers, be it SSE, AVX or AVX2, or something more “fresher” or “long”. Do not forget that vectorization plays a key role in the new MIC architecture.

They understood this in the committee working on the OpenMP specifications. Therefore, by a willful decision, well-known to people working with the Intel compiler, directives for working with data parallelism, very similar to those of Intel Cilk Plus, were added to the last document.

So, what appeared.
Now we can not just “scatter” our work in different streams through existing directives, but also to ensure the vectorization of the corresponding code. And all this within the framework of OpenMP, that is, without reference to a specific compiler, since standards tend to support everything, sooner or later.

I would like to first show how it was in Cilk Plus by a simple example.
Suppose we have a certain cycle:

void add_floats(float *a, float *b, float *c, float *d, float *e, int n){ int i; for (i=0; i<n; i++){ a[i] = a[i] + b[i] + c[i] + d[i] + e[i]; } }

So if we compile it with the vec-report key to find out if it was vectorized, we will see that it is not. The reason is in the conservatism of the compiler - if it “feels” that vectorization is dangerous, it will prefer to “play it safe” and do nothing. In our case, this is due to the large number of pointers passed to the function. The compiler simply thinks that some of them may well have intersections in memory. But if we, as developers, know that this is not the case and vectorize this cycle safely, we can use the pragma simd directive from Intel CilkPlus:

  #pragma simd for (i=0; i<n; i++){ a[i] = a[i] + b[i] + c[i] + d[i] + e[i];

We reassemble and voila - the cycle is vectorized! But once again - this is specific to the Intel compiler. Now, this functionality also appeared in OpenMP, only the directive has become more “familiar” form, namely:

 #pragma omp simd

In addition, there is a mixed construction, for using two types of parallelism at the same time:

 #pragma omp parallel for simd

It should be noted that the use of the construction in the “bare” form is not very recommended, because compiler checks for this cycle are completely disabled. Therefore, it is possible and necessary to set, as in other OMP directives, additional options that help the compiler (about the vector length, about memory alignment, access to it, etc.). By the way, they are also very similar to those that were already in the Intel compiler, but with a few changes.

Another very useful thing has also been added - the so-called elemental functions. It is not a secret that if a certain function is called in a loop, then in the general case this loop is not vectorized. In Cilk Plus, a magic __declspec (vector) was used , which allows us to make an elemental function from the most ordinary function - one that could “produce” a vector of results. As a result, it could be called in a loop, with its subsequent successful vectorization. In OpenMP, it now looks like this:

 #pragma omp declare simd

And now developers on Fortran will also be happy, because there was no such support in it (Cilk Plus only for C / C ++) until the last days.
Thus, in the hands of developers, a powerful tool has appeared in the use of all types of parallelism within one specification, which is good news. But this is not all that has appeared. Not paid attention to the now extremely popular programming model for "hybrid" systems with accelerators. And no matter what type of accelerator — whether it’s a GPU or what other unknown beast — the model of its use from the developer’s point of view is now common, and it is also part of the new OpenMP. With the help of the target directive, we can now perform calculations on accelerators and get results on the host:

 #pragma omp target map(to(b:count)) map(to(c,d)) map(from(a:count)) { #pragma omp parallel for for (i=0; i<count; i++) a[i] = b[i] * c + d; }

But this is a separate big topic to talk about. By the way, OpenMP is now increasingly beginning to compare with other standards, in particular OpenCL, OpenACC and others. And an interesting comparison is obtained - with a lot of “pluses” for OpenMP. But back to the topic.
In addition to all the above, other “pleasures” also appeared, but I would filter them out as less significant, although very useful for solving various kinds of tasks. I'm talking about new tools for error handling, binding threads to specific kernels, new tasking extensions, support for user-defined reductions and a number of other innovations.

A full description can be classically found on the OpenMP website . Well, one of the first compilers, which already partially supports a number of new functions, is here . As always, a trial 30-day version is available. So welcome to try out the new features of OpenMP 4.0!

Source: https://habr.com/ru/post/204668/

All Articles

Waited, waited and waited! OpenMP 4.0

More articles: