Progress does not stand still: OpenMP 4.5

Everything flows, everything changes, and OpenMP continues to evolve. Almost three years ago, the standard began to support not only parallelism in tasks, but also data (vectorization), about which I wrote in detail. It's time to see what appeared in the latest version released in November 2015, and what is already supported by Intel compilers at the moment. Well, let's get started!

Taskloop construction

Consider a loop that we want to parallelize on tasks using OpenMP. Previously, we would simply have written the parallel for directive in front of it, and found our happiness. Something like this:

#pragma omp parallel for for (int j = 0; j<n; j++) do_useful_job(j);

But “everything changes when they come” - new standards and opportunities. Now there is an opportunity not just to give some chunks of iterations to execution for all threads, but to create tasks (tasks) for them that will be distributed among the threads. This is implemented using the taskloop construction:
')

 #pragma omp taskloop

Having written this directive before the for loop, we thereby say that it is necessary to divide the iterations of this loop into pieces, and create a task for each of them. Further, these tasks will be performed by threads, but there is no direct binding to perform some piece of iterations by a specific thread. At the same time, we can control the number of tasks using the num_tasks clause, as well as the size of the pieces themselves through grainsize . If we set 32 tasks using num_tasks (32) , then with the number of iterations equal to 1024, each will perform 32 iterations of the loop. Honestly, in the previous OpenMP 4.0 standard it was possible to do this:

 #pragma omp taskgroup { for (int tmp = 0; tmp < 32; tmp++) #pragma omp task for (int i = tmp * 32; i < tmp * 32 + 32; i++) do_useful_job(i); }

But now our code will be even easier, especially if the cycles are nested or iterators are used.

With the help of grainsize, we can specify the number of iterations that should be performed by one task, and thus the number of tasks will be calculated automatically. The question arises, what is better to use - the “classic” parallel for , or taskloop construction?
If your code does not use work with tasks, it will hardly make sense to replace parallel for with taskloop . The advantage will be manifested in case of imbalance in loop iterations and in the presence of other tasks that can be performed in parallel with the iterations. Here is an example from the latest OpenMP 4.5 documentation:

 void long_running_task(void); void loop_body(int i, int j); void parallel_work(void) { int i, j; #pragma omp taskgroup { #pragma omp task long_running_task(); // can execute concurrently #pragma omp taskloop private(j) grainsize(500) nogroup for (i = 0; i < 10000; i++) { // can execute concurrently for (j = 0; j < i; j++) { loop_body(i, j); } } } }

Here we create a task that will perform the long-running function long_running_task , and in the same task group ( taskgroup ) we iterate the loop using taskloop, “giving” each task 500 iterations. The function will be executed, possibly in parallel with loop iterations. The nogroup clause allows us not to create an implicit group ( taskgroup ) for the taskloop construction, so we will not exit the parallel_work function until all tasks have been executed (with loop iterations and the long_running_task function).

It is worth noting that working with tasks is more efficient due to the fact that there will not be an oversubscription situation in which too many logical flows are used, which significantly increases costs in the operating system due to the fact that it has to enter time-based access to hardware resources. Working directly with the physical flows, the developer takes responsibility for ensuring compliance with the parallelism in the application and the available hardware resources. By the way, the OpenMP 3.0 standard also made it possible to work with tasks, not threads.

A vivid example that shows the need for tasks is the use of library functions in parallel regions. Let's say the same dgemm from MKL, which can be both sequential and parallel. As a result, it may turn out that we will work with a large number of logical threads, which, quite likely, will negatively affect performance.

Taskloop functionality is already supported by the Intel compiler. By the way, there was also an opportunity to set tasks with different priorities through the priority clause. In this case, the runtime will perform tasks with high priority first. True, in the compiler this is not yet.

Doacross parallelism

There are a number of algorithms that have well-structured iterative dependencies. An example would be many algorithms of stencil computations used in scientific calculations for solving differential equations by the method of finite differences, in problems of computational mechanics.

In these cases, we may need to parallelize these cycles, but the “complexity” of the calculations at each iteration should be essential so that the threads do not spend most of their time waiting for each other.

In general, the order in which the iterations are performed is not defined. But even in standard 4.0, the ordered construction appeared, which allows you to specify certain parts of the loop that should be executed in a sequential order. Here is a simple example where this can be useful:

 #pragma omp for ordered schedule(dynamic) for(int n=0; n<100; ++n) { files[n].compress(); #pragma omp ordered send(files[n]); }

In a cycle, 100 files are “compressed” in parallel, but their “transfer” is carried out strictly in increasing sequence. If one of the streams has compressed the 10th file, but the 9th has not been sent yet, then the stream will wait for sending and will not start the compression of the new one. Thus, all threads work in parallel to the part of the ordered , where execution takes place sequentially. This will work well if the code outside the ordered block is executed for substantial time.

In the new standard, and the compiler from Intel also supports this “feature”, now it is possible to use well ordered and additional clauses to indicate well-structured dependencies in cycles.

 #pragma omp for ordered(2) for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) { a[i][j] = foo (i, j); #pragma omp ordered depend (sink: i - 1, j) depend (sink: i, j - 1) b[i][j] = bar (a[i][j], b[i - 1][j], b[i][j - 1]); #pragma omp ordered depend (source) baz (a[i][j], b[i][j]); }

Let's look at this example. In this case, only the outer loop is distributed for execution by threads. The ordered depend (sink) directive blocks the execution of iterations i, j until the execution in iterations i-1, j and i, j-1 reaches the next ordered depend depend (source) directive.

In this example, the compiler ignores depend (sink: i, j - 1) , since only the iterations of the outer loop are distributed among the threads, which means that when iterating i, j , iteration i, j-1 is guaranteed to be executed.
By the way, for the directive ordered it is now possible to specify the simd clause for use in cycles when vectoring:

 #pragma omp ordered simd

In this case, you can specify a part of the code in the SIMD loop, which will be executed in lexical order. Thus, a small area is defined that will not be vectorized.

Total information

C ++ links were previously allowed to be used only in shared clauses, but now there is no such restriction and they may well be private / firstprivate in clauses.
Another expected improvement is due to reductions in cycles. Let me remind you that they allow you to solve the synchronization problem:

 int sum = 0; #pragma omp parallel for reduction(+:sum) for (int i = 0; i < N; i++) sum += val[i];

For example, in this case, we will calculate the value of the variable sum correctly, by creating our own copy in each stream, and summing the obtained values for each of them at the end. For this, the reduction clause is used, with an operator and variable.
So reductions could not be done for the elements of the array. That is, if sum is an array of size N , and we would like to make a reduction for sum [i] , then you need to invent something like this yourself (perhaps not the best solution):

 #pragma omp parallel { int sum_private[N]; #pragma omp for for (int i=0 ; i < N ; i++ ) { for (int j=0; j <=N; j++) { sum_private[i] += val[j]; } } #pragma omp critical { for(int i=0; i < N; i++) sum[i] += sum_private[i]; } }

In standard 4.0, it became possible to create our own reductions (user defined reduction), but for this we needed to create our own data type (wrapper) - structure or class:

 struct user_def_red { int sum[10]; };

Define the operation, for example, addition:

 void add_user_def_red(struct user_def_red * x, struct user_def_red * y) { int i; for (i = 0; i<10; i++) x->sum[i] += y->sum[i]; }

And declare the reduction itself:

 #pragma omp declare reduction(Add_test: struct user_def_red: \ add_user_def_red(&omp_out, &omp_in)) initializer( \ omp_priv={{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}} )

And only after that, it was possible to use an array as a variable for reduction:

 #pragma omp parallel for reduction(Add_test: Sum_red) for (n = 0; n < 10; ++n) for (m = 0; m <= n; ++m) Sum_red.sum[n] += val[m];

It is worth noting that the latest version of the Intel 17.0 Update 1 compiler still "did not master" support for the reduction of arrays for C ++.
In addition, the standard now allows declaring non-static class members as private inside class member functions (also non-static), using only its name, that is, without explicit reference using the this pointer.

Sync Tools

Modern processors support transactional memory, such as IBM BlueGene / Q or Intel TSX (Intel Transactional Synchronization Extensions) . About this memory, you can easily find many posts, such as this . In general, the idea is quite interesting and can give a performance boost under certain conditions. It is worth noting that applications may have different requirements for the synchronization mechanism: in some cases, conflicts when accessing shared data occur quite often, in others we protect system calls (for example, I / O operations), or we add synchronization for reinsurance in general, because that conflicts are almost never occur (hash cards). I would like to be able to choose the implementation of synchronization objects depending on our needs.

But, as we know, if you simply use OpenMP directives for synchronization, such as critical, or work with the locking mechanism through the functions omp_init_lock , omp_set_lock and omp_unset_lock , then the compiler is unlikely to create code that uses transactional memory and the corresponding instructions.
Now, at the standard level, it is possible to specify which types of synchronization objects we want to use. This is done with the help of new functions "with prompts":

 omp_init_lock_with_hint(omp_lock_t *lock, omp_lock_hint_t hint) omp_init_nest_lock_with_hint(omp_nest_lock_t *lock, omp_lock_hint_t hint)

Through the hint argument we can specify the type of synchronization that satisfies us:

 omp_lock_hint_none omp_lock_hint_uncontended omp_lock_hint_contended omp_lock_hint_nonspeculative omp_lock_hint_speculative

Thus, if we need to use transactional memory, we set hint as omp_lock_hint_speculative , and the compiler will generate the appropriate instructions. For the Intel compiler, we will use Intel TSX as an implementation:

 void example_locks() { omp_lock_t lock; omp_init_lock_with_hint(&lock, omp_hint_speculative); #pragma omp parallel { omp_set_lock(&lock); do_some_protected_job(); omp_unset_loc(&lock); } }

It is possible and for the critical section to set the type through the hint clause, while it must have an explicit name:

 void example_critical_with_hint() { #pragma omp parallel for for (int i=0; I < N; i++) { Data d= get_some_data(i); #pragme omp critical (HASH) hint(omp_hint_speculative) hash.insert(d); } }

What is left behind the scenes?

In addition to all the above, many small improvements have appeared in the standard that make our lives better. For example, the ability to set linear clause for #pragma omp declare simd additional options val , uval and ref . This allows us to clearly indicate what is really “linear” - the addresses or values themselves. This will be especially true for developers in Fortran, where function arguments are passed by reference rather than value, which led to a loss of performance when using the declare simd directive due to the generation of gather instructions.

I deliberately did not begin to say anything about a huge topic that deserves special attention - OpenMP tools for offload (unloading calculations to various accelerators).
This part has undergone significant changes that can affect even the compilability of code written using the previous standard. I hope this topic will be interesting and then I will write a sequel. As the saying goes, "to be continued ...".

Source: https://habr.com/ru/post/316164/

All Articles