Parallelization of recursive functions using OpenMP 3.0 task

Recently I stumbled upon the blog “ OpenMP 3.0 tasking ". After that, I decided to check how well recursive functions parallelize with the OpenMP 3.0 task. I recall that before the third standard there was no support for dynamic or irregular parallelism (for example, loops with while or recursive functions).

Computing systems

I will begin with a description of the computing systems that I used.
Below is a list of computers on which I can cover the results. The rest are not yet in production.

Parallelization process

For testing, I took an example of calculating Fibonacci numbers . The idea of parallelization was as follows: create an asynchronous task for each recursion step, and then organize synchronization before exiting the function. Unfortunately, with this approach, I did not get more acceleration than 1. Due to the dominance of the cost of runtime. The solution was found here and it turned out to be simple: create tasks only for part of the recursion steps, and execute the rest part sequentially. In other words, balance the costs of runtime and execution time. Below is a piece of code that is responsible for balancing:

INT_TYPE
fib( INT_TYPE n ) {
INT_TYPE i, j;
if( n < SPLITTER ) {
return fib_serial( n );
} else {
#pragma omp task shared( i )
i = fib( n - 1 );
#pragma omp task shared( j )
j = fib( n - 2 );
#pragma omp taskwait
return i + j;
}
}

In this example, SPLITTER is responsible for balancing. All source code can be downloaded here .
But not everything is so simple. For different SPLITTER values, the parallelization efficiency is different. Using the method of pointing a finger at the sky, I picked up two splitter values 19 and 32 at which the parallelization efficiency approaches one.
')

Compilations

Of all the software, only compilers will be of interest. I had three compilers at my disposal: Intel® Composer XE for Linux, Microsoft Visual C ++ 2010 and GCC. Visual C ++ is immediately discarded because it does not support OpenMP 3.0 (at the moment there is only support for OpenMP2.5). GCC supports OpenMP 3.0 since version 4.4, I used 4.5.0.
Below are compile times and sizes of executable files with both dynamic and static libraries. But it is worth noting that, unlike GCC, Intel® Composer XE is installed on a remote machine. Which leads to an increase in compile time.

The strings for GCC and Intel® Composer XE are as follows:
# gcc ./fib_task.c -fopenmp
# gcc ./fib_task.c -fopenmp -static
# icc ./fib_task.c -openmp -static
# icc ./fib_task.c -openmp

Execution time and scalability

All measurements were made to search for the 50th Fibonacci number for SPLITTER = 19 and SPLITTER = 32.

Acceleration and Parallelization Efficiency

This graphic shows the acceleration for SPLITTER = 32. In turn, efficiency is the ratio of the number of flows and acceleration.

Source and executable files

As mentioned above, the source file can be downloaded here . In this example, the 41st Fibonacia is calculated. To play the measurements, you need to change the variable n = 50 and the variable expected = 12586269025UL. You can also download immediately

executable files generated with Intel® Composer XE for SPLITTER = 19 and SPLITTER = 32 ;
executable files generated with GCC 4.5.0 for SPLITTER = 19 and SPLITTER = 32 .

findings

Actually, the charts with the times of execution and accelerations all speak for themselves.

For Composer XE, the parallelization efficiency is ~ 1 as long as the number of threads does not exceed the number of physical cores.
For GCC 4.5.0, the parallelization efficiency is ~ 0.5 as long as the number of threads does not exceed the number of physical cores.

The rest of the conclusions, I suggest you do it yourself, because everyone is interested in his own.

Afterword

On PringerLink I found an article " An Extension to Improve OpenMP Tasking Control ", in which it is proposed to introduce a new final clause, which acts as a balancer. From my point of view, this is a sober sentence. And, you can add balancer clauses, which in realtime will produce adaptive balancing.

Please refer to the Optimization Notice page for more details on performance and optimization in Intel software products.

Source: https://habr.com/ru/post/99823/

All Articles