
Recently I stumbled upon the blog “
OpenMP 3.0 tasking ". After that, I decided to check how well recursive functions parallelize with the OpenMP 3.0 task. I recall that before the third standard there was no support for dynamic or irregular parallelism (for example, loops with while or recursive functions).
Computing systems
I will begin with a description of the computing systems that I used.
Below is a list of computers on which I can cover the results. The rest are not yet in production.

Parallelization process
For testing, I took an example of calculating
Fibonacci numbers . The idea of ​​parallelization was as follows: create an asynchronous task for each recursion step, and then organize synchronization before exiting the function. Unfortunately, with this approach, I did not get more acceleration than 1. Due to the dominance of the cost of runtime. The solution was found
here and it turned out to be simple: create tasks only for part of the recursion steps, and execute the rest part sequentially. In other words, balance the costs of runtime and execution time. Below is a piece of code that is responsible for balancing:
INT_TYPE
fib( INT_TYPE n ) {
INT_TYPE i, j;
if( n < SPLITTER ) {
return fib_serial( n );
} else {
#pragma omp task shared( i )
i = fib( n - 1 );
#pragma omp task shared( j )
j = fib( n - 2 );
#pragma omp taskwait
return i + j;
}
}In this example, SPLITTER is responsible for balancing. All source code can be downloaded
here .
But not everything is so simple. For different SPLITTER values, the parallelization efficiency is different. Using the method of pointing a finger at the sky, I picked up two splitter values ​​19 and 32 at which the parallelization efficiency approaches one.
')
Compilations
Of all the software, only compilers will be of interest. I had three compilers at my disposal: Intel® Composer XE for Linux, Microsoft Visual C ++ 2010 and GCC. Visual C ++ is immediately discarded because it does not support OpenMP 3.0 (at the moment there is only support for OpenMP2.5). GCC supports OpenMP 3.0 since version 4.4, I used 4.5.0.
Below are compile times and sizes of executable files with both dynamic and static libraries. But it is worth noting that, unlike GCC, Intel® Composer XE is installed on a remote machine. Which leads to an increase in compile time.

The strings for GCC and Intel® Composer XE are as follows:
# gcc ./fib_task.c -fopenmp
# gcc ./fib_task.c -fopenmp -static
# icc ./fib_task.c -openmp -static
# icc ./fib_task.c -openmp
Execution time and scalability
All measurements were made to search for the 50th Fibonacci number for SPLITTER = 19 and SPLITTER = 32.

Acceleration and Parallelization Efficiency

This graphic shows the acceleration for SPLITTER = 32. In turn, efficiency is the ratio of the number of flows and acceleration.
Source and executable files
As mentioned above, the source file can be downloaded
here . In this example, the 41st Fibonacia is calculated. To play the measurements, you need to change the variable n = 50 and the variable expected = 12586269025UL. You can also download immediately
findings
Actually, the charts with the times of execution and accelerations all speak for themselves.
- For Composer XE, the parallelization efficiency is ~ 1 as long as the number of threads does not exceed the number of physical cores.
- For GCC 4.5.0, the parallelization efficiency is ~ 0.5 as long as the number of threads does not exceed the number of physical cores.
The rest of the conclusions, I suggest you do it yourself, because everyone is interested in his own.
Afterword
On PringerLink I found an article "
An Extension to Improve OpenMP Tasking Control ", in which it is proposed to introduce a new
final clause, which acts as a balancer. From my point of view, this is a sober sentence. And, you can add balancer clauses, which in realtime will produce adaptive balancing.
Please refer to
the Optimization Notice page for more details on performance and optimization in Intel software products.