OpenMP Optimization

The gradual development of the project went on as usual.

For a part of the funds received under the grant, the personal computer equipment park was updated. As a result, the calculations are now carried out not on a long-suffering laptop, but on a perfectly acceptable machine with a pseudo-eight-core Intel Core i7-2600 and 8 Gb of RAM on board. And the development is done under Visual Studio 2005 (obtained through the DreamSpark program) with the connected trial version of the Intel FORTRAN Compiler 12 / Intel Parallel Studio XE 2011 (it all runs under Win 7). OpenMP is used as a parallel API.

In view of the obviously noticeable growth of available capacities, new negative features of the algorithm written earlier were found. First of all, since March, a deep optimization of the computational part of the code was carried out, which made it possible to gain about 70% in performance. Such an increase was ensured primarily by the elimination of division operations, as well as an increase in the number of predictable variables.
')
upd: The post, in general, is about the gray working everyday life, and does not contain any discoveries in itself.

Minor mischief

The program was regularly used and produced good results, until one day it was decided to check how well the parallelization was performed. And, rather, even expectedly, than surprisingly, execution in one thread turned out to be on average twice as fast as multi-threaded launches, regardless of the number of threads.

The answer, in general, lies on the surface. From the mathematical point of view, the algorithm was optimized to such an extent that its bottleneck was the exchange of data between individual streams, which was confirmed even by a brief analysis in the Intel Vtune Amplifier. The greatest time required to initialize the threads and their local variables, as well as access to common variables and arrays. A significant role in the visibility of the dirty trick was played by the fact that until now a rough computational grid was used, only 3x200 spatial nodes (a kind of imitation of a one-dimensional problem), and the computation time was relatively small.

Minor fixes

What was done for optimization?

First of all, the directives and the separation of variables into classes are corrected. In particular, the main working arrays in which the values that are the purpose of the calculation are stored from SHARED were turned into THREADPRIVATE by setting the COMMON attribute (which simultaneously optimized their placement in memory) and COPYIN directives. The precomputed variables were left as SHARED , since the application of FIRSTPRIVATE or COPYIN to them not only did not give a noticeable effect, but also worsened the results. In total, the directive before the main work cycles took something like this:

 !$OMP PARALLEL DO NUM_THREADS(Threads_number) SCHEDULE(DYNAMIC) & !$OMP PRIVATE(...) & !$OMP COPYIN(...) & !$OMP DEFAULT(SHARED)

Here lists of variables are omitted, because with them the code takes about a dozen lines.

In total, there are nine such places in the code. Nine bottlenecks through which the program, quoting M. Evdokimov, "Squeaks, but climbs."

Throwing different variables back and forth lasted a couple of evenings, but the optimality of work could not be considered. The launch on a full processor load showed that on average only 2.1 - 2.3 threads simultaneously exist. Processor time was spent regularly in eight times. For clarity, histograms from VTune Amplifier for 3x200 grid:
3x200: 2.237

For 100x100:
100x100: 4.717

For 200x200:
200x200: 5.938

Obviously, as the share of computations increases, the results improve, but there is no desire to call it such a high efficiency.

Applying forced sleep deprivation to streams by increasing the KMP_BLOCKTIME value from 200 ms to 10 s just a little helped in the same way.

Nonsense is rarely small

Unexpectedly, a stern look was cast on the "space-time" structure formed by the flows in the algorithm. And everything immediately fell into place. The weak point was the directive

 !$OMP PARALLEL DO NUM_THREADS(Threads_number) SCHEDULE(DYNAMIC)

The PARALLEL keyword is responsible, as is well known, beyond the boundaries of the parallel and sequential code area. Upon its achievement, the creation of new threads, the redistribution of local variables in their memory and other procedures that require a considerable amount of time occur. There were nine such places, as already mentioned. Accordingly, nine times the streams were created and destroyed, and at the same time between them in some places there were not even consecutive sections. Schematically, it can be represented on this picture:
no optimization

A complete reorganization of the structure of the parallel part of the program was carried out. Now the flow diagram looks like this:
optimized

Vertical dashed lines conventionally show the boundaries of parallel loops, and towards the end, the SINGLE directive is used — it writes the calculation results to a disk, for which the work of all streams, except for one, is suspended. It is at least difficult to parallelize it, although there is an idea to record in one stream and perform a further loop in the others, since He does not depend on writing to disk, or rearrange them in places. But these are details that have nothing to do with the case.

And in the source text the directive structure looks like this:

 Time_cycle: do n = 0, Nt, 1 !$OMP PARALLEL NUM_THREADS(Threads_number) & !$OMP PRIVATE(...) & !$OMP COPYIN(...) & !$OMP DEFAULT(SHARED) !$OMP DO SCHEDULE(DYNAMIC) ... !$OMP END DO ...  7    !$OMP SINGLE ...    !$OMP END SINGLE !$OMP DO SCHEDULE(DYNAMIC) ... !$OMP END DO !$OMP END PARALLEL enddo Tyme_Cycle

That is, the whole cycle body in time (the program makes a direct numerical simulation of the evolution of the hydrodynamic system) is now in a parallel area, the threads are created at the beginning of the iteration and are destroyed only when it is completed, and not resurrected repeatedly. It is no longer possible to wrap in a parallel area and a time cycle, since each iteration of it, of course, is completely dependent on the previous one ( please correct it, if not so - my logic here gives a certain failure ).

The final improvement aimed at speeding up work was disabling the barrier synchronization between the flows in some cycles where it would not affect subsequent calculations.

As a result, the time of really parallel code execution seems to be increasing. On the 3x200 grid, the VTune node represents this result:
3x200: 3.041

On the grid 100x100 - this:
100x100: 4.722

Finally, at 200x200 - this:
200x200: 5.962

Thus, we confirm the long-standing truth that on large grids, when there are really many calculations, they take up most of the time and parallelism is effective. On small grids, it is required to optimize the exchange between processes, otherwise the results are not joyful. And despite this, the successive stages occupy a predominant part of the working time.

The question arises, is it worth the work done to optimize the spent time and effort? Check the actual speed of the program. The launch was carried out on the same three different grids with eight streams, and the time to a certain reference point was measured. The control points are different in all cases, therefore, it will be incorrect to compare the absolute values between different grids - 1 million iterations for 3 x 200, 500 thousand for 100 x 100 and 200 thousand for 200 x 200 nodes. The second line in parentheses shows the relative difference in the execution time of the two program options.

Mesh size	Lead time with
3 x 200, before optimization	94.4
3 x 200 optimized	70.0 (-26%)
100 x 100, before optimization	352
100 x 100, optimized	285 (-19%)
200 x 200, before optimization	543
200 x 200 optimized	436 (-19%)

It is obvious that the optimization carried out provided a good performance boost.

At the same time, we compare the quality of parallelization, determining in the same way the performance gain with increasing number of threads. Immediately it should be noted that on one stream the calculations may turn out to be faster than on two or three, due to, firstly, the lack of the need to exchange data with neighbors, and secondly, the work of TurboBoost, raising the clock frequency by 400 MHz. Also recall that the physical cores of the processor mentioned at the beginning are only 4, and the acceleration on 8 threads is the result of Hyper-Threading.

Grid 3x200, 1 million iterations:

The number of threads before optimization	Lead time with
one	80.9
2	88.7
3	84.4
four	83.7
eight	94.4
The number of threads after optimization	Lead time with
one	87.9
2	145
3	113
four	97.8
eight	70.0

Grid 100x100, 500 thousand iterations:

The number of threads before optimization	Lead time with
one	918
2	736
3	536
four	431
eight	352
The number of threads after optimization	Lead time with
one	845
2	528
3	434
four	381
eight	285

If on a small number of nodes the results are ambiguous, to put it mildly, although evidencing in favor of multithreading and optimization success, then on a large number of them - everything is again evident.

Findings? The main conclusion is the same - watch how the flows in the program are born and die. Extending their life may be helpful.

The implemented rearranging of the space-time structure of the streams is described, in particular, here:
Efficient load balancing between threads using OpenMP * . That's just it was read after coming up with a solution. Would save a couple of days, eh.

Source: https://habr.com/ru/post/141172/

All Articles

OpenMP Optimization

Minor mischief

Minor fixes

Nonsense is rarely small

More articles: