OpenMP Regions Analysis with Intel® VTune ™ Amplifier XE

OpenMP * is a fairly popular parallel programming model, especially for high-performance computing. But in order to achieve this high performance, OpenMP constructions often have to be “tuned”. And here you can not do without a good profiler. Most profilers produce performance data associated with functions or cycles, but do not provide pictures of specific OpenMP regions. As a result, the programmer loses context. And without an OpenMP context, diagnosing imbalances or overhead becomes very difficult.
Intel VTune Amplifier XE can profile OpenMP regions . The latest version of 2015 Update 2 makes the analysis much simpler and more understandable, thanks to the presentation of data in "OpenMP terms". The tool shows the time of parallel and consecutive regions, the difference between the actual and idealized execution time of a region, a breakdown into parallel loops, and the CPU load for each region separately.
The user can more easily understand where to invest the efforts in the first place, thanks to the metric of "potential gain". The classification of overheads helps determine the cause of inefficiency - for example, waiting due to load imbalance or “lock” due to synchronization.
This article describes some types of OpenMP problems identified by the VTune Amplifier, as they need to be understood and resolved.

Configuration

Good news: with the latest versions of VTune Amplifier and Intel Compiler, almost nothing needs to be set up. Only set one environment variable (example for Linux):

export KMP_FORKJOIN_FRAMES_MODE=3

After that, simply run any type of analysis of your OpenMP application, for example, Advanced Hotspots. The following product versions are used in this post:

Setting a variable is necessary temporarily, some of the functionality is still experimental. In future updates, this will not be necessary.
')

Inefficiency evaluation: sequential code and CPU utilization

For starters, it is recommended to look at the CPU utilization histogram in the Summary panel. It contains the work time of the entire program (elapsed time), divided by the load levels of the processor cores. This includes only useful time, i.e. burning cycles during active spinning is not taken into account.

Ideally, in a parallel application, most of the time should fall on the “green” zone, when most of the cores are working simultaneously. The picture above shows the test result from the Intel Xeon Phi coprocessor. Most of the 224 hardware threads are idle, which can be two main reasons:
1. Most of the serial part - if there is no parallelism initially.
2. Low efficiency of parallel regions - the code is parallel, but some bottlenecks limit scalability.
Now look at the “OpenMP Analysis” section on the Summary panel:

The execution time (elapsed time) is divided into serial (Serial Time) and parallel (Parallel Region Time). If the sequential time is large, look for ways to reduce the sequential part. Either parallelize where this has not been done yet, or do other optimization, for example, micro-architecture, if the algorithm does not parallel. The more hardware threads on the machine, the stronger the negative effect from consecutive sites. In our test, 93.4% of the execution time is carried out in a sequential code, which is the main limiting factor ( Amdal law ).

If we want to further explore the consecutive area, switch to the Bottom-up tab, select the grouping by “/ OpenMP Region / Thread / Function ..” and filter by master stream (Master Thread) in the line “[Serial - outside any region]” .

Here at Bottom-up, it can be seen that a consecutive segment takes 21.571 seconds, most of the total execution time. However, the CPU time (total for all threads) is much more in parallel sections - the OpenMP region on line 179 spends 164 seconds, and this is only one region. Filtering by master thread is necessary because there may be many OpenMP worker threads that just wait for the master thread to execute the sequential code. And they are waiting actively, burning CPU cycles - this time should be excluded if we want to concentrate on the sequential code itself. In our test, this is only 20.713 seconds of CPU time. And on the Intel Xeon Phi multi-core coprocessor, this kills performance — Amdahl's law in action.

The overall effectiveness of parallel regions can be assessed on the “OpenMP Region CPU Usage Histogram” on the Summary tab. This is the same as CPU usage histogram, but attached to a specific parallel region. Below is the CPU load from the same test on Intel Xeon Phi, but for the OpenMP region on line 153. Here the load is much better - most of it is close to perfect, the rest is shifted to the right of zero. Those. On this parallel region, the number of cores is more than the average for the program, which is expected.

Search for opportunities for optimization: potential gain

Let us examine another example. We analyzed NAS Parallel Benchmarks (NPB) for possible problems in the performance of OpenMP regions.

Test configuration:

CPU: Intel Xeon processor E5-2697 v2 @ 2.70GHz, 24 cores / 48 threads.
OS: RHEL 7.0 x64
Compiler: Intel Parallel Studio XE Composer Edition 2015 update 2
Workload: NPB 3.3.1, “CG - Conjugate Gradient, irregular memory access and communication” module, class B.

The number of OpenMP streams is set to 24, which corresponds to the number of physical cores. The CPU utilization histogram shows a good load, but still not perfect - considerable time was spent with only 2-6 cores simultaneously occupied. The program is parallel, but does not always use all 24 cores:

Consistent time in NPB is negligible and not a problem. But pay attention to the “Potential Gain” metric (highlighted in pink):

Potential Gain is the difference in execution time between the actual measurement and the ideal case, if the load on all the threads were perfectly balanced and the OpenMP runtime overhead would be zero. Those. Potential Gain - the potential gain from optimization, this is the maximum time you can win by improving parallel execution. This metric may be more important than execution time and processor time, since it focuses you not just on the most costly region, but on the region in which you most likely will get the best result from optimization (what do you usually look for, otherwise why profile?).

In the picture above, the Potential Gain, highlighted in pink, says that optimizing all parallel regions to an ideal state can reduce the execution time of the entire program by 3.975s or 34.9% - there is something to fight for, although we will not reach the ideal of course.

So far, we have operated with the metrics of the entire application - let's go deeper, to the level of parallel regions. Summary contains the top 5 OpenMP regions for potential gain. In our case, the entire potential gain (3.958c) is concentrated in one region on line 514. This is good for us - the problem is narrowed down to one single region.

Determining the causes of inefficiency of a parallel region

When we have focused on a specific region, click on it in Summary and follow the hyperlink to Bottom-up, which itself is grouped by OpenMP regions and highlights the one we need:

The Bottom-up table contains various statistics about parallel regions: region execution time (elapsed time), potential gain, number of OpenMP workflows, number of occurrences in a region (instance count). CPU time is divided into effective time (the code of the application itself), spin time (active waiting time), and overhead time (overhead). The time of active waiting is significant and highlighted in pink - 92.159s. Before digging deeper, let's take a quick look at the source code. Our parallel region on line 514 consists of many parallel “! $ Omp do” cycles. This is bad news, because our metrics are for the entire region, and it is unclear which cycles to assign them to:

And again, the good news is that VTune Amplifier can break data not only by parallel regions, but also by OpenMP barriers. As is known, all “#omp for” or “! $ Omp do” constructions have synchronization barriers, unless the “nowait” option is specified. Since VTune Amplifier recognizes these barriers, we can see the execution time and CPU for each barrier, i.e. and for each parallel cycle within the region. Here you need to create custom grouping by “/ OpenMP Region / OpenMP Barrier Type / OpenMP Barrier / ..” - see the small button on the right in the grouping line.

After grouping by barriers, everything becomes clearer. First, most of the time and the potential gain comes from the Loop barrier on line 572 (highlighted in the screenshot, the line did not fit):

Secondly, we expand the Spin Time column by categories, and we see that all active waiting is due to an imbalance (Imbalance is highlighted in pink). Thirdly, the column “OpenMP Loop Schedule Type” says that our parallel loop has static dispatch.

Unbalance correction

The loop on line 572 contains only “! $ Omp do”, without any “schedule” options. Therefore, the default dispatch type is static, which is what the profiler said.

Since the cycle suffers from imbalance, it is logical to try dynamic dispatching - let the load be distributed automatically:

Analysis and correction of overhead

After moving to dynamic dispatch, the performance of the loop became even worse. Execution time increased from 10.445s to 11.102s:

The picture in the table has changed - the imbalance and spin time disappeared, which means we still fixed it, not bad at all. But now a new problem has been highlighted - 74.99c CPU time goes into scheduling overhead. OpenMP runs too much inside.

The “OpenMP Loop Chunk” column has changed from the original 3125 to 1. This means that each iteration is dispatched individually, while the number of calculations per iteration is quite small - see the code above. The pieces of work for OpenMP threads are very small, and it’s necessary to distribute them among threads very often - the runtime works too intensively, it spends a lot of CPU time. Parallelism is too fine granular.

Chunk size or Grain size is equal to the default unit for dynamic dispatch. Check the assumption of too small granularity - increase it to 20:

We look into the results of profiling. Imbalance and overhead are now spending only about 1 second of CPU time. The cycle barrier on line 572 went down in the list of hot cycles, because its potential gain decreased to 0.077s from the initial 3.133s. The cycle execution time dropped from 10.445s to 8.928s. The CPU time for the entire parallel region has decreased from ~ 250s to ~ 213s. So we got some performance gains, albeit less potential 3x seconds:

Conclusion

Performance analysis of OpenMP applications with VTune Amplifier XE has become more natural - the tool operates with terms and concepts of OpenMP, rather than hardware and system metrics. You can analyze bottlenecks from the general to the particular, starting with the evaluation of the sequential part and the processor utilization for the entire application, and check the CPU utilization for a particular parallel region. The potential gain metric focuses the developer on the most interesting opportunities for optimization. The breakdown of data by barriers allows analyzing regions with multiple parallel cycles.
Bottom-up OpenMP statistics help determine the cause of inefficiency. The type of dispatch, chunk size, the number of threads and calls to the region, the categorization of active waiting time and overhead - all this will help to understand what limits the performance: imbalance, fine granularity, synchronization objects, or something else.

References:
- OpenMP analysis - Intel OpenMP and VTune Amplifier XE library collaboration, both of which are available in the Intel Parallel Studio XE 2015 Professional edition.
- Official site and additional materials on VTune Amplifier XE .
- A more complete and detailed article about OpenMP problems identified by VTune Amplifier (Eng.)

Source: https://habr.com/ru/post/248979/

All Articles