Finding and solving scalability problems using the example of multi-core Intel Core 2 processors (part 1)

Adapting software to efficiently use all available processors is most critical in light of the emerging multi-core future of modern computing. In addition to all other obstacles that may be encountered along this path, there are problems associated with the sharing of the final throughput of existing platforms and processors. Proper use of Intel Core2 processor performance events will determine the exact cause stopping the application on its way to full use of all cores available in the system.

What is scalability?

Before we start studying the problems of software scalability in a parallel computing environment, we need to determine the meaning of the term itself. Perfect scalability means that the total task execution time will decrease linearly with increasing number of cores used by the application. However, in accordance with the Amdahl law for parallel execution of the algorithm, the total program execution time can be reduced only on a segment of code that has been optimized accordingly. Thus, the first item in our search is to determine the achieved degree of parallelism of the algorithms.

CPU performance events based on Intel Core2 microarchitecture provide us with convenient metrics for this purpose. Each core has a Performance Monitoring Unit or PMU, and the CPU_CLK_UNHALTED.CORE event allows you to determine the number of operating cycles on each core. If VTune ™ Performance Analyzer is used to collect information, this number is summed for all cores that executed the process of interest. That is, this number is the number of effective cycles spent on the execution of the application. Let's call this number "effective_cycles".

A useful feature of the PMU is the ability to compare the event value with a given number (cmask) on each cycle and determine whether it is greater or equal (inv = 0) or less than this number (inv = 1). If this condition is specified, the PMU will count the cycles only if it is met. However, this is only possible for general purpose meters. Fixed counters, for example, for CPU_CLK_UNHALTED.CORE, INST_RETIRED.ANY and CPU_CLK_UNHALTED.REF events, cannot be subjected to this operation. If the value of the CPU_CLK_UNHALTED.CORE_P event (a generic version of the cycle counter) is compared with the number 2 with the “less than” condition (inv = 1), then the counter will count all the cycles. If this number is summed up for all processes and divided by the number of cores in the system, we will get the total execution time of the process. Let's call this number “total_cycles”.
')
To ensure the correctness of the values obtained, it is necessary that the Speed Stepping function be disabled in the BIOS and OS, otherwise unloaded kernels can operate at a lower frequency, which distorts the resulting total time value. The ratio of effective_cycles / total_cycles will be equal to the number of cores used for a highly scalable code, and equal to 1 for an absolutely sequential code. And the result does not depend on what part of all the cores of the system was involved during execution. However, the value of this technique can be leveled out if the code contains processor-devouring active wait cycles, so the wait cycles must be implemented correctly. For a more detailed analysis it is recommended to use Intel Thread Profiler.

Deviations from the ideal scalability may be due to the peculiarities of the code structure and synchronization logic, but they can also be caused by high competition for common system resources. The purpose of this article is precisely the system search for such problems and their solution.

Resource constraints

First of all, it is necessary to determine the list of resources for which high competition is possible in parallel computing. These include:

System capacity
1. Memory bandwidth
2. Interconnection bandwidth between sockets
Disk subsystem load
Total cache
Data Address Translation Buffer (DTLB)
Individual access to cache lines
1. Common elements in line
2. Common lines without common elements (false sharing)

The last component of the list has a slightly different impact on performance than all the others, since the synchronization of threads (threads) and processes depends on blocking access to the cache lines. This difference is most obvious from the point of view that a faster / more extensive cache is not always able to circumvent the impact of a drop in scalability, as is the case with other elements of this list.

For these restrictions, there are two main scenarios for scaling an application, which will be discussed below, splitting a fixed amount of work between cores and linearly increasing the amount of work performed by all cores. These two scenarios have a slightly different meaning, if we talk about scalability problems, and, therefore, differ in the desired features.

Bandwidth

To understand what contribution to the drop in performance is made by traffic over the bus, it is necessary to determine what traffic actually goes through the bus during program execution (total and for individual components) and what the peak throughput of the platform used in the analysis. Peak bandwidth depends on a large number of third-party factors. This includes the hardware configuration of the data prefetcher (hardware prefetchers), the number, type and location of slots in the memory rails, the system bus frequency, the processor frequency, and the conditions that allow the cache to become coherent. Therefore, no metric can be considered as determining the throughput of a platform based on Intel Core 2. The only acceptable method to determine it is to perform a synthetic throughput test. The single long cycle of the TRIAD algorithm seems to be best suited for this purpose. Since the bandwidth limit for multi-core computing is likely to be different from single-core, the above test should support parallel counting, either at the expense of threads, or by splitting it into separate processes.

The system capacity limit affects performance somewhat differently than most of the slowing factors. The influence of the majority grows linearly with the growth of their defining metrics, as in the case of cache misses at the last level of the cache, where the effect on performance is defined as the corresponding number of wait events. In this case, performance drops, as if bumping into a barrier that is not noticeable until the application has exhausted all the resources of the platform. That is, the dependence of performance on the use of system capacity is more stepwise than linear, as is the case with other events. For example, the memory access time increases nonlinearly as the bandwidth limit is reached. You can observe this effect by measuring the bus access delay with an increase in the number of simultaneously calculated triads. Bus access delays can be measured by performance events in counting mode (as opposed to sampling) as a ratio:

Bus Latency = BUS_REQUEST_OUSTANDING.SELF/(BUS_TRANS_BRD.SELF - BUS_TRANS_IFETCH.SELF)

In most cases, the ifetch (instruction fetch) component is not important, especially in a situation with peak bandwidth and, therefore, can be ignored.

There are many sources contributing to the system bus traffic. Intel Core 2 processor performance events allow you to use a variety of techniques to determine how full and individual bus loading these components are. The system bus saturation is very simple to determine. It can be represented as part of the bus cycles used for data transfer:

BUS_DRDY_CLOCKS.ALL_AGENTS/CPU_CLK_UNHALTED.BUS

Or directly, as the number of bytes transmitted over the bus in the cache lines:

Cacheline_Bandwidth (/) ~ 64*BUS_TRANS_BURST.ALL_AGENTS*core_freq / CPU_CLK_UNHALTED.CORE

A convenient metric in this case is simply the number of cache lines per cycle, since the appetite N parallel versions of this application / thread will be N times this value. So if the platform limit is defined in these values, the most likely bus load for parallel counting can be easily determined during a single flow analysis.
BUS_TRANS_ * events can be used hierarchically to isolate components using a bus. Their brief description is presented in the table below, and they are also discussed in great detail in the VTune Performance Analyzer online help.

Event	Description
BUS_TRANS_ANY	All bus transactions: Mem, IO, Def, Partial
BUS_TRANS_MEM	All cache lines, partial and invalid
BUS_TRANS_BURST	All cache lines: Brd, RFO, WB, Combined Record
BUS_TRANS_BRD	All cache line reads: Data, Ifetch
BUS_TRANS_IFETCH	All instruction cache lines
BUS_TRANS_RFO	Total cache lines when requesting exclusive use
BUS_TRANS_WRB	All cache line entries (modified cache lines)

There are many standard ways to reduce bus traffic that can be applied if the platform limit is reached. These include the following methods:

Use all bytes of all cache lines while they are in the cache.
1. It is necessary to arrange the elements of structures in descending order of their size in order to avoid their alignment by the compiler.
2. Determine structures by actual use, not for object orientation or thematic connectivity
3. The leading dimensions of arrays should be traversed in the innermost of nested loops.
4. Whenever possible, stack the components of the structure in the cache line
5. Place shared structure components nearby
Use streaming instructions past the cache for large assignment cycles.
Do not prefetch unused cache lines
If possible, it is necessary to combine cycles that rest against bandwidth with cycles that rest against processor performance.
This is usually easy to achieve if the number of iterations of the cycles is equal, but in principle it is also possible with a different number of iterations.
Try to break the data into blocks in order to use them as much as possible while they are in the cache.

The above does not in any way pretend to be an exhaustive list of possible options, it is rather a list of the most obvious. Anyone who actually tried to say that the latter is generally easier said than done.

I apologize for the multi-letter and some confusion, but the material is really somewhat dry. I did the translation last year, so I apologize for some loss of relevance.

Continued article: part 2 , part 3 , part 4

Source: https://habr.com/ru/post/107620/

All Articles

Finding and solving scalability problems using the example of multi-core Intel Core 2 processors (part 1)

What is scalability?

Resource constraints

Bandwidth

More articles: