📜 ⬆️ ⬇️

Finding and solving scalability problems using the example of multi-core Intel Core 2 processors (part 2)

Continued article: part 1 , part 3 , part 4

Cash Line Efficiency


Clauses 1.1 and 1.2 cover one of the most basic problems: using only a part of the address space in the cache line. And this not only increases the traffic on the bus, but also reduces the efficiency of processor caches. If only a part of the cache line is used while in the cache, the application significantly reduces the size of its cache. Paragraph 1.1 refers to the practice of compilers to arrange elements of structures in memory according to their size. So, if the first element of the structure is “char”, followed by a 4-byte “int”, then the compiler will add 3 more bytes between these elements in order to arrange them on a 4-byte grid.

Paragraph 1.2 seems to speak for itself, however it is one of the most important sources of high traffic on the bus. There are many ways to detect this problem. In applications with a predominance of cycles, efficiency can be roughly estimated by counting how many times the basic cycles are executed and how many bytes of data they use each iteration through a static analysis of the assembler listing. As a result, we get the maximum number of bytes used in the loop. Measuring the number of cache lines passing through the bus during cycles and multiplying this number by 64, we get the traffic on the bus, initiated in cycles. Now, if we divide the number of bytes consumed in a cycle (from the listing analysis) by the total number of bytes transmitted over the bus, we get a rough estimate of the efficiency of using the bus.

To find out how many times a cycle is performed, average the number of INST_RETIRED.ANY_P events by loop commands. This will be the number of executions of the base unit. The event itself is not evenly distributed over the base unit; however, the total quantity will be true for a “large enough” area. The total number for the application is correct, but the “long” commands will show a much larger number of this event than the commands immediately before or after them within the same block. In a loop with a dominant flow through several basic blocks, an event should be averaged over all its basic blocks, assuming that it is possible to demonstrate from a statistical analysis of the assembler listing that they are all actually performed an equal number of times. With the collected data, this can be done in VTune Analyzer. Just highlight all the commands in the loop to display the amount by them. Count the number of commands in the highlighted area. Multiply the amount of INST_RETIRED.ANY_P by the value of Sampling After Value (SAV) and divide by the number of commands. This will be the total number of iterations of the loop.
')
This technique also makes it possible to determine the average number of iterations of the inner loop; it is necessary to divide the number of executions of the inner loop by the number of base block executions before or after the loop. The result is extremely inaccurate if the number of iterations of the inner loop is much more than 100, but very accurate for small loops. However, while the technique is inaccurate for a large number of iterations, it is generally useful to know that there are many iterations.

A simple control of loading and unloading into memory will help estimate the total number of bytes used during the iteration. There is, of course, a risk to overstate this assessment due to identical pointers, especially considering the many cycles.

The total number of used cache lines is measured by the BUS_TRANS_BURST.SELF event.
Soon another article will be available, with a more detailed review of data access analysis.

In cases where only a small part of the data transmitted on the bus is actually used in frequently executed cycles, there are several ways to improve the use of the bus and, thus, the performance of a single stream or several parallel ones. The most obvious solution is to split the data according to their actual use, organizing them into parallel arrays or linked lists of structures. In the general case, only a certain subset of the data of each structure is used repeatedly, but the majority is used only occasionally. In a large application, completely reorganizing the data format can be extremely difficult. In such a case, frequently used data can be copied into a convenient, packaged array of structures containing only frequently used elements. This will allow frequent executable program cycles to generate minimal bus traffic during execution. A more aggressive optimization would include “unwrapping” the data format by 2, in the case of double-precision floating-point numbers, or by 4, if all the data are integers (int) or single-precision floating-point (float). This will open the way for extremely efficient use of the Intel Streaming SIMD Extension 3 (SSE3) instruction set, since no data packaging instructions will be required. It might even be nice to add a few dummy elements at the end of the array to guarantee the size of the array is equal to an even factor of 2 or 4.

It may be better to use array structures rather than structure arrays if the dominant access pattern is sequential array access, since this ensures that all bytes of the cache lines are used.

Unloading past cache


Compilers avoid generating streaming instructions past the cache in cases where the number of iterations of the data assignment cycle is unknown. For example, in the simple TRIAD function presented below, the value of len is not known at the time of the broadcast.

double TRIAD(int len, double a1, double * a, double *b, double*c, double*d){ int i; for(i=0; i<len; i++) a[i] = b[i] + a1*d[i]; return; } 


Most translators will use regular unload instructions or packed SSE unload instructions for array “a”. This will trigger a cache cache exclusive request, which doubles the bandwidth needed for this array. If the array is not required immediately or is too large to fit in the entire cache, the traffic can be halved. In the example above, if the number of iterations “len” multiplied by 8 is larger than the size of the last level cache (LLC) divided by 3, then the first cache line for array “a” will be unloaded from the cache even before the end of the cycle. To ensure that the Intel compiler generates the optimal code for a large number of iterations, you need to align the arrays of 16 bytes and then change the loop by adding two pragma directives immediately before it.

 #pragm vector aligned #pragma vector nontemporal for(i=0; i<len; i++) a[i] = b[i] + a1*d[i]; 


Also very important is the excessive prefetching of cache lines. Badly coded software prefetches can increase bus traffic. Make sure that you do not prefetch cache lines that will not be used later. The prefetch interval must also be large enough to actually load the line into the cache before using it. Load commands that reference cache lines still loaded by software prefetch increment the LOAD_HIT_PRE event counter. Thus, it is quite easy to detect.

There are 4 hardware prefetch boxes in the Intel Core 2 processor core. Two for working with a second-level cache (L2) in a manner similar to that previously seen in Pentium 4 processors. Another 2 additional hardware prefetch blocks work with the first-level data cache (L1): this is a stream prefetcher and a prefetch block that looks for stepping patterns, related to specific instruction addresses (IP). As a rule, the system includes two prefetch blocks for the L2 cache and the IP L1D prefetch block, but not the L1D stream prefetch block. For analysis purposes, it may be useful to disable these blocks. This can usually be done in the BIOS settings, in the “cpu options” subsection (processor options). You should search for the keywords “prefetch” and “adjency” - usually these items should be changed.

You can use performance events to determine which prefetch hardware is enabled. The L2_LD.SELF.PREFETCH event counts cache lines loaded into L2 cache as a L2 prefetch block. The occurrence of the L1D_PREFETCH.REQUESTS event indicates the enabled prefetch blocks for the L1 data cache.

To determine the number of unused cache lines requested by the hardware prefetcher, run the program 2 times, with prefetchs turned on and off. Measure the total number of cache lines that are transmitted with the BUS_TRANS_BURST.SELF event. If there is a significant difference between these two values, use VTune Analyzer to determine exactly where this happens in the source code. Frequent reasons for this:
  1. nested loops with a small number of iterations and large steps in external loops;
  2. nested loops with opposite directions of movement along the data (the outer loop goes forward with a large step, the inner loop goes backward with a small step).

The hardware prefetch is usually controlled by the inner loop, so in the first case the preloaded lines will be beyond the addressing limit of the inner loop, and will not be used by the next iteration of the outer loop. In the second case, the prefetch unit may try to load already used lines, that is, it will be absolutely useless. This situation is very common in iterative solvers and in most cases can be fixed by simply changing the direction of the inner loop and some of the required pointers. In both cases, it is likely that many cache lines will be consumed as requested by the loading and unloading instructions. This can be verified using the MEM_LOAD_RETIRED.L2_LINE_MISS event counting only downloads and L2_LD.SELF.DEMAND events counting and loading and unloading into cache lines, including also requests made by L1 prefetch blocks.

Breaking up data into blocks is a standard solution with insufficient bandwidth; it is usually suggested without any hint of how to actually implement it. Easier said than done. However, there is some correlation between data decomposition and blocking, and this can be used in some cases. Data decomposition can be done in such a way that two levels of hierarchy are obtained. The external level involves splitting “one set of data per core”, and at the same time into one process or thread, depending on the decomposition of the counting process used, openmp is an explicit threading organization or MPI, for example. The “data set per core” can be further divided, according to the same strategy, divided into “virtual nodes”, whose size is determined by the size of the last level of cache. This strategy should be considered in the early stages of the design of the algorithm, since it can significantly increase the scalability.

Excessive load on the disk subsystem


Excessive utilization of the shared disk for the program means that the time spent by the program waiting for access to the disk subsystem, with the number of cores used, begins to dominate during the execution time of the application itself. This is easily seen in the module browsing mode in the VTune Performance Analyzer, simply comparing how much of the runtime is spent in the disk driver with the increasing number of cores. For a well-scaled application (fixed or linearly increasing amount of work) this part should be constant. Of course, it is worth paying attention to it only if the (increased) disk access time is significant.

Behavioral variations mean that several components interfere with each other's access to the disk. This can happen, for example, due to various operations that cause erroneous movements of the disk heads, which prevents the sequential access, which is the most effective.

In cases where disk access is not fixed, it is almost always better to organize its sequence in time, leaving the rest of the core busy with other things. Thus, by scattering the disk access phases for individual threads or processes, one can avoid excessive load on the disk subsystem.

Source: https://habr.com/ru/post/107621/


All Articles