Finding and solving scalability problems using the example of multi-core Intel Core 2 processors (part 3)

Continued article: part 1 , part 2 , part 4

Total cache

The deterioration of scalability due to congestion of the common cache means that sharing the cache leads to large numbers of unloads and subsequent cache line reloads. On Intel Core 2 processors, the last-level cache line (LLC) miss is considered an L2_LINES_IN.SELF.ANY event. This event counts LLC misses when loading, unloading, sampling instructions and hardware and software prefetches. For the two scenarios discussed, the symptoms are different, but the difference is quite obvious. If the amount of work is fixed, then with non-intersecting parallel execution, the total number of LLC misses will not change with an increase in the number of cores involved. If the volume increases with an increase in the number of nuclei, then the number of misses per core remains unchanged. In any case, this is only a total number, not a metric, indicating the nature of the cause.

There are several reasons why these numbers will not follow the above rules. For example:

Real and false cache line sharing causing increased traffic
Splitting a working dataset to the size of LLC reduces performance when running on multiple cores, i.e., parallel processes displace each other’s rows

The first case is easily identified as (real or false) using cache lines with EXT_SNOOPS.ALL_AGENTS.HITM. This event counts how many times the LLC missed the requested string was in a different cache and in an altered state. We will consider this situation in the last section of this article.
')
The second reason is the main one in this section. A typical example is an application that multiplies large matrices, and the data are divided into blocks by more than half (completely?) Using the LLC volume. Parallelization can lead to the fact that although the size of the matrices solved by each thread or process is less due to the data decomposition, but if the blocking has not changed, two solvers will require twice as much for LLC in their data, which may exceed the entire its physical volume. This will result in a cache slip, when two such threads or processes compete for the space in the cache, displacing each other's cache lines. As a result, the total number of cache line loads (L2_LINES_IN.SELF.ANY) in the entire program, not associated with hitm (entering the line in the “changed” state) by accessing the cache line, significantly increases.

Another technique to assess how much performance degradation from the ideal scaling case depends on excessive substitutions in the cache is to measure the size of the application's working dataset in cases with a large and small number of nodes. To do this, you can use the PIN utility, which measures the Cache Stack Size (CSD) during two launches. This technique creates a stack of cache lines by adding each new cache access to the top and counting to the stack until the previous item is found and deleted. The distribution of the stack sizes obtained is the size distribution of the application's working dataset. The relative locations of significant peaks or “shoulders” in the distribution relative to the share of each core in LLC, will show the extent to which individual threads or processes unload each other’s cache lines prematurely. Thus, if two threads or processes should work simultaneously with a general LLC, without interfering with each other, the size of the working data sets would have to fit within half the size of the LLC cache.

If such mutual exclusion does occur, then the degree of data partitioning into blocks was achieved only for the case when one thread or process had access to the entire cache. Solving the problem of mutual displacement of threads / processes requires that the achieved blocking also be scaled during execution so that multiple threads / processes can coexist within their shares of a limited cache size.

Of course, when decomposing data to be launched on multiple cores and systems (MPI?), It is possible for each core to be completely cached, which leads to a large increase in bus traffic, so it is worth bearing in mind.

Data Address Translation Buffer (DTLB)

Each Intel Core 2 processor core has a specialized 2-tier DTLB system. Therefore, in our first scenario, an application with “reasonable” data decomposition should use a proportionally smaller number of entries in the DTLB and a constant number in the second. If the number of DTLB misses increases, this means that data decomposition has increased the number of accesses to memory pages, while the total size of the working data set has decreased or remained the same.

One thing is clear that the paging of the data decomposition in this case can be greatly improved. This usually happens in a multi-threaded application, since when partitioned into processes, each exists in a non-intersecting virtual address space. Such a situation may occur in a multi-threaded application, or when using shared memory, when an application uses multidimensional arrays, and data is partitioned by the primary dimension instead of the final one. This situation is easily detected by counting the events MEM_LOAD_RETIRED.DTLB_MISS (download only and exact counting mode), or DTLB_MISSES.ANY (loading and unloading, not exact mode).

In case of detection of a problem, it is necessary to compare the values of the same event when using a large and small number of cores, then to understand where the problem originates, you need to look at the part of the code where non-scalable shows itself in the VTune Analyzer in the source code). With this approach to viewing the source code, finding the source of the problem should not be difficult.

You can use another analysis method, for example, to draw a histogram of access to virtual addresses. If the access areas do not turn out to be contiguous and consistent, then many problems are clearly hidden.

Individual access to cash lines

In Intel Core 2 processors, data is accessed and transferred by cache lines, which are 64 contiguous bytes. With parallel execution with shared memory, competition for access to individual cache lines is possible. So, whenever two streams running in the same address space both access data within one of these 64-byte cache lines, there may be contention for access, in which a stream that is not the current owner of the string is blocked. This problem is not possible in the case of parallel executing processes, since each of them has its own address space.

Sharing cache lines does not necessarily cause contention for access. The cache protocol (MESI) allows for the sharing of data by a multi-core, multi-socket platform. The read-only cache line can be shared, the cache protocol allows you to have multiple copies of the cache line among multiple cores. In this case, copies of the cache line will be in the so-called Distributed state (Shared, S). As soon as one of the copies is changed, the cache line status changes to Modified (M). This change of state does not remain without attention of other kernels, copies of which go to the Error state (Invalid, I). The cache line can be placed in the Exclusive state (Exclusive, E) if the lock prefix is applied to the memory access instruction or when using the xchg instruction and so on. In this case, exclusive access to the cache line is guaranteed. This is the basic mechanism of synchronization methods based on mutex locks used to coordinate multi-threaded activities.

When working with cache lines in a shared memory environment, two groups of situations are possible, with two options each. If the access ranges of the streams within the cache line overlap, then this is valid sharing. If the ranges within the cache lines do not overlap, then this is a false sharing. The second separate, but sharing, distinction is determined by whether the lock is used when accessing the cache line or not.

Competition for unblocked distributed cache lines often results in unstable runtime, unstable performance event counters, and even results. This is due to the fact that the resolution of the competitive state depends on the specific discretization of the competing appeals. If the execution time is unstable, but the results are stable, then the cause is non-blocking false sharing. If results and lead times are unstable then this is a clear sign of calls to non-blocking truly shared cache lines, and often the cause is a race condition. In this case, using Intel Thread Checker is extremely useful for finding race conditions and related access conditions.

Blocked sharing is the basis of thread synchronization. Competition for a locked variable or cache line leads to sequential access. This reduces the amount of work that can be done at the same time, which leads to a deterioration in the scalability of the application.

Performance events allow you to estimate the traffic of these contested lines. Competitive access to the cache line can cause an increase in the number of LLC misses caused by controlled downloads or even hardware prefetch blocks. When a download causes an LLC slip, execution may be suspended for a full wait time for loading the cache line from memory. When a variable is unloaded, a request for exclusive use (RFO) is needed to translate the cache lines to an exclusive state. What are the consequences for traffic on the bus and the scalability of the application, we have already discussed. These aspects are interrelated, but here we look at the problem from a slightly different angle.

There are a whole range of events that are useful for detecting similar access problems. The following are only the most obvious:

EXT_SNOOP.ALL_AGENTS.HITM
MEM_LOAD_RETIRED.L2_LINE_MISS
MEM_LOAD_RETIRED.L2_MISS
BUS_TRANS_BURST.SELF
BUS_TRANS_RFO.SELF
BUS_HITM_DRV
L2_LD.SELF.E_STATE
L2_LD.SELF.I_STATE
L2_LD.SELF.S_STATE

The EXT_SNOOP event counts how many times LLC LLC caused a hit-modified response (hitm) from another core on the bus. The BUS_HITM_DRV event has a “reverse” action: how many times the specified kernel sent this response. The MEM_LOAD_RETIRED event works in the exact counting mode, allowing you to find out the exact number of downloads that caused the L2 slip. During the time that the cache line is unavailable, this slip will cause only a single transaction on the bus. So, the L2_LINE_MISS variant of this event counts the number of requested cache lines, and the L2_MISS variant counts the total number of misses. The difference between them indicates the number of calls to the specified line that occurred between the first miss, which caused the query cache line and the time the line arrived on the bus.

The case of non-blocking false sharing somewhat complicates the situation, especially if the IP L1D prefetcher identifies the slip miss as some kind of access pattern and tries to fetch from memory. This applies in particular to the situation of spurious sharing of a line that occurs in cycles when they are simultaneously executed by multiple threads. The difficulty here is that the exact MEM_LOAD_RETIRED.L2_LINE_MISS event will ignore the miss, since it is not attributed to the LLC, but to the IP prefetch unit. These LLC slips will be included in the L2_LD.SELF.I_STATE event. However, even here MEM_LOAD_RETIRED.L2_MISS will help to localize the problem. When comparing the values of the counters in the case of a single and multi-threaded account, a sudden surge of this event with a large number of EXT_SNOOP on the VTune Analyzer graphs will allow to detect the problem area.

Source: https://habr.com/ru/post/107622/

All Articles

Finding and solving scalability problems using the example of multi-core Intel Core 2 processors (part 3)

Total cache

Data Address Translation Buffer (DTLB)

Individual access to cash lines

More articles: