Finding and solving scalability problems using the example of multi-core Intel Core 2 processors (part 4)

Article continuation: part 1 , part 2 , part 3

Probably the simplest example is the parallel summation algorithm for the values of array elements. In this case, each thread would add its own set of elements. This algorithm might look like this:

int sum(int* data, int* sum, int size, int tid) { int i; for(i=0; i < size; i++) *sum += data[i]*data[i]; return *sum; }

If the pointer “sum” is declared as a simple array with an index that is a thread identifier, then we may encounter competition of parallel threads for the cache line containing the array element. Generally speaking, the compiler could sum the array values in the register-accumulator (eliminating the need for the variable “sum”), and thus avoid the problem, which is actually what the Intel compiler does.

However, not all compilers are as smart. If the function is declared like this:
int sum(int* data, volatile int* sum, int size, int tid)
then the compiler is forbidden to transfer "sum" to the register memory, which guarantees the battle for the cache line. The above events will flow the river.
')
As a second example, let's take a very simple three-fold nested loop of permutations in an array, which is excellent for our discussion.

 #define MAXTHR 4 #define ITERS 1000 #define SIZE 1000 int aa[MAXTHR][SIZE]; volatile int i[MAXTHR], j[MAXTHR], k[MAXTHR], n[MAXTHR], tmp[MAXTHR]; int sort(int *a, int size, int tid) //a = aa[tid][0] { n[tid] = 0; for (k[tid]=0; k[tid] < ITERS/2; k[tid]++){ for (i[tid] = 0; i[tid] < size-1; i[tid]++){ for (j[tid] = i[tid]+1; j[tid] < size; j[tid]++){ if (a[i[tid]] > a[j[tid]]){ tmp[tid] = a[i[tid]]; a[i[tid]] = a[j[tid]]; a[j[tid]] = tmp[tid]; n[tid]++; } } } for (i[tid] = 0; i[tid] < size-1; i[tid]++){ for (j[tid] = i[tid]+1; j[tid] < size; j[tid]++){ if (a[i[tid]] < a[j[tid]]){ tmp[tid] = a[i[tid]]; a[i[tid]] = a[j[tid]]; a[j[tid]] = tmp[tid]; n[tid]++; } } } } return n[tid]; }

Looking at the code, it becomes obvious that the index arrays i, j, k and the tmp array are most likely to be in the same cache line, for which the threads will fight on each iteration of each loop. They are specifically declared as volatile to prevent the compiler from placing them in registers, which is basically allowed by the language standard. The language standard generally allows you to do a lot with data access for optimization purposes. If you do not apply something like volatile to them, the compiler itself will not begin to think that this function should be executed by several threads.

The current Intel compiler (10.0) for the Intel 64 architecture will create an executable completely free of false sharing of the cache line when optimizing O3, if the volatile ad is removed from the above function. None of the local cycles, none of the temporary variables will ever be unloaded into memory, existing only in registers. The appearance of non-blocking false line sharing is generally very dependent on the compiler. There are innumerable variations on this topic when a programmer starts using optimizations like inline functions, splitting functions into parts, and so on.

Suppose we collected data using the VTune Analyzer, then we look at the dependencies between the event quantities. The number of EXT_SNOOP.ALL_AGENTS_HITM is approximately equal to BUS_HITM_DRV. Pay attention to how the application behaves from launch to launch, what happens to the events. MEM_LOAD_RETIRED.L2_MISS is much more than MEM_LOAD_RETIRED.L2_LINE_MISS. The number of MEM_LOAD_RETIRED.L2_LINE_MISS is much less than the number of cache lines transmitted on the bus, judging by BUS_TRANS_BURST.SELF. The greatest contribution is made to requests for exclusive use (RFO), measured by the BUS_TRANS_RFO.SELF event. Evaluate the RFO contribution to the bus traffic, measured by BUS_TRANS_BURST.SELF.

The balance is estimated in total readings, measured by BUS_TRANS_BRD.SELF. The L2_LD.SELF event is useful for finding out the state of the cache line, especially if the influence of hardware prefetch blocks is very strong.

Consequently, in order to determine whether there is a competition for cache lines, it would be reasonable to collect data with a single-threaded count in order to determine the baseline values, and only then with a multi-threaded one. To identify spurious line sharing, it is probably best to look at MEM_LOAD_RETIRED.L2_MISS by comparing the numbers and locations of the peaks of this event when viewing the source code in the VTune Analyzer. Usually, everything becomes immediately clear, if you also pay attention to EXT_SNOOPS.ALL_AGENTS.HITM.

Finding access lock conflicts is also quite simple. The L2_LOCK.SELF.E_STATE event occurs whenever an access lock (other than the xchg instruction) is used to create a mutex lock. If the locked item has been modified, then the L2_LOCK.SELF.M_STATE event will also occur. Due to the IP prefetch block, the MEM_LOAD_RETIRE.L2_LINE_MISS event is not efficient for our search. In this case, the peaks MEM_LOAD_RETIRED.L2_MISS together with L2_LD.LOCK.E_STATE when viewing the source code in the VTune Analyzer clarify the picture of the incident. EXT_SNOOP.HITM.ALL_AGENTS is again present in these cases.

Competition for a locked cache line is often driven by the use of a synchronization API. The above primitive analysis of the collected data will probably show that the application spends an arbitrary part of the time in the synchronization wait cycle. In this case, it is necessary to find where this API is called in order to understand whether it is possible to reduce the execution sequence caused by variable locks by changing the code. The location of these bottlenecks can be quickly determined using the call graph in the VTune Analyzer or Intel Thread Profiler.

While competition for cache lines is characteristic of the shared memory parallelization model, an excessive drop in scalability is also possible due to the abuse of synchronization operations in MPI. Synchronous messaging, MPI_Wait, and MPI global operations (MPI_Allreduce for example) can have a similar effect on performance, as in the cases described above. Intel Trace Analyzer and Collector was created just to find such MPI problems. Using this program is simply necessary to achieve the best MPI scaling in a large cluster environment based on Intel Core 2 processors.

Conclusion

The Intel Core 2 processor has a fairly well developed hierarchy of performance events, which is very effective in analyzing problems with low speed performance. Many reasons for poor scaling in a multi-core environment can be quickly and easily identified using the Intel VTune Performance Analyzer. To resolve more complex thread synchronization issues, there is the Intel Thread Profiler. For MPI, Intel Trace Analyzer and Collector is recommended.

Source: https://habr.com/ru/post/107624/

All Articles

Finding and solving scalability problems using the example of multi-core Intel Core 2 processors (part 4)

Conclusion

More articles: