Bubbles, caches, and transition predictors

This article was written based on a curious post , a brief comment by its author, to which I was encouraged to understand what was going on in more detail. It is proposed to compare two variations of the bubble sorting algorithm. The first one is the usual bubble, with a little optimization — the inner loop can be completed a little earlier, knowing that the rest of the array is already sorted:

for (i=0; i<N; i++)
for (j=0; j<N - (i+1); j++)
if (a[j] > a[j+1])
swap(a[j], a[j+1]);

In the second variant, the internal loop passes through another part of the array, however, this variant is algorithmically equivalent to the first (details below):

for (i=0; i<N-1; i++)
for (j=i; j>=0; j--)
if (a[j] > a[j+1])
swap(a[j], a[j+1]);

We run ( code ), for example, for N = 100,000 on an array of ints, and we get about 30 seconds in the first case, and less than 10 seconds in the second, that is, the difference is 3 times ! Where, then, does such a difference come from?

To begin with, we check that the matter is not in compiler optimizations. It is easy to verify this by generating and comparing the assembler codes in both cases. Or you can turn off the optimization. Or you can rewrite the code in assembler (for example, using assembly inserts). In any case, the effect persists.

Further, how to check that both options are equivalent? For example, you can output the compared values of a [j], a [j + 1] and the corresponding value of j, sort the resulting list, and then compare them line by line (of course, it is reasonable to do this for small N). It turns out that the lists are the same, which means that in both cases the same comparisons and the same assignments are made, and the difference is only in the order in which these operations are performed.

Yes, some will say, but we know other examples, when the time for their execution depends on the order of operations. For example, there is a difference between sequential and random memory access, even if we read the same data. In this case, the fact is that accessing the memory is quite a long process, and the processor practically stands idle at the time when it could execute hundreds of instructions. However, with sequential access, these delays are amortized at the expense of processor caches, where a large amount of data is immediately copied, and access to which is much faster.
')
But note that in our case the amount of data is not that large, 100,000 ints is only 400Kb, while modern processors have L2 caches of megabytes and more. In addition, in both cases, we consistently refer to the elements of the array, so the memory access delays are again depreciated. Finally, if you decrease N, you still see the same difference 3 times. Therefore, here the caches, although they play their part, are not the main cause of the observed effect.

To understand what is happening and check the previous statement, we will analyze both options using processor counters. These counters count the number of instructions executed, accesses to caches, the number of branches that occurred, and many other parameters. To take the readings, I used Intel's (trial version - yes, do not wait for freebies, here you aren’t Google!) VTune Performance Analyzer, but there are other ways. We start, the analysis is performed, and we get at the output (P4 Prescott, L1 16Kb, L2 2Mb, the first option vs. the second):

Executed instructions: 40 vs. 35 (· 10 ⁹ )
L2 cache slip: 1.1 vs. 1.8 (· 10 ⁹ )
L1 cache misses: 1.6 vs. 0.4 (· 10 ⁶ )

So what follows from this? First, the first-level cache misses per instruction are very few, and therefore the corresponding delays are not so significant. Secondly, the second level cache misses are still quite a lot, but it turns out that in the second variant misses occur even more often than in the first! The results on other processors may be slightly different (in particular, there may be another L3 cache, which is not on my processor), but I think that the overall picture will remain about the same.

So, it's not about caches. But let's look at additional analysis results:

Number of transitions (branches, branches): 1.0 vs. 1.0 (· 10 ¹⁰ )
Conversion predictor errors (mispredicted branches): 0.16 vs. 0.00009 (· 10 ¹⁰ )

A small digression for those who, like I was until recently, are not familiar with such an interesting thing as a branch predictor - oddly enough, I did not find topics about it in Habré.

You probably know that modern processors at the expense of the instruction pipeline ( instruction pipeline ) execute not one instruction after another sequentially, but several instructions at once in parallel, and then combine the results. However, the sequence of instructions may depend on the results of conditional transitions, as in our example - if a [j]> a [j + 1], then the elements are rearranged, but not otherwise. Here the predictor of transitions comes to the rescue, who tries to guess whether the transition will be executed or not, and in accordance with this, instructions for the pipeline are chosen.

The predictor of transitions can be static and dynamic. Dynamic tries to predict the transition based on the history of previous transitions. If this story is not yet recruited, a static predictor is used, but in our case it is not important. On the subject of static and dynamic predictors, I liked this article (English), although in reality everything is, of course, more complicated.

How important is the predictor of transitions and its errors? Wikipedia reports that on modern processors the delay is dozens of cycles, which is ~~not so much~~ in itself ( leotsarev ). However, in addition, the absence of such errors can mean a very good benefit, since due to the long length of the pipeline, the processor can “look” many instructions forward. You can verify this with the following code:

for (i=0; i<T; i++)
for (j=0; j<M; j++)
if (p[j]) ++;

Here, p [] is an array that determines whether to perform a conditional transition or not, and the counter simply serves to distinguish between these two events. If all p [j] values are the same, then after several iterations the transition is already well predicted. If p [j] is generated in some, for example, periodic way, then the predictability of the transition will depend on the predictor implementation. But if the array is randomly filled, it is impossible to predict the next transition, under certain restrictions. It should be noted that the size of the array M is important - if the array is too large, it can be poorly cached, and if it is too small, then the transition can be predicted.

On my computer, the execution time of this code varies 4-6 times depending on the degree of randomness of the array filling. If you perform a more complex operation than increasing the counter, for example, rearranging the elements of an array, the difference decreases, but not much. Thus, the presence or absence of errors of the predictor of transitions can lead to a difference in the execution time several times, as observed in our original problem.

According to the above analysis results, in the first sorting scenario, predictor errors occur in 16% of cases, and in the second, 1000 times less. Why is this so? This can be understood by considering how the sorting takes place in both cases.

In the first case, for small i, the inner loop through j runs almost to the end of the array, without touching only the sorted “tail”. Initially, the data in this part of the array are unordered, and therefore the condition a [j]> a [j + 1] is performed almost randomly. As i increases, some ordering of elements takes place due to permutations, but still a large amount of randomness remains ( animation ). Therefore, it is rather difficult to predict the transition, which leads to a large number of predictor errors.

In the second case, with i = 0, the inner loop only sorts a [0] and a [1], with i = 1, it adds a [2], and so on. In fact, this is sorting by inserts (but not binary) - at the i-th step, the element a [i + 1] is inserted into the already sorted subarray a [0..i] ( animation ). Since the element is inserted from the end of the subarray, in most cases, this element will be sequentially moved to the required position in the array, and the value of the condition p [j]> p [j + 1] will be the same before it reaches it. Thus, after several iterations, the transition is easy to predict what the transition predictor seems to be overwhelmingly pleased with.

Source: https://habr.com/ru/post/73726/

All Articles

Bubbles, caches, and transition predictors

More articles: