Based on
yesterday’s post about the optimization of conditional transitions when calculating x = sign (a, b) * min (abs (a), abs (b)) supposedly 10 times. Summary:
- optimization is evident, but the size is imaginary: not 10 times, but 2.5 times;
- benchmarks should be done correctly: it is not necessary to measure CPU stalls, RAM bandwidth and so on instead of the function being investigated;
- benchmarks should be done correctly: otherwise they can tremble wildly;
- set priority only cool, but on short benchmarks in vain: + 0.5% speed, -15% jitter;
- it is necessary to measure the investigated function and only it, only this way you get the correct data;
- you need to warm the percent, you need to count at least N runs / seconds, only this way you win the tremor;
- You need to use SSE, it turned out 8.6 times with it, and the code ... is read.
In general, again, a bundle of classical methodological errors at the benchmark. Who cares how to NOT make such mistakes, details, detailed debriefing, optimization several more times and, most importantly, the source code under the cut.
I read the
original post yesterday, I was very surprised at the acceleration 10 times due to the elimination of transitions, in vain, that on synthetics. This is too much, the transitions are expensive, but not so much. I tried to repeat the results, looked more attentively, and naturally: again, kindergarten errors in the benchmark method! Well, it's time to disassemble them again, a good example.
Error # 1 is that
the source code stupidly does not measure anything sane. The sum of the results of llr () is considered, and this is good. But the compiler sees that it is not used in any way, and therefore has every right to optimize. I just optimized. In addition, the originally published (now sorted out and corrected) variants of optimizations did not consider that result at all, and this was imperceptible. Oh…
')
Moral # 1: boys, print results, ALWAYS. You will catch the errors immediately, and the “unnecessary” cycle compiler will not throw out. It also helps to declare the result as volatile, but you still need to print it.
Error # 2 is that the author measures the rand () + llr () cycle, then the rand () + llr2 () cycle measures, then by hand and subtracts the execution time by eye. This is a bad idea for two reasons. rand is very brake, the benchmark turns out to be unreasonably long. ;) It's time. In the experiment, the performance of minced meat from two functions is measured, and this forcemeat obviously does NOT behave as soon as the desired function. These are two.
In the general case, the “let's measure the stuffing from the A + B functions” approach is unsuitable because the compiler can
mix the calculations . It turns out that the part of the “measured” function B hid in function A, and in fact we measure the unknown part from B. Well, when A + B is a combat pair, it is used like this. It's bad when, instead of A, test synthetics like rand ().
In this case, such mixing does not occur, in dysasm call _rand () without any attempts to inline it, but some other trouble is clearly happening. What exactly, I do not understand, but the working hypothesis about CPU stalls. The hypothesis is that by slightly delaying the start of the calculations of llr (), which begins with the test esi, esi instruction, where esi is almost just returned from rand (), it is possible to speed up the initial benchmark. In this case, just repeat the cycle 2 times, of course, the effect does not give, it is necessary to spread the calculations:
10.7 sec, rand ()
13.3 sec, rand () + llr ()
12.6 sec, rand () + llr () + 2x unroll
// 2x unroll , 13.3 sec int a = rand() - RAND_MAX / 2; int b = rand() - RAND_MAX / 2; x += LLR(a,b); a = rand() - RAND_MAX / 2; b = rand() - RAND_MAX / 2; x += LLR(a,b); // 2x unroll c , 12.6 sec int a = rand() - RAND_MAX / 2; int b = rand() - RAND_MAX / 2; int a2 = rand() - RAND_MAX / 2; int b2 = rand() - RAND_MAX / 2; x += LLR(a,b); x += LLR(a2,b2);
In one of the experiments, by the way, the acceleration to 12.8 sec generally gave a stupid insertion of asm {nop} before x + = LLR (a, b), but it was not possible to confidently repeat this miracle. Some random fluctuation. But in general, it is clearly seen that measuring the stuffing from heavy rand () and easy llr () is an occupation, hmm, non-trivial. I assume that somewhere in a pair of test / jxx instructions, stalls occur due to the fact that the expression was calculated just now. Maybe someone who has the hands of VTune, will be able to see more precisely why this is so.
Morality # 2: boys, do not measure the stuffing of the investigated function and synthetics, measure only the function you need. How the compiler mixes that stuffing and what special effects appear at the junction, do not guess.
We get rid of the braking rand (), we make the pre-calculation of a sufficiently large block of random numbers per cycle, inside the cycle we leave only the calculation and summation of llr (). In total, in addition to the function, we also measure the memory access overhead, but with linear reading it is minimal. SUDDENLY, instead of an imaginary acceleration of 10 times, we observe a real acceleration of a modest 2.5 times.
Ok, this is consistent with expectations, but now not fast enough and good;)
SSE comes to the rescue. I have a new Core2Duo E8500 on my desktop, but even he can do SSE4. We expand the cycle 4 times, count 4 pairs at a time. Right in the forehead we use special instructions for calculating sign, abs.
1.073 sec, llr () baseline
0.438 sec, llr2 () optimized, 2.5x
0.125 sec, llr4 () sse + 4x unroll, 8.6x
Interestingly, the code is quite readable. The only thing is that you need to accurately comment _mm_sign_epi32 (). First, he suddenly takes the 2nd argument as well, and as if multiplies it by the sign of the 1st argument. That is necessary. Second, with _mm_sign (0) = 0, not 1, as for some tasks you would like. However, for our purposes there is no difference, tk. if abs (a) or abs (b) is 0, then sign * min (abs) = 0, so there is no error.
static inline __m128i LLR4(const int * pa, const int * pb) { __m128i a = *(__m128i*)pa; __m128i b = *(__m128i*)pb;
Note in the margin: what is interesting, summation of the register components via _mm_hadd_epi32 instead of unloading the register into memory and summing up then 4 normal int results does not give. That would have to see the effect of hadd finally, at least somewhere, for a long time I want.
Another note: what is interesting, further unfolding of the cycle
does not give results either. Apparently, it already rests on the speed of reading memory, there about 6.4 GB / sec comes out.
Error # 3 , it appears, that the long-available SSE extensions are not involved, all the valiant optimizations for some reason were performed under the platform, essentially i386.
This is a controversial conclusion, long thought, to write it or how. "Error" or not? I personally think so. Because if we optimize, so much! Scholastics in kamentah certainly will howl, that there is not. After all, this version is vectorized and works on 4 pairs of numbers at a time, and this is a decidedly incompatible change of the original function. Plus, optimization for i386 is also very valuable, all of a sudden the program will be launched not on i7, but on the IT archeology monument. However, in
any case, the moral is certainly the same.
Morality # 3: boys, now 2013, SSE4 is almost everywhere, SSE2 is aaa everywhere, act accordingly, optimize (if possible) for 99% of users, and not 1% fallback.
The resulting SSE version is now vigorous enough to watch the time on different launches dancing from 0.125 to 0.145 seconds. Almost 20% difference. Oh…
Error # 4 has just appeared due to optimization;) Setting a high thread priority and a process gives about 0.5% of performance, but in no way saves from measurement
jitter . First, the idle processor resets the frequency, and it does not dial back immediately. In 10+ seconds, in time, in 0.1 seconds, no. Second, even if you run a
short test 100 times, the time of each individual iteration will still dance. Both at the beginning and at the end of these 100 runs. Whether it is not enough that there in ostensibly idle system with 99% idle to itself background works and as influences.
The processor must be “warm” and even after that the results (short enough) of the runs must be filtered. Just to do a little more iteration and average is not enough, the total time still trembles noticeably. Throwing out the first “cold” iterations is not enough either, the result is trembling. You can take the mean and variance, throw outliers, where abs (value-mean)> = k * stddev, then recalculate the mean, I tried, but it is trembling!
And the least trembling is such a simple method: we do several runs (I do 10) and select the
minimum time for one run from them. It turns out that such a minimum is as if dug in, the differences in several starts at a maximum of 0.5%.
Please note: these runs need to be done
in addition to the initial warm-up within 3-10 seconds, otherwise 10 * 0.1 = 1 second is not enough again to set the train, and despite the focus with a minimum, time will still tremble. Or, the runs must be done (strongly) over 10. In our example, the very first reference implementation at the same time works as such warming up. But if the line Bench (Test1) is commented out, dances named after T.Witt will begin again. Oops, throttling style!
Morality # 4: boys, heat the percents for at least 3 seconds, and preferably 10 seconds, and in addition, count at least N runs.
With benchmarks, you can still make a bunch of errors, and probably in the final code (by the way, here it is:
gist.github.com/anonymous/5605201 ) something else is missing, but today we have analyzed these. Hopefully not completely useless.
Be vigilant, measure correctly, optimize without mercy to the end;)