Explanation of the branching experiment, or philosophical research on benchmarks in vacuum and in ... reality

I hope, who wanted, got acquainted with my test experiment on Habré in this article. Now I think that it would be right to announce its results and even give a more detailed explanation of the reasons why such experiments are conducted at all. The post will be half philosophical, because now in the computer world everything is so difficult that without philosophical understanding it is simply impossible to make any intelligent decisions. I will try to express my opinion about spherical dimensions in a vacuum, so there will be a lot of letters. The article has a survey conducted before May 1, 2016. Under the cut entirely IMHO.

First of all, I ask you to pay attention to the fact that I called the experiment “ trial ” - and this is not accidental. I conceived a more detailed study of the features of different algorithms, and for this you need the help of different people (at least indirectly - to launch something somewhere and say the result). In the future, complex experiments should be carried out quite smoothly, you need to practice, to check in general the fundamental possibility of doing this in a similar way. The community usually likes to kick for errors in a research methodology or organization, and I just needed constructive indications of such errors. Such, fortunately, were.
')
Secondly, I apologize for the fact that I forgot about the need for a chaotic data flow, because of this, someone had to test everything again and there was confusion about which version the next participant checked. In my article about the calculation of single bits, I remembered very well the importance of the random flow of input data, but for some reason I forgot here. It happens. But, as I said above, this is the first and most important reason for me to conduct a trial experiment, in order to be less mistaken in serious research and to consider in advance possible constructive advice.

Now let's talk about the methodology of the research itself and what the need to test any functions for speed is. And is there this need. This is where philosophy begins.

Explanation of the word "branching"

Another of my mistakes that I made was that I didn’t explain what branching meant. I am usually not interested in academic terms, because they often have no relation to reality, and therefore I have to endow many words with their own meaning.

That article meant precisely the algorithmic (or logical) branch, which inevitably appears in the program logic when calculating a logical expression and subsequent actions depending on the result of this expression. Thus, when we see a condition in the program code, then by logic (precisely by human logic) this is a branch. Why is that? The fact is that branching is generally difficult to determine by normal, when the logic of the algorithm is first translated into PL, then the compiler translates it into machine language, and then the processor instructions themselves are translated into internal, say, RISC instructions. At one stage there is branching, but there’s no friend ... and then it may appear again.

That is, you may think that there is no branching in your C program, but in fact it is, or vice versa, you wrote code with branching (in logic), and the compiler figured out how to do it without branching in code. Example:

u64 a, c; u32 b; c = a / b;

There is no branch. True? Actually there is (at least after the VC ++ 2015 compiler). The fact is that with the so-called "2/1" division (when something of double size is divided into something of single size), the result can be both single and double. In the first case, this division procedure will be performed with one div instruction, in the second case - with two. So, in order to understand which branch to go with such a division, you need to calculate the size of the private one and make a choice, this choice will be one of the branches before the direct division. Another option may be, for example, a test for division by zero (although usually there is no such test). In short, the procedure of division is not reduced to the usual div, it is a very complicated procedure. A programmer, looking at this record, may think that there is a linear code.

The second example. The maximum function may look like this:

 i32 max1 (i32 a, i32 b) { return a ^ ((a ^ b) & -(a < b)); }

There are obviously no branches on the x86 architecture; probably there will not be any on many other architectures, although there is a direct logical comparison of the two numbers. That is, logically, there is a branch here, but no code. However, replace i32 with i64 and compile the program in x86 mode. Branches immediately appear. But in this code:

 i32 maxi1 (i32 a, i32 b) { i32 d = ab; return b + (d&(~(d^((a^b)&(d^a))) >> SHIFT)); }

There are no branches in logic and will not be in the program for any fixed size variable that the compiler supports (we are talking about x86 when compiled correctly). That is why I call this the only method of searching for the maximum (from me known), which does not contain branches.

Why am I telling everything?

The main problem of programmers who know little about the classical methods of optimizing programs, but not very deeply understood them: the desire to get rid of the branches wherever possible. Firstly, it is cool and beautiful, secondly, you can shine in the eyes of colleagues in the workshop, thirdly, the touch of magic is fascinating, somehow raises self-esteem, and now you seem to be living in vain. But very rarely in the list of reasons really is an understanding of how quickly or how slowly the new code will work. Getting rid of branching, the programmer, firstly, can generate it, and, secondly, it can slow down the code due to the fact that the new algorithm is more complex in its structure.

I wanted to show that everyone has a simple experiment that will show when and when not to get rid of branches in logic. We take different implementations and compare with each other at least on spherical tests in a vacuum ... and this makes sense, which is what I philosophize on.

Experiment Methodology

Of course, that everyone who was seriously engaged in the optimization of programs, drew attention to the strong artificiality of such measurements of time. Look: we take a function, create ideal conditions for its work and test it in these ideal conditions. In my case, this condition was the simplest loop. By the time of this cycle with the function call in it and by the time of the empty cycle we are trying to find out about the effectiveness of the function itself ... and it might have seemed to someone that I am doing so. In fact, everything is not so simple, because you can not do that.

In reality, your function will be surrounded by another set of instructions, the cycle in which it will be launched may be longer and more complicated, and the running time of this function may be completely different. In a sense, such benchmarks, like mine, resemble window-dressing in some store that the high official decided to visit: prices in it suddenly become 10 times lower, and staff salaries are 10 times higher ... but exactly until until the official leaves the store. So why bother doing it at all if the result does not reflect reality? Because if you correctly understand the phrase "reflects reality", then everything will fall into place.

And the thing is that I do not draw conclusions about the speed of the function separately. I conclude that which of the functions will be faster, and which will be slower in these greenhouse conditions . At the same time, I’m interested in a big gap in speed, because a small gap is often overwritten by the complexity of other parts of the program, but a big gap guarantees (at least in my experience) that in real conditions the difference will be the same ... it’s NOT important whether the function itself will work 100 times longer compared to ideal conditions — it will win over slower ones with approximately the same margin as I get in these ideal conditions. Honestly, I don’t know if this is so in ordinary everyday tasks, but in scientific calculations, when millions of machine hours are needed, this can be considered an axiom. In my practice it was the only way and nothing else. Now let's try to philosophically comprehend the general value of any research.

In our world, a significant part of experiments (even social ones) is not without a disadvantage that they are all artificial, but according to the results of these experiments, people can still quite reliably predict a certain situation in reality. Say, take the same stayer competition. Several people dress in running pants, T-shirt and sneakers (sometimes with spikes), do a special warm-up, go to the start and start running along a perfectly smooth and soft track, for example, 5 km. These are ideal conditions in which the master of sports passes a distance in 14 minutes. If you fasten felt boots on the master and make him put on a fur coat, he will run the same distance more slowly, especially through mud and puddles, say, 18-20 minutes ... but it is still faster than an ordinary unprepared person will run even in ideal conditions. And under equal conditions, he has zero chances at all. You can take other usual conditions: you need to run to the outgoing bus (or catch the "green" at a pedestrian crossing). The simple question is: who has more chances to catch him at a sufficiently large distance - from the long-distance runner or the average person? It is clear that the first, with the chances of virtually no change depending on the form of clothing and many other conditions. However, this assumption (and very reliable) we do only on the basis of what we saw as the masters run 5 km in 14 minutes. We just make a prediction based on the knowledge gained in ideal conditions. And with great probability these predictions will be true under real conditions.

In the world of programming, of course, everything is much more complicated. For example, why did I take it that my conditions described above are ideal? I called them that, but it may turn out that in a real program the compiler will find a way to “smear” my function by the program code, that its operation time will be zero (it will be calculated in parallel with some complicated operations nearby). Yes, it can be, and with a tough optimization of programs, an experienced programmer will try to change the algorithm so as to maximally balance the commands for the simultaneous execution of the instructions included in them, and the processor then moves the instructions even better on the go. But this is a different conversation, because such optimizations are not performed separately for individual simple functions like sign or abs, everything is done differently: we take a narrow section of code and look at it entirely, whether we can do something with it (even may be completely re-glued) so that its logical meaning remains the same, but the complexity has decreased. Our situation is different: we want to find out how fast this or that implementation of a separate small function will be, assuming that this separate small function will be heavily loaded in some process, but not so much as to optimize it in some way for some other program sites.

This is how often ordinary effective programs are written: when an algorithmic efficiency suits a programmer, he achieves additional practical efficiency by using well-optimized individual functions, but he doesn’t go deeper because it can be expensive and it’s easier to double the number of cores than spend a lot of time to achieve the same acceleration on the same amount of them. These well-optimized functions are written ... attention ... under ideal conditions! The same MPIR library for long arithmetic, look at the code and see there are many implementations of low-level functions, sharpened for different processors, and MPIR will win over your self-made code of long arithmetic both on tests like my (spherical in vacuum) and in real conditions when the data is not very predictable (it is clear that MPIR can be defeated very easily when you know in advance some serious features of the incoming numbers, but I’m talking about when you don’t know). And there are plenty of such examples in the scientific world. The factor function in Maple will break your self-made polynomial factorization function, both under ideal conditions, when you measure running time by repeating random tests one after another, and in real programs, where factorization takes a tangible part of your calculations (for example, for some work with rational fractions). Of course, I admit that you can compete with the factor from Maple, but there are very few such people, and we are talking about ordinary users who want to write a more or less good program, but find it difficult to choose one or another implementation of a heavily loaded function.

What I want to say is: I don’t know how in the ordinary IT world, but in the scientific computing field there is a clear correlation between benchmarks like mine (spherical) and the actual behavior of the tested functions in complex calculations, when these functions make a tangible contribution to the complexity of the entire program. That is, if a certain function f won the function g on spherical tests 10 times, then the situation will be approximately the same in the real program. And I repeatedly convinced of this with the help of the profiler.

Let me explain one more thing: in real problems, I did not see the need to optimize the functions min, max, sign and abs. Usually they are found in a group of much more complex calculations, therefore, completely invisible in the table with the results of profiling. I just often meet programmers who consider it their duty to distort the code based on their intuitive assumptions about optimization, whereas the bottleneck of their program is generally at a different point. Don't do that.

Nevertheless, my experiment with these functions still made sense to me, despite the artificiality and isolation from the reality of its seeming need. This I explain further.

Objectives of the experiment

Let me remind you that the main goal was to get feedback from the community in the form of comments, recommendations, and generally some behavior. Analyzing all this, I draw conclusions and can now make similar experiments more interesting and useful. Here I just practiced and express my gratitude to everyone who took whatever part they could.

The second goal is to check the quality of time measurement in my way. Taking into account the fact that someone got negative time, it is safe to say that we will have to think about the measurements. This is good, because a similar failure in a serious experiment would be a very difficult test for me.

The third goal is to test the ability of potential participants on Habré to perceive the situation correctly and act taking into account even possible deviations from the plan. This ability suits me perfectly, although I was a little disappointed that some participants provided test results when the program was compiled without any optimization keys ... I do not know what the meaning of such measurements is. At least in scientific calculations this can only be useful for debugging. It’s partly my fault that I didn’t explain this point, although it was necessary to guess that since all programmers are very different, it is necessary to formulate the conditions as precisely as possible. Similar experimental experiments teach understanding of this, which is also useful.

The fourth goal is to look at the ratio of the times of functions on different processors and with different compilers. Here I was surprised by one moment. For some users, it turned out that the minu0 function worked several times slower than the other seven functions for the minimum and maximum. Here are some examples (this is exactly when the data is chaotic):

 Intel (R) Core (TM) i3-2100 CPU @ 3.10GHz
 GCC 4.9.3-r3
 Options: -std = gnu ++ 11 -O3

 mini: 1.22 vs 2.59
 maxi: 1.19 vs 2.71
 minu: 13.64 vs 3.01
 maxu: 1.21 vs 2.54

 Intel Core i7-6700K CPU @ 4.00GHz
 GCC 5.2.1
 Options: -O3 -std = gnu ++ 11

 mini: 0.49 vs 0.83
 maxi: 0.48 vs 0.82
 minu: 10.20 vs 0.74
 maxu: 0.49 vs 0.91

 Intel (R) Core (TM) 2 Duo CPU E7300 @ 2.66GHz
 GCC 4.8.4
 Options: g ++ -std = gnu ++ 11 -O3

 sign: 12.95 vs 2.56
  abs: 12.74 vs 0.91
 mini: 2.31 vs 3.07
 maxi: 2.19 vs 3.19
 minu: 15.79 vs 3.54
 maxu: 2.08 vs 3.77

 Raspberry Pi 3, SoC: Broadcom BCM2837, CPU: ARM Cortex-A53 @ 1.2GHz
 gcc version 4.9.2 (Raspbian 4.9.2-10)
 options: -std = gnu ++ 11 -O3
 mini: 10.74 vs 17.93
 maxi: 10.74 vs 14.33
 minu: 24.63 vs 7.16
 maxu: 10.74 vs 7.16

And so on, these examples are many. And here GCC is stupid everywhere, because the clang did everything right:

 Intel (R) Core (TM) i5-4210U CPU @ 1.70GHz:
   Clang 3.7 with Microsoft CodeGen (v140_clang_3_7):
     Full optimization (-O3)

     mini: 0.41 vs 2.61
     maxi: 0.35 vs 9.28
     minu: 0.69 vs 2.96
     maxu: 0.44 vs 8.83

There are many more examples of Clang work in the comments. I will not give examples, but VC ++ 2015 also did everything right. Thus, these examples should be useful to the developers of the GCC compiler to debug the optimization block. That is, my experiment revealed a compiler jamb, which can later manifest itself somewhere in a serious program.

It is possible to highlight some other results that deserve attention, for example, in some cases, a scary formula without branching reveals a minimum or maximum faster than a formula with branching. Here is a snippet:

 Raspberry Pi 3, SoC: Broadcom BCM2837, CPU: ARM Cortex-A53 @ 1.2GHz
 gcc version 4.9.2 (Raspbian 4.9.2-10)
 options: -std = gnu ++ 11 -O3

 minu: 24.63 vs 7.16
 maxu: 10.74 vs 7.16

The first line is clear - this is the GCC glitch, which I wrote about above, and the second is precisely the defeat of the method with branching.

The fifth goal was indirectly indicated above - to show ordinary programmers that we should not fanatically stick to some kind of dogma. In each case, your option may be better and you need to conduct testing to make a choice, rather than relying on blind confidence based on optimization under the fourth stump that arose during those distant times when they were studying at the university.

The sixth goal , the least significant one - I had to share my materials (links [4-7] from the previous article) in order to get feedback over time, to understand if I’m moving right when I write such articles to outline my future work plan. and find like-minded people. I have not received this feedback yet, but everything has its time.

The seventh goal , indirect, is to get a reason to write this post, share your thoughts in it, and vote in it.

I propose to vote. If 80% of voters think that it makes sense to continue similar experiments on Habré (of course, more accurately), I will continue and, step by step, sort through many different algorithms. The benefits to the community are learning, because in the process of experiments, I always explain or show where to read the explanation of those things that are being tested. Another benefit is the ability to check everything on your computer. The benefits to me are criticism, adjustment, hints and tips. If 80% does not reach, I will gather my audience for such experiments myself, it just takes more time, but I will not interfere here.

We are voting to the next meetings :)

Ps. Explanation to the vote: a “more serious” experiment means no more complex algorithm (the algorithms will be different - from sign (x) to Schonhage - Strassen), and I will try to somehow improve their quality and potential value of the result.

UPD: Voting ended in favor of conducting more serious experiments. So I will prepare them, thank you all for your support :)

Source: https://habr.com/ru/post/281769/

All Articles

Explanation of the branching experiment, or philosophical research on benchmarks in vacuum and in ... reality

Explanation of the word "branching"

Experiment Methodology

Objectives of the experiment

More articles: