On the question of Bezier curves, speed Arduino and one interesting site, or how I spent the weekend

“Anyone can solve the Gray paradox with dolphins, and you try to do it without dolphins. "

Actually, I planned to spend the weekend a little differently, to go to the Copter Huck (not that I was a fan of the copters, just to see what young people think of, hang out like), but the older sister was totally against it. Of course, I insisted (that is, he snorted a couple of times and said, “Well, maybe, all the same ... it will be cool”), but she was inexorable, and when the spouse took her side, there were no chances for a trip. Well, okay, “I didn’t really want to,” but then I sat for a bit on the amusing programming problem that I thought of myself, which I report on.

(Necessary note - it meant the previous weekend, like this always - writing a program requires a couple of hours, writing a report on it and for five days of travel in public transport is not completed.)
')
In one recent post, the author considered the problem of acceleration (among other things) of the calculation of Bezier curves (CBs) on MCs with relatively weak parameters. Well, in fact, these parameters are at the level of the average mainframe of the 70s, but today they are considered to be clearly insufficient. As a result of certain actions, the author managed to speed up the calculations somewhat, in my opinion, clearly not enough, so I decided to write how to do this in the first approximation. I know very well the universal recipe for solving problems with speed - to take the MK with a higher frequency or switch to another family, but I come from the times when we learned to manage with what we have, simply because there was nothing else, from the word at all. For the time being, the approach is outdated, but it seemed to me that modern Habr's readers would not be without interest.

We formulate the task - we want as quickly as possible to calculate the coordinates of points on the Bezier curve given by the extreme points A and B and the imaginary focus C. The formula for calculating the point P on the curve is given by the expression

(1) P = T * T * B + 2 * T * (1 - T) * C + (1 - T) * (1 - T) * A

$(1) P = T * T * B + 2 * T * (1-T) * C + (1-T) * (1-T) * A$

where T varies from 0 to 1 inclusive. (Wiki writes that this formula was secret at one time, as strange as that, but anything is possible). It is clear that we will not count in a complex form; instead, we will look for the X and Y coordinates separately. We estimate the complexity of the calculation using this formula, simply by counting the number of characters of arithmetic operations in this expression - 7 multiplications and 5 additions (=> 7 * 5 + ). It is possible that a good compiler (and now all compilers are good and perfectly optimized, if you don’t explicitly prohibit it) will reduce costs to 7 * 3 +, although it would be better to help him by calculating (1-T) in advance. Generally speaking, a good compiler can do wonders if all values in a formula are represented by constants, but we assume all values are statically undefined.

Part One, Mathematical

We begin the optimization process, for which we open the parentheses and group the terms at T (maybe someday the compiler will be able to do this for us, but so far this part of the work rests on natural intelligence), getting

(2) P = T * T * (B + A - 2 * C) + T * 2 * (C - A) + A

$(2) P = T * T * (B + A-2 * C) + T * 2 * (C-A) + A$

=> 5 * 5 +, which is clearly better than the original value of 7 * 5 +, but with respect to the improved 7 * 3 +, you should still think.

If we take the execution time of the addition operation per unit, then the execution time of the multiplication will be exactly not less than one, as a rule, longer, but how much it depends on the implementation of the MC (first wrote - on the architecture, but this is not quite true). When there is no hardware multiplier on the crystal, the execution time of the multiplication will be tens (30+) times greater than one, and when it is present, it will be several times (1-6). Therefore, it is safe to believe that the replacement of multiplication by addition gives a gain (and often significant) in execution time almost always. Well, we immediately note that the transition from fixed-point numbers to a floating point (let us leave aside the proof of this fact) leads to an increase in the execution time by 20+ times for addition (alignment is very much affected here), but only to an insignificant increase for multiplication . Therefore, for floating-point numbers, the times of addition and multiplication differ little, especially in relative terms (one can expect 2 times maximum), but they still differ and are not in favor of multiplication, so the gain is also here.

Returning to the previous paragraph, we find that for PT, a 5 * 5 + score should not have a significant advantage over 7 * 3 +, but we still have reserves. Let's pay attention to the fact that we have to calculate the set of values of points on the Bezier curve, when the parameter T changes, and all other parameters of the curve are fixed (but not constant, but sorry), then the rest of the formula can be calculated in advance and we get

(3) P = T * T * A 1 + T * B 1 + A

$(3) P = T * T * A1 + T * B1 + A$

=> 3 * 2 +, where

A 1 = A + B - 2 * C

$A1 = A + B-2 * C$ and

B 1 = 2 * (C - A)

$B1 = 2 * (C-A)$ already not bad, but if you remember Horner’s scheme and write

(4) P = T * (T * A 1 + B 1) + A

$(4) P = T * (T * A1 + B1) + A$

=> 2 * 2 +, then compared with the decision "in the forehead" we have to win more than 2 times, almost 3, and these optimizations are completely obvious.

Let's test theory with practice (although this is completely unnecessary, we are confident in our assessments, but suddenly I underestimated the compiler), for which we need to measure the real time of execution of different variants on a real hardware. Well, it just so happens that I have many different debugging motherboards for MK from various companies at home (including rare books like Luminary Micro or Intel Edisson debugs, try to buy this today), but there is not a single Arduino board (“Well we do not have pineapples "). It would seem to be a dead end, but there are options - a very interesting site tinkercad.com comes to our rescue, on which you can build your scheme on a breadboard using the Arduino module, write a sketch and execute it immediately. At the same time, you can set breakpoints, execute the program step by step, and even (an unprecedented thing for a real Arduino) view the values of variables at the moments of stopping.

We turn to this site and begin to measure. To begin with, let's check our assumptions about the execution time of operations and, after clearing away the attendant circumstances, we obtain the following data for integers:

8 + 8 => 8 - 1 clock, 16 + 16 => 16 - 2,
8 * 8 => 16 - 2, 16 * 16 => 16 - 14 (the only thing that was unexpected, I thought to get 4 * 2 + 4 * 2 = 16, there are interesting optimizations),
8/8 => 8 - 230, 16/16 => 16 - 230.

Pay attention to the last two digits, of which it is clear that the division operation refers to prohibited ones, if we really want to count quickly. Now (finally) we measure the time for performing operations on numbers of PTs with a 24-bit mantissa
a + b - 126 (and highly dependent on operands), a * b - 140, a / b - 482.
The obtained data correlate well with our theoretical assumptions, it is clear that the hardware implementation on board this MC: for multiplication is, for division is not, for PT operations is not.

Now we begin to measure the time of the full calculation. We are given the values A = 140, B = 120, C = 70, and we build 170 evenly distributed points in KB. Why exactly these values - they were given in the specified post when evaluating performance. Below are the algorithms and the corresponding test execution time.

Formula (1) => 20ms or 1900 cycles per counting
Formula (1) => 18 ms or 1660 cycles per count (we consider 1-T separately)
Formula (2) => 16ms or 1540 cycles per counting
Formula (3) => 10ms or 923 cycles per count
Formula (4) => 8ms or 762 clocks per count

It can be seen that the resulting reduction in execution time (from 20 ms to 8 ms) correlates well with the expected one and we managed to speed up the calculations by more than 2 times. We note that in addition to completely obvious considerations and mathematics that does not go beyond the course of secondary school, we did not need to.

And now let's talk about what we should do if the result obtained is not enough, and we have already squeezed everything out of the calculation formulas. They wrote to me here (in the comments to another post) that in general any task can be reduced to calculations with FT and, despite the obvious controversy of the assumption (try to do it to solve the Navier-Stokes equations numerically), in this particular case this recommendation applies , although, as always, there are nuances.

Part Two, Computing

Once the modifications of the algorithm have been exhausted, only the data structures remain and we enter the ground of fixed-point numbers. Here we are waiting for a lot of pitfalls, which we did not think about for PT - the range and accuracy (in general, and for PT, these questions should be thought about, but here everything is simpler, much has been done for us). It is necessary to conduct a small study of the problem to determine the necessary representation of FT (selected in the post mentioned 9.7, judging by the results, is clearly insufficient), but I suggest to go a little different way. By the way, if we take not 170 steps in the interval, but 128 (I see no reason prohibiting this step), then this presentation would be fine for us. If we take into account the fact that the constants defining the KB are given by integers, and the only parameter T can be represented as a fraction of the form and / and the result will be used for drawing on the screen, that is, translated into integer coordinates, then we can just do everything in integers that are processed much faster.

We use only the last formula and rewrite it in the new notation.

(5) P = a n d / A n d * (a n d / A n d * A 1 + B 1) + A

$(5) P = and / And * (and / And * A1 + B1) + A$

(=> 2 * 2 + 2 /), where A1 and B1 are calculated in the same way as for PT. Obviously, all integers and corresponding operations should be performed much faster. In order not to lose accuracy in the operation of dividing integers (2/3 = 1! = 1.5) and carry out the division at the very last moment, we slightly transform the formula to the form

(6) P = ((a n d * A 1 + B 1 * I) / I * a n d + A * I) / I

$(6) P = ((and * A1 + B1 * I) / I * and + A * I) / I$

(=> 4 * 2 + 2 /). All the numbers of FT, so we implement this algorithm and get ... so here you are, grandmother, and St. George's day ... 1869 cycles, but this is much worse than for PT, we started with this, some garbage, because whole numbers are much faster.

We begin the debriefing and it turns out that it is not enough just to change the type of variables. First, we must use numbers not 8 or even 16, but 32 bits, otherwise overflow will occur, but long numbers, although faster than PT, but not enough to compensate for the flaws of the algorithm. Second, these flaws - y we again had constants calculated on each clock cycle - we remove them by preliminary calculation B2 = B1 * I, A2 = A * I * I. Then we get

(7) P = ((a n d * A 1 + B 2) * a n d + A 2) / A N D / A N D

$(7) P = ((and * A1 + B2) * and + A2) / AND / AND$

(=> 2 * 2 + 2 /) with the result of 1684 - better than the previous one, but still it was not for this that we left PT.

We exclude the calculation of another constant I2 = And * And we get

(8) P = ((a n d * A 1 + B 2) * a n d + A 2) / I 2

$(8) P = ((and * A1 + B2) * and + A2) / I2$

(=> 2 * 2 + 1 /), with the execution time of 956 cycles - and this is interesting, the exclusion of one operation led to a significant increase in productivity.

That's what slows us down - division, since this is a very time-consuming operation, but we have an interesting trick to deal with it. To calculate the expression 1 / And we can carry out elementary transformations 1 / = 1 / * ( / ) = 1 * ( / ) / . If we choose a power of two as H, then division by H can be replaced by shifts, and if the exponent is a multiple of 8, then even shifts are not needed. And the value of H / I will have to be calculated honestly, but only once, after which only multiplication remains in the calculation cycle.

We draw attention to the fact that we did an incompletely correct transformation and replaced H / AND with its rounded value K in order to proceed to operations with integers only. Invalidity is a loss of accuracy and additional research should be conducted to prove the applicability of this approach to our case. We write H / I as (K * I + d) / I = K + (d / I), where d is less than I. Then the absolute error in going to H / I k will be d / I, and the relative error - d / And / (K + d / I)> = d / I / (K + 1) ~ d / I / K, provided that K >> 1 (this is not a shift). It follows that the value of H should be chosen as large as possible, since the absolute error of the calculation is A * d / I / K> = A * 1 / H / I. If we want the error to be no more than one, we must withstand the condition A / K <= 1, then K> = A, convert K * I> = A * I, which means H> = A * I, then we don’t losing exactly. For our case A <= 256 and AND <= 256, we get H> = 2 ** 16, which is quite acceptable. Obviously, in the above formulas should use the modules of the original numbers.

We note for the future that if we will not round down, but towards the nearest integer, then the requirements are somewhat reduced and H should be two times smaller, although there are nuances here.

In any case, we can ensure the required accuracy and obtain the following algorithm: H = 2 ** 16; K = [H / I] (And <256); 0 <= and <= AND;

(9) P = ((((a n d * A 1 + B 2) * a n d + A 2) * K) >> 16) * K) >> $ 1

$(9) P = ((((and * A1 + B2) * and + A2) * K) >> 16) * K) >> $ 1$

(=> 4 * 2 + 2 >> 16) where all operations are performed on long integers. We implement this algorithm and get 583 clocks ... but this is already close to the ideal, but not yet an ideal.

Then there are minor settings for a specific MK - working with global variables is faster. than with local ones, but even faster with register local ones, which leads to a decrease in time to 506 cycles.

Further, we note that the last multiplication before the shift can be carried out with 16-bit numbers, which will give 504 - a trifle, but nice.

In total, we accelerated the calculations compared with the implementation of the "head on" in 1900/504 - more than 3 times, and not lost exactly from the word at all. I call this result a time optimization, and not 20% obtained in the original post.

Is it possible to achieve even better indicators - it is possible, but this is the topic of the next post.

Source: https://habr.com/ru/post/428113/

All Articles

On the question of Bezier curves, speed Arduino and one interesting site, or how I spent the weekend

“Anyone can solve the Gray paradox with dolphins, and you try to do it without dolphins. "

Part One, Mathematical

Part Two, Computing

More articles: