On the issue of division

We have the opportunity to conduct a small but extremely interesting tactical doctrine.

In the process of researching a new MC from a well-known company based on the Cortex-M4 architecture (I’ll definitely write about this), the question arose how quickly the operation of integer division in hardware implementation can work. The natural experiment gave a somewhat unexpected result: dividing a 32-bit number into a 32-bit one is performed in 3 clock cycles of the processor frequency — well, you can figure out how fast. It turned out that this is the case only with certain operands, but further studies have shown that never the time it takes to perform a division exceeds 7 clock cycles. The results obtained caused a slight shock (“and this is not a certain figure of speech, which is unknown what it means, but a very specific verb” - Divs, as always, is incomparable).

Well, you can not just take and quickly divide such long numbers, strange as that, but the facts are stubborn things. I imagined a picture that tomorrow the President of the Russian Federation calls me to him and sets before me the task of making MK no worse than that of the ARM (I agree that the picture is delusional, but that does not happen in the world), and I look at him confused and understand that I cannot make such a division of such numbers in such a time, and I will not justify the expectations placed on me (well, in fact, I can always quietly buy a license from ARM, and pretend to have thought everything myself, many do, but from me, the GDP is waiting for something completely different, and even then - I can deceive him then, but I can hardly be myself).

And it became sad for me that in ARM guys are sitting much smarter than me, and I went with a long look in the eye to peep on the Internet, how they do it. On the ARM website, I did not find any information on the time of execution, one of the materials on STM32 indicated that the division takes from 2 to 7 cycles, which corresponds to observations, but there is no information on how this is done.

In general, the Almighty Inet didn’t help much, there are tricks with division by a constant, he wrote about them in one of the posts, but we have a different situation, there is Newton’s algorithm and its accelerated version, but this is clearly not the case, there is an algorithm based on Fourier transforms, but this is for very large numbers and is unlikely to be completed in 7 clocks even on the ARM architecture. I had to invent it myself and the solution was found, and it was so simple and obvious that it becomes somewhat embarrassing that this was not done immediately after the task was formulated.
')
Before looking at my solution, I suggest you find your own, and then compare it with mine, and if they are different then I am waiting for you in the comments.

So, how do we quickly (no more than 7 cycles) divide two 32-bit numbers with a 32-bit result.

To begin with, we recall how division is generally carried out in binary arithmetic in
classic form. The algorithm is quite simple and clear - we subtract the divisor from the dividend. If the result is non-negative (we divide the unsigned numbers), then the next digit of the result is made equal to one and the result is considered as the next dividend, otherwise the next bit of the result is 0. Before the next clock cycle, we reduce the divisor by two times (or shift it to the right, move the dividend to the left) and reduce the bit weight by 2 times (similar shifts). Thus, we get one bit of the result in one clock cycle and the whole operation will last 32 clock cycles. In this process, there is still an initial shift, but it does not affect the assessment of the situation as a whole. We will accelerate, but how?

Note that the resulting algorithm strongly resembles the operation of the ADC with a sequential approximation and recalls that there is another conversion method, a much faster - parallel conversion. What if…

We will subtract from the divider not only the dividend, but also the dividend * 2 and the dividend * 3 (at the same time, on three adders), then we get three bits (result signs) of information that take 4 different values, so 2 bits can be extracted from them result. Next, we extrapolate a similar approach for 3,4,5 bits of the result.
To get 5 bits of information in one clock cycle, we need 31 adders, each of which will have a Divisable-Divisor operation * n (1-31), pass the result signs through the encoder and get 5 bits of the result immediately. Then we shift the dividend by 5 digits to the left and repeat until ready. Then we need 32/5 = 6.4 => 7 cycles to complete the operation.

To work, we need 31 + x adders, it seems to be a lot, but we already have them, because we have a 32 * 32 multiplication operation per cycle, and to implement it without 32 adders we can't do (well, I think so ... ), so that we already have the necessary equipment, the only question is in constructing a control circuit and heaps of multiplexers for the implementation of the fast shift, but this is completely solved.

So, the task to divide in 7 cycles has been solved, the question remains - how can this time be reduced, because in the investigated MC it is less than 7. The promising solution is to determine the number of the most significant digit of the dividend (H) and divisor (W) and immediately it becomes clear how many high bits of the quotient are zero, so that we can skip the first or several phases of the algorithm. For example, if H <W, then the result is immediately zero and we complete the operation, you can certainly derive a formula for the number of measures, but I was already bored.

Interestingly, the operation udiv gives only a partial, although the remainder is clearly somewhere inside remains. In principle, it is not difficult to get it in two cycles, which was done in the analyzed fragment of machine code by executing the pseudo-code Divisible-Partial * Divisor, but this is on any 2 cycles, why would not it be issued immediately in the register pair - I do not know the answer to this question.

In general, you will meet the GDP, tell him that we will definitely do the division unit at MK if it is still interesting to him.

PS: By the way, when I was searching for KDPV (as you noticed, I did not find it), I noticed one with the frankly incorrect inscription “It is impossible to divide by zero”. I must say with all certainty that it is possible to divide by zero, it is impossible to divide. But seriously, in different architectures they divide by zero differently, in x86 we get an exception (about this unforgettable error 200), in some we get a dividend or zero, but I have never seen the maximum integer. In ARM n / 0 = 0/0 and it turns out 0.

Source: https://habr.com/ru/post/444550/

All Articles

On the issue of division

We have the opportunity to conduct a small but extremely interesting tactical doctrine.

More articles: