Problem with accuracy and overflow

I want to share with you a small algorithmic task that my colleagues in the workshop have recently planted for me. As it seems to me, the solution was quite elegant.

The essence of the problem was as follows. There is a certain device (microcontroller) that can handle only 32-bit integer values and does not know how to work with a floating point.
On such a device there is a timer that generates 32768 ticks per second. It is necessary to write a function that transforms ticks into milliseconds without loss of accuracy (preferably with rounding).

Note: All the examples in the article are made in Java, but this should not interfere with the understanding of the algorithms.

It is obvious that the solution "in the forehead" - m = (ticks * 1000) / (2^15) not suitable, as it can easily lead to overflow.
')

Eureka?

The first option that the staff talked about is stupidly dividing by 32, but here we have a large margin of error.
Another solution was supposed to be “smart” division with multiplication, in several stages with a series of checks, trying not to overwhelm 32 bits at every step.

The following chain of transformations occurred to me quickly enough (remembering that 1000 = 1024 - 24 ):

ticks * 1000 * 2^(-15) =
ticks * 2^(-5) - ticks * 24 * 2^(-15) = ticks * 2^(-5) - 3 * ticks * 2^(-12)

Total we have:

  private int getMillis(int ticks) { return (ticks >> 5) - 3*(ticks >> 12); }

Super! It is a pity that it does not work.

First approach

I thought and checked once again - there are no obvious errors in the reasoning. Checked on several test examples - such an algorithm considers, but not always correctly. In some cases there are deviations from the correct result by + -3. Apparently, there are no problems in mathematics (otherwise the answers would not be like the truth), but there is an error.

To understand what the problem is, it is worth remembering how the bit shift works. The operation of the bit shift to the right by n corresponds to dividing completely by 2 to the power n. This is an integer operation, so bits that are “right” of the zero bit are simply discarded.

If you look at the proposed algorithm, we have 2 shift operations, the multiplication of one of the results and the difference. Schematically, this operation is shown below.

It is obvious that the error can accumulate in the bits that are located to the right of the zero bit.

Initially, I had a suspicion of the last bit dropped (minus the first), perhaps it was him who was not enough in multiplication and difference operations. As a result, the function code has become somewhat more complicated:

  private int getMillis(int ticks) { int err = ((ticks & 16) >> 4) - 3 * ((ticks & 2048) >> 11); if (err < 0) { return (ticks >> 5) - 3 * (ticks >> 12) - (Math.abs(err) >> 1); } else { return (ticks >> 5) - 3 * (ticks >> 12) + (err >> 1); } }

In the err variable, we put the bit that was dropped during the shift operation, and the same operation of the form x - 3 * y performed on it separately. To do this, the 4th and 11th bits are allocated, respectively, shifted to the zero position, the operation is calculated and the result is again shifted to the right by 1 bit.
As a result, all that remains after the shift is the lost bits that we use to adjust the result.
The result improved: now the absolute error was no more than 2.

Decision

Now it became obvious that the direction of thought was correct, but not one bit to the right of the comma should participate, but all 5 intersecting ones (that is, they end up in the position from -1 to -5 in the diagram).

As a result, I received a final version of the function, the error of which was less than 0.5.
The general idea was the same as in the previous example, but the source of error is not allocated to 1 bit from each operand, but 5.

In addition, rounding is implemented, which is based on the idea that the i- th bit is half the i + 1- th bit.

Everything else should be clear from the code :)

  private int getMillis(int ticks) { int l = (ticks & 0x1F) << 7; int r = ticks & 0xFFF; // 2^12 - 1 int err = l - 3 * r; int fraction = (Math.abs(err) & 0xFFF); int fractionRound = fraction >> 11; //== 1 <=> there is 0.5xxx fraction = fraction & 0x7FF; // 0.5xxx => 0.0xxx if (err < 0) { fraction = fraction > 0 ? 1 : 0; return (ticks >> 5) - 3 * (ticks >> 12) - (Math.abs(err) >> 12) - (fractionRound & fraction); } else { return (ticks >> 5) - 3 * (ticks >> 12) + (err >> 12) + fractionRound; } }

Be good programmers and have a nice weekend!

Source: https://habr.com/ru/post/137599/

All Articles

Problem with accuracy and overflow

Eureka?

First approach

Decision

More articles: