Fixed point calculations. Basic Principles (Part 1)

Introduction or why this topic

Reading Habrahabr , I came across two topics, "floating" calculations with a floating point.
In one of them, the extract from the IEEE754 standard and the main problems in floating point calculations are given in sufficient detail and qualitatively, the other is a short topic-note about the fact that not everything is so good at calculations on a PC. In this case, recommendations are made in the case when the mathematical accuracy of the result is important, use integer calculations, “fix a comma” or at least check the results produced by the platform (compiler + processor).
In spite of the fact that the advice is practical, it is not so easy to understand how to use integer calculations where previously there was a floating point, especially without mathematical preparation. In this sense, the attempt of one of the “habrovchan” to understand the fixed-point method of experiments is quite entertaining.
This topic is a brief introduction that should give an idea of fixed-point calculations. Mathematics in this article should not scare anyone - everything is very primitive. Immediately I ask you to forgive: among my friends, the well-established expression is precisely “fixed point” ( from English, fixed-point ), and not “comma”, so I will adhere to this particular term.

Once again about the mantissa and the exhibitor

In computational mathematics, fractional values are in the form of a pair of integers (n, e) : mantissas and exponents (in Russian is more true of the "exponent", but for brevity and by habit I will continue to use the word "exponent"). The pair represents a fractional number in the form n * 2 ^-e .
Exhibitor can be considered as the number of digits before the comma separating the fractional part of the number.
If the exponent is a variable written to the register and unknown when compiled, (n, e) is called a floating point number. If the exponent is known in advance, (n, e) is called a fixed-point number. Numbers with a fixed point can be written in ordinary integer variables (registers) by saving only the mantissa. Exhibitor is usually denoted by the letter q . So, having found something in the comment to the variable in the spirit of " q15 multiplier ", we should consider this variable as a number with a fixed point and an exponent of 15. However, I will return to the question of notation, which is found in various source codes and articles.

Calculations

So, we figured out that when working with a fixed point, the exhibitor is not recorded anywhere and is kept “in mind”.
How to make calculations? Computational arithmetic is an entire science with its own formulas, axioms and theorems. The purpose of this article was not to give an introduction to this science. The approaches given below are primarily aimed at programmers who solve engineering and applied problems, i.e. such, where the ranges of allowable values and the necessary accuracy of calculations are known and limited.
Another limitation of the article is that the algorithms for trigonometric and other complex operations are not given here. To make their full review in one article is unrealistic (and hardly necessary). The article provides the basis necessary for understanding such algorithms (and developing your own) —the rules for performing basic operations (addition / subtraction, multiplication, division) and the general fixed-point calculation method.

Addition and subtraction

Adding is simple if we imagine in our mind that we have to add two decimal fractions “in a column” on a piece of paper. When performing this operation, the numbers are written in a column so that the commas separating the fractional part are located one under the other. In binary arithmetic, such an operation is called the reduction of exponentials.
If you go from the "paper" to the mathematical record, you get the following:
Suppose there are two numbers a = n1 * 2 ^-q1 and b = n2 * 2 ^-q2 .
Then:
a + b = n1 * 2 ^-q1 + n2 * 2 ^-q2 = (n1 + n2 * 2 ^{(q1 - q2)} ) * 2 ^-q1 .
The factor 2 ^{(q1 - q2)} with the second term essentially means an arithmetic shift to bring numbers to one exponent.
It should be noted that the result of the calculation can also shift to bring the exponent to the desired value.
C code snippet:

int32_t a = 0x1000L; // q15: a = 0.125 int32_t b = 0x20000L; // q20: b = 0.125 int32_t c = 0; // q25 c = (a << 5) + b; // q20: (a * 2 ^ (20 - 15) + b); c = 0x40000L (0.25  q20) c <<= 5; // q25: c = 0x800000L (0.25  q25)

From the example it can be understood that in real calculations, even in such a simple operation as addition, there is room for thought. Always keep in mind the questions:

sacrifice accuracy or not? One can in fact lead the terms to a smaller exponential shift to the right and discard the lower digits.
are the values of variables limited? A shift to the right in this case, for example, does not lead to a loss of accuracy.
Is there an opportunity to expand the bit depth?

This is not a complete list, but it already shows that everything is not as simple as it may seem at first glance. In most practical applications for each subject area, ranges of acceptable values are known or can be obtained, so that working with a fixed point requires some experience or research. Often, pre-code is written with a floating point, after which the ranges of values are examined, and small values are neglected.
')

Multiplication

Fixed-point multiplication can be performed without tricky alignments and reduction to a single exponent. Nevertheless, multiplication is a rather dangerous operation, which most often results in loss of accuracy and requires special care in handling.
Let's start with a mathematical description of multiplication:
Suppose there are two numbers a = n1 * 2 ^-q1 and b = n2 * 2 ^-q2 .
Then:
a * b = n1 * 2 ^-q1 * n2 * 2 ^-q2 = n1 * n2 * 2 ^{- (q2 + q1)} .
From the expression it can be seen that the exponential numbers when multiplied are added: 2 ^{- (q2 + q1)} . The bit size of the data in this article is not considered until it suffices to remember that for safe multiplication without overflow and loss of accuracy, the bit depth of the result should be no less than the total bit depth of the factors.
Due to the addition of the exponentials, the multiplication result has to be corrected to perform further calculations. When decreasing the exponent, the lower digits of the result are discarded. That is, there is a loss of accuracy. You can reduce the loss of accuracy (and sometimes you have to), but ways to deal with losses are always associated with overhead.
C code snippet:

 int32_t a = 0x8000L; // q16: a = 0.5 int32_t b = 0x100000L; // q21: b = 0.5 int32_t c = 0xC0000L; // q20: c = 0.75 int64_t d; //      ,    . d = (int64_t)a * (int64_t)b; // q37 = q16 * q21; d = 0x800000000L (0.25 in q37) d >>= 17; // q37 / 2 ^ 17 = q20 c += (int32_t)d; // q20: c = 0x100000 (1 in q20)

I note that the 15 least significant digits of the multiplication result were discarded in order to bring the number to the format of the addendum. It was possible, of course, to increase the digit capacity of the variable c , but, as I said, in practice, the ranges of values are usually limited and the lower multiples are often neglected. In addition, the possibility of non-zero higher-order bits in the original factors is not taken into account.
But in this article, overflow handling is not considered.

Division

Let's start with the mathematical expression for division:
Suppose there are two numbers a = n1 * 2 ^-q1 and b = n2 * 2 ^-q2 .
Then:
a / b = n1 * 2 ^-q1 / (n2 * 2 ^-q2 ) = n1 / n2 * 2 ^{- (q1 - q2)} .
The multiplier 2 ^{- (q1 - q2)} means that when the division is performed, the exponent decreases automatically. If no action is taken, some of the significant digits are discarded automatically.
The correction method is obvious - it is necessary to increase the divider width in advance so that, as a result of the division, to obtain the desired number of significant bits:
a / b = n1 * 2 ^-q1 * 2 ^q3 / (n2 * 2 ^-q2 ) = n1 / n2 * 2 ^{- (q1 - q2 + q3)} .
Thus, the quotient of the quotient is increased by q3 rank.
C code snippet:

 int32_t a = 0x4000L; // q15: a = 0.5 int32_t b = 0x80000L; // q20: b = 0.5 int32_t c = 0; // q25 int64_t d; //      . d = (int64_t)a << 30; // q45: d = 0x200000000000; (0.5 in q45) c = (int32_t)(d / (int64_t)b); // q25: c = 0x2000000; (1 in q25)

Obviously, if the number of bits exceeds 32 bits, the problem cannot be solved so easily. However, for simple engineering calculations, 32-bit numbers are usually more than enough.
There is one simple way to significantly reduce the loss of accuracy when dividing - the preliminary rationing of the dividend. Rationing is actually the maximum shift of the mantissa to the left, at which no significant bits are discarded. You can determine by how much you can shift the number by counting the leading zeros in the dividend, for which there are special algorithms (or even processor hardware instructions).
After dividing the quotient should be shifted to the right by the same number of bits to restore the exponent.
The above code snippet may look like this:

 int32_t a = 0x4000L; // q15: a = 0.5 int32_t b = 0x80000L; // q20: b = 0.5 int32_t c = 0; // q25 int norm_shift = norm(a); //   . norm_shift = 16 c = ((a << norm_shift) / b); // q(-5): c = 0x800 (1*2^norm in q(-5)) c <<= (30 - norm); // q25: c = 0x2000000; (1 in q25)

As you can see, the loss of accuracy in this case did not happen without increasing the bit depth of the dividend.
However, this is not always the case, and if you want to stay within a certain bit depth (for example, 32 bits), you have to implement the division algorithmically. In a review article, it is hardly worth plunging into such a jungle - to understand the process of division and the difficulties associated with it, the description already given is sufficient.

Notation adopted in the literature and various sources

In the last section of the article I would like to once again return to the generally accepted notation used in the description of algorithms with a fixed point.
This is quite an important point that you have to pay attention to when reading other people's sources.
The most common are two options for designating a fixed-point number:

Q M - where M is the number of digits after the comma. Used in article
Q N.M - where N is the number of digits before the decimal point without taking into account the sign bit, and M - after.

The minus of the first notation is obvious: when working with a variable, you have to refer to the declaration of a variable (remember its bit depth) and perform some calculations in your head in order to figure out how to bring the exponent to the desired one. Moreover, if we recall rounding (int32_t) d in the multiplication example, it can be noted that with comments in this notation, it is difficult to understand whether shifting or discarding significant bits will lead to an error.
When using comments in the second notation, you can simply record the accuracy of the calculations, which eliminates the need to remember how the variable is declared.
Let me explain with an example:

 a = 0x1000; // Q15 b = 0x8000; // Q15 int32_t c = a + b; // ???     ,      . a = 0x2000; // Q0.15 b = 0x8000; // Q16.15 c = a + b; // Q0.15 + Q16.15 = Q16.15: 16 + 15 = 31  + 1

Comments in the second notation are obviously more convenient.
I will not argue here about the benefits of comments (they are far from everywhere), let me just say that for myself I always describe the types of variables when calculating, so as not to be mistaken and not be confused with bringing the exhibitors.
If there are no comments in principle, reading and understanding the code is of course more complicated, but, confused, you can always add such decoding in Q-notation to understand where “this left shift by 4, and then the right shift by 10,” came from.
I note that in the recently released Google source codes of the VoIP engine GIPS ( webrtc ) in the comments most often just write Q, implying that all the bits of the number are allocated to the fractional part. Personally, this is very confusing, because you have to rummage through definitions to clarify how the code works.
For myself, I use another notation that differs from the above and is close to the MATLAB toolbox for working with a fixed point. It binds mathematics to the digit capacity of variables and simplifies life when it is necessary to evaluate the result of an operation (bitness and exponent).
I note the numbers with a fixed point in my comments as QN.M , where N is the digit capacity of the number, M is the number of decimal places.
Let me explain why I found such a scheme convenient for myself:

Knowing the digit capacity of a number it is always possible to predict the digit capacity of the result, i.e. select a variable type sufficient to represent it.
I am personally inconvenient for reading a record of the form Q (-N) .M , which appears in the second notation after performing the shift to the right and the lack of digits for the fractional part. For example, the record for a 16-bit number, in which the exponent is 18 ( n * 2 ^-18 ), looks like q16.18 for me, and for the second notation, q (-3) .18 . The entry in the first notation, as already mentioned, in any case makes it necessary to refer to the definition to understand the accuracy of the calculations, but in this case it is still not clear without a definition: whether the leading significant bits have already been dropped or not.
Having made the calculations according to my notation, it is easier for me to see into the variable of what bitness the result will fit, and how to align the exponents. For example, q32.15 * q16.4 = q48.19 . It is immediately obvious that for a complete presentation of the result you need 48-bit. In the second notation, the record looks like q16.15 * q11.4 = q27.19 , and you have to calculate that 27 + 19 = 47 + 1 digit from the first factor + 1 digit from the second = 48 bits . Trifle, but nice. Especially when there are many source codes.

On the pros and cons of using a fixed point

Such a detailed description, even for basic operations, can scare off engineers and programmers from using a fixed point in calculations, especially if they have already developed the habit of floating point without tracking the result. However, there are advantages to using a fixed point, some of which are not obvious.
In order to finally decide whether you need it, you can use the following summary of fixed-point calculations.
Pros:

Need to think.
Predictable result. With the right approach to coding, the result of the calculations will be the same on any platform (processor + compiler) up to discharge. For this phenomenon, there is a special term “ bit-exactness ” (from English, bit-exactness ). A properly coded algorithm is always bit-stripped and, therefore, can be explored on a non-target platform. This is especially useful when debugging on the target platform is difficult or impossible and you can only remove input data.
Full control over the behavior of the code. The fixed point eliminates the appearance of "surprises" associated with the implementation features of the floating point on the platform used.
Automatic “filtering” of negligible values. In a floating point, computation errors can accumulate, this does not happen at a fixed point (by discarding small values) or the error accumulation process can be controlled algorithmically.
Algorithmically controlled range of variable values. Floating comma gives more freedom in calculations, but the result can go beyond the limits of permissible, which leads to the need to control it separately. At a fixed point, this problem is solved automatically during the development and debugging of the algorithm.
Portability algorithms. This plus is fairly correlated with the first, but it is worth noting that integer calculations are much better supported by many non-x86 processors than floating-point calculations. So, having once developed an algorithm at a fixed point, it becomes much easier to port it to various “weak” platforms. Sometimes integer calculations are generally the only thing available on the target platform.
Ability to control the complexity of calculations by reducing the accuracy in the development of the algorithm.
Sometimes it's interesting.

Minuses:

Need to think.
A reduced (in the simplest case) range of values of variables compared to a floating point.
The need to algorithmically control the range of values of variables. Much of the development time is spent on correct scaling and selection of ranges.
The need to monitor the bit at each stage of computation.
The need to write your own framework of basic functions (trigonometric, logarithmic, etc.) or modify the existing one.
The need to dive into the application area when developing an algorithm.
The need to improve the culture of writing and maintaining code - one cannot do without using our own developments at a fixed point. In a floating point, you can most often, without thinking, rewrite the math "head on", using ready-made functions.

As a conclusion

I did not touch on such (basic!) Problems in calculations with a fixed point as overflow of the result and zero drift in rounding and methods of dealing with them. The text has already turned out to be voluminous and, possibly, boring, in order to complement it with details.
In my turn, I paid quite a lot of attention to the generally accepted notation when writing operations with a fixed point in order to facilitate reading the special literature and writing my own sources (and, possibly, understanding the second part of the article). Yes, comments with calculations in Q-notation saved me from serious debugging and source code analysis more than once.
If the topic will be in demand, I will add the article with the following part, in which I will describe the above points and try to tell you how, in general, you can transfer the algorithm from a floating point to a fixed one.
NB Mathematicians, please do not worry, I think you are better acquainted with the illuminated question.

UPDATES:

The notation that I use, as it turned out, came to me from MATLAB. I still could not remember where I had dug it in due time. Thanks nerudo , recalled. The fixed-point toolbox of the specified package uses just a pair of “number of bits” + “number of bits per fractional part” to create objects, and with an explicit indication: a signed or unsigned number.
Despite the fact that the examples abound in 8-, 16-, 32- and 64-bit words, I did not try to bind the description only to x86 and other General Purpose processors. Simply, firstly, it is easier to give examples in C, and, secondly, I am far from being an expert on FPGA, ASIC, etc. (read VERILOG, etc.), which allow you to choose an arbitrary bit of a number in order to tell clearly about them . Perhaps people more knowledgeable in the question will be able to supplement the article with their topics with examples.
The notation descriptions do not say what to do with unsigned numbers. In general, the original sources, in which I saw them, also say nothing about signs. Basically it was assumed that all numbers are signed. On the move, I cannot say how the authors saw the writing of signed numbers. I will only note that in my / matlab notation I added the letter 'u' after Q to mark an unsigned number.

Source: https://habr.com/ru/post/131171/

All Articles