
In the distant past, for the IT industry in the 70s of the last century, mathematics scientists (as programmers used to be called) fought like Don Quixote in an unequal battle with computers that were then the size of small windmills. The tasks were serious: the search for enemy submarines in the ocean from images from orbit, the calculation of the ballistics of long-range missiles, and so on. To solve them, the computer must operate with real numbers, which, as is known, are a continuum, while memory is finite. Therefore, it is necessary to map this continuum to a finite set of zeros and ones. In search of a compromise between speed, size and accuracy of presentation, scientists proposed floating point numbers (or floating point, if in bourgeois).
For some reason, floating-point arithmetic is considered an exotic field of computer science, given that the corresponding data types are present in every programming language. I myself, to be honest, never attached much importance to computer arithmetic, while solving the same problem on the CPU and the GPU got a different result. It turned out that very curious and strange phenomena are hidden in the hidden corners of this area: non-commutative and non-associative arithmetic operations, signed zero, the difference of unequal numbers gives zero, and so on. The roots of this iceberg go deep into mathematics, and under the cut I will try to describe only what lies on the surface.
1. Basics
The set of integers is infinite, but we can always choose such a number of bits to represent any integer that occurs when solving a specific problem. The set of real numbers is not only infinite, but also continuous, therefore, no matter how many bits we take, we will inevitably encounter numbers that do not have an exact representation. Floating-point numbers are one of the possible ways of representing real numbers, which is a compromise between accuracy and the range of accepted values.
')
A floating-point number consists of a set of individual digits, conditionally divided into a sign, an
exponential order, and a mantissa. The order and the mantissa are integers, which together with the sign give the representation of a floating-point number as follows:

Mathematically, this is written like this:
(-1)
s × M × B
E , where s is a sign, B is a base, E is order, and M is a mantissa.
The base determines the number numeration system. Mathematically, floating-point numbers with a base B = 2 (binary representation) are most resistant to rounding errors, so in practice there are only bases 2 and, less often, 10. For further discussion, we will always assume B = 2, and the formula with floating point will be:
(-1)
s × M × 2
EWhat is a mantissa and order?
A mantissa is a fixed-length integer that represents the most significant bits of a real number. Suppose our mantissa consists of three bits (| M | = 3). Take, for example, the number "5", which in the binary system will be equal to 101
2 . The most significant bit corresponds to 2
2 = 4, the middle one (which is zero here) is 2
1 = 2, and the least significant bit 2
0 = 1.
Order is the degree of the base (2) of the highest order. In our case, E = 2. Such numbers are conveniently written in the so-called
"scientific" standard form, for example, "1.01e + 2". It is immediately evident that the mantissa consists of three characters, and the order is equal to two.
Suppose we want to get a fractional number using the same 3 bits of mantissa. We can do this if we take, say, E = 1. Then our number will be equal to
1.01e + 1 = 1 × 2
1 + 0 × 2
0 + 1 × 2
-1 = 2 + 0.5 = 2.5
Here, since E = 1, the power of two of the first digit (which goes before the comma) is equal to "1". Two other digits, located to the right (after the comma), provide the contribution of 2
E-1 and 2
E-2 (2
0 and 2
-1, respectively). Obviously, adjusting E to the same number can be represented in different ways. Consider an example with the mantissa length | M | = 4. The number "2" can be represented as follows:
2 = 10 (in binary) = 1.000e + 1 = 0.100e + 2 = 0.010e + 3. (E = 1, E = 2, E = 3, respectively)
Please note that the same number has several representations. This is not convenient for equipment, since you need to take into account the multiplicity of representations when comparing numbers and when performing arithmetic operations on them. Moreover, it is not economical, since the number of representations is finite, and repetitions reduce the set of numbers that can be represented at all. Therefore, in the very first machines they began to use the trick, making the first bit of the mantissa always positive. Such a preposition was called
normalized .

This saves one bit, since an implicit unit does not need to be stored in memory, and provides a unique representation of the number. In our example, “2” has a single normalized representation (“1.000e + 1”), and the mantissa is stored in memory as “000”, since senior unit implicitly implied. But in the normalized representation of numbers, a new problem arises - it is impossible to imagine zero in this form.
Strictly speaking, the normalized number is as follows:
(-1)
s × 1.M × 2
E.The quality of problem solving depends on the choice of the representation of numbers with a floating point. We smoothly approached the problem of standardization of such a presentation.
2. A bit of history
In the 60s and 70s there was no single standard for representing floating-point numbers, rounding methods, and arithmetic operations. As a result, the programs were not extremely portable. But an even bigger problem was that different computers had their own “weirdness” and they needed to be known and taken into account in the program. For example, the difference of two non-equal numbers returned zero. As a result, the expressions “X = Y” and “XY = 0” conflicted. The craftsmen bypassed this problem with very tricky tricks, for example, they made assignment “X = (XX) + X” before multiplication and division operations to avoid problems.
The initiative to create a single standard for representing floating-point numbers suspiciously coincided with attempts in 1976 by Intel to develop the “best” arithmetic for new co-processors for 8086 and i432. Scientists in this area, prof. John Palmer and William Kahan. The latter in his interview expressed the opinion that the seriousness with which Intel developed its arithmetic made other companies unite and begin the process of standardization.
All were serious, because it is very profitable to promote their architecture and make it standard. DEC, National Superconductor, Zilog, Motorola presented their proposals. Mainframe makers Cray and IBM watched from the sidelines. Intel, of course, also introduced its new arithmetic. William Kahan, Jeromy Coonen, and Harold Stone, and their proposal, were immediately nicknamed “KCS” by the authors of the proposed specification.
Almost immediately, all offers were rejected, except for two: VAX from DEC and “KCS” from Intel. The VAX specification was much simpler, it was already implemented in PDP-11 computers, and it was clear how to get maximum performance from it. On the other hand, KCS contained a lot of useful functionality, such as “special” and “denormalized” numbers (details below).
In KCS, all arithmetic algorithms are given strictly and it is required that in the implementation the result coincides with them. This allows you to display strict calculations within this specification. If earlier a mathematician solved a problem by numerical methods and proved the properties of a solution, there was no guarantee that these properties would remain in the program. The rigor of KCS arithmetic made it possible to prove theorems based on floating-point arithmetic.
DEC has done everything to make its specification standard. She even gained the support of some reputable scientists in that the arithmetic of "KCS" in principle can not achieve the same performance as that of DEC. The irony is that Intel knew how to make its specification as productive, but these tricks were a trade secret. If Intel did not yield and did not reveal some of the secrets, it would not have been able to hold back the onslaught of DEC.
For more information about standardization battles, see
Professor Kahan’s interview , and we’ll look at how the floating-point representation looks like now.
3. Representation of floating-point numbers today
The developers of KCS have won and now their offspring has been embodied in the IEEE754 standard. The floating point numbers in it are represented as a sign (s), a mantissa (M) and an order (E) as follows:
(-1)
s × 1.M × 2
EComment. In the new standard IEE754-2008, in addition to the numbers with base 2, there are numbers with base 10, the so-called
decimal (floating point) numbers.
In order not to overload the reader with excessive information that can be found on
Wikipedia , consider only one type of data, with single precision (float). The numbers with half, double and extended precision have the same features, but have a different range of order and mantissa. In single-precision numbers (float / single), the order consists of 8 bits, and the mantissa - of 23. The effective order is defined as E-127. For example, the number 0.15625 will be stored in memory as
Picture taken from WikipediaIn this example:
- Sign s = 0 (positive number)
- Order E = 01111100 2 -127 10 = -3
- Mantissa M = 1.01 2 (the first unit is not explicit)
- As a result, our number is F = 1.01 2 e-3 = 2 -3 + 2 -5 = 0.125 + 0.03125 = 0.15625
Slightly more detailed explanationHere we are dealing with a binary representation of the number “101” with a comma offset a few bits to the left. 1.01 is a binary representation meaning 1 × 2 0 + 0 × 2 -1 + 1 × 2 -2 . Moving the comma three positions to the left we get 1.01e-3 = 1 × 2 -3 + 0 × 2 -4 + 1 × 2 -5 = 1 × 0.125 + 0 × 0.0625 + 1 × 0.03125 = 0.125 + 0 , 03125 = 0.15625.
3.1 Special Numbers: Zero, Infinity, and Uncertainty
In IEEE754, the number "0" is represented by a value with an order equal to E = E
min -1 (for single it is -127) and a zero mantissa. The introduction of zero as an independent number (since zero cannot be represented in a normalized representation) made it possible to avoid many oddities in arithmetic. And although operations with zero need to be processed separately, they are usually performed faster than with ordinary numbers.
IEEE754 also provides a representation for special numbers, the work with which causes an exception. These numbers include infinity (± ∞) and uncertainty (NaN). These numbers allow you to return an adequate value in case of overflow. Infinities are represented as numbers with the order E = E
max +1 and the zero mantissa. You can get infinity with overflow and when dividing a non-zero number by zero. Infinity in the division of the developers determined on the basis of the existence of limits, when the dividend and divisor to strive for some number. Accordingly, c / 0 == ± ∞ (for example, 3/0 = + ∞, and -3 / 0 = -∞), since if the dividend tends to a constant and the divisor is zero, the limit is infinity. At 0/0, the limit does not exist, so the result will be uncertainty.
Uncertainty or NaN (from not a number) is a concept invented so that an arithmetic operation can always return some meaningless meaning. In IEEE754, NaN is represented as a number in which E = E
max +1, and the mantissa is not zero. Any operation with NaN returns NaN. If desired, the mantissa can record information that the program can interpret. The standard is not specified and the mantissa is often ignored.
How can I get NaN? One of the following ways:
- ∞ + (- ∞)
- 0 × ∞
- 0/0, ∞ / ∞
- sqrt (x), where x <0
By definition, NaN ≠ NaN, therefore, to check the value of a variable, you just need to compare it with yourself.
What is the zero sign (or +0 vs -0)
The inquisitive reader has probably already noticed that in the described representation of floating-point numbers, there are two zeros, which differ only in sign. So, 3 · (+0) = + 0, and 3 · (-0) = - 0. But when comparing + 0 = -0. In the standard, the sign was preserved intentionally so that expressions that, as a result of overflow or loss of significance, turn into infinity or zero, while multiplying and dividing, can still represent the most correct result. For example, if zero had no sign, the expression 1 / (1 / x) = x would not hold true for x = ± ∞, since 1 / ∞ and 1 / -∞ are equal to 0.
One more example:
(+ ∞ / 0) + ∞ = + ∞, while (+ ∞ / -0) + ∞ = NaN
How is infinity better in this case than NaN? The fact that if NaN appeared in the arithmetic expression, the result of the whole expression will always be NaN. If infinity is encountered in the expression, then the result can be zero, infinity, or a regular floating point number. For example, 1 / ∞ = 0.
3.3 Denormalized numbers
What is
subnormal denormalized (subnormal) numbers consider a simple example. Suppose we have a normalized representation with the mantissa length | M | = 2 bits (+ one normalization bit) and a range of values of the order of -1≤E≤2. In this case, we get 16 numbers:

Large strokes indicate numbers with a mantissa equal to 1.00. It is seen that the distance from zero to the nearest number (0 - 0.5) is greater than from this number to the next (0.5 - 0.625). This means that the difference of any two numbers from 0.5 to 1 will give 0, even if these numbers are not equal. Worse, the difference between numbers greater than 1 falls into the gap between 0.5 and 0. For example, "1.5-1.25 = 0" (see picture).
Not every program falls into the “near-zero hole”. According to the statistics of the 70s, on average, every computer encountered such a problem once a month. Considering that computers acquired mass character, the developers of KCS considered this problem serious enough to solve it at the hardware level. Their proposed solution was as follows. We know that when E = E
min -1 (for float it is “-127”) and the zero mantissa, the number is considered to be zero. If the mantissa is not zero, then the number is considered non-zero, its order is assumed to be E = E
min , and the implicit high bit of the mantissa is assumed to be zero. Such numbers are called
denormalized .
Strictly speaking, floating-point numbers now look like:
(-1)
s × 1.M × 2
E if E
min ≤ E≤E
max (normalized numbers)
(-1)
s × 0.M × 2
Emin if E = E
min -1. (denormalized numbers)
Let's go back to the example. Our E
min = -1. We introduce a new order value, E = -2, at which the numbers are denormalized. As a result, we get a new representation of numbers:

The interval from 0 to 0.5 fill in the denormalized numbers, which makes it possible not to fall into the 0 examples discussed above (0.5–0.25 and 1.5–1.25). This made the presentation more robust to round-off errors for numbers close to zero.
But the luxury of using a denormalized representation of numbers in the processor is not free. Due to the fact that such numbers need to be processed differently in all arithmetic operations, it is difficult to make the work in such arithmetic effective. This imposes additional difficulties in the implementation of the ALU in the processor. And although the denormalized numbers are very useful, they are not a panacea and should still be monitored when rounding to zero. Therefore, this functionality has become a stumbling block in the development of the standard and met the strongest resistance.
3.4 The order of numbers in IEEE754
One of the surprising features of the representation of numbers in the IEEE754 format is that the order and the mantissa are arranged one behind the other in such a way that together they form a sequence of integers {n} for which it is executed:
n <n + 1 ⇒ F (n) <F (n + 1), where F (n) is a floating point number formed from the integer n, splitting its bits into an order and the mantissa.
Therefore, if we take a positive floating point number, convert it to a whole, add “1”, we get the following number, which is representable in this arithmetic. In C, you can do this:
float a=0.5; int n = *((int*) &a); float b = *((float*) &(++n)); printf(" %e : %e, (%e)\n", a, b, ba);
This code will only work on a 32-bit int architecture.
4. Pitfalls in floating point arithmetic
Now - to practice. Consider the features of floating-point arithmetic, to which you need to take special care when programming.
4.1 Rounding
With errors due to rounding errors in modern floating-point arithmetic, it is difficult to meet, especially if you use double precision. The rounding rule in IEEE754 says that the result of any arithmetic operation must be as if it were performed on exact values and rounded to the nearest number that can be represented in this format. This requires additional efforts from the ALU and some compiler options (such as “-ffast-math” in gcc) can disable this behavior. Features rounding in IEEE754:
- Rounding up to the closest one in the standard is not done as we are used to. Mathematically, it is shown that if 0.5 is rounded to 1 (up), then there is a set of operations in which the rounding error will increase to infinity. Therefore, IEEE754 applies the rounding rule to even. So, 12.5 will be rounded up to 12, and 13.5 - to 14.
- The most dangerous operation in terms of rounding in floating point arithmetic is subtraction. When subtracting close numbers, significant digits may be lost, which is
may increase the relative error at times. - For many common mathematical formulas, mathematics have developed a special form that can significantly reduce the error in rounding. For example, the calculation of the formula "x 2 -y 2 " is better calculated using the formula "(xy) (x + y)".
4.2 Nonassociative arithmetic operations
In floating point arithmetic, the rule (a * b) * c = a * (b * c) is not executed for any arithmetic operation. For example,
(10
20 +1) -10
20 = 0 ≠ (10
20 -10
20 ) + 1 = 1
Suppose we have a program for summing numbers.
double s = 0.0; for (int i=0; i<n; i++) s = s + t[i];
Some default compilers can rewrite the code to use several ALUs at the same time (we assume that n is divisible by 2):
double sa[2], s; sa[0]=sa[1]=0.0; for (int i=0; i<n/2; i++) { sa[0]=sa[0]+t[i*2+0]; sa[1]=sa[1]+t[i*2+1]; } S=sa[0]+sa[1];
Since summation operations are not associative, these two programs may produce different results.
4.3 Numeric Constants
Remember that not all decimal numbers have binary floating point representations. For example, the number “0.2” will be represented as “0.20000003” in single precision. Accordingly, "0.2 + 0.2 ≈ 0.4". Absolute error in a separate
the case may not be high, but if we use such a constant in the cycle, we can get the accumulated error.
4.4 Selecting the minimum of two values
Suppose we need to choose the minimum of two values. In C, this can be done in one of the following ways:
- x <y? x: y
- x <= y? x: y
- x> y? y: x
- x> = y? y: x
Often the compiler considers them equivalent and always uses the first option, since it is executed in a single processor instruction. But if we take into account ± 0 and NaN, these operations are in no way equivalent:
x | y | x <y? x: y | x <= y? x: y | x> y? y: x | x> = y? y: x |
+0 | -0 | -0 | +0 | +0 | -0 |
NaN | one | one | one | NaN | NaN |
4.5 Comparing Numbers
A very common mistake when working with floats occurs when testing for equality. For example,
float fValue = 0.2; if (fValue == 0.2) DoStuff();
The error here is, firstly, that 0.2 does not have an exact binary representation, and secondly, 0.2 is a double precision constant, and the variable fValue is single, and there is no guarantee about the behavior of this comparison.
The best, but still an erroneous way, is to compare the difference with the permissible absolute error:
if (fabs(fValue – fExpected) < 0.0001) DoStuff();
The disadvantage of this approach is that the error in representing the number increases with the growth of this number itself. So, if the program expects "10,000", then the reduced equality will not be performed for the nearest neighboring number (10,000,000977). This is especially true if the program has a conversion from single precision to double precision.
Choosing the right comparison procedure is difficult and interested readers I refer to the article by
Bruce Dawson . It proposes comparing floating-point numbers with conversions to an integer variable. This is the best, though not portable way:
bool AlmostEqual2sComplement(float A, float B, int maxUlps) {
In this program, maxUlps (from Units-In-Last-Place) is the maximum number of floating-point numbers that can lie between the value being checked and the expected value. Another meaning of this variable is the number of binary digits (starting from the youngest) in the compared numbers is allowed to miss. For example, maxUlps = 16, means that the lower 4 bits (log
2 16) may not coincide, and the numbers will still be considered equal. In this case, when compared with the number 10,000, the absolute error will be equal to 0.04646, and when compared with 0.001, the error will be less than 0.00000001 (10
-8 ).
5. Check the completeness of IEE754 support
Do you think that if the processors fully comply with the IEEE754 standard, then any program using standard data types (such as float / double in C) will produce the same result on different computers? You are mistaken. Compiler and optimization options affect portability and compliance. William Kahan wrote a C program (there is also a version for Fortran), which allows you to check whether the combination “architecture + compiler + options” of IEEE754 satisfies. It is called “Floating point paranoia” and its source code is available for
download . A similar program is available for the
GPU . For example, the Intel compiler (icc) by default uses the “relaxed” IEEE754 model, and as a result, not all tests are performed. The “-fp-model precise” option allows you to compile a program with exact compliance with the standard. In the GCC compiler there is an option “-ffast-math”, the use of which leads to an IEEE754 mismatch.
Conclusion
Finally, a cautionary tale. When I was working on a test project on a GPU, I had a serial and parallel version of one program. Comparing the runtime, I was very pleased, since I received the acceleration 300 times. But later it turned out that the calculations on the GPU "collapsed" and turned to NaN, and working with them in the GPU was faster than with ordinary numbers. It was interesting to the other - the same program on the GPU emulator (on the CPU) produced the correct result, but not on the GPU itself. Later it turned out that the problem was that this GPU did not fully support the IEEE754 standard and the direct approach did not work.
Now floating point arithmetic is almost perfect. Almost always the naive approach will work, and the program, which does not take into account all its features, will produce the correct result, and the described pitfalls concern only exotic cases. But one must always remain vigilant: in such a matter as computer mathematics it is easy to step on a rake.
PS Thanks to the user
uqlock for an important note. Money cannot be stored as a floating point number, since in this case, significant digits cannot be distinguished. If in a programming language there are no data types with a fixed comma, you can get out of the situation and store money as an integer, implying pennies (sometimes fractions of kopecks).
PPS Thanks for typos and errors to users:
gribozavr ,
kurokikaze ,
Cenness ,
TheShock ,
perl_demon ,
GordTremor ,
fader44 ,
DraculaDis ,
icc ,
f0rbidik ,
Harkonnen ,
AlexanderYastrebov ,
Vayun ,
EvilsInterrupt !
Literature
- Interview with William Kahan on the development of the IEE754 standard .
- Floating Point Arithmetic, David Goldberg - a book with mathematical calculations.
- Comparing floating point numbers, Bruce Dowson .