Among the variety of formats for the representation of real numbers in computer technology, a special place is given to the floating-point number (PCT) format, which is recorded in the IEEE754 standard. The main advantages of floating-point numbers, as is well known, are that they allow you to perform calculations in a large range of values ββand, moreover, calculations are organized using binary arithmetic tools that are easily implemented on a computing device. However, the latter circumstance is fraught with pitfalls, which are the reason that the calculations made using this format can lead to completely unexpected results.
Below we give examples of arithmetic operations on some floating point numbers that lead to incorrect results. These results do not depend on the platform on which the calculations are implemented.
In this article we do not provide theoretical calculations that explain the reason for the appearance of these errors. This is the topic of the next topic. Here we will only try to draw the attention of specialists to the problem of the catastrophic inaccuracy of calculations arising when performing arithmetic operations on decimal numbers when using binary arithmetic. The examples considered here inexorably suggest the idea of ββthe expediency of using the floating-point format in the form that the IEEE754 standard interprets it.
')
Note that the cause of erroneous calculations with the PST is mainly due to neither rounding errors, which the standard pays great attention to, and the very nature of the conversion of decimal and binary numbers.
We will consider two main formats for NPT - float and double. Recall that the float format allows you to represent up to 7 true decimal digits in a binary mantissa containing 24 bits. The double format represents up to 15 valid decimal digits in the mantissa containing 53 binary digits.
Decimal numbers unrepresentable in a binary computer word, after reducing them to decimal form, contain both correct digits and βtailsβ of incorrect digits. These βtailsβ are the source of erroneous calculations of decimal real PNTs using binary arithmetic. Let us show it with examples.
SUM
So, first consider the sum of the two following real numbers represented in the float format, each of which has 7 valid significant digits:
0.6000006 + 0.03339874 = 0.6333993 4β0.6333993
All calculations will be carried out in the float format. For clarity, we will use the numbers in the unpacked form. Imagine our decimal numbers in normalized binary form:
0.6000006 β 1.001100110011001101001 * 2 ^ (- 1)
0.03339874β1.00010001100110100011110 * 2 ^ (- 5)
If the resulting binary codes of our numbers are again presented in decimal form, we obtain the following values:
0.6000006 β 0.6000006 198883056640625β0.6000006 2
0.03339874β0.03339873 9993572235107421875β0.0333987 4
Here, each number, rounded to 7 true digits, is separated from the βtailβ by a space. These "tails" turned out as a result of the inverse conversion of numbers from the binary code recorded in the machine mantissa to the decimal code.
The sum of our numbers in binary 24-bit form will give the following result:
1.001100110011001101001 * 2 ^ (- 1) + 1.00010001100110100011110 * 2 ^ (- 5) β 1.0100010001001100111011 * 2 ^ (- 1) β 0.6333994
The same result will be obtained if we add up the decimal numbers with "tails":
0.6000006 2+ 0.0333987 4 = 0.6333993 6β 0.6333994
As we see, after rounding up to 7 significant digits, a result is obtained here that is different from the result obtained when summing up the decimal numbers on the calculator. The rounding rules for binary representation of numbers laid down in the IEEE754 standard do not solve the problem of exact calculation of the numbers considered here. In our case, the cause of the error lies in the combination of numbers after the last true digits of the decimal terms, about which, a priori, nothing is known.
Let us give one more example of addition. Let's sum up on the calculator the following two real numbers, each of which contains 7 valid decimal significant digits:
6543.455 + 12.34548 = 6555.80048β6555.800
We give our decimal terms to the binary normalized form:
6543.455 = 1.10011000111101110100100 * 2 ^ 12 = 6543.455 078
12.3454810 = 1.10001011000011100010110 * 2 ^ 3 β 12.345 48
The sum of these terms in binary will give the following binary number:
1.10011001101111001101 * 2 ^ 12β6555.80078125β6555.801
Here we again get a result that differs from that calculated on a calculator for non-converted decimal numbers.
MULTIPLICATION
The same problem of inaccurate calculations arises when finding the works of some NFCs represented in binary code in the float format. For example, consider the product of the following real numbers:
0.06543455 * 139 = 9.095402 45β9.095402
Imagine in this expression the factors in the normalized binary form:
0.06543455 = 1.00001100000001010001101 * 2 ^ (- 4)
139 = 1.0001011 * 2 ^ 10
The result of multiplying our numbers in binary will be the number:
1.00001100000001010001101 * 2 ^ (- 4) Γ 1.0001011 * 2 ^ 10 = 1001.00011000011011000101 β9.095403
We received an error in the low order of the work, which cannot be corrected by rounding the number represented in binary code. Such errors are called fatal.
DIVISION
Similar to multiplication, the division operation in the float format for some PSTs also leads to fatal errors. Consider the following example:
131 / 0.066β1984.848
Imagine dividend and divisor in binary format, in normalized form:
13110 = 1.0000011 * 2 ^ 7
0.066 = 1.00001110010101100000010 * 2 ^ (- 4)
The quotient from dividing our numbers will be as follows:
1.0000011 * 2 ^ 7 / 1.00001110010101100000010 * 2 ^ (- 4) =
= 1.11110000001101100100111 * 2 ^ 10 = 1984.848 5107421875β1984.849
We see that the result obtained here does not correspond to the correct value calculated on a calculator or manually.
SUBTRACTION
Subtraction is an operation that completely discredits the idea of ββusing binary arithmetic to compute decimal values ββof the PRT. As a result of this operation, in some cases, it is possible to get a result that even closely matches the reality. Let us demonstrate this with an example.
Let the decrease be 105.3256. Subtract from it the number 105.32. The difference of these numbers, calculated manually, will be equal to:
105.3256-105.32 = 0.0056
Imagine the decimal decrement and the decimal subtracted in the normalized binary form:
105.3256 = 1.10100101010011010110101 * 2 ^ 6β105.3255 997 041015625
105.32 = 1.10100101010001111010111 * 2 ^ 6β105.32
Find the difference of these numbers in binary form:
1.10100101010011010110101 * 2 ^ 6-1.10100101010001111010111 * 2 ^ 6 = 1.01101111 * 2 ^ (- 8)
After converting this number to decimal, we get:
1.01101111 * 2 ^ (- 8) = 0.005599976
We got a result that is significantly different from what we expected.
ERRORS IN DABLE FORMATThe situation with fatal errors does not save the higher accuracy format, for example double. As we noted above, this is due to the very nature of converting numbers from one number system to another. Let us show it with examples.
As a tool to check the correctness of our reasoning, we will use Excel 2009, which implements the calculations in strict accordance with the IEEE754 standard specification.
Let's perform the following calculations using Excel 2009 tools. Select the cell format numeric, with 18 decimal places. To find the amount, we write the following numbers into the Excel cells:
A1 = 0.6236
A2 = 0.00661666666070646
In cell 3Excel we get the sum of these numbers:
A3 = A1 + A2 = 0.6236 + 0.00661666666070646β0.63021666666060707
If you calculate this amount manually or on a calculator, you get a number:
0.6236 + 0.00661666666070646β0.630216666660706
Which in the low order does not coincide with what is obtained in Excel.
Let's see what the subtraction operation in Excel leads to. Write the following numbers in the cells:
A1 = 123456,789012345
A2 = 123456
In cell A3, we find the difference of these numbers. It will be equal to:
A3 = A1-A2 = 0.789012345005176
And we expected to get the number:
123456,789012345-123456 = 0, 789012345
In conclusion, we give an example of how fast an error can grow if we use binary arithmetic to calculate decimal real numbers even without a subtraction operation.
We write the following numbers into the Excel cells:
A1 = 0.500000000660006
A2 = 0.0000213456548763
A3 = 0.00002334565487363
A4 = 0.000013345654873263
In cell A6, we write the formula = A1 / 5 + A2. Then the result will be obtained in it.
A6 = A1 / 5 + A2 = 0.100021345786878
In cell A7, we perform the following calculations:
A7 = A6 / 3 + A3 = 0.0333637942504995
Now let's calculate
A8 = A7 / 4 + A4 = 0.00835429421749813
Let's carry out the same calculations on the calculator. Calculations will be performed with an accuracy of 15 decimal digits. Figures that are to the right of the least significant digit, according to the rules of arithmetic, will be rounded to the nearest integer. As a result, we will have:
A1 / 5 = 0.500000000660006 / 5 = 0.100000000132001 2β0.100000000132001
A1 / 5 + A2 = 0.100000000132001 + 0.0000213456548763β 0.100021345786877
(A1 / 5 + A2) / 3 = 0.100021345786877 / 3β0.0333404485956257
(A1 / 5 + A2) / 3 + A3 = 0.0333404485956257 + 0.00002334565487363β 0.0333637942504993
[(A1 / 5 + A2) / 3 + A3] / 4 = 0.0333637942504993 / 4β0.00834094856262483
[(A1 / 5 + A2) / 3 + A3] / 4 + A4 = 0.00834094856262483 + 0.000013345654873263 = 0.00835429421749809
Comparing the result obtained in the calculations with the observance of the rules of arithmetic operations, which turned out in Excel, we can conclude that the use of binary arithmetic for floating-point decimal numbers can lead to completely unpredictable results.