Again about floating point numbers

Despite the fact that a lot of publications are devoted to the issues of the accuracy of computer calculations, some of them, in our opinion, are still not fully disclosed. Namely:

1. What number of valid digits n is guaranteed to have a decimal number represented by a binary m-digit code in the format of a floating-point number.
2. How does the normalization of floating-point numbers affect the accuracy of the representation of a number when converting it from one number system to another and with arithmetic operations performed on a computer.
3. How does the rounding of a number represented in binary form affect its decimal equivalent?
4. How the position of a virtual point in a machine word affects the value of a number represented in exponential form.

Below we will try to answer these questions.

In our reasoning, we will proceed from the concepts of numbers from the standpoint of classical arithmetic. We will consider numbers, the number of significant digits of which is limited by the bit grid of the machine word. To simplify the presentation we will consider only positive numbers.
')
As a rule, before making any arithmetic operations on a computer on decimal numbers, they are presented in a fractional binary form in a natural form, and then write the resulting numbers in a normalized exponential form:

F = M * 2 ^ -p,

where M is the binary number mantissa, 2 ^ -p is the characteristic of a number, it is often called the exponent, p is the order of the characteristic.

A binary number represented in decimal form will be called the binary equivalent of a decimal number. A decimal number represented in binary will be called the binary equivalent of a decimal number.

If the natural record of a number is taken as the mantissa M, then the order of the characteristic in the above formula will be zero, and the characteristic, respectively, will be one. We will have a number represented as F = M * 2 ^ 0 = M. For example, the binary number F = 0.011 has the order p = 0. If the point in the number is conventionally put before the most significant digit, then the order of the characteristic is p = -1 and F = 0.11 * 10 ^ -1. If the point in the number is conventionally put after the most significant digit, then the order is p = -2 and F = 1.1 * 10 ^ -2. If you put a virtual point after the lower order of the mantissa M, then the mantissa will be an integer and for it p = -3, and F = 11 * 10 ^ -3. As we see, in all cases the order of the characteristic is equal to the number of displacements of the virtual point in the number relative to its position in the natural record. All types of records in our examples are equivalent and have the same meaning.

Thus, the value of the indicator of the characteristic of a number represented in an exponential form, taking into account the position of the virtual point, determines the place of the point in the number recorded in its natural form.

The width of the machine mantissa uniquely determines the number of decimal numbers that can be represented in this mantissa in binary form. And the position of the virtual point in the mantissa determines the area on the number axis where these numbers are located. Binary numbers, the number of significant digits of which does not exceed the number of bits of the machine mantissa, are exact, and the decimal numbers obtained by converting these binary numbers to decimal are called representable.

Due to the fact that not all significant digits of a binary number can fit into the bit grid of the machine mantissa, this number is rounded to the required number of significant digits. Such a number is sometimes called rounded to the nearest representable. The rounded binary number becomes approximate.

Despite the fact that all binary numbers are equivalent to representable decimal numbers, not all decimal numbers can be represented in a machine word. This is due to the disparity of the decimal and binary number systems. Therefore, decimal numbers, in general, can be represented in binary form approximately, while all binary numbers can be represented in decimal form exactly. Or in other words. The binary equivalent of a decimal number of finite duration can contain an infinite number of significant digits. The decimal equivalent of a binary number of finite duration contains a finite number of significant digits.

Approximate numbers consist of correct (in a broad sense) and incorrect numbers. Incorrect numbers in arithmetic distort the final result. To prevent this from happening, approximate numbers are rounded to the nearest valid digit.

The rounding of binary numbers leads to a decrease in the correct numbers taken into account in a rounded number and to a change in incorrect numbers in its decimal equivalent. Incorrect numbers form the absolute error of conversion.

Rounding a binary number to the nearest representable leads to an increase in the error in representing the decimal equivalent of this number. This is due to the fact that when rounding a binary number, the number of significant binary digits participating in the presentation of its decimal equivalent decreases.

As a result of converting a decimal number to a binary, with a given number of significant digits, a binary number is obtained, the decimal equivalent of which will be an approximate number containing both valid and incorrect digits.

Determine how many valid digits n is guaranteed to have a decimal number represented by a binary m bit code in the format of a floating point number.

If m binary bits are allocated in the machine word for writing the number mantissa, then the maximum integer that can be written into such a mantissa will be equal to F _max = (2 ^ m) -1. All m digits in this number have the value 1. For m = 8, for example, F _max = 2 ^ 8 - 1 = 255 = 11111111 ₂ . Let now we have an integer decimal number with n significant digits. The maximum decimal number with n significant digits will consist of digits, each of which is 9. Thus, the maximum decimal number with n significant digits can be written as F _max = (10 ^ n) -1. For example, for n = 2, F _max = 10 ^ 2 -1 = 99.

In order for a decimal number with n significant digits to be guaranteed to be represented by a binary code with a mantissa with m digits, the following condition must be met: (10 ^ n) -1 ≤ (2 ^ m) -1 or 10 ^ n ≤ 2 ^ m. From where log ₁₀ ⁡ ₁₀ ^ n≤ log ₁₀ ⁡2 ^ m or n≤ m log ₁₀ 2. Since log ₁₀ 2≈0.3, the inequality n≤0.3m will be true. Since the numbers m and n are integers, the inequality
n≤⌊0.3m⌋. So, for m = 8, we will have n≤⌊0.3 * 8⌋ = 2.

So far we have talked about the integer machine mantissa. In practice, it is considered to be considered a machine mantissa as a fractional number with a virtual point facing the high-order bit. This virtual point converts the integer recorded in the machine mantissa into a fractional number. Converting an integer binary number with m significant digits to a number that represents the correct fraction is equivalent to multiplying this number by a factor of 2 ^ -m. Thus, if in each digit of m-bit machine mantissa only units are recorded and it is assumed that the virtual point is at the beginning of the mantissa, then the number represented in this mantissa will be maximum and equal to M _maxd = 1-2 ^ -m. Where M _maxd is the maximum fractional number that can be represented in the m-bit machine mantissa with a virtual point at the beginning of the mantissa.

On the other hand, if the machine mantissa is considered integer, i.e. to assume that the virtual point is immediately after the low order of the machine mantissa, then the maximum integer M _maxc that can be written into it will consist of one unit and equal to M _maxc = (2 ^ m) -1. If now the point is moved to the beginning of the mantissa, then it will be equivalent to the fact that M _maxd = M _maxc * 2 ^ -m.

Thus, fractional numbers lying in the range from 0 to M _maxc * 2 ^ -m can be written to the m bit machine mantissa. Where M _maxc is the maximum integer binary number that is placed in the machine mantissa.

In the general case, only the value of the order of the characteristic of a number represented in an exponential form depends on the choice of a virtual point. The value of the number itself remains unchanged.

For example, the maximum integer 111 ₂ = 7 can be written into a three-digit machine mantissa. This number with a virtual dot before the highest discharge of the machine mantissa will have the value 0.111 = 7 * 2 ^ -3 = 0.875. The number 101 ₂ = 5 recorded in the machine mantissa with a virtual point before the high-order digit will have the value 0.101 ₂ = 5 * 2 ^ -3 = 0.625, etc.

Different sources currently have different opinions on the accuracy of the representation of decimal numbers in binary code. Without addressing here the question of the accuracy of binary arithmetic operations on decimal numbers represented in binary form, consider the accuracy of converting decimal numbers into binary ones.

In the English-language Wikipedia [5], regarding the doabl format, it is said that in the 53-bit machine mantissa, 15-17-bit decimal numbers can be represented. However, we have derived the formula above, according to which a 53-bit mantissa can be guaranteed to represent a decimal number having significant digits n≤⌊0.3 * 53⌋ = 15. Indeed, the maximum integer binary number that can be written to a 53-bit mantissa will be out of 53 units. Its decimal value will be equal to (2 ^ 53) -1 = 9007199254740991. This number has 16 significant decimal digits. If one is added to it, the number of significant digits of the decimal number does not change, it will be 16, but the binary representation of this increased by one decimal number will already contain 1 significant digit more. Therefore, not all 16-bit decimal numbers can be represented by a 53-bit machine mantissa. At the same time, in such a binary mantissa you can guarantee all decimal numbers with the number of significant digits not more than 15. For a guaranteed representation of a decimal number with 17 significant digits, you need a machine mantissa with the number of digits of at least 57, since ⌊0.3 * 57⌋ = 17.

In order for the integer mantissa considered in the examples above to be shifted to the area of fractional numbers, it is necessary to multiply all the numbers by the scale factor 2 ^ -53. Then the maximum binary fractional number in the 53-bit machine mantissa will be:
88, 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 * 2 ^ -53 = 9007199254740991 * 2 ^ -53 = 0)

In order for the decimal number to be converted to binary as precisely as possible, it is necessary that the binary representation of the decimal number takes into account as many significant digits as possible. Ideally, the number of significant digits in the binary mantissa of the number should be equal to the number of machine mantissa digits. For this purpose, the decimal number is decomposed in powers of two until the number of significant digits of the binary number equals the number of digits of the machine mantissa or is not decomposed exactly. The number obtained in this way is normalized, shifting the most significant digit of the number to the most significant digit of the machine mantissa. And the scale factor equal to the number of shifts of the most significant digit of a number is placed in the machine area of the order of the characteristic. The normalization procedure does not change the value of the number, and hence the accuracy of its representation.

For example, take a computer with an 8-bit mantissa. The guaranteed number of significant decimal digits that can be represented in the 8-bit machine mantissa is n ≤ ⌊0.3 * 8⌋ = 2. Suppose we are given the number 0.0012. This binary number will be ≈ 0.00000000010011101010010. Let us round this number up to 8 significant figures, the number of which corresponds to the size of the discharge grid of the machine mantissa. We get the number 0.00000000010011101 = 0.00119781494140625≈0.0012. We normalize this number by placing in the machine mantissa all the significant digits of the mantissa number. We get 0.00000000010011101010010 = 0.10011101 * 2 ^ -9 = 0.61328125 * 2 ^ -9 = 0.00119781494140625. As we can see, the value of the number did not change after normalization. The accuracy of the number representation also has not changed, since the number of significant digits after normalization remained unchanged. We just got another form of recording the same number.
If during the normalization of a binary number it turns out that the number of shifts of the highest significant digit of the number exceeds the number of digits of the machine area, which is intended to record the order of the characteristics of the number, then the number can not be written in normalized form. In this case, there is a loss of accuracy due to the fact that part or all of the significant digits of the number are outside of the discharge grid of the machine mantissa.

In the case when as a result of an arithmetic operation a binary number is obtained, all significant digits of which are located within the machine mantissa, the normalization of the result, from the point of view of arithmetic, does not make sense.

For example, suppose we have a computer with an 8-bit machine mantissa in which the virtual point is located before the high order of the mantissa. Find the difference of two binary numbers: 0.10110000-0.10010011 = 0.00011101. The significant digits of the difference fit completely into the machine grid of the machine mantissa. The decimal equivalent of this difference will be 0.6875-0.57421875 = 0.11328125. Normalize the number 0.00011101. We will have 0.00011101 = 0.11101 * 2 ^ -3 = 0.90625 * 2 ^ -3 = 0.11328125. We see that the normalization did not change the value of the difference of numbers in any way and therefore, in further calculations, these two records of the result are equivalent from the point of view of mathematics.

DRY RESIDUE

Any number can be represented in exponential form with a virtual point located anywhere in the mantissa. The offset of the point from its natural position is recorded in the area of the machine word allocated to record the orders of characteristics.

The width of the machine mantissa determines the number of decimal numbers that can be represented in this mantissa. And the position of the virtual point in the mantissa determines the area on the number axis where these numbers are located.

The position of the virtual point does not affect the accuracy of the representation of the number.

The accuracy of the representation of a binary number depends on the number of significant digits of the mantissa, which can be written into a machine word.

All binary numbers are equivalent to representable decimal numbers. Decimal numbers are not all representable in a machine word. Decimal real numbers, in general, can be represented in binary form approximately, while all binary numbers can be represented in decimal form exactly.

The rounding of binary numbers leads to a decrease in the number of valid digits taken into account in the rounded binary number, by discarding the numbers that are not contained in the machine grid of the discharge grid and, as a result, reduce the accuracy of the representation of the decimal equivalent.

The rounding of binary real numbers leads to the change of incorrect digits in their decimal equivalent, but not to the elimination of incorrect decimal digits.

Incorrect decimal digits in the decimal equivalent of a binary number derived from an exact decimal number form the absolute error of the conversion.

In the machine mantissa with m digits, you can guarantee a decimal number, the number of valid digits in which n≤⌊0.3m⌋.

Normalization of binary numbers does not change the value of the number, if it is carried out without rounding.

Normalization does not affect the accuracy of a number, if it is carried out without rounding.

SOURCES

1. "What you need to know about floating point arithmetic"
2. Everything, point, sailed! We learn to work with floating-point numbers and develop an alternative with fixed precision decimals .
3. "Compensation errors in operations with floating point numbers"
4. "Floating point calculations: can you trust the results?"
5. Wikipedia
6. www.softelectro.ru/ieee754.html
7. “Do I need normalization in floating point numbers”

Source: https://habr.com/ru/post/322984/

All Articles

Again about floating point numbers

DRY RESIDUE

SOURCES

More articles: