A couple of words about floating point numbers in java

A few days ago, I was struck by an interesting such question as to what the result of the execution of this code would be:

double a = 2.0 - 1.1;

or such:

 double f = 0.0; for (int i=1; i <= 10; i++) { f += 0.1; }

Contrary to all my expectations, the answer: 0.89999999999999991 in the first case and 0.99999999999999989 in the second.
For those who want to know why, as well as some more interesting facts about this type of data, you are welcome.

In general, the answer to the above question will sound something like this: “Such errors are related to the internal binary (binary) representation of numbers. Just as in the decimal system it is impossible to accurately represent the result of the 1/3 division, it is impossible to accurately represent the 1/10 in the binary system. If you need to eliminate rounding errors, you should use the BigDecimal class. ”
')
There is an important distinction between abstract real numbers, such as π or 0.2, and the double data type in Java. First, the platonic-perfect representation of real numbers is infinite, while the representation in Java is limited to the number of bits. However, the accuracy of calculations is an even more urgent problem than the limit on the size of numbers. Even more "intrigues" a completely original way of rounding numbers, but first things first.

It’s probably worth starting with the binary representation of integers. This paragraph is useful to us later. So. The simplest way to represent integers is the so-called "Direct Code", in which the most significant bit is used to record the sign of a number (0 is positive, 1 is negative), and the remaining bits are used directly to record the value itself. Thus, the number "-9" in the eight-bit representation will look like 10001001. The disadvantage of this approach is the presence of two zeros ("+0" and "-0") and the complication of arithmetic operations with negative numbers. Another option that interests us is the "Shift Code", in which, in simple terms, we add to our number a certain constant for this type of representation number equal to 2 ^ (n-1), where n is the number of digits (bits) . In our case, the example with the number "-9" in the eight-bit representation will look like this:
-9 + 2 ^ (8-1) = -9 + 128 = 119. In binary form, we get 01110111. This option is convenient because we have only one zero, but we will need to take the offset into account when doing arithmetic operations.

Here it is worth mentioning about this. One of the stated goals of the Java language is machine independence. Calculations must produce the same result, regardless of which virtual machine performs them. For arithmetic calculations over floating point numbers, this unexpectedly proved to be a difficult task. The double type uses 64 bits to store numeric values, however some processors use 80-bit floating point registers. These registers provide additional accuracy at intermediate stages of the calculation, i.e. the intermediate result of the calculations is stored in an 80-bit register, after which the answer is rounded to 64 bits. However, this result may be different if in the process of all calculations a 64-bit processor is used. For this reason, the original description of the JVM indicated that all intermediate calculations should be rounded off. This caused a protest of many specialists, since such a rounding not only can lead to overflow, but the calculations themselves are slower. This led to the fact that JDK 1.2 introduced support for the strictfp keyword, which guarantees the reproducibility of the results of all calculations performed within this method, class, or interface (or rather, its implementation). In other words, the strictfp keyword ensures that floating-point computing on each platform will behave in the same way and with a certain accuracy, even if some platforms can perform calculations with greater accuracy. Interestingly, for the x86 family of processors, the floating point unit was separated into a separate microcircuit, called a floating point unit (FPU). Starting with MMX Pentium processors, the floating point module is integrated into the CPU. More details .

Further. The IEEE 754 standard tells us that the representation of real numbers must be written exponentially. This means that part of the bits encodes the so-called mantissa of the number, the other part is an indication of order (degree), and another bit is used to indicate the sign of the number (0 if the number is positive, 1 if the number is negative). Mathematically, this is written like this:
(-1) ^ s × M × 2 ^ E , where s is the sign, M is the mantissa, and E is the exponent. The exponent is written with a shift that can be obtained by the formula given above.

What is a mantissa and an exhibitor? A mantissa is a fixed-length integer that represents the most significant bits of a real number. Suppose our mantissa consists of four bits (| M | = 4). Take, for example, the number "9", which in binary will be equal to 1001.
The exponent (it is also called “order” or “exponent”) is the degree of the base (two) of the most significant digit. You can consider it as the number of digits before the point separating the fractional part of the number. If the exponent is a variable written to the register and unknown when compiled, the number is called the "floating point number". If the exponent is known in advance, then the number is called the "fixed-point number". Numbers with a fixed point can be written to ordinary integer variables (registers) by saving only the mantissa. In the case of writing floating point numbers, both the mantis and the exponent are recorded in the so-called standard form, for example, "1.001e + 3". It is immediately evident that the mantissa consists of four characters, and the exponent is equal to three.

Suppose we want to get a fractional number using the same 3 bits of mantissa. We can do this if we take, say, E = 1. Then our number will be equal to

1.001e + 1 = 1 × 2 ^ 2 + 0 × 2 ^ 1 + 0 × 2 ^ 0 + 1 × 2 ^ (- 1) = 4 + 0.5 = 4.5

One of the problems of this approach may be a different representation of the same number within the same mantissa length. Our "9-ku", with the length of the mantissa equal to 5, can be represented as 1.00100e + 3 and as 0.10010e + 4 and as 0.01001e + 5. This is not convenient for equipment, since it is necessary to take into account the multiplicity of representations when comparing numbers and when performing arithmetic operations on them. Moreover, it is not economical, since the number of representations is finite, and repetitions reduce the set of numbers that can be represented at all. However, there is a little trick. It turns out that the exponent can be used to calculate the value of the first bit. If all bits of the exponent are 0, then the first bit of the mantissa is also considered to be zero, otherwise it is equal to one. Floating-point numbers in which the first bit of the mantissa is one, are normalized. Numbers with a floating point, the first bit of the mantissa in which is zero, are called denormalized. With their help, much smaller values can be represented. Since the first bit can always be calculated, there is no need to store it explicitly. This saves one bit, since an implicit unit does not need to be stored in memory, and provides a unique representation of the number. In our example with “9”, the normalized representation will be 1.00100e + 3, and the mantissa will be stored in memory as “00100”, since senior unit implicitly implied. The problem with this approach is the impossibility of representing zero, which I will talk about later. You can read more about this and many other things here and here .

By the way, in JDK 1.5 it is permissible to specify floating-point numbers in hexadecimal format. For example, 0.125 can be represented as 0x1.0p-3. In hexadecimal notation, the sign “p” is used instead of “e” to indicate the exponent.

Things to remember when working with Double:

Integer division by 0 generates an exception, while the result of dividing floating-point numbers is infinity (or NaN in the case of 0.0 / 0 division). By the way, I was interested to know that the JVM developers, according to the same IEEE 754 standard, also entered the values Double.NEGATIVE_INFINITY and Double.POSITIVE_INFINITY, equal to -1.0 / 0.0 and 1.0 / 0.0, respectively.
Double.MIN_VALUE is actually not the smallest number that can be written in double. Remember, we said that according to the IEEE 754 standard, the highest mantissa unit is implicitly specified? So here. As already stated above, in the normalized form of a floating-point number, it is impossible to represent a zero, since there is no degree of two that would be equal to zero. And the JVM developers have specifically introduced the Double.MIN_VALUE variable to solve this problem, which, in fact, is the closest possible value to zero. The smallest value you can save in double is "-Double.MAX_VALUE".
```
 System.out.println(0.0 > Double.MIN_VALUE); //  false 
```
Developing the previous topic, we can give another interesting example, showing us that not everything is as obvious as it may seem at first glance. Double.MAX_VALUE returns us 1.7976931348623157E308, but what happens if we convert a string containing a floating-point number to double?
```
 System.out.println(Double.parseDouble("1.7976931348623157E308")); // (...7E308) = 1.7976931348623157E308 max value System.out.println(Double.parseDouble("1.7976931348623158E308")); // (...8E308) = 1.7976931348623157E308 same??? System.out.println(Double.parseDouble("1.7976931348623159E308")); // (...9E308) = Infinity 
```
It turns out that between Double.MAX_VALUE and Double.POSITIVE_INFINITY there are still some values that are rounded to one side or the other in the calculation. It is worth staying in more detail.

The set of real numbers is infinitely dense (dense). There is no such thing as the following real number. For any two real numbers, there is a real number between them. This property does not hold for floating point numbers. For each float or double number there is the following number. In addition, there is a minimum final distance between two consecutive numbers of type float or double. The Math.nextUp () method returns the next floating-point number that exceeds the specified parameter. For example, this code prints all float numbers between 1.0 and 2.0 inclusive.
```
 float x = 1.0F; int numFloats = 0; while (x <= 2.0) { numFloats++; System.out.println(x); x = Math.nextUp(x); } System.out.println(numFloats); 
```
It turns out that in the interval from 1.0 to 2.0 inclusively there are exactly 8,388,609 float numbers. This is a lot, but much smaller than an infinite number of real numbers that are in the same range. Each pair of consecutive numbers of type float is at a distance of approximately 0.0000001 from each other. This distance is called the unit of least precision (ULP). For type double, the situation is completely identical, except for the fact that the number of decimal numbers is much higher.

Perhaps that's all. Those who wish to "dig deeper" may need the following code:

 //           IEEE 754 long lbits = Double.doubleToLongBits(-0.06); long lsign = lbits >>> 63; //  long lexp = (lbits >>> 52 & ((1 << 11) - 1)) - ((1 << 10) - 1); //  long lmantissa = lbits & ((1L << 52) - 1); //  System.out.println(lsign + " " + lexp + " " + lmantissa); System.out.println(Double.longBitsToDouble((lsign << 63) | (lexp + ((1 << 10) - 1)) << 52 | lmantissa));

Thanks to all mastered. I will be glad to constructive criticism and additions.

Materials on the topic:
New Java Math Features, Part 2: Floating Point Numbers
IEEE Standard 754 Floating Point Numbers
Java Language and Virtual Machine Specifications
Representation of real numbers
What you need to know about floating-point arithmetic
Float Arithmetic

Source: https://habr.com/ru/post/219595/

All Articles

A couple of words about floating point numbers in java

More articles: