Understanding floating point numbers (part 0)

Hello, habrovchane. I have long been fond of floating point registers. I was always worried about how the output to the screen, etc. I remember a long time ago at the university I was implementing my own class of floating-point numbers consisting of 512 bits. The only thing I couldn’t realize was the output to the screen.

As soon as I had free time, I took up the old. I got myself a notebook and went away. I wanted to think of everything myself, only occasionally looking into the IEEE 754 standard.
And that's what came of it all. I ask those interested under the cat.

To master this article, you need to know the following: what is a bit, a binary system, arithmetic at the level of knowledge of negative degrees. The article will not affect the engineering details of the implementation at the processor level as well as normalized and denormalized numbers. More emphasis is placed on converting numbers into binary form and vice versa, as well as explaining how floating-point numbers are generally stored as bits.

Floating point numbers are a very powerful tool that you need to be able to use correctly. They are not as trivial as integer registers, but they are not as complicated if they are properly and slowly penetrated.
')
In today's article, for example, I will use 32-bit registers. Numbers with double precision (64-bit) work absolutely according to the same logic.

First, let's talk about how floating point numbers are stored. The oldest 31 bits are signed. The unity means that the number is negative, and zero, respectively, vice versa. Next come the 8 bit exponent. These 8 bits are a normal unsigned number. And at the very end are 23 bits of mantissa. For convenience, we denote the sign as S, the exponent as E, and, oddly enough, the mantissa, M.

We get the general formula

(- 1)^{s} t i m e s M t i m e s 2^{E - 127}

$(- 1) ^ s \ times M \ times 2 ^ {E-127}$

The mantissa is considered to be one implicit unit bit. That is, the mantissa will be 24 bits of itself, but since the high 23rd bit is always one, it can be omitted. Thus, this “restriction” will give us the uniqueness of the representation of any number.

The mantissa is a regular binary number, but unlike integers, the most significant bit is 2 ^ 0 degrees and further in decreasing degrees. This is where the exhibitor comes in handy. Depending on its value, the power of 2 bits is increased or decreased. That's the whole genius of this idea.

Let's try to show it with a clear example:

We represent the number 3.625 in binary form. At first, we divide this number into powers of two.

3.625 = 2 + 1 + 0.5 + 0.125 = 1 t i m e s 2^{1} + 1 t i m e s 2^{0} + 1 t i m e s 2^{- 1} + 0 t i m e s 2^{- 2} + 1 t i m e s 2^{- 3}

$3.625 = 2 + 1 + 0.5 + 0.125 = 1 \ times 2 ^ 1 + 1 \ times 2 ^ 0 + 1 \ times 2 ^ {-1} + 0 \ times 2 ^ {-2} + 1 \ times 2 ^ { -3}$

The degree of the highest two is equal to one. E - 127 = 1. E = 128.

0 1000000 11010000000000000000000

That's all our number.

Let's try also in the opposite direction. Suppose we have 32 bits, arbitrary 32 bits.

0 10000100 (1) 11011100101000000000000

The same implicit high bit is indicated in parentheses.

First we calculate the exponent. E = 132. Accordingly, the degree of the highest two will be equal to 5. Total we have the following number:

2^{5} + 2^{4} + 2^{3} + 2^{1} + 2^{0} + 2^{- 1} + 2^{- 4} + 2^{- 6} =

$2 ^ 5 + 2 ^ 4 + 2 ^ 3 + 2 ^ 1 + 2 ^ 0 + 2 ^ {-1} + 2 ^ {-4} + 2 ^ {-6} =$

= 32 + 16 + 8 + 2 + 1 + 0.5 + 0.0625 + 0.015625 = 59.578125

$= 32 + 16 + 8 + 2 + 1 + 0.5 + 0.0625 + 0.015625 = 59.578125$

It is not difficult to guess that we can store only a range of 24 degrees of two. Accordingly, if two numbers differ in the exponent by more than 24, then with the addition the number remains equal to the larger among them.

For easy conversion, I put a small program in C.

#include <stdio.h> union IntFloat { unsigned int integerValue; float floatValue; }; void printBits(unsigned int x) { int i; for (i = 31; i >= 0; i--) { if ((x & ((unsigned int)1 << i)) != 0) { printf("1"); } else { printf("0"); } if (i == 31) { printf(" "); } if (i == 23) { printf(" "); } } printf("\n"); } int main() { union IntFloat b0; b0.floatValue = 59.578125; printBits(b0.integerValue); b0.integerValue = 0b01000010011011100101000000000000; printf("%f\n", b0.floatValue); return 0; }

The grid spacing is the minimum difference between two adjacent floating point numbers. If you represent a sequence of bits of such a number as a regular integer, then the floating-point adjacent number will differ in bits as an integer by one.

You can put it another way. Two adjacent floating point numbers will differ by 2 ^ (E - 127 - 23). That is, the difference is equal to the value of the least significant bit.

As a proof, you can change main in the code and compile it again.

 union IntFloat b0, b1, b2; b0.floatValue = 59.578125F; b1.integerValue = b0.integerValue + 1; b2.floatValue = b1.floatValue - b0.floatValue; printBits(b0.integerValue); printBits(b1.integerValue); printBits(b2.integerValue); printf("%f\n", b0.floatValue); printf("%f\n", b1.floatValue); printf("%f\n", b2.floatValue); short exp1 = 0b10000100; short exp2 =0b01101101; /*  ,       */ b0.integerValue = 0b01000010011111111111111111111111; b1.integerValue = b0.integerValue + 1; b2.floatValue = b1.floatValue - b0.floatValue; printBits(b0.integerValue); printBits(b1.integerValue); printBits(b2.integerValue); printf("%f\n", b0.floatValue); printf("%f\n", b1.floatValue); printf("%f\n", b2.floatValue); /*   */ printf("%d %d\n", exp1, exp2);

I think today you can wrap up, but it turns out too long. Next time I will write about the addition of floating-point numbers and the loss of accuracy during rounding.

PS: I understand that I did not touch on the topic of denormalized numbers, etc. I just didn’t want to load the article very much, and this information can be easily found in the IEEE 754 standard almost at the very beginning.

Source: https://habr.com/ru/post/456714/

All Articles

Understanding floating point numbers (part 0)

More articles: