That's the point, sailed! We learn to work with floating-point numbers and develop an alternative with fixed decimal precision.

Today we will talk about real numbers. More precisely, the representation of their processor in the calculation of fractional values. Each of us was faced with the conclusion in a row of numbers of the form 3.4999990123 instead of 3.5 or, worse, a huge difference after the calculations between the theoretical result and what was the result of the execution of the program code. There is no terrible secret in this, and we will discuss the pros and cons of the approach of representing floating-point numbers, consider an alternative fixed-point path, and write the decimal fraction class with fixed precision.

Where the point floats

It is no secret that the processor did not always understand real numbers. At the dawn of the programming era, before the advent of the first coprocessors, real numbers were not supported at the hardware level and were emulated algorithmically with the help of integers with which the processor got on well. Thus, the real type in the old good Pascal was the progenitor of the current real numbers, but it was a superstructure over an integer in which the bits were logically interpreted as a mantissa and a real number exponent.

A mantissa is essentially a number written without a dot. The exponent is the degree to which a certain number N must be erected (as a rule, N = 2), so that when multiplied by a mantissa, we obtain the desired number (up to the bit depth of the mantissa). It looks like this:
')

x = m * N ^ e,  m  e —  ,         .

To avoid ambiguity, it is considered that 1 <= | m | <N, that is, the number is written in the form as if it were with one character before the comma, but the comma was maliciously erased, and the number turned into a whole.

A mantissa is essentially a number written without a dot. The exponent is the degree to which you need to build a certain number N (as a rule, N = 2), so that when multiplied by the mantissa you get the desired number

More recently, floating-point operations, as well as all algorithms with real numbers, the developers tried to avoid. The coprocessor processing operations with real numbers was not on all processors, and where it was, it did not always work efficiently. But as time went on, now floating point operations are built into the processor core, moreover, video chips are also actively processing real numbers, parallelizing operations of the same type.

Modern real numbers supported by hardware at the processor level are also broken down into a mantissa and an exponent. Of course, all operations familiar to the arithmetic of integers are also supported by processor commands for real numbers and are performed as quickly as possible.

Everything is so difficult because, first of all, such a recording format allows multiplication and division operations with such numbers to be quite effective; besides, it is also easy to get the initial real number represented by such a format. This representation of numbers is called a floating point number .

Point swimming standard

Floating-point real numbers supported at the processor level are described by a special international standard IEEE 754 . The main two types for any calculation are single-precision (single precision) and double-precision (double precision) floating-point (floating point numbers). These names directly reflect the bitness of the binary representation of single and double precision numbers: 32 bits are allocated to the single-precision representation, and 64 bits, to double, oddly enough, exactly twice as large.

In addition to single and double precision, the new edition of the IEEE 754-2008 standard also provides for types of extended precision, quadruple and even half precision. However, in C / C ++, besides float and double, there is perhaps another type of long double, persistently not supported by Microsoft, which in Visual C ++ substitutes the usual double instead. The processor core is also currently, as a rule, not yet supported types of half and quadruple precision. Therefore, choosing floating-point representations, we have to choose only from float and double.

The basis for the exponent of the number according to the standard is taken 2, respectively, the above formula reduces to the following:

 x = m * (2 ^ e),  1 <= |m| < 2; m, e —

The alignment in bits in single precision numbers looks like this:

| 1 bit under the sign | 8 bit exponent | 23 bits mantissa |

For double precision, we can use more bits:

| 1 bit under the sign | 11 bit exhibitors | 52 bits mantissa |

In both cases, if the sign bit is 0, then the number is positive and 1 is set for negative numbers. This rule is similar to integers with the only difference that, unlike integers, in order to get the number inverse to the addition, it is enough to invert one bit of the sign.

Since the mantissa is written in binary form, the integer part is already equal to 1, therefore in the record of the mantissa one bit is always implied, which is not stored in the binary record. The fractional part of the normalized number is stored in bits of the mantissa in binary notation.

The mantissa is written in binary form, and the whole part, which is obviously equal to 1, is discarded; therefore, we never forget that the mantissa is one bit longer than it is stored in binary form

You do not need to have a doctoral degree in order to calculate the accuracy in decimal places of numbers that can be represented by this standard: 2 ^ (23 + 1) = 16,777,216; this clearly indicates to us that the accuracy of the representation of real numbers with single precision reaches slightly more than seven decimal places. This means that we will not be able to save in this format, for example, the number 123,456.78 is a small number, in general, but starting from the hundredth we will get the wrong number. The situation is complicated by the fact that for large numbers of the form 1 234 567 890, which perfectly fits even into a 32-bit integer, we will get an error in hundreds of units! Therefore, for example, in C ++ for real numbers, the double type is used by default. The double-precision mantissa already exceeds 15 characters: 2 ^ 52> = 4 503 599 627 370 496 and quietly contains all 32-bit integers, failing only on really large 64-bit integers (19 decimal places), where the error in hundreds of units already, as a rule, irrelevant. If greater accuracy is needed, then we will definitely help with this article.

Now for the exponent. This is the usual binary representation of an integer in which you need to build 10 so that when multiplied by the mantissa in a normalized form, you get the original number. Here, only in the standard, in addition, we introduced an offset, which must be subtracted from the binary representation, in order to obtain the desired degree of tens (the so-called biased exponent is an offset exponent). The exponent is shifted to simplify the comparison operation, that is, for single precision, the value 127 is taken, and for double 1023. This all sounds extremely difficult, so many skip the chapter on floating point type. And in vain!

Approximate swimming

To make it a little clearer, consider an example. Encode the number 640 (= 512 + 128) in binary form as a real number of single precision:

the number is positive - the sign bit will be 0;
to get a normalized mantissa, we need to divide the number by 512 - the maximum power of two, the smaller number, we get 640/512 = 512/512 + 128/512 or 1 + 1/4, which gives the binary record 1.01, respectively the bits of the mantis will be 0100000 00000000 00000000;
to get 640 from 1 + 1/4 again, we need to specify an exponent equal to 9, just 2 ^ 9 = 512, the number we divided the number when normalizing the mantissa, but in the binary form there must be a representation in the shifted form, and for real numbers of single precision, you need to add 127, we get 127 + 9 = 128 + 8, which in binary form will be written as: 10001000.

For double precision, almost everything will be the same, but the mantissa will contain even more zeros on the right in the fractional part, and the exponent will be 1023 + 9 = 1024 + 8, that is, slightly more than zeros between the high bit and the number of exponents: 100 00001000. In general , it's not so scary, if you carefully understand.

Home task: understand the binary constants of the following constants: plus and minus infinity ( INF - infinity), zero, minus zero and non-integer number ( NaN - not-a-number).

For the buoys do not swim!

There is one important rule: each number representation format has its limits beyond which it is impossible to swim. Moreover, the programmer himself has to ensure that the programmer does not go beyond these limits, because the behavior of a C / C ++ program is to make an imperturbable person when issuing two large positive integers as an addition as small negative ones. But if for integers it is necessary to take into account only the maximum and minimum value, then for real numbers in the floating-point representation, more attention should be paid not so much to the maximum values as to the number of digits. Thanks to the exponent, the maximum number for a floating-point representation with double precision exceeds 10 ³⁰⁸ , even the single-precision exponent makes it possible to encode numbers above 10 ³⁸ . If we compare it with the “miserable” 10 ¹⁹ , the maximum for 64-bit integers, we can conclude that the maximum and minimum values will hardly ever have to be taken into account, although you should not forget about them.

If for integers you need to take into account only the maximum and minimum value, then for real numbers in the floating-point representation, you should pay more attention not so much to the maximum values, but to the number of digits.

Another thing is the problem of accuracy. A pitiful 23 bits under the mantissa give an error already on the 8th decimal place. For double-precision numbers, the situation is not so deplorable, but 15 decimal places quickly turn into a problem if, for example, data processing requires 6 fixed signs after a point, and numbers to a point are rather large, only 9 digits remain for them. Accordingly, any multi-billion dollar sums will give a significant error in the fractional part. With a high intensity of processing such numbers, billions of euros can disappear, simply because they "do not fit", and the fractional error was summed up and accumulated a huge balance of unrecorded data.

If only it were a theory! In practice, even a thousandth of a cent should not disappear, the error of all operations should be strictly zero. Therefore, for business logic, as a rule, they do not use C / C ++, but take C # or Python, where the standard library already has a Decimal type built in that processes decimal fractions with zero error at the specified accuracy in decimal places after the comma. What can we, C ++ programmers, do if we are faced with the task of processing numbers of very high resolution without using high-level programming languages? Yes, the same as usual: fill in the gap by creating one small data type for working with high-precision decimal fractions, similar to the Decimal types of high-level libraries.

Add a floating point cement

It's time to fix the floating point. Since we decided to get rid of the floating-point type because of problems with the accuracy of the calculations, we have integer types, and since we need the maximum bit depth, we also need the whole ones to have a maximum bit depth of 64 bits.

Today, for educational purposes, we will look at how to create a representation of real numbers with guaranteed accuracy up to 18 digits after a full stop. This is achieved by simply combining two 64-bit integers for the integer and fractional parts, respectively. In principle, no one bothers, instead of a single number for each of the components, to take an array of values and get a full-fledged “long” arithmetic. But it will be more than enough now to solve the problem of accuracy, making it possible to work with an accuracy of 18 characters before and after the decimal point, fixing the point between these two values and filling it with cement.

Give me a decimal!

First, a little theory. Denote our two components, the integer and fractional parts of a number, as n and f, and the number itself will be representable as

x = n + f * 10 ^-18 , where n, f are integers, 0 <= f <10 ¹⁸ .

For the integer part, the sign type of a 64-bit integer is best, and for the fractional part it is unsigned, this will simplify many operations in the future.

 class decimal { … private: int64_t m_integral; uint64_t m_fractional; };

The integer part in this case is the maximum integer smaller than the number represented, the fractional part is the result of subtracting from this number its integer part multiplied by 10 ¹⁸ and reduced to the integer: f = (x - n) * 10 ¹⁸ .

The integer part for negative numbers will be larger in modulus of the number itself, and the fractional part will not coincide with the decimal notation of the number itself, for example, for the number –1.67 components will be: n = –2 and f = 0.33 * 10 ¹⁸ . But such a record allows us to simplify and speed up the algorithms of addition and multiplication, since no branching is necessary for negative numbers.

Decimal type operations

Of course, the type of number with high accuracy will be useless without arithmetic operations. Addition is implemented relatively simply:

 x = a + b * 1e-18, y = c + d * 1e-18, z = x + y = e + f * 1e-18, a, c, e: int64_t; b, d ,f: uint64_t; 0 <= b, d, f < 1e+18, z = (a + b * 1e-18) + (c + d * 1e-18) e = a + c + [b * 1e-18 + d * 1e-18] f = {b * 1e-18 + d * 1e-18} * 1e+18

NB: hereinafter all the records in the form 1e are integers.

Here [n] is getting the integer part of a number, and {n} is getting the fractional part. Everything is good, but remember about the restriction of integers. The value of 1e + 18 is already close to the verge of an unsigned 64-bit integer type uint64_t (that is why we chose it), but no one bothers to simplify the expression a bit to ensure that we stay within the bounds of the type based on the initial conditions:

 e = a + c + (b + d) div 1e+18, f = (b + d) mod 1e+18.

Two things should always be taken into account when implementing operations with numbers, since they imply intensive use: first, you should always optimize the algorithm, minimizing multiplication and division operations, so you should simplify the expression mathematically in advance so that the first paragraph is easily executed. In our case, everything needs to be minimized integer divisions with the remainder. Secondly, it is imperative to check all possible situations of overflow of a number with going beyond the boundaries of the calculated type, otherwise you will get very unobvious errors when using your type.

Of course, it is worth checking the boundary values when adding a and c. Also, assuming that b and d are less than 1e + 18, we know that (b + d) <2 * 1e + 18, which means that the last addition will add a maximum of one, therefore, algorithmically, the division can not be considered to optimize the execution speed operations:

 e = a + c; f = b + d; if (f >= 1e+18) f -= 1e+18, ++e;

Here, the checks on the maximum integer for the value of e are omitted for simplicity.

To subtract everything a bit more complicated, here you need to take into account the transition below zero for an unsigned integer. That is, you need to compare two numbers before subtraction.

 e = a - c; if (b >= d) f = b - d; else f = (1e+18 - d) + b, --e;

In general, everything is easy. Before multiplication and division, everything is always easy.

Multiplication of numbers with fixed precision

Consider multiplication first. Since the numbers in the fractional part are quite large and, as a rule, are in the nearest neighborhood of 1e + 18, we will have to either use assembly language and operations with Q-registers, or bypass the integers, breaking each component into two parts, 1e + 9. In this case, the multiplication will give no more than 1e + 18 and we will not need the assembler yet, it will not be so fast, but we have enough of a 64-bit integer, and we will remain inside C ++. Do you remember the school and the multiplication column? If not, it's time to remember:

a = s _a * a ₁ - a ₂ * 10 ^-9 ; b = b ₁ - b ₂ * 10 ^-9 ;
c = s _c * c ₁ - c ₂ * 10 ^-9 ; d = d ₁ - d ₂ * 10 ^-9 ;
0 <= a ₂ , b ₂ , c _1,2 , c _1,2 <10 ⁹ ;
s _{a, c} = sign (a), sign (c)
0 <= a ₁ , with ₁ <MAX_INT64 / 10 ⁹

We introduce a matrix to simplify the calculation of multiplication:

U = (a ₁ , a ₂ , b ₁ , b ₂ ),
V = (c ₁ , c ₂ , d ₁ , d ₂ ) ^T ,
A = V * U,
A =
| a ₁ * c ₁ a ₁ * c ₁ b ₁ * c ₁ b ₂ * c ₁ |
| a ₁ * c ₂ a ₁ * c ₂ b ₁ * c ₂ b ₂ * c ₂ |
| a ₁ * d ₁ a ₁ * d ₁ b ₁ * d ₁ b ₂ * d ₁ |
| a ₁ * d ₂ a ₁ * d ₂ b ₁ * d ₂ b ₂ * d ₂ |

The matrix is introduced not so much for convenience of calculation, but for convenience of verification. After all, A ₁₁ = a ₁ * c ₁ should be strictly less than MAX_INT64 / 10 ¹⁸ , and the values with a diagonal below: A ₁₂ = a ₁ * c ₂ and A ₂₁ = a ₂ * c ₁ should be strictly less than MAX_INT64 / 109. Just because that we will multiply by these coefficients when adding components:

e = A ₁₁ * 10 ¹⁸ + (A ₁₂ + A ₂₁ ) * 10 ⁹ + (A ₁₃ + A ₂₂ + A ₃₁ ) + (A ₁₄ + A ₂₃ + A ₃₂ + A ₄₁ ) div 10 ¹⁸ ,
f = (A ₁₄ + A ₂₃ + A ₃₂ + A ₄₁ ) mod 10 ¹⁸ + (A ₂₄ + A ₃₃ + A ₄₂ ) + (A ₃₄ + A ₄₃ ) div 10 ⁹

Here we omit the term A ₄₄ div 10 ¹⁸ simply because it is zero. Of course, before each addition it is worth checking whether we go beyond the limits of MAX_INT64. Fortunately, we can operate on the unsigned uint64_t type for all matrix components and for the intermediate result. All that needs to be done at the end is to determine the sign of the result s _e = s _a xor s _c and, for a negative number, correct the integral and fractional parts: reduce the whole by one, subtract the fractional from one. In general, and all multiplication, the main thing - to be very careful. With an assembler, everything is an order of magnitude simpler, but this material goes beyond the framework of the C ++ Academy.

Algorithm of division without registration and SMS

If you remember the column division algorithm - well done, but here it will not be needed. Thanks to mathematics and a little witchcraft with inequalities, it will be easier for us to calculate the inverse number x ^–1 and perform the multiplication by x ^–1 . So, we solve the problem

y = x ^-1 = 1 / (a + b * 10 ^-18 ) = c + d * 10 ^-18 .

For simplicity, consider finding the inverse of a positive x. If at least one of the components x is zero (but not both), the calculations are greatly simplified. If a = 0, then:

 y = 1 / (b * 1e-18) = 1e+18 / b, e = 1e+18 div b, f = 1e+18 mod b;  b = 0, a = 1,  y = e = 1, f = 0; ec b = 0, a > 1, : y = 1 / a, e = 0, f = 1e+18 div a.

For the more general case, when x contains non-zero fractional and integer parts, in this case the equation reduces to the following:

  a > 1, b != 0 : y = 1 / (a + b * 1e-18) < 1,  e = 0, f = 1e+18 / (a + b * 1e-18).

Now you need to find the maximum degree of 10, which will not be greater than a, and iteratively perform the following action:

 k = max(k): 1e+k <= a, u = 1e+18, v = (a * 1e+18-k + b div 1e+k); f = (u / v) * 1e+18-k, for (++k; k <=18; ++k) { u = (u % v) * 10; if (!u) break; //    f += u / v * 1e+(18-k); }

Here we just use the multiplication and division of a fraction into the same factor - the power of tens, and then we calculate the division and the remainder of the division for the next power of ten in steps.

It will be very useful to have an array of ten degrees from 0 to 18, inclusive, since it is completely unnecessary to calculate them, we know them in advance and we will often need them.

Type conversion

We know and are able enough to now turn the vague float and double into our new decimal.

 decimal::decimal(double value) : m_integral(static_cast<int64_t>(std::floor(value)) m_fractional(static_cast<int64_t>(std::floor( (value - m_integral) * 1e+18 + 0.5)) { normalize(); } void decimal::normalize() { uint64_t tail = m_fractional % 1e+3; if (tail) { if (tail > 103/2) m_fractional += 1e+3 - tail; else m_fractional -= tail; } }

Here 103 is, in fact, the error beyond which a double ceases to be exact. If desired, the error can still be reduced, here 10 ^{18-15 is} needed for clarity of presentation. Normalization after conversion will be needed anyway, since exactly double is obviously lower than even the decimal fractional part. In addition, we must take into account the case when double goes beyond the limits of int64_t, under such conditions our decimal will not be able to correctly convert the integer part of a number.

For float, everything looks similar, but the error is much higher: 10 ^18-7 = 10 ¹¹ .

decimal , m_integral. m_integral, m_fractional.

decimal double float :

 return m_integral + m_fractional * 1e-18;

. , , , decimal separator , . «» m_precision .

, . , , , , — , .

, decimal, .

GITHUB

decimal, .

, !

, C/C++ . , Python C#, 15–18 , .

decimal , int64_t. , double float , . , decimal . , .

, . double float, , . , , , , . , !

#192.
Author: Vladimir Qualab Kerimov, Lead C ++ Developer, Parallels

Subscribe to "Hacker"

Source: https://habr.com/ru/post/257897/

All Articles