📜 ⬆️ ⬇️

Single or double precision?

Introduction


In scientific calculations, we often use floating point numbers. This article is a guide to choosing the right representation of a floating-point number. Most programming languages ​​have two built-in types of accuracy: 32-bit (single precision) and 64-bit (double precision). In the C language family, they are known as float and double , and here we will use these terms. There are other types of accuracy: half , quad , etc. I will not focus on them, although there are also many disputes about the choice of half vs float or double vs quad . So we’ll clarify right away: here we are talking only about 32-bit and 64-bit IEEE 754 numbers.

The article is also written for those of you who have a lot of data. If you need several numbers here or there, just use double and do not bother yourself!

The article is divided into two separate (but related) discussions: what to use to store your data and what to use in the calculations . Sometimes it is better to store data in float , and calculations to produce in double .

If you need it, at the end of the article I added a small reminder of how floating-point numbers work. Feel free to read it first, and then come back here.
')

Data accuracy


32-bit floating-point numbers have an accuracy of about 24 bits, that is, about 7 decimal places, while double-precision numbers have 53 bits, that is, about 16 decimal places. How much is this? Here are some rough estimates of what accuracy you get at worst when using float and double to measure objects in different ranges:

ScaleSingle precisionDouble precision
Room sizemicrometerproton radius
Earth circumference2.4 metersnanometer
Distance to the sun10 kmhuman hair thickness
Duration of the day5 millisecondspicosecond
The duration of the century3 minutesmicrosecond
Time from the Big Bangmillenniumminute

(example: using double , we can imagine the time since the Big Bang with an accuracy of about a minute).

So, if you measure the size of an apartment, then float enough. But if you want to provide GPS coordinates with an accuracy of less than a meter, you will need a double .

Why not always keep everything with double precision?


If you have a lot of RAM, and the execution speed and battery consumption are not a problem - you can stop reading right now and use double . Goodbye and have a nice day!

If the memory is limited, the reason for choosing float instead of double is simple: it takes half as much space. But even if memory is not a problem, storing data in a float can be much faster. As I already mentioned, double takes up twice the space than a float , that is, it takes twice the time to place, initialize and copy data if you use double . Moreover, if you read data in an unpredictable way (random access), then with double you will increase the number of misses by the cache, which slows down reading by about 40% (judging by the practical rule O (√N) , which is confirmed by benchmarks).

Impact on single and double precision computing performance


If you have a well-fitted pipeline using SIMD, then you can double the performance of FLOPS by replacing double with float . If not, the difference may be much smaller, but depends heavily on your CPU. On an Intel Haswell processor, the difference between float and double small, and on ARM Cortex-A9, the difference is large. See the comprehensive test results here .

Of course, if the data is stored in double , then it makes little sense to perform calculations in float . In the end, why keep such accuracy if you are not going to use it? However, the opposite is not true: it may be quite reasonable to store data in a float , but to do some or all of the calculations with double precision.

When to make calculations with increased accuracy


Even if you store data with single precision, in some cases it is appropriate to use double precision in calculations. Here is a simple C example:

 float sum(float* values, long long count) { float sum = 0; for (long long i = 0; i < count; ++i) { sum += values[i]; } return sum; } 

If you run this code on ten single precision numbers, you won't notice any problems with precision. But if you run on a million numbers, you will definitely notice. The reason is that accuracy is lost when adding large and small numbers, and after adding a million numbers, this situation is likely to occur. The rule of thumb is this: if you add 10 ^ N values, you lose N decimal places of accuracy. So when you add thousands ( 10 ^ 3 ) numbers, three decimal places are lost. If you add a million ( 10 ^ 6 ) numbers, then six decimal places are lost (and the float only seven!). The solution is simple: perform calculations in double format instead:

 float sum(float* values, long long count) { double sum = 0; for (long long i = 0; i < count; ++i) { sum += values[i]; } return (float)sum; } 

Most likely, this code will work as fast as the first one, but the accuracy will not be lost. Please note that you do not need to store numbers in double to take advantage of the increased accuracy of calculations!

Example


Suppose you want to accurately measure a value, but your measuring device (with some kind of digital display) shows only three significant digits. Measuring a variable ten times produces the following row of values:

 3.16, 3.15, 3.16, 3.18, 3.15, 3.11, 3.14, 3.11, 3.14, 3.15 

To increase accuracy, you decide to add the measurement results and calculate the average. This example uses a floating-point number in base-10, whose precision is exactly seven decimal places (similar to a 32-bit float ). With three significant digits, this gives us four additional decimal places for precision:

 3.160000 + 3.150000 + 3.160000 + 3.180000 + 3.150000 + 3.110000 + 3.140000 + 3.110000 + 3.140000 + 3.150000 = 31.45000 

In total, there are already four significant digits, with three free ones. What if you add up a hundred of these values? Then we get something like this:

 314.4300 

There are still two unused digits left. If to summarize one thousand numbers?

 3140.890 

Ten thousand?

 31412.87 

So far, so good, but now we use all the decimal places for accuracy. Continue adding numbers:

 31412.87 + 3.11 = 31415.98 

Notice how we shift the smaller number to even out the decimal separator. We no longer have spare discharges, and we dangerously approached the loss of accuracy. What if you add up a hundred thousand values? Then adding new values ​​will look like this:

 314155.6 + 3.12 = 314158.7 

Note that the last significant bit of data ( 2 in 3.12 ) is lost. Now the loss of accuracy does occur, since we will continuously ignore the last bit of accuracy of our data. We see that the problem occurs after adding ten thousand numbers, but up to one hundred thousand. We have seven decimal places of accuracy, and there are three significant digits in the measurements. The remaining four digits are four orders of magnitude, which serve as a kind of "numeric buffer". Therefore, we can safely add four orders of magnitude = 10,000 values ​​without losing accuracy, but then problems will arise. Therefore, the rule is as follows:

If your floating point number contains P bits ( 7 for float , 16 for double ) accuracy, and in your data S bits of significance, then you have PS bits for maneuver and you can add 10 ^ (PS) values ​​without any problems with accuracy. So, if we used 16 bits of precision instead of 7, we could add 10 ^ (16-3) = 10,000,000,000,000 values ​​without any problems with accuracy.

(There are numerically stable ways of adding a large number of values . However, simply switching from float to double much simpler and probably faster).

findings



Appendix: What is a floating point number?


I found that many do not really understand what floating-point numbers are, so it makes sense to explain briefly. I’ll skip over the smallest details about bits, INF, NaN, and subnorms, and instead show a few examples of floating-point numbers in base-10. All the same applies to binary numbers.

Here are a few examples of floating point numbers, all with seven decimal places (this is close to a 32-bit float ).

1.875545 · 10 ^ -18 = 0.000 000 000 000 000 00 1 875 545
3.141593 · 10 ^ 0 = 3.141593
2.997925 · 10 ^ 8 = 299 792 5 00
6.022141 · 10 ^ 23 = 602 214 1 000 000 000 000 000 000

The bold part is called the mantissa, and the italicized part is the exponent. In short, accuracy is stored in the mantissa, and the value is in the exponent. So how to work with them? Well, multiplication is simple: multiply the mantissas and add the exponents:

1.111111 · 10 ^ 42 · 2.000000 · 10 ^ 7
= ( 1.111111 · 2.000000 ) · 10 ^ ( 42 + 7 )
= 2.222222 · 10 ^ 49

Adding is a little trickier: to add two numbers of different sizes, you first need to move the smaller of the two numbers so that the comma is in the same place.

3.141593 · 10 ^ 0 + 1.111111 · 10 ^ -3 =
3.141593 + 0.000 1111111 =
3.141593 + 0.000111 =
3.141704

Notice how we shifted some of the significant decimal places so that the commas match. In other words, we lose accuracy when we add numbers of different quantities.

Source: https://habr.com/ru/post/331814/


All Articles