Introduction
In scientific calculations, we often use floating point numbers. This article is a guide to choosing the
right representation of a floating-point number. Most programming languages have two built-in types of accuracy: 32-bit (single precision) and 64-bit (double precision). In the C language family, they are known as
float
and
double
, and here we will use these terms. There are other types of accuracy:
half
,
quad
, etc. I will not focus on them, although there are also many disputes about the choice of
half
vs
float
or
double
vs
quad
. So we’ll clarify right away: here we are talking only about 32-bit and 64-bit
IEEE 754 numbers.
The article is also written for those of you who have a lot of data. If you need several numbers here or there, just use
double
and do not bother yourself!
The article is divided into two separate (but related) discussions: what to use to
store your data and what to use in the
calculations . Sometimes it is better to store data in
float
, and calculations to produce in
double
.
If you need it, at the end of the article I added a small reminder of how floating-point numbers work. Feel free to read it first, and then come back here.
')
Data accuracy
32-bit floating-point numbers have an accuracy of about 24 bits, that is, about 7 decimal places, while double-precision numbers have 53 bits, that is, about 16 decimal places. How much is this? Here are some rough estimates of what accuracy you get at worst when using
float
and
double
to measure objects in different ranges:
Scale | Single precision | Double precision |
---|
Room size | micrometer | proton radius |
Earth circumference | 2.4 meters | nanometer |
Distance to the sun | 10 km | human hair thickness |
Duration of the day | 5 milliseconds | picosecond |
The duration of the century | 3 minutes | microsecond |
Time from the Big Bang | millennium | minute |
(example: using
double
, we can imagine the time since the Big Bang with an accuracy of about a minute).
So, if you measure the size of an apartment, then
float
enough. But if you want to provide GPS coordinates with an accuracy of less than a meter, you will need a
double
.
Why not always keep everything with double precision?
If you have a lot of RAM, and the execution speed and battery consumption are not a problem - you can stop reading right now and use
double
. Goodbye and have a nice day!
If the memory is limited, the reason for choosing
float
instead of
double
is simple: it takes half as much space. But even if memory is not a problem, storing data in a
float
can be much faster. As I already mentioned,
double
takes up twice the space than a
float
, that is, it takes twice the time to place, initialize and copy data if you use
double
. Moreover, if you read data in an unpredictable way (random access), then with
double
you will increase the number of misses by the cache, which slows down reading by about 40% (judging by the
practical rule O (√N) , which is confirmed by benchmarks).
Impact on single and double precision computing performance
If you have a well-fitted pipeline using SIMD, then you can double the performance of FLOPS by replacing
double
with
float
. If not, the difference may be much smaller, but depends heavily on your CPU. On an Intel Haswell processor, the difference between
float
and
double
small, and on ARM Cortex-A9, the difference is large. See the comprehensive test results
here .
Of course, if the data is stored in
double
, then it makes little sense to perform calculations in
float
. In the end, why keep such accuracy if you are not going to use it? However, the opposite is not true: it may be quite reasonable to store data in a
float
, but to do some or all of the calculations with double precision.
When to make calculations with increased accuracy
Even if you store data with single precision, in some cases it is appropriate to use double precision in calculations. Here is a simple C example:
float sum(float* values, long long count) { float sum = 0; for (long long i = 0; i < count; ++i) { sum += values[i]; } return sum; }
If you run this code on ten single precision numbers, you won't notice any problems with precision. But if you run on a million numbers, you will definitely notice. The reason is that accuracy is lost when adding large and small numbers, and after adding a million numbers, this situation is likely to occur. The rule of thumb is this: if you add
10 ^ N values, you lose
N decimal places of accuracy. So when you add thousands (
10 ^ 3 ) numbers, three decimal places are lost. If you add a million (
10 ^ 6 ) numbers, then six decimal places are lost (and the
float
only seven!). The solution is simple: perform calculations in
double
format instead:
float sum(float* values, long long count) { double sum = 0; for (long long i = 0; i < count; ++i) { sum += values[i]; } return (float)sum; }
Most likely, this code will work as fast as the first one, but the accuracy will not be lost. Please note that you do not need to store numbers in
double
to take advantage of the increased accuracy of calculations!
Example
Suppose you want to accurately measure a value, but your measuring device (with some kind of digital display) shows only three significant digits. Measuring a variable ten times produces the following row of values:
3.16, 3.15, 3.16, 3.18, 3.15, 3.11, 3.14, 3.11, 3.14, 3.15
To increase accuracy, you decide to add the measurement results and calculate the average. This example uses a floating-point number in base-10, whose precision is exactly seven decimal places (similar to a 32-bit
float
). With three significant digits, this gives us four additional decimal places for precision:
3.160000 + 3.150000 + 3.160000 + 3.180000 + 3.150000 + 3.110000 + 3.140000 + 3.110000 + 3.140000 + 3.150000 = 31.45000
In total, there are already four significant digits, with three free ones. What if you add up a hundred of these values? Then we get something like this:
314.4300
There are still two unused digits left. If to summarize one thousand numbers?
3140.890
Ten thousand?
31412.87
So far, so good, but now we use all the decimal places for accuracy. Continue adding numbers:
31412.87 + 3.11 = 31415.98
Notice how we shift the smaller number to even out the decimal separator. We no longer have spare discharges, and we dangerously approached the loss of accuracy. What if you add up a hundred thousand values? Then adding new values will look like this:
314155.6 + 3.12 = 314158.7
Note that the last significant bit of data (
2 in
3.12 ) is lost. Now the loss of accuracy does occur, since we will continuously ignore the last bit of accuracy of our data. We see that the problem occurs after adding ten thousand numbers, but up to one hundred thousand. We have seven decimal places of accuracy, and there are three significant digits in the measurements. The remaining four digits are four orders of magnitude, which serve as a kind of "numeric buffer". Therefore, we can safely add four orders of magnitude =
10,000 values without losing accuracy, but then problems will arise. Therefore, the rule is as follows:
If your floating point number contains
P bits (
7 for
float
,
16 for
double
) accuracy, and in your data
S bits of significance, then you have
PS bits for maneuver and you can add
10 ^ (PS) values without any problems with accuracy. So, if we used 16 bits of precision instead of 7, we could add
10 ^ (16-3) = 10,000,000,000,000 values without any problems with accuracy.
(There are
numerically stable ways of adding a large number of values . However, simply switching from
float
to
double
much simpler and probably faster).
findings
- Do not use extra precision when storing data.
- If you add a large amount of data, switch to double precision.
Appendix: What is a floating point number?
I found that many do not really understand what floating-point numbers are, so it makes sense to explain briefly. I’ll skip over the smallest details about bits, INF, NaN, and subnorms, and instead show a few examples of floating-point numbers in base-10. All the same applies to binary numbers.
Here are a few examples of floating point numbers, all with seven decimal places (this is close to a 32-bit
float
).
1.875545 · 10 ^
-18 = 0.000 000 000 000 000 00
1 875 5453.141593 · 10 ^
0 =
3.1415932.997925 · 10 ^
8 =
299 792 5 00
6.022141 · 10 ^
23 =
602 214 1 000 000 000 000 000 000
The
bold part is called the mantissa, and the
italicized part is the exponent. In short, accuracy is stored in the mantissa, and the value is in the exponent. So how to work with them? Well, multiplication is simple: multiply the mantissas and add the exponents:
1.111111 · 10 ^
42 ·
2.000000 · 10 ^
7= (
1.111111 · 2.000000 ) · 10 ^ (
42 + 7 )
=
2.222222 · 10 ^
49Adding is a little trickier: to add two numbers of different sizes, you first need to move the smaller of the two numbers so that the comma is in the same place.
3.141593 · 10 ^
0 +
1.111111 · 10 ^
-3 =
3.141593 + 0.000
1111111 =
3.141593 +
0.000111 =
3.141704Notice how we shifted some of the significant decimal places so that the commas match. In other words, we lose accuracy when we add numbers of different quantities.