In the overwhelming majority of my posts about random number generation, we mainly considered the properties of various generation schemes. This may be unexpected, but the performance of the randomization algorithm may depend not on the chosen generation scheme, but on other factors. In this post (which was inspired by the excellent
article by Daniel Lemire ) we explore the main reasons for the decline in performance of generating random numbers, which often outweigh the performance of the PRGS engine.
Imagine this situation:
As a homework assignment, Juan and Sasha implement the same randomized algorithm in C ++, which will be executed on the same university computer and with one data set. Their code is almost identical and differs only in generating random numbers. Juan is in a hurry to do his music lessons, so he simply chose the whirlwind of Mersenne. Sasha, on the other hand, spent several extra hours on research. Sasha conducted benchmarks of several of the fastest GPSNGs, which he recently learned from social networks, and chose the fastest. At the meeting, Sasha was impatient to boast, and he asked Juan: “What GPSNU did you use?”
')
"Personally, I just took the whirlwind of Mersenne - it is built into the language and seems to work quite well."
“Ha!”, Answered Sasha. “I used
jsf32
. It is much faster than the old and slow whirlwind of Mersenne! My program runs in 3 minutes 15 seconds! ”.
“Hmm, not bad, but mine does it in less than a minute,” says Juan and shrugs. “Well, okay, I have to go to the concert. Will you go with me?"
“No,” Sasha replies. "I ... uh ... need to look at my code again."
This awkward fictional situation is
not much fictional; It is based on real results. If your randomized algorithm does not run as fast as you would like, and the bottleneck seems to be the generation of random numbers, then, oddly enough, the problem may not be in the random number generator!
Introduction: Random Numbers in Practice
Most modern qualitative random number generators create machine words filled with random bits, that is, they usually generate numbers in the interval [0..2
32 ) or [0..2
64 ). But in many cases of use, users need numbers at a certain interval — for example, to roll a die or select a random playing card, numbers are needed at small regular intervals. However, many algorithms, from
mixing and
reservoir sampling to
randomized binary search trees, require numbers taken from other intervals.
Methods
We will look at many different methods. To simplify the discussion, instead of generating numbers in the interval [
i ..
j ) or [
i ..
j ], we will generate numbers in the interval [0 ..
k ). Having such a scheme, we can, for example, generate numbers in the interval [
i ..
j ), specifying
k =
j -
i , generating a number in the interval [0 ..
k ), and then adding to it
i .
C ++ Embedded
Many languages ​​have built-in tools for getting a random number in a specified interval. For example, to get a card out of a deck with 52 cards in scripting languages ​​like Perl and Python, we can write
int(rand(52))
and
random.randint(0,52)
. In C ++, we can similarly use
uniform_int_distribution
.
C ++ code for implementing this approach is simple:
uint32_t bounded_rand(rng_t& rng, uint32_t range) { std::uniform_int_distribution<uint32_t> dist(0, range-1); return dist(rng); }
Usually, the built-in tools use one of the techniques described below, but most users simply use these tools without thinking about what is happening “under the hood”, considering that these tools are properly designed and quite effective. In C ++, built-in tools are more complicated, because they must be able to work with rather arbitrary generation engines - the generator producing values ​​in the interval from -3 to 17 can be quite acceptable and be used with
std::uniform_int_distribution
to create numbers in any interval, for example [0..1000). That is, the built-in C ++ tools are too overcomplicated for most cases in which they are used.
The classical remainder of the division (with skewed)
Let's move from an overcomplicated approach to a too simplified one.
When I studied programming, we generated numbers in the interval (for example, to select a card in a 52 card deck) using the division remainder operator. To get a number in the interval [0..52), we wrote
rand() % 52
.
In C ++, this approach can be implemented as follows:
uint32_t bounded_rand(rng_t& rng, uint32_t range) { return rng() % range; }
Despite the simplicity of this approach, it demonstrates the reason that getting numbers in the right interval is usually a slow task — it requires division (to calculate the remainder obtained by the
%
operator). The division is usually at least an order of magnitude slower than other arithmetic operations, so a single arithmetic operation takes more time than all the work performed by fast PRNG.
But in addition to low speed, he also
skewed . To understand why
rand() % 52
returns skewed numbers, suppose
rand()
creates numbers in the interval [0..2
32 ), and notice that 52 does not divide 2
32 completely, it divides it 82 595 524 times with the remainder 48. That is, if we use
rand() % 52
, then we will have 82,595,525 ways to choose the first 48 cards from the deck and only 82,595,524 ways to choose the last four cards. In other words, there is a bias of 0.00000121% against these last four cards (perhaps they are kings!). When I was a student and wrote homework about throwing dice or drawing cards, nobody was usually worried about such tiny distortions, but as the interval increases, the distortion increases linearly. For a 32-bit GPSN, a limited interval less than 2
24 has a skew less than 0.5%, but above 2
31 the skew is 50% - some numbers will return twice as often as others.
In this article, we will mainly consider techniques that use strategies to eliminate systematic errors, but probably should be said that for 64-bit PRNGs, the amount of skew in most typical applications will most likely be negligible.
Another problem may be that some generators have weak low bits. For example, the PRGS families Xoroshiro + and Xoshiro + have low-order bits that do not pass statistical tests. When we execute
% 52
(because 52 is even), we pass the low bit straight to the output.
Multiplication of floating-point numbers (skewed)
Another common technique is the use of GPSNG, which generates floating-point numbers in the interval [0..1) with the subsequent conversion of these numbers to the desired interval. This approach is used in Perl, it is
recommended to use
int(rand(10))
to generate an integer in the interval [0..10) by generating a floating point number, followed by rounding down.
In C ++, this approach is written as:
static uint32_t bounded_rand(rng_t& rng, uint32_t range) { double zeroone = 0x1.0p-32 * rng(); return range * zeroone; }
(Note that
0x1.0p-32
is a binary constant with a floating point for 2
-32 , which we use to convert a random integer in the interval [0..2
32 ) to double in the unit interval; instead, we can perform this conversion using
ldexp(rng(), -32)
, but when I conducted a benchmark of this approach, it was much slower.)
This approach is as skewed as the classic remainder of the division, but the bias is different. For example, if we chose numbers in the interval [0..52), then the numbers 0, 13, 26, and 39 would occur at one time less than others.
This version when generalizing to 64 bits is even more unpleasant, because it requires a floating point type, whose mantissa is at least 64 bits. On x86 machines with Linux and macOS, we can use
long double
to use x86 floating-point numbers that have a 64-bit mantissa, but
long double
not ported universally to all systems — on some systems,
long double
equivalent to
double
.
There is a good side - this approach works faster than residual solutions for PRNG with weaker low bits.
Integer multiplication (skewed)
The multiplication method can be adapted to arithmetic with a fixed rather than a floating point. In fact, we just constantly multiply by 2
32 ,
uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t x = rng(); uint64_t m = uint64_t(x) * uint64_t(range); return m >> 32; }
It may seem that 64-bit arithmetic is required for this version, on x86 processors a good compiler compiles this code into a 32-bit
mult
instruction (which gives us two 32-bit output values, one of which is the return value). It can be expected that this version will be fast, but it is skewed just like the method of multiplying floating-point numbers.
Division with drop (without skewing)
We can modify the floating point multiplication scheme into a scheme based on the division. Instead of multiplying
x * range / 2**32
we calculate
x / (2**32 / range)
. Since we work with integer arithmetic, rounding in this version will be done differently, and sometimes generate values ​​outside the desired interval. If we discard these values ​​(for example, get rid of them and generate new values), then the result will be a method without distortions.
For example, in the case of drawing a card using a 32-bit PRNG, we can generate a 32-bit number and divide it by 2 32/52 = 82 595 524 to select a card. This technique works if the random value from the 32-bit PRNG is less than 52 × 82595524 = 2 32/32 - 48. If the random value from the PRNG is one of the last 48 values ​​of the upper part of the generator interval, then it should be discarded and look for another.
In our code for this version, a trick with dividing 2
32 by
range
without using 64-bit mathematics is used. To directly calculate
2**32 / range
we need to present the number 2
32 , which is too large (by one!) To be represented as a 32-bit integer. Instead, we take into account that for unsigned integers, the unary negation operation
range
calculates a positive value of 2
32 -
range
; dividing this value by
range
, we get an answer for a unit less than
2**32 / range
.
Consequently, the C ++ code for generating a number using a drop-down division looks like this:
uint32_t bounded_rand(rng_t& rng, uint32_t range) {
Of course, such an approach requires two slow operations based on division, which are usually slower than other arithmetic operations, so you should not expect it to be fast.
The remainder of the division (double) without warps - method OpenBSD
We can also use the drop approach to eliminate skewing in the classical division residue method. In the playing cards example, we again need to drop 48 values. In this version, instead of discarding the
last 48 values, we (equivalently) discard the
first 48 values.
Here is the implementation of this approach in C ++:
uint32_t bounded_rand(rng_t& rng, uint32_t range) {
This technique eliminates skewing, but it requires two time-consuming division operations with a remainder for each output value (and an internal generator may be required to create several numbers). Therefore, it is to be expected that the method will be approximately two times slower than the classical approach with a bias.
The
arc4random_uniform
OpenBSD arc4random_uniform
(also used in OS X and iOS) uses this strategy.
The remainder of the division (single) without skewing - Java technique
In Java, a different approach is used to generate a number in an interval in which only one division operation with the remainder is used, with the exception of quite rare cases of discarding the result. Code:
static uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t x, r; do { x = rng(); r = x % range; } while (x - r > (-range)); return r; }
To understand why this option works, you need to think a little. Unlike the previous version based on residuals, which eliminates the skew by removing some of the lowest values ​​from the internal generation engine, this version filters the values ​​from the upper part of the engine interval.
Integral multiply without skews - Lemir technique
In almost the same way as we have eliminated the bias from the division remainder technique, we can eliminate the bias from the integer multiplication technique. This technique was invented by Lemir.
uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t t = (-range) % range; do { uint32_t x = rng(); uint64_t m = uint64_t(x) * uint64_t(range); uint32_t l = uint32_t(m); } while (l < t); return m >> 32; }
Bitmask with a drop (without skewing) - Apple technique
In our last approach, division and remainder operations are completely excluded. Instead, it uses a simple masking operation to obtain a random number in the interval [0..2
k ), where
k is the smallest value, such that 2
k is greater than the interval. If the value is too large for our interval, we discard it and try to get another one. The code is shown below:
uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t mask = ~uint32_t(0); --range; mask >>= __builtin_clz(range|1); uint32_t x; do { x = rng() & mask; } while (x > range); return x; }
This approach was adopted by Apple when (in the macOS Sierra release) it performed
its own arc4random_uniform
code
arc4random_uniform
.
Benchmarking basic techniques
Now we have several approaches that can be assessed. Unfortunately, when we are concerned about the costs of a single division, benchmarking becomes a non-trivial matter. No benchmark is able to take into account all the factors affecting the scope, and there are no guarantees that the best option for your application will definitely be the best for mine.
We use three benchmarks and test techniques with many different PRNGs.
Benchmark Large-Shuffle
Probably the most obvious benchmark is mixing. In this benchmark, we imitate the implementation of large-scale mixing. To sort an array of size
N, we must generate numbers in the intervals [0 ..
N ), [0 .. (
N -1)), ..., [0..1). In this benchmark, we will assume that
N is the maximum possible number (for
uint32_t
this is 2
32 -1). Code:
for (uint32_t i = 0xffffffff; i > 0; --i) { uint32_t bval = bounded_rand(rng, i); assert(bval < i); sum += bval; }
Notice that we “use” each number by adding it to the
sum
(so that optimization does not discard it), but do not do any mixing to focus on generating numbers.
To test 64-bit generation, we have a similar test, but it would be impractical to perform a test corresponding to the mixing of an array of 2
64 - 1 size (because it will take many thousands of years to execute this larger benchmark). Instead, we cross the entire 64-bit interval, but generate the same number of output values ​​as in the 32-bit test. Code:
for (uint32_t i = 0xffffffff; i > 0; --i) { uint64_t bound = (uint64_t(i)<<32) | i; uint64_t bval = bounded_rand(rng, bound ); assert(bval < bound); sum += bval; }
Mersenne Vortex Results
The results shown below demonstrate the performance of this benchmark for each of the methods we considered using the Mersenn vortex and testing on the 32-bit discussed in the article (using
std::mt19937
from
libstdc++
) and similar 64-bit code (using
std:mt19937_64
from
libstdc++
). The results are a geometric average of 15 runs with different seed values, which is then normalized so that the classical method of remainder has a unit run time.
It may seem that we have clear answers about performance - it looks like you can build techniques for their perfection, and ask yourself what the developers of
libstdc++
thought about when they wrote such a terrible implementation for 32-bit numbers. But, as is often the case with benchmarking, the situation is more complicated than it seems from these results. Firstly, there is a risk that the results may be specific for the Mersenn vortex, so we will expand the set of PRNGs tested. Secondly, there may be a subtle problem with the benchmark itself. Let's first deal with the first question.
Results of various PRNG
We will test 32-bit
arc4_rand32
using
arc4_rand32
,
chacha8r
,
gjrand32
,
jsf32
,
mt19937
,
pcg32
,
pcg32_fast
,
sfc32
,
splitmix32
,
xoroshiro64+
,
xorshift*64/32
xoshiro128+
,
xoshiro128+
and
xoshiro128**
, and
pcg32_fast
will be
sfc32
to the
splitmix32
xoroshiro64+
,
xorshift*64/32
xoshiro128+
,
xoshiro128+
and
xoshiro128**
, and we will be using the
arc4_rand32
xorshift*64/32
xoshiro128+
,
xoshiro128+
and
xoshiro128**
, and we will be using the
arc4_rand32
xorshift*64/32
splitmix32
,
xoroshiro64+
, and
xorshift*64/32
,
xorshift*64/32
splitmix32
,
xoroshiro64+
, and
xorshift*64/32
,
xorshift*64/32
xoshiro128+
,
xoshiro128+
and
xoshiro128**
, and we will be using the
gjrand64
;
jsf64
,
mcg128
,
mcg128_fast
,
mt19937_64
,
pcg64
,
pcg64_fast
,
sfc64
,
splitmix64
,
xoroshiro128+
,
xorshift*128/64
xoshiro256+
,
xoshiro256+
and
xoshiro256*
. These sets will give us a few slow PRNGs and a large number of very fast ones.
Here are the results:
We can see key differences from the results with the whirlwind of Mersenne. Faster PRNG shifts the equilibrium towards the limiting code, and therefore the difference between different approaches becomes more pronounced, especially in the case of 64-bit PRNG. With a wider set of
libstc++
implementation of
libstc++
ceases to seem so terrible.
findings
In this benchmark with a significant margin wins in speed approach based on multiply with warp. There are many situations in which the boundaries will be small relative to the size of the PRNG, and performance is absolutely critical. In such situations, a slight bias is not likely to have a noticeable effect, but the PRNG speed will have. One such example is Quicksort with a random reference point. Of the methods without distortions, the technique of bit masks looks promising.
But before making serious conclusions, we need to point out the huge problem of this benchmark - most of the time is spent on very high boundaries, which most likely gives excessive importance to large intervals. Therefore, we need to go to the second benchmark.
Benchmark Small-Shuffle
This benchmark is similar to the previous one, but performs a much smaller “array shuffle” (multiple). Code:
for (uint32_t j = 0; j < 0xffff; ++j) { for (uint32_t i = 0xffff; i > 0; --i) { uint32_t bval = bounded_rand(rng, i); assert(bval < i); sum += bval; } }
Mersenne Vortex Results
Results of various PRNG
findings
This benchmark avoids too much emphasis on large boundaries and more accurately reflects real use cases, but now completely discards large boundaries.Benchmark for all intervals
This benchmark aims to avoid the drawbacks of the previous two; he performs testing at each grade size of two, so that each size is present, but his influence is not overestimated. for (uint32_t bit = 1; bit != 0; bit <<= 1) { for (uint32_t i = 0; i < 0x1000000; ++i) { uint32_t bound = bit | (i & (bit - 1)); uint32_t bval = bounded_rand(rng, bound); assert(bval < bound); sum += bval; } }
Mersenne Vortex Results
Results of various PRNG
findings
Many of our findings remain unchanged. The skew multiplication method is fast if we can accept the error, and the bitmask scheme seems to be a good average choice.We could complete this if we didn’t want to go back, take a critical look at our code and make changes to it.Make improvements
Up to this point, all methods of eliminating skews required the use of an additional remainder division operation, which is why they are performed much slower than skewed methods. It would be helpful if we could reduce this advantage.Faster threshold based drops
Some of our algorithms have code using a threshold value, for example: uint32_t bounded_rand(rng_t& rng, uint32_t range) {
When range
small compared with the PRNG output interval, most often the number will be much more than the threshold. That is, if we can add a preliminary estimate of the threshold, which may be a little more, then we will save on the costly operation of taking the remainder of the division.This task is handled by the following code: uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t r = rng(); if (r < range) { uint32_t t = (-range) % range; while (r < t) r = rng(); } return r % range; }
This change can be applied to the "dual Mod without distortions" (see above), and to the "integer multiplication without distortions". The idea came up with Lemir, who applied it to the second method (but not to the first).Benchmark Results Large-Shuffle
This optimization leads to a significant improvement in the results of the 64-bit benchmark (in which mod is even slower), but in fact slightly degrades performance in the 32-bit benchmark. Despite the improvements, the bitmask method still wins.Benchmark results Small-Shuffle
On the other hand, this change significantly speeds up the small-shuffle benchmark for the multiplication method of integers, and for the double remainder of the division. In both cases, of the performance shifts closer to the results of the options without distortions. The performance of the double remainder method (OpenBSD) is now almost equal to that of the single remainder method (Java).Benchmark results for all intervals
We see a similar improvement in the benchmark for all intervals.It seems that we can declare a new universal winner: an optimized method of multiplying Lemir integers without distortions.Optimization of the remainder of the division
Usually, the calculation a % b
requires division, but in situations where the a < b
result will be simple a
, but division is not required. And when a/2 < b
, the result is just a - b
. Therefore, instead of computing a %= b;
we can perform if (a >= b) { a -= b; if (a >= b) a %= b; }
The cost of division is so significant that increasing the cost of this more complex code can justify itself by saving time due to the absence of division.Benchmark Results Large-Shuffle
Adding this optimization significantly improves the large-shuffle benchmark results. This is again more noticeable in the 64-bit code, where the operation of taking the remainder is more expensive. In the double residue method (in the OpenBSD style), versions with optimization are shown for only one residue operation and for both.In this benchmark, the bitmask is still the winner, but the boundary between it and the approach of Lemir has narrowed considerably.Benchmark results Small-Shuffle
Adding this optimization does not improve the performance of the small-shuffle benchmark, so the question remains whether it adds significant costs. In some cases, no, in other costs increase slightly.Benchmark results for all intervals
In the benchmark for all intervals, the changes are also small.Bonus: results of the PRNG comparison
The main reason for using a number of PRNG for testing the number schemes in the intervals was to avoid unintentional distortion of the results due to the peculiarities of the operation of certain PRNN schemes. But we can use the same internal test results to compare the generation schemes themselves.PRNG with 32-bit numbers output
The graph below shows the performance of different 32-bit generation schemes, averaged for all methods and fifteen runs, normalized by the performance of the 32-bit Mersenne vortex:On the one hand, I am glad to see that I’m pcg32_fast
really fast - only a small variant of Xoroshiro (which does not pass statistical tests) won it. But it also shows why I rarely get upset about the performance of modern high-performance general-purpose PRNGs — the difference between the different methods is very small. In particular, the fastest four schemes differ in performance by less than 5%, and I believe that this is simply caused by "noise".PRNG with 64-bit numbers output
The graph shows the performance of various 64-bit generation schemes, averaged among all the techniques and fifteen runs, normalized by the performance of the 32-bit Mersenn vortex. It may seem strange that normalization is performed on the 32-bit Mersenn vortex, but this allows us to see the additional costs of using 64-bit generation in cases when 32-bit generation is sufficient.These results confirm that it is mcg128_fast
incredibly fast, but the last four technicians again differ by only about 5%, therefore it is difficult to choose from the fastest methods. pcg64
and pcg64_fast
must be slower mcg128_fast
because their basic generators are 128-bit linear congruential generators (LCG, LCG) and 128-bit multiplicative congruential generators (MCG, MCG). Despite the fact that they are not the fastest technicians in this set, pcg64
it is still 20% faster than the 64-bit whirlwind of Mersenne.But perhaps more importantly, these results also show that if you do not need 64-bit output, then 64-bit PRNG is usually slower than 32-bit.findings
From our benchmarks, we can see that the transition from the standardly used PRNGs (for example, the 32-bit Mersenn vortex) to faster PRNGs reduced the execution time of benchmarks by 45%. But the transition from the standard method of finding numbers in the interval to our fastest method allowed us to reduce the benchmark time by about 66%; in other words, up to one third of the original time.The fastest method (without distortions) is the Lemire method (with my additional optimization). Here he is: uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t x = rng(); uint64_t m = uint64_t(x) * uint64_t(range); uint32_t l = uint32_t(m); if (l < range) { uint32_t t = -range; if (t >= range) { t -= range; if (t >= range) t %= range; } while (l < t) { x = rng(); m = uint64_t(x) * uint64_t(range); l = uint32_t(m); } } return m >> 32; }
Using the Lemire method will improve the performance of most randomized algorithms more than the transition from the fast generation engine to the running one a little faster.Applications: Testing Notes
The code for all tests is posted on GitHub . In general, I tested 23 methods for bounded_rand
using 26 different PRNGs (13 32-bit PRNGs and 13 64-bit), in two compilers (GCC 8 and LLVM 6), which gave me 26 * 23 * 2 = 1196 executable files, each of which was performed with the same 15 seed, which gives 1196 * 15 = 17,940 unique test runs, in each of which three benchmarks are combined. Basically, I ran tests on a 48-core machine with four Xeon E7-4830v3 processors with a frequency of 2.1 GHz. Performing a full set of tests required a little less than a month of CPU time.In the end we will return to the situation from the introduction of the article. Imagine that Sasha used jsf32.STD-libc++
, and Juan -mt19937.BIASED_FP_MULT_SCALE
. In benchmark 3, the latter takes 69.6% less time. That is, the time from this fictional situation is based on data from reality.