📜 ⬆️ ⬇️

Pseudo Random Number Generator Overview for CUDA

According to the specifics of work, one often has to do simulations on the GPU using pseudorandom number generators. As a result, I gained experience, which I decided to share with the community.

So, the types of generators, their advantages and disadvantages.

1. Linear Congruential Generator
')
The simplest and most inefficient generator with a period of only 2 ^ 32. It is not suitable for serious simulations, such as Monte Carlo, but can be used for debugging, because it is extremely simple to implement.

__device__ inline unsigned int lcg_rand(unsigned int& seed)
{
seed = seed * 1664525 + 1013904223UL;
return seed;
}


The generator requires storing a state for each generator (stream). To do this, you can use the unsigned int array and address it by threadId. For example:

__global__ void GenerateRandNums(unsigned int* out, unsigned int* states)
{
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int s = states[tid];
out[tid] = lcg_rand(s) + lcg_rand(s);
states[tid] = s;
}


2. Combined Tausworthe Generator

This generator has a longer period than LCG, but almost as simple. It requires only 27 instructions to generate a stream with a period of 2 ^ 113. This period is achieved by combining several generators with exclusive or. Below is the code generating a single thread.

//S1,S2,S3,M -
__device__ inline unsigned int taus_rand_step(unsigned int& state, int S1, int S2, int S3, int M)
{
unsigned int b = (((state << S1) ^ state) >> S2);
state = (((state & M) << S3) ^ b);
return state;
}


Combined generator using 4 independent streams.

__device__ unsigned int taus_rand(uint4& state)
{
return
taus_rand_step(state.x, 6, 13, 18, 4294967294UL) ^
taus_rand_step(state.y, 2, 27, 2, 4294967288UL) ^
taus_rand_step(state.z, 13, 21, 7, 4294967280UL) ^
taus_rand_step(state.w, 3, 12, 13, 4294967268UL);

}



This generator requires 4 unsigned int to store the state, but its use is not fundamentally different from LCG.

3. Hybrid Generator

This generator is obtained by combining the 2 previous generators. If the periods of 2 generators are mutually simple, then the period of the resulting generator will be the product of their periods. This generator is obtained by combining 3 Tausworthe generator streams and one LCG stream. The resulting period will be 2 ^ 121.
__device__ unsigned int hybrid_taus(uint4& state)
{
return
taus_rand_step(state.x, 13, 19, 12, 4294967294UL) ^
taus_rand_step(state.y, 2, 25, 4, 4294967288UL) ^
taus_rand_step(state.z, 3, 11, 17, 4294967280UL) ^
lcg_rand(state.w);
}


4. XorShift

This generator has a period of 2 ^ 128-1 and requires 4 unsigned int to store the state. Thanks to maratyszcza for advice.

__device__ inline unsigned int rng_xor128(uint4& s)
{
unsigned int t;

t = sx^(sx<<11);

sx = sy;
sy = sz;
sz = sw;

sw = (sw^(sw>>19))^(t^(t>>8));

return sw;
}


5. Mersenne Twister

This generator is considered to be one of the best at the present time and has a very long period - 2 ^ 19937; however, to generate one element from the stream requires the implementation of a complex sequential algorithm. Also, the generator requires a lot of memory to store its state, which can greatly affect the performance of the GPU, because state has to be stored in global memory. However, in situations where accuracy is more important than speed, it is more preferable to use this generator. As in the previous generators, each thread has its own generator and state. Below is the code for this generator (a modified example from the NVIDIA SDK).

#define SEED 999

#define DCMT_SEED 4172
#define MT_RNG_PERIOD 607

typedef struct{
unsigned int matrix_a;
unsigned int mask_b;
unsigned int mask_c;
unsigned int seed;
} mt_struct_stripped;

#define MT_RNG_COUNT (4096*16)
#define MT_MM 9
#define MT_NN 19
#define MT_WMASK 0xFFFFFFFFU
#define MT_UMASK 0xFFFFFFFEU
#define MT_LMASK 0x1U
#define MT_SHIFT0 12
#define MT_SHIFTB 7
#define MT_SHIFTC 15
#define MT_SHIFT1 18

__device__ struct mt_state
{
int iState;
unsigned int mti1;
unsigned int mt[MT_NN];
};

__device__ mt_struct_stripped ds_MT[MT_RNG_COUNT];
static mt_struct_stripped h_MT[MT_RNG_COUNT];
__device__ mt_state ds_mtState[MT_RNG_COUNT];

void InitGPUTwisters(const char* fname, unsigned int seed)
{
FILE *fd = fopen(fname, "rb");
if(!fd)
printf("initMTGPU(): failed to open %s\n", fname);

if( !fread(h_MT, sizeof(h_MT), 1, fd) )
printf("initMTGPU(): failed to load %s\n", fname);

fclose(fd);

mt_struct_stripped *MT = (mt_struct_stripped *)malloc(MT_RNG_COUNT * sizeof(mt_struct_stripped));

for(int i = 0; i < MT_RNG_COUNT; i++)
{
MT[i] = h_MT[i];
seed = lcg_rand(seed);
MT[i].seed = seed;
}
cudaMemcpyToSymbol(ds_MT, MT, sizeof(h_MT));

free(MT);
}

__device__ unsigned int GetRandomIntegerFast(unsigned int tid, mt_state* mts, mt_struct_stripped* config)
{
int iState1, iStateM;
unsigned int mti, mtiM, x;

iState1 = mts->iState + 1;
iStateM = mts->iState + MT_MM;

if(iState1 >= MT_NN)
iState1 -= MT_NN;

if(iStateM >= MT_NN)
iStateM -= MT_NN;

mti = mts->mti1;
mts->mti1 = mts->mt[iState1];
mtiM = mts->mt[iStateM];

x = (mti & MT_UMASK) | (mts->mti1 & MT_LMASK);
x = mtiM ^ (x >> 1) ^ ((x & 1) ? config->matrix_a : 0);
mts->mt[mts->iState] = x;
mts->iState = iState1;

x ^= (x >> MT_SHIFT0);
x ^= (x << MT_SHIFTB) & config->mask_b;
x ^= (x << MT_SHIFTC) & config->mask_c;
x ^= (x >> MT_SHIFT1);

return x;
}

__device__ float GetRandomFloatFast(unsigned int tid, mt_state* mts, mt_struct_stripped* config)
{
return ((float)GetRandomIntegerFast(tid,mts,config) + 1.0f) / 4294967296.0f;
}

__device__ void InitTwisterFast(unsigned int tid, mt_state* mts, mt_struct_stripped* config)
{
mts->mt[0] = config->seed;

for(int i = 1; i < MT_NN; i++)
mts->mt[i] = (1812433253U * (mts->mt[i - 1] ^ (mts->mt[i - 1] >> 30)) + i) & MT_WMASK;

mts->iState = 0;
mts->mti1 = mts->mt[0];
}



An example of using this generator:

__global__ void GenerateUniformNumbers(float* output)
{
const uint tid = gridDim.x*gridDim.y*threadIdx.x + blockIdx.y*gridDim.x + blockIdx.x;

mt_state mts = ds_mtState[tid];
mt_struct_stripped config = ds_MT[tid];

InitTwisterFast(tid, &mts, &config);

output[tid] = GetRandomFloatFast(tid, &mts, &config)

ds_mtState[tid] = mts;
}


6. WarpStandard generator

This generator was designed specifically for the GPU and has a period of 2 ^ 1024. The generator uses shared memory, so all threads in the block must be executed in concert, which is not always possible. A very good description of this generator is presented by reference.

Conclusion

Each of the presented generators is suitable for different cases. For debugging, you can use LCG, if you need more precision and randomness, I would use MersenneTwister. If all the threads on the GPU are consistent, there are no dynamic transitions inside the blocks — WarpStandard fits very well. In all other cases, I use HybridGenerator.

Source: https://habr.com/ru/post/99876/


All Articles