Parallel non-parallel or search for prime numbers on the GPU

One wonderful summer evening, in the heat of an argument, I had the stupidity of noticing that you could write a fast-working sieve of Eratosthenes on CUDA. N = 1,000,000,000 (nine zeros) as the goal. And the legend has begun ...

I will not go into the details of the algorithm, you can read about it, for example, here and immediately show the code that I had at that time:

#include <iostream> #include <math.h> using namespace std; int main() { double number = 1000000000; bool* a = new bool[int(number/2)]; int i,j,result; for (i=0; i<number/2; i++) a[i] = true; for (i=3; i<=floor(sqrt(number)); i+=2) if (a[i/2]) for (j=i*i; j<=number; j+=i*2) a[j/2]=false; result = 0; for (i=0; i<number/2; i++) if (a[i]) result++; cout << result << endl; delete[] a; return 0; }

Single-threaded slightly optimized code that runs for 14-15 seconds on a Core i3 330M and consumes a large amount of memory. Let's start with it.

Block sieve

The key to successful parallelization is the division of the task into independent parts in order to eliminate memory problems arising from parallel access to it. And here, by the way, comes the so-called block sieve, which allows you to save memory and divide a large sieve into a number of small ones. Perhaps this becomes due to one interesting property of the sieve, namely: all composite numbers from the root from N to N will be multiples of the simple from the interval from 2 to the root of N. In other words, finding simple to the root of N (let's call it a preliminary sieve) will remain for each block only, find elements from which to start crossing out.
')
The first thing that comes to mind is the enumeration of dividers and, in the case of divisibility, start crossing out with a step into the divisor. But even by approximate estimations, such an approach will be very inefficient, even with all the possible optimizations.

If you take N = 10,000, then just to find simple ones (1229 pieces), you need a little less than thirty thousand checks, moreover, you should not forget that each element is checked at least once, that is, this is another ten thousand checks.

By the way, this option was implemented on CUDA and hung the GeForce 9600 GT video card for almost an hour.

Attempt number two

This could all end. But the decision came, as always, unexpectedly. And it looked like this.

Suppose that N = 100, the number of blocks is equal to ten, as is the size of the part. There will be four elements in the preliminary sieve: 2, 3, 5, 7. Since the size of the preliminary sieve is equal to the size of the part, you can start from the second block. It begins with the number 11 known to us. An example of a simple number by which a strikeout occurs will be a triple. I do not take into account the two because of the obvious optimization with her participation, but more on that later. The previous block ended at 10, the last crossed out element multiple of a triple is nine, located in one element from the end of the block. Consequently, if you subtract this “tail” from the troika, then you can find how much in the next block there is up to the element that needs to be crossed out - up to 12.

And so for every simple preliminary sieve. In addition, the optimizations used in the "ordinary" sieve are applicable here, namely: saving memory by eliminating even elements and deleting them, starting with the number in the square.

From theory to practice

It remains to gather all the theoretical developments. Here's what came of it:

 for(i=0; i<PARTS; i++) { for(j=0; j<PART_SIZE/2; j++) part[j] = true; for(j=1; j<result1; j++) { offset = 0; if(part1[j]*part1[j]<=(i+1)*PART_SIZE) { offset = i*PART_SIZE+part1[j]-i*PART_SIZE%part1[j]; if(offset%2 == 0) offset += part1[j]; } else break; for(k=offset; k<=(i+1)*PART_SIZE; k+=part1[j]*2) if(part[(k-(i*PART_SIZE)-1)/2]) part[(k-(i*PART_SIZE)-1)/2] = false; } if(i==0) result = j - 1; for(j=0; j<PART_SIZE/2; j++) if(part[j]) result++; }

I do not consider it necessary to add a preliminary sieve code here, because it is identical to a “regular” sieve.

This version of the sieve, let's call it block, works in 8-9 seconds.

Sieve separation

It was decided not to stop at what had been achieved, so a version working in two streams was quickly written.

Sieve in two streams, the result is 4.7 seconds.

But there was one factor that slowed down the execution, the second thread working with the second half of the sieve turned out to be slower than the main one. Having shifted the "cut" of the sieve, we managed to speed up the execution by almost half a second.

The result after the offset is 4.3 seconds.

Sieve moves

Tested on the CPU algorithm, was transferred to the GPU. Experimentally, the dimensions of the grid and the sieve parts were established, which made it possible to achieve the greatest productivity.

The sieve of Eratosthenes on CUDA, the result is 4.5 seconds.

Source code is available on GitHub .
Update 07/19/2014 added version with OpenMP.

That's all, thank you all for your attention!

Source: https://habr.com/ru/post/230357/

All Articles