📜 ⬆️ ⬇️

CUDA: performance aspects in solving typical problems

Before starting to transfer the implementation of the computational algorithm to a video card, it is worth considering whether we will get the desired performance gain or just lose time. And despite the manufacturers' promises about hundreds of GFLOPS, the current generation of cards has its own problems, which are better known in advance. I will not go deep into the theory and consider a few significant practical points and draw some useful conclusions.

We will assume that you have figured out about how CUDA works and have already downloaded a stable version of the CUDA Toolkit .

I will torment the middle-end GTX460 graphics card on the Core Duo E8400.

Function call

Yes, if we want to count something, then we cannot do without calling a function performed on the card. To do this, we write the simplest test function:
')
__global__ void stubCUDA( unsigned short * output)
{
// the most valid function: yep, does nothing.
}

Let me remind you that the __global__ specifier allows you to perform a function on the GPU by calling it from the CPU:
cudaThreadSynchronize();
stubCUDA<<<GRID, THREADS>>>(0);
cudaThreadSynchronize();

All function calls are asynchronous by default, so calls to cudaThreadSynchronize () are necessary to wait for the completion of the called function.

Let's try to run such a block in a cycle: we get about 15,000 calls per second for GRID = 160, THREADS = 96.

Let's just say not at all thick. Even the simplest function that does nothing cannot be executed faster than 0.7 ms.

The first assumption is that most of the time is spent on synchronization of threads and asynchronous calls would work much faster (although they can be applied more specifically to specific tasks).

Check it out. Without synchronization, it was possible to start the function 73100 times per second. The result, it should be noted, is not at all impressive.

And the last test, we run the function with GRID = THREADS = 1, it would seem that this should eliminate the overhead of creating a pile of threads inside the card. But this is not the case, we get the same 73000-73500 calls per second.

So, the moral:

Memory access from outside

In order to consider something useful, we need input and output data. To do this, you need to understand how fast the data is transferred from / to the video card. We use the following function:
cudaMemcpy(data_cuda, image, data_cuda_size, cudaMemcpyHostToDevice);

Yes, CUDA offers us the means of asynchronous data transfer, but their performance, running ahead, does not differ from the synchronous function.

We copy large blocks: as in the direction of cudaMemcpyHostToDevice, and cudaMemcpyDeviceToHost, we get a performance of about 2 GB / s on large blocks (more than 100 megabytes). In general, it is very good.

Much worse is the situation with very small structures. Transmitting by 4 bytes, we receive no more than 22,000 calls per second, i.e. 88 kb / s .

Morality:

Memory access from the inside

After we have transferred the data to the card, you can begin to work with them. I would like to evaluate the approximate speed of access to the video memory. To do this, we write the following function:
__global__ void accessTestCUDA(unsigned short * output, unsigned short * data, int blockcount, int blocksize)
{
// just for test of max access speed: does nothing useful
unsigned short temp;
for ( int i = blockIdx.x; i < blockcount; i += gridDim.x)
{
int vectorBase = i * blocksize;
int vectorEnd = vectorBase + blocksize;

for ( int j = vectorBase + threadIdx.x; j < vectorEnd; j += blockDim.x)
{
temp = data[j];
}
}
output[0] = temp;
}


Here, the GRID and THREADS parameters are already used, until I explain why, but believe me - everything is as it should. Picky ones will say that the result is spelled incorrectly due to the lack of synchronization, but we don’t need it.

So, we get about 42 GB / c for random reading. This is not bad at all.

Now we modify the function so that it copies the input data to the output. It makes no sense, but allows us to estimate the recording speed in the video memory (since the change is quite simple, I will not duplicate the code).

We get about 30 GB / s for I / O. It's not bad too.

It is necessary to make an amendment to the fact that in fact we used sequential (with some deviations) memory access. For an arbitrary number can worsen up to two times - but then, and this is not a problem?

Morality:

Arithmetic operations

Quite simple examples will be omitted and we will do something useful. Namely - image normalization (pixel [t]: = (pixel [t] -sub) * factor). Actually code:
__global__ void normalizeCUDA(unsigned short * data, int blockcount, int blocksize, float sub, float factor)
{
for ( int i = blockIdx.x; i < blockcount; i += gridDim.x)
{
int vectorBase = i * blocksize;
int vectorEnd = vectorBase + blocksize;

for ( int j = vectorBase + threadIdx.x; j < vectorEnd; j += blockDim.x)
{
register float d = ( float )data[j];
d = (d - sub) * factor;
data[j] = (unsigned short )d;
}
}
}


Here three seemingly costly computational procedures are used: reduction to real numbers, ADDMUL and reduction to integers. The forums scare that casting whole-material works very badly. Maybe this was true for older generations of cards, but now it is not.

Total processing speed: 26 GB / s . Three operations worsened performance relative to direct I / O by only 13%.

If you look closely at the code, it normalizes it is not quite right. Before writing to integers, the real must be rounded, for example, by the function round (). But do not do it, and try to never use it!

round (d): 20 GB / s , another minus 23%.
(unsigned short) (d + 0.5): 26 GB / s , the time itself did not even change within the measurement error.

Morality:

Logical operations

Let's try to estimate the speed of logical operations, and at the same time we will do one more good deed: we will find the minimum and maximum values ​​in the array. This stage usually precedes normalization (and it was for this purpose that it was written), but everything will be the opposite for us - because it's harder. Here is the working code:
__global__ void getMinMaxCUDA(unsigned short * output, unsigned short * data, int blockcount, int blocksize)
{
__shared__ unsigned short sMins[MAX_THREADS];
__shared__ unsigned short sMaxs[MAX_THREADS];

sMins[threadIdx.x] = data[0];
sMaxs[threadIdx.x] = data[0];

for ( int i = blockIdx.x; i < blockcount; i += gridDim.x)
{
int vectorBase = i * blocksize;
int vectorEnd = vectorBase + blocksize;

for ( int j = vectorBase + threadIdx.x; j < vectorEnd; j += blockDim.x)
{
register unsigned short d = data[j];
if (d < sMins[threadIdx.x])
sMins[threadIdx.x] = d;
if (d > sMaxs[threadIdx.x])
sMaxs[threadIdx.x] = d;
}
}

__syncthreads();

if (threadIdx.x == 0)
{
register unsigned short min = sMins[0];
for ( int j = 1; j < blockDim.x; j++)
if (sMins[j] < min)
min = sMins[j];
if (min < output[0])
output[0] = min;
}

if (threadIdx.x == 1)
{
register unsigned short max = sMaxs[0];
for ( int j = 1; j < blockDim.x; j++)
if (sMaxs[j] > max)
max = sMaxs[j];
if (max > output[1])
output[1] = max;
}

__syncthreads();
}


There is no way without synchronization of threads and shared memory .

Final speed: 29 GB / s , even faster normalization.

Why I combined the code of minimum and maximum - both are usually needed, and the calls separately lose time (see the first paragraph).

In general, throw a stone at someone who said that on video cards it’s bad with conditional operations: artificially it was possible to slow down this fragment almost 2 times, but for this it was necessary to increase the depth of conditions right up to 4! if () if () if () if () else if () ...

Morality:

Complex data structures

Guided by the idea that the algorithms and data structures are strongly related (at least to recall N. Wirth), you should check how things are with some complex data structures.

This is where the problem arises, when transferring data to a function, we can use only two types of objects - constant integral types (numbers) and links to video memory blocks.

The idea to build for example trees based on links is covered immediately:

Thus, complex data structures remain to be represented as a solid block of memory and an array of references to the elements of this block. So you can easily submit a hash table, tree, and index structure over any data array.

Payback for such tricks is the need to use double indexing:
for ( int j = vectorBase + threadIdx.x; j < vectorEnd; j += blockDim.x)
{
temp = data[index[j]+i];
}

This fragment works with a speed of 10 to 30 GB / s , depending on the content and size of the index and data. Memory usage can be attempted to be optimized, but even at best, we lose 25% access speed. Triple indexes behave even worse, losing 40% -60% of performance.

Today we understood a lot

With proper use of the capabilities of the video card, you can get unprecedented performance in tasks, say image processing, sound, video - everywhere where there are large amounts of data, the need for clever arithmetic and the absence of complex data structures.

If you like the topic, I’ll tell you how to calculate several useful objects on a video card: Distance Map, image morphology and search indexes and show some interesting data structures that run fast enough and do not create unnecessary synchronization problems .

Source: https://habr.com/ru/post/119435/


All Articles