CUDA: Working with memory. Part I.

In the process of working with CUDA, I almost didn’t touch on the use of graphics card memory. It is time to remove this gap.

Since the topic is very voluminous, I decided to divide it into several parts. In this part, I will talk about the main types of memory available on the video card and give an example of how the choice of the type of memory affects the performance of calculations on the GPU.

Video card and memory types

When using a GPU, the developer has several types of memory available: registers, local, global, shared, constant, and texture memory. Each of these types of memory has a specific purpose, which is determined by its technical parameters (speed, level of access to read and write). The hierarchy of memory types is shown in Fig. one.
')
GPU Memory Types

Fig. 1. Types of video card memory

Register memory (register) is the fastest of all kinds. You can determine the number of registers of available GPUs using the already well-known cudaGetDeviceProperties function. It is also easy to calculate the number of registers available to a single GPU thread, for this it is necessary to divide the total number of registers by the product of the number of threads in the block and the number of blocks in the grid. All GPU registers are 32 bit. There are no explicit ways to use register memory in CUDA; the compiler takes all the work of placing data in registers.
Local memory can be used by the compiler with a large number of local variables in a function. According to the speed characteristics, the local memory is much slower than the register memory. The documentation from nVidia recommends using local memory only in the most necessary cases. Explicit means of blocking the use of local memory is not provided, so when the performance drops, it is worth carefully analyzing the code and eliminating unnecessary local variables.
Global memory (global memory) - the slowest type of memory available from the GPU. Global variables can be selected using the __global__ specifier, as well as dynamically, using functions from the cudMallocXXX family. Global memory is mainly used to store large amounts of data received on the device from the host, this movement is performed using the functions cudaMemcpyXXX. In algorithms that require high performance, the number of operations with global memory must be minimized.
Shared memory refers to a fast type of memory. Shared memory is recommended to minimize access to global memory, as well as to store local variables of functions. Addressing shared memory between thread threads is the same within one block, which can be used to exchange data between threads within one block. To allocate data in shared memory, the __shared__ specifier is used.
Constant memory (constant memory) is quite fast of the available GPU. A distinctive feature of constant memory is the ability to write data from the host, but within the GPU only reading from this memory is possible, which causes its name. To place data in the constant memory, the specifier __constant__ is provided. If it is necessary to use an array in constant memory, then its size must be specified in advance, since dynamic allocation, in contrast to global memory, is not supported in constant memory. To write from host to constant memory, use the cudaMemcpyToSymbol function, and to copy from device to host cudaMemcpyFromSymbol, as you can see, this approach is slightly different from the approach when working with global memory.
Texture memory (texture memory), as the name implies, is designed primarily for working with textures. Texture memory has specific features in addressing, reading and writing data. In more detail about the texture memory, I will discuss when considering issues of image processing on the GPU.

Shared Memory Example

Just above, I briefly talked about the different types of memory that are available when programming a GPU. Now I want to give an example of using shared memory during matrix transposition.

Before you start writing the main code, I’ll give a small way to debug. As you know, functions from the CUDA runtime API can return various error codes, but the previous time I did not take it into account. To simplify your life you can use the following macro to catch errors:

#define CUDA_DEBUG #ifdef CUDA_DEBUG #define CUDA_CHECK_ERROR(err) \ if (err != cudaSuccess) { \ printf( "Cuda error: %s\n" , cudaGetErrorString(err)); \ printf( "Error in file: %s, line: %i\n" , __FILE__, __LINE__); \ } \ #else #define CUDA_CHECK_ERROR(err) #endif * This source code was highlighted with Source Code Highlighter .

As you can see, if the environment variable CUDA_DEBUG is defined, the error code is checked and information about the file and the line where it occurred is displayed. This variable can be enabled when compiling for debugging and disabled when compiling for release.

Getting down to the main task.

In order to see how the use of shared memory affects the speed of calculations, you should also write a function that will use only global memory.
We write this function:

// // // inputMatrix - // outputMatrix - // width - ( -) // height - ( -) // __global__ void transposeMatrixSlow( float * inputMatrix, float * outputMatrix, int width, int height) { int xIndex = blockDim.x * blockIdx.x + threadIdx.x; int yIndex = blockDim.y * blockIdx.y + threadIdx.y; if ((xIndex < width) && (yIndex < height)) { // int inputIdx = xIndex + width * yIndex; // - int outputIdx = yIndex + height * xIndex; outputMatrix[outputIdx] = inputMatrix[inputIdx]; } } * This source code was highlighted with Source Code Highlighter .

This function simply copies the rows of the original matrix into the columns of the result matrix. The only difficult point is to define the indexes of matrix elements; here it is necessary to remember that when calling the kernel, different dimensions of blocks and grid can be used, for this the built-in variables blockDim, blockIdx are used.

We write a transpose function that uses shared memory:

#define BLOCK_DIM 16 // c // // inputMatrix - // outputMatrix - // width - ( -) // height - ( -) // __global__ void transposeMatrixFast( float * inputMatrix, float * outputMatrix, int width, int height) { __shared__ float temp[BLOCK_DIM][BLOCK_DIM]; int xIndex = blockIdx.x * blockDim.x + threadIdx.x; int yIndex = blockIdx.y * blockDim.y + threadIdx.y; if ((xIndex < width) && (yIndex < height)) { // int idx = yIndex * width + xIndex; // temp[threadIdx.y][threadIdx.x] = inputMatrix[idx]; } // __syncthreads(); xIndex = blockIdx.y * blockDim.y + threadIdx.x; yIndex = blockIdx.x * blockDim.x + threadIdx.y; if ((xIndex < height) && (yIndex < width)) { // int idx = yIndex * height + xIndex; // outputMatrix[idx] = temp[threadIdx.x][threadIdx.y]; } } * This source code was highlighted with Source Code Highlighter .

In this function, I use shared memory as a two-dimensional array.
As already mentioned, the addressing of shared memory within the same block is the same for all threads, therefore, to avoid collisions when accessing and writing, each element in the array corresponds to one thread in the block.
After copying the elements of the original matrix into the temp buffer, the function __syncthreads is called. This function synchronizes threads within a block. Its difference from other ways of synchronization lies in the fact that it runs only on the GPU.
At the end, the saved elements of the original matrix are copied into the result matrix, in accordance with the transposition rule.
It may seem that this function should be performed more slowly than its version without shared memory, where there are no intermediaries. But in fact, copying from global to global memory is much slower than a bunch of global memory - shared memory - global memory.
I want to note that it is worth checking manually the boundaries of the arrays of the matrices, the GPU does not have the hardware to monitor the boundaries of the arrays.

And finally, let's write the transpose function, which is executed only on the CPU:

// , CPU __host__ void transposeMatrixCPU( float * inputMatrix, float * outputMatrix, int width, int height) { for ( int y = 0; y < height; y++) { for ( int x = 0; x < width; x++) { outputMatrix[x * height + y] = inputMatrix[y * width + x]; } } } * This source code was highlighted with Source Code Highlighter .

Now you need to generate data for the calculations, copy them from the host to the device, in case of using the GPU, make performance measurements and clear resources.
Since these stages are about the same as I described last time, I quote this fragment right away:

#define GPU_SLOW 1 #define GPU_FAST 2 #define CPU 3 #define ITERATIONS 20 // __host__ int main() { int width = 2048; // int height = 1536; // int matrixSize = width * height; int byteSize = matrixSize * sizeof ( float ); // float * inputMatrix = new float [matrixSize]; float * outputMatrix = new float [matrixSize]; // for ( int i = 0; i < matrixSize; i++) { inputMatrix[i] = i; } // printf( "Select compute mode: 1 - Slow GPU, 2 - Fast GPU, 3 - CPU\n" ); int mode; scanf( "%i" , &mode); // printMatrixToFile( "before.txt" , inputMatrix, width, height); if (mode == CPU) // CPU { int start = GetTickCount(); for ( int i = 0; i < ITERATIONS; i++) { transposeMatrixCPU(inputMatrix, outputMatrix, width, height); } // CPU ( ) printf ( "CPU compute time: %i\n" , GetTickCount() - start); } else // GPU { float * devInputMatrix; float * devOutputMatrix; // CUDA_CHECK_ERROR(cudaMalloc(( void **)&devInputMatrix, byteSize)); CUDA_CHECK_ERROR(cudaMalloc(( void **)&devOutputMatrix, byteSize)); // CUDA_CHECK_ERROR(cudaMemcpy(devInputMatrix, inputMatrix, byteSize, cudaMemcpyHostToDevice)); // dim3 gridSize = dim3(width / BLOCK_DIM, height / BLOCK_DIM, 1); dim3 blockSize = dim3(BLOCK_DIM, BLOCK_DIM, 1); cudaEvent_t start; cudaEvent_t stop; // event' GPU CUDA_CHECK_ERROR(cudaEventCreate(&start)); CUDA_CHECK_ERROR(cudaEventCreate(&stop)); // GPU cudaEventRecord(start, 0); if (mode == GPU_SLOW) // { for ( int i = 0; i < ITERATIONS; i++) { transposeMatrixSlow<<<gridSize, blockSize>>>(devInputMatrix, devOutputMatrix, width, height); } } else if (mode == GPU_FAST) // { for ( int i = 0; i < ITERATIONS; i++) { transposeMatrixFast<<<gridSize, blockSize>>>(devInputMatrix, devOutputMatrix, width, height); } } // cudaEventRecord(stop, 0); float time = 0; // cudaEventSynchronize(stop); // GPU cudaEventElapsedTime(&time, start, stop); // printf( "GPU compute time: %.0f\n" , time); // CUDA_CHECK_ERROR(cudaMemcpy(outputMatrix, devOutputMatrix, byteSize, cudaMemcpyDeviceToHost)); // // // CUDA_CHECK_ERROR(cudaFree(devInputMatrix)); CUDA_CHECK_ERROR(cudaFree(devOutputMatrix)); CUDA_CHECK_ERROR(cudaEventDestroy(start)); CUDA_CHECK_ERROR(cudaEventDestroy(stop)); } // - printMatrixToFile( "after.txt" , outputMatrix, height, width); // delete[] inputMatrix; delete[] outputMatrix; return 0; } * This source code was highlighted with Source Code Highlighter .

If calculations are performed only on the CPU, then the GetTickCount () function is used to measure the calculation time, which is connected from windows.h. To measure the time of calculations on the GPU, use the cudaEventElapsedTime function, the prototype of which is as follows:

cudaError_t cudaEventElapsedTime (float * time, cudaEvent_t start, cudaEvent_t end), where

time - pointer to float, to record the time between start and end events (in milliseconds),
start - handle of the first event,
end - handle of the second event.

Returns:

cudaSuccess - if successful
cudaErrorInvalidValue - incorrect value
cudaErrorInitializationError - initialization error
cudaErrorPriorLaunchFailure - error during previous asynchronous function launch
cudaErrorInvalidResourceHandle - invalid event handle

I also write the original matrix and the result in the files through the function printMatrixToFile. To make sure the results are correct. The code for this function is as follows:

__host__ void printMatrixToFile( char * fileName, float * matrix, int width, int height) { FILE* file = fopen(fileName, "wt" ); for ( int y = 0; y < height; y++) { for ( int x = 0; x < width; x++) { fprintf(file, "%.0f\t" , matrix[y * width + x]); } fprintf(file, "\n" ); } fclose(file); } * This source code was highlighted with Source Code Highlighter .

If the matrices are very large, then outputting data to files can slow down the execution of the program.

Conclusion

In the process of testing, I used matrices of 2048 * 1536 = 3145728 elements and 20 iterations in load cycles. After the measurement results, I obtained the following results (Fig. 2).

Performance comparison

Fig. 2. Calculation time. (less is better).

As you can see, the GPU version with shared memory runs almost 20 times faster than the version on the CPU. It is also worth noting that when using shared memory, the calculation is performed about 4 times faster than without it.
In my example, I do not take into account the time of copying data from the host to the device and back, but in real applications it is also necessary to take them into account. The number of data movement between the CPU and GPU, if possible, should be minimized.

PS I hope you enjoyed the performance boost you can get with the GPU.

Source: https://habr.com/ru/post/55461/

All Articles

CUDA: Working with memory. Part I.

Video card and memory types

Shared Memory Example

Conclusion

More articles: