CUDA: How does the GPU work

The internal model of nVidia GPU is a key moment in understanding GPGPU using CUDA. This time I will try to tell you in more detail about the software device GPUs. I will talk about the key points of the CUDA compiler, the CUDA runtime API, well, and finally, I will give an example of using CUDA for simple mathematical calculations.

Let's get started

Computational GPU model:

Consider the computational model of the GPU in more detail.

The top level of the GPU core consists of blocks that are grouped into a grid or grid (dimension) of dimension N1 * N2 * N3. This can be represented as follows:

Fig. 1. Computing device GPU.
')
The dimension of the grid of blocks can be found using the cudaGetDeviceProperties function, in the resulting structure, the maxGridSize field is responsible for this. For example, on my GeForce 9600M GS, the dimension of the grid of blocks is 65535 * 65535 * 1, that is, the grid of blocks is two-dimensional (the obtained data satisfy Compute Capability v.1.1).
Any block in turn consists of threads (threads), which are the direct performers of the calculations. The threads in the block are formed as a three-dimensional array (Fig. 2), the dimension of which can also be found using the cudaGetDeviceProperties function, the maxThreadsDim field is responsible for this.

Fig. 2. Device GPU unit.

When using a GPU, you can use the grid of the required size and configure the blocks for the needs of your task.

CUDA and C language:

The technology itself CUDA (compiler nvcc.exe) introduces a number of additional extensions for the C language, which are necessary for writing code for the GPU:

Function specifiers that show how and from where the functions will be performed.
Variable specifiers that serve to indicate the type of memory used by the GPU.
GPU kernel launch qualifiers.
Built-in variables for identifying threads, blocks, and other parameters when executing code in the GPU core.
Additional variable types.

As mentioned, function specifiers determine how and from where functions will be called. There are 3 such specifiers in CUDA:

__host__ - will be executed on the CPU, called from the CPU (in principle, you can not specify it).
__global__ - executed on the GPU, called from the CPU.
__device__ - executed on the GPU, called from the GPU.

Kernel launch qualifiers are used to describe the number of blocks, threads, and memory that you want to allocate when calculating on a GPU. The kernel startup syntax is as follows:

myKernelFunc <<< gridSize, blockSize, sharedMemSize, cudaStream >>> (float * param1, float * param2), where

gridSize is the dimension of the grid of blocks (dim3) allocated for calculations,
blockSize is the block size (dim3) allocated for calculations,
sharedMemSize is the size of the additional memory allocated when the kernel starts,
cudaStream is a cudaStream_t variable that specifies the stream in which the call will be made.

And of course the myKernelFunc itself is a kernel function (specifier __global__). Some variables when calling the kernel can be omitted, for example sharedMemSize and cudaStream.

It is also worth mentioning the built-in variables:

gridDim is the grid dimension, has type dim3. Allows you to find out the size of the grid allocated during the current kernel call.
blockDim - the dimension of the block, also has the type dim3. Allows you to find out the size of the block allocated during the current kernel call.
blockIdx - the index of the current block in the calculation on the GPU, is of type uint3.
threadIdx - the index of the current thread in the calculation on the GPU, is of type uint3.
warpSize is the size of warp, it is of type int (I have not tried it myself yet).

By the way, gridDim and blockDim are the very variables that we pass when the GPU kernel starts up, although in the kernel they can be read only.

Additional variable types and their specifiers will be discussed directly in the examples of working with memory.

CUDA host API:

Before you start using CUDA directly for computing, you need to familiarize yourself with the so-called CUDA host API, which is the link between the CPU and the GPU. The CUDA host API, in turn, can be divided into a low-level API called the CUDA driver API, which provides access to the CUDA user-mode driver, and the high-level API, the CUDA runtime API. In my examples, I will use the CUDA runtime API.

The CUDA runtime API includes the following function groups:

Device Management - includes functions for general management of the GPU (obtaining information about the capabilities of the GPU, switching between the GPU during the SLI mode, etc.).
Thread Management - thread management.
Stream Management - stream management.
Event Management is the function of creating and managing events.
Execution Control - functions for starting and executing the CUDA core.
Memory Management - GPU memory management functions.
Texture Reference Manager - working with texture objects via CUDA.
OpenGL Interoperability - functions for interacting with the OpenGL API.
Direct3D 9 Interoperability - functions to interact with the Direct3D 9 API.
Direct3D 10 Interoperability - functions to interact with the Direct3D 10 API.
Error Handling - error handling functions.

We understand the work of the GPU:

As it was said, the thread is the direct performer of the calculations. What, then, is the parallelization of calculations between threads? Consider the work of a single block.

Task. It is required to calculate the sum of two vectors of dimension N elements.

We know the maximum size of our block: 512 * 512 * 64 threads. Since we have a one-dimensional vector, for now we will limit ourselves to using the x-dimensions of our block, that is, we will use only one strip of threads from the block (Fig. 3).

Fig. 3. Our strip of threads from the used block.

Note that the x-dimension of the block is 512, that is, we can add at one time vectors whose length is N <= 512 elements. In other matters, with more massive calculations, you can use a larger number of blocks and multidimensional arrays. I also noticed one interesting feature, perhaps some of you thought that you can use 512 * 512 * 64 = 16777216 threads in one block, naturally this is not so, in general, this work cannot exceed 512 (at least, my video card).

The program itself must perform the following steps:

Get data for calculations.
Copy this data to GPU memory.
Perform calculation in the GPU through the kernel function.
Copy the calculated data from the GPU memory to the RAM.
View results.
Free up used resources.

Moving directly to writing code:

First of all, we write the kernel function, which will add vectors:

// __global__ void addVector( float * left, float * right, float * result) { // id . int idx = threadIdx.x; // . result[idx] = left[idx] + right[idx]; } * This source code was highlighted with Source Code Highlighter .

Thus, parallelization will be performed automatically when the kernel starts. This function also uses the built-in variable threadIdx and its field x, which allows you to set the correspondence between the calculation of the vector element and the thread in the block. We calculate each element of the vector in a separate thread.

We write the code, which is responsible for 1 and 2 points in the program:

#define SIZE 512 __host__ int main() { // float * vec1 = new float [SIZE]; float * vec2 = new float [SIZE]; float * vec3 = new float [SIZE]; // for ( int i = 0; i < SIZE; i++) { vec1[i] = i; vec2[i] = i; } // float * devVec1; float * devVec2; float * devVec3; // cudaMalloc(( void **)&devVec1, sizeof ( float ) * SIZE); cudaMalloc(( void **)&devVec2, sizeof ( float ) * SIZE); cudaMalloc(( void **)&devVec3, sizeof ( float ) * SIZE); // cudaMemcpy(devVec1, vec1, sizeof ( float ) * SIZE, cudaMemcpyHostToDevice); cudaMemcpy(devVec2, vec2, sizeof ( float ) * SIZE, cudaMemcpyHostToDevice); … } * This source code was highlighted with Source Code Highlighter .

To allocate memory on a video card, the cudaMalloc function is used , which has the following prototype:
cudaError_t cudaMalloc (void ** devPtr, size_t count), where

devPtr - the pointer to which the address of the allocated memory is written,
count - the size of the allocated memory in bytes.

Returns:

cudaSuccess - with successful memory allocation
cudaErrorMemoryAllocation - on memory allocation error

To copy data into the video card's memory, use cudaMemcpy, which has the following prototype:
cudaError_t cudaMemcpy (void * dst, const void * src, size_t count, enum cudaMemcpyKind kind), where

dst is a pointer containing the address of the copy destination;
src - pointer containing the address of the copy source,
count - the size of the copied resource in bytes,
cudaMemcpyKind is an enumeration that indicates the direction of the copy (maybe cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyHostToHost, cudaMemcpyDeviceToDevice).

Returns:

cudaSuccess - with successful copying
cudaErrorInvalidValue - invalid argument parameters (for example, copy size is negative)
cudaErrorInvalidDevicePointer - invalid memory pointer in video card
cudaErrorInvalidMemcpyDirection is the wrong direction (for example, the source and copy destination are confused)

Now we come to the direct call of the kernel for computing on the GPU.

… dim3 gridSize = dim3(1, 1, 1); // dim3 blockSize = dim3(SIZE, 1, 1); // // addVector<<<gridSize, blockSize>>>(devVec1, devVec2, devVec3); … * This source code was highlighted with Source Code Highlighter .

In our case, it is not necessary to determine the size of the grid and the block, since we use only one block and one dimension in the block, so the code above can be written:

addVector<<<1, SIZE>>>(devVec1, devVec2, devVec3); * This source code was highlighted with Source Code Highlighter .

Now we need to copy the result of the calculation from the video memory to the host memory. But the core functions have a special feature - asynchronous execution, that is, if after the kernel call, the next piece of code began to work, then this does not mean that the GPU performed the calculations. To complete the work of a given kernel function, you must use synchronization tools, such as eventa. Therefore, before copying the results to the host, we synchronize the threads of the GPU through the event.

Code after calling the kernel:

// addVector<<<blocks, threads>>>(devVec1, devVec2, devVec3); // event' cudaEvent_t syncEvent; cudaEventCreate(&syncEvent); // event cudaEventRecord(syncEvent, 0); // event cudaEventSynchronize(syncEvent); // event // cudaMemcpy(vec3, devVec3, sizeof ( float ) * SIZE, cudaMemcpyDeviceToHost); * This source code was highlighted with Source Code Highlighter .

Let's take a closer look at the functions from the Event Managment API.

An event is created using the cudaEventCreate function, the prototype of which is:
cudaError_t cudaEventCreate (cudaEvent_t * event), where

* event - a pointer to record the event handle.

Returns:

cudaSuccess - if successful
cudaErrorInitializationError - initialization error
cudaErrorPriorLaunchFailure - error during previous asynchronous function launch
cudaErrorInvalidValue - incorrect value
cudaErrorMemoryAllocation - memory allocation error

Event'a is recorded using the cudaEventRecord function, the prototype of which is:
cudaError_t cudaEventRecord (cudaEvent_t event, CUstream stream), where

event - handle of the event to write,
stream is the number of the stream in which we are recording (in our case this is the main zero stream).

Returns:

cudaSuccess - if successful
cudaErrorInvalidValue - incorrect value
cudaErrorInitializationError - initialization error
cudaErrorPriorLaunchFailure - error during previous asynchronous function launch
cudaErrorInvalidResourceHandle - invalid event handle

The event is synchronized by the cudaEventSynchronize function. This function waits for all the threads of the GPU to end and pass a given event and only then returns control to the caller. The function prototype is:
cudaError_t cudaEventSynchronize (cudaEvent_t event), where

event - the event handle that is expected to pass.

Returns:

cudaSuccess - if successful
cudaErrorInitializationError - initialization error
cudaErrorPriorLaunchFailure - error during previous asynchronous function launch
cudaErrorInvalidValue - incorrect value
cudaErrorInvalidResourceHandle - invalid event handle

You can understand how cudaEventSynchronize works from the following scheme:

Fig. 4. Synchronization of the basic and GPU programs.

In Figure 4, the “Waiting for Event'A Passage” block is the call to the cudaEventSynchronize function.

Well, in conclusion, we display the result on the screen and clean the allocated resources.

// for ( int i = 0; i < SIZE; i++) { printf( "Element #%i: %.1f\n" , i , vec3[i]); } // // // cudaEventDestroy(syncEvent); cudaFree(devVec1); cudaFree(devVec2); cudaFree(devVec3); delete[] vec1; vec1 = 0; delete[] vec2; vec2 = 0; delete[] vec3; vec3 = 0; * This source code was highlighted with Source Code Highlighter .

I think that there is no need to describe the resource release functions. Unless, it is possible to remind that they also return cudaError_t values, if there is a need to check their work.

Conclusion

I hope that this material will help you understand how the GPU functions. I described the most important points that you need to know to work with CUDA. Try it yourself to write the addition of two matrices, but do not forget about the hardware limitations of the video card.

PS: It turned out not very briefly. Hope not tired. If you need all the source code, I can send it by mail.
PSS: Ask questions.

Source: https://habr.com/ru/post/54707/

All Articles

CUDA: How does the GPU work

Computational GPU model:

CUDA and C language:

CUDA host API:

We understand the work of the GPU:

Conclusion

More articles: