📜 ⬆️ ⬇️

CUDA: Working with memory. Part II.

The main theme of this part is the optimization of working with global memory when programming a GPU.

The GPU has a number of features that, if ignored, can cost multiple losses in performance when using global memory. But if you take into account all the subtleties, you can get really effective CUDA-programs.

Getting started.
')

What is wrong with global memory?


The amount of global memory is the largest of all types of memory, but at the same time, this memory is the slowest in terms of technical characteristics: read and write speeds.

In the previous part, I considered an example of matrix transposition. To improve performance, a shared memory buffer was used, which increased the performance by almost four times. But it was enough for the country to see this increase with an extra mediator. The secret lies in the correct access to global memory.

There are two ways to optimize when working with global memory: the alignment of the sizes of the types used and the use of combined queries.

Alignment of sizes of used types


Aligning the data type allows you to compile the request into global memory into a single GPU command, otherwise the compiler will generate additional code, which can significantly reduce performance. For optimal performance, the data type should be 4, 8, or 16 bytes.

If the type size does not correspond to 4, 8, or 16 bytes, then it is better to use the type of a higher dimension or to align using the __align __ (alignment size) keyword.

An example of optimization when using built-in CUDA-types.

The type of type int3 is 12 bytes, the access to memory will not be optimal:

__device__ int3 data[512];

__global__ void initData()
{
int idx = threadIdx.x
data[idx] = make_int3(idx, idx, idx);
};

* This source code was highlighted with Source Code Highlighter .


It is better to use the int4 type (16 bytes), even if you do not need the fourth component:

__device__ int4 data[512];

__global__ void initData()
{
int idx = threadIdx.x
data[idx] = make_int4(idx, idx, idx, 0);
};

* This source code was highlighted with Source Code Highlighter .


When working with structures, you must use the __align__ keyword, which allows you to align the type to a given size.

An example of the alignment of the size of the structure.

Before alignment, the size of the vector3 structure will be 12 bytes:

struct vector3
{
float x;
float y;
float z;
};

int main()
{
printf( "%i\n" , sizeof (vector3));
return 0;
};


* This source code was highlighted with Source Code Highlighter .


The number 12 is displayed on the console.

After alignment, the size of vector3 will be 16 bytes:

struct __align__(16) vector3
{
float x;
float y;
float z;
};

int main()
{
printf( "%i\n" , sizeof (vector3));
return 0;
};

* This source code was highlighted with Source Code Highlighter .


The number 16 is displayed on the console.

Using combined queries


Much greater performance gains can be obtained by combining a large number of requests into global memory into one (sometimes requests are called transactions). In the nVidia documentation, this is called coalescing global memory accesses . But, before proceeding to a direct discussion of what is necessary to combine requests into memory, you need to know a couple of additional things about the work of the GPU.

To control the execution of the work threads GPU uses the so-called warp. From a programmatic point of view, warp represents a thread pool. It is within this warp that parallel work of the threads that were requested when the kernel was called occurs, it is in the warp that threads can interact with each other. The size of warp for all GPUs is 32, that is, only 32 threads are executed in parallel in the warp. At the same time, several warps can be run on the GPU, this number is determined by the size of the available register and shared memory. Another interesting feature is that half-warp is used for memory access, that is, the first 16 threads are addressed to the memory at the beginning, and then the second half of the 16 threads. Why access occurs exactly this way, I can’t say for sure, I can only assume that this is related to the primary tasks of the GPU - graphics processing.

Now consider the requirements needed to merge requests into global memory. Do not forget that memory access occurs through half-warp.

The conditions necessary for combining memory access depend on the version of Compute Capability, I give them for versions 1.0 and 1.1, more details can be found in the documentation from nVidia.

A couple of notes to the conditions:



Fig. 1. Requests that give a union when accessing memory

In fig. 1 shows examples of global memory queries that give a single transaction. On the left, all conditions are fulfilled: each stream from a half-warp refers to the 32-bit word corresponding in order, the memory start address is aligned to the size of the transaction block (16 threads * 4 bytes = 64 bytes). On the right is an example when some streams from the block do not refer to the corresponding words in memory at all.


Fig. 2. Requests that do not give a union when accessing memory

In fig. 2 shows examples that do not give a union when accessing global memory. On the left, the condition for converting threads to the corresponding words in the memory is not fulfilled. On the right, the condition for aligning the memory address to the block size is not fulfilled. As a result: instead of one unified transaction, we get 16 separate ones, one for each half-warp stream.

Array structures or arrays of structures?



A few words should be given to the question of working with structures and how to achieve increased productivity. If there is a need to use an array of structures, it is better to create separate arrays of components of the structure, which will reduce the number of requests to global memory at the expense of associations.

Consider an example.

Inefficient work with global memory:

struct __align__(16) vec3
{
float x;
float y;
float z;
};

__device__ vec3 data[SIZE];

__global__ void initData()
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
data[idx].x = idx;
data[idx].y = idx * 2;
data[idx].z = idx * 3;
};

* This source code was highlighted with Source Code Highlighter .


It is more efficient to use separate arrays:

__device__ float x[SIZE];
__device__ float y[SIZE];
__device__ float z[SIZE];

__global__ void initArr()
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
x[idx] = idx;
y[idx] = idx * 2;
z[idx] = idx * 3;
};

* This source code was highlighted with Source Code Highlighter .


In the first case of using an array of vectors, a separate request to the memory is needed to access each field of the structure; in the second case, by combining only 3 requests for each half-warp. On average, this approach allows you to increase productivity by 2 times.

Conclusion


In conclusion of all the above, I want to give the most important advice when working with memory in CUDA:

NEVER ATTEMPT TO CHANGE THE VALUE OF A SINGLE CELL MEMORY WITH SEVERAL THREADS SIMULTANEOUSLY.

This is the most common mistake in multithreaded programming. In fact, CUDA does not guarantee atomic access for each thread to a specific area of ​​memory, so the results may not be exactly as expected. Although atomic operations in CUDA exist, it is better to use the concept of immutable data and save the results of calculations in new objects, which are transferred to the next stages of calculations.

Source: https://habr.com/ru/post/56514/


All Articles