Understanding shared-memory bank conflicts in NVIDIA CUDA

Shared memory is a very effective optimization tool due to very fast access (100 times faster than global memory). However, if it is used improperly, there may be bank conflicts that significantly slow down performance. This article will discuss how these conflicts arise, and how to avoid them.

How shared memory conflicts arise

Conflicts arise when 2 or more streams from one warp (for devices version 2.0) or half of a warp (for devices version 1.3 and below) access the bytes that belong to different 32-bit words in the same memory bank. In the event of a conflict, access is performed sequentially. The number of threads accessing the bank is called the degree of conflict. If the degree of conflict is N, then access is carried out N times slower than if there was no conflict.

Broadcast Access Mechanism

')
On devices of version 1.x, conflict can be avoided if several streams access the same word belonging to the same bank, and only if this request is single — in this case, the broadcast access mechanism is activated.

On devices of version 2.x there can be several such requests and they will be realized in parallel (different threads can access different bytes of the word).

Features access to devices version 2.0

With 64-bit access, a bank conflict only occurs if 2 or more streams from either of the warp halves access the addresses belonging to the same bank.

With 128-bit access, there are usually second-degree bank conflicts.

Access with a bit greater than 32 is broken down into requests of 32, 64 and 128 bits.

How memory is distributed among banks

The memory is distributed among banks in such a way that each 32-bit word in a sequence is sequentially assigned to one of 32 banks in the case of device version 2.0 and 16 banks in the case of device version 1.3 and below. Accordingly, the bank number can be calculated by the following formula:

Bank number = (Address in bytes / 4)% 32 - for device version 2.0
Bank number = (Address in bytes / 4)% 16 - for device version 1.x

Examples of memory access causing conflicts

For device version 1.x

1. 8 and 16 bit access

__shared__ char shmem8[32];
char data = shmem8[threadIdx.x];

In this example, the first 4 bytes are in the same bank, so the first 4 streams will conflict with access

The problem is solved by adding redundant data (padding) and changing the access scheme:

__shared__ char shmem8[32*4];
char data = shmem8[threadIdx.x*4];

For 16-bit access:

__shared__ short shmem16[32];
short data = shmem16[threadIdx.x];

In this example, the first 2 shorts are in the same bank, so the first 2 threads will conflict with access

The problem is solved in the same way as 8-bit access:

__shared__ short shmem16[32*2];
short data = shmem16[threadIdx.x*2];

2. 32-bit access

For this type of access, bank conflicts are less obvious, but can occur when, for example, this access scheme is:

__shared__ int shmem32[64];
int data1 = shmem32[threadIdx.x*2];
int data2 = shmem32[threadIdx.x*2+1];

In this case, the 0th and 8th streams are read from 0 and 1 banks, respectively, thus creating a 2nd degree conflict.

You can solve this problem for example:

__shared__ int shmem32_1[32];
__shared__ int shmem32_2[32];
int data1 = shmem32_1[threadIdx.x];
int data2 = shmem32_2[threadIdx.x];

For devices version 2.0

Due to the peculiarities of broadcast access, 8 and 16 bit access schemes on these devices do not cause bank conflicts, however, a conflict may arise in the following case:

__shared__ int shared[64];
int data = shared[threadIdx.x*s];

A conflict arises if s is even. If s is odd, but there is no conflict.

Bank Conflict Tracking

NVIDIA Banck Checker

Conflicts can be traced by using the macro CUT_BANK_CHECKER (array, index), which is part of the CUDA Utility Toolkit. To do this, you must use this macro to access memory and run the application in emulation mode. Upon completion of the application, a conflict report will be printed.

For example, like this:

__shared__ int shared[64];
int data = CUT_BANK_CHECKER(shared, threadIdx.x*s);

CUDA Profiler

Also, you can use a profiler to track conflicts. This information is displayed in the warp serialize section. This counter shows the number of warps that need to serialize their access when addressing constant or shared memory, in other words, this counter shows bank conflicts.

Conclusion

In conclusion, I would like to note that the most effective method for eliminating bank conflicts is the development of access schemes that minimize their occurrence and the subsequent analysis of the application by a profiler (which is never superfluous).

Source: https://habr.com/ru/post/100363/

All Articles