Shared memory is a very effective optimization tool due to very fast access (100 times faster than global memory). However, if it is used improperly, there may be bank conflicts that significantly slow down performance. This article will discuss how these conflicts arise, and how to avoid them.
How shared memory conflicts arise
Conflicts arise when 2 or more streams from one warp (for devices version 2.0) or half of a warp (for devices version 1.3 and below) access the bytes that belong to different 32-bit words in the same memory bank. In the event of a conflict, access is performed sequentially. The number of threads accessing the bank is called the degree of conflict. If the degree of conflict is N, then access is carried out N times slower than if there was no conflict.
Broadcast Access Mechanism
')
On devices of version 1.x, conflict can be avoided if several streams access the same word belonging to the same bank, and only if this request is single — in this case, the broadcast access mechanism is activated.
On devices of version 2.x there can be several such requests and they will be realized in parallel (different threads can access different bytes of the word).
Features access to devices version 2.0
With 64-bit access, a bank conflict only occurs if 2 or more streams from either of the warp halves access the addresses belonging to the same bank.
With 128-bit access, there are usually second-degree bank conflicts.
Access with a bit greater than 32 is broken down into requests of 32, 64 and 128 bits.
How memory is distributed among banks
The memory is distributed among banks in such a way that each 32-bit word in a sequence is sequentially assigned to one of 32 banks in the case of device version 2.0 and 16 banks in the case of device version 1.3 and below. Accordingly, the bank number can be calculated by the following formula:
Bank number = (Address in bytes / 4)% 32 - for device version 2.0
Bank number = (Address in bytes / 4)% 16 - for device version 1.x
Examples of memory access causing conflicts
For device version 1.x
1. 8 and 16 bit access
__shared__ char shmem8[32];
char data = shmem8[threadIdx.x];
In this example, the first 4 bytes are in the same bank, so the first 4 streams will conflict with access
The problem is solved by adding redundant data (padding) and changing the access scheme:
__shared__ char shmem8[32*4];
char data = shmem8[threadIdx.x*4];
For 16-bit access:
__shared__ short shmem16[32];
short data = shmem16[threadIdx.x];
In this example, the first 2 shorts are in the same bank, so the first 2 threads will conflict with access
The problem is solved in the same way as 8-bit access:
__shared__ short shmem16[32*2];
short data = shmem16[threadIdx.x*2];
2. 32-bit access
For this type of access, bank conflicts are less obvious, but can occur when, for example, this access scheme is:
__shared__ int shmem32[64];
int data1 = shmem32[threadIdx.x*2];
int data2 = shmem32[threadIdx.x*2+1];
In this case, the 0th and 8th streams are read from 0 and 1 banks, respectively, thus creating a 2nd degree conflict.
You can solve this problem for example:
__shared__ int shmem32_1[32];
__shared__ int shmem32_2[32];
int data1 = shmem32_1[threadIdx.x];
int data2 = shmem32_2[threadIdx.x];
For devices version 2.0
Due to the peculiarities of broadcast access, 8 and 16 bit access schemes on these devices do not cause bank conflicts, however, a conflict may arise in the following case:
__shared__ int shared[64];
int data = shared[threadIdx.x*s];
A conflict arises if
s is even. If
s is odd, but there is no conflict.
Bank Conflict Tracking
NVIDIA Banck Checker
Conflicts can be traced by using the macro CUT_BANK_CHECKER (array, index), which is part of the CUDA Utility Toolkit. To do this, you must use this macro to access memory and run the application in emulation mode. Upon completion of the application, a conflict report will be printed.
For example, like this:
__shared__ int shared[64];
int data = CUT_BANK_CHECKER(shared, threadIdx.x*s);
CUDA Profiler
Also, you can use a profiler to track conflicts. This information is displayed in the
warp serialize section. This counter shows the number of warps that need to serialize their access when addressing constant or shared memory, in other words, this counter shows bank conflicts.
Conclusion
In conclusion, I would like to note that the most effective method for eliminating bank conflicts is the development of access schemes that minimize their occurrence and the subsequent analysis of the application by a profiler (which is never superfluous).