📜 ⬆️ ⬇️

Stencil Buffer Implementation Example with CUDA


Stencil buffer is used to mask reflections in places where they actually do not exist. The Stencil technique is used in OpenGL and DirectX. Before applying the algorithm, the Stencil test is performed and in those places where there is no image the pixel shader will not be rendered. So we stop the extra work.


Stencil is stored in a buffer with depth. For example, in the format D3DFMT_D24S8 24 bits - bits of depth and 8 bits - Stencil. For simplicity, we will assume below that Stencil is stored in the last bit. If this bit = 1, then the pixel is active. Thus, the simplified algorithm presents the following steps:

  1. Reset (fill with zeros) Stencil buffer.
  2. We start recording and draw a plane in Stencil buffer, relative to which we will consider the reflection. Where there is a mirror, units will be stored, and where there is no mirror, zeros.
  3. We reflect all the geometry relative to the plane using a separate matrix, and then draw the reflection, simultaneously running the Stencil test.

')
Thus, where the mirror was in the image, the reflection will be displayed. And where there is none, nothing will change.

CUDA software implementation



CUDA, unfortunately, the mechanism Stencil-test is missing. This is a very useful trick and I will explain how to get around this limitation in the next article, but for now let's look at the implementation details.

So, we get stencil buffer size exactly (N / 32) * sizeof (int) bytes. And we tie a texture to it.

cudaMalloc((void**)&m_stencilBuffer, N*sizeof(int)/32); cudaBindTexture(0, stencil_tex, m_stencilBuffer, N*sizeof(int)/32);     -  (.h )  : Texture<int, 1, cudaReadModeElementType> stencil_tex; ,        : static __device__ int g_stencilMask[32] = { 0x00000001, 0x00000002, 0x00000004, 0x00000008, 0x00000010, 0x00000020, 0x00000040, 0x00000080, 0x00000100, 0x00000200, 0x00000400, 0x00000800, 0x00001000, 0x00002000, 0x00004000, 0x00008000, 0x00010000, 0x00020000, 0x00040000, 0x00080000, 0x00100000, 0x00200000, 0x00400000, 0x00800000, 0x01000000, 0x02000000, 0x04000000, 0x08000000, 0x10000000, 0x20000000, 0x40000000, 0x80000000 }; 


For those kernels that only read the stencil buffer, the macro should be applied at the beginning of the kernel as follows:

 __global__ void my_kernel(…) { uint tid = blockDim.x * blockIdx.x + threadIdx.x; STENCIL_TEST(tid); // my code here } 


In practice (GTX560), such a stencil test is about 20-25% faster than a simple check of the check type:

 uint activeFlag = a_flags[tid]; if(activeFlag==0) return; 


However, taking into account memory savings, there is definitely a profit. It should also be noted that on video cards with a less wide bus (for example, GTS450), the acceleration may be more significant.

So, it remains to implement only the entry in the stencil buffer. First, we read the value for everything in the warp from the stealth buffer to the variable activeWarp; Then each thread gets its bit from this variable using the logical & and stores it in the active variable. At the end of the kernel, we will collect all active variables for the given warp value back into one 32-bit uint, and the zero warp stream will write the result back to memory.

 // (tid >> 5) same as (tid/32) // (tid & 0x1f) same as (tid%32) __global__ void my_kernel2(…,uint* a_stencilBuffer) { uint tid = blockDim.x * blockIdx.x + threadIdx.x; uint activeWarp = a_stencilBuffer[tid >> 5]; if(activeWarp==0) // all threads in warp inactive return; // each threads will store it's particular bit from group of 32 threads uint active = activeWarp & g_stencilMask[tid&0x1f]; if(!active) goto WRITE_BACK_STENCIL_DATA; // my code here WRITE_BACK_STENCIL_DATA: WriteStencilBit(tid, a_stencilBuffer, active); } 

If the stream is inactive, it will immediately go to the ridge of the kernel. If for any reason you have decided within your code that this thread should be inactive, do this:

 if(want to kill thread) { active = 0; goto WRITE_BACK_STENCIL_DATA; } 


In the example, the label and the goto operator are intentionally used. Although it is a bad programming style, in this case it adds security to your code. The fact is that you are guaranteed to reach WriteStencilBit function code. If for some reason you decide to make a return inside your code, everything will break down (let's discuss why later). Instead of return, you need to put goto WRITE_BACK_STENCIL_DATA so that before exiting, all the threads from the warp can collect data, and the zero stream (zero inside the warp-a) will write it to the stencil buffer. Actually, the WriteStencilBit function looks as follows:

 __device__ void WriteStencilBit(int tid, uint* a_stencilBuffer, uint value) { uint stencilMask = __ballot(value); if((tid & 0x1f) == 0) // same as tid%32 == 0 a_stencilBuffer[tid >> 5] = stencilMask; } 


The __ballot () function returns uint, where each i-th bit is 1 if and only if what is in its argument is not zero. That is, it does exactly what is needed there, stitching back into the uint flags from different threads inside the warp.

The __ballot () function belongs to the so-called “warp vote functions” and works very quickly. Unfortunately, it is only available for compute capability 2.0, that is, video cards with Fermi architecture. Important note on her work, the following code will be wrong:

 __device__ void WriteWrongStencilBit(int tid, uint* a_stencilBuffer, uint value) { if((tid & 0x1f) == 0) // same as tid%32 == 0 a_stencilBuffer[tid >> 5] = __ballot(value); } 


The fact is that __ballot () will always put 0 in those bits whose streams are currently masked. And all streams with a number inside the warp that are not equal to zero (1..31) will be masked and will not fall inside the if operator, therefore 1..31 bits of the result of the __ballot () function for such a code will always be zero. From here the truth follows an interesting conclusion. If you are guaranteed to write for video cards with Fermi architecture, then even for kernels that are written in stencil buffer, you can kill the stream as follows:

 if(want to kill thread) return; 


Thus, the threads for which you made return will be masked and __ballot () will return zeros for the corresponding bits in its result. There is truth one subtlety. At least for a zero stream inside the warp, you cannot do this, otherwise the result will simply not be written back. Therefore, in fact, you can only do so.

 if(want to kill thread && (tid&0x1f!=0)) return; 


Or use the form suggested above:

 if(want to kill thread) { active = 0; goto WRITE_BACK_STENCIL_DATA; } 


Implementation features for older hardware (G80-GT200)

Let us now consider which extensions must be made in order for the stencil to work effectively on older GPUs. The __ballot () function is not supported on these video cards. Let's rewrite the WriteStencilBit function in accordance with the capabilities that we have:

 template<int CURR_BLOCK_SIZE> __device__ inline void WriteStencilBit(int tid, uint* a_stencilBuffer, uint value) { #if COMPUTE_CAPABILITY >= COMPUTE_CAPABILITY_GF100 uint stencilMask = __ballot(value); if((tid & 0x1f) == 0) a_stencilBuffer[tid >> 5] = stencilMask; #elif COMPUTE_CAPABILITY >= COMPUTE_CAPABILITY_GT200 if(__all(value==0)) { if((tid & 0x1f) == 0) a_stencilBuffer[tid >> 5] = 0; } else if(__all(value)) { if((tid & 0x1f) == 0) a_stencilBuffer[tid >> 5] = 0xffffffff; } else { __shared__ uint active_threads[CURR_BLOCK_SIZE/32]; uint* pAddr = active_threads + (threadIdx.x >> 5); if((tid & 0x1f) == 0) *pAddr = 0; atomicOr(pAddr, value); if((tid & 0x1f) == 0) a_stencilBuffer[tid >> 5] = *pAddr; } #else __shared__ uint active_threads[CURR_BLOCK_SIZE]; active_threads[threadIdx.x] = value; if((threadIdx.x & 0x1) == 0) active_threads[threadIdx.x] = value | active_threads[threadIdx.x+1]; if((threadIdx.x & 0x3) == 0) active_threads[threadIdx.x] = active_threads[threadIdx.x] | active_threads[threadIdx.x+2]; if((threadIdx.x & 0x7) == 0) active_threads[threadIdx.x] = active_threads[threadIdx.x] | active_threads[threadIdx.x+4]; if((threadIdx.x & 0xf) == 0) active_threads[threadIdx.x] = active_threads[threadIdx.x] | active_threads[threadIdx.x+8]; if((threadIdx.x & 0x1f) == 0) active_threads[threadIdx.x] = active_threads[threadIdx.x] | active_threads[threadIdx.x+16]; uint* perWarpArray = active_threads + ((threadIdx.x >> 5) << 5); if((tid & 0x1f) == 0) a_stencilBuffer[tid >> 5] = perWarpArray[0]; #endif } 


Thus, we can make atomics in shared memory + 2 voting functions are available, __any and __all, so that we can use them. In other cases, only classical reduction remains.

Testing Stencil


For raytracing needs, this stencil buffer came up pretty well. On my old laptop's GTX560, I get about 4 billion calls per second (i.e. 4 billion empty calls per second) - not bad, right ?! As the tracing depth increased, performance dropped slightly according to how many real-world objects we see. Tests were specifically made on the most simple reflective scene:

The FPS dynamics are as follows: 30, 25, 23.7, 20, 19.4, 18.8

Source: https://habr.com/ru/post/151640/


All Articles