GPU Particles using Compute and Geometry Shaders

Hello dear reader!

Today we will continue to study the graphics pipeline, and I will talk about such wonderful things as Compute Shader and Geometry Shader using the example of creating a system for 1000000+ particles, which in turn are not points, but squares ( billboard quads ) and have their own texture. In other words, we will derive 2000000+ textured triangles with FPS> 100 (on a budget GeForce 550 Ti video card ).

Introduction

I wrote a lot about shaders among my articles, but we always operated on only two types: Vertex Shader , Pixel Shader . However, with the advent of DX10 + , new types of shaders have appeared: Geometry Shader, Domain Shader, Hull Shader, Compute Shader . Just in case, let me remind you what the graphics pipeline looks like now:
')

At once I will make a reservation that in this article we will not touch on Domain Shader and Hull Shader , I will write about tessellation in the following articles.

Only Geometry Shader remains unexplored. What is a Geometry Shader ?

Chapter 1: Geometry Shader

Vertex Shader is engaged in processing vertices, Pixel Shader is engaged in processing pixels, and as you can guess, Geometry Shader is engaged in processing primitives.

This shader is an optional part of the pipeline, i.e. it may not be at all: verteks directly go to the Primitive Assembly Stage and the rasterization of the primitive goes on.
Geometry Shader is located between the Primitive Assembly Stage and the Rasterizer Stage .

At the entrance, he can get information both about the assembled primitive and about neighboring primitives:

At the exit we have a stream of primitives, where we in turn add a primitive. Moreover, the type of the returned primitive may differ from the input. For example - we receive Point , we return Line . An example of a simple geometric shader that does nothing and simply connects the input to the output:

struct PixelInput { float4 Position : SV_POSITION; //  System-Value   }; [maxvertexcount(1)] //  - ,     void SimpleGS( point PixelInput input[1], inout PointStream<PixelInput> stream ) { PixelInput pointOut = input[0]; //   stream.Append(pointOut); //   stream.RestartStrip(); //   ( Point –   ) }

Chapter 2: StructuredBuffer

DirectX10 + has such a type of buffers as Structured Buffer , such a buffer can be described by a programmer as he pleases, i.e. in the very classical sense, it is a homogeneous array of structures of a certain type, which is stored in the memory of the GPU .

Let's try to create a similar buffer for our particle system. Let's describe what properties a particle has ( on the C # side ):

 public struct GPUParticleData { public Vector3 Position; public Vector3 Velocity; };

And create the buffer itself ( using the SharpDX.Toolkit helper ):

 _particlesBuffer = Buffer.Structured.New<GPUParticleData>(graphics, initialParticles, true);

Where initialParticles is an array of GPUParticleData with the size of the desired number of particles.

It is worth noting that the flags when creating the buffer are set as follows:

BufferFlags.ShaderResource - to be able to access the buffer from the shader
BufferFlags.StructuredBuffer - indicates the buffer belongs
BufferFlags.UnorderedAccess - for the ability to change the buffer from the shader

Create a buffer with a size of 1,000,000 elements and fill it with random elements:

 GPUParticleData[] initialParticles = new GPUParticleData[PARTICLES_COUNT]; for (int i = 0; i < PARTICLES_COUNT; i++) { initialParticles[i].Position = random.NextVector3(new Vector3(-30f, -30f, -30f), new Vector3(30f, 30f, 30f)); }

After that, we will store a buffer of 1,000,000 elements with random values in the memory of the GPU .

Chapter 3. Render Point-Particles

Now we need to figure out how to draw this buffer? After all, we do not even have vertices! We will generate vertexes on the go, based on the values of our structural buffer.

Create two shaders - Vertex Shader and Pixel Shader .
First, let's describe the input data for shaders:

 struct Particle //    GPU { float3 Position; float3 Velocity; }; StructuredBuffer<Particle> Particles : register(t0); //   cbuffer Params : register(b0) //     { float4x4 View; float4x4 Projection; }; // ..    ,     ID      Vertex Buffer struct VertexInput { uint VertexID : SV_VertexID; }; struct PixelInput //      Vertex Shader { float4 Position : SV_POSITION; }; struct PixelOutput //    { float4 Color : SV_TARGET0; };

Well, let's take a closer look at shaders, for starters, vertex:

 PixelInput DefaultVS(VertexInput input) { PixelInput output = (PixelInput)0; Particle particle = Particles[input.VertexID]; float4 worldPosition = float4(particle.Position, 1); float4 viewPosition = mul(worldPosition, View); output.Position = mul(viewPosition, Projection); return output; }

In this country of magic, we simply read a particular particle from the particle buffer according to the current VertexID (and we have it in the range from 0 to 999999) and using the position of the particle, we project it into the screen space.

Well, with Pixel Shader easier than ever:

 PixelOutput DefaultPS(PixelInput input) { PixelOutput output = (PixelOutput)0; output.Color = float4((float3)0.1, 1); return output; }

Set the color of the particle as float4 (0.1, 0.1, 0.1, 1) . Why 0.1 ? Because we have a million particles, and we will use Additive Blending .

Set the buffers and draw the geometry:

 graphics.ResetVertexBuffers(); //      graphics.SetBlendState(_additiveBlendState); //  Additive Blend State //       SRV ( ). _particlesRender.Parameters["Particles"].SetResource<SharpDX.Direct3D11.ShaderResourceView>(0, _particlesBuffer); //  _particlesRender.Parameters["View"].SetValue(camera.View); _particlesRender.Parameters["Projection"].SetValue(camera.Projection); //   _particlesRender.CurrentTechnique.Passes[0].Apply(); //   1000000     graphics.Draw(PrimitiveType.PointList, PARTICLES_COUNT);

Well, and admire the first victory:

Chapter 4: Render of QuadBillboard Particles

If you have not forgotten the first chapter, you can safely turn our set of points into full-fledged Billboard consisting of two triangles.

I’ll talk a little bit about what QuadBillboard is : this is a square made of two triangles and this square is always turned towards the camera.

How to create this square? We need to come up with an algorithm for quickly generating such squares. Let's take a look at something in the Vertex Shader . There we have three spaces when building the SV_Position :

World Space - vertex position in world coordinates
View Space - vertex position in view coordinates
Projection Space - vertex position in screen coordinates

View Space is just what we need, because these coordinates are just relative to the camera and the plane (-1 + px, -1 + py, pz) -> (1 + px, 1 + py, pz) created in this space will always have a normal, which is aimed at the camera.

Therefore, we’ll change something in the shader:

 PixelInput TriangleVS(VertexInput input) { PixelInput output = (PixelInput)0; Particle particle = Particles[input.VertexID]; float4 worldPosition = float4(particle.Position, 1); float4 viewPosition = mul(worldPosition, View); output.Position = viewPosition; output.UV = 0; return output; }

At the output of SV_Position, we will transmit not a ProjectionSpace-position , but a ViewSpace-position , in order to create new primitives in the Geometry Shader in the ViewSpace .

Add a new stage:

 //         Projection Space PixelInput _offsetNprojected(PixelInput data, float2 offset, float2 uv) { data.Position.xy += offset; data.Position = mul(data.Position, Projection); data.UV = uv; return data; } [maxvertexcount(4)] //   GS – 4 ,   TriangleStrip void TriangleGS( point PixelInput input[1], inout TriangleStream<PixelInput> stream ) { PixelInput pointOut = input[0]; const float size = 0.1f; //    //   stream.Append( _offsetNprojected(pointOut, float2(-1,-1) * size, float2(0, 0)) ); stream.Append( _offsetNprojected(pointOut, float2(-1, 1) * size, float2(0, 1)) ); stream.Append( _offsetNprojected(pointOut, float2( 1,-1) * size, float2(1, 0)) ); stream.Append( _offsetNprojected(pointOut, float2( 1, 1) * size, float2(1, 1)) ); //  TriangleStrip stream.RestartStrip(); }

Well, as we have now UV - we can read the texture in the pixel shader:

 PixelOutput TrianglePS(PixelInput input) { PixelOutput output = (PixelOutput)0; float particle = ParticleTexture.Sample(ParticleSampler, input.UV).x * 0.3; output.Color = float4((float3)particle, 1); return output; }

Additionally, set the sampler and particle texture for the render:

 _particlesRender.Parameters["ParticleSampler"].SetResource<SamplerState>(_particleSampler); _particlesRender.Parameters["ParticleTexture"].SetResource<Texture2D>(_particleTexture);

Check, test:

Chapter 5: Particle Movement

Now, everything is ready, we have a special buffer in the memory of the GPU and there is a particle render, built using the Geometry Shader , but such a system is static. You can, of course, change the position on the CPU , just read the buffer data from the GPU every time, change it, and then load it back, but what kind of GPU Power can we talk about? Such a system will not survive and 100,000 particles.

And to work on the GPU with such buffers, you can use a special shader - Compute Shader . It is outside the traditional render-pipeline and can be used separately.

What is a Compute Shader ?

In its own words, the Compute Shader is a special stage of the pipeline that replaces all the traditional ones (but can still be used with it), allows you to execute arbitrary code using the GPU , read / write data into buffers (including texture ). Moreover, the execution of this code occurs as parallel as the developer sets up.

Let's look at the execution of the simplest code:

 [numthreads(1, 1, 1)] void DefaultCS( uint3 DTiD: SV_DispatchThreadID ) { // DTiD.xyz -   // ...   } technique ComputeShader { pass DefaultPass { Profile = 10.0; ComputeShader = DefaultCS; } }

At the very beginning of the code is the numthreads field, which indicates the number of threads in the group. For now, we will not use group streams and make sure that there is one stream per group.
uint3 DTiD.xyz points to the current thread.

The next step is to launch such a shader, it is performed as follows:

 _effect.CurrentTechnique.Passes[0].Apply(); graphics.Dispatch(1, 1, 1);

In the Dispatch method, we indicate how many groups of threads we should have, and the maximum number of each dimension is limited to 65536 . And if we execute such a code, the shader code on the GPU will be executed once, because we have 1 group of threads, in each group 1 stream. If you put, for example, Dispatch (5, 1, 1) - the shader code on the GPU will be executed five times, 5 groups of streams, 1 stream in each group. If at the same time change also numthreads -> (5, 1, 1) , then the code will be executed 25 times, and in 5 groups of threads, in each group there are 5 threads. More detail can be considered if you look at the picture:

Now, back to the particle system, what do we have? We have a one-dimensional array of 1,000,000 elements and the task is to process the positions of the particles. Since Since particles move independently of each other, this task can be very well parallelized.

In DX10 (we use this version of CS , to support DX10 cards) the maximum number of flows per group of flows is 768 , and in all three dimensions. I create 32 * 24 * 1 = 768 threads in total for each group of threads, i.e. our one group is able to process 768 particles (1 stream - 1 particle). Further, it is necessary to calculate how many flow groups one needs (taking into account the fact that one group will process 768 particles) in order to process the N -th number of particles.
You can calculate it by the formula:

 int numGroups = (PARTICLES_COUNT % 768 != 0) ? ((PARTICLES_COUNT / 768) + 1) : (PARTICLES_COUNT / 768); double secondRoot= System.Math.Pow((double)numGroups, (double)(1.0 / 2.0)); secondRoot= System.Math.Ceiling(secondRoot); _groupSizeX = _groupSizeY = (int)secondRoot;

After that, we can call Dispatch (_groupSizeX, _groupSizeY, 1) , and the shader will be able to process in parallel the Nth number of elements.

To access a specific element use the formula:

 uint index = groupID.x * THREAD_IN_GROUP_TOTAL + groupID.y * GROUP_COUNT_Y * THREAD_IN_GROUP_TOTAL + groupIndex;

Below is the updated shader code:

 struct Particle { float3 Position; float3 Velocity; }; cbuffer Handler : register(c0) { int GroupDim; uint MaxParticles; float DeltaTime; }; RWStructuredBuffer<Particle> Particles : register(u0); #define THREAD_GROUP_X 32 #define THREAD_GROUP_Y 24 #define THREAD_GROUP_TOTAL 768 [numthreads(THREAD_GROUP_X, THREAD_GROUP_Y, 1)] void DefaultCS( uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex ) { uint index = groupID.x * THREAD_GROUP_TOTAL + groupID.y * GroupDim * THREAD_GROUP_TOTAL + groupIndex; [flatten] if(index >= MaxParticles) return; Particle particle = Particles[index]; float3 position = particle.Position; float3 velocity = particle.Velocity; // payload particle.Position = position + velocity * DeltaTime; particle.Velocity = velocity; Particles[index] = particle; } technique ParticleSolver { pass DefaultPass { Profile = 10.0; ComputeShader = DefaultCS; } }

Another magic happens here, we use our particle buffer as a special resource: RWStructuredBuffer , this means that we can read and write to this buffer.
(!) Prerequisite for writing - this buffer must be marked with the UnorderedAccess flag when created .

Well, the final stage, we set the resource for the shader as UnorderedAccessView to our buffer and call Dispatch :

 /* SOLVE PARTICLES */ _particlesSolver.Parameters["GroupDim"].SetValue(_threadGroupSize); _particlesSolver.Parameters["MaxParticles"].SetValue(PARTICLES_COUNT); _particlesSolver.Parameters["DeltaTime"].SetValue(deltaTime); _particlesSolver.Parameters["Particles"].SetResource<SharpDX.Direct3D11.UnorderedAccessView>(0, _particlesBuffer); _particlesSolver.CurrentTechnique.Passes[0].Apply(); graphics.Dispatch( _threadSize, _threadSize, 1); _particlesSolver.CurrentTechnique.Passes[0].UnApply(false);

After completing the execution of the code, it is necessary to remove the UnorderedAccessView from the shader, otherwise we will not be able to use it!

Let's do something with particles, we will write the simplest solver:

 float3 _calculate(float3 anchor, float3 position) { float3 direction = anchor - position; float distance = length(direction); direction /= distance; return direction * max(0.01, (1 / (distance*distance))); } // main { ... velocity += _calculate(Attractor, position); velocity += _calculate(-Attractor, position); ... }

Attractor we will set in the constant buffer.

Compile, run and admire:

Conclusion 1

If we talk about particles, then nothing prevents you from creating a full-fledged and powerful system of particles: the points are easy enough to sort (to ensure transparency), to apply the soft particles technique when drawing, and also to take into account the lighting of “non-luminous” particles. Computational shaders are mainly used to create the Bokeh Blur effect (we also need geometric ones), to create a Tiled Deferred Renderer , etc. Geometric shaders, for example, can be used when you need to generate a lot of geometry. The most striking example is grass and particles. By the way, the use of GS and CS are endless and limited only by the imagination of the developer.

Conclusion 2

Traditionally, I attach the full source code and demo to the post.
PS to run the demo - you need a video card with support for DX10 and Compute Shader.

Conclusion 3

I am very pleased when people show interest in what I write. And the reaction to the article is very important for me, be it in the form of a plus or a minus with a constructive comment. So I can determine which topics are more interesting for the habrasoobshchestvo, and which are not.

Source: https://habr.com/ru/post/248755/

All Articles