Physical simulation on the GPU using the compute shader in the Unity3D environment

In this tutorial, I’ll explain how to use the compute shader to implement computations on a video card - using the hair model as an example:

Here is a project for Unity3D, on the explanation of which the manual is built. It needs to be downloaded and opened in Unity:

link to project unity

Who will understand this manual? Those who use Unity3D or at least know C # or C ++. The shader is written in HLSL, a close syntax relative of C ++.
Who will this guide be useful for? Experienced programmers who want to learn how to use GPU for computing. But even an inexperienced, but diligent programmer will easily understand everything.

Why use a video card for computing? For parallel tasks, its performance is 10-100 times higher than that of the processor. That is, each computer has a small supercomputer with a convenient API, it makes sense to use it in suitable cases.
Is this huge performance really needed? Yes, often the processor speed is a limiting factor. For example, when you need to perform the same operations on large data arrays. But such tasks are easily parallelized. In addition, developers often reject solutions because of their computational capacity, and entire areas in the algorithm space remain unexplored. For example, you can do the coolest physics in games, if you carefully load the graphics processor.
')
And what, with a video card, you can now just solve problems with brute force? The demand for optimization does not depend on the performance of iron. There is no such supercomputer, which could not be tightly loaded with inefficient code.

Why compute shader? Why not opencl or cuda? Cuda only works on nvidia hardware, and I don't know opencl. Unity can build into any API, including the opengl core. On Macs and on Android, computer shaders work, like on Linux too (although I haven't tried it). Although, each API has limitations that should be considered. For example, on Metal it is impossible to make more than 256 streams along one axis (B DX10 - 1024). And the android API will not be able to use more than 4 buffers per kernel (In DX10 - 8, in DX11 - even more).

Why precisely physical simulation? This is a computationally intensive task, and it is well suited for parallel computation. In addition, the task is claimed. Game devs can implement interesting physics in games, students can create experimental models for coursework, engineers and scientists can do the calculation on the model.

And why exactly hair model? I wanted to take a simple task, but at the same time covering the main issues.

How to use this guide? It is best to download the source code, open it and read as you move through the manual. I will explain in detail all the main lines, although I will not explain every line at all, the meaning of most of them is obvious. There are no complex algorithms in the text, there is only the use of an interface of classes that serve computations on the GPU. And on the side of the shader code, there is nothing but reading data, performing simple math operations on them and writing results. But if something is not clear - surely ask, I will answer everything in kamentah.

And now for those who have absolutely no idea about using compute shaders, I propose to take a step aside and go to a very simple guide , which is devoted to the basics of using computer shaders. I advise you to start with it in order to better understand the essence and adapt yourself to the practice of GPU computing using an extremely simple example. And then come back here and continue. And those with computer shaders at least somehow familiar, boldly read on.

If you want to make a physical model calculated on a GPU from scratch, this task can be divided into 4 parts:

- mathematical model of the phenomenon
- algorithm for parallel calculation of the model
- shader code
- preparation and launch of the shader in a unit

Mathematical model

The strengths of the video cards are that they can apply a single operation simultaneously to multiple objects. Therefore, a hair model can be made as a chain of points, each of which interacts with two neighbors. The interaction between points is based on the spring principle: k * (S0-S) ^ n, where S0 is the equilibrium distance, S is the current distance. In reality, the hair does not look like a spring; it is perceived as inextensible. This means that the spring in the model must be made sufficiently rigid. It is better to increase the stiffness of the spring by increasing n, because the degree increases the curvature of the curve in the vicinity of the equilibrium, which reduces the backlash and reduces the “rubberiness” effect of the hair. I took n = 2, and the value of the coefficient k will be discussed below.

In addition to the elastic force between points, diffusion of relative velocities or one-dimensional viscosity will be realized. The exchange of the tangential component of the velocity models the dynamic resistance to stretching, and the exchange of the normal velocity characteristic, the dynamic resistance to the bend. All together it will speed up the transmission of disturbances along the hair, which will improve the dynamics, make the hair more visually coherent and less springy.

In addition, there will be a static tendency to straightening. Each point will tend to compensate for the fold of the hair. If there is a bend at a point, a point will be affected by a force proportional to the bend and directed in the direction of decreasing the bend. Two points adjacent to the bend point will experience twice as little force in the opposite direction.

These interactions are enough to model the physics of the hair, but we will not limit ourselves with it. It is necessary to add the interaction of the hair with solid objects. This has a practical meaning. The point is not only that physical models, as a rule, include interaction between differently simulated entities, for example, liquids and solids, between themselves. But also in the fact that in practical tasks, for example, in games, a GPU simulation should interact in real time with objects calculated on the CPU side. So I could not help but pay attention to this interaction. Our hair will interact with solids, information about which will be transmitted to the video memory in each step.

For simplicity, we will work only with round objects. On the CPU side, we will have several circle colliders from the standard 2d physics of the unit. And the rule of interaction will be this: if a hair point is inside a solid body, it is transferred outside, and the fraction directed towards the body is subtracted from the speed of such a point, and the same fraction is transferred to the body. We will not take into account the absolute speed of the body, for simplicity.

Algorithm, code and shader preparation

These three points are too closely related to discuss them separately.

To describe the points from which many hairs are made, we use the following structure:

struct hairNode{ float x; //     float y; // float vx; //  float vy; // int dvx; //   -      int dvy; // int dummy1; //       128  int dummy2; // }

This structure is declared twice: on the CPU side and on the GPU side. For comfort. On the CPU side, we write the initial data, copy it to the GPU buffer, and then they are processed there. But it was possible to explain this structure only on the side of the GPU, if we do not need to transfer the initial data.

About the parameters dummy1 and dummy2. In an article written by an engineer from nvidia, I read that the data of the video memory buffers should be kept multiple to 128 bits. Because it reduces the number of operations needed to calculate the offset.

The values of the remaining parameters, I believe, are clear. Although, an attentive reader may ask: Why is the speed of the type float, and the change in speed - int? The short answer is: because the speed change is modified simultaneously by parallel threads, and to avoid computation errors, you need to use a secure entry. And the secure write function works only with integer variables. I will tell you more about this below.

We have a lot of points with which we model our hair. Data on all points are stored in video memory and are available through the buffer interface:

 RWStructuredBuffer<hairNode> hairNodesBuffer;

In the shader code, we define only its name and data type, and its size is set outside, from the side of the code executed on the processor.

How is the computer shader code structured, what is it all about? The code consists of kernels. This is the same as the methods, but each kernel is executed in parallel on a set of cores. Therefore, for each, the number of flows is indicated in the form of a three-dimensional structure.
This is how an empty kernel looks like, in which there is no code, only the necessary external information:

 #pragma kernel kernelName [numthreads(8,4,1)] void kernelName (uint3 id : SV_DispatchThreadID){ //     }

The kernel has an input parameter id, which stores the three-dimensional index of the stream. This is very convenient, each thread knows its own index, which means it can work with its own separate data unit.

From the side of the processor code, the kernel is called like this:

 shaderInstance.Dispatch(kernelIndex, 2, 2, 1);

These three numbers “2, 2, 1” are associated with the line that precedes the corresponding kernel:

 [numthreads(8,4,1)]

These two triples of digits determine the number of streams, that is, the number of parallel copies of the kernel. You just need to multiply them: 8 * 4 * 1 * 2 * 2 * 1 = 128 threads.

Addressing flows will be on each axis. In this case, the x-axis will be 8 * 2 = 16 units. On axis 4 * 2 = 8 units. That is, if the kernel is called like this:

 ComputeShader.Dispatch(kernelIndex, X, Y, Z);

And on the shader side, the number of threads is set as follows:

 [numthreads(x,y,z)]

So we will have (X * x) * (Y * y) * (Z * z) streams

For example, suppose that we need to process each pixel of a texture of 256 x 256 in size, and we want a separate stream to deal with each pixel. So, we can determine the number of threads as follows:

 Dispatch(kernelIndex, 16, 16, 1);

and on the shader side:

 [numthreads(16,16,1)]

Inside the kernel, the id.x parameter takes values in the range [0, 255], the same is the id.y parameter

So, this is the line:

 texture[id.xy]=float4(1, 1, 1, 1);

paint each of 65536 pixels of texture in white

id.xy is the same as uint2 (id.x, id.y)

If this part, related to the number of threads, is incomprehensible to someone, I advise you to go to the easier guide I mentioned, and see how all this is used in practice to draw the Mandelbrot fractal using the simplest shader.

The text of the shader in our model contains several kernels, which are in turn run on the CPU side in the Update () method. I will then review the text of each kernel, and first I will briefly explain what each of them does.

calc — tangential and normal interaction forces between particles are calculated: the tension force of the “springs” pushes the particles along the line between them, and the “stiffness at the bend” force pushes the particles perpendicular to the line between adjacent particles; calculated forces are stored for each particle

velShare - particles exchange relative speeds. Tangential and full standing - separately. Why allocate a tangential, if then all the same there is an exchange of full speed? The exchange of tangential speed should be much more intense than normal, with it should be a higher coefficient, so it had to be allocated. Then why in the second case, I do not use the pure normal component, but use the full speed? To save on calculations. Changes in speed are recorded in the form of forces, as in the previous kernel.

interactionWithColliders - each point interacts with colliders, information about which is contained in the buffer updated in each cycle

calcApply - the forces calculated in the previous kernels are added to the speed, and the speeds change the coordinates of points

visInternodeLines - lines are drawn between points in a special buffer 1024 x 1024 long (not yet on texture)

pixelsToTexture - and here the values from the above are already converted to pixel colors on the texture of size [1024, 1024]

clearPixels - all values of the intermediate buffer (in which we drew lines) are reset

clearTexture - clear texture

oneThreadAction - this kernel is executed in one single thread, it is needed to smoothly move the entire hair system to where we dragged it with a mouse. Smoothness is needed so that the system does not peddle from abrupt movement (as you remember, in our model, the force between particles is proportional to the square of the distance between them).

On the side of the CPU code

Now I will show how these kernels are started by the CPU code. But first, how to prepare the shader for launch.

We declare a variable:

 ComputeShader _shader;

We initialize it by specifying the shader text file:

 _shader = Resources.Load<ComputeShader>("shader");

Set the constants that will be useful to us on the side of the GPU

 //  nodesPerHair  nHairs   _shader.SetInt("nNodsPerHair", nodesPerHair); _shader.SetInt("nHairs", nHairs);

We declare variables for the array, which will store the data of the simulated points, and for the buffer, through whose interface we can read and write data in the video memory

 hairNode[] hairNodesArray; ComputeBuffer hairNodesBuffer;

Initialize the buffer and write the array data to the video memory.

 // hairNodesArray   hairNodesBuffer = new ComputeBuffer(hairNodesArray.Length, 4 * 8); hairNodesBuffer.SetData(hairNodesArray);

For each kernel, set the used buffers so that the kernel can read and write data in this buffer.

 kiCalc = _shader.FindKernel("calc"); _shader.SetBuffer(kiCalc, "hairNodesBuffer", hairNodesBuffer);

When all the necessary buffers are created and installed for all the shader kernels, you can run the kernels.

All kernels run from Update (). From FixedUpdate (), they should not be run (will be lagging heavily), because the graphics pipeline is synchronized with Update ().

Kernels are launched in the following sequence (I quote the whole code of the “doShaderStuff” method called in Update ()):

 void doShaderStuff(){ int i, nHairThreadGroups, nNodesThreadGroups; nHairThreadGroups = (nHairs - 1) / 16 + 1; nNodesThreadGroups = (nodesPerHair - 1) / 8 + 1; _shader.SetFloats("pivotDestination", pivotPosition); circleCollidersBuffer.SetData(circleCollidersArray); i = 0; while (i < 40) { _shader.Dispatch(kiVelShare, nHairThreadGroups, nNodesThreadGroups, 1); _shader.Dispatch(kiCalc, nHairThreadGroups, nNodesThreadGroups, 1); _shader.Dispatch(kiInteractionWithColliders, nHairThreadGroups, nNodesThreadGroups, 1); _shader.Dispatch(kiCalcApply, nHairThreadGroups, nNodesThreadGroups, 1); _shader.Dispatch(kiOneThreadAction, 1, 1, 1); i++; } circleCollidersBuffer.GetData(circleCollidersArray); _shader.Dispatch(kiVisInternodeLines, nHairThreadGroups, nNodesThreadGroups, 1); _shader.Dispatch(kiClearTexture, 32, 32, 1); _shader.Dispatch(kiPixelsToTexture, 32, 32, 1); _shader.Dispatch(kiClearPixels, 32, 32, 1); }

It immediately catches the eye that several Kernels are launched 40 times for an update. What for? So that with a small time step, the simulation works quickly in real time. And why should the time step be small? To reduce the sampling error, that is, for the stability of the system. And how and why does instability arise? If a step is large, and a large force acts on a point, then in one step the point flies away, the return force becomes even larger, and in the next step the point flies to the other side even further. Result: the system goes astray, all points fly back and forth with increasing amplitude. And with a small step, all the curves of forces and speeds are very smooth, because the errors are greatly reduced with a decrease in the time step.

So instead of one big step, the system takes 40 small steps in each cycle, and because of this, it demonstrates a high accuracy of calculations. Due to its high accuracy, it is possible to work with large interaction forces without losing stability. But big forces mean that we do not have sluggish springy macaroni in the model dangle, trying to explode from a sudden movement, and strong hairs revolve cheerfully.

The data about the points with which we model the hair is stored in the video memory in the form of a one-dimensional array, to which we access through the buffer interface.

For the convenience of working with a one-dimensional buffer, we index the flows as follows: (x-axis: number of hair * y-axis: number of points in the hair). That is, we will have a two-dimensional array of threads, each of which will know its point by the index of the stream.

As you remember, the number of threads in which the kernel is executed is determined by the product of the parameters of the Dispatch () method and the parameters of the [numthreads ()] directive in the shader code.

In our case, all kernels that work with hair dots are preceded by the directive [numthreads (16.8,1)]. Therefore, the parameters of the Dispatch () method must be such that the product gives the number of threads no less than we need to process the entire array of points. In the code, we calculate the x and y parameters of the Dispatch () method:

 nHairThreadGroups = (nHairs - 1) / 16 + 1; nNodesThreadGroups = (nodesPerHair - 1) / 8 + 1;

The relationship between the [numthreads ()] and Dispatch () parameters stems from the architecture of graphical solvers. The first is the number of threads in the group. The second is the number of thread groups. Their ratio affects the speed of work. If we need 1024 streams along the x axis, it is better to make 32 groups of 32 streams, than 1 group of 1024 streams. Why? To answer this question, you need to tell a lot about the GPU architecture, leaving this topic too deeply unaffected.

Details GPU-code

So, 40 times for the update, we launch in turn the kernels that calculate the change in the speed of points and change their speeds and coordinates. Let's look at the code for each kernel. Everything is quite simple there, you only need to learn a couple of specific features.

Kernel "calc" calculates the change in the speed of points. The points in the hairNodesBuffer buffer are arranged in turn, first the first point of the first hair, then the second, and so on until the last. Then immediately the first point of the second hair, and so on for all the hair, until the end of the buffer. We remember that the kernel has a parameter id, and in our case id.x indicates the number of the hair, and id.y indicates the number of the point. And here is how we access the data points:

 int nodeIndex, nodeIndex2; hairNode node, node2; nodeIndex = id.x * nNodesPerHair + id.y; nodeIndex2 = nodeIndex + 1; node = hairNodesBuffer[nodeIndex]; node2 = hairNodesBuffer[nodeIndex2];

Here, the value of nNodesPerHair is a constant that we set on the CPU side when the shader is initialized. The data from the buffer is copied into the local variables node and node2 because accessing the buffer data may require more kernel cycles than accessing a local variable. The algorithm itself is as follows: for each point, if it is not the last in the hair, we calculate the force acting between it and the next point. On the basis of this force, we record the change in speed at each of the points.

Here is an important feature of parallel computation: each stream modifies two points, the current and the next, which means that each point is modified by two parallel streams. Unprotected writing to common for parallel streams is extremely fraught with data loss. If you use the normal increment:

 variable += value;

then the recording can occur at the same time, like this: the first stream will copy the original value, add one to it, but before it writes the value back to the memory cell, the second stream will take the original value. Then the first thread will write back the incremented value. Then the second stream will add its unit and write the increased value back. Result: although two streams were added one at a time, the variable increased only by one unit. To avoid this situation, use a secure entry. HLSL has several functions for protected modification of generalized variables. They guarantee that the data will not be lost and the contribution of each stream will be taken into account.

A minor problem is that these functions work only with integer variables. And that is why in the structure describing the state of the point we use the dvx and dvy parameters of the int type. That there was an opportunity to write in them by means of the protected functions and not to lose data. But in order not to lose accuracy on rounding, we pre-determined the multipliers. One translates float into int, the other back. So we use the entire dianazon int-value, and do not lose exactly (we lose, of course, but negligible).

The secure entry looks like this:

 InterlockedAdd(hairNodesBuffer[nodeIndex].dvx, (int)(F_TO_I * (dv.x + 2 * dvFlex.x))); InterlockedAdd(hairNodesBuffer[nodeIndex].dvy, (int)(F_TO_I * (dv.y + 2 * dvFlex.y))); InterlockedAdd(hairNodesBuffer[nodeIndex2].dvx, (int)(F_TO_I * (-dv.x - dvFlex.x))); InterlockedAdd(hairNodesBuffer[nodeIndex2].dvy, (int)(F_TO_I * (-dv.y - dvFlex.y)));

Here F_TO_I is the aforementioned coefficient for the projection of float on int, dv is the force vector of the influence of the second particle on the first through the spring connection. And dvFlex - straightening force. "(int)" needs to be added because InterlockedAdd () is overloaded for int and uint types, and the float is interpreted as uint by default.

VelShare Kernel is similar to the previous one, it also modifies the dvx and dvy parameters of two adjacent points, but instead of calculating the forces, the diffusion of the relative velocity is calculated.

In the interactionWithColliders kernel, the points do not interact with each other, here each point runs through all colliders of the solid bodies buffer (which we update in each update). That is, each stream writes only to one particle, there is no danger of simultaneous recording, and therefore instead of InterlockedAdd () we can directly change the speed of the particle. But at the same time, our model implies that the points transfer momentum to the collider. This means that parallel streams can simultaneously change the magnitude of the impulse of the same collider, which means that we use a protected version of the record.

Only here it is necessary to understand: when we project a float on an int, the whole and fractional parts compete with us. Accuracy competes with a range of magnitude. For the case of the interaction of points, we chose a coefficient that allows for a sufficient variation of the magnitude for us, and let the rest be for accuracy. But this coefficient is not suitable for the transfer of momentum to the collider, because at the same time hundreds of points can add their momentum in one direction, and therefore one must sacrifice accuracy in favor of the ability to hold a large number. So with secure recording, we do not use the F_TO_I coefficient, but use a smaller coefficient.

After all interactions of points are calculated, we in the “calcApply” kernel add the impulse to the speed, and the speed to the coordinates. In addition, in this kernel, each root (first in a row) point of the hair is fixed in a certain place relative to the current position of the entire hair system. Even in this kernel, the contribution of gravity is added to the vertical component of the velocity. Plus, “braking” on the air is realized, that is, the absolute value of the speed of each point is multiplied by a factor slightly less than one.

Note that in the calcApply kernel, the speed affects the coordinates through the dPosRate coefficient. It determines the size of the simulation step. This factor is set on the CPU side and is stored in a variable, which I called “simulation Speed”. The larger this parameter, the faster the system will evolve over time. But the lower will be the accuracy of the calculation. The accuracy of the calculation, I repeat, limits the magnitude of the forces, since with large forces and low accuracy the magnitude of the error is so great that it determines the behavior of the model. We took the modeling speed rather low, it gives us greater accuracy, so we can afford more power, which means more realistic model behavior.

The magnitude of the forces corresponds to the coefficient associating the effect of the pulse on speed - “dVelRate”. This factor is large, it is set on the CPU side and is called “strengthOfForces”.

I repeat that in all the mentioned kernels the number of threads is equal to the number of points, one stream is responsible for processing one point. And this is a good practice. We do not pay for the number of threads, there can be as many as you want (in shader model 5.0 - no more than 1024 on the x and y axes and no more than 64 on the z axis). In the tradition of parallel computing, it is better to avoid using cycles to perform one operation in a single stream with respect to several data units; it’s better to make as many streams as is required to implement the principle “one data unit - one stream”.

Let's go back to the doShaderStuff () method on the CPU code side. After completing the cycle of 40 steps of calculating the hair model, we read the data of the collider:

 circleCollidersBuffer.GetData(circleCollidersArray);

It may be recalled that on the GPU side, impulses from the hair side are written into the buffer with the collider data, and we use them on the CPU side to apply force to rigidbody. Note that the force to the rigidbody is applied in the FixedUpdate () method, since it is synchronized with physics. In this case, the pulse data is updated in Update (). So, under the influence of various factors, during one Update () several FixedUpdate () can occur and vice versa. That is, in the influence of hair on the collider there is no absolute accuracy, part of the data can be overwritten before it has an impact, and other data can have an impact twice. Measures can be taken to prevent this from happening, but these measures are not taken in the program under consideration.

It is worth noting here that the GetData () method suspends the work of the graphics pipeline, which causes a noticeable slowdown. Unfortunately, the asynchronous version of this method in the unit has not yet been implemented, although it is rumored that it will appear in 2018. In the meantime, you need to understand that if in your task you need to copy data from the GPU to the CPU, the program will run 20-30% slower. At the same time, the SetData () method has no such effect, it works quickly.

Visualization

, doShaderStuff(), .

, .
CPU RenderTexture, enableRandomWrite = true, mainTexture UI- Image.

, , SetTexture(), RenderTexture :

 RenderTexture renderTexture; renderTexture = new RenderTexture(1024, 1024, 32); renderTexture.enableRandomWrite = true; renderTexture.Create(); GameObject.Find("canvas/image").GetComponent<UnityEngine.UI.Image>().material.mainTexture = renderTexture; _shader.SetTexture(kiPixelsToTexture, "renderTexture", renderTexture);

RWTexture2D, :

 RWTexture2D<float4> renderTexture;

, :

 #pragma kernel clearTexture [numthreads(32,32,1)] void clearTexture (uint3 id : SV_DispatchThreadID){ renderTexture[id.xy] = float4(0, 0, 0, 0); }

 _shader.Dispatch(kiClearTexture, 32, 32, 1);

, 1024 x 1024 , . : id.xy .

? , , , , , , , . : . , , : .

, , . - , .

, «visInternodeLines» , . , . , RWStructuredBuffer RWStructuredBuffer 4 uint.

, RenderTexture , «» .

compute shader, , , .

, «pixelsToTexture» , .

, GPU. , . , , . . .

, . CPU , , , , .

Source: https://habr.com/ru/post/346268/

All Articles