I want to tell you how I created, and then transferred my own particle system to the GPU. As I naively thought it would just be done (they say cho there, move the particles, tyuyu). In fact, it is possible to talk a lot about the nuances arising from the implementation for a long time, so I’ll only talk about solving the bottleneck problems later.
Background
The customer develops dynamic music fountains, which are controlled via script dmx controllers. He made the script editor himself. But in practice, creating scripts turned out to be inconvenient, because in order to see how it turns out, you need to have a completely built and running fountain. In addition, if the designer choreographer suddenly wanted to add additional nozzles for the fountain, then this is almost impossible to do. Therefore, the customer wanted to acquire a module for modeling fountains, so that the choreographer could develop scenarios without a real fountain. In general, I got something like this: here's a video of what Hawaii50.wmv modeled, but what came out in real life after the fountain was designed: H5OClip.wmv
Requirements
At the moment there is a fixed set of nozzles that behave in a certain way, as well as LED light sources. In fact, I had to provide interfaces for each type of nozzle and for light sources, with methods for manipulating these nozzles / light sources. There should be a scene that we can rotate in our own window using the mouse (there were also a lot of small requirements not related to a particle system, such as a grid on a plane, fountain height marks, etc.). And of course all this is needed in real-time, that is, at least 25-30 frames per second. ')
First pancake
At first I did a simple creation of particles on the CPU, then the whole thing went to the vertex buffer, which was rendered. When tests it all worked fine, in practice it turned out to be unsuitable. If you look at the H5OClip.wmv video, then notice how many light sources are lit at the same time. Often the number comes to hundreds or more. At the same time, one source often “covers” several jets of fountains at once, and in fact each jet is essentially an emitter. Now imagine that 150-200 jets simultaneously create particles. How many particles are needed in order to portray one jet? In practice, it was found that a tolerable display of one jet, beating at full power, requires an average of 5k particles. And for 150 jets we get 750,000 particles. It is clear that it is necessary to lay at least 150 jets.
The first version worked like this. First there was the process of creating particles. For each particle there was a field in which the millisecond was stored, in which the particle would die. We determine the number of particles that the emitter created for the last frame, and run from the beginning of the array until we create all the particles. If we meet a dead particle (current time> time of death), then we fill it with a new time of death, set the initial coordinates and the initial velocity. In essence, the particle is created. If the array ends, and not all particles are still created, then we allocate an additional piece of memory. If you created all the particles, then run to the end of the buffer and remember the index of the last living particle. If the index of the last living particle is much less than the length of the array, then shorten the array. This index is useful to us in the future, so as not to run around the buffer. Then came the simultaneous process of particle motion, and filling the VBO (Vertex Buffer Object). We run along the array, if the particle is alive - we move it and fill it in the VBO, otherwise we skip it. We check not the whole array, but before the index of the last live particle. So VBO is ready, render it. In practice (and then I had an Athlon 64 x2 3800, this is 2.0 Hz core), if memory serves me, then about 100-150k particles with 25-30FPS came out, which is not good. Therefore, we agree that we need to somehow manipulate the peak with 750k particles, or come up with an alternative. Therefore, go to the second pancake.
Second pancake
Analysis
First, I conducted tests that exactly eats the final FPS. So, loads that are clearly visible:
Creation / death of new particles
Particle motion
Filling Vertex Buffer
Particle render
Of course, the movement of particles turned out to be the most “inhibited”. In the second place was the creation / death of new particles. The third is filling the vertex buffer. As for the render - then everything was shaky. Since the fountains are in 3D, they can be at different distances from the camera, and the particles must be scaled in depth. If the camera is directed from top to bottom, so that the jets beat right into the camera, the FPS will fall. It is understandable, because the particles were huge, and the fillrate was accordingly huge. In the usual case, the camera never directed anyone like that (in the future I carried out optimization for such cases) and the FPS rendered almost no effect, because the GPU displays the image asynchronously and managed to cope with its work during the CPU.
Attempt to optimize
The first thing I decided to optimize the math of displacement. But the mathematics was so simple that there was practically nothing to optimize. The compiler perfectly optimized the whole thing in asm code. Then the idea arose not to run through the array twice, but to create, move particles and fill the buffer in one pass. No sooner said than done. But the increase in speed could be seen only under a microscope. With the filling of the vertex buffer, no optimizations occurred to me. Of course, it was possible to try using one buffer for VBO and for vertices, but then it would be necessary to output dead vertices beyond the viewport, and this option seemed to me even more inhibitory. And the overhead for filling out the VBO was scanty. Theoretically (flops) there was still a large margin, but it was impossible to keep up with those synthetic figures. Of course, it could also be solved in the forehead, parallelized to 4 threads, set in the minimum system requirements for a CPU program with 4 cores, and a frequency per core of 2.5 GHz, but I absolutely did not like this way.
Successful optimization attempt
So, it is necessary to reduce the number of particles. It is clear that the fountains located far away do not need a large number of particles. You can draw a smaller number of particles a little larger, but if the camera approaches the fountain abruptly, then we should show more particles, everything seems logical and understandable, but the problem is that we must somehow move these most invisible particles. Otherwise, when approaching, they will be at our starting point. And again, we rest on the fact that on the CPU we need to manipulate with all the particles. And what if we don’t move the particle at all, but simply calculate its position using the equation of motion. The simplest equation of motion: x = x 0 + v 0 * t + 0.5 * a * t * t. It would be a really great option if it were not for one thing. The customer wanted “air friction”, because for jets with a low angle to the horizon, the simulation result was very different from the real result. The viscous friction force F = -bV, for a single medium of the same size and shape of droplets, it can be roughly said that the acceleration from friction is a = kV, where k is a certain coefficient. As a result, our simple equation of motion turns into a monster (the current formula in the shader: NewCoord = ((uAirFriction * aVel + G) * (exp (uAirFriction * dt) -1.0) / uAirFriction - G * dt) / uAirFriction + aCoord;). And despite the wild formula, I have already received a tangible increase in productivity only due to the fact that I considered the position of only those vertices that I will really draw. For fountains located at a distance of N from the camera, we take every second particle, for fountains located at a distance of 2N every fourth, etc. The result was something of the order of 500-700k of living particles at 20-30FPS, which is quite good. The above figure actually went out very floating, and everything depended on the location of the fountains in the frame, but in general the performance completely satisfied the needs.
Third pancake
Despite the fact that the task was already implemented, and the customer was satisfied, for the sake of my own sports interest, I decided to rewrite the calculations on the GPU. So, I needed a render in the vertex buffer . For the vertex buffer with initial values ​​(initial position, initial speed, initial time, final time), we render the vertex buffer that stores only the current coordinates. Then, using the resulting buffer, we render the particles themselves. A simple implementation of a vlob (without reducing the number of particles depending on the distance) yielded 1k particles at 40-50FPS on my GF250. Now on the CPU we only need to “give birth” to a particle, and there are quite a few such particles for each frame. But to make a different number of particles, depending on the distance here is not so trivial. After all, we get not a solid array of particles, but an array with “holes” (holes from dead particles). I see a couple of solutions for this case, but did not have time to implement it due to lack of time. If it is interesting for the habrasoobshchestvo to look at further implementations, then when free time appears I will try to bake the fourth and fifth pancakes (and even with demos).
findings
In particle systems, the bottleneck is by no means a bus, as I originally thought (I don’t know how for AGP, but for PCIe16 it is not noticeable), but particle movement.
Fillrate in particle systems can significantly "gobble up" performance. It is recommended to optimize this moment.
Tasks that are well parallelized often “head-on” are solved on the GPU faster, and such bottlenecks are always better transferred to the GPU (but this does not mean that everything should be done on the GPU head-on).
The most important conclusion, first think, then do. First, I estimate the number of particles - I would immediately think about optimization, and the first pancake would simply not exist.
ps I apologize for the lack of code and demos. I understand that reading the pictures is more interesting, but the versions of the old code have not been preserved, I could say from the memory of my “searches”. Accordingly, there are no old screenshots / videos / demos, but the current video can be found on the customer’s site. In the following articles I will try to correct.
upd. Filled the above video on YouTube
upd2. Screen lossless quality: The top screen is an attempt to add water hammer and make it more beautiful and realistic. I consider the attempt unsuccessful due to a strong drop in performance. Lower screen - put the third pancake in the above article.