Last week I showed how we can leverage the power of the GPU to render huge numbers of particles with great performance.
I stated that by using the GPU to simulate our particles we can get much better performance than if we were using the CPU only. However – and while you may believe me – I provided no evidence that this is in fact the case.
That is something I want to rectify today.
Instead of doing a simple comparison between a CPU and a GPU implementation of the same particle system, I would like to go into more detail on what exactly we are doing, why we are doing it, and why the results – spoiler alert – will be confirming my hypothesis above.
To illustrate the points I will be making, I implemented a variety of different ways to simulate and render particles. The complete code for all of them can be found on GitHub.
The particle simulation pipeline
In the simplest case – and we will refrain from making things more complicated than they have to be for the purpose of this post – the simulation and rendering of particles can be looked at as a pipeline.
The pipeline we will look at here applies to every single frame of the game/application in question. Further, we will ignore particle creation completely, since this is not the bottleneck of any reasonably well implemented system – unless everything else has been optimised that is.
Efficient creation of particles is an interesting topic in itself however, and we might cover that in a future post.
For today’s purpose we will divide the particle simulation pipeline into the following five steps:
- Storing data
The place and fashion in which we store the state or initial parameters of our particles.
- Updating positions
How we update or calculate the current position of our particles for rendering.
- Creating quad
Creating a quad of four vertices and two triangles to render for each particle.
- Transforming vertices into screenspace
Applying model, view, and projection transformations to our vertex coordinates.
- Render quad
Rendering the quad onto the screen.
For each of these five points we must make a number of decisions resulting in a variety of possible ways to simulate particles. Some of these are easy, some are more difficult, and the results in terms of performance can vary drastically as we will see below.
Simple CPU simulated particles
To get a baseline to compare our different implementations to, I went ahead and went with the most straight forward and simple implementation I could think off.
In this case, we keep around a list of particles, each one being a managed object. Each frame we integrate over the gravitational acceleration and advance the particles velocity and position according to the time elapsed since the last frame.
When rendering, we then write four vertices and six indices to OpenGL vertex and index buffer objects respectively, and then render these using simple vertex and fragment shaders.
Of note is that we create the four vertices in the same position and only expand them into a quad in the vertex shader after aligning them with the screen-plane to ensure the screen-space alignment of the particles.
In my test, I spawned batches of 1000 particles 100 times a second. Each particle lived for 1.5 seconds on average, resulting in 150.000 particles being on the screen at once.
Not surprisingly, we do not get good performance with these numbers of particles. On my machine this particle system ran at a fairly constant 25 frames per second.
Improvements and considerations
I also did some more experiments, including using parametric particles, that we do not have to iterate over to update. This did not result in a measurable performance difference however.
While I have no data on this, I believe that we would see more significant differences here if we were dealing with more complicated particles that took more space in memory and were thus less cache efficient. In fact, since the program does not do anything but simulate particles, we could hardly implement it more cache efficient if we wanted to.
One way in which we can still improve it is by using structures instead of classes. This essentially forces the program into keeping all particles close together – and even in order – in memory, which is as good as it gets.
However, in my experiments I could detect no difference in performance with using structures either. I have seen such differences working on the particle rich graphics pipeline of Roche Fusion however. There, using structures instead of classes gave us a slight improvement in overall performance and significantly improved garbage collection behaviour.
Why we cannot do much better using this technique
One of the reasons we do not see any change in performance when changing the way we are storing and updating our particles is because this is not the bottleneck of the system.
As already highlighted in the above diagram, the bottleneck of this particle system is pushing – or more precisely creating – vertices and indices on the CPU to render them.
This can be demonstrated not only by profiling, but also through a little experiment:
When I changed the vertex creation code to write the vertices and indices directly into arrays, instead of pushing them in small batches of 4 and 6, my simulation went from 25 to 34 FPS – a 36% increase.
Introducing the geometry shader
The biggest problem above is the limited speed with with we can create the vertices and indices for our particles on a single CPU core.
However, there is no inherent reason why this work should be done by the CPU in the first place. Since the introduction of geometry shaders, the GPU is perfectly able to create more complex geometry than it receives. Further, it is a highly parallisable problem, where each instance takes relatively little amount of work – in other words the perfect kind of job for a GPU.
In this step we will then use a geometry shader to create our take a single vertex representing an entire particle and create a quad for it which can be rendered to the screen as usual.
Note that the vertex shader is the only non-optional shader stage, so that we have to include it. There is however nothing that the vertex shader do, that cannot be done in the geometry shader, so we will simply pass the particle vertex through without modification.
Using this technique I more than doubled the simulations performance, all the way up to a steady 90 FPS.
Remember that we are still storing our particles in RAM and are calculating their render parameters entirely on the CPU. We simply decreased the amount of data we have to compile and upload to the GPU to render our particles.
Calculating particle parameters on the GPU
Looking at what we have done so far, the next step is obvious:
Our particles are already parametric, so it is almost trivial to move the particles simulation code to the GPU entirely.
The results for this case were mixed for me. Overall it seems that the system seems to run slightly slower, at around 80 FPS, which I attribute to us again uploading slightly more data to the GPU than before (in my example the initial vertex information contains more data then the final render parameters).
Only uploading particle data once
However, the last case is of little importance. There is no need to upload almost the same exact data to the GPU every single frame.
Instead we will switch to uploading the data only a single time and continuously render from the same vertex buffer for a large number of consecutive frames.
We can do this by batching together all particles created at the same time into one vertex buffer, and rendering this until the last particle had died – a simple check on the CPU.
While this renders slightly more particles that need to be rendered – namely those of shorter lives within each batch – at all times, we can discard these in the geometry shader. And even if we would not, and simple create degenerate geometry to be discarded by the rasteriser, the performance gain from freeing up the CPU is immense.
Using this technique, my simulation ran at 130 FPS in 720p resolution, and up to 400 FPS at smaller resolutions. None of the other techniques showed differences based on the resolution rendered.
This shows that with this technique we are finally using the GPU to its full potential, resulting in our simulation now being fill-rate limited, instead of CPU bound.
In a practical sense this is great news since it frees up our CPU to do more important things like simulate actual gameplay, physics, AI, etc.
For more details on the implementation of this technique, see my detailed post on it, including all the shader code and a full example project.
We use this exact same technique – albeit with much more complicated particles – in Roche Fusion, where it is a huge contributor to the graphical fidelity we can achieve by using a large number of particles. This would not be possible, could we not render many tens of thousands of particles each frame utilising the GPU.
Note that using this technique as implemented here can result in the usage of a great number of vertex buffers, which each need to be rendered using individual draw-calls. While there surely is a limit at which this becomes slower than the previous techniques due to the overhead of the draw-calls, we have not come close to it in Roche Fusion, even though we often render hundreds of vertex buffers containing only a few dozen particles each.
What helps us do so is the minimisation of state changes between rendering the different buffers. But that is a topic for another post.
Above we demonstrated different techniques to simulate and render particles.
We saw that the key point in question is how much work to do on the GPU, and that – in general – using the GPU as much as possible can increase our performance hugely.
I hope you found this interesting, and please let me know what you think of the different techniques I presented.
Especially if you have suggestions or ideas for other ways of modifying them to possible increase performance even further, let me know in the comments below.
Enjoy the pixels!