I recently embarked upon a project to develop an interactive, large scale simulation using the Blender Game Engine. As an initial disclaimer, any details concerning the internals of the project will not be disclosed in the interests of my employer. For the purpose of this blog post, I will refer to such a project in the context of a general purpose particle simulation.
As with all projects, I started designing the basis of program by establishing a sensible, object-oriented model. Any form of user interaction would be directed through well defined, generalised interfaces. Any behaviours explicitly requested by the client would be implemented as abstractly as possible, so as to avoid influencing the general design and sanity of the program in its early stages.
The initial release of this simulation proved to be both exciting, and underwhelming. Whilst any simulation can be a fun endeavour, if it fails to reach its performance targets, it can quickly become a constant struggle to extract smaller and smaller margins. As a programmer, it is generally recommended to avoid premature optimisation, in order to encourage good quality, legible code. However, sometimes there must be measures taken to optimise the code that you’ve written because it’s just not fast enough.
The simulation in question was updating at an ideal frequency of 60Hz, but in reality the frame-time would exceed 100ms. As the client would later require an even larger simulation, this evidently was not suitable as a final candidate.
This led to me to question how best to improve performance. Before I did anything, I did some low-level benchmarking.
The simulation had two components that were involved in the update process; an O(n) update step for all particle components, and O(a*n) update step for an additional step which managed future state, with respect to neighbouring particles. In addition to this, the aforementioned update step required searching through k*n particles to find the a nearest neighbours.
This secondary step was the slowest aspect of the simulation, and was therefore the natural starting point for any optimisation. Quite quickly, it became apparent that the limitations of the Python runtime were the fundamental reasons behind the large frame time. The dynamic nature of Python’s language – where little is known at compile time – means that it is very difficult to prematurely optimise the code that is executed at runtime. So, I looked into Cython as a means for improving the performance of the heavy update code.
After spending some time converting pure-python into Cython code, it was evident that despite a significant improvement ~ 3X speedup, it was not enough to support future expansion of the simulation size. Whilst Cython was quite easy to work with, it had a few drawbacks that prevented it from gaining further speed / implementation.
- To get the best results from the system, all the code would need to be annotated with Cython syntax to increase the number of compile-time optimisations.
- Cython syntax has little support in syntax highlighting, and is quite ugly. For this reason, I found it unpleasant working with the language during this stage of development (for large-scale applications).
So, after careful deliberation, I discarded the Cython route.
But, what was left? As I have written earlier, I am comfortable programming in C++, but given the maturity of code as it stood, was it worth moving to C++? It would give me the speed improvements that I required for project, but at a significant development time overhead. After weighing the benefits and drawbacks, I started the arduous task of converting from Python to C++, with the help of the boost::python library.
A lot of work was required to complete this task. Certain design choices no longer made sense in the C++ paradigm, and in many places, inlining functions and passing by reference could be used to optimise performance. But by far the most difficult process was debugging. For my build environment, I used MSVC 2013, with Python 3.4.2 64 bit (Windows). As this was the biggest C++ project that I’d ever worked with, a number of things would catch me out as I started working with the new debugging model. Whereas in Python, the traditional debugging route is a combination of exception stack traces and console messages, C++ lends itself to an in-memory debugger. Certain details like enabling exceptions in the debugging options (for the built-in MSVC debugger) and learning the meaning behind obscure build errors caused a lot of initial headache. The most notorious of all bugs were caused by mismanaged pointers. (Of course, many would encourage the use of shared pointers (such as boost’s shared_ptr), but in the interests of speed, I preferred to use raw pointers).
At the end of this process, I was comfortable with debugging practices, the importance of appropriate debug scenarios (for repeatability), debug dumps and general C++ best practices. Thankfully, the Python-C++ interface caused only an initial headache due to specific nuances with exposing raw pointers (I opted for a special base class to hold onto shared pointers, in order to use boost::shared_ptr for the Python interface). In hindsight, I made the right choice. As the simulation grew ever featureful, it became increasingly heavyweight due to client requests and greater simulation depth. Had I not written the simulation code in C++, every new feature would have led to a greater feeling of disdain and foreboding of performance concerns. However, my problems with performance were not yet mitigated.
Of course, Python is both my preferred language, fast for development and the scripting layer for the BGE, so a simulation manager was layered atop the C++ Python bindings to manage and interface with the simulation code. This made development easier, and more extendable as features were requested. Wherever possible, features were written in Python, with the C++ API extended if necessary to accommodate such requests.
With the final iteration of the simulation code, performance was still of a concern. Certain scenarios can reach a frame-time of 20-50ms for some of the larger implementations, with a framerate in the range of 8 – 30 Hz. Initially, I was quite concerned that the simulation code was too complex for a real time application, but then I attempted to isolate the bottleneck. The core of the simulation would call into Python through a callback to set the simulation results in the 3D scene. First, I disabled these callbacks to test the simulation code in isolation of the Python environment, and surely enough, the frame-time would sit around 3-6ms. So, with some thought, I wondered if the code executed in the callbacks was the culprit, or the act of calling into python itself. Simply commenting out the function body and adding a pass statement at the top of the definition proved that this was not the case. It was this observation that indicated that setting the position and orientation of objects from Python was a bottleneck in the simulation pipeline. Furthermore, as the simulation objects deepened their child hierarchy, so increased the frame-time!
At a loss when faced with such a fundamental engine limitation, I started to wildy consider removing the Python interface entirely; building Blender with my simulation code embedded within it. I experimented with a build call to set transformations from a list rather than one-by-one; to eliminate the method call overhead. Whilst this helped to some degree, it was unreliable and caused more problems than it was worth maintaining for. A small conversation with a project member suddenly led me to consider optimisation methods.
Given that the simulation itself wasn’t the primary performance bottleneck, I could afford to think about reducing the Python task load without impacting the simulation integrity.
- Level of detail – reducing transformation updates with distance from the camera, and culling transformations after a specified distance.
- Frustum culling – by far the most effective means of improving performance, culling all transformation updates of invisible objects.
For both optimisations, I decided to work from the C++ code, because whilst the BGE KX_Camera api offers access to frustum culling, and the KX_GameObject API to distances between objects, I had most of this information in C++ already, and I didn’t want the optimisation method itself to become a bottleneck.
Frustum culling proved a slightly more laborious task than anticipated. It turns out that Blender’s fov value for KX_Camera is in degrees (whereas some parts of the BGE API use radians), and is a horizontal FOV. Most implementations of frustum culling leverage vertical FOV, as I had done for this project. This required a conversion of the FOV value prior to any culling. Furthermore, the client made use of a very large FOV (270 degrees over three cameras), so that the performance benefits were limited to ~ 25% of the main simulation time, except in certain edge cases.
By the end of this process, the simulation was finally useable in the end product. A careful combination of Python, C++ and additional optimisation methods made the goal attainable.
At the end of this project, I recognised the power that writing in C++ can give you in writing more advanced, more expansive games and interactive simulations. I would encourage anyone who currently finds their projects running into performance bottlenecks to give the process some significant thought as a means of alleviating these concerns!