20100105 - Game Engine Architecture

Gamasutra Features - Book Excerpt: Game Engine Architecture

Damn, I need to get this book!

Batch Bucketed Updates With State Caching
Jason Gregory provides an awesome description of an evolution to a fully parallel game model. Lets attempt to evolve it further.

OVERLAPPED EXECUTION BUCKETS - Effectively in the bucketed update model, each bucket of independent jobs ends with a sync point (wait for bucket to finish before starting next bucket), creating a pipeline bubble. One way to remove this pipeline bubble is to split work into two independent groups, and space out dependencies by two buckets.

Bucket C -> Dependent on completion of bucket A and prior.
Bucket D -> Dependent on completion of bucket B and prior.
Bucket E -> Dependent on completion of bucket C and prior.
Bucket F -> Dependent on completion of bucket D and prior.

Note how bucket D can now start while bucket C is finishing up (as long as B is fully finished).

GENERALIZE BUCKETS TO HETEROGENEOUS TASK DEPENDENCIES - No reason to limit this model to game objects, nearly any system can be pipelined in this way. Total number of bucket tiers/frame will bound how parallel the game can get (and thus how it will scale on future hardware). One can fully generalize this model by allowing heterogeneous tasks per bucket tier instead of one homogeneous task per tier. In the excerpt's model, use state caching to break out and fully pipeline the work done in UpdateBucket(). For instance gameObject.PreAnimUpdate() of tier B can run in parallel with g_animationEngine.CalculateIntermediatePoses() of tier A.

WORK STEALING - One possible option for adding tasks to a bucket, and later switching to removing task from bucket when processing tier, is to go the global route using a single lock free stack (like DX11 append/consume model). CPU atomic operations in this case do not scale. Second option is to use a task stack per CPU hardware thread per tier. This completely removes the atomic on write, and can mostly remove the atomic contention on read as well, because each hardware thread completely processes tasks on the thread's local stack first, only to attempt stealing work from remote stacks in the case where the thread would wait on a tier dependency.

STATIC TASK GRAPH WITH DYNAMIC RESULTS - One can completely remove the "add task to bucket step" if the task graph is relatively static. Instead just fully fill tier stacks as a pre-process, and skip unneeded work when reading from the immutable stacks at run-time.

GO NUMA AWARE - Group tasks based on thread affinity organized into hierarchy based on NUMA. Via the work stealing approach, work will stay localized unless the machine needs to load balance. Modify work stealing to steal more from threads in the local memory hierarchy before going out to the more remote threads.

On to the GPU
The ultimate evolution is to get everything fully scalable and running on an actual parallel machine optimized for throughput. This is indeed possible right now in CUDA, it is quite unconventional, but is trivially easy with the dynamic static task graph, work stealing, and uber-kernels (topic for future post)...