20090708 - GPU Ray Traversal Efficiency


Understanding the Efficiency of Ray Traversal on GPUs
by Timo Aila and Samuli Laine of NVidia Research

I really like this HPG2009 paper which explores the primary factors effecting performance of ray traversal on NVidia GPUs. They compared actual results with results from a simulator which had zero cost memory performance. So actuals vs upper bound on best (unobtainable) GPU performance of only ALU cycle counts.

They found work distribution yielding lower SIMD efficiency was the primary bottleneck for ray traversal and NOT the memory system. Quote from the paper, "We have also shown that, contrary to conventional wisdom, ray tracing performance of GTX285 is not significantly hampered by the lack of cache hierarchy. In fact, we can also expect good scaling to more complex scenes as a result of not relying on caches."


Upper End Result in the Paper on a GTX285

- 140M rays/sec for primary rays (coherent)
- 40M rays/sec for secondary randomly shuffled GI rays (non-coherent)

They found around (and often better than) a 2x performance increase via two primary optimizations. With these changes they managed between 77% to 91% of the theoretical maximum GPU performance assuming infinite memory speed.

(1.) PERSISTENT THREADS. Instead of launching a thread per pixel, launch enough threads to fill the GPU once, and have the threads pull work from a global pool. The problem is that ray traversal has varying execution time. The advantage of threads grabbing their own work is that long running warps only stall the warp instead of stalling the entire block (another block would have to wait until the longest running warp finished before being issued). Also note paper pulls work from the global pool into small local (shared memory) pools to reduce pressure on a single global atomic counter.

(2.) PER-RAY TRAVERSAL. Typical CPU SIMD tracers do packet traversal partly because of lack of vector gather and because CPUs don't automatically handle branch divergence. The paper found that per-ray traversal is faster than packet traversal on the GTX285 in all cases, and for diffuse primary rays about 2x faster (even though per-ray traversal decreases memory coherency)!


Implementation

- 192 threads/block
- 21-25 registers/thread
- node data fetched via 1D texture fetch (cached)
- triangle data fetched via direct global memory access (non-cached)
- per-ray stacks via global memory access (non-cached)
- binary BVH traversal with maximum leaf size of 8
- tree built using greedy surface-area heuristic
- 64 bytes per the two BVH nodes (one full global memory segment access)
- pair of BVH nodes always fetched and tested together
- Woop's unit triangle intersection test
- 48 bytes per triangle (1 to 2 global memory segments per access)
- rays assigned in Morton order (good node texture coherency on primary rays)

I'm making some assumptions here (which I didn't see in the paper directly but should be obvious?), that the pairs of BVH nodes and 48 byte triangle leaf data (referred to below as a "data block") are fetched using 1-2 global memory transaction into shared memory.

(1.) Threads write address of data block into shared memory.
(2.) All threads share in loading the data block for one thread at a time.
(3.) Data blocks are loaded with increasing bank offsets.

This insures very fast memory transactions both globally (perfect coalescing on GT200), and no bank conflicts locally afterward when threads move from coordinated loading to individual parallel computation.

Notice stack data is also thread-local storage which is non-cached global memory accesses. I'm going to assume again that they insured good global memory segment coalescing (GT200) via keeping per thread stacks spaced out by a stride of 16. This way global memory segments required for stack update would be a function of divergence in stack size. So stacks of similar size would get serviced in a single global memory transaction (aside, this is exactly the same thing required on Larrabee to do fast parallel stacks since gather performance is a function of cache lines accessed).


Dedicated Warp-Wide Scan or Prefix Sum Instruction

This enables lower software overhead for fine thread granularity work regrouping on the GPU. IMO NVidia needs this (or something like Larrabee's 2 related instructions), covered it in prior blog posts, have hopes that this might be in NVidia's GT300 arch. The paper covers this in the section on "Improving the SIMD efficiency further".

Paper found that current software overhead for replacing terminated rays didn't yield a consistent performance update. However simulator suggested a max 20% improvement with the addition of two warp-wide hardware instructions (prefix sum and population count) assuming no memory bottleneck.

Paper also has some very rough speculation towards going to fully dynamic "Work queues" using the simulated new warp-wide instructions showing a max potential of 40% and 80% performance improvements in two cases assuming no memory bottleneck.


SIMD Width : 32-wide vs 16-wide vs 1-wide (Scaler)

Given impossibly infinitely fast memory, and identical computational power, simulator results showed 16-wide SIMD to be 6-19% faster, and scaler to be 30-144% faster than 32-wide SIMD.