20180330 - Join, Preemption, and Going Slow on the GPU
Preemption is Like Loosing a Full-Screen Pass
Lets base this on a 64 CU GCN part with around 512 GiB/s of bandwidth.
One can view preemption as roughly loosing a 3072x1728 full-screen pass.
- 16 MiB VGPRs
- 4 MiB LDS
- 20 MiB state (without including SGPRs)
- 40 MiB to save/restore state
A 3072x1728 32-bit/pixel image is around 20 MiB.
And if the GPU can sustain around 80% of it's paper spec bandwidth, that full-screen pass would be around 0.1 ms if bandwidth bound.
So if one preempted the GPU every millisecond, you'd have a 10% perf tax,
including the costs of waiting until drain, flushing the caches, relaunching waves, and starting in cold cache conditions.
Join is Like a Mega Upgrade to Preemption's Slowness
Launch the same 3072x1728 sized kernel (over 5 million invocations).
Say each wave forks and joins, ie stop execution of the wave,
save state, then "call" into another kernel, then return and continue execution where one left off.
If kernel state at the "fork" is 64 VGPRs (and no LDS), we can compute the fork cost,
2592 MiB / 40 MiB = roughly enough bandwidth for over 64 full-screen passes, or maybe around 6.5 ms.
Hopefully dead obvious, call and return, or fork and join,
or any construct which is built on the concept of saving and restoring per-lane state
is fundamentally flawed.
But Wait, What About "Keeping it On Chip?"
- 64 VGPRs * 4 bytes/VGPR = 256 bytes/lane
- 512 bytes/lane to save and restore state
- 512 bytes/lane * 3072x1728 lanes = 2592 MiB required to fork/join
Getting back to GPU stats,
- 16 MiB VGPRs
- 4 MiB LDS
- 4 MiB L2$
- 1 MiB L1$
First obvious problem, there is no cache on chip that can hold VGPR state.
But lets just ignore this problem anyway.
How about using 2 MiB of L2$ for saving state.
Say L2$ has somewhere between 2x and 4x bandwidth of VRAM.
So fork/join would still be 32 or 16 full-screen passes respectively.
And if that isn't bad enough, by burning half the L2$,
average cacheline life also halves,
and one can only fork 1/8 of the waves.
Ok, so clearly attempting to save and load state from L2$ is also like swimming in a pool of petrol and playing with matches.
"On Chip", As in VGPRs?
And the final option, do what existing GPUs already do, inline everything and just massively up the VGPR count.
Meaning leave enough VGPRs available to keep existing state in VGPRs while calling out to separate function.
Clearly this is the only sane option, as it doesn't burn all of on-chip bandwidth attempting to save and restore state.
But as is well known, performance of traditional shaders at low wave occupancy is horribly poor.
Fork and join is a great way to guarantee slowness!