20180119 - Extreme Minimal/Rapid GPU Programming


Investing in compute-only for rendering with some radical simplifications and API-illegal behavior. Aka, "Don't Do This", but also happens to be exactly how I personally operate, at the friendly knife edge of what is possible. Like juggling live chainsaws, not recommended to try at home ...

Engine is a fixed data flow network of compute dispatches. Command buffer for odd and even frames, gets replayed instead of re-generated (no CPU work). One command buffer per frame, removes overhead assocaited with command buffer breaks. Only synchronization is between the command buffer and present. All resources are created at initialization time and never freed, exit simply exits process. Why bother with specialized clean up or any dynamic allocation. With instant app restart, resolution change is same as program restart, and same with lost device.

All resources are per-app frequency, not per-item (material, etc). All specialization is in data and not code or objects or configuration of resources. All image resources are STORAGE (aka writable), so GENERAL, no transitions. Leverage implicit existing per-command-buffer cache flush once per frame. No more per-dispatch cache flushes. All shader stores are "coherent" in GLSL, aka GLC=1, write through to GPU's last level cache. All data access fits the write-once-before-read-then-read-many model (no chance for stale data in the caches). All indirect dispatch data writes are "volatile" in GLSL, aka SLC=1, write through to DRAM. Small amount of data, avoids needing L2 flush on hardware which has a front-end which fetches from DRAM instead L2, but works in either case. No more barriers, only execution dependencies.

Leverage the GPU's design of serializing CS dispatches, consistent kernel rasterization, with natural limits on parallel execution. If two dependent kernels cannot physicaly have data race given the practical limits above, then that itself serves as the execution dependency (no barrier or event used or objectively needed in practice). Fully pipelined execution (no wait-for-idle inside the command buffer). When a given kernel launch cannot reliably fill the machine for N times the GPU fill time on current and projected hardware, then mix in enough independent work to do so, or use an explicit execution barrier (event). Using "coherent" for dynamically uniform (K$ access-only) polling to move execution dependency into post wave init region of shader execution is also an interesting possibility.

CPU-to-GPU is all late-latch communication and minimal use of bandwidth (limited to gamepad/key/mouse state upload, and network data upload, all game logic lives on the GPU not CPU). Minimal bandwidth cases so bypass transfer queue usage and associated WDDM overheads, instead CS dispatch "volatile" reads to get latest data and store to GPU. Avoiding DEVICE_LOCAL+HOST_VISIBLE due to problem with Win7 persistent maps getting migration to CPU-side memory, and not portable. GPU-to-CPU is all pass through of network data, also minimal bandwidth. Also store direct in shader to CPU, accepting any loss of perf due to disruption of GPU memory traffic.

Thus a collection of "bad behavior" to massively simplify engine developement and gain optimal execution on the GPU. Recompile on source file modification (save), with instant reload without app exit rounds out the rapid development functionality. Effectively a "shader toy" on steroids, but something which recompiles into a single binary in shippable-to-consumer form. Had a very stale version of the shell of this engine up on github in public domain form (LDVE, Low Dependency Vulkan Engine). Stale as in the hub only got last updated during early initial bring-up. Skipping link, will post if I ever get around to clean-up and upload of the real thing.

Random Interesting Stuff on the Interwebs