20151028 - Random Thoughts on TS vs Alternatives

In the context of: Joshua Barczak - Thoughts on Texel Shaders.

Here is where my mind gets stuck when I think about any kind of TS-like stage...

Looking at a single compute unit in GCN,

256 KB of vector registers
16 KB of L1

If a shader occupies 32 waves (relatively good occupancy out of 40 possible) that is a tiny 512 bytes of L1 cache on average per wave. Dividing that out into the 64 lanes of a wave provides just 8 bytes on average per invocation in the L1 cache. Interesting to think about these 8 bytes in the context of how many textures a fragment (or pixel) shader invocation accesses in it's lifetime. The ratio of vector register per L1 cache is 16:1. This working state to cache ratio provides a strong indication that data lifetime in the L1 cache is typically very short. L1 cache serves to collect coherence in a relatively small window temporally, and the SIMD lockstep execution of a full wave guarantees the tight timing requirements. Suggesting likely one could not cache TS stage results in L1. L2 is also relatively tiny in comparison with the amount of vector register state of the full machine...

Going back to R9 Nano ratios: 16 ops to 2 bytes to 1 texture fetch. The "op" in this context is a vector instruction (1 FMA instruction provides 2 flops). Lets work with the assumption of a balanced shader using those numbers. Say a shader uses 256 vector operations, it has capacity for 16 texture fetches, and lets assume those 16 fetches are batched into 4 sets of 4 fetches. Lets simplify scheduling to exact round robin. Then simplify to assume magically 5 waves can always be scheduled (enough active waves to run 5 function units like scalar, vector, memory, export, etc). Then simplify to average texture return latency of 384 cycles (made that up). Given vector ops take 4 clocks, we can ballpark shader runtime as,

4 clocks per op * 256 operations * 5 waves interleaved + 4 batches of fetch * 384 cycles of latency
= 6.6 thousand cycles of run-time

This made up example is just to point out that adding a TS stage serves as an amplifier on the amount of latency a texture miss can take to service. Instead of pulling in memory, the shader waiting on the miss now waits on another program instead. Assuming TS dumps results to L2 (which auto-backs to memory),

Dump out arguments for TS shading request
Schedule new TS job
Wait until machine has resources available to run scheduled work (free wave)
Wait for shader execution to finish filling in the missing cache lines
Send back an ack that the TS job is finished

If a TS shader can access a procedural texture, in theory that TS shader could also miss, resulting in a compounding amount of latency. The 16:1 ratio of vector register to L1 cache, hints at another trouble: the shader has a huge amount of state. Any attempt to save out wave state and later restore (for wave which needs to sleep for many 1000's or maybe many 10000's of cycles for a TS shader to service a miss), is likely to use more bandwidth to save/restore than is used to fetch textures in the shader itself running without a TS stage. Ultimately suggesting it would be better to service what would be expected TS misses long before a shader runs, instead of preemptively attempting to service while a shader is running...

The majority of visual coherence is temporal, not spatial. Comparing compression ratios of video to still image provides an idea of the magnitude. Might be more powerful to engineer around enabling temporal coherence instead just very limited spatial coherence. Suggests the optimal end game caches all the way through to DRAM in some kind of view independent parameterization to enable some amount of reuse across frames in common case. This also could be a major stepping stone in decoupling shading rate from both refresh-rate and screen resolution. Suggesting again a pipeline of caching what would be TS results across frames...

Gut feeling based on a tremendous amount of hand waving is pointing to something which doesn't actually new any new hardware, something which can be done quite well on existing GCN GPUs for example. Unique virtual shading cache shaded in the same 8x8 texel tiles one might imagine for TS shaders, but in this case async shaded in CS instead. With background mechanism which is actively pruning and expanding the tree structure of the cache based on the needs of view visibility. Each 8x8 tile with a high precision {scale, translation, quaternion}, paired with a compressed texture feeding a 3D object space displacement, providing texel world space position for rigid bodies or pre-fabs. Skinned objects perhaps have an additional per tile bone list, per tile base weights, and compressed texture feeding per texel modifications to base weights. Lots of shading complexity is factored out into per tile work. For example with traditional lights, can cull lights to fully in/out of shadow to skip shadow sampling. Each frame can classify the highest priority tiles which need update, then shade them: tiles with actively changing shadow, tiles reflecting quickly changing specular, etc.

Continuing on the TS Blog Conversation Chain

Re Joshua Barczak - Texel Shader Discussion...

"Iím suggesting that the calling wave be re-used to service the TS misses (if any), so instead of waiting for scheduling and execution, it can jump into a TS and do the execution itself."

I'm going to attempt to digest actually building this on something similar to current hardware and see where the pitfalls would be. Basically the PS stage shader gets recompiled to include conditional TS execution. This would roughly look like,

(1.) Do some special IMAGE_LOADS which set bit on miss in a wave bitmask stored in a pair of SGPRs.
(2.) Do standard independent ALU work to help hide latency.
(3.) Do S_WAITCNT to wait for IMAGE_LOADS to return.
(4.) Check if bitmask in SGPR is non-zero, if so do wave coherent branch to TS execution (this needs to activate inactive lanes).

Continuing with TS execution,

(5.) Loop while bitmask is non-zero.
(6.) Find first one bit.
(7.) Start TS shader wave-wide corresponding to the lane with the one bit.
(8.) Use the TEX return {x,y} to get an 8x8 tile coordinate to re-generate and {z} for mip level.
(9.) Do TS work and write results directly back into L1.
(10.) When (5.) ends, re-issue IMAGE_LOADS.
(11.) Do S_WAITCNT to wait for loads to return.
(12.) For any new invocations which didn't pass before, save off successful results to other registers.
(13.) Check again if bitmask in SGPR is non-zero, if so go back to (5.).
(14.) Branch back to PS execution (which needs to disable inactive lanes).

This kind of design has a bunch of issues, getting into a few of them,

(A.) Step (10.) has no post load ALU before S_WAITCNT, so it hides less of it's own latency (even though it will hit in the cache).

(B.) Need to assume texture can miss at the point where the wave has already peak register usage in the shader, this implies a shader needs that plus the TS needs in terms of total VGPR usage. Given the frequency of PS work which is VGPR pressure limited without any TS pass compiled in, this is quite scary. Cannot afford to save out the PS registers. Also cannot afford to make hardware to dynamically allocate registers at run-time just for TS (deadlock issues, worst case problem of too many waves attempting to run TS code paths at same time, etc). So VGPR usage would be a real problem with TS embedded in PS.

(C.) Need to hardware change to ensure TS results are in pinned cache lines until after first access finishing serving a given invocation. This way the IMAGE_LOAD in (10.) is ensured a hit to have some guaranteed forward progress. There is a real problem that 8x8 tiles generated early in the (5.) loop might normally be evicted by the time all the data was generated.

(D.) Attempt to do random access at 64-bit/texel (aka fetch from 64 8x8 tiles) which all miss. That's 64*8*8*8 bytes (32KB) or double the size of the L1 cache. There are multiple major terminal design problems related to this causing the wasteful (12.) step: need to support the possibility that one cannot service all texture loads for 64-invocations from one pass.

(E.) The TS embedded in PS option would lead to some radically extreme cases like multiple waves missing on the same 8x8 tiles, and possibly attempting to regenerate same tiles in parallel.

(F.) The TS embedded in PS option would result in extreme variation in PS execution time. Causing a requirement for more buffering in the in-order ROP fixed function pipeline.

So gut feeling is that this isn't practical.

I feel like many of these ideas fall into a similar design trap: the idea of borrowing the concept of "call and return" from CPUs. Devs have decades of experience solving problems taking advantage of a "stack", depending on what seems like "free" saving of state and later restoring of state. That idea only applies to hardware which has tiny register files, and massive caches. GPUs are the opposite, there is no room in any cache for saving working state. And working state of a kernel is massive in comparison to the bandwidth used to fetch inputs and write outputs, so one never wants working set to ever go off-chip. Any time anyone builds a GPU based API which has a "return", or a "join", this is an immediate red flag for me. GPUs instead require "fire and forget" solutions, things that are "stackless" and things which look more like message passing where a job never waits for the return. The message needs to include something which triggers the thing which ultimately consumes the data, or the consumer is pre-scheduled to run and waits on some kind of signal which blocks launch.