20180326 - Thinking Through GPU Memory Access

Proxy GPU Used For This Post
Made up but similar to GCN, something like Fury Nano with more VRAM. Numbers set to make it easy to reason about with simple power of 2 maths.

Derived properties of this made up GPU.

Memory Divergence Hiding Poor SIMD Occupancy
Difference between peak off-chip and on-chip bandwidth is 8x:

So in theory if not limited by hiding latency, a kernel which runs near limited by ALU and MEM when hitting in L1$, could drop to around 8 lanes (out of 64) as the workload becomes fully memory divergent (missing in the cache). The workload wouldn't be lane occupancy limited, but rather actually memory bound.

While just a synthetic thought experiment, this outlines a core reason why lane refilling can be a lost cause. Always check if bandwidth bound before focusing on lane occupancy.

Magnitude of Accesses Can Be a Limiter
Lets say a workload needs to traverse a tree of 20 levels. That means often at least 20 cache lines, and even if the majority hit in the cache, the bandwidth reduction for hitting in the cache is only a factor of 8. So even if ALU load was negligible, hitting in the cache won't make the cost of large amounts of addressed loads disappear.

This is a motivation for avoiding log-time searching, and adopting faster near direct domain mappings.

Vector Register Access is the Highest Source of On-Chip Bandwidth
Difference between peak off-chip and VGPR source bandwidth is 96x:

At the point of being TEX addressing bound (ie the 16 clock/CU/VMEM op limit for images) the next transform requires in-register reuse.

This is another motivating factor for semi-persistent and persistent waves, as unrolling enables register reuse.

Peak Cached Tree Depths
Working with a base assumption of one cacheline per node in a tree.

At best roughly 7 levels of that tree fit in L1$, which is only good for 128 lanes if all lanes end up on a unique node. A CU typically needs 1024 lanes active to hide latency for non-unrolled workloads.

Hints to why it is important to test against highest detail dataset in prototyping, because small data set under-sampling might not realistically stress the cache.

Getting to unique node per lane in a fully coherent traversal means 64 lines for just the leaf nodes, eating a quarter of the L1$. Note likely 8 such waves are often sharing the L1$. A lane requires 16 VGPRs and 4 128-bit fetches to grab one cache line, making this kind of access highly dependent on back-to-back issue of those 4 VMEM instructions from the same wave. For a single wave to traverse the last 6 levels of a binary tree to unique detail means 127 lines.

L1$ is small, thus at some point L1$ reuse is mostly between lanes of a wave.

Continuing the thoughts of getting to a unique node per lane. One line miss per wave is a (63/64) 98.4% cache hit rate (ie if all lanes access the same data). Possible to ignore L2$ effects and ballpark 127+N lines streamed from VRAM for an N-6 level coherent traversal. The change between a small 10-level traversal (small dataset) and a 24-level (floating precision limited dataset) fully coherent traversal might only be 9% of lines.

Again the majority of bandwidth is getting to the unique data per leaf.

Missing LOD and/or Per-Lane Detail Is Perf Death For Log-Time Searching
Continuing ...

Using lower bound of say 128 lines accessed per coherent traversal to unique data (tiny data set), this machine could do at best 64 Mi traversals per second (8 GiLines / 128). If fetch limited, the machine tops at 10 Gi traversals per second (256 GiFetch / 24) for a 24-level traversal.

This is the point at which a bunch of things should get quite clear. Taking this silly example and relating to something real. Coherent RT style traversals on similar machines tend to do better than 64 Mi traversals/sec, but approach under 1/10 of that 10 Gi traversal fantasy number. Hints that bandwidth is the core limiter in those kinds of cases. Also hints that RT traversal tends to benchmark quite poorly as a function of level-of-detail. That even coherent RT traversal time tanks unless the scene was simple enough such that the majority of rays terminate on similar leaf nodes. Which means quite literally: GPU RT for even coherent rays falls off the performance cliff if anyone attemps to get to per-pixel geometric triangle detail.

Lane Regrouping For Data Locality Given Incoherent Traversal?
Little point even discussing this given coherent traversal is doomed for interesting amounts of detail, but going to dive in anyway.

If using VRAM to bucket lanes for regrouping (append state to buckets then launch when buckets are wave sized), can ballpark at least two lines per lane for regrouping overhead, not including the atomics to get positions to append to. So regrouping overheads of at least 128 lines per wave. Which in theory might half traversal rate, not including the reduction in traversal rate from any reduction of coherency of lanes in the final buckets.

IMO before diving into technology it is important to think through the data flow, specifically how the problem maps to memory access patterns. Often very rough estimations are good enought to gain insight which can translate into a "go / no-go" gauge on if a given solution space can ultimately yield acceptable performance.