20180115 - Acceleration and Technical Debt


Technical Debt Accelerator
Objectively speaking the primary goal of hardware in the current age is to accelerate technical debt. Meaning developers target a market which includes typically half a decade old hardware (or order), and software takes years to change. The most successfull hardware has been hardware which runs legacy workloads better. This does have a side effect that hardware is not necessarily designed to follow an evolution to the most efficient way do computation given clean room design. Nor has software, instead we have this strange feedback loop where software is incentivized to not be redesigned (minimize initial cost), and instead just depend on faster hardware, and then hardware is optimized to better run the poor software, repeat.

This had been working great in the prior years of non-power-limited hardware and rapid process shrinks. Moving forward, we are falling into a local minimum. Engineering thought processes also have a lot of intertia. We often use arguments that while this is not real-time today, it will be in 5-10 years in the future. However, if the trend lines don't support that conclusion, it might be a good time to adjust our thinking.

Local Blip in Area-Limit of Hardware Design
NVIDIA pushed beyond prior typical limit of ~600 mm^2 to a ~800 mm^2 chip. Given yield (and thus cost) is proportional to area, one can only speculate this chip is expensive to produce. Analysis of Gflops/$ and scaling shows diminishing returns. NVIDIA's latest Titan V (Volta) compared to prior Titan Xp (Pascal) in raw specs from Wikipedia,

1.73x area (12 nm vs 16 nm)
1.76x transistors
1.19x bandwidth
1.14x 32-bit flops

Titan V has 4.1 Gflops/$ while Titan Xp had 9.0 Gflops/$.

For comparison, AMD's latest Vega 64 LC (Vega) compared to prior Fury X (Fiji) in raw specs from Wikipedia,

0.82x area (14 nm vs 28 nm)
1.40x transistors
0.94x bandwidth
1.60x 32-bit flops (using peak clocks)

Vega 64 LC has 19.7 Gflops/$ while Fury X had 13.3 Gflops/$.
AMD RX 580 (Polaris) in 4 GB form, tops in at 31.0 Gflops/$ (almost 8x flops/$ of the Titan V).

Traditional gaming performance has not been scaling with ALU increase, so leveraging all the free ALU capacity is one giant opportunity to increase performance. Note even when a problem is mostly memory bound, ie when load request queues periodically drain, Amdahl's law effect ulimately applies: it still important to scale ALU performance to get the load arguments computed as fast as possible, just this works at diminishing returns approaching the limit established by serial parts.

The Assembly Line
The most important "invention" in the modern age, efficient manufacturing, reduces to one simple concept: pre-schedule the supply chain to the location of work, so that work/time can be maximized.

Parallel accelerators like GPUs are still solving problems in the "pre-assembly-line" era. In this context the "supply chain" is the data, and the "work" is ALU. GPUs have established over a decade of technical debt solving problems where the "worker" has to figure out what "he/she" needs at every step, and then "ask" (aka fetch) for the next blocking item: the great serializer. GPUs are designed around accelerating the technical debt of hiding the latency of this serializer. Specifically giant register file SRAMs (aka the parking lot) and now larger and larger caches.

Switching to pre-scheduling and optimized data transfer involves moving to a store/scatter/message-passing design. Which in hardware would look like interleaved SRAMs/ALU nodes on some kind of network. The "loads" limited to just reading from tiny ALU-local SRAMs. The important "feature" of this architecture would be removing all the serialization typically involved in store-based designs. Hardware which only supports store {data} to {remote address} would likely be too limiting. Much better to have send {data} to {remote peer}, where the peer decides where to place the data in it's local SRAM. The logic of that perhaps could even filter the message. The network could also support localized broadcast, meaning send the message to multiple nodes based on some kind of mask (in contrast typical parallel machines like GPUs duplicate load requests for each node fetching the same data). Unfortunately though there is no way to get to this kind of more efficient hardware, because it would *not* be efficient at running legacy software.

Random Interesting Stuff on the Interwebs

After-Action Review: An Overview of the Making of LuftrauserZ - C64 port. Labor of love, sold roughly 122 units in a few weeks.