20170319 - Beyond the GPU

Disclaimer: as is everything on my blog, these are my personal opinions and thoughts, which could be wrong, and might possibly be something I later disagree with after learning something new or moving on to a better idea...

That being said, diving into thoughts of what will be required to scale compute capacity beyond what is possible on GPUs without changing manufacturing process.

Motivativating Factors
Referencing the graph from 1985-2017 Industy Retrospective below. Reset the {os,api} complexity to the 1985 level. Continue the {ops} scaling with a giant bump, even though process {freq,mm2,tdp} hit their practical limits, and process shrinks {nm} is slowing and approaching the asymptotic limit.

Improvments rest on premise that hardware can be vastly simplified. Transistors per lane provides a good indication of hardware complexity. Using the Fury Nano as an example,

4096 operations/clock
4096 bytes/clock (through vector L1$)

If the chip is divided into 4096 pieces, the chip has around 2.17 million transistors/piece.

To place complexity into context for 32-bit processors, according to Wikipedia,

1.180 million transistors - 80486 (1989, has tiny L1 cache and floating point unit)
0.275 million transistors - 80386 (1985)
0.025 million transistors - ARM 1 (1985)

Motivativating Paper
How to build a Megacore microprocessor by Andreas Olofsson. 53.3% of total die area at 117 mm2 on 16FF+ provides 64 MiB SRAM. Roughly 1 MiB/mm2. Continuing the paper's comparison to NVIDIA's P100 (also 16FF+). Scale design to match NVIDIA P100 die size at 610 mm2 (factor of 5.2x), assume Custom logic design slide is possible (factor of 2x), and it seems possible to hit 512 MiB of on-chip SRAM in the P100 package using somewhere between 42% to 84% of the area (depending on how much custom logic helps) and assuming the rest of the chip complexity is minimized. Andreas's presentation pushes to the extreme of using 90% of on-chip area for SRAM.

NVIDIA P100 ............. 3840 lanes ..... 18 MiB+ (4 MiB L2, 14 MiB Registers, ? L1, ? shared)
Same die, but better ... 65536 lanes .... 512 MiB

Rounding P100 to 32 MiB of on-chip storage, the "Same die, but better", might have 16x the program-accessable on-chip memory.

Looking at memory per lane given possible numbers of lanes,

__4 KiLanes = 128 KiB/lane (half an SPU of memory per lane)
__8 KiLanes = _64 KiB/lane
_16 KiLanes = _32 KiB/lane
_32 KiLanes = _16 KiB/lane
_64 KiLanes = __8 KiB/lane (over 16x the lanes of the P100)
128 KiLanes = __4 KiB/lane

The practical target seems like over an order of magnitude scaling in both ALU and MEMORY density over current GPUs on the same die and process, by ripping out all the complexity found in the parallel processor, and producing something which is most definitely not going to run any prior written software.

Data Flow Constraints Provide the Necessary Processor Design
This is where I diverge from the thoughts expressed by Andreas: Parallella has a scatter based 2D grid network, which I currently believe is likely too limiting for performance scaling. Uncoordinated data-flow has poor latency guarantees, which limits useful peak bandwidth, or greatly increases buffering requirements.

Working from the 8 KiB/lane with 65536 lane/chip example, how does one organize and access on-chip memory on such a machine?

All the memory is divided into the compute lanes, and all overhead like {page tables, virtual memory, TLBs, and caches} have fully been stripped out. Effectively 512 MiB of SRAM is divided into 65536 banks (memory is 65536-way interleaved). Each 32-bit word in a given lane is 1/65536th of a larger slice of memory. There are 2 KiSlices (the 2048 addressable 32-bit words of local on-chip memory).

Any gather based parallel machine of this density will require a parallel hierarchical process to de-interleave memory and do selection of individual words of memory.

Any memory used for buffering (partially de-interleaved data) eats away directly from the amount of memory available to store de-interleaved local and interleaved global data. Minimizing buffering is thus a direct design requirement. And thus fully synchronized data flow is a direct design requirement. And thus we are talking about SIMD extended full-chip and extended all the way to controlling on-chip network routing for "memory".

Machine thus looks like a giant wave with some form of hierarchical cross-lane operations for on-chip network routing, where the cross-lane operations are driven in parallel with computation.

This is an awesome place to be design-wise because it is well explored and understood in the public domain since 1980s super-computing.

Possible High-Level Design

Hierarchical Lane Organization
Focusing on core architecture of the on-chip "network". For an N-lane machine, it would be nice to have a full log2(N)-dimensional hypercube network connectivity, except that is not practical. Instead can go for a lower-dimensional machine where the lanes in a given dimension are connected via a simple low-cost ring. For example a 256-lane machine could be an 8D hypercube using 256*8=2048 connections, but could be simplified to a 2D machine with two 16-lane rings using 16*2=32 connections as shown below.

76543210 - lane index
....xxxx - 1st dimension
xxxx.... - 2nd dimension

Each dimension has a separate directional (one-way) ring interconnect network. This is minimal complexity, and the networks for each dimension work in parallel. The lower dimensional interconnects will be fully lane to lane, this ultimately won't hold after some dimension due to wire and pin constraints, and thus some variable amount of serialization will happen. For massive multi-chip machines, perhaps higher dimensional interconnects will likely venture off-chip to chip-stack or interposer. Eventually at some dimension, all machines can connect to off-board network.

Logically visualizing a simple example of a 2D 16-lane machine below with two sets of 4-lane rings. The lowest dimension interconnect drawn out, and only the {ABCD} to {EFGH} paths for the 2nd dimension drawn (next would be {EFGH} to {IJKL} and so on). Physical layout would look quite different with no long return path (like a folded torus).

|  +---------|--+    (first hop of 2nd dimension's ring)
|  |  +------|--|--+
|  |  |  +---|--|--|--+
|  |  |  |   |  |  |  |
|  |  |  |   v  V  V  V 
A->B->C->D   E->F->G->H   I->J->K->L   M->N->O->P 
^        |   ^        |   ^        |   ^        |
|        |   |        |   |        |   |        |
+--------+   +--------+   +--------+   +--------+

Vector Gather Via Parallel Coordinated De-Interleaving
Continuing with the example of a 2D machine and showing the process of coordinating access to non-local memory via a DMA engine. The program can access global memory via [address][y][x]. Where [address] is the local memory address and is machine-dynamically-uniform in this example. Think of [address] like selecting a tile of memory on the machine, the [y] providing the cache-line (row), and the [x] providing the index of the word to read out of a line (column). In this context the [y] must be dynamically uniform across a row of the machine, and the [x] is fully non-uniform.

Given a MxM sized machine, the machine can coordinate loads in this context as described below. Each lane only has ability to load from it's one local tiny memory. Non-local access is coordinated by the push-only network, which only needs to transfer the data and not the addresses (because [y][x] address of the data on the network is fully parametric and can be derived knowing local lane index and clock of the operation).

(1.) Each core fetches it's local value at [address].
(2.) That value is placed on the [y] dimension network.
(3.) The network rolls the fetched line in the Y direction M times.
(4.) When a row of lanes on the machine sees it's desired [y] line, it copies the value off the network.
(5.) After these M steps each row on the machine has the desired line.
(6.) The network then rolls the lines on the [x] dimension network, in the X direction M times.
(7.) After these M steps each lane in the row has a local copy of the full line.

In this simplified example,

(A.) Line load latency is roughly 2*M clocks.
(B.) M*M lanes do computation in parallel on the machine.
(C.) Due to separated networks, 2 requests can be active in parallel.
(D.) Line load request throughput is roughly M clocks per request.

The machine in theory might be able to fill line loads at a rate which approaches being single-write-port limited storing the lines into the local SRAMs. Bringing this back into the prior 65536-lane machine example, that is maybe around 4-bytes * 64Ki ports * 1 GHz = around 262144 GiB/s of localized on-chip vector line gather. A 4D machine with 16-lane rings/dimension might have roughly 64 cycles of latency per line request, and could consume 4 requests in parallel.

Other Aspects
Depending on final machine design, there are ways to reduce latency by trading for temporary local data duplication and reuse. For instance, working from the 4D/16-ring example of [address][w][z][y][x], de-interleave and duplicate [address][w] into 16 slots of local memory. Now future memory accesses sharing [w] can start directly from just [z][y][x] (for a reduction of 25% of the latency). This can be applied at various scales on the machine to optimize, and this represents the explicit analog to hierarchical caches, but with the added advantage of efficient broadcast at various levels of the hierarchy.

Might also be possible to dynamically split up the multi-dimensional lane grid into subsets so that different programs could run in parallel.

Lots of ideas to explore in this space. Given how simple the architecture could be, it might be possible to build a 1/64 scale (1024 lane) machine in a $1000 FPGA dev board...