20160423 - Parallel Machine Thoughts

Been very slowly working up towards learning everything required to build my own highly parallel computer in an FPGA. This process has in some ways been on-going for years, getting comfortable building languages and working on little OS prototypes, etc. On the way taking inspriation from other people's projects like the JayStation and Linus Åkesson's Parallelogram, and realizing that not-too-expensive FPGAs have reached the performance level where is it possible to simulate vintage arcade machines like the NeoGeo. This post represents some of the higher level thinking on the current design direction which is highly motivated by the CM-1.

Fusing MEM and ALU
One traditional design construct which I'm intending to toss out is the macro-level separation of memory and ALU. The FPGA I have in mind currently is the Digilent Nexys Video Artix-7 FPGA Trainer Board which has the largest 7 series Xilinx chip which can be used with the free version of Vivado. The Artix-7 has 13-mega-bits of on-chip block ram. To place this into context, the 2nd Amiga, the A2000, started with 8-mega-bits, and a little over 7 MHz CPU. Effectively I'm looking at carving out the block rams across a large number of tiny processors connected by an on-chip network, treating the external DDR3 more like volatile "storage" instead of "memory". I'm intending to prototype (or just design if I fail) something for fun and learning (something I could build a little arcade game on), but if made into actual hardware could hopefully be scaled up to stacked chips like HBM but instead of being composed of just memory, would have arrays of paired eDRAM and ALU cells.

Contrasting with GPUs
The majority of on-chip memory in a GPU is effectively a parking lot for the state of parallel jobs waiting on off-chip memory access, or high latency hits into on-chip caches. As a direct result of this, one of the software programming model constructs a GPU effectively deprecates, is the usage of the large CPU stacks. Likewise deprecated, the model of writing large software libraries invoked with "call and return" traversing huge library dependency chains.

However the GPU is still built around the data analog to "call and return", the "fetch memory" construct (the VMEM load/sample opcodes in GCN for example). The important property of the "fetch memory" construct is that it is a blocking operation.

Designing Around All Non-Blocking Constructs
GPU is designed around vector gather (aka the texture fetch), with only limited ability to scatter, either locally into a tiny memory like LDS divided across workgroups, or globally into L2, but no ability to directly scatter into the largest on-chip memory: the register file (aka the GPU's parking lot). One of the best evolutionally changes between the CPU and GPU was moving the ALU for global atomics from the shader cores to the other side of the crossbar close to the L2. I'd like to continue this evolution to the next logical step.

An inverse of the gather machine is a machine built around scatter, a machine built around non-blocking "fire-and-forget". Conceptionally the tiny processors become "pinned jobs" which are persistently running instead of mostly parked, and the "scatter" is on-chip parallel message routing between the register files of these tiny pinned jobs.

Designing Around Cross-Lane Operations
Looking to repurpose the area of the traditional GPU memory sub-system, like caches and texture units, along with the ports used in the register file for load/store, instead for what is effectively a second functional unit which specializes in doing cross-lane operations in parallel with the ALU on a giant SIMD machine. These cross-lane operations become the "scatter" or parallel message routing network which I'm going to call the "NPU" (short for Network Processing Unit). This NPU would be able to do things like broadcast (on a local to global scale), sorting (when paired with ALU operations), scatter, etc. NPU's purpose is to re-organize data for the needs of computation in parallel with computation.

Transposed ALU Design
The ALU I have in mind is very similar to the CM-1: an ultra-wide SIMD machine with a bit addressed register file and a very simple ALU. Where conventional SIMD processor opcodes use an immediate register index to address and fetch something like 1 32-bit word per lane, the design I have in mind instead would fetch 1-bit for 32 words instead (using the same register index). The opcode register index is thus a bit index (shifts are amortized into the cost of the bit addressed register file), and the ALU, instead of retiring one 32-bit result per op, retires one bit for 32 words. This introduces the ability to do varible bit-width operations, but greatly increases the wall clock time it takes to compute anything (no longer optimizing for latency here because there are no non-register file memory reads).

In order to make it possible to compute multiplies faster, I'm looking at an ALU design which includes a pipelined tree reduction, with a mux at the end to select which element of the tree to return to the register file (implemented in block rams). Very rough diagram below for a 4-bit tree. Each [_] is an adder, and with carry and control logic not drawn. The "y" input is transfered across the tree with a one clock delay. Each of the [a] through [d] adders have an input which feeds back to itself for the next cycle, with a control input which can selectively override with the new "x" input to the module.
bit in x --->o-----o-----o-----o
             |     |     |     |
bit in y -->[a]-->[b]-->[c]-->[d]
             |\   /       \   /
             | [e]         [f]
             |  | \       / 
             |  |  \     /  
             |  |    [g]    
             |  |     |     
bit out z <--o--o-----o
The number of leaves of the tree acts as some multiplier on the throughput of multiply, while having side effects on the critical path timing, and increasing area, etc. I'm still exporing what other uses this tree construct could have to see what opcode control signals I should support to reconfigure.

Bit-Banked Register File
The FPGA's block rams are dual port, and I need 3 ports for the ALU, and more for the NPU. To work around this problem, I'm looking at banking the register file at the bit address level. Multi-precision ALU ops and multi-bit message passing (NPU) will stream through a series of addresses, making it easy to have streams start at different addresses modulo the banding multiplier, in theory at least. This adds and associated complexity to the compiler for code generation, or the human, if hand assembling code. Not sure yet if this idea is going to work out.

Finding this to be the most challenging part of the project, producing a design which is not too large in comparison with the ALU, etc. I'm working from the idea of using the same CM-1 style hypercube network, but not sure about edge node design (CM-1 has 16 ALUs at each edge). Currently doubt I want anything like a full 16-lane crossbar in an edge.

From a high level view, messages are stored in the register file, and the NPU can conditionally swap two SIMD lane's messages, with the ability to conditionally write the data payload of the swapped message instead to a second area of register file for a message which reached the destination (this way the NPU can continue to use the message scratch area for routing other messages). Lane connectivity is limited by whatever fixed network layout I end up going with. Decided on using the register file as the message buffering area to allow the "program" to control the full routing algorithm (instead of having say a fixed petit cycle as in the CM-1 and a separate memory, etc). This also effectively limits the design to message swaps (like a parallel sorting network), but enables the possibility of doing broadcast and other things requiring message duplication.

Messages are composed of {message enable bit, variable bit relative lane address, variable bit payload}. Some message passing won't need the relative lane address (fixed network flow instead of data driven). Also parts of the message could be read out of order. For instance to build up the NPU register which enables the message transfer to actually write to the lane's register file, for routing like in the CM-1, the bit representing the active edge of the petit cycle would be read first.

The NPU design I'm looking at has one read and one write port into the register file, if message throughput is a bit per clock per 1-bit lane in the SIMD machine. The alternative is to drive messages at half rate and use the one FPGA block ram port which can be read or write. This has some appeal because then only 4 ports are needed across the ALU and NPU combined. And lastly I'm thinking about just merging ALU and NPU, so that network inputs can directly be used as ALU input.

On Paper
Current work involves building towards a fixed ALU+NPU logic design which gets driven by SIMD control fan-out. Once I'm happy with the ALU+NPU design, I'm back to verilog to see how many of these I can place in the FPGA. Then I need to think more about building the "sequencer" or CPU-style scalar unit which drives the control lines for the massive SIMD machine.

Wrestled about actually blogging about this for a long time, because like anything not tied to the day job with a real deadline, I'm likely to scrap everything and do something else. And this is my first attempt at somewhat serious hardware design, which is certain to contain horrible fail moments, which hopefully others can enjoy at my expense. On the other hand, unlike dealing with software, this project has brought back the sense of wonder I had as a kid which otherwise since has transformed into sarcasm after dealing with how the industry as since evolved.