20160626 - FPGA Processor Rethinking : Part 2

Continued from prior post...

Why Not MIMD
Jumping to the conclusion first, I still wouldn't adopt a MIMD architecture for a highly parallel machine. This post is going to attempt to describe why, and outline the rough design that I've been pursuing for a SIMD machine.

INSTRUCTION ISSUES - With MIMD the amount of instruction RAM per core would be tiny. GRVI in the 8 CPU cluster configuration from the GRVI Phalanx Paper has just 4KB per core. Ideally instructions need to be small, with an ISA with good code density (like Forth, like the J1), but ultimately this will drop ALU IPC. The J1 Paper provides an example: 35% instructions are ALU. Rather be able to issue loads and stores and DSP ALU (d=a*b+c) all in the same cycle, which requires VLIW. SIMD amortizes the large instructions, with a broadcast of the decoded control signals, or a hierarchical instruction decode across the fan out (based on whatever is optimal). I'd like to also have parallel ability to pack and unpack sub-words (to maintain good information density in the core's local scratch RAM), as well as broadcast uniform constants.

ROUTING ISSUES - Or rather timing issues. The Hoplite Paper outlines one of the of core problems. Routing the communication of random MIMD jobs requires throttling message injection to avoid saturating the network, which would otherwise result in some drastically horrible message latencies. Seems like Hoplite requires something like roughly 0.5/max(gridWidth,gridHeight) message injection rate max (seems like the router needs to remain roughly half empty to avoid latency cliff). Optimizing for machines with highly volatile data-dependent run-time performance can be a nightmare. SIMD ensures the work synchronization required for highly efficient communication, as well as makes it easy to implement a majority of the common forms of parallel communication.

Current Design Thoughts
An 18-bit/word massively wide SIMD machine driven by VLIW with the following per lane,

One 1024 entry x 18-bit Block RAM (2 ports both either Read or Write)
One 32 entry x 18-bit Register File (32x2Q SLICEM x 9)
One hyper-cube router

Rough machine limits/lane if targeting a 512 lane machine (using same FPGA from prior post),

11552 SLICEM / 512 = 22
22098 SLICEL / 512 = 43

To place this into context, each SLICEL has the ability to MUX four 4-bit values into one 4-bit value, given a 2-bit control signal. Each 18-bit word requires 4.5 SLICEs to do implement a 4:1 choice. That gives 43/4.5 or roughly 9 4:1 choices for the SLICELs for driving connections between {reg-file, BRAM, DSP, and router}. Not including using left over SLICEMs, and not including all the other misc things required. Likely impossible to hit a 512 lane target, won't know until I move on to implementing the cores.

BIT-PIPELINED? - Prior was thinking about doing a bit-pipelined machine (1-bit per lane, 16 lanes grouped). Each SLICE can be configured into 4 pairs of 5:1 LUTs each which can make a full adder. The bit-pipelined multiply would require 4 SLICEs per 16 1-bit lanes for each adder. Easy to burn the entire SLICEL budget on the 3 stage pipeline for just an N-bit x 8-bit multiplier (not including the control structures). And would need a huge amount of instructions to drive that machine. Also the routing required roughly 18x the switch registers (compared to current thinking). So tabled the idea for now.

ROUTER - An analogy for this "router" is to think of a railroad switching station, with a set of parallel tracks or "lanes" each connected by a set of switches which can exchange trains between two tracks, but with switches that require lanes to cross over each other (as in a bridge over tracks not involved in the exchange). The trains in this case are really small, ideally one word of information, but likely sub-words due to area (more on this later).

The general idea is to setup the switching station based on the data-driven needs of the communication, then allow the data payload to flow through the switch. For instance when sorting a packet per lane, for each of the log2(lanes) sorting sub-pass of a bitonic sort, the sorting key gets passed through the switching station, setting switches, then the data payload passes through at a cost proportional to size of the payload. The switching station takes out what would otherwise be log2(lanes/pow2(sub_pass)) passes per sub-pass for the payload transfer for the bitonic sort.

The series of connections are hyper-cube edges, which means if you take the binary value of the lane index, there is one connection per bit in that lane index. For example, a 512-lane SIMD machine has a 9-bit lane index, and thus 9 connections (9 edges in the 9 dimensional hyper-cube). In order to make a switching station which useful for basic parallel communication, the ordering in time of the connections matter greatly. Broadcast and other data replication/expansion algorithms need the switches ordered from LSB to MSB, while sorting and data merging algorithms need the switching ordered from MSB to LSB. Continuing the 512-lane example, each of the 9 pipeline stages of the switching station requires,

(1.) Output from the prior pipeline stage.
(2.) The associated LSB-first ordered edge from the prior pipeline stage.
(3.) The associated MSB-first ordered edge from the prior pipeline stage.
(4.) 2-switch control bits (select between 1,2,3 or some extra value).

For a 9 pipeline stage switching station, this would exceed the entire SLICEL budget. So likely going to, run the switch at half-word or lower granularity.

MOVING ON - Choosing VLIW ISA, and the exact structure for the cores is hard work. Validating the design has all the hardware features I'd like for solving problems is also challenging. Left off thinking about multi-precision math, and how to best manage the 3 cycle DSP latency.