20160507 - Parallel Project : Routing Thoughts


Some very quick thoughts on this in between being too busy to work on it... well until this May 7th weekend!
Another interesting massively parallel CPU on FPGA project: GRVI Phalanx. The design is very conventional, a bunch of small CPU cores connected by a directional 2D torus router. What I'm aiming to do is quite different. Instead of a conventional MIMD machine, I'm looking to get as many bit-pipelined SIMD lanes as possible on one single "thread" of execution on the chip (aiming for 1024 or more lanes). Instead of having highly variable routing, for my project, routing really means accelerating lock-step cross-lane communication patterns. Ultimately looking at sub-region through full SIMD-width, application of algorithms like {broadcast, flood-fill, reductions, shuffles, scans, sorting, scatter, etc} which are the core components of parallel programs. The focus is on a bit-pipelined design, vs say a 32-bit word-pipelined design, because I'd like to scale up the number of lanes able to communicate by that factor of 32 instead. From the perspective of hardware design, I'm looking to connect each lane only to other lanes as the edges of a hypercube (similar to CM-1). This way connection count scales as log2(lanes). Reducing this to 16-wide SIMD (4D) for sake of example, the connections would be as follows,

   0 1 2 3 4 5 6 7 8 9 a b c d e f  <-- lanes
 +--------------------------------
0| \ 0 0 . 0 . . . 0 . . . . . . .
1| 1 \ . 1 . 1 . . . 1 . . . . . . 
2| 2 . \ 2 . . 2 . . . 2 . . . . .  <-- connections for lane 2 (\ = self lane, 2 = connected lane)
3| . 3 3 \ . . . 3 . . . 3 . . . .
4| 4 . . . \ 4 4 . . . . . 4 . . .
5| . 5 . . 5 \ . 5 . . . . . 5 . .
6| . . 6 . 6 . \ 6 . . . . . . 6 .
7| . . . 7 . 7 7 \ . . . . . . . 7
8| 8 . . . . . . . \ 8 8 . 8 . . .
9| . 9 . . . . . . 9 \ . 9 . 9 . .
a| . . a . . . . . a . \ a . . a .
b| . . . b . . . . . b b \ . . . b
c| . . . . c . . . c . . . \ c c .
d| . . . . . d . . . d . . d \ . d
e| . . . . . . e . . . e . e . \ e
f| . . . . . . . f . . . f . f f \

I'm looking at building a fixed routing pipeline with a per-lane 1-bit register controlling the switch at each stage of the pipeline. The routing sourcing from the ALU or supplying an argument to the ALU. The switch decides if to exchange bits between lanes as a message flows through the routing pipeline. Each stage feeds the next stage a clock cycle later. The idea is that the algorithm configures the switches in the network, then can pass a message through the "configured" network without needing to load and store the complete message in any given lane except the destination lane. Showing an example of 16-wide SIMD, there are 4 bit-pipelined stages of the routing pipeline, in which each "o--o" connection is a point where the bits of a message can swap between lanes.

0 1 2 3 4 5 6 7 8 9 a b c d e f
-------------------------------
o---------------o..............  <-- stage 0
..o---------------o............
....o---------------o..........
......o---------------o........
........o---------------o......
..........o---------------o....
............o---------------o..
..............o---------------o

0 1 2 3 4 5 6 7 8 9 a b c d e f
-------------------------------
o-------o.......o-------o......  <-- stage 1
..o-------o.......o-------o....
....o-------o.......o-------o..
......o-------o.......o-------o

0 1 2 3 4 5 6 7 8 9 a b c d e f
-------------------------------
o---o...o---o...o---o...o---o..  <-- stage 2
..o---o...o---o...o---o...o---o

0 1 2 3 4 5 6 7 8 9 a b c d e f
-------------------------------
o-o.o-o.o-o.o-o.o-o.o-o.o-o.o-o  <-- stage 3

The pipeline matches exactly what is required for the Bitonic Merge procedure on a hypercube network. Typical SIMD sort using lane shuffle takes log2 passes of small to large Bitonic Merges. By avoiding needing to shuffle full messages through each stage of the Bitonic Merge sub-passes, the aim is to provide a massive speed up (less of the algorithm requires perf limited load/store to register file). Likewise in order to make broadcast (and flood fill) work fast, planning on also supporting the reverse of that pipeline. All other communication patterns build from these primitives.

High-Level View
Data grouped into lanes, algorithms which use fixed simple communication patterns to do computations using neighborhood data. Then each lane computes a new position in some space which represents spatial locality, followed by a full machine sort and re-organization of the lanes. Repeat. In-place communication keeps memory/buffer requirements low. Fixed timing keeps optimization easy.

Next Steps
Got the high-level ALU and lane routing designs thought through enough to have started on verilog implementation to ultimately test how many lanes I can synthesize in the target FPGA. More next time...