20160623 - FPGA Processor Rethinking


Have been very inspired by Jan Gray's GRVI Phalanx.


MIMD
Outside of work, I've been slowly attempting to work up a paper design for my own massively parallel FPGA based computer, looking to close on something to actually build. Mostly been working on SIMD based machines, without a serious focus on MIMD, until I read Jan's paper, which set me off in another direction, how would I build a MIMD machine in a Xilinx FPGA?

Basics
Looking at using a board with the fastest Artix-7. Collecting numbers,

33650 - CLB Slices
740 - DSP Slices
365 - 36 Kbit Block RAMs (each which can be split into two 18 Kbit BRAMs)

If could use all 740 DSP Slices and could maintain a 375 MHz clock (number borrowed from Jan's paper), that would be,

740 DSPs * 2 ops/clk * 375 MHz = 0.555 Tops/sec

Talking about effectively trying for a 740 core MIMD machine. Definitely won't be able to realize that peak, and in comparison to GPUs like FuryX at 8.6 Tops/sec (and 32-bit instead of 18-bit), this number seems small at first. Except if this little FPGA machine was driving an old TV like a console at around NES resolution (same width, less height),

PC driving 2560x1440 which at 8.6 Tops/sec is ___2.3 Mops/pixel/sec.
FPGA driving 256x192 which at 0.5 Tops/sec is __11.3 Mops/pixel/sec.

Which is complete insanity levels of performance per pixel for the FPGA machine for a vintage arcade box, even if only reaching 1/4 of that performance.

So Let the Fun Begin
DSPs are limited by having an 18-bit input for the "d=a*b+c" operation, Block RAMs are natively 18-bits wide, so naturally this is going to be an 18-bit computer. Working backwards to get rough design constraints, first breaking down how many CLB slices support distributed RAMs, and dividing everything by the 740 DSP slices.

11552 SLICEM / 740 = 15
22098 SLICEL / 740 = 29
__730 _BRAMS / 740 = ~1 (18 Kbit)

Block RAMs are dual ported, they don't have the ports required to keep the DSPs filled. Each SLICEM in contrast can be used as a Quad-Port 32 x 2-bit RAM, which looks like a good target for a register file (can sustain 3 read ports for the DSP op). Will need 9 SLICEMs to support a 32 entry x 18-bit register file. These SLICEM's only support 1 write port, and want 2 (for parallel ability to write into the register file while doing DSP ops). So register file will need to be at least 2 banks, for a total of 18 SLICEMs. High level register file design will limit the peak number of DSPs used.

Initial target of one 1024 entry x 18-bit Block RAM per core for data (roughly 16x the capacity of the register file). If clustering 8 CPUs together, that is 8K x 18-bit words for data, via 8-way banking. I'm thinking about doing something quite rash, and only supporting aligned 8-word (8 * 18-bits/word) block loads and stores from this data RAM, both for the CPU register file and the message router. This in comparison with the GRVI would replace the 2:1 concentrators and 4x4 crossbar. Instead the CPU would have 4 8-word regions in one of the banks which could be accessed for block load/store, in parallel with the other bank of 32-word register file used for DSP operations. Effectively the CPU would be modal with a binary switch, one bank used for block load/store to setup for next group of computation, while the other bank is used for computation. Then switch to do math on the loaded data, and load new data in the other bank. It is a very restrictive and simple design but something I think can work quite well in practice.

Also thinking for instruction RAM, sharing one BRAM between two CPUs. But switching to block loads, so all 8 CPUs in a cluster can share 4K x 18-bit words of BRAM. This involves an ISA design which can compute the branch target early enough in the pipeline.

More next time...