20160705 - CPU Threading to Hide Pipelining

If a CPU has a 4 stage pipeline, would be nice to have 4 CPU threads round robin scheduled to ensure pipeline delays for {memory, alu, branches, etc} do not have to be programmer visible, and to avoid complexities such as forwarding.

According to docs, Xilinx 7-series DSPs need a 3 stage pipeline for full speed MAC, and BRAMs (Block RAMs) need 1 cycle delay for reads.

Working from high level constraints, I'm planning using the following for the CPU-side of the project,

16-bit or 18-bit machine
X BRAMs of Instruction RAM (2 ports, read or write for either)
Y BRAMs of Data RAM (2 ports, read or write for either)

Which suggests the following 4 stage pipeline (with 4 CPU threads running in parallel, one on each pipeline stage),

Instruction BRAM Read -> Instruction BRAM Registers
DSP MUL -> DSP {M,C} Registers (from prior instruction)

Instruction Decode
DSP ALU -> DSP {P} Registers (from prior instruction)

Data BRAM Write(s) (results from prior instruction)
Data BRAM Read(s) -> Data BRAM Registers

DSP Input -> DSP {A,B,D} Registers

With an ISA which can do something as complex as the following (below) in one instruction. A focus on instruction forms which can leverage both ports on the Instruction BRAMs (opcode and separate optional immediate), as well as both ports on the Data BRAMs. Using dedicated address registers to provide windows into the Data BRAMs for immediate access instead of a conventional register file, and leveraging a special high-bit-width accumulator to maintain precision of intermediate fixed-point operations.

[addressRegister[2bitImmediate]^nbitImmediate] = accumulator;
accumulator = [addressRegister[2bitImmediate]^nbitImmediate]] OP 18bitImmediate;