20161001 - T4K Try 1


Tries : 3 2 1

This post is just me using blogger an as active notepad to paper design a FPGA soft-core for a many-core machine. Hopefully I'll update it once an a while. The aim is to try to see what can be built in the budget of one BRAM per core on Xilinx FPGAs, thinking through pipelines and physical implementation in CLBs. Machine target is 600 32-bit integer cores, at 300 to 500 MHz, with 4KB/core of on-chip local RAM (for instruction and data), with '25-bit * 18-bit + 48-bit' DSP based ALUs, with a Hoplite-based on-chip router for message passing.

Update Log
2016/10/02 : Returning to the mechanics of code, shift and bit array stuff.

2016/10/01 : Updating ISA, etc. Traditionally an instruction encapsulates {source values, operation, destination}, and the instruction data flows down the CPU pipeline. I'm working towards a different kind of ISA, where the opcode instead describes what to do in this clock, routing between the various data pipelines in the CPU core. This could be a horrible idea (racing to fail), it certainly is for assembly readability! Effectively I'm attempting to be able to describe {branch, alu, mov, mem} access all in a single 32-bit opcode. Attempt to push an IPC closer to 3 or 4, instead of around 1.

2016/10/01 : Beginning. Yanked the prior post because my implementation estimation was fail, the revised version appears here.

Notes
==============

  FPGA NOTES

==============

===============
  CORE BUDGET
===============
  1 BRAM (32-bit x 1024 entry, with 2 ports both which can read or write)
  2 DSPs 
400 LUTs (50 slices, 8 LUTs per slice)


==============
  LUTs / MUX
==============
1 LUT =  2:1 MUX x2 (get two of these)
1 LUT =  4:1 MUX
2 LUT =  8:1 MUX
4 LUT = 16:1 MUX
8 LUT = 32:1 MUX


===================================
  LUTs / GENERAL PURPOSE FUNCTION
===================================
1 LUT =  5:1 x2 (get two of these sharing same 5 input bits)
1 LUT =  6:1
2 LUT = 13:1
4 LUT = 27:1 


===============
  SLICE RULES
===============
Carry chain for slice can start at LUT 0 or LUT 4
Distributed RAM granularity is half a slice (starting at LUT 0 or LUT 4)


========================
  DSP PATTERN DETECTOR
========================
From docs, "use of the pattern detector leads to a moderate speed reduction 
  due to the extra logic on the pattern detect path"
Suggests not using this to check for zero
  So branch on signed or unsigned only?
Wouldn't be able to use this for both saturation checks and zero check anyway


====================

  FUNCTIONAL UNITS

====================

============================
  CURRENT LUT BUDGET USAGE
============================
LUTs  %    Usage
====  ===  =====
28    7    Program counter
12    3    Return stack
107   27   BRAM variable bit-width windows (includes adr^imm)
64    16   Register file
----  ---  -----
211   53   Total


===================
  PROGRAM COUNTER
===================
10-bit program counter (PC)
Only lower 8-bits of PC increment on linear execution
  Requires only an 8-bit PC+1 computation (one slice)

Aim to minimize critical path getting next address to BRAM
  Only one level of LUT to compute next address
  All inputs registered at end of prior clock
  Followed by 8-bit add to compute possible PC for next clock

PC function inputs per output bit (map to 13:1 function at 2 LUTs/bit)
  Bits  Meaning
  ====  =======
  1     Next PC if not branching (computed in prior clock)
  1     Top of return stack
  1     Immediate absolute branch address
  1     PSP P output register (computed branch target in prior clock)
  1     PSP P output register sign bit (for conditional branch)
  8     Up to 8 bits from instruction opcode to decode
  
LUTs  Usage
====  =====
20    13:1 function for next 10-bit PC computation including instruction decode
8     PC+1 adder for 8 lower bits of PC
----  -----
28    Total (7% of 400 LUT/BRAM budget)


================
  RETURN STACK
================
Going to plan on a dedicated return stack for now
Only need single port for return stack (either call/push, or return/pop)
Hardware background,
  Distributed RAM works in 4 LUT granularity
  1 LUT provides 2x SPRAM32 (single port 32x1 RAM)
  Writes are synchronous on clock edge
  Reads are async
  8 LUTs for a 32 x 16-bit return stack data

Not using everything
  Padding to 4 LUT granularity
  Only need 10-bits out of 16-bits (6-bits free for other state)
  Likely going to keep only 4-bit top of stack address register
    Could use other 16 entries for run function on new message?

Todo
  Control inputs, adder input, etc

LUTs  Usage
====  =====
8     32 entry x 16-bit return stack
4     4-bit top of stack pointer
?     
----  -----
12    Total (3% of 400 LUT/BRAM budget)


=================================================
  ADDRESS REGISTER XOR INSTEAD OF ADD IMMEDIATE
=================================================
Planning on [address ^ immediate] addressing
  This removes an adder from the design

XOR is the same as [address + immediate] for an n-bit immediate
  When lower n-bits of address are zero
  Means data must be aligned to the nearest pow2 of maximum immediate offset

Using the following terms,
  ggggggoooo
    g =  group bits (address bits choose the group of data)
    o = offset bits (address bits are zero, immediate chooses element in group)

For bits in address which are not cleared (ie the group bits),
  Setting bits in the immediate results in accessing a neighbor group
  Regardless of the starting group in the address register
    It is possible to roll through all aligned groups
    But ordering is different based on starting group address
    Example of group bits for address crossed with immediate
           00 01 10 11   
         +-------------
      00 | 00 01 10 11
      01 | 01 00 11 10
      10 | 10 11 00 01
      11 | 11 10 01 00


===================================
  BRAM VARIABLE BIT-WIDTH WINDOWS
===================================
Trying to support transparent pack/unpack of variable bit-widths from BRAM
  Want zero impact to ISA, no special instructions
  Instead dividing address range into windows of different bit-widths
  Each address range addresses at a multiple of the bit-width
  Effectively the high bits of address choose the bit-width

Store path limited to {8,16,32}-bit
  Only using BRAM byte write mask to avoid any {read, modify, write}

Fixed signed vs unsigned configuration
  32-bit doesn't matter
  16-bit going to go with signed (needed for vector or audio)
   8-bit unsigned (not so sure about that)
   4-bit unsigned for sure (sprites?)

BRAMs always in 32-bit port mode,
  fedcba9876543210 
  ================
  .xxxxxxxxxx00000 - requires 10-bit address

Address register,
  fedcba9876543210 
  ================
  00....xxxxxxxxxx - 1024 x 32-bit
  01...xxxxxxxxxxx - 2048 x 16-bit 
  10..xxxxxxxxxxxx - 4096 x  8-bit 
  11.xxxxxxxxxxxxx - 8192 x  4-bit (supported for read only)

Address register value to BRAM address translation
  This needs to include XORing the immediate
  Uses a 13:1 function for each bit,
    Bits  Meaning
    ====  =======
    4     Address shifted left {0,1,2,3} bits
    4     Immediate shifted left {0,1,2,3} bits
    2     The 'fe' address bits
    3     Up to 3 bits from instruction opcode to decode
          Should select if use immediate, etc

Permutations (showing address and byte write mask for store),
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - 32-bit adr=00....xxxxxxxxxx  write=1111
  ................bbbbbbbbbbbbbbbb - 16-bit adr=01...xxxxxxxxxx0  write=0011
  cccccccccccccccc................ - 16-bit adr=01...xxxxxxxxxx1  write=1100
  ........................dddddddd -  8-bit adr=10..xxxxxxxxxx00  write=0001
  ................eeeeeeee........ -  8-bit adr=10..xxxxxxxxxx01  write=0010
  ........ffffffff................ -  8-bit adr=10..xxxxxxxxxx10  write=0100
  gggggggg........................ -  8-bit adr=10..xxxxxxxxxx11  write=1000
  ............................hhhh -  4-bit adr=11.xxxxxxxxxx000
  ........................iiii.... -  4-bit adr=11.xxxxxxxxxx001
  ....................jjjj........ -  4-bit adr=11.xxxxxxxxxx010
  ................kkkk............ -  4-bit adr=11.xxxxxxxxxx011
  ............llll................ -  4-bit adr=11.xxxxxxxxxx100
  ........mmmm.................... -  4-bit adr=11.xxxxxxxxxx101
  ....nnnn........................ -  4-bit adr=11.xxxxxxxxxx110
  oooo............................ -  4-bit adr=11.xxxxxxxxxx111

Shift value for store,
  Requires 3:1 MUX per bit, 32 LUTs

Generate write enable for store,
  Requires same 4-bits per function,
    2 lower address bits
    2 upper address bits
  2 LUTs (5:1 function sharing inputs, 2 outputs per LUT)

Unpack after load,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  <---------------bbbbbbbbbbbbbbbb - sign extended
  <---------------cccccccccccccccc - sign extended
  000000000000000000000000dddddddd
  000000000000000000000000eeeeeeee
  000000000000000000000000ffffffff
  000000000000000000000000gggggggg
  0000000000000000000000000000hhhh
  0000000000000000000000000000iiii
  0000000000000000000000000000jjjj
  0000000000000000000000000000kkkk
  0000000000000000000000000000llll
  0000000000000000000000000000mmmm
  0000000000000000000000000000nnnn
  0000000000000000000000000000oooo
  ================================
  xxxxxxxxxxxxxxxx................ -  4:1 MUX/bit (a bit, top b bit, top c bit, 0)                                   
  ................xxxxxxxx........ -  4:1 MUX/bit ({a,b,c} bit, 0)
  ........................xxxx.... -  8:1 MUX/bit ({a,b,...g} bit, 0) 
  ............................xxxx - 15:1 MUX/bit ({a,b,...o} bit) 

Unpack if done with MUX control logic computed in prior clock,
  This registers 4-bits extra, to reduce LUT cost on MUX
  Register 2-bits for top 24-bit MUX control
    Logic function (3:1 sharing inputs), one LUT 
      up lo    1 0
      == ===   = =
      00 xxx | 0 0
      01 xx0 | 0 1
      01 xx1 | 1 0
      10 xxx | 1 1
      11 xxx | 1 1    
  Register 3-bits for 2nd lowest 4-bit MUX control
    Logic function (4:1 sharing inputs), rounds to 2 LUTs 
      up lo    2 1 0
      == ==    = = =
      00 xxx | 0 0 0
      01 xx0 | 0 0 1
      01 xx1 | 0 1 0
      10 x00 | 0 1 1
      10 x01 | 1 0 0
      10 x10 | 1 0 1
      10 x11 | 1 1 0
      11 xxx | 1 1 1  
  Register 4-bits for lower 4-bit MUX control
    Logic function (5:1 sharing inputs), 2 LUTs 
      Skipping as pattern is obvious ...

LUTs  Usage
====  =====
20    Address register value to BRAM address translation 2 LUTs x 10-bits
32    Shift value for store
2     Generate write enable
5     Unpack MUX control logic (done on clock computing address for BRAM)
24    Unpack MUX for 24-bits
8     Unpack MUX for 4-bits (higher nibble)
16    Unpack MUX for 4-bits (lower nibble) 
----  -----
107   Total (27% of 400 LUT/BRAM budget)


=================
  REGISTER FILE
=================
Using the smallest possible, but assuming the need for 1 write and 3 read ports,
  8 LUTs (one SLICEM) yields one Quad-port 32 entry x 4-bit RAM

Register file write sources,
  DSP P output register
  BRAM load
  What else?
  TODO?
    Need to fold these choices into BRAM unpack logic?
    Or is BRAM load also forwarded into DSP inputs without first going to reg file?

Register file read sources,
  Address register
  DSP A,B,C inputs
  What else?   

LUTs  Usage
====  =====
64    Total (16% of 400 LUT/BRAM budget)


=======
  DSP
=======
Not attempting to use all features of DSP
  Pre-add 'a+d' tossed because of extra pipeline stage

DSP is effectively modal,
  If multiply is enabled then there are only 2 options,
    'p = c+(a*b)'
    'p = c-(a*b)' <- is multiply subtract worth it? (assuming yes for now)
  Otherwise,
    'p = c OP (a:b)


===================
  MESSAGE PASSING  
===================
Todo


=================================
  
  ISA / CODE GENERATION DETAILS

=================================
Todo

=======
  ISA 
=======
Going to get messy for now.
Attempting first to describe everything which needs instruction control.
This will overflow a 32-bit instruction.
In process culling options which are less needed.
In hope of eventually fitting everything.

Source input data which can be accessed each clock,
  Register file loads (from prior clock),
    32-bit x
    32-bit y
    32-bit z
  Instruction immediate for this instruction,
    10-bit i
  BRAM last unpacked fetch (from prior clock),
    32-bit f
  DSP output (from prior clock),
    32|48-bit p

Sink output data,
  DSP inputs,
    18-bit b
    24-bit a (could be up to 30-bit, but no LUTs for that)
    32-bit c (could be up to 48-bit, but no LUTs for that)
     ?-bit control bits (todo)
  BRAM output word
    32-bit o
  BRAM memory address before translation
    16-bit m
  BRAM access control bits
  Register file entries for each port
     5-bit immediate s (store port)
     5-bit immediate t 
     5-bit immediate u
     5-bit immediate v
  Register file write value
    32-bit w

Alphabet usage,
  abcdefghijklmnopqrstuvwxyz
  ==========================
  abc.......................  DSP inputs
  ...............p..........  DSP output
  ........i.................  immediate
  ............m.............  BRAM memory address
  .....f....................  BRAM fetched value
  ..............o...........  BRAM output word
  ..................s.......  Register file store port
  ...................tuv....  Register file read ports
  ......................w...  Register file write value
  .......................xyz  Register file reads

Tracking how much opcode overload,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  bbb.............................  Branching 
  ...rr...........................  BRAM operation 
  .....sssstttuuuuvvvv............  Reg file ports
  ....................??..........  Not enough space for DSP control
  ......................iiiiiiiiii  Trying for fixed 10-bit immediate 
  ================================
  Notes,
    Definitely need smarter opcode encoding

Immediate,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  ......................iiiiiiiiii  Trying for fixed 10-bit immediate 
  ================================
  Notes,
    Wanted 10-bit to hit full BRAM absolute address

DSP control,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  Notes,
    Lots of control bits to get correct in here      

DSP inputs,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  ................................  Input a 24-bits
  ................................    a=signExtend(i)
  ................................    a=y[32:18]
  ................................    a=f[32:18]
  ................................    a=f
  ................................    a=y
  ................................    a=z
  ================================
  ................................  Input b 18-bits
  ...............................0    b=i
  ..............................??    b=y
  ..............................??    b=f 
  ================================
  ................................  Input c 32-bits
  ...............................?    c=f
  ...............................?    c=z
  ================================
  Notes,
    Might be able to have a be modal based on DSP control (for mul vs a:b cases)
      Or maybe merge for some cases?
    The a=signExtend(i) case is needed for signed (a:b)=i
    Should c=z be z or something else
    Believe f probably should be an input into b for '(a:b) op c' case
    Culled, a=i, as i is too small for top bits of a:b, and mul is associative
    Culled, b=p, as it is better to just forward in these cases
    Culled, c=p, hoping forwarding if control bits covers this

Branch control,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  .............................???  No branch
  .............................???  Return
  .............................???  Call to p
  .............................???  Switch to/from MessageHandler/Program
  .............................???  Conditional jump if p<0
  .............................???  Conditional jump if p>=0
  .............................???  Call
  .............................???  Jump
  ================================
  Notes,
    Likely not getting branch control under 3-bits
      At least without multiple instruction forms
    In theory this enables "free" branching
    Won't have p==0, as don't want to turn on pattern detector
    Work through computed branch targets cases again ...
    
Register file ports,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  .................ssss...........  Store to any register
  .....................ttt........  BRAM address registers (limited to first 8 for space)
  ........................uuuuvvvv  Load from any register
  ================================
  Notes,
    This is the most trouble in opcode encoding size
    Dropping to 16 entries from 32
      Assuming want freer context switch to handle message
        First 16 used during normal execution
        Later 16 used during handle message

BRAM operation bits,
  11111111111111110000000000000000
  fedcba9876543210fedcba9876543210
  ================================
  ..............................00  f=[x]
  ..............................01  f=[x^i]
  ..............................10  [x]=p
  ..............................11  [x^i]=p
  ================================
  Notes,
    This is all that is needed to keep the one BRAM port busy
    Direct mapping to register port t (1st read port) to save size
    Only supporting storing from DSP output p
      If one is going to store, store when generated
      If need to store later just load back into the DSP (nop)
    Need separate [x] case because imm may be used for something else
    Culled, nop (just going to burn power for load regardless if needed)
    Culled, f=[i], [i]=p, because [x^i] can load x=0


===================
  SHIFTING ISSUES
===================
Needs more thought ...

Have the following pipeline options built in the DSP
 (25-bit a * 18-bit b) + 48-bit c
 ((25-bit a * 18-bit b) + 48-bit c) << 17

Usage cases,
  Address math,
    This is 'base + index * stride', so use 'a*b+c'
    Index limited to 16 M
    Base not limited
  Bitfield,
    Included in via variable-bit BRAM access
  Bit arrays,
    Expanded to later section
  Divides
    Needs some thought ...

Shifter will do at most 24-bit integers,
  Top bits get sign-extended (likely not the opcode space for unsigned)
  Using 24 to keep with quad LUT alignment

Shifts for 16-bit,
  __for_shift_>>__ __for_shift_<<__
  1111111111111111 0000000000000000
  fedcba9876543210 fedcba9876543210  mul       <<  >>
  ================ ================  ========  ==  ==    
  ................ fedcba9876543210  00000001   0  10   
  ...............f edcba9876543210.  00000002   1   f
  ..............fe dcba9876543210..  00000004   2   e
  .............fed cba9876543210...  00000008   3   d
  ............fedc ba9876543210....  00000010   4   c
  ...........fedcb a9876543210.....  00000020   5   b
  ..........fedcba 9876543210......  00000040   6   a
  .........fedcba9 876543210.......  00000080   7   9
  ........fedcba98 76543210........  00000100   8   8
  .......fedcba987 6543210.........  00000200   9   7
  ......fedcba9876 543210..........  00000400   a   6
  .....fedcba98765 43210...........  00000800   b   5
  ....fedcba987654 3210............  00001000   c   4
  ...fedcba9876543 210.............  00002000   d   3
  ..fedcba98765432 10..............  00004000   e   2
  .fedcba987654321 0...............  00008000   f   1
  fedcba9876543210 ................  00010000  10   0


==============
  BIT ARRAYS
==============
Maybe best to just use the 8-bit BRAM window
  Keeps with-in the range of immediate for AND mask

Emulation without special hardware,
  Algorithms,
    Extract lowest bit set ....................  x & -x
    Get mask up to lowest bit set .............  x ^ (x - 1)
    Reset lowest bit set ......................  x & (x - 1)
    ===========================================  ============
    Fill from lowest clear bit ................  x & (x + 1)
    Isolate lowest clear bit and complement ...  ~x & (x + 1)
    Mask from lowest clear bit ................  x ^ (x + 1)
    Mask from trailing zeros ..................  ~x & (x - 1)
    ===========================================  ============
    Isolate lowest clear bit ..................  x | ~(x + 1)
    Set lowest clear bit ......................  x | (x + 1)
    Fill from lowest set bit ..................  x | (x - 1)
    Isolate lowest set bit and complement .....  ~x | (x - 1)
    Inverse mask from trailing ones ...........  ~x | (x + 1)

Common algorithms,
  Bit insert
  Bit extract
  Popuplation count
    Output is 3 bits
    Requires 8:1 function, or 2 LUTs/bit, 6 LUTs in hardware
  Count leading zeros
    6 LUTs in hardware
  Count trailing zeros
    6 LUTs in hardware

Todo,
  Look through De Bruijn Sequence based stuff again
          

Related Material
Bit Hacks
Bit Manipulation Instruction Sets
A Fast Carry Chain Adder for Virtex-5 FPGAs
GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator
On Hard Adders and Carry Chains in FPGAs
Principles of FPGA IP Interconnect