20130814 - Notes on AMD GCN ISA

From what I can tell from the AMD GCN ISA doc,

All vector operations are predicated by the EXEC bitmask. The EXEC bitmask provides 64-bits for the 64 lanes of a vector. The EXEC bitmask is scalar registers 126 and 127 (a pair of 32-bit registers). There is a VCC or vector condition code bitmask in scalar registers 106 and 107. The VCC bitmask is set by vector compare instructions. There are special registers which contain a 1-bit flag if the EXEC or VCC bitmask is zero. Finally there is a SCC or scalar condition code bit.

Predicates (bools) for the vector are stored in pairs of scalar registers. Predication of a sequence of instructions (using if() in shader code) involves saving the current EXEC bitmask into a pair of scalar registers, then loading the EXEC bitmask with the predicate (this is done with one single instruction: S_{op}_SAVEEXEC_B64), then later restoring the saved EXEC (S_MOV_B64).

At full 40 wavefront occupancy (and going with the published 8KB scalar register file number), each wavefront gets 51 scalar registers (and 25 vector registers). The peak around 100 registers/wavefront can be found easily with around 50% occupancy (and 52 vector registers/wavefront).

Vector Instructions
Either 32-bit or 64-bit total opcode length. The 32-bit form supports,

vdst = OP(vsrc0, vsrc1); // Vector source.
vdst = OP(ssrc0, vsrc1); // Common scalar source (typically a constant).
vdst = OP(-16 to 64, vsrc1); // Free special integer literal.
vdst = OP(0.0 or +/-{0.5,1.0,2.0,4.0}, vsrc1); // Free special floating point literal.
vdst = OP(imm32, vsrc1); // A 32-bit immediate which follows the instruction.
vdst = OP(LDS[M0], vsrc1); // Broadcast shared LDS value, M0 is special scalar register.

The 64-bit form supports mix of any of the following,

vdst = OMOD * OP(); // Free multiplier on result, OMOD = {0.5,1.0,2.0,4.0}.
vdst = clamp(OP(),0.0,1.0); // Free clamp on result.
vdst = OP(neg(src0),abs(src1),neg(abs(src2)))); // Free negate or absolute value on any input.
vdst = OP(src0, 0.0 or +/-{0.5,1.0,2.0,4.0}, src1); // Free special floating point literal on any input.
vdst = OP(src0, -16 to 64, src1); // Free special integer literal on any input.
vdst = OP(vsrc0, ssrc0, ssrc0); // Only one scalar register can be used in any of the three inputs.
vdst = OP(src0, LDS[M0], src2); // One broadcast shared LDS value for any of the three inputs.

The V_MOVRELS and V_MOVRELD instructions enable indexed register file access, but note M0 is a scalar register so M0 is non-divergent across the entire wavefront. Anything divergent requires extra manual predication logic.

vdst = vsrc[MO]; // Indexed register load.
vdst[MO] = vsrc; // Indexed register store.

Changing Materials the GCN Way
DX and GL are years behind in API design compared to what is possible on GCN. For instance there is no need for the CPU to do any binding for a traditional material system with unique shaders/textures/samplers/buffers associated with geometry. Going to the metal on GCN, it would be trivial to pass a 32-bit index from the vertex shader to the pixel shader, then use the 32-bit index and S_BUFFER_LOAD_DWORDX16 to get constants, samplers, textures, buffers, and shaders associated with the material. Do a S_SETPC to branch to the proper shader.

This kind of system would use a set of uber-shaders grouped by material shader register usage. So all objects sharing a given uber-shader can get drawn in say one draw call (draw as many views as you have GPU perf for). No need for traditional vertex attributes either, again just one S_SETPC branch in the vertex shader and manually fetch what is needed from buffers or textures...