20151123 - Cross-Invocation Data Sharing Portability


A general look at the possibility of portability for dGPUs with regards to cross-invocation data sharing (aka to go next after ARB_shader_ballot which starts exposing useful SIMD-level programming constructs). As always I'd like any feedback anyone has on this topic, feel free to write comments or contact me directly. Warning this was typed up fast to collect ideas, might be some errors in here...

References: NV_shader_thread_group | NV_shader_thread_shuffle | AMD GCN3 ISA Docs

NV Quad Swizzle (supported on Fermi and beyond)
shuffledData = quadSwizzle{mode}NV({type} data, [{type} operand])
Where,
(1.) "mode" is {0,1,2,3,X,Y}
(2.) "type" must be a floating point type (implies possible NaN issues issues with integers)
(3.) "operand" is an optional extra unshuffled operand which can be added to the result
The "mode" is either a direct index into the 2x2 fragment quad, or a swap in the X or Y directions.

AMD GCN DS_SWIZZLE_B32
swizzledData = quadSwizzleAMD({type} data, mode)
Where,
(1.) "mode" is a bit array, can be any permutation (not limited to just what NVIDIA exposes)
(2.) "type" can be integer or floating point

Possible Portable Swizzle Interface
bool allQuad(bool value) // returns true if all invocations in quad are true
bool anyQuad(bool value) // returns true for entire quad if any invocations are true
swizzledData = quadSwizzleFloat{mode}({type} data)
swizzledData = quadSwizzle{mode}({type} data)
Where,
(1.) "mode" is the portable subset {0,1,2,3,X,Y} (same as NV)
(2.) "type" is limited to float based types only for quadSwizzleFloat()
This is the direct union of common functionality from both dGPU vendors. NV's swizzled data returns 0 for "swizzledData" if any invocation in the quad is inactive according to the GL extension. AMD returns 0 for "swizzledData" only for inactive invocations. So the portable spec would have undefined results for "swizzledData" if any invocation in the fragment quad is inactive. This is a perfectly acceptable compromise IMO. Would work on all AMD GCN GPUs and any NVIDIA GPU since Fermi for quadSwizzlefloat(), and since Maxwell for quadSwizzle() (using shuffle, see below), this implies two extensions. Quads in non fragment shaders are defined by directly splitting the SIMD vector into aligned groups of 4 invocations.



NV Shuffle (supported starting with Maxwell)
shuffledData = shuffle{mode}NV({type} data, uint index, uint width, [out bool valid])
Where,
(1.) "mode" is one of {up, down, xor, indexed}
(2.) "data" is what to shuffle
(3.) "index" is a invocation index in the SIMD vector (0 to 31 on NV GPUs)
(4.) "width" is {2,4,8,16, or 32}, divides the SIMD vector into equal sized segments
(5.) "valid" is optional return which is false if the shuffle was out-of-segment
Below the "startOfSegmentIndex" is the invocation index of where the segment starts in the SIMD vector. The "selfIndex" is the invocation's own index in the SIMD vector. Each invocation computes a "shuffleIndex" of another invocation to read "data" from, then returns the read "data". Out-of-segment means that "shuffleIndex" is out of the local segment defined by "width". Out-of-segment shuffles result in "valid = false" and sets "shuffleIndex = selfIndex" (to return un-shuffled "data"). The computation of "shuffleIndex" before the out-of-segment check depends on "mode".
(indexed) shuffleIndex = startOfSegmentIndex + index
(_____up) shuffleIndex = selfIndex - index
(___down) shuffleIndex = selfIndex + index
(____xor) shuffleIndex = selfIndex ^ index


AMD DS_SWIZZLE_B32 (all GCN)
Also can do swizzle across segments of 32 invocations using the following math.
and_mask = offset[4:0];
or_mask = offset[9:5];
xor_mask = offset[14:10];
for (i = 0; i < 32; i++) {
j = ((i & and_mask) | or_mask) ^ xor_mask;
thread_out[i] = thread_valid[j] ? thread_in[j] : 0; }

The "_mask" values are compile time immediate values encoded into the instruction.

AMD VOP_DPP (starts with GCN3: Tonga, Fiji, etc)
DPP can do many things,
For a segment size of 4, can do full permutation by immediate operand.
For a segment size of 16, can shift invocations left by an immediate operand count.
For a segment size of 16, can shift invocations right by an immediate operand count.
For a segment size of 16, can rotation invocations right by and immediate operand count.
For a segment size of 64, can shift or rotate, left or right, by 1 invocation.
For a segment size of 16, can reverse the order of invocations.
For a segment size of 8, can reverse the order of invocations.
For a segment size of 16, can broadcast the 15th segment invocation to fill the next segment.
Can broadcast invocation 31 to all invocations after 31.
Has option of either using "selfIndex" on out-of-segment, or forcing return of zero.
Has option to force on invocations for the operation.

AMD DS_PERMUTE_B32 / DS_BPERMUTE_B32 (starts with GCN3: Tonga, Fiji, etc)
Supports something like this (where "temp" is in hardware),
bpermute(data, uint index) { temp[selfIndex] = data; return temp[index]; }
permute(data, uint index) { temp[index] = data; return temp[selfIndex]; }


Possible Portable Shuffle Interface : AMD GCN + NV Maxwell
This is just a start of ideas, have not had time to fully explore the options, feedback welcomed...
SIMD width would be different for each platform so developer would need to build shader permutations for different platform SIMD width in some cases.

SEGMENT-WIDE BUTTERFLY
butterflyData = butterfly{width}({type} data)
Where "width" is {2,4,8,16,32}. This is "xor" mode for shuffle on NV, and DS_SWIZZLE_B32 on AMD (with and_mask = ~0, and or_mask = 0) with possible DPP optimizations on GCN3 for "width"={2 or 4}. The XOR "mask" field for both NV and AMD is "width>>1". This can be used to implement a bitonic sort (see slide 19 here).

SEGMENT-WIDE PARALLEL SCAN
TODO: Weather is nice outside, will write up later...

SEGMENT-WIDE PARALLEL REDUCTIONS
reducedData = reduce{op}{width}({type} data)
Where,
(1.) "op" specifies the operation to use in the reduction (add, min, max, and, ... etc)
(2.) "width" specifies the segment width
At the end of this operation only the largest indexed invocation in each segment has the result, the values for all other invocations in the segment are undefined. This enables both NV and AMD to have optimal paths. This uses "up" or "xor" mode on NV for log2("width") operations. Implementation on AMD GCN uses DS_SWIZZLE_B32 as follows,
32 to 16 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=16
16 to 8 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=8
8 to 4 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=4
4 to 2 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=2
2 to 1 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=1
64 from finalized 32 => V_READFIRSTLANE_B32 to grab invocation 0 to apply to all invocations

Implementation on AMD GCN3 uses DPP as follows,
16 to 8 => reverse order of 16-wide (DPP_ROW_MIRROR)
__0123456789abcdef__
__fedcba9876543210__
8 to 4 => reverse order of 8-wide (DPP_ROW_HALF_MIRROR)
_01234567_
_76543210_
4 to 2 => reverse order using full 4-wide permutation mode
__0123__
__3210__
2 to 1 => reverse order using full 4-wide permutation mode
__0123__
__1032__
32 from finalized 16 => DPP_ROW_BCAST15
__...............s...............t...............u................__
__................ssssssssssssssssttttttttttttttttuuuuuuuuuuuuuuuu__
64 from finalized 32 => DPP_ROW_BCAST32
__...............................s................................__
__................................ssssssssssssssssssssssssssssssss__

reducedData = allReduce{op}{width}({type} data)
The difference being that all invocations end up with the result. Uses "xor" mode on NV for log2("width") operations. On AMD this is the same as "reduce" except for "width"={32 or 64}. The 64 case can use V_READLANE_B32 from the "reduce" version to keep the result in an SGPR to save from using a VGPR. The 32 case can use DS_SWIZZLE_B32 for the 32 to 16 step.


Possible Portable Shuffle Interface 2nd Extension : AMD GCN3 + NV Maxwell
This is just a start of ideas, have not had time to fully explore the options, feedback welcomed...
SIMD width would be different for each platform so developer would need to build shader permutations for different platform SIMD width in various cases.

SIMD-WIDE PERMUTE
Backwards permutation of full SIMD width is portable across platforms, maps on NV to shuffleNV(data, index, 32), and DS_BPERMUTE_B32 on AMD,
permutedData = bpermute(data, index)