References: NV_shader_thread_group | NV_shader_thread_shuffle | AMD GCN3 ISA Docs

Where,

(1.) "mode" is {0,1,2,3,X,Y}

(2.) "type" must be a floating point type (implies possible NaN issues issues with integers)

(3.) "operand" is an optional extra unshuffled operand which can be added to the result

The "mode" is either a direct index into the 2x2 fragment quad, or a swap in the X or Y directions.

Where,

(1.) "mode" is a bit array, can be any permutation (not limited to just what NVIDIA exposes)

(2.) "type" can be integer or floating point

Where,

(1.) "mode" is the portable subset {0,1,2,3,X,Y} (same as NV)

(2.) "type" is limited to float based types only for quadSwizzleFloat()

This is the direct union of common functionality from both dGPU vendors. NV's swizzled data returns 0 for "swizzledData" if any invocation in the quad is inactive according to the GL extension. AMD returns 0 for "swizzledData" only for inactive invocations. So the portable spec would have undefined results for "swizzledData" if any invocation in the fragment quad is inactive. This is a perfectly acceptable compromise IMO. Would work on all AMD GCN GPUs and any NVIDIA GPU since Fermi for quadSwizzlefloat(), and since Maxwell for quadSwizzle() (using shuffle, see below), this implies two extensions. Quads in non fragment shaders are defined by directly splitting the SIMD vector into aligned groups of 4 invocations.

Where,

(1.) "mode" is one of {up, down, xor, indexed}

(2.) "data" is what to shuffle

(3.) "index" is a invocation index in the SIMD vector (0 to 31 on NV GPUs)

(4.) "width" is {2,4,8,16, or 32}, divides the SIMD vector into equal sized segments

(5.) "valid" is optional return which is false if the shuffle was out-of-segment

Below the "startOfSegmentIndex" is the invocation index of where the segment starts in the SIMD vector. The "selfIndex" is the invocation's own index in the SIMD vector. Each invocation computes a "shuffleIndex" of another invocation to read "data" from, then returns the read "data". Out-of-segment means that "shuffleIndex" is out of the local segment defined by "width". Out-of-segment shuffles result in "valid = false" and sets "shuffleIndex = selfIndex" (to return un-shuffled "data"). The computation of "shuffleIndex" before the out-of-segment check depends on "mode".

(_____up) shuffleIndex = selfIndex - index

(___down) shuffleIndex = selfIndex + index

(____xor) shuffleIndex = selfIndex ^ index

Also can do swizzle across segments of 32 invocations using the following math.

or_mask = offset[9:5];

xor_mask = offset[14:10];

for (i = 0; i < 32; i++) {

j = ((i & and_mask) | or_mask) ^ xor_mask;

thread_out[i] = thread_valid[j] ? thread_in[j] : 0; }

The "_mask" values are compile time immediate values encoded into the instruction.

DPP can do many things,

For a segment size of 4, can do full permutation by immediate operand.

For a segment size of 16, can shift invocations left by an immediate operand count.

For a segment size of 16, can shift invocations right by an immediate operand count.

For a segment size of 16, can rotation invocations right by and immediate operand count.

For a segment size of 64, can shift or rotate, left or right, by 1 invocation.

For a segment size of 16, can reverse the order of invocations.

For a segment size of 8, can reverse the order of invocations.

For a segment size of 16, can broadcast the 15th segment invocation to fill the next segment.

Can broadcast invocation 31 to all invocations after 31.

Has option of either using "selfIndex" on out-of-segment, or forcing return of zero.

Has option to force on invocations for the operation.

Supports something like this (where "temp" is in hardware),

SIMD width would be different for each platform so developer would need to build shader permutations for different platform SIMD width in some cases.

SEGMENT-WIDE BUTTERFLY

Where "width" is {2,4,8,16,32}. This is "xor" mode for shuffle on NV, and DS_SWIZZLE_B32 on AMD (with and_mask = ~0, and or_mask = 0) with possible DPP optimizations on GCN3 for "width"={2 or 4}. The XOR "mask" field for both NV and AMD is "width>>1". This can be used to implement a bitonic sort (see slide 19 here).

SEGMENT-WIDE PARALLEL SCAN

TODO: Weather is nice outside, will write up later...

SEGMENT-WIDE PARALLEL REDUCTIONS

Where,

(1.) "op" specifies the operation to use in the reduction (add, min, max, and, ... etc)

(2.) "width" specifies the segment width

At the end of this operation only the largest indexed invocation in each segment has the result, the values for all other invocations in the segment are undefined. This enables both NV and AMD to have optimal paths. This uses "up" or "xor" mode on NV for log2("width") operations. Implementation on AMD GCN uses DS_SWIZZLE_B32 as follows,

16 to 8 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=8

8 to 4 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=4

4 to 2 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=2

2 to 1 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=1

64 from finalized 32 => V_READFIRSTLANE_B32 to grab invocation 0 to apply to all invocations

Implementation on AMD GCN3 uses DPP as follows,

__0123456789abcdef__

__fedcba9876543210__

8 to 4 => reverse order of 8-wide (DPP_ROW_HALF_MIRROR)

_01234567_

_76543210_

4 to 2 => reverse order using full 4-wide permutation mode

__0123__

__3210__

2 to 1 => reverse order using full 4-wide permutation mode

__0123__

__1032__

32 from finalized 16 => DPP_ROW_BCAST15

__...............s...............t...............u................__

__................ssssssssssssssssttttttttttttttttuuuuuuuuuuuuuuuu__

64 from finalized 32 => DPP_ROW_BCAST32

__...............................s................................__

__................................ssssssssssssssssssssssssssssssss__

The difference being that all invocations end up with the result. Uses "xor" mode on NV for log2("width") operations. On AMD this is the same as "reduce" except for "width"={32 or 64}. The 64 case can use V_READLANE_B32 from the "reduce" version to keep the result in an SGPR to save from using a VGPR. The 32 case can use DS_SWIZZLE_B32 for the 32 to 16 step.

SIMD width would be different for each platform so developer would need to build shader permutations for different platform SIMD width in various cases.

SIMD-WIDE PERMUTE

Backwards permutation of full SIMD width is portable across platforms, maps on NV to shuffleNV(data, index, 32), and DS_BPERMUTE_B32 on AMD,