20120109 - SBAA Paper, SRAA, and Texture-Aligned Deferred Shading (TADS)

I'm going to describe a new deferred shading technique, Texture Aligned Deferred Shading (TADS) below, but first some notes on the SBAA Paper by Marco Salvi and Kiril Vidimce and SRAA,

(1.) SBAA has two rendering passes. The paper describes the first drawing pass as similar to a z pre-pass, but leaves out some important details. Z pre-pass can on many GPUs be rendered at 2x the fill rate, and often Z pre-pass can be lower poly conservative proxy geometry, etc. SBAA requires full geometry for both passes. Two full geometry passes is likely a bad idea for tessellated geometry (or really high polygon non-tessellated geometry) and bad for the number of required draw calls. A single pass solution is highly desired.

(2.) SBAA uses the same primitiveID trick mentioned in the NVIDIA SIGGRAPH presentation on SRAA. Matthaus and I came up with the primitiveID option for SRAA as an optimization, but didn't have time to fully explore the best solution when mixed with tessellation as primitiveID is per patch instead of per triangle. When used with vector displacement, likely one would want to create a modified hash value in the domain shader for each triangle based on primitiveID and some parametric coordinates.

Side note on SRAA, should be able to do Matthaus's SRAA in one pass with DX11 and surface writes. In the MSAA pass which generates primitiveID (or something better for tessellation) also write the G-buffer, except early exit in the pixel shader when the coverage for the pixel doesn't intersect the "center sample" used for SRAA. For pixels which intersect the center sample, write out the G-buffer using surface writes.

(3.) SBAA has the same problem MSAA has, screen-aligned shading, or pre-shading filtering, is the wrong way to shade for moving pictures.

Texture Aligned Deferred Shading (TADS)

I'm describing this here, because right now I don't have time to actually try this...

Ideally for visual quality per number of shaded samples, shading would be aligned to the texels used to texture shaded surfaces. Filtering, to re-sample shading results to the screen grid, would ideally happen after shading for temporal stability. TADS is a method to do exactly this.

TADS requires "software" virtual texturing (like mega-texture), or some other method where all surface texture data can be index from one position in a set of textures (or texture layers) referred to as the "physical texture" below. This position will be referred to as "physical position" below. For a 8K by 8K sized set of textures, a 32-bit physical position provides only 3-bits of sub-pixel precision (which might not be enough, so might need to use more bits).

TADS pipeline works as follows,

(1.) Only one rendering pass, a modified G-buffer generation pass. Write out {normal, binormal, and physical position}.

(2.) Find unique samples pass. This is a full screen pass which reads the G-buffer's physical position, and computes the number of texture aligned samples which need to be shaded, and writes this information to Z. The number of unique samples to be shaded is packed in the high bits, and a 4-bit bitmask of the samples to shade in the 2x2 texel footprint in the physical texture is packed in the low bits. Each physical position can require between one to four shaded texels, and in most cases neighboring pixels have physical positions which share parts of the shaded texel footprint. To insure samples are not shaded multiple times, each pixel fetches the physical position of the North, North West, and West pixels. Then removes any texels shared by those 3 neighbors from this pixels list of samples to shade.

(3.) Deferred shading passes. Each pixel has a variable number of samples to shade (much like mixing MSAA and deferred shading). To optimize for this case, use depth bounds test to take the Z value generated in (2.) and split the shading up into 4 passes. First pass for pixels which have one sample, second pass for pixels which have two samples, etc. Shaders for each of the 4 passes are optimized to shade {1,2,3,4} samples at the same time respectively. This pass shades texels in physical texture, and uses surface stores to write the results into a "shade results physical texture". AMD hardware doesn't support depth bounds test, so use an alternative method.

(4.) Filtering pass. Each pixel fetches the physical position from the G-buffer, and samples the filtered results from the "shade results physical texture".

Adding in LOD and anisotropy to improve filtering,

(5.) Modify the G-buffer generation pass (1.) and write out gradient vector to provide a line to filter along, and the thickness of this line (LOD or simulation of mip-mapping).

(6.) Modify the shading passes (3.) to output the lower bits of a frame counter into the alpha channel.

(7.) Modify the filtering pass (4.) to fetch the gradient vector and thickness from the G-buffer, and manually sample the anisotropic footprint (simplify to a few samples along the vector) and for LOD a few samples to do a low-pass filter. Compare the filtered alpha value of texture fetches to the frame counter. As the alpha value starts to deviate from the frame counter value, reduce the weight of the associated texture fetch when blending for the final filtered value. The deviation signifies that the filtered result is fetching from a texel which was last shaded in prior frames (temporally stale).

Reducing the shading rate,

(8.) One can reduce the frequency of shading by modifying the "find unique sample" pass (2.) and just marking texels as shaded. The engine will automatically use the prior shaded value for the texel. It is better to mark tiles of texels as shaded instead of random texels. There are a lot of possibilities here, which I will leave for a future post.