20140621 - Filtering and Reconstruction : Points and Filtered Raster Without MSAA

Just ran across these old screen captures (open images in a new window, they are likely getting cropped), stuff I'm no longer working on, but might re-invest in sometime in the future...

Reconstruction with Points or Surfels

The above shot is a HDR monochrome image (converted to color) with a little grain and bloom, of a fractal with lots of sub-pixel geometry rendered in a 360-degree fisheye with point based reconstruction. There is an average of 2 samples/pixel (using 1 bin/pixel but 2x the screen area). The GPU keeps a tree of the scene representation which is updated based on visibility, and then bins all the leaf nodes each frame. Finally there is a reconstruction pass which takes {color, sub-pixel offset} for all the binned points and then reconstructs the scene. Reconstruction is a simple gaussian filter which does a weighted average of a neighborhood of points with weights based on distance from pixel center. For HDR pre-convert to color*rcp(luma+1) as well to remove the HDR fire-fly problem (not energy conserving), remembering to un-convert after filtering. Binning of points uses some stratified jitter so that each frame gets a slightly different collection of points for filtering. Did not use any temporal filtering. The result is a quality analog feeling reconstruction which has soothing temporal noise. Need about 2 samples/pixel for good quality if not using temporal filtering. With really good temporal filtering (see Brian Karis's talk at Siggraph) only need on average one shaded sample/pixel/frame.

Binning for the above image was done using standard point raster, everything running in the vertex shader, near null fragment shader. This works ok on NVIDIA GPUs, but I would not advise for AMD. Future forward, with GPUs which have 64-bit integer min/max atomics and ability to run a full cache-line of atomics in one clock, the binning process should be quite fast. Bin layout in cache-lines must have good 2D locality (atomics on texture not buffer). Evaluate points in groups which have good spatial locality. Ideally should hide shading operations in the shadow of atomic throughput. If not enough work to fill that shadow, then try a pre-resolve atomic collisions in shared memory. Pack pixels into 64-bits {MSB logZ, LSB color and a few bits of sub-pixel offset}. The color is stored scaled by rcp(luma+1) and uses dithering in the conversion.

Points can early out by fetching destination pixels and checking if they are more distant than the current framebuffer result. Just make sure to schedule the load early to avoid the latency problem. Can extend this theory to scene traversal and do a hierarchical z-buffer pre-pass. Render out in a compute shader anti-conservative points into the hierarchy. This process would involve culling paths in the scene graph, and appending a list of nodes (point groups) which need to get rendered. Load balancing this traversal is a challenge which is worthy of another blog post.

Common in both point based techniques and ray-marching with fixed cost/frame algorithms, is that depending on traversal cost, some percentage of screen pixels might have holes or might have fractional coverage. It is possible to hole fill leveraging temporal reprojection, also a topic worthy of another blog post.

Reconstruction with Rotated Rendering and 2x Super-Sampling

Another monochrome image (converted to color) this time of a lot of unlit boxes. This image is also using a slight image warp and some out-of-focus vignette. This was generated by rendering an average of 2x super-sampling, but rendering the frame rotated. This removes the garbage sample pattern of regular non-rotated rendering. The result with a proper reconstruction filter (I used gaussian for performance with weights based on sample distance from pixel center) is something which is similar to 3xMSAA (horizontal and vertical gradients have three visible primary steps if the frame rotation angle is chosen correctly). There is a high cost in memory for rendering rotated. However the pre-z-fill (set to near plane) pass to mask out the invisible extents is super fast.