20070926 - Drawing in Reverse


On the Topic of Alpha Blending
I had a theory that only about 8 times of overdraw per pixel would be necessary to render everything in Atom. Currently using something upwards of 32 times overdraw per pixel, so if I could skip 3/4 of the overdraw, this would be a tremendous performance win. So I switched the rendering from back to front, to front to back. Changing the alpha blending equation, and added a stencil test so only the first 8 front most impostors per pixel get drawn. The result worked mostly, with one problem. When the first 8 pixels are all low alpha, there is still some artifacting. Adding in a alpha test to clip out really low alpha pixels so they didn't get included in my 8 pixel limit, helped but didn't fix the problem. A more innovative solution was needed!

If you think about it, when a pixel is generated by the overlap of many low alpha sprites, it is usually representing some kind of fog or haze. And this fog or haze usually has a similar color to the surrounding pixels. So if the accumulated coverage of a pixel is very low after drawing 8 pixels, it is probably safe to assume the fog/haze case. Now I had a solution to the problem.

The solution is to add one more pass, drawing a 1/2 down-sampled copy (using the GPU's automatic mipmap generation) of the previous frame as the last back-most overdraw pass. The down-sampling blurs the pixels slightly (fog/haze), and fills in the areas of low alpha accumulation. Given a good 30 fps, the convergence of the algorithm is invisible to the eye. And it worked, really really well!

Final Step to a Huge Performance Win
Already the stencil test helps quite a lot by skipping the fragment shader (and thus 2 texture reads, and 1 ROP blend). But there is a faster way by eliminate large groups of pixels way before the stencil check. After some research, it looks as if only the newest AMD/ATI GPUs have a hierarchical stencil buffer, enabling the stencil pass to clip out groups of pixels (say 16 or 32) at a time. So the best next option is to use the hierarchical z-cull hardware, which I believe is similar in function in all DX10 type cards.

Filling the Z buffer is another subproblem. Looks like to use the z-cull, I'm going to have to draw polygons with alpha test off, and no fragment shader depth write. So my idea is to draw a mini framebuffer (x/4 by y/4) first using the stencil idea, but only drawing Z into a texture instead of color. So the last z drawn is for the 8th pixel drawn into the mini framebuffer. Then using a vertex shader to generate two triangles per pixel of the mini framebuffer, and doing a depth only write of the resulting z values into the full size Z buffer. Then the z-cull hardware should be primed to quickly chop groups of pixels which exceed the overdraw limit.

With the stencil optimization alone, I am again CPU bound. So I probably wont get to my z-cull test until I get the CPU side better optimized (need to finish my Atom4th stuff).