Interesting talk comparing methods - GTC S4385: Order Independent Transparency in OpenGL by Christoph Kubisch, NVIDIA

Given that OIT for billboards does not solve the order change problem, or the stereo flat card problem, I eventually lost interest and never bothered to try lots of stuff. Here are some ideas to extend the performance of the k-buffer approach with a fixed number of bins per pixel and atomic counter to decide which bin in the k-buffer to write to.

Packing of {color, alpha, depth} into 64-bits could be done with {24-bit logarithmic depth, 8-bit alpha, and 32-bit RGBD} which provides both optimal depth distribution and HDR color.

Thanks to the awesome NV_shader_thread_group extension by NVIDIA, one can now bin by 2x2 quads instead of pixels to reduce the atomic load. Use gl_ThreadInWarpNV to find which quad you are in (divide by 4). Then use ballotThreadNV() to find the first active thread per quad. Then only the first active thread per quad does the atomic +1 (to find the bin in the K-buffer to write into). Finally use quadSwizzle*NV()s to communicate the atomic return value (the bin) to all threads in the 2x2 quad.

With a rough object z-sort before OIT one could extend the effective k-buffer per-pixel array size using dithering and the 2x2 quad approach. Given pixels ABCD in a 2x2 quad for example an 8-deep k-buffer could be extended to: 4 full quads ABCD, then half quads {1 AD, 1 BC}, {1 AD, 1 BC}, then one quarter quads in {1 A, 1 D, 1 B, 1 C}, {1 A, 1 D, 1 B, 1 C}. Which is an effective 16 bins with lower resolution as the bins fill up. The reconstruction filter would need to restore the full resolution image. Some improvement could be made by adjusting the dithering pattern so that the quarter quad patters change per quad and change temporally (use a temporal feedback post filter to help remove the noise).