20121010 - Faster GPU Integer Math Using Subnormals


Update: This won't work in DX or GL.
Update: Looking to find a way to get GL to not flush denormals to zero.


When optimizing for a GPU which either has better floating point performance than integer performance, or when optimizing a program which is integer performance limited and the GPU can co-issue floating point and integer instructions, sometimes one can move integer instructions onto the floating point unit and see performance gains.

NVIDIA's Kepler family of GPUs falls into this catagory.

DX and GL
On the GPU you can freely alias between integer and floating point without any type conversion or cost. In GL this is done by using floatBitsToInt(), floatBitsToUint(), intBitsToFloat(), and uintBitsToFloat(). In DX this is done by using asint(), asuint(), and asfloat().

Technique
Modern GPUs (on AMD starting with HD5xxx in 2009) support full speed subnormals and denormals. For addition, subnormals (or denormals) work exactly like unsigned integers for unsigned integers between {0 to 16777216}. So as long as the programmer can insure no overflow or underflow, something like this should just work,

uint a;
uint b;
uint c = asuint(asfloat(a) + asfloat(b));


For integer addition simulated with subnormals, for runtime constants which need to be negative, one can build negative constant N by using "abs(N)+0x80000000", which will set the floating point sign bit. Note the programmer still needs to insure all intermediate computations never underflow or overflow the safe {0 to 16777216} range.

Also note on modern GPUs like Kepler and Fermi negation of a compile time immediate indexed constant is typically free. Both these lines for example effectively have the same performance on Kepler or Fermi for something like a pixel shader,

cbuffer cb0 : register(b0) { float4 f[16] : packoffset(c0); };
float d = (a * b) - c;
float d = (a * (-f[0].x)) - f[1].x;


For integer multiplication simulated with subnormals, one of the terms must be an actual floating point number. Also in the case of only multiplication, underflow will clamp to zero. So it is really easy to simulate bit shifts, these lines would have the same result,

uint c = a >> 5;
uint c = asuint(asfloat(a) * (1.0/32.0));


Double Check in Practice
Note that there are run-time optimizing compilers which exist between your GLSL and HLSL source and the GPU's native instruction stream, so before attempting to leverage this, double check it actually works...