20110907 - Parallel Programming With Clones

This is a very simple way to do CPU parallel programming which automatically handles load ballancing without requiring atomic operations. Start the same program running on N threads in parallel sharing the same memory space. Write the program such that,

(1.) The program can skip over work which it knows has been completed by another clone.

(2.) That clones will always write the same results to memory for data.

Under these rules, any thread can be preempted and the program continues, when the thread returns it quickly skips ahead to the front of computation. Threads trade sometimes redoing work for avoiding expensive communication and synchronization. A simple example,

// frameCounter advances by one each frame
uint32_t frame;
// check if should start task
volatile uint32_t taskStart;
volatile uint32_t taskEnd;
if(taskStart != frame)
{ taskStart = frame; DoTask(); MemoryBarrier(); taskEnd = frame; }
// do some other work
// verify task was completed
// before using results of task
if(taskEnd != frame)
{ DoTask(); MemoryBarrier(); taskEnd = frame; }
// use results of DoTask()

Keep data double buffered and setup a simple barrier construct so that threads cannot fall over two frames behind. Also insure that threads are running enough out of phase so that they do not all always attempt to do all the same work.

For exceptions to the rule, keep track of a threadIdx for each thread (perhaps in thread local storage). External interfaces need to be predicated to only run in one thread, for example OpenGL access requires that calls be predicated to only the thread which owns the active GL context.