20100102 - PC CPU Task Parallelism Limits


All numbers below are from a Windows XP + 2.39 GHz Intel Core 2 Duo laptop. Performance on other machines and configurations will vary (duh).


Some Numbers

Histogram of the measured clock cycles/task of one thread processing the same ~10K clock cycle task many times (in a loop) on the above laptop,

835330 ... 0-16K cycles
8005 ..... 16K-64K cycles
1178 ..... 64K-256K cycles
66 ....... 256K-1M cycles
20 ....... 1M-4M cycles
5 ........ greater than 4M cycles


This specific task is a fully ALU bound dummy task which spins doing pure math (all values in registers). Run time variability is a function of hyper-threading (minor effect) and operating system preemption (major effect). Time scale here is somewhere around 4 seconds of total run time (for 0.8M tasks), which could be 120 frames of a 30 Hz game on this laptop. Note the variability.


Preemption

A major problem for any task parallel system on the CPU is preemption at non-task boundaries. The net result is that tasks at random could easily get stalled for over 1 millisecond (happened 5 times in 4 seconds in the above example). Any tasks which depend on the result of those stalled tasks also get stalled.

The visible result is variable frame rates and a screwed over player/user experience when the application attempts to have more than a few task parallel dependencies per video frame.


Workarounds

The best solution is cooperative multitasking: tasks release the CPU when they are finished, programmers insure task run time is at the desired program response latency. As desktop operating systems have de-evolved over the years we have lost the ability to provide a correct solution.

Are there any workarounds?

UNDER-UTILIZE CPU - In order to always hit v-sync, the program can under-utilize the CPU such that the maximum amount of time taking to compute a frame is the minimum amount of time in the worst case preemption. Clearly preemption is not a bounded problem, a program will always have to accept some kind of frame drop.

DUPLICATE TASKS - With a painful case of over-engineering, one could build a task system with transactional memory, such that if a currently needed task dependency is not completed due to preemption, that a running thread could manually duplicate and run the dependent task.

Anything less painful?


Manual Self Preemption?

Seemed like a great idea: keep track of thread run time and yield execution to another task worker thread before the operating system would preempt the thread's time-slice. This way the program could simulate cooperative multitasking switching at task boundaries.

Epic fail on the XP Laptop!

Of the two ways to do this with fixed thread affinity: pair of threads yielding execution, or pair of threads blocking/releasing each other via signals, all have high overhead and do not fully solve the variability problem.

So anything less painful?

No!