20090602 - Thread Scheduling Part 1


Decided to re-look at the problems of IO and scheduling across all platforms to see what has changed since the last time I checked (turns out not much at all)! Beginning with Windows, Charles Bloom's Post is a great place to start.


Windows Thread Scheduling Refresher

Scheduling Priorities - Priority range 1 (lowest) to 31 (highest). Threads in each priority are scheduled round-robin. Scheduler runs highest priority thread possible at all times, so it is possible for higher priority threads to starve low priority threads. Base priority is set from priority class and priority level (in the class).

If a higher-priority thread becomes available to run, system stops lower-priority thread (before it finishes it's time slice), and runs the higher priority thread for a full time slice.

That is what the docs say. Here is my interpretation: Threads are scheduled to run for a "quantum" number of timer interrupts (which increases BTW when the dynamic priority gets bumped). If a higher-priority thread becomes available to run, it could in theory get scheduled at two places: (1.) the timer interrupt, or (2.) after an IO interrupt (or kernel call) is processed which results in a higher priority thread getting marked as runnable. *** I have a feeling case (2.) doesn't happen on Windows. Which means that scheduling interrupt timer granularity is of huge importance for any blocking IO worker threads! Default Windows interrupt timer granularity was previously no better than 10-15ms (I don't know what is now). Good idea to fix this with timeBeginPeriod(1) for games to get a 1ms granularity.

SwitchToThread() can be used to yield to the next thread in line to run on the current processor only. Thread is re-queued for the next pass of round-robin. Note if no other equal priority thread (higher priority threads will automatically preempt a lower priority thread) is available to run, then this does nothing.

Priority Boosts - Actual priority is base priority + dynamic priority. Windows boosts priority in the following cases, (1.) when a NORMAL_PRIORITY_CLASS process becomes the foreground window, (2.) when events are sent to a windows event queue the thread associated with the queue gets a boosted priority, and (3.) when wait conditions are satisfied for a blocked thread. Set(Process|Thread)PriorityBoost() can be used to control this.


To Place Scheduling Granularity In Perspective

Quantums vary 20-120ms for NT4/2000/XP/2003 depending on priority boost, workstation or server OS, etc. Note with a 60Hz refresh rate, timeBeginPeriod(1) 1ms scheduling granularity, and assumption *** above, this results in only 16.67 possible chances for a thread to get involuntarily preempted (by a higher priority thread) per frame! Mix in background tasks/IO and it is easy to see wild flux in run-time. The only way to hit higher task switches per frame is to have threads manually blocking (on IO or object) in a way that isn't a spinlock. Note other applications might not play nice and higher task switches per frame means more time doing nothing.

If there is anything to be learned from this,

(1.) Any IO blocking worker threads need to be given a high enough priority to insure they always preempt "compute" threads which would otherwise run for their entire quantum. With any luck in any of these 16.67 chances/frame for a compute thread to get preempted, the higher priority of IO blocking threads would then insure any available unblocked blocking IO threads get run before another compute thread gets reissued.

(2.) Interactive in-focus high FPS applications pushing CPU limits cannot really afford to play nice and Sleep() free CPU time before the next frame. However, (and I haven't tried this yet), there might be a way to manually reduce priority to simulate sleep and then allow a higher priority timer thread to "wake" the sleeper thread by increasing its priority.

(3.) Likely upper limit for number of possible dependent stages in a 60Hz game pipeline probably best to be well under 16 if possibly blocking.

(4.) Threading sync interfaces which use double (or more) possibly blocking sync primitives to enter and leave can be very bad.

(5.) There is a good point to non-blocking IO!


Thread Scheduling and V-Sync

First forget the notion that it is OK for the GPU to wait when one is rendering faster than v-sync. Frame rates are variable, being able to get an early start on the next frame could be quite useful (lowers the chance of dropped frames)!

There are multiple ways hardware could possibly handle sync:

(1.) CPU adds a GPU command buffer command which stalls the GPU until v-sync, followed by a command which swaps buffers. CPU then wouldn't have to stall and could continue adding commands for the next frame until command buffer queue space fills up (at which case the draw call would stall). Clearly in this case the GPU stalls, and cannot begin processing commands for the next frame, this is very bad!

(2.) CPU adds a GPU command which sets up for the swap. Then later the CPU uses something besides a command buffer command to trigger the actual swap (memory mapped IO to hardware register, etc, likely from a v-blank interrupt CPU side). So neither CPU or GPU needs to stall. CPU could queue draw calls and issue those draw calls to the GPU as long as they didn't write into the front buffer (used before the swap), or fill up command buffer queues. At some point the CPU would get notification that the swap was completed and could issue dependent (as in write to the previous front buffer, ie the new back buffer) calls for the next frame. This is the (only) sane way to do things.

I'm going to assume GPU supports (and driver is using) case (2.).

If the application is GPU bound, meaning the CPU can issue draw calls faster than the GPU could ever complete them. At some point the graphics API function call will block. The API call will block for two possible reasons, first buffer space is all used up, or second, it blocks on swap to insure the CPU side doesn't get too many frames ahead. So if one wants to push the CPU too, don't do that from the same thread issuing graphics API function calls!