20090606 - Thread Scheduling Part 3

My Personal Views on Solutions to This Mess

*** Personal views for the independent developer. Clearly things are different with the constraints a huge legacy code bases, etc...

First a little background, back when I started programming, is was practical to write your own operating systems and drivers (I did this in my teens). I would boot DOS, then switch into protected mode with the DOS interrupt table and low memory intact so if necessary I could switch into virtual x86 mode and use DOS for things I didn't feel like programming my self (like access to the FAT files system on the main harddrive). Everything else, sound, video, etc was done accessing the hardware directly. Actually things where much EASIER back then (IMO). Since then, the industry had added layer upon layer to abstract out the hardware...

WHAT DO I MEAN BY EASIER? - Back in the early days you got control over everything. Computers were single processor, no virtual memory, no threading. Applications (at least the stuff I did) would work with non-blocking asynchronous IO requests and coarse polling with a small set of interrupt service routines to handle hardware interrupts. It was rather easy to poll with one bitmask read to check for completion of any IO request, then cooperatively "multi-task" in application. Things were actually more interactive and responsive back then as well. Polling was cheep (read from memory, no kernel call), task switch was cheap because it was cooperative, no state to save and restore. Programs were tiny, systems were tiny. Check out Contiki for a modern example!

I hold no illusions that somehow we could go back to the way things were before, however there are a huge number of lessons from back then which should be applied now.

What I am Doing Circa 2009

First I have one thread per hyperthread in the system (or per processor in a non-hyperthreaded system). These threads are locked down via setting thread affinity. These threads are my "compute threads" which don't ever do blocking IO. Whenever possible I use low-level as possible non-blocking IO interfaces (this covers networking and raw file IO on most systems).

For all the other cases, meaning when the operating system or API doesn't provide a 100% non-blocking interface, I resort to blocking IO threads to which I non-locking queue requests to. Note because scheduling granularity is 1ms or so at best, queues get groups of requests at one time (to lower atomic operation overhead and overhead of kernel call to wake a IO thread which was blocking on an empty queue). Blocking IO threads run at a higher priority so they will be insured to preempt a "compute thread" whenever they get marked as runnable. Blocking IO threads are designed to have very short runtime and thus a minor effect on the "compute thread" which it prempted.

My job scheduler is quite different from what others do (it is modeled directly on how GPUs process draw calls and what is now the OpenCL model). I break my program down into an array/list of small jobs (like draw calls). I handle handle different sections of the program by turning on/off jobs in this list.

Each job has an index to a prior dependent job (which controls when the job can be scheduled). I keep track of job runtime (free profiling) and use this to pre-schedule all jobs into one queue per "compute thread" a frame in advance. So "compute threads" simply pull jobs in order off their queue and spin if in the unlikely case the job's dependency hasn't finished execution.

It is up to the program designer (ie me) to insure that the program pipeline doesn't ever need to stall and thus has enough non-dependent jobs to handle the variability of the preemption by the blocking IO threads.

Note that with this system I don't ever need any other thread-2-thread synchronization other than atomic operations. All that "work" and "mess" is factored out into proper program "pipeline" design!

To handle scalability to machines of varying number of processors, I have two types of jobs: (BATCH) This job is attempted to be scheduled on all "compute threads" at the same time. Each job entry point gets the "compute thread" index and the number of "compute threads". It is then up to the job to carve up the data and process its group. Batches are for data parallel work. (TASK) This job gets scheduled on only one processor. Tasks are for task parallel work.

For fast development cycles, I have all my code in a single dynamic (or shared on unix platforms) library, so I can re-compile, unload the old library, and load the new library to instantly test code changes without needing to restart the program. This is my re-attachable code model. BTW, I don't use the debugger ever at home, I don't need to!