20090611 - Factoring Out the Job Scheduler


Last night I got another idea on how I might be able to both simplify my job scheduler (in fact remove it) and better handle a dynamic CPU load. All while keeping a very simple design with minimal sync points. Be warned, I have NOT had time to try this yet, it should work in theory...

Threading

(1.) The program entry starts N threads then goes to sleep forever.
(2.) N threads to match the N hardware threads of the machine.
(3.) N threads are locked down via thread affinity.
(4.) Each of these N threads runs the exact same function.
(5.) Each thread starts with function(X, N).
(6.) Where X is the thread index ranging {0 to N-1}.

The basis of parallelism here is running N instances of a program. However since the N instances can have completely divergent branching, the program can use X to mix and match task and data parallel computation anywhere in the program. Note all threads share the same memory.


Synchronization

For efficient computation only very coarse grain synchronization is supported between the N threads. Coarse as in you can count the number of sync points per screen refresh on your fingers. Examples below show lines for each thread and letters as follows,

- -> thread doing work
x -> thread blocking
E -> event
S -> sync point, thread blocks until all threads finished E


During normal program operation sync points don't ever block because enough work is provided in between by a good program architect. Note in this model threads have a lot of slop and don't ever actually fully sync up in time unless something goes really wrong.

--------E-------S-------
-----------E------S-----
------E--------S--------
-------E------------S---


When something goes wrong, threads do block.

--------E-------S-------
---------------E--S-----
------E-----Sxxx--------
---E-----Sxxxxxx--------


The program would be composed of a sequence of E and S points per frame.

-----E1----E2S1-----S2--

In the above example, after S1, data generated prior to E1 is free to be read-only shared between all threads. Likewise after S2 data generated prior to E2 (ie between E1 and E2) is now safe to share as read-only.

Data outside sync points is considered "unordered" (just as UAVs are in DX11) and can be shared only with DX11 or CUDA like unordered atomic operations. Data which would otherwise require fine-grain synchronization to share, is simply re-computed in each thread.


Dynamic Load Balancing

This is where things get interesting. The goal case is where threads reach E at the same time (E is in the same point in code on all threads). There is enough slop in the system so that the E point in time can be a bit different between threads. To load balance one would want to dynamically adjust work distribution such that at the next E point, that the threads would be closer together in timing.

To enable dynamic balancing, all threads have a secondary work distribution factor Y (the primary work distribution factor is the thread index X). This Y factor provides a segment {start,end} along the range of {0 to 1}. This Y segment can be used by the thread to choose the ranges of data parallel work to operate on.

(1.) At E, each thread stores CPU clock cycle time.
(2.) This requires CPU clock cycle counter bases to be re-synced periodically.
(3.) Also at E, each thread records diff in cycle time to the previous E.
(4.) At S, threads read all the E times for all the threads.
(5.) Note the E times are read-only.
(6.) Each thread uses the E times to compute a new future Y segment.
(7.) This new future Y segment stored to be used at the next sync point.


Example Forms of Task and Data Parallel Programming

To have one thread handle an OpenGL interface, the program would just dynamic branch on X==0 (X being the thread index). Note, a GL context is usually tied to one thread. With DX11 there is also the concept of the primary thread, secondary threads build deferred commands. Note it is ok to lock down a task to just one thread because the dynamic load balancing would redistribute the data parallel work via the Y segment (from above).

To process M objects, each thread would compute an exclusive set of the M objects using the Y segment (something like Y.start * M to Y.end * M, with fixup to insure no overlap).


Advantages

With this system, there isn't a worklet system or a scheduler, there is just the program running multiple instances of itself. Synchronization is flexible, and rarely used (very parallel friendly). Read-only data would likely be well shared on a shared L1 or L2 or L3 because the same program is running on all threads and the dynamic load balancing works to bring together the phase (the time) of the threads.