20090604 - Thread Scheduling Part 2


Part 2 is all about the IO side of Thread Scheduling, CPU IO in a parallel world.

Review, Blocking vs Non-Blocking

BLOCKING - Thread which issues the IO call is put to sleep until the operation is "finished". At some point in the IO call, a kernel call is made (to service the IO request) which results in a run-level task switch to kernel mode. In kernel mode, the kernel queues the IO request, sleeps the calling thread and does a task switch to another runnable thread. At some point later after the kernel services the IO request, the original thread (which issued the IO) is again scheduled to run (latency can be a function of IO latency plus how often the blocking thread gets rescheduled to run).

LIKELY BLOCKING, POSSIBLY NON-BLOCKING - In the case of reads, sometimes the data is cached (perhaps disk cache) and thus the IO call can be serviced right away. This can also be the case for writes when the kernel call copies the write request to a queue (or to the page cache) and immediately returns (note when queue is full or page isn't in page cache, kernel call can become a blocking call).

MOST LIKELY NON-BLOCKING, POSSIBLY BLOCKING - This is the troubling case where an interface is almost completely non-blocking, but has some caveats. Caveats in that the interface doesn't return some error code indicating that the call would block. Often this is the case when buffer or queue space is full. The v-sync section in the previous post is a good example of this!

NON-BLOCKING - Correct fully non-blocking API, if a call would block, the call returns with an error code. Non-blocking APIs might provide a polling function to return the status of pending operations. Non-blocking APIs might also provide a callback interface. In this case a user supplied function is called on pending IO. Callback interface can be lower latency at the expense of loosing control over job synchronization.


Disk Access Re-ordering To Reduce Seek Time

All operating systems which have good file IO performance on rotating disc based storage do this (else performance would suck worse than it does now). The operating system queues up the disk accesses required to service the file IO (virtual memory IO, etc) from all processes and threads, and orders the accesses to reduce seek time.

This should be kept in mind when thinking about application file IO. It is often smart to queue up file IO requests from many files at once, instead of just following the read->process->read->process model. Likewise, having just one asych thread servicing file requests using a blocking interface might not yield the best throughput depending on access patterns.


Read Ahead

The operating system provides a read-ahead mechanism which attempts to predict which pages will be needed in the future (based on past access patterns) and effectively adds future requests to current requests when reading from the device. In some cases (Linux) it can get various patterns like sequential + stride, etc.


Why Your Desktop User Experience Is Often a Waiting Game

(1.) Same thread processing the application "GUI" event queue does file IO. Epic Fail!

(2.) Application is creating a huge data structure, it writes into memory which was just allocated, but hasn't yet been mapped to physical pages by the OS. System is under heavy load, and not enough non-dirty pages are available to discard. Operating system needs to write out dirty pages to disc to provide pages for the application to write to. Application blocks until disc IO is finished on the dirty pages. Epic Fail!

(3.) Application suffers from code bloat. App could even be tiny but suffer from a bloated hierarchy of dynamic link libraries. Non-accessed code likely isn't resident (or if loaded, was paged out to provide space for something else). Example, user clicks on menu and opens up a seldom used dialog, paged-out code is accessed, and App blocks on file IO. Epic Fail!

(4.) Application access a huge number of small files, all with long path names (think bloated file structure with huge number of files per directory). Seek nightmare for two reasons, first just accessing the (possibly uncached) directory structure to find the files, and second accessing the (likely uncached) files. Epic Fail!

Solving the first case is just good application design. Solving the third case is expert application design. One has to insure all GUI or user-input processing is in a small core amount of common code (and data) which will always be resident (often accessed likely equals always resident). Possibly blocking functions (ie ones which might access code which is seldom called, ie which is paged out) needs to be done in an asynchronous thread. This way the app is always responsive!

Point here is anything which might result (directly or indirectly) in file IO must be on a separate thread from whatever thread or thread(s) interact with the user.


Flavors of File IO

CACHED STREAM FILE IO - Think the standard C library interface: open(), close(), read(), write(), etc. Reads and writes end up as a mix of blocking and non-blocking kernel calls. Nothing new here. I avoid this stuff.

MEMORY MAPPED FILE IO - The case where the operating system provides file IO directly using the virtual memory system. All the mapping kernel call has to do is to setup the page tables to invalid and then setup the os pages mapping to map to the file. At this point it can return (however it might prefetch based on mapping hints). When the app then attempts to first access memory in the mapped range, a page fault will happen, the kernel will then queue the disk IO, and sleep the thread (assuming the access isn't beyond the end of the file, also assuming the disk page isn't cached in memory). Depending on access patterns, the OS might prefetch N extra pages on fault (read-ahead). Nothing says that the entire file is loaded.

A scheduling pitfall of memory mapped IO is possible lengthy page faults even after the first access. So if using memory mapped IO in a blocking service thread, that thread would have to manually touch (read a integer from) all the pages to insure they are all actually loaded (skipping pages isn't an option because the read-ahead might detect that an then only service the skipped pages). This way when the async service thread marks the IO as complete, the program can assume a compute thread will not block. Problem with a blocking async thread touching pages to insure they are loaded is that each read will be a full (wasted) cache line miss out to main memory (if each miss is 1000 cycles, on a 2 GHz machine with 4KB pages, 1 ms of miss time is only good for at best ~7.8 MB of touched blocks, costing you 1/16 of your 60Hz frame). So totally NOT practical!

Pages modified are automatically marked by the CPU so on unmap the OS can only write out modified pages for free. Also in theory, memory mapped IO could provide the operating system a way to avoid an extra copy on read and write from the file system. This is assuming the OS does direct file DMA to/from the pages. Avoiding the copy would be avoiding the cache pollution and stalls caused by huge CPU to CPU copies. Avoiding the extra copy sitting around in the disk cache also saves memory.

Memory mapped file IO is available on Windows/Unix/Mac/Etc, but not on consoles, and other embedded systems!

RAW DIRECT ASYNCHRONOUS FILE IO - Given the problems with cached or memory mapped file IO, for databases and other critical high performance applications, operating systems often provide an interface for raw file IO. Interfaces for raw file IO are also commonly non-blocking and asynchronous (for obvious reasons) and likely batch request friendly.


Windows Asynchronous Disk I/O

Quick refresher, I'm not personally interested in things like accessing a huge number of random files. IMO that is just broken by design. What I do want is background IO with simple polling for completion (no heavy weight kernel call), and no duplication of memory (ie copy of the file block in the file cache). So I'm not going to cover things like I/O completion ports...

Microsoft has a great article on this which should be read first. Windows AIO has one key design flaw in that the operating system can choose to go synchronous for an operation without giving the application a choice. This IMO is an Epic Fail case for any asynchronous interface.

Cases of going synchronous are: NTFS compression, NTFS encryption, extending a file's length, and large number of cached file IO requests. FILE_FLAG_NO_BUFFERING (raw direct un-cached IO) best insures an actual asynchronous call. However from their example (of no-buffering async calls), only 500 requests got queued in 0.22 seconds. Which shows a 0.45ms average kernel call time. Yikes, that is either certainly not asynchronous, or just spin locking, or the thread is getting heavily preempted!!!

Windows does provide fail cases with ERROR_INVALID_USER_BUFFER or ERROR_NOT_ENOUGH_MEMORY whenever there are too many outstanding asynchronous I/O requests.

Windows also does provide a HasOverlappedIoCompleted() macro to check for AIO completion without using a kernel call.


MacOSX Asynchronous Disk I/O

POSIX asynchronous IO wasn't supported until MacOSX 10.4. POSIX AIO provides a true non-blocking interface, if a request cannot be enqueued, then functions fail. POSIX AIO requires a aio_error() "system call" to check on status of an async operation. I believe that POSIX AIO on OSX is a wrapper for kqueue() so polling does indeed require a system call (yuck). According to some internet sources a process is limited to 16 AIO requests at a time on OSX. Probably better to just use kqueue() directly on OSX.


Linux Asynchronous Disk I/O

Included in 2.6 kernel (patch for 2.4). Works on files or block devices opened with O_DIRECT or raw devices. Doesn't work on sockets or pipes. One option is to directly interface with the io_*() system calls through the libaio userspace wrapper library. Basically,

#include < libaio.h >

// setup context
aio_context_t ctx;
bzero(&ctx, sizeof(aio_context_t));
io_setup(max_events, &ctx);

// submit events
struct iocb tab[events];
io_prep_pread(tab+0, fd, buf, bytes, offset);
io_prep_pwrite(tab+1, fd, buf, bytes, offset);
...
io_submit(ctx, events, tab);

// poll on completion using syscall
// will want to wrap this in something which disables signals
struct io_event tab2[events];
num_events_or_error = io_getevents(ctx, 0, events, tab2, 0);
if(tab2[0].obj == tab+0) // then that event is finished
// tab2[0].res2 == 0 then ok, else error
// tab2[0].res == amount read/written


Links: Linux System Calls, Neat Interactive Map of Linux Kernel

Other option would be to use libposix-aio to get the POSIX AIO interface (which uses io syscalls). The link above is to the source which is a good reference to the syscalls (for example they wrap io_getevents() in something which masks interrupts, stuff like that likely learned the hard way through trial and error!).

BTW, Linux too fails on the most basic level of simply directly updating the iocb with the completion status so a kernel call isn't needed to poll. Arg.


Non-Blocking Network Interface

I personally don't use TCP, and only do UDP. Via UDP, the problem of the "connection" interface to the operating system goes away, and a single socket can gather all incoming packets from all clients and/or send packets to all clients (recvfrom()/sendto()). A quick ioctl(), or ioctlsocket() (with Windows WSA emulation of BSD sockets), can be used to set non-blocking operation.


Audio

Going to leave this for another blog post...