20120719 - Dynamic Linking and Run-time Recompile


First some important background on dynamic linking (while this is Linux specific this also applies to other platforms),

(1.) There is only one dynamic external symbol required on Linux to support using dynamic libraries: dlsym(). One can use dlsym(0, "dlopen") to grab the address of dlopen(), etc, which enables manual dynamic linking. Note, dlsym(0, "dlsym") also works for the faster manual linking (below).

(2.) While RTLD_LAZY or lazy binding can save memory (in that unused parts of shared libraries are not loaded), it can cause in-app stutter as first use can happen after load time and trigger blocking IO.

(3.) Use static when possible! A quick reference: Understanding the x86-64 Code Models. On x86-64 because of the RIP relative addressing, static symbols (functions and data) in position-independent-code (PIC) are as fast as non-PIC code for the default small memory model.

(4.) Manual dynamic linking is faster than standard dynamic linking. Manual means fetching an external symbol's address (via dlsym()), placing that address in a static variable (if global) or embedded in other data, then fetching that saved address directly to use. For function calls, the manual method does not require an extra jmp, and with the manual method symbol address can be packed with data known to be in cache and grouped for good cache locality.

(5.) The .bss section for initial zeroed global data can be avoided via -fno-zero-initialized-in-bss. This can force all writeable global data to get packed into the .data section, making run-time recompile easier. While this can increase the size of the binary, this also removes the run-time pain and suffering of page faults when .bss pages get written for the first time (assuming one is using mlockall()).


Run-Time Recompile at Native Performance (for x86-64)!
This is trivialy easy using my current model of "all code in one object file" with all static symbols and manual dynamic linking, as PIC code is the same speed as native code in this case.

All code is compiled into one single shared object file. There is a tiny loader which copies to temp file then loads the .so (the actual application) using dlopen(). The program monitors the timestamp on the source .c file, and recompiles if it changes. The program independently monitors the timestamp on the original .so file also, and triggers the reload process on change.

The reload process finishes the frame, has all extra threads shut down, and returns back to the loader passing the pointer to global .data segment and size. The loader copies to a new temp file, then attempts to load the new .so file. On success, enters the new .so passing in the pointer and size of the .data segment of the last running .so. The program copies the data to it's .data segment, fixes any .data based pointers (data re-linking), closes the prior .so, restarts the extra threads, and starts the next frame. All global data in the app is in one global structure called "all", so grabbing size and pointer to .data segment is easy. Any new data between run-time recompiles is added to the end of the "all" structure to insure same organization of memory.


Misc
I figured out another way to do this without shared objects so that the data re-linking step would not be required, but the complexity required did not seam worth the effort to me. Hint, one can control segment start via --section-start linker option, which can also indirectly control segment sizing. Then mprotect() can be used to make the read-only segments writable for run-time reload, etc...