20120801 - A Better and Faster way to do TLS
A better option for Thread Local Storage (which I have not tried) would be to simply have a page sized and page aligned static variable or struct inside the application or dynamic library's data segment for thread local storage.
Then at run-time in each thread which wants local storage, to call a function to "setup" this TLS data region. This function would munmap() the pages used by the static variable, then mmap(MAP_PRIVATE|MAP_FIXED) to get new private page(s) for this thread in the unmap()ed region.
Accessing this Thread Local Storage is the same performance as any static global access (which on x86-64 is direct RIP relative addressing).
EDIT: Fail, looks like I'm wrong here for Linux and mmap(). I thought the convention was that mmap(MAP_SHARED|MAP_ANONYMOUS) was required in threads to get the same physical pages in each thread, turns out CLONE_VM used to create the thread insures this for all mmap()/munmap() calls. So for my idea to work, one would need to roll "fork" style threads with clone() without CLONE_VM, then use mmap(MAP_SHARED|MAP_ANONYMOUS) or shm_open()+mmap() to manually share address space which isn't TLS. Too bad. Looks like Linux would need a new "MAP_THREAD_PRIVATE" feature in order to not break backwards compatibility.