20150626 - OS Project : 8 - Factoring out Hashing and Interpreting


Another long break, too busy, but yet somewhere in the background the thoughts continue to churn in the quest of something better...

Since last time wrote an editor based on the prior idea of storing strings in a reversible hashed form. Experimented with various methods of input. Worked great. Then absence cultivated some disruptive ideas.

Factoring out Hashing and Interpreting
At the most primitive level a forth dictionary is basically something which maps similar symbols to the single same word of memory. Somewhere along the line color forth added in the complexity of mixing tags/data/symbols into the source tokens. Lets first remove tags/data, and return to source as only a series of words. Now when source is loaded each word is translated to an address in memory (in the dictionary), making sure any words which are the same (in this source or existing loaded source) end up with the same address in memory. So loaded source is simply an array of addresses. The address to a location in memory which would hold the address to where the word gets later compiled. At this point might as well just load directly to source code. Each source word translating into,

CALL [address]

Which in x86-64 can be 2E prefix padded to a 64-bit value where the high 32-bit __ bytes are the RIP relative (end of instruction) offset to the address,

2E 2E FF 15 __ __ __ __

So interpreting source is now just jumping or calling directly (no interpreter is required). This loading process can be directly inverted as long as it is possible to translate an address back into the string. This is needed for both the editor and for serializing out to storage. This is trivial, just make the dictionary a layered data structure. With 64-bit pointers in the first layer, then in the second layer, a 64-bit strings (supporting up to 10 characters each). Given a word address, the string is (address + layerSize).

Making Loading Fast
(1.) Have a 2nd chunk of memory 2x the size of the source, clear it.
(2.) Hash source words into that chunk of memory.
(3.) Run one pass over the dictionary hashing into the same chunk of memory,
(4.) recording the address when there is a match.
(5.) Run through source, looking in hash map, if match is found use it,
(6.) else append a new entry to the end of the dictionary layers.

Similar to radix sort this has a high fixed cost. Two passes over the source, one pass over the full dictionary. But this is very parallel friendly for something like a GPU, each word in source or dictionary can run in parallel in a given pass.

Editing and Bringing Back Color
Remembering the dictionary is layered, it would be possible to transform source instead to,

CALL [RBP+offset]

And then have the layer base address in RBP before calling into the source. This enables the ability to use a separate dictionary for interpreting the source, and then another for drawing the source. In this context, drawing the source into the console is just changing RBP, then calling the source directly (it draws itself). Words that have special meaning, like {start comment, define word, newline, etc}, just define themselves a non-default meaning into the editor layer of the dictionary. These non-default function pointers can change color, etc.