20141231 - Continued Notes on Custom Language


Continuing notes on portable development for Linux and Windows using a custom language...

Using WINE on Linux for Windows development is working great: ability to use OpenGL 4.x in native 64-bit WIN32 binaries on Linux. Using the same compiler to build both Linux and Windows binaries which can both be run from the same Linux box.

Forth-like Language as a Pre-Processor for Code-Generation
My 1706 byte compiler takes a stream of prefixed text: mostly {optional prefix character, then word, then space} repeated...

\comment\
word \compile call to address at word\
:word \compile pop data stack into word\
.word \compile push 64-bit value at word onto data stack\
'word \compile conditional call to address at word if pop top of stack is non-zero\
`word \copy opcodes at word into compile stream\
%word \compile push address of word, still haven't used this\
{ word \define word, stores compile address in word, then at closing stores size (used for opcodes)\ }
34eb- \compile push hex number negated\
,c3 \write raw byte into compile stream, x86 ret opcode in this example\
"string" \compile push address of string, then push size of string\


Compiler strips whitespace and comments while converting the input words into machine code (using the conventions above), then executes the machine code. There is no "interpreter" as would normally be used in a Forth based language. The compiler does not even have the standard Forth opcodes, these are instead just specified in the input source file. For example ADD,

{ add ,48 ,03 ,03 ,83 ,eb ,08 }

Which is the raw opcode stream for "MOV rax,[rbx]; ADD rbx,8" where "rax" is the top of the data stack, and "rbx" points to the 2nd item on the data stack. Since Forth opcodes are zero operand, it is trival to just write them in source code directly (language is easily extendable). I use under 30 forth style opcodes. After an opcode is defined, it can be used. For example,

10 20 `add

Which pushes 16 then 32 on the data stack (numbers are all in hex), then adds them. To do anything useful, words are added which pop a value from the data stack and write them to a buffer. For example write a byte,

{ byte ,40 ,88 ,07 ,83 ,c7 ,01 ,48 ,8b ,03 ,83 ,eb ,08 }

Once the compiler is finished executing the machine code generated by the source, which in turn is used to write a binary into a buffer, the compiler stores that buffer to disk and exits. In order to do anything useful the next step is to use the language and source opcodes which extend the language to build an assembler. Some bits of my assembler (enough for the ADD opcodes),

\setup words for integer registers\
0 :rax 1 :rcx 02 :rdx 03 :rbx 04 :rsp 05 :rbp 06 :rsi 07 :rdi
8 :r8 9 :r9 0a :r10 0b :r11 0c :r12 0d :r13 0e :r14 0f :r15

\words used to generate the opcode\
{ REXw 40 `add `byte }
{ REX .asmR 1 `shr 4 `and .asmRM 3 `shr `add `dup 'REXw `drp }
{ REXx .asmR 1 `shr 4 `and .asmRM 3 `shr `add 48 `add `byte }
{ MOD .asmR 7 `and 8 `mul .asmRM 7 `and `add .asmMOD `add `byte }
{ OP .asmOP 8 `shr 'OPh .asmOP `byte }
{ OP2 :asmOP :asmRM :asmR 0c0 :asmMOD REX OP MOD }
{ OP2x :asmOP :asmRM :asmR 0c0 :asmMOD REXx OP MOD }

\implementation of 32-bit and 64-bit ADD\
{ + 03 OP2 } { X+ 03 OP2x }


Afterwords it is possible to write assembly like,

.rax .rbx + \32-bit ADD eax,ebx\
.rax .rbx X+ \64-bit ADD rax,rbx\


Due to the complexity of x86-64 ISA, I used roughly 300 lines to get a full assembler (sans vector opcodes). With a majority of those opcodes not even getting used in practice. The ref.x86asm.net/coder64.html site is super useful as an opcode reference.

Binary Header
Next step is writing source to generate either a PE (Windows) or ELF (Linux) binary header. ELF with "dlsym" symbol used roughly 70 lines (mostly comments to describe the mess of structure required for an ELF). The PE header I generate for WIN32 binaries looks similar to this example from Peter Ferrie which is a rather minimal header with non-standard overlapping structures. I added an import table for base Kernel32 functions like "LoadLibraryA", because of fear that manual run-time "linking" via PEB would trigger virus warnings on real Windows boxes. I'm not really attempting to hit a minimum size (like a 4KB demo), but rather just attempting to limit complexity. WINE handles my non-standard PE with ease.

If I was to write an OS, I wouldn't have binary headers (PE/ELF complexity just goes away). Instead I would just load the binary at zero, with execution starting at 4 with no stack setup, with binary pages loaded with read+write+execute, and then some defined fixed address to grab any system related stuff (same page always mapped to all processes as read-only). This has an interesting side effect that JMP/CALL to zero would just restart the program (if nop filled) or do exception (if invalid opcode filled). Program start would map zero-fill and setup stack. I'd also implement thread local storage as page mapping specific to a thread (keeping it simple).

ABI: Dealing With the Outside World
Having your own language is awesome ... dealing with the C based Linux/Windows OS is a pain. I use a 8-byte stack alignment convention. The ABI uses a 16-byte stack alignment convention. The ABI for Linux Kernel, Linux C export libraries, and 64-bit Windows is different. Here is a rough breakdown of register usage,

_ ___ _ LXK LXU WIN
0 rax .
1 rcx . k0 a3 a0
2 rdx . a2 a2 a1
3 rbx . s0 s0 s0
4 rsp X
5 rbp X t0 t0 t0
6 rsi . a1 a1 a4 <- stored before call on WIN
7 rdi . a0 a0 a5 <- stored before call on WIN
8 r8_ . a4 a4 a2
9 r9_ . a5 a5 a3
a r10 . a3 k0 k0
b r11 . k1 k1 k1
c r12 X t1 t1 t1
d r13 X t2 t2 t2
e r14 . s1 s1 s1
f r15 . s2 s2 s2


rax = return value (or syscall index in Linux)
rsp = hardware stack
a# = register argument (where Windows 64-bit a4 and a5 are actually on the stack)
t# = temp register saved by callee, but register requires SIB byte for immediate indexed addressing
s# = register saved by callee, no SIB required for immediate indexed addressing
k# = register saved by caller if required (callee can kill the register)

I use a bunch of techniques to manage portability. Use a set of words {abi0, abi1, abi2 ... } and {os0, os1, os2 ... } (for things which map to Linux system calls) which map to different registers based on platform. Use word "ABI(" to store the stack into R13, then align the stack to 16-bytes, then setup a stack frame for the ABI safe for any amount of arguments. Words "ABI", "ABI5", "ABI6+" to do ABI based calls based on the number of integer arguments needed for the call. This is needed because Linux supports 6 arguments in registers, and Windows only supports 4. Then later ")ABI" to get my 8-byte aligned stack back,

{ ABI( .abiT2 .rsp X= .rsp 0fffffffffffffff0 X#& .rsp 50- X#+ }
{ ABI \imm\ #@CAL } \call with up to 4 arguments\
{ ABI5 \imm\ #@CAL } \call with 5 arguments\
{ ABI6+ \imm\ #@CAL } \call with 6 or more arguments\
{ )ABI .rsp .abiT2 X= }


With the following words overriding some of the above words on Windows (slighly more overhead on Windows),

{ ABI(.W .abiT2 .rsp X= .rsp 0fffffffffffffff0 X#& .rsp 80- X#+ }
{ ABI5.W .abi4 .abiW4 PSH! \imm\ #@CAL }
{ ABI6+.W .abi4 .abiW4 PSH! .abi5 .abiW5 PSH! \imm\ #@CAL }


So a C call to something like glMemoryBarrier would end up being something like,

ABI( .abi0 \GL_BUFFER_UPDATE_BARRIER_BIT\ 200 # .GlMemoryBarrier ABI )ABI

And in practice the "ABI(" and ")ABI" would be factored out to a much larger group of work. The "#@CAL" translates to call to "CALL [RIP+disp32]", since all ABI calls are manual dynamic linking the ".GlMemoryBarrier" is the address which holds the address to the external function (in practice I rename long functions into something smaller). Since both Windows and Linux lack any way to force libraries into the lower 32-bits of address space, and x86-64 has no "CALL [RIP+disp64]", I decided against run-time code modification patching due to complexity (would be possible via "MOV rax,imm64; CALL rax"). Both Windows and Linux require slightly different stack setup. Convention used for Linux (arg6 is the 7th argument),

-58 pushed return
-50 arg6 00 <---- aligned to 16-bytes, ABI base
-48 arg7 08
-40 arg8 10
-38 arg9 18
-30 argA 20
-28 argB 28
-20 argC 30
-18 argD 38
-10 argE 40
-08 argF 48
+00 aligned base


Convention used for Windows,

-88 pushed return
-80 .... 00 <---- aligned to 16-bytes, ABI base
-78 .... 08
-70 .... 10
-68 .... 18
-60 arg4 20
-58 arg5 28
-50 arg6 30
-48 arg7 38
-40 arg8 40
-38 arg9 48
-30 argA 50
-28 argB 58
-20 argC 60
-18 argD 68
-10 argE 70
-08 argF 78
+00 aligned base


The C based ABI and associated "{{save state, call, ... nesting ..., return, load state} small amount of code} repeat ...}" pattern forces inefficient code in the form of code bloat and constant shuffling around data between functional interfaces. Some percentage of callee-save registers are often used to shadow arguments, often data is loaded to register arguments only to be saved in memory again for the next call, caller saves happen even when the callee does not modify, etc.

I'd much rather be using a "have command structures in memory, fire-and-forget array of pointers to commands" model. The fire-and-forget model is more parallel friendly (no return), and provides ability for reuse of command data (patching). The majority of system calls or ABI library calls could just be baked command data which exists in the binary. Why do I need to generate complex code to do constant run-time generation and movement of mostly static data?

I conceptionally treat registers as a tiny fast compile-time immediate-indexed RAM (L0$). Register allocation is a per-task process, not a per-function process. There is no extra shuffling of data. No push/pop onto stack frames, etc. For example register allocation is fixed purpose during the two passes of the compiler,

\____REGISTERS_DURING_COMPILE__________________________________\
.rax :chr \input character\ .rax :dic \dictionary entry address\ .rax :str#
.rcx :hsh \hash value\ .rcx :jmp \jump address\ .rcx :num .rcx :str$
.rdx :pck1 \string packing 1\ .rdx :num- .rdx :siz
.rbx :pck2 \string packing 2\ .rbx :chr$1
.rbp :dic$ \dictionary top pointer, not addressed from\
.rsi :chr$ \input pointer\
.rdi :mem$ \memory stack, compile stack\
.r8 :def$ \define stack\
.r15 :zero \set to zero\
\____REGISTERS_DURING_EXECUTE__________________________________\
.rax :top \rax used for .dic only at launch\
.rcx :cnt \counter on text copy\
.rdx :fsk$ \float stack pointer, points to 2nd entry\
.rbx :stk$ \data stack pointer, points to 2nd entry\
.rsi :txt$ \used for source on text copy\
.rdi :out$ \output pointer, points to next empty byte\


Debugging
My dev environment is basically two side by side terminal windows per virtual desktop with a easy switch to different virtual desktops. I'm using nano as a text editor and have some rather simple color syntax highlighting for my language. No IDE no debugger. Proper late stone-age development.

To port to WIN32 I did have to fix some code generation bugs with regard to having a non-zero ORG. On Linux I load the binary at zero, so file offset and loaded offsets are the same. This is not possible in Windows. IMO there is more utility in keeping it simple than having zero page fault. When tracking down bugs in code generation or binary headers I just use "hexdump -C binary". My language supports injection of any kind of data during code generation, so it is trivial to just wrap an instruction or bit of code or data with something like "--------" which is easy to find via "hexdump -C binary | less".
The Forth inspired language I use has only one possible error, attempting to call a word which has not been defined (which calls zero). My compiler does not print any error messages, it simply crashes on that error. Since in practice this effectively almost never happens, I've never bothered to have it write out an error message. The last time I misspelled a word, it was a super quick manual log search (commenting out code) to find the error. When compiling and testing is perceptually instant, lots of traditional complexity is just not needed.

As for regular development, I started programming when a bug would take down the entire machine (requiring reboot). Being brought up in that era quickly instills a different kind of development process (one that does not generate bugs often). Most bugs I have are opcode mistakes (misspellings), like load "@" instead of store "!", or using a 32-bit form of an opcode instead of the 64-bit form. The only pointers which are 64-bit in my binaries are pointers to library functions (Linux loads libraries outside the 32-bit address space), the stack (I'm still using the OS default instead of moving to the lower 32-bit address space), or pointers returned by a library. When dealing with bugs with an instant iteration cycle, "printf" style debug is the fast path. I've built some basic constructs for debugging,

BUG( "message" )BUG \writes message to console\
.bugN .rax = 10 BUG# \example: prints the hex value in RAX to 16 digits to console\


Adding something to my "watch window" is just adding the code to output the value to the console. This ends up being way better than a traditional debug tool because console output provides both the history and timeline of everything being "watched".

Misc
In the past I had a practice of building a custom editor and run-time for each language. The idea being that it was more efficient to compile from a binary representation, which has a dictionary embedded in the source code (no parsing, no hashing). Ultimately I moved away from that approach due to the complexity involved in building the editor that I wanted for the language. Mostly due to the complexity of interfacing with the OS. It is really easy to fall in the trap of building a tool which is many times more complex than the problem the tool is engineered to solve.

Decoupling from the dependency of a typical compiler and toolchain on modern systems has been a great learning experience. If FPGA perf/$ was competitive with GPUs I'd probabably move on to building at the hardware level instead...