Welcome back to the “Compiling a Lisp” series. Last time we made a small code execution demo. Today we’re going to add the first part of our language: integer literals. Our programs will look like this:
123
-10
0
But we’re not going to put a parser in. That can come later when it gets harder to manually construct syntax trees.
Also, since implementing full big number support is pretty tricky, we’re only going to support fixed-width numbers. It’s entirely possible to then implement big number support in Lisp after we build out some more features.
Since the integers are always small (less than 64 bits), and we’re targeting x86-64, we can represent the integers as tagged pointers. To read a little more about that, check out the “Pointer tagging” section of my Programming languages resources page. Since we’ll also represent some other types of objects as tagged pointers, I’ll sketch out a tagging scheme up front. That way it’s easier to reason about than if I draw it out post-by-post.
High Low
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX00 Integer
0000000000000000000000000000000000000000000000000XXXXXXX00001111 Character
00000000000000000000000000000000000000000000000000000000X0011111 Boolean
0000000000000000000000000000000000000000000000000000000000101111 Nil
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX001 Pair
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX010 Vector
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX011 String
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX101 Symbol
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX110 Closure
In this diagram, we have some pointer templates composed of 0
s, 1
s, and
X
s. 0
refers to a 0 bit and 1
refers to a 1 bit.
X
is a placeholder that refers to payload data for that value. For immediate
values — values whose data are part of the pointer itself — the X
s refer
to the data. For heap-allocated objects, it is the pointer address.
It’s important to note that we can only accomplish this tagging scheme because
on modern computer systems the lower 3 bits of heap-allocated pointers are 0
because allocations are word-aligned — meaning that all pointers are
numbers that are multiples of 8. This lets us a) differentiate real pointers
from fake pointers and b) stuff some additional data there in the real
pointers.
These tags let us quickly distinguish objects from one another. Just check the lower bits:
00
means integer111
means one of the other immediate value types; check the
lower 7 bits to tell them apartThis is a choice that Ghuloum made when drawing up the compiler paper. It’s entirely possible to pick your own encoding as long as your encoding also has the property that it’s possible to distinguish the type based on the pointer.1
We’re going to be a little clever and use the same encoding scheme inside the compiler to represent Abstract Syntax Tree (AST) nodes as we are going to use in the compiled code. I mean, why not? We’re going to have to build the encoding and decoding tools anyway.
We’ll start off with integer encoding, since we don’t have any other types yet.
#include <assert.h> // for assert
#include <stdbool.h> // for bool
#include <stddef.h> // for NULL
#include <stdint.h> // for int32_t, etc
#include <string.h> // for memcpy
#include <sys/mman.h> // for mmap
#include "greatest.h"
// Objects
typedef int64_t word;
typedef uint64_t uword;
const int kBitsPerByte = 8; // bits
const int kWordSize = sizeof(word); // bytes
const int kBitsPerWord = kWordSize * kBitsPerByte; // bits
Ignore greatest.h
— that is a header-only library I use for lightweight
testing.
word
and uword
are type aliases that I will use throughout the codebase to
refer to types of values that fit in registers. It saves us a bunch of typing
and helps keep types consistent.
To avoid some mysterious magical constants, I’ve also defined helpful names for the number of bits in a byte (a standard C feature), the number of bytes in a word, and the number of bits in a word.
const unsigned int kIntegerTag = 0x0;
const unsigned int kIntegerTagMask = 0x3;
const unsigned int kIntegerShift = 2;
const unsigned int kIntegerBits = kBitsPerWord - kIntegerShift;
const word kIntegerMax = (1LL << (kIntegerBits - 1)) - 1;
const word kIntegerMin = -(1LL << (kIntegerBits - 1));
word Object_encode_integer(word value) {
assert(value < kIntegerMax && "too big");
assert(value > kIntegerMin && "too small");
return value << kIntegerShift;
}
// End Objects
As we saw above, integers can be fit inside pointers by shifting them two bits
to the left. We have this handy-dandy function, Object_encode_integer
, for
that.
I’ve added some bounds checks to make sure we don’t accidentally mangle the values coming in. If the number we’re trying to encode is too big or too small, shifting it left by 2 bits will chop off the left end.
This function is pretty low-level. It doesn’t add any new type information (it
returns a word
, just as it takes a word
). It’s meant to be a utility
function inside the compiler. We’ll add another function in a moment that
builds on top of this one to make ASTs.
While we could pass around word
s all day and try really hard to keep the
boundary between integral values and pointer values straight, I don’t much
fancy that. I like my type in formation, thank you very much. So we’re going to
add a thin veneer over the object encoding that both gives us some nicer type
APIs and gives the C compiler some hints about when we’ve already encoded an
object.
// AST
struct ASTNode;
typedef struct ASTNode ASTNode;
ASTNode *AST_new_integer(word value) {
return (ASTNode *)Object_encode_integer(value);
}
bool AST_is_integer(ASTNode *node) {
return ((word)node & kIntegerTagMask) == kIntegerTag;
}
word AST_get_integer(ASTNode *node) { return (word)node >> kIntegerShift; }
// End AST
We’ll use these functions pretty heavily in the compiler, especially as we add more datatypes.
Now that we can manually build programs, let’s get cracking writing our
buffers. We have to emit the machine code to somewhere, after all. Remember the
mmap
/memcpy
stuff from last time? We’re going to wrap those in some
easier-to-remember APIs.
// Buffer
typedef unsigned char byte;
typedef enum {
kWritable,
kExecutable,
} BufferState;
typedef struct {
byte *address;
BufferState state;
size_t len;
size_t capacity;
} Buffer;
byte *Buffer_alloc_writable(size_t capacity) {
byte *result = mmap(/*addr=*/NULL, /*length=*/capacity,
/*prot=*/PROT_READ | PROT_WRITE,
/*flags=*/MAP_ANONYMOUS | MAP_PRIVATE,
/*filedes=*/-1, /*offset=*/0);
assert(result != MAP_FAILED);
return result;
}
void Buffer_init(Buffer *result, size_t capacity) {
result->address = Buffer_alloc_writable(capacity);
assert(result->address != MAP_FAILED);
result->state = kWritable;
result->len = 0;
result->capacity = capacity;
}
void Buffer_deinit(Buffer *buf) {
munmap(buf->address, buf->capacity);
buf->address = NULL;
buf->len = 0;
buf->capacity = 0;
}
int Buffer_make_executable(Buffer *buf) {
int result = mprotect(buf->address, buf->len, PROT_EXEC);
buf->state = kExecutable;
return result;
}
These functions are good building blocks for creating and destroying buffers. They abstract away some of the fiddly parameters and add runtime checks.
We still need to write into the buffer at some point, though, and we’re not
going to memcpy
whole blocks in. So let’s add some APIs for incremental
writing.
byte Buffer_at8(Buffer *buf, size_t pos) { return buf->address[pos]; }
void Buffer_at_put8(Buffer *buf, size_t pos, byte b) { buf->address[pos] = b; }
This Buffer_at_put8
is the building block of the rest of the compiler. Every
write will go through this function. But notice that it is pretty low-level; it
does not do any bounds checks and it does not advance the current position in
the buffer. So let’s add some more functions to do that…
word max(word left, word right) { return left > right ? left : right; }
void Buffer_ensure_capacity(Buffer *buf, word additional_capacity) {
if (buf->len + additional_capacity <= buf->capacity) {
return;
}
word new_capacity =
max(buf->capacity * 2, buf->capacity + additional_capacity);
byte *address = Buffer_alloc_writable(new_capacity);
memcpy(address, buf->address, buf->len);
int result = munmap(buf->address, buf->capacity);
assert(result == 0 && "munmap failed");
buf->address = address;
buf->capacity = new_capacity;
}
void Buffer_write8(Buffer *buf, byte b) {
Buffer_ensure_capacity(buf, sizeof b);
Buffer_at_put8(buf, buf->len++, b);
}
void Buffer_write32(Buffer *buf, int32_t value) {
for (size_t i = 0; i < sizeof value; i++) {
Buffer_write8(buf, (value >> (i * kBitsPerByte)) & 0xff);
}
}
// End Buffer
With the addition of Buffer_ensure_capacity
, Buffer_write8
, and
Buffer_write32
, we can start putting together functions to emit x86-64
instructions. I added both write8
and write32
because we’ll need to both
emit single bytes and 32-bit immediate integer values. The helper function
ensures that we don’t need to think about endian-ness every single time we emit
a 32-bit value.
There are a couple ways we could write an assembler:
mov eax, 1
and mov ecx, 1
are distinct, for example) and fetch
chunks of bytes from thereI chose to go with the last option, though I’ve seen all three while looking
for a nice C assembler library. It allows us to write code like
Emit_mov_reg_imm32(buf, Rcx, 123)
, which if you ask me, looks fairly similar
to mov rcx, 123
.
If we were writing C++ we could get really clever with operator overloading… or we could not.
Note that I did not make up this encoding logic. This is a common phenomenon in instruction sets and it helps in decoding (for the hardware) and encoding (for the compilers).
// Emit
typedef enum {
kRax = 0,
kRcx,
kRdx,
kRbx,
kRsp,
kRbp,
kRsi,
kRdi,
} Register;
static const byte kRexPrefix = 0x48;
void Emit_mov_reg_imm32(Buffer *buf, Register dst, int32_t src) {
Buffer_write8(buf, kRexPrefix);
Buffer_write8(buf, 0xc7);
Buffer_write8(buf, 0xc0 + dst);
Buffer_write32(buf, src);
}
void Emit_ret(Buffer *buf) { Buffer_write8(buf, 0xc3); }
// End Emit
Boom. Two instructions. One mov
, one ret
. The REX prefix is used in x86-64
to denote that the following instruction, which might have been decoded as
something else in x86-32, means something different in 64-bit mode.
In this particular mov
’s case, it is the difference between mov eax, IMM
and mov rax, IMM
.
Now that we can emit instructions, it’s time to choose what instructions to emit based on the input program. We have a very restricted set of input programs (yes, several billion of them, if you’re being persnickety about the range of possible integers) so the implementation is short and sweet.
If we see a literal integer, encode it and put it in rax
. Done.
// Compile
int Compile_expr(Buffer *buf, ASTNode *node) {
if (AST_is_integer(node)) {
word value = AST_get_integer(node);
Emit_mov_reg_imm32(buf, kRax, Object_encode_integer(value));
return 0;
}
assert(0 && "unexpected node type");
}
int Compile_function(Buffer *buf, ASTNode *node) {
int result = Compile_expr(buf, node);
if (result != 0) {
return result;
}
Emit_ret(buf);
return 0;
}
// End Compile
I make a distinction between expr
and function
because we don’t always
want to ret
. We only want to ret
the result of a function body, which might
be composed of several nested expressions. This divide will become clearer as
we add more expression types.
Our compiler is all well and good, but it’s notably more complicated than the
mini JIT demo from the last post. It’s one thing to test that by manually
checking the return code of main
, but I think we should have some regression
tests to keep us honest as we go forth and break things.
For that, I’ve written some testing utilities to help check that we generated the right code, and also to execute the JITed code and return the result.
typedef int (*JitFunction)();
// Testing
#define EXPECT_EQUALS_BYTES(buf, arr) \
ASSERT_MEM_EQ(arr, (buf)->address, sizeof arr)
word Testing_execute_expr(Buffer *buf) {
assert(buf != NULL);
assert(buf->address != NULL);
assert(buf->state == kExecutable);
// The pointer-pointer cast is allowed but the underlying
// data-to-function-pointer back-and-forth is only guaranteed to work on
// POSIX systems (because of eg dlsym).
JitFunction function = *(JitFunction *)(&buf->address);
return function();
}
// End Testing
ASSERT_MEM_EQ
will check the generated code and point out any differences if
it finds them. Even though this only prints out hex representations of the
generated code, it’s very helpful. I often paste unexpected output into rasm2
(part of the radare2 suite), Cutter (also part of the radare2 suite), or this
online disassembler. If the
instructions look super unfamiliar, it means we messed up the encoding!
Since we have our utilities, we’re going to use the greatest.h
testing API to
write some unit tests for our compiler and compiler utilities.
// Tests
TEST encode_positive_integer(void) {
ASSERT_EQ(0x0, Object_encode_integer(0));
ASSERT_EQ(0x4, Object_encode_integer(1));
ASSERT_EQ(0x28, Object_encode_integer(10));
PASS();
}
TEST encode_negative_integer(void) {
ASSERT_EQ(0x0, Object_encode_integer(0));
ASSERT_EQ((word)0xfffffffffffffffc, Object_encode_integer(-1));
ASSERT_EQ((word)0xffffffffffffffd8, Object_encode_integer(-10));
PASS();
}
TEST buffer_write8_increases_length(void) {
Buffer buf;
Buffer_init(&buf, 5);
ASSERT_EQ(buf.len, 0);
Buffer_write8(&buf, 0xdb);
ASSERT_EQ(Buffer_at8(&buf, 0), 0xdb);
ASSERT_EQ(buf.len, 1);
Buffer_deinit(&buf);
PASS();
}
TEST buffer_write8_expands_buffer(void) {
Buffer buf;
Buffer_init(&buf, 1);
ASSERT_EQ(buf.capacity, 1);
ASSERT_EQ(buf.len, 0);
Buffer_write8(&buf, 0xdb);
Buffer_write8(&buf, 0xef);
ASSERT(buf.capacity > 1);
ASSERT_EQ(buf.len, 2);
Buffer_deinit(&buf);
PASS();
}
TEST buffer_write32_expands_buffer(void) {
Buffer buf;
Buffer_init(&buf, 1);
ASSERT_EQ(buf.capacity, 1);
ASSERT_EQ(buf.len, 0);
Buffer_write32(&buf, 0xdeadbeef);
ASSERT(buf.capacity > 1);
ASSERT_EQ(buf.len, 4);
Buffer_deinit(&buf);
PASS();
}
TEST buffer_write32_writes_little_endian(void) {
Buffer buf;
Buffer_init(&buf, 4);
Buffer_write32(&buf, 0xdeadbeef);
ASSERT_EQ(Buffer_at8(&buf, 0), 0xef);
ASSERT_EQ(Buffer_at8(&buf, 1), 0xbe);
ASSERT_EQ(Buffer_at8(&buf, 2), 0xad);
ASSERT_EQ(Buffer_at8(&buf, 3), 0xde);
Buffer_deinit(&buf);
PASS();
}
TEST compile_positive_integer(void) {
word value = 123;
ASTNode *node = AST_new_integer(value);
Buffer buf;
Buffer_init(&buf, 10);
int compile_result = Compile_function(&buf, node);
ASSERT_EQ(compile_result, 0);
// mov eax, imm(123); ret
byte expected[] = {0x48, 0xc7, 0xc0, 0xec, 0x01, 0x00, 0x00, 0xc3};
EXPECT_EQUALS_BYTES(&buf, expected);
Buffer_make_executable(&buf);
word result = Testing_execute_expr(&buf);
ASSERT_EQ(result, Object_encode_integer(value));
PASS();
}
TEST compile_negative_integer(void) {
word value = -123;
ASTNode *node = AST_new_integer(value);
Buffer buf;
Buffer_init(&buf, 100);
int compile_result = Compile_function(&buf, node);
ASSERT_EQ(compile_result, 0);
// mov eax, imm(-123); ret
byte expected[] = {0x48, 0xc7, 0xc0, 0x14, 0xfe, 0xff, 0xff, 0xc3};
EXPECT_EQUALS_BYTES(&buf, expected);
Buffer_make_executable(&buf);
word result = Testing_execute_expr(&buf);
ASSERT_EQ(result, Object_encode_integer(value));
PASS();
}
SUITE(object_tests) {
RUN_TEST(encode_positive_integer);
RUN_TEST(encode_negative_integer);
}
SUITE(buffer_tests) {
RUN_TEST(buffer_write8_increases_length);
RUN_TEST(buffer_write8_expands_buffer);
RUN_TEST(buffer_write32_expands_buffer);
RUN_TEST(buffer_write32_writes_little_endian);
}
SUITE(compiler_tests) {
RUN_TEST(compile_positive_integer);
RUN_TEST(compile_negative_integer);
}
// End Tests
GREATEST_MAIN_DEFS();
int main(int argc, char **argv) {
GREATEST_MAIN_BEGIN();
RUN_SUITE(object_tests);
RUN_SUITE(buffer_tests);
RUN_SUITE(compiler_tests);
GREATEST_MAIN_END();
}
These tests pass, at least for me. And no Valgrind errors, either! The full source for this post can be put together by putting together the individual code snippets back to back, in order. I recommend following along and typing it manually, to get the full educational experience, but if you must copy and paste it should still work. :)
If you want to convince yourself the tests work, modify the values we’re checking against in some places. Then you’ll see the test fail. Never trust a test suite that you haven’t seen fail… it might not be running the tests!
I think there is also a way to use greatest.h
to do setup and teardown so we
don’t have to do all that buffer machinery, but I haven’t figured out an
ergonomic way to do that yet.
Next time on Dragon Ball Z, we’ll compile some other immediate constants.
Actually, you can get away with a scheme that only plays games with pointer tagging for immediate objects, and uses a header as part of the heap-allocated object to encode additional information about the type, the length, etc. This is what runtimes like the JVM do. ↩