In my last post I discussed inline caching as a technique for runtime optimization. I ended the post with some extensions to the basic technique, like quickening. If you have not read the previous post, I recommend it. This post will make many references to it.
Quickening involves bytecode rewriting — self modifying code — to remove some branches and indirection in the common path. Stefan Brunthaler writes about it in his papers Efficient Interpretation using Quickening and Inline Caching Meets Quickening.
Let’s take a look at a fragment of the caching interpreter from the last post
so we can talk about the problem more concretely. You can also get the sources
from the repo and open interpreter.c
in your preferred editor.
void add_update_cache(Frame* frame, Object* left, Object* right) {
Method method = lookup_method(object_type(left), kAdd);
cache_at_put(frame, object_type(left), method);
Object* result = (*method)(left, right);
push(frame, result);
}
void eval_code_cached(Frame* frame) {
// ...
while (true) {
// ...
switch (op) {
// ...
case ADD: {
Object* right = pop(frame);
Object* left = pop(frame);
CachedValue cached = cache_at(frame);
Method method = cached.value;
if (method == NULL || cached.key != object_type(left)) {
add_update_cache(frame, left, right);
break;
}
Object* result = (*method)(left, right);
push(frame, result);
break;
}
// ...
}
frame->pc += kBytecodeSize;
}
}
As I also mentioned last post, the ADD
opcode handler has three cases to
handle:
Since Deutsch & Schiffman found that types don’t vary that much, the third case is the fast path case. This means that we should do as little as possible in that case. And right now, we’re doing too much work.
Why should we have to check if the cache slot is empty if in the fast path it shouldn’t be? And why should we then have to make an indirect call? On some CPUs, indirect calls are much slower than direct calls. And this assumes the compiler generates a call instruction — it’s very possible that a compiler would decide to inline the direct call.
Quickening is a technique that reduces the number of checks by explitly marking state transitions in the bytecode.
In order to remove one of the checks — the method == NULL
check — we can
add a new opcode, ADD_CACHED
. The ADD_CACHED
opcode can skip the check
because our interpreter will maintain the following invariant:
Invariant: The opcode ADD_CACHED
will appear in the bytecode stream if
and only if there is an entry in the cache at that opcode.
After ADD
adds something to the cache, it can rewrite itself to ADD_CACHED
.
This way, the next time around, we have satisfied the invariant.
Let’s see how that looks:
void eval_code_quickening(Frame* frame) {
// ...
while (true) {
// ...
switch (op) {
// ...
case ADD: {
Object* right = pop(frame);
Object* left = pop(frame);
add_update_cache(frame, left, right);
code->bytecode[frame->pc] = ADD_CACHED;
break;
}
case ADD_CACHED: {
Object* right = pop(frame);
Object* left = pop(frame);
CachedValue cached = cache_at(frame);
if (cached.key != object_type(left)) {
add_update_cache(frame, left, right);
break;
}
Method method = cached.value;
Object* result = (*method)(left, right);
push(frame, result);
break;
}
// ...
}
frame->pc += kBytecodeSize;
}
}
Not too different. We’ve shuffled the code around a little bit but overall it
looks fairly similar. We still get to share some code in add_update_cache
, so
there isn’t too much duplication, either.
Now that we’ve moved the empty check, it’s time to remove the indirect call.
Let’s assume for a minute that you, the writer of a language runtime, know that
most of the time, when people write a + b
, the operation refers to integer
addition.
Not many other primitive types implement addition. Frequently floating point numbers use the same operator (though languages like OCaml do not). Maybe strings. And maybe your language allows for overloading the plus operator. But most people don’t do that. They add numbers.
In that case, you want to remove as much of the overhead as possible for adding
two numbers. So let’s introduce a new opcode, ADD_INT
that is specialized for
integer addition.
In an ideal world, we would just be able to pop two objects, add them, and move on. But in our current reality, we still have to deal with the possibility of programmers passing in a non-integer every once in a while.
So first, we check if the types match. If they don’t, we populate the cache and
transition to ADD_CACHED
. I’ll get to why we do that in a moment.
And if we did actually get an int, great, we call this new function
do_add_int
.
void do_add_int(Frame* frame, Object* left, Object* right) {
Object* result = int_add(left, right);
push(frame, result);
}
void eval_code_quickening(Frame* frame) {
// ...
while (true) {
// ...
switch (op) {
// ...
case ADD_INT: {
Object* right = pop(frame);
Object* left = pop(frame);
if (object_type(left) != kInt) {
add_update_cache(frame, left, right);
code->bytecode[frame->pc] = ADD_CACHED;
break;
}
do_add_int(frame, left, right);
break;
}
// ...
}
frame->pc += kBytecodeSize;
}
}
This is a nice opcode handler for ADD_INT
, but right now it’s orphaned. Some
opcode has to take the leap and rewrite itself to ADD_INT
, otherwise it’ll
never get run.
I suggest we make ADD
do the transition. This keeps ADD_CACHED
fast for
other types. If ADD
observes that the left hand side of the operation is an
integer, it’ll call do_add_int
and rewrite itself.
Let’s see how that looks in code.
void eval_code_quickening(Frame* frame) {
// ...
while (true) {
// ...
switch (op) {
// ...
case ADD: {
Object* right = pop(frame);
Object* left = pop(frame);
if (object_type(left) == kInt) {
do_add_int(frame, left, right);
code->bytecode[frame->pc] = ADD_INT;
break;
}
add_update_cache(frame, left, right);
code->bytecode[frame->pc] = ADD_CACHED;
break;
}
// ...
}
frame->pc += kBytecodeSize;
}
}
Back to “why transition from ADD_INT
to ADD_CACHED
”. Two thoughts:
We could transition back to ADD
. In that case, this code would perform
poorly in an environment where the programmer passes multiple different
types at this opcode. There would be a lot of bytecode rewriting overhead
going on as it goes back and forth between ADD
and ADD_INT
.
We could also assume it’s a hiccup and not rewrite. This would perform
poorly if the first time the argument is an integer, but something else
every subsequent operation. There would be a lot of lookup_method
calls.
A great extension here would be to add a polymorphic cache. Those are designed to efficiently handle a small (less than five, normally) amount of repeated types at a given point.
Even if we leave the interpreter in this state, a small C bytecode interpreter, we save a couple of instructions and some call overhead in the fast path of integer addition. This is a decent win for math-heavy applications.
In the best case, though, we save a great deal of instructions. It’s entirely
possible that the compiler will optimize the entire body of ADD_INT
to
something like:
pop rax
pop rcx
cmp rax, $IntTag
jne slow_path
add rcx
push rcx
jmp next_opcode
slow_path:
; ...
It won’t look exactly like that, due to our object representation and because
our push
/pop
functions do not operate on the C call stack, but it will be a
little closer than before. But what if we could fix these issues and trim down
the code even further?
Then we might have something like the Dart intermediate implementation of addition for small integers on x86-64. The following C++ code emits assembly for a specialized small integer handler:
void CheckedSmiOpInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
// ...
Register left = locs()->in(0).reg();
Register right = locs()->in(1).reg();
// Check both left and right are small integers
__ movq(TMP, left);
__ orq(TMP, right);
__ testq(TMP, compiler::Immediate(kSmiTagMask));
__ j(NOT_ZERO, slow_path->entry_label());
Register result = locs()->out(0).reg();
__ movq(result, left);
__ addq(result, right);
// ...
}
This example is a little bit different since it is using an optimizing compiler and assumes the input and output are both in registers, but still expresses the main ideas.
See also the JVM template interpreter implementation for binary operations on small integers:
void TemplateTable::iop2(Operation op) {
// ...
__ pop_i(rdx);
__ addl (rax, rdx);
// ...
}
which pops the top of the stack and adds rax
to it. I think this is because
the JVM caches the top of the stack in the register rax
at all times, but I
have not been able to confirm this. It would explain why it adds rax
and why
there is no push
, though.
There are a number of improvements that could be made to this very simple demo. Bytecode rewriting can unlock a lot of performance gains with additional work. I will list some of them below:
ADD_INT
) directly make use of the call stack.Maybe I will even write about them in the future.