4
$\begingroup$

Optimization is getting things to work faster with less resources.

What are some ways to optimize an interpreter?

$\endgroup$
3
  • $\begingroup$ In what way does this need focus? $\endgroup$
    – Starship
    Commented Jun 28, 2023 at 20:45
  • 1
    $\begingroup$ @Starship-OnStrike Optimizing compilers/interpreters is a whole field. Like a quarter of this site's entire scope is covered by this one question. It's enormously broad. $\endgroup$ Commented Jun 29, 2023 at 16:32
  • $\begingroup$ I said some common ways. Not all ways @RydwolfPrograms $\endgroup$
    – Starship
    Commented Jun 30, 2023 at 0:53

3 Answers 3

6
$\begingroup$

The first option is changing the way you execute the code. Reduce the cache thrashing of the tree walk interpretation by moving the AST nodes being interpreted into a contiguous memory buffer.

After that you can reduce the size of each interpreter node by only keeping the essential things for the execution. Change the virtual function table pointer into an byte-sized enum that you switch on (or computed goto) in the interpreter loop, change jumps from pointers to offsets into the buffer.

Then you are only a few short steps removed from making it a proper bytecode interpreter. To do this would will need to refactor into register based or stack based interpretation. Stackbased has a more compact bytecode but needs a bit more work to keep the stack top pointer up to date between instructions.

Then you have to optimize the interpreter loop, make sure that between actual work there are as few instructions as possible.

Then you can check which combination of opcodes often happen together and perhaps reduce those down into fewer opcodes. However that may bloat the machine code size which can hurt the instruction cache in the cpu.

The final step in making an interpreter fast is compiling it to machine code (often called JIT compiling), however this means quite a bit of upfront work before you start executing.

$\endgroup$
4
$\begingroup$

I can only speak to my experience; I have a widely-used interpreter, and I didn't optimize it much.

The first thing that I noticed is that executing an AST directly involves a lot of pointer chasing. In fact, it was so obvious that I noticed the problem before I released the first version.

Now, it is still tempting to do that, and it can be fast enough, but in my case, it wasn't.

So I rewrote everything to use bytecode, which meant encoding arguments as indices.

"Indices into what?" you might ask.

It depends on the argument, but essentially, for every kind of argument, I had a global or local array of the values of that kind of argument.

Anyway, you can do a few tricks to shrink the size of bytecode to optimize further.

For example, I encoded indices as one byte whose value was the number of bytes needed to encode the value, then that number of bytes smashed together into the value.

This means that 0 would encode as 1 byte (with a value of 0), 1 would encode as 2 bytes (first byte with a value of 1, second with value of 1), 2 would encode as two bytes (first with a value of 1, second with a value of 2), etc.

This shrinks the bytecode, which shrinks the amount of memory it takes and the amount of loads you need to do.

Maybe you can improve performance by using bitcode, shrinking everything down to the exact number of bits needed? Perhaps, but my experience showed that the extra instructions needed to extract the information either nullified the advantage or reversed it.

The second thing you can do is improve the data structures used to store data. I did this by having all of the values use the same data (a union of sorts) even if it was a bit wasteful. I would be surprised if you could use this in a general interpreter, but bc is simple.

The third thing is to optimize the execution of the most important instructions. I did this by making my math operations much faster.

In fact, while this step is the most open-ended, it has the potential for great improvement because profiling will help you focus your efforts on the best places.

The fourth thing I did was to take advantage of common compiler optimizations with link-time optimization to inline aggressively. This gave me a 2x improvement basically everywhere because the inlining would then be able to specialize stuff for specific operations.

The fifth thing was to use computed goto when I could. Here is a good introduction to computed goto.

I had it harder because my code had to work on compilers without computed goto, so I wrote up some macros whose implementation changes per-compiler. But this gave me a 1.5x improvement.

At that point, it was good enough (math is still the bottleneck for a bc), so I stopped.

But if you want to go further, study the best, of which an incomplete list is:

  • LuaJIT for Lua by Mike Pall.
  • V8 for JavaScript.
  • CPython for Python.
$\endgroup$
3
$\begingroup$

It depends on how exactly you're using your interpreter.

If you need to optimise for latency between reading a script code line and executing it, then no optimisations whatsoever is the best optimisation in many cases.

If you need to optimise an interpreter for speed of execution of somewhat long running code, then you can do pretty much all the same things you'd do in an optimising compiler. E.g., lower your code into some easily optimisable IR, such as SSA, apply all the optimisations you can think of - constant folding, various code motions, ADCE, partial evaluation, inlining, loop fusion, etc. Then reduce what is left to some bytecode and execute it. Or go further and perform JIT compilation based on runtime tracing data.

If you need a balance between execution speed and latency from reading a code to executing it, then you can do minimal, simplest optimisations, then generate, say, some stack VM bytecode, convert it to indirect (or even direct) threaded code and execute it.

Language properties affect performance quite significantly - for example, a dynamically typed language will result in slow code regardless of how smart your optimisations are, simply due to a dynamic dispatch overhead. Statically typed language allows a lot more optimisations at much lower cost.

Likewise, lexical scoping allows for much better runtime performance than dynamic scoping that may result in runtime variable name lookups.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .