How to speed up VM / bytecode interpreter?

I have programmed a registry based bytecode interpreter/virtual machine. The purpose of this, is to make it possible to run (interpreted) programs which can be edited from the Serial Monitor.

I started with a simple stackbased virtual machine (not yet all bytecodes are implemented). I have improved the execution speed 4 to 5 times. But it still is 24 times slower than native code. (All this is meassured calculating fibonaccia n=(n-2)+(n-1).) So I am looking for further improvements to execution speed. Attached the sketch, which completes in 101ms on a Arduino Uno 16Mhz.

I noticed / implemented the following (in the order below) which improved execution speed:

  • Replaced byte-code switch tree with gotos
  • Register based instead of stack based
  • Local vars and no class construct
  • Bytes as datatypes instead of ints
  • Bytecodes (and number of arguments) equal to the AVR instruction set

What did not make any difference was set GCC-argument to -O3.

Any ideas how to further improve the speed?

Interpreter_RegBased.ino (13.1 KB)

Try here

Continuing the other thread...

By the way, one order of magnitude is a factor of 10, and two orders of magnitude is a factor of 100. Orders of magnitude are powers of 10.

If you are slowed down by a factor of 24, then you are slowed down between one and two orders of magnitude.

vaj4088:
Continuing the other thread...

By the way, one order of magnitude is a factor of 10, and two orders of magnitude is a factor of 100. Orders of magnitude are powers of 10.

If you are slowed down by a factor of 24, then you are slowed down between one and two orders of magnitude.

1.38

"it still is 24 times slower than native code" - I assume that what you mean is that if a given instruction sequence takes 10mSec to run on a real AVR processor, presumably 16MHz, then it will take 240mSec to run on your simulator? If so, I doubt you'll do much better with this approach.

First, what you've done is virtually identical to the code that would be produced by the compiler using this simple construct:

...
int opcode;
while (opcode = NCODE())
 switch(opcode)
{
    case ADD:
        ...
        break;
    case SUB:
        ...
        break;
    ....
}

which an order of magnitude more maintainable and extensible. A switch statement with a non-sparse matrix will generally be compile to a jump table. But your basic problem is that each operation you perform, for example fetching an operand, takes multiple AVR instructions, each of which takes MORE time then the entire AVR instruction. You simply won't get close to the same performance. The only way to get as close as possible, is to write in assembler, and optimize the he11 out of the code, and use every trick in the book.

Back in the '80s, I wrote a Z-80 + CP/M emulator that ran on IBM PCs, so that old CP/M programs. In fact, the company that sold it, ran their CP/M CBASIC accounting software on it for many years, and they also sold a Z-80 text editor for PCs that was actually just the existing CP/M version of the editor running under my emulator. Nobody ever knew. Running on a 6MHz PC-AT, it was as fast as a 2MHz Z-80. Faster, in fact that any other Z-80 emulator on the market, and even some of the Z-80 plug-in boards. But it was written entirely in 8086 assembler, and used every trick I knew to make it that fast.

If your goal is to make some kind of emulator, why not simply use the AVR to actually execute real AVR code? Your program can act as an interactive assembler, and it can compile to actual machine code, and you can modify the code on-the-fly to allow breakpoints and single-stepping, or register dumps, or whatever.

Regards,
Ray L.

"it still is 24 times slower than native code" - I assume that what you mean is that if a given instruction sequence takes 10mSec to run on a real AVR processor, presumably 16MHz, then it will take 240mSec to run on your simulator? If so, I doubt you'll do much better with this approach....But your basic problem is that each operation you perform, for example fetching an operand, takes multiple AVR instructions, each of which takes MORE time then the entire AVR instruction.

I understand. Would it be faster to parse the full (byte)code, determine the operands, store them in a structure per opcode and after this interpret/run these structures instead of the bytecode?
I did try this a bit, but did not achieve any speed gains (only slowdowns). However I very could have implemented this suboptimal. So it worth pursueing this further?
BTW: I forgot to make some register variables local. Current speed is 'only' 20x slower.

If your goal is to make some kind of emulator, why not simply use the AVR to actually execute real AVR code?

Execute AVR ASM-code which the user inputs, or do I misunderstand? I thought given the Harvard architecture this is impossible.

NardJ:
Execute AVR ASM-code which the user inputs, or do I misunderstand? I thought given the Harvard architecture this is impossible.

Why? In emulation, you just have a program and data segment, both in RAM.

I think you've missed the whole point. You're simply NOT going to get it significantly faster than it already is. By coding the whole thing in assembler you might get another 10-15%, but that's as good as it's going to get. Even 10X slower than the actual hardware is simply not realistic.

Regards,
Ray L.

Well I managed to get it down to only 20 x slower (using fib(6000) with byte-calcs as benchmark). I can't get it any faster (don't want to resort to assembly and indeed doubt if it will make much difference).

See here for the forum-thread with the result/screenshot/source files.

thx all for your explanations and help!