VM/Interpreter - stack or register based?

I build a simple stack-based VM for the Uno and achieve about 150k-175k OpCodes/second when calculating Fibonnaci numbers. This is about 50 times slower compared to native Arduino (C) Code.

Did some testing with a register-based VM (not sure if I implemented this optimal however) and found it about 5-10% slower. EDIT: Slower if counting the number op OpCodes handled per second. However the register-based VM needs less OpCodes (at least in the case of a recursive fibonacci test case) and is 10-15% faster in total time needed.

E.g. FORTH interpreter/compiler is available also for Arduinos. When a terminal is required for user interaction or updates, nowadays a PC will be used for that, so that also the IDE can be used to update a program.

I have been reading up on forth as a possible interpreter/VM. The language itself seems very close to the bytecode language which serves as input for a (stack-based) VM. I do think it is hard to master as a language, but speed is most important so I tried to find some benchmarks but nothing meaningfull.

Any experience with the speed of Forth compared to native Arduino (C) Code?

EDIT: I Managed to install YAFFA, learned some Forth and run a speed test using the fibonacci of 12. Forth (at least the YAFFA implementation) seems about 70 slower than native Arduino (C) Code.