Elf Sternberg: World's Fourth Simplest Bytecode Interpreter: Register Machines

For the fourth World's Simplest Bytecode Interpreter, Kazanov goes into a bit of esoterica without discussing why: he has "authentic bytecode"; he's doing C programming and he dense-packs his instructions into a single 16-bit instruction. That's fine if you're emulating a real chunk of hardware, but for our purposes... I really don't care.

With that in mind, let's talk: A register-based machine is one that says that all of the instructions will fit within only a few memory locations and the entirety of the program's live data will only need a small number of memory locations. In our case, that small number is 16, although in reality my tests only use three.

There are only two opcodes for moving memory in or out of our system: Load which takes a value from memory and moves it into a register, and Done which returns a specific value from a register to some calling context. Since our example program is a simple calculator, every operation takes three register names: two sources and a destination:

<dfn><lib.rs>=</dfn>
use std::result;

const REGISTER_COUNT: usize = 16;

pub enum Opcode {
    Load(usize, i64),
    Add(usize, usize, usize),
    Sub(usize, usize, usize),
    Mul(usize, usize, usize),
    Div(usize, usize, usize),
    Done(usize),
}

pub struct Program {
    program: Vec<Opcode>,
    registers: [i64; REGISTER_COUNT],
}

Since there's no stack, we don't care about stack underflow. On the other hand, we're requiring that the program have an explicit exit terminator; unlike the previous versions, falling off the end of the program is not acceptable, so we replace our stack underflow with an "unexpected termination" error.

<dfn><lib.rs>+=</dfn>
#[derive(Debug)]
pub enum ProgramError {
    DivisionByZero,
    UnexpectedTermination,
}

type Result<T> = result::Result<T, ProgramError>;

Initializing our program remains familiar, although I've changed one of the initializers. The cargo clippy command enforces an even more strict programming style on Rust than cargo fmt, and it recommends this form of initialization. Here, the program entry means "find a value named program in the current lexical scope and use it to initialize the program field for the Program struct. I am not fond of this— it strikes me as a kind of mystery meat programming— but it seems to be a popular with Rustaceans.

<dfn><lib.rs>+=</dfn>
pub fn interpret(program: Vec<Opcode>) -> Result<i64> {
    let mut program = Program {
        program, 
        registers: [0; REGISTER_COUNT],
    };

And now the opcode handler. This is straightforward; we either introduce values into the program with Load, we process them with an operation, or we're Done and return a value in the register specified. We don't need our macro this time; the addresses are immanent and quickly available. We do still need to watch out for that divide-by-zero problem. And if the program ends without being Done, well, that's an error, too. I should point out that the order of registers is Kazanov's; if I had been designing this from the ground up, I would have chosen the leftmost register instruction as the target.

<dfn><lib.rs>+=</dfn>
    for op in program.program {
        match op {
            Opcode::Load(r0, imm) => program.registers[r0] = imm,
            Opcode::Add(r0, r1, r2) => {
                program.registers[r2] = program.registers[r0] + program.registers[r1]
            }
            Opcode::Sub(r0, r1, r2) => {
                program.registers[r2] = program.registers[r0] - program.registers[r1]
            }
            Opcode::Mul(r0, r1, r2) => {
                program.registers[r2] = program.registers[r0] * program.registers[r1]
            }
            Opcode::Div(r0, r1, r2) => {
                if program.registers[r1] == 0 {
                    return Err(ProgramError::DivisionByZero);
                }
                program.registers[r2] = program.registers[r0] / program.registers[r1];
            },
            Opcode::Done(r0) => return Ok(program.registers[r0]),
        }
    }
    Err(ProgramError::UnexpectedTermination)
}

Two things this VM does not do that it should. First, we do not verify that a register being read is one that has been written to at least once by the program, otherwise we're just using the default 0 value for whatever calculations come next. Second, we do not validate that the register IDs being passed in are valid; we could easily access a value outside the size of our static array.

In some ways, this is okay: a smart compiler would catch these mistakes before they could possibly happen.

But we're now outside the realm of "things the compiler can do for us"; our interpreter is beginning to have its own error states, and while there are a number of things we can do to make sure the interpreter itself is well-formed, we're going to have to start worrying about bugs in the programs our interpreter runs.

And the unit tests. You can start to see how complex a register-based compiler would have to be to keep track of which registers were free, which were in use, and how those registers should be allocated and used. And a "real" programming language would also have a stack for scopes and for tracknig the evolution of complex function trees. For now, though, this is sufficient.

<dfn><lib.rs>+=</dfn>
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn basic() {
        use Opcode::*;
        assert_eq!(interpret(vec![Load(0, 2), Done(0)]).unwrap(), 2);
        assert_eq!(interpret(vec![Load(1, 2), Load(2, 2), Mul(1, 2, 0), 
                                  Load(1, 3), Mul(1, 0, 0), Load(1, 4), 
                                  Mul(1, 0, 0), Done(0) ]).unwrap(),
                   48);
    }
}

And the results are the same:

<code>running 1 tests
test tests::basic ... ok</code>

And that's a register-based virtual machine.

Elf M. Sternberg

Full Stack Web Developer

Where one teaches, two learn.

Blog

WORLD'S FOURTH SIMPLEST BYTECODE INTERPRETER: REGISTER MACHINES