18Apr

The World’s Fifth Simplest Bytecode Interpreter: A Regular Expression Engine!

Posted by Elf Sternberg as Uncategorized

Kazanov’s fifth World’s Simplest Bytecode Intepreter (see the Fourth Simplest) isn’t quite so simple: it is an implementation of Thompson’s most basic regular expression table-driven DFA handler. It handles a complete set of Kleene’s Regular Expressions without the closure operator (no * operator), and it does so by exploiting the host language stack to handle nested recursion of alternatives.

This is a much different beast from the previous four, and it requires a little thinking to grok why it’s so different. For one thing, we’re going to be jumping back and forth inside the program: this is our first bytecode interpreter with a Jmp instruction. More than that, we may want to jump backwards, so we can’t use ‘usize’ for that, we’ll need to use ‘isize’. There are times we want to add a negative offset, after all!

Our opcode now looks like this:

<lib.rs>=
pub enum Opcode {
    Chr(char),
    Jmp(isize),
    Alt(isize, isize),
    End,
}

If you’re at all following my other project at home, there’s an operation missing: where is Seq? Well, it turns out that Seq is handled simply by incrementing the two pointers.

In order to facilitate rapid access to our opcodes, we’re going to convert it into a concrete Vec<Opcode>, and likewise the string we’ll be analyzing is going to be a Vec<char>. We’ll be passing references of these two objects to our interpreter, as neither the string nor the program will be mutated by the interpreter’s operation:

<lib.rs>+=
pub fn interpret_reentrant(
    program: &[Opcode],
    sample: &[char],
    mut pc: isize,
    mut sp: isize,
) -> bool {

We’re only going to bother with “did it match, or didn’t it?”— a boolean. We also have two new variables, the pc (program counter) and the sp (string pointer). As those can “fall off the world” at either end, we have to test for that. If we leave the world, obviously it didn’t match:

<lib.rs>+=
    loop {
        if sp < 0 || (sp as usize) > sample.len() || pc < 0 || (pc as usize) > program.len() {
            return false;
        }
        

The meat of our interpreter is, erm, “fun.” If the character matches, we proceed; if not, we return false. The Alt operator takes two addresses: an offset to the left child expression and another offset to the right child expression. Each child expression must end with a Jmp to the next element in the program sequence after the Alt.  And again, because we’re mixing and matching pointer and offsets, we have to do benign casts on our indexes:

<lib.rs>+=
        match program[pc as usize] {
            Opcode::Chr(c) => {
                if c == sample[sp as usize] {
                    pc += 1;
                    sp += 1;
                } else {
                    return false;
                }
            },
            
            Opcode::Jmp(i) => {
                pc += i;
            },
            
            Opcode::Alt(l, r) => {
                if interpret_reentrant(program, sample, pc + l, sp) {
                    return true;
                }
                pc += r;
            },

            Opcode::End => {
                return true
            }
        }
    }
}

We now have to start our engine:

<lib.rs>+=
pub fn interpret(program: Vec<Opcode>, sample: &str) -> bool {
    let s: Vec<char> = sample.chars().collect();
    interpret_reentrant(&program, &s, 0, 0)
}

And… that’s it. That’s an entire regular expression engine in about 45 lines of Rust. It’s not really a good regular expression engine; a char in Rust isn’t really a character, as it will only encode a Unicode code point. A letter with an umlaut might be a single Unicode point, or it could be a multi-code assembly of an undecorated letter followed by an “put an umlaut over the previous symbol” instruction. Discussing the difference between individual Unicode code-points and grapheme clusters, and normalizing between them, is a very advanced topic, one I have not yet begun to tackle in my own work. Because the sp points on the stack encode backtracking-on-failure, this code doesn’t work in a streaming fashion. But for a demo… eh. It does the job.

Let’s just say that performing regular expressions on Asian languages, apparently especially Korean, is a non-trivial exercise.

The unit tests are straightforward, but you can see how complex our programs quickly become. There’s also an interesting, well, I don’t want to call it a ‘failure mode,’ but look down at the Alt test. See those Jmp(1) operations? They’re a consequence of the rule that each arm of the Alt must end with a Jmp to command after the Alt. That Jmp is completely unnecessary; it only advances the program counter by one, which our (virtual) Seq would do anyway. It’s a wasted operation and hopefully any good compiler we write would recognize it for what it is.

<lib.rs>+=
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn basic() {
        use Opcode::*;
        // "a"
        assert!(interpret(vec![Chr('a'), End], "a"));
    }

    #[test]
    fn seq() {
        use Opcode::*;
        // "ab"
        assert!(interpret(vec![Chr('a'), Chr('b'), End], "ab"));
    }

    #[test]
    fn alt() {
        use Opcode::*;
        // "(a|b)c"
        assert!(interpret(
            vec![Alt(1, 3), Chr('a'), Jmp(2), Chr('b'), Jmp(1), Chr('c'), End],
            "ac"
        ));
        assert!(interpret(
            vec![Alt(1, 3), Chr('a'), Jmp(2), Chr('b'), Jmp(1), Chr('c'), End],
            "bc"
        ));
        assert_eq!(
            interpret(
                vec![Alt(1, 3), Chr('a'), Jmp(2), Chr('b'), Jmp(1), Chr('c'), End],
                "ab"
            ),
            false
        );
    }
}

And here ends where I’m going to follow Kazanov. He does much more with later regular expressions, exploiting some GCC-isms to create a full-powered context-threaded interpreter. It turns out Rust will recognize what you’re trying to do and do the right thing if you follow a fairly idiomatic design pattern for interpreters, so those wacky GCC-isms aren’t necessary.

I’m sure there’s much more we could go into, but I have other things to study next week, I’m in the middle of looking for work (hey, anyone need a skilled Rust developer?) and this more or less satisfies my rebooting my knowledge of bytecode interpreter design.

All of the code for this project is now available on Github.

Comment Form

Subscribe to Feed

Categories

Calendar

April 2019
M T W T F S S
« Mar   May »
1234567
891011121314
15161718192021
22232425262728
2930