Kazanov’s fifth World’s Simplest Bytecode Intepreter (see the Fourth Simplest) isn’t quite so simple: it is an implementation of Thompson’s most basic regular expression table-driven DFA handler. It handles a complete set of Kleene’s Regular Expressions without the closure operator (no * operator), and it does so by exploiting the host language stack to handle nested recursion of alternatives.

This is a much different beast from the previous four, and it requires a little thinking to grok why it’s so different. For one thing, we’re going to be jumping back and forth inside the program: this is our first bytecode interpreter with a Jmp instruction. More than that, we may want to jump backwards, so we can’t use ‘usize’ for that, we’ll need to use ‘isize’. There are times we want to add a negative offset, after all!

Our opcode now looks like this:

<lib.rs>=
pub enum Opcode {
    Chr(char),
    Jmp(isize),
    Alt(isize, isize),
    End,
}

If you’re at all following my other project at home, there’s an operation missing: where is Seq? Well, it turns out that Seq is handled simply by incrementing the two pointers.

In order to facilitate rapid access to our opcodes, we’re going to convert it into a concrete Vec<Opcode>, and likewise the string we’ll be analyzing is going to be a Vec<char>. We’ll be passing references of these two objects to our interpreter, as neither the string nor the program will be mutated by the interpreter’s operation:

<lib.rs>+=
pub fn interpret_reentrant(
    program: &[Opcode],
    sample: &[char],
    mut pc: isize,
    mut sp: isize,
) -> bool {

We’re only going to bother with “did it match, or didn’t it?”— a boolean. We also have two new variables, the pc (program counter) and the sp (string pointer). As those can “fall off the world” at either end, we have to test for that. If we leave the world, obviously it didn’t match:

<lib.rs>+=
    loop {
        if sp < 0 || (sp as usize) > sample.len() || pc < 0 || (pc as usize) > program.len() {
            return false;
        }
        

The meat of our interpreter is, erm, “fun.” If the character matches, we proceed; if not, we return false. The Alt operator takes two addresses: an offset to the left child expression and another offset to the right child expression. Each child expression must end with a Jmp to the next element in the program sequence after the Alt.  And again, because we’re mixing and matching pointer and offsets, we have to do benign casts on our indexes:

<lib.rs>+=
        match program[pc as usize] {
            Opcode::Chr(c) => {
                if c == sample[sp as usize] {
                    pc += 1;
                    sp += 1;
                } else {
                    return false;
                }
            },
            
            Opcode::Jmp(i) => {
                pc += i;
            },
            
            Opcode::Alt(l, r) => {
                if interpret_reentrant(program, sample, pc + l, sp) {
                    return true;
                }
                pc += r;
            },

            Opcode::End => {
                return true
            }
        }
    }
}

We now have to start our engine:

<lib.rs>+=
pub fn interpret(program: Vec<Opcode>, sample: &str) -> bool {
    let s: Vec<char> = sample.chars().collect();
    interpret_reentrant(&program, &s, 0, 0)
}

And… that’s it. That’s an entire regular expression engine in about 45 lines of Rust. It’s not really a good regular expression engine; a char in Rust isn’t really a character, as it will only encode a Unicode code point. A letter with an umlaut might be a single Unicode point, or it could be a multi-code assembly of an undecorated letter followed by an “put an umlaut over the previous symbol” instruction. Discussing the difference between individual Unicode code-points and grapheme clusters, and normalizing between them, is a very advanced topic, one I have not yet begun to tackle in my own work. Because the sp points on the stack encode backtracking-on-failure, this code doesn’t work in a streaming fashion. But for a demo… eh. It does the job.

Let’s just say that performing regular expressions on Asian languages, apparently especially Korean, is a non-trivial exercise.

The unit tests are straightforward, but you can see how complex our programs quickly become. There’s also an interesting, well, I don’t want to call it a ‘failure mode,’ but look down at the Alt test. See those Jmp(1) operations? They’re a consequence of the rule that each arm of the Alt must end with a Jmp to command after the Alt. That Jmp is completely unnecessary; it only advances the program counter by one, which our (virtual) Seq would do anyway. It’s a wasted operation and hopefully any good compiler we write would recognize it for what it is.

<lib.rs>+=
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn basic() {
        use Opcode::*;
        // "a"
        assert!(interpret(vec![Chr('a'), End], "a"));
    }

    #[test]
    fn seq() {
        use Opcode::*;
        // "ab"
        assert!(interpret(vec![Chr('a'), Chr('b'), End], "ab"));
    }

    #[test]
    fn alt() {
        use Opcode::*;
        // "(a|b)c"
        assert!(interpret(
            vec![Alt(1, 3), Chr('a'), Jmp(2), Chr('b'), Jmp(1), Chr('c'), End],
            "ac"
        ));
        assert!(interpret(
            vec![Alt(1, 3), Chr('a'), Jmp(2), Chr('b'), Jmp(1), Chr('c'), End],
            "bc"
        ));
        assert_eq!(
            interpret(
                vec![Alt(1, 3), Chr('a'), Jmp(2), Chr('b'), Jmp(1), Chr('c'), End],
                "ab"
            ),
            false
        );
    }
}

And here ends where I’m going to follow Kazanov. He does much more with later regular expressions, exploiting some GCC-isms to create a full-powered context-threaded interpreter. It turns out Rust will recognize what you’re trying to do and do the right thing if you follow a fairly idiomatic design pattern for interpreters, so those wacky GCC-isms aren’t necessary.

I’m sure there’s much more we could go into, but I have other things to study next week, I’m in the middle of looking for work (hey, anyone need a skilled Rust developer?) and this more or less satisfies my rebooting my knowledge of bytecode interpreter design.

All of the code for this project is now available on Github.

For the fourth World’s Simplest Bytecode Interpreter, Kazanov goes into a bit of esoterica without discussing why: he has “authentic bytecode”; he’s doing C programming and he dense-packs his instructions into a single 16-bit instruction. That’s fine if you’re emulating a real chunk of hardware, but for our purposes… I really don’t care.

With that in mind, let’s talk: A register-based machine is one that says that all of the instructions will fit within only a few memory locations and the entirety of the program’s live data will only need a small number of memory locations. In our case, that small number is 16, although in reality my tests only use three.

There are only two opcodes for moving memory in or out of our system: Load which takes a value from memory and moves it into a register, and Done which returns a specific value from a register to some calling context. Since our example program is a simple calculator, every operation takes three register names: two sources and a destination:

<lib.rs>=
use std::result;

const REGISTER_COUNT: usize = 16;

pub enum Opcode {
    Load(usize, i64),
    Add(usize, usize, usize),
    Sub(usize, usize, usize),
    Mul(usize, usize, usize),
    Div(usize, usize, usize),
    Done(usize),
}

pub struct Program {
    program: Vec<Opcode>,
    registers: [i64; REGISTER_COUNT],
}

Since there’s no stack, we don’t care about stack underflow. On the other hand, we’re requiring that the program have an explicit exit terminator; unlike the previous versions, falling off the end of the program is not acceptable, so we replace our stack underflow with an “unexpected termination” error.

<lib.rs>+=
#[derive(Debug)]
pub enum ProgramError {
    DivisionByZero,
    UnexpectedTermination,
}

type Result<T> = result::Result<T, ProgramError>;

Initializing our program remains familiar, although I’ve changed one of the initializers. The cargo clippy command enforces an even more strict programming style on Rust than cargo fmt, and it recommends this form of initialization. Here, the program entry means “find a value named program in the current lexical scope and use it to initialize the program field for the Program struct. I am not fond of this— it strikes me as a kind of mystery meat programming— but it seems to be a popular with Rustaceans.

<lib.rs>+=
pub fn interpret(program: Vec<Opcode>) -> Result<i64> {
    let mut program = Program {
        program, 
        registers: [0; REGISTER_COUNT],
    };

And now the opcode handler. This is straightforward; we either introduce values into the program with Load, we process them with an operation, or we’re Done and return a value in the register specified. We don’t need our macro this time; the addresses are immanent and quickly available. We do still need to watch out for that divide-by-zero problem. And if the program ends without being Done, well, that’s an error, too.  I should point out that the order of registers is Kazanov’s; if I had been designing this from the ground up, I would have chosen the leftmost register instruction as the target.

<lib.rs>+=
    for op in program.program {
        match op {
            Opcode::Load(r0, imm) => program.registers[r0] = imm,
            Opcode::Add(r0, r1, r2) => {
                program.registers[r2] = program.registers[r0] + program.registers[r1]
            }
            Opcode::Sub(r0, r1, r2) => {
                program.registers[r2] = program.registers[r0] - program.registers[r1]
            }
            Opcode::Mul(r0, r1, r2) => {
                program.registers[r2] = program.registers[r0] * program.registers[r1]
            }
            Opcode::Div(r0, r1, r2) => {
                if program.registers[r1] == 0 {
                    return Err(ProgramError::DivisionByZero);
                }
                program.registers[r2] = program.registers[r0] / program.registers[r1];
            },
            Opcode::Done(r0) => return Ok(program.registers[r0]),
        }
    }
    Err(ProgramError::UnexpectedTermination)
}

Two things this VM does not do that it should. First, we do not verify that a register being read is one that has been written to at least once by the program, otherwise we’re just using the default 0 value for whatever calculations come next. Second, we do not validate that the register IDs being passed in are valid; we could easily access a value outside the size of our static array.

In some ways, this is okay: a smart compiler would catch these mistakes before they could possibly happen.

But we’re now outside the realm of “things the compiler can do for us”; our interpreter is beginning to have its own error states, and while there are a number of things we can do to make sure the interpreter itself is well-formed, we’re going to have to start worrying about bugs in the programs our interpreter runs.

And the unit tests. You can start to see how complex a register-based compiler would have to be to keep track of which registers were free, which were in use, and how those registers should be allocated and used. And a “real” programming language would also have a stack for scopes and for tracknig the evolution of complex function trees. For now, though, this is sufficient.

<lib.rs>+=
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn basic() {
        use Opcode::*;
        assert_eq!(interpret(vec![Load(0, 2), Done(0)]).unwrap(), 2);
        assert_eq!(interpret(vec![Load(1, 2), Load(2, 2), Mul(1, 2, 0), 
                                  Load(1, 3), Mul(1, 0, 0), Load(1, 4), 
                                  Mul(1, 0, 0), Done(0) ]).unwrap(),
                   48);
    }
}

And the results are the same:

running 1 tests
test tests::basic ... ok

And that’s a register-based virtual machine.

For the previous step of Implementing The World’s Simplest Bytecode Interpreter in Rust, I followed Vladimir Kazanov as he added larger opcodes to his bytecode. In the third variant, he adds a stack, pushing values on as needed and allowing the operations to pop values off the stack as needed.

My variant of this operation follows his, but is slightly different. He continues to use the stack somewhat arbitrarily, but as I’ve pointed out, Rust doesn’t alow for that outside of unsafe code.

We’re going to swap out the accumulator for a stack, and we’ll return the last value pushed onto the stack as the result. After including the result type, which we’ll use for error messaging and forwarding, our program now looks like this:

<lib.rs>=
use std::result;

pub enum Opcode {
    Push(i64),
    Add,
    Sub,
    Mul,
    Div,
    Done
}

pub struct Program {
    program: Vec<Opcode>,
    stack: Vec<i64>,
}

One thing I wanted to do was add proper error handling. This isn’t exactly "proper," as I’ve added no pretty printing, but it does the job. One thing I have done is redefine Result to always have the types used by my library, and nothing else:

<lib.rs>+=
#[derive(Debug)]
pub enum ProgramError {
    DivisionByZero,
    StackUnderflow,
}

type Result<T> = result::Result<T, ProgramError>;

Okay, now things are gonna get… a little advanced. But only a little. What this next piece is, is simple: I found myself writing the exact same seven lines of code for three of the four operations. So why not extract that out into a function? Well, the biggest problem is that the maths symbols, +, -, etc, aren’t first-class functions; they can’t be passed around. But they are Rust tokens, which means they can be handled by a macro. So I created a macro that does the operation, and manipulates the object to pop() and push() the values as needed:

<lib.rs>+=
macro_rules! make_op {
    ($code:expr, $op:tt) => {{
        if let Some(a) = $code.stack.pop() {
            if let Some(b) = $code.stack.pop() {
                $code.stack.push(b $op a);
                None
            } else { Some(ProgramError::StackUnderflow) }
        } else { Some(ProgramError::StackUnderflow) }
    }
}}

This is a very simple substitution macro; it takes two items separated by a comma and builds a block that either returns None or Some(Error). We’ll use this return value as a signal to our processing loop.

The first item passed into the macro is a Rust expression, something that returns a value (in this case, the value is just the program struct, so access is a no-op), and the second is a Rust token, which is the most primitive object a macro can handle, but it’s perfect for arithmetic symbols.

There is one big oddity in the macro rule; can you spot it? The operation’s values will be extracted into a and b, but the operation itself is b $op a. Multiplication and addition are fully commutative, at least for integers, but subtraction is not; we needed to do the operations in this order to ensure that they were done in an "unsurprising" order: 5 4 - is "5 – 4", not "4 – 5".

Initializing our program is unchanged:

<lib.rs>+=
pub fn interpret(program: Vec<Opcode>) -> Result<i64> {
    let mut code = Program {
        program: program,
        stack: Vec::new()
    };

And now that I have my macro handler, the actual opcode runner is fairly uncluttered, although the cyclomatic complexity of the error handling for division does make it ugly. This is one place where railway-oriented programming would be lovely, but that’s an advanced topic.

Basically, if any operation returns an error the program halts right there with the error value, otherwise it proceeds until it either hits Done or runs out of operations to perform, at which point it tries to extract the value from the stack and return it, or returns a StackUnderflow error if the stack is empty.

<lib.rs>+=
    for op in code.program {
        if let Some(err) = match op {
            Opcode::Push(i) => { code.stack.push(i); None }
            Opcode::Mul => make_op!(code, *),
            Opcode::Add => make_op!(code, +),
            Opcode::Sub => make_op!(code, -),
            Opcode::Div => {
                if let Some(a) = code.stack.pop() {
                    if a == 0 {
                        Some(ProgramError::DivisionByZero)
                    } else if let Some(b) = code.stack.pop() {
`                       code.stack.push(b / a);
                        None
                    } else { Some(ProgramError::StackUnderflow) }
                } else { Some(ProgramError::StackUnderflow) }
            }
            Opcode::Done => break,
        } {
            return Err(err);
        }
    }

    if let Some(v) = code.stack.pop() {
        Ok(v)
    } else {
        Err(ProgramError::StackUnderflow)
    }
}

And the unit tests:

<lib.rs>+=
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn basic() {
        use Opcode::*;
        assert_eq!(interpret(vec![Push(2), Done]).unwrap(), 2);
        assert_eq!(interpret(vec![Push(2), Push(3), Mul, Done]).unwrap(), 6);
        assert_eq!(interpret(vec![Push(5), Push(3), Sub, Done]).unwrap(), 2);
        assert!(interpret(vec![Push(2), Mul, Done]).is_err());
        assert!(interpret(vec![Push(2), Push(0), Div, Done]).is_err());
        assert_eq!(
            interpret(vec![ Push(2), Push(2), Mul, Push(3), Mul,
                            Push(4), Mul, Done]).unwrap(), 48);
        assert_eq!(
            interpret(vec![ Push(5), Push(2), Mul, Push(5), Div,
                            Push(2), Div, Done ]) .unwrap(), 1);
    }
}

And the results are the same:

running 1 tests
test tests::basic ... ok

One thing to note here is that we’ve now entered a realm where type errors are possible; if we could push more than one type of object onto the stack, we would need to differentiate among the types that our operations could handle. For a full-blown general-purpose interpreter, this sort of typechecking will get repetitive very quickly, but once you’ve identified the typechecking pattern in your code you can extract that repetition into a macro. Rust macros are already amazing, and I consider myself very much a beginner when it comes to them.

In my last post, Implementing The World’s Simplest Bytecode Interpreter in Rust, I did exactly that, following Vladimir Kazanov’s World’s Simplest Bytecode Interpreter.

His first interpreter was basically dumb as a rock: starting from an integer value of zero it allowed only increment and decrement. His second variant adds two new instructions which allow for the addition and subtraction of arbitrary integers. This is where the difference between C and Rust really begin to shine. I’ve added the two instructions to the Opcode:

pub enum Opcode {
    Inc,
    Dec,
    Add(i64),  // New
    Sub(i64),  // New
    Done
}
    
pub struct Program {
    program: Vec<Opcode>,
    accumulate: i64
}

The program definition is unchanged:

pub fn interpret(program: Vec<Opcode>) -> Option<i64> {
    let mut code = Program {
        program: program,
        accumulate: 0
    };

And the actual opcode runner is fairly straightforward:

    for i in code.program {
        code.accumulate = match i {
            Opcode::Inc => code.accumulate + 1,
            Opcode::Dec => code.accumulate - 1,
            Opcode::Add(i) => code.accumulate + i,  // New
            Opcode::Sub(i) => code.accumulate - i,  // New
            Opcode::Done => break
        }
    }

    return Some(code.accumulate)
}

And that’s the whole change. Kazanov is trying to show how the program’s array of instruction can keep values associated with an operation in-line with the operation itself, but in Rust that’s just… an enum. An enum in Rust is an authentic sum type; it can carry additional information.  There’s no need to coerce the vector into holding “weird” values and casting them appropriately; Rust already guarantees that happened at compile time.  And again, the responsibility for verifying the program’s correctness off storage is a separate responsibility; by the time the program has entered the interpreter, we’ve verified that it is syntactically correct at the bytecode level.

There is a cost to this: Rust’s Vec will consist of uniformly sized objects, each as big as the largest object defined in the enum plus the management of the enum itself, so at least one byte for the enum differentiation plus the size of a 64-bit integer, even for the Inc and Dec operators. For a small DSL, that’s not a terrible price to pay. As an interpreter grows, alternative ways of describing an opcode may be desired. We shall see.

A basic unit test:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn longer() {
        use Opcode::*;
        assert_eq!(interpret(vec![Inc, Inc, Inc, Add(4), Add(3), Sub(6), Done]), Some(4));
    }

}

And the results are the same:

running 1 tests
test tests::longer ... ok

 

Because I feel woefully out of practice with parts of Rust that aren’t related to my current project, I decided to try a couple of on-line exercises, mostly written in other languages, and see what sort of effort it would take to do the exercises in rust.

Vladimir Kazanov’s World’s Simplest Bytecode Interpreter was my first exercise, and it turned out that a number of the things Kazanov used just aren’t possible in Rust. His bytecode interpreter really is the simplest possible: it has three bytecodes: increment, decrement, and done.

Kazanov defines three structures: a VM, a bytecode collection, and a result. He defines a constructor for the VM that nulls it out, and then a function that takes a pointer to the program and runs it through his bytecode interpreter, returning either a result or an error message.

So far, the Rust is going to look the same:

pub enum Opcode {
    Inc,
    Dec,
    Done
}
    
pub struct Program {
    program: Vec<Opcode>,
    accumulate: i64
}

Instead of a pointer to the program, I’m going to move the entire program into my VM. This is probably not what I want, in the long run; I’d either want a reference or the capacity to Clone the code as needed, but this is the world’s simplest bytecode interpreter, so for now, there’s not a lot to do.

The interpret function is equally straightforward. Here’s the first notable changes to Kazanov’s code: First, we don’t need a special return type, because Rust has Option. Second, we don’t need to zero out the memory, because Rust guarantees that any memory we allocate is going to be zeroed out already. Third, we’re moving the program into the VM, thus requiring the client to take on the responsibility of ensuring the program is correct.

More importantly, we’re constrained by the compiler to the opcodes provided; we literally cannot have a bad opcode! Kazanov can; he defines an enum, but in C that’s just a set of named intergers. In Rust, enums have a much more strict semantic meaning. (Now, turning an enum into a static, on-storage representation, and reading it back into memory, is a different task, but it’s a task with exceptionally clear guidance that you just don’t have in C, and it’s guidance that you get for free, guidance with no runtime cost whatsoever. This is part of why I love to work in Rust.)

pub fn interpret(program: Vec<Opcode>) -> Option<i64> {
    let mut code = Program {
        program: program,
        accumulate: 0
    };

And the actual opcode runner is fairly straightforward:

    for i in code.program {
        code.accumulate = match i {
            Opcode::Inc => code.accumulate + 1,
            Opcode::Dec => code.accumulate - 1,
            Opcode::Done => break
        }
    }

    return Some(code.accumulate)
}

A basic unit test:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn longer() {
        use Opcode::*;
        assert_eq!(interpret(vec![Inc, Inc, Inc, Dec, Dec, Done, Inc, Inc]), Some(1));
    }

}

And the results are promising:

running 2 tests
test tests::longer ... ok

In Kazanov’s code, he needs extra test to assert that the program is “correct” in its bytecode, but these tests are not needed in Rust: for embedded bytecode, the compiler literally will not let me write a bad bytecode representation. A byte-to-bytecode reader will be required by the compiler to associate correct data values with read data types.

As I’ve been working my way through this project of mine, I’ve come to respect exactly why Haskell is so mind-bendingly different from other programming languages. It has to do with the level of abstraction you’re expected to track in your head.

In the classic "Object Oriented Programming Language" tutorial, when inheritence is usually introduced, the abstraction is flat: you create an Mammal class, which says it has four limbs and fur, and then add specific subclasses such as Dog and Cat, with different function bodies for the abstract method speak(), defining what sounds the animal makes.

The impression you’re given is that object orientation is often just fiddling around the edges; subclasses define different kinds of business logic. This impression is reinforced in languages like Java and C++ in realms as wildly different as office application development and game programming. Even in game programming, subclasses are often just business logic in a pretty dress: this class of Object moves four pixels per frame, that class has a weapon that does 20 points of damage.

Haskell throws all that out the window and asks you a simple question: "What is addition?"

Like, we think we know what "addition" is. It’s the sum of two numbers to create a new number. "Well, what’s a number?" And so forth.

In my project, the discovery that regular expressions both are semirings and that they can produce semirings was mind-boggling and promised a very neat and pluggable way to adapt regex to a variety of purposes, and then when I learned what semirings really were, the embogglement went much higher.

Because a semiring is often described as this: (0, 1, ⨯, +, ℙ), where ℙ is some set. Not a set "of numbers," just some set. If we say ℙ is "the set of natural numbers" then we get all the operations you’d expect. But this is just a concrete implementation! The underlying abstraction is still there. The exact same behavior holds for (F, T, &, |, 𝔹), the set of booleans values; F&x is always false, just as 0⋅x is always zero. F|x=x, T&x=x, etc.

And then the mind-boggling part I got from Might & Co, which I later discovered were called Viterbi Parse Forests: ([], [""], ⊗, ∪, [𝕊]), where the brackets represent "set of…" and 𝕊 represents "strings" (as programmers understand them) has the exact same behavior The operator is the cartesian product, that is, the concatenation of every string from the left set with every string from the right set. With this, what you get back is a record of the exact strings that were recognized and matched.

If you replace "set" in the above paragraph with "list", you get prioritized unions— which is the basis of disambiguation in Parsing Expression Grammar theory. If you define the semiring rather strangely but still within the rules, you get "longest match" disambiguation— which is the basis of μ-regex parsing under the POSIX rules. And so forth.

This is a much deeper level of abstraction. The user has to be able to not just internalize the "obvious" uses, but to understand the deep connective tissue linking these different uses and then come up with new ones. Haskell developers are a mixture of very quick and very slow— the slow part comes while their brain is tickling out a new understanding of the abstraction before them, and the quick is implementing it with good IDE support. It separates the clever from the skilled. Ideally, it would be good to have both.

So, to muse on about regular expressions…

I was genuinely pleased to post Rigged Brzozowski Regular Expressions in Rust (Experiment No. 08), once that completely supports all of the operations in Haskell and one that completely supports the same semantics as the Glushkov Heavyweights in Rust (Experiment No. 07), which makes me mindbogglingly happy.

Now, there are a couple of things going on here. Here’s the skinny: The Glushkov version is a state machine that is progressively building the Semiring as characters are read. The Brzozowski version takes the regular expression tree and builds a parse tree of results, some of which use Adam’s Delta operator as a gatekeeper to preserve valid parses or to block off invalid ones. It then presents the resulting tree to a function that postprocesses it, creating a semiring out of the results.

The Delta operator is only introduced into the parse tree during Sequence operations, to block off trees that don’t need postprocessing because they couldn’t have been valid in the first place, or to preserve trees that could be valid. For example, ab*c is an expression that can contain just ac or abc (or abbbbc), so the sequence requires a gatekeeper to preserve whether or not the b was seen.

There are two ways of handling gatekeeping. Might says:

Dc(L ◦ R) = (Dc(L) ◦ R) ∪ (δ(L) ◦ Dc(R)).

is equivalent to:

if (ε ∈ L) then
    Dc(L ◦ R) = (Dc(L) ◦ R) ∪ Dc(R)
else
    Dc(L ◦ R) = Dc(L) ◦ R

The difference between these is that in the first case, the delta operator is either an annihilator or an identity on the result of the parse, whereas in the second, the nullability function decides on the process of the parse. Despite Adams’s assertion, the truth is harder; in semiring parsing, the loss of result information on the left hand side of a sequence is devastating to the final parse. We do not get an honest parse tree without the Delta operator.

Might solves the problem by recognizing that concatenation of an epsilon with anything requires data preservation only, and since he’s approaching this from a more engineering-oriented standpoint than the Play on Regular Expressions people, using in-line combinators, he offers:

(ε ↓ {t1}) ◦ p ⇒ p → λt2.(t1, t2)
p ◦ (ε ↓ {t2}) ⇒ p → λt1.(t1, t2)

This basically says that if you have a (pending) semiring t1 sequenced with the eventual results of a parser p, you end up with the output of p being the left hand of the mul operation over the two semirings. And flipped if the epsilon is on the right. In other words, he recognizes when an epsilon is in a sequence and ‘floats’ multiplying it with the result of the right hand side of the sequence. The information is preserved.

I want to believe this is on purpose; that Might & Adams knew this was a necessary preservation step. Darais’s highly optimized Racket implementation uses it effectively and has no need of either the Delta operator or the Epsilon-Star operator, as data preservation is left to the reduction optimization on line 97 of Derp-3.

It’s exactly what you’d expect out of semiring parsing, but it’s mechanical and, well, naïve. Not in a bad way, mind you. I’m still trying to figure out how to make it "go" in a pure semiring implementation. I may have to implement reducers anyway, just to support this; otherwise, I have to use the Delta operator and the tree remains alarmingly large.

I went to the HackerX recruitment event this week and, while it was interesting, the event was awkward. It’s a bit like speed-dating: 36 companies show up and you have five minutes per company to hear their pitch and make your own, and then time is called and you move on to the next company. The round lasts an hour, so you have time to visit at most 12 recruiters.

The problem is that three times as many people showed up for the event as there were seats. In order to "facilitate" this problem, you were placed in a waiting line per row, and depending upon your starting position you could fall out of your line, in which case you got to pick a new waiting line. And the lines were initialized in order, with the first person in line being sent to the end of the row.

In other words, if you were a go-getter who arrived early, you got to talk to exactly one recruiter before you fell out of the room and had to go to a waiting line, and there was very little chance you’d get back into the room.

Even worse: that recruiter was always Amazon. The last entry of every row was Amazon: AWS, Amazon Logistics, or Amazon Internals. It didn’t matter. It was Amazon. All Jeff, all the time.

I was second in line. I got to talk to two recruiters, and I really didn’t care to talk to Amazon, although the people behind the table were nice enough. In line (I really wanted to talk to the Edu startup), I did get to talk to a woman who was looking to move from startup to enterprise, and some recruiters did come out and talk to us in line, having mercy on our patience, but they were the HR assistants to the engineering recruiters, and couldn’t answer many of my questions about work environment, tooling or projects.

I’d probably go again, but I’d really prefer less of a free-for-all.

I have hit a snag in my project. This entry is me thinking about solutions.

My goal was a reasonable one:

There are a million other "goals" that can go with this:

  • Write a variant that produces a DFA, allowing for static compilation
  • Write a variant that produces a bitmap, allowing for fast comparison
  • Write a variant that is-a iterator, such that some expressions can be marked "progressive," and when is complete emit the contents
  • Write a relationship between the parser and the semirings such that progress can be suspended and restored as needed
  • Add negation, complement, and interleaving as regular expression operators
  • Write Darais’ Language-of-Languages
  • Write a PEG parser on top
  • Write ROSIE in Rust
  • Instantiate backreferences
  • Add a concrete stream operator

The idea is that Barre/Barge is a generalized but still highly performant toolkit for building recognition, matching, and parsing engines, and the nature of the operation depends on passing it an output dependency with an interface that is theoretically sound and operationally powerful.

Unfortunately, I’ve hit this snag: The classic implementation of parsing-with-semirings can produce parse tree via the Derivation Forest Semiring, but disambiguating that the forest down to a single tree requires a different semiring to implement one of First Match / Longest Match / Longest Atomic Match. Adams and Darais provide a derivation forest engine, and it’s the one I’ve implemented, but there’s one special case.

Imagine a sequence of regular expressions: "abcde" These can be depicted in one of two ways: (a, (b, (c, (d, e)))) or ((((a, b), c), d), e). Computationally, that second one, "left heavy," is troubling: for every character, you’ll have to transition down the tree to reach the character, traversing four nodes just to get to the a expression. Darais implemented Adams’ suggestion that the tree be rebalanced to make it "right-heavy," and then a post-parsing operation is placed in front of the node to rebalance the resulting parse tree back to a left-heavy representation so that the results are consistent with expectations.

The problem with this is that there’s no clear implementation that corresponds to the semiring. It’s an engineering-versus-math issue, and I’m not yet sure how to deal with it, although just writing it out does help me see some of the potential solutions. The basic solution is to keep it all: the raw parse tree, the raw disambiguation matrix, and the product of the parse as quickly as the expression specification can produce it.

28Feb

Current Bibliography

Posted by Elf Sternberg as Uncategorized

I’m just keeping this here, to keep me honest about what I’m reading.

  • A Play on Regular Expressions: A Functional Pearl. Sebastian Fischer, Frank Huch, Thomas Wilke.
    • Solid introduction to parsing with semirings in Haskell
    • Haskell is straightforward and accessible
    • Readable, breezy style
    • Semirings covered: Recognition, Counting, Viterbi-Best (Maxlong, Firstlong)
  • Algebraic Foundations of Semiring Parsing. Yudong Liu
    • Discusses semiring parsing as unified framework
    • Unites semiring parsing and finite state transducers
    • Describes a mechanical C++ framework for the pieces/parts
    • Looks a lot like Barge/Barre
    • Good overview, fairly accessible.
  • An Algorithm for RELAX NG Validation James Clark.
    • Describes the Interleaf operator in some detail.
  • Bit Coded Regular Expression Parsing. Lasse Nielsen.
    • Presentation, so short on details
    • Describes bit coding as a way to do comparisons on regexs with less than machine-width alternatives
  • Bitsets Match Regular Expressions, Compactly. Paul Khuong
    • Describes the Bitap algorithm for bitmap encoding
    • Describes graph minimization as assuming that all states are singular, and breaking them up as we prove the assumption wrong.
  • Constructing Human Grade Parsers. Joe Karter.
    • Wish-list of parser capabilities outside the actual parsing engine
    • Parse errors should be recoverable
    • Parser should be able to produce partial output
    • Requires tooling outside the formalities of regex and semiring parsing
  • Design and Analysis of String Representation. R. Sekar.
    • Describes Owens, et. al.’s DFA generator as a realization of McNaughton-Yamada
    • Describes alternative search algorithms
  • Efficient and Flexible Incremental Parsing. S. L. Graham
    • Describes incremental parsing, in which a document tree and parse tree are maintained and updated together.
  • Error-Correcting Parser Combinators S. Doaitse Swierstra, Pablo R. Azero Alcocer
    • Haskell-based parsing language
    • Describes error-handling in parsing processes
    • Describes recovering from errors to continue parsing
    • Uses combinators, though
  • Total Parser Combinators Guillaume Allais
    • Describes a combinator-based parser with guaranteed termination
    • Uses Brzozowski as its basis
    • Proves that the combinator recognizer is a semiring
    • Written in Agda (eek!)
  • Kleene Meets Church. Fritz Henglein
    • Has the "theory vs practice" description
    • Ideal: parsing + catamorphic processing
    • Actual: FAs + ad-hoc instrumentation
  • On the Complexity and Performance of Parsing with Derivatives Michael D. Adams, Celeste Hollenbeck, Matthew Might
    • Provides both theoretical and mechanical optimizations
    • Replaces the recursion nullability with data-flow analysis
    • Provides basis for Darais’s implementation
  • Recognising and Generating Terms using Derivatives of Parsing Expression Grammars. Tony Garnock-Jones, Mahdi Eslamimehr, Alessandro Warth.
    • Describes the derivatives of PEGs
    • Describes the δPEG algorithm without backtracking for negation
    • Describes the derivation of prioritized choice
  • Regular Expression Derivatives Reexamined Scott Owens, John Reppy, Aaron Turon.
    • The paper that started the current project.
    • Section 3.3: Using derivatives for DFA construction.
    • Describes equivalence sets in the face of symbol class operators
    • Describes regular vectors for multithreaded matching.
    • Describes integration of complement and intersection operations
  • Regular Expressions as Types. Fritz Henglein
    • Introduced the "ad-hoc" slander about PCRE. 😊
    • Describes regex as a set membership problem
    • Describes types systems as "regular languages"
    • Clarifies emitter location for stream-processing
    • On the Kleene star operation
    • May require different Kleene stars, or a visitor with some context awareness.
  • Semiring Parsing. Joshua Goodman.
    • Mentioned by Liu
    • Covers the initial history of semiring parsing
    • Describes derivations forests (see Might & Adams) in a concise way (sec 2.5.1)
    • Describes Viterbi n-best as a probablistic disambiguation strategy
  • Stream Processing Using Grammars and Regular Expressions Urlik Turk Rasmussen
    • Has a great explanation of the "Regular Expressions as Types" approach.
    • Leads into regex equivalence sets
    • Discusses disambiguation strategies with equivalence sets
    • Discusses bit coding of unions (no μ-regex or e-regex)
    • (Con) Uses bit-coding as a parse tree extraction tool
    • Discusses alternative needs
      • Recognition
      • Matching (capture groups)
      • Parsing (parse trees)
    • Discusses two-pass analysis (Might) vs one-pass (Adams)
    • Discusses difficulty of regex composition (see Conway 2018)
    • Discusses CFGs as an alternative DSL for regex composition (PEG, EBNF)
    • Is actually 5 papers; the other 4 cover algorithm’s evolution
    • Fantastic overview overall
  • Taxonomies and Toolkits of Regular Expression Algorithms Bruce William Watson
    • Solid overview
    • Fairly long
    • Excellent coverage of DFA minimization
    • Discusses prefix/suffix scanning via Aho-Corasik, et.al.
    • Cover Brzozowski’s algorithm as a performance-based alternative DFA constructor
  • Yacc is Dead Matthew Might, David Darais
    • Started this whole mess
    • Still a pretty good introduction.

Subscribe to Feed

Categories

Calendar

April 2019
M T W T F S S
« Mar    
1234567
891011121314
15161718192021
22232425262728
2930