After working on a take-home quiz I have four numbers with regard to the "twitter clone" project given to me by ${MEGACORP}.

2,572

2,572 different Javascript libraries ended up inside the build tree for the demonstration application. That number does not include the global instance of the Prisma toolkit that I ended up installing on the recommendation of the tutorial. That number does not include the Prisma back-end, running on AWS, over which I had no ownership or control.

That’s 2,572 potential cases of corruption, infiltration, or code unreliability. No one involved with this project could possibly have vetted all of them, and there’s very little chance anyone could have a web-of-trust that encompasses all of them.

2,371,451

The final build size for this project was 2,371,451 bytes. That’s the size of the payload that gets downloaded to your client— desktop, phone, tablet— to run on your browser. 2.4 Megabytes. That’s a lot of code to run when all you want is a simple scrolling twitter feed.

[NaN]

I don’t know how many random accounts I have out there. I’ve been on the Internet since 1991; I’ve had account on AOL, CompuServe, MySpace, and a zillion other places. Authorizing Prisma to have access to my GitHub account for the purposes of authentication was just one more, and it annoyed me that local-first isn’t even on some companies’ radars.

32

32 is the number of hours I spent on a coding challenge. That’s a ridiculous number; no one who’s actually got a job could possibly have gotten through this assignment in anything like a valid amount of time.

On the other hand, learning five new technologies sufficient to "make it go" in only 32 hours is an awesome, and it was really good to see that my coding chops at the browser level are still as solid as they are at the back-end.

Afterward

The thing of it all is, I look at This is a Motherfucking Website and wonder how the hell we got from that to, well, to React. My story sites use native Javascript and treats the HTML as the source-of-truth, which is one of the reasons it’s so damn fast. The Javascript on those pages is minescule… less than 1 kilobyte, but they do the job of helping people navigate the site fast and efficiently and are still ARIA-compliant.

React is like Golang: both are enterprise-ready solutions that let developers do things fast enough, but hem the developers into pre-packaged tooling and solutions that limit thought and creativity. They’re both great tool in their roles, but they’re not what I would choose to work with recreationally.

Which is too bad, but what else can you do? They really do let companies throw stuff out fast, and if the consumer can’t keep up, well, it’s the consumer’s responsibility to upgrade their phone or laptop, isn’t it? Upgrade or die.

The world still deserves better.

So how did I do on the coding challenge? I have no idea. But here’s what I went through:

The README promised that there was a working, minimal implementation in the ZIP file. There was not. So the first thing I did was go through the How To GraphQL tutorial, which kinda gets you to the point of having a working Twitter clone. While going through it, I realized that the coding challenge was this tutorial, more or less.

Prisma is ridiculously difficult to run locally. It’s a cloud-first solution, written in Scala. In fact, ${MEGACORP} provided me with a Prisma instance running somewhere on AWS rather than ask me to try and run it on my laptop. So the first discovery was that part of the reason the example code they gave me wasn’t working was that I didn’t have a Prisma account; I had to log into Prisma and give it authentication via Github.

The next thing I realized was that, looking at the schema they provided me, it didn’t look right. I ran the prisma generate account and all these warnings and errors came back to me. Despite being only three months old, the schema provided was already deprecated; the @unique attribute had been augmented by the @ID attribute for data keys, and the syntax for date/time fields had changed. With the schema fixed, the Prisma account activated, and the database seeded, I actually started to see something in my "Twitter" feed.

It was time to turn my attention to the client.

Figuring out how to do login / logout was the hardest part. It involved all the moving parts, including authentication, and it was tricky. I ended up wrapping the entire app in a hand-written React Authentication.Context object that provided hooks to the login and logout function, as well as a state that indicated if there was a user and who it was. I so missed Rust’s sum types here; I wish Javascript had them, and in that form. (One of the wildest things I saw recently was that Typescript has sum types and enums, and they’re not the same thing. I filed that under "What the whatting what?") But once it was working, it was working fine.

I did have one moment where I couldn’t figure out how to banish the log-in popup. It was a two-stage asynchronous operation, one to handle the ‘submit’ even and another to handle the ‘login’ event. On a hunch, I wrapped the login handler in a console.log() and it returned a Promise which contained the return value from the login, so I could use a successful Promise resolution as the event to call the .dismiss() method. It worked, and it was elegant as hell, but I get the feeling it’s rather un-React-y; it was a "do something" rather than "be something" implementation.

I wrote out a few decision tables and decided I could live with it.

After that, things accelerated: I quickly sketched out how I wanted it to look, how I wanted the toolbar and profile sidebar to update, and getting them to work was interesting. I learned a megaton of Material and Material-UI in the process, although I’m still not keen on Material’s thematic engine.

I had a good time doing this. It’s not the best use of my skills, but it’s certainly satisfying to see things working in real-time, and I’m impressed with the tooling, at any rate.

I’ve just spent the past four days working on a “homework assignment,” a coding assignment given to me by ${MEGACORP} to assess my skills as a web developer and programmer. I was sent a zip file with a README in it and a pair of sub folders: one of the server, one of the client, and told to build it out and “make it work.”

It was broken as delivered. The install worked fine, but I winced when I realized that yarn had installed 2,572 different packages in the library trees, including the entire flow binary and the OCaml runtime that comes with it! Running the server threw a dozen errors, and the client showed only a blank white page with a red notification bar, completely unpopulated.

I was coming out of Splunk, where I’d been working with Splunk’s own server and client foundation for five years. That foundation was built on top of Django and Backbone, two skills I clearly had when Splunk hired me in 2013. In 2015, some of our teams had started to work with React, and I’d been fortunate enough to be part of that.

But the 2017-2018 development cycle had been all Go, all the time. I was woefully out of practice as a web developer.

Here’s what they threw at me:

GraphQL

GraphQL is a query language specified by Facebook that handles client/server transactions. You specify the possibly deeply nested data you want sent to the client, and when you make the request you get a JSON object pre-populated with that nested data. Instead of assembling all the data out of different requests, your client sends a single request and gets back all the data it needs to fill in a page.

It’s a pretty good idea. But it was new to me.

Yoga

Yoga is a proxy server into which you define the GraphQL endpoints for your client. You write your specification here, defining the structure of the data and the handlers for assembling that data out of your back-end. Also completely new to me. You write in Javascript or one of its derivatives (Typescript, Clojurescript).

Prisma

Prisma is yet another proxy server, this one written in Scala. Prisma is the big aggregate engine; it talks to all your back-end services and assembles the data for you.

The relationship between Yoga and Prisma is basically something of a firewall; Yoga limits the kinds of transactions that can happen, and has the capability to restrict access based on authentication issues. Prisma will take whatever Yoga sends it and do the heavy lifting of assembling and aggregating the objects to send to the client.

JWT

JavascriptWebTokens were being used for authentication. I straight up admit that I ended up using them badly and contrary to security advice. We hadn’t used them at Splunk until the “cloud era,” and by then I was working on Kubernetes deployment issues and didn’t get to learn much about them. They’re basically cryptographically signed bundles that describe “claims,” a word that encompasses authentication and authorization, but they’re short so they fit in the HTTP Authentication: header. Unfortunately, they’re also blatant, so if you’re using them in an unsecured environment anyone can grab them and use them against you.

ES6 Javascript

Oh, and did I mention that this was entirely ES6 Javascript, which we hadn’t been using at Splunk at all? So I got to learn a whole bunch of new things! And actually, I was pretty stoked about that. It has real Classes now, and the function shorthand operator with no this overrides (inhering the lexical this is awesome!), and it does destructuring and lexical naming and a whole bunch of stuff that I enjoyed the hell out of it.

React

I wasn’t new to React. In fact, I had a pretty solid working relationship with React, and even had a good idea of how to hack it to use with older, Backbone-based data sources. But that was two and a half years ago, and React has moved. There are more libraries today.

Apollo

Apollo is a toolkit for GraphQL clients. The Apollo toolkit creates React Contexts in which you can write GraphQL queries and get them back as processed objects ready to feed directly into your React components and draw them at will. Apollo was interesting, but I found its relationship with React a bit awkward, especially when trying to handle post-event routing. Apollo wants to deliver you ultimately render, but in some cases, such as log-in, what you want is both the render event and some other behavior, such as dismissing the login pop-up. That was tricky.

Material & Material UI

Material is Google’s own CSS Design System, the one we’ve all come to recognize. It has paper and cards as metaphors, and has a whole bunch of guidelines for building out the best UIs. It’s fine as far as design systems go, and it’s certainly mature.

Material UI is an implementation of Material in React; it provides a whole bunch of widgets for all the things one might normally do with an application and makes them easy to bring to life.

… and that was just the basics. I had to learn all of those enough to do the job, and this was for a code quiz. It was like a 32-hour whiteboarding session without any of the “cultural fit” stuff going on. It was grueling, and I hope I don’t have to do anything like it very soon.

Part 2: What I actually did on the project.

Kazanov’s fifth World’s Simplest Bytecode Intepreter (see the Fourth Simplest) isn’t quite so simple: it is an implementation of Thompson’s most basic regular expression table-driven DFA handler. It handles a complete set of Kleene’s Regular Expressions without the closure operator (no * operator), and it does so by exploiting the host language stack to handle nested recursion of alternatives.

This is a much different beast from the previous four, and it requires a little thinking to grok why it’s so different. For one thing, we’re going to be jumping back and forth inside the program: this is our first bytecode interpreter with a Jmp instruction. More than that, we may want to jump backwards, so we can’t use ‘usize’ for that, we’ll need to use ‘isize’. There are times we want to add a negative offset, after all!

Our opcode now looks like this:

<lib.rs>=
pub enum Opcode {
    Chr(char),
    Jmp(isize),
    Alt(isize, isize),
    End,
}

If you’re at all following my other project at home, there’s an operation missing: where is Seq? Well, it turns out that Seq is handled simply by incrementing the two pointers.

In order to facilitate rapid access to our opcodes, we’re going to convert it into a concrete Vec<Opcode>, and likewise the string we’ll be analyzing is going to be a Vec<char>. We’ll be passing references of these two objects to our interpreter, as neither the string nor the program will be mutated by the interpreter’s operation:

<lib.rs>+=
pub fn interpret_reentrant(
    program: &[Opcode],
    sample: &[char],
    mut pc: isize,
    mut sp: isize,
) -> bool {

We’re only going to bother with “did it match, or didn’t it?”— a boolean. We also have two new variables, the pc (program counter) and the sp (string pointer). As those can “fall off the world” at either end, we have to test for that. If we leave the world, obviously it didn’t match:

<lib.rs>+=
    loop {
        if sp < 0 || (sp as usize) > sample.len() || pc < 0 || (pc as usize) > program.len() {
            return false;
        }
        

The meat of our interpreter is, erm, “fun.” If the character matches, we proceed; if not, we return false. The Alt operator takes two addresses: an offset to the left child expression and another offset to the right child expression. Each child expression must end with a Jmp to the next element in the program sequence after the Alt.  And again, because we’re mixing and matching pointer and offsets, we have to do benign casts on our indexes:

<lib.rs>+=
        match program[pc as usize] {
            Opcode::Chr(c) => {
                if c == sample[sp as usize] {
                    pc += 1;
                    sp += 1;
                } else {
                    return false;
                }
            },
            
            Opcode::Jmp(i) => {
                pc += i;
            },
            
            Opcode::Alt(l, r) => {
                if interpret_reentrant(program, sample, pc + l, sp) {
                    return true;
                }
                pc += r;
            },

            Opcode::End => {
                return true
            }
        }
    }
}

We now have to start our engine:

<lib.rs>+=
pub fn interpret(program: Vec<Opcode>, sample: &str) -> bool {
    let s: Vec<char> = sample.chars().collect();
    interpret_reentrant(&program, &s, 0, 0)
}

And… that’s it. That’s an entire regular expression engine in about 45 lines of Rust. It’s not really a good regular expression engine; a char in Rust isn’t really a character, as it will only encode a Unicode code point. A letter with an umlaut might be a single Unicode point, or it could be a multi-code assembly of an undecorated letter followed by an “put an umlaut over the previous symbol” instruction. Discussing the difference between individual Unicode code-points and grapheme clusters, and normalizing between them, is a very advanced topic, one I have not yet begun to tackle in my own work. Because the sp points on the stack encode backtracking-on-failure, this code doesn’t work in a streaming fashion. But for a demo… eh. It does the job.

Let’s just say that performing regular expressions on Asian languages, apparently especially Korean, is a non-trivial exercise.

The unit tests are straightforward, but you can see how complex our programs quickly become. There’s also an interesting, well, I don’t want to call it a ‘failure mode,’ but look down at the Alt test. See those Jmp(1) operations? They’re a consequence of the rule that each arm of the Alt must end with a Jmp to command after the Alt. That Jmp is completely unnecessary; it only advances the program counter by one, which our (virtual) Seq would do anyway. It’s a wasted operation and hopefully any good compiler we write would recognize it for what it is.

<lib.rs>+=
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn basic() {
        use Opcode::*;
        // "a"
        assert!(interpret(vec![Chr('a'), End], "a"));
    }

    #[test]
    fn seq() {
        use Opcode::*;
        // "ab"
        assert!(interpret(vec![Chr('a'), Chr('b'), End], "ab"));
    }

    #[test]
    fn alt() {
        use Opcode::*;
        // "(a|b)c"
        assert!(interpret(
            vec![Alt(1, 3), Chr('a'), Jmp(2), Chr('b'), Jmp(1), Chr('c'), End],
            "ac"
        ));
        assert!(interpret(
            vec![Alt(1, 3), Chr('a'), Jmp(2), Chr('b'), Jmp(1), Chr('c'), End],
            "bc"
        ));
        assert_eq!(
            interpret(
                vec![Alt(1, 3), Chr('a'), Jmp(2), Chr('b'), Jmp(1), Chr('c'), End],
                "ab"
            ),
            false
        );
    }
}

And here ends where I’m going to follow Kazanov. He does much more with later regular expressions, exploiting some GCC-isms to create a full-powered context-threaded interpreter. It turns out Rust will recognize what you’re trying to do and do the right thing if you follow a fairly idiomatic design pattern for interpreters, so those wacky GCC-isms aren’t necessary.

I’m sure there’s much more we could go into, but I have other things to study next week, I’m in the middle of looking for work (hey, anyone need a skilled Rust developer?) and this more or less satisfies my rebooting my knowledge of bytecode interpreter design.

All of the code for this project is now available on Github.

For the fourth World’s Simplest Bytecode Interpreter, Kazanov goes into a bit of esoterica without discussing why: he has “authentic bytecode”; he’s doing C programming and he dense-packs his instructions into a single 16-bit instruction. That’s fine if you’re emulating a real chunk of hardware, but for our purposes… I really don’t care.

With that in mind, let’s talk: A register-based machine is one that says that all of the instructions will fit within only a few memory locations and the entirety of the program’s live data will only need a small number of memory locations. In our case, that small number is 16, although in reality my tests only use three.

There are only two opcodes for moving memory in or out of our system: Load which takes a value from memory and moves it into a register, and Done which returns a specific value from a register to some calling context. Since our example program is a simple calculator, every operation takes three register names: two sources and a destination:

<lib.rs>=
use std::result;

const REGISTER_COUNT: usize = 16;

pub enum Opcode {
    Load(usize, i64),
    Add(usize, usize, usize),
    Sub(usize, usize, usize),
    Mul(usize, usize, usize),
    Div(usize, usize, usize),
    Done(usize),
}

pub struct Program {
    program: Vec<Opcode>,
    registers: [i64; REGISTER_COUNT],
}

Since there’s no stack, we don’t care about stack underflow. On the other hand, we’re requiring that the program have an explicit exit terminator; unlike the previous versions, falling off the end of the program is not acceptable, so we replace our stack underflow with an “unexpected termination” error.

<lib.rs>+=
#[derive(Debug)]
pub enum ProgramError {
    DivisionByZero,
    UnexpectedTermination,
}

type Result<T> = result::Result<T, ProgramError>;

Initializing our program remains familiar, although I’ve changed one of the initializers. The cargo clippy command enforces an even more strict programming style on Rust than cargo fmt, and it recommends this form of initialization. Here, the program entry means “find a value named program in the current lexical scope and use it to initialize the program field for the Program struct. I am not fond of this— it strikes me as a kind of mystery meat programming— but it seems to be a popular with Rustaceans.

<lib.rs>+=
pub fn interpret(program: Vec<Opcode>) -> Result<i64> {
    let mut program = Program {
        program, 
        registers: [0; REGISTER_COUNT],
    };

And now the opcode handler. This is straightforward; we either introduce values into the program with Load, we process them with an operation, or we’re Done and return a value in the register specified. We don’t need our macro this time; the addresses are immanent and quickly available. We do still need to watch out for that divide-by-zero problem. And if the program ends without being Done, well, that’s an error, too.  I should point out that the order of registers is Kazanov’s; if I had been designing this from the ground up, I would have chosen the leftmost register instruction as the target.

<lib.rs>+=
    for op in program.program {
        match op {
            Opcode::Load(r0, imm) => program.registers[r0] = imm,
            Opcode::Add(r0, r1, r2) => {
                program.registers[r2] = program.registers[r0] + program.registers[r1]
            }
            Opcode::Sub(r0, r1, r2) => {
                program.registers[r2] = program.registers[r0] - program.registers[r1]
            }
            Opcode::Mul(r0, r1, r2) => {
                program.registers[r2] = program.registers[r0] * program.registers[r1]
            }
            Opcode::Div(r0, r1, r2) => {
                if program.registers[r1] == 0 {
                    return Err(ProgramError::DivisionByZero);
                }
                program.registers[r2] = program.registers[r0] / program.registers[r1];
            },
            Opcode::Done(r0) => return Ok(program.registers[r0]),
        }
    }
    Err(ProgramError::UnexpectedTermination)
}

Two things this VM does not do that it should. First, we do not verify that a register being read is one that has been written to at least once by the program, otherwise we’re just using the default 0 value for whatever calculations come next. Second, we do not validate that the register IDs being passed in are valid; we could easily access a value outside the size of our static array.

In some ways, this is okay: a smart compiler would catch these mistakes before they could possibly happen.

But we’re now outside the realm of “things the compiler can do for us”; our interpreter is beginning to have its own error states, and while there are a number of things we can do to make sure the interpreter itself is well-formed, we’re going to have to start worrying about bugs in the programs our interpreter runs.

And the unit tests. You can start to see how complex a register-based compiler would have to be to keep track of which registers were free, which were in use, and how those registers should be allocated and used. And a “real” programming language would also have a stack for scopes and for tracknig the evolution of complex function trees. For now, though, this is sufficient.

<lib.rs>+=
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn basic() {
        use Opcode::*;
        assert_eq!(interpret(vec![Load(0, 2), Done(0)]).unwrap(), 2);
        assert_eq!(interpret(vec![Load(1, 2), Load(2, 2), Mul(1, 2, 0), 
                                  Load(1, 3), Mul(1, 0, 0), Load(1, 4), 
                                  Mul(1, 0, 0), Done(0) ]).unwrap(),
                   48);
    }
}

And the results are the same:

running 1 tests
test tests::basic ... ok

And that’s a register-based virtual machine.

For the previous step of Implementing The World’s Simplest Bytecode Interpreter in Rust, I followed Vladimir Kazanov as he added larger opcodes to his bytecode. In the third variant, he adds a stack, pushing values on as needed and allowing the operations to pop values off the stack as needed.

My variant of this operation follows his, but is slightly different. He continues to use the stack somewhat arbitrarily, but as I’ve pointed out, Rust doesn’t alow for that outside of unsafe code.

We’re going to swap out the accumulator for a stack, and we’ll return the last value pushed onto the stack as the result. After including the result type, which we’ll use for error messaging and forwarding, our program now looks like this:

<lib.rs>=
use std::result;

pub enum Opcode {
    Push(i64),
    Add,
    Sub,
    Mul,
    Div,
    Done
}

pub struct Program {
    program: Vec<Opcode>,
    stack: Vec<i64>,
}

One thing I wanted to do was add proper error handling. This isn’t exactly "proper," as I’ve added no pretty printing, but it does the job. One thing I have done is redefine Result to always have the types used by my library, and nothing else:

<lib.rs>+=
#[derive(Debug)]
pub enum ProgramError {
    DivisionByZero,
    StackUnderflow,
}

type Result<T> = result::Result<T, ProgramError>;

Okay, now things are gonna get… a little advanced. But only a little. What this next piece is, is simple: I found myself writing the exact same seven lines of code for three of the four operations. So why not extract that out into a function? Well, the biggest problem is that the maths symbols, +, -, etc, aren’t first-class functions; they can’t be passed around. But they are Rust tokens, which means they can be handled by a macro. So I created a macro that does the operation, and manipulates the object to pop() and push() the values as needed:

<lib.rs>+=
macro_rules! make_op {
    ($code:expr, $op:tt) => {{
        if let Some(a) = $code.stack.pop() {
            if let Some(b) = $code.stack.pop() {
                $code.stack.push(b $op a);
                None
            } else { Some(ProgramError::StackUnderflow) }
        } else { Some(ProgramError::StackUnderflow) }
    }
}}

This is a very simple substitution macro; it takes two items separated by a comma and builds a block that either returns None or Some(Error). We’ll use this return value as a signal to our processing loop.

The first item passed into the macro is a Rust expression, something that returns a value (in this case, the value is just the program struct, so access is a no-op), and the second is a Rust token, which is the most primitive object a macro can handle, but it’s perfect for arithmetic symbols.

There is one big oddity in the macro rule; can you spot it? The operation’s values will be extracted into a and b, but the operation itself is b $op a. Multiplication and addition are fully commutative, at least for integers, but subtraction is not; we needed to do the operations in this order to ensure that they were done in an "unsurprising" order: 5 4 - is "5 – 4", not "4 – 5".

Initializing our program is unchanged:

<lib.rs>+=
pub fn interpret(program: Vec<Opcode>) -> Result<i64> {
    let mut code = Program {
        program: program,
        stack: Vec::new()
    };

And now that I have my macro handler, the actual opcode runner is fairly uncluttered, although the cyclomatic complexity of the error handling for division does make it ugly. This is one place where railway-oriented programming would be lovely, but that’s an advanced topic.

Basically, if any operation returns an error the program halts right there with the error value, otherwise it proceeds until it either hits Done or runs out of operations to perform, at which point it tries to extract the value from the stack and return it, or returns a StackUnderflow error if the stack is empty.

<lib.rs>+=
    for op in code.program {
        if let Some(err) = match op {
            Opcode::Push(i) => { code.stack.push(i); None }
            Opcode::Mul => make_op!(code, *),
            Opcode::Add => make_op!(code, +),
            Opcode::Sub => make_op!(code, -),
            Opcode::Div => {
                if let Some(a) = code.stack.pop() {
                    if a == 0 {
                        Some(ProgramError::DivisionByZero)
                    } else if let Some(b) = code.stack.pop() {
`                       code.stack.push(b / a);
                        None
                    } else { Some(ProgramError::StackUnderflow) }
                } else { Some(ProgramError::StackUnderflow) }
            }
            Opcode::Done => break,
        } {
            return Err(err);
        }
    }

    if let Some(v) = code.stack.pop() {
        Ok(v)
    } else {
        Err(ProgramError::StackUnderflow)
    }
}

And the unit tests:

<lib.rs>+=
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn basic() {
        use Opcode::*;
        assert_eq!(interpret(vec![Push(2), Done]).unwrap(), 2);
        assert_eq!(interpret(vec![Push(2), Push(3), Mul, Done]).unwrap(), 6);
        assert_eq!(interpret(vec![Push(5), Push(3), Sub, Done]).unwrap(), 2);
        assert!(interpret(vec![Push(2), Mul, Done]).is_err());
        assert!(interpret(vec![Push(2), Push(0), Div, Done]).is_err());
        assert_eq!(
            interpret(vec![ Push(2), Push(2), Mul, Push(3), Mul,
                            Push(4), Mul, Done]).unwrap(), 48);
        assert_eq!(
            interpret(vec![ Push(5), Push(2), Mul, Push(5), Div,
                            Push(2), Div, Done ]) .unwrap(), 1);
    }
}

And the results are the same:

running 1 tests
test tests::basic ... ok

One thing to note here is that we’ve now entered a realm where type errors are possible; if we could push more than one type of object onto the stack, we would need to differentiate among the types that our operations could handle. For a full-blown general-purpose interpreter, this sort of typechecking will get repetitive very quickly, but once you’ve identified the typechecking pattern in your code you can extract that repetition into a macro. Rust macros are already amazing, and I consider myself very much a beginner when it comes to them.

In my last post, Implementing The World’s Simplest Bytecode Interpreter in Rust, I did exactly that, following Vladimir Kazanov’s World’s Simplest Bytecode Interpreter.

His first interpreter was basically dumb as a rock: starting from an integer value of zero it allowed only increment and decrement. His second variant adds two new instructions which allow for the addition and subtraction of arbitrary integers. This is where the difference between C and Rust really begin to shine. I’ve added the two instructions to the Opcode:

pub enum Opcode {
    Inc,
    Dec,
    Add(i64),  // New
    Sub(i64),  // New
    Done
}
    
pub struct Program {
    program: Vec<Opcode>,
    accumulate: i64
}

The program definition is unchanged:

pub fn interpret(program: Vec<Opcode>) -> Option<i64> {
    let mut code = Program {
        program: program,
        accumulate: 0
    };

And the actual opcode runner is fairly straightforward:

    for i in code.program {
        code.accumulate = match i {
            Opcode::Inc => code.accumulate + 1,
            Opcode::Dec => code.accumulate - 1,
            Opcode::Add(i) => code.accumulate + i,  // New
            Opcode::Sub(i) => code.accumulate - i,  // New
            Opcode::Done => break
        }
    }

    return Some(code.accumulate)
}

And that’s the whole change. Kazanov is trying to show how the program’s array of instruction can keep values associated with an operation in-line with the operation itself, but in Rust that’s just… an enum. An enum in Rust is an authentic sum type; it can carry additional information.  There’s no need to coerce the vector into holding “weird” values and casting them appropriately; Rust already guarantees that happened at compile time.  And again, the responsibility for verifying the program’s correctness off storage is a separate responsibility; by the time the program has entered the interpreter, we’ve verified that it is syntactically correct at the bytecode level.

There is a cost to this: Rust’s Vec will consist of uniformly sized objects, each as big as the largest object defined in the enum plus the management of the enum itself, so at least one byte for the enum differentiation plus the size of a 64-bit integer, even for the Inc and Dec operators. For a small DSL, that’s not a terrible price to pay. As an interpreter grows, alternative ways of describing an opcode may be desired. We shall see.

A basic unit test:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn longer() {
        use Opcode::*;
        assert_eq!(interpret(vec![Inc, Inc, Inc, Add(4), Add(3), Sub(6), Done]), Some(4));
    }

}

And the results are the same:

running 1 tests
test tests::longer ... ok

 

Because I feel woefully out of practice with parts of Rust that aren’t related to my current project, I decided to try a couple of on-line exercises, mostly written in other languages, and see what sort of effort it would take to do the exercises in rust.

Vladimir Kazanov’s World’s Simplest Bytecode Interpreter was my first exercise, and it turned out that a number of the things Kazanov used just aren’t possible in Rust. His bytecode interpreter really is the simplest possible: it has three bytecodes: increment, decrement, and done.

Kazanov defines three structures: a VM, a bytecode collection, and a result. He defines a constructor for the VM that nulls it out, and then a function that takes a pointer to the program and runs it through his bytecode interpreter, returning either a result or an error message.

So far, the Rust is going to look the same:

pub enum Opcode {
    Inc,
    Dec,
    Done
}
    
pub struct Program {
    program: Vec<Opcode>,
    accumulate: i64
}

Instead of a pointer to the program, I’m going to move the entire program into my VM. This is probably not what I want, in the long run; I’d either want a reference or the capacity to Clone the code as needed, but this is the world’s simplest bytecode interpreter, so for now, there’s not a lot to do.

The interpret function is equally straightforward. Here’s the first notable changes to Kazanov’s code: First, we don’t need a special return type, because Rust has Option. Second, we don’t need to zero out the memory, because Rust guarantees that any memory we allocate is going to be zeroed out already. Third, we’re moving the program into the VM, thus requiring the client to take on the responsibility of ensuring the program is correct.

More importantly, we’re constrained by the compiler to the opcodes provided; we literally cannot have a bad opcode! Kazanov can; he defines an enum, but in C that’s just a set of named intergers. In Rust, enums have a much more strict semantic meaning. (Now, turning an enum into a static, on-storage representation, and reading it back into memory, is a different task, but it’s a task with exceptionally clear guidance that you just don’t have in C, and it’s guidance that you get for free, guidance with no runtime cost whatsoever. This is part of why I love to work in Rust.)

pub fn interpret(program: Vec<Opcode>) -> Option<i64> {
    let mut code = Program {
        program: program,
        accumulate: 0
    };

And the actual opcode runner is fairly straightforward:

    for i in code.program {
        code.accumulate = match i {
            Opcode::Inc => code.accumulate + 1,
            Opcode::Dec => code.accumulate - 1,
            Opcode::Done => break
        }
    }

    return Some(code.accumulate)
}

A basic unit test:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn longer() {
        use Opcode::*;
        assert_eq!(interpret(vec![Inc, Inc, Inc, Dec, Dec, Done, Inc, Inc]), Some(1));
    }

}

And the results are promising:

running 2 tests
test tests::longer ... ok

In Kazanov’s code, he needs extra test to assert that the program is “correct” in its bytecode, but these tests are not needed in Rust: for embedded bytecode, the compiler literally will not let me write a bad bytecode representation. A byte-to-bytecode reader will be required by the compiler to associate correct data values with read data types.

As I’ve been working my way through this project of mine, I’ve come to respect exactly why Haskell is so mind-bendingly different from other programming languages. It has to do with the level of abstraction you’re expected to track in your head.

In the classic "Object Oriented Programming Language" tutorial, when inheritence is usually introduced, the abstraction is flat: you create an Mammal class, which says it has four limbs and fur, and then add specific subclasses such as Dog and Cat, with different function bodies for the abstract method speak(), defining what sounds the animal makes.

The impression you’re given is that object orientation is often just fiddling around the edges; subclasses define different kinds of business logic. This impression is reinforced in languages like Java and C++ in realms as wildly different as office application development and game programming. Even in game programming, subclasses are often just business logic in a pretty dress: this class of Object moves four pixels per frame, that class has a weapon that does 20 points of damage.

Haskell throws all that out the window and asks you a simple question: "What is addition?"

Like, we think we know what "addition" is. It’s the sum of two numbers to create a new number. "Well, what’s a number?" And so forth.

In my project, the discovery that regular expressions both are semirings and that they can produce semirings was mind-boggling and promised a very neat and pluggable way to adapt regex to a variety of purposes, and then when I learned what semirings really were, the embogglement went much higher.

Because a semiring is often described as this: (0, 1, ⨯, +, ℙ), where ℙ is some set. Not a set "of numbers," just some set. If we say ℙ is "the set of natural numbers" then we get all the operations you’d expect. But this is just a concrete implementation! The underlying abstraction is still there. The exact same behavior holds for (F, T, &, |, 𝔹), the set of booleans values; F&x is always false, just as 0⋅x is always zero. F|x=x, T&x=x, etc.

And then the mind-boggling part I got from Might & Co, which I later discovered were called Viterbi Parse Forests: ([], [""], ⊗, ∪, [𝕊]), where the brackets represent "set of…" and 𝕊 represents "strings" (as programmers understand them) has the exact same behavior The operator is the cartesian product, that is, the concatenation of every string from the left set with every string from the right set. With this, what you get back is a record of the exact strings that were recognized and matched.

If you replace "set" in the above paragraph with "list", you get prioritized unions— which is the basis of disambiguation in Parsing Expression Grammar theory. If you define the semiring rather strangely but still within the rules, you get "longest match" disambiguation— which is the basis of μ-regex parsing under the POSIX rules. And so forth.

This is a much deeper level of abstraction. The user has to be able to not just internalize the "obvious" uses, but to understand the deep connective tissue linking these different uses and then come up with new ones. Haskell developers are a mixture of very quick and very slow— the slow part comes while their brain is tickling out a new understanding of the abstraction before them, and the quick is implementing it with good IDE support. It separates the clever from the skilled. Ideally, it would be good to have both.

So, to muse on about regular expressions…

I was genuinely pleased to post Rigged Brzozowski Regular Expressions in Rust (Experiment No. 08), once that completely supports all of the operations in Haskell and one that completely supports the same semantics as the Glushkov Heavyweights in Rust (Experiment No. 07), which makes me mindbogglingly happy.

Now, there are a couple of things going on here. Here’s the skinny: The Glushkov version is a state machine that is progressively building the Semiring as characters are read. The Brzozowski version takes the regular expression tree and builds a parse tree of results, some of which use Adam’s Delta operator as a gatekeeper to preserve valid parses or to block off invalid ones. It then presents the resulting tree to a function that postprocesses it, creating a semiring out of the results.

The Delta operator is only introduced into the parse tree during Sequence operations, to block off trees that don’t need postprocessing because they couldn’t have been valid in the first place, or to preserve trees that could be valid. For example, ab*c is an expression that can contain just ac or abc (or abbbbc), so the sequence requires a gatekeeper to preserve whether or not the b was seen.

There are two ways of handling gatekeeping. Might says:

Dc(L ◦ R) = (Dc(L) ◦ R) ∪ (δ(L) ◦ Dc(R)).

is equivalent to:

if (ε ∈ L) then
    Dc(L ◦ R) = (Dc(L) ◦ R) ∪ Dc(R)
else
    Dc(L ◦ R) = Dc(L) ◦ R

The difference between these is that in the first case, the delta operator is either an annihilator or an identity on the result of the parse, whereas in the second, the nullability function decides on the process of the parse. Despite Adams’s assertion, the truth is harder; in semiring parsing, the loss of result information on the left hand side of a sequence is devastating to the final parse. We do not get an honest parse tree without the Delta operator.

Might solves the problem by recognizing that concatenation of an epsilon with anything requires data preservation only, and since he’s approaching this from a more engineering-oriented standpoint than the Play on Regular Expressions people, using in-line combinators, he offers:

(ε ↓ {t1}) ◦ p ⇒ p → λt2.(t1, t2)
p ◦ (ε ↓ {t2}) ⇒ p → λt1.(t1, t2)

This basically says that if you have a (pending) semiring t1 sequenced with the eventual results of a parser p, you end up with the output of p being the left hand of the mul operation over the two semirings. And flipped if the epsilon is on the right. In other words, he recognizes when an epsilon is in a sequence and ‘floats’ multiplying it with the result of the right hand side of the sequence. The information is preserved.

I want to believe this is on purpose; that Might & Adams knew this was a necessary preservation step. Darais’s highly optimized Racket implementation uses it effectively and has no need of either the Delta operator or the Epsilon-Star operator, as data preservation is left to the reduction optimization on line 97 of Derp-3.

It’s exactly what you’d expect out of semiring parsing, but it’s mechanical and, well, naïve. Not in a bad way, mind you. I’m still trying to figure out how to make it "go" in a pure semiring implementation. I may have to implement reducers anyway, just to support this; otherwise, I have to use the Delta operator and the tree remains alarmingly large.

Subscribe to Feed

Categories

Calendar

May 2019
M T W T F S S
« Apr    
 12345
6789101112
13141516171819
20212223242526
2728293031