I've made some excellent progress on the Barre project. Along the way, I discovered that implementing a theoretical CS paper has some annoying difficulties, the worst of which is figuring out a translation table for the different notations used by different researchers in the given field. In describing my progress on Barre, and implementing Brzozowski's Regular Expressions in Rust, today I'm going to talk about a different problem: orthagonality.
In the four core papers on Brzozowski's 1965 algorithm (Might 2010, Might 2011, Adams 2015, and Sulzmann 2010, there is a progression, and a swerve, about the nature of implementating a parser via Brzozowski derivatives. Brzozowski himself was talking only about regular expression recognizers, which can only tell you if a string matches a given regular expression, a boolean result. Brzozowski realized that, equationally, one could put an instance of the parser inside itself, creating a recursive structure that would gracefully handle backreferences and even context free grammars (CFGs). Unfortunately, no programming language in 1965 could handle Brzozowski's formulation. In 2009, Scott Owens published a paper showing it could be done with modern languages.
Scott Owens showed that it could work as a recognizer. Matt Might made it work as a parser, and then contributed some significant performance-oriented optimizations based on the observation that the derivative of the derivative of some parsers would have redundancies that could be avoided. Adams made it go fast. Sulzmann contributed an important step toward making it deterministic and predictable.
But in trying to implement the results of all these papers together, I've discovered that there are several moving parts:
- Recognizing regular expressions
- Recognizing context-free grammars
1. Memoizing derived parsers for re-use
2. Avoiding derivations until needed
- Extracting parse trees from the recognizer
- Performance-related optimizations
1. Node optimizations
2. Control-flow optimizations
- Common extensions.
1. Character classes and ranges
2. Capture groups
3. Named capture groups
5. Back references
6. Forward references
7. Intersection and negation
- Preambles and contextual symbols
- Self-hosting: The regexp parser is written in itself.
- Alternative parse languages
It's easy, reading through the papers, to think that I might want to implement these in order, or that one is a pre-requisite for another. The first, "Recognizing regular expressions," is a pre-requisite for everything else.
Right now, I have a working implementation of (1). I think I have working implementations of (2), which require (2.1) and (2.2) to be complete. CFGs only work with memoization (so that we can find the recursive entry) and laziness (so that a recursive call doesn't go into an infinite loop).
It was after I had the CFG recognizer "working" (it passes tests, but I'm not confident I have the algorithm perfectly correct) that I started working on the extractor-- and hit a wall. The code is fairly complicated, and I made the mistake of implementing (4.2), because it made for an interesting challenge in Rust.
But I realized yesterday that implenting (3), extracting the data from the parser, was a perfectly compatible step with (1), the simplest of all the implementations. I didn't have to worry about recursion, or the node's structure types during the run.
The lesson here is to identify, down to the smallest possible unit, what an individual implementation step is, and do only that step. Implementing parse tree extraction on the simplest possible code is the only way for me to understand the implications of doing it with much more complex code, and having it at-hand means that the pattern for implementing it for future extensions such as character classes, intersection and negation, will already be understood, and not hacked ad-hoc into those node types at the same time I'm trying to figure them out at all.
Plus, being aware that those future node types are coming will encourage me to try and write the
parse_tree_extract() function in a way that has the least impact on the highly diverse node types, and try to keep the zoo of node types to a minimum.