For the past two weeks, I've been working my way through Matt Might's paper Yacc is Dead, which describes and implementing it in Rust. The result is BARRE: Brzozowski Antimirov Rusty Regular Expressions. So far as I've been able to determine, I'm the first person to try implementing it in a systems language, one without garbage collection.
Rust has made this pretty straightforward so far. My progress has pretty good: I have the naive implementation built and running. The naive implementation sucks: it's slow, it's inefficient, and it has a tendency to grow out of control in space. The regex (A|B)*C
when trying to match 'ABABAAC' grows to a deterministic finite automata (DFA) of over 140 nodes!
My progress is going to be a bit stalled this week as I work through chapter 1 of Michael Sipser's Theory of Computation, which teaches about DFAs and NFAs and how to convert back and forth between the two. There are several major improvements that can be made to the existing algorithm.
First, I need to figure out how to extend the engine to handle different repetition kinds. Right now it's the pure Kleene star, "zero or more." Modern regex engines handle "one or more," ("Kleene plus"), "zero or one", and ranges ("exactly n", "at least n", "at most n").
PCRE adds Boundaries (word boundaries, the "" operator), and Anchors (start of line, end of line, start of document, end of document), Classes (is-a-num, is-a-letter, range-of-letters). All of these are single-term operations, though, and should have the same semantics as a Token.
PCRE also adds Captures (groups that can be retrieved after the expression engine has run, showing what was actually matched), and Backreferences (literals that appear later in the regular expression that match strings built out of what was matched, effectively "splicing" the parse tree with a runtime-only concatenation at a given point in the matching process.
Python, of all things, adds a unique feature to the input iterator to replace indent/dedent changes with block-open and block-close symbols. This may be useful for whitespace-based languages, and obviates the need to play funny with the regular expression engine.
Brzozowski's algorithm itself has four major improvements: laziness, which delays actually creating a derivative node until it's truly needed; fixed points, which takes a derivative subgraph and re-derives it, over and over until it stops changing, at which point we have the final matcher; memoization, which intercepts the fixed-point processor to find duplication and returns a pointer to the start of the duplicate in the existing subgraph, interrupting any potential infinite recursions which may blow our stack; and compaction, which identifies duplications across subgraphs and prevents the in-memory representation from growing out of control.
As far as I can tell, Brzozowski regular expressions have a "start-up cost," that is, the first time through can be pretty expensive as the graph is filled out, but follow-on matches can be pretty quick, as only the deltas need to be filled out.
Antimirov Derivatives promise an easier-to-implement variant of Brzozowski's algorithm. I haven't yet dived into these, but they're on the list.
My hope is that Antimirov's Derivatives don't interfere with Might's final point: Recursive Regular Expressions which, if implemented in a system with captures, can parse context-free grammars whole.
A couple nice-to-haves: First, I need internal macros to build internal representatives in as straightforward manner. Rust has a fairly powerful macro language, and leveraging it to make my life easy while negotiating with the borrow checker would be awesome. Second, the internal representation of a first-order Brzozowski's Derivative is a simple array; being able to pre-compile that array to have stand-alone regexes at compile time would be wonderful.
And finally, it would be amazing if a parser for a Thompson, PCRE, or even PCRE-light syntax (such as supported by Rust's own Regex engine), could be written in Barre. The parser for Rust's Regex engine is written with, but not in, Regex; that is, while the Regex engine is used to recognize substrings written in Thompson/PCRE regex (you know, the classic abc(.*?)de
strings), it can't handle nested expressions, which requires the engine to recurse on itself, but theoretically, Barre can.
If all of this is true and possible, then I might well have a competitive alternative to both regex.rs and nom.rs, and my next steps toward learning how to write my own programming language will be clear, if not easy.