Barre: Progress Report, second week of December
The develop branch of Barre now fully implements all of the features of Adams, Might & Darais's 2015 implementation, only it does so in Rust. The current price of running seems to be about 15μs (15 microseconds) per operation, although that's a preliminary finding based on very short grammars.
This... is a frigging miracle. It works. It works, dammit. It can be done, and it seems to be fairly performant. It still has a problem with memory blowup, and at this point the only solution is collaborative garbage collection, but I have to do an analysis to make sure the nodes are mobile and not-fragile. I know my roots so compacting the arena is probably do-able, but I would have to determine the memory-vs-speed costs before implementing.
But it works. This is something of a minor breakthrough in theoretical computer science! I'm absolutely thrilled.
The biggest headache now is genericizing the reducers. Under the covers, a regular expression engine does one thing: it say "Yes" or "No" to whatever is being parsed. Reducers rely on whatever operation said "Yes" to store what it said "yes" to: the character, sequence, or list of alternatives that the parser engine found worth returning, and it does so in the form of a parse tree. These discoveries are stored in a special operation called an epsilon. Because alternatives can result in multiple matches, reducers deal in sets: each reducer takes a set of trees from its inner parsers and turns the result into something else.
This is incredibly useful. A string of epsilons that we would have to trudge through every time to get to the next operation can be collapsed into a single epsilon by a reducer on any epsilon ◦ epsilon operations.
Optimizations on the current engine rely on reducers. The rebalancing of sequence trees from left-to-right requires that the results be balanced back to right-to-left afterward, and a reducer does this. Multiple reductions in sequence can be rebalanced to a single, composed reduction on the results of the whole sequence. And when there are alternatives of alternatives, the results have to be permutated appropriately.
All well and good. And those internal reductions all work well and are well-applied inside the grammar. But there's a complicating headache: external reducers.
The user of this library will want to add in-line reducers to modify the discovered parse tree into something else. A basic example is the regular expression standby, the capture group. A capture group is a reduction: in this case, taking the product of the regular expression inside the capture group and adding a label. And that's where the problem lies: there's no place in the existing datatype for a label. Or anything else. Decorating the rutrned tree with additional information is currently not possible.
The whole point of this exercise is to produce ASTs, IRs, and other internal representations of raw data of some kind. The problem is that client reducers will not have the same return type. Even worse, client reducers will have a headache of a time dealing with the internal representations of the graphs produced by Barre.
As an exercise, I'm going to try to implement Darais's "Language of Languages," which provides a lot of convenience methods for defining other languages. It relies on external reductions, so maybe it'll teach me a thing or two about how to get it right.
Current To-Do list looks like this:
-
Implement right-handed optimizations
-
Implement Language-of-Languages
-
Implement Capture Groups
-
Determine Reducer return type strategy (look at pom for tips)
-
Implement a parser-of-regex parser that supports the Rust Regex dialect
-
Benchmark Barre vs Rust Regex
-
Benchmark Barre vs RE2
-
Implement the negation, intersection, and interleaf operators
-
Implement a parser-of-regex parser that supports a meaningful subset of the Perl Six dialect
-
Implement the cut operator,
which enables POSIX-compliant matching -
Swap out the
set
return type for aposet
type, to enable prioritized matching -
Implement a Derivatives of Parser Expression Grammars library
-
Implement ROSIE in Rust
-
Implement a PEGjs-like tool for generating advanced parsers in Rust
-
Genericize the PEG producer to work with other languages.
-
Kill YACC.