I have hit a snag in my project. This entry is me thinking about solutions.
My goal was a reasonable one:
- write a recursive regular expression engine
- in Rust
- using Brzozowski Derivatives algorithm for calculating truth values
- using Might's algorithm for recursion
- using Adam's algorithms for optimal performance
- using Semiring Parsing Theory to categorically "lift" those truth values via finite state transduction
- use a "derivation parsing" semiring to capture the text
- provide alternative semirings to turn derivation parsing on and off, creating capture group semantics
- use a "weighted semiring" to automate the usually error-prone process of disambiguating the parse tree into a pluggable and theoretically sound approach
There are a million other "goals" that can go with this:
- Write a variant that produces a DFA, allowing for static compilation
- Write a variant that produces a bitmap, allowing for fast comparison
- Write a variant that is-a iterator, such that some expressions can be marked "progressive," and when is complete emit the contents
- Write a relationship between the parser and the semirings such that progress can be suspended and restored as needed
- Add negation, complement, and interleaving as regular expression operators
- Write Darais' Language-of-Languages
- Write a PEG parser on top
- Write ROSIE in Rust
- Instantiate backreferences
- Add a concrete stream operator
The idea is that Barre/Barge is a generalized but still highly performant toolkit for building recognition, matching, and parsing engines, and the nature of the operation depends on passing it an output dependency with an interface that is theoretically sound and operationally powerful.
Unfortunately, I've hit this snag: The classic implementation of parsing-with-semirings can produce parse tree via the Derivation Forest Semiring, but disambiguating that the forest down to a single tree requires a different semiring to implement one of First Match / Longest Match / Longest Atomic Match. Adams and Darais provide a derivation forest engine, and it's the one I've implemented, but there's one special case.
Imagine a sequence of regular expressions: "abcde" These can be depicted in one of two ways: (a, (b, (c, (d, e))))
or ((((a, b), c), d), e)
. Computationally, that second one, "left heavy," is troubling: for every character, you'll have to transition down the tree to reach the character, traversing four nodes just to get to the a
expression. Darais implemented Adams' suggestion that the tree be rebalanced to make it "right-heavy," and then a post-parsing operation is placed in front of the node to rebalance the resulting parse tree back to a left-heavy representation so that the results are consistent with expectations.
The problem with this is that there's no clear implementation that corresponds to the semiring. It's an engineering-versus-math issue, and I'm not yet sure how to deal with it, although just writing it out does help me see some of the potential solutions. The basic solution is to keep it all: the raw parse tree, the raw disambiguation matrix, and the product of the parse as quickly as the expression specification can produce it.