Elf Sternberg: Hitting the math-vs-engineering speedbump with the Regex project

I have hit a snag in my project. This entry is me thinking about solutions.

My goal was a reasonable one:

write a recursive regular expression engine
in Rust
using Brzozowski Derivatives algorithm for calculating truth values
using Might's algorithm for recursion
using Adam's algorithms for optimal performance
using Semiring Parsing Theory to categorically "lift" those truth values via finite state transduction
use a "derivation parsing" semiring to capture the text
provide alternative semirings to turn derivation parsing on and off, creating capture group semantics
use a "weighted semiring" to automate the usually error-prone process of disambiguating the parse tree into a pluggable and theoretically sound approach

There are a million other "goals" that can go with this:

Write a variant that produces a DFA, allowing for static compilation
Write a variant that produces a bitmap, allowing for fast comparison
Write a variant that is-a iterator, such that some expressions can be marked "progressive," and when is complete emit the contents
Write a relationship between the parser and the semirings such that progress can be suspended and restored as needed
Add negation, complement, and interleaving as regular expression operators
Write Darais' Language-of-Languages
Write a PEG parser on top
Write ROSIE in Rust
Instantiate backreferences
Add a concrete stream operator

The idea is that Barre/Barge is a generalized but still highly performant toolkit for building recognition, matching, and parsing engines, and the nature of the operation depends on passing it an output dependency with an interface that is theoretically sound and operationally powerful.

Unfortunately, I've hit this snag: The classic implementation of parsing-with-semirings can produce parse tree via the Derivation Forest Semiring, but disambiguating that the forest down to a single tree requires a different semiring to implement one of First Match / Longest Match / Longest Atomic Match. Adams and Darais provide a derivation forest engine, and it's the one I've implemented, but there's one special case.

Imagine a sequence of regular expressions: "abcde" These can be depicted in one of two ways: (a, (b, (c, (d, e)))) or ((((a, b), c), d), e). Computationally, that second one, "left heavy," is troubling: for every character, you'll have to transition down the tree to reach the character, traversing four nodes just to get to the a expression. Darais implemented Adams' suggestion that the tree be rebalanced to make it "right-heavy," and then a post-parsing operation is placed in front of the node to rebalance the resulting parse tree back to a left-heavy representation so that the results are consistent with expectations.

The problem with this is that there's no clear implementation that corresponds to the semiring. It's an engineering-versus-math issue, and I'm not yet sure how to deal with it, although just writing it out does help me see some of the potential solutions. The basic solution is to keep it all: the raw parse tree, the raw disambiguation matrix, and the product of the parse as quickly as the expression specification can produce it.

Elf M. Sternberg

Full Stack Web Developer

Where one teaches, two learn.

Blog

HITTING THE MATH-VS-ENGINEERING SPEEDBUMP WITH THE REGEX PROJECT