Elf Sternberg: The purity of mathematics vs. the necessities of engineering, take 2.

So, to muse on about regular expressions...

I was genuinely pleased to post Rigged Brzozowski Regular Expressions in Rust (Experiment No. 08), once that completely supports all of the operations in Haskell and one that completely supports the same semantics as the Glushkov Heavyweights in Rust (Experiment No. 07), which makes me mindbogglingly happy.

Now, there are a couple of things going on here. Here's the skinny: The Glushkov version is a state machine that is progressively building the Semiring as characters are read. The Brzozowski version takes the regular expression tree and builds a parse tree of results, some of which use Adam's Delta operator as a gatekeeper to preserve valid parses or to block off invalid ones. It then presents the resulting tree to a function that postprocesses it, creating a semiring out of the results.

The Delta operator is only introduced into the parse tree during Sequence operations, to block off trees that don't need postprocessing because they couldn't have been valid in the first place, or to preserve trees that could be valid. For example, ab*c is an expression that can contain just ac or abc (or abbbbc), so the sequence requires a gatekeeper to preserve whether or not the b was seen.

There are two ways of handling gatekeeping. Might says:

<code>Dc(L ◦ R) = (Dc(L) ◦ R) ∪ (δ(L) ◦ Dc(R)).</code>

is equivalent to:

<code>if (ε ∈ L) then
    Dc(L ◦ R) = (Dc(L) ◦ R) ∪ Dc(R)
else
    Dc(L ◦ R) = Dc(L) ◦ R</code>

The difference between these is that in the first case, the delta operator is either an annihilator or an identity on the result of the parse, whereas in the second, the nullability function decides on the process of the parse. Despite Adams's assertion, the truth is harder; in semiring parsing, the loss of result information on the left hand side of a sequence is devastating to the final parse. We do not get an honest parse tree without the Delta operator.

Might solves the problem by recognizing that concatenation of an epsilon with anything requires data preservation only, and since he's approaching this from a more engineering-oriented standpoint than the Play on Regular Expressions people, using in-line combinators, he offers:

<code>(ε ↓ {t1}) ◦ p ⇒ p → λt2.(t1, t2)
p ◦ (ε ↓ {t2}) ⇒ p → λt1.(t1, t2)</code>

This basically says that if you have a (pending) semiring t1 sequenced with the eventual results of a parser p, you end up with the output of p being the left hand of the mul operation over the two semirings. And flipped if the epsilon is on the right. In other words, he recognizes when an epsilon is in a sequence and 'floats' multiplying it with the result of the right hand side of the sequence. The information is preserved.

I want to believe this is on purpose; that Might & Adams knew this was a necessary preservation step. Darais's highly optimized Racket implementation uses it effectively and has no need of either the Delta operator or the Epsilon-Star operator, as data preservation is left to the reduction optimization on line 97 of Derp-3.

It's exactly what you'd expect out of semiring parsing, but it's mechanical and, well, naïve. Not in a bad way, mind you. I'm still trying to figure out how to make it "go" in a pure semiring implementation. I may have to implement reducers anyway, just to support this; otherwise, I have to use the Delta operator and the tree remains alarmingly large.

Elf M. Sternberg

Full Stack Web Developer

Where one teaches, two learn.

Blog

THE PURITY OF MATHEMATICS VS. THE NECESSITIES OF ENGINEERING, TAKE 2.