Regular expressions
"A PLAY ON REGULAR EXPRESSIONS," PART 2!
Okay, in part one I showed you what I've learned about writing regular expression engines in Haskell. The example shown uses Haskell chars, so is unicode-compliant.
I also mentioned a semiring. Here's what a Haskell semiring's class looks like. (A Haskell 'class' is more like a Trait in Rust, or an Interface is Java; it describes a collection of behaviors that we want implementors of the class to have. The syntax for an implementation is an `instance'):
<code>class Semiring s where
zero, one :: s
mul, add :: s -> s -> s</code>
That's... that's it. s
is a set, and these are the operations on the set. Remember that Int
is a set: the set of all integer numbers.
The idea in the paper is to make regular expressions more generalized. To give them powers above and beyond what they're normally capable of. This is also one of the main ideas in Barre, so I've been reading this paper with significant interest.
The idea here is to make the return type of the expression a Semiring. This is also the idea in Might and Adams, but here the authors go in a different direction. Because they have, as discussed last time, all the possible combinations of string and regex run at the same time, at the end of parsing, they end up with all possible parse trees automatically. The example showed all those resulting parse trees, which were lists (not sets) of boolean values being reduced down to, for sequences, "Are they all true?", and for alternatives, "Are any of them true?" But for a brief moment, all possible parse trees exist.
So let's create a semiring version of our regular expression. Better yet, let's exploit the fact that we already have one. Here, the only real change is that we've said a rigged regular expression is one that takes a symbol and returns a semiring – a "rig".
<code>data Regw c s =
Epsw -- Epsilon
| Symw (c -> s) -- Character
| Altw (Regw c s) (Regw c s) -- Alternation
| Seqw (Regw c s) (Regw c s) -- Sequence
| Repw (Regw c s) -- R*</code>
For our two "base" types, Epsilon and Symbol, we'll have to implement them; everything else can be implemented in terms of them. For everything else, we can "lift" them from the existing machinery, and here's our lift:
<code>rigged :: Semiring s => Reg -> Regw Char s
rigged Eps = Epsw
rigged (Sym c) = sym c
rigged (Alt p q) = Altw (rigged p) (rigged q)
rigged (Seq p q) = Seqw (rigged p) (rigged q)
rigged (Rep r) = Repw (rigged r)
sym :: Semiring s => Char -> Regw Char s
sym c = Symw (\b -> if b == c then one else zero)</code>
Those are the definitions, including the one that say that if the input symbol matches the constructed symbol, then return the Semiring's version of "one", otherwise "zero". Recall that the test expression for the original version of Sym
was u == [c]
; we've now said that Symw
takes a predicate that compares two symbols.
The accept
version must now be written to handle the rigged versions, and this case we pass the input string to the case statement, which says if the list exists pass it to the predicate, otherwise zero
:
<code>acceptw :: Semiring s => Regw c s -> [c] -> s
acceptw Epsw u = if null u then one else zero
acceptw (Symw f) u =
case u of
[c] -> f c
_ -> zero
acceptw (Altw p q) u = acceptw p u `add` acceptw q u
acceptw (Seqw p q) u = sumr [ acceptw p u1 `mul` acceptw q u2 | (u1, u2) <- split u ]
acceptw (Repw r) u = sumr [ prodr [ acceptw r ui | ui <- ps ] | ps <- parts u ]</code>
In the Boolean version in the previous post, I used and
and or
to sum up all the resulting values of True
and False
that came back from the recognizer. I can't do that here, because we don't know what type of set the Semiring operates over; for Boolean, it would just be True/False. For other things...
But wait! Haskell can abstract that as well. We'll replace and
with prodr
and or
with sumr
, and define them as folds over their semiring operations:
<code>sumr, prodr :: Semiring r => [r] -> r
sumr = foldr add zero
prodr = foldr mul one</code>
This works. If we're talking booleans, add
becomes ||
, and mul
become &&
, and foldr (&&) True [True, True, True]
is True, as is foldr (||) False [True, False, True]
, and so forth. Which gives us our first Semiring:
<code>instance Semiring Bool where
zero = False
one = True
add = (||)
mul = (&&)</code>
And then you can say:
<code>> let nocs = Rep ( Alt ( Sym 'a' ) ( Sym 'b' ) )
> let onec = Seq nocs (Sym 'c' )
> let evencs = Seq ( Rep ( Seq onec onec ) ) nocs
> acceptw (rigged evencs) "acc" :: Bool
True</code>
This is the point where I start to suspect Haskell of sorcery. You'll notice that we didn't actually associate the Semiring with our regular expression. No, we told acceptw
that its job was to return a Boolean value, since acceptw
takes a Semiring, it just went out and found a Semiring that does the job. There was only one implementation in the current scope that meets the definition "a Semiring of Bool," so Haskell assumed that must have been the one I wanted, and just used it. To someone coming from the C/C++ world, that's flaming magical. I understand it, but man, it's magical.
All right. But I said Semirings can be defined on natural numbers. That looks exactly like you'd expect:
<code>instance Semiring Int where
zero = 0
one = 1
add = (+)
mul = (*)
</code>
Yeah, zero is, um, zero. The number. There are no surprises here. So what happens when we ask:
<code>> acceptw (rigged evencs) "acc" :: Int
1</code>
"1?" What does "1?" mean. It means that our example above has exactly one parse tree. The number of different ways our regular expression above can handle this is: one. "aaa" gets you zero. And a string of twelve letters takes several minutes on my laptop due to that crazy explosion of possibilities to check I discussed in the previous post.
So is there a way to show this working? Well, sure. What would the regular expression (a|a*)
do when handed the string "a"
?
<code>> let as = Alt (Sym 'a') (Rep (Sym 'a'))
> acceptw (rigged as) "a" :: Int
2</code>
Two? Yes: both the left and the right alternatives returned their semiring one
value, and the alternative add
ed them together. Two.
<code>> acceptw (rigged as) "aa" :: Int
1</code>
One? Yes: Only the repetition returned anything, and it consumed the whole string. There was only one parse tree returned to be summed up.
<code>> let bs = Alt (Sym 'b') (Rep (Sym 'b'))
> acceptw (rigged (Seq as bs)) "ab" :: Int
4</code>
Four? Two for recognizing the a
, and two for recognizing the b
, and then the sequencing sumr
summed those values together to give us four.
And that's where I am in the paper. I've gotten this to work. Here's where things get interesting:
* The third part of the paper discusses reducing that explosion-of-nodes issue by using [Glushkov's Construction](https://en.wikipedia.org/wiki/Glushkov%27s_construction_algorithm) to generate an efficient directed finite automata. I want to use Brzozowski's algorithm instead, so this'll be interesting.
* The semiring construction here returns parse trees and does analysis on them on a symbol-by-symbol level; this is different from Might's use of a new regular expression type that acts as a parser-combinator, taking the parse tree and returning something built out of it, performing Henglein's "catamorphism on parse trees"; Adams' version takes this further and does the catamorphism the moment parsing is done. What this means is that Might's & Adams's work allows us to insert transformations in the middle of the expression, much the same way a PEG might return something constructed and of a different type, rather than merely a subset of the data provided.
* Brzozowski's algorithm with Might's nullability operation can be made recursive; does that make the semiring construction invalid? (I don't think so, since the analysis is done after the recursion has be successfully traversed, but I'm not sure.)
* The semiring construction requires a whole parse tree, but Brzozowski's algorithm with Might's reduction node addition allows for the parser to return results _at any time_, yielding them, turning the parser into one of type `Parser Iterator s, Iterator r => s -> r`; in Rust terms, it takes an iterator as an input, and it return an iterator; every time you call `.next()` on it, it consumes as many symbols from `s` as necessary until a coherent element of type `r` can be returned, or it returns an error. Can _this_ technique be adapted to work in that environment?
I have two competing desires at this point: (1) Try to adapt the work they've done here to a Brzozowski implementation, or (2) try to implement the work they've done here in Rust. I haven't decided which yet.