Regular expressions
READING "A PLAY ON REGULAR EXPRESSIONS" AND EXPLAINING IT TO MYSELF, PART 1.
I've been working my way through A Play on Regular Expressions, a programming pearl by Sebastian Fischer, Frank Huch, and Thomas Wilke on the use of weighted values in regular expressions. The paper has six parts, and I've understood everything up through part two, and what I've understood already blows my mind. I recently put up a big chunk of documentation on where I am in Barre, and where I am is that I have a reliable, working toolkit for parsing regular expressions and returning the string found (I have "Smart Epsilons"), but I haven't progressed much beyond that part.
This paper does something different. It starts with the idea of a Semiring as the basic "wrapper" type of the expression.
A semiring is a set R with members and operations written: (0, 1, ⊕, ⊛, R). A semiring is an abstraction of common addition and multiplication over natural numbers (The integers from zero to infinity), with the requirement that 0 ⊕ x = x, 1 ⊛ x = x, 0 ⊛ x = 0. That last operation is called "annihilation." That's it; it has two values from the set with specific meanings, and two operations on members of the set that have specific meanings, and a couple of rules about the two values. Try hard not to set in your mind that semirings are about any one "thing" at all. It's just an arbitrary set of rules that lets us do cool things.
A semiring can also be defined over a set of sets! In fact, regular expressions are themselves semirings: ⊕ defines the union of the results of two expressions; this is use in the Alternative operator; ⊛ is the Sequence operator, and defines all possible legal combinations of two expressions is sequence. An expression that recognizes nothing is zero; an expression that recognizes only the empty string is one. If an operation returns the empty set, it's an annhilator under ⊛.
But that's not what this paper is about. This paper starts with a basic regular expression. If you've been with me at all, you know that regular expressions are six functions that take a string and return a value: Empty (always False), Epsilon (True on Empty String), Token (True if it matches the constructed Token), Alternative (True if any constructed expression within matches), Sequence (True if the two constructed expressions within match in order), and Repetition (True if the constructed expression within matches zero or more times). Here are the five main operations (the sixth, "Empty", is assumed) in Haskell for a regular expression recognizer (boolean values only):
<code>data Reg =
Eps -- Epsilon
| Sym Char -- Token
| Alt Reg Reg -- Alternation
| Seq Reg Reg -- Sequence
| Rep Reg -- Repetition
accept :: Reg -> String -> Bool
accept Eps u = null u
accept (Sym c) u = u == [c]
accept (Alt p q) u = accept p u || accept q u
accept (Seq p q) u = or [accept p u1 && accept q u2 | (u1, u2) <- split u]
accept (Rep r) u = or [and [accept r ui | ui <- ps] | ps <- parts u]
split :: [a] -> [([a], [a])]
split [] = [([], [])]
split (c:cs) = ([], c : cs) : [(c : s1, s2) | (s1, s2) <- split cs]
parts :: [a] -> [[[a]]]
parts [] = [[]]
parts [c] = [[[c]]]
parts (c:cs) = concat [[(c : p) : ps, [c] : p : ps] | p:ps <- parts cs]</code>
This says that we return true if the string is empty and the expression is empty; we return true if the string is a single char and the expression is a symbol of one character. split
returns every possible combination of the string, split at each point along its length; parts
goes futher and produces every possible combination of the string; an eight-letter word has 2⁷ (128) possible combinations:
<code>split "abc" == [("","abc"),("a","bc"),("ab","c"),("abc","")]
parts "abc" == [["abc"],["a","bc"],["ab","c"],["a","b","c"]]</code>
This is hideously inefficient. It has to be done repeatedly for every sequence and subsequence of the string and its expression; this function is exponential O(n²) where n
is the length of the string! But it does work. You can build a regular expression out of this and it does recognize strings reliably:
<code>> let nocs = Rep ( Alt ( Sym 'a' ) ( Sym 'b' ) )
> let onec = Seq nocs (Sym 'c' )
> let evencs = Seq ( Rep ( Seq onec onec ) ) nocs
> accept evencs "acc"
True</code>
In part two, things get really cool.