Regular expressions
READING A PAPER: DERIVATIVES OF REGULAR EXPRESSIONS RE-EXAMINED
Reading paper in computer science is itself an important skill, one you should master. I've detailed some of my problems before in my series on A Play On Regular Expressions, and now that I have a little free time on my hand I'm going to dive into some of the issues that I want to cover in the future.
My Regular Expressions series talked mostly about two different papers: Fischer, Huch and Wilke's A Play on Regular Expressions, and Matt Might's Parsing with Derivatives. One isn't about Brzozowski's derivatives at all; it's about identifying a common mathematical principle by which the boolean yes/no Finite State Automata of the classic Kleene Algebra can be tranformed into a Finite State Transducer by abstracting "up one level" what the automata does. The second describes a long-neglected algorithm for solving the Kleene Algebra that promises some interesting characteristics. My contribution was to show that these papers are compatible— that Might's combinators are actually semirings in disguise— and that Brzozowski's algorithm can be expressed in a systems language with strict memory requirements.
But that's not where the story starts. The story starts with two other papers, one of which I'll be reviewing today. The first paper is Joshua Goodman's 1999 paper, "Semiring Parsing," which I'll be addressing later.
Today's paper is
Regular Expression Derivatives Re-Examined
... by Scott Owens, John Reppy, and Aaron Turon.
This paper started the whole Brzozowski thing in the first place. There had been a couple of interesting experiments, mostly in the XML space, but nobody had seriously contemplated implementing Brzozowski recognizers until this paper came along. It's a fairly short paper, about 18 pages long, and has some gems that I think get missed along the way.
The Kleene Alegebra is the basis of regular expressions, and the starting point for the discipline of "stringology," the process of recognizing a sequence of data. Stringology is the abstraction known as parsing, but it's more interesting than that, as we'll see.
The paper starts out by describing the Kleene Algebra:
The Kleene Algebra is six operators: Given a collection of symbols L, and a string, a linear sequence made up of those symbols, you can do the following:
- R(∅) = false
- R(ε) = true if the string is empty, false otherwise
- R(c) = true if the next symbol c, false otherwise
- R(re1 re2) = R(re1) followed by R(re2)
- R(re1 | re2) = R(re1) or R(re2)
- R(re*) = true if there are zero or more exact matches of re
The paper starts by adding two extras:
- R(re1 & re2) = R(re1) and R(re2) are true simultaneously
- R(!re) = true if re is false, and vice versa
The authors claim that these additions are "conservative," and have been proven to be valid under Boolean algebra... but that may be problematic for the Semiring implementation, as semirings only support the Kleene Algebra. (It's also important to note that there are two "Kleene Algebras," and one of them is a superset of Boolean algebra, but as Wikipedia points out, it is a different algebra altogether!).
Regular expressions are said to recognize "a language." By this they mean a collection of strings. And just as Alonzo Church's lambda notation and Turing's tape machine can describe computation, McCulloch's Deterministic Finite Automata and Kleene's notation can capture recognition. My concerns about whether or not principled "Parsing with Semirings" is possible with extended regular expressions is still suspended as Wikipedia's entry on Deterministic Finite Automata states that they're still closed under intersection (AND) and negation (NOT).
In Section 3, the authors introduce Brzozowski's Algorithm. The basics of Brzozowski's Derivatives of Regular Expressions can be explained simply: The derivative of a regular expressions with respect to a string fragment u is defined as Re(v) such that u·v belongs to the language recognized by the original regular expression. Basically, after successfully recognized a symbol, a new regular expression is "derived" that can only recognize a new language that encompasses "the rest of the string."
The derivatives for a regular expression are defined this way:
- Dc(∅) = ∅
- Dc(ε) = ∅
- Dc(c) = ε if c = c', ∅ otherwise
- Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
- Dc(re1 | re2) = Dc(re1) | Dc(re2)
- Dc(re_) = Dc(re) re_
That δ symbol is the nullability operator; it returns true if the expression it contains can ever return a null string, or false if it cannot. Nullability is a huge issue with Brzozowski's algorithm, as we'll see.
Section 3.1 provides proof that the derivative of a regular expression is itself a regular expression, and goes further to include the extended operators of the Boolean algebra.
Section 3.2 discusses using Brzozowski's formula to calculate regular expressions. They assert that Brzozowski included the negation and intersection operators in his original 1957 paper (which I haven't been able to get ahold of, so far), and I'll take them at their word on that.
Section 3.3 describes the construction of DFAs (Deterministic Finite Automata) using Brzozowski's formula. I haven't done this yet, so this is the part of the paper that most interests me. As they put it, for each symbol of the string, they are left with a residual string and a residual regular expression.
The authors talk about an important idea: equivalence. Two regular
expressions are equivalent if they're constructed of different
expressions but still contain (that is, recognize) the same language.
The simplest example is the OR() operator expressed in both possible
ways, a|b
and b|a
, but that operator is defined as commutative, so
they're axiomatically equivalent.
The equivalence issue comes up when talking about equivalence classes, which are essential for regular expressions. When generating a regular expression, we don't want to have to generate a separate test function for every letter in the alphabet. That's actually how things were done in the 1970s when every drop of RAM was precious, but the ASCII table was only 92 practical characters! Nowadays, the Unicode definition has room for 1,114,112 code points... and you can't have a table of 1 million "not recognized" for every point in the language to be recognized. So a "symbol class" is defined that includes ranges ("every letter in Klingon"), and so ranges that fail ("every symbol in Unicode except the λ") can also be supported through equivalence.
After proving that the derivatives of equivalent regular expressions are themselves equivalent, the derivation of a DFA generator is somewhat straightforward. They also claim (but do not show) that regular expressions have only a finite number of derivatives, and the size of that is the product of the number of expressions times the number of equivalence classes per expression.
The algorithm they present is naive, but that leads us to section 4:
Section 4: Practical DFA Generation.
And now the fun part. What surprised me most is how often people have discovered that the generation of regular expression derivatives results in a lot of useless states. Both alternation and sequence generate derivatives that have left and right operations that must be considered simultaneously at the next symbol, but often one of those computes to... nothing. And we know what they are, much of the time. Using smart constructors during derivation allows us to elide lots of unnecessary work.
The authors show a list of "similar" rules, in which common operations
like, ∅|r ≃ r
, which means that if a derivative results in one side of
an or operation being null, only the other side has to be considered.
Brzozowski showed that only three rules were necessary to ensure DFA
generation, but the authors include nineteen, and say that adding them
generates near-perfect minimal DFAs.
They also say that they sort the operations that are commutative or associative in lexical order, eliminate redundancies, and reuse functional pieces, creating a look up table of regular expressions that basically becomes the classic state table built by tools like Lex.
And finally they revisit the character set issue, and show how, as long as your alphabet has an agreed-upon lexical order, it's possible to build derivatives of character sets. They show that derivation follows naturally:
- L(S) = S The language of a character set is... the character set!
- v(S) = ∅ Like individual characters, a character set's nullability is false
- Da(S) = ε if a ∈ S, ∅ otherwise
On page 10, there's an assertion that bothered me for day. While
talking about how an equivalence class is determined by the derivative
product, but then they write: "For example, the derivative classes for
a+b·a+c are {a,c},{b}, and Σ{a,b,c}." Under the covers, all regular
expressions are usually defined using linked lists, and are treated
atomically. To my eyes, this looks like a left-to-right regex. It's
not. First, the · operator is given precedence, and secondly, the
entire operation, all the expressions, are a single regex. Treated this
way, what you really have is the regular expression, written in PCRE
rather than computer science metanotation, as (a|ba|c)
. The
derivative of that with respect to 'a' and 'c' is ε, and with 'b' the
derivative is 'a', and with everything else in Σ is 'ε'. So, yeah, the
example works, but man did I have to sleep on that for a day or so.
They then prove, with simple algebraic reasoning, that if a ~r b and a ~s b (a is equivalent to b in r or s), then:
- a ~(r·s) b (a is equivalent to b in r sequences with s)
- a ~(r+s) b (a is equivalent to b in r alternated with s)
- a ~(r&s) b (a is equivalent to b in r simultaneously with s)
- a ~r∗ b (a is equivalent to b in r-star)
- a ~¬r b (if a is not recognized by r, then b will not be recognized by r)
... leading to the possibility of calculating complex regex equivalences. So now calculating the classes, which are the outgoing states that must be considered in any DFA transition, they define the following:
- C(ε) = {Σ} (The character class for an empty string recognizer is any)
- C(S) = {S,Σ\S} (The two classes that must be considered for a symbol recognizer are-- the set of symbols that transition, and the set that doesn't.)
- C(r·s) = if r is nullable, C(r)∧C(s), otherwise C(r)
- C(r+s) = C(r)∧ C(s)
- C(r&s) = C(r) ∧ C(s)
- C(r∗) = C(r)
- C(¬r) = C(r)
There are no equivalence classes for ∅, obviously.
The algorithm described above, they admit, overpartitions (that is, creates more outgoing states than is necessary) when processing a regex using the linked-list representation. Take the example that blocked me for so long:
a + b·a + c :: C((a+b·a)+c) = C(a+b·a) ∧ C(c)
This is how these things are usually represented, and note how the
precedence operation is encapsulated. By separating handling the
character 'c' into its own subroutine, this algorithm ends up not with
{a,c}, {b}, {∅}
as the outgoing states, but {a}, {c}, {b}, {∅}
, with
both 'a' and 'c' pointing to two different new regular expressions that
are themselves mathematically equivalent. Ah, well; it's a
space-vs-time tradeoff.
But now I get it.
Regular Vectors
A tuple of regular expressions (r1, r2, ... rn) is called a regular vector. Brzozowski proved that ∂(r1, r2, ... rn) is (∂r1, ∂r2, ... ∂rn), so it's fairly straightforward to create a collection of regular expressions and merge them all together into a single DFA, with each state transition being to another node in which some of the regexes drop out as failures, leading to the recognition of a symbol... which means that Brzozowski regular expressions can "easily" be turned into a recognition engine for something like Lex, and a state is considered "accepting" if any expression in the vector can handle an empty string when the string being processed is exhausted.
Since every DFA is basically the answer to the question, "Given a symbol
c
from the alphabet Σ
, what is the next state?" and in Brzozowski's
formulation, that's a perfect, easily constructed DFA. It might, on the
other hand, be a quite massive DFA, but it would be fast, with a
single look up and a single transition per symbol in the string.
By labeling the vectors, we end up with a full-blown tokenizer. Which is, y'know, really freakin' cool.
Extended Regular Expressions
This is where I get a little weedy, because I want a principled algebraic way to express the products of regular expressions. Joshua Goodman's 1999 paper describes an algebraic abstraction that accurately describes the products of the standard Kleene algebra, but when you add the two new operations of intersection and negation, that abstraction no longer works. Goodman's paper describes two operations, but now we have four. (I managed to get my hands on a copy of Goodman's 1998 thesis, Parsing Inside-Out, but he never gets beyond the basic semiring operators.)
Performance
Interestingly enough, the paper says that the number of states produced by their engine is efficient, and almost perfectly minimal. I was surprised to learn that the lexer for HTML has only 49 states in a "perfect" system, and LEX produces one with only a few more, at 52. Scheme's own lexer produces a DFA of 324 states, but Brzozowski's method produces one of only 194, a saving of 30%! That's kinda cool.
One of the best moments in the paper is where they raspberry another developer who's proud of their technique for using derivatives to create DFAs. The example engine created a lexer for parsing a limited log file vocabulary, and took 18 minutes to compile! The authors of this paper reimplemented it, and it took less than a second. Most of the savings came from using interval ranges and character classes, rather than scanning the entire alphabet on a per-expression basis. And this was using seven-bit ASCII, not the full Unicode book.
Conclusion
Here's what I really get out of this paper:
- The vision of using Brzozowski recognizers to replace Lex is already present in this paper, especially in the form of Regex Vectors, a feature I haven't seen elsewhere.
- The processing of regular expressions as unitary expressions rather than linked lists could go a long way toward reducing the size of DFAs even further, but that would require a compiler-writer's skill in determining code flow and post-processing equivalence when looking at things like the alternative and intersection operators.
- The cost of calculating nullability is not addressed in this paper, and it remains a serious bugaboo of the whole project until Might's second paper, which we'll get to.
- The production of values above and beyond recognition is not addressed in this paper. It's a lexer, not a parser.
- As impressive as Owens, Reppy, and Turon's original implementation may have been, it was still slower than it had to be, due to the cost of calculating nullability.
- This paper does not address recursive regular expressions at all.
And one big insight I got from this paper: the equivalence of Kleene's Algebra and McCullogh's DFAs aren't just interesting, they may be critical. A programming language does only a few things: calculates values, recurses, selects from collection of alternatives, sequences through operations, returns a value, or fails. Those look like the same six values as a Kleene Algebra, so this begs the question:
If you have a program expressed in the Untyped Lambda Calculus, could you solve that program by calculating the terminal derivative of the expression?