Introduction

I've been trying to grok Matt Might's Parsing With Derivatives for a while, and one of the better, clearer explanations for it that I found is Yehonathan Sharvit's Clojure version. Since Clojure isn't a language I know very well, if at all, although I do know some Scheme, I decided to try and re-implement it in Python.

The tangled edition of this document (i.e. the source code in runnable form) is available on my GitHub as a gist.

Literate Program

A note: this article was written with the Literate Programming toolkit Noweb. Where you see something that looks like this, it's a placeholder for code described elsewhere in the document. Placeholders with an equal sign at the end of them indicate the place where that code is defined. The link (U->) indicates that the code you're seeing is used later in the document, and (<-U) indicates it was used earlier but is being defined here.

What is a parser

A parser is something that recognizes languages. A language is a collection of zero or more symbols arranged in a regular way, where "regular" means "arranged in a definite pattern." The simplest parser, the kind illustrated here, simply says that a set of symbols corresponds to a language. All programming languages are regular languages, in that a parser says that your source code corresponds to the language.

A language is a collection of strings, formed from an alphabet. The simplest language is Empty, it has no language. Then we have the null language, which consists of the Empty String. (Yeah, that can be confusing. Bear with me.) A language of one letter, say "a", will only recognize a string of one letter, "a".

You can have languages like char("a"), char("b"), char("c"), and so forth. To string them all together, you concatenate them, like so:

cat(char("a"), char("b"), char("c"))

This will recognize the string "abc", and no other.

Parser derivatives

The hard part for me was understand what Matt Might meant by "the derivative of a regular expression." It was originally defined as "after the recognition of a single token, the set of all possible regular expressions that could follow." Which is one of those phrases that intimidate the neophyte developer, like when dealing with continuation passing you read that "the continuation represents the entire rest of the program", and you're left wondering how the heck you're even supposed to know what that is, much less construct it.

But let's start with an example.

In a regular expression, we might write "(abc)*" to represent that we want to match zero or more representations of the string "abc". So it would match the empty string, or "abc" or "abcabcabc", but would fail on "abd" or "abcabcabe", because those don't match. How would we represent this? Let's say there's a function that represents "repetition of zero or more", and this function returns a function that matches things.

abc = cat(char("a"), char("b"), char("c"))
abcstar = rep(cat)

In the parsing lingo, those are both languages: one is a repetition of another. Let's say we call abcstar("abcabc"). The first pass-through, it matches the "a". What's left? Well, we can't continue with the repetition; we're no longer matching it's internal language, the concatenation. In parsing with derivatives, the concatenation will return a derivative of itself, a function that will match "bc".

But repetition can't work with that, because "bc" isn't part of repetitions, er, repertoire. We need to take the derivative of abc, and concatenate it with the existing repetition. So after one iteration, our function looks like:

cat(derivative_of(abc), abcstar)

Which, in turn, is:

cat(cat(char("b"), char("c")), abcstar)

See how this is the derivative of our initial language with respect to having matched the first character? That's all that parsing derivatives is about.

The trick is to generate these derivatives automatically. Let's start with the basics of our language: an empty language.

<a href="#NWD2T5fU2-5" name="NW2T5fU2-3YIKzx-1"><dfn><empty language>=</dfn></a> (<a href="#NWD2T5fU2-I">U-></a>)
class Empty(Derives):
    def __init__(self, *args, **kwargs): pass
    def __str__(self): return "(Empty)"

A language of the empty string.

<a href="#NWD2T5fU2-6" name="NW2T5fU2-fg2wi-1"><dfn><empty string>=</dfn></a> (<a href="#NWD2T5fU2-I">U-></a>)
class Eps(Derives):
    def __init__(self, *args, **kwargs): pass
    def __str__(self): return "(EPS)"

What are the basic atoms (in this case, characters or, given that this is the 21st century, runes) of our language? Here's a class that describes a language of one character.

<a href="#NWD2T5fU2-7" name="NW2T5fU2-1cSjcF-1"><dfn><char>=</dfn></a> (<a href="#NWD2T5fU2-I">U-></a>)
class Char(Derives):
    def __init__(self, c, *args, **kwargs):
        self.c = c
    
    def __str__(self): return "(char '{}')".format(self.c)

As I said, languages are built out of smaller languages. It's not good enough for a language to have "a" and "b", we want it to have words like "function" and "return". For that, we have concatenation. Since I'm kinda a lispy guy, I do it as a linked list:

<a href="#NWD2T5fU2-8" name="NW2T5fU2-3L2wzM-1"><dfn><concatenation>=</dfn></a> (<a href="#NWD2T5fU2-I">U-></a>)
class Cat(Derives):
    def __init__(self, l, r, *args, **kwargs):
        if len(args) > 0:
            self.l = l
            self.r = Cat(r, args[0], *args[1:])
         else:
            self.l = l
            self.r = r

    def __str__(self):
        return "(cat {} {})".format(self.l, self.r)

You'll notice that we're not dealing with characters here, but with languages, in this case the sub-languages of Char or... whatever. We can concatenate other concatenations, or unions, or repetition.

Union is similar, but the rules for applying are different:

<a href="#NWD2T5fU2-9" name="NW2T5fU2-4coxDV-1"><dfn><union>=</dfn></a> (<a href="#NWD2T5fU2-I">U-></a>)
class Alt(Derives):
    def __init__(self, l, r, *args, **kwargs):
        if len(args) > 0:
            self.l = l
            self.r = Alt(r, args[0], *args[1:])
        else:
            self.l = l
            self.r = r
    
    def __str__(self):
        return "(alt {} {})".format(self.l, self.r)

And finally repetition:

<a href="#NWD2T5fU2-A" name="NW2T5fU2-29VUxa-1"><dfn><repetition>=</dfn></a> (<a href="#NWD2T5fU2-I">U-></a>)
class Rep(Derives):
    def __init__(self, r, *args, **kwargs):
        self.r = r

    def __str__(self):
        return "(rep {})".format(self.r)

One of the things we have to be able to figure out is if the string is exhausted; that is, having reached the end of the language, is the rest of the string to be analyzed the empty string? Also, note that we need to be able to return the empty string for repetition in the "zero examples left" case. We also need to know it because there are two alternatives for analyzing concatenation depending on which kind of language we're using. The term for this is nullability:

<a href="#NWD2T5fU2-B" name="NW2T5fU2-4ZySbG-1"><dfn><nullability>=</dfn></a> (<a href="#NWD2T5fU2-I">U-></a>)
def nullable(l):
    if isinstance(l, Empty) or isinstance(l, Char): return False
    if isinstance(l, Eps) or isinstance(l, Rep): return True
    if isinstance(l, Alt): return nullable(l.l) or nullable(l.r)
    if isinstance(l, Cat): return nullable(l.l) and nullable(l.r)
    raise "Not a language."

You may have noticed that all my classes derive from a class called Derive. Derive actually does all the work of traversing the regular expression and the string at the same time, building derivative regular expressions as it goes.

To make it easy to use our regular expression engine, I'm going to make all regular languages callable, like so:

<a href="#NWD2T5fU2-C" name="NW2T5fU2-3fx6cH-1"><dfn><derive>=</dfn></a> (<a href="#NWD2T5fU2-I">U-></a>)
class Derives(object):
    def __call__(self, w, l = None):
        if l == None:
            l = self

        if (w == ""):
            return nullable(l)

        return self.__call__(w[1:], self.derive(w[0], l))

    <a href="#NWD2T5fU2-D" name="NW2T5fU2-3fx6cH-1-u1"><derive definition></a>

And now we move onto derivation. The really tricky part of the derivative is in the issues like union, concatenation, and repetition. Let's show the basic atoms first:

<a href="#NWD2T5fU2-D" name="NW2T5fU2-42RTPp-1"><dfn><derive definition>=</dfn></a> (<a href="#NWD2T5fU2-C"><-U</a>)
    def derive(self, c, o = None):
        if o == None:
            o = self
        if isinstance(o, Empty): return Empty()
        if isinstance(o, Eps): return Empty()
        if isinstance(o, Char):
            if (o.c == c):
                return Eps()
            return Empty()
        <a href="#NWD2T5fU2-E" name="NW2T5fU2-42RTPp-1-u1"><derive repetition></a>
        <a href="#NWD2T5fU2-F" name="NW2T5fU2-42RTPp-1-u2"><derive union></a>
        <a href="#NWD2T5fU2-G" name="NW2T5fU2-42RTPp-1-u3"><derive concatenation></a>

Here, we're using Eps to indicate that this substring has matched, otherwise we're returning Empty, which means we've run out without a match.

For the tricky part, let's start with repetition, since we've covered that. I've already described the derivation for you: A concatenation of the derivative of the inner language ("what's left to match from one repetition") with the repetition of the inner language (which can be zero or more, so it's okay if there's only one instance to match). Here it is in all its glory. That inner 'r' is from the class defined above.

<a href="#NWD2T5fU2-E" name="NW2T5fU2-2VFVke-1"><dfn><derive repetition>=</dfn></a> (<a href="#NWD2T5fU2-D"><-U</a>)
if isinstance(o, Rep):
    return Cat(o.r.derive(c), o)

Union is easy too: it's just derivative of its components with respect to the next character:

<a href="#NWD2T5fU2-F" name="NW2T5fU2-3cZZgc-1"><dfn><derive union>=</dfn></a> (<a href="#NWD2T5fU2-D"><-U</a>)
if isinstance(o, Alt):
    return Alt(o.l.derive(c), o.r.derive(c))

This one took me while to understand. If we're dealing with a concatenation and the next element of the concatenation can be nullified, we have to deal the fact that both it and the next language can be true: it's a hidden union. Otherwise, we can deal with just the derivative of the current language and concatenate it with the rest of the concatenation's language. This is one of those places where it's painfully obvious how we're constantly building and rebuilding the derivative regular expression as the string gets consumed.

<a href="#NWD2T5fU2-G" name="NW2T5fU2-1eKGNl-1"><dfn><derive concatenation>=</dfn></a> (<a href="#NWD2T5fU2-D"><-U</a>)
if isinstance(o, Cat):
    left_derivative = Cat(o.l.derive(c), o.r)
    if nullable(o.l):
        return Alt(left_derivative, o.r.derive(c))
    return left_derivative

This "constantly building and rebuilding" is, in fact, where the first optimization can happen: we can memoize the produced derivatives and put them into a lookup, so the next time we encounter this particular language, we don't have to build the regular expression derivative, we just re-use the one we had last time.

To test, I'm going to follow from Sharvit's example, create a parser for floating point numbers, and then test it against our use cases.

<a href="#NWD2T5fU2-H" name="NW2T5fU2-2LwY0b-1"><dfn><testing>=</dfn></a> (<a href="#NWD2T5fU2-I">U-></a>)
digit = Alt(*[Char(i) for i in "0123456789"])
floater = Cat(Alt(Eps(), Alt(Char("+"), Char("-"))),
              Rep(digit),
              Char("."),
              digit,
              Rep(digit))

for i in [("-2.0", True), ("1", False), ("", False), ("+12.12", True), ("1.0", True)]:
    print i[1], floater(i[0])

It occurs to me that this is how you bootstrap a regular expression parser. You don't have a parser for the parser yet, so you create one using the syntax above that describes your language for your regular expression, and then you can bootstrap upward to a full-on regular expression handler.

Assembling the file for Noweb looks like:

<a href="#NWD2T5fU2-I" name="NW2T5fU2-d8QWE-1"><dfn><derivatives.py>=</dfn></a>
<a href="#NWD2T5fU2-C" name="NW2T5fU2-d8QWE-1-u1"><derive></a>

<a href="#NWD2T5fU2-5" name="NW2T5fU2-d8QWE-1-u2"><empty language></a>

<a href="#NWD2T5fU2-6" name="NW2T5fU2-d8QWE-1-u3"><empty string></a>

<a href="#NWD2T5fU2-7" name="NW2T5fU2-d8QWE-1-u4"><char></a>

<a href="#NWD2T5fU2-8" name="NW2T5fU2-d8QWE-1-u5"><concatenation></a>

<a href="#NWD2T5fU2-9" name="NW2T5fU2-d8QWE-1-u6"><union></a>

<a href="#NWD2T5fU2-A" name="NW2T5fU2-d8QWE-1-u7"><repetition></a>

<a href="#NWD2T5fU2-B" name="NW2T5fU2-d8QWE-1-u8"><nullability></a>

<a href="#NWD2T5fU2-H" name="NW2T5fU2-d8QWE-1-u9"><testing></a>






  * _<char>_: D1, U2
  * _<concatenation>_: D1, U2
  * _<derivatives.py>_: D1
  * _<derive>_: D1, U2
  * _<derive concatenation>_: D1
  * _<derive definition>_: U1, D2
  * _<derive repetition>_: D1
  * _<derive union>_: D1
  * _<empty language>_: D1, U2
  * _<empty string>_: D1, U2
  * _<nullability>_: D1, U2
  * _<repetition>_: D1, U2
  * _<testing>_: D1, U2
  * _<union>_: D1, U2