For the past two weeks I've been working on an implemented of regular expressions, only writing an alternative implementation to the classic Thompson engine, the simplest one, and often the fastest if you don't need advanced features like, oh, Unicode range matching or backtracking. Instead of the Kleene/Thompson engine, I've been using Brzozowski's Algorithm, which has some interesting properties and promises some interesting capabilities.

Right now, the engine implements the most naive possible version, one that's known to be inefficient in time and space. It's also the most fundamental Thompson implementation, with exactly the six operators in his 1968 matching engine. It has no 'plus', no 'any' operator. On the other hand, it is completely templatized, so it can handle bytes, chars, or more complex symbols. It only recognizes truly regular languages, and has none of the power of modern engines, but part of the point of this exercise is to see how close I can get in expressiveness, power, speed, and size, and then to provide wrappers that could parse Clojure spec files, or implement the Rosie syntax.

The problem

But I had a problem. Part of this problem is an implementation detail, but part of it is just Rust being Rust. To recognize the letters "ABC" I would have to write:

<code>let mut pattern = Recognizer::<char>::new();
let a =  pattern.tok('A');
let b =  pattern.tok('B');
let ab = pattern.cat(a, b);
let c =  pattern.tok('C');
let _ =  pattern.cat(ab, c);</code>

What I really wanted was something simpler. I wanted to be able to write:

let mut pattern = re(cat('A', 'B', 'C'))

And since I didn't know how to write Rust macros, my first thought was that I'd write it with them. But to give you a taste of what I ended up with, let's just say that the pattern AB(CC|DDDD)E*F ultimately looks like this:

<code>let mut pattern = re!{cat{tok{'A'},tok{ 'B' },
                      alt{cat{tok{'C'},tok{'C'}},
                          cat{tok{'D'},tok{'D'},tok{'D'},tok{'D'}}},
                      rep{tok{'E'}},tok{'F'}}};</code>

I learned a few days later that the RE2 engine, which I will be using as a learning example and as a source of some knowledge, also has an alternative representation that looks somewhat similar:

/a.b.c/ => cat{lit{a}dot{}lit{b}dot{}lit{c}}

It uses 'lit' where I used 'tok', and I haven't implemented 'any', and they have 'dot', but the outcome is remarkably the same.

Initialization

There are two different things that my macro has to two: it has to create a new Recognizer, and it has to add elements to that Recognizer.

The Recognizer uses a memory arena to represent the graph of the regular expression, a graph which is built on-the-fly as the string is processed. This implementation can't have "nodes" as loose objects "somewhere" in RAM because Rust really doesn't like that: adding a new node that "points to" previous nodes happens through handles (integers) rather than pointers, a phenomenon you can see in my first example above.

Rust macros are hygenic and textual. A Rust macro has a pattern (sometimes containing captures) and a result: if your input to the rust macro matches the pattern, the invocation of the macro will be replaced with the text of the result, and some elements inside the result will be replaced with elements of the capture.

There are currently 10 different things that can be captured; my macros only use four of them: ty for types, expr for expressions, ident for identifiers, and tt for token trees.

When a macro is read by the compiler, it is broken up into a token tree. The content of what you pass to the macro is also broken up into a token tree.

Let's start with the most simple example. Let's create the recognizer.

I called my macro 're!', because of course I did. I also had to define the type that the Recognizer recognized.

<code>macro_rules! re {
    ($sty:ty; $($cmds:tt)+) => {
        {
            let mut pt = Recognizer::<$sty>::new();
            {
                let _ = re!(...);
            }
            pt
        }
    };
};</code>

That pattern there is one of the most simple: two items separated by a semicolon. The first item must be recognizable by the compiler as a type; the second item must be a stream of tokens (some of which can be further streams of tokens). A "token" in this case is any object Rust can reasonably identify is separate from other objects: tok{'a'} is four objects, as we'll see later. tt objects are the most useful, but they're not terminal objects– Rust doesn't know what to do with them. They're useful only for recursive processing; eventually, they must be reanalyzed to determine if they're types, identifiers, expressions, or other valid Rust code.

And man, I really wanted a default type. Let me try to add that rule.

<code>    ($($cmds:tt)+) => { re!{ char; $($cmds)* } };</code>

This is the first recursive macro, in that if we just get a blank list of tokens, we're going to call re! again, only this time with the pattern ty:$(tt)+, which means a type followed by a collection of tokens. That matches the first pattern, as so a new recognizer is born.

Note that although the pattern reads $($cmds:tt)+, which captures a collection of tokens, the replacement pattern is $($cmds)*.

Note also that our collections end in a plus, which means "one or more." What if the user only wants a blank Recognizer, one that recognizes nothing? Here's where things get interesting. We can do this:

<code>    () => { re!{ char; } };
    ( $sty:ty ; ) => { Recognizer::<$sty>::new() };</code>

The order's a little backwards; first it says that if the expression is empty, call it with only the type; if the type exists but the tree is empty, build an empty Recognizer.

These rules are extremely precise; they cannot be confused for anything else. So they have to go at the very top of the macro_rules! call; the rules that are most precise must come before rules that have more general matching semantics, or the more general semantics will match and the macro generator will attempt to use the corresponding result... which will leave you with sadness.

Okay, so now we've handled all the possible invocations of the Recognizer. What about the actual expression?

Parsing the regex

We already know that macros can be called recursively. We also want the parsing to be unambiguous. And here's where we enter into an interesting problem: Token trees and expressions can easily be mistaken for one another. The phrase tok { 'a' } could be a token stream of two tokens, the second of which is itself a token stream; it could also be a token stream of four tokens; it could also be an expression (although a broken one, as the Rust parser would attempt to treat it as a struct, and the interior of the struct does not meet Rust's specification).

In this case we want it to be a token stream of four tokens: tok { 'a' and }. Only one of those is an actual expression. We also want this to be unambiguously called, and for that we use an internal identifier. In macros, an internal identifier begins with an @ symbol. Like so:

<code>(@process $pt:ident, tok { $e:expr }) => {
    {
        $pt.tok($e)
    }
};</code>

This says that we want a token stream of six tokens: the four we identified earlier, along with the literal token @process and this thing $pt:ident.

Go back and look at the original macro rule definition, the first one that created a Recognizer. Let's replace the elipsis ('...') with what I really want there:

<code>    ($sty:ty; $($cmds:tt)+) => {
        {
            let mut pt = Recognizer::<$sty>::new();
            {
                let _ = re!(@process pt, $($cmds)*);
            }
            pt
        }
    };</code>

We create a new scope, and a new variable within that scope, pt. Then, within yet another new scope, I start the recursive parser by calling re!(@process pt, $($cmds)*); the @process unambiguously points us to the internal handling, and the pt is recognized as an identifier that will be kept consistent throughout the construction process.

Next, the cat macro, for concatenation, which conveniently has the exact same structure as the alt macro. Those need a left and a right arm, as they're nodes and not leaves of the graph. We'll start with the same three tokens: @process, $pt, and... what? Alt? Cat? There's no "alternative" operator in Rust macros, so I'll use a regular identifier.

<code>    (@process $pt:ident, $op:ident { $lop:ident { $($lex:tt)+ }, $rop:ident { $($rex:tt)+ } }) => {
        {
            let l = re!(@process $pt, $lop { $($lex)* });
            let r = re!(@process $pt, $rop { $($rex)* });
            $pt.$op(l, r)
        }
    };</code>

A couple of things to note here: First, note that I have to re-create the structure of the op { tokens+ }, because sometimes they're not tokens! For the tok operator, that's expected to be an expression. Also note that the final operation can be $pt.$op(l, r), because the substitution here is purely textual.

The left and right arms will be replaced with scoped expressions like this one (or the one for tok).

There's one thing left that I want: cat{tok{'a'},tok{'b'},tok{'c'}}. For that, I have to do two things: I have to capture more than just two token trees, and I have to use the cons-style extension for cat (and alt, as you can have more than two alternate possibilities), preserving the existing operator:

<code>    (@process $pt:ident, $op:ident { $lop:ident { $($lex:tt)+ }, $rop:ident { $($rex:tt)+ }, $($xex:tt)+ }) => {
        {
            let l = re!(@process $pt, $lop { $($lex)* });
            let r = re!(@process $pt, $op { $rop { $($rex)* }, $($xex)* });
            $pt.$op(l, r)
        }
    };</code>

In this case, I add the rest of the expression to the end of the @process parse line, and then for the right-hand expression I then re-create the rest of the expression whole, minus the one operation I've already handled. This is classic Lisp-like recursive programming, and it's the only way to do this with Rust macros: For a list of problems, I take one argument off the list, process it, then re-call the function with a smaller list of problems, until the list can be handled by the two-argument instance.

Just a (geeky) (but important) note: this works here because the cat and alt operators are associative, that is, cat(A, cat(B, C)) is exactly the same as cat(cat(A, B), C).

Conclusion

The entirety of the macro is available at Github.

Rust macros take some time to understand; quite they're different from the rest of Rust, and they have some Lisp-like qualities in their internal, recursive form for which a general education in Scheme or Lisp might come in very useful. It helped a lot to understand the nature of the problem, too, in that I knew that concatenative operations were associative.

But they are quite useful, and I've made my life much easier for writing test cases, once I've written tests that satisfy me that the macros are doing the right thing.

Consequences

At the moment, BARRE "Languages" and the "Recognizer" are inappropriately mixed up. What I have are the six operations of classic regular expressions, and the Language and Recognizer are mixed together. Languages are supposed to be composable, and Rust's own regex engine not only can compose them, but has a special root version of alt in the RegexSet handler that can keep track of large sets of regular expression, giving users the ability to know which of several complex expressions may have been triggered by a string.

If cat, alt, etc. were written as either bare languages or simple operations, it should be possible to perform all the compositional operations on them. This may eliminate the need for the macros entirely. Time will tell. In the meantime, this was a great learning opportunity.