13Dec

Barre, progress report: IT LIVES!

Posted by Elf Sternberg as Uncategorized

Barre: Progress Report, second week of December

The develop branch of Barre now fully implements all of the features of Adams, Might & Darais’s 2015 implementation, only it does so in Rust. The current price of running seems to be about 15μs (15 microseconds) per operation, although that’s a preliminary finding based on very short grammars.

This… is a frigging miracle. It works. It works, dammit. It can be done, and it seems to be fairly performant. It still has a problem with memory blowup, and at this point the only solution is collaborative garbage collection, but I have to do an analysis to make sure the nodes are mobile and not-fragile. I know my roots so compacting the arena is probably do-able, but I would have to determine the memory-vs-speed costs before implementing.

But it works. This is something of a minor breakthrough in theoretical computer science! I’m absolutely thrilled.

The biggest headache now is genericizing the reducers. Under the covers, a regular expression engine does one thing: it say "Yes" or "No" to whatever is being parsed. Reducers rely on whatever operation said "Yes" to store what it said "yes" to: the character, sequence, or list of alternatives that the parser engine found worth returning, and it does so in the form of a parse tree. These discoveries are stored in a special operation called an epsilon. Because alternatives can result in multiple matches, reducers deal in sets: each reducer takes a set of trees from its inner parsers and turns the result into something else.

This is incredibly useful. A string of epsilons that we would have to trudge through every time to get to the next operation can be collapsed into a single epsilon by a reducer on any epsilon ◦ epsilon operations.

Optimizations on the current engine rely on reducers. The rebalancing of sequence trees from left-to-right requires that the results be balanced back to right-to-left afterward, and a reducer does this. Multiple reductions in sequence can be rebalanced to a single, composed reduction on the results of the whole sequence. And when there are alternatives of alternatives, the results have to be permutated appropriately.

All well and good. And those internal reductions all work well and are well-applied inside the grammar. But there’s a complicating headache: external reducers.

The user of this library will want to add in-line reducers to modify the discovered parse tree into something else. A basic example is the regular expression standby, the capture group. A capture group is a reduction: in this case, taking the product of the regular expression inside the capture group and adding a label. And that’s where the problem lies: there’s no place in the existing datatype for a label. Or anything else. Decorating the rutrned tree with additional information is currently not possible.

The whole point of this exercise is to produce ASTs, IRs, and other internal representations of raw data of some kind. The problem is that client reducers will not have the same return type. Even worse, client reducers will have a headache of a time dealing with the internal representations of the graphs produced by Barre.

As an exercise, I’m going to try to implement Darais’s "Language of Languages," which provides a lot of convenience methods for defining other languages. It relies on external reductions, so maybe it’ll teach me a thing or two about how to get it right.

Current To-Do list looks like this:

  • Implement right-handed optimizations
  • Implement Language-of-Languages
  • Implement Capture Groups
  • Determine Reducer return type strategy (look at pom for tips)
  • Implement a parser-of-regex parser that supports the Rust Regex dialect
  • Benchmark Barre vs Rust Regex
  • Benchmark Barre vs RE2
  • Implement the negation, intersection, and interleaf operators
  • Implement a parser-of-regex parser that supports a meaningful subset of the Perl Six dialect
  • Implement the cut operator,
    which enables POSIX-compliant matching
  • Swap out the set return type for a poset type, to enable prioritized matching
  • Implement a Derivatives of Parser Expression Grammars library
  • Implement ROSIE in Rust
  • Implement a PEGjs-like tool for generating advanced parsers in Rust
  • Genericize the PEG producer to work with other languages.
  • Kill YACC.

So, I’m mad about this.

The original paper about an implmentation of Brozozowski’s Parsing Algorithm is Matt Might’s Parsing With Derivatives: A Functional Pearl. For three weeks I’ve been hung up an a problem with the implementation that wasn’t making any sense to me, and when I realized my mistake it came down to this pair of sentences:

  • The derivative of the empty parser is empty: Dc(∅) = ∅.
  • The derivative of the null parser is also empty: Dc(ε) = ∅.

The "empty parser" is called empty because the set of words it recognizes contains nothing: {}. It recognizes no words1 at all. Another name for the empty set: the NULL set, which is why the null symbol, ∅, is used there. The null parser is called that because it parses the empty string, the string written "" if you’re using the Latin alphabet, and in C, C++, and other early languages was often marked by an actual symbol in the language, NUL, and referred to the ASCII byte 0000.

If you’re confused, good: because so I was. It’s especially not clear when the author writes, "The null string becomes the consume-nothing parser, while the empty set becomes the reject-everything parse." Note the change in terminology: here he means the null string parser and the empty parser, which in other places is called, respectively, the empty string parser and the null parser.

Once I figured out my confusion, figuring out how data preservation actually works in Might’s formula was easier. But man, the people writing these math papers often do not mean for them to be read, do they?


1 "Words" in this case does not mean what it means in English or other languages; it just means sequences of letters in the alphabet making up meaningful strings, and the separation between them is arbitrary and not necessarily a whitespace or other punctuation.

28Nov

A bit of analysis while working on Barre.

Posted by Elf Sternberg as Uncategorized

I’ve been worrying a lot about the next stage of my project and realizing that I had seriously screwed up somewhere along the way. The next task on my list is to convert the recognizer into a parser.

I’m going to refer to a couple of different things here. A quick bibliography:

  • Herp, Rerp, and Derp all refer the final variants produced by David Darais in 2011, after Matt Might’s original code.
  • Adams 2016 refers the the Derp-3 implementation in Racket, which as far as anyone knows has the greatest number of optimizations.
  • McGuire 2012 refers to Tommy McGuire’s Java implementation.

There are three different main issues that need to be covered by this code.

First is the generation of a new parser for each new character, in much the same way that parser combinators work, and ideally this new parser is lazy, in that it doesn’t actually recurse down a tree until an item is needed, and that once the item has been generated it is memoized so that it doesn’t need to be generated a second time.

Second is the determination of nullability. Since these parsers are defined equationally, then given a language 𝑳, we calculate the fixed point of 𝑳 in the classic way: Feed 𝑳 into the equations repeatedly until 𝑳 doesn’t change. This derivative of 𝑳 is both non-recursive and efficient.

Third is the extraction of the result from the language. This is where I’m a bit stuck, and trying to reverse-engineer the result from the code examples above.

The problem I have at the moment is that there appears to be two different ways of calculating and managing the the nullability and extraction issues. To understand the issue, you have to know that nullability is important not just when determining whether or not the parser is done, but whenever the parser needs to do concatentation, the single most common operation in regular expressions; for concatenation, the next parser-combinator generated is different depending on whether or not the first symbol in the concatenation machinery is nullable. In Matt Might’s original Parsing With Derivatives: A Functional Pearl, he has a simple nullability function for recognition, but uses a different nullability function, a sort of “automatic cancellation” feature, in concatenation. McGuire also goes this route.

Darais goes back to the original way of doing things: Nullability is a simple true/false operation with no special behavior. It’s effects on other equations are handled in those equations as special constructors, which is very useful when the constructors make decisions not about themselves, but about the results of their derivatives, a two-level analysis for performance reasons.

Adams goes even further; his code is deeply optimized with some very powerful features, but the biggest problem I have is that he does productions piecewise in order to reduce the length of the analysis chain the further into the expression you go. This kind of optimization is highly challenging in a strongly typed language like Rust, which is Why I’ve been having such a hard time with it.

Adams does this with “Reductions,” which Might introduced in his original paper in the context of discussing parser combinators, using them interleaved with the parsing process, which again, is a challenge in a typed language.

This week, I’m going to focus on very small incremental features. I’m just going to implement Literals and Alternatives, and see if I can’t get logical extractions out of them. If I can, then I’ll try and get Concatenation working.

My fear is that, when that time comes, I’ll have to implement Reductions as well. Since I’ve been using this project as a way to learn Rust, I’ve not yet delved into some deeper parts of it, like, for example, function pointers, which is how Reductions are stored. This should be fun.

19Nov

Making Things and Going Slowly

Posted by Elf Sternberg as Uncategorized


For the record, version 1.0 is supposed to support POSIX-style regular expressions with character classes, ranges, and limited backtracking, providing a drop-in API compatible with the Rust regex library without the PCRE context-sensitive expressions such as ‘(x?)’ and ”. I maintain that there’s something wrong with a regular expression library that can’t use itself to parse out regular expressions, which is why "self-hosting" is on the post-1.0 list.

I started the Barre project ten weeks ago. I expected it to take about four to six weeks, and yet here I am, ten weeks in, and I still don’t have a full implementation of the core, much less any of the extensions.

I’m disappointed by this slow progress, but not surprised. Half of it has been learning Rust. I mean, I thought I knew Rust when I finished all of the exercises in the Programming Rust textbook, but that was only half the story. I’d only done exercises and built off the work in the book. I hadn’t written a program in Rust on my own, "editing a blank page." Rust does not support graph structures comfortably; the options for doing so are limited and verbose, and my choice to use a memory arena means that I had a different set of difficulties with convincing Rust’s correctness operators to give me access to nodes as I needed them.

The other half has been that, no really, implementing laziness in a language that doesn’t support it natively is really tricky. Tommy McGuire’s implementation is blunt and obvious, and while his code was excellent inspiration for what I’m trying to acheive, it’s not written in a way that makes it easily portable to Rust. On the other hand, Michael Adams’s implementation, while much more complete and tractable in Rust terms, is also written in a highly advanced Racket that re-uses fields, treats types as run-time signals, and does deep exploitation of Lisp’s list-like structures to perform post-hoc transformations. Trying to synthesize out of both of those, as well as Matt Might’s original implementation, has proven more challenging than I imagined.

As always, there’s too much to learn, and not a whole lot of time in which to do it all.

I’ve made some excellent progress on the Barre project. Along the way, I discovered that implementing a theoretical CS paper has some annoying difficulties, the worst of which is figuring out a translation table for the different notations used by different researchers in the given field. In describing my progress on Barre, and implementing Brzozowski’s Regular Expressions in Rust, today I’m going to talk about a different problem: orthagonality.

In the four core papers on Brzozowski’s 1965 algorithm (Might 2010, Might 2011, Adams 2015, and Sulzmann 2010, there is a progression, and a swerve, about the nature of implementating a parser via Brzozowski derivatives. Brzozowski himself was talking only about regular expression recognizers, which can only tell you if a string matches a given regular expression, a boolean result. Brzozowski realized that, equationally, one could put an instance of the parser inside itself, creating a recursive structure that would gracefully handle backreferences and even context free grammars (CFGs). Unfortunately, no programming language in 1965 could handle Brzozowski’s formulation. In 2009, Scott Owens published a paper showing it could be done with modern languages.

Scott Owens showed that it could work as a recognizer. Matt Might made it work as a parser, and then contributed some significant performance-oriented optimizations based on the observation that the derivative of the derivative of some parsers would have redundancies that could be avoided. Adams made it go fast. Sulzmann contributed an important step toward making it deterministic and predictable.

But in trying to implement the results of all these papers together, I’ve discovered that there are several moving parts:

  1. Recognizing regular expressions
  2. Recognizing context-free grammars
    1. Memoizing derived parsers for re-use
    2. Avoiding derivations until needed
  3. Extracting parse trees from the recognizer
  4. Performance-related optimizations
    1. Node optimizations
    2. Control-flow optimizations
  5. Common extensions.
    1. Character classes and ranges
    2. Capture groups
    3. Named capture groups
    4. Ranges
    5. Back references
    6. Forward references
    7. Intersection and negation
  6. Preambles and contextual symbols
  7. Self-hosting: The regexp parser is written in itself.
  8. Alternative parse languages
    1. PCRE
    2. Perl6
    3. Peg
    4. Rosie

It’s easy, reading through the papers, to think that I might want to implement these in order, or that one is a pre-requisite for another. The first, "Recognizing regular expressions," is a pre-requisite for everything else.

Right now, I have a working implementation of (1). I think I have working implementations of (2), which require (2.1) and (2.2) to be complete. CFGs only work with memoization (so that we can find the recursive entry) and laziness (so that a recursive call doesn’t go into an infinite loop).

It was after I had the CFG recognizer "working" (it passes tests, but I’m not confident I have the algorithm perfectly correct) that I started working on the extractor– and hit a wall. The code is fairly complicated, and I made the mistake of implementing (4.2), because it made for an interesting challenge in Rust.

But I realized yesterday that implenting (3), extracting the data from the parser, was a perfectly compatible step with (1), the simplest of all the implementations. I didn’t have to worry about recursion, or the node’s structure types during the run.

The lesson here is to identify, down to the smallest possible unit, what an individual implementation step is, and do only that step. Implementing parse tree extraction on the simplest possible code is the only way for me to understand the implications of doing it with much more complex code, and having it at-hand means that the pattern for implementing it for future extensions such as character classes, intersection and negation, will already be understood, and not hacked ad-hoc into those node types at the same time I’m trying to figure them out at all.

Plus, being aware that those future node types are coming will encourage me to try and write the parse_tree_extract() function in a way that has the least impact on the highly diverse node types, and try to keep the zoo of node types to a minimum.

Theoretical computer science is fascinating. There are all sorts of papers out there about advanced topics which haven’t quite reached mainstream acceptance or awareness, and once in a while you find a fascinating topic and start chasing down the rabbit hole of implementations and papers trying to wrap your head around the ideas presented.

I’ve been doing this for the past two weeks with respect to something called Brzozowski Derivatives, an algorithm for parsing and running regular expressions. Without access to Janus Brzozowski’s original 1964 paper, I was left reading Matt Might‘s 2011 paper in which he discussed Brzozowski’s algorithm and how advances in modern software made it possible to implement a footnote in Brzozowski’s original paper: if it could be made recursive (that is, if a sub-expression could refer to itself), it could be built into a full-blown parser. Parsers often use regular expressions, but context-free grammars require a lot more specialized handling. Brzozowski unified the two, but only in math: software couldn’t handle it. Might figured out how to make it work.

But the performance was terrible. Might asked for assistance in improving the implementation, and a lot of papers were published. And I’ve read a lot of papers this week. Usually one a day. Might’s original paper was seven pages long, and it took me seven hours to read it, or approximately one page an hour. I visited the University of Washington library and downloaded or printed quite a few.

And then I moved on to the other papers. And I learned a lot about reading theoretical computer science papers, and here’s what I learned.

Scan the paper first: What kind of CS paper is it?

My goal here is to do something new and interesting: Implement a Brzozowski Expression Library in a systems language. To the best of my knowledge, this has never been done before; all of the implementations are in languages that have memory management, closures, continuations, and lots of nifty things that aren’t available in systems languages like C, C++, or Rust.

If you’re in this boat, here’s what you want to do: Read the Introduction (not the abstract), then scan the rest of the paper to determine: is this paper an implementation, an algorithm, or results? A “results” paper is almost always useless; the authors are bragging about something they achieved, but they have no interest helping you understand how they achieved it. Even then, scan the paper and ask: “Is there a link to source code?”

Sometimes, even if there isn’t, a judicious Google search will sometimes help you. Sometimes, you can just type the author’s name into Github, Bitbucket, and GoogleCode, and see what you get.

Implementation: Is it worthwhile?

If the paper is an implementation paper, go down to the “Conclusions” section and see if their implementation offers anything worthwhile. Might’s original paper concluded that the implementation was simple (and indeed, it was so simple it took me only two days to write it in Rust), but maybe there could be improvements. Writing it in Rust gave me a grip on the scale of the non-recursive problem.

Sometimes, “worthwhileness” is hard to determine from a single reading. Scott Owen’s paper (see below) is entirely about creating static DFAs for regular expressions via derivatives and not about higher-order parsing, so at first blush it didn’t seem worthwhile; however, Owens also offered two extensions to Brzozowski’s theory that allow for regular expression predicates (the sequence of symbols to be matched must match more than one expression, what Perl 6 calls the “conjunctive operator”) without backtracking, which is an amazing result.

Implementation: Has there been any progress?

Your next step is to search for citations of the paper, to see if any other papers use it. CiteSeer is surprisingly useless for this task. I found a much better solution is to search for the title with a modifier to find PDFs, such as "Parsing with Derivatives" inurl:pdf, after which a lot of papers jump out. For me, the most interesting papers included “POSIX Regular Expression Parsing with Derivatives” (Sulzmann and Lu), “Regular Expression Derivatives Reexamined” (Owens, et al.), and “On the Complexity and Performance of Parsing with Derivatives” (Adams, et. al.).

While lots of papers are hidden behind various paywalls, often papers associated with conference releases, preprints, and pre-reviewed versions are readily available. When all else failed, I’ve emailed the lead author and asked for a copy, and so far they’ve always just mailed me a copy. (The authors of papers love to know someone is engaging with their work. Really.)

Algorithm: Every Author has their own notation. You will have to write your own translation guide.

Those three papers represent the core of what I’m studying. The Adams paper features a better algorithm for detecting early match failures, as well as an important insight into avoiding the infinite loops inherent in Brzozowski’s original description. The Sulzmann paper describes using some performance hacks from classical regular expressions that might be applicable to Brzozowski. And the Owens paper proves (in the mathematical sense) that Brzozowski’s algorithm supports Perl-style regular expression conjunctions natively without having to supply an internal boolean mini-parser as PCRE does.

Brzozowski’s algorithm has six basic sub-cases for matching: a letter, a concatenation (this then that), an alternative (this or that), repetition (the star operator), the empty string, and the empty parser. The most complex is concatenation, so let me show you the equation for that as presented in the algorithm papers. The concatenation operator is an open circle; the ‘∪’ operator is the set theoretical symbol for ‘union’ and represents alternation.

These equations all mean the same thing: “The derivative of two regular expressions in sequence with respect to a character c depends on whether or not the first expression recognizes the empty string; if it does, then the derivative is the derivative of the first regular expression with respect to c followed by the second regular expression or the derivative of the second expression with respect to c, and if it doesn’t then it’s just the first derivative of the first expression with respect to c followed by the second expression.”

Got that?

Good. Might uses Dc to represent “The derivative with respect to c,” and he uses c to mean “character.”

Might:
Dc(L1 ◦ L2) = (Dc(L1) ◦ L2) ∪ Dc(L2) if ε ∈ L1
Dc(L1 ◦ L2) = Dc(L1) ◦ L2 if ε not ∈ L1

Adams:
Dc(L1 ◦ L2) = (Dc (L1) ◦ L2) ∪ Dc (L2) if ε ∈ [[L1]]
Dc(L1 ◦ L2) = Dc (L1 ◦ L2) = (Dc (L1) ◦ L2 otherwise

Owens:
∂a (r · s) = ∂a r · s + ν(r) · ∂a s

Sulzmann:
(r1r2)\l= (r1\l)r2 + r2\l if ε ∈ L(r1), (r1\l)r2 otherwise

Those are (supposedly) all the same. The Adams and Might equations are entirely the same, aside from Adams’ use of the L1 vs [[L1]] syntax. Understanding Owens’ formulation requires that you understand that the ν(r) expresses the same thing as the ε ∈ L1 syntax from Might, that is, ν is a function that determines whether or not ε is a member of the language r, and exploits the fact that ∅ anywhere in a sequence expression results in no matching ever.

And Sulzmann, who even cited Might’s paper… Sulzmann is just from another world. His equation means the same thing, but as you can see his notation is entirely, wildly different. The concatenation symbol is elided as “understood,” the “with respect to a character c” is “/l“…

Now, the thing here is that there are six basic operators in regular expression theory, so since every paper has this same table of six operations, it was possible to figure out, eventually, what Sulzmann meant by his notation. But it took work; I had to write down on a sheet of paper his equations and discoveries in order to make sense of them.

It takes time.

If you want to implement a result in theoretical computer science as a research project and, eventually, as a library, you’re going to have to read and re-read the papers until you’ve internalized the basic rules and gotten comfortable with at least one notation, and if you’re really comfortable with that notation you’re going to have to take a deep breath and re-write many of the other paper’s statements and conclusions in your notation so you have something you can readily absorb when you’re looking for refinements, extensions, and progress in the topic.

It’s also fun (for very geeky definitions of the word “fun”). I’ve been having a good time working this out.

I seem to be in a state of semantic confusion.

I went back and reviewed the Theory of Regular Operations, which is section 1.10 of Sipser’s Theory of Computation, and I seem to have come upon a bit of semantic confusion that I need to decide how to resolve, and more importantly, when to resolve.

According to Sipser, a regular language is any language (that is, a set of strings made up of symbols) for which a finite automata can be constructed. For Brozozowski, a regular language can be recognized by processing its derivative, with the following rules:

  • Dc(∅) = ∅
  • Dc(ε) = ∅
  • Dc(c) = ε if c = c’, ∅ otherwise
  • Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
  • Dc(re1 | re2) = Dc(re1) | Dc(re2)
  • Dc(re) = Dc(re) re

Only one of those rules talks about the symbols, namely the third; all the rest are either end-state issues or what are termed in the busines the regular operations.

The first two are results which describe the state of the machine at some point in the process. If the machine is ε at the end of the process then the string was recognized, and if it’s in any other state then the string was not recognized. The third is about the recognition of symbols.

A tree is a directed graph in which any two nodes are connected by exactly one path. The outcome of Brzozowski’s algorithm, the naive version, is a tree generated dynamically as the derivative results of the current symbol (provided it doesn’t cause termination) are placed onto the stack and then processed with the subsequent symbol. Characters, CharacterClasses (i.e. ranges, [A-Z] or [0-9], etc), and the Any symbol are leaf nodes for which a submatch operation (such as ‘Alternative’ or ‘Concatenation’) allows the process to proceed. The last three items in the list are the three main operations: ‘Alternative’, ‘Concatenate’, and ‘Star’.

Other operations, such as "At least," "At most," "Exactly n", and "Plus" are just variants on the above; how to optimize them isn’t something I’ve addressed yet, nor the "automatic derivative of a concatenation of symbols," which could greatly speed up some searches; an extended string match would result in either ε or ∅. (On the other hand, the automatic concatenation of symbols would also result in the need for a checkpointing operation, as now we’d be trading memory for a dubious speedup.)

There are some non-operations that also need to be considered, including capture groups and non-nested successive backtracking (nested and precessive backtracking I am not handling, maybe not ever). The input iterator needs a wrapper to introduce specialized symbols for WordBoundary, StartOfText, StartOfLine, and their complements.

More to the point, the Recognizer recognizes Strings about Languages, but I don’t have a Language, I have a convenient enum of recognizer machine states. You can perform operations on the Languages, and that’s more what I want. Once you can cat, alt, and star a whole language, the algebra of my recognizer becomes much easier to use.

The real problem comes down to the use of a memory arena to depict the tree of nodes. Rust Leipzig’s Idiomatic Trees and Graphs shows that the graph has to allocate the nodes, his pointers are opaque handles, and even his library doesn’t supply a composition of graphs operation.

All this is to stay that I’m a bit stymied at the moment. There are obvious steps I could take, but imagine how much easier it would be to write regular expressions if they were concatenatable, alternable, and kleene-able?

14Oct

Rust Macros, and how they’re used in Barre.

Posted by Elf Sternberg as Rust

For the past two weeks I’ve been working on an implemented of regular expressions, only writing an alternative implementation to the classic Thompson engine, the simplest one, and often the fastest if you don’t need advanced features like, oh, Unicode range matching or backtracking. Instead of the Kleene/Thompson engine, I’ve been using Brzozowski’s Algorithm, which has some interesting properties and promises some interesting capabilities.

Right now, the engine implements the most naive possible version, one that’s known to be inefficient in time and space. It’s also the most fundamental Thompson implementation, with exactly the six operators in his 1968 matching engine. It has no ‘plus’, no ‘any’ operator. On the other hand, it is completely templatized, so it can handle bytes, chars, or more complex symbols. It only recognizes truly regular languages, and has none of the power of modern engines, but part of the point of this exercise is to see how close I can get in expressiveness, power, speed, and size, and then to provide wrappers that could parse Clojure spec files, or implement the Rosie syntax.

The problem

But I had a problem. Part of this problem is an implementation detail, but part of it is just Rust being Rust. To recognize the letters "ABC" I would have to write:

let mut pattern = Recognizer::<char>::new();
let a =  pattern.tok('A');
let b =  pattern.tok('B');
let ab = pattern.cat(a, b);
let c =  pattern.tok('C');
let _ =  pattern.cat(ab, c);

What I really wanted was something simpler. I wanted to be able to write:

let mut pattern = re(cat('A', 'B', 'C'))

And since I didn’t know how to write Rust macros, my first thought was that I’d write it with them. But to give you a taste of what I ended up with, let’s just say that the pattern AB(CC|DDDD)E*F ultimately looks like this:

let mut pattern = re!{cat{tok{'A'},tok{ 'B' },
                      alt{cat{tok{'C'},tok{'C'}},
                          cat{tok{'D'},tok{'D'},tok{'D'},tok{'D'}}},
                      rep{tok{'E'}},tok{'F'}}};

I learned a few days later that the RE2 engine, which I will be using as a learning example and as a source of some knowledge, also has an alternative representation that looks somewhat similar:

/a.b.c/ => cat{lit{a}dot{}lit{b}dot{}lit{c}}

It uses ‘lit’ where I used ‘tok’, and I haven’t implemented ‘any’, and they have ‘dot’, but the outcome is remarkably the same.

Initialization

There are two different things that my macro has to two: it has to create a new Recognizer, and it has to add elements to that Recognizer.

The Recognizer uses a memory arena to represent the graph of the regular expression, a graph which is built on-the-fly as the string is processed. This implementation can’t have "nodes" as loose objects "somewhere" in RAM because Rust really doesn’t like that: adding a new node that "points to" previous nodes happens through handles (integers) rather than pointers, a phenomenon you can see in my first example above.

Rust macros are hygenic and textual. A Rust macro has a pattern (sometimes containing captures) and a result: if your input to the rust macro matches the pattern, the invocation of the macro will be replaced with the text of the result, and some elements inside the result will be replaced with elements of the capture.

There are currently 10 different things that can be captured; my macros only use four of them: ty for types, expr for expressions, ident for identifiers, and tt for token trees.

When a macro is read by the compiler, it is broken up into a token tree. The content of what you pass to the macro is also broken up into a token tree.

Let’s start with the most simple example. Let’s create the recognizer.

I called my macro ‘re!’, because of course I did. I also had to define the type that the Recognizer recognized.

macro_rules! re {
    ($sty:ty; $($cmds:tt)+) => {
        {
            let mut pt = Recognizer::<$sty>::new();
            {
                let _ = re!(...);
            }
            pt
        }
    };
};

That pattern there is one of the most simple: two items separated by a semicolon. The first item must be recognizable by the compiler as a type; the second item must be a stream of tokens (some of which can be further streams of tokens). A “token” in this case is any object Rust can reasonably identify is separate from other objects: tok{'a'} is four objects, as we’ll see later. tt objects are the most useful, but they’re not terminal objects– Rust doesn’t know what to do with them. They’re useful only for recursive processing; eventually, they must be reanalyzed to determine if they’re types, identifiers, expressions, or other valid Rust code.

And man, I really wanted a default type. Let me try to add that rule.

    ($($cmds:tt)+) => { re!{ char; $($cmds)* } };

This is the first recursive macro, in that if we just get a blank list of tokens, we’re going to call re! again, only this time with the pattern ty:$(tt)+, which means a type followed by a collection of tokens. That matches the first pattern, as so a new recognizer is born.

Note that although the pattern reads $($cmds:tt)+, which captures a collection of tokens, the replacement pattern is $($cmds)*.

Note also that our collections end in a plus, which means "one or more." What if the user only wants a blank Recognizer, one that recognizes nothing? Here’s where things get interesting. We can do this:

    () => { re!{ char; } };
    ( $sty:ty ; ) => { Recognizer::<$sty>::new() };

The order’s a little backwards; first it says that if the expression is empty, call it with only the type; if the type exists but the tree is empty, build an empty Recognizer.

These rules are extremely precise; they cannot be confused for anything else. So they have to go at the very top of the macro_rules! call; the rules that are most precise must come before rules that have more general matching semantics, or the more general semantics will match and the macro generator will attempt to use the corresponding result… which will leave you with sadness.

Okay, so now we’ve handled all the possible invocations of the Recognizer. What about the actual expression?

Parsing the regex

We already know that macros can be called recursively. We also want the parsing to be unambiguous. And here’s where we enter into an interesting problem: Token trees and expressions can easily be mistaken for one another. The phrase tok { 'a' } could be a token stream of two tokens, the second of which is itself a token stream; it could also be a token stream of four tokens; it could also be an expression (although a broken one, as the Rust parser would attempt to treat it as a struct, and the interior of the struct does not meet Rust’s specification).

In this case we want it to be a token stream of four tokens: tok { 'a' and }. Only one of those is an actual expression. We also want this to be unambiguously called, and for that we use an internal identifier. In macros, an internal identifier begins with an @ symbol. Like so:

(@process $pt:ident, tok { $e:expr }) => {
    {
        $pt.tok($e)
    }
};

This says that we want a token stream of six tokens: the four we identified earlier, along with the literal token @process and this thing $pt:ident.

Go back and look at the original macro rule definition, the first one that created a Recognizer. Let’s replace the elipsis (‘…’) with what I really want there:

    ($sty:ty; $($cmds:tt)+) => {
        {
            let mut pt = Recognizer::<$sty>::new();
            {
                let _ = re!(@process pt, $($cmds)*);
            }
            pt
        }
    };

We create a new scope, and a new variable within that scope, pt. Then, within yet another new scope, I start the recursive parser by calling re!(@process pt, $($cmds)*); the @process unambiguously points us to the internal handling, and the pt is recognized as an identifier that will be kept consistent throughout the construction process.

Next, the cat macro, for concatenation, which conveniently has the exact same structure as the alt macro. Those need a left and a right arm, as they’re nodes and not leaves of the graph. We’ll start with the same three tokens: @process, $pt, and… what? Alt? Cat? There’s no "alternative" operator in Rust macros, so I’ll use a regular identifier.

    (@process $pt:ident, $op:ident { $lop:ident { $($lex:tt)+ }, $rop:ident { $($rex:tt)+ } }) => {
        {
            let l = re!(@process $pt, $lop { $($lex)* });
            let r = re!(@process $pt, $rop { $($rex)* });
            $pt.$op(l, r)
        }
    };

A couple of things to note here: First, note that I have to re-create the structure of the op { tokens+ }, because sometimes they’re not tokens! For the tok operator, that’s expected to be an expression. Also note that the final operation can be $pt.$op(l, r), because the substitution here is purely textual.

The left and right arms will be replaced with scoped expressions like this one (or the one for tok).

There’s one thing left that I want: cat{tok{'a'},tok{'b'},tok{'c'}}. For that, I have to do two things: I have to capture more than just two token trees, and I have to use the cons-style extension for cat (and alt, as you can have more than two alternate possibilities), preserving the existing operator:

    (@process $pt:ident, $op:ident { $lop:ident { $($lex:tt)+ }, $rop:ident { $($rex:tt)+ }, $($xex:tt)+ }) => {
        {
            let l = re!(@process $pt, $lop { $($lex)* });
            let r = re!(@process $pt, $op { $rop { $($rex)* }, $($xex)* });
            $pt.$op(l, r)
        }
    };

In this case, I add the rest of the expression to the end of the @process parse line, and then for the right-hand expression I then re-create the rest of the expression whole, minus the one operation I’ve already handled. This is classic Lisp-like recursive programming, and it’s the only way to do this with Rust macros: For a list of problems, I take one argument off the list, process it, then re-call the function with a smaller list of problems, until the list can be handled by the two-argument instance.

Just a (geeky) (but important) note: this works here because the cat and alt operators are associative, that is, cat(A, cat(B, C)) is exactly the same as cat(cat(A, B), C).

Conclusion

The entirety of the macro is available at Github.

Rust macros take some time to understand; quite they’re different from the rest of Rust, and they have some Lisp-like qualities in their internal, recursive form for which a general education in Scheme or Lisp might come in very useful. It helped a lot to understand the nature of the problem, too, in that I knew that concatenative operations were associative.

But they are quite useful, and I’ve made my life much easier for writing test cases, once I’ve written tests that satisfy me that the macros are doing the right thing.

Consequences

At the moment, BARRE "Languages" and the "Recognizer" are inappropriately mixed up. What I have are the six operations of classic regular expressions, and the Language and Recognizer are mixed together. Languages are supposed to be composable, and Rust’s own regex engine not only can compose them, but has a special root version of alt in the RegexSet handler that can keep track of large sets of regular expression, giving users the ability to know which of several complex expressions may have been triggered by a string.

If cat, alt, etc. were written as either bare languages or simple operations, it should be possible to perform all the compositional operations on them. This may eliminate the need for the macros entirely. Time will tell. In the meantime, this was a great learning opportunity.

For the past two weeks, I’ve been working my way through Matt Might’s paper Yacc is Dead, which describes and implementing it in Rust. The result is BARRE: Brzozowski Antimirov Rusty Regular Expressions. So far as I’ve been able to determine, I’m the first person to try implementing it in a systems language, one without garbage collection.

Rust has made this pretty straightforward so far. My progress has pretty good: I have the naive implementation built and running. The naive implementation sucks: it’s slow, it’s inefficient, and it has a tendency to grow out of control in space. The regex (A|B)*C when trying to match ‘ABABAAC’ grows to a deterministic finite automata (DFA) of over 140 nodes!

My progress is going to be a bit stalled this week as I work through chapter 1 of Michael Sipser’s Theory of Computation, which teaches about DFAs and NFAs and how to convert back and forth between the two. There are several major improvements that can be made to the existing algorithm.

First, I need to figure out how to extend the engine to handle different repetition kinds. Right now it’s the pure Kleene star, "zero or more." Modern regex engines handle "one or more," ("Kleene plus"), "zero or one", and ranges ("exactly n", "at least n", "at most n").

PCRE adds Boundaries (word boundaries, the "" operator), and Anchors (start of line, end of line, start of document, end of document), Classes (is-a-num, is-a-letter, range-of-letters). All of these are single-term operations, though, and should have the same semantics as a Token.

PCRE also adds Captures (groups that can be retrieved after the expression engine has run, showing what was actually matched), and Backreferences (literals that appear later in the regular expression that match strings built out of what was matched, effectively "splicing" the parse tree with a runtime-only concatenation at a given point in the matching process.

Python, of all things, adds a unique feature to the input iterator to replace indent/dedent changes with block-open and block-close symbols. This may be useful for whitespace-based languages, and obviates the need to play funny with the regular expression engine.

Brzozowski’s algorithm itself has four major improvements: laziness, which delays actually creating a derivative node until it’s truly needed; fixed points, which takes a derivative subgraph and re-derives it, over and over until it stops changing, at which point we have the final matcher; memoization, which intercepts the fixed-point processor to find duplication and returns a pointer to the start of the duplicate in the existing subgraph, interrupting any potential infinite recursions which may blow our stack; and compaction, which identifies duplications across subgraphs and prevents the in-memory representation from growing out of control.

As far as I can tell, Brzozowski regular expressions have a "start-up cost," that is, the first time through can be pretty expensive as the graph is filled out, but follow-on matches can be pretty quick, as only the deltas need to be filled out.

Antimirov Derivatives promise an easier-to-implement variant of Brzozowski’s algorithm. I haven’t yet dived into these, but they’re on the list.

My hope is that Antimirov’s Derivatives don’t interfere with Might’s final point: Recursive Regular Expressions which, if implemented in a system with captures, can parse context-free grammars whole.

A couple nice-to-haves: First, I need internal macros to build internal representatives in as straightforward manner. Rust has a fairly powerful macro language, and leveraging it to make my life easy while negotiating with the borrow checker would be awesome. Second, the internal representation of a first-order Brzozowski’s Derivative is a simple array; being able to pre-compile that array to have stand-alone regexes at compile time would be wonderful.

And finally, it would be amazing if a parser for a Thompson, PCRE, or even PCRE-light syntax (such as supported by Rust’s own Regex engine), could be written in Barre. The parser for Rust’s Regex engine is written with, but not in, Regex; that is, while the Regex engine is used to recognize substrings written in Thompson/PCRE regex (you know, the classic abc(.*?)de strings), it can’t handle nested expressions, which requires the engine to recurse on itself, but theoretically, Barre can.

If all of this is true and possible, then I might well have a competitive alternative to both regex.rs and nom.rs, and my next steps toward learning how to write my own programming language will be clear, if not easy.

21Sep

Progress Report, Week 10.

Posted by Elf Sternberg as Uncategorized

I set out this week to do the following:

On the whole, that’s not… too bad. I have about four hours a day in which to do my studying; my studies get my morning, and my family seems to be occupying more of my afternoons than I had originally planned when I took this sabbatical. I also didn’t get anything at all done Thursday this week; instead, we all took the day off to go to the Washington State Fair and eat very badly. (So badly, in fact, that when I got home I took a 90 minute nap, woke up long enough to watch a movie with my wife, and then went right back to bed for a solid eight hours of sack time.)

I’ve been leaning on Scheme a lot this week, primarily because I’ve become obsessed with understanding this bit of code, which I hope will unlock secrets of the universe for me. I’m finally starting to understand the difference between let, let*, and letrec, but reading Scheme macros, especially pattern matching macros, is still a bit beyond me. But I hope to figure it out, and then build on top of it. I realized, and perhaps this is obvious to others, that as I was working through my primitive constructions of regular expressions that by defining the “thing” you’re using as your core language as <T> (rather than char or byte), you end up with an ability to match on tokens rather than characters, and from there you end up with the ability to pattern match sequences much more complex than mere strings. One of the things that drives me crazy is how opaque the texts are about the leap from language generators to language lexers and parsers, gliding as quickly as possible over language recognizers. That needs to be made more explicit. Recognizers say that a sequence matches a pattern; parsers say that a sequence matches a pattern and derives a parallel (“Lifted?”) structure that reifies the sequence in a meaningful way.

Remember my wild ambition?. Take a look at this chart and you’ll see where I’m going with this. I love Clojure because it’s a Lisp with a remarkably low rate of bug density among skilled developers… but it runs on the JVM. I love Python… but it’s tendency toward bugginess (and the death’s head semantic) drive me insane. I loathe Go’s syntax and semantics… but it’s runtime and libraries are first-rate.* I’d love to get to Static where possible, dynamic where necessary. This is where I want to land with Blackwing: somewhere in the middle of this. I’m not sure that I can, but I’ve got a frackton of examples to work with.

I also attended the Fun(c) Meetup on Category Theory and that was pretty interesting. Mostly because I started to see the way real world applications matter in categorical thinking. One of the biggest realizations to come out of the event was that often compositional thinking is an optimization issue: does it make sense to spend the time and energy composing the functions that you’ll perform on a lifted type, or live with the code-as is?

Overall, I’m driven by two overwhelming notions. First, programmers need to develop good taste in developing programs. They need both clear naming schemes and clear layout. Code should tell a story. It should be easy to read and understand. Go acheives this by dumbing everything down to a third-grade vocabulary, but I don’t believe that’s necessary, and I don’t believe programmers should be treated this way. Go is the epitome of “We don’t want programmers with good taste, we want programmers that taste good.”

Second, developers need compassion. This is harder to teach than good taste. The old saw, “Always treat your source code as if the next person to see it is a psychopath who knows where you live” always bother me as violent and frightful; I’d much rather we appreciate that the next person to see your code is going to need as much compassion as you can muster and your code should reflect your wish to convey grace and kindness. Not everyone will appreciate it, but those who do are the ones who matter.


Next week

  • Three more chapters of Seasoned Schemer
  • Chapter 3.5 of SICP
  • Redraft the Mandelbrot and Buddhabrot code.
  • Colorize Buddhabrot. ** Stretch: Consider alternative coloring schemes for Mandlebrot.
  • Read ahead in the Scheme book to understand dparse.rkt

Subscribe to Feed

Categories

Calendar

December 2018
M T W T F S S
« Nov    
 12
3456789
10111213141516
17181920212223
24252627282930
31