I’ve made some excellent progress on the Barre project. Along the way, I discovered that implementing a theoretical CS paper has some annoying difficulties, the worst of which is figuring out a translation table for the different notations used by different researchers in the given field. In describing my progress on Barre, and implementing Brzozowski’s Regular Expressions in Rust, today I’m going to talk about a different problem: orthagonality.

In the four core papers on Brzozowski’s 1965 algorithm (Might 2010, Might 2011, Adams 2015, and Sulzmann 2010, there is a progression, and a swerve, about the nature of implementating a parser via Brzozowski derivatives. Brzozowski himself was talking only about regular expression recognizers, which can only tell you if a string matches a given regular expression, a boolean result. Brzozowski realized that, equationally, one could put an instance of the parser inside itself, creating a recursive structure that would gracefully handle backreferences and even context free grammars (CFGs). Unfortunately, no programming language in 1965 could handle Brzozowski’s formulation. In 2009, Scott Owens published a paper showing it could be done with modern languages.

Scott Owens showed that it could work as a recognizer. Matt Might made it work as a parser, and then contributed some significant performance-oriented optimizations based on the observation that the derivative of the derivative of some parsers would have redundancies that could be avoided. Adams made it go fast. Sulzmann contributed an important step toward making it deterministic and predictable.

But in trying to implement the results of all these papers together, I’ve discovered that there are several moving parts:

  1. Recognizing regular expressions
  2. Recognizing context-free grammars
    1. Memoizing derived parsers for re-use
    2. Avoiding derivations until needed
  3. Extracting parse trees from the recognizer
  4. Performance-related optimizations
    1. Node optimizations
    2. Control-flow optimizations
  5. Common extensions.
    1. Character classes and ranges
    2. Capture groups
    3. Named capture groups
    4. Ranges
    5. Back references
    6. Forward references
    7. Intersection and negation
  6. Preambles and contextual symbols
  7. Self-hosting: The regexp parser is written in itself.
  8. Alternative parse languages
    1. PCRE
    2. Perl6
    3. Peg
    4. Rosie

It’s easy, reading through the papers, to think that I might want to implement these in order, or that one is a pre-requisite for another. The first, "Recognizing regular expressions," is a pre-requisite for everything else.

Right now, I have a working implementation of (1). I think I have working implementations of (2), which require (2.1) and (2.2) to be complete. CFGs only work with memoization (so that we can find the recursive entry) and laziness (so that a recursive call doesn’t go into an infinite loop).

It was after I had the CFG recognizer "working" (it passes tests, but I’m not confident I have the algorithm perfectly correct) that I started working on the extractor– and hit a wall. The code is fairly complicated, and I made the mistake of implementing (4.2), because it made for an interesting challenge in Rust.

But I realized yesterday that implenting (3), extracting the data from the parser, was a perfectly compatible step with (1), the simplest of all the implementations. I didn’t have to worry about recursion, or the node’s structure types during the run.

The lesson here is to identify, down to the smallest possible unit, what an individual implementation step is, and do only that step. Implementing parse tree extraction on the simplest possible code is the only way for me to understand the implications of doing it with much more complex code, and having it at-hand means that the pattern for implementing it for future extensions such as character classes, intersection and negation, will already be understood, and not hacked ad-hoc into those node types at the same time I’m trying to figure them out at all.

Plus, being aware that those future node types are coming will encourage me to try and write the parse_tree_extract() function in a way that has the least impact on the highly diverse node types, and try to keep the zoo of node types to a minimum.

Theoretical computer science is fascinating. There are all sorts of papers out there about advanced topics which haven’t quite reached mainstream acceptance or awareness, and once in a while you find a fascinating topic and start chasing down the rabbit hole of implementations and papers trying to wrap your head around the ideas presented.

I’ve been doing this for the past two weeks with respect to something called Brzozowski Derivatives, an algorithm for parsing and running regular expressions. Without access to Janus Brzozowski’s original 1964 paper, I was left reading Matt Might‘s 2011 paper in which he discussed Brzozowski’s algorithm and how advances in modern software made it possible to implement a footnote in Brzozowski’s original paper: if it could be made recursive (that is, if a sub-expression could refer to itself), it could be built into a full-blown parser. Parsers often use regular expressions, but context-free grammars require a lot more specialized handling. Brzozowski unified the two, but only in math: software couldn’t handle it. Might figured out how to make it work.

But the performance was terrible. Might asked for assistance in improving the implementation, and a lot of papers were published. And I’ve read a lot of papers this week. Usually one a day. Might’s original paper was seven pages long, and it took me seven hours to read it, or approximately one page an hour. I visited the University of Washington library and downloaded or printed quite a few.

And then I moved on to the other papers. And I learned a lot about reading theoretical computer science papers, and here’s what I learned.

Scan the paper first: What kind of CS paper is it?

My goal here is to do something new and interesting: Implement a Brzozowski Expression Library in a systems language. To the best of my knowledge, this has never been done before; all of the implementations are in languages that have memory management, closures, continuations, and lots of nifty things that aren’t available in systems languages like C, C++, or Rust.

If you’re in this boat, here’s what you want to do: Read the Introduction (not the abstract), then scan the rest of the paper to determine: is this paper an implementation, an algorithm, or results? A “results” paper is almost always useless; the authors are bragging about something they achieved, but they have no interest helping you understand how they achieved it. Even then, scan the paper and ask: “Is there a link to source code?”

Sometimes, even if there isn’t, a judicious Google search will sometimes help you. Sometimes, you can just type the author’s name into Github, Bitbucket, and GoogleCode, and see what you get.

Implementation: Is it worthwhile?

If the paper is an implementation paper, go down to the “Conclusions” section and see if their implementation offers anything worthwhile. Might’s original paper concluded that the implementation was simple (and indeed, it was so simple it took me only two days to write it in Rust), but maybe there could be improvements. Writing it in Rust gave me a grip on the scale of the non-recursive problem.

Sometimes, “worthwhileness” is hard to determine from a single reading. Scott Owen’s paper (see below) is entirely about creating static DFAs for regular expressions via derivatives and not about higher-order parsing, so at first blush it didn’t seem worthwhile; however, Owens also offered two extensions to Brzozowski’s theory that allow for regular expression predicates (the sequence of symbols to be matched must match more than one expression, what Perl 6 calls the “conjunctive operator”) without backtracking, which is an amazing result.

Implementation: Has there been any progress?

Your next step is to search for citations of the paper, to see if any other papers use it. CiteSeer is surprisingly useless for this task. I found a much better solution is to search for the title with a modifier to find PDFs, such as "Parsing with Derivatives" inurl:pdf, after which a lot of papers jump out. For me, the most interesting papers included “POSIX Regular Expression Parsing with Derivatives” (Sulzmann and Lu), “Regular Expression Derivatives Reexamined” (Owens, et al.), and “On the Complexity and Performance of Parsing with Derivatives” (Adams, et. al.).

While lots of papers are hidden behind various paywalls, often papers associated with conference releases, preprints, and pre-reviewed versions are readily available. When all else failed, I’ve emailed the lead author and asked for a copy, and so far they’ve always just mailed me a copy. (The authors of papers love to know someone is engaging with their work. Really.)

Algorithm: Every Author has their own notation. You will have to write your own translation guide.

Those three papers represent the core of what I’m studying. The Adams paper finds features a better algorithm for detecting early match failures, as well as an important insight into avoiding the infinite loops inherent in Brzozowski’s original description. The Sulzmann paper describes using some performance hacks from classical regular expressions that might be applicable to Brzozowski. And the Owens paper proves (in the mathematical sense) that Brzozowski’s algorithm supports Perl-style regular expression conjunctions natively without having to supply an internal boolean mini-parser as PCRE does.

Brzozowski’s algorithm has six basic sub-cases for matching: a letter, a concatenation (this then that), an alternative (this or that), repetition (the star operator), the empty string, and the empty parser. The most complex is concatenation, so let me show you the equation for that as presented in the algorithm papers. The concatenation operator is an open circle; the ‘∪’ operator is the set theoretical symbol for ‘union’ and represents alternation.

These equations all mean the same thing: “The derivative of two regular expressions in sequence with respect to a character c depends on whether or not the first expression recognizes the empty string; if it does, then the derivative is the derivative of the first regular expression with respect to c followed by the second regular expression or the derivative of the second expression with respect to c, and if it doesn’t then it’s just the first derivative of the first expression with respect to c followed by the second expression.”

Got that?

Good. Might uses Dc to represent “The derivative with respect to c,” and he uses c to “character.”

Might:
Dc(L1 ◦ L2) = (Dc(L1) ◦ L2) ∪ Dc(L2) if ε ∈ L1
Dc(L1 ◦ L2) = Dc(L1) ◦ L2 if ε not ∈ L1

Adams:
Dc(L1 ◦ L2) = (Dc (L1) ◦ L2) ∪ Dc (L2) if ε ∈ [[L1]]
Dc(L1 ◦ L2) = Dc (L1 ◦ L2) = (Dc (L1) ◦ L2 otherwise

Owens:
∂a (r · s) = ∂a r · s + ν(r) · ∂a s

Sulzmann:
(r1r2)\l= (r1\l)r2 + r2\l if ε ∈ L(r1), (r1\l)r2 otherwise

Those are (supposedly) all the same. The Adams and Might equations are entirely the same, aside from Adams’ use of the L1 vs [[L1]] syntax. Understanding Owens’ formulation requires that you understand that the ν(r) expresses the same thing as the ε ∈ L1 syntax from Might, that is, ν is a function that determines whether or not ε is a member of the language r, and exploits the fact that ∅ anywhere in a sequence expression results in no matching ever.

And Sulzmann, who even cited Might’s paper… Sulzmann is just from another world. His equation means the same thing, but as you can see his notation is entirely, wildly different. The concatenation symbol is elided as “understood,” the “with respect to a character c” is “/l“…

Now, the thing here is that there are six basic operators in regular expression theory, so since every paper has this same table of six operations, it was possible to figure out, eventually, what Sulzmann meant by his notation. But it took work; I had to write down on a sheet of paper his equations and discoveries in order to make sense of them.

It takes time.

If you want to implement a result in theoretical computer science as a research project and, eventually, as a library, you’re going to have to read and re-read the papers until you’ve internalized the basic rules and gotten comfortable with at least one notation, and if you’re really comfortable with that notation you’re going to have to take a deep breath and re-write many of the other paper’s statements and conclusions in your notation so you have something you can readily absorb when you’re looking for refinements, extensions, and progress in the topic.

It’s also fun (for very geeky definitions of the word “fun”). I’ve been having a good time working this out.

I seem to be in a state of semantic confusion.

I went back and reviewed the Theory of Regular Operations, which is section 1.10 of Sipser’s Theory of Computation, and I seem to have come upon a bit of semantic confusion that I need to decide how to resolve, and more importantly, when to resolve.

According to Sipser, a regular language is any language (that is, a set of strings made up of symbols) for which a finite automata can be constructed. For Brozozowski, a regular language can be recognized by processing its derivative, with the following rules:

  • Dc(∅) = ∅
  • Dc(ε) = ∅
  • Dc(c) = ε if c = c’, ∅ otherwise
  • Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
  • Dc(re1 | re2) = Dc(re1) | Dc(re2)
  • Dc(re) = Dc(re) re

Only one of those rules talks about the symbols, namely the third; all the rest are either end-state issues or what are termed in the busines the regular operations.

The first two are results which describe the state of the machine at some point in the process. If the machine is ε at the end of the process then the string was recognized, and if it’s in any other state then the string was not recognized. The third is about the recognition of symbols.

A tree is a directed graph in which any two nodes are connected by exactly one path. The outcome of Brzozowski’s algorithm, the naive version, is a tree generated dynamically as the derivative results of the current symbol (provided it doesn’t cause termination) are placed onto the stack and then processed with the subsequent symbol. Characters, CharacterClasses (i.e. ranges, [A-Z] or [0-9], etc), and the Any symbol are leaf nodes for which a submatch operation (such as ‘Alternative’ or ‘Concatenation’) allows the process to proceed. The last three items in the list are the three main operations: ‘Alternative’, ‘Concatenate’, and ‘Star’.

Other operations, such as "At least," "At most," "Exactly n", and "Plus" are just variants on the above; how to optimize them isn’t something I’ve addressed yet, nor the "automatic derivative of a concatenation of symbols," which could greatly speed up some searches; an extended string match would result in either ε or ∅. (On the other hand, the automatic concatenation of symbols would also result in the need for a checkpointing operation, as now we’d be trading memory for a dubious speedup.)

There are some non-operations that also need to be considered, including capture groups and non-nested successive backtracking (nested and precessive backtracking I am not handling, maybe not ever). The input iterator needs a wrapper to introduce specialized symbols for WordBoundary, StartOfText, StartOfLine, and their complements.

More to the point, the Recognizer recognizes Strings about Languages, but I don’t have a Language, I have a convenient enum of recognizer machine states. You can perform operations on the Languages, and that’s more what I want. Once you can cat, alt, and star a whole language, the algebra of my recognizer becomes much easier to use.

The real problem comes down to the use of a memory arena to depict the tree of nodes. Rust Leipzig’s Idiomatic Trees and Graphs shows that the graph has to allocate the nodes, his pointers are opaque handles, and even his library doesn’t supply a composition of graphs operation.

All this is to stay that I’m a bit stymied at the moment. There are obvious steps I could take, but imagine how much easier it would be to write regular expressions if they were concatenatable, alternable, and kleene-able?

14Oct

Rust Macros, and how they’re used in Barre.

Posted by Elf Sternberg as Rust

For the past two weeks I’ve been working on an implemented of regular expressions, only writing an alternative implementation to the classic Thompson engine, the simplest one, and often the fastest if you don’t need advanced features like, oh, Unicode range matching or backtracking. Instead of the Kleene/Thompson engine, I’ve been using Brzozowski’s Algorithm, which has some interesting properties and promises some interesting capabilities.

Right now, the engine implements the most naive possible version, one that’s known to be inefficient in time and space. It’s also the most fundamental Thompson implementation, with exactly the six operators in his 1968 matching engine. It has no ‘plus’, no ‘any’ operator. On the other hand, it is completely templatized, so it can handle bytes, chars, or more complex symbols. It only recognizes truly regular languages, and has none of the power of modern engines, but part of the point of this exercise is to see how close I can get in expressiveness, power, speed, and size, and then to provide wrappers that could parse Clojure spec files, or implement the Rosie syntax.

The problem

But I had a problem. Part of this problem is an implementation detail, but part of it is just Rust being Rust. To recognize the letters "ABC" I would have to write:

let mut pattern = Recognizer::<char>::new();
let a =  pattern.tok('A');
let b =  pattern.tok('B');
let ab = pattern.cat(a, b);
let c =  pattern.tok('C');
let _ =  pattern.cat(ab, c);

What I really wanted was something simpler. I wanted to be able to write:

let mut pattern = re(cat('A', 'B', 'C'))

And since I didn’t know how to write Rust macros, my first thought was that I’d write it with them. But to give you a taste of what I ended up with, let’s just say that the pattern AB(CC|DDDD)E*F ultimately looks like this:

let mut pattern = re!{cat{tok{'A'},tok{ 'B' },
                      alt{cat{tok{'C'},tok{'C'}},
                          cat{tok{'D'},tok{'D'},tok{'D'},tok{'D'}}},
                      rep{tok{'E'}},tok{'F'}}};

I learned a few days later that the RE2 engine, which I will be using as a learning example and as a source of some knowledge, also has an alternative representation that looks somewhat similar:

/a.b.c/ => cat{lit{a}dot{}lit{b}dot{}lit{c}}

It uses ‘lit’ where I used ‘tok’, and I haven’t implemented ‘any’, and they have ‘dot’, but the outcome is remarkably the same.

Initialization

There are two different things that my macro has to two: it has to create a new Recognizer, and it has to add elements to that Recognizer.

The Recognizer uses a memory arena to represent the graph of the regular expression, a graph which is built on-the-fly as the string is processed. This implementation can’t have "nodes" as loose objects "somewhere" in RAM because Rust really doesn’t like that: adding a new node that "points to" previous nodes happens through handles (integers) rather than pointers, a phenomenon you can see in my first example above.

Rust macros are hygenic and textual. A Rust macro has a pattern (sometimes containing captures) and a result: if your input to the rust macro matches the pattern, the invocation of the macro will be replaced with the text of the result, and some elements inside the result will be replaced with elements of the capture.

There are currently 10 different things that can be captured; my macros only use four of them: ty for types, expr for expressions, ident for identifiers, and tt for token trees.

When a macro is read by the compiler, it is broken up into a token tree. The content of what you pass to the macro is also broken up into a token tree.

Let’s start with the most simple example. Let’s create the recognizer.

I called my macro ‘re!’, because of course I did. I also had to define the type that the Recognizer recognized.

macro_rules! re {
    ($sty:ty; $($cmds:tt)+) => {
        {
            let mut pt = Recognizer::<$sty>::new();
            {
                let _ = re!(...);
            }
            pt
        }
    };
};

That pattern there is one of the most simple: two items separated by a semicolon. The first item must be recognizable by the compiler as a type; the second item must be a stream of tokens (some of which can be further streams of tokens). A “token” in this case is any object Rust can reasonably identify is separate from other objects: tok{'a'} is four objects, as we’ll see later. tt objects are the most useful, but they’re not terminal objects– Rust doesn’t know what to do with them. They’re useful only for recursive processing; eventually, they must be reanalyzed to determine if they’re types, identifiers, expressions, or other valid Rust code.

And man, I really wanted a default type. Let me try to add that rule.

    ($($cmds:tt)+) => { re!{ char; $($cmds)* } };

This is the first recursive macro, in that if we just get a blank list of tokens, we’re going to call re! again, only this time with the pattern ty:$(tt)+, which means a type followed by a collection of tokens. That matches the first pattern, as so a new recognizer is born.

Note that although the pattern reads $($cmds:tt)+, which captures a collection of tokens, the replacement pattern is $($cmds)*.

Note also that our collections end in a plus, which means "one or more." What if the user only wants a blank Recognizer, one that recognizes nothing? Here’s where things get interesting. We can do this:

    () => { re!{ char; } };
    ( $sty:ty ; ) => { Recognizer::<$sty>::new() };

The order’s a little backwards; first it says that if the expression is empty, call it with only the type; if the type exists but the tree is empty, build an empty Recognizer.

These rules are extremely precise; they cannot be confused for anything else. So they have to go at the very top of the macro_rules! call; the rules that are most precise must come before rules that have more general matching semantics, or the more general semantics will match and the macro generator will attempt to use the corresponding result… which will leave you with sadness.

Okay, so now we’ve handled all the possible invocations of the Recognizer. What about the actual expression?

Parsing the regex

We already know that macros can be called recursively. We also want the parsing to be unambiguous. And here’s where we enter into an interesting problem: Token trees and expressions can easily be mistaken for one another. The phrase tok { 'a' } could be a token stream of two tokens, the second of which is itself a token stream; it could also be a token stream of four tokens; it could also be an expression (although a broken one, as the Rust parser would attempt to treat it as a struct, and the interior of the struct does not meet Rust’s specification).

In this case we want it to be a token stream of four tokens: tok { 'a' and }. Only one of those is an actual expression. We also want this to be unambiguously called, and for that we use an internal identifier. In macros, an internal identifier begins with an @ symbol. Like so:

(@process $pt:ident, tok { $e:expr }) => {
    {
        $pt.tok($e)
    }
};

This says that we want a token stream of six tokens: the four we identified earlier, along with the literal token @process and this thing $pt:ident.

Go back and look at the original macro rule definition, the first one that created a Recognizer. Let’s replace the elipsis (‘…’) with what I really want there:

    ($sty:ty; $($cmds:tt)+) => {
        {
            let mut pt = Recognizer::<$sty>::new();
            {
                let _ = re!(@process pt, $($cmds)*);
            }
            pt
        }
    };

We create a new scope, and a new variable within that scope, pt. Then, within yet another new scope, I start the recursive parser by calling re!(@process pt, $($cmds)*); the @process unambiguously points us to the internal handling, and the pt is recognized as an identifier that will be kept consistent throughout the construction process.

Next, the cat macro, for concatenation, which conveniently has the exact same structure as the alt macro. Those need a left and a right arm, as they’re nodes and not leaves of the graph. We’ll start with the same three tokens: @process, $pt, and… what? Alt? Cat? There’s no "alternative" operator in Rust macros, so I’ll use a regular identifier.

    (@process $pt:ident, $op:ident { $lop:ident { $($lex:tt)+ }, $rop:ident { $($rex:tt)+ } }) => {
        {
            let l = re!(@process $pt, $lop { $($lex)* });
            let r = re!(@process $pt, $rop { $($rex)* });
            $pt.$op(l, r)
        }
    };

A couple of things to note here: First, note that I have to re-create the structure of the op { tokens+ }, because sometimes they’re not tokens! For the tok operator, that’s expected to be an expression. Also note that the final operation can be $pt.$op(l, r), because the substitution here is purely textual.

The left and right arms will be replaced with scoped expressions like this one (or the one for tok).

There’s one thing left that I want: cat{tok{'a'},tok{'b'},tok{'c'}}. For that, I have to do two things: I have to capture more than just two token trees, and I have to use the cons-style extension for cat (and alt, as you can have more than two alternate possibilities), preserving the existing operator:

    (@process $pt:ident, $op:ident { $lop:ident { $($lex:tt)+ }, $rop:ident { $($rex:tt)+ }, $($xex:tt)+ }) => {
        {
            let l = re!(@process $pt, $lop { $($lex)* });
            let r = re!(@process $pt, $op { $rop { $($rex)* }, $($xex)* });
            $pt.$op(l, r)
        }
    };

In this case, I add the rest of the expression to the end of the @process parse line, and then for the right-hand expression I then re-create the rest of the expression whole, minus the one operation I’ve already handled. This is classic Lisp-like recursive programming, and it’s the only way to do this with Rust macros: For a list of problems, I take one argument off the list, process it, then re-call the function with a smaller list of problems, until the list can be handled by the two-argument instance.

Just a (geeky) (but important) note: this works here because the cat and alt operators are associative, that is, cat(A, cat(B, C)) is exactly the same as cat(cat(A, B), C).

Conclusion

The entirety of the macro is available at Github.

Rust macros take some time to understand; quite they’re different from the rest of Rust, and they have some Lisp-like qualities in their internal, recursive form for which a general education in Scheme or Lisp might come in very useful. It helped a lot to understand the nature of the problem, too, in that I knew that concatenative operations were associative.

But they are quite useful, and I’ve made my life much easier for writing test cases, once I’ve written tests that satisfy me that the macros are doing the right thing.

Consequences

At the moment, BARRE "Languages" and the "Recognizer" are inappropriately mixed up. What I have are the six operations of classic regular expressions, and the Language and Recognizer are mixed together. Languages are supposed to be composable, and Rust’s own regex engine not only can compose them, but has a special root version of alt in the RegexSet handler that can keep track of large sets of regular expression, giving users the ability to know which of several complex expressions may have been triggered by a string.

If cat, alt, etc. were written as either bare languages or simple operations, it should be possible to perform all the compositional operations on them. This may eliminate the need for the macros entirely. Time will tell. In the meantime, this was a great learning opportunity.

For the past two weeks, I’ve been working my way through Matt Might’s paper Yacc is Dead, which describes and implementing it in Rust. The result is BARRE: Brzozowski Antimirov Rusty Regular Expressions. So far as I’ve been able to determine, I’m the first person to try implementing it in a systems language, one without garbage collection.

Rust has made this pretty straightforward so far. My progress has pretty good: I have the naive implementation built and running. The naive implementation sucks: it’s slow, it’s inefficient, and it has a tendency to grow out of control in space. The regex (A|B)*C when trying to match ‘ABABAAC’ grows to a deterministic finite automata (DFA) of over 140 nodes!

My progress is going to be a bit stalled this week as I work through chapter 1 of Michael Sipser’s Theory of Computation, which teaches about DFAs and NFAs and how to convert back and forth between the two. There are several major improvements that can be made to the existing algorithm.

First, I need to figure out how to extend the engine to handle different repetition kinds. Right now it’s the pure Kleene star, "zero or more." Modern regex engines handle "one or more," ("Kleene plus"), "zero or one", and ranges ("exactly n", "at least n", "at most n").

PCRE adds Boundaries (word boundaries, the "" operator), and Anchors (start of line, end of line, start of document, end of document), Classes (is-a-num, is-a-letter, range-of-letters). All of these are single-term operations, though, and should have the same semantics as a Token.

PCRE also adds Captures (groups that can be retrieved after the expression engine has run, showing what was actually matched), and Backreferences (literals that appear later in the regular expression that match strings built out of what was matched, effectively "splicing" the parse tree with a runtime-only concatenation at a given point in the matching process.

Python, of all things, adds a unique feature to the input iterator to replace indent/dedent changes with block-open and block-close symbols. This may be useful for whitespace-based languages, and obviates the need to play funny with the regular expression engine.

Brzozowski’s algorithm itself has four major improvements: laziness, which delays actually creating a derivative node until it’s truly needed; fixed points, which takes a derivative subgraph and re-derives it, over and over until it stops changing, at which point we have the final matcher; memoization, which intercepts the fixed-point processor to find duplication and returns a pointer to the start of the duplicate in the existing subgraph, interrupting any potential infinite recursions which may blow our stack; and compaction, which identifies duplications across subgraphs and prevents the in-memory representation from growing out of control.

As far as I can tell, Brzozowski regular expressions have a "start-up cost," that is, the first time through can be pretty expensive as the graph is filled out, but follow-on matches can be pretty quick, as only the deltas need to be filled out.

Antimirov Derivatives promise an easier-to-implement variant of Brzozowski’s algorithm. I haven’t yet dived into these, but they’re on the list.

My hope is that Antimirov’s Derivatives don’t interfere with Might’s final point: Recursive Regular Expressions which, if implemented in a system with captures, can parse context-free grammars whole.

A couple nice-to-haves: First, I need internal macros to build internal representatives in as straightforward manner. Rust has a fairly powerful macro language, and leveraging it to make my life easy while negotiating with the borrow checker would be awesome. Second, the internal representation of a first-order Brzozowski’s Derivative is a simple array; being able to pre-compile that array to have stand-alone regexes at compile time would be wonderful.

And finally, it would be amazing if a parser for a Thompson, PCRE, or even PCRE-light syntax (such as supported by Rust’s own Regex engine), could be written in Barre. The parser for Rust’s Regex engine is written with, but not in, Regex; that is, while the Regex engine is used to recognize substrings written in Thompson/PCRE regex (you know, the classic abc(.*?)de strings), it can’t handle nested expressions, which requires the engine to recurse on itself, but theoretically, Barre can.

If all of this is true and possible, then I might well have a competitive alternative to both regex.rs and nom.rs, and my next steps toward learning how to write my own programming language will be clear, if not easy.

21Sep

Progress Report, Week 10.

Posted by Elf Sternberg as Uncategorized

I set out this week to do the following:

On the whole, that’s not… too bad. I have about four hours a day in which to do my studying; my studies get my morning, and my family seems to be occupying more of my afternoons than I had originally planned when I took this sabbatical. I also didn’t get anything at all done Thursday this week; instead, we all took the day off to go to the Washington State Fair and eat very badly. (So badly, in fact, that when I got home I took a 90 minute nap, woke up long enough to watch a movie with my wife, and then went right back to bed for a solid eight hours of sack time.)

I’ve been leaning on Scheme a lot this week, primarily because I’ve become obsessed with understanding this bit of code, which I hope will unlock secrets of the universe for me. I’m finally starting to understand the difference between let, let*, and letrec, but reading Scheme macros, especially pattern matching macros, is still a bit beyond me. But I hope to figure it out, and then build on top of it. I realized, and perhaps this is obvious to others, that as I was working through my primitive constructions of regular expressions that by defining the “thing” you’re using as your core language as <T> (rather than char or byte), you end up with an ability to match on tokens rather than characters, and from there you end up with the ability to pattern match sequences much more complex than mere strings. One of the things that drives me crazy is how opaque the texts are about the leap from language generators to language lexers and parsers, gliding as quickly as possible over language recognizers. That needs to be made more explicit. Recognizers say that a sequence matches a pattern; parsers say that a sequence matches a pattern and derives a parallel (“Lifted?”) structure that reifies the sequence in a meaningful way.

Remember my wild ambition?. Take a look at this chart and you’ll see where I’m going with this. I love Clojure because it’s a Lisp with a remarkably low rate of bug density among skilled developers… but it runs on the JVM. I love Python… but it’s tendency toward bugginess (and the death’s head semantic) drive me insane. I loathe Go’s syntax and semantics… but it’s runtime and libraries are first-rate.* I’d love to get to Static where possible, dynamic where necessary. This is where I want to land with Blackwing: somewhere in the middle of this. I’m not sure that I can, but I’ve got a frackton of examples to work with.

I also attended the Fun(c) Meetup on Category Theory and that was pretty interesting. Mostly because I started to see the way real world applications matter in categorical thinking. One of the biggest realizations to come out of the event was that often compositional thinking is an optimization issue: does it make sense to spend the time and energy composing the functions that you’ll perform on a lifted type, or live with the code-as is?

Overall, I’m driven by two overwhelming notions. First, programmers need to develop good taste in developing programs. They need both clear naming schemes and clear layout. Code should tell a story. It should be easy to read and understand. Go acheives this by dumbing everything down to a third-grade vocabulary, but I don’t believe that’s necessary, and I don’t believe programmers should be treated this way. Go is the epitome of “We don’t want programmers with good taste, we want programmers that taste good.”

Second, developers need compassion. This is harder to teach than good taste. The old saw, “Always treat your source code as if the next person to see it is a psychopath who knows where you live” always bother me as violent and frightful; I’d much rather we appreciate that the next person to see your code is going to need as much compassion as you can muster and your code should reflect your wish to convey grace and kindness. Not everyone will appreciate it, but those who do are the ones who matter.


Next week

  • Three more chapters of Seasoned Schemer
  • Chapter 3.5 of SICP
  • Redraft the Mandelbrot and Buddhabrot code.
  • Colorize Buddhabrot. ** Stretch: Consider alternative coloring schemes for Mandlebrot.
  • Read ahead in the Scheme book to understand dparse.rkt

17Sep

Notes on using the Rust image library

Posted by Elf Sternberg as programming, Rust

I have finally figured out how to use the images library for Rust, and it’s not obvious, especially not for internally generated images. The library does a very good job of hiding complicated implementation details when you load a file off the disk, automatically figuring out the file format and providing some baseline utilities. Zbigniew Siciarz has a pretty good writeup of the basic implementation, but for the Mandelbrot renderer, I wanted to get a little more specific.

So here’s the basic. Images are made up of Pixels, and every pixel is an array of 1 to 4 numbers (of any machine size; common ones are u8 and u32, but you can create Pixels of type f64 if you like). A single item Pixel is greyscale; a 3-item Pixel is RGB, and a 4-item Pixel is RGBA (alpha).

Pixels go into implementations of GenericImage, the most common of which is ImageBuffer. ImageBuffer controls access to the underlying representation, ensures bounds checking and provides limited blending capabilities.

To colorize my image, I needed to create the colormap, and then map the colors to a new image. And then I learned that the PNM colormap handler doesn’t handle Pixels.

PNM is an ancient image format, which may explain why I enjoy it. It’s simply a run-length encoding of bytes: RGBRGBRGBRGB… with a header explaining how to dimension the image by width and length. The PNM Handler for images.rs can only handle flattened arrays that look like a PNM array.

So for the Mandlebrot colorizer, the solution was to create a new array three times as long as the underlying Vec for my original image, and sequence through the pixels, mapping them to a color and then pushing the pixel subcolors to the new array. Which is annoying as heck, but it works.

I’m sure there’s a better way to do this. And the constant remapping from u32 (which the image array understands) to usize (which is what slices are made of) and back got tiresome, although the compiler was helpful in guiding me through the process.

Brad Delong recently pointed me at Susan Dynarski’s article For Better Learning, Lay Down the Laptop and Pick up a Pen, in which she reviews the evidence that laptops, because of their speed and utility, actually inhibit learning by allowing students to take notes too quickly, and by giving students a universe of alternative distractions should the instruction get too boring. I’ve found Dynarski’s article to be absolutely true.

I recently finished a small homework assignment. After three days of working with it on the computer, I sat down with a sheet of paper and worked it out in an hour. I wrote it snowflake fashion: I described it, then iterated on the description until I had a complete description, with the public API underlined. It took another hour to implement it.

This is my lesson: if I can’t explain it to myself on paper, then I can’t explain or implement it on the computer. Ever. It just doesn’t work that way, at least not with software. The ideas don’t stick unless I’ve written them out. Every few years I have to re-learn this lesson, believing that I can short out the learning curve and get straight to the meat of the problem. But the meat of the problem isn’t in the computer, it’s math and common sense, and those take time, and paper, and a pencil.

One of the horrifying follow-on realizations this led me to was that, when I was at my last job, I was very rarely writing software. I wasn’t working out algorithms or implementing new and innovative processes. I was exploiting lenses.

In software development, a lens is an an algorithmic function that allows the user to focus in on a specific part of data without the user having access to, or having to know about, the rest of the data. (There’s much more to it than that, including a deep mathematical description, but that’s the basic idea.) And most of what I was doing was writing shims, to be part of a Kubernetes ecosystem, to help internal users monitor the health and well-being of the system.

Which is all well-and-good, but when you realize that you’ve married a massive protocol like HTTP to a very simple collection of filters, then come up with your own peculiar way of specifying to the filters who you are and what you want, over and over again… well, that can lead to a bit of burnout. Every job is easy, but every job whispers, "This is stupid and should have taken three days, not four weeks."

With very rare exception, this is what programming is these days. All software is little more than views into specific arrangements of data. This is one of the reasons functional programming is becoming so prevalent: that’s what functional programming believes in its heart. (There is a hard part, dealing with philosophical notions of time, but until other people can understand what the heck Conal Elliot is talking about, I’ll just leave that there.) Sometimes getting the data into that specific arrangement is a story about performance, timeliness, memory, and storage, but that’s pretty much all it is.

13Sep

Mostly studying this week.

Posted by Elf Sternberg as Uncategorized

Happy Thursday!

Thursday is the day where I look back upon the week and consider what I’ve learned. Last week I completed the main Buddhabrot algorithm and got it to work.

Studying: The Little Schemer

This week was primarily studying, so there’s not a lot of code to consider. Instead, I worked my way through The Little Schemer, a book that teaches how to write Scheme and, by doing so, master the Y Combinator (the mathematical formula, not the orange site people), and eventually meta-circular interpretation. The latter wasn’t as interesting to me as I’d hoped, as I’d already done it once, writing a meta-circular interpreter on top of a Lisp I’d already written which, in turn, was written in Coffeescript, because that’s how I was rolling back then.

It’s a rite of passage to finish The Structure and Interpretation of Computer Programs and then write your own programming language in it. I’m up to section 3.3 and am currently a little stuck on the “write your own deque” section. One thing I dislike intensely about MIT Scheme (and Schemes in general) is the way objects and pointers just exist, willy-nilly, and knowing when you’re using the original handle and when you’re using a reference is all a matter of keeping it in your head. I think I’m becoming more of a Rust partisan every day.

Speaking of Rust, I started an implementation of Brzozowski’s Algorithm in Rust. It’s a very naive sort of thing, based on my earlier attempt with Python. It doesn’t work yet and, according to Might it will probably have terrible performance and memory issues, but it’s a first draft. I may have to figure out how do memoization and some macro handling / inlining. The first time I wrote it, I realized where the memoization would have to occur to make it performant, but that knowledge seems to have faded from my brain and I’ll have to re-learn it all over.

Plans: Next Week

Next week, fate willing, I’m going to:

  • Implement colorized versions of the Mandelbrot and Buddhabrot implementations.
  • Read three chapters of The Seasoned Schemer
  • Finish chapter three of Structure and Interpretation of Computer Programs
  • Write some damn documentation for the images.rs library.

Future Plans

There are a number of other, unfinished projects on my plate. Most notably, I haven’t finished porting Daniel Givney’s Assembly Tutorials to 64-bit ASM, as promised.

And then I’m going to dive into writing an interpreter, a virtual machine, and a compiler. All the stuff I never learned how to do because my parents didn’t believe, in 1985, that a programming career would ever be lucrative and would only pay my college if I majored in Accounting– excuse me, I meant “Business Computing”– instead. My ideas for it are incredibly vague, but there are some basic ideas.

Excessively Ambitious:

  • It is going to be a Scheme.
  • It is going to be strongly typed.
  • It is not going to be beholden to the naming history of Lisp or Scheme. My first thought was be to steal from Hy, using Ghostwheel or Compact Notations as inspiration for typing annotations.
  • It will use type inference wherever possible.
  • It will use a sane import scheme.
  • It will rest upon an Intermediate Representation that will not be married to the syntax. Alternative syntaxes should be easy.
  • I want it to have the following:
  • Garbage Collection: A tri-color parallel garbage collector.
  • Green threads implementation, like Go routines.
  • Batteries included
  • Computation Expressions
  • Fat executables (all files and libraries included in a single, runnable format)
  • It would be silly for me to want:
    • Software-transactional memory
    • Type Providers
  • It would be madness to provide domain-specific stuff like:
    • A Grammar interface
    • A type-driven graph engine (the underlying technology of spreadsheets)
    • A type-driven pluggable buffer engine (the underlying technology of databases)
    • A high-performance rope or piece table implementation (the underlying technology of text processors)
    • A document object model (the DOM is an obvious one, but what about OpenDocument?)

For all I know, there is a strong conflict among these different goals, in that, for example, grammars might be incompatible with strong typing, or at least difficult to implement correctly. This is a learning experiment.

And here’s the thing: I don’t just want to write a programming language. I want to write useful things in my programming language. That’s the whole point. If I can’t write useful things in it, then it’s not a useful language. Yes, there are days when I feel like this plan goes over Programming Language for Old Timers, but that’s also the point; I know enough to feel modern, but I want something that is expressive to me.

This week, I finished off the basic Programming Rust exercises by extending the Mandelbrot example in the book to produce Buddhabrots instead. In my last post, I mentioned that I’d gotten the Buddhabrot working, but only in a very basic way. I was unhappy with the result; my point-to-pixel mapper was giving me weirdly squashed results, and what I was producing didn’t look like many of the examples found in the wild. The O’Reilly Mandelbrot exercise extended the code to work on multi-core machines, so I tried to do that with the Buddhabrot.

The Mandelbrot set divides the complex plane into two sets: for a given complex point c and a zero-zero complex point z, those points in which the equation z = 

z * z + c iterated infinitely goes to infinity, and those for which z remains within a bounded region, usually the circle starting at the origin with a radius of 2+2ᵢ. For those starting points that go to infinity, the velocity at which it goes gives a grayscale number, from which we generate the mandelbrot set.

The Buddhabrot says, “Wait a minute, for all those points that don’t go to infinity, what if we colored those instead?” For those, for each iteration, we track the points it makes within the black center of the Mandelbrot set, and the number of times a point corresponds to a pixel gives us a number we can then render as a greyscale.

So: the problem here is easy. For Mandelbrot set, each point is not dependent on any other points on the screen. For the Buddhabrot, this isn’t true: most points are dependent upon the behavior of some previous point. The Mandelbrot set could be divided up into different regions and each region handed to a different core. For the Buddhabrot, we have to build up each map in total.

Even on my monitor, which is 3840×2160, using 16-bit greyscale and 8 threads, that was still under a gigabyte of memory. Yay for modern computers!


One of the biggest problems I had learning this was the notation for Rust mutation. And this is before we get to things like reference counters, boxes, and reference cells!

Here’s what I need to remember:

let x = 5; // defines a location, 'x', and puts 5 into it.
// x = 6; // illegal: x isn't mutable.

let mut x = 5; // legal: This shadows the value above.
x = 6; // redefines 'x' to be 6, which is legal because x is now
       // mutable.  The types must match!
{
    let y = &mut x; // 'y' is a mutable reference to 'x'. 'y' cannot be
                    // reassigned, but what it points to can be mutated:
    *y += 1;
}

That covers the basics, and helps a lot. The other thing is that the &mut syntax inside function signatures has a subtle meaning: that the function only takes mutabl

e references, and there can only be one of those at a time.


The rules about mutability in Rust just killed me. I wanted to run my Buddhabrot generator the way I would expect in any other object-oriented language: multiple planes on which to operate, each one running independently. I couldn’t get it to work with a vec of Plane objects, because Rust really didn’t want to move those massive planes around.

The only way to get it to work was to generate one gigantic vec, then use Vec::chunks_mut to break it up into subplanes, one for each thread, and hand them over to an immutable processor that did all the work of calculating the Buddhabrot for that plane. It worked, but what a pain in the neck.


The other thing I did this week was rip through the first 100 pages or so of The Little Schemer, a classic book teaching you how to use Scheme. I went through it using Racket, which was lovely, and taught me a lot about how Scheme works inside. I have two thoughts as I return to the Scheme land after wandering around in Rust: I really, really want better guarantees in my language that I won’t be passing around non-containers to functions that expect containers, and I really want better guarantees that the types I pass to functions are understood by those functions ahead of time.

Maybe I want Haskell. Or some kind of typed scheme. Or something. But this exercise was just… weird.

Subscribe to Feed

Categories

Calendar

November 2018
M T W T F S S
« Oct    
 1234
567891011
12131415161718
19202122232425
2627282930