When last we left our hero, he had successfully implemented Rigged Regular Expressions in Rust.

There’s been a lot of progress since then.

Matt Might’s Parse Result Data Type Is A Semiring!

One of my strongest intuitions when I started reading A Play on Regular Expressions was that under the covers Matt Might’s implementation of Brzozowski’s Regular Expressions used a data structure exactly like those in the Play paper. I was correct; I was able to implement a (non-recursive) Brzozowski engine in Haskell, and use a Semiring formalism to accurately reproduce Might’s outcome and data structure from Parsing With Derivatives: A Functional Pearl.

And Might’s parser-combinators are functors between semirings!

To me, this is something of a monumental discovery. Intellectually, I’m happy to show that the data structures Might is using are semirings and his parser-combinator approach is actually just functors between semirings. Personally, I’m thrilled that I was able to do this in Haskell, writing code that is not from either paper to show that the two have signficant overlap. Even better, I was able to show a different category of complexity can be encompassed by the semiring properties from those in the Play paper.

Semiring-as-a-trait in Rust

The first thing is that, in that last post, I discussed not having to implement a Semiring trait in Rust, because I could abstract it further using the num_trait crate to define my zero and one, and the std::ops trait to define multiplication and addition.

Unfortunately, that didn’t hold. It worked fine when my values were primitives such as Boolean or Integer, but when I implemented Matt Might’s sets of strings as a return value, the std::ops implementations were insufficient. To work as an operation, Rust requires that the data must be bitwise copyable without loss of fidelity; unfortunately, bitwise copying of sets-of-strings doesn’t work; the underlying implementation has much of its data distributed across the heap. So I had to implement the Semiring as a trait to exploit Rust references and get high-performance operations, and it worked just fine.

You can write Haskell (poorly) in any language

At one point developing the Rigged Brzozowski in Haskell implementation, I was stuck on a hard-to-surface bug. That whole bit about how "If it compiles it’s probably correct" Haskell superiority nonsense is nonsense; there are lots of little fiddly bits in this algorithm ("Is this a sum or product operation?" "Is the initializer one or zero?" "Is the precedence correct?" "Is this test on the left or right operand?") that can go wrong.

I ported the implementation to Python in order to litter it with print statements and watch it work. This was useful, and helped me track down the bug. I was able to port the working instance back to Haskell easily.

One thing to note is that the Python is rather… odd. There are the six regex operations that can be performed on a string which are both unique and finite. By using namedtuple I was able to create the same structure as the Haskell data or Rust enum operation, and with clever use of Python’s reflection capabilities I was likewise able to make Python do something like Haskell or Rust’s pattern-matching, using dictionaries. The result is, to my eye, pleasing, but I’ve been told it’s "not very Pythonic."

Easier to maintain, at any rate.

Are we done here?

That’s… really the twelve thousand dollar question now, isn’t it? I’ve finished sections one and two; the third section is about adopting this framework to recursive regular expressions, which I’m already somewhat proficient in from working with Darais’ Racket implementation. So there are a couple of different ways I could go about this:

I could just keep plugging away

I could proceed as I have been, implementing:

  • Heavyweights using Brzozowski in Haskell
  • Heavyweights using Brzozowski in Rust
  • Recursive Regular Expressions in Haskell
  • Recursive Regular Expressions in Rust
  • Recursive Brzozowski Regular Expressions

I could show it works in C++

I could combine what I did with the Python implementation and my limited (very limited) C++ knowledge and try to port one of the Rigged Glushkov engines to C++. The state-of-the-art in C++ unicode support looks absolutely terrifying, though.

I could add to the feature set: incremental regular expressions

One of the big changes I made in Rust was that, toward the end, I changed the input value from an Iteratable to an Iterator, thus simplifying the API. I want to do the same thing for the output, that is, I want the receiver to get not just a semiring containing a set, but to get instead an iterator that produces elements from the semiring as they’re produced, in order. I want to create an incremental regular expression.

I could add to the feature set: compile-time regular expressions

In the paper that started this whole thing, Owens, Reppy & Turon showed (Section 3.3) that Brzozowski’s algorithm can produce static DFAs, and that high-performance compile-time regular expressions are possible. Combined with Rust’s Procedural Macros and the iterator ideas above, this could lead to static regular expressions becoming a first-class data type next to the container library.

I could add to the feature set: compile-time bitmapped regular expressions

Fritz Henglein has a paper in which he discusses Bit Coded Regular Expressions, which look fairly adaptable. BCRE requires that you not have character classes in your regular expression library (no "\w+", no "\{Number}", and no "." operator!), but in exchange what you get is an insanely fast regular expression matching algorithm that stores and outputs its results as a bitmap, which happens to make "I found X" vs "I found Y" incredibly cheap on modern hardware; the engine knows which bitmap corresponds to which sequence exactly. This speed is exactly what you need to analyze the terabytes of data that flow through a modern logging engine.

I could add to the feature set: extended regular expressions

There are four other operations that are known to work with regular expressions: Intersection ("this AND that at the same time"), Negation ("NOT that"), Interleaf ("This AND that AND that, independent of order of appearance"), and Cut ("The biggest THIS"). The Cut operation can be fully simulated using the Semiring implementation for the "longest submatch" (which can be adapted to emulate any of the Alternative behaviors found in the while: longest, atomic longest, first match, last match). The Interleaf operator is useful for parsing declarative languages in which elements appear in any order (for example, HTML attributes), and Intersection plus Negation are already commonplace in engines like PCRE.

Unfortunately, these fall outside of the the Semiring implementation. That doesn’t make them bad, I just don’t have a good intellectual grasp on how to implement them in a way that has a solid theoretical foundation. Even worse, there’s some evidence that Intersection plus Negation together create a parser that has some edge cases with gnarly performance penalties.

I could abstract Brzozowki’s algorithm into a finite state transducer

More abstractly, Brzozowski’s algorithm can actually be abstracted (eek!) into a bidirectional scanner. In some ways, a regular expression is a set-membership function with a functor transforming the membership determinant into something more useful than true-or-false. If I’m reading the paper Clowns to the Left of Me, Jokers to the Right correctly (no promises), Conor McBride shows that you could start anywhere in an abstract syntax tree, and Brzozowski’s algorithm (really, any regular expression algorithm, but Brzozowksi’s seems easiest here, actually) could provide derivatives of the left and right parts of the tree, producing a new tree transformed according to a regular expression substitution rule on the old tree. Do this progressively enough, and you end up with a fully functional tree transformer, or a very powerful abstraction of a finite state transducer that’s fully explicable in Church terms, rather than the Turing terms used in the Wikipedia article.

I could finish Barre

Or I could just "go for it," and just start re-writing Barre with the knowledge I’ve picked up working on these. One of the big goals of Barre is to implement Adams & Darais’ "Language of Languages," a superset of Kleene’s base six regular expressions to make it easier to build primitive parsers, giving you the "strap" part of bootstrapping this implementation into a full-powered parser generator.

I just haven’t decided which to do yet.

In the last post on “A Play on Regular Expressions,” I showed how we go from a boolean regular expression to a “rigged” one; one that uses an arbitrary data structure to extract data from the process of recognizing regular expressions. The data structure must conform to a set of mathematical laws (the semiring laws), but that simple requirement led us to some surprisingly robust results.

Now, the question is: Can we port this to Rust?

Easily.

The first thing to do, however, is to not implement a Semiring. A Semiring is a conceptual item, and in Rust it turns out that you can get away without defining a Semiring as a trait; instead, it’s a collection of traits derived from the num_traits crate: Zero, zero, One, one; the capitalized versions are the traits, and the lower case ones are the implementations we have to provide.

I won’t post the entire code here, but you can check it out in Rigged Kleene Regular Expressions in Rust. Here are a few highlights:

The accept() function for the Haskell version looked like this:

acceptw :: Semiring s => Regw c s -> [c] -> s
acceptw Epsw u     = if null u then one else zero
acceptw (Symw f) u = case u of [c] -> f c;  _ -> zero
acceptw (Altw p q) u = acceptw p u `add` acceptw q u
acceptw (Seqw p q) u = sumr [ acceptw p u1 `mul` acceptw q u2 | (u1, u2) <- split u ]
acceptw (Repw r)   u = sumr [ prodr [ acceptw r ui | ui <- ps ] | ps <- parts u ]

The accept() function in Rust looks almost the same:

pub fn acceptw<S>(r: &Regw<S>, s: &[char]) -> S
    where S: Zero + One
{
    match r {
        Regw::Eps => if s.is_empty() { one() } else { zero() },
        Regw::Sym(c) => if s.len() == 1 { c(s[0]) } else { zero() },
        Regw::Alt(r1, r2) => S::add(acceptw(&r1, s), acceptw(&r2, s)),
        Regw::Seq(r1, r2) => split(s)
            .into_iter()
            .map(|(u1, u2)| acceptw(r1, &u1) * acceptw(r2, &u2))
            .fold(S::zero(), sumr),
        Regw::Rep(r) => parts(s)
            .into_iter()
            .map(|ps| ps.into_iter().map(|u| acceptw(r, &u)).fold(S::one(), prod))
            .fold(S::zero(), sumr)
    }
}

There’s a bit more machinery here to support the sum-over and product-over maps. There’s also the where S: Zero + One clause, which tells us that our Semiring must be something that understands those two notions and have implementations for them.

To restore our boolean version of our engine, we have to build a nominal container that supports the various traits of our semiring. To do that, we need to implement the methods associated with Zero, One, Mul, and Add, and explain what they mean to the datatype of our semiring. The actual work is straightforward.

pub struct Recognizer(bool);

impl Zero for Recognizer {
    fn zero() -> Recognizer { Recognizer(false) }
    fn is_zero(&self) -> bool { !self.0 }
}

impl One for Recognizer {
    fn one() -> Recognizer { Recognizer(true) }
}

impl Mul for Recognizer {
    type Output = Recognizer;
    fn mul(self, rhs: Recognizer) -> Recognizer { Recognizer(self.0 && rhs.0) }
}

impl Add for Recognizer {
    type Output = Recognizer;
    fn add(self, rhs: Recognizer) -> Recognizer { Recognizer(self.0 || rhs.0) }
}

Also, unlike Haskell, Rust must be explicitly told what kind of Semiring will be used before processing, whereas Haskell will see what kind of Semiring you need to produce the processed result and hook up the machinery for you, but that’s not surprising. In Rust, you “lift” a straight expression to a rigged one thusly:

let rigged: Regw<Recognizer>  = rig(&evencs);

All in all, porting the Haskell to Rust was extremely straightforward. The code looks remarkably similar, but for one detail. In the Kleene version of regular expressions we’re emulating as closely as possible the “all possible permutations of our input string” implicit in the set-theoretic language of Kleene’s 1956 paper. That slows us down a lot, but in Haskell the code for doing it was extremely straightforward, which two simple functions to create all possible permutations for both the sequence and repetition options:

split []     = [([], [])]
split (c:cs) = ([], c : cs) : [(c : s1, s2) | (s1, s2) <- split cs]
parts []     = [[]]
parts [c]    = [[[c]]]
parts (c:cs) = concat [[(c : p) : ps, [c] : p : ps] | p:ps <- parts cs]

In Rust, these two functions were 21 and 29 lines long, respectively. Rust’s demands that you pay attention to memory usage and the rules about it require that you also be very explicit about when you want it, so Rust knows exactly when you no longer want it and can release it back to the allocator.

Rust’s syntax and support are amazing, and the way Haskell can be ported to Rust with little to no loss of fidelity makes me happy to work in both.

Okay, in part one I showed you what I’ve learned about writing regular expression engines in Haskell. The example shown uses Haskell chars, so is unicode-compliant.

I also mentioned a semiring. Here’s what a Haskell semiring’s class looks like. (A Haskell ‘class’ is more like a Trait in Rust, or an Interface is Java; it describes a collection of behaviors that we want implementors of the class to have. The syntax for an implementation is an `instance’):

class Semiring s where
    zero, one :: s
    mul, add  :: s -> s -> s

That’s… that’s it. s is a set, and these are the operations on the set. Remember that Int is a set: the set of all integer numbers.

The idea in the paper is to make regular expressions more generalized. To give them powers above and beyond what they’re normally capable of. This is also one of the main ideas in Barre, so I’ve been reading this paper with significant interest.

The idea here is to make the return type of the expression a Semiring. This is also the idea in Might and Adams, but here the authors go in a different direction. Because they have, as discussed last time, all the possible combinations of string and regex run at the same time, at the end of parsing, they end up with all possible parse trees automatically. The example showed all those resulting parse trees, which were lists (not sets) of boolean values being reduced down to, for sequences, “Are they all true?”, and for alternatives, “Are any of them true?” But for a brief moment, all possible parse trees exist.

So let’s create a semiring version of our regular expression. Better yet, let’s exploit the fact that we already have one. Here, the only real change is that we’ve said a rigged regular expression is one that takes a symbol and returns a semiring – a “rig”.

data Regw c s =                
    Epsw                       -- Epsilon
  | Symw (c -> s)              -- Character
  | Altw (Regw c s) (Regw c s) -- Alternation
  | Seqw (Regw c s) (Regw c s) -- Sequence
  | Repw (Regw c s)            -- R*

For our two “base” types, Epsilon and Symbol, we’ll have to implement them; everything else can be implemented in terms of them.  For everything else, we can “lift” them from the existing machinery, and here’s our lift:

rigged :: Semiring s => Reg -> Regw Char s
rigged Eps = Epsw
rigged (Sym c) = sym c
rigged (Alt p q) = Altw (rigged p) (rigged q)
rigged (Seq p q) = Seqw (rigged p) (rigged q)
rigged (Rep r)   = Repw (rigged r)

sym :: Semiring s => Char -> Regw Char s
sym c = Symw (\b -> if b == c then one else zero)

Those are the definitions, including the one that say that if the input symbol matches the constructed symbol, then return the Semiring’s version of “one”, otherwise “zero”. Recall that the test expression for the original version of Sym was u == [c]; we’ve now said that Symw takes a predicate that compares two symbols.

The accept version must now be written to handle the rigged versions, and this case we pass the input string to the case statement, which says if the list exists pass it to the predicate, otherwise zero:

acceptw :: Semiring s => Regw c s -> [c] -> s
acceptw Epsw u     = if null u then one else zero
acceptw (Symw f) u =
    case u of
      [c] -> f c
      _ -> zero
acceptw (Altw p q) u = acceptw p u `add` acceptw q u
acceptw (Seqw p q) u = sumr [ acceptw p u1 `mul` acceptw q u2 | (u1, u2) <- split u ]
acceptw (Repw r)   u = sumr [ prodr [ acceptw r ui | ui <- ps ] | ps <- parts u ]

In the Boolean version in the previous post, I used and and or to sum up all the resulting values of True and False that came back from the recognizer. I can’t do that here, because we don’t know what type of set the Semiring operates over; for Boolean, it would just be True/False. For other things…

But wait! Haskell can abstract that as well. We’ll replace and with prodr and or with sumr, and define them as folds over their semiring operations:

sumr, prodr :: Semiring r => [r] -> r
sumr = foldr add zero
prodr = foldr mul one

This works. If we’re talking booleans, add becomes ||, and mul become &&, and foldr (&&) True [True, True, True] is True, as is foldr (||) False [True, False, True], and so forth. Which gives us our first Semiring:

instance Semiring Bool where
    zero = False
    one = True
    add = (||)
    mul = (&&)

And then you can say:

> let nocs = Rep ( Alt ( Sym 'a' ) ( Sym 'b' ) )
> let onec = Seq nocs (Sym 'c' )
> let evencs = Seq ( Rep ( Seq onec onec ) ) nocs 
> acceptw (rigged evencs) "acc" :: Bool 
True

This is the point where I start to suspect Haskell of sorcery. You’ll notice that we didn’t actually associate the Semiring with our regular expression. No, we told acceptw that its job was to return a Boolean value, since acceptw takes a Semiring, it just went out and found a Semiring that does the job.  There was only one implementation in the current scope that meets the definition “a Semiring of Bool,” so Haskell assumed that must have been the one I wanted, and just used it.  To someone coming from the C/C++ world, that’s flaming magical. I understand it, but man, it’s magical.

All right. But I said Semirings can be defined on natural numbers. That looks exactly like you’d expect:

instance Semiring Int where
    zero = 0
    one = 1
    add = (+)
    mul = (*)
    

Yeah, zero is, um, zero. The number. There are no surprises here. So what happens when we ask:

> acceptw (rigged evencs) "acc" :: Int
1

“1?” What does “1?” mean. It means that our example above has exactly one parse tree. The number of different ways our regular expression above can handle this is: one. “aaa” gets you zero. And a string of twelve letters takes several minutes on my laptop due to that crazy explosion of possibilities to check I discussed in the previous post.

So is there a way to show this working? Well, sure. What would the regular expression (a|a*) do when handed the string "a"?

> let as = Alt (Sym 'a') (Rep (Sym 'a'))
> acceptw (rigged as) "a" :: Int
2

Two? Yes: both the left and the right alternatives returned their semiring one value, and the alternative added them together. Two.

> acceptw (rigged as) "aa" :: Int
1

One? Yes: Only the repetition returned anything, and it consumed the whole string. There was only one parse tree returned to be summed up.

> let bs = Alt (Sym 'b') (Rep (Sym 'b'))
> acceptw (rigged (Seq as bs)) "ab" :: Int
4

Four? Two for recognizing the a, and two for recognizing the b, and then the sequencing sumr summed those values together to give us four.

And that’s where I am in the paper. I’ve gotten this to work. Here’s where things get interesting:

  • The third part of the paper discusses reducing that explosion-of-nodes issue by using Glushkov’s Construction to generate an efficient directed finite automata. I want to use Brzozowski’s algorithm instead, so this’ll be interesting.
  • The semiring construction here returns parse trees and does analysis on them on a symbol-by-symbol level; this is different from Might’s use of a new regular expression type that acts as a parser-combinator, taking the parse tree and returning something built out of it, performing Henglein’s “catamorphism on parse trees”; Adams’ version takes this further and does the catamorphism the moment parsing is done. What this means is that Might’s & Adams’s work allows us to insert transformations in the middle of the expression, much the same way a PEG might return something constructed and of a different type, rather than merely a subset of the data provided.
  • Brzozowski’s algorithm with Might’s nullability operation can be made recursive; does that make the semiring construction invalid? (I don’t think so, since the analysis is done after the recursion has be successfully traversed, but I’m not sure.)
  • The semiring construction requires a whole parse tree, but Brzozowski’s algorithm with Might’s reduction node addition allows for the parser to return results at any time, yielding them, turning the parser into one of type Parser Iterator s, Iterator r => s -> r; in Rust terms, it takes an iterator as an input, and it return an iterator; every time you call .next() on it, it consumes as many symbols from s as necessary until a coherent element of type r can be returned, or it returns an error. Can this technique be adapted to work in that environment?

I have two competing desires at this point: (1) Try to adapt the work they’ve done here to a Brzozowski implementation, or (2) try to implement the work they’ve done here in Rust. I haven’t decided which yet.

I’ve been working my way through A Play on Regular Expressions, a programming pearl by Sebastian Fischer, Frank Huch, and Thomas Wilke on the use of weighted values in regular expressions. The paper has six parts, and I’ve understood everything up through part two, and what I’ve understood already blows my mind. I recently put up a big chunk of documentation on where I am in Barre, and where I am is that I have a reliable, working toolkit for parsing regular expressions and returning the string found (I have “Smart Epsilons”), but I haven’t progressed much beyond that part.

This paper does something different. It starts with the idea of a Semiring as the basic “wrapper” type of the expression.

A semiring is a set R with members and operations written: (0, 1, ⊕, ⊛, R). A semiring is an abstraction of common addition and multiplication over natural numbers (The integers from zero to infinity), with the requirement that 0 ⊕ x = x, 1 ⊛ x = x, 0 ⊛ x = 0. That last operation is called “annihilation.” That’s it; it has two values from the set with specific meanings, and two operations on members of the set that have specific meanings, and a couple of rules about the two values. Try hard not to set in your mind that semirings are about any one “thing” at all. It’s just an arbitrary set of rules that lets us do cool things.

A semiring can also be defined over a set of sets! In fact, regular expressions are themselves semirings: ⊕ defines the union of the results of two expressions; this is use in the Alternative operator; ⊛ is the Sequence operator, and defines all possible legal combinations of two expressions is sequence. An expression that recognizes nothing is zero; an expression that recognizes only the empty string is one. If an operation returns the empty set, it’s an annhilator under ⊛.

But that’s not what this paper is about. This paper starts with a basic regular expression. If you’ve been with me at all, you know that regular expressions are six functions that take a string and return a value: Empty (always False), Epsilon (True on Empty String), Token (True if it matches the constructed Token), Alternative (True if any constructed expression within matches), Sequence (True if the two constructed expressions within match in order), and Repetition (True if the constructed expression within matches zero or more times). Here are the five main operations (the sixth, “Empty”, is assumed) in Haskell for a regular expression recognizer (boolean values only):

data Reg = 
    Eps         -- Epsilon
  | Sym Char    -- Token
  | Alt Reg Reg -- Alternation
  | Seq Reg Reg -- Sequence
  | Rep Reg     -- Repetition

accept :: Reg -> String -> Bool
accept Eps u       = null u
accept (Sym c) u   = u == [c]
accept (Alt p q) u = accept p u || accept q u
accept (Seq p q) u = or [accept p u1 && accept q u2 | (u1, u2) <- split u]
accept (Rep r) u   = or [and [accept r ui | ui <- ps] | ps <- parts u]

split :: [a] -> [([a], [a])]
split []     = [([], [])]
split (c:cs) = ([], c : cs) : [(c : s1, s2) | (s1, s2) <- split cs]

parts :: [a] -> [[[a]]]
parts []     = [[]]
parts [c]    = [[[c]]]
parts (c:cs) = concat [[(c : p) : ps, [c] : p : ps] | p:ps <- parts cs]

This says that we return true if the string is empty and the expression is empty; we return true if the string is a single char and the expression is a symbol of one character. split returns every possible combination of the string, split at each point along its length; parts goes futher and produces every possible combination of the string; an eight-letter word has 2⁷ (128) possible combinations:

split "abc" == [("","abc"),("a","bc"),("ab","c"),("abc","")]
parts "abc" == [["abc"],["a","bc"],["ab","c"],["a","b","c"]]

This is hideously inefficient. It has to be done repeatedly for every sequence and subsequence of the string and its expression; this function is exponential O(n²) where n is the length of the string! But it does work. You can build a regular expression out of this and it does recognize strings reliably:

> let nocs = Rep ( Alt ( Sym 'a' ) ( Sym 'b' ) )
> let onec = Seq nocs (Sym 'c' )
> let evencs = Seq ( Rep ( Seq onec onec ) ) nocs
> accept evencs "acc"
True

In part two, things get really cool.

01Jan

I installed LSP and I Liked It

Posted by Elf Sternberg as Uncategorized

I was wrong about Visual Studio Code.

Not in the particulars, mind you. I’m still an Emacs user. It’s still my one and only IDE. But there was something about Visual Studio Code that bugged the hell out of me: on-the-fly syntactic analysis. I hated it. I thought it was a lazy developer’s feature, a hand-holding digital assistant who would nitpick the hell out of your work. That was the compiler’s job.

A few weeks ago I managed, mostly just out of curiosity, to install the Language Server Protocol engine for Rust, and to get it work under Emacs. It turned out the biggest problem was a breakdown between the PATH as understood by the boot shell and the path understood by my terminal shell. Once I reconciled the two, Everything Worked™. And I’ve started to appreciate what I was missing. It’s not that it helps me with semantics, but that it helps me catch typos and misunderstandings, and since my academic side project involves three different languages that I’m not all that comfortable with (Rust, Haskell, and Racket), it’s been a heck of a boon.

I know I’m probably the last programmer on Earth to understand this, but in case I’m not: go ahead and try LSP, FlyCheck, and Company. It’ll really make a difference in the code-compile-test cycle.

13Dec

Barre, progress report: IT LIVES!

Posted by Elf Sternberg as Uncategorized

Barre: Progress Report, second week of December

The develop branch of Barre now fully implements all of the features of Adams, Might & Darais’s 2015 implementation, only it does so in Rust. The current price of running seems to be about 15μs (15 microseconds) per operation, although that’s a preliminary finding based on very short grammars.

This… is a frigging miracle. It works. It works, dammit. It can be done, and it seems to be fairly performant. It still has a problem with memory blowup, and at this point the only solution is collaborative garbage collection, but I have to do an analysis to make sure the nodes are mobile and not-fragile. I know my roots so compacting the arena is probably do-able, but I would have to determine the memory-vs-speed costs before implementing.

But it works. This is something of a minor breakthrough in theoretical computer science! I’m absolutely thrilled.

The biggest headache now is genericizing the reducers. Under the covers, a regular expression engine does one thing: it say "Yes" or "No" to whatever is being parsed. Reducers rely on whatever operation said "Yes" to store what it said "yes" to: the character, sequence, or list of alternatives that the parser engine found worth returning, and it does so in the form of a parse tree. These discoveries are stored in a special operation called an epsilon. Because alternatives can result in multiple matches, reducers deal in sets: each reducer takes a set of trees from its inner parsers and turns the result into something else.

This is incredibly useful. A string of epsilons that we would have to trudge through every time to get to the next operation can be collapsed into a single epsilon by a reducer on any epsilon ◦ epsilon operations.

Optimizations on the current engine rely on reducers. The rebalancing of sequence trees from left-to-right requires that the results be balanced back to right-to-left afterward, and a reducer does this. Multiple reductions in sequence can be rebalanced to a single, composed reduction on the results of the whole sequence. And when there are alternatives of alternatives, the results have to be permutated appropriately.

All well and good. And those internal reductions all work well and are well-applied inside the grammar. But there’s a complicating headache: external reducers.

The user of this library will want to add in-line reducers to modify the discovered parse tree into something else. A basic example is the regular expression standby, the capture group. A capture group is a reduction: in this case, taking the product of the regular expression inside the capture group and adding a label. And that’s where the problem lies: there’s no place in the existing datatype for a label. Or anything else. Decorating the rutrned tree with additional information is currently not possible.

The whole point of this exercise is to produce ASTs, IRs, and other internal representations of raw data of some kind. The problem is that client reducers will not have the same return type. Even worse, client reducers will have a headache of a time dealing with the internal representations of the graphs produced by Barre.

As an exercise, I’m going to try to implement Darais’s "Language of Languages," which provides a lot of convenience methods for defining other languages. It relies on external reductions, so maybe it’ll teach me a thing or two about how to get it right.

Current To-Do list looks like this:

  • Implement right-handed optimizations
  • Implement Language-of-Languages
  • Implement Capture Groups
  • Determine Reducer return type strategy (look at pom for tips)
  • Implement a parser-of-regex parser that supports the Rust Regex dialect
  • Benchmark Barre vs Rust Regex
  • Benchmark Barre vs RE2
  • Implement the negation, intersection, and interleaf operators
  • Implement a parser-of-regex parser that supports a meaningful subset of the Perl Six dialect
  • Implement the cut operator,
    which enables POSIX-compliant matching
  • Swap out the set return type for a poset type, to enable prioritized matching
  • Implement a Derivatives of Parser Expression Grammars library
  • Implement ROSIE in Rust
  • Implement a PEGjs-like tool for generating advanced parsers in Rust
  • Genericize the PEG producer to work with other languages.
  • Kill YACC.

So, I’m mad about this.

The original paper about an implmentation of Brozozowski’s Parsing Algorithm is Matt Might’s Parsing With Derivatives: A Functional Pearl. For three weeks I’ve been hung up an a problem with the implementation that wasn’t making any sense to me, and when I realized my mistake it came down to this pair of sentences:

  • The derivative of the empty parser is empty: Dc(∅) = ∅.
  • The derivative of the null parser is also empty: Dc(ε) = ∅.

The "empty parser" is called empty because the set of words it recognizes contains nothing: {}. It recognizes no words1 at all. Another name for the empty set: the NULL set, which is why the null symbol, ∅, is used there. The null parser is called that because it parses the empty string, the string written "" if you’re using the Latin alphabet, and in C, C++, and other early languages was often marked by an actual symbol in the language, NUL, and referred to the ASCII byte 0000.

If you’re confused, good: because so I was. It’s especially not clear when the author writes, "The null string becomes the consume-nothing parser, while the empty set becomes the reject-everything parse." Note the change in terminology: here he means the null string parser and the empty parser, which in other places is called, respectively, the empty string parser and the null parser.

Once I figured out my confusion, figuring out how data preservation actually works in Might’s formula was easier. But man, the people writing these math papers often do not mean for them to be read, do they?


1 "Words" in this case does not mean what it means in English or other languages; it just means sequences of letters in the alphabet making up meaningful strings, and the separation between them is arbitrary and not necessarily a whitespace or other punctuation.

28Nov

A bit of analysis while working on Barre.

Posted by Elf Sternberg as Uncategorized

I’ve been worrying a lot about the next stage of my project and realizing that I had seriously screwed up somewhere along the way. The next task on my list is to convert the recognizer into a parser.

I’m going to refer to a couple of different things here. A quick bibliography:

  • Herp, Rerp, and Derp all refer the final variants produced by David Darais in 2011, after Matt Might’s original code.
  • Adams 2016 refers the the Derp-3 implementation in Racket, which as far as anyone knows has the greatest number of optimizations.
  • McGuire 2012 refers to Tommy McGuire’s Java implementation.

There are three different main issues that need to be covered by this code.

First is the generation of a new parser for each new character, in much the same way that parser combinators work, and ideally this new parser is lazy, in that it doesn’t actually recurse down a tree until an item is needed, and that once the item has been generated it is memoized so that it doesn’t need to be generated a second time.

Second is the determination of nullability. Since these parsers are defined equationally, then given a language 𝑳, we calculate the fixed point of 𝑳 in the classic way: Feed 𝑳 into the equations repeatedly until 𝑳 doesn’t change. This derivative of 𝑳 is both non-recursive and efficient.

Third is the extraction of the result from the language. This is where I’m a bit stuck, and trying to reverse-engineer the result from the code examples above.

The problem I have at the moment is that there appears to be two different ways of calculating and managing the the nullability and extraction issues. To understand the issue, you have to know that nullability is important not just when determining whether or not the parser is done, but whenever the parser needs to do concatentation, the single most common operation in regular expressions; for concatenation, the next parser-combinator generated is different depending on whether or not the first symbol in the concatenation machinery is nullable. In Matt Might’s original Parsing With Derivatives: A Functional Pearl, he has a simple nullability function for recognition, but uses a different nullability function, a sort of “automatic cancellation” feature, in concatenation. McGuire also goes this route.

Darais goes back to the original way of doing things: Nullability is a simple true/false operation with no special behavior. It’s effects on other equations are handled in those equations as special constructors, which is very useful when the constructors make decisions not about themselves, but about the results of their derivatives, a two-level analysis for performance reasons.

Adams goes even further; his code is deeply optimized with some very powerful features, but the biggest problem I have is that he does productions piecewise in order to reduce the length of the analysis chain the further into the expression you go. This kind of optimization is highly challenging in a strongly typed language like Rust, which is Why I’ve been having such a hard time with it.

Adams does this with “Reductions,” which Might introduced in his original paper in the context of discussing parser combinators, using them interleaved with the parsing process, which again, is a challenge in a typed language.

This week, I’m going to focus on very small incremental features. I’m just going to implement Literals and Alternatives, and see if I can’t get logical extractions out of them. If I can, then I’ll try and get Concatenation working.

My fear is that, when that time comes, I’ll have to implement Reductions as well. Since I’ve been using this project as a way to learn Rust, I’ve not yet delved into some deeper parts of it, like, for example, function pointers, which is how Reductions are stored. This should be fun.

19Nov

Making Things and Going Slowly

Posted by Elf Sternberg as Uncategorized


For the record, version 1.0 is supposed to support POSIX-style regular expressions with character classes, ranges, and limited backtracking, providing a drop-in API compatible with the Rust regex library without the PCRE context-sensitive expressions such as ‘(x?)’ and ”. I maintain that there’s something wrong with a regular expression library that can’t use itself to parse out regular expressions, which is why "self-hosting" is on the post-1.0 list.

I started the Barre project ten weeks ago. I expected it to take about four to six weeks, and yet here I am, ten weeks in, and I still don’t have a full implementation of the core, much less any of the extensions.

I’m disappointed by this slow progress, but not surprised. Half of it has been learning Rust. I mean, I thought I knew Rust when I finished all of the exercises in the Programming Rust textbook, but that was only half the story. I’d only done exercises and built off the work in the book. I hadn’t written a program in Rust on my own, "editing a blank page." Rust does not support graph structures comfortably; the options for doing so are limited and verbose, and my choice to use a memory arena means that I had a different set of difficulties with convincing Rust’s correctness operators to give me access to nodes as I needed them.

The other half has been that, no really, implementing laziness in a language that doesn’t support it natively is really tricky. Tommy McGuire’s implementation is blunt and obvious, and while his code was excellent inspiration for what I’m trying to acheive, it’s not written in a way that makes it easily portable to Rust. On the other hand, Michael Adams’s implementation, while much more complete and tractable in Rust terms, is also written in a highly advanced Racket that re-uses fields, treats types as run-time signals, and does deep exploitation of Lisp’s list-like structures to perform post-hoc transformations. Trying to synthesize out of both of those, as well as Matt Might’s original implementation, has proven more challenging than I imagined.

As always, there’s too much to learn, and not a whole lot of time in which to do it all.

I’ve made some excellent progress on the Barre project. Along the way, I discovered that implementing a theoretical CS paper has some annoying difficulties, the worst of which is figuring out a translation table for the different notations used by different researchers in the given field. In describing my progress on Barre, and implementing Brzozowski’s Regular Expressions in Rust, today I’m going to talk about a different problem: orthagonality.

In the four core papers on Brzozowski’s 1965 algorithm (Might 2010, Might 2011, Adams 2015, and Sulzmann 2010, there is a progression, and a swerve, about the nature of implementating a parser via Brzozowski derivatives. Brzozowski himself was talking only about regular expression recognizers, which can only tell you if a string matches a given regular expression, a boolean result. Brzozowski realized that, equationally, one could put an instance of the parser inside itself, creating a recursive structure that would gracefully handle backreferences and even context free grammars (CFGs). Unfortunately, no programming language in 1965 could handle Brzozowski’s formulation. In 2009, Scott Owens published a paper showing it could be done with modern languages.

Scott Owens showed that it could work as a recognizer. Matt Might made it work as a parser, and then contributed some significant performance-oriented optimizations based on the observation that the derivative of the derivative of some parsers would have redundancies that could be avoided. Adams made it go fast. Sulzmann contributed an important step toward making it deterministic and predictable.

But in trying to implement the results of all these papers together, I’ve discovered that there are several moving parts:

  1. Recognizing regular expressions
  2. Recognizing context-free grammars
    1. Memoizing derived parsers for re-use
    2. Avoiding derivations until needed
  3. Extracting parse trees from the recognizer
  4. Performance-related optimizations
    1. Node optimizations
    2. Control-flow optimizations
  5. Common extensions.
    1. Character classes and ranges
    2. Capture groups
    3. Named capture groups
    4. Ranges
    5. Back references
    6. Forward references
    7. Intersection and negation
  6. Preambles and contextual symbols
  7. Self-hosting: The regexp parser is written in itself.
  8. Alternative parse languages
    1. PCRE
    2. Perl6
    3. Peg
    4. Rosie

It’s easy, reading through the papers, to think that I might want to implement these in order, or that one is a pre-requisite for another. The first, "Recognizing regular expressions," is a pre-requisite for everything else.

Right now, I have a working implementation of (1). I think I have working implementations of (2), which require (2.1) and (2.2) to be complete. CFGs only work with memoization (so that we can find the recursive entry) and laziness (so that a recursive call doesn’t go into an infinite loop).

It was after I had the CFG recognizer "working" (it passes tests, but I’m not confident I have the algorithm perfectly correct) that I started working on the extractor– and hit a wall. The code is fairly complicated, and I made the mistake of implementing (4.2), because it made for an interesting challenge in Rust.

But I realized yesterday that implenting (3), extracting the data from the parser, was a perfectly compatible step with (1), the simplest of all the implementations. I didn’t have to worry about recursion, or the node’s structure types during the run.

The lesson here is to identify, down to the smallest possible unit, what an individual implementation step is, and do only that step. Implementing parse tree extraction on the simplest possible code is the only way for me to understand the implications of doing it with much more complex code, and having it at-hand means that the pattern for implementing it for future extensions such as character classes, intersection and negation, will already be understood, and not hacked ad-hoc into those node types at the same time I’m trying to figure them out at all.

Plus, being aware that those future node types are coming will encourage me to try and write the parse_tree_extract() function in a way that has the least impact on the highly diverse node types, and try to keep the zoo of node types to a minimum.

Subscribe to Feed

Categories

Calendar

February 2019
M T W T F S S
« Jan    
 123
45678910
11121314151617
18192021222324
25262728