I went to the HackerX recruitment event this week and, while it was interesting, the event was awkward. It’s a bit like speed-dating: 36 companies show up and you have five minutes per company to hear their pitch and make your own, and then time is called and you move on to the next company. The round lasts an hour, so you have time to visit at most 12 recruiters.

The problem is that three times as many people showed up for the event as there were seats. In order to "facilitate" this problem, you were placed in a waiting line per row, and depending upon your starting position you could fall out of your line, in which case you got to pick a new waiting line. And the lines were initialized in order, with the first person in line being sent to the end of the row.

In other words, if you were a go-getter who arrived early, you got to talk to exactly one recruiter before you fell out of the room and had to go to a waiting line, and there was very little chance you’d get back into the room.

Even worse: that recruiter was always Amazon. The last entry of every row was Amazon: AWS, Amazon Logistics, or Amazon Internals. It didn’t matter. It was Amazon. All Jeff, all the time.

I was second in line. I got to talk to two recruiters, and I really didn’t care to talk to Amazon, although the people behind the table were nice enough. In line (I really wanted to talk to the Edu startup), I did get to talk to a woman who was looking to move from startup to enterprise, and some recruiters did come out and talk to us in line, having mercy on our patience, but they were the HR assistants to the engineering recruiters, and couldn’t answer many of my questions about work environment, tooling or projects.

I’d probably go again, but I’d really prefer less of a free-for-all.

I have hit a snag in my project. This entry is me thinking about solutions.

My goal was a reasonable one:

There are a million other "goals" that can go with this:

  • Write a variant that produces a DFA, allowing for static compilation
  • Write a variant that produces a bitmap, allowing for fast comparison
  • Write a variant that is-a iterator, such that some expressions can be marked "progressive," and when is complete emit the contents
  • Write a relationship between the parser and the semirings such that progress can be suspended and restored as needed
  • Add negation, complement, and interleaving as regular expression operators
  • Write Darais’ Language-of-Languages
  • Write a PEG parser on top
  • Write ROSIE in Rust
  • Instantiate backreferences
  • Add a concrete stream operator

The idea is that Barre/Barge is a generalized but still highly performant toolkit for building recognition, matching, and parsing engines, and the nature of the operation depends on passing it an output dependency with an interface that is theoretically sound and operationally powerful.

Unfortunately, I’ve hit this snag: The classic implementation of parsing-with-semirings can produce parse tree via the Derivation Forest Semiring, but disambiguating that the forest down to a single tree requires a different semiring to implement one of First Match / Longest Match / Longest Atomic Match. Adams and Darais provide a derivation forest engine, and it’s the one I’ve implemented, but there’s one special case.

Imagine a sequence of regular expressions: "abcde" These can be depicted in one of two ways: (a, (b, (c, (d, e)))) or ((((a, b), c), d), e). Computationally, that second one, "left heavy," is troubling: for every character, you’ll have to transition down the tree to reach the character, traversing four nodes just to get to the a expression. Darais implemented Adams’ suggestion that the tree be rebalanced to make it "right-heavy," and then a post-parsing operation is placed in front of the node to rebalance the resulting parse tree back to a left-heavy representation so that the results are consistent with expectations.

The problem with this is that there’s no clear implementation that corresponds to the semiring. It’s an engineering-versus-math issue, and I’m not yet sure how to deal with it, although just writing it out does help me see some of the potential solutions. The basic solution is to keep it all: the raw parse tree, the raw disambiguation matrix, and the product of the parse as quickly as the expression specification can produce it.


Current Bibliography

Posted by Elf Sternberg as Uncategorized

I’m just keeping this here, to keep me honest about what I’m reading.

  • A Play on Regular Expressions: A Functional Pearl. Sebastian Fischer, Frank Huch, Thomas Wilke.
    • Solid introduction to parsing with semirings in Haskell
    • Haskell is straightforward and accessible
    • Readable, breezy style
    • Semirings covered: Recognition, Counting, Viterbi-Best (Maxlong, Firstlong)
  • Algebraic Foundations of Semiring Parsing. Yudong Liu
    • Discusses semiring parsing as unified framework
    • Unites semiring parsing and finite state transducers
    • Describes a mechanical C++ framework for the pieces/parts
    • Looks a lot like Barge/Barre
    • Good overview, fairly accessible.
  • An Algorithm for RELAX NG Validation James Clark.
    • Describes the Interleaf operator in some detail.
  • Bit Coded Regular Expression Parsing. Lasse Nielsen.
    • Presentation, so short on details
    • Describes bit coding as a way to do comparisons on regexs with less than machine-width alternatives
  • Bitsets Match Regular Expressions, Compactly. Paul Khuong
    • Describes the Bitap algorithm for bitmap encoding
    • Describes graph minimization as assuming that all states are singular, and breaking them up as we prove the assumption wrong.
  • Constructing Human Grade Parsers. Joe Karter.
    • Wish-list of parser capabilities outside the actual parsing engine
    • Parse errors should be recoverable
    • Parser should be able to produce partial output
    • Requires tooling outside the formalities of regex and semiring parsing
  • Design and Analysis of String Representation. R. Sekar.
    • Describes Owens, et. al.’s DFA generator as a realization of McNaughton-Yamada
    • Describes alternative search algorithms
  • Efficient and Flexible Incremental Parsing. S. L. Graham
    • Describes incremental parsing, in which a document tree and parse tree are maintained and updated together.
  • Error-Correcting Parser Combinators S. Doaitse Swierstra, Pablo R. Azero Alcocer
    • Haskell-based parsing language
    • Describes error-handling in parsing processes
    • Describes recovering from errors to continue parsing
    • Uses combinators, though
  • Total Parser Combinators Guillaume Allais
    • Describes a combinator-based parser with guaranteed termination
    • Uses Brzozowski as its basis
    • Proves that the combinator recognizer is a semiring
    • Written in Agda (eek!)
  • Kleene Meets Church. Fritz Henglein
    • Has the "theory vs practice" description
    • Ideal: parsing + catamorphic processing
    • Actual: FAs + ad-hoc instrumentation
  • On the Complexity and Performance of Parsing with Derivatives Michael D. Adams, Celeste Hollenbeck, Matthew Might
    • Provides both theoretical and mechanical optimizations
    • Replaces the recursion nullability with data-flow analysis
    • Provides basis for Darais’s implementation
  • Recognising and Generating Terms using Derivatives of Parsing Expression Grammars. Tony Garnock-Jones, Mahdi Eslamimehr, Alessandro Warth.
    • Describes the derivatives of PEGs
    • Describes the δPEG algorithm without backtracking for negation
    • Describes the derivation of prioritized choice
  • Regular Expression Derivatives Reexamined Scott Owens, John Reppy, Aaron Turon.
    • The paper that started the current project.
    • Section 3.3: Using derivatives for DFA construction.
    • Describes equivalence sets in the face of symbol class operators
    • Describes regular vectors for multithreaded matching.
    • Describes integration of complement and intersection operations
  • Regular Expressions as Types. Fritz Henglein
    • Introduced the "ad-hoc" slander about PCRE. 😊
    • Describes regex as a set membership problem
    • Describes types systems as "regular languages"
    • Clarifies emitter location for stream-processing
    • On the Kleene star operation
    • May require different Kleene stars, or a visitor with some context awareness.
  • Semiring Parsing. Joshua Goodman.
    • Mentioned by Liu
    • Covers the initial history of semiring parsing
    • Describes derivations forests (see Might & Adams) in a concise way (sec 2.5.1)
    • Describes Viterbi n-best as a probablistic disambiguation strategy
  • Stream Processing Using Grammars and Regular Expressions Urlik Turk Rasmussen
    • Has a great explanation of the "Regular Expressions as Types" approach.
    • Leads into regex equivalence sets
    • Discusses disambiguation strategies with equivalence sets
    • Discusses bit coding of unions (no μ-regex or e-regex)
    • (Con) Uses bit-coding as a parse tree extraction tool
    • Discusses alternative needs
      • Recognition
      • Matching (capture groups)
      • Parsing (parse trees)
    • Discusses two-pass analysis (Might) vs one-pass (Adams)
    • Discusses difficulty of regex composition (see Conway 2018)
    • Discusses CFGs as an alternative DSL for regex composition (PEG, EBNF)
    • Is actually 5 papers; the other 4 cover algorithm’s evolution
    • Fantastic overview overall
  • Taxonomies and Toolkits of Regular Expression Algorithms Bruce William Watson
    • Solid overview
    • Fairly long
    • Excellent coverage of DFA minimization
    • Discusses prefix/suffix scanning via Aho-Corasik, et.al.
    • Cover Brzozowski’s algorithm as a performance-based alternative DFA constructor
  • Yacc is Dead Matthew Might, David Darais
    • Started this whole mess
    • Still a pretty good introduction.

Denis Kyashif recently wrote a piece called "Implementing a Regular Expression Engine," and showed how to do it in Javascript. It’s a good piece, and I recommend it. Kyashif does a very good job of describing regular expressions from a Turing-theoretic approach using finite automata, but there’s another way to think about regular expressions, and that’s from the Church-theoretic approach. And since I’ve implemented sixteen seventeen regular expression engines in a variety of languages from a primarily Church-theoretic approach, the approach originally used by Stephen Kleene in 1956, I’m going to cover how to think about regular expressions from that direction.

Alphabets, Languages, Primitives and Composites

You’ll notice in the DFA approach that rarely does one ask the question, "What is a regular expression expressing, and about what?" The "about what" is easier to explain: it’s called a regular language and is a set of strings composed out of a finite alphabet of symbols. The most common symbols are just character sets, and the most common of those are the ASCII Set and the Unicode Standard.

The set of strings in a regular language can be finite and concrete. In the common DSL known as regex, "foo" is a regular language of one string. "foo|bar" is a regular language of two strings. Or they can be infinite: "(foo|bar)*" is a regular language with an infinite number of strings that consist of the repetition of the previous example: "foobarbarbarfoo" and "barfoobarfoofoo" are both in that language.

In programming terms, a regular expression is a function that takes a string and returns a boolean indicating whether or not that string is a member of a specific regular language. That function is composed out of six other functions, three of which are called the primitives and three of which are composites built out of other regular expressions. All of these functions are themselves regular expressions, and have the same inputs and outputs.

  • null(s): Always returns False
  • empty(s): Is the string empty?
  • symc(s): Constructed with the symbol c, does the string consist of (and only of) the symbol c?
  • altr1,r2(s): Constructed out of two other regular expressions, and true only if s matches either. This is the ‘alternatives’ operator, specified in regex with the pipe symbol: |
  • seqr1,r2(s): Constructed out of two other regular expressions, and true only if s consists of the first expression immediately followed by the second.
  • repr1(s): Constructed out of another regular expression, true if s consists of zero or more repeated instances of that expression. This is the star operator: *

Every regular expression is a tree of these sub-expressions, and given a string, it starts at the top of the tree and works its way down the chain of functions until it determines the truth proposition "is this string a member of the language described by this regular expression?"

Expressing regular expressions Kleene-ly

In Haskell, it’s possible to turn Kleene’s formula directly into source code, and from here, it’s possible to convert this directly to Javascript. This is the Haskell version:

data Reg = Emp | Sym Char | Alt Reg Reg | Seq Reg Reg | Rep Reg

accept :: Reg -> String -> Bool
accept Emp u       = null u
accept (Sym c) u   = u == [c]
accept (Alt p q) u = accept p u || accept q u
accept (Seq p q) u = or [accept p u1 && accept q u2 | (u1, u2) <- split u]
accept (Rep r) u   = or [and [accept r ui | ui <- ps] | ps <- parts u]

split :: [a] -> [([a], [a])]
split []     = [([], [])]
split (c:cs) = ([], c : cs) : [(c : s1, s2) | (s1, s2) <- split cs]

parts :: [a] -> [[[a]]]
parts []     = [[]]
parts [c]    = [[[c]]]
parts (c:cs) = concat [[(c : p) : ps, [c] : p : ps] | p:ps <- parts cs]

Those split and parts functions are necessary to express the sequence (r1 followed by r2) operator from Kleene’s original math:

L[[r · s]] = {u · v | u ∈ L[[r]] and v ∈ L[[s]]}

Those membership tests are universal: every possible combination of sequences in the string being tested must be tested. split() takes every possible substring and decomposes it into a list of all possible pairs of strings, so that Seq{} can be compared against them. parts() goes even further, devolving every possible substring into the powerset of lists of strings, so that Rep{} can be compared to every possible variation of the ordered input string. Mathematically, this is elegant and sensible; computationally, it’s inefficient and ridiculous; a string of n letters requires 2n-1 tests!

Doing It In Javascript Typescript

Porting the Kleene version of this to Javascript was difficult only insofar as Javascript is notorious about copying vs. referencing, especially when it comes to heavily nested arrays like those used above. The ports of split and parts were also significantly more complex, although Typescript’s type system was of enormous help in sorting out what was happening each step of the way.

The conversion is straightforward because the Haskell doesn’t use any higher-kinded types: no applicatives, no functors, and certainly no monads!

The datatype Reg in Haskell, along with a couple of convenience factory functions, becomes:

interface Regcom { kind: string };
class Eps implements Regcom { kind: "eps"; };
class Sym implements Regcom { kind: "sym"; s: string; }
class Alt implements Regcom { kind: "alt"; l: Regex; r: Regex };
class Seq implements Regcom { kind: "seq"; l: Regex; r: Regex };
class Rep implements Regcom { kind: "rep"; r: Regex };

function eps():                   Eps { return { kind: "eps" }; };
function sym(c: string):          Sym { return { kind: "sym", s: c }; };
function alt(l: Regex, r: Regex): Alt { return { kind: "alt", l: l, r: r }; };
function seq(l: Regex, r: Regex): Seq { return { kind: "seq", l: l, r: r }; };
function rep(r: Regex):           Rep { return { kind: "rep", r: r }; };

type Regex = Eps | Sym | Alt | Seq | Rep;

And the accept looks remarkably similar. The some() and every() methods on Arrays were especially useful here, as they implement the same behavior as and and or over Haskell lists.

function accept(r: Regex, s: string): boolean {
    switch(r.kind) {
    case "eps":
        return s.length == 0;
    case "sym":
        return s.length == 1 && r.s == s[0];
    case "alt":
        return accept(r.l, s) || accept(r.r, s);
    case "seq":
        return split(s).some((v: Array<string>) => accept(r.l, v[0]) && accept(r.r, v[1]));
    case "rep":
        return parts(s).some((v: Array<string>) => v.every((u: string) => accept(r.r, u)));

split() required a significant amount of munging to make sure the arrays were copied and not just referenced, but looks much like the Haskell version:

function split(s: string) {
    if (s.length == 0) {
        return [["", ""]];  
    return [["", s.slice()]].concat(
            (v) => [s[0].slice().concat(v[0].slice()), v[1].slice()]));

parts(), too, learns a lot from the Haskell version:

function parts(s: string): Array<Array<string>> {
    if (s.length == 0) {
        return [[]];

    if (s.length == 1) {
        return [[s]];

    let c = s[0];
    let cs = s.slice(1);
    return parts(cs).reduce((acc, pps) => {
        let p: string  = pps[0];
        let ps: Array<string> = pps.slice(1);
        let l:  Array<string> = [c + p].concat(ps);
        let r:  Array<string> = [c].concat(p).concat(ps);
        return acc.concat([l, r]);
    }, [[]]).filter((c) => c.length != 0);

You use this code in a straightforward fashion:

let nocs = rep(alt(sym("a"), sym("b")));
let onec = seq(nocs, sym("c"));
let evencs = seq(rep(seq(onec, onec)), nocs);
console.log(accept(evencs, "abcc") == true);           // "true"
console.log(accept(evencs, "abccababbbbcc") == true);  // "true

Wrap up

Most regular expression "under the covers" tutorials come from a Turing-theoretic approach, describing the finite automata that transition from state-to-state, ultimately ending up somewhere in a table with a flag that says "This is an accept state."

I approached this from a Church-theoretic approach. The Church-Turing Thesis says that these two approaches are equivalent, but use different notation. Turing’s approach is mechanical and engineering oriented; Church’s approach and notation are mathematical.

Stephen Kleene’s original 1956 paper on Regular Expressions was primarily written from a Church-theoretic approach, and I showed that this approach can legitimately, if inefficiently, be implemented in an ordinary programming language like Javascript. I showed how Kleene’s six basic operations can be composed together to create complete and effective regular expressions.

The code for this Typescript implementation, the eight other Haskell variants, the seven other Rust variants, and one Python variant, are all available on Github.

When last we left our hero, he had successfully implemented Rigged Regular Expressions in Rust.

There’s been a lot of progress since then.

Matt Might’s Parse Result Data Type Is A Semiring!

One of my strongest intuitions when I started reading A Play on Regular Expressions was that under the covers Matt Might’s implementation of Brzozowski’s Regular Expressions used a data structure exactly like those in the Play paper. I was correct; I was able to implement a (non-recursive) Brzozowski engine in Haskell, and use a Semiring formalism to accurately reproduce Might’s outcome and data structure from Parsing With Derivatives: A Functional Pearl.

And Might’s parser-combinators are functors between semirings!

To me, this is something of a monumental discovery. Intellectually, I’m happy to show that the data structures Might is using are semirings and his parser-combinator approach is actually just functors between semirings. Personally, I’m thrilled that I was able to do this in Haskell, writing code that is not from either paper to show that the two have signficant overlap. Even better, I was able to show a different category of complexity can be encompassed by the semiring properties from those in the Play paper.

Semiring-as-a-trait in Rust

The first thing is that, in that last post, I discussed not having to implement a Semiring trait in Rust, because I could abstract it further using the num_trait crate to define my zero and one, and the std::ops trait to define multiplication and addition.

Unfortunately, that didn’t hold. It worked fine when my values were primitives such as Boolean or Integer, but when I implemented Matt Might’s sets of strings as a return value, the std::ops implementations were insufficient. To work as an operation, Rust requires that the data must be bitwise copyable without loss of fidelity; unfortunately, bitwise copying of sets-of-strings doesn’t work; the underlying implementation has much of its data distributed across the heap. So I had to implement the Semiring as a trait to exploit Rust references and get high-performance operations, and it worked just fine.

You can write Haskell (poorly) in any language

At one point developing the Rigged Brzozowski in Haskell implementation, I was stuck on a hard-to-surface bug. That whole bit about how "If it compiles it’s probably correct" Haskell superiority nonsense is nonsense; there are lots of little fiddly bits in this algorithm ("Is this a sum or product operation?" "Is the initializer one or zero?" "Is the precedence correct?" "Is this test on the left or right operand?") that can go wrong.

I ported the implementation to Python in order to litter it with print statements and watch it work. This was useful, and helped me track down the bug. I was able to port the working instance back to Haskell easily.

One thing to note is that the Python is rather… odd. There are the six regex operations that can be performed on a string which are both unique and finite. By using namedtuple I was able to create the same structure as the Haskell data or Rust enum operation, and with clever use of Python’s reflection capabilities I was likewise able to make Python do something like Haskell or Rust’s pattern-matching, using dictionaries. The result is, to my eye, pleasing, but I’ve been told it’s "not very Pythonic."

Easier to maintain, at any rate.

Are we done here?

That’s… really the twelve thousand dollar question now, isn’t it? I’ve finished sections one and two; the third section is about adopting this framework to recursive regular expressions, which I’m already somewhat proficient in from working with Darais’ Racket implementation. So there are a couple of different ways I could go about this:

I could just keep plugging away

I could proceed as I have been, implementing:

  • Heavyweights using Brzozowski in Haskell
  • Heavyweights using Brzozowski in Rust
  • Recursive Regular Expressions in Haskell
  • Recursive Regular Expressions in Rust
  • Recursive Brzozowski Regular Expressions

I could show it works in C++

I could combine what I did with the Python implementation and my limited (very limited) C++ knowledge and try to port one of the Rigged Glushkov engines to C++. The state-of-the-art in C++ unicode support looks absolutely terrifying, though.

I could add to the feature set: incremental regular expressions

One of the big changes I made in Rust was that, toward the end, I changed the input value from an Iteratable to an Iterator, thus simplifying the API. I want to do the same thing for the output, that is, I want the receiver to get not just a semiring containing a set, but to get instead an iterator that produces elements from the semiring as they’re produced, in order. I want to create an incremental regular expression.

I could add to the feature set: compile-time regular expressions

In the paper that started this whole thing, Owens, Reppy & Turon showed (Section 3.3) that Brzozowski’s algorithm can produce static DFAs, and that high-performance compile-time regular expressions are possible. Combined with Rust’s Procedural Macros and the iterator ideas above, this could lead to static regular expressions becoming a first-class data type next to the container library.

I could add to the feature set: compile-time bitmapped regular expressions

Fritz Henglein has a paper in which he discusses Bit Coded Regular Expressions, which look fairly adaptable. BCRE requires that you not have character classes in your regular expression library (no "\w+", no "\{Number}", and no "." operator!), but in exchange what you get is an insanely fast regular expression matching algorithm that stores and outputs its results as a bitmap, which happens to make "I found X" vs "I found Y" incredibly cheap on modern hardware; the engine knows which bitmap corresponds to which sequence exactly. This speed is exactly what you need to analyze the terabytes of data that flow through a modern logging engine.

I could add to the feature set: extended regular expressions

There are four other operations that are known to work with regular expressions: Intersection ("this AND that at the same time"), Negation ("NOT that"), Interleaf ("This AND that AND that, independent of order of appearance"), and Cut ("The biggest THIS"). The Cut operation can be fully simulated using the Semiring implementation for the "longest submatch" (which can be adapted to emulate any of the Alternative behaviors found in the while: longest, atomic longest, first match, last match). The Interleaf operator is useful for parsing declarative languages in which elements appear in any order (for example, HTML attributes), and Intersection plus Negation are already commonplace in engines like PCRE.

Unfortunately, these fall outside of the the Semiring implementation. That doesn’t make them bad, I just don’t have a good intellectual grasp on how to implement them in a way that has a solid theoretical foundation. Even worse, there’s some evidence that Intersection plus Negation together create a parser that has some edge cases with gnarly performance penalties.

I could abstract Brzozowki’s algorithm into a finite state transducer

More abstractly, Brzozowski’s algorithm can actually be abstracted (eek!) into a bidirectional scanner. In some ways, a regular expression is a set-membership function with a functor transforming the membership determinant into something more useful than true-or-false. If I’m reading the paper Clowns to the Left of Me, Jokers to the Right correctly (no promises), Conor McBride shows that you could start anywhere in an abstract syntax tree, and Brzozowski’s algorithm (really, any regular expression algorithm, but Brzozowksi’s seems easiest here, actually) could provide derivatives of the left and right parts of the tree, producing a new tree transformed according to a regular expression substitution rule on the old tree. Do this progressively enough, and you end up with a fully functional tree transformer, or a very powerful abstraction of a finite state transducer that’s fully explicable in Church terms, rather than the Turing terms used in the Wikipedia article.

I could finish Barre

Or I could just "go for it," and just start re-writing Barre with the knowledge I’ve picked up working on these. One of the big goals of Barre is to implement Adams & Darais’ "Language of Languages," a superset of Kleene’s base six regular expressions to make it easier to build primitive parsers, giving you the "strap" part of bootstrapping this implementation into a full-powered parser generator.

I just haven’t decided which to do yet.

In the last post on “A Play on Regular Expressions,” I showed how we go from a boolean regular expression to a “rigged” one; one that uses an arbitrary data structure to extract data from the process of recognizing regular expressions. The data structure must conform to a set of mathematical laws (the semiring laws), but that simple requirement led us to some surprisingly robust results.

Now, the question is: Can we port this to Rust?


The first thing to do, however, is to not implement a Semiring. A Semiring is a conceptual item, and in Rust it turns out that you can get away without defining a Semiring as a trait; instead, it’s a collection of traits derived from the num_traits crate: Zero, zero, One, one; the capitalized versions are the traits, and the lower case ones are the implementations we have to provide.

I won’t post the entire code here, but you can check it out in Rigged Kleene Regular Expressions in Rust. Here are a few highlights:

The accept() function for the Haskell version looked like this:

acceptw :: Semiring s => Regw c s -> [c] -> s
acceptw Epsw u     = if null u then one else zero
acceptw (Symw f) u = case u of [c] -> f c;  _ -> zero
acceptw (Altw p q) u = acceptw p u `add` acceptw q u
acceptw (Seqw p q) u = sumr [ acceptw p u1 `mul` acceptw q u2 | (u1, u2) <- split u ]
acceptw (Repw r)   u = sumr [ prodr [ acceptw r ui | ui <- ps ] | ps <- parts u ]

The accept() function in Rust looks almost the same:

pub fn acceptw<S>(r: &Regw<S>, s: &[char]) -> S
    where S: Zero + One
    match r {
        Regw::Eps => if s.is_empty() { one() } else { zero() },
        Regw::Sym(c) => if s.len() == 1 { c(s[0]) } else { zero() },
        Regw::Alt(r1, r2) => S::add(acceptw(&r1, s), acceptw(&r2, s)),
        Regw::Seq(r1, r2) => split(s)
            .map(|(u1, u2)| acceptw(r1, &u1) * acceptw(r2, &u2))
            .fold(S::zero(), sumr),
        Regw::Rep(r) => parts(s)
            .map(|ps| ps.into_iter().map(|u| acceptw(r, &u)).fold(S::one(), prod))
            .fold(S::zero(), sumr)

There’s a bit more machinery here to support the sum-over and product-over maps. There’s also the where S: Zero + One clause, which tells us that our Semiring must be something that understands those two notions and have implementations for them.

To restore our boolean version of our engine, we have to build a nominal container that supports the various traits of our semiring. To do that, we need to implement the methods associated with Zero, One, Mul, and Add, and explain what they mean to the datatype of our semiring. The actual work is straightforward.

pub struct Recognizer(bool);

impl Zero for Recognizer {
    fn zero() -> Recognizer { Recognizer(false) }
    fn is_zero(&self) -> bool { !self.0 }

impl One for Recognizer {
    fn one() -> Recognizer { Recognizer(true) }

impl Mul for Recognizer {
    type Output = Recognizer;
    fn mul(self, rhs: Recognizer) -> Recognizer { Recognizer(self.0 && rhs.0) }

impl Add for Recognizer {
    type Output = Recognizer;
    fn add(self, rhs: Recognizer) -> Recognizer { Recognizer(self.0 || rhs.0) }

Also, unlike Haskell, Rust must be explicitly told what kind of Semiring will be used before processing, whereas Haskell will see what kind of Semiring you need to produce the processed result and hook up the machinery for you, but that’s not surprising. In Rust, you “lift” a straight expression to a rigged one thusly:

let rigged: Regw<Recognizer>  = rig(&evencs);

All in all, porting the Haskell to Rust was extremely straightforward. The code looks remarkably similar, but for one detail. In the Kleene version of regular expressions we’re emulating as closely as possible the “all possible permutations of our input string” implicit in the set-theoretic language of Kleene’s 1956 paper. That slows us down a lot, but in Haskell the code for doing it was extremely straightforward, which two simple functions to create all possible permutations for both the sequence and repetition options:

split []     = [([], [])]
split (c:cs) = ([], c : cs) : [(c : s1, s2) | (s1, s2) <- split cs]
parts []     = [[]]
parts [c]    = [[[c]]]
parts (c:cs) = concat [[(c : p) : ps, [c] : p : ps] | p:ps <- parts cs]

In Rust, these two functions were 21 and 29 lines long, respectively. Rust’s demands that you pay attention to memory usage and the rules about it require that you also be very explicit about when you want it, so Rust knows exactly when you no longer want it and can release it back to the allocator.

Rust’s syntax and support are amazing, and the way Haskell can be ported to Rust with little to no loss of fidelity makes me happy to work in both.

Okay, in part one I showed you what I’ve learned about writing regular expression engines in Haskell. The example shown uses Haskell chars, so is unicode-compliant.

I also mentioned a semiring. Here’s what a Haskell semiring’s class looks like. (A Haskell ‘class’ is more like a Trait in Rust, or an Interface is Java; it describes a collection of behaviors that we want implementors of the class to have. The syntax for an implementation is an `instance’):

class Semiring s where
    zero, one :: s
    mul, add  :: s -> s -> s

That’s… that’s it. s is a set, and these are the operations on the set. Remember that Int is a set: the set of all integer numbers.

The idea in the paper is to make regular expressions more generalized. To give them powers above and beyond what they’re normally capable of. This is also one of the main ideas in Barre, so I’ve been reading this paper with significant interest.

The idea here is to make the return type of the expression a Semiring. This is also the idea in Might and Adams, but here the authors go in a different direction. Because they have, as discussed last time, all the possible combinations of string and regex run at the same time, at the end of parsing, they end up with all possible parse trees automatically. The example showed all those resulting parse trees, which were lists (not sets) of boolean values being reduced down to, for sequences, “Are they all true?”, and for alternatives, “Are any of them true?” But for a brief moment, all possible parse trees exist.

So let’s create a semiring version of our regular expression. Better yet, let’s exploit the fact that we already have one. Here, the only real change is that we’ve said a rigged regular expression is one that takes a symbol and returns a semiring – a “rig”.

data Regw c s =                
    Epsw                       -- Epsilon
  | Symw (c -> s)              -- Character
  | Altw (Regw c s) (Regw c s) -- Alternation
  | Seqw (Regw c s) (Regw c s) -- Sequence
  | Repw (Regw c s)            -- R*

For our two “base” types, Epsilon and Symbol, we’ll have to implement them; everything else can be implemented in terms of them.  For everything else, we can “lift” them from the existing machinery, and here’s our lift:

rigged :: Semiring s => Reg -> Regw Char s
rigged Eps = Epsw
rigged (Sym c) = sym c
rigged (Alt p q) = Altw (rigged p) (rigged q)
rigged (Seq p q) = Seqw (rigged p) (rigged q)
rigged (Rep r)   = Repw (rigged r)

sym :: Semiring s => Char -> Regw Char s
sym c = Symw (\b -> if b == c then one else zero)

Those are the definitions, including the one that say that if the input symbol matches the constructed symbol, then return the Semiring’s version of “one”, otherwise “zero”. Recall that the test expression for the original version of Sym was u == [c]; we’ve now said that Symw takes a predicate that compares two symbols.

The accept version must now be written to handle the rigged versions, and this case we pass the input string to the case statement, which says if the list exists pass it to the predicate, otherwise zero:

acceptw :: Semiring s => Regw c s -> [c] -> s
acceptw Epsw u     = if null u then one else zero
acceptw (Symw f) u =
    case u of
      [c] -> f c
      _ -> zero
acceptw (Altw p q) u = acceptw p u `add` acceptw q u
acceptw (Seqw p q) u = sumr [ acceptw p u1 `mul` acceptw q u2 | (u1, u2) <- split u ]
acceptw (Repw r)   u = sumr [ prodr [ acceptw r ui | ui <- ps ] | ps <- parts u ]

In the Boolean version in the previous post, I used and and or to sum up all the resulting values of True and False that came back from the recognizer. I can’t do that here, because we don’t know what type of set the Semiring operates over; for Boolean, it would just be True/False. For other things…

But wait! Haskell can abstract that as well. We’ll replace and with prodr and or with sumr, and define them as folds over their semiring operations:

sumr, prodr :: Semiring r => [r] -> r
sumr = foldr add zero
prodr = foldr mul one

This works. If we’re talking booleans, add becomes ||, and mul become &&, and foldr (&&) True [True, True, True] is True, as is foldr (||) False [True, False, True], and so forth. Which gives us our first Semiring:

instance Semiring Bool where
    zero = False
    one = True
    add = (||)
    mul = (&&)

And then you can say:

> let nocs = Rep ( Alt ( Sym 'a' ) ( Sym 'b' ) )
> let onec = Seq nocs (Sym 'c' )
> let evencs = Seq ( Rep ( Seq onec onec ) ) nocs 
> acceptw (rigged evencs) "acc" :: Bool 

This is the point where I start to suspect Haskell of sorcery. You’ll notice that we didn’t actually associate the Semiring with our regular expression. No, we told acceptw that its job was to return a Boolean value, since acceptw takes a Semiring, it just went out and found a Semiring that does the job.  There was only one implementation in the current scope that meets the definition “a Semiring of Bool,” so Haskell assumed that must have been the one I wanted, and just used it.  To someone coming from the C/C++ world, that’s flaming magical. I understand it, but man, it’s magical.

All right. But I said Semirings can be defined on natural numbers. That looks exactly like you’d expect:

instance Semiring Int where
    zero = 0
    one = 1
    add = (+)
    mul = (*)

Yeah, zero is, um, zero. The number. There are no surprises here. So what happens when we ask:

> acceptw (rigged evencs) "acc" :: Int

“1?” What does “1?” mean. It means that our example above has exactly one parse tree. The number of different ways our regular expression above can handle this is: one. “aaa” gets you zero. And a string of twelve letters takes several minutes on my laptop due to that crazy explosion of possibilities to check I discussed in the previous post.

So is there a way to show this working? Well, sure. What would the regular expression (a|a*) do when handed the string "a"?

> let as = Alt (Sym 'a') (Rep (Sym 'a'))
> acceptw (rigged as) "a" :: Int

Two? Yes: both the left and the right alternatives returned their semiring one value, and the alternative added them together. Two.

> acceptw (rigged as) "aa" :: Int

One? Yes: Only the repetition returned anything, and it consumed the whole string. There was only one parse tree returned to be summed up.

> let bs = Alt (Sym 'b') (Rep (Sym 'b'))
> acceptw (rigged (Seq as bs)) "ab" :: Int

Four? Two for recognizing the a, and two for recognizing the b, and then the sequencing sumr summed those values together to give us four.

And that’s where I am in the paper. I’ve gotten this to work. Here’s where things get interesting:

  • The third part of the paper discusses reducing that explosion-of-nodes issue by using Glushkov’s Construction to generate an efficient directed finite automata. I want to use Brzozowski’s algorithm instead, so this’ll be interesting.
  • The semiring construction here returns parse trees and does analysis on them on a symbol-by-symbol level; this is different from Might’s use of a new regular expression type that acts as a parser-combinator, taking the parse tree and returning something built out of it, performing Henglein’s “catamorphism on parse trees”; Adams’ version takes this further and does the catamorphism the moment parsing is done. What this means is that Might’s & Adams’s work allows us to insert transformations in the middle of the expression, much the same way a PEG might return something constructed and of a different type, rather than merely a subset of the data provided.
  • Brzozowski’s algorithm with Might’s nullability operation can be made recursive; does that make the semiring construction invalid? (I don’t think so, since the analysis is done after the recursion has be successfully traversed, but I’m not sure.)
  • The semiring construction requires a whole parse tree, but Brzozowski’s algorithm with Might’s reduction node addition allows for the parser to return results at any time, yielding them, turning the parser into one of type Parser Iterator s, Iterator r => s -> r; in Rust terms, it takes an iterator as an input, and it return an iterator; every time you call .next() on it, it consumes as many symbols from s as necessary until a coherent element of type r can be returned, or it returns an error. Can this technique be adapted to work in that environment?

I have two competing desires at this point: (1) Try to adapt the work they’ve done here to a Brzozowski implementation, or (2) try to implement the work they’ve done here in Rust. I haven’t decided which yet.

I’ve been working my way through A Play on Regular Expressions, a programming pearl by Sebastian Fischer, Frank Huch, and Thomas Wilke on the use of weighted values in regular expressions. The paper has six parts, and I’ve understood everything up through part two, and what I’ve understood already blows my mind. I recently put up a big chunk of documentation on where I am in Barre, and where I am is that I have a reliable, working toolkit for parsing regular expressions and returning the string found (I have “Smart Epsilons”), but I haven’t progressed much beyond that part.

This paper does something different. It starts with the idea of a Semiring as the basic “wrapper” type of the expression.

A semiring is a set R with members and operations written: (0, 1, ⊕, ⊛, R). A semiring is an abstraction of common addition and multiplication over natural numbers (The integers from zero to infinity), with the requirement that 0 ⊕ x = x, 1 ⊛ x = x, 0 ⊛ x = 0. That last operation is called “annihilation.” That’s it; it has two values from the set with specific meanings, and two operations on members of the set that have specific meanings, and a couple of rules about the two values. Try hard not to set in your mind that semirings are about any one “thing” at all. It’s just an arbitrary set of rules that lets us do cool things.

A semiring can also be defined over a set of sets! In fact, regular expressions are themselves semirings: ⊕ defines the union of the results of two expressions; this is use in the Alternative operator; ⊛ is the Sequence operator, and defines all possible legal combinations of two expressions is sequence. An expression that recognizes nothing is zero; an expression that recognizes only the empty string is one. If an operation returns the empty set, it’s an annhilator under ⊛.

But that’s not what this paper is about. This paper starts with a basic regular expression. If you’ve been with me at all, you know that regular expressions are six functions that take a string and return a value: Empty (always False), Epsilon (True on Empty String), Token (True if it matches the constructed Token), Alternative (True if any constructed expression within matches), Sequence (True if the two constructed expressions within match in order), and Repetition (True if the constructed expression within matches zero or more times). Here are the five main operations (the sixth, “Empty”, is assumed) in Haskell for a regular expression recognizer (boolean values only):

data Reg = 
    Eps         -- Epsilon
  | Sym Char    -- Token
  | Alt Reg Reg -- Alternation
  | Seq Reg Reg -- Sequence
  | Rep Reg     -- Repetition

accept :: Reg -> String -> Bool
accept Eps u       = null u
accept (Sym c) u   = u == [c]
accept (Alt p q) u = accept p u || accept q u
accept (Seq p q) u = or [accept p u1 && accept q u2 | (u1, u2) <- split u]
accept (Rep r) u   = or [and [accept r ui | ui <- ps] | ps <- parts u]

split :: [a] -> [([a], [a])]
split []     = [([], [])]
split (c:cs) = ([], c : cs) : [(c : s1, s2) | (s1, s2) <- split cs]

parts :: [a] -> [[[a]]]
parts []     = [[]]
parts [c]    = [[[c]]]
parts (c:cs) = concat [[(c : p) : ps, [c] : p : ps] | p:ps <- parts cs]

This says that we return true if the string is empty and the expression is empty; we return true if the string is a single char and the expression is a symbol of one character. split returns every possible combination of the string, split at each point along its length; parts goes futher and produces every possible combination of the string; an eight-letter word has 2⁷ (128) possible combinations:

split "abc" == [("","abc"),("a","bc"),("ab","c"),("abc","")]
parts "abc" == [["abc"],["a","bc"],["ab","c"],["a","b","c"]]

This is hideously inefficient. It has to be done repeatedly for every sequence and subsequence of the string and its expression; this function is exponential O(n²) where n is the length of the string! But it does work. You can build a regular expression out of this and it does recognize strings reliably:

> let nocs = Rep ( Alt ( Sym 'a' ) ( Sym 'b' ) )
> let onec = Seq nocs (Sym 'c' )
> let evencs = Seq ( Rep ( Seq onec onec ) ) nocs
> accept evencs "acc"

In part two, things get really cool.

Last night, I went to a huge “Tech Talk and Happy Hour” put on by one of the major automobile manufacturers. They were recruiting talent for their engineering teams to develop a new infrastructure for self-driving cars. They even had an example on display. It was invitation-only and the recruiter tried extra hard to make sure I showed up.

They fed us pretty well; two free beers and all the tacos you could eat, and the tacos were made in-house and very tasty. So that was kinda nice. The place was crowded and there were a lot of people of various skills at the place.

I have very mixed feelings about self-driving cars. I suspect they’re going to make life worse, not better; we may not have to take the wheel, and they may be safer than human beings once we solve a whole host of problems, but they’re going to create more problems than they solve. Privately owned ones will stratify the rich from the poor even further; wealthy parents will be able to keep working while dispatching the car to get the kids to and from school, and wealthy owners will tell their cars to just orbit the block “until we’re done,” thus creating horrific traffic in city cores. Fleet cars will add to the street burden.

As a technologist, I think self-driving cars are significant and important. However, as a committed urbanist I want cities that are walkable, and mass transit that is frequent, useful, and adaptable. Self-driving cars will make cities less livable, not moreso. They have their place; as an adaptive technology for the disabled, they will be fantastic. As a convenient technology for the lazy, they’re a communal and personal health hazard.

Also, the infrastructure for supporting a fleet of self-driving cars is an environmental nightmare. “We have four cars at the moment, and together those four cars generate as much data per day as all of Facebook.” Facebook generates 19Kg of CO2per second; so do four self-driving cars. There are 268.8 million cars on the road. If one percent of them were self driving, that’s 50,920,000 Kgs/CO2 per second just for the “steering” part, never mind actually charging up the vehicle!

The good news, though, is that when the presentation was over, I went up to the people in the company shirts, and they eventually directed me over to a tech recruiter. My pitch was “Look, here’s what I can do for you, now convince me you’re not a dysfunctional mess,” which he actually took as a bit of a challenge. When he asked me where I was, I explained that I had left Splunk and was taking a few classes, and then mentioned my project. “Wait,” I said. “How nerdy do you want me to get?”

“I’m an engineer turned tech recruiter. Get as super-nerdy as you want.” So after about 15 minutes of explaining the project, its origin, and the wild things it has led me to do (take classes in set theory, category theory, and Haskell, among other things), and concluding with a list of use cases and potential value-add projects, he said, “Oh, we have to hire you.”

So there’s that, I guess.


I installed LSP and I Liked It

Posted by Elf Sternberg as Uncategorized

I was wrong about Visual Studio Code.

Not in the particulars, mind you. I’m still an Emacs user. It’s still my one and only IDE. But there was something about Visual Studio Code that bugged the hell out of me: on-the-fly syntactic analysis. I hated it. I thought it was a lazy developer’s feature, a hand-holding digital assistant who would nitpick the hell out of your work. That was the compiler’s job.

A few weeks ago I managed, mostly just out of curiosity, to install the Language Server Protocol engine for Rust, and to get it work under Emacs. It turned out the biggest problem was a breakdown between the PATH as understood by the boot shell and the path understood by my terminal shell. Once I reconciled the two, Everything Worked™. And I’ve started to appreciate what I was missing. It’s not that it helps me with semantics, but that it helps me catch typos and misunderstandings, and since my academic side project involves three different languages that I’m not all that comfortable with (Rust, Haskell, and Racket), it’s been a heck of a boon.

I know I’m probably the last programmer on Earth to understand this, but in case I’m not: go ahead and try LSP, FlyCheck, and Company. It’ll really make a difference in the code-compile-test cycle.

Subscribe to Feed



May 2019
« Apr