As I’ve been working my way through this project of mine, I’ve come to respect exactly why Haskell is so mind-bendingly different from other programming languages. It has to do with the level of abstraction you’re expected to track in your head.

In the classic "Object Oriented Programming Language" tutorial, when inheritence is usually introduced, the abstraction is flat: you create an Mammal class, which says it has four limbs and fur, and then add specific subclasses such as Dog and Cat, with different function bodies for the abstract method speak(), defining what sounds the animal makes.

The impression you’re given is that object orientation is often just fiddling around the edges; subclasses define different kinds of business logic. This impression is reinforced in languages like Java and C++ in realms as wildly different as office application development and game programming. Even in game programming, subclasses are often just business logic in a pretty dress: this class of Object moves four pixels per frame, that class has a weapon that does 20 points of damage.

Haskell throws all that out the window and asks you a simple question: "What is addition?"

Like, we think we know what "addition" is. It’s the sum of two numbers to create a new number. "Well, what’s a number?" And so forth.

In my project, the discovery that regular expressions both are semirings and that they can produce semirings was mind-boggling and promised a very neat and pluggable way to adapt regex to a variety of purposes, and then when I learned what semirings really were, the embogglement went much higher.

Because a semiring is often described as this: (0, 1, ⨯, +, ℙ), where ℙ is some set. Not a set "of numbers," just some set. If we say ℙ is "the set of natural numbers" then we get all the operations you’d expect. But this is just a concrete implementation! The underlying abstraction is still there. The exact same behavior holds for (F, T, &, |, 𝔹), the set of booleans values; F&x is always false, just as 0⋅x is always zero. F|x=x, T&x=x, etc.

And then the mind-boggling part I got from Might & Co, which I later discovered were called Viterbi Parse Forests: ([], [""], ⊗, ∪, [𝕊]), where the brackets represent "set of…" and 𝕊 represents "strings" (as programmers understand them) has the exact same behavior The operator is the cartesian product, that is, the concatenation of every string from the left set with every string from the right set. With this, what you get back is a record of the exact strings that were recognized and matched.

If you replace "set" in the above paragraph with "list", you get prioritized unions— which is the basis of disambiguation in Parsing Expression Grammar theory. If you define the semiring rather strangely but still within the rules, you get "longest match" disambiguation— which is the basis of μ-regex parsing under the POSIX rules. And so forth.

This is a much deeper level of abstraction. The user has to be able to not just internalize the "obvious" uses, but to understand the deep connective tissue linking these different uses and then come up with new ones. Haskell developers are a mixture of very quick and very slow— the slow part comes while their brain is tickling out a new understanding of the abstraction before them, and the quick is implementing it with good IDE support. It separates the clever from the skilled. Ideally, it would be good to have both.

So, to muse on about regular expressions…

I was genuinely pleased to post Rigged Brzozowski Regular Expressions in Rust (Experiment No. 08), once that completely supports all of the operations in Haskell and one that completely supports the same semantics as the Glushkov Heavyweights in Rust (Experiment No. 07), which makes me mindbogglingly happy.

Now, there are a couple of things going on here. Here’s the skinny: The Glushkov version is a state machine that is progressively building the Semiring as characters are read. The Brzozowski version takes the regular expression tree and builds a parse tree of results, some of which use Adam’s Delta operator as a gatekeeper to preserve valid parses or to block off invalid ones. It then presents the resulting tree to a function that postprocesses it, creating a semiring out of the results.

The Delta operator is only introduced into the parse tree during Sequence operations, to block off trees that don’t need postprocessing because they couldn’t have been valid in the first place, or to preserve trees that could be valid. For example, ab*c is an expression that can contain just ac or abc (or abbbbc), so the sequence requires a gatekeeper to preserve whether or not the b was seen.

There are two ways of handling gatekeeping. Might says:

Dc(L ◦ R) = (Dc(L) ◦ R) ∪ (δ(L) ◦ Dc(R)).

is equivalent to:

if (ε ∈ L) then
    Dc(L ◦ R) = (Dc(L) ◦ R) ∪ Dc(R)
else
    Dc(L ◦ R) = Dc(L) ◦ R

The difference between these is that in the first case, the delta operator is either an annihilator or an identity on the result of the parse, whereas in the second, the nullability function decides on the process of the parse. Despite Adams’s assertion, the truth is harder; in semiring parsing, the loss of result information on the left hand side of a sequence is devastating to the final parse. We do not get an honest parse tree without the Delta operator.

Might solves the problem by recognizing that concatenation of an epsilon with anything requires data preservation only, and since he’s approaching this from a more engineering-oriented standpoint than the Play on Regular Expressions people, using in-line combinators, he offers:

(ε ↓ {t1}) ◦ p ⇒ p → λt2.(t1, t2)
p ◦ (ε ↓ {t2}) ⇒ p → λt1.(t1, t2)

This basically says that if you have a (pending) semiring t1 sequenced with the eventual results of a parser p, you end up with the output of p being the left hand of the mul operation over the two semirings. And flipped if the epsilon is on the right. In other words, he recognizes when an epsilon is in a sequence and ‘floats’ multiplying it with the result of the right hand side of the sequence. The information is preserved.

I want to believe this is on purpose; that Might & Adams knew this was a necessary preservation step. Darais’s highly optimized Racket implementation uses it effectively and has no need of either the Delta operator or the Epsilon-Star operator, as data preservation is left to the reduction optimization on line 97 of Derp-3.

It’s exactly what you’d expect out of semiring parsing, but it’s mechanical and, well, naïve. Not in a bad way, mind you. I’m still trying to figure out how to make it "go" in a pure semiring implementation. I may have to implement reducers anyway, just to support this; otherwise, I have to use the Delta operator and the tree remains alarmingly large.

I went to the HackerX recruitment event this week and, while it was interesting, the event was awkward. It’s a bit like speed-dating: 36 companies show up and you have five minutes per company to hear their pitch and make your own, and then time is called and you move on to the next company. The round lasts an hour, so you have time to visit at most 12 recruiters.

The problem is that three times as many people showed up for the event as there were seats. In order to "facilitate" this problem, you were placed in a waiting line per row, and depending upon your starting position you could fall out of your line, in which case you got to pick a new waiting line. And the lines were initialized in order, with the first person in line being sent to the end of the row.

In other words, if you were a go-getter who arrived early, you got to talk to exactly one recruiter before you fell out of the room and had to go to a waiting line, and there was very little chance you’d get back into the room.

Even worse: that recruiter was always Amazon. The last entry of every row was Amazon: AWS, Amazon Logistics, or Amazon Internals. It didn’t matter. It was Amazon. All Jeff, all the time.

I was second in line. I got to talk to two recruiters, and I really didn’t care to talk to Amazon, although the people behind the table were nice enough. In line (I really wanted to talk to the Edu startup), I did get to talk to a woman who was looking to move from startup to enterprise, and some recruiters did come out and talk to us in line, having mercy on our patience, but they were the HR assistants to the engineering recruiters, and couldn’t answer many of my questions about work environment, tooling or projects.

I’d probably go again, but I’d really prefer less of a free-for-all.

I have hit a snag in my project. This entry is me thinking about solutions.

My goal was a reasonable one:

There are a million other "goals" that can go with this:

  • Write a variant that produces a DFA, allowing for static compilation
  • Write a variant that produces a bitmap, allowing for fast comparison
  • Write a variant that is-a iterator, such that some expressions can be marked "progressive," and when is complete emit the contents
  • Write a relationship between the parser and the semirings such that progress can be suspended and restored as needed
  • Add negation, complement, and interleaving as regular expression operators
  • Write Darais’ Language-of-Languages
  • Write a PEG parser on top
  • Write ROSIE in Rust
  • Instantiate backreferences
  • Add a concrete stream operator

The idea is that Barre/Barge is a generalized but still highly performant toolkit for building recognition, matching, and parsing engines, and the nature of the operation depends on passing it an output dependency with an interface that is theoretically sound and operationally powerful.

Unfortunately, I’ve hit this snag: The classic implementation of parsing-with-semirings can produce parse tree via the Derivation Forest Semiring, but disambiguating that the forest down to a single tree requires a different semiring to implement one of First Match / Longest Match / Longest Atomic Match. Adams and Darais provide a derivation forest engine, and it’s the one I’ve implemented, but there’s one special case.

Imagine a sequence of regular expressions: "abcde" These can be depicted in one of two ways: (a, (b, (c, (d, e)))) or ((((a, b), c), d), e). Computationally, that second one, "left heavy," is troubling: for every character, you’ll have to transition down the tree to reach the character, traversing four nodes just to get to the a expression. Darais implemented Adams’ suggestion that the tree be rebalanced to make it "right-heavy," and then a post-parsing operation is placed in front of the node to rebalance the resulting parse tree back to a left-heavy representation so that the results are consistent with expectations.

The problem with this is that there’s no clear implementation that corresponds to the semiring. It’s an engineering-versus-math issue, and I’m not yet sure how to deal with it, although just writing it out does help me see some of the potential solutions. The basic solution is to keep it all: the raw parse tree, the raw disambiguation matrix, and the product of the parse as quickly as the expression specification can produce it.

28Feb

Current Bibliography

Posted by Elf Sternberg as Uncategorized

I’m just keeping this here, to keep me honest about what I’m reading.

  • A Play on Regular Expressions: A Functional Pearl. Sebastian Fischer, Frank Huch, Thomas Wilke.
    • Solid introduction to parsing with semirings in Haskell
    • Haskell is straightforward and accessible
    • Readable, breezy style
    • Semirings covered: Recognition, Counting, Viterbi-Best (Maxlong, Firstlong)
  • Algebraic Foundations of Semiring Parsing. Yudong Liu
    • Discusses semiring parsing as unified framework
    • Unites semiring parsing and finite state transducers
    • Describes a mechanical C++ framework for the pieces/parts
    • Looks a lot like Barge/Barre
    • Good overview, fairly accessible.
  • An Algorithm for RELAX NG Validation James Clark.
    • Describes the Interleaf operator in some detail.
  • Bit Coded Regular Expression Parsing. Lasse Nielsen.
    • Presentation, so short on details
    • Describes bit coding as a way to do comparisons on regexs with less than machine-width alternatives
  • Bitsets Match Regular Expressions, Compactly. Paul Khuong
    • Describes the Bitap algorithm for bitmap encoding
    • Describes graph minimization as assuming that all states are singular, and breaking them up as we prove the assumption wrong.
  • Constructing Human Grade Parsers. Joe Karter.
    • Wish-list of parser capabilities outside the actual parsing engine
    • Parse errors should be recoverable
    • Parser should be able to produce partial output
    • Requires tooling outside the formalities of regex and semiring parsing
  • Design and Analysis of String Representation. R. Sekar.
    • Describes Owens, et. al.’s DFA generator as a realization of McNaughton-Yamada
    • Describes alternative search algorithms
  • Efficient and Flexible Incremental Parsing. S. L. Graham
    • Describes incremental parsing, in which a document tree and parse tree are maintained and updated together.
  • Error-Correcting Parser Combinators S. Doaitse Swierstra, Pablo R. Azero Alcocer
    • Haskell-based parsing language
    • Describes error-handling in parsing processes
    • Describes recovering from errors to continue parsing
    • Uses combinators, though
  • Total Parser Combinators Guillaume Allais
    • Describes a combinator-based parser with guaranteed termination
    • Uses Brzozowski as its basis
    • Proves that the combinator recognizer is a semiring
    • Written in Agda (eek!)
  • Kleene Meets Church. Fritz Henglein
    • Has the "theory vs practice" description
    • Ideal: parsing + catamorphic processing
    • Actual: FAs + ad-hoc instrumentation
  • On the Complexity and Performance of Parsing with Derivatives Michael D. Adams, Celeste Hollenbeck, Matthew Might
    • Provides both theoretical and mechanical optimizations
    • Replaces the recursion nullability with data-flow analysis
    • Provides basis for Darais’s implementation
  • Recognising and Generating Terms using Derivatives of Parsing Expression Grammars. Tony Garnock-Jones, Mahdi Eslamimehr, Alessandro Warth.
    • Describes the derivatives of PEGs
    • Describes the δPEG algorithm without backtracking for negation
    • Describes the derivation of prioritized choice
  • Regular Expression Derivatives Reexamined Scott Owens, John Reppy, Aaron Turon.
    • The paper that started the current project.
    • Section 3.3: Using derivatives for DFA construction.
    • Describes equivalence sets in the face of symbol class operators
    • Describes regular vectors for multithreaded matching.
    • Describes integration of complement and intersection operations
  • Regular Expressions as Types. Fritz Henglein
    • Introduced the "ad-hoc" slander about PCRE. 😊
    • Describes regex as a set membership problem
    • Describes types systems as "regular languages"
    • Clarifies emitter location for stream-processing
    • On the Kleene star operation
    • May require different Kleene stars, or a visitor with some context awareness.
  • Semiring Parsing. Joshua Goodman.
    • Mentioned by Liu
    • Covers the initial history of semiring parsing
    • Describes derivations forests (see Might & Adams) in a concise way (sec 2.5.1)
    • Describes Viterbi n-best as a probablistic disambiguation strategy
  • Stream Processing Using Grammars and Regular Expressions Urlik Turk Rasmussen
    • Has a great explanation of the "Regular Expressions as Types" approach.
    • Leads into regex equivalence sets
    • Discusses disambiguation strategies with equivalence sets
    • Discusses bit coding of unions (no μ-regex or e-regex)
    • (Con) Uses bit-coding as a parse tree extraction tool
    • Discusses alternative needs
      • Recognition
      • Matching (capture groups)
      • Parsing (parse trees)
    • Discusses two-pass analysis (Might) vs one-pass (Adams)
    • Discusses difficulty of regex composition (see Conway 2018)
    • Discusses CFGs as an alternative DSL for regex composition (PEG, EBNF)
    • Is actually 5 papers; the other 4 cover algorithm’s evolution
    • Fantastic overview overall
  • Taxonomies and Toolkits of Regular Expression Algorithms Bruce William Watson
    • Solid overview
    • Fairly long
    • Excellent coverage of DFA minimization
    • Discusses prefix/suffix scanning via Aho-Corasik, et.al.
    • Cover Brzozowski’s algorithm as a performance-based alternative DFA constructor
  • Yacc is Dead Matthew Might, David Darais
    • Started this whole mess
    • Still a pretty good introduction.

Denis Kyashif recently wrote a piece called "Implementing a Regular Expression Engine," and showed how to do it in Javascript. It’s a good piece, and I recommend it. Kyashif does a very good job of describing regular expressions from a Turing-theoretic approach using finite automata, but there’s another way to think about regular expressions, and that’s from the Church-theoretic approach. And since I’ve implemented sixteen seventeen regular expression engines in a variety of languages from a primarily Church-theoretic approach, the approach originally used by Stephen Kleene in 1956, I’m going to cover how to think about regular expressions from that direction.

Alphabets, Languages, Primitives and Composites

You’ll notice in the DFA approach that rarely does one ask the question, "What is a regular expression expressing, and about what?" The "about what" is easier to explain: it’s called a regular language and is a set of strings composed out of a finite alphabet of symbols. The most common symbols are just character sets, and the most common of those are the ASCII Set and the Unicode Standard.

The set of strings in a regular language can be finite and concrete. In the common DSL known as regex, "foo" is a regular language of one string. "foo|bar" is a regular language of two strings. Or they can be infinite: "(foo|bar)*" is a regular language with an infinite number of strings that consist of the repetition of the previous example: "foobarbarbarfoo" and "barfoobarfoofoo" are both in that language.

In programming terms, a regular expression is a function that takes a string and returns a boolean indicating whether or not that string is a member of a specific regular language. That function is composed out of six other functions, three of which are called the primitives and three of which are composites built out of other regular expressions. All of these functions are themselves regular expressions, and have the same inputs and outputs.

  • null(s): Always returns False
  • empty(s): Is the string empty?
  • symc(s): Constructed with the symbol c, does the string consist of (and only of) the symbol c?
  • altr1,r2(s): Constructed out of two other regular expressions, and true only if s matches either. This is the ‘alternatives’ operator, specified in regex with the pipe symbol: |
  • seqr1,r2(s): Constructed out of two other regular expressions, and true only if s consists of the first expression immediately followed by the second.
  • repr1(s): Constructed out of another regular expression, true if s consists of zero or more repeated instances of that expression. This is the star operator: *

Every regular expression is a tree of these sub-expressions, and given a string, it starts at the top of the tree and works its way down the chain of functions until it determines the truth proposition "is this string a member of the language described by this regular expression?"

Expressing regular expressions Kleene-ly

In Haskell, it’s possible to turn Kleene’s formula directly into source code, and from here, it’s possible to convert this directly to Javascript. This is the Haskell version:

data Reg = Emp | Sym Char | Alt Reg Reg | Seq Reg Reg | Rep Reg

accept :: Reg -> String -> Bool
accept Emp u       = null u
accept (Sym c) u   = u == [c]
accept (Alt p q) u = accept p u || accept q u
accept (Seq p q) u = or [accept p u1 && accept q u2 | (u1, u2) <- split u]
accept (Rep r) u   = or [and [accept r ui | ui <- ps] | ps <- parts u]

split :: [a] -> [([a], [a])]
split []     = [([], [])]
split (c:cs) = ([], c : cs) : [(c : s1, s2) | (s1, s2) <- split cs]

parts :: [a] -> [[[a]]]
parts []     = [[]]
parts [c]    = [[[c]]]
parts (c:cs) = concat [[(c : p) : ps, [c] : p : ps] | p:ps <- parts cs]

Those split and parts functions are necessary to express the sequence (r1 followed by r2) operator from Kleene’s original math:

L[[r · s]] = {u · v | u ∈ L[[r]] and v ∈ L[[s]]}

Those membership tests are universal: every possible combination of sequences in the string being tested must be tested. split() takes every possible substring and decomposes it into a list of all possible pairs of strings, so that Seq{} can be compared against them. parts() goes even further, devolving every possible substring into the powerset of lists of strings, so that Rep{} can be compared to every possible variation of the ordered input string. Mathematically, this is elegant and sensible; computationally, it’s inefficient and ridiculous; a string of n letters requires 2n-1 tests!

Doing It In Javascript Typescript

Porting the Kleene version of this to Javascript was difficult only insofar as Javascript is notorious about copying vs. referencing, especially when it comes to heavily nested arrays like those used above. The ports of split and parts were also significantly more complex, although Typescript’s type system was of enormous help in sorting out what was happening each step of the way.

The conversion is straightforward because the Haskell doesn’t use any higher-kinded types: no applicatives, no functors, and certainly no monads!

The datatype Reg in Haskell, along with a couple of convenience factory functions, becomes:

interface Regcom { kind: string };
class Eps implements Regcom { kind: "eps"; };
class Sym implements Regcom { kind: "sym"; s: string; }
class Alt implements Regcom { kind: "alt"; l: Regex; r: Regex };
class Seq implements Regcom { kind: "seq"; l: Regex; r: Regex };
class Rep implements Regcom { kind: "rep"; r: Regex };

function eps():                   Eps { return { kind: "eps" }; };
function sym(c: string):          Sym { return { kind: "sym", s: c }; };
function alt(l: Regex, r: Regex): Alt { return { kind: "alt", l: l, r: r }; };
function seq(l: Regex, r: Regex): Seq { return { kind: "seq", l: l, r: r }; };
function rep(r: Regex):           Rep { return { kind: "rep", r: r }; };

type Regex = Eps | Sym | Alt | Seq | Rep;

And the accept looks remarkably similar. The some() and every() methods on Arrays were especially useful here, as they implement the same behavior as and and or over Haskell lists.

function accept(r: Regex, s: string): boolean {
    switch(r.kind) {
    case "eps":
        return s.length == 0;
    case "sym":
        return s.length == 1 && r.s == s[0];
    case "alt":
        return accept(r.l, s) || accept(r.r, s);
    case "seq":
        return split(s).some((v: Array<string>) => accept(r.l, v[0]) && accept(r.r, v[1]));
    case "rep":
        return parts(s).some((v: Array<string>) => v.every((u: string) => accept(r.r, u)));
    }
}

split() required a significant amount of munging to make sure the arrays were copied and not just referenced, but looks much like the Haskell version:

function split(s: string) {
    if (s.length == 0) {
        return [["", ""]];  
    }
    return [["", s.slice()]].concat(
        split(s.slice(1)).map(
            (v) => [s[0].slice().concat(v[0].slice()), v[1].slice()]));
}

parts(), too, learns a lot from the Haskell version:

function parts(s: string): Array<Array<string>> {
    if (s.length == 0) {
        return [[]];
    }

    if (s.length == 1) {
        return [[s]];
    }

    let c = s[0];
    let cs = s.slice(1);
    return parts(cs).reduce((acc, pps) => {
        let p: string  = pps[0];
        let ps: Array<string> = pps.slice(1);
        let l:  Array<string> = [c + p].concat(ps);
        let r:  Array<string> = [c].concat(p).concat(ps);
        return acc.concat([l, r]);
    }, [[]]).filter((c) => c.length != 0);
}

You use this code in a straightforward fashion:

let nocs = rep(alt(sym("a"), sym("b")));
let onec = seq(nocs, sym("c"));
let evencs = seq(rep(seq(onec, onec)), nocs);
console.log(accept(evencs, "abcc") == true);           // "true"
console.log(accept(evencs, "abccababbbbcc") == true);  // "true
 

Wrap up

Most regular expression "under the covers" tutorials come from a Turing-theoretic approach, describing the finite automata that transition from state-to-state, ultimately ending up somewhere in a table with a flag that says "This is an accept state."

I approached this from a Church-theoretic approach. The Church-Turing Thesis says that these two approaches are equivalent, but use different notation. Turing’s approach is mechanical and engineering oriented; Church’s approach and notation are mathematical.

Stephen Kleene’s original 1956 paper on Regular Expressions was primarily written from a Church-theoretic approach, and I showed that this approach can legitimately, if inefficiently, be implemented in an ordinary programming language like Javascript. I showed how Kleene’s six basic operations can be composed together to create complete and effective regular expressions.

The code for this Typescript implementation, the eight other Haskell variants, the seven other Rust variants, and one Python variant, are all available on Github.

When last we left our hero, he had successfully implemented Rigged Regular Expressions in Rust.

There’s been a lot of progress since then.

Matt Might’s Parse Result Data Type Is A Semiring!

One of my strongest intuitions when I started reading A Play on Regular Expressions was that under the covers Matt Might’s implementation of Brzozowski’s Regular Expressions used a data structure exactly like those in the Play paper. I was correct; I was able to implement a (non-recursive) Brzozowski engine in Haskell, and use a Semiring formalism to accurately reproduce Might’s outcome and data structure from Parsing With Derivatives: A Functional Pearl.

And Might’s parser-combinators are functors between semirings!

To me, this is something of a monumental discovery. Intellectually, I’m happy to show that the data structures Might is using are semirings and his parser-combinator approach is actually just functors between semirings. Personally, I’m thrilled that I was able to do this in Haskell, writing code that is not from either paper to show that the two have signficant overlap. Even better, I was able to show a different category of complexity can be encompassed by the semiring properties from those in the Play paper.

Semiring-as-a-trait in Rust

The first thing is that, in that last post, I discussed not having to implement a Semiring trait in Rust, because I could abstract it further using the num_trait crate to define my zero and one, and the std::ops trait to define multiplication and addition.

Unfortunately, that didn’t hold. It worked fine when my values were primitives such as Boolean or Integer, but when I implemented Matt Might’s sets of strings as a return value, the std::ops implementations were insufficient. To work as an operation, Rust requires that the data must be bitwise copyable without loss of fidelity; unfortunately, bitwise copying of sets-of-strings doesn’t work; the underlying implementation has much of its data distributed across the heap. So I had to implement the Semiring as a trait to exploit Rust references and get high-performance operations, and it worked just fine.

You can write Haskell (poorly) in any language

At one point developing the Rigged Brzozowski in Haskell implementation, I was stuck on a hard-to-surface bug. That whole bit about how "If it compiles it’s probably correct" Haskell superiority nonsense is nonsense; there are lots of little fiddly bits in this algorithm ("Is this a sum or product operation?" "Is the initializer one or zero?" "Is the precedence correct?" "Is this test on the left or right operand?") that can go wrong.

I ported the implementation to Python in order to litter it with print statements and watch it work. This was useful, and helped me track down the bug. I was able to port the working instance back to Haskell easily.

One thing to note is that the Python is rather… odd. There are the six regex operations that can be performed on a string which are both unique and finite. By using namedtuple I was able to create the same structure as the Haskell data or Rust enum operation, and with clever use of Python’s reflection capabilities I was likewise able to make Python do something like Haskell or Rust’s pattern-matching, using dictionaries. The result is, to my eye, pleasing, but I’ve been told it’s "not very Pythonic."

Easier to maintain, at any rate.

Are we done here?

That’s… really the twelve thousand dollar question now, isn’t it? I’ve finished sections one and two; the third section is about adopting this framework to recursive regular expressions, which I’m already somewhat proficient in from working with Darais’ Racket implementation. So there are a couple of different ways I could go about this:

I could just keep plugging away

I could proceed as I have been, implementing:

  • Heavyweights using Brzozowski in Haskell
  • Heavyweights using Brzozowski in Rust
  • Recursive Regular Expressions in Haskell
  • Recursive Regular Expressions in Rust
  • Recursive Brzozowski Regular Expressions

I could show it works in C++

I could combine what I did with the Python implementation and my limited (very limited) C++ knowledge and try to port one of the Rigged Glushkov engines to C++. The state-of-the-art in C++ unicode support looks absolutely terrifying, though.

I could add to the feature set: incremental regular expressions

One of the big changes I made in Rust was that, toward the end, I changed the input value from an Iteratable to an Iterator, thus simplifying the API. I want to do the same thing for the output, that is, I want the receiver to get not just a semiring containing a set, but to get instead an iterator that produces elements from the semiring as they’re produced, in order. I want to create an incremental regular expression.

I could add to the feature set: compile-time regular expressions

In the paper that started this whole thing, Owens, Reppy & Turon showed (Section 3.3) that Brzozowski’s algorithm can produce static DFAs, and that high-performance compile-time regular expressions are possible. Combined with Rust’s Procedural Macros and the iterator ideas above, this could lead to static regular expressions becoming a first-class data type next to the container library.

I could add to the feature set: compile-time bitmapped regular expressions

Fritz Henglein has a paper in which he discusses Bit Coded Regular Expressions, which look fairly adaptable. BCRE requires that you not have character classes in your regular expression library (no "\w+", no "\{Number}", and no "." operator!), but in exchange what you get is an insanely fast regular expression matching algorithm that stores and outputs its results as a bitmap, which happens to make "I found X" vs "I found Y" incredibly cheap on modern hardware; the engine knows which bitmap corresponds to which sequence exactly. This speed is exactly what you need to analyze the terabytes of data that flow through a modern logging engine.

I could add to the feature set: extended regular expressions

There are four other operations that are known to work with regular expressions: Intersection ("this AND that at the same time"), Negation ("NOT that"), Interleaf ("This AND that AND that, independent of order of appearance"), and Cut ("The biggest THIS"). The Cut operation can be fully simulated using the Semiring implementation for the "longest submatch" (which can be adapted to emulate any of the Alternative behaviors found in the while: longest, atomic longest, first match, last match). The Interleaf operator is useful for parsing declarative languages in which elements appear in any order (for example, HTML attributes), and Intersection plus Negation are already commonplace in engines like PCRE.

Unfortunately, these fall outside of the the Semiring implementation. That doesn’t make them bad, I just don’t have a good intellectual grasp on how to implement them in a way that has a solid theoretical foundation. Even worse, there’s some evidence that Intersection plus Negation together create a parser that has some edge cases with gnarly performance penalties.

I could abstract Brzozowki’s algorithm into a finite state transducer

More abstractly, Brzozowski’s algorithm can actually be abstracted (eek!) into a bidirectional scanner. In some ways, a regular expression is a set-membership function with a functor transforming the membership determinant into something more useful than true-or-false. If I’m reading the paper Clowns to the Left of Me, Jokers to the Right correctly (no promises), Conor McBride shows that you could start anywhere in an abstract syntax tree, and Brzozowski’s algorithm (really, any regular expression algorithm, but Brzozowksi’s seems easiest here, actually) could provide derivatives of the left and right parts of the tree, producing a new tree transformed according to a regular expression substitution rule on the old tree. Do this progressively enough, and you end up with a fully functional tree transformer, or a very powerful abstraction of a finite state transducer that’s fully explicable in Church terms, rather than the Turing terms used in the Wikipedia article.

I could finish Barre

Or I could just "go for it," and just start re-writing Barre with the knowledge I’ve picked up working on these. One of the big goals of Barre is to implement Adams & Darais’ "Language of Languages," a superset of Kleene’s base six regular expressions to make it easier to build primitive parsers, giving you the "strap" part of bootstrapping this implementation into a full-powered parser generator.

I just haven’t decided which to do yet.

In the last post on “A Play on Regular Expressions,” I showed how we go from a boolean regular expression to a “rigged” one; one that uses an arbitrary data structure to extract data from the process of recognizing regular expressions. The data structure must conform to a set of mathematical laws (the semiring laws), but that simple requirement led us to some surprisingly robust results.

Now, the question is: Can we port this to Rust?

Easily.

The first thing to do, however, is to not implement a Semiring. A Semiring is a conceptual item, and in Rust it turns out that you can get away without defining a Semiring as a trait; instead, it’s a collection of traits derived from the num_traits crate: Zero, zero, One, one; the capitalized versions are the traits, and the lower case ones are the implementations we have to provide.

I won’t post the entire code here, but you can check it out in Rigged Kleene Regular Expressions in Rust. Here are a few highlights:

The accept() function for the Haskell version looked like this:

acceptw :: Semiring s => Regw c s -> [c] -> s
acceptw Epsw u     = if null u then one else zero
acceptw (Symw f) u = case u of [c] -> f c;  _ -> zero
acceptw (Altw p q) u = acceptw p u `add` acceptw q u
acceptw (Seqw p q) u = sumr [ acceptw p u1 `mul` acceptw q u2 | (u1, u2) <- split u ]
acceptw (Repw r)   u = sumr [ prodr [ acceptw r ui | ui <- ps ] | ps <- parts u ]

The accept() function in Rust looks almost the same:

pub fn acceptw<S>(r: &Regw<S>, s: &[char]) -> S
    where S: Zero + One
{
    match r {
        Regw::Eps => if s.is_empty() { one() } else { zero() },
        Regw::Sym(c) => if s.len() == 1 { c(s[0]) } else { zero() },
        Regw::Alt(r1, r2) => S::add(acceptw(&r1, s), acceptw(&r2, s)),
        Regw::Seq(r1, r2) => split(s)
            .into_iter()
            .map(|(u1, u2)| acceptw(r1, &u1) * acceptw(r2, &u2))
            .fold(S::zero(), sumr),
        Regw::Rep(r) => parts(s)
            .into_iter()
            .map(|ps| ps.into_iter().map(|u| acceptw(r, &u)).fold(S::one(), prod))
            .fold(S::zero(), sumr)
    }
}

There’s a bit more machinery here to support the sum-over and product-over maps. There’s also the where S: Zero + One clause, which tells us that our Semiring must be something that understands those two notions and have implementations for them.

To restore our boolean version of our engine, we have to build a nominal container that supports the various traits of our semiring. To do that, we need to implement the methods associated with Zero, One, Mul, and Add, and explain what they mean to the datatype of our semiring. The actual work is straightforward.

pub struct Recognizer(bool);

impl Zero for Recognizer {
    fn zero() -> Recognizer { Recognizer(false) }
    fn is_zero(&self) -> bool { !self.0 }
}

impl One for Recognizer {
    fn one() -> Recognizer { Recognizer(true) }
}

impl Mul for Recognizer {
    type Output = Recognizer;
    fn mul(self, rhs: Recognizer) -> Recognizer { Recognizer(self.0 && rhs.0) }
}

impl Add for Recognizer {
    type Output = Recognizer;
    fn add(self, rhs: Recognizer) -> Recognizer { Recognizer(self.0 || rhs.0) }
}

Also, unlike Haskell, Rust must be explicitly told what kind of Semiring will be used before processing, whereas Haskell will see what kind of Semiring you need to produce the processed result and hook up the machinery for you, but that’s not surprising. In Rust, you “lift” a straight expression to a rigged one thusly:

let rigged: Regw<Recognizer>  = rig(&evencs);

All in all, porting the Haskell to Rust was extremely straightforward. The code looks remarkably similar, but for one detail. In the Kleene version of regular expressions we’re emulating as closely as possible the “all possible permutations of our input string” implicit in the set-theoretic language of Kleene’s 1956 paper. That slows us down a lot, but in Haskell the code for doing it was extremely straightforward, which two simple functions to create all possible permutations for both the sequence and repetition options:

split []     = [([], [])]
split (c:cs) = ([], c : cs) : [(c : s1, s2) | (s1, s2) <- split cs]
parts []     = [[]]
parts [c]    = [[[c]]]
parts (c:cs) = concat [[(c : p) : ps, [c] : p : ps] | p:ps <- parts cs]

In Rust, these two functions were 21 and 29 lines long, respectively. Rust’s demands that you pay attention to memory usage and the rules about it require that you also be very explicit about when you want it, so Rust knows exactly when you no longer want it and can release it back to the allocator.

Rust’s syntax and support are amazing, and the way Haskell can be ported to Rust with little to no loss of fidelity makes me happy to work in both.

Okay, in part one I showed you what I’ve learned about writing regular expression engines in Haskell. The example shown uses Haskell chars, so is unicode-compliant.

I also mentioned a semiring. Here’s what a Haskell semiring’s class looks like. (A Haskell ‘class’ is more like a Trait in Rust, or an Interface is Java; it describes a collection of behaviors that we want implementors of the class to have. The syntax for an implementation is an `instance’):

class Semiring s where
    zero, one :: s
    mul, add  :: s -> s -> s

That’s… that’s it. s is a set, and these are the operations on the set. Remember that Int is a set: the set of all integer numbers.

The idea in the paper is to make regular expressions more generalized. To give them powers above and beyond what they’re normally capable of. This is also one of the main ideas in Barre, so I’ve been reading this paper with significant interest.

The idea here is to make the return type of the expression a Semiring. This is also the idea in Might and Adams, but here the authors go in a different direction. Because they have, as discussed last time, all the possible combinations of string and regex run at the same time, at the end of parsing, they end up with all possible parse trees automatically. The example showed all those resulting parse trees, which were lists (not sets) of boolean values being reduced down to, for sequences, “Are they all true?”, and for alternatives, “Are any of them true?” But for a brief moment, all possible parse trees exist.

So let’s create a semiring version of our regular expression. Better yet, let’s exploit the fact that we already have one. Here, the only real change is that we’ve said a rigged regular expression is one that takes a symbol and returns a semiring – a “rig”.

data Regw c s =                
    Epsw                       -- Epsilon
  | Symw (c -> s)              -- Character
  | Altw (Regw c s) (Regw c s) -- Alternation
  | Seqw (Regw c s) (Regw c s) -- Sequence
  | Repw (Regw c s)            -- R*

For our two “base” types, Epsilon and Symbol, we’ll have to implement them; everything else can be implemented in terms of them.  For everything else, we can “lift” them from the existing machinery, and here’s our lift:

rigged :: Semiring s => Reg -> Regw Char s
rigged Eps = Epsw
rigged (Sym c) = sym c
rigged (Alt p q) = Altw (rigged p) (rigged q)
rigged (Seq p q) = Seqw (rigged p) (rigged q)
rigged (Rep r)   = Repw (rigged r)

sym :: Semiring s => Char -> Regw Char s
sym c = Symw (\b -> if b == c then one else zero)

Those are the definitions, including the one that say that if the input symbol matches the constructed symbol, then return the Semiring’s version of “one”, otherwise “zero”. Recall that the test expression for the original version of Sym was u == [c]; we’ve now said that Symw takes a predicate that compares two symbols.

The accept version must now be written to handle the rigged versions, and this case we pass the input string to the case statement, which says if the list exists pass it to the predicate, otherwise zero:

acceptw :: Semiring s => Regw c s -> [c] -> s
acceptw Epsw u     = if null u then one else zero
acceptw (Symw f) u =
    case u of
      [c] -> f c
      _ -> zero
acceptw (Altw p q) u = acceptw p u `add` acceptw q u
acceptw (Seqw p q) u = sumr [ acceptw p u1 `mul` acceptw q u2 | (u1, u2) <- split u ]
acceptw (Repw r)   u = sumr [ prodr [ acceptw r ui | ui <- ps ] | ps <- parts u ]

In the Boolean version in the previous post, I used and and or to sum up all the resulting values of True and False that came back from the recognizer. I can’t do that here, because we don’t know what type of set the Semiring operates over; for Boolean, it would just be True/False. For other things…

But wait! Haskell can abstract that as well. We’ll replace and with prodr and or with sumr, and define them as folds over their semiring operations:

sumr, prodr :: Semiring r => [r] -> r
sumr = foldr add zero
prodr = foldr mul one

This works. If we’re talking booleans, add becomes ||, and mul become &&, and foldr (&&) True [True, True, True] is True, as is foldr (||) False [True, False, True], and so forth. Which gives us our first Semiring:

instance Semiring Bool where
    zero = False
    one = True
    add = (||)
    mul = (&&)

And then you can say:

> let nocs = Rep ( Alt ( Sym 'a' ) ( Sym 'b' ) )
> let onec = Seq nocs (Sym 'c' )
> let evencs = Seq ( Rep ( Seq onec onec ) ) nocs 
> acceptw (rigged evencs) "acc" :: Bool 
True

This is the point where I start to suspect Haskell of sorcery. You’ll notice that we didn’t actually associate the Semiring with our regular expression. No, we told acceptw that its job was to return a Boolean value, since acceptw takes a Semiring, it just went out and found a Semiring that does the job.  There was only one implementation in the current scope that meets the definition “a Semiring of Bool,” so Haskell assumed that must have been the one I wanted, and just used it.  To someone coming from the C/C++ world, that’s flaming magical. I understand it, but man, it’s magical.

All right. But I said Semirings can be defined on natural numbers. That looks exactly like you’d expect:

instance Semiring Int where
    zero = 0
    one = 1
    add = (+)
    mul = (*)
    

Yeah, zero is, um, zero. The number. There are no surprises here. So what happens when we ask:

> acceptw (rigged evencs) "acc" :: Int
1

“1?” What does “1?” mean. It means that our example above has exactly one parse tree. The number of different ways our regular expression above can handle this is: one. “aaa” gets you zero. And a string of twelve letters takes several minutes on my laptop due to that crazy explosion of possibilities to check I discussed in the previous post.

So is there a way to show this working? Well, sure. What would the regular expression (a|a*) do when handed the string "a"?

> let as = Alt (Sym 'a') (Rep (Sym 'a'))
> acceptw (rigged as) "a" :: Int
2

Two? Yes: both the left and the right alternatives returned their semiring one value, and the alternative added them together. Two.

> acceptw (rigged as) "aa" :: Int
1

One? Yes: Only the repetition returned anything, and it consumed the whole string. There was only one parse tree returned to be summed up.

> let bs = Alt (Sym 'b') (Rep (Sym 'b'))
> acceptw (rigged (Seq as bs)) "ab" :: Int
4

Four? Two for recognizing the a, and two for recognizing the b, and then the sequencing sumr summed those values together to give us four.

And that’s where I am in the paper. I’ve gotten this to work. Here’s where things get interesting:

  • The third part of the paper discusses reducing that explosion-of-nodes issue by using Glushkov’s Construction to generate an efficient directed finite automata. I want to use Brzozowski’s algorithm instead, so this’ll be interesting.
  • The semiring construction here returns parse trees and does analysis on them on a symbol-by-symbol level; this is different from Might’s use of a new regular expression type that acts as a parser-combinator, taking the parse tree and returning something built out of it, performing Henglein’s “catamorphism on parse trees”; Adams’ version takes this further and does the catamorphism the moment parsing is done. What this means is that Might’s & Adams’s work allows us to insert transformations in the middle of the expression, much the same way a PEG might return something constructed and of a different type, rather than merely a subset of the data provided.
  • Brzozowski’s algorithm with Might’s nullability operation can be made recursive; does that make the semiring construction invalid? (I don’t think so, since the analysis is done after the recursion has be successfully traversed, but I’m not sure.)
  • The semiring construction requires a whole parse tree, but Brzozowski’s algorithm with Might’s reduction node addition allows for the parser to return results at any time, yielding them, turning the parser into one of type Parser Iterator s, Iterator r => s -> r; in Rust terms, it takes an iterator as an input, and it return an iterator; every time you call .next() on it, it consumes as many symbols from s as necessary until a coherent element of type r can be returned, or it returns an error. Can this technique be adapted to work in that environment?

I have two competing desires at this point: (1) Try to adapt the work they’ve done here to a Brzozowski implementation, or (2) try to implement the work they’ve done here in Rust. I haven’t decided which yet.

I’ve been working my way through A Play on Regular Expressions, a programming pearl by Sebastian Fischer, Frank Huch, and Thomas Wilke on the use of weighted values in regular expressions. The paper has six parts, and I’ve understood everything up through part two, and what I’ve understood already blows my mind. I recently put up a big chunk of documentation on where I am in Barre, and where I am is that I have a reliable, working toolkit for parsing regular expressions and returning the string found (I have “Smart Epsilons”), but I haven’t progressed much beyond that part.

This paper does something different. It starts with the idea of a Semiring as the basic “wrapper” type of the expression.

A semiring is a set R with members and operations written: (0, 1, ⊕, ⊛, R). A semiring is an abstraction of common addition and multiplication over natural numbers (The integers from zero to infinity), with the requirement that 0 ⊕ x = x, 1 ⊛ x = x, 0 ⊛ x = 0. That last operation is called “annihilation.” That’s it; it has two values from the set with specific meanings, and two operations on members of the set that have specific meanings, and a couple of rules about the two values. Try hard not to set in your mind that semirings are about any one “thing” at all. It’s just an arbitrary set of rules that lets us do cool things.

A semiring can also be defined over a set of sets! In fact, regular expressions are themselves semirings: ⊕ defines the union of the results of two expressions; this is use in the Alternative operator; ⊛ is the Sequence operator, and defines all possible legal combinations of two expressions is sequence. An expression that recognizes nothing is zero; an expression that recognizes only the empty string is one. If an operation returns the empty set, it’s an annhilator under ⊛.

But that’s not what this paper is about. This paper starts with a basic regular expression. If you’ve been with me at all, you know that regular expressions are six functions that take a string and return a value: Empty (always False), Epsilon (True on Empty String), Token (True if it matches the constructed Token), Alternative (True if any constructed expression within matches), Sequence (True if the two constructed expressions within match in order), and Repetition (True if the constructed expression within matches zero or more times). Here are the five main operations (the sixth, “Empty”, is assumed) in Haskell for a regular expression recognizer (boolean values only):

data Reg = 
    Eps         -- Epsilon
  | Sym Char    -- Token
  | Alt Reg Reg -- Alternation
  | Seq Reg Reg -- Sequence
  | Rep Reg     -- Repetition

accept :: Reg -> String -> Bool
accept Eps u       = null u
accept (Sym c) u   = u == [c]
accept (Alt p q) u = accept p u || accept q u
accept (Seq p q) u = or [accept p u1 && accept q u2 | (u1, u2) <- split u]
accept (Rep r) u   = or [and [accept r ui | ui <- ps] | ps <- parts u]

split :: [a] -> [([a], [a])]
split []     = [([], [])]
split (c:cs) = ([], c : cs) : [(c : s1, s2) | (s1, s2) <- split cs]

parts :: [a] -> [[[a]]]
parts []     = [[]]
parts [c]    = [[[c]]]
parts (c:cs) = concat [[(c : p) : ps, [c] : p : ps] | p:ps <- parts cs]

This says that we return true if the string is empty and the expression is empty; we return true if the string is a single char and the expression is a symbol of one character. split returns every possible combination of the string, split at each point along its length; parts goes futher and produces every possible combination of the string; an eight-letter word has 2⁷ (128) possible combinations:

split "abc" == [("","abc"),("a","bc"),("ab","c"),("abc","")]
parts "abc" == [["abc"],["a","bc"],["ab","c"],["a","b","c"]]

This is hideously inefficient. It has to be done repeatedly for every sequence and subsequence of the string and its expression; this function is exponential O(n²) where n is the length of the string! But it does work. You can build a regular expression out of this and it does recognize strings reliably:

> let nocs = Rep ( Alt ( Sym 'a' ) ( Sym 'b' ) )
> let onec = Seq nocs (Sym 'c' )
> let evencs = Seq ( Rep ( Seq onec onec ) ) nocs
> accept evencs "acc"
True

In part two, things get really cool.

Subscribe to Feed

Categories

Calendar

March 2019
M T W T F S S
« Feb    
 123
45678910
11121314151617
18192021222324
25262728293031