Theoretical computer science is fascinating. There are all sorts of papers out there about advanced topics which haven't quite reached mainstream acceptance or awareness, and once in a while you find a fascinating topic and start chasing down the rabbit hole of implementations and papers trying to wrap your head around the ideas presented.
I've been doing this for the past two weeks with respect to something called Brzozowski Derivatives, an algorithm for parsing and running regular expressions. Without access to Janus Brzozowski's original 1964 paper, I was left reading Matt Might's 2011 paper in which he discussed Brzozowski's algorithm and how advances in modern software made it possible to implement a footnote in Brzozowski's original paper: if it could be made recursive (that is, if a sub-expression could refer to itself), it could be built into a full-blown parser. Parsers often use regular expressions, but context-free grammars require a lot more specialized handling. Brzozowski unified the two, but only in math: software couldn't handle it. Might figured out how to make it work.
But the performance was terrible. Might asked for assistance in improving the implementation, and a lot of papers were published. And I've read a lot of papers this week. Usually one a day. Might's original paper was seven pages long, and it took me seven hours to read it, or approximately one page an hour. I visited the University of Washington library and downloaded or printed quite a few.
And then I moved on to the other papers. And I learned a lot about reading theoretical computer science papers, and here's what I learned.
My goal here is to do something new and interesting: Implement a Brzozowski Expression Library in a systems language. To the best of my knowledge, this has never been done before; all of the implementations are in languages that have memory management, closures, continuations, and lots of nifty things that aren't available in systems languages like C, C++, or Rust.
If you're in this boat, here's what you want to do: Read the Introduction (not the abstract), then scan the rest of the paper to determine: is this paper an implementation, an algorithm, or results? A "results" paper is almost always useless; the authors are bragging about something they achieved, but they have no interest helping you understand how they achieved it. Even then, scan the paper and ask: "Is there a link to source code?"
Sometimes, even if there isn't, a judicious Google search will sometimes help you. Sometimes, you can just type the author's name into Github, Bitbucket, and GoogleCode, and see what you get.
If the paper is an implementation paper, go down to the "Conclusions" section and see if their implementation offers anything worthwhile. Might's original paper concluded that the implementation was simple (and indeed, it was so simple it took me only two days to write it in Rust), but maybe there could be improvements. Writing it in Rust gave me a grip on the scale of the non-recursive problem.
Sometimes, "worthwhileness" is hard to determine from a single reading. Scott Owen's paper (see below) is entirely about creating static DFAs for regular expressions via derivatives and not about higher-order parsing, so at first blush it didn't seem worthwhile; however, Owens also offered two extensions to Brzozowski's theory that allow for regular expression predicates (the sequence of symbols to be matched must match more than one expression, what Perl 6 calls the "conjunctive operator") without backtracking, which is an amazing result.
Your next step is to search for citations of the paper, to see if any other papers use it. CiteSeer is surprisingly useless for this task. I found a much better solution is to search for the title with a modifier to find PDFs, such as
"Parsing with Derivatives" inurl:pdf, after which a lot of papers jump out. For me, the most interesting papers included "POSIX Regular Expression Parsing with Derivatives" (Sulzmann and Lu), "Regular Expression Derivatives Reexamined" (Owens, et al.), and "On the Complexity and Performance of Parsing with Derivatives" (Adams, et. al.).
While lots of papers are hidden behind various paywalls, often papers associated with conference releases, preprints, and pre-reviewed versions are readily available. When all else failed, I've emailed the lead author and asked for a copy, and so far they've always just mailed me a copy. (The authors of papers love to know someone is engaging with their work. Really.)
Those three papers represent the core of what I'm studying. The Adams paper features a better algorithm for detecting early match failures, as well as an important insight into avoiding the infinite loops inherent in Brzozowski's original description. The Sulzmann paper describes using some performance hacks from classical regular expressions that might be applicable to Brzozowski. And the Owens paper proves (in the mathematical sense) that Brzozowski's algorithm supports Perl-style regular expression conjunctions natively without having to supply an internal boolean mini-parser as PCRE does.
Brzozowski's algorithm has six basic sub-cases for matching: a letter, a concatenation (this then that), an alternative (this or that), repetition (the star operator), the empty string, and the empty parser. The most complex is concatenation, so let me show you the equation for that as presented in the algorithm papers. The concatenation operator is an open circle; the '∪' operator is the set theoretical symbol for 'union' and represents alternation.
These equations all mean the same thing: "The derivative of two regular expressions in sequence with respect to a character c depends on whether or not the first expression recognizes the empty string; if it does, then the derivative is the derivative of the first regular expression with respect to c followed by the second regular expression or the derivative of the second expression with respect to c, and if it doesn't then it's just the first derivative of the first expression with respect to c followed by the second expression."
Good. Might uses Dc to represent "The derivative with respect to c," and he uses c to mean "character."
Might: Dc(L1 ◦ L2) = (Dc(L1) ◦ L2) ∪ Dc(L2) if ε ∈ L1 Dc(L1 ◦ L2) = Dc(L1) ◦ L2 if ε not ∈ L1
Adams: Dc(L1 ◦ L2) = (Dc (L1) ◦ L2) ∪ Dc (L2) if ε ∈ [[L1]] Dc(L1 ◦ L2) = Dc (L1 ◦ L2) = (Dc (L1) ◦ L2 otherwise
Owens: ∂a (r · s) = ∂a r · s + ν(r) · ∂a s
Sulzmann: (r1r2)\l= (r1\l)r2 + r2\l if ε ∈ L(r1), (r1\l)r2 otherwise
Those are (supposedly) all the same. The Adams and Might equations are entirely the same, aside from Adams' use of the L1 vs [[L1]] syntax. Understanding Owens' formulation requires that you understand that the ν(r) expresses the same thing as the ε ∈ L1 syntax from Might, that is, ν is a function that determines whether or not ε is a member of the language r, and exploits the fact that ∅ anywhere in a sequence expression results in no matching ever.
And Sulzmann, who even cited Might's paper... Sulzmann is just from another world. His equation means the same thing, but as you can see his notation is entirely, wildly different. The concatenation symbol is elided as "understood," the "with respect to a character c" is "/l"...
Now, the thing here is that there are six basic operators in regular expression theory, so since every paper has this same table of six operations, it was possible to figure out, eventually, what Sulzmann meant by his notation. But it took work; I had to write down on a sheet of paper his equations and discoveries in order to make sense of them.
If you want to implement a result in theoretical computer science as a research project and, eventually, as a library, you're going to have to read and re-read the papers until you've internalized the basic rules and gotten comfortable with at least one notation, and if you're really comfortable with that notation you're going to have to take a deep breath and re-write many of the other paper's statements and conclusions in your notation so you have something you can readily absorb when you're looking for refinements, extensions, and progress in the topic.
It's also fun (for very geeky definitions of the word "fun"). I've been having a good time working this out.