I seem to be in a state of semantic confusion.

I went back and reviewed the Theory of Regular Operations, which is section 1.10 of Sipser’s Theory of Computation, and I seem to have come upon a bit of *semantic confusion* that I need to decide how to resolve, and more importantly, *when* to resolve.

According to Sipser, a *regular language* is any *language* (that is, a set of *strings* made up of *symbols*) for which a *finite automata* can be constructed. For Brozozowski, a *regular language* can be *recognized* by processing its derivative, with the following rules:

- Dc(∅) = ∅
- Dc(ε) = ∅
- Dc(c) = ε if c = c’, ∅ otherwise
- Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
- Dc(re1 | re2) = Dc(re1) | Dc(re2)
- Dc(re
*) = Dc(re) re*

Only one of those rules talks about the *symbols*, namely the third; all the rest are either end-state issues or what are termed in the busines the *regular operations*.

The *first two* are *results* which describe the state of the machine at some point in the process. If the machine is ε at the end of the process then the string was recognized, and if it’s in any other state then the string was not recognized. The third is about the recognition of symbols.

A tree is a directed graph in which any two nodes are connected by exactly one path. The outcome of Brzozowski’s algorithm, the naive version, is a tree generated dynamically as the derivative results of the current symbol (provided it doesn’t cause termination) are placed onto the stack and then processed with the subsequent symbol. Characters, CharacterClasses (i.e. ranges, `[A-Z]`

or `[0-9]`

, etc), and the *Any* symbol are leaf nodes for which a submatch operation (such as ‘Alternative’ or ‘Concatenation’) allows the process to proceed. The last three items in the list are the three main operations: ‘Alternative’, ‘Concatenate’, and ‘Star’.

Other operations, such as "At least," "At most," "Exactly *n*", and "Plus" are just variants on the above; how to optimize them isn’t something I’ve addressed yet, nor the "automatic derivative of a concatenation of symbols," which could greatly speed up some searches; an extended string match would result in either ε or ∅. (On the other hand, the automatic concatenation of symbols would also result in the need for a checkpointing operation, as now we’d be trading memory for a dubious speedup.)

There are some non-operations that also need to be considered, including capture groups and non-nested *successive* backtracking (nested and precessive backtracking I am *not* handling, maybe not ever). The input iterator needs a wrapper to introduce specialized symbols for WordBoundary, StartOfText, StartOfLine, and their complements.

More to the point, the *Recognizer* recognizes *Strings* about *Languages*, but I don’t have a *Language*, I have a convenient enum of recognizer machine states. You can perform operations on the Languages, and that’s more what I want. Once you can cat, alt, and star a whole language, the algebra of my recognizer becomes much easier to use.

The real problem comes down to the use of a memory arena to depict the tree of nodes. Rust Leipzig’s Idiomatic Trees and Graphs shows that the *graph* has to allocate the nodes, his pointers are opaque handles, and even his library doesn’t supply a composition of graphs operation.

All this is to stay that I’m a bit stymied at the moment. There are obvious steps I could take, but imagine how much easier it would be to write regular expressions if they were concatenatable, alternable, and kleene-able?

## 2 Responses to Thinking about refactoring Barre, and what to do next

Buddha Buck

October 17th, 2018 at 12:10 pm

I believe you mentioned derivatives of regular expressions months ago, which lead me down the pathway you are visiting now. My musings on the semantic issues you are dealing with are expressed in detail at http://hacek.homelinux.org/historical-regular-expressions/ in a blog post I wrote in January.

It includes links to early papers on regular expressions by Thompson, Brozozowski, and Kleene.

Brozozowski does make FA out of his derivatives, noting that for a given regular expression, there are only a finite number of sets of similar regular expressions formed by differentiation (similarity is a weakened form of equality — all similar regular expressions will recognize the same language, but not all regular expressions which recognize the same language will be similar). These sets of similar REs form the states of the FA, and the paths of differentiation take you from state to state (as in, if you start with an re “r”, you start in the state corresponding to re’s similar to “r”. If you take in the character “c”, you move to the state of re’s similar to D_c r, if you take in the character “a”, you then move to the state of re’s similar to D_ca r, and so on). Similar RE’s which accept the empty string are final/recognizing states of the FA.

The bulk of my RE derivative code (in Haskell) is devoted to implementing Brozozowski’s similarity functionality. My goal was to take an RE and reduce it to a “canonical” form that, among other things, doesn’t have any null REs or unneeded empty REs. Two RE’s with the same canonical form are similar, and generally the canonical form of D_c r is much smaller than D_c r itself, and much faster to work with.

Elf Sternberg

October 28th, 2018 at 1:05 pm

The optimizations proposed by Might & Adams automatically remove the null and peripheral REs by default. Owens paper (see my later post) includes mention of actually includes intersection, which, if I’m reading this right, provides an alternative to Perl’s regex predicates and regex conjunctions without any special magic in the parsser. Which is just wild; like, did no one realize this before, when it’s apparently in Kleene’s original paper?