## THINKING ABOUT REFACTORING BARRE, AND WHAT TO DO NEXT

I seem to be in a state of semantic confusion.

I went back and reviewed the Theory of Regular Operations, which is section 1.10 of Sipser's Theory of Computation, and I seem to have come upon a bit of *semantic confusion* that I need to decide how to resolve, and more importantly, *when* to resolve.

According to Sipser, a *regular language* is any *language* (that is, a set of *strings* made up of *symbols*) for which a *finite automata* can be constructed. For Brozozowski, a *regular language* can be *recognized* by processing its derivative, with the following rules:

- Dc(∅) = ∅
- Dc(ε) = ∅
- Dc(c) = ε if c = c', ∅ otherwise
- Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
- Dc(re1 | re2) = Dc(re1) | Dc(re2)
- Dc(re_) = Dc(re) re_

Only one of those rules talks about the *symbols*, namely the third; all the rest are either end-state issues or what are termed in the busines the *regular operations*.

The *first two* are *results* which describe the state of the machine at some point in the process. If the machine is ε at the end of the process then the string was recognized, and if it's in any other state then the string was not recognized. The third is about the recognition of symbols.

A tree is a directed graph in which any two nodes are connected by exactly one path. The outcome of Brzozowski's algorithm, the naive version, is a tree generated dynamically as the derivative results of the current symbol (provided it doesn't cause termination) are placed onto the stack and then processed with the subsequent symbol. Characters, CharacterClasses (i.e. ranges, `[A-Z]`

or `[0-9]`

, etc), and the *Any* symbol are leaf nodes for which a submatch operation (such as 'Alternative' or 'Concatenation') allows the process to proceed. The last three items in the list are the three main operations: 'Alternative', 'Concatenate', and 'Star'.

Other operations, such as "At least," "At most," "Exactly *n*", and "Plus" are just variants on the above; how to optimize them isn't something I've addressed yet, nor the "automatic derivative of a concatenation of symbols," which could greatly speed up some searches; an extended string match would result in either ε or ∅. (On the other hand, the automatic concatenation of symbols would also result in the need for a checkpointing operation, as now we'd be trading memory for a dubious speedup.)

There are some non-operations that also need to be considered, including capture groups and non-nested *successive* backtracking (nested and precessive backtracking I am *not* handling, maybe not ever). The input iterator needs a wrapper to introduce specialized symbols for WordBoundary, StartOfText, StartOfLine, and their complements.

More to the point, the *Recognizer* recognizes *Strings* about *Languages*, but I don't have a *Language*, I have a convenient enum of recognizer machine states. You can perform operations on the Languages, and that's more what I want. Once you can cat, alt, and star a whole language, the algebra of my recognizer becomes much easier to use.

The real problem comes down to the use of a memory arena to depict the tree of nodes. Rust Leipzig's Idiomatic Trees and Graphs shows that the *graph* has to allocate the nodes, his pointers are opaque handles, and even his library doesn't supply a composition of graphs operation.

All this is to stay that I'm a bit stymied at the moment. There are obvious steps I could take, but imagine how much easier it would be to write regular expressions if they were concatenatable, alternable, and kleene-able?