I seem to be in a state of semantic confusion.
I went back and reviewed the Theory of Regular Operations, which is section 1.10 of Sipser's Theory of Computation, and I seem to have come upon a bit of semantic confusion that I need to decide how to resolve, and more importantly, when to resolve.
According to Sipser, a regular language is any language (that is, a set of strings made up of symbols) for which a finite automata can be constructed. For Brozozowski, a regular language can be recognized by processing its derivative, with the following rules:
- Dc(∅) = ∅
- Dc(ε) = ∅
- Dc(c) = ε if c = c', ∅ otherwise
- Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
- Dc(re1 | re2) = Dc(re1) | Dc(re2)
- Dc(re_) = Dc(re) re_
Only one of those rules talks about the symbols, namely the third; all the rest are either end-state issues or what are termed in the busines the regular operations.
The first two are results which describe the state of the machine at some point in the process. If the machine is ε at the end of the process then the string was recognized, and if it's in any other state then the string was not recognized. The third is about the recognition of symbols.
A tree is a directed graph in which any two nodes are connected by exactly one path. The outcome of Brzozowski's algorithm, the naive version, is a tree generated dynamically as the derivative results of the current symbol (provided it doesn't cause termination) are placed onto the stack and then processed with the subsequent symbol. Characters, CharacterClasses (i.e. ranges, [A-Z]
or [0-9]
, etc), and the Any symbol are leaf nodes for which a submatch operation (such as 'Alternative' or 'Concatenation') allows the process to proceed. The last three items in the list are the three main operations: 'Alternative', 'Concatenate', and 'Star'.
Other operations, such as "At least," "At most," "Exactly n", and "Plus" are just variants on the above; how to optimize them isn't something I've addressed yet, nor the "automatic derivative of a concatenation of symbols," which could greatly speed up some searches; an extended string match would result in either ε or ∅. (On the other hand, the automatic concatenation of symbols would also result in the need for a checkpointing operation, as now we'd be trading memory for a dubious speedup.)
There are some non-operations that also need to be considered, including capture groups and non-nested successive backtracking (nested and precessive backtracking I am not handling, maybe not ever). The input iterator needs a wrapper to introduce specialized symbols for WordBoundary, StartOfText, StartOfLine, and their complements.
More to the point, the Recognizer recognizes Strings about Languages, but I don't have a Language, I have a convenient enum of recognizer machine states. You can perform operations on the Languages, and that's more what I want. Once you can cat, alt, and star a whole language, the algebra of my recognizer becomes much easier to use.
The real problem comes down to the use of a memory arena to depict the tree of nodes. Rust Leipzig's Idiomatic Trees and Graphs shows that the graph has to allocate the nodes, his pointers are opaque handles, and even his library doesn't supply a composition of graphs operation.
All this is to stay that I'm a bit stymied at the moment. There are obvious steps I could take, but imagine how much easier it would be to write regular expressions if they were concatenatable, alternable, and kleene-able?