I seem to be in a state of semantic confusion.

I went back and reviewed the Theory of Regular Operations, which is section 1.10 of Sipser’s Theory of Computation, and I seem to have come upon a bit of semantic confusion that I need to decide how to resolve, and more importantly, when to resolve.

According to Sipser, a regular language is any language (that is, a set of strings made up of symbols) for which a finite automata can be constructed. For Brozozowski, a regular language can be recognized by processing its derivative, with the following rules:

  • Dc(∅) = ∅
  • Dc(ε) = ∅
  • Dc(c) = ε if c = c’, ∅ otherwise
  • Dc(re1 re2) = δ(re1) Dc(re2) | Dc(re1) re2
  • Dc(re1 | re2) = Dc(re1) | Dc(re2)
  • Dc(re) = Dc(re) re

Only one of those rules talks about the symbols, namely the third; all the rest are either end-state issues or what are termed in the busines the regular operations.

The first two are results which describe the state of the machine at some point in the process. If the machine is ε at the end of the process then the string was recognized, and if it’s in any other state then the string was not recognized. The third is about the recognition of symbols.

A tree is a directed graph in which any two nodes are connected by exactly one path. The outcome of Brzozowski’s algorithm, the naive version, is a tree generated dynamically as the derivative results of the current symbol (provided it doesn’t cause termination) are placed onto the stack and then processed with the subsequent symbol. Characters, CharacterClasses (i.e. ranges, [A-Z] or [0-9], etc), and the Any symbol are leaf nodes for which a submatch operation (such as ‘Alternative’ or ‘Concatenation’) allows the process to proceed. The last three items in the list are the three main operations: ‘Alternative’, ‘Concatenate’, and ‘Star’.

Other operations, such as "At least," "At most," "Exactly n", and "Plus" are just variants on the above; how to optimize them isn’t something I’ve addressed yet, nor the "automatic derivative of a concatenation of symbols," which could greatly speed up some searches; an extended string match would result in either ε or ∅. (On the other hand, the automatic concatenation of symbols would also result in the need for a checkpointing operation, as now we’d be trading memory for a dubious speedup.)

There are some non-operations that also need to be considered, including capture groups and non-nested successive backtracking (nested and precessive backtracking I am not handling, maybe not ever). The input iterator needs a wrapper to introduce specialized symbols for WordBoundary, StartOfText, StartOfLine, and their complements.

More to the point, the Recognizer recognizes Strings about Languages, but I don’t have a Language, I have a convenient enum of recognizer machine states. You can perform operations on the Languages, and that’s more what I want. Once you can cat, alt, and star a whole language, the algebra of my recognizer becomes much easier to use.

The real problem comes down to the use of a memory arena to depict the tree of nodes. Rust Leipzig’s Idiomatic Trees and Graphs shows that the graph has to allocate the nodes, his pointers are opaque handles, and even his library doesn’t supply a composition of graphs operation.

All this is to stay that I’m a bit stymied at the moment. There are obvious steps I could take, but imagine how much easier it would be to write regular expressions if they were concatenatable, alternable, and kleene-able?

14Oct

Rust Macros, and how they’re used in Barre.

Posted by Elf Sternberg as Rust

For the past two weeks I’ve been working on an implemented of regular expressions, only writing an alternative implementation to the classic Thompson engine, the simplest one, and often the fastest if you don’t need advanced features like, oh, Unicode range matching or backtracking. Instead of the Kleene/Thompson engine, I’ve been using Brzozowski’s Algorithm, which has some interesting properties and promises some interesting capabilities.

Right now, the engine implements the most naive possible version, one that’s known to be inefficient in time and space. It’s also the most fundamental Thompson implementation, with exactly the six operators in his 1968 matching engine. It has no ‘plus’, no ‘any’ operator. On the other hand, it is completely templatized, so it can handle bytes, chars, or more complex symbols. It only recognizes truly regular languages, and has none of the power of modern engines, but part of the point of this exercise is to see how close I can get in expressiveness, power, speed, and size, and then to provide wrappers that could parse Clojure spec files, or implement the Rosie syntax.

The problem

But I had a problem. Part of this problem is an implementation detail, but part of it is just Rust being Rust. To recognize the letters "ABC" I would have to write:

let mut pattern = Recognizer::<char>::new();
let a =  pattern.tok('A');
let b =  pattern.tok('B');
let ab = pattern.cat(a, b);
let c =  pattern.tok('C');
let _ =  pattern.cat(ab, c);

What I really wanted was something simpler. I wanted to be able to write:

let mut pattern = re(cat('A', 'B', 'C'))

And since I didn’t know how to write Rust macros, my first thought was that I’d write it with them. But to give you a taste of what I ended up with, let’s just say that the pattern AB(CC|DDDD)E*F ultimately looks like this:

let mut pattern = re!{cat{tok{'A'},tok{ 'B' },
                      alt{cat{tok{'C'},tok{'C'}},
                          cat{tok{'D'},tok{'D'},tok{'D'},tok{'D'}}},
                      rep{tok{'E'}},tok{'F'}}};

I learned a few days later that the RE2 engine, which I will be using as a learning example and as a source of some knowledge, also has an alternative representation that looks somewhat similar:

/a.b.c/ => cat{lit{a}dot{}lit{b}dot{}lit{c}}

It uses ‘lit’ where I used ‘tok’, and I haven’t implemented ‘any’, and they have ‘dot’, but the outcome is remarkably the same.

Initialization

There are two different things that my macro has to two: it has to create a new Recognizer, and it has to add elements to that Recognizer.

The Recognizer uses a memory arena to represent the graph of the regular expression, a graph which is built on-the-fly as the string is processed. This implementation can’t have "nodes" as loose objects "somewhere" in RAM because Rust really doesn’t like that: adding a new node that "points to" previous nodes happens through handles (integers) rather than pointers, a phenomenon you can see in my first example above.

Rust macros are hygenic and textual. A Rust macro has a pattern (sometimes containing captures) and a result: if your input to the rust macro matches the pattern, the invocation of the macro will be replaced with the text of the result, and some elements inside the result will be replaced with elements of the capture.

There are currently 10 different things that can be captured; my macros only use four of them: ty for types, expr for expressions, ident for identifiers, and tt for token trees.

When a macro is read by the compiler, it is broken up into a token tree. The content of what you pass to the macro is also broken up into a token tree.

Let’s start with the most simple example. Let’s create the recognizer.

I called my macro ‘re!’, because of course I did. I also had to define the type that the Recognizer recognized.

macro_rules! re {
    ($sty:ty; $($cmds:tt)+) => {
        {
            let mut pt = Recognizer::<$sty>::new();
            {
                let _ = re!(...);
            }
            pt
        }
    };
};

That pattern there is one of the most simple: two items separated by a semicolon. The first item must be recognizable by the compiler as a type; the second item must be a stream of tokens (some of which can be further streams of tokens). A “token” in this case is any object Rust can reasonably identify is separate from other objects: tok{'a'} is four objects, as we’ll see later. tt objects are the most useful, but they’re not terminal objects– Rust doesn’t know what to do with them. They’re useful only for recursive processing; eventually, they must be reanalyzed to determine if they’re types, identifiers, expressions, or other valid Rust code.

And man, I really wanted a default type. Let me try to add that rule.

    ($($cmds:tt)+) => { re!{ char; $($cmds)* } };

This is the first recursive macro, in that if we just get a blank list of tokens, we’re going to call re! again, only this time with the pattern ty:$(tt)+, which means a type followed by a collection of tokens. That matches the first pattern, as so a new recognizer is born.

Note that although the pattern reads $($cmds:tt)+, which captures a collection of tokens, the replacement pattern is $($cmds)*.

Note also that our collections end in a plus, which means "one or more." What if the user only wants a blank Recognizer, one that recognizes nothing? Here’s where things get interesting. We can do this:

    () => { re!{ char; } };
    ( $sty:ty ; ) => { Recognizer::<$sty>::new() };

The order’s a little backwards; first it says that if the expression is empty, call it with only the type; if the type exists but the tree is empty, build an empty Recognizer.

These rules are extremely precise; they cannot be confused for anything else. So they have to go at the very top of the macro_rules! call; the rules that are most precise must come before rules that have more general matching semantics, or the more general semantics will match and the macro generator will attempt to use the corresponding result… which will leave you with sadness.

Okay, so now we’ve handled all the possible invocations of the Recognizer. What about the actual expression?

Parsing the regex

We already know that macros can be called recursively. We also want the parsing to be unambiguous. And here’s where we enter into an interesting problem: Token trees and expressions can easily be mistaken for one another. The phrase tok { 'a' } could be a token stream of two tokens, the second of which is itself a token stream; it could also be a token stream of four tokens; it could also be an expression (although a broken one, as the Rust parser would attempt to treat it as a struct, and the interior of the struct does not meet Rust’s specification).

In this case we want it to be a token stream of four tokens: tok { 'a' and }. Only one of those is an actual expression. We also want this to be unambiguously called, and for that we use an internal identifier. In macros, an internal identifier begins with an @ symbol. Like so:

(@process $pt:ident, tok { $e:expr }) => {
    {
        $pt.tok($e)
    }
};

This says that we want a token stream of six tokens: the four we identified earlier, along with the literal token @process and this thing $pt:ident.

Go back and look at the original macro rule definition, the first one that created a Recognizer. Let’s replace the elipsis (‘…’) with what I really want there:

    ($sty:ty; $($cmds:tt)+) => {
        {
            let mut pt = Recognizer::<$sty>::new();
            {
                let _ = re!(@process pt, $($cmds)*);
            }
            pt
        }
    };

We create a new scope, and a new variable within that scope, pt. Then, within yet another new scope, I start the recursive parser by calling re!(@process pt, $($cmds)*); the @process unambiguously points us to the internal handling, and the pt is recognized as an identifier that will be kept consistent throughout the construction process.

Next, the cat macro, for concatenation, which conveniently has the exact same structure as the alt macro. Those need a left and a right arm, as they’re nodes and not leaves of the graph. We’ll start with the same three tokens: @process, $pt, and… what? Alt? Cat? There’s no "alternative" operator in Rust macros, so I’ll use a regular identifier.

    (@process $pt:ident, $op:ident { $lop:ident { $($lex:tt)+ }, $rop:ident { $($rex:tt)+ } }) => {
        {
            let l = re!(@process $pt, $lop { $($lex)* });
            let r = re!(@process $pt, $rop { $($rex)* });
            $pt.$op(l, r)
        }
    };

A couple of things to note here: First, note that I have to re-create the structure of the op { tokens+ }, because sometimes they’re not tokens! For the tok operator, that’s expected to be an expression. Also note that the final operation can be $pt.$op(l, r), because the substitution here is purely textual.

The left and right arms will be replaced with scoped expressions like this one (or the one for tok).

There’s one thing left that I want: cat{tok{'a'},tok{'b'},tok{'c'}}. For that, I have to do two things: I have to capture more than just two token trees, and I have to use the cons-style extension for cat (and alt, as you can have more than two alternate possibilities), preserving the existing operator:

    (@process $pt:ident, $op:ident { $lop:ident { $($lex:tt)+ }, $rop:ident { $($rex:tt)+ }, $($xex:tt)+ }) => {
        {
            let l = re!(@process $pt, $lop { $($lex)* });
            let r = re!(@process $pt, $op { $rop { $($rex)* }, $($xex)* });
            $pt.$op(l, r)
        }
    };

In this case, I add the rest of the expression to the end of the @process parse line, and then for the right-hand expression I then re-create the rest of the expression whole, minus the one operation I’ve already handled. This is classic Lisp-like recursive programming, and it’s the only way to do this with Rust macros: For a list of problems, I take one argument off the list, process it, then re-call the function with a smaller list of problems, until the list can be handled by the two-argument instance.

Just a (geeky) (but important) note: this works here because the cat and alt operators are associative, that is, cat(A, cat(B, C)) is exactly the same as cat(cat(A, B), C).

Conclusion

The entirety of the macro is available at Github.

Rust macros take some time to understand; quite they’re different from the rest of Rust, and they have some Lisp-like qualities in their internal, recursive form for which a general education in Scheme or Lisp might come in very useful. It helped a lot to understand the nature of the problem, too, in that I knew that concatenative operations were associative.

But they are quite useful, and I’ve made my life much easier for writing test cases, once I’ve written tests that satisfy me that the macros are doing the right thing.

Consequences

At the moment, BARRE "Languages" and the "Recognizer" are inappropriately mixed up. What I have are the six operations of classic regular expressions, and the Language and Recognizer are mixed together. Languages are supposed to be composable, and Rust’s own regex engine not only can compose them, but has a special root version of alt in the RegexSet handler that can keep track of large sets of regular expression, giving users the ability to know which of several complex expressions may have been triggered by a string.

If cat, alt, etc. were written as either bare languages or simple operations, it should be possible to perform all the compositional operations on them. This may eliminate the need for the macros entirely. Time will tell. In the meantime, this was a great learning opportunity.

For the past two weeks, I’ve been working my way through Matt Might’s paper Yacc is Dead, which describes and implementing it in Rust. The result is BARRE: Brzozowski Antimirov Rusty Regular Expressions. So far as I’ve been able to determine, I’m the first person to try implementing it in a systems language, one without garbage collection.

Rust has made this pretty straightforward so far. My progress has pretty good: I have the naive implementation built and running. The naive implementation sucks: it’s slow, it’s inefficient, and it has a tendency to grow out of control in space. The regex (A|B)*C when trying to match ‘ABABAAC’ grows to a deterministic finite automata (DFA) of over 140 nodes!

My progress is going to be a bit stalled this week as I work through chapter 1 of Michael Sipser’s Theory of Computation, which teaches about DFAs and NFAs and how to convert back and forth between the two. There are several major improvements that can be made to the existing algorithm.

First, I need to figure out how to extend the engine to handle different repetition kinds. Right now it’s the pure Kleene star, "zero or more." Modern regex engines handle "one or more," ("Kleene plus"), "zero or one", and ranges ("exactly n", "at least n", "at most n").

PCRE adds Boundaries (word boundaries, the "" operator), and Anchors (start of line, end of line, start of document, end of document), Classes (is-a-num, is-a-letter, range-of-letters). All of these are single-term operations, though, and should have the same semantics as a Token.

PCRE also adds Captures (groups that can be retrieved after the expression engine has run, showing what was actually matched), and Backreferences (literals that appear later in the regular expression that match strings built out of what was matched, effectively "splicing" the parse tree with a runtime-only concatenation at a given point in the matching process.

Python, of all things, adds a unique feature to the input iterator to replace indent/dedent changes with block-open and block-close symbols. This may be useful for whitespace-based languages, and obviates the need to play funny with the regular expression engine.

Brzozowski’s algorithm itself has four major improvements: laziness, which delays actually creating a derivative node until it’s truly needed; fixed points, which takes a derivative subgraph and re-derives it, over and over until it stops changing, at which point we have the final matcher; memoization, which intercepts the fixed-point processor to find duplication and returns a pointer to the start of the duplicate in the existing subgraph, interrupting any potential infinite recursions which may blow our stack; and compaction, which identifies duplications across subgraphs and prevents the in-memory representation from growing out of control.

As far as I can tell, Brzozowski regular expressions have a "start-up cost," that is, the first time through can be pretty expensive as the graph is filled out, but follow-on matches can be pretty quick, as only the deltas need to be filled out.

Antimirov Derivatives promise an easier-to-implement variant of Brzozowski’s algorithm. I haven’t yet dived into these, but they’re on the list.

My hope is that Antimirov’s Derivatives don’t interfere with Might’s final point: Recursive Regular Expressions which, if implemented in a system with captures, can parse context-free grammars whole.

A couple nice-to-haves: First, I need internal macros to build internal representatives in as straightforward manner. Rust has a fairly powerful macro language, and leveraging it to make my life easy while negotiating with the borrow checker would be awesome. Second, the internal representation of a first-order Brzozowski’s Derivative is a simple array; being able to pre-compile that array to have stand-alone regexes at compile time would be wonderful.

And finally, it would be amazing if a parser for a Thompson, PCRE, or even PCRE-light syntax (such as supported by Rust’s own Regex engine), could be written in Barre. The parser for Rust’s Regex engine is written with, but not in, Regex; that is, while the Regex engine is used to recognize substrings written in Thompson/PCRE regex (you know, the classic abc(.*?)de strings), it can’t handle nested expressions, which requires the engine to recurse on itself, but theoretically, Barre can.

If all of this is true and possible, then I might well have a competitive alternative to both regex.rs and nom.rs, and my next steps toward learning how to write my own programming language will be clear, if not easy.

21Sep

Progress Report, Week 10.

Posted by Elf Sternberg as Uncategorized

I set out this week to do the following:

On the whole, that’s not… too bad. I have about four hours a day in which to do my studying; my studies get my morning, and my family seems to be occupying more of my afternoons than I had originally planned when I took this sabbatical. I also didn’t get anything at all done Thursday this week; instead, we all took the day off to go to the Washington State Fair and eat very badly. (So badly, in fact, that when I got home I took a 90 minute nap, woke up long enough to watch a movie with my wife, and then went right back to bed for a solid eight hours of sack time.)

I’ve been leaning on Scheme a lot this week, primarily because I’ve become obsessed with understanding this bit of code, which I hope will unlock secrets of the universe for me. I’m finally starting to understand the difference between let, let*, and letrec, but reading Scheme macros, especially pattern matching macros, is still a bit beyond me. But I hope to figure it out, and then build on top of it. I realized, and perhaps this is obvious to others, that as I was working through my primitive constructions of regular expressions that by defining the “thing” you’re using as your core language as <T> (rather than char or byte), you end up with an ability to match on tokens rather than characters, and from there you end up with the ability to pattern match sequences much more complex than mere strings. One of the things that drives me crazy is how opaque the texts are about the leap from language generators to language lexers and parsers, gliding as quickly as possible over language recognizers. That needs to be made more explicit. Recognizers say that a sequence matches a pattern; parsers say that a sequence matches a pattern and derives a parallel (“Lifted?”) structure that reifies the sequence in a meaningful way.

Remember my wild ambition?. Take a look at this chart and you’ll see where I’m going with this. I love Clojure because it’s a Lisp with a remarkably low rate of bug density among skilled developers… but it runs on the JVM. I love Python… but it’s tendency toward bugginess (and the death’s head semantic) drive me insane. I loathe Go’s syntax and semantics… but it’s runtime and libraries are first-rate.* I’d love to get to Static where possible, dynamic where necessary. This is where I want to land with Blackwing: somewhere in the middle of this. I’m not sure that I can, but I’ve got a frackton of examples to work with.

I also attended the Fun(c) Meetup on Category Theory and that was pretty interesting. Mostly because I started to see the way real world applications matter in categorical thinking. One of the biggest realizations to come out of the event was that often compositional thinking is an optimization issue: does it make sense to spend the time and energy composing the functions that you’ll perform on a lifted type, or live with the code-as is?

Overall, I’m driven by two overwhelming notions. First, programmers need to develop good taste in developing programs. They need both clear naming schemes and clear layout. Code should tell a story. It should be easy to read and understand. Go acheives this by dumbing everything down to a third-grade vocabulary, but I don’t believe that’s necessary, and I don’t believe programmers should be treated this way. Go is the epitome of “We don’t want programmers with good taste, we want programmers that taste good.”

Second, developers need compassion. This is harder to teach than good taste. The old saw, “Always treat your source code as if the next person to see it is a psychopath who knows where you live” always bother me as violent and frightful; I’d much rather we appreciate that the next person to see your code is going to need as much compassion as you can muster and your code should reflect your wish to convey grace and kindness. Not everyone will appreciate it, but those who do are the ones who matter.


Next week

  • Three more chapters of Seasoned Schemer
  • Chapter 3.5 of SICP
  • Redraft the Mandelbrot and Buddhabrot code.
  • Colorize Buddhabrot. ** Stretch: Consider alternative coloring schemes for Mandlebrot.
  • Read ahead in the Scheme book to understand dparse.rkt

17Sep

Notes on using the Rust image library

Posted by Elf Sternberg as programming, Rust

I have finally figured out how to use the images library for Rust, and it’s not obvious, especially not for internally generated images. The library does a very good job of hiding complicated implementation details when you load a file off the disk, automatically figuring out the file format and providing some baseline utilities. Zbigniew Siciarz has a pretty good writeup of the basic implementation, but for the Mandelbrot renderer, I wanted to get a little more specific.

So here’s the basic. Images are made up of Pixels, and every pixel is an array of 1 to 4 numbers (of any machine size; common ones are u8 and u32, but you can create Pixels of type f64 if you like). A single item Pixel is greyscale; a 3-item Pixel is RGB, and a 4-item Pixel is RGBA (alpha).

Pixels go into implementations of GenericImage, the most common of which is ImageBuffer. ImageBuffer controls access to the underlying representation, ensures bounds checking and provides limited blending capabilities.

To colorize my image, I needed to create the colormap, and then map the colors to a new image. And then I learned that the PNM colormap handler doesn’t handle Pixels.

PNM is an ancient image format, which may explain why I enjoy it. It’s simply a run-length encoding of bytes: RGBRGBRGBRGB… with a header explaining how to dimension the image by width and length. The PNM Handler for images.rs can only handle flattened arrays that look like a PNM array.

So for the Mandlebrot colorizer, the solution was to create a new array three times as long as the underlying Vec for my original image, and sequence through the pixels, mapping them to a color and then pushing the pixel subcolors to the new array. Which is annoying as heck, but it works.

I’m sure there’s a better way to do this. And the constant remapping from u32 (which the image array understands) to usize (which is what slices are made of) and back got tiresome, although the compiler was helpful in guiding me through the process.

Brad Delong recently pointed me at Susan Dynarski’s article For Better Learning, Lay Down the Laptop and Pick up a Pen, in which she reviews the evidence that laptops, because of their speed and utility, actually inhibit learning by allowing students to take notes too quickly, and by giving students a universe of alternative distractions should the instruction get too boring. I’ve found Dynarski’s article to be absolutely true.

I recently finished a small homework assignment. After three days of working with it on the computer, I sat down with a sheet of paper and worked it out in an hour. I wrote it snowflake fashion: I described it, then iterated on the description until I had a complete description, with the public API underlined. It took another hour to implement it.

This is my lesson: if I can’t explain it to myself on paper, then I can’t explain or implement it on the computer. Ever. It just doesn’t work that way, at least not with software. The ideas don’t stick unless I’ve written them out. Every few years I have to re-learn this lesson, believing that I can short out the learning curve and get straight to the meat of the problem. But the meat of the problem isn’t in the computer, it’s math and common sense, and those take time, and paper, and a pencil.

One of the horrifying follow-on realizations this led me to was that, when I was at my last job, I was very rarely writing software. I wasn’t working out algorithms or implementing new and innovative processes. I was exploiting lenses.

In software development, a lens is an an algorithmic function that allows the user to focus in on a specific part of data without the user having access to, or having to know about, the rest of the data. (There’s much more to it than that, including a deep mathematical description, but that’s the basic idea.) And most of what I was doing was writing shims, to be part of a Kubernetes ecosystem, to help internal users monitor the health and well-being of the system.

Which is all well-and-good, but when you realize that you’ve married a massive protocol like HTTP to a very simple collection of filters, then come up with your own peculiar way of specifying to the filters who you are and what you want, over and over again… well, that can lead to a bit of burnout. Every job is easy, but every job whispers, "This is stupid and should have taken three days, not four weeks."

With very rare exception, this is what programming is these days. All software is little more than views into specific arrangements of data. This is one of the reasons functional programming is becoming so prevalent: that’s what functional programming believes in its heart. (There is a hard part, dealing with philosophical notions of time, but until other people can understand what the heck Conal Elliot is talking about, I’ll just leave that there.) Sometimes getting the data into that specific arrangement is a story about performance, timeliness, memory, and storage, but that’s pretty much all it is.

13Sep

Mostly studying this week.

Posted by Elf Sternberg as Uncategorized

Happy Thursday!

Thursday is the day where I look back upon the week and consider what I’ve learned. Last week I completed the main Buddhabrot algorithm and got it to work.

Studying: The Little Schemer

This week was primarily studying, so there’s not a lot of code to consider. Instead, I worked my way through The Little Schemer, a book that teaches how to write Scheme and, by doing so, master the Y Combinator (the mathematical formula, not the orange site people), and eventually meta-circular interpretation. The latter wasn’t as interesting to me as I’d hoped, as I’d already done it once, writing a meta-circular interpreter on top of a Lisp I’d already written which, in turn, was written in Coffeescript, because that’s how I was rolling back then.

It’s a rite of passage to finish The Structure and Interpretation of Computer Programs and then write your own programming language in it. I’m up to section 3.3 and am currently a little stuck on the “write your own deque” section. One thing I dislike intensely about MIT Scheme (and Schemes in general) is the way objects and pointers just exist, willy-nilly, and knowing when you’re using the original handle and when you’re using a reference is all a matter of keeping it in your head. I think I’m becoming more of a Rust partisan every day.

Speaking of Rust, I started an implementation of Brzozowski’s Algorithm in Rust. It’s a very naive sort of thing, based on my earlier attempt with Python. It doesn’t work yet and, according to Might it will probably have terrible performance and memory issues, but it’s a first draft. I may have to figure out how do memoization and some macro handling / inlining. The first time I wrote it, I realized where the memoization would have to occur to make it performant, but that knowledge seems to have faded from my brain and I’ll have to re-learn it all over.

Plans: Next Week

Next week, fate willing, I’m going to:

  • Implement colorized versions of the Mandelbrot and Buddhabrot implementations.
  • Read three chapters of The Seasoned Schemer
  • Finish chapter three of Structure and Interpretation of Computer Programs
  • Write some damn documentation for the images.rs library.

Future Plans

There are a number of other, unfinished projects on my plate. Most notably, I haven’t finished porting Daniel Givney’s Assembly Tutorials to 64-bit ASM, as promised.

And then I’m going to dive into writing an interpreter, a virtual machine, and a compiler. All the stuff I never learned how to do because my parents didn’t believe, in 1985, that a programming career would ever be lucrative and would only pay my college if I majored in Accounting– excuse me, I meant “Business Computing”– instead. My ideas for it are incredibly vague, but there are some basic ideas.

Excessively Ambitious:

  • It is going to be a Scheme.
  • It is going to be strongly typed.
  • It is not going to be beholden to the naming history of Lisp or Scheme. My first thought was be to steal from Hy, using Ghostwheel or Compact Notations as inspiration for typing annotations.
  • It will use type inference wherever possible.
  • It will use a sane import scheme.
  • It will rest upon an Intermediate Representation that will not be married to the syntax. Alternative syntaxes should be easy.
  • I want it to have the following:
  • Garbage Collection: A tri-color parallel garbage collector.
  • Green threads implementation, like Go routines.
  • Batteries included
  • Computation Expressions
  • Fat executables (all files and libraries included in a single, runnable format)
  • It would be silly for me to want:
    • Software-transactional memory
    • Type Providers
  • It would be madness to provide domain-specific stuff like:
    • A Grammar interface
    • A type-driven graph engine (the underlying technology of spreadsheets)
    • A type-driven pluggable buffer engine (the underlying technology of databases)
    • A high-performance rope or piece table implementation (the underlying technology of text processors)
    • A document object model (the DOM is an obvious one, but what about OpenDocument?)

For all I know, there is a strong conflict among these different goals, in that, for example, grammars might be incompatible with strong typing, or at least difficult to implement correctly. This is a learning experiment.

And here’s the thing: I don’t just want to write a programming language. I want to write useful things in my programming language. That’s the whole point. If I can’t write useful things in it, then it’s not a useful language. Yes, there are days when I feel like this plan goes over Programming Language for Old Timers, but that’s also the point; I know enough to feel modern, but I want something that is expressive to me.

This week, I finished off the basic Programming Rust exercises by extending the Mandelbrot example in the book to produce Buddhabrots instead. In my last post, I mentioned that I’d gotten the Buddhabrot working, but only in a very basic way. I was unhappy with the result; my point-to-pixel mapper was giving me weirdly squashed results, and what I was producing didn’t look like many of the examples found in the wild. The O’Reilly Mandelbrot exercise extended the code to work on multi-core machines, so I tried to do that with the Buddhabrot.

The Mandelbrot set divides the complex plane into two sets: for a given complex point c and a zero-zero complex point z, those points in which the equation z = 

z * z + c iterated infinitely goes to infinity, and those for which z remains within a bounded region, usually the circle starting at the origin with a radius of 2+2ᵢ. For those starting points that go to infinity, the velocity at which it goes gives a grayscale number, from which we generate the mandelbrot set.

The Buddhabrot says, “Wait a minute, for all those points that don’t go to infinity, what if we colored those instead?” For those, for each iteration, we track the points it makes within the black center of the Mandelbrot set, and the number of times a point corresponds to a pixel gives us a number we can then render as a greyscale.

So: the problem here is easy. For Mandelbrot set, each point is not dependent on any other points on the screen. For the Buddhabrot, this isn’t true: most points are dependent upon the behavior of some previous point. The Mandelbrot set could be divided up into different regions and each region handed to a different core. For the Buddhabrot, we have to build up each map in total.

Even on my monitor, which is 3840×2160, using 16-bit greyscale and 8 threads, that was still under a gigabyte of memory. Yay for modern computers!


One of the biggest problems I had learning this was the notation for Rust mutation. And this is before we get to things like reference counters, boxes, and reference cells!

Here’s what I need to remember:

let x = 5; // defines a location, 'x', and puts 5 into it.
// x = 6; // illegal: x isn't mutable.

let mut x = 5; // legal: This shadows the value above.
x = 6; // redefines 'x' to be 6, which is legal because x is now
       // mutable.  The types must match!
{
    let y = &mut x; // 'y' is a mutable reference to 'x'. 'y' cannot be
                    // reassigned, but what it points to can be mutated:
    *y += 1;
}

That covers the basics, and helps a lot. The other thing is that the &mut syntax inside function signatures has a subtle meaning: that the function only takes mutabl

e references, and there can only be one of those at a time.


The rules about mutability in Rust just killed me. I wanted to run my Buddhabrot generator the way I would expect in any other object-oriented language: multiple planes on which to operate, each one running independently. I couldn’t get it to work with a vec of Plane objects, because Rust really didn’t want to move those massive planes around.

The only way to get it to work was to generate one gigantic vec, then use Vec::chunks_mut to break it up into subplanes, one for each thread, and hand them over to an immutable processor that did all the work of calculating the Buddhabrot for that plane. It worked, but what a pain in the neck.


The other thing I did this week was rip through the first 100 pages or so of The Little Schemer, a classic book teaching you how to use Scheme. I went through it using Racket, which was lovely, and taught me a lot about how Scheme works inside. I have two thoughts as I return to the Scheme land after wandering around in Rust: I really, really want better guarantees in my language that I won’t be passing around non-containers to functions that expect containers, and I really want better guarantees that the types I pass to functions are understood by those functions ahead of time.

Maybe I want Haskell. Or some kind of typed scheme. Or something. But this exercise was just… weird.

27Aug

The Mandelbrot and the Buddhabrot

Posted by Elf Sternberg as Uncategorized

Last week, I started knuckling down and praticing Rust for seriousness. I’ve been kinda skating along the top, not learning it in any real way; I’d been doing that for a while at my last job, where they insisted I use Go instead. I’m not fond of Go; I think it’s an amazingly powerful idiomatic garbage collection and inter-thread communications framework on top of which is build the godawfulest ugly language I’ve ever seen.

Last week I figured out how to render the Mandelbrot set. It’s the last exercise in chapter two of Programming Rust, and after I was done writing it, I decided to see if I could make a few improvements.

So here’s the basic thing about the Mandelbrot set: it’s a little arbitrary. The Mandelbrot set lives in the ℂ plane in a subplane ℤ described by the points 2+2i to -2-2i; the objective is to pick a subregion of that square, and map it to an arbitrary pixel plane, say an image 1024×768. For each pixel in that pixel plane:

  • figure out what point c on the ℂ plane most closely maps to that pixel.
  • Start with a complex number z representing the origin: 0+0i.
  • iterate on z = z2+c
  • if z does not leave the subplane ℤ, leave the region black
  • if z does leave the subplane, the iteration number when it violated the subplane border can be used to map to a color

Note that the base Mandelbrot algorithm returns only a single digit, and is appropriate only for greyscale images. The typically colorful Mandelbrot images you see are mappings of arbitrary colormamps to the grayscale values, much the same way GIFs are colored. There are other algorithms for choosing coloring, and I haven’t looked much into them.

The source code for my efforts on my Github under programming_rust/mandelbrot. There are two changes to the example in the book:

  1. I realized early on that every function was receiving the same two variables: a tuple describing the pixel plane, a tuple describing the complex plane. These values never change during the operation of a single render, and since I have programmed a bit of Rust over the past year or so I realized that these could best be put into a struct and have implementations range over them, so now we have an object Plane. This makes the function pixel_to_point() something that just takes a pixel and returns a complex number. It also means that every iteration through render() no longer has to recalculate the mapping every time; it’s done once and done for good.
  2. I decided to make my output into PNM. I’m a big fan of the Portable Anymap format. I know it’s not most people’s favorite, but it works for me and I know how to process it well.

I also did the concurrent version of the algorithm. This involves slicing up the plane into subplanes; points on the Mandelbrot set are independent of their neighbors (unlike, say, a Conway Game of Life), so making it concurrent is simply a map of slicing the plane into uniform subplanes based on how many cores your CPU can support, and giving each core a plane to process.

At first, I worried about how this was going to work with my Plane object, but then I realized that the Plane itself had the capacity to provide its own slices, and the method subplane() does exactly that. Then it’s simply a matter of mapping the subplane to a band of pixels in the image buffer, creating a thread having the thread call render() on the subplane.

I believe this code is easier to read than what’s in the book. Yes, it relies on structs, which aren’t introduced for quite a few chapters, so it’s cheating, but it’s fun.

There are two things I’d like to do with this: one is colorize it. My problem so far is that I don’t know how to read Rust type signatures, so figuring out how, using Rust’s most popular image processing library, to colorize an image has failed me so far.

On the other hand, generating a basic Buddhabrot turned out to be fairly easy.  That’s what you’re seeing in that illustration above.

A Buddhabrot says that for iterating function z = z2 + c, each resulting z represents another point inside the plane ℤ. Instead of counting when an iteration point leaves the plane ℤ, a Buddhabrot simply increments every pixel z represents during a set number of iterations. The resulting images are ghostly and strange, and looked to be a lot of fun. So I did it, and it worked, and I’m happy with the result

I’ve decided, for the sake of insanity, to work my way through the O’Reilly Programming Rust book, since I feel like my Rust has become, shall we say, rusty. And I’m doing it with my usual sense of attitude, that "There has to be a better way!"

First, I took this mess from the very first example (another greatest common denominator function… you’d think tutorial writers would figure out most of us would rather write something useful, ne):

let mut numbers == Vec::new();
for arg in std::env::args().skip(1) {
        numbers.push(u64::from_str(&arg)
                     .expect('error parsing argument'));
}

And said, "I know there are way to initialize and map values automatically in other languages. Rust should too." And I dug through the documentation and discovered, why yes, Rust does:

let numbers = Vec::from_iter(
    std::env::args().skip(1)
        .map(|x| u64::from_str(&x).expect("Error parsing argument")));

That’s pretty dense, but it does the job, and it uses the exact same allocations, too.

The next step was the actual reduction (I’d already written the GCD function in a library):

let mut d = numbers[0];
for m in &numbers[1..] {
    d = gcd(d, *m);
}

Could I one-liner that? Why yes:

    let d = &numbers[1..].iter().fold(numbers[0], |acc, x| gcd(acc, *x));

I know there are people who hate this kind of programming, but I dunno, I find it much more readable. It says that d is a reduction via GCD of all the numbers in a container. It’s actually less confusing than the loop.

To me, at least.

Subscribe to Feed

Categories

Calendar

October 2018
M T W T F S S
« Sep    
1234567
891011121314
15161718192021
22232425262728
293031