I've been conducting a bit of code archaeology, and this week I've been researching the original Locate program, the one written in 1983 and which persisted until the release of FreeBSD in 1994, at least. In this post, I'm going to describe the original Locate database format, which was called Squozen 1.0

The Locate database used only the ASCII character set, and was in fact limited to the printable characters between SPACE and DEL (U+20 to U+7F).

Building a Squozen 1.0 database

Let's start with the basics. Squozen was built using the find command! Here's the basic sequence. There are two commands in this that aren't familiar, and those construct the squozen database out of stdin and stdout.

# Find every filesystem object that's not on a 
# "local" (i.e. dynamic) filesystem (like /dev) 
# sort it, replacing the '/' symbol with '001'
# so that it would sort in the order human beings 
# expect, and then put the '/' symbol back.  
# Note that this means that no filename can 
# contain the '001' symbol or it will be lost 
# to this database.

find / ! -fstype local -a -prune -o -print | \ 
    tr '/' '\001' | \ 
    sort -f | \
    tr '\001' '/' > locate.list

# The 'bigram' script reads in the list just generated 
# and blindly spits out every two-byte pair.  So "abcd"
# would generate "ab", "bc", "cd", etc. Those pairs are
# sorted, counted for frequency, then resorted by number, 
# and the first 128 are stored in the bigram file.

locate.bigram < locate.list | \
    sort | \
    uniq -c | \
    sort -nr | \
    awk '{ if (NR <= 128) print $2 }' | \
    tr -d '\012' > locate.bigrams

# See below for the explanation.  This is the database 
# constructor.
locate.code locate.bigrams < locate.list > locate.db

(Note: Removed from the above is the cleanup phase, and error handling if sort should fail. The sort program is incredibly robust and uses temporary files if it starts to run out of memory, but if it runs out of memory and disk space, it returns an error.)

The locate database has exactly two components: a 256-byte block containing the first 128 bigrams, and a linear array of every path in the filesystem. That array is searched in its entirety for every locate request. The database uses two tricks to compress the content.

The prefix for every entry was either a byte or a three-byte sequence that began with the RS (Record Separator, Unicode point U+1F), a character "below" the space character, the first printable character, in the Basic Multilingual Plane. If the next byte was the RS, the next two bites were a 16-bit integer that represented the number of characters from the previously seen path that could safely be re-used; otherwise, it was a single 8-bit integer that represented the same thing.

This example comes from the documentation:

/usr/src                     0 /usr/src
/usr/src/cmd/aardvark.c      8 /cmd/aardvark.c
/usr/src/cmd/armadillo.c    14 armadillo.c
/usr/tmp/zoo                 5 tmp/zoo

The prefix number indicates the number of characters from the prior line that can be re-used. By re-using parts of the previous path, each entry only needed to store its terminal difference from its prior sibling, saving a lot of disk space and read time. Notable here is that if the character had the high bit set and was greater than 127, it was downcast to be part of the ASCII table that locate could handler; if it was RS or below, it was replaced with a question mark. How rude!

Now that we know how many characters from the previous iteration's path could be preserved, we now write that 1- or 3-byte prefix, and begin writing the difference, one byte at a time, but with one extra trick: for each byte, if the one that follows it make a bigram pair, write out the one-byte index of the pair, setting the high bit to indicate that this byte represents an index into the bigram table.

Typical for programming at that time, it uses two arrays for "previous" and "current," and then swaps them at the end of processing a line, so that the "current" is now the "previous" and vice-versa. This means that the same memory is being used over and over.

Reading a Squozen 1.0 database

The locate script does this is reverse: It reads in the entire bigram table, and then scans the user's requested search string, looking for a run of characters that are not a Unix glob symbol (tl;dr: Glob is a primitive matching syntax similar to regular expressions, with a much more limited vocabulary, but one that can be scanned quickly and in linear time). If it finds that run, it returns it as a "primitive" pattern to match.

In pure "C" terms, the database could be mmap'd to this:

struct Squozen {
    char[256] bigrams;
    char* root;
}

Then, for each entry in the database, it unpacks it, reversing the compression described above, again using two arrays for "previous" and "current". It scans each line from the end, which is statistically the part most likely to have a high delta, but also the most likely to match an "extension check," if the user is just looking for a filename extension (like locate .jpg). If the primitive pattern succeeds in matching, the fnmatch() function call is used to do a full glob search against the path. If it succeeds, the entry is printed.

During the scanning, it records where the position in the line where it found the previous match. Exploiting the knowledge we're tracking about differences in path prefixes created by the squozen format, we can safely assume parts of the path never match, so we don't have to scan the entire line from end-to-beginning, just the changed part, saving CPU time.

Because it was able to read one line at a time from the file, the only RAM either program needed was space for two paths (the previous one, and the current one), and the bigram table. In 1994, MAXPATHLEN was 1024 bytes, and the Bigram table was 256 bytes, so overall this program ran in less that 8KB of RAM, but only after sort had done its magic. In fact, sort is the real hero of these scripts, as it does all the heavy lifting of both the filename collection and the bigram collection.

Criticism

It's a bit unfair to criticize a program like this from a prior age, back when disk space and memory were measured in kilobytes, for this kind of super-dense and very skilled programming, but I still found a few things that bothered me. Some were minor.

But I suspect some were bugs. This line from the locate.code script, the one that actually builds the Squozen database, stands out to me. It's analyzing the character that was just read. We want to make sure it's can't be confused with a bigram index entry, so we check to see if it has that bit set, and if it is, strip it out. Also, if it's below the RS (SWITCH, here), we replace it with the question mark.

    if ((unsigned char)*cp >= PARITY)
      *cp &= PARITY - 1;
    else if (*cp <= SWITCH)
      *cp = '?';

This line makes me wonder if I could pass it a character that is above the PARITY flag, but would create a character that is within the sub-RS table, creating havoc in the database.

Smaller ones are like this:

 if ( diffcount < 0 || diffcount > 2*OFFSET )

I can't help but think that would clearer (and faster!) as:

    if (diffcount >= 0 && diffcount <= (2 * OFFSET)) {

This puts the more common case first, and makes it clear we're checking for something inside that 14-byte offset range, where the differential run length is encoded as a single byte.

As is typical of Unix at the time, most functions return zero when everything is fine, and a postive signed integer is an error code... unless the result could be an array index, in which case zero is okay, and a negative number is an error code... unless the "array" is actually being indexed by pointers, in which case zero is an invalid memory location, and now we're back to the original rules.

And you just had to memorize all this. Or at least understand where to look for it in the man pages.

No tests

But the biggest thing in all of this is that there were exactly zero unit tests. No one tested to see if that potential bug in the locate.code program could be manifested. There are so many places where this code could fail, and the developer provided none of the automated testing that we've become accustomed to, to ensure that the algorithm being used in any given place was correct.

The good and bad of modern development

I'm glad we don't code like this anymore. It's sparse, and empty, and is too busy meeting the demands of expensive hardware that still provided very little RAM or disk space, and those disks were actual, spinning disks, slow and cranky. Little things like automated testing, static analysis, and language servers have made our lives so much better; even as I was doing my archaeology there was a little red indicator in the lower-left corner of my IDE's message bar showing how many warnings and errors were detected... and this was before we talk about the archaic dialect of C where function arguments were declared twice: as a list in the declaration, and then with types in the space between the declaration and the definition block.

On the other hand, there is something charming about the idea that you would just write a single file, compile it, and run it. These days development environments come with initialization tools like npm, cargo, and whatever Python is using nowadays, and involve the struggle to hook up your testing and documentation with cloud services, SaaS providers, and APIs that you have to mock if you're to get anything done in a timely and testable fashion.

I don't miss the old days, but I can't help but feel that something's gone terribly awry. We oversell cloud services and don't do much useful with them, while at the same time we spend hours and hours tinkering with getting the pipeline of versions and services just right. Some days I yearn to just sit down and bang out some code; I just know that if I'm going to share it with anyone, "just banging it out" is irresponsible; I want to provide quality tests, examples, and documentation, to encourage others to use what I write.