One of the things I've always wanted for my website is a search engine that wasn't beholden to Google. I've considered using Solr or one of those, but I really wanted something local. Something that ran on my laptop. Something that wasn't "in the cloud."

I had helped developed a library science application back at university, but that was over 30 years ago, and while I actually still have that textbook the techniques in it are outdated. I thought about using Tantivvy or something along those lines, but each was a complicated mess.

Then I discovered Meilisearch, and I fell immediately in love. It's a typo-friendly, incredibly fast, and (for searches) lightweight search engine that you can run locally.

Here's how I built it:

Installing the search engine

Meilisearch is a single binary; aside from that it needs a configuration file and a folder in which to store its document stores and indices. It's written in [Rust], so you're welcome to build it yourself, but it's probably easiest to just download the most recent release.

Since I run a single-site server (very old-skool), I created a new user that would be privileged to run Meilisearch and nothing else. I created a bin folder for the executable, the data.ms and dumps folders, and copied the server to the bin. I also copied the config.toml file that came with the source code, giving it a local secret key, setting the mode to production, and locking down the network interface to localhost-only.

I also added a systemd entry:

[Unit]
Description=Meilisearch
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
Restart=always
RestartSec=1
User=meili
ExecStart=/home/meili/bin/meilisearch --config-file-path /home/meili/config.toml

[Install]
WantedBy=multi-user.target

Entering the data

I built the database on my desktop and then rsync-copied it into the server's data.ms folder. Building the database was an interesting exercise. I had long had a copy of Emily Daniel's "Character Extraction" program, written in python, which identifies all the proper names in a document. I hacked it up quite a bit to turn it into a library and streamlined it so that I only cared about the character names and not their sentiment.

My stories are stored the way my static site generator, Zola, likes them: in Markdown with a TOML header identifying the story by title, position in the series, etc, and I know the algorithm by which that information is turned into HTML and the slugs that identify its location. I wrote a quick Python script using the Python meilisearch library as well as the character extraction library, and my own toolkit for handling such Zola-fied Markdown; it opened each folder, ran the text through the character extraction, generated the slug, and then sent the ID, series ID, title, characters, content, and slug to the Meilisearch server.

Processing 400 stories took about 15 minutes, most of that due to the character search algorithm. AI isn't known for being fast.

Securing the server

What I wanted was to use Ngnix to remap any queries to a more limited response, one which didn't send every copy of every story that matched the request; I also wanted Meilisearch to return everything, since I only had a few hundred stories.

Nginx will create a variable $arg_X for any query key X; I just wanted to forward that to Meilisearch, with the secret key that Meilisearch uses to secure its instance. Here's the basic magic in NGINX configuration:

location /_query/ {
  if ($request_method != GET) { 
  return 404; 
  }
  proxy_pass http://127.0.0.1:7700/indexes/stories/search/?q=$arg_q&limit=500&attributesToRetrieve=path,title,series,characters;
  proxy_method     GET;
  proxy_redirect   off;
  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  proxy_set_header X-Real-IP $remote_addr;
  proxy_set_header X-Forwarded-Host $host;
  proxy_set_header Authorization "Bearer <Your Secret Here>";
}

This is by no means perfect, but it's probably good enough for the time being. (You didn't think I was going to actually reveal my server's secret key, did you?) One thing I'd like to do is figure out how to use the Nginx map directive to do the whitelisting of the query content here, rather than have to force Meilisearch to do it.

Providing the experience

Meilisearch can put out "instant" results, but that's not what I was after. I just wanted a basic search experience: type in a name, hit [Enter], and get back a list of stories in "relevance order". I mapped the path back to the file path I expected, and the rest was just a simple list of titles along with the provided list of characters, which to me is the most important aspect of the story.

The core of the code is absurdly simple. Assume renderHit() is just a template (because it is). This is the function commitSearch, minus a bit of whitelist checking of the input. The same whitelist checking is done server-side, but it's nice to short out the dumber hackers early.

const results = document.getElementById("search-results");
fetch(`https://pendorwright.com/_query/?q=${encoded}`)
  .then(response => response.json())
  .then(data => {
    const lines = data.hits.map(renderHit);
    results.innerHTML = `<dl>${lines.join("\n")}</dl>`;
    window.history.pushState({}, null,
    `${window.location.origin}${window.location.pathname}?search=${encoded}`); });

A couple of event handlers (onClick for the button, onKeydown(enter) for the input field), and that's about it.

I hacked a little in the window.history so that if you went forward from a page then hit the 'back' button, your last search would still be there. It also lets you create permalink searches. For example you can create a permanent URL for every story where "Marriage" is mentioned. The JavaScript for that, in the page loader, was just as straightforward:

  const urlParams = new URLSearchParams(queryString);
  const pageSearch = urlParams.get("search");
  if (pageSearch) {
    commitSearch(pageSearch);
  }

Challenges and Learning

From this experiment, I learned:

  • How to install Meilisearch
  • How to configure Meilisearch
  • How to build rich Meilisearch indexes using the Python SDK
  • How to incorporate Ms. Daniels' NLTK "Identify the characters" library into a Meilisearch index.
  • How to use Meilisearch with Zola
  • How to configure Nginx to forward outside requests to Meilisearch in a somewhat safe manner.
  • How to write complex Meilisearch queries using only the GET interface

I also wrote some very straightforward JavaScript, not even Typescript. I probably should have used Typescript but it seemed heavyweight for a 60-line JavaScript program. It would have been silly for me to pull out something like React or Backbone for this but if I want to move forward to a richer experience perhaps I'll look into it.

If I have one complaint about Meilisearch it's that the document store isn't compressed. It took 663MB to index only 13MB of all 372 episodes and the 84 chapters of novels, unrelated short stories, etc that are part of my story series. I shudder to think of what indexing all 6,576 blog entries in my Livejournal archive will take.

As far back as 1989 we knew that it was not only possible to compress the document store and still make it searchable, but that it was faster and more efficient to pull the compressed content off-disk and decompress it than to read from an uncompressed stream. This may no longer be true in the era of solid state storage but it's still a feature I would love to see; my server is tiny and cheap, and every gigabyte is precious to me.