I added Basename and Dirname to PostgreSQL, and sped up Catalogia by a LOT

Posted by Elf Sternberg as database, django, python

The Semantics of Python Import series has been important because the work I’m doing there supports work I’m doing elsewhere, namely modernizing the Hy import system to work with importlib, and then further shimming the import system to provide for heterogeneous source loaders.

My showcase for all this has been Catalogia, a music collection management program written in Hy and Django. Catalogia shows off all the inner workings of my Polyloader shim, and shows that even Django’s bizarre metaprogramming won’t break with Polyloader active. Download my version of Hy (that’s important, as I haven’t made a PR to the Hy people yet; I’m still testing Polyloader to make sure it’s not broken) into a (preferably Python 3) virtualenv, install Django, install Catalogia… and it might work. No promises.

But while I’m working on Python, I had another headache. One of Catalogia’s issues was that I wanted to find nested albums. My collection has over a thousand albums (I’m old, okay? When I was a teenager, collecting vinyl was the way to go. I still have my high school’s glee club albums on flippin’ vinyl!), and I’m sure somewhere along the line I messed up and mvd a file to the wrong place.

Catalogia’s primary key is the path to an MP3 file. The organizational scheme I’ve always used is “Artist – Album/Song.mp3”, so knowing if somehow one album folder wound up inside another was a good integrity check; that’s not supposed to happen.

PostgreSQL (which I’m using) doesn’t have POSIX basename and dirname functions. Why would it? But I really needed them, because I needed the dirname (folder path) to an MP3 file, so I know what folder an album represents, so I could check for nested albums.

So I wrote dirname and basename for PostgreSQL. The unit tests are taken from the Python unit tests for posixpath.py, and so ought to be competently congruent with Python, which is my whole point. They’re even fun to watch:

    name    | basetest | baseexpected | dirtest  | direxpected | base | dir  
 /foo/bar   | bar      | bar          | /foo     | /foo        | PASS | PASS
 /          |          |              | /        | /           | PASS | PASS
 foo        | foo      | foo          |          |             | PASS | PASS
 ////foo    | foo      | foo          | ////     | ////        | PASS | PASS
 //foo//bar | bar      | bar          | //foo    | //foo       | PASS | PASS
 /foo/bar   | bar      | bar          | /foo     | /foo        | PASS | PASS
 /foo/bar/  |          |              | /foo/bar | /foo/bar    | PASS | PASS

Y’know, in case I’ve forgotten how to SQL.

But the best part came later, when I tried to test out whether or not it worked on my own database. The comparison was actually kinda… well…

    SELECT DISTINCT dirname(a.path) AS parent,
                    dirname(b.path) AS child
    FROM catalog_mp3 as a,
         catalog_mp3 as b,
    WHERE dirname(a.path) != dirname(b.path)
    AND   dirname(b.path) ~ ('^' || dirname(a.path));

On my laptop, that took 11 minutes and 52 seconds to run on a sample set of about 200 albums (folders). It was really disheartening. But then I realized that part of the reason it ran so slowly was because it was comparing title paths, not album paths, and it was performing the dirname comparison over and over.

The fix was obvious. Use a Common Table Expression to make a temporary table of the album paths, then run comparisons against the CTE:

    WITH prepped_paths AS (
      SELECT DISTINCT dirname(path) AS dpath FROM catalog_mp3)
         SELECT a.dpath AS parent, b.dpath AS child
         FROM prepped_paths AS a, prepped_paths AS b
         WHERE a.dpath != b.dpath
         AND a.dpath ~ ('^' || b.dpath);

That took 3 seconds, a speed-up factor of almost 240! Not too shabby at all.

But then I wondered… what if I preprocessed the comparison expression, and used LIKE instead of regexp?

     WITH prepped_paths AS (
       SELECT DISTINCT dirname(path) AS dpath,
                       (dirname(path) || '%') AS mpath
       FROM catalog_mp3)
         SELECT a.dpath AS parent, b.dpath AS child
         FROM prepped_paths AS a, prepped_paths AS b
         WHERE a.dpath != b.dpath
         AND a.dpath LIKE b.mpath;

88ms. A speed-up factor of 8000! Whee!

This does leave me with a conundrum, however. I could lock Catalogia down into using only PostgreSQL, which wouldn’t leave me heartbroken as I’m a PostgreSQL snob. The alternative is to store album paths with the album title fields. Which I may end up doing anyway in order to support MySQL and SQLite… but with a real RDBMS it’s derivable information, so the PostgreSQL way makes me much happier.

Comment Form

Subscribe to Feed



July 2016
« Jun   Aug »