Elf Sternberg: Django and Asynchronous Gearman, Together At Last!

I searched for "django gearman" on Google and Bing, and found precious little. There isn't much out there, so I've decided to put together my own example, using Gearman as a queue manager.

If you don't know what Gearman is, it's a "generic application framework to farm out work to other machines or processes that are better suited to do the work." You have Gearman clients and workers: clients dispatch jobs to a configurable table of gearman servers, which in turn dispatch the job to any idle worker processes. Clients and workers can be on separate machines; it is the collection of gearman servers that makes decisions about which worker to accept a given task. This can be very useful when you have a long-running process, such as audio or video processing (or, in my case, the automatic LaTeX-ification of documentation).

As a (utterly trivial and inadequate) example, I'm going to show you how to communicate between a client triggered with Django, a worker that does some work, and then how the message gets back to Django that the process has been run. This example is trivial because it assumes both processes are on the same machine; it is inadequate because it uses no sychronization to ensure that the message passed back to the Django process isn't accidentally destroyed by a race condition.

Let's start with the basics:

virtualenv --no-site-packages geardemo
cd geardemo
source bin/activate
pip install gearman django
django-admin startproject geardemo
cd geardemo
./manage.py startapp testapp
mkdir workers

Here, I've created a virtual environment in which we're going to run our example, installed gearman and django, started a django project, and in that project created our app and a directory for the workers.

What I want to do is create a view in my testapp that dispatches jobs to Gearman. The view will take a single argument from its web page and pass that to Gearman. I also want to pass my session key, because I'll be using the session object to receive notice that the process is done. In testapp/views.py:

from django.shortcuts import render_to_response
from django.template import RequestContext
from gearman import GearmanClient, Task

from django.conf import settings

try:
    import cPickle as pickle
except ImportError:
    import pickle

def run(request):
    if request.method == "POST":
        jobname = request.POST.get('name')

        if jobname.strip():
            client = GearmanClient(["127.0.0.1"])
            req = request.COOKIES.get(settings.SESSION_COOKIE_NAME, None)
            arg = (jobname, req)
            pik = pickle.dumps(arg, pickle.HIGHEST_PROTOCOL)
            res = client.dispatch_background_task("work", pik)

        status = request.session.get('worker.status', [])
        return render_to_response('view.html', { 'status': status },
            context_instance=RequestContext(request))

The tricky part, for me at least, was figuring out how to pass multiple objects to the worker process. They have to be _pickled; _the gearmand servers are written in C and will blindly pass any stream of bytes from the clients to the workers, so to pass multiple objects, they must be serialized into a string using pickle.dumps before handing them to gearman.

The worker process has its own mysteries. In the client, I get the status from the session object, which means that I have to populate the session object within the worker. This is inadequate because there's no synchronization process at work; for example, the django process could read the status object, the worker process could _read _then write, then the django process could write, and erase the worker process's changes. To do this, I pass the session key to the worker in the client request above. This is likewise inadequate because there's no guarantee the worker and the client are on the same system-- but this will work if you're using memcached as your session store. In fact, I recommend using memcached and a smarter synchronization process, one perhaps with atomic increments, independent of sessions, for this kind of communication.

Here's the file workers/worker.py:

from django.core.management import setup_environ
from django.utils.importlib import import_module
from gearman import GearmanWorker
import time
import sys
import os.path

try:
    import cPickle as pickle
except ImportError:
    import pickle

sys.path.insert(0, os.path.realpath(os.path.join(os.path.dirname(__file__), '..')))
import settings

setup_environ(settings)
from django.conf import settings
engine = import_module(settings.SESSION_ENGINE)

def dothework(gjob):
    job, session_key = pickle.loads(gjob.arg)
    for i in xrange(0, 6):
        time.sleep(1)

    if session_key:
        session = engine.SessionStore(session_key)
        status = session.get('worker.status', [])
        status.append('done with %s' % job)
        session['worker.status'] = status
        session.save()

    return True

worker = GearmanWorker(['127.0.0.1'])
worker.register_function("work", dothework
worker.work()

There's a certain amount of rigamorale in importing the Django session object. Using sys.path to put settings.py within our papth, importing the session object, setting up the environment, then re-importing the django settings object gives us access to the SESSION_ENGINE, and just like the client I need pickle to get at the argument and the session key passed from the django process. This process merely sleeps for six seconds, then writes back to the user's session that his process is done, with the word passed in from the client.

Note that both the client and the server use a token, "work", to communicate which function they want run at the worker's end.

One last file. Here's view.html, the template for our example:

<h1>Test</h1>
<p>Current Status: </p>
{% if not status %}
<p>No recent status updates</p>
{% else %}
<ul>
{% for s in status %}
<li>{{ s }}</li>
{% endfor %}
</ul>
{% endif %}

<hr>
<form action="." method="POST">
{% csrf_token %}
<label for="name">Job Name: <input type="text" name="name"></label>
<input type="submit" value="submit">
</form>

Very simple and straightforward. We're not even using forms. Just if there's status then show it, and ask for another job to do.

To run this:

./manage syncdb
gearmand -d
cd workers
python workers.py &
cd ..
./manage runserver

If you did everything right, you can now browse to port 8000 on your server and get the view above. Pass it job names, and six seconds after you do, you'll get a response. If you stack the jobs, the later ones will take longer because Gearman is running them in a serial queue, not in parallel.

And that's it.

Full source code to the geardemo program is available. You will still have to set up the virtualenv on your own.

Elf M. Sternberg

Full Stack Web Developer

Where one teaches, two learn.

Blog

DJANGO AND ASYNCHRONOUS GEARMAN, TOGETHER AT LAST!