Posterous theme by Cory Watilo

Filed under: Python

clihelper

I tend to write a fair amount of command-line applications in Python that more often than not are meant to run as daemons. I also tend to use the same patterns in doing so. At first I wrote daemonization code myself following the general pattern that can be found in many places. Then I discovered python-daemon, the reference implementation of PEP-3143.

For logging, I would often use the same code, cut-and-paste from one application to another. After digging into the logging documentation for Python 2.7, I decided that DictConfig in logging was for me, but I needed support in 2.6. I also wanted something I could install via pip instead of copying the code from 2.7's logging package and including it in my code. Thus logging-config was born. *Edit: I have now removed logging-config and moved to logutils thanks to a comment by Vinay Sajip below.

Today, I am releasing clihelper, a Python module that aims to make writing command-line applications and daemons in Python easier. It uses python-daemon and logging/logutils together with a YAML based configuration file to let one focus on writing the core application and not the details about how to deal with command-line option handling, configuration, logging and daemonization.

Getting started with clihelper is meant to be very straightforward; simply extend the clihelper.Controller class:

class MyApp(clihelper.Controller):
    def _process(self):
        self._logger.info('Would be processing at the specified interval now')

Next, in the main guard for the python module you are putting the MyApp class clihelper.setup method should be called, then call the clihelper.run method passing in the class that will be used as your application controller.

if __name__ == '__main__':
    clihelper.setup('MyApp', 'MyApp is just a demo', '0.0.1')
    clihelper.run(MyApp)

Next, the configuration file should be created:

Logging:
  version: 1
  formatters:
    verbose:
      format: '%(levelname) -10s %(asctime)s %(process)-6d %(processName) -15s %(name) -10s %(funcName) -20s: %(message)s'
      datefmt: '%Y-%m-%d %H:%M:%S'
  handlers:
    console:
      class: logging.StreamHandler
      formatter: verbose
      level: DEBUG
      debug_only: True
    file:
      class: logging.handlers.RotatingFileHandler
      formatter: verbose
      level: DEBUG
      filename: example.log
      maxBytes: 1024
      backupCount: 3
  loggers:
    MyApp:
      handlers: [console, file]
      level: DEBUG
      propagate: true
    clihelper:
      handlers: [console, file]
      level: DEBUG
      propagate: true

Now invoke your application via the command line. Try passing --help to see the base level options:

Usage: usage: example.py -c  [options]

MyApp is just a demo

Options:
  -h, --help            show this help message and exit
  -c CONFIGURATION, --config=CONFIGURATION
                        Path to the configuration file.
  -f, --foreground      Run interactively in console

clihelper allows you to add your own command line options and does not need to be interval based. Instead, one can use a blocking IOLoop or other similar concepts. In the class that extends clihelper.Controller, redefine the clihelper.Controller._loop method. In addtion, you'll likely want to extend the clihelper.Controller._shutdown method to tell the IOLoop to stop when the application has been signalled to stop. An example with the Tornado IO Loop may look something like:

import clihelper
from tornado import ioloop

class Test(clihelper.Controller): 
    
    def _loop(self):
        self._ioloop = ioloop.IOLoop.instance()
        self._ioloop.start()

    def _shutdown(self):
        self._ioloop.stop()

There are a few other options, so if you're inclined to try it out, I suggest reading the README and the code. I hope that someone else will find this useful. If you have any suggestions or improvements, please do not hesitate to send them my way.

Automagical Docstring Linking in Sphinx

I've been trying to pick up Sphinx to document my projects and have spent a few hours today trying to force it to do what I want. In essence I was looking for a way using the autodoc functionality to automatically replace module.Class references in docblock text with python domain markup.

To some extent, I would have thought this functionality would be included and if it is, I could not find it. What I did find, however, was the autodoc-process-docstring event that I was able to create a handler for. In this handler, I would end up using a regex to search for module.Class references and then replace them in place with :py:meth:~`module.Class` which then auto-links the appropriate document and location in the docblock's Sphinx output for me.

Since I could not find any examples of this, nor any mention or example of using autodoc-process-docstring, I present to you my implementation. To use, just append this to your conf.py in your Sphinx project.

import re
py_meth = re.compile("(([a-z_]+)([.][a-z_]+)+?)", re.IGNORECASE)

def is_module(test_string):
     try:
        __import__(test_string)
        return True
    except ImportError:
        parts = test_string.split('.')
        try:
            __import__('.'.join(parts[:-1]))
            return True
        except ImportError:
            return False
    return False

def process_docstring(app, what, name, obj, options, lines):
     for x in xrange(0, len(lines)):
        matches = py_meth.findall(lines[x])
        for match in matches:
             if is_module(match[0]):
                lines[x] = lines[x].replace(match[0], ':py:meth:`~%s`' %\
                                                      match[0])

 def setup(app):
    app.connect('autodoc-process-docstring', process_docstring)
I'm checking to see if my matches are actually importable and if they are, I'm replacing with the :py:meth syntax. Since my python project is local to the scope of the sphinx project, the import testing is valid.

All your infinite loop are belong to asyncore.poll

In working on the next version of Pika I've been implementing  functional and unit testing of the various bits code base. I ran into a weird bug when I was working on a test case for connecting and disconnecting from RabbitMQ. The way that I structured the test is to iterate through each of Pika's adapters doing a connect and disconnect test. What I encountered was that any time a test followed the AsyncoreConnection adapter, nosetests would hang. If I changed the order of the tests and put asyncore at the end, the tests would complete successfully.

After much frustration I decided to dig into the asyncore code to see if I could figure out what was going on. The first thing I noticed other than the coding style is that the code hasn't been updated in over 10 years, so I wasn't the first to stumble upon this bug. As such I started digging in the bug tracker on python.org and ran across http://bugs.python.org/issue10878 which is very recent (2011-01-10).

Teodor Georgiev had been seeing the same symptom as I was encountering and went as far as to isolate the reason. Since he pointed out the problem is when the read, write and both flags are empty lists, I decided to go look at the code in context.

asyncore.loop() has four parameters: timeout, use_poll, map, and count. The problem stood out when I saw that if you do not pass your own socket map into asyncore.loop, it will use the asyncore.socket_map dictionary created when you import asyncore. The odd thing to me about the default socket_map dictionary is that it doesn't seem to be maintained in the module.

The hack to get around the bug ended up being passing in an empty list into the map parameter when calling asyncore.loop() as it only checks to make sure you didn't pass in a value. When you do this, map has a value other than None and the while loop on line 209 and 213 can exit because an empty list will evaluate to False. A empty dictionary, like socket_map will just sit in that loop and run forever.

The proper bug fix should be to change 209 and 213 to have it check len(map) not just map.

Tornado is not (just) a Web Framework

One of the points I've been working on refining in my talks on Tornado is that it's a take what you need framework. At myYearbook we took this to the next step and have been experimenting with building non HTTP based protocol servers on the tornado.ioloop module. This has some great advantages for rapid prototyping of servers while making available things like the async tornado.httpclient.

In essence, this is what I do for the Pika Tornado Adapter as well.

I'll be giving a talk at pyCon 2011 in March on Tornado, I'll have to figure out a way to bring this point home in one slide or less.

Tornado Tip: Variables & Functions in Tornado Templates

Tornado has a very fast and flexible template system that is reminiscent of other template systems, such as that found in web.py.  I have found that in practical use of the template system, the documentation is thin in some areas.  A good example is in what variables and functions are exposed to the templates by default.

To that end, here is a list of the variables and functions available to a template as defined in both template.py and web.py:

  • _ (underscore)
    An alias for the locale.translate() function.
  • current_user
    The current_user object as returned by RequestHandler.get_current_user().
  • datetime
    The datetime module.   Example: {{ datetime.date.today().year }} returns the current year.
  • escape
    A function that escapes a string so it is valid within XML or XHTML.
  • handler
    The request handler that called the self.render function to process the template.  Example: {{handler.static_url(‘foo’)}} will run the static_url function in RequestHandler returning a full URL path, pre-pending static_url.
  • json_encode
    A function that JSON-encodes the given Python object.
  • locale
    The locale value as returned by RequestHandler.get_user_locale().
  • request
    The request object that is passed into the Request Handler.
  • reverse_url 
    Given a full module and class (myapp.Homepage) it will look at the handler to URI mapping and return the URL for the given class.
  • squeeze
    A function that replaces all sequences of whitespace chars with a single space.
  • static_url
    The value of the static_url property passed in to the application settings.
  • url_escape
    A function that returns a valid URL-encoded version of the given value.
  • xsrf_form_html
    If using the cross-site forgery protection, this function returns the hidden input field containing the xrsf variable.
This is the current list as of today's master branch on github.  If you're using 0.2 there are a few functions that are not in that version.  Did I miss one or get something wrong?  Please let me know.

Web Application Development with Tornado

I recently finished a rewrite of privatepaste.com in python using Tornado as the web framework. There are multiple reasons why I decided to use Tornado instead of something like Django, cherry.py or web.py, all of which I’ve previously used.  One of the main reasons for my choice to switch to Tornado was due of its feature-rich yet light-weight nature. In addition, the benchmarks and asynchronous http server were intriguing.

Tornado is distributed as a set of loosely coupled python modules. It’s up to the developer to decide which aspects of Tornado you’d like to use. It’s also up to the developer to write the core application which is responsible for running your web application as a daemon. To create a base level application, all that is required is using tornado.httpserver and tornado.web. If you’ve ever programmed with using web.py, many of the conventions should be similar to you.

The base principle in writing an application is to map a URI to a class. In that class you provide functions for the HTTP methods you intend to support. Because Tornado at its core is a HTTP server, you must implement every HTTP method you intend on supporting for a URI. For example, if you are writing a CMS, you would not only implement the GET function, but you’d want to implement a HEAD function for returning browsers with cache information for your content.

Being loosely-coupled, it is up to the developer to implement everything from session handling and authentication to localization and the data layer.  Some may consider this an issue, but not to worry, Tornado includes modules to help.  There are authentication mix-ins for Google, Twitter, Facebook and Friendfeed.  To achieve authentication with Tornado, you would extend the tornado.web.RequestHandler class and extend the get_current_user() to handle the authentication functionality.

Localization is handled in a similar fashion.  While there is some magic under the covers, it generally leaves localization implementation up to the developer.  By extending get_user_locale(), the developer returns a locale object which has been initialized with the appropriate language.  As with other modules, the meat of the documentation is in the locale and web classes.

Documentation is one of the key drawbacks of using Tornado.  If you can not dive in to other peoples code to find what you need, you’re going to have a difficult time with Tornado.  Much of the initial time that I spent with Tornado was in the Tornado code itself, figuring out how to access different parts of data within the http request, templating system and the application class.  The documentation provided is deceivingly simple, and indeed, for a Hello World application the documentation is sufficient and accurate.  It’s when you’re knee deep in code that you’ll find yourself having to go beyond the documentation to get what you need.

The template engine is full-featured and has yet to leave me wanting.  While there has been some back and forth on the mailing lists about the speed of the template engine, it has proven important to turn off debug mode when comparing template engines.  I have found the template engine to be very fast, even when extending other templates and including modules.

The biggest hurdle, which isn’t uncommon in any web application, is right-sizing processes to serve your application.  Because your tornado app runs as a stand-alone HTTP server that is directly coupled to your application classes, you need to run multiple processes to serve multiple requests.  Like FriendFeed, I use Tornado behind a web server using a reverse proxy module.  However, instead of Nginx, I am using Cherokee.  I use Python’s multiprocessing module to spawn multiple HTTPServer instances with my application.  Each instance has its own port number and the reverse proxy server uses these backends in a round-robin pool to provision requests.

When coming from a CGI based backend, one has to think a bit more about how you size your backends.  Because your web server front-end can’t spawn new backends on demand, you’ll need to make sure that you have enough backends to provision your maximum number of simultaneous requests.  There are changes in the master branch of Tornado on github to make Tornado fork on its own, spawning a thread per CPU core, but this will not change the scaling concern, as the same principles apply.

Asynchronous request handling is one of the more often touted features of Tornado.  It’s important to understand exactly how async requests fit into your application development model.  Because each Tornado back end is single threaded it is important to think about the blocking areas of your application, such as database calls, to determine if you can benefit from the async functionality.  To truly benefit from the async server, you’ll need to use a fully async model for any type of operation that would normally be blocking.

An example where the async functionality shines is the Authentication Mix-in’s for Google, Facebook, FriendFeed and Twitter.   When you use these mix-ins, you specify a callback function to call once the HTTPClient class has returned a result from your call.

Because I use psycopg2, a blocking PostgreSQL driver, for my database access, I generally do not use the async functionality.  For me this is not an issue, as in full featured applications I still see performance as fast as 1ms from start to finish of request.  Of course your own application has as much, if not more impact on performance than Tornado itself.

If you’re just getting started with Tornado, be sure to check out the demo code.  If you’re looking for a little more structure in getting started, check out Tinman, a meta-framework on top of Tornado.

Edit: Changed to reflect a misstatement about the new forking changes coming up for 0.3.

The Attention Deficit Disorder Guide to RabbitMQ

RabbitMQ has been one of my interests of late, as I’ve identified it as part of our technology path at work. There are other very good resources that dive pretty deep in RabbitMQ and how to use it. The goal of this guide is to help you get on your feet quickly and easily. It assumes a couple of things:

  • You already know about message queues and have some experience or knowledge on the subject.
  • You know what AMQP is.
  • You are already interested in RabbitMQ enough to try it out.

If you’re good on those things, let’s get started…

RabbitMQ is written in erlang. As such, you should have already downloaded and installed erlang as a first step.

Download RabbitMQ and install it, which is pretty easy.  I like to setup RabbitMQ in an /opt/rabbitmq directory. To do that, I set some environment variables before compiling (bash assumed):

export TARGET_DIR=/opt/rabbitmq
export SBIN_DIR=/opt/rabbitmq/sbin
export MAN_DIR=/opt/rabbitmq/man

Then I compile and install with “make install.” Because I like to run as my own user or a service user, I’ll chown -R myuser /opt/rabbitmq as appropriate.

There are a few other things we need to do including make the log directory and the directory RabbitMQ will use to store its data:

mkdir /var/log/rabbitmq
chown myuser /var/log/rabbitmq
mkdir /var/lib/rabbitmq
chown myuser /var/lib/rabbitmq

Now as “myuser” we can “cd /opt/rabbitmq/sbin” and run “./rabbitmq-server” and what you should see is:

RabbitMQ 1.6.0 (AMQP 8-0)
Copyright (C) 2007-2009 LShift Ltd., Cohesive Financial Technologies LLC., and Rabbit Technologies Ltd.
Licensed under the MPL. See http://www.rabbitmq.com/

node  : rabbit@binti
log  : /var/log/rabbitmq/rabbit.log
sasl log  : /var/log/rabbitmq/rabbit-sasl.log
database dir: /var/lib/rabbitmq/mnesia/rabbit

starting database …done
starting core processes …done
starting recovery …done
starting persister …done
starting guid generator …done
starting builtin applications …done
starting TCP listeners …done

If you have the hang of starting RabbitMQ and now want to run it in the background, instead do: “./rabbitmq-server -detached”.

Once we’ve gotten this far, we’ve got our broker up and running and now we’ll need some way to talk to it. For the purposes of this article, I’m going to talk about amqplib and Python. There are AMQP libraries for just about every relevant language at this point. RabbitMQ 1.6.0 implements the AMQP 0.8 standard. The easiest way to install amqplib is a simple “easy_install amqplib”.

But before we dive into code, there are a few key concepts we need to talk about:

Queues: You should get these already, one puts a message in a queue and a consumer app receives it somewhere else.

Exchanges: These are a little more tricky than queues. I like to think of them as namespaces.  One of the keen things about RabbitMQ exchanges is that different exchanges will get a different erlang process which should help make better use your available hardware resources. There are three types of exchanges that we need to talk about:

Direct: a direct exchange means when you put a message in, it goes to one consumer and he’s all that will get that message routed through the exchange.

Fanout
: a fanout exchange sends your message to every consumer that listening to a particular exchange / queue combination.

Topic Exchange
: this type of exchange allows you to do neat things like listen to the same queue across exchanges on one consumer, multiple queues in one namespace in a consumer and other wildcard type trickery.

Bindings: In RabbitMQ you bind your exchanges and queues together in unique combinations which determine how messages are routed to what consumers.

Memory: As of RabbitMQ 1.6.0 all messages are kept in memory. If you have nothing consuming your messages and you send too many of them, you’ll run out of memory.

Monitoring: The main install has the app rabbitmq_ctl which you can use to inspect the various parts of RabbitMQ. This isn’t very good for remote monitoring or visualization. For that there’s a great project called Alice which is also erlang based.

Speed: There are two ways to get messages from RabbitMQ: basic_get and basic_consume.

basic_get is where your app, on a message by message basis, asks RabbitMQ for a message. This is the slower of the two methods and will not allow single consumer applications to scale to a very high transaction rate.  Note that RabbitMQ will not register these connections as a consumer and you will not see them in list_queues or in Alice as such.

basic_consume
is where your app registers itself with RabbitMQ as a consumer and RabbitMQ will send messages to you as fast as you’re able to consume them.

Durability: If you want to have the definitions of your queues and exchanges hang around if you have to restart RabbitMQ you need to define them as durable.

Auto-Delete: If you want your queues and exchanges to exist even when there are no consumers waiting for messages on them, you need to turn auto-delete off.

Persistence: If you do not tell RabbitMQ that you want it to hang on to your messages if it reboots, it will not do so. You must set the delivery mode of a message to “2” to tell it to persist it until it is consumed.

Auto-Ack: You can tell RabbitMQ to automatically acknowledge receipt of a message, or you can do it yourself. This is a boolean setting that you use when you’re consuming messages via basic_get or basic_consume.

Queue and Exchange definitions: By default, queues and exchanges do not exist until you connect a consumer to them. You can cheat and do this in your code that enqueues your messages.

Now that we have that out of the way, here’s some sample Consumer code:

#!/bin/env python
""" Sample Consumer Code """

import amqplib.client_0_8 as amqp
# This is the function that basic_consume will send messages to                               
def process_message( message ):
    """ Callback function used by channel.basic_consume """
    print 'Received: %s' % message.body

# Rabbit Server to connect to
host = '127.0.0.1'
port = 5672

# Exchange and queue information
exchange_name = 'test'
exchange_type = 'direct'
queue_name = 'messages'
routing_key = 'test.messages'

# Let's set this up by default, we'll use it later
process_messages = True

# Connect to Rabbit
connection= amqp.Connection( host ='%s:%s' % ( host, port ),
                        userid = 'guest',
                        password = 'guest',
                        ssl = False,
                        virtual_host = '/' )

# Create a channel to talk to Rabbit on
channel = connection.channel()

# Create our exchange
channel.exchange_declare( exchange = exchange_name, 
                          type = exchange_type, 
                          durable = True,
                          auto_delete = False )
                                       
# Create our Queue
channel.queue_declare( queue = queue_name , 
                       durable = True,
                       exclusive = False, 
                       auto_delete = True )
            
# Bind to the Queue / Exchange
channel.queue_bind( queue = queue_name, 
                    exchange = exchange_name,
                    routing_key = routing_key )

# Let AMQP know to send us messages
consumer_tag = channel.basic_consume( queue = queue_name, 
                                      no_ack = True,
                                      callback = process_message )

# Loop while process_messages is True
while process_messages:

    # Wait for a message
    channel.wait()            
            
# Close the channel
channel.close()

# Close our connection
connection.close()
            
# This might go somewhere like a signal handler
def cancel_processing():
    """ Stop consuming messages from RabbitMQ """
    global channel, consumer_tag, process_messages
    
    # Do this so we exit our main loop
    process_message = False          
    
    # Tell the channel you dont want to consume anymore  
    channel.basic_cancel( consumer_tag )

Note that a lot of what is in that example is commented code and whitespace for ease of reading, the actual implementation is pretty darn simple.

Now that we have a consumer going let’s send some messages in:

#!/bin/env python
import amqplib.client_0_8 as amqp

# Connect
connection = amqp.Connection( host = "localhost:5672", 
                              userid = "guest", 
                              password = "guest", 
                              virtual_host = "/", 
                              insist = False )

# Create our channel
channel = connection.channel()

""" We've already declared our queue, exchange and binding in our consumer so just send the messages """
for i in range(0, 10):
        message = amqp.Message("Test message %i!" % i)
        message.properties["delivery_mode"] = 2
        channel.basic_publish( message, 
                               exchange = "test", 
                               routing_key = "test.messages")

That’s it! If we did this right, you’ve now setup RabbitMQ, sent some messages and consumed them on the other end of the pipe.

If I’ve kept you this long and you’re still interested, but still have questions, I highly recommend this article which goes much more in depth and has been a valuable guide for me.

If you’re into both python and RabbitMQ, you might want to check out my consumer framework “rejected.py,” it’s on GitHub.

I hope you enjoyed the first of my A.D.D. Guides. I’d be happy to answer any questions and would appreciate feedback so I may improve this and future articles to come.

gdata Python Client Mappings

I had a little bit of an issue today finding out that Google’s gdata Python library changes attribute names for the UserEntity class to match with the PEP8 naming conventions.  Here’s a quick rundown:

  • userName becomes user_name
  • changePasswordAtNextLogin becomes change_password
  • ipWhitelisted becomes ip_whitelisted
  • agreedToTerms becomes agreed_to_terms
  • hashFunctionName becomes hash_function_name

Notice that changePasswordAtNextLogin is shortened to just change_password.  Frankly I’m surprised by the lack of consistency by the part of the gdata Python authors, and the lack of documentation of such in the Google Apps Provisioning API Developer’s Guide: Python.