Parsing XML via SAX in Python

Posted on 3th May 2017


I've worked with XML before (in Java), but always small files using the Document Object Model. Now faced with multi-GB of Open Street Map derived XML files, of which I need to get a small amount of data, some other method is required. Step forward the Simple API for XML (SAX). This is an event-driven API: the XML parser calls a "handler" object with information about tags opening and closing, and the character data in between.

In Python, there is support in the standard library for SAX parsing. You need to sub-class (or duck-type, and implement the interface of) xml.sax.handler.ContentHandler. It seems that duck-typing is frustrating, as you need to implement the whole interface, even if you never expect certain methods to be called.

The methods startDocument and endDocument are called at the start and end of parsing. The startElement method sends details of the name of an opening tag, and it's attributes (sent as essentially, but not quite the same as, a dict from string to string), and endElement tells you of a closing tag. Text is sent to you via characters which will also notify of new lines (which probably want ignoring). There is more, but that's enough for my application.

Getting a generator

Somehow, a callback doesn't feel very "pythonic" (and does feel terribly Javascript-esq). The pythonic way to push data to a client is surely to use a generator. Naively, to convert a callback to a generator, we'd like to:

  • Make an __iter__ method call the code which requires the callback handler.
  • When control is first returned to the callback, store the data and somehow return control to __iter__ which builds an iterator, and returns control to the client.
  • Each time we call __next__ on the iterator, return control to the data generation function...
  • ???
  • Profit?

Given that we are suspending execution, it should come as no surprise that the way to do this is via threading. Run the data generation code in a separate thread, and let the callback handler write all its data to a blocking queue. On the main thread, we simply implement a generator which pulls data off the queue (waiting if necessary for data, hence allowing control back to the thread) and yields it back to the client.

For fun, I implement this in the module cbtogen in my project OSMDigest. Sadly, in Python, event with a large queue, there is a signifcant overhead in running two threads and passing data over a queue.

For the application of converting the SAX callback to a generator, the result is incredibly slow code. This can be significantly improved by moving as much parse logic as possible to the thread, so we send far fewer objects over the queue. However, this is a better way...

Alternative using element-tree

The xml.etree.ElementTree module in the Python standard library represents an XML document via elements which have "children" (i.e. nested tags). The standard usage is to parse the whole document, but it is also possible to parse the document tag by tag using the iterparse function. There is a caveat however: all children of tags are still collected. If your document consists of lots of disjoint sub-sections, this is not a problem, but for my application, parsing Open Street Map data, the entire document is contained in an <osm> tag. As such, we'd eventually collect the entire document as children of this main (or root) tag. The trick is to capture a reference to the root tag, and then periodically (at a suitable point in the iteration loop) call clear on this object. This removes references to children, and so allows them to be garbage collected. The obvious downside here is that different document structures might require different techniques to "clear" the correct tags.

For OSM data, however, this is a clear winner, giving by fast the quickest way to parse the data.


Categories
Recent posts