Posted on 3th May 2017
I've worked with XML before (in Java), but always small files using the Document Object Model. Now faced with multi-GB of Open Street Map derived XML files, of which I need to get a small amount of data, some other method is required. Step forward the Simple API for XML (SAX). This is an event-driven API: the XML parser calls a "handler" object with information about tags opening and closing, and the character data in between.
In Python, there is support in the standard library for SAX parsing. You need to sub-class (or duck-type, and implement the interface of) xml.sax.handler.ContentHandler
. It seems that duck-typing is frustrating, as you need to implement the whole interface, even if you never expect certain methods to be called.
The methods startDocument
and endDocument
are called at the start and end of parsing. The startElement
method sends details of the name of an opening tag, and it's attributes (sent as essentially, but not quite the same as, a dict
from string to string), and endElement
tells you of a closing tag. Text is sent to you via characters
which will also notify of new lines (which probably want ignoring). There is more, but that's enough for my application.
Somehow, a callback doesn't feel very "pythonic" (and does feel terribly Javascript-esq). The pythonic way to push data to a client is surely to use a generator. Naively, to convert a callback to a generator, we'd like to:
__iter__
method call the code which requires the callback handler.__iter__
which builds an iterator, and returns control to the client.__next__
on the iterator, return control to the data generation function...Given that we are suspending execution, it should come as no surprise that the way to do this is via threading. Run the data generation code in a separate thread, and let the callback handler write all its data to a blocking queue. On the main thread, we simply implement a generator which pulls data off the queue (waiting if necessary for data, hence allowing control back to the thread) and yield
s it back to the client.
For fun, I implement this in the module cbtogen
in my project OSMDigest. Sadly, in Python, event with a large queue, there is a signifcant overhead in running two threads and passing data over a queue.
For the application of converting the SAX callback to a generator, the result is incredibly slow code. This can be significantly improved by moving as much parse logic as possible to the thread, so we send far fewer objects over the queue. However, this is a better way...
The xml.etree.ElementTree
module in the Python standard library represents an XML document via elements which have "children" (i.e. nested tags). The standard usage is to parse the whole document, but it is also possible to parse the document tag by tag using the iterparse
function. There is a caveat however: all children of tags are still collected. If your document consists of lots of disjoint sub-sections, this is not a problem, but for my application, parsing Open Street Map data, the entire document is contained in an <osm>
tag. As such, we'd eventually collect the entire document as children of this main (or root) tag. The trick is to capture a reference to the root tag, and then periodically (at a suitable point in the iteration loop) call clear
on this object. This removes references to children, and so allows them to be garbage collected. The obvious downside here is that different document structures might require different techniques to "clear" the correct tags.
For OSM data, however, this is a clear winner, giving by fast the quickest way to parse the data.