On memory management

I have only ever been a hobbyist C++ programmer, while I have been paid to write Java and Python. But a common complaint I’ve read about C++ is that you have to manage memory manually, and worry about it. Now, I’d slightly dispute this with C++11, but perhaps I don’t really have enough experience to comment.

However, I think there’s a strong case that with Garbage Collected languages, you can’t really forget about memory, or the difference between copy by reference and copy, but the language rather allows you to pretend that you can cease to worry. In my experience, this is only true 99% of the time, and the 1% of time it bites you, you’ve quite forgotten that it’s a possibility, which makes debugging a real pain (the classic “unknown unknown”).

A stupid example which wasted some of my time today is:

import numpy as np
indexes = np.argsort(times)
coords[0] = coords[0][indexes]
coords[1] = coords[1][indexes]

With hindsight this obviously mutates the data underpinning coords and hence mutates anything which is an alias of coords. Cue two tests failing, and the first one was silently mutating the data the second test tried to use. But this is really hard to spot– both tests failed, so I spend a while looking at the base class because that’s the only common code involved. Unit testing doesn’t really help, as I’d never think to test that I’m not accidentally mutating some data reference (because I’d never be that stupid, right…)

What I meant to do was:

coords = coords[:,indexes]

This generates a new array instance and assigns the reference to coords. But this is quite subtle. To even express it, I have to use language which I learnt from C/C++. I only finally noticed when I wrote some test code in a notebook, and noticed that there was some period 2 behaviour going on. “Oh, I must be mutating something… Oh, right…”

The problem with Python, and Java, is that you get out of the habit of even thinking in this way. I used to write a lot of immutable code in Java, precisely to avoid such problems. That seems to make massive sense in a corporate environment. But for numpy, and trying to squeeze performance out of an interpretted language, you sometimes need mutability. Which means you need to think. (And regularly makes me wish I could just use C++, but that’s nothing story…)

Learning Python UI programming

Another new task: get going with some GUI programming!

Some references

Hints and tip

Some books

As culled from Leeds University Library:

Pandas, HD5, and large data sets

I have finally gotten around to playing with the HD5 driver for pandas (which uses, I believe, pytables under the hood). I’m only scratching the surface, but it’s easy to do what I want:

  • Create a huge data frame storing an entire data set
  • Efficiently query subsections of the frame

Create the dataframe

We obviously cannot do this in memory. But if we have some way of generating one row at a time, or a small “chunk” of rows at a time, then we can “append” these iteratively to a HD5 store:

 store = pd.HDFStore("test.hd5", "w", complevel=9, complib="bzip2", fletcher32=True)
 # Generate a data frame as `frame`
 store.append("main", frame, data_columns=True)
 # Repeat as necessary

This creates a new HD5 file, and then creates a table in it named “main”. We can call store.append() repeatedly to add lots of rows. The data_columns=True is necessary if we wish to query by column (which we do).

Read back the data

We can then iterate over the whole dataframe in “chunks” of rows:

store = pd.HDFStore("test.hd5", "r")
for df in store.select("main", chunksize = 1000):
    # Do something with `df` which contains the next 1000 rows

Alternatively, we can use the power querying ability. Suppose we have a column named “one” in the large dataframe, and we just want the rows where the value of “one” is less then 100. Then we can use:

store = pd.HDFStore("test.hd5", "r")
df = store.select("main", where="one < 100")

This seems to be wonderfully fast.


You cannot store “objects” in a table, so e.g. storing a GeoPandas data frame is impossible (or extremely hard).

Some sources

Parsing XML via SAX in Python

I’ve worked with XML before (in Java), but always small files using the Document Object Model. Now faced with multi-GB of Open Street Map derived XML files, of which I need to get a small amount of data, some other method is required. Step forward the Simple API for XML (SAX). This is an event-driven API: the XML parser calls a “handler” object with information about tags opening and closing, and the character data in between.

In Python, there is support in the standard library for SAX parsing. You need to sub-class (or duck-type, and implement the interface of) xml.sax.handler.ContentHandler. It seems that duck-typing is frustrating, as you need to implement the whole interface, even if you never expect certain methods to be called.

The methods startDocument and endDocument are called at the start and end of parsing. The startElement method sends details of the name of an opening tag, and it’s attributes (sent as essentially, but not quite the same as, a dict from string to string), and endElement tells you of a closing tag. Text is sent to you via characters which will also notify of new lines (which probably want ignoring). There is more, but that’s enough for my application.

Getting a generator

Somehow, a callback doesn’t feel very “pythonic” (and does feel terribly Javascript-esq). The pythonic way to push data to a client is surely to use a generator. Naively, to convert a callback to a generator, we’d like to:

  • Make an __iter__ method call the code which requires the callback handler.
  • When control is first returned to the callback, store the data and somehow return control to __iter__ which builds an iterator, and returns control to the client.
  • Each time we call __next__ on the iterator, return control to the data generation function…
  • ???
  • Profit?

Given that we are suspending execution, it should come as no surprise that the way to do this is via threading. Run the data generation code in a separate thread, and let the callback handler write all its data to a blocking queue. On the main thread, we simply implement a generator which pulls data off the queue (waiting if necessary for data, hence allowing control back to the thread) and yields it back to the client.

For fun, I implement this in the module cbtogen in my project OSMDigest. Sadly, in Python, event with a large queue, there is a signifcant overhead in running two threads and passing data over a queue.

For the application of converting the SAX callback to a generator, the result is incredibly slow code. This can be significantly improved by moving as much parse logic as possible to the thread, so we send far fewer objects over the queue. However, this is a better way…

Alternative using element-tree

The xml.etree.ElementTree module in the Python standard library represents an XML document via elements which have “children” (i.e. nested tags). The standard usage is to parse the whole document, but it is also possible to parse the document tag by tag using the iterparse function. There is a caveat however: all children of tags are still collected. If your document consists of lots of disjoint sub-sections, this is not a problem, but for my application, parsing Open Street Map data, the entire document is contained in an <osm> tag. As such, we’d eventually collect the entire document as children of this main (or root) tag. The trick is to capture a reference to the root tag, and then periodically (at a suitable point in the iteration loop) call clear on this object. This removes references to children, and so allows them to be garbage collected. The obvious downside here is that different document structures might require different techniques to “clear” the correct tags.

For OSM data, however, this is a clear winner, giving by fast the quickest way to parse the data.

Open Street Map XML data

I want to process large amounts of XML data from Open Street Map (OSM). I.e. that obtained from GeoFrabrik or OSM.Planet. For smaller snapshots, do look at OSMnx.

My pure-Python project to read and process OSM data, currently a work in progress, can be found on GitHub, as “OSMDigest”.

The XML format is documented on the OSM Wiki. There is no formal schema, but the data you can download seems to be of quite a constrained type:

  • Start with an <osm> element giving the “version”, “generator” and “timestamp”.
  • Then a <bounds> element giving the rectangle in latitude/longitude coordinates which encloses the data.
  • Following this, elements of three types. (They seem to appear in the order given here, though this I guess is unimportant). Each of these elements contains some common attributes: “id” giving the OSM id (which is unique within each type), the (optional) “user”, “uid”; giving the user who last modified the object, the “timestamp” of last modification, the edit “version” (which increases on each edit) and the “changeset” number. There is also a “visible”, but in the downloaded data which I’ve seen, this is always either missing, or “true”.
  • <node> specifies a point on the planet, and has attributes “lon”, “lat” for coordinates. May contain 0 or more <tag> sub-elements.
  • <way> specifies a path. Contains, in order, <nd> sub-elements referencing nodes, and 0 or more <tag>s.
  • <relation> specifies some logical relationship between other objects (e.g. the route of a bus, the area enclosing woodland, traffic instructions such as “no left turn here”). Contains <member> sub-elements referencing the other objects which make up the relationship, and 0 or more <tag>s.
  • Then we have three sub-elements which never contain further elements themselves:
  • <tag> which is a key/value pair, stored as attributes “k” and “v”.
  • <nd> which references a node and contains just the attribute “ref”
  • <member> which contains attributes “ref”, “type” and a (maybe empty, but always present) “role” describing what role the member has in the relationship.

The meaning of ways and relations is defined by the tags present. For more details see:

  • Way article. Things rapidly get complicated. A way which starts and ends at the same node is a “closed” way, and are often, but not always, treated as Areas. For example, a closed way tagged “highway=footway” is assumed to be a circular pathway, unless we also have the tag “area=yes” in which case it is a pedestrian plaza. But “landuse=forest” is always an “area” even without the “area=yes” tag.
  • Relation article and types of relation.
  • Possible keys and values can be found here: Key descriptions by group and Map features.