Learning Python UI programming

Another new task: get going with some GUI programming!

Some references

Hints and tip

Some books

As culled from Leeds University Library:

Pandas, HD5, and large data sets

I have finally gotten around to playing with the HD5 driver for pandas (which uses, I believe, pytables under the hood). I’m only scratching the surface, but it’s easy to do what I want:

  • Create a huge data frame storing an entire data set
  • Efficiently query subsections of the frame

Create the dataframe

We obviously cannot do this in memory. But if we have some way of generating one row at a time, or a small “chunk” of rows at a time, then we can “append” these iteratively to a HD5 store:

 store = pd.HDFStore("test.hd5", "w", complevel=9, complib="bzip2", fletcher32=True)
 # Generate a data frame as `frame`
 store.append("main", frame, data_columns=True)
 # Repeat as necessary
 store.close()

This creates a new HD5 file, and then creates a table in it named “main”. We can call store.append() repeatedly to add lots of rows. The data_columns=True is necessary if we wish to query by column (which we do).

Read back the data

We can then iterate over the whole dataframe in “chunks” of rows:

store = pd.HDFStore("test.hd5", "r")
for df in store.select("main", chunksize = 1000):
    # Do something with `df` which contains the next 1000 rows

Alternatively, we can use the power querying ability. Suppose we have a column named “one” in the large dataframe, and we just want the rows where the value of “one” is less then 100. Then we can use:

store = pd.HDFStore("test.hd5", "r")
df = store.select("main", where="one < 100")

This seems to be wonderfully fast.

Downsides

You cannot store “objects” in a table, so e.g. storing a GeoPandas data frame is impossible (or extremely hard).

Some sources

Parsing XML via SAX in Python

I’ve worked with XML before (in Java), but always small files using the Document Object Model. Now faced with multi-GB of Open Street Map derived XML files, of which I need to get a small amount of data, some other method is required. Step forward the Simple API for XML (SAX). This is an event-driven API: the XML parser calls a “handler” object with information about tags opening and closing, and the character data in between.

In Python, there is support in the standard library for SAX parsing. You need to sub-class (or duck-type, and implement the interface of) xml.sax.handler.ContentHandler. It seems that duck-typing is frustrating, as you need to implement the whole interface, even if you never expect certain methods to be called.

The methods startDocument and endDocument are called at the start and end of parsing. The startElement method sends details of the name of an opening tag, and it’s attributes (sent as essentially, but not quite the same as, a dict from string to string), and endElement tells you of a closing tag. Text is sent to you via characters which will also notify of new lines (which probably want ignoring). There is more, but that’s enough for my application.

Getting a generator

Somehow, a callback doesn’t feel very “pythonic” (and does feel terribly Javascript-esq). The pythonic way to push data to a client is surely to use a generator. Naively, to convert a callback to a generator, we’d like to:

  • Make an __iter__ method call the code which requires the callback handler.
  • When control is first returned to the callback, store the data and somehow return control to __iter__ which builds an iterator, and returns control to the client.
  • Each time we call __next__ on the iterator, return control to the data generation function…
  • ???
  • Profit?

Given that we are suspending execution, it should come as no surprise that the way to do this is via threading. Run the data generation code in a separate thread, and let the callback handler write all its data to a blocking queue. On the main thread, we simply implement a generator which pulls data off the queue (waiting if necessary for data, hence allowing control back to the thread) and yields it back to the client.

For fun, I implement this in the module cbtogen in my project OSMDigest. Sadly, in Python, event with a large queue, there is a signifcant overhead in running two threads and passing data over a queue.

For the application of converting the SAX callback to a generator, the result is incredibly slow code. This can be significantly improved by moving as much parse logic as possible to the thread, so we send far fewer objects over the queue. However, this is a better way…

Alternative using element-tree

The xml.etree.ElementTree module in the Python standard library represents an XML document via elements which have “children” (i.e. nested tags). The standard usage is to parse the whole document, but it is also possible to parse the document tag by tag using the iterparse function. There is a caveat however: all children of tags are still collected. If your document consists of lots of disjoint sub-sections, this is not a problem, but for my application, parsing Open Street Map data, the entire document is contained in an <osm> tag. As such, we’d eventually collect the entire document as children of this main (or root) tag. The trick is to capture a reference to the root tag, and then periodically (at a suitable point in the iteration loop) call clear on this object. This removes references to children, and so allows them to be garbage collected. The obvious downside here is that different document structures might require different techniques to “clear” the correct tags.

For OSM data, however, this is a clear winner, giving by fast the quickest way to parse the data.

Open Street Map XML data

I want to process large amounts of XML data from Open Street Map (OSM). I.e. that obtained from GeoFrabrik or OSM.Planet. For smaller snapshots, do look at OSMnx.

My pure-Python project to read and process OSM data, currently a work in progress, can be found on GitHub, as “OSMDigest”.

The XML format is documented on the OSM Wiki. There is no formal schema, but the data you can download seems to be of quite a constrained type:

  • Start with an <osm> element giving the “version”, “generator” and “timestamp”.
  • Then a <bounds> element giving the rectangle in latitude/longitude coordinates which encloses the data.
  • Following this, elements of three types. (They seem to appear in the order given here, though this I guess is unimportant). Each of these elements contains some common attributes: “id” giving the OSM id (which is unique within each type), the (optional) “user”, “uid”; giving the user who last modified the object, the “timestamp” of last modification, the edit “version” (which increases on each edit) and the “changeset” number. There is also a “visible”, but in the downloaded data which I’ve seen, this is always either missing, or “true”.
  • <node> specifies a point on the planet, and has attributes “lon”, “lat” for coordinates. May contain 0 or more <tag> sub-elements.
  • <way> specifies a path. Contains, in order, <nd> sub-elements referencing nodes, and 0 or more <tag>s.
  • <relation> specifies some logical relationship between other objects (e.g. the route of a bus, the area enclosing woodland, traffic instructions such as “no left turn here”). Contains <member> sub-elements referencing the other objects which make up the relationship, and 0 or more <tag>s.
  • Then we have three sub-elements which never contain further elements themselves:
  • <tag> which is a key/value pair, stored as attributes “k” and “v”.
  • <nd> which references a node and contains just the attribute “ref”
  • <member> which contains attributes “ref”, “type” and a (maybe empty, but always present) “role” describing what role the member has in the relationship.

The meaning of ways and relations is defined by the tags present. For more details see:

  • Way article. Things rapidly get complicated. A way which starts and ends at the same node is a “closed” way, and are often, but not always, treated as Areas. For example, a closed way tagged “highway=footway” is assumed to be a circular pathway, unless we also have the tag “area=yes” in which case it is a pedestrian plaza. But “landuse=forest” is always an “area” even without the “area=yes” tag.
  • Relation article and types of relation.
  • Possible keys and values can be found here: Key descriptions by group and Map features.

Open Street Map Data

I’m currently working on using some address information from open street map to augment other open data sources. Here are some notes on using data from open street map, in Python.

Getting and using Open StreetMap Data

It seems like this is a bit of a pain. Open StreetMap (OSM) uses a custom, XML based, format which is hard/impossible for standard GIS software to read.

Data sources

  • http://wiki.openstreetmap.org/wiki/Planet.osm Gives links to download the world and various extracts of it.
  • http://wiki.openstreetmap.org/wiki/Overpass_API Details the “Overpass API” which allows targeted querying of data from the OSM database. (The OSMnx package seems great for making direct, small-scale queries.)

GeoFabrik

http://download.geofabrik.de/ Offer downloads of regions of the world, either in OSM format, or converted to shape-file format. The latter can be loaded into geopandas, QGIS etc. etc. but appears, sadly, to be missing useful information. For example, I can locate a building and find the polygon for this (by it’s ID) in the shapefile, but there is no meta-data attached to give me e.g. the address!

However, for off the shelf analysis using existing tools, this is perhaps where to start.

Format

The standard format is XML, either compressed, or packaged using Protobuf.

  • http://wiki.openstreetmap.org/wiki/OSM_XML Gives the XML format
  • http://overpass-api.de/output_formats.html Gives details of the JSON format which the Overpass API can return.
  • http://wiki.openstreetmap.org/wiki/Map_Features Gives details of the “features” the returned data can contain.

Libraries

I should probably be using pyOsmium but there is no conda build, and being stuck with either a windows box, or a locked-down linux box, I am scared to try to get it to build.

GDAL / OGR can be easily installed (or come for free with geopandas).

  • http://www.gdal.org/drv_osm.html Gives some brief details
  • A cookbook for GDAL.
  • I didn’t have a great deal of luck finding the details I wanted.

OSMnx

It seems that others have had similar thoughts to me. OSMnx looks to be a great tool for network analysis of road networks. The review paper is also very informative, with some nice background information.

With Anaconda, it’s a one-line install:

conda install -c conda-forge osmnx

This seems very nice for targeted downloads, but it’s a bit too “magic” for my tastes. The source code is very readable however, and I’ve been learning a lot from it.