PyPi and use of ReStructuredText

I’ve in the process of putting together my first proper Python package to be uploaded to PyPi / PyPi Old. The docs around doing this are not great, but the official docs are pretty good:

One thing which was unclear to me was how to specify the text which gets displayed on PyPi. After some playing, it seems that:

  1. This should be set in the long_description variable of setup() or in setup.cfg
  2. This needs to be ReStructuredText not Markdown, for example.

Some searching found a solution:

  • Download Pandoc
  • Download Pypandoc : pip install pypandoc
  • (Or use Conda for both steps in one)
  • Then you can dynamically generate a rst file when setup.py is invoked:

      try:
          import pandoc
          doc = pandoc.Document()
          with open('readme.md', encoding='utf-8') as f:
              doc.markdown = f.read().encode("utf-8")
          with open("README.rst", "wb") as f:
              f.write(doc.rst)
      except:
          print("NOT REFRESHING README.rst")
    
      with open('README.rst', encoding='utf-8') as f:
          long_description = f.read()
    
  • Enclosing in try/except means I haven’t broken setup for users without pypandoc

Here’s the project on GitHub: TileMapBase

On memory management

I have only ever been a hobbyist C++ programmer, while I have been paid to write Java and Python. But a common complaint I’ve read about C++ is that you have to manage memory manually, and worry about it. Now, I’d slightly dispute this with C++11, but perhaps I don’t really have enough experience to comment.

However, I think there’s a strong case that with Garbage Collected languages, you can’t really forget about memory, or the difference between copy by reference and copy, but the language rather allows you to pretend that you can cease to worry. In my experience, this is only true 99% of the time, and the 1% of time it bites you, you’ve quite forgotten that it’s a possibility, which makes debugging a real pain (the classic “unknown unknown”).

A stupid example which wasted some of my time today is:

import numpy as np
...
indexes = np.argsort(times)
coords[0] = coords[0][indexes]
coords[1] = coords[1][indexes]

With hindsight this obviously mutates the data underpinning coords and hence mutates anything which is an alias of coords. Cue two tests failing, and the first one was silently mutating the data the second test tried to use. But this is really hard to spot– both tests failed, so I spend a while looking at the base class because that’s the only common code involved. Unit testing doesn’t really help, as I’d never think to test that I’m not accidentally mutating some data reference (because I’d never be that stupid, right…)

What I meant to do was:

coords = coords[:,indexes]

This generates a new array instance and assigns the reference to coords. But this is quite subtle. To even express it, I have to use language which I learnt from C/C++. I only finally noticed when I wrote some test code in a notebook, and noticed that there was some period 2 behaviour going on. “Oh, I must be mutating something… Oh, right…”

The problem with Python, and Java, is that you get out of the habit of even thinking in this way. I used to write a lot of immutable code in Java, precisely to avoid such problems. That seems to make massive sense in a corporate environment. But for numpy, and trying to squeeze performance out of an interpretted language, you sometimes need mutability. Which means you need to think. (And regularly makes me wish I could just use C++, but that’s nothing story…)

Learning Python UI programming

Another new task: get going with some GUI programming!

Some references

Hints and tip

Some books

As culled from Leeds University Library:

Pandas, HD5, and large data sets

I have finally gotten around to playing with the HD5 driver for pandas (which uses, I believe, pytables under the hood). I’m only scratching the surface, but it’s easy to do what I want:

  • Create a huge data frame storing an entire data set
  • Efficiently query subsections of the frame

Create the dataframe

We obviously cannot do this in memory. But if we have some way of generating one row at a time, or a small “chunk” of rows at a time, then we can “append” these iteratively to a HD5 store:

 store = pd.HDFStore("test.hd5", "w", complevel=9, complib="bzip2", fletcher32=True)
 # Generate a data frame as `frame`
 store.append("main", frame, data_columns=True)
 # Repeat as necessary
 store.close()

This creates a new HD5 file, and then creates a table in it named “main”. We can call store.append() repeatedly to add lots of rows. The data_columns=True is necessary if we wish to query by column (which we do).

Read back the data

We can then iterate over the whole dataframe in “chunks” of rows:

store = pd.HDFStore("test.hd5", "r")
for df in store.select("main", chunksize = 1000):
    # Do something with `df` which contains the next 1000 rows

Alternatively, we can use the power querying ability. Suppose we have a column named “one” in the large dataframe, and we just want the rows where the value of “one” is less then 100. Then we can use:

store = pd.HDFStore("test.hd5", "r")
df = store.select("main", where="one < 100")

This seems to be wonderfully fast.

Downsides

You cannot store “objects” in a table, so e.g. storing a GeoPandas data frame is impossible (or extremely hard).

Some sources

Parsing XML via SAX in Python

I’ve worked with XML before (in Java), but always small files using the Document Object Model. Now faced with multi-GB of Open Street Map derived XML files, of which I need to get a small amount of data, some other method is required. Step forward the Simple API for XML (SAX). This is an event-driven API: the XML parser calls a “handler” object with information about tags opening and closing, and the character data in between.

In Python, there is support in the standard library for SAX parsing. You need to sub-class (or duck-type, and implement the interface of) xml.sax.handler.ContentHandler. It seems that duck-typing is frustrating, as you need to implement the whole interface, even if you never expect certain methods to be called.

The methods startDocument and endDocument are called at the start and end of parsing. The startElement method sends details of the name of an opening tag, and it’s attributes (sent as essentially, but not quite the same as, a dict from string to string), and endElement tells you of a closing tag. Text is sent to you via characters which will also notify of new lines (which probably want ignoring). There is more, but that’s enough for my application.

Getting a generator

Somehow, a callback doesn’t feel very “pythonic” (and does feel terribly Javascript-esq). The pythonic way to push data to a client is surely to use a generator. Naively, to convert a callback to a generator, we’d like to:

  • Make an __iter__ method call the code which requires the callback handler.
  • When control is first returned to the callback, store the data and somehow return control to __iter__ which builds an iterator, and returns control to the client.
  • Each time we call __next__ on the iterator, return control to the data generation function…
  • ???
  • Profit?

Given that we are suspending execution, it should come as no surprise that the way to do this is via threading. Run the data generation code in a separate thread, and let the callback handler write all its data to a blocking queue. On the main thread, we simply implement a generator which pulls data off the queue (waiting if necessary for data, hence allowing control back to the thread) and yields it back to the client.

For fun, I implement this in the module cbtogen in my project OSMDigest. Sadly, in Python, event with a large queue, there is a signifcant overhead in running two threads and passing data over a queue.

For the application of converting the SAX callback to a generator, the result is incredibly slow code. This can be significantly improved by moving as much parse logic as possible to the thread, so we send far fewer objects over the queue. However, this is a better way…

Alternative using element-tree

The xml.etree.ElementTree module in the Python standard library represents an XML document via elements which have “children” (i.e. nested tags). The standard usage is to parse the whole document, but it is also possible to parse the document tag by tag using the iterparse function. There is a caveat however: all children of tags are still collected. If your document consists of lots of disjoint sub-sections, this is not a problem, but for my application, parsing Open Street Map data, the entire document is contained in an <osm> tag. As such, we’d eventually collect the entire document as children of this main (or root) tag. The trick is to capture a reference to the root tag, and then periodically (at a suitable point in the iteration loop) call clear on this object. This removes references to children, and so allows them to be garbage collected. The obvious downside here is that different document structures might require different techniques to “clear” the correct tags.

For OSM data, however, this is a clear winner, giving by fast the quickest way to parse the data.