More formal working

I am a big fan of Jupyter notebooks and similar (e.g. R Markdown) systems which allow you to mix code and documentation, preferably in a browser (which allows sharing).

However, I’ve found that it’s quite easy to fall into a “hacking” work pattern of developing quite a lot of code, and mixing it up with substantial data processing. This leads to a number of anti-patterns:

  • The code begins to completely dominate, vs the documentation, or overview, big picture view.
  • I fall into the habit of restarting the notebook, wasting time on reloading data, and then making small changes to an analysis.
  • Constant minor editing and then “shift-return”ing through a load of cells.

This leads to wasting time; to getting lost (have I tried this minor variation before?) and general frustration.

A better working pattern seems to be the following:

  • Prototype quickly in a notebook
  • But before long, start to formally develop code in a formal package. A good directory layout is:

      |-- my_package
          |-- __init__.py
          |-- load.py
          |-- analysis.py
      |-- tests
          |-- __init__.py
          |-- load_test.py
          |-- analysis_tets.py
      |-- notebooks
          |-- Clean Data.ipynb
    
  • A good trick to import data without worrying about setup.py and installing is to start each notebook with

      import os, sys
      sys.path.insert(0, os.abspath("..))
    

With the above directory layout, this adds the base directory to the python search path, so that

    import my_package

will work; and will load the working version (and not a version which you might have installed).

  • Then I move code out of the notebooks into the package
  • I write tests as I go along. I tend not to go full TDD, but with Python especially, having some basic tests which load the package and run most of the code is a great way to catch silly errors which an IDE will struggle to find (e.g. namespace related issues).

I call this working a bit more “formally”. As with many processes in software development, it slows you down initially, but in the long-run you win.

  • By writing formal functions and classes, it forces me to confront design issues, algorithm choice,and scientific/research questions properly. It’s all too easy to hack away in a notebook, think “this is probably okay, but I should think more closely about it later”, and then never revisit the decision.
  • The code naturally ends up being documented and well-factored.
  • With the code packaged away, the notebooks become much cleaner, allowing you to concentrate on presentation.
  • Comparing parameters and different algorithms becomes a lot easier.
  • From a Reproducible research perspective, this is a big win.

I’m currently using this process with the notebook here: Comparison methods in the day job.

TileMapBase

I’ve published my first python package on PyPi (See also the New PyPi which seems to have finally synced.)

Get it here: TileMapBase or TileMapBase on new PyPi:

Uses OpenStreetMap tiles, or other tile servers, to produce “basemaps” for use with matplotlib. Uses a SQLite database to cache the tiles, so you can experiment with map production without re-downloading the same tiles. Supports Open Data tiles from the UK Ordnance Survey.

My original aim was to produce a simple, high-level way to use OpenStreetMap style tiles as a “basemap” with MatPlotLib in Jupyter Python notebooks. Since then, I’ve also been working on TileWindow which uses this library to cache tiles, and provides a tkinter widget which displays a map– sort of like GoogleMaps but in Python. Ultimately for use in my current job: PredictCode.

PyPi and use of ReStructuredText

I’ve in the process of putting together my first proper Python package to be uploaded to PyPi / PyPi Old. The docs around doing this are not great, but the official docs are pretty good:

One thing which was unclear to me was how to specify the text which gets displayed on PyPi. After some playing, it seems that:

  1. This should be set in the long_description variable of setup() or in setup.cfg
  2. This needs to be ReStructuredText not Markdown, for example.

Some searching found a solution:

  • Download Pandoc
  • Download Pypandoc : pip install pypandoc
  • (Or use Conda for both steps in one)
  • Then you can dynamically generate a rst file when setup.py is invoked:

      try:
          import pandoc
          doc = pandoc.Document()
          with open('readme.md', encoding='utf-8') as f:
              doc.markdown = f.read().encode("utf-8")
          with open("README.rst", "wb") as f:
              f.write(doc.rst)
      except:
          print("NOT REFRESHING README.rst")
    
      with open('README.rst', encoding='utf-8') as f:
          long_description = f.read()
    
  • Enclosing in try/except means I haven’t broken setup for users without pypandoc

Here’s the project on GitHub: TileMapBase

On memory management

I have only ever been a hobbyist C++ programmer, while I have been paid to write Java and Python. But a common complaint I’ve read about C++ is that you have to manage memory manually, and worry about it. Now, I’d slightly dispute this with C++11, but perhaps I don’t really have enough experience to comment.

However, I think there’s a strong case that with Garbage Collected languages, you can’t really forget about memory, or the difference between copy by reference and copy, but the language rather allows you to pretend that you can cease to worry. In my experience, this is only true 99% of the time, and the 1% of time it bites you, you’ve quite forgotten that it’s a possibility, which makes debugging a real pain (the classic “unknown unknown”).

A stupid example which wasted some of my time today is:

import numpy as np
...
indexes = np.argsort(times)
coords[0] = coords[0][indexes]
coords[1] = coords[1][indexes]

With hindsight this obviously mutates the data underpinning coords and hence mutates anything which is an alias of coords. Cue two tests failing, and the first one was silently mutating the data the second test tried to use. But this is really hard to spot– both tests failed, so I spend a while looking at the base class because that’s the only common code involved. Unit testing doesn’t really help, as I’d never think to test that I’m not accidentally mutating some data reference (because I’d never be that stupid, right…)

What I meant to do was:

coords = coords[:,indexes]

This generates a new array instance and assigns the reference to coords. But this is quite subtle. To even express it, I have to use language which I learnt from C/C++. I only finally noticed when I wrote some test code in a notebook, and noticed that there was some period 2 behaviour going on. “Oh, I must be mutating something… Oh, right…”

The problem with Python, and Java, is that you get out of the habit of even thinking in this way. I used to write a lot of immutable code in Java, precisely to avoid such problems. That seems to make massive sense in a corporate environment. But for numpy, and trying to squeeze performance out of an interpretted language, you sometimes need mutability. Which means you need to think. (And regularly makes me wish I could just use C++, but that’s nothing story…)

Learning Python UI programming

Another new task: get going with some GUI programming!

Some references

Hints and tip

Some books

As culled from Leeds University Library: