More formal working

Posted on 20th October 2017

I am a big fan of Jupyter notebooks and similar (e.g. R Markdown) systems which allow you to mix code and documentation, preferably in a browser (which allows sharing).

However, I've found that it's quite easy to fall into a "hacking" work pattern of developing quite a lot of code, and mixing it up with substantial data processing. This leads to a number of anti-patterns:

The code begins to completely dominate, vs the documentation, or overview, big picture view.
I fall into the habit of restarting the notebook, wasting time on reloading data, and then making small changes to an analysis.
Constant minor editing and then "shift-return"ing through a load of cells.

This leads to wasting time; to getting lost (have I tried this minor variation before?) and general frustration.

A better working pattern seems to be the following:

Prototype quickly in a notebook

But before long, start to formally develop code in a formal package. A good directory layout is:

|-- my_package
    |-- __init__.py
    |-- load.py
    |-- analysis.py
|-- tests
    |-- __init__.py
    |-- load_test.py
    |-- analysis_tets.py
|-- notebooks
    |-- Clean Data.ipynb

A good trick to import data without worrying about setup.py and installing is to start each notebook with
```
import os, sys
sys.path.insert(0, os.abspath("..))
```

With the above directory layout, this adds the base directory to the python search path, so that

    import my_package

will work; and will load the working version (and not a version which you might have installed).

Then I move code out of the notebooks into the package
I write tests as I go along. I tend not to go full TDD, but with Python especially, having some basic tests which load the package and run most of the code is a great way to catch silly errors which an IDE will struggle to find (e.g. namespace related issues).

I call this working a bit more "formally". As with many processes in software development, it slows you down initially, but in the long-run you win.

By writing formal functions and classes, it forces me to confront design issues, algorithm choice,and scientific/research questions properly. It's all too easy to hack away in a notebook, think "this is probably okay, but I should think more closely about it later", and then never revisit the decision.
The code naturally ends up being documented and well-factored.
With the code packaged away, the notebooks become much cleaner, allowing you to concentrate on presentation.
Comparing parameters and different algorithms becomes a lot easier.
From a Reproducible research perspective, this is a big win.

I'm currently using this process with the notebook here: Comparison methods in the day job.

More formal working

Categories

Recent posts