Posted on 20th October 2017
I am a big fan of Jupyter notebooks and similar (e.g. R Markdown) systems which allow you to mix code and documentation, preferably in a browser (which allows sharing).
However, I've found that it's quite easy to fall into a "hacking" work pattern of developing quite a lot of code, and mixing it up with substantial data processing. This leads to a number of anti-patterns:
This leads to wasting time; to getting lost (have I tried this minor variation before?) and general frustration.
A better working pattern seems to be the following:
But before long, start to formally develop code in a formal package. A good directory layout is:
|-- my_package
|-- __init__.py
|-- load.py
|-- analysis.py
|-- tests
|-- __init__.py
|-- load_test.py
|-- analysis_tets.py
|-- notebooks
|-- Clean Data.ipynb
A good trick to import data without worrying about setup.py
and installing is to start each notebook with
import os, sys
sys.path.insert(0, os.abspath("..))
With the above directory layout, this adds the base directory to the python search path, so that
import my_package
will work; and will load the working version (and not a version which you might have installed).
I call this working a bit more "formally". As with many processes in software development, it slows you down initially, but in the long-run you win.
I'm currently using this process with the notebook here: Comparison methods in the day job.