Computation

I have had an extended dalliance with Computer Science, probably making the mistake of confusing aspects of Computer Science with a career in IT. I worked for a time as a Java Developer for a high frequency sports betting company, on the IT side of things (data acquisition, building APIs for quants to interact with, and so forth) rather than the quant side. What Java I learnt I have probably now forgotten.

Data science

I then transferred back to academia, employed as a scientific programmer in the School of Geography, University of Leeds. The project was to look at (short time frame, spatial, aspects of) Predictive Policing, developing a Python library and GUI application which would provide an open source implementation of various published algorithms. Some thoughts from this project:

  • I was keen to take ideas from industry into an academic setting: specifically, using vaguely formal software development workflows. For this project, this meant working with source control daily, instead of just as an "archive" of work performed mainly offline (which seems to be how many academics use GitHub), and using continuous integration tools.
  • I am a convert to Test Driven Development which I mainly use as a design tool. I write tests first, which helps me to design APIs from a user perspective, and then write code motivated by those tests. In particular, it was for me a great way to break myself of REPL based Python prototyping. Of course, having the tests there also makes for great, evolving, documentation as to the expected use of the API, and running the tests in a CI manner does indeed catch many (silly) mistakes.
  • I organically started writing Python code using a mixture of formally developing a Python package, while at the same time interacting with the code using the package in Jupyter notebooks. This seems to me to be an excellent way of building medium sized data science projects. I religiously wrote PyDocs, and found that these self-guides to my code, interactively accessible in a Jupyter notebook, greatly sped up my work.
  • In particular, I essentially stopped work on the GUI side of the project, deciding that Jupyter notebooks, backed by the library I had developed, provided a far more flexible, and far more reproducible (see below), environment.

Reproducible Data Science

Towards the end of the project, I became very interested (slightly obsessed) with reproducible Research especially in the "computational sciences". I firmly believe in the Manifesto here. Writing a GUI doesn't really fit this manifesto: what the GUI is doing is too opaque, and documenting exactly the steps used and settings chosen is too hard. Working in a Jupyter notebook does fit the bill: the code is there to see, you can add prose and Mathematical discussion in situ, and the whole notebook can be re-run by an interested party.

It strikes me (although I am an outsider looking in) that the "harder" sciences (e.g. computational astronomy) are much further along in this than the social science area I was working in. I did become somewhat discouraged by reading all too many papers which made strong claims about e.g. the performance of algorithms, but for which code and data were simply not available (even through asking the authors). In my bleaker moments, I do wonder in what sense this is "science" as the claims being made simply cannot be (easily, or perhaps even with a large amount of work) reproduced by other researchers (nevermind end users).

The future

I am now working again as a Pure Mathematician, and while I remain very interested in software development, I have little enough time for my mathematics research. The PI of the project I worked on is on extended leave from his position, and so the project is somewhat moribund. However, I remain interested, and would welcome collaboration (taking account of my time commitments).

Outputs

I wrote 3 academic papers, though all remain currently unpublished (see comments above re: lack of time):

  • Self-excited point process models of crime. A mathematical explanation of the algorithms which (are believed) to underpin the PredPol system, together with some case studies on openly available crime data.
  • Open crime data which provides some standardised Python code to work with some openly available crime data, together with algorithms to randomly modify the data to provide more realistic data.
  • Prediction scoring. Some attempt to offer different ways to assess the quality of short term spatial predictions of crime: trying to move beyond the "hit rate".

I gave a talk on the "Prediction scoring" work, together with an introduction to Predictive Policing, and some polemics about reproducible research: Talk to Leeds Institute For Data Science.

Software

Some software, in order of quality.

  • TileMapBase.
    Uses OpenStreetMap tiles, or other tile servers, to produce "basemaps" for use with matplotlib. Uses a SQLite database to cache the tiles, so you can experiment with map production without re-downloading the same tiles. Supports Open Data tiles from the UK Ordnance Survey.

    For bugs, open an issue on GitHub, and I will endeavour to fix.

    The aim was to produce something offering something like Leaflet but designed for off-line work, to produce high-quality figures for inclusion in documents, for example.

  • PDFImage. A simple, pure python library, for reading the raw contents of PDF files and writing PDF files composed of images, e.g. from scanned images. Supports PNG images natively, and JBIG2 black and white compression with an external program. I use for programmatically producing high-quality PDF files from hand-written scans etc.
  • TileWindow. Uses tkinter to display large or infinite images built out of tiles. Originally designed for use with the Python GUI I built (as above) to visualise crime events on top of a basemap, though this functionality was never added to the GUI. Works, but is not complete.

Pastimes

Back when I was (temporarily) losing interest in Mathematics and coding more, I spent some time:

  • Solving Project Euler problems. A lot of fun for a Mathematician interested in programming. But a time sink, and I haven't looked at these for some time.
    Project Euler profile
  • Some Google Code Jam solutions on GitHub; I was too slow in 2015; and didn’t compete beyond the prelim round in 2017. A lot of fun, but to be competitive, I at least need some practice, and I just don't have the time (and often the competition times clash with "real-life").