I’ve written some stuff for the University of Leeds Python Discussion Group on Code style, tooling, testing, all that Jazz.
I’ll try and do some live coding, and get some debate going tomorrow. But the above can be read as a bit of blog post as well.
I am a big fan of Jupyter notebooks and similar (e.g. R Markdown) systems which allow you to mix code and documentation, preferably in a browser (which allows sharing).
However, I’ve found that it’s quite easy to fall into a “hacking” work pattern of developing quite a lot of code, and mixing it up with substantial data processing. This leads to a number of anti-patterns:
This leads to wasting time; to getting lost (have I tried this minor variation before?) and general frustration.
A better working pattern seems to be the following:
But before long, start to formally develop code in a formal package. A good directory layout is:
|-- my_package
|-- __init__.py
|-- load.py
|-- analysis.py
|-- tests
|-- __init__.py
|-- load_test.py
|-- analysis_tets.py
|-- notebooks
|-- Clean Data.ipynb
A good trick to import data without worrying about setup.py
and installing is to start each notebook with
import os, sys
sys.path.insert(0, os.abspath("..))
With the above directory layout, this adds the base directory to the python search path, so that
import my_package
will work; and will load the working version (and not a version which you might have installed).
I call this working a bit more “formally”. As with many processes in software development, it slows you down initially, but in the long-run you win.
I’m currently using this process with the notebook here: Comparison methods in the day job.
I’ve published my first python package on PyPi (See also the New PyPi which seems to have finally synced.)
Get it here: TileMapBase or TileMapBase on new PyPi:
Uses OpenStreetMap tiles, or other tile servers, to produce “basemaps” for use with matplotlib. Uses a SQLite database to cache the tiles, so you can experiment with map production without re-downloading the same tiles. Supports Open Data tiles from the UK Ordnance Survey.
My original aim was to produce a simple, high-level way to use OpenStreetMap style tiles as a “basemap” with MatPlotLib in Jupyter Python notebooks. Since then, I’ve also been working on TileWindow which uses this library to cache tiles, and provides a tkinter
widget which displays a map– sort of like GoogleMaps but in Python. Ultimately for use in my current job: PredictCode.
I’ve in the process of putting together my first proper Python package to be uploaded to PyPi / PyPi Old. The docs around doing this are not great, but the official docs are pretty good:
One thing which was unclear to me was how to specify the text which gets displayed on PyPi. After some playing, it seems that:
long_description
variable of setup()
or in setup.cfg
Some searching found a solution:
pip install pypandoc
Then you can dynamically generate a rst
file when setup.py
is invoked:
try:
import pandoc
doc = pandoc.Document()
with open('readme.md', encoding='utf-8') as f:
doc.markdown = f.read().encode("utf-8")
with open("README.rst", "wb") as f:
f.write(doc.rst)
except:
print("NOT REFRESHING README.rst")
with open('README.rst', encoding='utf-8') as f:
long_description = f.read()
try/except
means I haven’t broken setup
for users without pypandoc
Here’s the project on GitHub: TileMapBase
I have only ever been a hobbyist C++ programmer, while I have been paid to write Java and Python. But a common complaint I’ve read about C++ is that you have to manage memory manually, and worry about it. Now, I’d slightly dispute this with C++11, but perhaps I don’t really have enough experience to comment.
However, I think there’s a strong case that with Garbage Collected languages, you can’t really forget about memory, or the difference between copy by reference and copy, but the language rather allows you to pretend that you can cease to worry. In my experience, this is only true 99% of the time, and the 1% of time it bites you, you’ve quite forgotten that it’s a possibility, which makes debugging a real pain (the classic “unknown unknown”).
A stupid example which wasted some of my time today is:
import numpy as np
...
indexes = np.argsort(times)
coords[0] = coords[0][indexes]
coords[1] = coords[1][indexes]
With hindsight this obviously mutates the data underpinning coords
and hence mutates anything which is an alias of coords
. Cue two tests failing, and the first one was silently mutating the data the second test tried to use. But this is really hard to spot– both tests failed, so I spend a while looking at the base class because that’s the only common code involved. Unit testing doesn’t really help, as I’d never think to test that I’m not accidentally mutating some data reference (because I’d never be that stupid, right…)
What I meant to do was:
coords = coords[:,indexes]
This generates a new array
instance and assigns the reference to coords
. But this is quite subtle. To even express it, I have to use language which I learnt from C/C++. I only finally noticed when I wrote some test code in a notebook, and noticed that there was some period 2 behaviour going on. “Oh, I must be mutating something… Oh, right…”
The problem with Python, and Java, is that you get out of the habit of even thinking in this way. I used to write a lot of immutable code in Java, precisely to avoid such problems. That seems to make massive sense in a corporate environment. But for numpy
, and trying to squeeze performance out of an interpretted language, you sometimes need mutability. Which means you need to think. (And regularly makes me wish I could just use C++, but that’s nothing story…)
Another new task: get going with some GUI programming!
.wait_window()
is necessary but not sufficient.As culled from Leeds University Library:
I have finally gotten around to playing with the HD5 driver for pandas
(which uses, I believe, pytables
under
the hood). I’m only scratching the surface, but it’s easy to do what I want:
We obviously cannot do this in memory. But if we have some way of generating one row at a time, or a small “chunk” of rows at a time, then we can “append” these iteratively to a HD5 store:
store = pd.HDFStore("test.hd5", "w", complevel=9, complib="bzip2", fletcher32=True)
# Generate a data frame as `frame`
store.append("main", frame, data_columns=True)
# Repeat as necessary
store.close()
This creates a new HD5 file, and then creates a table in it named “main”. We can call store.append()
repeatedly to add lots of rows. The data_columns=True
is necessary if we wish to query by column (which we do).
We can then iterate over the whole dataframe in “chunks” of rows:
store = pd.HDFStore("test.hd5", "r")
for df in store.select("main", chunksize = 1000):
# Do something with `df` which contains the next 1000 rows
Alternatively, we can use the power querying ability. Suppose we have a column named “one” in the large dataframe, and we just want the rows where the value of “one” is less then 100. Then we can use:
store = pd.HDFStore("test.hd5", "r")
df = store.select("main", where="one < 100")
This seems to be wonderfully fast.
You cannot store “objects” in a table, so e.g. storing a GeoPandas
data frame is impossible (or extremely hard).
I’ve worked with XML before (in Java), but always small files using the Document Object Model. Now faced with multi-GB of Open Street Map derived XML files, of which I need to get a small amount of data, some other method is required. Step forward the Simple API for XML (SAX). This is an event-driven API: the XML parser calls a “handler” object with information about tags opening and closing, and the character data in between.
In Python, there is support in the standard library for SAX parsing. You need to sub-class (or duck-type, and implement the interface of) xml.sax.handler.ContentHandler
. It seems that duck-typing is frustrating, as you need to implement the whole interface, even if you never expect certain methods to be called.
The methods startDocument
and endDocument
are called at the start and end of parsing. The startElement
method sends details of the name of an opening tag, and it’s attributes (sent as essentially, but not quite the same as, a dict
from string to string), and endElement
tells you of a closing tag. Text is sent to you via characters
which will also notify of new lines (which probably want ignoring). There is more, but that’s enough for my application.
Somehow, a callback doesn’t feel very “pythonic” (and does feel terribly Javascript-esq). The pythonic way to push data to a client is surely to use a generator. Naively, to convert a callback to a generator, we’d like to:
__iter__
method call the code which requires the callback handler.__iter__
which builds an iterator, and returns control to the client.__next__
on the iterator, return control to the data generation function…Given that we are suspending execution, it should come as no surprise that the way to do this is via threading. Run the data generation code in a separate thread, and let the callback handler write all its data to a blocking queue. On the main thread, we simply implement a generator which pulls data off the queue (waiting if necessary for data, hence allowing control back to the thread) and yield
s it back to the client.
For fun, I implement this in the module cbtogen
in my project OSMDigest. Sadly, in Python, event with a large queue, there is a signifcant overhead in running two threads and passing data over a queue.
For the application of converting the SAX callback to a generator, the result is incredibly slow code. This can be significantly improved by moving as much parse logic as possible to the thread, so we send far fewer objects over the queue. However, this is a better way…
The xml.etree.ElementTree
module in the Python standard library represents an XML document via elements which have “children” (i.e. nested tags). The standard usage is to parse the whole document, but it is also possible to parse the document tag by tag using the iterparse
function. There is a caveat however: all children of tags are still collected. If your document consists of lots of disjoint sub-sections, this is not a problem, but for my application, parsing Open Street Map data, the entire document is contained in an <osm>
tag. As such, we’d eventually collect the entire document as children of this main (or root) tag. The trick is to capture a reference to the root tag, and then periodically (at a suitable point in the iteration loop) call clear
on this object. This removes references to children, and so allows them to be garbage collected. The obvious downside here is that different document structures might require different techniques to “clear” the correct tags.
For OSM data, however, this is a clear winner, giving by fast the quickest way to parse the data.
I want to process large amounts of XML data from Open Street Map (OSM). I.e. that obtained from GeoFrabrik or OSM.Planet. For smaller snapshots, do look at OSMnx.
My pure-Python project to read and process OSM data, currently a work in progress, can be found on GitHub, as “OSMDigest”.
The XML format is documented on the OSM Wiki. There is no formal schema, but the data you can download seems to be of quite a constrained type:
The meaning of ways and relations is defined by the tags present. For more details see:
I’m currently working on using some address information from open street map to augment other open data sources. Here are some notes on using data from open street map, in Python.
It seems like this is a bit of a pain. Open StreetMap (OSM) uses a custom, XML based, format which is hard/impossible for standard GIS software to read.
OSMnx
package seems great for making direct, small-scale queries.)http://download.geofabrik.de/ Offer downloads of regions of the world, either in OSM format, or converted to shape-file format. The latter can be loaded into geopandas, QGIS etc. etc. but appears, sadly, to be missing useful information. For example, I can locate a building and find the polygon for this (by it’s ID) in the shapefile, but there is no meta-data attached to give me e.g. the address!
However, for off the shelf analysis using existing tools, this is perhaps where to start.
The standard format is XML, either compressed, or packaged using Protobuf.
I should probably be using pyOsmium but there is no conda build, and being stuck with either a windows box, or a locked-down linux box, I am scared to try to get it to build.
GDAL / OGR can be easily installed (or come for free with geopandas).
It seems that others have had similar thoughts to me. OSMnx looks to be a great tool for network analysis of road networks. The review paper is also very informative, with some nice background information.
With Anaconda, it’s a one-line install:
conda install -c conda-forge osmnx
This seems very nice for targeted downloads, but it’s a bit too “magic” for my tastes. The source code is very readable however, and I’ve been learning a lot from it.
I asked this question on Stackoverflow, and got a nice answer, but one which I needed to think through a little more. Here’s my conclusions.
My aim was to understand how to write robust code which could take scalars, but which would also do “as expected” on arrays. Let me expand a little on this, by using a slightly easier example than in the original question. Suppose f(x)
is a function which takes a scalar and returns a scalar. I then want that if x
is actually an array, of any shape, then f(x)
will return an array of the same shape as x
, namely the array obtained by applying f
to every entry.
You can do this with the np.frompyfunc
method, but we cannot expect to take advantage of the speed of numpy. Furthermore, we also obtain an array of objects.
Instead, let me present the “numpy way” and then explain why and how it works. Firstly, the starting function:
fixed_array = np.arange(10)
def f(x):
return np.sum((x - fixed_array) ** 2)
Here fixed_array
is global for simplicity. So what f
does is to subtract x
from each entry of fixed_array
, square, and then sum. Equivalent to sum( (x-t)**2 for t in fixed_array )
. Passing an array into f
yields an error as numpy
cannot work out how to compute x - fixed_array
. Indeed, we do not want to form this pointwise. Rather, I think what we really want to do is something like the following:
x
is also a 1D array (though in the end x
should be allowed to be any shape.)xx
so that xx[i][j] = x[i]
for all i,j
yy
so that yy[i][j] = fixed_array[j]
for all i,j
xx
and yy
have the same shape.zz = (xx - yy)**2
(pointwise) so that ``zz[i][j] = (x[i] - fixed_array[j])**2 for each
i,j`.f(x[i])
in the i
th position, as required.Okay, so here’s the answer:
def fff(x):
return np.sum((np.asarray(x)[...,None] - fixed_array)**2, axis=-1)
How does this work?
np.asarray(x)
returns, if x
is scalar, an array of shape (1)
containing x
as it’s entry; and if x
is already “array like”, we get a genuine array.y
, the slice y[...,None]
does the same as (the perhaps more clear) y[...,np.newaxis]
. The ...
means the same as :,:,:
however many times required, and np.newaxis
gives you a new axis. If y
has shape (2,3,4)
then y[...,None]
has shape (2,3,4,1)
. See Indexing docs.yy
and consider how Broadcasting applies to yy - fixed_array
fixed_array
so it has the same ndim
as yy
(2,3,4,n)
where fixed_array
is of length n
y
will be used in the first dimension(s), and the entries of fixes_array
in the last dimension.axis=-1
command.See Notebook for a quick demo and some code.
In my new job, I find myself working with numpy (after a break of a couple of years, and now professionally, and not as a hobby.) Numpy is great, but it doesn’t half require a little thinking upon occasion.
Suppose we have an array of 10 points in the plane. Should this be represented as a
numpy
array of shape(2,10)
or(10,2)
?
Suppose I want to write a function which represents a two dimensional kernel, for example:
def ker(x,y):
return x + 2 * y
This doesn’t vectorise at all: I cannot do
p = np.array([1,2])
ker(p)
So instead we might write:
def ker(p):
return p[0] + 2 * p[1]
Then, to vectorise across my array of 10 points, evaluated the kernel at each point, I need my array to have shape (2,10)
.
This is exactly the convention chosen by the kernel
object which Scipy stats Gaussian KDE will produce, for example.
Given my 10 points, I should be able to translate them all by the same amount, by just adding a point.
my_array + point
For this to work, I need to work with the Numpy broadcasting rules which imply shape (10,2)
:
my_array = np.random.random(20).reshape((10,2))
point = np.array([2,3])
my_array + point
So if you’re working with things your thinking of as spatial “vectors”, the other convention seems most natural. This is exactly the convention chosen by e.g. Scipy Spatial KDTree.
I guess you pick a convention, stick with it, and translate (e.g. using .T
) as necessary.
A further hint is to read the broadcasting rules very closely and to know that indexing with None
is the same as adding a new axis:
my_array = np.random.random(20).reshape((10,2))
new_array = my_array[:,None,:]
print(new_array.shape) # (10,1,2)
Given a population and sampling at random (“with replacement”) what’s the expected number of samples I need to see 50% (or any fixed proportion) of the population.
I deliberately ask for “expected” because calculating expectations is often easier than getting a handle on the whole probability distribution. A trick is to exploit linearity: express the random variable of interest as a sum of random variables you can calculate the expectation of.
Sampling at random from , suppose we have seen exactly members of . As each sample is independent, letting denote the number of samples required to see a new member of , we see that \[ \mathbb P(T_k = j) = \left(\frac{k}{\vert P\vert}\right)^{j-1} \frac{\vert P\vert-k}{\vert P\vert} \] That is, a geometric distribution, and so . By convention, .
Then, if I want to see exactly members, the time needed is . We can estimate the expectation by an integral, \[ \mathbb E(S_n) \approx \vert P\vert \int_{\vert P\vert-n}^{\vert P\vert+1} \frac 1 x \ dx = \vert P\vert \log \Big( \frac{\vert P\vert+1}{\vert P\vert-n} \Big) \]
Hence, in answer to our original question, we need on average samples to see half of .
This post, very tangentially, relates to a quiz we set job candidates. If you are applying for a job at my current company, and somehow work out I work there, and find this, then you probably half deserve a job anyway.
Suppose you have a population \( P \) and some test as to whether a member of the population is good or bad. We want to find a “random good member”. There are two methods that come to mind:
The first method has the virtue of being simple. The second method uses a lot of memory, if \(P\) is large. But on closer thought, what if the proportion of “good” members is rather small. The 2nd method is guaranteed to find a good member in . How slow can the first method be?
At the dying ends of the work day, I came across this page and was initially confused by
public abstract class Enum<E extends Enum<E>>
Doesn’t that look, well, horribly circular? Stackoverflow suggests this is common confusion.
LED lights seem a no-brainer: instant on, good colour reproduction, extremely energy efficient (cool to the touch after minutes of use). But they are expensive. I wonder what the payback time is?
While spending a long weekend in Windsor, we found the following:
The instructions read, in part:
The puzzle challenge is to solve this unique one-way maze by travelling from the entrance Pawn to the central Castle – always going forwards.
It’s not quite clear to me what this means, but there is a good quote from Here:
The maze should be played as if you are a runaway train - always moving smoothly through the points forwards, and never able to reverse. This maze-movement idea was inspired by the nearby Windsor Railway Station, which brought the Royal Family to Windsor from Victorian times onwards. The maze paths sometimes run together in pairs, like a pair of railway tracks; sometimes they even run through “railway cuttings”, where the grass surface rises and falls around them.
I wonder if there is an elegant way to convert this to a directed graph problem? I couldn’t see how to do this cleanly, as in the problem you must pass through a “node”, not reverse direction.
Anyway, here is a solution. From the pawn you initially get yourself into an outer “loop”, from which only the King is accessible. As we can only visit the King once, we don’t want to get back onto the outer loop in the same direction.
So, the only solution is King, Bishop, Knight, Queen, Castle. And once we’ve heading to the Knight, we only have a choice of which direction to pass through the Knight “station” in, everything else is determined. Before that we can loop about a bit if we wish.
More a note to a future self than aything else… In C++ we have the notion of Resource Acquisition Is Initialization which I think I have internalised as the following:
Any “resource” (memory, file, handle etc.) should be “owned” by an object. An object’s constructor should allocate the resource, and the destructor should free the resource (with some appropriate choices then made for what copy/move constructors/assignments should do). By use of C++11
unique_ptr
andshared_ptr
this model can be extended to pointers.
Notice I didn’t mention “exception”, though one major additional use of RAII is exception safety: if an exception is thrown, then all objects currently in scope will have their destructors run, and so any resources will be safely released.
In C#/Java we have garbage collection. This takes care of memory, but not any other “resource”. In C# there is the IDisposable
interface, and I was for a while confused by what this did. I am not alone, judging from SO:
For me, the following is a good way to think:
malloc
and then free it later with free
. Similarly, if you open a file, you need to later close it.free
anything. But it does nothing for the file open/close issue.close()
a file, say.try ... catch ... finally
idiom!IDisposable
interface in C# should be thought of as enabling the syntactic sugar of the using
command. Similarly the AutoCloseable
interface in Java 7.What of destructors? Don’t use them: they are only of use if you need to free “unmanaged resources”, and if you need to do that, you’ll know it.
So why the “IDisposable pattern”? This seems really to exist merely to support finalizers (aka destructors). However, if your class is not sealed, then maybe a derived class will need a destructor. The pattern then exists to ensure unmanaged resources are always freed, but from the destructor no managed resource is freed (as they might have already been freed by the garbage collector).
dispose
methoddispose
on member fields, which is the way to handle “composition”.All in all, that seems about as much work as writing a destructor in C++!
BF is an amusingly named, impractical, Turing complete programming language. It is very like an actual Turing machine: you have an infinite (originally, one-sided, 30K long) array of (8-bit) values, with commands to increase/decrease (+/-) the cell, or move left/right (</>). Input (,) and output (.) are handled. Loops are given by [] and implement a “while cell is non-zero” loop. So a trivial mapping to C is possible (see the Wikipedia article).
I thought it would be fun to give an object oriented implementation, for which I used C++ for a change. My implementation decouples the parsing of the programme from running it. The parse class reads the input in a single pass, using a stack for dealing with nested loops. It translates the input into a list of commands, represented by an abstract base class with an execute
method, overloaded for the various different commands. This is the “command” pattern (closely related to the “strategy” pattern).
Round 2 came and went (and Round 3 and the “distributed” round). As an effort to practice with C#, I solved these problems in C# (definitely slower than using Python, and actually, the end running time is not that much quicker, for these sort of algorithmic / numerical problems).
There’s a definite step-up in difficulty here! Maybe I kid myself I could have done problem A and some “small” problems quickly… that people did them all in 150 minutes is quite scary. The Official Contest Analysis is rather brief (from G+, it seems the organisers were busy with the distributed round; or maybe, if we’ve come this far, shorter answers are all we need ;-> ) I think my background in pure Maths shows here: I found problem D quite easy, as I found analysing the special cases off-line (pen and paper) to be not so hard. Problems B and C (in “large” form) were hard, I thought. (I missed the trick with problem B; and using an off-the-shelf Graph package is obviously the fast way to do Problem C.)
See the code on GitHub.
On the basis that knowing a modern OO language would be good, and on the basis that everyone and their granny knows JAVA, I thought I’d look at C#. Getting going with C# was a bit tedious: I end up looking things up endlessly, and puzzling over things for a while before they “click” and seem obvious (which is always annoying, as then I’m left wondering why it wasn’t obvious from the start). You wait for muscle memory to build…
Example: A small mistake lead to some musings on how sort algorithms are implemented: StackOverFlow Question (the actual problem came from thinking about orderings in Python and C++ too much, where you specify a strict weak ordering, basically “less than”, and then porting that to a CompareTo
method and missing a corner-case. This came up in a CodeJam problem.)
(Aside: Stackoverflow never ceases to amaze me: I asked a kind-of-silly question, which probably I could have thought about myself, and I get a wonderful, long, complete answer…)
I do really like LINQ, and the way it reminds one of list comprehensions. Turning this…
into this…
is rather nice. (The 2nd snippet uses a simple extension method to Add
; again, extension methods seem pretty neat.)
However, give me templates from C++ anyday! I really (really) like the duck-typing nature of templates: it’s sort of like using a dynamic language, but with compile time checking. (Of course, one also has to enjoy page long error messages… Maybe “concepts” will arrive one day.) That C# just doesn’t let you do
is very tedious! Similary, the limitations of using
(i.e. it’s lack of being a replacement for typedef
) in C# are annoying: I typically like setting up private typedefs in a class, to save typing out long class names. Or, giving Tuple<T,S>
a more domain-specific name: using DirectedEdge = Tuple<T,T>
for example.
An analogue of namedtuple
from Python would be neat: a simple way to setup a struct or class which just needs some fields, and default ordering / hashing etc. C# does do this with anonymous types, but the result is, well, anonymous…
It would also be nice to have something similar to C++ const
. readonly
is too limited, and class only. A readonly
List<T>
can still have its items changed, and that’s not my definition of “readonly”! For a statically typed, compile time checked language, decorating methods with const
is, I think, a wonderful thing: it’s another way to signal to the compiler (and other people reading your code) your intentions.
Is next up to be toyed with…!
Mostly as an excuse to learn C# I implemented the path finding algorithms. Running some tests (random graphs, and then computing cuts from found disjoint paths, as to verify a “cut” is easy!) caught some implementation issues. All obvious with hindsight.
See the file on GitHub. I wanted to use interfaces, and extension methods, as they are features of C# which differ from e.g. C++. In hindsight, I think an abstract base case Graph
would have been more sensible, with DirectedGraph
and UndirectedGraph
inheriting from this.
DirectedGraph
should have a GetNeighbours
method.)The Ford-Fulkerson algorithm makes essential use of “unflowing” flow back along a path. In our case, this allows us the new path to make use of an old path in reverse. That, at the end of each step, we really do get a set of edge disjoint paths is slightly tricky (and I didn’t prove this– in the full FF algorithm, it follows by noting that the “flow constraints” always hold).
In my implementation, I simply ignore the issue, and store “paths” as simply a set of (directed) edges. Then, at the end, we reconstruct the paths (using that each edge can occur in only one path, and that a simple greedy algorithm works.) The slightly subtle issue here is shown in the following diagram:
Namely, we can end up with loops which are divorced from any actual “path”. Once you realise this, we simply need to ignore it at the final step. In the algorithm itself, we can’t run into a problem, as we can always use the “loop” in reverse, when finding a new path, and having a “loop”, by definition, means that we don’t have to worry about where we start: we can always access all the vertices.
In e.g. the https://en.wikipedia.org/wiki/Depth-first_search on depth-first search, the word “backtracking” is used a lot. So a rather naive way to implement depth-first search might be a recursive algorithm. This would allow us to save “state” on the stack: at each vertex, where in the list of neighbours we had got to, and, encoded in the call stack, the partial path.
For simple route finding, a rather different approach is normally used. Suppose first I simply want to know if I get to vertex target
from start
. We’d maintain a “stack” (in the abstract datatype sense) of vertices to visit, and a set of visited vertices. At each turn:
target
and stop if so.That we use a “stack” means this is depth-first. And we visit each vertex exactly once. Rather than keeping track of state for each vertex, we simply have our set of visited vertices, \(O(V)\) size, and a stack, which in a naive implementation can get as large as \(O(E)\).
If we want to actually find the path, then we can keep track of not only if a vertex has been visited, but also what the “parent” vertex which lead to the vertex is. (I use this terminology as we are effectively finding a spanning tree rooted at start
). We can then work backwards from the target
vertex, using here that we visit each vertex just once, and so we really do build a tree, and hence there is a well-defined path back to start
.
Replacing the “stack” with a “queue” gives a breadth-first search instead.
I implemented the above without really thinking. But it doesn’t work for the vertex-disjoint path case, because of the following example:
Here we have one path already, the straight line. It’s possible to find a new vertex-disjoint path:
Notice that is traversing the “residual graph”, shown on the left, we have the visit the middle vertex twice. This makes route finding somewhat more tricky, as our “history” changes the permissible future paths. In a full backtracking approach, this would be easy to implement, as we’d always have a “local copy” of the current path.
A simple way to implement this in a modification of the above. The observation is that we only need to visit certain vertices twice and never more so. So we simply tediously keep track of if we’ve coming back for a 2nd time, and if so, remember where we came from on the 1st go. The problem of not keeping track of this extra information is shown here:
Here on the 2nd visit to the middle vertex, we over-write the original “parent” and end up in an infinite loop.
A few notes on paths in graphs: finding the maximum number of edge/vertex disjoint paths, and an application to finding disconnections.
Consider a graph, for the moment undirected. We wish to find the maximum number of edge disjoint paths from a fixed vertex \(s\) to a fixed vertex \(t\). We can do this with the Max-Flow Min-Cut theorem and the Ford-Fulkerson algorithm, at least in the directed case, but let’s specialise this to our case.
Motivated by another nice post from Jeremy Kun I’ve had a play with AdaBoost. Like Jermey I used the “Adult Census Dataset” and then also the “Titanic” dataset. The latter requires some data munging, and some decisions about how to process the data.
Once you understand what AdaBoost does, and how Decision Stumps/Trees work, it seems reasonable to now use scikit-learn (which can often make you seem like a trained monkey, plugging things into your data). It’s worth noting that an out the box decision tree works almost perfectly on the “Adult” dataset, but much less well on the Titanic Dataset (where AdaBoost and decision stumps now don’t look so bad!)
As ever, Python feels a bit slow for this sort of thing, but also nice because it’s interactive, and well aimed at handling data.
As a final application of MCMC methods, you can’t escape the Ising Model, and it’s cute application to denoising images (of a text-like nature).
So, linear regression isn’t so exciting. What about removing outliers? I do find this, just a little, magical:
We define a model where a data point \( y_i \) can either be from a normal linear regression model, or just a random (normally distributed) number. We then build into our model indicators \( o_i \in \{0,1\} \) to indicate if this is an outlier or not. Then sample the posterior distribution with an MCMC sampler, integrate out the \(o_i\) and you have your line of best fit. Fix \(i\) and integerate out everything but \(o_i\) and you get an estimate of probability the point is an outlier. The above plot shows those points estimated above 90%.
Rest is in an Ipython notebook on GitHub. This also gave me a chance to play with my own modified Gibbs sampler, which works nicely, if I say so myself.
How might we tackle simple regression from a Bayesian perpsective? Our model will be that we are given points \( (x_i)_{i=1}^n \) (which we assume we know, at least to a very high accuracy) and dependent points \( y_i^{re} = \alpha + \beta x_i \), but we only observe \( y_i \) where \( y_i = y_i^{re} + e_i \) where the \( e_i \) are our “uncertainties” (I like the line: “if they were errors, we would have corrected them!”, see footnote 10 of arXiv:1008.4686) usually modelled as iid \( N(0,\sigma^2) \). The likelihood is then
\[ f(y|\alpha,\beta,\sigma) = \prod (2\pi\sigma^2)^{-1/2} \exp\Big( -\frac{1}{2\sigma^2} (y_i - (\alpha + \beta x_i))^2 \Big) \]
Finding the MLE leads to Least Squares Regression. A simple Bayesian approach would be to stick some prior on $\alpha, \beta, \sigma$, but of course, this raises the question of how to do this!
Anyway, another Ipython notebook which develops some of the basic maths, and then uses emcee and the Triangle Plot to make some simulations and draw some plots of posterior distributions.
Bayesian statistics appeals to me both because it seems more “philosophically correct” than frequentist arguments (I have this data, and want to make an inference from it, rather than worrying about data which “might” exist, given other circumstances). Also, Bayesian approaches lead to some interesting computational issues, which are interesting in their own right.
TL;DR: In the process of thinking about MCMC methods in sampling from posterior distributions, I became interested in the choice of prior distributions. Emerging from a lot of reading, you can view my IPython Notebook on finding an “invariant” prior: if you have a natural family of transformations on your “data space” which is reflected in transformations in the parameter space, then I argue there’s a natural reason to expect certain invariance in the prior.
For a simple example, consider normally distributed data with known variance. If you translate your data by, say, 5 units, then you would expect your inference about the (unknown) mean to also be exactly translated by 5 units (but to otherwise be the same). This leads to a uniform (improper) prior. I treat the case of unknown mean and variance, which leads to the Jeffreys prior, which is regarded as not so great in the 2 parameter case. Hey ho. This was also a good excuse to learn and play with SymPy.
One of the fun things about C++11 is that minor changes allow you to write somewhat more expressive (aka Pythonic) code.
A common pattern is to iterate over an array while also knowing the index you are at. Python does this with the enumerate
keyword:
In C++ the idiomatic way to do this is perhaps:
This has its flaws though (and is tedious to type). Using g++, if you have warnings on, -Wall
, then you’ll see complaint, as the type of container.size()
is, typically, of size std::size_t
which on my machine is unsigned long long
(aka an unsigned 64-bit number). Warnings aside, as int
is on my machine a 32-bit unsigned number, we could potentially have problems if the size of container is more than around 2 billion (unlikely, but possible these days).
Last one of these for a while. Problem D became an obsession. As ever, links to: Official Contest Analysis and my code on GitHub.
Problem A, Speaking in Tongues: Could be solved with pen and paper, as all the information you need is sneakily in the question.
Wrapping up with these… The final problem was rather hard, and in the end was an exercise in profiling… As ever, links to: Official Contest Analysis and my code on GitHub.
Problem A, Password Problem: You have typed A
characters of your password, and the probability you typed letter i
correctly is \( p_i \), for i=1,...,A
. You can press backspace from 0 to A
times and type again, or just give up and retype. All your typing will from now on be 100% accurate, and pressing enter counts as a keypress. For each strategy compute the expected number of keypresses needed (if you get the password wrong, you’ll have to retype it) and return the lowest expected number of keypresses needed. Your password is B
characters in total.
Urgh, so failure! Some silly (stupid) errors meant my good start of getting problem A out in 16 minutes didn’t get me anywhere, as a silly not checking the boundary conditions killed problem B, and not thinking on paper long enough got to problem C. 3 hours and less stress, and it would have been 100/100, but I guess everyone says that about exams. (Of course, given the competition, Round 2 was going to be the end anyway).
As ever, links to: Official Contest Analysis and my code on GitHub.
Problem A, Brattleship: It’s in your brother’s interest to drag the game out for as long as possible: once he says “hit” we move to a 2nd phase which can only end more quickly.
W-1
. So we hit points W, 2*W, 3*W, ...
Busy at 5pm Saturday when this ran (so all the eggs in the final basket of round 1C). Under timed conditions, I did problem A very slowly, and B-small in the time, so would have qualified at around place 770. B-large took a bit longer, and Problem C wasn’t really looked at in the time limit. Hard problems…
As ever, links to: Official Contest Analysis and my code on GitHub.
Problem A, Counter Culture: How fast can you get from N
from 1 if you are allowed moves of: say one more than the last numbers; or reverse the decimal number. E.g. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 21, 22, 23 is the quickest route to 23.
I’ve spent some time on Codingame which is an on-line Competitive programming site, based around computer games. (Though only loosely: some puzzles are just puzzles like Google Code Jam, and the “games” are often pretty obviously graph or search based puzzles). But it’s kind of fun: certainly the more interactive puzzles, where your solution has to respond to unknown, almost real-time, inputs, is entertaining.
As a change, here is a puzzle I never quite got to the bottom of a few years ago. Just some flattened disks arranged in a circle and then rendered using LuxRender. I rather like the effect. There are two ways to make a cylinder shape: either use the inbuilt primitive objects (so a cylinder with disks top and bottom) or export a cylinder from Blender as a mesh. Actually, I rolled my own using a python script:
As ever, links to: Official Contest Analysis and my code on GitHub.
I did this under timed conditions, and would have just qualified. A silly error was all which stood between B large and me…
Problem A, Osmos: Start with A and \( (x_i)_{i=1}^N \) integers. You can absorb one of the \( x_i \) if it’s smaller than A, and then A grows by \( x_i \). Help Armin to be able to absorb all the numbers by adjusting the initial set:
What is the least number of moves to get a valid set?
So, I didn’t do this one live, as it ran at 2am UK time… I did however, for fun, try this under timed conditions a few days later, and I didn’t do great, but got enough to qualify (did A and B with 3 silly mistakes from B, then C small, and stupidly didn’t think, and tried my slow algorithm on C large. As large problems are one-shot, it would have been game over, and I’d be around 500 in the ranking.)
As ever, links to: Official Contest Analysis and my code on GitHub.
Problem A, Mushroom Monster: Easy, and my solution doesn’t differ from the official analysis.
As ever, links to: Official Contest Analysis and my code on GitHub.
What did you need to do to progress? Score 30 points in 100 minutes, or more points. So doing problems A and B would be enough. Again, tackling the “small” problems quickly is worthwhile.
Problem A, Part Elf: At generation 40, everyone is 1 or 0 elf. So at generation 39, everyone is \( \frac12 (a+b) \) elf, where \( 0 \leq a,b \leq 1 \). By induction, at generation \( 40-n \) everyone is \( a/2^n \) elf, for some integer \( 0 \leq a \leq 2^n \). So read in P
and Q
, use Euclid’s algorithm to find the gcd and hence write P/Q
in lowest terms, and then we need \( P/Q = a / 2^{40} \) so Q
should be a power of 2, less than or equal to 40, and we need \( P \leq Q \).
As ever, links to: Official Contest Analysis and my code on GitHub.
I did this under timed conditions: a mild disaster, but then it was for everyone back in 2013. If I had been clinical, I could have solved A, B-small and C-small, which would have been enough.
Problem A, Bullseye: Draw some concentric rings, with a fixed amount of paint. A bit of maths shows that ring \( n \in \{1,2,3,\cdots\} \) uses \( 2r+4n-3 \) units of paint, so we want the maximal \( N \) with \[ t \geq \sum_{n=1}^N 2r+4n-3 = 2rN + 2N(N+1) - 3N. \]
As ever, links to: Official Contest Analysis and my code on GitHub.
Of current interest is how to progress: top 1000 people go through, and for that you’d need 42 points in any time faster than basically the whole 2.5 hours. This seems more reasonable…
Problem A, The Repeater: Given some input strings and Omar can make a move: he can pick one string, and one character in that string, and either repeat it once more (so “abc” -> “abbc” or “aabc” or “abcc”) or if the character is already repeated, he can delete one copy (so “aabcc” -> “abcc” or “aabc”). Can he make all the strings the same, and if so, what’s the minimal number of moves to do so?
As ever, links to: Official Contest Analysis and my code on GitHub.
Of current interest is how to progress: top 1000 people go through, and for that you’d need 55 points in a reasonable time. That is, solve all but the tricky statistics based question, Problem C.
Problem A, Charging Chaos: The problem is the find a “mask” which when xored with each outlet gives each device. Of course, what makes this tricky is that there is also an unknown permutation to find, to map the outlets bijectively to the devices. But you can find these iteratively:
As expected, flight back from Boston to Leeds via Dublin resulted in zero sleep, and so hopes were not overly high for the 2015 Qualification Round. But, only 20 points needed to progress, which can’t be too hard, right?
Anyway, in the end, I got 100%, and seem to be the highest ranked UK participant. This, one suspects, won’t last…!
Some code on GitHub. On the todo list is to update this post with some comments, but my solutions are pretty similar to the Offical Analysis.
I found this to be somewhat harder than round A. Also, my solutions in Python, running on my 2013 era laptop, are rather slow, so really a C++ or similar implementation would be needed, especially on 2008 era hardware.
Again, The Official Contest Analysis is a good writeup, so I won’t say a great deal. See the code on GitHub.
Continuing, here is the 2013 qualification round.
As ever, the Official Analysis is very good, and similar to my approaches. Some code on GitHub.
While on holiday, what better way to unwind than solve some old Code Jam puzzles?
As ever, the Official Analysis is very good, and similar to my approaches. Some code on GitHub.
These seem to me to be essentially pure mathematics, in that once I understood the problem, and the maths behind it, the implementation in code was almost trivial.
The Official Contest Analysis is a good writeup, so I won’t say a great deal. See the code on GitHub.
Problem A, Minimum Scalar Product: The only comment I have is that the formal “proof” the contest analysis gives seems overly complicated to me.
I was hoping to take part in Google Code Jam this year, but for much of Saturday the 11th I’ll be transatlantic, flying back from Boston to Leeds, and I’m not sure I fancy my chances so much with jet-lag, no sleep, and much reduced amount of time…
But anyway, I thought it would be fun to look at past problems, and also to try to solve them in C#, as a way of learning more about the language. I must say that I’ve found it useful to solve the problems in Python– it’s nice having an iterative environment, and Python is generally fast enough.
So, here are my thoughts on the 2008 Qual round. See my code on GitHub.
I had a need to update and move my old academic website so thought I would take the plunge and host a site on GitHub. Here is an aide-memoire for myself as to how I did this:
This blog is built using Jekyll: