All Posts

On memory management

I have only ever been a hobbyist C++ programmer, while I have been paid to write Java and Python. But a common complaint I’ve read about C++ is that you have to manage memory manually, and worry about it. Now, I’d slightly dispute this with C++11, but perhaps I don’t really have enough experience to comment.

However, I think there’s a strong case that with Garbage Collected languages, you can’t really forget about memory, or the difference between copy by reference and copy, but the language rather allows you to pretend that you can cease to worry. In my experience, this is only true 99% of the time, and the 1% of time it bites you, you’ve quite forgotten that it’s a possibility, which makes debugging a real pain (the classic “unknown unknown”).

A stupid example which wasted some of my time today is:

import numpy as np
...
indexes = np.argsort(times)
coords[0] = coords[0][indexes]
coords[1] = coords[1][indexes]

With hindsight this obviously mutates the data underpinning coords and hence mutates anything which is an alias of coords. Cue two tests failing, and the first one was silently mutating the data the second test tried to use. But this is really hard to spot– both tests failed, so I spend a while looking at the base class because that’s the only common code involved. Unit testing doesn’t really help, as I’d never think to test that I’m not accidentally mutating some data reference (because I’d never be that stupid, right…)

What I meant to do was:

coords = coords[:,indexes]

This generates a new array instance and assigns the reference to coords. But this is quite subtle. To even express it, I have to use language which I learnt from C/C++. I only finally noticed when I wrote some test code in a notebook, and noticed that there was some period 2 behaviour going on. “Oh, I must be mutating something… Oh, right…”

The problem with Python, and Java, is that you get out of the habit of even thinking in this way. I used to write a lot of immutable code in Java, precisely to avoid such problems. That seems to make massive sense in a corporate environment. But for numpy, and trying to squeeze performance out of an interpretted language, you sometimes need mutability. Which means you need to think. (And regularly makes me wish I could just use C++, but that’s nothing story…)

Learning Python UI programming

Another new task: get going with some GUI programming!

Some references

Hints and tip

Some books

As culled from Leeds University Library:

Pandas, HD5, and large data sets

I have finally gotten around to playing with the HD5 driver for pandas (which uses, I believe, pytables under the hood). I’m only scratching the surface, but it’s easy to do what I want:

  • Create a huge data frame storing an entire data set
  • Efficiently query subsections of the frame

Create the dataframe

We obviously cannot do this in memory. But if we have some way of generating one row at a time, or a small “chunk” of rows at a time, then we can “append” these iteratively to a HD5 store:

 store = pd.HDFStore("test.hd5", "w", complevel=9, complib="bzip2", fletcher32=True)
 # Generate a data frame as `frame`
 store.append("main", frame, data_columns=True)
 # Repeat as necessary
 store.close()

This creates a new HD5 file, and then creates a table in it named “main”. We can call store.append() repeatedly to add lots of rows. The data_columns=True is necessary if we wish to query by column (which we do).

Read back the data

We can then iterate over the whole dataframe in “chunks” of rows:

store = pd.HDFStore("test.hd5", "r")
for df in store.select("main", chunksize = 1000):
    # Do something with `df` which contains the next 1000 rows

Alternatively, we can use the power querying ability. Suppose we have a column named “one” in the large dataframe, and we just want the rows where the value of “one” is less then 100. Then we can use:

store = pd.HDFStore("test.hd5", "r")
df = store.select("main", where="one < 100")

This seems to be wonderfully fast.

Downsides

You cannot store “objects” in a table, so e.g. storing a GeoPandas data frame is impossible (or extremely hard).

Some sources

Parsing XML via SAX in Python

I’ve worked with XML before (in Java), but always small files using the Document Object Model. Now faced with multi-GB of Open Street Map derived XML files, of which I need to get a small amount of data, some other method is required. Step forward the Simple API for XML (SAX). This is an event-driven API: the XML parser calls a “handler” object with information about tags opening and closing, and the character data in between.

In Python, there is support in the standard library for SAX parsing. You need to sub-class (or duck-type, and implement the interface of) xml.sax.handler.ContentHandler. It seems that duck-typing is frustrating, as you need to implement the whole interface, even if you never expect certain methods to be called.

The methods startDocument and endDocument are called at the start and end of parsing. The startElement method sends details of the name of an opening tag, and it’s attributes (sent as essentially, but not quite the same as, a dict from string to string), and endElement tells you of a closing tag. Text is sent to you via characters which will also notify of new lines (which probably want ignoring). There is more, but that’s enough for my application.

Getting a generator

Somehow, a callback doesn’t feel very “pythonic” (and does feel terribly Javascript-esq). The pythonic way to push data to a client is surely to use a generator. Naively, to convert a callback to a generator, we’d like to:

  • Make an __iter__ method call the code which requires the callback handler.
  • When control is first returned to the callback, store the data and somehow return control to __iter__ which builds an iterator, and returns control to the client.
  • Each time we call __next__ on the iterator, return control to the data generation function…
  • ???
  • Profit?

Given that we are suspending execution, it should come as no surprise that the way to do this is via threading. Run the data generation code in a separate thread, and let the callback handler write all its data to a blocking queue. On the main thread, we simply implement a generator which pulls data off the queue (waiting if necessary for data, hence allowing control back to the thread) and yields it back to the client.

For fun, I implement this in the module cbtogen in my project OSMDigest. Sadly, in Python, event with a large queue, there is a signifcant overhead in running two threads and passing data over a queue.

For the application of converting the SAX callback to a generator, the result is incredibly slow code. This can be significantly improved by moving as much parse logic as possible to the thread, so we send far fewer objects over the queue. However, this is a better way…

Alternative using element-tree

The xml.etree.ElementTree module in the Python standard library represents an XML document via elements which have “children” (i.e. nested tags). The standard usage is to parse the whole document, but it is also possible to parse the document tag by tag using the iterparse function. There is a caveat however: all children of tags are still collected. If your document consists of lots of disjoint sub-sections, this is not a problem, but for my application, parsing Open Street Map data, the entire document is contained in an <osm> tag. As such, we’d eventually collect the entire document as children of this main (or root) tag. The trick is to capture a reference to the root tag, and then periodically (at a suitable point in the iteration loop) call clear on this object. This removes references to children, and so allows them to be garbage collected. The obvious downside here is that different document structures might require different techniques to “clear” the correct tags.

For OSM data, however, this is a clear winner, giving by fast the quickest way to parse the data.

Open Street Map XML data

I want to process large amounts of XML data from Open Street Map (OSM). I.e. that obtained from GeoFrabrik or OSM.Planet. For smaller snapshots, do look at OSMnx.

My pure-Python project to read and process OSM data, currently a work in progress, can be found on GitHub, as “OSMDigest”.

The XML format is documented on the OSM Wiki. There is no formal schema, but the data you can download seems to be of quite a constrained type:

  • Start with an <osm> element giving the “version”, “generator” and “timestamp”.
  • Then a <bounds> element giving the rectangle in latitude/longitude coordinates which encloses the data.
  • Following this, elements of three types. (They seem to appear in the order given here, though this I guess is unimportant). Each of these elements contains some common attributes: “id” giving the OSM id (which is unique within each type), the (optional) “user”, “uid”; giving the user who last modified the object, the “timestamp” of last modification, the edit “version” (which increases on each edit) and the “changeset” number. There is also a “visible”, but in the downloaded data which I’ve seen, this is always either missing, or “true”.
  • <node> specifies a point on the planet, and has attributes “lon”, “lat” for coordinates. May contain 0 or more <tag> sub-elements.
  • <way> specifies a path. Contains, in order, <nd> sub-elements referencing nodes, and 0 or more <tag>s.
  • <relation> specifies some logical relationship between other objects (e.g. the route of a bus, the area enclosing woodland, traffic instructions such as “no left turn here”). Contains <member> sub-elements referencing the other objects which make up the relationship, and 0 or more <tag>s.
  • Then we have three sub-elements which never contain further elements themselves:
  • <tag> which is a key/value pair, stored as attributes “k” and “v”.
  • <nd> which references a node and contains just the attribute “ref”
  • <member> which contains attributes “ref”, “type” and a (maybe empty, but always present) “role” describing what role the member has in the relationship.

The meaning of ways and relations is defined by the tags present. For more details see:

  • Way article. Things rapidly get complicated. A way which starts and ends at the same node is a “closed” way, and are often, but not always, treated as Areas. For example, a closed way tagged “highway=footway” is assumed to be a circular pathway, unless we also have the tag “area=yes” in which case it is a pedestrian plaza. But “landuse=forest” is always an “area” even without the “area=yes” tag.
  • Relation article and types of relation.
  • Possible keys and values can be found here: Key descriptions by group and Map features.

Open Street Map Data

I’m currently working on using some address information from open street map to augment other open data sources. Here are some notes on using data from open street map, in Python.

Getting and using Open StreetMap Data

It seems like this is a bit of a pain. Open StreetMap (OSM) uses a custom, XML based, format which is hard/impossible for standard GIS software to read.

Data sources

  • http://wiki.openstreetmap.org/wiki/Planet.osm Gives links to download the world and various extracts of it.
  • http://wiki.openstreetmap.org/wiki/Overpass_API Details the “Overpass API” which allows targeted querying of data from the OSM database. (The OSMnx package seems great for making direct, small-scale queries.)

GeoFabrik

http://download.geofabrik.de/ Offer downloads of regions of the world, either in OSM format, or converted to shape-file format. The latter can be loaded into geopandas, QGIS etc. etc. but appears, sadly, to be missing useful information. For example, I can locate a building and find the polygon for this (by it’s ID) in the shapefile, but there is no meta-data attached to give me e.g. the address!

However, for off the shelf analysis using existing tools, this is perhaps where to start.

Format

The standard format is XML, either compressed, or packaged using Protobuf.

  • http://wiki.openstreetmap.org/wiki/OSM_XML Gives the XML format
  • http://overpass-api.de/output_formats.html Gives details of the JSON format which the Overpass API can return.
  • http://wiki.openstreetmap.org/wiki/Map_Features Gives details of the “features” the returned data can contain.

Libraries

I should probably be using pyOsmium but there is no conda build, and being stuck with either a windows box, or a locked-down linux box, I am scared to try to get it to build.

GDAL / OGR can be easily installed (or come for free with geopandas).

  • http://www.gdal.org/drv_osm.html Gives some brief details
  • A cookbook for GDAL.
  • I didn’t have a great deal of luck finding the details I wanted.

OSMnx

It seems that others have had similar thoughts to me. OSMnx looks to be a great tool for network analysis of road networks. The review paper is also very informative, with some nice background information.

With Anaconda, it’s a one-line install:

conda install -c conda-forge osmnx

This seems very nice for targeted downloads, but it’s a bit too “magic” for my tastes. The source code is very readable however, and I’ve been learning a lot from it.

Numpy vectorising

I asked this question on Stackoverflow, and got a nice answer, but one which I needed to think through a little more. Here’s my conclusions.

My aim was to understand how to write robust code which could take scalars, but which would also do “as expected” on arrays. Let me expand a little on this, by using a slightly easier example than in the original question. Suppose f(x) is a function which takes a scalar and returns a scalar. I then want that if x is actually an array, of any shape, then f(x) will return an array of the same shape as x, namely the array obtained by applying f to every entry.

You can do this with the np.frompyfunc method, but we cannot expect to take advantage of the speed of numpy. Furthermore, we also obtain an array of objects.

Instead, let me present the “numpy way” and then explain why and how it works. Firstly, the starting function:

fixed_array = np.arange(10)
def f(x):
    return np.sum((x - fixed_array) ** 2)

Here fixed_array is global for simplicity. So what f does is to subtract x from each entry of fixed_array, square, and then sum. Equivalent to sum( (x-t)**2 for t in fixed_array ). Passing an array into f yields an error as numpy cannot work out how to compute x - fixed_array. Indeed, we do not want to form this pointwise. Rather, I think what we really want to do is something like the following:

  • For simplicity, suppose x is also a 1D array (though in the end x should be allowed to be any shape.)
  • Form an array xx so that xx[i][j] = x[i] for all i,j
  • Form an array yy so that yy[i][j] = fixed_array[j] for all i,j
  • xx and yy have the same shape.
  • Compute zz = (xx - yy)**2 (pointwise) so that ``zz[i][j] = (x[i] - fixed_array[j])**2 for each i,j`.
  • Sum over the final dimension, thus giving f(x[i]) in the ith position, as required.

Okay, so here’s the answer:

def fff(x):
   return np.sum((np.asarray(x)[...,None] - fixed_array)**2, axis=-1)

How does this work?

  • np.asarray(x) returns, if x is scalar, an array of shape (1) containing x as it’s entry; and if x is already “array like”, we get a genuine array.
  • Then for an array y, the slice y[...,None] does the same as (the perhaps more clear) y[...,np.newaxis]. The ... means the same as :,:,: however many times required, and np.newaxis gives you a new axis. If y has shape (2,3,4) then y[...,None] has shape (2,3,4,1). See Indexing docs.
  • Keep working with yy and consider how Broadcasting applies to yy - fixed_array
    1. We prepend 1s to the shape of fixed_array so it has the same ndim as yy
    2. We will get an output of size (2,3,4,n) where fixed_array is of length n
    3. If an input has dimension 1 then the single entry in that dimension will be used.
  • This gives us exactly what we want, because the entries of y will be used in the first dimension(s), and the entries of fixes_array in the last dimension.
  • We then square entry wise, and sum over the final dimensional, the axis=-1 command.

See Notebook for a quick demo and some code.

Working with numpy again

In my new job, I find myself working with numpy (after a break of a couple of years, and now professionally, and not as a hobby.) Numpy is great, but it doesn’t half require a little thinking upon occasion.

Suppose we have an array of 10 points in the plane. Should this be represented as a numpy array of shape (2,10) or (10,2)?

Argument from kernels

Suppose I want to write a function which represents a two dimensional kernel, for example:

def ker(x,y):
   return x + 2 * y

This doesn’t vectorise at all: I cannot do

p = np.array([1,2])
ker(p)

So instead we might write:

def ker(p):
   return p[0] + 2 * p[1]

Then, to vectorise across my array of 10 points, evaluated the kernel at each point, I need my array to have shape (2,10).

This is exactly the convention chosen by the kernel object which Scipy stats Gaussian KDE will produce, for example.

Argument from broadcasting

Given my 10 points, I should be able to translate them all by the same amount, by just adding a point.

my_array + point

For this to work, I need to work with the Numpy broadcasting rules which imply shape (10,2):

my_array = np.random.random(20).reshape((10,2))
point = np.array([2,3])
my_array + point

So if you’re working with things your thinking of as spatial “vectors”, the other convention seems most natural. This is exactly the convention chosen by e.g. Scipy Spatial KDTree.

What to do?

I guess you pick a convention, stick with it, and translate (e.g. using .T) as necessary.

A further hint is to read the broadcasting rules very closely and to know that indexing with None is the same as adding a new axis:

my_array = np.random.random(20).reshape((10,2))
new_array = my_array[:,None,:]
print(new_array.shape) # (10,1,2)

Random sampling to see a percentage of a population.

Given a population and sampling at random (“with replacement”) what’s the expected number of samples I need to see 50% (or any fixed proportion) of the population.

I deliberately ask for “expected” because calculating expectations is often easier than getting a handle on the whole probability distribution. A trick is to exploit linearity: express the random variable of interest as a sum of random variables you can calculate the expectation of.

Sampling at random from , suppose we have seen exactly members of . As each sample is independent, letting denote the number of samples required to see a new member of , we see that \[ \mathbb P(T_k = j) = \left(\frac{k}{\vert P\vert}\right)^{j-1} \frac{\vert P\vert-k}{\vert P\vert} \] That is, a geometric distribution, and so . By convention, .

Then, if I want to see exactly members, the time needed is . We can estimate the expectation by an integral, \[ \mathbb E(S_n) \approx \vert P\vert \int_{\vert P\vert-n}^{\vert P\vert+1} \frac 1 x \ dx = \vert P\vert \log \Big( \frac{\vert P\vert+1}{\vert P\vert-n} \Big) \]

Hence, in answer to our original question, we need on average samples to see half of .

Random sampling

This post, very tangentially, relates to a quiz we set job candidates. If you are applying for a job at my current company, and somehow work out I work there, and find this, then you probably half deserve a job anyway.

Suppose you have a population \( P \) and some test as to whether a member of the population is good or bad. We want to find a “random good member”. There are two methods that come to mind:

  • Random sampling: pick a member at random, test it, if it passes return it, otherwise try again.
  • Randomly order \(P\) and work through the whole set, returning the first good member.

The first method has the virtue of being simple. The second method uses a lot of memory, if \(P\) is large. But on closer thought, what if the proportion of “good” members is rather small. The 2nd method is guaranteed to find a good member in . How slow can the first method be?

Java Enum definition

At the dying ends of the work day, I came across this page and was initially confused by

public abstract class Enum<E extends Enum<E>>

Doesn’t that look, well, horribly circular? Stackoverflow suggests this is common confusion.

LED lights and payoff

LED lights seem a no-brainer: instant on, good colour reproduction, extremely energy efficient (cool to the touch after minutes of use). But they are expensive. I wonder what the payback time is?

Graphs in Windsor

While spending a long weekend in Windsor, we found the following:

Maze problem

The instructions read, in part:

The puzzle challenge is to solve this unique one-way maze by travelling from the entrance Pawn to the central Castle – always going forwards.

It’s not quite clear to me what this means, but there is a good quote from Here:

The maze should be played as if you are a runaway train - always moving smoothly through the points forwards, and never able to reverse. This maze-movement idea was inspired by the nearby Windsor Railway Station, which brought the Royal Family to Windsor from Victorian times onwards. The maze paths sometimes run together in pairs, like a pair of railway tracks; sometimes they even run through “railway cuttings”, where the grass surface rises and falls around them.

I wonder if there is an elegant way to convert this to a directed graph problem? I couldn’t see how to do this cleanly, as in the problem you must pass through a “node”, not reverse direction.

Anyway, here is a solution. From the pawn you initially get yourself into an outer “loop”, from which only the King is accessible. As we can only visit the King once, we don’t want to get back onto the outer loop in the same direction.

  • So, we take the loop which the King is on, and reverse direction. Two choices here.
  • We can then either go to the Bishop:
    • In we pass out by the “Raised Banks” then we’re back on the outer loop, which is no good.
    • So we must reverse direction again, and head in the direction of the Knight.
    • We can skip past the knight back onto the outer loop, but that’s wrong.
    • So visit the Knight.
    • Then take the branch down to the Queen, and then to the Castle. Task complete.
  • Or we loop back down towards the Pawn, and either:
    • Head towards the Knight, but we can’t visit the Knight, so we end up back up near the Bishop, and we’ve just taken a loop.
    • Take the earlier branch, and head up the Castle, but following the track, we’re forced to visit the Queen, and then head to the Castle. But we’ve missed pieces.

So, the only solution is King, Bishop, Knight, Queen, Castle. And once we’ve heading to the Knight, we only have a choice of which direction to pass through the Knight “station” in, everything else is determined. Before that we can loop about a bit if we wish.

RAII in C#/Java?

More a note to a future self than aything else… In C++ we have the notion of Resource Acquisition Is Initialization which I think I have internalised as the following:

Any “resource” (memory, file, handle etc.) should be “owned” by an object. An object’s constructor should allocate the resource, and the destructor should free the resource (with some appropriate choices then made for what copy/move constructors/assignments should do). By use of C++11 unique_ptr and shared_ptr this model can be extended to pointers.

Notice I didn’t mention “exception”, though one major additional use of RAII is exception safety: if an exception is thrown, then all objects currently in scope will have their destructors run, and so any resources will be safely released.

In C#/Java we have garbage collection. This takes care of memory, but not any other “resource”. In C# there is the IDisposable interface, and I was for a while confused by what this did. I am not alone, judging from SO:

For me, the following is a good way to think:

  • Imagine going back to programming in C. You allocate memory with malloc and then free it later with free. Similarly, if you open a file, you need to later close it.
  • A managed language (let’s say C#) removes the need to free anything. But it does nothing for the file open/close issue.
  • So we must manually close() a file, say.
  • How then to handle exceptions? Use the try ... catch ... finally idiom!
  • Then the IDisposable interface in C# should be thought of as enabling the syntactic sugar of the using command. Similarly the AutoCloseable interface in Java 7.

What of destructors? Don’t use them: they are only of use if you need to free “unmanaged resources”, and if you need to do that, you’ll know it.

So why the “IDisposable pattern”? This seems really to exist merely to support finalizers (aka destructors). However, if your class is not sealed, then maybe a derived class will need a destructor. The pattern then exists to ensure unmanaged resources are always freed, but from the destructor no managed resource is freed (as they might have already been freed by the garbage collector).

All in all, that seems about as much work as writing a destructor in C++!

BF interpreter in C++

BF is an amusingly named, impractical, Turing complete programming language. It is very like an actual Turing machine: you have an infinite (originally, one-sided, 30K long) array of (8-bit) values, with commands to increase/decrease (+/-) the cell, or move left/right (</>). Input (,) and output (.) are handled. Loops are given by [] and implement a “while cell is non-zero” loop. So a trivial mapping to C is possible (see the Wikipedia article).

I thought it would be fun to give an object oriented implementation, for which I used C++ for a change. My implementation decouples the parsing of the programme from running it. The parse class reads the input in a single pass, using a stack for dealing with nested loops. It translates the input into a list of commands, represented by an abstract base class with an execute method, overloaded for the various different commands. This is the “command” pattern (closely related to the “strategy” pattern).

CodeJam 2015 Round 2

Round 2 came and went (and Round 3 and the “distributed” round). As an effort to practice with C#, I solved these problems in C# (definitely slower than using Python, and actually, the end running time is not that much quicker, for these sort of algorithmic / numerical problems).

There’s a definite step-up in difficulty here! Maybe I kid myself I could have done problem A and some “small” problems quickly… that people did them all in 150 minutes is quite scary. The Official Contest Analysis is rather brief (from G+, it seems the organisers were busy with the distributed round; or maybe, if we’ve come this far, shorter answers are all we need ;-> ) I think my background in pure Maths shows here: I found problem D quite easy, as I found analysing the special cases off-line (pen and paper) to be not so hard. Problems B and C (in “large” form) were hard, I thought. (I missed the trick with problem B; and using an off-the-shelf Graph package is obviously the fast way to do Problem C.)

See the code on GitHub.

Thoughts about C#

On the basis that knowing a modern OO language would be good, and on the basis that everyone and their granny knows JAVA, I thought I’d look at C#. Getting going with C# was a bit tedious: I end up looking things up endlessly, and puzzling over things for a while before they “click” and seem obvious (which is always annoying, as then I’m left wondering why it wasn’t obvious from the start). You wait for muscle memory to build…

Example: A small mistake lead to some musings on how sort algorithms are implemented: StackOverFlow Question (the actual problem came from thinking about orderings in Python and C++ too much, where you specify a strict weak ordering, basically “less than”, and then porting that to a CompareTo method and missing a corner-case. This came up in a CodeJam problem.)

(Aside: Stackoverflow never ceases to amaze me: I asked a kind-of-silly question, which probably I could have thought about myself, and I get a wonderful, long, complete answer…)

LINQ

I do really like LINQ, and the way it reminds one of list comprehensions. Turning this…

var allEdges = new List<Tuple<T,T>>()
foreach(var edge in edges)
{
	foreach(var e in graph.AsDirectedEdges(edge.Item1, edge.Item2))
	{
		allEdges.Add(e);
	}
}

into this…

allEdges.Add(from edge in edges
	from e in graph.AsDirectedEdges(edge.Item1, edge.Item2)
	select e);

is rather nice. (The 2nd snippet uses a simple extension method to Add; again, extension methods seem pretty neat.)

Generics

However, give me templates from C++ anyday! I really (really) like the duck-typing nature of templates: it’s sort of like using a dynamic language, but with compile time checking. (Of course, one also has to enjoy page long error messages… Maybe “concepts” will arrive one day.) That C# just doesn’t let you do

public static MyMethod<T>(T one, T two)
{
   return one + two;
}

is very tedious! Similary, the limitations of using (i.e. it’s lack of being a replacement for typedef) in C# are annoying: I typically like setting up private typedefs in a class, to save typing out long class names. Or, giving Tuple<T,S> a more domain-specific name: using DirectedEdge = Tuple<T,T> for example.

An analogue of namedtuple from Python would be neat: a simple way to setup a struct or class which just needs some fields, and default ordering / hashing etc. C# does do this with anonymous types, but the result is, well, anonymous…

Const

It would also be nice to have something similar to C++ const. readonly is too limited, and class only. A readonly List<T> can still have its items changed, and that’s not my definition of “readonly”! For a statically typed, compile time checked language, decorating methods with const is, I think, a wonderful thing: it’s another way to signal to the compiler (and other people reading your code) your intentions.

Async

Is next up to be toyed with…!

Disjoint Paths; Implementation Issues

Mostly as an excuse to learn C# I implemented the path finding algorithms. Running some tests (random graphs, and then computing cuts from found disjoint paths, as to verify a “cut” is easy!) caught some implementation issues. All obvious with hindsight.

Notes on the implementation

See the file on GitHub. I wanted to use interfaces, and extension methods, as they are features of C# which differ from e.g. C++. In hindsight, I think an abstract base case Graph would have been more sensible, with DirectedGraph and UndirectedGraph inheriting from this.

  • That I made the interface explicit is annoying, as you have to cast to the interface. (In this case, again it’s more logical to implicitly implement, as a DirectedGraph should have a GetNeighbours method.)
  • I do like extension methods, though I can see that keeping track of them in a large project could be dangerous. It is odd that for an extension method on the Interface, an explicit cast isn’t needed (compared to the previous point). Of course, LINQ wouldn’t work without this…

Efficient path finding

The Ford-Fulkerson algorithm makes essential use of “unflowing” flow back along a path. In our case, this allows us the new path to make use of an old path in reverse. That, at the end of each step, we really do get a set of edge disjoint paths is slightly tricky (and I didn’t prove this– in the full FF algorithm, it follows by noting that the “flow constraints” always hold).

In my implementation, I simply ignore the issue, and store “paths” as simply a set of (directed) edges. Then, at the end, we reconstruct the paths (using that each edge can occur in only one path, and that a simple greedy algorithm works.) The slightly subtle issue here is shown in the following diagram:

Example

Namely, we can end up with loops which are divorced from any actual “path”. Once you realise this, we simply need to ignore it at the final step. In the algorithm itself, we can’t run into a problem, as we can always use the “loop” in reverse, when finding a new path, and having a “loop”, by definition, means that we don’t have to worry about where we start: we can always access all the vertices.

Depth-first or breadth-first?

In e.g. the https://en.wikipedia.org/wiki/Depth-first_search on depth-first search, the word “backtracking” is used a lot. So a rather naive way to implement depth-first search might be a recursive algorithm. This would allow us to save “state” on the stack: at each vertex, where in the list of neighbours we had got to, and, encoded in the call stack, the partial path.

For simple route finding, a rather different approach is normally used. Suppose first I simply want to know if I get to vertex target from start. We’d maintain a “stack” (in the abstract datatype sense) of vertices to visit, and a set of visited vertices. At each turn:

  • We pop off a new vertex from the stack.
  • If we haven’t visited this vertex, then add to our visited set, check if it’s target and stop if so.
  • Otherwise the add to the stack all the (unvisited) neighbours of the current vertex

That we use a “stack” means this is depth-first. And we visit each vertex exactly once. Rather than keeping track of state for each vertex, we simply have our set of visited vertices, \(O(V)\) size, and a stack, which in a naive implementation can get as large as \(O(E)\).

  • A less naive approach is to keep track of a set of “to visit” vertices as well– we only add to the stack if we haven’t visited the neighbour, and also if we haven’t already marked it as “to visit”.
  • Further thought then shows that we can merge the “to visit” and “visited” sets, so we still just have one set and one stack, now both using at most \( O(V) \) space.
  • At worst, we do have to at least examine each edge, so we have an \(O(E)\) time performance.

If we want to actually find the path, then we can keep track of not only if a vertex has been visited, but also what the “parent” vertex which lead to the vertex is. (I use this terminology as we are effectively finding a spanning tree rooted at start). We can then work backwards from the target vertex, using here that we visit each vertex just once, and so we really do build a tree, and hence there is a well-defined path back to start.

Replacing the “stack” with a “queue” gives a breadth-first search instead.

The problem…

I implemented the above without really thinking. But it doesn’t work for the vertex-disjoint path case, because of the following example:

Example

Here we have one path already, the straight line. It’s possible to find a new vertex-disjoint path:

Example

Notice that is traversing the “residual graph”, shown on the left, we have the visit the middle vertex twice. This makes route finding somewhat more tricky, as our “history” changes the permissible future paths. In a full backtracking approach, this would be easy to implement, as we’d always have a “local copy” of the current path.

A simple way to implement this in a modification of the above. The observation is that we only need to visit certain vertices twice and never more so. So we simply tediously keep track of if we’ve coming back for a 2nd time, and if so, remember where we came from on the 1st go. The problem of not keeping track of this extra information is shown here:

Example

Here on the 2nd visit to the middle vertex, we over-write the original “parent” and end up in an infinite loop.

Disjoint Paths

A few notes on paths in graphs: finding the maximum number of edge/vertex disjoint paths, and an application to finding disconnections.

Edge disjoint paths

Consider a graph, for the moment undirected. We wish to find the maximum number of edge disjoint paths from a fixed vertex \(s\) to a fixed vertex \(t\). We can do this with the Max-Flow Min-Cut theorem and the Ford-Fulkerson algorithm, at least in the directed case, but let’s specialise this to our case.

AdaBoost and classification

Motivated by another nice post from Jeremy Kun I’ve had a play with AdaBoost. Like Jermey I used the “Adult Census Dataset” and then also the “Titanic” dataset. The latter requires some data munging, and some decisions about how to process the data.

Once you understand what AdaBoost does, and how Decision Stumps/Trees work, it seems reasonable to now use scikit-learn (which can often make you seem like a trained monkey, plugging things into your data). It’s worth noting that an out the box decision tree works almost perfectly on the “Adult” dataset, but much less well on the Titanic Dataset (where AdaBoost and decision stumps now don’t look so bad!)

As ever, Python feels a bit slow for this sort of thing, but also nice because it’s interactive, and well aimed at handling data.

GitHub Repository

Ising model and image denoising

As a final application of MCMC methods, you can’t escape the Ising Model, and it’s cute application to denoising images (of a text-like nature).

Denoising

Ipython notebook

Removing outliers

So, linear regression isn’t so exciting. What about removing outliers? I do find this, just a little, magical:

Outliers

We define a model where a data point \( y_i \) can either be from a normal linear regression model, or just a random (normally distributed) number. We then build into our model indicators \( o_i \in \{0,1\} \) to indicate if this is an outlier or not. Then sample the posterior distribution with an MCMC sampler, integrate out the \(o_i\) and you have your line of best fit. Fix \(i\) and integerate out everything but \(o_i\) and you get an estimate of probability the point is an outlier. The above plot shows those points estimated above 90%.

Rest is in an Ipython notebook on GitHub. This also gave me a chance to play with my own modified Gibbs sampler, which works nicely, if I say so myself.

Regression from a (simple) Bayesian perspective

How might we tackle simple regression from a Bayesian perpsective? Our model will be that we are given points \( (x_i)_{i=1}^n \) (which we assume we know, at least to a very high accuracy) and dependent points \( y_i^{re} = \alpha + \beta x_i \), but we only observe \( y_i \) where \( y_i = y_i^{re} + e_i \) where the \( e_i \) are our “uncertainties” (I like the line: “if they were errors, we would have corrected them!”, see footnote 10 of arXiv:1008.4686) usually modelled as iid \( N(0,\sigma^2) \). The likelihood is then

\[ f(y|\alpha,\beta,\sigma) = \prod (2\pi\sigma^2)^{-1/2} \exp\Big( -\frac{1}{2\sigma^2} (y_i - (\alpha + \beta x_i))^2 \Big) \]

Finding the MLE leads to Least Squares Regression. A simple Bayesian approach would be to stick some prior on $\alpha, \beta, \sigma$, but of course, this raises the question of how to do this!

Anyway, another Ipython notebook which develops some of the basic maths, and then uses emcee and the Triangle Plot to make some simulations and draw some plots of posterior distributions.

Dabbling in Bayesian Statistics

Bayesian statistics appeals to me both because it seems more “philosophically correct” than frequentist arguments (I have this data, and want to make an inference from it, rather than worrying about data which “might” exist, given other circumstances). Also, Bayesian approaches lead to some interesting computational issues, which are interesting in their own right.

TL;DR: In the process of thinking about MCMC methods in sampling from posterior distributions, I became interested in the choice of prior distributions. Emerging from a lot of reading, you can view my IPython Notebook on finding an “invariant” prior: if you have a natural family of transformations on your “data space” which is reflected in transformations in the parameter space, then I argue there’s a natural reason to expect certain invariance in the prior.

For a simple example, consider normally distributed data with known variance. If you translate your data by, say, 5 units, then you would expect your inference about the (unknown) mean to also be exactly translated by 5 units (but to otherwise be the same). This leads to a uniform (improper) prior. I treat the case of unknown mean and variance, which leads to the Jeffreys prior, which is regarded as not so great in the 2 parameter case. Hey ho. This was also a good excuse to learn and play with SymPy.

Enumerate in C++

One of the fun things about C++11 is that minor changes allow you to write somewhat more expressive (aka Pythonic) code.

A common pattern is to iterate over an array while also knowing the index you are at. Python does this with the enumerate keyword:

for index, x in enumerate(container):
   print("At {} is {}".format(index, x))

In C++ the idiomatic way to do this is perhaps:

std::vector<int> container;
...
for (int i = 0; i < container.size(); ++i) {
   std::cout << "At " << i << " is " << container[i] << "\n";
}

This has its flaws though (and is tedious to type). Using g++, if you have warnings on, -Wall, then you’ll see complaint, as the type of container.size() is, typically, of size std::size_t which on my machine is unsigned long long (aka an unsigned 64-bit number). Warnings aside, as int is on my machine a 32-bit unsigned number, we could potentially have problems if the size of container is more than around 2 billion (unlikely, but possible these days).

Jam 2012 Qualification Round

Last one of these for a while. Problem D became an obsession. As ever, links to: Official Contest Analysis and my code on GitHub.

Problem A, Speaking in Tongues: Could be solved with pen and paper, as all the information you need is sneakily in the question.

Jam 2012 Round 1A

Wrapping up with these… The final problem was rather hard, and in the end was an exercise in profiling… As ever, links to: Official Contest Analysis and my code on GitHub.

Problem A, Password Problem: You have typed A characters of your password, and the probability you typed letter i correctly is \( p_i \), for i=1,...,A. You can press backspace from 0 to A times and type again, or just give up and retype. All your typing will from now on be 100% accurate, and pressing enter counts as a keypress. For each strategy compute the expected number of keypresses needed (if you get the password wrong, you’ll have to retype it) and return the lowest expected number of keypresses needed. Your password is B characters in total.

Jam 2015 Round 1C

Urgh, so failure! Some silly (stupid) errors meant my good start of getting problem A out in 16 minutes didn’t get me anywhere, as a silly not checking the boundary conditions killed problem B, and not thinking on paper long enough got to problem C. 3 hours and less stress, and it would have been 100/100, but I guess everyone says that about exams. (Of course, given the competition, Round 2 was going to be the end anyway).

As ever, links to: Official Contest Analysis and my code on GitHub.

Problem A, Brattleship: It’s in your brother’s interest to drag the game out for as long as possible: once he says “hit” we move to a 2nd phase which can only end more quickly.

  • So the worst case is when the brat says “miss” until he has no choice but to say hit.
  • You want to chop the board up into regions where the ship cannot hit: so strips of width W-1. So we hit points W, 2*W, 3*W, ...
  • Do each row, and then in the final row, at the final point, brat must say “hit” and we move to a 2nd algorithm of finding the ship.

Jam 2015 Round 1B

Busy at 5pm Saturday when this ran (so all the eggs in the final basket of round 1C). Under timed conditions, I did problem A very slowly, and B-small in the time, so would have qualified at around place 770. B-large took a bit longer, and Problem C wasn’t really looked at in the time limit. Hard problems…

As ever, links to: Official Contest Analysis and my code on GitHub.

Problem A, Counter Culture: How fast can you get from N from 1 if you are allowed moves of: say one more than the last numbers; or reverse the decimal number. E.g. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 21, 22, 23 is the quickest route to 23.

Codingame

I’ve spent some time on Codingame which is an on-line Competitive programming site, based around computer games. (Though only loosely: some puzzles are just puzzles like Google Code Jam, and the “games” are often pretty obviously graph or search based puzzles). But it’s kind of fun: certainly the more interactive puzzles, where your solution has to respond to unknown, almost real-time, inputs, is entertaining.

LuxRender of my profile picture

Profile Picture

As a change, here is a puzzle I never quite got to the bottom of a few years ago. Just some flattened disks arranged in a circle and then rendered using LuxRender. I rather like the effect. There are two ways to make a cylinder shape: either use the inbuilt primitive objects (so a cylinder with disks top and bottom) or export a cylinder from Blender as a mesh. Actually, I rolled my own using a python script:

Jam 2013 Round 1B

As ever, links to: Official Contest Analysis and my code on GitHub.

I did this under timed conditions, and would have just qualified. A silly error was all which stood between B large and me…

Problem A, Osmos: Start with A and \( (x_i)_{i=1}^N \) integers. You can absorb one of the \( x_i \) if it’s smaller than A, and then A grows by \( x_i \). Help Armin to be able to absorb all the numbers by adjusting the initial set:

  • You can add any new number;
  • You can remove a number.

What is the least number of moves to get a valid set?

Jam 2015 Round 1A

So, I didn’t do this one live, as it ran at 2am UK time… I did however, for fun, try this under timed conditions a few days later, and I didn’t do great, but got enough to qualify (did A and B with 3 silly mistakes from B, then C small, and stupidly didn’t think, and tried my slow algorithm on C large. As large problems are one-shot, it would have been game over, and I’d be around 500 in the ranking.)

As ever, links to: Official Contest Analysis and my code on GitHub.

Problem A, Mushroom Monster: Easy, and my solution doesn’t differ from the official analysis.

Jam 2014 Round 1C

As ever, links to: Official Contest Analysis and my code on GitHub.

What did you need to do to progress? Score 30 points in 100 minutes, or more points. So doing problems A and B would be enough. Again, tackling the “small” problems quickly is worthwhile.

Problem A, Part Elf: At generation 40, everyone is 1 or 0 elf. So at generation 39, everyone is \( \frac12 (a+b) \) elf, where \( 0 \leq a,b \leq 1 \). By induction, at generation \( 40-n \) everyone is \( a/2^n \) elf, for some integer \( 0 \leq a \leq 2^n \). So read in P and Q, use Euclid’s algorithm to find the gcd and hence write P/Q in lowest terms, and then we need \( P/Q = a / 2^{40} \) so Q should be a power of 2, less than or equal to 40, and we need \( P \leq Q \).

Jam 2013 Round 1A

As ever, links to: Official Contest Analysis and my code on GitHub.

I did this under timed conditions: a mild disaster, but then it was for everyone back in 2013. If I had been clinical, I could have solved A, B-small and C-small, which would have been enough.

Problem A, Bullseye: Draw some concentric rings, with a fixed amount of paint. A bit of maths shows that ring \( n \in \{1,2,3,\cdots\} \) uses \( 2r+4n-3 \) units of paint, so we want the maximal \( N \) with \[ t \geq \sum_{n=1}^N 2r+4n-3 = 2rN + 2N(N+1) - 3N. \]

Jam 2014 Round 1B

As ever, links to: Official Contest Analysis and my code on GitHub.

Of current interest is how to progress: top 1000 people go through, and for that you’d need 42 points in any time faster than basically the whole 2.5 hours. This seems more reasonable…

Problem A, The Repeater: Given some input strings and Omar can make a move: he can pick one string, and one character in that string, and either repeat it once more (so “abc” -> “abbc” or “aabc” or “abcc”) or if the character is already repeated, he can delete one copy (so “aabcc” -> “abcc” or “aabc”). Can he make all the strings the same, and if so, what’s the minimal number of moves to do so?

Jam 2014 Round 1A

As ever, links to: Official Contest Analysis and my code on GitHub.

Of current interest is how to progress: top 1000 people go through, and for that you’d need 55 points in a reasonable time. That is, solve all but the tricky statistics based question, Problem C.

Problem A, Charging Chaos: The problem is the find a “mask” which when xored with each outlet gives each device. Of course, what makes this tricky is that there is also an unknown permutation to find, to map the outlets bijectively to the devices. But you can find these iteratively:

  • For the first bit, try the mask having bit 0: then you partition the outlets into two sets, those with bit 1 having 0, and those with bit 1 being 1. Also partition the devices similarly, and then the sets need to match in size for a bijection to exist. Similarly if the mask has bit 1 (then 0 for outlet maps to 1 for device).
  • Then carry on, further subdividing the sets
  • This gives a Backtracking algorithm, or, what amounts to the same thing, a Depth First Search with pruning.
  • This is rather different from the official solution.

Code Jam Qualification 2015

As expected, flight back from Boston to Leeds via Dublin resulted in zero sleep, and so hopes were not overly high for the 2015 Qualification Round. But, only 20 points needed to progress, which can’t be too hard, right?

Anyway, in the end, I got 100%, and seem to be the highest ranked UK participant. This, one suspects, won’t last…!

Some code on GitHub. On the todo list is to update this post with some comments, but my solutions are pretty similar to the Offical Analysis.

Jam 2008 Round 1B

I found this to be somewhat harder than round A. Also, my solutions in Python, running on my 2013 era laptop, are rather slow, so really a C++ or similar implementation would be needed, especially on 2008 era hardware.

Again, The Official Contest Analysis is a good writeup, so I won’t say a great deal. See the code on GitHub.

Code Jam 2013 Qualification

Continuing, here is the 2013 qualification round.

As ever, the Official Analysis is very good, and similar to my approaches. Some code on GitHub.

Code Jam 2014 Qualification

While on holiday, what better way to unwind than solve some old Code Jam puzzles?

As ever, the Official Analysis is very good, and similar to my approaches. Some code on GitHub.

Jam 2008 Round 1A

These seem to me to be essentially pure mathematics, in that once I understood the problem, and the maths behind it, the implementation in code was almost trivial.

The Official Contest Analysis is a good writeup, so I won’t say a great deal. See the code on GitHub.

Problem A, Minimum Scalar Product: The only comment I have is that the formal “proof” the contest analysis gives seems overly complicated to me.

Google Code Jam

I was hoping to take part in Google Code Jam this year, but for much of Saturday the 11th I’ll be transatlantic, flying back from Boston to Leeds, and I’m not sure I fancy my chances so much with jet-lag, no sleep, and much reduced amount of time…

But anyway, I thought it would be fun to look at past problems, and also to try to solve them in C#, as a way of learning more about the language. I must say that I’ve found it useful to solve the problems in Python– it’s nice having an iterative environment, and Python is generally fast enough.

So, here are my thoughts on the 2008 Qual round. See my code on GitHub.

Hello World

I had a need to update and move my old academic website so thought I would take the plunge and host a site on GitHub. Here is an aide-memoire for myself as to how I did this:

  • How to get a website for your GitHub project: Pages

This blog is built using Jekyll: