Blog

Parsing XML via SAX in Python

I've worked with XML before (in Java), but always small files using the Document Object Model. Now faced with multi-GB of Open Street Map derived XML files, of which I need to get a small amount of data, some other method is required. Step forward the Simple API for XML (SAX). This is an event-driven API: the XML parser calls a "handler" object with information about tags opening and closing, and the character data in between.

In Python, there is support in the standard library for SAX parsing. You need to sub-class (or duck-type, and implement the interface of) xml.sax.handler.ContentHandler. It seems that duck-typing is frustrating, as you need to implement the whole interface, even if you never expect certain methods to be called.

Read More →

3th May 2017

Open Street Map Data

I'm currently working on using some address information from open street map to augment other open data sources. Here are some notes on using data from open street map, in Python.

Getting and using Open StreetMap Data

It seems like this is a bit of a pain. Open StreetMap (OSM) uses a custom, XML based, format which is hard/impossible for standard GIS software to read.

Read More →

27th April 2017

Numpy vectorising

I asked this question on Stackoverflow, and got a nice answer, but one which I needed to think through a little more. Here's my conclusions.

My aim was to understand how to write robust code which could take scalars, but which would also do "as expected" on arrays. Let me expand a little on this, by using a slightly easier example than in the original question. Suppose f(x) is a function which takes a scalar and returns a scalar. I then want that if x is actually an array, of any shape, then f(x) will return an array of the same shape as x, namely the array obtained by applying f to every entry.

Read More →

24th March 2017

Working with numpy again

In my new job, I find myself working with numpy (after a break of a couple of years, and now professionally, and not as a hobby.) Numpy is great, but it doesn't half require a little thinking upon occasion.

Suppose we have an array of 10 points in the plane. Should this be represented as a numpy array of shape (2,10) or (10,2)?

Read More →

23th March 2017

Random sampling to see a percentage of a population.

Given a population \( P \) and sampling at random ("with replacement") what's the expected number of samples I need to see 50% (or any fixed proportion) of the population.

I deliberately ask for "expected" because calculating expectations is often easier than getting a handle on the whole probability distribution. A trick is to exploit linearity: express the random variable of interest as a sum of random variables you can calculate the expectation of.

Read More →

3th October 2016

Random sampling

This post, very tangentially, relates to a quiz we set job candidates. If you are applying for a job at my current company, and somehow work out I work there, and find this, then you probably half deserve a job anyway.

Suppose you have a population \( P \) and some test as to whether a member of the population is good or bad. We want to find a "random good member". There are two methods that come to mind:

Random sampling: pick a member at random, test it, if it passes return it, otherwise try again.
Randomly order \( P \) and work through the whole set, returning the first good member.

The first method has the virtue of being simple. The second method uses a lot of memory, if \(P\) is large. But on closer thought, what if the proportion of "good" members is rather small. The 2nd method is guaranteed to find a good member in \( O(\vert P \vert) \). How slow can the first method be?

Read More →

29th September 2016

Java Enum definition

At the dying ends of the work day, I came across this page and was initially confused by

public abstract class Enum<E extends Enum<E>>

Doesn't that look, well, horribly circular? Stackoverflow suggests this is common confusion.

Read More →

20th October 2015

LED lights and payoff

LED lights seem a no-brainer: instant on, good colour reproduction, extremely energy efficient (cool to the touch after minutes of use). But they are expensive. I wonder what the payback time is?

Read More →

16th October 2015

Blog of Matthew Daws

Parsing XML via SAX in Python

Open Street Map Data

Getting and using Open StreetMap Data

Numpy vectorising

Working with numpy again

Random sampling to see a percentage of a population.

Random sampling

Java Enum definition

LED lights and payoff

Categories

Recent posts