Open Street Map Data

I’m currently working on using some address information from open street map to augment other open data sources. Here are some notes on using data from open street map, in Python.

Getting and using Open StreetMap Data

It seems like this is a bit of a pain. Open StreetMap (OSM) uses a custom, XML based, format which is hard/impossible for standard GIS software to read.

Data sources

  • http://wiki.openstreetmap.org/wiki/Planet.osm Gives links to download the world and various extracts of it.
  • http://wiki.openstreetmap.org/wiki/Overpass_API Details the “Overpass API” which allows targeted querying of data from the OSM database. (The OSMnx package seems great for making direct, small-scale queries.)

GeoFabrik

http://download.geofabrik.de/ Offer downloads of regions of the world, either in OSM format, or converted to shape-file format. The latter can be loaded into geopandas, QGIS etc. etc. but appears, sadly, to be missing useful information. For example, I can locate a building and find the polygon for this (by it’s ID) in the shapefile, but there is no meta-data attached to give me e.g. the address!

However, for off the shelf analysis using existing tools, this is perhaps where to start.

Format

The standard format is XML, either compressed, or packaged using Protobuf.

  • http://wiki.openstreetmap.org/wiki/OSM_XML Gives the XML format
  • http://overpass-api.de/output_formats.html Gives details of the JSON format which the Overpass API can return.
  • http://wiki.openstreetmap.org/wiki/Map_Features Gives details of the “features” the returned data can contain.

Libraries

I should probably be using pyOsmium but there is no conda build, and being stuck with either a windows box, or a locked-down linux box, I am scared to try to get it to build.

GDAL / OGR can be easily installed (or come for free with geopandas).

  • http://www.gdal.org/drv_osm.html Gives some brief details
  • A cookbook for GDAL.
  • I didn’t have a great deal of luck finding the details I wanted.

OSMnx

It seems that others have had similar thoughts to me. OSMnx looks to be a great tool for network analysis of road networks. The review paper is also very informative, with some nice background information.

With Anaconda, it’s a one-line install:

conda install -c conda-forge osmnx

This seems very nice for targeted downloads, but it’s a bit too “magic” for my tastes. The source code is very readable however, and I’ve been learning a lot from it.

Numpy vectorising

I asked this question on Stackoverflow, and got a nice answer, but one which I needed to think through a little more. Here’s my conclusions.

My aim was to understand how to write robust code which could take scalars, but which would also do “as expected” on arrays. Let me expand a little on this, by using a slightly easier example than in the original question. Suppose f(x) is a function which takes a scalar and returns a scalar. I then want that if x is actually an array, of any shape, then f(x) will return an array of the same shape as x, namely the array obtained by applying f to every entry.

You can do this with the np.frompyfunc method, but we cannot expect to take advantage of the speed of numpy. Furthermore, we also obtain an array of objects.

Instead, let me present the “numpy way” and then explain why and how it works. Firstly, the starting function:

fixed_array = np.arange(10)
def f(x):
    return np.sum((x - fixed_array) ** 2)

Here fixed_array is global for simplicity. So what f does is to subtract x from each entry of fixed_array, square, and then sum. Equivalent to sum( (x-t)**2 for t in fixed_array ). Passing an array into f yields an error as numpy cannot work out how to compute x - fixed_array. Indeed, we do not want to form this pointwise. Rather, I think what we really want to do is something like the following:

  • For simplicity, suppose x is also a 1D array (though in the end x should be allowed to be any shape.)
  • Form an array xx so that xx[i][j] = x[i] for all i,j
  • Form an array yy so that yy[i][j] = fixed_array[j] for all i,j
  • xx and yy have the same shape.
  • Compute zz = (xx - yy)**2 (pointwise) so that ``zz[i][j] = (x[i] - fixed_array[j])**2 for each i,j`.
  • Sum over the final dimension, thus giving f(x[i]) in the ith position, as required.

Okay, so here’s the answer:

def fff(x):
   return np.sum((np.asarray(x)[...,None] - fixed_array)**2, axis=-1)

How does this work?

  • np.asarray(x) returns, if x is scalar, an array of shape (1) containing x as it’s entry; and if x is already “array like”, we get a genuine array.
  • Then for an array y, the slice y[...,None] does the same as (the perhaps more clear) y[...,np.newaxis]. The ... means the same as :,:,: however many times required, and np.newaxis gives you a new axis. If y has shape (2,3,4) then y[...,None] has shape (2,3,4,1). See Indexing docs.
  • Keep working with yy and consider how Broadcasting applies to yy - fixed_array
    1. We prepend 1s to the shape of fixed_array so it has the same ndim as yy
    2. We will get an output of size (2,3,4,n) where fixed_array is of length n
    3. If an input has dimension 1 then the single entry in that dimension will be used.
  • This gives us exactly what we want, because the entries of y will be used in the first dimension(s), and the entries of fixes_array in the last dimension.
  • We then square entry wise, and sum over the final dimensional, the axis=-1 command.

See Notebook for a quick demo and some code.

Working with numpy again

In my new job, I find myself working with numpy (after a break of a couple of years, and now professionally, and not as a hobby.) Numpy is great, but it doesn’t half require a little thinking upon occasion.

Suppose we have an array of 10 points in the plane. Should this be represented as a numpy array of shape (2,10) or (10,2)?

Argument from kernels

Suppose I want to write a function which represents a two dimensional kernel, for example:

def ker(x,y):
   return x + 2 * y

This doesn’t vectorise at all: I cannot do

p = np.array([1,2])
ker(p)

So instead we might write:

def ker(p):
   return p[0] + 2 * p[1]

Then, to vectorise across my array of 10 points, evaluated the kernel at each point, I need my array to have shape (2,10).

This is exactly the convention chosen by the kernel object which Scipy stats Gaussian KDE will produce, for example.

Argument from broadcasting

Given my 10 points, I should be able to translate them all by the same amount, by just adding a point.

my_array + point

For this to work, I need to work with the Numpy broadcasting rules which imply shape (10,2):

my_array = np.random.random(20).reshape((10,2))
point = np.array([2,3])
my_array + point

So if you’re working with things your thinking of as spatial “vectors”, the other convention seems most natural. This is exactly the convention chosen by e.g. Scipy Spatial KDTree.

What to do?

I guess you pick a convention, stick with it, and translate (e.g. using .T) as necessary.

A further hint is to read the broadcasting rules very closely and to know that indexing with None is the same as adding a new axis:

my_array = np.random.random(20).reshape((10,2))
new_array = my_array[:,None,:]
print(new_array.shape) # (10,1,2)

Random sampling to see a percentage of a population.

Given a population and sampling at random (“with replacement”) what’s the expected number of samples I need to see 50% (or any fixed proportion) of the population.

I deliberately ask for “expected” because calculating expectations is often easier than getting a handle on the whole probability distribution. A trick is to exploit linearity: express the random variable of interest as a sum of random variables you can calculate the expectation of.

Sampling at random from , suppose we have seen exactly members of . As each sample is independent, letting denote the number of samples required to see a new member of , we see that \[ \mathbb P(T_k = j) = \left(\frac{k}{\vert P\vert}\right)^{j-1} \frac{\vert P\vert-k}{\vert P\vert} \] That is, a geometric distribution, and so . By convention, .

Then, if I want to see exactly members, the time needed is . We can estimate the expectation by an integral, \[ \mathbb E(S_n) \approx \vert P\vert \int_{\vert P\vert-n}^{\vert P\vert+1} \frac 1 x \ dx = \vert P\vert \log \Big( \frac{\vert P\vert+1}{\vert P\vert-n} \Big) \]

Hence, in answer to our original question, we need on average samples to see half of .

Random sampling

This post, very tangentially, relates to a quiz we set job candidates. If you are applying for a job at my current company, and somehow work out I work there, and find this, then you probably half deserve a job anyway.

Suppose you have a population \( P \) and some test as to whether a member of the population is good or bad. We want to find a “random good member”. There are two methods that come to mind:

  • Random sampling: pick a member at random, test it, if it passes return it, otherwise try again.
  • Randomly order \(P\) and work through the whole set, returning the first good member.

The first method has the virtue of being simple. The second method uses a lot of memory, if \(P\) is large. But on closer thought, what if the proportion of “good” members is rather small. The 2nd method is guaranteed to find a good member in . How slow can the first method be?

Let \(B\subseteq P\) be the set of bad members. The first method fails to find a good member in \(n\) terms with probability \[ \left( \frac{\vert B\vert }{\vert P\vert} \right)^n \] (The chance of repeatedly choosing a bad member).

  • So, if half your members are bad, then you need just 7 goes to be 99% sure of finding a good member.
  • If only 1% of your population is good then you need 459 trials to be 99% sure of find a good member.

For the 2nd method, by “random ordering” I mean picking a member of the symmetric group of order at random and applying it to the set. We can do this in time proportional to . The algorithm is simple: choose an unpicked member of at random and take it as the next member of the random ordering, and then repeat. How long does it take to find a good member? The chance of choosing only bad members for the first goes is \[ \frac{\vert B \vert (\vert B\vert -1) (\vert B\vert-2) \cdots (\vert B\vert - n+1)} {\vert P \vert (\vert P\vert -1) (\vert P\vert-2) \cdots (\vert P\vert - n+1)} \]

  • So this will be quicker than method one, always.
  • But as becomes large, the limit is the same.

I’m not sure I’ve come to any conclusion. Method 1 is simple and fast if the good population is not too small. Method 2 needs some storage space, but is more predictable if is not too large and the proportion of good members is very small. If is very large and the proportion of good members very small, you probably need a better idea than simple sampling.

A more mathematical question presents itself. Suppose we do away with the good and bad members and simply ask:

Given a population and sampling at random (“with replacement”) what’s the expected number of samples I need to see 50% (or any fixed proportion) of the population.