Random sampling

Posted on 29th September 2016

This post, very tangentially, relates to a quiz we set job candidates. If you are applying for a job at my current company, and somehow work out I work there, and find this, then you probably half deserve a job anyway.

Suppose you have a population $ P $ and some test as to whether a member of the population is good or bad. We want to find a "random good member". There are two methods that come to mind:

Random sampling: pick a member at random, test it, if it passes return it, otherwise try again.
Randomly order $ P $ and work through the whole set, returning the first good member.

The first method has the virtue of being simple. The second method uses a lot of memory, if $P$ is large. But on closer thought, what if the proportion of "good" members is rather small. The 2nd method is guaranteed to find a good member in $ O(\vert P \vert) $. How slow can the first method be?

Let $B\subseteq P$ be the set of bad members. The first method fails to find a good member in $n$ terms with probability \[ \left( \frac{\vert B\vert }{\vert P\vert} \right)^n \] (The chance of repeatedly choosing a bad member).

So, if half your members are bad, then you need just 7 goes to be 99% sure of finding a good member.
If only 1% of your population is good then you need 459 trials to be 99% sure of find a good member.

For the 2nd method, by "random ordering" I mean picking a member of the symmetric group of order $$ N = \vert P \vert $$ at random and applying it to the set. We can do this in time proportional to $ N $. The algorithm is simple: choose an unpicked member of $ P $ at random and take it as the next member of the random ordering, and then repeat. How long does it take to find a good member? The chance of choosing only bad members for the first $ n $ goes is \[ \frac{\vert B \vert (\vert B\vert -1) (\vert B\vert-2) \cdots (\vert B\vert - n+1)} {\vert P \vert (\vert P\vert -1) (\vert P\vert-2) \cdots (\vert P\vert - n+1)} \]

So this will be quicker than method one, always.
But as $ P $ becomes large, the limit is the same.

I'm not sure I've come to any conclusion. Method 1 is simple and fast if the good population is not too small. Method 2 needs some storage space, but is more predictable if $ P $ is not too large and the proportion of good members is very small. If $ P $ is very large and the proportion of good members very small, you probably need a better idea than simple sampling.

A more mathematical question presents itself. Suppose we do away with the good and bad members and simply ask:

Given a population $P$ and sampling at random ("with replacement") what's the expected number of samples I need to see 50% (or any fixed proportion) of the population.

Random sampling

Categories

Recent posts