Posted on 29th September 2016
This post, very tangentially, relates to a quiz we set job candidates. If you are applying for a job at my current company, and somehow work out I work there, and find this, then you probably half deserve a job anyway.
Suppose you have a population \( P \) and some test as to whether a member of the population is good or bad. We want to find a "random good member". There are two methods that come to mind:
The first method has the virtue of being simple. The second method uses a lot of memory, if \(P\) is large. But on closer thought, what if the proportion of "good" members is rather small. The 2nd method is guaranteed to find a good member in \( O(\vert P \vert) \). How slow can the first method be?
Let \(B\subseteq P\) be the set of bad members. The first method fails to find a good member in \(n\) terms with probability \[ \left( \frac{\vert B\vert }{\vert P\vert} \right)^n \] (The chance of repeatedly choosing a bad member).
For the 2nd method, by "random ordering" I mean picking a member of the symmetric group of order $$ N = \vert P \vert $$ at random and applying it to the set. We can do this in time proportional to \( N \). The algorithm is simple: choose an unpicked member of \( P \) at random and take it as the next member of the random ordering, and then repeat. How long does it take to find a good member? The chance of choosing only bad members for the first \( n \) goes is \[ \frac{\vert B \vert (\vert B\vert -1) (\vert B\vert-2) \cdots (\vert B\vert - n+1)} {\vert P \vert (\vert P\vert -1) (\vert P\vert-2) \cdots (\vert P\vert - n+1)} \]
I'm not sure I've come to any conclusion. Method 1 is simple and fast if the good population is not too small. Method 2 needs some storage space, but is more predictable if \( P \) is not too large and the proportion of good members is very small. If \( P \) is very large and the proportion of good members very small, you probably need a better idea than simple sampling.
A more mathematical question presents itself. Suppose we do away with the good and bad members and simply ask:
Given a population \(P\) and sampling at random ("with replacement") what's the expected number of samples I need to see 50% (or any fixed proportion) of the population.