# Random sampling

29 Sep 2016This post, very tangentially, relates to a quiz we set job candidates. If you are applying for a job at my current company, and somehow work out I work there, and find this, then you probably half deserve a job anyway.

Suppose you have a population \( P \) and some test as to whether a member of the population is good or bad. We want to find a “random good member”. There are two methods that come to mind:

- Random sampling: pick a member at random, test it, if it passes return it, otherwise try again.
- Randomly order \(P\) and work through the whole set, returning the first good member.

The first method has the virtue of being simple. The second method uses a lot of memory, if \(P\) is large. But on closer thought, what if the proportion of “good” members is rather small. The 2nd method is guaranteed to find a good member in . How slow can the first method be?

Let \(B\subseteq P\) be the set of bad members. The first method fails to find a good member in \(n\) terms with probability \[ \left( \frac{\vert B\vert }{\vert P\vert} \right)^n \] (The chance of repeatedly choosing a bad member).

- So, if half your members are bad, then you need just 7 goes to be 99% sure of finding a good member.
- If only 1% of your population is good then you need 459 trials to be 99% sure of find a good member.

For the 2nd method, by “random ordering” I mean picking a member of the symmetric group of order at random and applying it to the set. We can do this in time proportional to . The algorithm is simple: choose an unpicked member of at random and take it as the next member of the random ordering, and then repeat. How long does it take to find a good member? The chance of choosing only bad members for the first goes is \[ \frac{\vert B \vert (\vert B\vert -1) (\vert B\vert-2) \cdots (\vert B\vert - n+1)} {\vert P \vert (\vert P\vert -1) (\vert P\vert-2) \cdots (\vert P\vert - n+1)} \]

- So this will be quicker than method one, always.
- But as becomes large, the limit is the same.

I’m not sure I’ve come to any conclusion. Method 1 is simple and fast if the good population is not too small. Method 2 needs some storage space, but is more predictable if is not too large and the proportion of good members is very small. If is very large and the proportion of good members very small, you probably need a better idea than simple sampling.

A more mathematical question presents itself. Suppose we do away with the good and bad members and simply ask:

Given a population and sampling at random (“with replacement”) what’s the expected number of samples I need to see 50% (or any fixed proportion) of the population.