Posted on 3th October 2016
Given a population \( P \) and sampling at random ("with replacement") what's the expected number of samples I need to see 50% (or any fixed proportion) of the population.
I deliberately ask for "expected" because calculating expectations is often easier than getting a handle on the whole probability distribution. A trick is to exploit linearity: express the random variable of interest as a sum of random variables you can calculate the expectation of.
Sampling at random from \( P \), suppose we have seen exactly \( k \) members of \( P \). As each sample is independent, letting \( T_k \) denote the number of samples required to see a new member of \( P \), we see that \[ \mathbb P(T_k = j) = \left(\frac{k}{\vert P\vert}\right)^{j-1} \frac{\vert P\vert-k}{\vert P\vert} \] That is, a geometric distribution, and so \( \mathbb E(T_k) = \frac{\vert P\vert}{\vert P\vert-k} \). By convention, \( T_0=1 \).
Then, if I want to see exactly \( n \) members, the time needed is \( S_n = \sum_{j=0}^{n-1} T_j \). We can estimate the expectation by an integral, \[ \mathbb E(S_n) \approx \vert P\vert \int_{\vert P\vert-n}^{\vert P\vert+1} \frac 1 x \ dx = \vert P\vert \log \Big( \frac{\vert P\vert+1}{\vert P\vert-n} \Big) \]
Hence, in answer to our original question, we need on average \( 0.7 \vert P\vert \) samples to see half of \( P \).