Posted on 17th May 2020
What feels like a lifetime ago (before some other genetics happened) I listened to Adam Rutherford's book of the week "How to Argue With a Racist" on Radio 4. It's now been so long that the Radio 4 link is dead, but you can buy the book or read the book review. The radio series was interesting, but one particular (slightly off topic) point I remember. Adam claimed (I mean, I think, it was a while ago) something like
Go back around 11 generations, and you will have ancestors who share none of your genetic material.
What argument might lead to this conclusion? My main initial conclusion was that I did not really know how inheritance works at the genetic level.
Humans have 22 pairs of chromosomes and one pair of sex chromosomes. I shall ignore the latter, as they are different, and complicate the argument. (Aside: I didn't know that the classical picture of a chromosome, as an "X" shape, actually only arises during the metaphase stage of mitosis: when chromosomes are being replicated). The non-sex chromosomes are called autosomes and in most, diploid, cells in the body you have a pair, 2 different copies, of each chromosome.
Inheritance (in my crude understanding) is described by Mendelian inheritance, which describes observations, and chromosome theory which gives a mechanism for these observations to occur. For us, what is important is that gametes contain only one copy of each chromosome, and at fertilisation, these are combined to give a diploid cell.
In our first model, for each chromosome, a (uniformly) randomly chosen copy will end up in the gamete. Thus, from your father (F) and mother (M) you will receive exactly one copy of each pair. However, for each chromosome, that copy could have either come from your father's father (FF) or your father's mother (MF), and similarly from your mother' father (FM) or your mother's mother (MM).
For example, there is a \( (1/2)^{22} \approx 2.4\times 10^{-6} \) chance that FF gave you no chromosomes. How do we perform the calculations for further back up the family tree? 2 generations back, your father has 2 parents, 3 generations back 4 ancestors lead to your father, and \( n \) generations back \( 2^{n-1} \) ancestors. When \( n=6 \) we have \( 2^5=32 \) ancestors, and at least 10 of these cannot have contributed a single chromosome!
This leaves a bit of a puzzle though: how can we get genetic diversity if whole chromosomes are copied around? Indeed, this model is unreasonably simplistic, as in the process of Meiosis, chromosomal crossover occurs, mixing the two copies of the chromosome together. How this occurs is a little complicated, with the upshot that "geometry matters": genetic material which is close by is less likely to be mixed than further away material -- for more details see genetic distance. However, I shall model this using a simple uniform distribution: from your father, say, for each chromosome independently, you get \( p \) proportion of the 1st copy, and \( 1-p \) proportion of 2nd copy, of each chromosome, where \( p\sim U[0,1] \).
Now we have a genuine probability problem: for each chromosome, you will always have some fraction from every ancestor. However, that we have 22 chromosomes and not one simplifies things. There are 22 independent cross-over events, which in our simple model leads to 22 independent uniform random variables. To obtain the total fraction of chromosomes carries from one ancestor, we add these random variables up, which leads to the Irwin–Hall distribution. More formally, let \( (X_i)_{i=1}^{22} \) be iid \( U[0,1] \) random variables, and then we are interested in \( X = \frac{1}{22}\sum_{i=1}^{22} X_i \) which is the total proportion. Then \( \mathbb{E}(X) = \frac12 \) and \( \newcommand{\var}{\operatorname{var}}\var(X) = \frac{1}{12\times 22} \). From the central limit theorem we know that, if \( 22 \) were a large number, then \( X \) would be close to a normal random variable with the stated mean and variance. In fact, in this particular case, \( 22 \) is "large", and \( X \) is very close to being distributed as a \( N(1/2, 1/264) \) random variable.
There are already some simplifications here: for example, in human chromosomes it is not true that each chromosome has the same amount of "genetic information". We could make a further simplification, and suppose that actually \( X=1/2 \). That is, lets assume that cross-over exactly gives a 50/50 mix. Thus 2 generations back, you have exactly 1/4 of the genetic material from each 4 ancestors MM, MF, FM, FF, 4 generations back you share 1/8, and so forth. We have again removed any randomness.
Going back to the general case, let's ask the question: "going back \( n \) generations, what is the distribution of the minimal ancestor?" By this, I mean the following: we have \( 2^n \) ancestors, and each contributes a fraction of genetic material, say \( F_1,\cdots,F_{2^n} \) (these being random variables, with \( 0\leq F_i\leq 1 \) and \( \sum_i F_i=1 \)). I am interested in the distribution of \( M=\min_i F_i \). Let \( X \) be as above, a random variable distributed as the Irwin-Hall distribution with \( N=22 \). Then if \( M_F \) is the minimum contribution from an ancestor of your father, and similarly \( M_M \) from your mother, we have that \( M = \min(XM_F, (1-X)M_M) \). Notice that \( M_M, M_F \) are iid with \( 2^{n-1} \) ancestors. Further \( X \) is symmetric about \( \frac12 \), so if \( M_F, M_M \) has CDF \( G_{n-1} \), and \( X \) has density \( h \), \[ G_n(p) = \mathbb P(M\leq p) = 2\int_0^{1/2} h(t) G_{n-1}(p/t) \ dt. \] Notice that \( G_1(p) = 2\int_0^p h(t) \ dt \) (again by symmetry of \( h \)). In principle this allows us to compute \( G_n \) by recursion, but in practice I do not see how to do this.
However, it's not hard to perform a simulation, and get an idea of what \( G_{11} \) looks like. For example, the 95th percentile of the distribution of \( G_{11} \) is around \( 1.4\times 10^{-4} \). There is some debate around how many genes we have, but Wikipedia suggests around at most 330000. This leaves the minimal contribution from an ancestor 11 generations ago at around \( 1.4\times 10^{-4} \times 3.3\times 10^{5} \approx 46 \) genes. If we just look at protein-coding genes, then there are at most around 21300, so a minimal contribution of \( 1.4\times 10^{-4} \times 2.13\times 10^{4} \approx 3 \) genes.
So, I seem to have failed to quite decide why the magic "11 generations ago". I guess I need to buy the book, check the quote, and see is there is a footnote. At least I now know more about basic genetics than I did.
A fuller argument would need to consider the sex chromosomes, and also Mitochondrial DNA. Similarly, as Rutherford's programme already pointed out, going back many generations and assuming a pure tree structure for your family tree (that is, with no common ancestors at all) is basically just wrong. Some code on GitHub.