Posted on 28th January 2020
As part of my job, I have to give two lectures to postgraduate students, as part of a "training programme" for them. As we have no Mathematics PhD students, I will be speaking to (Astro-)Physicists and Astronomers. This leave me with three choices, as I see it:
So, my talks will be "A Mathematician looks at statistics". These are some brief working notes: for the first talk at least I plan to give a "chalk-and-talk" with maybe some brief Python demonstrations.
What is "Statistics"? To quote Wikipedia:
Statistics is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data.
Furthermore, we can broadly split the study of statistics into two areas:
Descriptive statistics which is the study of presenting data.
Statistical inference which, to quote, is "the process of using data analysis to deduce properties of an underlying probability distribution."
Rather crudely, one might say that the use of descriptive statistics is when you have access to all the data you need, and you wish to present it. For example, the university knows the home address postcode of every student: how might we present this information visually to gain some knowledge of where our students come from?
Again, rather crudely, statistical inference occurs when we do not have access to all the data, but only a "sample". The classical example is an election opinion poll: we contact, at random, a few thousand people, and ask them how they will vote in an election. From this "sample" (which is a tiny fraction of the "population" of all voters) we can somehow determine what the likely outcome of the election will be.
This is a rather crude picture, as statistical inference also occurs in, for example, the following two situations:
We are interested in coin flipping. We have a coin which we suspect is biased: more likely to land heads than tails. How might we test is this suspicion is true or not?
We have luminosity data from two families of stars. How might we decide if the two families are "different", or not?
Neither of these cases fits easily into the "sample from a population" framework.
Let's go back to what Wikipedia said: "...deduce properties of an underlying probability distribution". In this first lecture, I want to first of all think about what a "probability distribution" is.
In Mathematics statistics, the idea of "probability" is central. In Mathematical probability, we have two key notions:
As is tradition in maths, having given a definition, we proceed to some examples:
We toss a fair coin, and obtain a heads or tails. The probability space is hence the "set" (a mathematical collection) \( \{ H, T\} \) and the probability is \[ \mathbb{P}(H) = \mathbb{P}(T) = 0.5. \] This represents that getting a head, or getting a tail, is equally likely (and then \( 0.5 \) is the only possibility if the total probability is 1)
We pick someone at random, and ask which month their birthday is in. The probability space is the set of all months (or the numbers 1 to 12, perhaps?). The probability assigns \( 1/12 \) to each month, as there should be no bias as to which month a randomly chosen person is born in. We might like to think if this is a realistic model of real people?
What does probability "mean"? This is a deep philosophical problem. As a mathematician, I am tempted to dismiss the question, and simply say that we have some axioms which we work with (this is the Kolmogorov view). But to do statistics, we have to have some belief that mathematical probability is an accurate model of the world. Without getting beyond my pay grade, a common justification of probability is given by the law of large numbers. In the coin example, that the probabilities are equal is reflected in the fact that if we toss a coin many times, then we would expect to get, "roughly", "on average", the same number of heads as tails. In the birthday month example, the situation is easier, as we could (in theory) find the birth month of everyone in the UK (for example) and then the probability is simply the proportion of each month which occurs.
This naive view is dangerous however, because it is easy to think of situations where we wish to speak of "probability" where either is it possible to sample the entire population, and nor is it possible to repeat an experiment many times (even in theory, never mind what is realistically possible). However, I shall say no more.
A nice class of examples from considering urn models. Consider a container (then "urn") which contains \( n \) red balls and \( m \) blue balls.
We can extend this in lots of ways. Maybe there are more than two colours. Maybe we remove more than one ball. Let us introduce some more notation: if \( A \) is a set of events, then write \( \mathbb P(A) \) for the probability that some event in the set \( A \) occurs.
In this last example, we can think of \( R \) as being "the number of red balls removed". This is a "variable": it is the whole number \( 0,1,2 \) or \( 3 \). However, it is a "random variable" as it doesn't have a fixed value, but instead takes each value with a certain probability. We write, for example, \( \mathbb P(R=2) \) for the probability of drawing exactly 2 red balls.
We can perform arithmetic on random variables. For example, if \( B \) is the number of blue balls drawn, then \( R+B=3 \) and so \( \mathbb P(R=2) = \mathbb P(B=1) \).
How might we model the heights of people? Notice the word "model" here: I am already thinking about how to reduce complicated real-world data to a simple mathematical description. The key problem is that "height" is a continuous variable. We deal with this by saying that it is only possible to talk about the probability of a random variable falling in some interval.
In this example, the probability depends on two parameters \( \sigma,\mu \). In the "urn" example, we chose the number of balls \( n \) and \( m \). At present, we think of these as fixed, but later is might want to "estimate" these parameters from some data.
Suppose we take our coin, and toss it twice. We now have a new probability space: \( \{ HH, HT, TH, TT\} \) of all 4 possible outcomes of two tosses. What should be the probability? It seems reasonable to think that each event should be equally likely, so each has probability \( 1/4 \).
The key idea here is "independence". We think of tossing one coin, and then tossing it again, and the result of the first toss does not affect the second. Let \( H_1 \) be the event of getting a heads on the first toss, and \( H_2 \) getting a head on the second toss. Then \( H_1 \) is set of outcomes \( HT \) or \( HH \), while \( H_2 \) is the set of outcomes \( TH \) or \( HH \). Thus \( H_1 \cap H_2 \) (the intersection) is the event that we get both a head on the first toss, and a head on the second toss, that is, \( HH \). So \[ \frac14 = \mathbb P(HH) = \mathbb P(H_1 \cap H_2) = \mathbb P(H_1)\mathbb P(H_2) = \frac12 \times \frac12. \]
Definition: Two events \( A \) and \( B \) are independent \( \mathbb P(A\cap B) = \mathbb P(A) \mathbb P(B) \).
We might like to think about how this could not be so?
(You might like to think about how we might, in the real world, obtain the probabilities involved here. Google "regression to mean".)
The following is a "paradox":
Of course, each toss is independent, and so the change of getting tails after 10 heads is 1/2, as it always is.
To make these sorts of calculations easier, we introduce some more notation: Given sets of events \( A \) and \( B \), write \[ \mathbb P(A|B) \] for the probability of \( A \) happening, given that \( B \) has happened. This is called conditional probability: the probability that \( A \) occurs, conditional on \( B \) having occurred. In the traffic accident example, we had \[ \mathbb P(A|\text{It is Monday}) = 0.05, \qquad \mathbb P(A|\text{accident happened the day before}) = 0.03. \]
Then two events are independent if \( \mathbb P(A|B) = \mathbb P(A) \).
We can derive a formula for \( \mathbb P(A|B) \). If \( B \) is known to have occurred, then we consider a new probability space as being all the things which can occur, knowing that \( B \) has occurred. Then that \( A \) occurs means that \( A\cap B \) must occur (both \( A \) and \( B \) occur), but we need to normalise by dividing the probability that \( B \) occurs. Thus we obtain \[ \mathbb P(A|B) = \frac{\mathbb P(A\cap B)}{\mathbb P(B)}. \] (Note that in some Mathematical approaches to probability, this is actually taken as an axiom.) We shall henceforth take this as the definition of conditional probability. If \( A,B \) are independent that \( \mathbb P(A\cap B) = \mathbb P(A)\mathbb P(B) \) and so \[ \mathbb P(A|B) = \frac{\mathbb P(A\cap B)}{\mathbb P(B)} = \frac{\mathbb P(A)\mathbb P(B)}{\mathbb P(B)} = \mathbb P(A), \] as we had before. (So we start to see how "definitions" agree with "intuition" / "meaning".)
There are two key results to know here.
Law of total probability: Suppose \( B_1,\cdots,B_n \) are disjoint events which cover the whole of the probability space. If \( A \) is any event, then \[ \mathbb P(A) = \sum_{k=1}^n \mathbb P(A|B_k)\mathbb P(B_k). \]
Proof: We can suppose \( \mathbb P(B_k)\not=0 \) as if it was \( 0 \) we can ignore it. By definition, \( \mathbb P(A|B_k)\mathbb P(B_k) = \mathbb P(A\cap B_k) \). As the union of the \( B_k \) is the whole space, we know that \[ A = \bigcup_k A\cap B_k. \] As \( A\cap B_k \) is disjoint from \( A\cap B_j \) for \( j\not=k \), and probability is additive, we have \[ \mathbb P(A) = \sum_{k=1}^n \mathbb P(A \cap B_k). \]
Bayes Theorem: \( \mathbb P(A|B) \mathbb P(B) = \mathbb P(B|A) \mathbb P(A) \).
Proof: Using the definition we have \[ \mathbb P(A|B) \mathbb P(B) = \frac{\mathbb P(A\cap B)}{\mathbb P(B)} \mathbb P(B) = \mathbb P(A\cap B). \] Similarly, \( \mathbb P(B|A) \mathbb P(A) = \mathbb P(A\cap B) \). Thus both sides are equal.
We often re-arrange as \[ \mathbb P(A|B) = \frac{\mathbb P(B|A) \mathbb P(A)}{\mathbb P(B)}. \]
This simple result is hugely useful in applications. Here is a classical one:
We need to understand what these words mean:
Let \( C \) be the event I have the virus, and \( \neg C \) the event I don't. Let \( T \) be the event I test for the virus, and \( \neg T \) the event the test says I don't have the virus. The questions gives us this information: \[ \mathbb P(C) = 1 / 100,000, \qquad \mathbb P(T|C) = 0.95, \qquad \mathbb P(\neg T|\neg C) = 0.99. \] I want to know \( \mathbb P(C|T) \): the probability I have the virus, given that I test positive for it.
By Bayes' Theorem, we have \[ \mathbb P(C|T) = \frac{ \mathbb P(T|C) \mathbb P(C) }{ \mathbb P(T) }. \] By the Law of Total Probability, \[ \mathbb P(T) = \mathbb P(T|C)\mathbb P(C) + \mathbb P(T|\neg C)\mathbb P(\neg C). \] Putting in the numbers gives \[ \mathbb P(T) = 0.95 \times 1/100,000 + 0.05 \times 99,999 / 100,000 = 0.050,009, \] and so \[ \mathbb P(C|T) = \frac{0.95 / 100,000}{0.050,009} \approx 1.9 \times 10^{-4}. \] I should not worry!
Hopefully this seems a little counter-intuitive to you. The test sounds quite accurate! However, the virus is rather rare, and it is in fact this which really matters. I have never meant this situation in real life, but I wonder if doctors are good at explaining this? There are various visualisation tools, which I know from the work of David Spiegelhalter, which can help here.
I like to think of Statistics as swapping the roles of "random variables" and "parameters". For example:
An urn contains 10 red balls, and an unknown number of blue balls. I draw a ball, observe its colour, and put the ball back in the urn. Doing this 5 times I get BBBRB. What is the most likely number of red balls?
We assume heights of students are distributed normally with \( \sigma=20cm \). We measure 10 Mathematics students and get a mean height of 190cm, and measure 10 Physics students and get a mean height of 195cm. Are Physicists taller than Mathematicians?
I will talk about "parameter estimation" and "hypothesis testing", and maybe the Bayesian approach to this.