Theory of Inference MATH35600/MATHM0019
The course instructor is Simon Wood (Fry GA.05). There are 2 lectures (see below) and one computer lab based tutorial (Tuesday 12 - see BB page) each week. 20% of the course mark will be from an asessed practical project. The remaining 80% comes from a 2.5 hour exam consisting of 4 questions (no choice): an example is provided below. Tutorial sheet questions provide the best preparation for the exam. Office hours are Tuesday 13-14 (see BB page).
Given that you already have printed notes, there seems little that course slides could usefully add, so I have recorded electronic whiteboard lectures as mp4 movies (a process that swallows days). Some of the lectures are longer than a standard lecture, as it seemed sensible to divide the material by content, rather than by arbitrary 45 minute blocks. To use the lectures, I recommend that you print out (if possible) and read the corresponding lecture note sections before viewing the lecture (there is some evidence that people are better able to retain written information read from printed paper rather than a screen). Then view the lecture, with the lecture notes to hand for reference (and annotation). Then go over the printed notes again. After that, attempt relevant questions from the tutorial sheets: that should include the practical questions (so make sure you have R and JAGS installed).
- p-values and testing Section 8.5, also regularity conditions and test inversion.
- AIC Section 8.9 on Akaike's Information Criterion.
- Bayesian MCMC Section 9-9.6 on Bayesian stochastic simulation by Metropolis Hastings Markov Chain Monte Carlo.
- Gibbs sampling Section 9.7-9.11, also covering convergence diagnostics and credible intervals.
- DAGs, Gibbs and JAGS Section 10.
- More Bayes Section 11. Point estimation and model comparison.
Please install R on you own computer from CRAN, and the R package 'rjags' from the same place. 'rjags' also requires that you install
JAGS as standalone software on your computer.
As a result of the UCU strike action, some of the more in depth material will not be lectured or examined, although for a deeper understanding of the subject I would encourage you to work through these sections yourself. The skipped sections are: the full Newton algorithm at the end of 7.2, and the specific examples in 7.2.1, althogh the basic principle of Newton's method (in 1 and more dimensions) was covered and is examinable. Sections 7.2.2 and 8.5.3 are dropped. Section 8.6 will not be covered, although you are expected to be able to use the result and understand its limitations (covered in other sections). As it turns out the two lectures that I had to reschedule to strike days, no longer need to be re-scheduled and will take place, so fewer lectures were lost to strike than planned, and dropping of more material should not be necessary.
- Lecture notes. These may be updated as the course progresses.
- Matrix notes. Essential revision notes on matrices.
- Core Statistics is a short textbook covering the material in this course, along with background and extensions. Chapter one should be reviewed for the essential background assumed by the course.
- CRAN is the place to get your own (highly recommended) free copy of the R statistical computing language and environment used in the course. Notice the Documentation section on the left of the CRAN page --- it has alot of useful information.
- The JAGS user manual is a handy reference for when we cover Bayesian computation.
- D.R. Cox (2006) Principles of Statistical Inference also covers much of the material in the course.
- A.C. Davison (2003) Statistical Models covers everything we cover and much more.
- G. Casella and R.L Berger (1990) Statistical Inference covers many of the topics in greater mathematical depth.
- Daniel Kahneman's Thinking Fast and Slow has alot of interesting things to say about statistical reasoning, and the built in flaws in how we humans tend to reason and make inferences.
Need to brush up on basic matrix algebra etc? Try
Exam papers from before 2018 are not a good guide to the course exam, because the course content has been modified to increase the emphasis on practical application of statistical inference theory (e.g. the introduction of weekly labs and assessed coursework). While studying your notes, attending labs and handing in work for marking remain the best ways to prepare for the exam, here is an example exam (with solutions) to indicate roughly what to expect in terms of paper style and question format. Here is the 2019 exam and solution
Given that the remaining lectures are now online (see above) I have uploaded all the remaining tutorial sheets, so that you can attempt them early, if you go through the material early. In the current situation I would strongly recommend working online with your project group on tutorial problems. If you help someone else, the act of explaining is better than anything else at consolidating your knowledge. If you need help from someone else, the benefits are obvious. If you finish a tutorial sheet early and want the solutions, just email me with the subject line 'TOI solutions request'. There is an online computer lab/tutorial session at 12 on Tuesdays (see Black Board page). To get tutorial work marked, upload it to the course Black Board page as detailed there, by the end of Thursday.
Here are some datasets used in labs. To read them directly into R, use something like:
dat <- read.table("https://people.maths.bris.ac.uk/~sw15190/TOI/confound.txt")
The course work is different for MATH35600 and MATHM0019. The course work is to be completed in groups of three, to provide experience of teamworking. You must arrange to be in a team with two others on the same course code as you (MATH35600 or MATHM0019). The coursework was handed out at the 9AM lecture on Monday 16th of March. You should have been sure to attend this lecture to pick up the coursework and register your work group (or to make sure that another group member did so for you). People not in full groups of 3 were able to get into complete groups in the lecture. In that lecture it was agreed that groups would complete the work by online collaboration for a slightly extended deadline of 12 Noon Monday 30th March (to avoid the work contributing to likely further disruption after the easter break). The university has now announced that deadlines have to be in term time so the official deadline is 12 Noon Tuesday 21st April. My recommendation is still to do the work now and submit it by the original deadline to avoid it dragging on too long and a loss of momentum.
If you did not register your group in the Monday 16th March lecture then you should email me the names of your group members immediately.
CORRECTION: in class I said that the summation up to K in the model started at 1. For M0019 only, this is wrong, it should start at 0 or the form used in the slightly modified sheet below should be used.
The coursework assignments are here:
Some extra information (just for interest and not for inclusion in your report). Statistical inference has been prominent in the COVID-19 episode, and some catastrophic statistical blunders are contributing to the problems. This paper from Oxford makes interesting reading, pointing out one of the problems. Using models for a whole epidemic (not just the early phase models you have been using), they use the Bayesian methods we will cover shortly for inference about the possible size of the hidden epidemic (i.e. the epidemic amongst asymptomatics and those with mild illness). They fit their model to the data on deaths, since these are the most reliable (the death data is the most likely to represent the true deaths, the case data are much more problematic as testing rates change). The analysis shows that the course of the epidemic so far is completely consistent with a very large hidden epidemic. We can not rule out that a high proportion of the population have already had the disease, nor can we rule out that the proportion already infected is much smaller. By using statistical inference to draw attention to a level of uncertainty that prevents any sort of sensible epidemic management planning, they also highlight what is needed to move back to evidence based policy: testing random samples of the population to establish disease prevalence, and preferably the proportion already exposed. The true mortality rates could then also be estimated accurately: currently all that is known is the mortality rate amongst those whose symptoms are considered severe enough to test, with `severe enough' varying hugely by country.
In contrast to the above paper, some statistical shockers are being used to argue for the current policy. For example, an article in the Guardian argued that there would be no life lost as a result of the large recession likely to be caused by the current measures, and that the weight of evidence suggested that life expectancy would actually be increased by recession. Following the link for this `weight of evidence' takes you to a single paper. Unfortunately the full text is behind a paywall, but there is a summary. Hopefully after our discussons on confounding and causality you can see what is wrong with going from their results to their conclusion (unless you can't think of anything else that might have changed over time and impacted life expectancy, in which case it's all fine). Sadly this kind of problem is not rare, although this particular example is rather spectacular. Note that I can't tell you for certain that a recession will lead to substantial loss of life, but there is a 5-8 year difference in life expectancy between the richest and poorest in the UK and data comparing regions of the US shows a 2/3 of a year reduction in life expectancy for each percentage point increase in unemployment. It's quite a gamble (with other people's lives) to simply assume that all these effects are down to confounding, which is what those saying there will be no life lost to recession are doing. In fact the gap between life expectancy of the top 10% and bottom 10% in the UK grew by 20 weeks in the aftermath of the 2008 financial crisis, an increase that is difficult to ascribe to confounding (see here for the data). That's a loss of life of 2 weeks per person if it were averaged over everyone, rather than being concentrated in the poorest group. 2 weeks is about the average loss of life per person expected if you did nothing to mitigate the covid-19 pandemic.
Part of the current mood is down to a problem in communicating risks to non-specialists. An excellent attempt to redress this is the More or less Corona special (although you may have the feeling that the first and last contributors have not quite managed to communicate to each other). David Spiegelhalter finds an excellent way of communicating what the Corona risks really mean to individuals - that if you get the disease (and show symptoms) you will essentially have a year's worth of your risk of dying compressed into a couple of weeks. So it is a risk you'd avoid if you can, but probably not with measures out of all proportion to those you usually take to avoid your risk of dying in a year. It's also not clear to me why we would take measures to protect others out of all proportion to the measures we are prepared to take to protect them over a year (and I routinely vote to spend more of my money on protecting others). Let me know if you have thoughts on that one. My guess is that the explanation is to be found in Daniel Kahneman's Thinking Fast and Slow, which would make better lock down reading than anything I can teach you.
My miniscule contribution to the public discussion is here, while this article makes a number of interesting points, including the one that even the data on `deaths from' corona are not what they might seem. Not reported there is that one of the big problems with reported data on death rates, in particular, is that while almost 100 percent of deaths are known and reported, only a small proportion of the cases is known. The only country that is close to attempting to rectify this is Iceland, where they have a testing programme aiming to directly get at true disease prevalence. It's still not fully randomized, so we still have confounding problems, but if we treat their data as coming from a near random sample we get some idea of the size of the problem. At time of writing Iceland has 890 cases having tested 3 percent of its population and has had 3 deaths, generally reported as a 0.3 percent crude death rate. But the 890 cases come from 3 percent of the population and the deaths from the whole population. If the 3 percent were a truly random sample, then there would have to be about 30000 cases in the whole population, suggesting a crude death rate of 1 in 10000. A more careful accounting using the published information from Iceland suggests 3-9000 cases in addition to the known ones, suggesting a crude death rate considerably below 1 in 1000. I have written `crude' death rate here, because although it is what is commonly reported it neglects disease duration effects that matter.
The 2019 coursework assignments are here:
The 2018 course work is here: