I am a Sir Henry Dale Wellcome Trust Research Fellow working on "Statistical methodology for population genetics inference from massive datasets with applications in epidemiology" at the University of Bristol. I am interested in developing methodology for the statistical analysis of "difficult" datasets, meaning a) massive data, b) qualitative data, c) dynamical systems, game theory and other "intractible" models. Statistical methods are central to these problems, but modelling plays a significant role. In particular, the development of models appropriate to the available data is an essential step. I'm working at the intersection of statistics and data mining when handling massive datasets.
Scaling Population Genetics Methods
The use of population based sequencing is a rapidly evolving tool in genetics. Epidemiology datasets currently extend to over tens thousand individuals (UK10K, Genetics of Type 2 Diabetes), and projects are underway to sequence 100K individuals (UK NHS 100K Genome Project). Massive datasets help us understand how historical populations have changed over time (e.g. Peopling of the British Isles, Human Genome Diversity Panel) and inform how the genome itself has evolved, which has key consequences for the use of genetics to improve public health. Current methodology provides tantalising glimpses but is inadequate at the scale of these datasets. We need new methods that directly address the `big-data' problems faced, whilst accounting for the uncertainty introduced by the approximations involved. The genetic relationships between populations provides an increasingly powerful lens on human history. Modelling how populations have changed throughout history provides insights into important historical events. This fascinating way of looking into our own past is also an increasingly important step in understanding how our own genomes are structured, as the scale of datasets makes extracting information from related individuals and inhomogeneous populations essential. The new generation of genetics data promises to provide yet new levels of detail on genomic function, but is massive and complex. With so much investment in gathering genetics datasets, it is essential that we have methods that can learn from them.
Importance of scalable methodology
It should not be the case that we learn less from more data. Yet in reality, a carefully chosen, small dataset can be much more informative that a much larger dataset, even when the former is contained within the latter. That is because we expertly design experiments to obtain the best information first. Old, early datasets were carefully crafted and are therefore significantly more informative than a random sample of a modern dataset.
One way to scale analyses is to make new methods that can handle more data. It is often OK that these methods use each datum less efficiently - we still learn more. Genetics has done very well out of this sort of approach, moving from single-locus methods to faster haplotype-based approaches effectively. Creating better models is always a good idea.
Another way is to be able to scale the methods we already have to work with more data than is normally possible. For example, downsampling data is sometimes sufficient to run an old algorithm. This has the big advantage that new methodology does not need to be developed. However, for heterogeneous datasets, random sampling is no good. The problem is that we have no statistically sound ways of turning a large dataset into the sample that an expert would have chosen. This is very difficult because you actually have to look at the data to make the decision - you therefore have to use it twice. This creates a bias that is hard to account for.
An important research question in statistics is therefore: how do we scale up generic algorithms, so that we can use them at scale with minimal effort? I'm working on emulator methods (see Lawson & Adams) that do this. The idea is that you only need some of the calculations and you can then use Machine Learning to guess the rest. This is really helpful for many genetic problems - we can currently handle all-vs-all comparisons. Two examples where this helps are ChromoPainter which compares all individuals against each other to paint their genomes, and the Multiple Sequentially Markov Coalescent (Schiffels & Durbin) allows comparisons over all populations.
Research by Application
Applications in Genetics and Epidemiology
How do we make sense of the massive amount of genetics data being produced, and how do we relate it to phenotypes of interest? Can we exploit the massive redundancy present to process it more efficiently? Specifically, we use the relationships between individuals and the history of populations to extract information about how the genome has changed. Further, we use this to learn about phenotypes, including disease, under the pressure of the massive scale of upcoming genetics data.
The main tool here is "Chromosome Painting", a way of summarising a genome with regards to how similar it is to other genomes. Specifically, we find the recombination events that led to changes in the most recent ancestor of a given individual changing. The number of such events is an important summary of "coancestry" and is key to inferring demographic relationships between populations.
Key research topics include: how can we account for the different degrees of relatedness that occur in real populations, when we can sample a large fraction of the population? How can we interpret the differences between populations? How can we efficiently compute the painting for large numbers of people? And how can we learn about regions of the genome that have experienced selection or are associated with disease?
Head over to the fineSTRUCTURE page to learn more about this problem.
Recombination in Bacteria
We can learn about the evolutionary history of bacteria using the "Ancestoral Recombination Graph" model, which assumes that a population of organisms has been reproducing (and their genomes have been recombining) at random. Unfortunately, this is very unwieldy for inferring events that actually happened, so we are considering approximations that can readily find recombination events in a bacteria's history. Our model treats recombination as a relatively rare event occurring on a background of clonal reproduction, and can detect weak signals of recombination, including the origin of imported DNA.
Applications in Ecology
Bacteria live inside of us and are performing a range of useful activities, such as the processing of nutrients. However, not all of these bacteria are good for us. Understanding the ecological processes affecting bacteria in the gut is therefore of wide interest, for which I'm working with Grietje Holtrop of BioSS to infer actual ecological interactions of bacteria living in the gut from simple experimental data.
The birth and death processes assumed in the above genetic applications is really occurring in physical space. This has an impact on readily observable landscape scale features, such as the distribution of chemicals produced by trees in a forest. I have an ongoing project looking at this spatial distribution, and inferring ecological parameters from it.
Why do historical states collapse? History is often viewed as a continuous progression from simple prehistoric societies to more complex societies. However, throughout history complexity has decreased locally when large organised states or empires collapse into multiple simpler entities. This is difficult to explain with verbal theories because it can involve unintuitive non-linear feedback processes; to be sure a theory can explain the observed pattern we must create a mathematical description.
Climate is the simplest explanation for state collapse. However, more dynamical possibilities exist: Turchin (2003, Historical Dynamics: Why States Rise and Fall) describes a set of mathematical theories for how states might "overshoot" their optimum size because they are more unified in their early growing period. This leads to larger states being less stable and less capable of innovation than smaller ones. This theory itself is still in its infancy, as sufficient quality quantitative data is scarce and few mathematical descriptions of history are available for comparison. However, what is clear is that it is possible to describe qualitative patterns in history with simple predictive models. This leads to the obvious question: how do we compare models?
I'm interested in exploring the many facets of quantizing the study of history in general, with the focus on asking "why?" History is clearly not just one thing after another: things change, populations grow, things get more complex (on average). This field is in its infancy, and there is no answer yet as to how much we can explain and how much of history is just chance. Ultimately, it may be possible to have an evolutionary theory for societies that has significant predictive power.
I'm working with a wide range of researchers with specialist knowledge. This includes the analysis of energy markets, the flow of information in computer networks and cyber security to name a few.