Full publication list on google scholar. Selected papers below.

High-dimensional data

The Manifold Hypothesis is a widely accepted tenet of Machine Learning which asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space. This phenomenon is observed empirically in many real world situations, has led to development of a wide range of statistical methods in the last few decades, and has been suggested as a key factor in the success of modern AI technologies. We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model -- the Latent Metric Model -- via elementary concepts such as latent variables, correlation and stationarity. This establishes a general statistical explanation for why the Manifold Hypothesis seems to hold in so many situations. Informed by the Latent Metric Model we derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism. These procedures operate under minimal assumptions and make use of well known, scaleable graph-analytic algorithms.

In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.

Graphs & Dynamic networks

We present a new algorithmic framework, Intensity Profile Projection, for learning continuous-time representations of the nodes of a dynamic network, characterised by a node set and a collection of instantaneous interaction events which occur in continuous time. Our framework consists of three stages: estimating the intensity functions underlying the interactions between pairs of nodes, e.g. via kernel smoothing; learning a projection which minimises a notion of intensity reconstruction error; and inductively constructing evolving node representations via the learned projection. We show that our representations preserve the underlying structure of the network, and are temporally coherent, meaning that node representations can be meaningfully compared at different points in time. We develop estimation theory which elucidates the role of smoothing as a bias-variance trade-off, and shows how we can reduce smoothing as the signal-to-noise ratio increases on account of the algorithm `borrowing strength' across the network.

This paper concerns the statistical analysis of a weighted graph through spectral embedding. Under a latent position model in which the expected adjacency matrix has low rank, we prove uniform consistency and a central limit theorem for the embedded nodes, treated as latent position estimates. In the special case of a weighted stochastic block model, this result implies that the embedding follows a Gaussian mixture model with each component representing a community. We exploit this to formally evaluate different weight representations of the graph using Chernoff information. For example, in a network anomaly detection problem where we observe a p-value on each edge, we recommend against directly embedding the matrix of p-values, and instead using threshold or log p-values, depending on network sparsity and signal strength.

Recent work has shown that sparse graphs containing many triangles cannot be reproduced using a finite-dimensional representation of the nodes, in which link probabilities are inner products. Here, we show that such graphs can be reproduced using an infinite-dimensional inner product model, where the node representations lie on a low-dimensional manifold. Recovering a global representation of the manifold is impossible in a sparse regime. However, we can zoom in on local neighbourhoods, where a lower-dimensional representation is possible. As our constructions allow the points to be uniformly distributed on the manifold, we find evidence against the common perception that triangles imply community structure.

A generalisation of a latent position network model known as the random dot product graph is considered. We show that, whether the normalised Laplacian or adjacency matrix is used, the vector representations of nodes obtained by spectral embedding, using the largest eigenvalues by magnitude, provide strongly consistent latent position estimates with asymptotically Gaussian error, up to indefinite orthogonal transformation. The mixed membership and standard stochastic block models constitute special cases where the latent positions live respectively inside or on the vertices of a simplex, crucially, without assuming the underlying block connectivity probability matrix is positive-definite. Estimation via spectral embedding can therefore be achieved by respectively estimating this simplicial support, or fitting a Gaussian mixture model. In the latter case, the use of $K$-means (with Euclidean distance), as has been previously recommended, is suboptimal and for identifiability reasons unsound. Indeed, Euclidean distances and angles are not preserved under indefinite orthogonal transformation, and we show stochastic block model examples where such quantities vary appreciably. Empirical improvements in link prediction (over the random dot product graph), as well as the potential to uncover richer latent structure (than posited under the mixed membership or standard stochastic block models) are demonstrated in a cyber-security example.

Given a graph or similarity matrix, we consider the problem of recovering a notion of true distance between the nodes, and so their true positions. Through new insights into the manifold geometry underlying a generic latent position model, we show that this can be accomplished in two steps: matrix factorisation, followed by nonlinear dimension reduction. This combination is effective because the point cloud obtained in the first step lives close to a manifold in which latent distance is encoded as geodesic distance. Hence, a nonlinear dimension reduction tool, approximating geodesic distance, can recover the latent positions, up to a simple transformation. We give a detailed account of the case where spectral embedding is used, followed by Isomap, and provide encouraging experimental evidence for other combinations of techniques.

We consider the problem of embedding a dynamic network, to obtain time-evolving vector representations of each node, which can then be used to describe changes in behaviour of individual nodes, communities, or the entire graph. Given this open-ended remit, we argue that two types of stability in the spatio-temporal positioning of nodes are desirable: to assign the same position, up to noise, to nodes behaving similarly at a given time (cross-sectional stability) and a constant position, up to noise, to a single node behaving similarly across different times (longitudinal stability). Similarity in behaviour is defined formally using notions of exchangeability under a dynamic latent position network model. By showing how this model can be recast as a multilayer random dot product graph, we demonstrate that unfolded adjacency spectral embedding satisfies both stability conditions. We also show how two alternative methods, omnibus and independent spectral embedding, alternately lack one or the other form of stability.

Statistical analysis of a graph often starts with embedding, the process of representing its nodes as points in space. How to choose the embedding dimension is a nuanced decision in practice, but in theory a notion of true dimension is often available. In spectral embedding, this dimension may be very high. However, this paper shows that existing random graph models, including graphon and other latent position models, predict the data should live near a much lower dimensional set. One may therefore circumvent the curse of dimensionality by employing methods which exploit hidden manifold structure.

We present a comprehensive extension of the latent position network model known as the random dot product graph to accommodate multiple graphs -- both undirected and directed -- which share a common subset of nodes, and propose a method for jointly embedding the associated adjacency matrices, or submatrices thereof, into a suitable latent space. Theoretical results concerning the asymptotic behaviour of the node representations thus obtained are established, showing that after the application of a linear transformation these converge uniformly in the Euclidean norm to the latent positions with Gaussian error. Within this framework, we present a generalisation of the stochastic block model to a number of different multiple graph settings, and demonstrate the effectiveness of our joint embedding method through several statistical inference tasks in which we achieve comparable or better results than rival spectral methods. Empirical improvements in link prediction over single graph embeddings are exhibited in a cyber-security example.

This paper shows that graph spectral embedding using the random walk Laplacian produces vector representations which are completely corrected for node degree. Under a generalised random dot product graph, the embedding provides uniformly consistent estimates of degree-corrected latent positions, with asymptotically Gaussian error. In the special case of a degree-corrected stochastic block model, the embedding concentrates about K distinct points, representing communities. These can be recovered perfectly, asymptotically, through a subsequent clustering step, without spherical projection, as commonly required by algorithms based on the adjacency or normalised, symmetric Laplacian matrices. While the estimand does not depend on degree, the asymptotic variance of its estimate does -- higher degree nodes are embedded more accurately than lower degree nodes. Our central limit theorem therefore suggests fitting a weighted Gaussian mixture model as the subsequent clustering step, for which we provide an expectation-maximisation algorithm.

Spectral embedding finds vector representations of the nodes of a network, based on the eigenvectors of its adjacency or Laplacian matrix, and has found applications throughout the sciences. Many such networks are multipartite, meaning their nodes can be divided into partitions and nodes of the same partition are never connected. When the network is multipartite, this paper demonstrates that the node representations obtained via spectral embedding live near partition-specific low-dimensional subspaces of a higher-dimensional ambient space. For this reason we propose a follow-on step after spectral embedding, to recover node representations in their intrinsic rather than ambient dimension, proving uniform consistency under a low-rank, inhomogeneous random graph model. Our method naturally generalizes bipartite spectral embedding, in which node representations are obtained by singular value decomposition of the biadjacency or bi-Laplacian matrix.

Hypothesis testing / anomaly detection

The mid-p-value is a proposed improvement on the ordinary p-value for the case where the test statistic is partially or completely discrete. In this case, the ordinary p-value is conservative, meaning that its null distribution is larger than a uniform distribution on the unit interval, in the usual stochastic order. The mid-p-value is not conservative. However, its null distribution is dominated by the uniform distribution in a different stochastic order, called the convex order. The property leads us to discover some new finite-sample and asymptotic bounds on functions of mid-p-values, which can be used to combine results from different hypothesis tests conservatively, yet more powerfully, using mid-p-values rather than p-values. Our methodology is demonstrated on real data from a cyber-security application.

Combining p-values from independent statistical tests is a popular approach to meta-analysis, particularly when the original data which founded each of the tests are either no longer available or are difficult to combine into a single test. A diverse range of p-value combiners appear in the scientific literature, each with quite different statistical properties. Yet all too often the final choice of combiner used in a meta-analysis can appear arbitrary, as if all effort has been expended in building models that gave rise to the p-values in the first place. Birnbaum (1954) gave an existence proof showing that any sensible p-value combiner must be optimal against some alternative hypothesis for the p-values. Starting from this perspective and recasting each method of combining p-values as a likelihood ratio test, this article presents some straightforward theoretical results for some of the standard combiners, which provide guidance about how a powerful combiner might be chosen in practice.

This article presents an algorithm that generates a conservative confidence interval of a specified length and coverage probability for the power of a Monte Carlo test (such as a bootstrap or permutation test). It is the first method that achieves this aim for almost any Monte Carlo test. Previous research has focused on obtaining as accurate a result as possible for a fixed computational effort, without providing a guaranteed precision in the above sense. The algorithm we propose does not have a fixed effort and runs until a confidence interval with a user-specified length and coverage probability can be constructed. We show that the expected effort required by the algorithm is finite in most cases of practical interest, including situations where the distribution of the p-value is absolutely continuous or discrete with finite support. The algorithm is implemented in the R-package simctest, available on CRAN.

Single-molecule localisation microscopy

Photoactivated localisation microscopy (PALM) produces an array of localisation coordinates by means of photoactivatable fluorescent proteins. However, observations are subject to fluorophore multiple-blinking and each protein is included in the dataset an unknown number of times at different positions, due to localisation error. This causes artificial clustering to be observed in the data. We present a workflow using calibration-free estimation of blinking dynamics and model-based clustering, to produce a corrected set of localisation coordinates now representing the true underlying fluorophore locations with enhanced localisation precision. These can be reliably tested for spatial randomness or analysed by other clustering approaches, and previously inestimable descriptors such as the absolute number of fluorophores per cluster are now quantifiable, which we validate with simulated data. Using experimental data, we confirm that the adaptor protein, LAT, is clustered at the T cell immunological synapse, with its nanoscale clustering properties depending on location and intracellular phosphorylatable tyrosine residues.

Single-molecule localization-based super-resolution microscopy techniques such as photoactivated localization microscopy (PALM) and stochastic optical reconstruction microscopy (STORM) produce pointillist data sets of molecular coordinates. Although many algorithms exist for the identification and localization of molecules from raw image data, methods for analyzing the resulting point patterns for properties such as clustering have remained relatively under-studied. Here we present a model-based Bayesian approach to evaluate molecular cluster assignment proposals, generated in this study by analysis based on Ripley's K function. The method takes full account of the individual localization precisions calculated for each emitter. We validate the approach using simulated data, as well as experimental data on the clustering behavior of CD3ζ, a subunit of the CD3 T cell receptor complex, in resting and activated primary human T cells.