Preprints

A generalisation of a latent position network model known as the random dot product graph model is considered. The resulting model may be of independent interest because it has the unique property of representing a mixture of connectivity behaviours as the corresponding convex combination in latent space. We show that, whether the normalised Laplacian or adjacency matrix is used, the vector representations of nodes obtained by spectral embedding provide strongly consistent latent position estimates with asymptotically Gaussian error. Direct methodological consequences follow from the observation that the well-known mixed membership and standard stochastic block models are special cases where the latent positions live respectively inside or on the vertices of a simplex. Estimation via spectral embedding can therefore be achieved by respectively estimating this simplicial support, or fitting a Gaussian mixture model. In the latter case, the use of $K$-means, as has been previously recommended, is suboptimal and for identifiability reasons unsound. Empirical improvements in link prediction, as well as the potential to uncover much richer latent structure (than available under the mixed membership or standard stochastic block models) are demonstrated in a cyber-security example.

The mixed membership stochastic blockmodel is a statistical model for a graph, which extends the stochastic blockmodel by allowing every node to randomly choose a different community each time a decision of whether to form an edge is made. Whereas spectral analysis for the stochastic blockmodel is increasingly well established, theory for the mixed membership case is considerably less developed. Here we show that adjacency spectral embedding into $\mathbb{R}^k$, followed by fitting the minimum volume enclosing convex $k$-polytope to the $k−1$ principal components, leads to a consistent estimate of a $k$-community mixed membership stochastic blockmodel. The key is to identify a direct correspondence between the mixed membership stochastic blockmodel and the random dot product graph, which greatly facilitates theoretical analysis. Specifically, a $2 \rightarrow \infty$ and central limit theorem for the random dot product graph are exploited to respectively show consistency and partially correct the bias of the procedure.

Posterior predictive p-values are a common approach to Bayesian model-checking. This article analyses their frequency behaviour, that is, their distribution when the parameters and the data are drawn from the prior and the model respectively. We show that the family of possible distributions is exactly described as the distributions that are less variable than uniform on [0,1], in the convex order. In general, p-values with such a property are not conservative, and we illustrate how the theoretical worst-case error rate for false rejection can occur in practice. We describe how to correct the p-values to recover conservatism in several common scenarios, for example, when interpreting a single p-value or when combining multiple p-values into an overall score of significance. We also handle the case where the p-value is estimated from posterior samples obtained from techniques such as Markov Chain or Sequential Monte Carlo. Our results place posterior predictive p-values in a much clearer theoretical framework, allowing them to be used with more assurance.

Many scientific questions rely on determining whether two sequences of event times are associated. This article introduces a likelihood ratio test which can be parameterised in several ways to detect different forms of dependence. A common finite-sample distribution is derived, and shown to be asymptotically related to a weighted Kolmogorov-Smirnov test. Analysis leading to these results also motivates a more general tool for diagnosing dependence. The methodology is demonstrated on data generated on an email network, showing evidence of information flow using only timing information. Implementation code is available in the R package 'mppa'.

Journal publications

The mid-p-value is a proposed improvement on the ordinary p-value for the case where the test statistic is partially or completely discrete. In this case, the ordinary p-value is conservative, meaning that its null distribution is larger than a uniform distribution on the unit interval, in the usual stochastic order. The mid-p-value is not conservative. However, its null distribution is dominated by the uniform distribution in a different stochastic order, called the convex order. The property leads us to discover some new finite-sample and asymptotic bounds on functions of mid-p-values, which can be used to combine results from different hypothesis tests conservatively, yet more powerfully, using mid-p-values rather than p-values. Our methodology is demonstrated on real data from a cyber-security application.

Combining p-values from independent statistical tests is a popular approach to meta-analysis, particularly when the original data which founded each of the tests are either no longer available or are difficult to combine into a single test. A diverse range of p-value combiners appear in the scientific literature, each with quite different statistical properties. Yet all too often the final choice of combiner used in a meta-analysis can appear arbitrary, as if all effort has been expended in building models that gave rise to the p-values in the first place. Birnbaum (1954) gave an existence proof showing that any sensible p-value combiner must be optimal against some alternative hypothesis for the p-values. Starting from this perspective and recasting each method of combining p-values as a likelihood ratio test, this article presents some straightforward theoretical results for some of the standard combiners, which provide guidance about how a powerful combiner might be chosen in practice.

Single-molecule localisation microscopy (SMLM) allows the localisation of fluorophores with a precision of 10–30 nm, revealing the cell’s nanoscale architecture at the molecular level. Recently, SMLM has been extended to 3D, providing a unique insight into cellular machinery. Although cluster analysis techniques have been developed for 2D SMLM data sets, few have been applied to 3D. This lack of quantification tools can be explained by the relative novelty of imaging techniques such as interferometric photo-activated localisation microscopy (iPALM). Also, existing methods that could be extended to 3D SMLM are usually subject to user defined analysis parameters, which remains a major drawback. Here, we present a new open source cluster analysis method for 3D SMLM data, free of user definable parameters, relying on a model-based Bayesian approach which takes full account of the individual localisation precisions in all three dimensions. The accuracy and reliability of the method is validated using simulated data sets. This tool is then deployed on novel experimental data as a proof of concept, illustrating the recruitment of LAT to the T-cell immunological synapse in data acquired by iPALM providing ~10 nm isotropic resolution.

Cell function is regulated by the spatiotemporal organization of the signaling machinery, and a key facet of this is molecular clustering. Here, we present a protocol for the analysis of clustering in data generated by 2D single-molecule localization microscopy (SMLM)—for example, photoactivated localization microscopy (PALM) or stochastic optical reconstruction microscopy (STORM). Three features of such data can cause standard cluster analysis approaches to be ineffective: (i) the data take the form of a list of points rather than a pixel array; (ii) there is a non-negligible unclustered background density of points that must be accounted for; and (iii) each localization has an associated uncertainty in regard to its position. These issues are overcome using a Bayesian, model-based approach. Many possible cluster configurations are proposed and scored against a generative model, which assumes Gaussian clusters overlaid on a completely spatially random (CSR) background, before every point is scrambled by its localization precision. We present the process of generating simulated and experimental data that are suitable to our algorithm, the analysis itself, and the extraction and interpretation of key cluster descriptors such as the number of clusters, cluster radii and the number of localizations per cluster. Variations in these descriptors can be interpreted as arising from changes in the organization of the cellular nanoarchitecture. The protocol requires no specific programming ability, and the processing time for one data set, typically containing 30 regions of interest, is ~18 h; user input takes ~1 h.

Single-molecule localization-based super-resolution microscopy techniques such as photoactivated localization microscopy (PALM) and stochastic optical reconstruction microscopy (STORM) produce pointillist data sets of molecular coordinates. Although many algorithms exist for the identification and localization of molecules from raw image data, methods for analyzing the resulting point patterns for properties such as clustering have remained relatively under-studied. Here we present a model-based Bayesian approach to evaluate molecular cluster assignment proposals, generated in this study by analysis based on Ripley's K function. The method takes full account of the individual localization precisions calculated for each emitter. We validate the approach using simulated data, as well as experimental data on the clustering behavior of CD3ζ, a subunit of the CD3 T cell receptor complex, in resting and activated primary human T cells.

This article presents an algorithm that generates a conservative confidence interval of a specified length and coverage probability for the power of a Monte Carlo test (such as a bootstrap or permutation test). It is the first method that achieves this aim for almost any Monte Carlo test. Previous research has focused on obtaining as accurate a result as possible for a fixed computational effort, without providing a guaranteed precision in the above sense. The algorithm we propose does not have a fixed effort and runs until a confidence interval with a user-specified length and coverage probability can be constructed. We show that the expected effort required by the algorithm is finite in most cases of practical interest, including situations where the distribution of the p-value is absolutely continuous or discrete with finite support. The algorithm is implemented in the R-package simctest, available on CRAN.

We consider the problem of testing whether a complex-valued random vector is proper, i.e., is uncorrelated with its complex conjugate. We formulate the testing problem in terms of real-valued Gaussian random vectors, so we can make use of some useful existing results which enable us to study the null distributions of two test statistics. The tests depend only on the sample-size n and the dimensionality of the vector p . The basic behaviors of the distributions of the test statistics are derived and critical values (thresholds) are calculated and presented for certain (n,p) values. For one of these tests we derive a distributional approximation for a transform of the statistic, potentially very useful in practice for rapid and simple testing. We also study the power (detection probability) of the tests. Our results mean that testing for propriety can be a practical and undaunting procedure.

The contribution to a stationary complex-valued time series at a single frequency magnitude takes the form of a random ellipse, and its properties such as aspect ratio (which includes rotational direction) and orientation are of great interest in science. A case when both the aspect ratio and orientation are fixed is found, and their variability, in general, results from the additional influence of an orthogonal ellipse. It is shown how a magnitude squared coherence coefficient controls both the relative influences of these components and the variation of both the orientation and aspect ratio of the resultant ellipse. Realizations of random ellipses are recovered very accurately from simulated time series. The mean orientation of the random ellipse is formally derived.

An algorithm is proposed for the simulation of improper (noncircular) complex-valued second-order stationary stochastic processes having specified second-order properties. Three examples are given. Generated processes are shown to obey necessary distributional properties.

Conference proceedings, book chapters, abstracts

Comment on "Sparse graphs using exchangeable random measures" by Caron & Fox
Patrick Rubin-Delanchy. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79.5 (2017): 1295-1366.

In this article we outline a general modus operand under which to perform intrusion detection at scale. The over-arching principle is this: a network monitoring tool has access to large stores of data on which it can learn 'normal' network behaviour. On the other hand, data on intrusions are relatively rare. This imbalance invites us to frame intrusion detection as an anomaly detection problem where, under the null hypothesis that there is no intrusion, the data follow a machine-learnt model of behaviour, and, under the alternative that there is some form of intrusion, certain anomalies in that model will be apparent. This approach to cyber-security poses some important statistical challenges. One is simply modelling and doing inference with such large scale and heteregeneous data. Another is performing anomaly detection when the null hypothesis comprises a complex model. Finally, a key problem is combining different anomalies through time and across the network.

Network data is ubiquitous in cyber-security applications. Accurately modelling such data allows discovery of anomalous edges, subgraphs or paths, and is key to many signature-free cyber-security analytics. We present a recurring property of graphs originating from cyber-security applications, often considered a 'corner case' in the main literature on network data analysis, that greatly affects the performance of standard 'off-the-shelf' techniques. This is the property that similarity, in terms of network behaviour, does not imply connectivity, and in fact the reverse is often true. We call this disassortivity. The phenomenon is illustrated using network flow data collected on an enterprise network. Improved procedures are proposed, that take explicit account of this property, for spectral analysis and link prediction.

Statistical anomaly detection techniques provide the next layer of cyber-security defences below traditional signature-based approaches. This article presents a scalable, principled, probability-based technique for detecting outlying connectivity behaviour within a directed interaction network such as a com- puter network. Independent Bayesian statistical models are fit to each message recipient in the network using the Dirichlet process, which provides a tractable, conjugate prior distribution for an unknown discrete probability distribution. The method is shown to successfully detect a red team attack in authentication data obtained from the enterprise network of Los Alamos National Laboratory.

The network traffic generated by a computer, or a pair of computers, is often well-modelled as a series of sessions. These are, roughly speaking, intervals of time during which a computer is engaging in the same, continued, activity. This article explores a variety of statistical approaches to re-discovering ses- sions from network flow data using timing alone. Solutions to this problem are essential for network monitoring and cyber-security. For example overlapping sessions on a computer network can be evidence of an intruder 'tunnelling'.

Detecting polling behaviour in a computer network has two important applications. First, the polling behaviour can be indicative of malware beaconing, where an undetected software virus sends regular communications to a controller. Second, the polling behaviour may not be malicious, and correspond to regular automated update requests permitted by the client; to build models of normal host behaviour for signature-free anomaly detection, this polling behaviour needs to be understood. This article presents a simple Fourier analysis technique for identifying polling behaviour, and focuses on the second application: modelling the normal behaviour of a host, using real data collected from the computer network of Imperial College London.

How can we effectively use costly statistical models in the defence of large computer networks? Statistical modelling and machine learning are potentially powerful ways to detect threats as they do not require a human level understanding of the attack. However, they are rarely applied in practice as the computational cost of deploying all but the most simple algorithms can become implausibly large. Here we describe a multilevel approach to statistical modelling in which descriptions of the normal running of the network are built up from the lower netflow level to higher-level sessions and graph-level descriptions. Statistical models at low levels are most capable of detecting the unusual activity that might be a result of malicious software or hackers, but are too costly to run over the whole network. We develop a fast algorithm to identify tunnelling behaviour at the session level using 'telescoping' of sessions containing other sessions, and demonstrate that this allows a statistical model to be run at scale on netflow timings. The method is applied to a toy dataset using an artificial 'attack'.