Statistical inference for computer experiments

My main research interests lie in the field of Bayesian statistics and its application. In particular, I am interested in the statistical challenges posed by computer experiments.

David Cox said in a recent interview that the 'challenge [...] for an academic statistician, is to be involved in several fields of application in a non-trivial sense and combine the stimulus and the contribution you can make that way with theoretical contributions that those contacts will suggest' and this captures perfectly my research strategy: to be involved in several application areas and work on interesting methodology as it arises.

The main areas I am interested in are listed below. I would be happy to supervise PhD students in any of these areas and interested students should contact me for details of potential projects.

  • Approximate Bayesian computation (ABC)
  • Analysis of computer experiments
  • Statistical challenges in climate science
  • Bayesian approaches to palaeontology
  • Uncertainty quantification for carbon capture and storage
  • Please see my publication page for more details.

    I'm also a member of the Past Earth Network, group leader of the Model-Data Comparison group, and chair of the Environmental Statistics Section of the Royal Statistical Society.

    Approximate Bayesian computation (ABC)

    ABC methods are a class on Monte Carlo method that arose out of genetics and has been extensively studied in the past decade. The methods can be used for doing Bayesian inference when the likelihood function can't be evaluated directly, but can be simulated from. This situation is common in many scientific fields, where it is easy to code a physical model that simulates some data, but which is difficult to analyse mathematically. A simple way to perform calibration in these models is via Approximate Bayesian Computation.

    If we let θ represent the model parameters, x the observed data, and S(x) some summary statistic of the data, then the basic ABC algorithm is as follows: <ol type=2>

  • Choose θ from the prior
  • Simulate data x' from the model with parameter θ, and calculate S'=S(x').
  • Accept θ if simulated statistic S' is close to the summary of the real data S=S(x). i.e accept if ρ(S, S') <ε, where ρ represents a carefully chosen metric.
  • The main advantage of ABC is that is easy to code and can be used in nearly all modelling situations. It doesn't necessarily require extensive statistical knowledge, and so is an important tool for scientists who wish to fit their model to data, but who might not have experience with complex likelihood calculations and MCMC. The key challenge for statisticians is developing algorithms that can be easily applied, and give accurate inference.

    The focus of much of my current research is on trying to speed up ABC algorithms so that they can be used with more computationally expensive simulators.

    <ol type=2>
  • Wilkinson, Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Statistical Approaches in Genetics and Molecular Biology, 12(2):129-141, 2013. Access the recommendation on F1000Prime
  • Answers the question about what distribution are we really sampling from when we do ABC?

  • Wilkinson, Accelerating ABC methods using Gaussian processes. JMLR Workshop and Conference Proceedings Volume 33: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics.
  • Uses Gaussian processes to build a model of the unknown likelihood function with the aim of speeding up ABC. Seeks to combine ideas from ABC and history-matching.

  • Wilkinson, Bayesian Inference of Primate Divergence Times. PhD Thesis. Department of Applied Probability and Theoretical Physics, University of Cambridge, (2007).
  • Chapters 3, 4, 5 and 6 all give details and extensions of the ABC method and chapter 3 serves as an introduction to the subject

  • I gave a two hour introductory tutorial on ABC at NIPS in 2013 (slides and video), and a shorter 45 minute introduction at CERN.

  • Analysis of Computer experiments

    Computer experiments are now commonplace in nearly all areas of science. From a statistical point of view, the main question is, how we can learn about a physical system from a simulation of it? For deterministic simulators, if we run the model multiple times with the same input, we will get the same output each time. Because there is no natural variation we must introduce and account for uncertainty ourselves, in order to make predictions with sensible error bounds. An excellent source of information on some of the methods for dealing with computer experiments can be found in the MUCM toolkit.

    I am interested in two main problems: <ol type=I>

  • Dealing with code uncertainty - if the simulator is expensive to run, we will have to conduct all inference using a small ensemble of model runs. One approach to dealing with this, is to build an emulator of the simulator. In other words, we build a cheap statistical model of the expensive computer model.
  • Model error - if we accept that a model is imperfect, then it is natural to ask whether we can learn in what way the model goes wrong, and whether we can correct the error. Related to this is the question of whether we can learn the appropriate degree of uncertainty we must add to our inference in order to make sensible predictions. This work has close links to some of my ABC work.
  • Gaussian processes are used to tackle both of these problems, as they provide a flexible non-parametric family of models that is easy to work with. Some papers:

    1. Wilkinson, M. Vrettas, D. Cornford, J. E. Oakley. Quantifying simulator discrepancy in discrete-time dynamical simulators. Journal of Agricultural, Biological and Environmental Statistics (special issue on "Computer models and spatial statistics for environmental science") 16(4), 554-570, 2011. There is also some supplementary material available here.

    2. Wilkinson, Bayesian calibration of expensive multivariate computer experiments. In Large scale inverse problems and quantification of uncertainty.

      We introduce the principal component emulator, which can be used to emulate models with high dimensional output. We also show how to calibrate the model using the PCA emulator.

    3. L. Bastos and Wilkinson . Análise Estatística de Simuladores (Statistical Analysis of Computer Experiments). Simpósio Nacional de Probabilidade e Estatística 19o (SINAPE), (2010).

      This is a short book Leonardo Bastos and I wrote as an introduction to computer experiments. It is currently only available in Portuguese, but we hope to produce an English version soon. There are also lecture slides that accompany the notes:

      1. Introduction to computer experiments (in English)
      2. Gaussian process emulators (Portuguese)
      3. Design of experiments and multi-output emulators (Portuguese)
      4. Calibration (English)
      5. Validation and sensitivity analysis (Portuguese)
      6. Approximate Bayesian computation (English)
    4. Holden, Edwards, Garthwaite, Wilkinson, Emulation and interpretation of high-dimensional models. In submission.

    5. N. Bounceur, M. Crucifix and Wilkinson General sensitivity analysis of the climate-vegetation system to astronomical forcing: an emulator-based approach

    6. Statistical challenges in climate science

      Climate models tend to be expensive and deterministic, typically depending on many unknown parameters. The models are also imperfect, in the sense that there are physical processes missing from the models and approximating assumptions need to be made in order to solve the equations. The data can also be sparse and in the case of palaeodata, very noise. All these are reasons why I think climate science is a great area to work in.

      I am interested in work which incorporates statistical modelling into the analysis in order to produce predictions/explanations that account for the various uncertainties we know to exist (code uncertainty, parametric uncertainty, model error, measurement errors). It is important to produce probabilistic predictions, as deterministic predictions (which scientists understand to be approximate) are often interpreted (by skeptics) as undermining all of climate science if they do not occur exactly as promised. Some papers:

      1. P.B. Holden, N.R. Edwards, K.I.C. Oliver, T.M. Lenton and Wilkinson. A probabilistic calibration of climate sensitivity in GENIE-1. To appear, Climate Dynamics.

      2. J. Carson, M. Crucifix, S. Preston, Wilkinson What drives the glacial-interglacial cycle? A Bayesian solution to a long standing problem

      3. Holden, Edwards, Garthwaite, Wilkinson, Emulation and interpretation of high-dimensional models. In submission.

      4. N. Bounceur, M. Crucifix and Wilkinson General sensitivity analysis of the climate-vegetation system to astronomical forcing: an emulator-based approach

      5. D.M. Ricciuto, R. Tonkonojenkov, Wilkinson, N.M. Urban, D. Matthews, K.J. Davis, and K. Keller, Assimilation of global carbon cycle observations into an Earth system model to estimate uncertain terrestrial carbon cycle parameters .
      6. I am a member of the two EPSRC networks, and the newly formed Past Earth Network.

        Bayesian approaches in palaeontology

        I am interested in stochastic modelling in palaeontological applications. These problems are often characterised by only having a limited amount of noisy data that we must exploit as best we can.

        Primate genetic and fossil posteriors

        <ol type=2>

      7. Wilkinson, M. Steiper, C. Soligo, R.D. Martin, Z. Yang, and S. Tavaré. Dating primate divergences through an integrated analysis of palaeontological and molecular data. Systematic Biology 60(1): 16-31, 2011.

        We combine the posteriors found in chapter 4 of my thesis with genetic data from extant primates to find estimates of the primate divergence time that incorporate both genetic and fossil data. The main findings are illustrated in the figure above.

      8. Wilkinson and S. Tavaré. Estimating the primate divergence time using conditioned birth-and-death processes. Theoretical Population Biology, 75 (2009), pp. 278-285. doi:10.1016/j.tpb.2009.02.003 .

        A paper based on chapters 7 and 8 of my PhD thesis.

      9. Bracken-Grissom, H.D., Ahyong, S. T., Wilkinson, Feldmann, R., Schweitzer, C., Brienholdt, J., Bendall, M., Palero, F., Chan, T-Y., Felder, D.L., Robles, R., Chu, K.H., Tsang, M., Kim, D., Martin, J., Crandall, K.A. The Emergence of the Lobsters: Phylogenetic Relationships, Morphological Evolution and Divergence Time Comparisons of an Ancient Group (Decapoda: Achelata, Astacidea, Glypheidea, Polychelida) Systematic biology, in press, 2014.
      10. Amongst other things, does for lobsters what we previously did for primates. It also compares different approaches for utilizing the information in the fossil record.

      11. Wilkinson. Bayesian Inference of Primate Divergence Times. PhD Thesis. Department of Applied Probability and Theoretical Physics, University of Cambridge, (2007).
      12. My thesis contains most of the results in the above two papers, plus other cases that haven't been published. Chapters of particular interest might be chapters 4 and 6, where the basic primate model is extended in various directions.

        Uncertainty quantification for carbon capture and storage

        Carbon capture and storage (CCS) is the process of capturing the CO2 emissions produced in the generation of electricity from fossil fuels, and then burying the captured CO2 underground. I am involved in several CCS projects.

        The first, is through involvement in the project, an EU funded consortium across 12 Universities. In Nottingham, we are responsible for the uncertainty quantification. The main question we wish to answer, is given that the geology of the underlying bedrock is unknown, what do we believe the likely distribution is for how the buried CO2 will permeate through the bedrock? This involves building Gaussian process emulators of complex ground water flow models. The dimension of the input and outputs to this simulator can be very high (d=105 is typical), and so we are investigating dimension reduction techniques for accurately capturing the simulator behaviour.

        The second CCS project is looking at modelling impure CO2 properties, which is important in the transport of CO2. This ranges from training empirical parametric models, to non-parametric Gaussian process models where we allow the functional form of the model to arise from the data itself, rather than being postulated in advance.

        Finally, we have been looking at ab-initio molecular simulation. A particular feature of our approach is to integrate techniques across different length scales, aiming to produce highly tractable models that are informed by the more expensive molecular techniques. We have been using Gaussian processes to feed information about the molecular interactions from techniques in computational chemistry calculations into the simulation. Ultimately, we hope this will lead to completely ab-initio computation of physical properties.