Approximate Bayesian model choice via random forests

Click here to load reader

  • date post

    06-Aug-2015
  • Category

    Science

  • view

    2.590
  • download

    1

Embed Size (px)

Transcript of Approximate Bayesian model choice via random forests

  1. 1. Reliable Approximate Bayesian computation (ABC) model choice via random forests Christian P. Robert Universit Paris-Dauphine, Paris & University of Warwick, Coventry SPA 2015, University of Oxford [email protected] Joint with J.-M. Cornuet, A. Estoup, J.-M. Marin, & P. Pudlo
  2. 2. The next MCMSkv meeting: Computational Bayes section of ISBA major meeting: MCMSki V in Lenzerheide, Switzerland, Jan. 5-7, 2016 MCMC, pMCMC, SMC2 , HMC, ABC, (ultra-) high-dimensional computation, BNP, QMC, deep learning, &tc Plenary speakers: S. Scott, S. Fienberg, D. Dunson, K. Latuszynski, T. Lelivre Call for contributed 9 sessions and tutorials opened Switzerland in January, where else...?!"
  3. 3. Outline Intractable likelihoods ABC methods ABC for model choice ABC model choice via random forests
  4. 4. intractable likelihood Case of a well-dened statistical model where the likelihood function (|y) = f (y1, . . . , yn|) is (really!) not available in closed form cannot (easily!) be either completed or demarginalised cannot be (at all!) estimated by an unbiased estimator examples of latent variable models of high dimension, including combinatorial structures (trees, graphs), missing constant f (x|) = g(y, ) Z() (eg. Markov random elds, exponential graphs,. . . ) c Prohibits direct implementation of a generic MCMC algorithm like MetropolisHastings which gets stuck exploring missing structures
  5. 5. intractable likelihood Case of a well-dened statistical model where the likelihood function (|y) = f (y1, . . . , yn|) is (really!) not available in closed form cannot (easily!) be either completed or demarginalised cannot be (at all!) estimated by an unbiased estimator c Prohibits direct implementation of a generic MCMC algorithm like MetropolisHastings which gets stuck exploring missing structures
  6. 6. Necessity is the mother of invention Case of a well-dened statistical model where the likelihood function (|y) = f (y1, . . . , yn|) is out of reach Empirical A to the original B problem Degrading the data precision down to tolerance level Replacing the likelihood with a non-parametric approximation based on simulations Summarising/replacing the data with insucient statistics
  7. 7. Necessity is the mother of invention Case of a well-dened statistical model where the likelihood function (|y) = f (y1, . . . , yn|) is out of reach Empirical A to the original B problem Degrading the data precision down to tolerance level Replacing the likelihood with a non-parametric approximation based on simulations Summarising/replacing the data with insucient statistics
  8. 8. Necessity is the mother of invention Case of a well-dened statistical model where the likelihood function (|y) = f (y1, . . . , yn|) is out of reach Empirical A to the original B problem Degrading the data precision down to tolerance level Replacing the likelihood with a non-parametric approximation based on simulations Summarising/replacing the data with insucient statistics
  9. 9. Approximate Bayesian computation Intractable likelihoods ABC methods Genesis of ABC abc of ABC Summary statistic ABC for model choice ABC model choice via random forests
  10. 10. Genetic background of ABC skip genetics ABC is a recent computational technique that only requires being able to sample from the likelihood f (|) This technique stemmed from population genetics models, about 15 years ago, and population geneticists still signicantly contribute to methodological developments of ABC. [Grith & al., 1997; Tavar & al., 1999]
  11. 11. Demo-genetic inference Each model is characterized by a set of parameters that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we cannot calculate the likelihood of the polymorphism data f (y|)...
  12. 12. Demo-genetic inference Each model is characterized by a set of parameters that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we cannot calculate the likelihood of the polymorphism data f (y|)...
  13. 13. Kingmans colaescent Kingmans genealogy When time axis is normalized, T(k) Exp(k(k 1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity /2 over the branches MRCA = 100 independent mutations: 1 with pr. 1/2
  14. 14. Kingmans colaescent Kingmans genealogy When time axis is normalized, T(k) Exp(k(k 1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity /2 over the branches MRCA = 100 independent mutations: 1 with pr. 1/2
  15. 15. Kingmans colaescent Observations: leafs of the tree ^ = ? Kingmans genealogy When time axis is normalized, T(k) Exp(k(k 1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity /2 over the branches MRCA = 100 independent mutations: 1 with pr. 1/2
  16. 16. Instance of ecological questions [message in a beetle] How did the Asian Ladybird beetle arrive in Europe? Why do they swarm right now? What are the routes of invasion? How to get rid of them? [Lombaert & al., 2010, PLoS ONE] beetles in forests
  17. 17. Worldwide invasion routes of Harmonia Axyridis [Estoup et al., 2012, Molecular Ecology Res.]
  18. 18. c Intractable likelihood Missing (too much missing!) data structure: f (y|) = G f (y|G, )f (G|)dG cannot be computed in a manageable way... [Stephens & Donnelly, 2000] The genealogies are considered as nuisance parameters This modelling clearly diers from the phylogenetic perspective where the tree is the parameter of interest.
  19. 19. c Intractable likelihood Missing (too much missing!) data structure: f (y|) = G f (y|G, )f (G|)dG cannot be computed in a manageable way... [Stephens & Donnelly, 2000] The genealogies are considered as nuisance parameters This modelling clearly diers from the phylogenetic perspective where the tree is the parameter of interest.
  20. 20. A?B?C? A stands for approximate [wrong likelihood / picture] B stands for Bayesian C stands for computation [producing a parameter sample]
  21. 21. ABC methodology Bayesian setting: target is ()f (x|) When likelihood f (x|) not in closed form, likelihood-free rejection technique: Foundation For an observation y f (y|), under the prior (), if one keeps jointly simulating () , z f (z| ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected (|y) [Rubin, 1984; Diggle & Gratton, 1984; Tavar et al., 1997]
  22. 22. ABC methodology Bayesian setting: target is ()f (x|) When likelihood f (x|) not in closed form, likelihood-free rejection technique: Foundation For an observation y f (y|), under the prior (), if one keeps jointly simulating () , z f (z| ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected (|y) [Rubin, 1984; Diggle & Gratton, 1984; Tavar et al., 1997]
  23. 23. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone (y, z) where is a distance Output distributed from () P{(y, z) < } def (|(y, z) < ) [Pritchard et al., 1999]
  24. 24. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone (y, z) where is a distance Output distributed from () P{(y, z) < } def (|(y, z) < ) [Pritchard et al., 1999]
  25. 25. ABC recap Likelihood free rejection sampling Tavar et al. (1997) Genetics 1) Set i = 1, 2) Generate from the prior distribution (), 3) Generate z from the likelihood f (| ), 4) If ((z ), (y)) , set (i , zi ) = ( , z ) and i = i + 1, 5) If i N, return to 2). Only keep s such that the distance between the corresponding simulated dataset and the observed dataset is small enough. Tuning parameters > 0: tolerance level, (z): function that summarizes datasets, (, ): distance between vectors of summary statistics N: size of the output
  26. 26. ABC recap Likelihood free rejection sampling Tavar et al. (1997) Genetics 1) Set i = 1, 2) Generate from the prior distribution (), 3) Generate z from the likelihood f (| ), 4) If ((z ), (y)) , set (i , zi ) = ( , z ) and i = i + 1, 5) If i N, return to 2). Only keep s such that the distance between the corresponding simulated dataset and the observed dataset is small enough. Tuning parameters > 0: tolerance level, (z): function that summarizes datasets, (, ): distance between vectors of summary statistics N: size of the output
  27. 27. Output The likelihood-free algorithm samples from the marginal in z of: (, z|y) = ()f (z|)IA ,y (z) A ,y ()f (z|)dzd , where A ,y = {z D|((z), (y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: (|y) = (, z|y)dz (|y) .
  28. 28. Output The likelihood-free algorithm samples from the marginal in z of: (, z|y) = ()f (z|)IA ,y (z) A ,y ()f (z|)dzd , where A ,y = {z D|((z), (y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: (|y) = (, z|y)dz (|y) .
  29. 29. Output The likelihood-free algorithm samples from the marginal in z of: