Approximate Bayesian Computation for Astrostatistics · 2016-10-26 · Approximate Bayesian...

Post on 07-Jun-2020

6 views 0 download

Transcript of Approximate Bayesian Computation for Astrostatistics · 2016-10-26 · Approximate Bayesian...

Approximate Bayesian Computation forAstrostatistics

Jessi CisewskiDepartment of Statistics

Yale University

October 24, 2016

SAMSI Undergraduate Workshop

Our goals

Introduction to Bayesian methods

Likelihoods, priors, posteriors

Learn our ABCs....

Approximate Bayesian Computation

Image from http://livingstontownship.org/lycsblog/?p=241

1

Suppose we randomly select a sample of size n women andmeasure their heights: x1, x2, x3, . . . , xn

Let’s assume this follows a Normal distribution: X ∼ N(µ, σ)

f (x ;µ, σ) =1√

2πσ2e−

(x−µ)2

2σ2

2

Stellar Initial Mass Function

3

Probability density function (pdf)−→ the function f (x , µ, σ), with µ fixed and x variable

f (x , µ, σ) =1√

2πσ2e−

(x−µ)2

2σ2

Height (in)

Density

µHeight

Each observation from this

Likelihood−→ the function f (x , µ, σ), with µ variable and x fixed

f (x , µ, σ) =1√

2πσ2e−

(x−µ)2

2σ2

µ

Likelihood

µ X

4

Probability density function (pdf)−→ the function f (x , µ, σ), with µ fixed and x variable

f (x , µ, σ) =1√

2πσ2e−

(x−µ)2

2σ2

Height (in)

Density

µHeight

Each observation from thisLikelihood−→ the function f (x , µ, σ), with µ variable and x fixed

f (x , µ, σ) =1√

2πσ2e−

(x−µ)2

2σ2

µ

Likelihood

µ X

4

Consider a random sample of size n (assuming independence, anda known σ): X1, . . . ,Xn ∼ N(3, 2)

f (x , µ, σ) = f (x1, . . . , xn, µ, σ)=n∏

i=1

1√2πσ2

e−(xi−µ)2

2σ2

1.0 1.5 2.0 2.5 3.0 3.5 4.0

01

23

45

µ

(nor

mal

ized

) Lik

elih

ood

n = 50n = 100n = 250n = 500µ

5

Likelihood Principle

All of the information in a sample is contained in the likelihoodfunction, a density or distribution function.

The data are modeled by a likelihood function.

How do we infer µ? Or any unknown parameter(s) θ?

6

Maximum likelihood estimation

The parameter value, θ, that maximizes the likelihood:

θ = maxθ

f (x1, . . . , xn, θ)

1.0 2.0 3.0 4.00e+00

2e-43

4e-43

6e-43

µ

Likelihood

MLEµ

maxµ f (x1, . . . , xn, µ, σ) =

maxµ∏n

i=11√

2πσ2e−

(xi−µ)2

2σ2

Hence, µ =∑n

i=1 xin = x

7

Bayesian framework

Suppose θ is our parameter(s) of interest

Classical or Frequentist methods for inference consider θ to befixed and unknown−→ performance of methods evaluated by repeated sampling−→ consider all possible data sets

Bayesian methods consider θ to be random−→ only considers observed data set and prior information

8

Posterior distribution

π(θ |Data︷︸︸︷x ) =

Likelihood︷ ︸︸ ︷f (x | θ) ·

Prior︷︸︸︷π(θ)

f (x)=

f (x | θ)π(θ)∫Θ dθf (x | θ)π(θ)

∝ f (x | θ)π(θ)

The prior distribution allows you to “easily” incorporate yourbeliefs about the parameter(s) of interest

Posterior is a distribution on the parameter space given theobserved data

9

Data: y1, . . . , y4 ∼ N(µ = 3, σ = 2), y = 1.278

Prior: N(µ0 = −3, σ0 = 5)

Posterior: N(µ1 = 1.114, σ1 = 0.981)

-10 -5 0 5 10

0.0

0.1

0.2

0.3

0.4

µ

DensityLikelihoodPriorPosterior

10

Data: y1, . . . , y4 ∼ N(µ = 3, σ = 2), y = 1.278

Prior: N(µ0 = −3, σ0 = 1)

Posterior: N(µ1 = −0.861, σ1 = 0.707)

-10 -5 0 5 10

0.0

0.2

0.4

0.6

µ

DensityLikelihoodPriorPosterior

11

Data: y1, . . . , y200 ∼ N(µ = 3, σ = 2), y = 2.999

Prior: N(µ0 = −3, σ0 = 1)

Posterior: N(µ1 = 2.881, σ1 = 0.140)

-6 -4 -2 0 2 4 6

0.0

1.0

2.0

3.0

µ

DensityLikelihoodPriorPosterior

12

Prior distribution

The prior distribution allows you to “easily” incorporate yourbeliefs about the parameter(s) of interest

If one has a specific prior in mind, then it fits nicely into thedefinition of the posterior

But how do you go from prior information to a prior distribution?

And what if you don’t actually have prior information?

And what if you don’t know the likelihood??

13

Prior distribution

The prior distribution allows you to “easily” incorporate yourbeliefs about the parameter(s) of interest

If one has a specific prior in mind, then it fits nicely into thedefinition of the posterior

But how do you go from prior information to a prior distribution?

And what if you don’t actually have prior information?

And what if you don’t know the likelihood??

13

Approximate Bayesian Computation

“Likelihood-free” approach to approximating π(θ | xobs)i.e. f (xobs | θ) not specified

Proceeds via simulation of the forward process

Why would we not know f (xobs | θ)?

1 Physical model too complex to write as a closed form functionof θ

2 Strong dependency in data

3 Observational limitations

ABC in Astronomy:Ishida et al. (2015); Akeret et al. (2015); Weyant et al. (2013);Schafer and Freeman (2012); Cameron and Pettitt (2012)

14

15

Basic ABC algorithm

For the observed data xobs and prior π(θ):

Algorithm∗

1 Sample θprop from prior π(θ)

2 Generate xprop from forward process f (x | θprop)

3 Accept θprop if xobs = xprop

4 Return to step 1

∗Introduced in Pritchard et al. (1999) (population genetics)

16

Step 3: Accept θprop if xobs = xprop

Waiting for proposals such that xobs = xprop would be computationallyprohibitive

Instead, accept proposals with ∆(xobs, xprop) ≤ εfor some distance ∆ and some tolerance threshold ε

When x is high-dimensional, will have to make ε too large in order tokeep acceptance probability reasonable.

Instead, reduce the dimension by comparing summariesS(xprop) and S(xobs)

17

Gaussian illustration

Data xobs consists of 25 iid draws from Normal(µ, 1)

Summary statistics S(x) = x

Distance function ∆(S(xprop), S(xobs)) = |xprop − xobs|

Tolerance ε = 0.50 and 0.10

Prior π(µ) = Normal(0,10)

18

Gaussian illustration: posteriors for µ

Tolerance: 0.5, N:1000, n:25

µ

Density

-1.0 -0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

True PosteriorABC PosteriorInput value

Tolerance: 0.1, N:1000, n:25

µ

Density

-1.0 -0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0 True Posterior

ABC PosteriorInput value

19

How to pick a tolerance, ε?

Instead of starting the ABC algorithm over with a smallertolerance (ε), decrease the tolerance and use the alreadysampled particle system as a proposal distribution rather thandrawing from the prior distribution.

Particle system: (1) retained sampled values, (2) importanceweights

Beaumont et al. (2009); Moral et al. (2011); Bonassi and West(2004)

20

Gaussian illustration: sequential posteriors

-1.0 -0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

N:1000, n:25

µ

Density

True PosteriorABC PosteriorInput value

Tolerance sequence, ε1:10:1.00 0.75 0.53 0.38 0.27 0.19 0.15 0.11 0.08 0.06

21

Stellar Initial Mass Function

22

Stellar Initial Mass Function: the distribution of star massesafter a star formation event within a specified volume of space

Molecular cloud → Protostars → Stars

Image: adapted from http://www.astro.ljmu.ac.uk

23

Broken power-law(Kroupa, 2001)

Φ(M) ∝ M−αi ,

M1i ≤ M ≤ M2i

α1 = 0.3 for 0.01 ≤ M/M∗Sun ≤ 0.08 [Sub-stellar]α2 = 1.3 for 0.08 ≤ M/MSun ≤ 0.50α3 = 2.3 for 0.50 ≤ M/MSun ≤ Mmax

Many other models, e.g. Salpeter (1955); Chabrier (2003)∗1 MSun = 1 Solar Mass (the mass of our Sun)

24

ABC for the Stellar Initial Mass Function

MassFrequency

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0200

400

600

Log10(Mass)

Frequency

-1.0 -0.5 0.0 0.5

050

100

150

Image (left): Adapted from http://www.astro.ljmu.ac.uk

25

IMF Likelihood

Start with a power-law distribution: each star’s mass isindependently drawn from a power law distribution with density

f (m) =

(1− α

M1−αmax −M1−α

min

)m−α, m ∈ (Mmin,Mmax)

Then the likelihood is

L(α | m1:ntot ) =

(1− α

M1−αmax −M1−α

min

)ntot

×ntot∏i=1

m−αi

ntot = total number of stars in cluster

26

Observational limitations: aging

Lifecycle of star depends on mass → more massive stars die faster

Cluster age of τ Myr → only observe stars with masses< Tage ≈ τ−2/5 × 108/5

If age = 30 Myr so the aging cutoff is Tage ≈ 10 MSun

Then the likelihood is

L(α | m1:nobs , ntot) =

(1− α

T 1−αage −M1−α

min

)nobs(

nobs∏i=1

m−αi

)× P(M > Tage)

ntot−nobs

ntot = # of stars in cluster

nobs = # stars observed in cluster

Image: http://scioly.org

27

Observational limitations: completeness

Completeness function:

P(observing star | m) =

0, m < Cminm−Cmin

Cmax−Cmin, m ∈ [Cmin,Cmax]

1, m > Cmax

Probability of observing a particular star given its mass

Depends on the flux limit, stellar crowding, etc.

Image: NASA, J. Trauger (JPL), J. Westphal (Caltech)

28

Observational limitations: measurement error

Incorporating log-normal measurement error gives our final likelihood:

L(α | m1:nobs, ntot ) =(

P(M > Tage ) +

(1− α

M1−αmax − M1−α

min

)∫ Cmax

Cmin

M−α ×(

1−M − Cmin

Cmax − Cmin

)dM

)ntot−nobs

×nobs∏i=1

{∫ Tage

2(2πσ2)

− 12 m−1

i e− 1

2σ2 (log(mi )−log(M))2(

1− α

M1−αmax − M1−α

min

)M−α

×(I{M > Cmax} +

(M − Cmin

Cmax − Cmin

)I{Cmin ≤ M ≤ Cmax}

)dM

}

29

IMF

Mass

Density

0 10 20 30 40 50 60

0.0

0.1

0.2

0.3

0.4

0.5

0.6

With aging, completeness, and error

Mass

Density

2 4 6 8 10 12 14 16

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Sample size = 1000 stars, [Cmin,Cmax] = [2, 4], σ = 0.25

30

Simulation Study: forward model

Draw from

f (m) =

(1− α

601−α − 21−α

)m−α, m ∈ (2, 60)

Aged 30 Myrs

Observational completeness:

P(obs | m) =

0, m < 4m−2

2 , m ∈ [2, 4]

1, m > 4.

Uncertainty: logM = logm + 0.25η (with η ∼ N(0, 1))

Prior: α ∼ U[0, 6]∗

31

Simulation Study: summary statistics

We want to account for the following with our summary statisticsand distance functions:

1. Shape of the observed Mass Function

ρ1(msim,mobs) =

[∫ {flog msim

(x)− flog mobs(x)}2

dx

]1/2

32

2. Number of stars observed

ρ2(msim,mobs) = |1− nsim/nobs |

msim = masses of the stars simulated from the forward modelmobs = masses of observed starsnsim = number of stars simulated from the forward modelnobs = number of observed stars

33

Simulation Study

1 Draw n = 103 stars2 IMF slope α = 2.35 with Mmin = 2 and Mmax = 603 N = 103 particles4 T = 30 sequential time steps

Observed Mass Function

Mass

Density

0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

34

0 10 20 30 40 50

0.00

0.10

0.20

0.30

IMF

N = 1000 Bandwidth = 0.4971

Density

0 10 20 30 40 50

0.00

0.10

0.20

0.30

Observed MF

N = 510 Bandwidth = 0.5625

Density

Sample size = 1000 stars, [Cmin,Cmax] = [2, 4], σ = 0.25

35

Simulation Study results

1 Draw n = 103 stars

2 IMF slope α = 2.35 with Mmin = 2 and Mmax = 60

3 N = 103 particles

4 T = 30 sequential time steps

0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

(A) Observed Mass Function

Log10(M)

Log10(f M)

Observed MF95% Credible bandPosterior Median

1.6 1.8 2.0 2.2 2.4 2.6 2.8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

(B) Posteriors

α

Density

ABC PosteriorTrue PosteriorInput value

Initial ABC posterior...Intermediate ABC posterior...Final ABC Posterior

36

Summary

ABC can be a useful tool when data are too complex to define areasonable likelihood,

E.g. the Stellar Initial Mass Function

Selection of good summary statistics is crucial for ABC posterior tobe meaningful

THANK YOU!

37

Summary

ABC can be a useful tool when data are too complex to define areasonable likelihood,

E.g. the Stellar Initial Mass Function

Selection of good summary statistics is crucial for ABC posterior tobe meaningful

THANK YOU!37

Bibliography I

Akeret, J., Refregier, A., Amara, A., Seehars, S., and Hasner, C. (2015), “Approximate Bayesian computation forforward modeling in cosmology,” Journal of Cosmology and Astroparticle Physics, 2015, 043.

Bate, M. R. (2012), “Stellar, brown dwarf and multiple star properties from a radiation hydrodynamical simulationof star cluster formation,” Monthly Notices of the Royal Astronomical Society, 419, 3115–3146.

Beaumont, M. A., Cornuet, J.-M., Marin, J.-M., and Robert, C. P. (2009), “Adaptive approximate Bayesiancomputation,” Biometrika, 96, 983 – 990.

Bonassi, F. V. and West, M. (2004), “Sequential Monte Carlo with Adaptive Weights for Approximate BayesianComputation,” Bayesian Analysis, 1, 1–19.

Cameron, E. and Pettitt, A. N. (2012), “Approximate Bayesian Computation for Astronomical Model Analysis: ACase Study in Galaxy Demographics and Morphological Transformation at High Redshift,” Monthly Notices ofthe Royal Astronomical Society, 425, 44–65.

Chabrier, G. (2003), “The galactic disk mass function: Reconciliation of the hubble space telescope and nearbydeterminations,” The Astrophysical Journal Letters, 586, L133.

Ishida, E. E. O., Vitenti, S. D. P., Penna-Lima, M., Cisewski, J., de Souza, R. S., Trindade, A. M. M., Cameron,E., and Busti, V. C. (2015), “cosmoabc: Likelihood-free inference via Population Monte Carlo ApproximateBayesian Computation,” Astronomy and Computing, 13, 1 – 11.

Kroupa, P. (2001), “On the variation of the initial mass function,” Monthly Notices of the Royal AstronomicalSociety, 322, 231–246.

Moral, P. D., Doucet, A., and Jasra, A. (2011), “An adaptive sequential Monte Carlo method for approximateBayesian computation,” Statistics and Computing, 22, 1009–1020.

Pritchard, J. K., Seielstad, M. T., and Perez-Lezaun, A. (1999), “Population Growth of Human Y Chromosomes:A study of Y Chromosome Microsatellites,” Molecular Biology and Evolution, 16, 1791 – 1798.

Salpeter, E. E. (1955), “The luminosity function and stellar evolution.” The Astrophysical Journal, 121, 161.

Schafer, C. M. and Freeman, P. E. (2012), Statistical Challenges in Modern Astronomy V, Springer, chap. 1,Lecture Notes in Statistics, pp. 3 – 19.

Weyant, A., Schafer, C., and Wood-Vasey, W. M. (2013), “Likeihood-free cosmological inference with type Iasupernovae: approximate Bayesian computation for a complete treatment of uncertainty,” The AstrophysicalJournal, 764, 116.

38