A Gentle Tutorial in Bayesian Statistics.pdf

45
A Gentle Tutorial in Bayesian Statistics Theo Kypraios http://www.maths.nott.ac.uk/tk School of Mathematical Sciences - Division of Statistics Division of Radiological and Imaging Sciences Away Day 1 / 29

description

Exposure to Bayesian Stats

Transcript of A Gentle Tutorial in Bayesian Statistics.pdf

Page 1: A Gentle Tutorial in Bayesian Statistics.pdf

A Gentle Tutorial in Bayesian Statistics

Theo Kypraioshttp://www.maths.nott.ac.uk/∼tk

School of Mathematical Sciences − Division of Statistics

Division of Radiological and Imaging Sciences Away Day

1 / 29

Page 2: A Gentle Tutorial in Bayesian Statistics.pdf

Warning

This talk includes

about 5 equations (hopefully not too hard!)

about 10 figures.

This tutorial should be accessible even if the equations mightlook hard.

2 / 29

Page 3: A Gentle Tutorial in Bayesian Statistics.pdf

Outline of the Talk

The need for (statistical) modelling;

two examples (a linear model/tractography)

introduction to statistical inference (frequentist);

introduction to the Bayesian approach to parameterestimation;

more examples and Bayesian inference in practice

conclusions.

3 / 29

Page 4: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (1)

Examples include:

Sample Size Determination

Comparison between two (or more) groups

t-tests, Z-tests;

Analysis of variance (ANOVA);

tests for proportions etc;

Receiver Operating Characteristic (ROC) curves;

Clinical Trials;

. . .

4 / 29

Page 5: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (1)

Examples include:

Sample Size Determination

Comparison between two (or more) groups

t-tests, Z-tests;

Analysis of variance (ANOVA);

tests for proportions etc;

Receiver Operating Characteristic (ROC) curves;

Clinical Trials;

. . .

4 / 29

Page 6: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (1)

Examples include:

Sample Size Determination

Comparison between two (or more) groups

t-tests, Z-tests;

Analysis of variance (ANOVA);

tests for proportions etc;

Receiver Operating Characteristic (ROC) curves;

Clinical Trials;

. . .

4 / 29

Page 7: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (1)

Examples include:

Sample Size Determination

Comparison between two (or more) groups

t-tests, Z-tests;

Analysis of variance (ANOVA);

tests for proportions etc;

Receiver Operating Characteristic (ROC) curves;

Clinical Trials;

. . .

4 / 29

Page 8: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (1)

Examples include:

Sample Size Determination

Comparison between two (or more) groups

t-tests, Z-tests;

Analysis of variance (ANOVA);

tests for proportions etc;

Receiver Operating Characteristic (ROC) curves;

Clinical Trials;

. . .

4 / 29

Page 9: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by fitting a(statistical) model. Examples include:

(linear/logistic/loglinear) regression models;

survival analysis;

longitudinal data analysis;

infectious disease modelling;

image/shape analysis;

. . .

5 / 29

Page 10: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by fitting a(statistical) model. Examples include:

(linear/logistic/loglinear) regression models;

survival analysis;

longitudinal data analysis;

infectious disease modelling;

image/shape analysis;

. . .

5 / 29

Page 11: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by fitting a(statistical) model. Examples include:

(linear/logistic/loglinear) regression models;

survival analysis;

longitudinal data analysis;

infectious disease modelling;

image/shape analysis;

. . .

5 / 29

Page 12: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by fitting a(statistical) model. Examples include:

(linear/logistic/loglinear) regression models;

survival analysis;

longitudinal data analysis;

infectious disease modelling;

image/shape analysis;

. . .

5 / 29

Page 13: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by fitting a(statistical) model. Examples include:

(linear/logistic/loglinear) regression models;

survival analysis;

longitudinal data analysis;

infectious disease modelling;

image/shape analysis;

. . .

5 / 29

Page 14: A Gentle Tutorial in Bayesian Statistics.pdf

Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by fitting a(statistical) model. Examples include:

(linear/logistic/loglinear) regression models;

survival analysis;

longitudinal data analysis;

infectious disease modelling;

image/shape analysis;

. . .

5 / 29

Page 15: A Gentle Tutorial in Bayesian Statistics.pdf

Aims of Statistical Modelling:A Simple Example

Perhaps we can fit a straight line?

y = α + βx + error

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●●

●●

● ●

● ●

●●

●●

●●

●●

−2 −1 0 1 2

0.2

0.4

0.6

0.8

1.0

explanatory (x)

resp

onse

(y)

6 / 29

Page 16: A Gentle Tutorial in Bayesian Statistics.pdf

An Example in DW-MRI

Suppose that we are interested in tractography.

We use the diffusion tensor to model local diffusion within avoxel.

The (model) assumption made is that local diffusion could bemodelled with a 3D Gaussian distribution whosevariance-covariance matrix is proportional to the diffusiontensor, D.

7 / 29

Page 17: A Gentle Tutorial in Bayesian Statistics.pdf

An Example in DW-MRI

The resulting diffusion-weighted signal, µi along a gradientdirection gi with b-value bi is modelled as:

µi = S0 exp {−bigTi Dg} (1)

where

D =

D11 D12 D13

D21 D22 D23

D31 D32 D33

S0 is the signal with no diffusion weight gradients applied (i.e.b0 = 0).

The eigenvectors of D give an orthogonal coordinate systemand define the orientation of the ellipsoid axes.

The eigenvalues of D give the length of these axes.

If we sort the eigenvalues by magnitude we can derive the theorientation of the major axis of the ellipsoid and theorientation of the minor axes.

8 / 29

Page 18: A Gentle Tutorial in Bayesian Statistics.pdf

An Example in DW-MRI

Although this may look a bit complicate, actually, it can be writtenin terms of a linear model.

Taken from {Sotiropoulos, Jones, Bai + K (2010)}9 / 29

Page 19: A Gentle Tutorial in Bayesian Statistics.pdf

Aims of Statistical Modelling

Models have parameters some of which (if not all) areunknown, e.g. α and β.

In statical modelling we are interested in inferring (e.g.estimating) the unknown parameters from data → inference.

Parameter estimation needs be done in a formal way. Inother words we ask ourselves the question: what are the bestvalues for α and β such that the proposed model (straightline) best describe the observed data?

Should we only look for a single estimate for (α, β)? No!

Why? Because there may be many pairs (α, β) (often notvery different from each other) which may equally welldescribe the data → uncertainty

10 / 29

Page 20: A Gentle Tutorial in Bayesian Statistics.pdf

Aims of Statistical Modelling

Models have parameters some of which (if not all) areunknown, e.g. α and β.

In statical modelling we are interested in inferring (e.g.estimating) the unknown parameters from data → inference.

Parameter estimation needs be done in a formal way. Inother words we ask ourselves the question: what are the bestvalues for α and β such that the proposed model (straightline) best describe the observed data?

Should we only look for a single estimate for (α, β)? No!

Why? Because there may be many pairs (α, β) (often notvery different from each other) which may equally welldescribe the data → uncertainty

10 / 29

Page 21: A Gentle Tutorial in Bayesian Statistics.pdf

Aims of Statistical Modelling

Models have parameters some of which (if not all) areunknown, e.g. α and β.

In statical modelling we are interested in inferring (e.g.estimating) the unknown parameters from data → inference.

Parameter estimation needs be done in a formal way. Inother words we ask ourselves the question: what are the bestvalues for α and β such that the proposed model (straightline) best describe the observed data?

Should we only look for a single estimate for (α, β)? No!

Why? Because there may be many pairs (α, β) (often notvery different from each other) which may equally welldescribe the data → uncertainty

10 / 29

Page 22: A Gentle Tutorial in Bayesian Statistics.pdf

The likelihood function

The likelihood function plays a fundamental role in statisticalinference.

In non-technical terms, the likelihood function is a function thatwhen evaluated at a particular point, say (α0, β0), is theprobability of observing the (observed) data given that theparameters (α, β) take the values α0 and β0.

Let’s think of a very simple example:

Suppose we are interested in estimating the probability ofsuccess (denoted by θ) for one particular experiment.

Data: Out of 100 times we repeated the experiment weobserved 80 successes.

What about L(0.1), L(0.7), L(0.99)?

11 / 29

Page 23: A Gentle Tutorial in Bayesian Statistics.pdf

Classical (Frequentist) Inference

Frequentist inference tell us that:

we should for parameter values that maximise the likelihoodfunction → maximum likelihood estimator (MLE)

associate parameter’s uncertainty with the calculation ofstandard errors . . .

. . . which in turn enable us to construct confidence intervalsfor the parameters.

What’s wrong with that?

Nothing, but . . .

. . . it is approximate, counter-intuitive (data is assumed to berandom, parameter is fixed) and often mathematicallyintractable.

12 / 29

Page 24: A Gentle Tutorial in Bayesian Statistics.pdf

Classical (Frequentist) Inference

Frequentist inference tell us that:

we should for parameter values that maximise the likelihoodfunction → maximum likelihood estimator (MLE)

associate parameter’s uncertainty with the calculation ofstandard errors . . .

. . . which in turn enable us to construct confidence intervalsfor the parameters.

What’s wrong with that?

Nothing, but . . .

. . . it is approximate, counter-intuitive (data is assumed to berandom, parameter is fixed) and often mathematicallyintractable.

12 / 29

Page 25: A Gentle Tutorial in Bayesian Statistics.pdf

Classical (Frequentist) Inference - Some Issues

For instance, we cannot ask (or answer!) questions such as

1. “what is the probability that the (unknown) probability ofsuccess in the previous experiment is greater than 0.6?” i.e.compute the quantity P(θ > 0.6) . . .

2. or something like, P(0.3 < θ < 0.9);

Sometime we are interested in (not necessarily) functions ofparameters, e.g.

θ1 + θ2,θ1/(1− θ1)

θ2/(1− θ2)

Whilst in some cases, the frequentist approach offers asolution which is not exact but approximate, there otherwhere it cannot or it is very hard to do so.

13 / 29

Page 26: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Inference

When drawing inference within a Bayesian framework,

the data are treated as a fixed quantity and

the parameters are treated as random variables.

That allows us to assign to parameters (and models) probabilities,making the inferential framework

far more intuitive and

more straightforward (at least in principle!)

14 / 29

Page 27: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Inference (2)

Denote by θ the parameters and by y the observed data. Bayestheorem allows to write:

π(θ|y) =π(y|θ)π(θ)

π(y)=

π(y|θ)π(θ)∫θ π(y|θ)π(θ) dθ

where

π(θ|y) denotes the posterior distribution of the parametersgiven the data;

π(y|θ) = L(θ) is the likelihood function;

π(θ) is the prior distribution of θ which express our beliefsabout the parameters, before we see the data;

π(y) is often called the marginal likelihood and plays the roleof the normalising constant of the density of the posteriordistribution

15 / 29

Page 28: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior);

we are allowed to incorporate prior information about theparameter . . .

which is then updated by using the likelihood function . . .

leading to the posterior distribution which tell us everythingwe need about the parameter.

16 / 29

Page 29: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior);

we are allowed to incorporate prior information about theparameter . . .

which is then updated by using the likelihood function . . .

leading to the posterior distribution which tell us everythingwe need about the parameter.

16 / 29

Page 30: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior);

we are allowed to incorporate prior information about theparameter . . .

which is then updated by using the likelihood function . . .

leading to the posterior distribution which tell us everythingwe need about the parameter.

16 / 29

Page 31: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior);

we are allowed to incorporate prior information about theparameter . . .

which is then updated by using the likelihood function . . .

leading to the posterior distribution which tell us everythingwe need about the parameter.

16 / 29

Page 32: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Inference: The Prior

One of the biggest criticisms to the Bayesian paradigm is the useof the prior distribution.

Choose a very informative prior to come up with favourableresults;

I know nothing about the parameter; what prior do I choose?

Arguments against that criticism:

priors should be chosen before we see the data and it is veryoften the case that there is some prior informationavailable (e.g. previous studies)

if we know nothing about the parameter, then we could assignto it a so-called uninformative (or vague) prior;

if there is not a lot of data available then the posteriordistribution would not be influenced by the prior (too much)and vice versa;

17 / 29

Page 33: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Inference: The Posterior

Although Bayesian inference has been around for long time itis only the last two decades that it has really revolutionizedthe way we do statistical modelling.

Although, in principle, Bayesian inference is straightforwardand intuitive when it comes to computations it could be veryhard to implement it.

Thanks to computational developments such as Markov ChainMonte Carlo (MCMC) doing Bayesian inference is a loteasier.

18 / 29

Page 34: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Inference: Some Examples

83/100 successes: interested in probability of success θ

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

theta

post

erio

r

posteriorlikprior

19 / 29

Page 35: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Inference: Some Examples

83/100 successes: interested in probability of success θ

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

theta

post

erio

r

posteriorlikprior

20 / 29

Page 36: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Inference: Some Examples

83/100 successes: interested in probability of success θ

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

theta

post

erio

r

posteriorlikprior

21 / 29

Page 37: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Inference: Some Examples

8/10 successes: interested in probability of success θ

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

theta

post

erio

r

posteriorlikprior

22 / 29

Page 38: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Inference: Some Examples

83/100 successes: interested in probability of success θ

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

theta

post

erio

r

posteriorlikprior

23 / 29

Page 39: A Gentle Tutorial in Bayesian Statistics.pdf

Comparing Different Hypotheses:Bayesian Model Choice

Suppose that we are interested in testing two competingmodel hypotheses, M1 and M2.

Within a Bayesian framework, the model index M can betreated as a an extra parameter (as well as the otherparameters in M1 and M2.

So, it is natural to ask “what is the posterior model probabilitygiven the observed data?”, i.e. (M1|y) or P(M2|y)

Bayes Theorem:

P(M1|y) =π(y|M1)π(M1)

π(D)

whereπ(y|M1) is the marginal likelihood (also called the evidence),π(M1) is the prior model probability

24 / 29

Page 40: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Model Choice (2)

Given a model selection problem in which we have to choosebetween two models, on the basis of observed data y. . .

. . .the plausibility of the two different models M1 and M2,parametrised by model parameter vectors θ1 and θ2 isassessed by the Bayes factor given by:

P(y|M1)

P(y|M2)=

∫θ1π(y|θ1,M1)π(θ1) dθ1∫

θ2π(y|θ2,M2)π(θ2) dθ2

The Bayesian model comparison does not depend on theparameters used by each model. Instead, it considers theprobability of the model considering all possible parametervalues.

This is similar to a likelihood-ratio test, but instead ofmaximizing the likelihood, we average over all the parameters.

25 / 29

Page 41: A Gentle Tutorial in Bayesian Statistics.pdf

Bayesian Model Choice (3)

Why bother?

An advantage of the use of Bayes factors is that itautomatically, and quite naturally, includes a penalty forincluding too much model structure.

It thus guards against overfitting.

No free lunch!

In practical situations, the calculation of Bayes Factor relies onthe employment computationally intensive methods, suchReversible-Jump Markov Chain Monte Carlo (RJ-MCMC)which require a certain amount of expertise from the end-user.

26 / 29

Page 42: A Gentle Tutorial in Bayesian Statistics.pdf

An Example in DW-MRI Analysis

We assume that the voxel’s intensity can be modelled by assumingthat

Si/S0 ∼ N(µi , σ2)

where we could consider (at least) two different models:

1. Diffusion Tensor Model (Model 1) assumes that:

µi = exp {−bigTi Dg}

2. Simple Partial Volume Model (Model 2) assumes that:

µi = f exp {−bd}+ (1− f ) exp {−bdgiCgTi }

27 / 29

Page 43: A Gentle Tutorial in Bayesian Statistics.pdf

An Example in DW-MRI Analysis (2)

Suppose that we have some measurements (intensities) foreach voxel.

We could fit the two different models (on the same dataset).

Question: How do we tell which model fits the data besttaking into account the uncertainty associated with theparameters in each model?

Answer: Calculate the Bayes factor!

28 / 29

Page 44: A Gentle Tutorial in Bayesian Statistics.pdf

An Example in DW-MRI Analysis (2)

Suppose that we have some measurements (intensities) foreach voxel.

We could fit the two different models (on the same dataset).

Question: How do we tell which model fits the data besttaking into account the uncertainty associated with theparameters in each model?

Answer: Calculate the Bayes factor!

28 / 29

Page 45: A Gentle Tutorial in Bayesian Statistics.pdf

Conclusions

Quantification of the uncertainty both in parameter estimationand model choice is essential in any modelling exercise.

A Bayesian approach offers a natural framework to deal withparameter and model uncertainty.

It offers much more than a single “best fit” or any sort“sensitivity analysis”.

There is no free lunch, unfortunately. To do fancy things,often one has to write his/her own computer programs.

Software available: R, Winbugs, BayesX . . .

29 / 29