Bayesian Model Diagnostics and...

Bayesian Model Diagnostics and Checking

Earvin Balderama

Quantitative Ecology LabDepartment of Forestry and Environmental Resources

North Carolina State University

April 12, 2013

1 / 34Bayesian Model Diagnostics and Checking

c© 2013 by E. Balderama

Introduction

MCMCMC



Introduction

MCMCMC

Steps in (Bayesian) Modeling1 Model Creation2 Model Checking (Criticism)3 Model Comparison



Introduction

Outline

1 Popular Bayesian Diagnostics2 Predictive Distributions3 Posterior Predictive Checks4 Model Comparison

Precision

Accuracy

Extreme Values

5 Conclusion



Introduction

Modeling

Classical methods1 Standardized Pearson residuals2 p-values3 Likelihood ratio4 MLE

also apply in Bayesian Analysis;1 Posterior mean of the standardized residuals.2 Posterior probabilities3 Bayes factor4 Posterior mean



Popular Model Fit Statistics

Bayes Factor

For determining which model fits the data better, the Bayes factor iscommonly used in a hypothesis test.

Bayes factorThe odds ratio of the data favoring Model 1 over Model 2 is given by

BF =f1(y)f2(y)

=

∫f (y|θ1)f (θ1)dθ1∫f (y|θ2)f (θ2)dθ2

(1)

More robust than frequentist hypothesis testing.

Difficult to compute, although easy to approximate with software.

Only defined for proper marginal density functions.




Bayes Factor

Computation is conditional that one of the models is true.Because of this, Gelman thinks Bayes factors are irrelevant.

Prefers looking at distance measures between data and model.

Many alternatives to choose from! One of which is...




DIC

Like many good measures of model fit and comparison, the DevianceInformation Criterion (DIC) includes

1 how well the model fits the data (goodness of fit) and2 the complexity of the model (effective number of parameters).

Deviance Information Criterion

DIC = D + pD




Deviance

Deviance

D = −2 log f (y|θ) (2)

Example: Deviance with Poisson likelihood

D = −2 log∏

i

[µyi

i e−µi

yi!

]= −2

∑i

[− µi + yi logµi − log(yi!)

]




DIC




DIC = D + pD (3)




DIC




DIC = D + pD (4)

where1 D = E(D) is the posterior mean of deviance.2 pD = D− D(θ) is “mean deviance - deviance at means.”




DIC

DIC can then be rewritten as

DIC = 2D− D(θ)

= D(θ) + 2pD

= −2 log f (y|θ) + 2pD

which is a generalization of

AIC = −2 log f (y|θMLE) + 2k (5)

DIC can be used to compare different models as well as differentmethods. Preferred models have low DIC values.




DIC

Requires joint posterior distribution to be approximatelymultivariate normal.Doesn’t work well with

highly non-linear models

mixture models with discrete parameters

models with missing data

If pD is negativelog-likelihood may be non-concave

prior may be misspecified

posterior mean may not be a good estimator



Posterior Predictive Checks

Prior Predictive Distribution

One way to decide between competing models is to rank them basedon how “well” each model does in predicting future observations. InBayesian analyses, predictive distributions are used for this kind ofdecision.Before data is observed, what could we use for predictions?

Marginal likelihood

f (y) =∫

f (y|θ)f (θ)dθ (6)

The marginal likelihood is what one would expect data to look likeafter averaging over the prior distribution of θ , so it is called the priorpredictive distribution.




Posterior Predictive Distribution

More interestingly, if a set of data y have already been observed, onecan predict future (or new or unobserved) y′ from the marginalposterior likelihood of y′, called the posterior predictive distribution.

Marginal posterior likelihood

f (y′|y) =∫

f (y′|θ)f (θ|y)dθ (7)

This distribution is what one would expect y′ to look like afterobserving y and averaging over the posterior distribution of θ given y .




Posterior Predictive Distribution

Equivalently, y′ can be the missing values and treated as additionalparameters to be estimated in a Bayesian framework. Specifically, foreach month, we randomly give NA values to 10 observed sites tocreate a test set of n = 220 observations, {yi; i = 1, 2, . . . , n}. Then then posterior predictive distributions, {Pi; i = 1, 2, . . . , n}, can then beused to determine measures of overall model goodness-of-fit, as wellas predictive performance measures of each individual yi in the test set.




Posterior Predictive Ordinate

The posterior predictive ordinate (PPO) is the density of the posteriorpredictive distribution evaluated at an observation yi. PPO can be usedto estimate the probability of observing yi in the future if after havingalready observed y.

PPOi = f (yi|y) =∫

f (yi|θ)f (θ|y)dθ (8)

We can estimate the ith posterior predictive ordinate by

PPOi =1S

S∑s=1

f (yi|θ(s)) (9)




BUGS code

Example: Poisson count model

model{

#likelihoodfor(i in 1:N){

y[i] ~ dpois(mu[i])mu[i] <- ...

ppo.term[i] <- exp(-mu[i] + y[i]*log(mu[i]) - logfact(y[i]))

}

#priors...

}




Conditional Predictive Ordinate

PPO is good for prediction, but violates the likelihood principle.

The conditional predictive ordinate (CPO) is based onleave-one-out-cross-validation.

CPO estimates the probability of observing yi in the future if afterhaving already observed y−i.

Low CPO values suggest possible outliers, high-leverage andinfluential observations.





CPOi = f (yi|y−i) = · · · =[∫

1f (yi|θ)

f (θ|y)dθ]−1

(10)

Therefore, CPO can be estimated by taking the inverse of the posteriormean of the inverse density function value of yi (harmonic mean of thelikelihood of yi). Thus,

CPOi =

1S

S∑s=1

1

f(

yi|θ(s))−1

(11)




Predictive Ordinates

Estimate of PPO

PPOi =1S

S∑s=1

f (yi|θ(s)) (18)

Estimate of CPO

CPOi =

1S

S∑s=1

1

f(

yi|θ(s))−1

(19)




BUGS code

Example: Poisson count model

model{

#likelihoodfor(i in 1:N){

y[i] ~ dpois(mu[i])mu[i] <- ...

ppo.term[i] <- exp(-mu[i] + y[i]*log(mu[i]) - logfact(y[i]))icpo.term[i] <- 1/ppo.term[i]

}

#priors...

}



Model Comparison

LPML

The sum of the log CPO’s and is an estimator for the log marginallikelihood.

The “best” model amongst competing models have the largestLPML.

Log-pseudo marginal likelihood

LPML =1n

n∑i=1

log(CPOi) (20)



Model Comparison

Hypothesis Testing

A ratio of LPML’s is a surrogate for the Bayes factor.

Another overall measure for model comparison is the posterior Bayesfactor, which is simply the Bayes factor but using the posteriorpredictive distributions.



Model Comparison

Measures of Predictive Precision

The mean absolute deviation (MAD) is the mean of the absolutevalues of the deviations between the actual observed value and themedian of its respective P .

Mean Absolute Deviation

MAD =1n

n∑i=1

∣∣∣yi − Pi

∣∣∣ (21)

Note: You can also use the median absolute deviation (also MAD!) fora more robust statistic.



Model Comparison

Measures of Predictive Precision

MSE is the mean of the squared deviations (errors) between theactual observed value and the mean of its respective P .

The average standard deviation of the P’s can also be helpful.

Mean Squared Error

MSE =1n

n∑i=1

(yi − P i

)2(22)

Mean Standard Deviation

SD =1n

n∑i=1

σPi (23)



Model Comparison

Measures of Predictive Accuracy

Coverage is the proportion of test set observations yi falling insidesome interval of their respective posterior predictive distributions Pi.

90% Coverage

C(90%) =1n

n∑i=1

[I(P(.05)

i < yi < P(.95)i

)](24)

where I is the indicator function and P(q) is the estimated qth quantile.

This shows how well the model does in creating posterior predictivedistributions that actually capture the true value.



Model Comparison

Predictive Performance

Combining information from measures of precision (i.e Coverage) andmeasures of accuracy (i.e. SD) is important for model comparison.

A high coverage probability can simply be a result of high-varianceposterior predictive distributions.



Model Comparison

Prediction of Extreme Values

The Brier score is the squared difference between the posteriorpredictive probability of exceeding a certain value and whether or notthe actual observation exceeds that value. This score can be used as ameasure of predictive accuracy of extreme values.

The Brier score for a test set observation, given a certain value c, canbe computed as

Brier Score

Brieri =[I(yi > c)− Pi(yi > c)

]2(25)

where I is the indicator function and P(y > c) is the posteriorpredictive probability that yi > c.



Model Comparison


For each observation yi in the test set, the quantile score can becomputed as

Quantile Score

QSi = 2×[I(

yi < P(q)i

)− q]×[P(q)

i − yi

](26)

where I is the indicator function and P(q) is the estimated qth quantile.



Model Comparison


The average of the quantile scores and the average of the Brier Scorescan be used for model comparison. They evaluate how well a modelcaptures extreme values.

Smaller scores are better.

Larger Brier scores suggest lack of predictive accuracy.

Larger quantile scores suggest that the observed value is very farfrom its estimated quantile value from P .



Conclusion

Other Posterior Predictive Checks

1 Other Bayesian p-value tests.2 Gelman Chi-Square tests, and other Chi-Square tests.3 Quantile Ratio4 Predictive Concordance5 Bayesian Predictive Information Criterion (BPIC)6 L-criterion7 ...and many more...



Conclusion

Summary

1 Separate your research into the three MC’s.2 List of model checks is not exhaustive!3 Choose some based on the focus of your research.4 Statistics only a guide.



Bayesian Model Diagnostics and...

Documents

Transcript of Bayesian Model Diagnostics and...