Bayesian Statistics: Model Uncertainty & Missing Data Introduction to Bayesian Statistics Model...

OutlineIntroduction to Bayesian Statistics

Model UncertaintyMissing Data

Concluding Remarks

Bayesian Statistics: Model Uncertainty & Missing

Data

David Dunson

National Institute of Environmental Health Sciences, NIH

March 1, 2007

David Dunson Bayesian Statistics: Model Uncertainty & Missing Data



Concluding Remarks

Introduction to Bayesian StatisticsBasic DefinitionsPosterior Computation via MCMCEpidemiologic Application

Model UncertaintyFormulation of ProblemVariable Selection & Stochastic SearchEpidemiologic Application

Missing DataGeneral FormulationPosterior Computation

Concluding Remarks




Concluding Remarks

Basic DefinitionsPosterior Computation via MCMCEpidemiologic Application

Illustration: Patient Diagnoses

I For the past two weeks, Sue has been feeling weak and hashad nausea.




Concluding Remarks




I Although she suspects a stomach virus, she visits the doctorbecause the symptoms have been persisting.




Concluding Remarks





I The doctor also suspects a virus, but collects blood samplesand orders several tests to verify that there aren’t moreserious problems.




Concluding Remarks





I The doctor also suspects a virus, but collects blood samplesand orders several tests to verify that there aren’t moreserious problems.

I The tests come back and Sue has an abnormally low whitecell count.




Concluding Remarks


Updating Subjective Probabilities

I Formalizing this problem statistically, let D = (D1, . . . ,DK )′,with Dk = 1 if disease k is the cause of Sue’s symptoms.




Concluding Remarks




I D1 = 1 if Sue has a virus or bacterial infection, D2 = 1 ifcancer, D3 = 1 if a parasitic infection of type 1, etc




Concluding Remarks





I During the first few days of her illness, Sue estimated herprobability of a virus or bacterial infection asPr(D1 = 1) = π1(0) > 0.99.




Concluding Remarks






I After two weeks, her estimated probability gradually decreasedto π1(2) = 0.95.




Concluding Remarks






I After two weeks, her estimated probability gradually decreasedto π1(2) = 0.95.

I With the abnormally low white cell count test, this probabilitydecreased further to π1(3) = 0.90.




Concluding Remarks



I We can formalize the process by which Sue’s value of π1

changes using Bayes rule:

π1(t) =π1(0) L(datat |D1 = 1)

π1(0) L(datat |D1 = 1) + {1 − π1(0)} L(datat |D1 = 0),




Concluding Remarks





π1(t) =π1(0) L(datat |D1 = 1)


I t = 0 corresponds to the time of symptom onset




Concluding Remarks





π1(t) =π1(0) L(datat |D1 = 1)


I t = 0 corresponds to the time of symptom onsetI π1(0) is Sue’s guess at the probability of D1 = 1 at t = 0




Concluding Remarks





π1(t) =π1(0) L(datat |D1 = 1)


I t = 0 corresponds to the time of symptom onsetI π1(0) is Sue’s guess at the probability of D1 = 1 at t = 0I L(datat |D = d) = likelihood of data at time t given D = d




Concluding Remarks





π1(t) =π1(0) L(datat |D1 = 1)


I t = 0 corresponds to the time of symptom onsetI π1(0) is Sue’s guess at the probability of D1 = 1 at t = 0I L(datat |D = d) = likelihood of data at time t given D = dI “data” = ongoing symptoms, test results, input by Sue’s

physician, reading on the web, etc




Concluding Remarks


Learning from Sue

I For Sue this updating process continued until an astutephysician diagnosed a parasitic infection after several months.




Concluding Remarks


Learning from Sue


I Most medical diagnoses proceed in a similar manner, with thepatient & physicians probabilities updated through Bayesianlearning.




Concluding Remarks


Learning from Sue



I Scientific research evolves in a similar manner, with priorinsights updated as new data become available.




Concluding Remarks


Learning from Sue



I Scientific research evolves in a similar manner, with priorinsights updated as new data become available.

I Bayesian statistics seeks to formalize the process of learningthrough the accrual of evidence from different sources.




Concluding Remarks


Illustration: Normal Linear Regression

I Suppose we collect data consisting of a response, yi , andpredictors xi = (xi1, . . . , xip)

′, for subjects i = 1, . . . , n.




Concluding Remarks




′, for subjects i = 1, . . . , n.

I For example, yi may be birth weight, with xi factorspotentially predictive of birth weight.




Concluding Remarks




′, for subjects i = 1, . . . , n.

I For example, yi may be birth weight, with xi factorspotentially predictive of birth weight.

I Assuming yi is normally distributed conditionally on xi , wehave the likelihood function:

L(y; θ,X) =n∏

i=1

(2πσ2)−1/2 exp

{−

1

2σ2(yi − x′iβ)2

},

with θ = (β, σ2).




Concluding Remarks


Classical Model Fitting and Inferences

I Often, interest focuses on inferences on the regressioncoefficients β




Concluding Remarks




I In order to estimate β, the standard approach is to usemaximum likelihood estimation, which results in the leastsquares estimator:

β̂ = (X′X)−1X′y,

with X = (x1, . . . , xn)′ and y = (y1, . . . , yn)

′.




Concluding Remarks




I In order to estimate β, the standard approach is to usemaximum likelihood estimation, which results in the leastsquares estimator:

β̂ = (X′X)−1X′y,

with X = (x1, . . . , xn)′ and y = (y1, . . . , yn)

′.

I One can obtain confidence intervals for βj and test whetherβj = 0 using standard approaches.




Concluding Remarks


Prior Knowledge

I In most applications, prior knowledge is available about θbefore observing the data in the current study.




Concluding Remarks


Prior Knowledge


I For example, in investigating factors predictive of birthweight, one can rely on a rich literature from previous studies.




Concluding Remarks


Prior Knowledge


I For example, in investigating factors predictive of birthweight, one can rely on a rich literature from previous studies.

I The classical paradigm relies entirely upon the current data →often not enough information to obtain accurate estimates ofall parameters




Concluding Remarks


Bayesian Paradigm

I Bayesian instead choose a prior distribution to quantify thestate of knowledge about θ before observing the current data.




Concluding Remarks


Bayesian Paradigm


I The prior distribution, π(θ), effectively treats the parametersas random variables




Concluding Remarks


Bayesian Paradigm


I The prior distribution, π(θ), effectively treats the parametersas random variables

I Inferences are then based on the posterior distribution,updating the prior with the likelihood from the current study:

π(θ | y,X) =π(θ) L(y; θ,X)∫π(θ) L(y; θ,X)dθ

,

with L(y;X) =∫

π(θ) L(y; θ,X)dθ known as the marginallikelihood.




Concluding Remarks


Posterior Calculations for Linear Regression

I For normal linear regression models and conjugate priors, wecan calculate the posterior distribution analytically.




Concluding Remarks




I Suppose π(β) = Np(β;β0,Σβ), with β0 the prior mean andΣβ the prior covariance.




Concluding Remarks




I Suppose π(β) = Np(β;β0,Σβ), with β0 the prior mean andΣβ the prior covariance.

I Then, the conditional posterior of β given y,X, σ2 is:

π(β | y,X, σ2) = N(β̂, Σ̂β)

β̂ = Σ̂β

(Σ−1

β β0 + σ−2X′y)

Σ̂β =(Σ−1

β + σ−2X′X)−1

Note: β̂ → MLE as n increases and prior variance increases.




Concluding Remarks


Prior Elicitation - Different Schools of Thought

I Subjective Bayes: informative priors should be used thataccurately describe your prior uncertainty in the parameters.




Concluding Remarks




I Objective Bayes: non-informative or default priors should beused to obtain statistical procedures with good properties.




Concluding Remarks




I Objective Bayes: non-informative or default priors should beused to obtain statistical procedures with good properties.

I Pragmatic Bayes: the Bayes machinery is very useful foraddressing complex problems & (hopefully) results are robustto priors.




Concluding Remarks


Shrinkage Priors

I Although non-informative priors are widely used, shrinkagepriors often have better performance (e.g., lower mean squareerror) (MacLehose et al., 2007, Epidemiology)




Concluding Remarks


Shrinkage Priors


I By choosing a prior centered at zero for the coefficients, onetends to obtain more stable estimates, limiting over-fittingand multicolinearity problems.




Concluding Remarks


Shrinkage Priors



I There is a rich theoretical literature providing motivation forshrinkage.




Concluding Remarks


Shrinkage Priors




I From an applied perspective, good idea to choose priors thatassign low probability outside of a plausible range for theparameters.




Concluding Remarks


Shrinkage Priors




I From an applied perspective, good idea to choose priors thatassign low probability outside of a plausible range for theparameters.

I MLEs have problems when there is limited information aboutcertain parameters.




Concluding Remarks


What about more Complex Settings?

I Bayes & frequentist inferences under a single model tend tobe similar in simple settings (e.g., linear regression withmodest numbers of predictors & ample sample size)




Concluding Remarks




I Advantages of Bayes more apparent in complex settings -model uncertainty, missing data, large number of parameters,etc




Concluding Remarks




I Advantages of Bayes more apparent in complex settings -model uncertainty, missing data, large number of parameters,etc

I Outside of simple models, posterior computation typicallyrelies on Markov chain Monte Carlo (MCMC) algorithms.




Concluding Remarks


MCMC - The Basic Idea

I MCMC algorithms rely on randomly sampling the modelparameters in a special way so that the samples converge indistribution to a target distribution, which is the true jointposterior distribution




Concluding Remarks




I Flavors of MCMC:




Concluding Remarks




I Flavors of MCMC:I Gibbs Sampling (Gelfand and Smith, 1990): sequentially

samples from the full conditional posterior distributions of eachof the parameters




Concluding Remarks




I Flavors of MCMC:I Gibbs Sampling (Gelfand and Smith, 1990): sequentially

samples from the full conditional posterior distributions of eachof the parameters

I Metropolis-Hastings (Hastings, 1970): samples a candidate fora parameter from a proposal density and accepts thiscandidate with a specific probability.




Concluding Remarks


Application to Study of DDE & Preterm Birth

I Scientific Interest: Association between DDE exposure &preterm birth adjusting for possible confounding variables.




Concluding Remarks




I Data from US Collaborative Perinatal Project (CPP) -n = 2380 children out of whom 361 were born preterm.




Concluding Remarks




I Data from US Collaborative Perinatal Project (CPP) -n = 2380 children out of whom 361 were born preterm.

I Analysis: Bayesian analysis using a probit model




Concluding Remarks


Probit Model for Risk of Preterm Birth

I Let yi = 1 if preterm birth and yi = 0 if full-term birth




Concluding Remarks




I Probit Model:

Pr(yi = 1 | xi ,β) = Φ(x′iβ),

where Φ(·) is the standard normal distribution function




Concluding Remarks




I Probit Model:

Pr(yi = 1 | xi ,β) = Φ(x′iβ),

where Φ(·) is the standard normal distribution functionI xi = (1, ddei , xi3, . . . , xi7)

′




Concluding Remarks




I Probit Model:

Pr(yi = 1 | xi ,β) = Φ(x′iβ),


′

I xi3, . . . , xi7 = represent possible confounders




Concluding Remarks




I Probit Model:

Pr(yi = 1 | xi ,β) = Φ(x′iβ),


′

I xi3, . . . , xi7 = represent possible confoundersI β1 = intercept




Concluding Remarks




I Probit Model:

Pr(yi = 1 | xi ,β) = Φ(x′iβ),


′

I xi3, . . . , xi7 = represent possible confoundersI β1 = interceptI β2 = dde slope




Concluding Remarks


Bayesian Analysis: Prior, Likelihood & Posterior

I Prior: π(β) = N(β0,Σβ)




Concluding Remarks




I Likelihood:

L(y;β,X) =

n∏

i=1

Φ(x′iβ)yi

{1 − Φ(x′iβ)

}1−yi




Concluding Remarks




I Likelihood:

L(y;β,X) =

n∏

i=1

Φ(x′iβ)yi

{1 − Φ(x′iβ)

}1−yi

I Posterior:π(β | y,X) ∝ π(β)L(y;β,X).




Concluding Remarks




I Likelihood:

L(y;β,X) =

n∏

i=1

Φ(x′iβ)yi

{1 − Φ(x′iβ)

}1−yi

I Posterior:π(β | y,X) ∝ π(β)L(y;β,X).

I No closed form available for normalizing constant




Concluding Remarks


Posterior Computation using Data Augmentation

I Full conditional posterior distributions needed for Gibbssampling are not automatically available




Concluding Remarks




I However, we can rely on a very useful data augmentation trickproposed by Albert and Chib (1993):




Concluding Remarks





I Augment observed data {yi , xi} with latent zi .




Concluding Remarks





I Augment observed data {yi , xi} with latent zi .I Probit model can be expressed in hierarchical form as follows:

yi = 1(zi > 0)

zi ∼ N(x′iβ, 1)




Concluding Remarks





I Augment observed data {yi , xi} with latent zi .I Probit model can be expressed in hierarchical form as follows:

yi = 1(zi > 0)

zi ∼ N(x′iβ, 1)

I Marginalizing out zi we obtain Pr(yi = 1 | xi , β) = Φ(x′β).




Concluding Remarks


Gibbs Sampling Steps

I Gibbs sampling relies on alternately sampling from fullconditional posterior distributions of unknown parameters




Concluding Remarks




I After data augmentation, unknowns include latent data {zi}and regression parameters β




Concluding Remarks





I Full conditional posterior distributions:




Concluding Remarks






1. π(zi | y,X, β) = N(x′iβ) truncated below by zero if yi = 1 andabove by zero if yi = 0.




Concluding Remarks






1. π(zi | y,X, β) = N(x′iβ) truncated below by zero if yi = 1 andabove by zero if yi = 0.

2. π(β | z, y,X) = Np(β̂, Σ̂β), Σ̂β = (Σ−1β + X′X)−1,

β̂ = Σ̂β

(Σ−1

β β0 + X′z).




Concluding Remarks


Gibbs Sampling Implementation

I To implement Gibbs sampling, we simply iteratively samplefrom these full conditional posterior distributions a largenumber of times




Concluding Remarks




I An initial burn-in is discarded to allow convergence to astationary distribution




Concluding Remarks





I Inferences can be based on posterior summaries calculatedusing the draws from the joint posterior distribution.




Concluding Remarks





I Inferences can be based on posterior summaries calculatedusing the draws from the joint posterior distribution.

I WinBUGS provides an easy to use & free software package forimplementing Gibbs sampling for complex models.




Concluding Remarks


Returning to the DDE and Premature Birth Application

I We chose a normal prior, π(β) = N7(β; 0, 4 × I7×7)(motivated by shrinkage considerations)




Concluding Remarks




I Choosing β = 0 as the starting value, we ran the Gibbssampler 1,000 iterations, discarding the first 100 as a burn-in




Concluding Remarks




I Choosing β = 0 as the starting value, we ran the Gibbssampler 1,000 iterations, discarding the first 100 as a burn-in

I In general, more samples should be taken for complex models




Concluding Remarks


Gibbs Sampling Trace Plots




Concluding Remarks


Convergence and Mixing Issues

I Whenever MCMC algorithms are used, trace plots of differentparameters should be carefully examined.




Concluding Remarks




I In complex models, convergence and mixing are often ofconcern.




Concluding Remarks





I Slow convergence - the chain takes a long time to convergeto a stationary distribution, so that a long burn-in is needed.




Concluding Remarks





I Slow convergence - the chain takes a long time to convergeto a stationary distribution, so that a long burn-in is needed.

I Slow mixing - high autocorrelation in the samples even afterconvergence, so that a very large number of samples is neededto reduce Monte Carlo error.




Concluding Remarks


Estimated Posterior Densities




Concluding Remarks


Posterior Summaries of Regression Parameters

Parameter Mean Median SD 95% credible interval

β1 -1.08 -1.08 0.04 (-1.16, -1.01)β2 0.17 0.17 0.03 (0.12, 0.23)β3 -0.13 -0.13 0.04 (-0.2, -0.05)β4 0.11 0.11 0.03 (0.05, 0.18)β5 -0.02 -0.02 0.03 (-0.08, 0.05)β6 -0.08 -0.08 0.04 (-0.15, -0.02)β7 0.05 0.06 0.06 (-0.07, 0.18)




Concluding Remarks


Maximum Likelihood Results

Parameter MLE SE Z stat p-value

β1 -1.08 0.04 -24.8 < 2e − 16β2 0.18 0.03 6.03 1.67e-09β3 -0.13 0.04 -3.63 0.0003β4 0.11 0.03 3.30 0.001β5 -0.02 0.03 -0.501 0.617β6 -0.08 0.04 -2.30 0.022β7 0.05 0.06 0.844 0.399

β2 = dde slope (highly significant increasing trend)




Concluding Remarks


Fitting Bayesian GLMs in SAS

I We repeated our Bayesian analysis using BGENMOD (a newSAS proc for Bayes analysis of GLMs).




Concluding Remarks




I Very simple to implement in few lines of code and gaveidentical results to our R implementation (different MCMCimplementation)




Concluding Remarks





I Automatically outputs posterior summaries, trace plots,convergence diagnostics, etc




Concluding Remarks





I Automatically outputs posterior summaries, trace plots,convergence diagnostics, etc

I UNC Bayes in SAS Conference, May 17-18(www.sph.unc.edu/bios for details)




Concluding Remarks

Formulation of ProblemVariable Selection & Stochastic SearchEpidemiologic Application

Suppose the true model is unknown

I In the DDE application, we assumed that we knew in advancethat the probit model with pre-specified predictors wasappropriate.




Concluding Remarks




I There is typically substantial uncertainty in the model & it ismore realistic to suppose that there is a list of a prioriplausible models.




Concluding Remarks





I Typical Strategy: sequentially change model until a good fit isproduced, and then base inferences/predictions on the finalselected model.




Concluding Remarks





I Typical Strategy: sequentially change model until a good fit isproduced, and then base inferences/predictions on the finalselected model.

I Strategy is flawed in ignoring uncertainty in the modelselection process – leads to major bias in many cases.




Concluding Remarks


Bayes Model Uncertainty

I Let M ∈ M denote a model index, with M a list of possiblemodels.




Concluding Remarks




I To allow for model uncertainty, Bayesian’s first choose:




Concluding Remarks





1. A prior probability for each model: Pr(M = m) = πm, m ∈ M.




Concluding Remarks





1. A prior probability for each model: Pr(M = m) = πm, m ∈ M.2. Priors for the coefficients within each model, π(θm), m ∈ M.




Concluding Remarks





1. A prior probability for each model: Pr(M = m) = πm, m ∈ M.2. Priors for the coefficients within each model, π(θm), m ∈ M.

I Given data, y, the posterior probability of model M = m is

π̂m = Pr(M = m | y) =πm Lm(y)∑l∈M πlLl (y)

,

where Lm(y) =∫

L(y |M = m, θm)π(θm)dθm is the marginallikelihood for model M = m




Concluding Remarks


Some Comments

I In the absence of prior knowledge about which models in thelist are more plausible, one often lets πm = 1/#M, with #Mthe number of models.




Concluding Remarks


Some Comments


I The highest posterior probability model is then the model withthe highest marginal likelihood.




Concluding Remarks


Some Comments



I Unlike the maximized likelihood, the marginal likelihood hasan implicit penalty for model complexity.




Concluding Remarks


Some Comments



I Unlike the maximized likelihood, the marginal likelihood hasan implicit penalty for model complexity.

I This penalty is due to the integration across the prior, whichis higher dimensional in larger models.




Concluding Remarks


Impact of Prior on Coefficients

I The prior, π(θm), on the coefficients within each model playsan important role.




Concluding Remarks




I As the variance of the priors on the coefficients within eachmodel increase, the penalty for model complexity alsoincreases.




Concluding Remarks





I Hence, for higher variance priors, one tends to favor smallermodels.




Concluding Remarks






I Using the BIC criteria is approximately equivalent to assuminga unit information prior, which is quite vague, so that BICfavors small models.




Concluding Remarks






I Using the BIC criteria is approximately equivalent to assuminga unit information prior, which is quite vague, so that BICfavors small models.

I By estimating the variance, one can obtain a data adaptivepenalty (George & Foster, 2000)




Concluding Remarks


Bayes Factors

I The Bayes factor (BF) can be used as a summary of theweight of evidence in the data in favor of model m1 overmodel m2.




Concluding Remarks


Bayes Factors


I The BF for model m1 over m2 is defined as the ratio ofposterior to prior odds, which is simply:

BF12 =L1(y)

L2(y),

a ratio of marginal likelihoods.




Concluding Remarks


Bayes Factors


I The BF for model m1 over m2 is defined as the ratio ofposterior to prior odds, which is simply:

BF12 =L1(y)

L2(y),

a ratio of marginal likelihoods.

I Values of BF12 > 1 suggest that model m1 is preferred, withthe weight of evidence in favor of m1 increasing as BF12

increases.




Concluding Remarks


Bayesian Model Averaging (BMA)

I Posterior model probabilities can be used for model selectionand inferences.




Concluding Remarks




I When focus is on prediction, BMA preferred to modelselection (Madigan and Raftery, 1994)




Concluding Remarks




I When focus is on prediction, BMA preferred to modelselection (Madigan and Raftery, 1994)

I To predict yn+1 given xn+1, BMA relies on:

f (yn+1 | xn+1) =∑

m∈M

π̂m

∫L(yn+1 |M = m, θm)

×π(θm |M = m, y,X) dθm.




Concluding Remarks


Bayes Model Uncertainty - Practical Issues

I Computation of the posterior model probabilities requirescalculation of the marginal likelihoods, Lm(y)




Concluding Remarks




I These marginal likelihoods are not automatically produced bytypical MCMC algorithms




Concluding Remarks





I Routine implementations rely on the Laplace approximation(Tierney and Kadane, 1986; Raftery, 1996)




Concluding Remarks






I In large model spaces, it is not feasible to do calculations forall the models, so search algorithms are used.




Concluding Remarks






I In large model spaces, it is not feasible to do calculations forall the models, so search algorithms are used.

I Refer to Hoeting et al. (1999) for a tutorial on BMA




Concluding Remarks


Bayesian Variable Selection

I Suppose we start with a vector of p candidate predictors,xi = (xi1, . . . , xip)

′




Concluding Remarks




′

I A very common type of model uncertainty corresponds touncertainty in which predictors to include in the model.




Concluding Remarks




′

I A very common type of model uncertainty corresponds touncertainty in which predictors to include in the model.

I In this case, we end up with a list of 2p different models,corresponding to each of the p candidate predictors beingexcluded or not.




Concluding Remarks


Stochastic Search Variable Selection (SSVS)

I George and McCulloch (1993, 1997) proposed a Gibbssampling approach for the variable selection problem.




Concluding Remarks




I Similar approaches have been very widely used in applications.




Concluding Remarks




I Similar approaches have been very widely used in applications.

I The SSVS idea will be illustration through a return to theDDE and preterm birth application




Concluding Remarks


Bayes Variable Selection in Probit Regression

I Earlier we focused on the model, Pr(yi = 1 | xi ,β) = Φ(x′iβ),with yi an indicator of premature delivery.




Concluding Remarks




I Previously, we chose a N7(0, 4I) prior for β, assuming all 7predictors were included.




Concluding Remarks




I Previously, we chose a N7(0, 4I) prior for β, assuming all 7predictors were included.

I To account for uncertainty in subset selection, choose amixture prior:

π(β) =

p∏

j=1

{δ0(βj )p0j + (1 − p0j)N(βj ; 0, c

2j )

},

where p0j is the prior probability of excluding the jth predictorby setting its coefficient to 0




Concluding Remarks


SSVS in Probit Regression

I The data augmentation Gibbs sampler described earlier can beeasily adapted.




Concluding Remarks


SSVS in Probit Regression

I The data augmentation Gibbs sampler described earlier can beeasily adapted.

I Sample from the conditional posterior distributions of βj , forj = 1, . . . , p:

π(βj |β(−j), z, y,X) = p̂jδ0(βj ) + (1 − p̂0j)N(βj ;Ej ,Vj),

where Vj = (c−2j + X′X)−1, Ej = VjX

′z, and

p̂j =p0j

p0j + (1 − p0j )N(0;0,c2

j)

N(0;Ej ,Vj)

is the conditional probability of βj = 0 (i.e., we exclude thejth predictor)




Concluding Remarks


SSVS - Comments

I After convergence, generates samples of models,corresponding to subsets of the set of p candidate predictors,from the posterior distribution.




Concluding Remarks


SSVS - Comments


I Based on a large number of SSVS iterations, we can estimateposterior probabilities for each of the models.




Concluding Remarks


SSVS - Comments



I For example, the full model may appear in 10% of thesamples collected after convergence, so that model would beassigned posterior probability of 0.10.




Concluding Remarks


SSVS - Comments




I To summarize, one can present a table of the top 10 or 100models




Concluding Remarks


SSVS - Comments




I To summarize, one can present a table of the top 10 or 100models

I Potentially more useful to calculate marginal inclusionprobabilities.




Concluding Remarks


Samples from Posterior - DDE application (normal prior)




Concluding Remarks


Samples from Posterior - DDE application (mixture prior)




Concluding Remarks


SSVS - Comments

I Samples congregate on 0 for the regression coefficient forpredictors that are not as important.




Concluding Remarks


SSVS - Comments


I Such samples correspond to models with that predictorexcluded.




Concluding Remarks


SSVS - Comments


I Such samples correspond to models with that predictorexcluded.

I Even though the prior probabilities of exclusion are the same,posterior probabilities vary greatly for the different predictors.




Concluding Remarks


Posterior Summaries - Normal prior analysis

Parameter Mean Median SD 95% credible interval

β1 -1.08 -1.08 0.04 (-1.16, -1.01)β2 0.17 0.17 0.03 (0.12, 0.23)β3 -0.13 -0.13 0.04 (-0.2, -0.05)β4 0.11 0.11 0.03 (0.05, 0.18)β5 -0.02 -0.02 0.03 (-0.08, 0.05)β6 -0.08 -0.08 0.04 (-0.15, -0.02)β7 0.05 0.06 0.06 (-0.07, 0.18)




Concluding Remarks


Posterior Summaries - Mixture prior analysis

Parameter Mean Median SD 95% CI Pr(βj = 0 | data)

β1 -1.05 -1.05 0.03 (-1.12, -0.99) 0.00β2 0.18 0.18 0.03 (0.12, 0.23) 0.00β3 -0.08 -0.09 0.06 (-0.19, 0.00) 0.36β4 0.05 0.00 0.06 (0.00, 0.16) 0.50β5 0.00 0.00 0.01 (0.00, 0.00) 0.98β6 -0.02 0.00 0.04 (-0.13, 0.00) 0.72β7 0.01 0.00 0.02 (0.00, 0.1) 0.93




Concluding Remarks


Posterior Probabilities of Visited Models

π̂m Model Indicator

1 0.24981301421092 1 1 0 0 0 0 02 0.225878833208676 1 1 1 1 0 0 03 0.196958364497632 1 1 1 1 0 1 04 0.139865370231862 1 1 1 0 0 0 05 0.0363999002742458 1 1 0 0 0 1 06 0.0304163550236849 1 1 0 1 0 0 07 0.0274245823984044 1 1 1 0 0 1 08 0.0206930939915233 1 1 0 0 0 0 19 0.0177013213662428 1 1 1 1 0 0 110 0.012216404886562 1 1 1 0 0 0 1




Concluding Remarks


Some Comments on DDE Application Results

I In 4,000 Gibbs iterations only 26/128 = 20.3% of the modelswere visited




Concluding Remarks




I There wasn’t a single dominant model, but none of themodels excluded the intercept or DDE slope.




Concluding Remarks




I There wasn’t a single dominant model, but none of themodels excluded the intercept or DDE slope.

I All of the better models included the 3rd & 5th of the 5possible confounders




Concluding Remarks


General Comments on Bayes Model Uncertainty

I SSVS provides very useful approach.




Concluding Remarks




I For large numbers of candidate predictors, shotgun stochasticsearch provides an alternative (Hans et al., 2007).




Concluding Remarks





I SSVS has also been adapted to select predictors with randomeffects (Cai and Dunson, 2006)




Concluding Remarks





I SSVS has also been adapted to select predictors with randomeffects (Cai and Dunson, 2006)

I For routine implementation, one can rely on Laplaceapproximations to marginal likelihoods.




Concluding Remarks

General FormulationPosterior Computation

Missing Data Introduction

I Many (if not most) studies are faced with problems withmissing data




Concluding Remarks




I Bayesian methods provide a natural framework for accountingfor missing data without need to rely on ad hoc imputation




Concluding Remarks




I Bayesian methods provide a natural framework for accountingfor missing data without need to rely on ad hoc imputation

I Focus: missing predictors in regression models




Concluding Remarks


Missing Predictors in Regression

I Suppose we are interested in the general linear model:

yi = x′iβ + εi , εi ∼ N(0, σ2),

with xi = (xi1, . . . , xip)′ a vector of predictors that may have

missing values




Concluding Remarks




yi = x′iβ + εi , εi ∼ N(0, σ2),


missing valuesI Assume that the missing predictors are missing at random

(MAR), so that missingness is conditionally independent ofthe unmeasured value given the observed data.




Concluding Remarks




yi = x′iβ + εi , εi ∼ N(0, σ2),




I To accommodate missing predictors, we need to specify ajoint distribution for xi (typically chosen as normal or asequence of conditional GLMs).




Concluding Remarks




yi = x′iβ + εi , εi ∼ N(0, σ2),




I To accommodate missing predictors, we need to specify ajoint distribution for xi (typically chosen as normal or asequence of conditional GLMs).

I Then, the missing values are simply additional unknowns tobe updated in the MCMC algorithm.




Concluding Remarks


Gibbs Sampler

I When the predictors have a normal likelihood and we have alinear regression model, missing predictors can beaccommodated in a simple Gibbs sampler:




Concluding Remarks


Gibbs Sampler


1. Starting with an initial value for β, σ2, sample the missingpredictor values from their normal full conditional.




Concluding Remarks


Gibbs Sampler



2. Given the imputed data, sample β and σ2 from their fullconditional posterior distributions.




Concluding Remarks


Gibbs Sampler



2. Given the imputed data, sample β and σ2 from their fullconditional posterior distributions.

I This algorithm is easily adapted for non-normal likelihoods foryi and xi , and the resulting Gibbs sampler can beimplemented in WinBUGS.




Concluding Remarks


Some Comments

I The Bayesian approach is often used to obtain multipleimputed data sets, which are then combined using frequentistmethods.




Concluding Remarks


Some Comments


I This approach avoids ad hoc imputation methods, such asbootstrapping, which often have implicit missing completelyat random assumptions.




Concluding Remarks


Some Comments



I We have assumed that missingness is non-informative.




Concluding Remarks


Some Comments



I We have assumed that missingness is non-informative.

I Shared random effects models can be used to account forinformative missingness and censoring.




Concluding Remarks

Summary

I Very brief introduction to Bayesian statistics




Concluding Remarks

Summary


I Emphasis on regression models, variable selection & missingpredictors




Concluding Remarks

Summary



I Ideas related to model uncertainty and missing data can begeneralized to much broader settings




Concluding Remarks

Summary



I Ideas related to model uncertainty and missing data can begeneralized to much broader settings

I Many of the MCMC algorithms are easy to program, but thereare also a number of packages available (WinBUGS, Rfunctions for Bayes model averaging in GLMs, etc)


Bayesian Statistics: Model Uncertainty & Missing Data Introduction to Bayesian Statistics Model...

Documents

Transcript of Bayesian Statistics: Model Uncertainty & Missing Data Introduction to Bayesian Statistics Model...