Bayesian Inference Chapter 4: Regression and Hierarchical ... · Y i jp i ˘P( i) log i = x i...

Bayesian Inference

Chapter 4: Regression and Hierarchical Models

Conchi Ausın and Mike Wiper

Department of Statistics

Universidad Carlos III de Madrid

Master in Business Administration and Quantitative MethodsMaster in Mathematical Engineering

Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 1 / 35

Objective

AFM Smith Dennis Lindley

We analyze the Bayesian approach to fitting normal and generalized linear modelsand introduce the Bayesian hierarchical modeling approach. Also, we study themodeling and forecasting of time series.


Contents

1 Normal linear models

1.1. ANOVA model

1.2. Simple linear regression model

2 Generalized linear models

3 Hierarchical models

4 Dynamic models


Normal linear models

A normal linear model is of the following form:

y = Xθ + ε,

where y = (y1, . . . , yn)′ is the observed data, X is a known n × k matrix, calledthe design matrix, θ = (θ1, . . . , θk)′ is the parameter set and ε follows amultivariate normal distribution. Usually, it is assumed that:

ε ∼ N(0k ,

1

φIk

).

A simple example of normal linear model is the simple linear regression model

where X =

(1 1 . . . 1x1 x2 . . . xn

)T

and θ = (α, β)T .



Consider a normal linear model, y = Xθ + ε. A conjugate prior distribution is anormal-gamma distribution:

θ | φ ∼ N(m,

1

φV

)φ ∼ G

(a

2,b

2

).

Then, the posterior distribution given y is also a normal-gamma distribution with:

m∗ =(XTX + V−1

)−1 (XTy + V−1m

)V∗ =

(XTX + V−1

)−1

a∗ = a + n

b∗ = b + yTy + mTV−1m−m∗TV∗−1m∗



The posterior mean is given by:

E [θ | y] =(XTX + V−1

)−1 (XTy + V−1m

)=(XTX + V−1

)−1(XTX

(XTX

)−1XTy + V−1m

)=(XTX + V−1

)−1(XTXθ + V−1m

)where θ =

(XTX

)−1XTy is the maximum likelihood estimator.

Thus, this expression may be interpreted as a weighted average of the priorestimator, m, and the MLE, θ, with weights proportional to precisions since,conditional on φ, the prior variance is 1

φV and that the distribution of the MLE

from the classical viewpoint is θ | φ ∼ N(θ, 1

φ (XTX)−1)



Consider a normal linear model, y = Xθ + ε, and assume the limiting priordistribution,

p(θ, φ) ∝ 1

φ.

Then, we have that,

θ | y, φ ∼ N(θ,

1

φ

(XTX

)−1),

φ | y ∼ G

n − k

2,yTy − θ

T (XTX

)θ

2

.

Note that σ2 =yT y−θT(XTX)θ

n−k is the usual classical estimator of σ2 = 1φ .

In this case, Bayesian credible intervals, estimators etc. will coincide with theirclassical counterparts.


ANOVA model

The ANOVA model is an example of normal lineal model where:

yij = θi + εij ,

where εij ∼ N (0, 1φ ), for i = 1, . . . , k , and j = 1, . . . , ni .

Thus, the parameters are θ = (θ1, . . . , θk), the observed data arey = (y11, . . . , y1n1 , y21, . . . , y2n2 , . . . , yk1, . . . , yknk )T , the design matrix is:

X =

1 0 · · · 0...

...1n1 0 · · · 00 1 0...

......

0 1n2 0...

......

0 0 1


ANOVA model

Assume conditionally independent normal priors, θi ∼ N(mi ,

1αiφ

), for

i = 1, . . . , k , and a gamma prior φ ∼ G( a2 ,

b2 ).

This corresponds to a normal-gamma prior distribution for (θ, φ) where

m = (m1, . . . ,mk) and V = diag(

1α1, . . . , 1

αk

).

Then, it is obtained that,

θ | y, φ ∼ N

n1y1·+α1m1

n1+α1

...n1y1·+α1m1

n1+α1

,1

φ

1

α1+n1

. . .1

αk+nk

and

φ | y ∼ G

(a + n

2,b +

∑ki=1

∑nij=1 (yij − yi·)

2 +∑k

i=1ni

ni+αi(yi· −mi )

2

2

)


ANOVA model

If we assume alternatively the reference prior, p(θ, φ) ∝ 1φ , we have:

θ | y, φ ∼ N

y1·

...yk·

,1

φ

1n1

. . .1nk

,

φ ∼ G(n − k

2,

(n − k) σ2

2

),

where σ2 = 1n−k∑k

i=1 (yij − yi·)2 is the classical variance estimate for this

problem.

A 95% posterior interval for θ1 − θ2 is given by:

y1· − y2· ± σ√

1

n1+

1

n2tn−k (0.975) ,

which is equal to the usual, classical interval.


Example: ANOVA model

Suppose that an ecologist is interested in analysing how the masses ofstarlings (a type of birds) vary between four locations.

A sample data of the weights of 10 starlings from each of the four locationscan be downloaded from:http://arcue.botany.unimelb.edu.au/bayescode.html.

Assume a Bayesian one-way ANOVA model for these data where a differentmean is considered for each location and the variation in mass betweendifferent birds is described by a normal distribution with a common variance.

Compare the results with those obtained with classical methods.


Simple linear regression model

Another example of normal linear model is the simple regression model:

yi = α + βxi + εi ,

for i = 1, . . . , n, where εi ∼ N(

0, 1φ

).

Suppose that we use the limiting prior:

p(α, β, φ) ∝ 1

φ.



Then, we have that:

αβ

∣∣∣∣ y, φ ∼ N( α

β

),

1

φnsx

n∑i=1

x2i −nx

−nx n

φ | y ∼ G

(n − 2

2,sy(1− r2

)2

)

where:

α = y − βx , β =sxysx,

sx =∑n

i=1 (xi − x)2, sy =

∑ni=1 (yi − y)2

,

sxy =∑n

i=1 (xi − x) (yi − y) , r =sxy√sxsy

, σ2 =sy(1− r2

)n − 2

.



Thus, the marginal distributions of α and β are Student-t distributions:

α− α√σ2

n

∑ni=1x

2i

sx

| y ∼ tn−2

β − β√σ2

sx

| y ∼ tn−2

Therefore, for example, a 95% credible interval for β is given by:

β ± σ√sxtn−2(0.975)

equal to the usual classical interval.



Suppose now that we wish to predict a future observation:

ynew = α + βxnew + εnew .

Note that,

E [ynew | φ, y] = α + βxnew

V [ynew | φ, y] =1

φ

(∑ni=1x

2i + nx2

new − 2nxxnewnsx

+ 1

)=

1

φ

(sx + nx2 + nx2

new − 2nxxnewnsx

+ 1

)Therefore,

ynew | φ, y ∼ N

(α + βxnew ,

1

φ

((x − xnew )2

sx+

1

n+ 1

))



And then,ynew − α + βxnew

σ

√((x−xnew )2

sx+ 1

n + 1) | y ∼ tn−2

leading to the following 95% credible interval for ynew :

α + βxnew ± σ

√√√√( (x − xnew )2

sx+

1

n+ 1

)tn−2 (0.975) ,

which coincides with the usual, classical interval.


Example: Simple linear regression model

Consider the data file prostate.data that can be downloaded from:http://statweb.stanford.edu/~tibs/ElemStatLearn/.

This includes, among other clinical measures, the level of prostate specificantigen in logs (lpsa) and the log cancer volume (lcavol) in 97 men whowere about to receive a radical prostatectomy.

Use a Bayesian linear regression model to predict the lpsa in terms of thelcavol.

Compare the results with a classical linear regression fit.


Generalized linear models

The generalized linear model generalizes the normal linear model by allowing thepossibility of non-normal error distributions and by allowing for a non-linearrelationship between y and x.

A generalized linear model is specified by two functions:

1 A conditional, exponential family density function of y given x, parameterizedby a mean parameter, µ = µ(x) = E [Y | x] and (possibly) a dispersionparameter, φ > 0, that is independent of x.

2 A (one-to-one) link function, g(·), which relates the mean, µ = µ(x) to thecovariate vector, x, as g(µ) = xθ.



The following are generalized linear models with the canonical link function whichis the natural parameterization to leave the exponential family distribution incanonical form.

A logistic regression is often used for predicting the occurrence of an eventgiven covariates:

Yi | pi ∼ Bin(ni , pi )

logpi

1− pi= xiθ

A Poisson regression is used for predicting the number of events in a timeperiod given covariates:

Yi | pi ∼ P(λi )

log λi = xiθ



The Bayesian specification of a GLM is completed by defining (typicallynormal or normal gamma) prior distributions p(θ, φ) over the unknown modelparameters.

As with standard linear models, when improper priors are used, it is thenimportant to check that these lead to valid posterior distributions.

Clearly, these models will not have conjugate posterior distributions, but,usually, they are easily handled by Gibbs sampling.

In particular, the posterior distributions from these models are usually logconcave and are thus easily sampled via adaptive rejection sampling.


Example: A logistic regression model

The O-Ring data consist of 23 observations on Pre-Challenger Space ShuttleLaunches

On each launch, it is observed whether there is at least one O-ring failure,and the temperature at launch

The goal is to model the probability of at least one O-ring failure as afunction of temperature.

Temperatures were 53, 57, 58, 63, 66, 67, 67, 67, 68, 69,70, 70, 70, 70, 72,73, 75, 75, 76, 76, 78, 79, 81

Failures occurred at 53, 57, 58, 63, 70, 70, 75


Example: A logistic regression model

The table shows the relationship, for 64 infants, between gestational age of theinfant (in weeks) at the time of birth (x) and whether the infant was breastfeeding at the time of release from hospital (y).

x 28 29 30 31 32 33# {y = 0} 4 3 2 2 4 1# {y = 1} 2 2 7 7 16 14

Let xi represent the gestational age and ni the number of infants with this age.Then we can model the probability that yi infants were breast feeding at time ofrelease from hospital via a standard binomial regression model.


Hierarchical models

Suppose we have data, x, and a likelihood function f (x | θ) where theparameter values θ = (θ1, . . . , θk) are judged to be exchangeable, that is, anypermutation of them has the same distribution.

In this situation, it makes sense to consider a multilevel modeling assuming aprior distribution, f (θ | φ), which depends upon a further, unknownhyperparameter, φ, and use a hyperprior distribution, f (φ).

In theory, this process could continue further, using hyperhyperpriordistributions to estimate the hyperprior distributions. This is a method toelicit the optimal prior distributions.

One alternative is to estimate the hyperparameter using classical methods,which is known as empirical Bayes. A point estimate φ is then obtained toapproximate the posterior distribution. However, the uncertainty in φ isignored.


Hierarchical models

In most hierarchical models, the joint posterior distributions will not beanalytically tractable as it will be,

f (θ, φ | x) ∝ f (x | θ)f (θ | φ)f (φ)

However, often a Gibbs sampling approach can be implemented by samplingfrom the conditional posterior distributions:

f (θ | x, φ) ∝ f (x | θ)f (θ | φ)

f (φ | x,θ) ∝ f (θ | φ)f (φ)

It is important to check the propriety of the posterior distribution whenimproper hyperprior distributions are used. An alternative (as in for exampleWinbugs) is to use proper but high variance hyperprior distributions.


Hierarchical models

For example, a hierachical normal linear model is given by:

xij | θi , φ ∼ N(θi ,

1

φ

), i = 1, . . . , n, j = 1, . . . ,m.

Assuming that the means, θi , are exchangeable, we may consider the followingprior distribution:

θi | µ, ψ ∼ N(µ,

1

ψ

),

where the hyperparameters are µ y ψ.


Example: A hierarchical one-way ANOVA

Suppose that 5 individuals take 3 different IQ test developed by 3 differentpsychologists obtaining the following results:

1 2 3 4 5Test 1 106 121 159 95 78Test 2 108 113 158 91 80Test 3 98 115 169 93 77

Then, we can assume that:

Xij | θi , φ ∼ N(θi ,

1

φ

),

θi | µ, ψ ∼ N(µ,

1

ψ

),

for i = 1, . . . , 5, and j = 1, 2, 3, where θi represents the true IQ of subject i and µthe mean true IQ in the population.


Example: A hierarchical Poisson model

The number of failures, Xi at a power plant i is assumed to follow a Poissondistribution:

Xi | λi ∼ P(λi ti ), para i = 1, . . . , 10,

where λi is the failure rate for pump i and ti is the length of operation time of thepump (in 1000s of hours). It seems natural to assume that the failure rates areexchangeable and thus we might assume:

λi | γ ∼ E(γ),

where γ is the prior hyperparameter. The observed data are:

Pump 1 2 3 4 5 6 7 8 9 10ti 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5xi 5 1 5 14 3 19 1 1 4 22


Dynamic models

The univariate normal dynamic linear model (DLM) is:

yt = Ftθt + νt , νt ∼ N (0,Vt)

θt = Gtθt−1 + ωt , ωt ∼ N (0,Wt).

These models are linear state space models, where xt = Ftθt represents thesignal, θt is the state vector, Ft is a regression vector and Gt is a state matrix.

The usual features of a time series such as trend and seasonality can be modeledwithin this format.

If the matrices Ft , Gt , Vt and Wt are constants, the model is said to be timeinvariant.


Dynamic models

One of the simplest DLMs is the random walk plus noise model, also called firstorder polynomial model. It is used to model univariate observations and the statevector is unidimensional:

yt = θt + νt , νt ∼ N (0,Vt)

θt = θt−1 + ωt , ωt ∼ N (0,Wt).

This is a slowly varying level model where the observations fluctuate around amean which varies according to a random walk.

Assuming known variances, Vt and Wt , a straightforward Bayesian analysis can becarried out as follows.


Dynamic models

Suppose that the information at time t − 1 is yt−1 = {y1, y2, ..., yt−1} and assumethat:

θt−1 | yt−1 ∼ N (mt−1,Ct−1).

Then, we have that:

The prior distribution for θt is:

θt | yt−1 ∼ N (mt−1,Rt)

where Rt = Ct−1 + Wt

The one step ahead predictive distribution for yt is:

yt | yt−1 ∼ N (mt−1,Qt)

where Qt = Rt + Vt .


Dynamic models

The joint distribution of θt and yt is:(θtyt

)| yt−1 ∼ N

(mt−1

mt−1,

(Rt Rt

Rt Qt

))The posterior distribution for θt given yt =

{yt−1, yt

}is:

θt | yt ∼ N(mt ,Ct), where

mt = mt−1 + Atet ,

At = Rt/Qt ,

et = yt −mt−1,

Ct = Rt − A2tQt .

Note that et is simply a prediction error term. The posterior mean formulacould also be written as:

mt = (1− At)mt−1 + Atyt .


Example: First order polynomial DLM

Assume a slowly varying level model for the water level in Lake Huron with knownvariances: Vt = 1 and Wt = 1.

1 Estimate the filtered values of the state vector based on the observations upto time t from f (θt | yt).

2 Estimate the predicted values of the state vector based on the observationsup to time t − 1 from f (θt | yt−1).

3 Estimate the predicted values of the signal based on the observations up totime t − 1 from f (yt | yt−1).

4 Compare the results using e.g:I Vt = 10 and Wt = 1.I Vt = 1 and Wt = 10.


Dynamic models

When the variances are not known, the Bayesian inference for the system ismore complex.

One possibility is the use of MCMC algorithms which are usually based onthe so-called forward filtering backward sampling algorithm.

1 The forward filtering step is the standard normal linear analysis to givef (θt |yt) at each t, for t = 1, . . . ,T .

2 The backward sampling step uses the Markov property and samples θ∗T fromf (θT |yT ) and then, for t = T − 1, . . . , 1, samples from f (θt | yt , θ∗t+1)

Thus, a sample from the posterior parameter structure is generated.

However, MCMC may be computationally very expensive for on-lineestimation. One possible alternative is the use of particle filters.


Dynamic models

Other examples of DLM are the following:

A dynamic linear regression model is given by:

yt = Ftθt + νt , νt ∼ N (0,Vt)

θt = θt−1 + ωt , ωt ∼ N (0,Wt).

The AR(p) model with time-varying coefficients takes the form:

yt = θ0t + θ1tyt−1 + . . .+ θptyt−p + νt , θit = θi,t−1 + ωit ,

This model can be expressed in state space form by setting θ = (θ0t , . . . , θpt)and F = (1, yt−1, . . . , yt − p).


Dynamic models

The additive structure of the DLMs makes it easy to think of observed series asoriginating form the sum of different components,e.g.,

yt = y1t + . . . , yh,t

where y1t might represent a trend component, y2t a seasonal component, and soon. Then, each component, yit , might be described by a different DLM:

yt = Fitθit + νit , νit ∼ N (0,Vit)

θit = Gitθt−1 + ωit , ωit ∼ N (0,Wit).

By the assumption of independence of the components, yt is also a DLMdescribed by:

Ft = (F1t |. . .|Fht) , Vt = V1t + . . .+ Vht ,

and

Gt =

G1t

. . .

Ght

, Wt =

W1t

. . .

Wht

.


Bayesian Inference Chapter 4: Regression and Hierarchical ... · Y i jp i ˘P( i) log i = x i...

Documents

Transcript of Bayesian Inference Chapter 4: Regression and Hierarchical ... · Y i jp i ˘P( i) log i = x i...