Post on 29-Jul-2020
Bayesian Inference
Chapter 4: Regression and Hierarchical Models
Conchi Ausın and Mike Wiper
Department of Statistics
Universidad Carlos III de Madrid
Master in Business Administration and Quantitative MethodsMaster in Mathematical Engineering
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 1 / 35
Objective
AFM Smith Dennis Lindley
We analyze the Bayesian approach to fitting normal and generalized linear modelsand introduce the Bayesian hierarchical modeling approach. Also, we study themodeling and forecasting of time series.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 2 / 35
Contents
1 Normal linear models
1.1. ANOVA model
1.2. Simple linear regression model
2 Generalized linear models
3 Hierarchical models
4 Dynamic models
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 3 / 35
Normal linear models
A normal linear model is of the following form:
y = Xθ + ε,
where y = (y1, . . . , yn)′ is the observed data, X is a known n × k matrix, calledthe design matrix, θ = (θ1, . . . , θk)′ is the parameter set and ε follows amultivariate normal distribution. Usually, it is assumed that:
ε ∼ N(0k ,
1
φIk
).
A simple example of normal linear model is the simple linear regression model
where X =
(1 1 . . . 1x1 x2 . . . xn
)T
and θ = (α, β)T .
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 4 / 35
Normal linear models
Consider a normal linear model, y = Xθ + ε. A conjugate prior distribution is anormal-gamma distribution:
θ | φ ∼ N(m,
1
φV
)φ ∼ G
(a
2,b
2
).
Then, the posterior distribution given y is also a normal-gamma distribution with:
m∗ =(XTX + V−1
)−1 (XTy + V−1m
)V∗ =
(XTX + V−1
)−1
a∗ = a + n
b∗ = b + yTy + mTV−1m−m∗TV∗−1m∗
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 5 / 35
Normal linear models
The posterior mean is given by:
E [θ | y] =(XTX + V−1
)−1 (XTy + V−1m
)=(XTX + V−1
)−1(XTX
(XTX
)−1XTy + V−1m
)=(XTX + V−1
)−1(XTXθ + V−1m
)where θ =
(XTX
)−1XTy is the maximum likelihood estimator.
Thus, this expression may be interpreted as a weighted average of the priorestimator, m, and the MLE, θ, with weights proportional to precisions since,conditional on φ, the prior variance is 1
φV and that the distribution of the MLE
from the classical viewpoint is θ | φ ∼ N(θ, 1
φ (XTX)−1)
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 6 / 35
Normal linear models
Consider a normal linear model, y = Xθ + ε, and assume the limiting priordistribution,
p(θ, φ) ∝ 1
φ.
Then, we have that,
θ | y, φ ∼ N(θ,
1
φ
(XTX
)−1),
φ | y ∼ G
n − k
2,yTy − θ
T (XTX
)θ
2
.
Note that σ2 =yT y−θT(XTX)θ
n−k is the usual classical estimator of σ2 = 1φ .
In this case, Bayesian credible intervals, estimators etc. will coincide with theirclassical counterparts.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 7 / 35
ANOVA model
The ANOVA model is an example of normal lineal model where:
yij = θi + εij ,
where εij ∼ N (0, 1φ ), for i = 1, . . . , k , and j = 1, . . . , ni .
Thus, the parameters are θ = (θ1, . . . , θk), the observed data arey = (y11, . . . , y1n1 , y21, . . . , y2n2 , . . . , yk1, . . . , yknk )T , the design matrix is:
X =
1 0 · · · 0...
...1n1 0 · · · 00 1 0...
......
0 1n2 0...
......
0 0 1
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 8 / 35
ANOVA model
Assume conditionally independent normal priors, θi ∼ N(mi ,
1αiφ
), for
i = 1, . . . , k , and a gamma prior φ ∼ G( a2 ,
b2 ).
This corresponds to a normal-gamma prior distribution for (θ, φ) where
m = (m1, . . . ,mk) and V = diag(
1α1, . . . , 1
αk
).
Then, it is obtained that,
θ | y, φ ∼ N
n1y1·+α1m1
n1+α1
...n1y1·+α1m1
n1+α1
,1
φ
1
α1+n1
. . .1
αk+nk
and
φ | y ∼ G
(a + n
2,b +
∑ki=1
∑nij=1 (yij − yi·)
2 +∑k
i=1ni
ni+αi(yi· −mi )
2
2
)
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 9 / 35
ANOVA model
If we assume alternatively the reference prior, p(θ, φ) ∝ 1φ , we have:
θ | y, φ ∼ N
y1·
...yk·
,1
φ
1n1
. . .1nk
,
φ ∼ G(n − k
2,
(n − k) σ2
2
),
where σ2 = 1n−k∑k
i=1 (yij − yi·)2 is the classical variance estimate for this
problem.
A 95% posterior interval for θ1 − θ2 is given by:
y1· − y2· ± σ√
1
n1+
1
n2tn−k (0.975) ,
which is equal to the usual, classical interval.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 10 / 35
Example: ANOVA model
Suppose that an ecologist is interested in analysing how the masses ofstarlings (a type of birds) vary between four locations.
A sample data of the weights of 10 starlings from each of the four locationscan be downloaded from:http://arcue.botany.unimelb.edu.au/bayescode.html.
Assume a Bayesian one-way ANOVA model for these data where a differentmean is considered for each location and the variation in mass betweendifferent birds is described by a normal distribution with a common variance.
Compare the results with those obtained with classical methods.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 11 / 35
Simple linear regression model
Another example of normal linear model is the simple regression model:
yi = α + βxi + εi ,
for i = 1, . . . , n, where εi ∼ N(
0, 1φ
).
Suppose that we use the limiting prior:
p(α, β, φ) ∝ 1
φ.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 12 / 35
Simple linear regression model
Then, we have that:
αβ
∣∣∣∣ y, φ ∼ N( α
β
),
1
φnsx
n∑i=1
x2i −nx
−nx n
φ | y ∼ G
(n − 2
2,sy(1− r2
)2
)
where:
α = y − βx , β =sxysx,
sx =∑n
i=1 (xi − x)2, sy =
∑ni=1 (yi − y)2
,
sxy =∑n
i=1 (xi − x) (yi − y) , r =sxy√sxsy
, σ2 =sy(1− r2
)n − 2
.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 13 / 35
Simple linear regression model
Thus, the marginal distributions of α and β are Student-t distributions:
α− α√σ2
n
∑ni=1x
2i
sx
| y ∼ tn−2
β − β√σ2
sx
| y ∼ tn−2
Therefore, for example, a 95% credible interval for β is given by:
β ± σ√sxtn−2(0.975)
equal to the usual classical interval.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 14 / 35
Simple linear regression model
Suppose now that we wish to predict a future observation:
ynew = α + βxnew + εnew .
Note that,
E [ynew | φ, y] = α + βxnew
V [ynew | φ, y] =1
φ
(∑ni=1x
2i + nx2
new − 2nxxnewnsx
+ 1
)=
1
φ
(sx + nx2 + nx2
new − 2nxxnewnsx
+ 1
)Therefore,
ynew | φ, y ∼ N
(α + βxnew ,
1
φ
((x − xnew )2
sx+
1
n+ 1
))
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 15 / 35
Simple linear regression model
And then,ynew − α + βxnew
σ
√((x−xnew )2
sx+ 1
n + 1) | y ∼ tn−2
leading to the following 95% credible interval for ynew :
α + βxnew ± σ
√√√√( (x − xnew )2
sx+
1
n+ 1
)tn−2 (0.975) ,
which coincides with the usual, classical interval.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 16 / 35
Example: Simple linear regression model
Consider the data file prostate.data that can be downloaded from:http://statweb.stanford.edu/~tibs/ElemStatLearn/.
This includes, among other clinical measures, the level of prostate specificantigen in logs (lpsa) and the log cancer volume (lcavol) in 97 men whowere about to receive a radical prostatectomy.
Use a Bayesian linear regression model to predict the lpsa in terms of thelcavol.
Compare the results with a classical linear regression fit.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 17 / 35
Generalized linear models
The generalized linear model generalizes the normal linear model by allowing thepossibility of non-normal error distributions and by allowing for a non-linearrelationship between y and x.
A generalized linear model is specified by two functions:
1 A conditional, exponential family density function of y given x, parameterizedby a mean parameter, µ = µ(x) = E [Y | x] and (possibly) a dispersionparameter, φ > 0, that is independent of x.
2 A (one-to-one) link function, g(·), which relates the mean, µ = µ(x) to thecovariate vector, x, as g(µ) = xθ.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 18 / 35
Generalized linear models
The following are generalized linear models with the canonical link function whichis the natural parameterization to leave the exponential family distribution incanonical form.
A logistic regression is often used for predicting the occurrence of an eventgiven covariates:
Yi | pi ∼ Bin(ni , pi )
logpi
1− pi= xiθ
A Poisson regression is used for predicting the number of events in a timeperiod given covariates:
Yi | pi ∼ P(λi )
log λi = xiθ
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 19 / 35
Generalized linear models
The Bayesian specification of a GLM is completed by defining (typicallynormal or normal gamma) prior distributions p(θ, φ) over the unknown modelparameters.
As with standard linear models, when improper priors are used, it is thenimportant to check that these lead to valid posterior distributions.
Clearly, these models will not have conjugate posterior distributions, but,usually, they are easily handled by Gibbs sampling.
In particular, the posterior distributions from these models are usually logconcave and are thus easily sampled via adaptive rejection sampling.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 20 / 35
Example: A logistic regression model
The O-Ring data consist of 23 observations on Pre-Challenger Space ShuttleLaunches
On each launch, it is observed whether there is at least one O-ring failure,and the temperature at launch
The goal is to model the probability of at least one O-ring failure as afunction of temperature.
Temperatures were 53, 57, 58, 63, 66, 67, 67, 67, 68, 69,70, 70, 70, 70, 72,73, 75, 75, 76, 76, 78, 79, 81
Failures occurred at 53, 57, 58, 63, 70, 70, 75
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 21 / 35
Example: A logistic regression model
The table shows the relationship, for 64 infants, between gestational age of theinfant (in weeks) at the time of birth (x) and whether the infant was breastfeeding at the time of release from hospital (y).
x 28 29 30 31 32 33# {y = 0} 4 3 2 2 4 1# {y = 1} 2 2 7 7 16 14
Let xi represent the gestational age and ni the number of infants with this age.Then we can model the probability that yi infants were breast feeding at time ofrelease from hospital via a standard binomial regression model.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 22 / 35
Hierarchical models
Suppose we have data, x, and a likelihood function f (x | θ) where theparameter values θ = (θ1, . . . , θk) are judged to be exchangeable, that is, anypermutation of them has the same distribution.
In this situation, it makes sense to consider a multilevel modeling assuming aprior distribution, f (θ | φ), which depends upon a further, unknownhyperparameter, φ, and use a hyperprior distribution, f (φ).
In theory, this process could continue further, using hyperhyperpriordistributions to estimate the hyperprior distributions. This is a method toelicit the optimal prior distributions.
One alternative is to estimate the hyperparameter using classical methods,which is known as empirical Bayes. A point estimate φ is then obtained toapproximate the posterior distribution. However, the uncertainty in φ isignored.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 23 / 35
Hierarchical models
In most hierarchical models, the joint posterior distributions will not beanalytically tractable as it will be,
f (θ, φ | x) ∝ f (x | θ)f (θ | φ)f (φ)
However, often a Gibbs sampling approach can be implemented by samplingfrom the conditional posterior distributions:
f (θ | x, φ) ∝ f (x | θ)f (θ | φ)
f (φ | x,θ) ∝ f (θ | φ)f (φ)
It is important to check the propriety of the posterior distribution whenimproper hyperprior distributions are used. An alternative (as in for exampleWinbugs) is to use proper but high variance hyperprior distributions.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 24 / 35
Hierarchical models
For example, a hierachical normal linear model is given by:
xij | θi , φ ∼ N(θi ,
1
φ
), i = 1, . . . , n, j = 1, . . . ,m.
Assuming that the means, θi , are exchangeable, we may consider the followingprior distribution:
θi | µ, ψ ∼ N(µ,
1
ψ
),
where the hyperparameters are µ y ψ.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 25 / 35
Example: A hierarchical one-way ANOVA
Suppose that 5 individuals take 3 different IQ test developed by 3 differentpsychologists obtaining the following results:
1 2 3 4 5Test 1 106 121 159 95 78Test 2 108 113 158 91 80Test 3 98 115 169 93 77
Then, we can assume that:
Xij | θi , φ ∼ N(θi ,
1
φ
),
θi | µ, ψ ∼ N(µ,
1
ψ
),
for i = 1, . . . , 5, and j = 1, 2, 3, where θi represents the true IQ of subject i and µthe mean true IQ in the population.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 26 / 35
Example: A hierarchical Poisson model
The number of failures, Xi at a power plant i is assumed to follow a Poissondistribution:
Xi | λi ∼ P(λi ti ), para i = 1, . . . , 10,
where λi is the failure rate for pump i and ti is the length of operation time of thepump (in 1000s of hours). It seems natural to assume that the failure rates areexchangeable and thus we might assume:
λi | γ ∼ E(γ),
where γ is the prior hyperparameter. The observed data are:
Pump 1 2 3 4 5 6 7 8 9 10ti 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5xi 5 1 5 14 3 19 1 1 4 22
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 27 / 35
Dynamic models
The univariate normal dynamic linear model (DLM) is:
yt = Ftθt + νt , νt ∼ N (0,Vt)
θt = Gtθt−1 + ωt , ωt ∼ N (0,Wt).
These models are linear state space models, where xt = Ftθt represents thesignal, θt is the state vector, Ft is a regression vector and Gt is a state matrix.
The usual features of a time series such as trend and seasonality can be modeledwithin this format.
If the matrices Ft , Gt , Vt and Wt are constants, the model is said to be timeinvariant.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 28 / 35
Dynamic models
One of the simplest DLMs is the random walk plus noise model, also called firstorder polynomial model. It is used to model univariate observations and the statevector is unidimensional:
yt = θt + νt , νt ∼ N (0,Vt)
θt = θt−1 + ωt , ωt ∼ N (0,Wt).
This is a slowly varying level model where the observations fluctuate around amean which varies according to a random walk.
Assuming known variances, Vt and Wt , a straightforward Bayesian analysis can becarried out as follows.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 29 / 35
Dynamic models
Suppose that the information at time t − 1 is yt−1 = {y1, y2, ..., yt−1} and assumethat:
θt−1 | yt−1 ∼ N (mt−1,Ct−1).
Then, we have that:
The prior distribution for θt is:
θt | yt−1 ∼ N (mt−1,Rt)
where Rt = Ct−1 + Wt
The one step ahead predictive distribution for yt is:
yt | yt−1 ∼ N (mt−1,Qt)
where Qt = Rt + Vt .
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 30 / 35
Dynamic models
The joint distribution of θt and yt is:(θtyt
)| yt−1 ∼ N
(mt−1
mt−1,
(Rt Rt
Rt Qt
))The posterior distribution for θt given yt =
{yt−1, yt
}is:
θt | yt ∼ N(mt ,Ct), where
mt = mt−1 + Atet ,
At = Rt/Qt ,
et = yt −mt−1,
Ct = Rt − A2tQt .
Note that et is simply a prediction error term. The posterior mean formulacould also be written as:
mt = (1− At)mt−1 + Atyt .
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 31 / 35
Example: First order polynomial DLM
Assume a slowly varying level model for the water level in Lake Huron with knownvariances: Vt = 1 and Wt = 1.
1 Estimate the filtered values of the state vector based on the observations upto time t from f (θt | yt).
2 Estimate the predicted values of the state vector based on the observationsup to time t − 1 from f (θt | yt−1).
3 Estimate the predicted values of the signal based on the observations up totime t − 1 from f (yt | yt−1).
4 Compare the results using e.g:I Vt = 10 and Wt = 1.I Vt = 1 and Wt = 10.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 32 / 35
Dynamic models
When the variances are not known, the Bayesian inference for the system ismore complex.
One possibility is the use of MCMC algorithms which are usually based onthe so-called forward filtering backward sampling algorithm.
1 The forward filtering step is the standard normal linear analysis to givef (θt |yt) at each t, for t = 1, . . . ,T .
2 The backward sampling step uses the Markov property and samples θ∗T fromf (θT |yT ) and then, for t = T − 1, . . . , 1, samples from f (θt | yt , θ∗t+1)
Thus, a sample from the posterior parameter structure is generated.
However, MCMC may be computationally very expensive for on-lineestimation. One possible alternative is the use of particle filters.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 33 / 35
Dynamic models
Other examples of DLM are the following:
A dynamic linear regression model is given by:
yt = Ftθt + νt , νt ∼ N (0,Vt)
θt = θt−1 + ωt , ωt ∼ N (0,Wt).
The AR(p) model with time-varying coefficients takes the form:
yt = θ0t + θ1tyt−1 + . . .+ θptyt−p + νt , θit = θi,t−1 + ωit ,
This model can be expressed in state space form by setting θ = (θ0t , . . . , θpt)and F = (1, yt−1, . . . , yt − p).
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 34 / 35
Dynamic models
The additive structure of the DLMs makes it easy to think of observed series asoriginating form the sum of different components,e.g.,
yt = y1t + . . . , yh,t
where y1t might represent a trend component, y2t a seasonal component, and soon. Then, each component, yit , might be described by a different DLM:
yt = Fitθit + νit , νit ∼ N (0,Vit)
θit = Gitθt−1 + ωit , ωit ∼ N (0,Wit).
By the assumption of independence of the components, yt is also a DLMdescribed by:
Ft = (F1t |. . .|Fht) , Vt = V1t + . . .+ Vht ,
and
Gt =
G1t
. . .
Ght
, Wt =
W1t
. . .
Wht
.
Conchi Ausın and Mike Wiper Regression and hierarchical models Masters Programmes 35 / 35