Lecture 1: Introduction - Gaussian Markov random fieldsbodavid/GMRF2015/Lectures/F1slides.pdf ·...
Transcript of Lecture 1: Introduction - Gaussian Markov random fieldsbodavid/GMRF2015/Lectures/F1slides.pdf ·...
Lecture 1: IntroductionGaussian Markov random fields
David BolinChalmers University of Technology
January 19, 2015
Practical information
Litterature:The course will mostly be based on the bookGaussian Markov Random Fields: Theory and Applicationsby Håvard Rue and Leonhard Held
Additional articles will be used later on.
Homepage:http://www.math.chalmers.se/~bodavid/GMRF2015/
Schedule:We will meet two times each week:Mondays and Tuesdays (10-12)
Lectures will be in MVL:14
There will be 10 lectures and 4 computer labs
Practical David Bolin
Examimation
There will be two components in the examination:• Project assignments introduced in the computer labs• An oral exam at the end of the course
The projects can be done individually or in pairs of two students.
The final oral exam is individual.
The grading scale comprises Fail, (U), Pass (G).
Successful completion of the course will be rewarded by 7.5 hp.
Practical David Bolin
Three relevant questions
Why take this course when we have had one course on Markovrandom fields and one course on Gaussian random fields this year?
Why is it a good idea to learn about Gaussian Markov randomfields?
What’s there to learn? Isn’t all just a Gaussian?
Practical David Bolin
Outline of lectures
Lecture 1 IntroductionLecture 2 Definitions and basic properties of GMRFsLecture 3 Simulation and conditioningLecture 4 Numerical methods for sparse matricesLecture 5 Intrinsic GMRFsLecture 6 MCMC estimation for hierarchical modelsLecture 7 Approximation techniques and INLALecture 8 Stochastic PDEs and FEMLecture 9 SPDEs part 2Lecture 10 Extensions and applications
Practical David Bolin
Random fields
Random fieldsA random field (or stochastic field), X(s, ω), s ∈ D, ω ∈ Ω, is arandom function specified by its finite-dimensional jointdistributions
F (y1, . . . , yn; s1, . . . , sn) = P(X(s1) ≤ y1, . . . , X(sn) ≤ yn)
for every finite n and every collection s1, . . . , sn of locations in D.
• The set D is usually a subset of Rd.• At every location s ∈ D, X(s, ω) is a random variable wherethe event ω lies in some abstract sample space Ω.
• Kolmogorov’s existence theorem can be used to ensure thatthe random field has a valid mathematical specification.
• To simplify the notation, one often writes X(s), removing thedependency on ω from the notation.
GRFs David Bolin
Gaussian random fields
An important special case is when the random field is Gaussian.
Gaussian random fieldsA Gaussian random field X(s) is defined by a mean functionµ(s) = E(X(s)) and a covariance functionC(s, t) = Cov(X(s), X(t)). It has the property that, for everyfinite collection of points s1, . . . , sp,
x ≡ (X(s1), . . . , X(sp))T ∼ N(µ,Σ),
where Σij = C(si, sj).
For existence of a Gaussian field with a prescribed mean andcovariance it is enough to ensure that C is positive definite.
GRFs David Bolin
Positive definite functions
A function C(s, t) is positive definite if for any finite set oflocations s1, . . . , sn in D, the covariance matrix
Σ =
C(s1, s1) C(s1, s2) · · · C(s1, sn)C(s2, s1) C(s2, s2) · · · C(s2, sn)
......
. . ....
C(sn, s1) C(sn, s2) · · · C(sn, sn)
is non-negative definite: z>Σz ≥ 0 for all real valued vectors z.
(Note the inconsequence in the notation here: A positive definitefunction requires a positive semi-definite matrix)
GRFs David Bolin
Stationary random fields
A common simplifying assumption is that the random field isstationary.
Strict stationarityA random field X(s) is called strictly stationary if for any vector hand for every collection s1, . . . , sn of locations in D
F (y1, . . . , yn; s1 + h, . . . , sn + h) = F (y1, . . . , yn; s1, . . . , sn).
Weak stationarityA random field X(s) is called weakly stationary if for any vector hand any locations s, t ∈ D
µ(s + h) = µ(s), and C(s + h, t + h) = C(s, t) = C(s− t).
There is no distinction between the two concepts in the Gaussiancase and one then simply writes that the field is stationary.
GRFs David Bolin
Isotropic fields
An important subclass of the weakly stationary fields are theisotropic fields. These have covariance functions that depend onlyon distance, and not direction, between points,i.e. C(s1, s2) = C(‖s1 − s2‖).
In most practical applications of Gaussian random fields, thecovariance function is chosen from a parametric family of isotropiccovariance functions such as:• Exponential covariance function.• Gaussian covariance function.• Matérn covariance function.
GRFs David Bolin
The standard choice: Gaussian Matérn fields
The Matérn covariance function:
C(h) =21−νφ2
(4π)d2 Γ(ν + d
2)κ2ν(κ‖h‖)νKν(κ‖h‖), h ∈ Rd, ν > 0,
Here ν is a shape parameter for the covariance function, κ a spatialscale parameter, φ2 a variance parameter, Γ is the gamma function,and Kν is a modified Bessel function of the second kind.
GRFs David Bolin
Spectral representations
• An alternative to covariance-based representation of Gaussianfields is to do the specification in the frequency domain.
• By Bochner’s theorem, a function C is a valid covariancefunction if and only if it can be written as
C(h) =
∫exp(ih>k) dΛ(k) (1)
for some non-negative and symmetric measure Λ.• Equation (1) is called the spectral representation of thecovariance function, and if the measure Λ has a Lebesguedensity S, this is called the spectral density.
• For example, the spectral density associated with the Matérncovariance function is
S(k) =φ2
(2π)d1
(κ2 + ‖k‖2)ν+ d2
.
GRFs David Bolin
Variograms
• Another popular representation, first proposed by Matheron(1971), is the (semi)variogram γ(h), that for a stationaryprocess is defined as
γ(h) =1
2V(X(s + h)−X(s)).
• One popular estimation method in geostatistics is to use socalled empirical variograms
• These can be useful for non-differentiable random fields butcan be missleading for differientable processes.
• We will not use variograms at all.
GRFs David Bolin
Geostatistics and kriging
One of the most important problems in geostatistics is spatialreconstruction of a random field X(s) given a finite number ofobservations Y = (Y1, . . . , Yn)> of the latent field at locationss1, . . . , sn taken under measurement noise.
The most popular method for spatial reconstruction in geostatisticswas developed by Georges Matheron.
Depending on the assumptions on the mean value function µ(s) forthe latent field, linear kriging is usually divided into three cases:Simple kriging µ(s) is knownOrdinary kriging µ(s) = µ and µ is unknownUniversal kriging µ(s) =
∑mk=1 βkbk(s) where bk are known basis
functions and the parameters βk are unknown.The kriging estimator of X(s) at some location s0 is derived as theminimum mean squared error linear predictor.
GRFs David Bolin
Hierarchical models
There is a close connection between kriging and estimation inhierarchical models which we use.
A hierarchical model is constructed as a hierarchy of conditionalprobability models that, when multiplied together, yield the jointdistribution for all quantities in the model.
Typically, we have a three-stage statistical model for data ymodelled using a latent field x with hyperparameters θ, structuredin a hierarchical way
π(y,x,θ) = π(y|x,θ)π(x|θ)π(θ)
GRFs David Bolin
The data y|x,θWe have been given some data y.• Normally distributed?• Count data?• Binary data?• Point pattern?• How was it collected? (Distance sampling?Capture/Recapture? Exhaustive sample? Preferentialsampling?)
We place all of this information into our likelihood π(y|x,θ).A typical situation is when the latent field is measured underadditive noise,
Yi = X(si) + εi .
A common assumption is that ε1, . . . , εn are independentidentically distributed with some variance σ2, uncorrelated with thelatent process.
GRFs David Bolin
The latent field x|θ
In our models, we will assume that the data depends on someunobserved latent components x.• Covariates• Unstructured random effects ("white noise")• Structured random effects (temporal dependency, spatialdependency, smoothness terms)
The dependence between the data and the latent field can be linearor non-linear, but as these are not directly observed, the modellingassumptions need to be more restrictive.
The process model can in itself be written as a hierarchical model,specified by a number of conditional sub-models.
GRFs David Bolin
The hyperparameters θ
Both our likelihood and our latent field can depend on somehyperparameters θ• Variance of observation noise• Probability of a zero (zero-inflated models)• Variance of the unstructured random field• Range of a structured random effect (effective correlationdistance)
• Autocorrelation parameter
For a Bayesian model, we specify these using a joint prior π(θ)
Frequentists assume that the parameters are fixed but unknown.The model is then sometimes referred to as an empirical-Bayesianmodel, or empirical hierarchical model.
GRFs David Bolin
Inference
Inference in hierarchical models is performed using the posteriordistribution
π(X,θ|Y) ∝ π(Y|X,θ)π(X|θ)π(θ).
Kriging predictions are calculated from the marginal posteriordistribution
π(X|Y) ∝∫π(X|Y,θ)π(θ|Y) dθ,
and one typically reports the posterior mean E(X|Y) as a pointestimator and the posterior variance V(X|Y) as a measure of theuncertainty in the predictor.
The posterior distribution for X and θ generally have to beestimated using Markov Chain Monte Carlo (MCMC) methods.
GRFs David Bolin
Inference II
In an empirical hierarchical model, inference is instead performedusing the conditional posterior π(X|Y, θ). Here θ is an estimate ofθ obtained using for example maximum likelihood estimation, ormaximum a posteriori estimation in the Bayesian setting.
The parameter model π(θ) can often be chosen so that theposterior mean and variance of X agree with the classical krigingpredictions.
Even if this is not done, we will refer to the conditional mean of theposterior distribution as the kriging predictor.
GRFs David Bolin
Latent Gaussian Models
We call a Bayesian hierarchical model where π(x|θ) is a Gaussiandistribution a Latent Gaussian model (LGM):
θ ∼ π(θ)
x | θ ∼ π(x | θ) = N (0,Σ(θ))
y | x,θ ∼∏i
π(yi | ηi,θ)
Note that we also assume that the observations are independentgiven the latent process.
This is a huge model class that is used in many seemingly unrelatedareas, and which is especially useful if we let x be a GMRF
GRFs — Latent Gaussian models David Bolin
Bayesian linear models
Consider the linear model yi = µ+ β1c1i + β2c
2i + ui + εi.
• yi is an observation• µ is the intercept• c1 and c2 are covariates (fixed effects) and β1 and β2 are thecorresponding weights
• εi is i.i.d. normal observation noise.• u is a random effect
To make a Bayesian model, we need to chose some priors. Classicalchoices:• β = (µ, β1, β2)T ∼ N(0, σ2
fixI), where σfix is a large numbernumber.
• u ∼ N(0,Σu) where the covariance matrix Σu is known.• ε ∼ N(0, σ2
nI).
LGMs — Examples David Bolin
Bayesian structured additive regression models
GLM/GAM/GLMM/GAMM/+++• Perhaps the most important class of statistical models• n-dimensional observation vector y, distributed according toan exponential family.
• mean µi = E(yi) linked to a linear predictor
ηi = g(µi) = α+ ziTβ +
∑γ
fγ(cγ,i) + ui, i = 1, . . . , n
where
α : Interceptβ : linear effects of covariates z
fγ(·) : Non-linear/smooth effects of covariates cγ
u : Unstructured error terms
LGMs — Examples David Bolin
Bayesian structured additive regression models cont.
Flexibility due to many different forms of the unknown functionsfγ(·)
• relax linear relationship of covariates• include random effects• temporally and/or spatially indexed covariates
Special cases:• Generalized linear models (GLM): g(µ) = α+
∑mj=1 βjzj
• Generalized additive models (GAM): g(µ) = α+∑m
j=1 fj(zj)
A latent Gaussian model is obtained by assigning Gaussian priors toall random terms α,β, fγ(·),u in the linear predictor.
LGMs — Examples David Bolin
Example: Disease mapping
• Data yi ∼ Poisson(Eiexp(ηi))
• Log-relative riskηi = µ+ ui + vi + f(ci)
• Smooth effect of a covariate c
• Structured component u
• Unstructured component v
−0.63
−0.37
−0.1
0.17
0.44
0.71
0.98
LGMs — Examples David Bolin
Example: Spatial geological count data
• Spatio-temporal observationsyit ∼ nBin(Eitexp(ηit))
• Log-relative risk of bycatchηit = uit +
∑j f(cjit)
• Smooth effects of covariates cj
• Spatio-temporal random field u
LGMs — Examples David Bolin
Example: Stochastic volatility
• Log daily difference of thepound-dollar exchange rates
• Data yt|ηt ∼ N (0, exp(ηt))
• Volatility ηi = µ+ ut
• Unknown mean µ• AR-process u
0 200 400 600 800 1000
−2
02
4
LGMs — Examples David Bolin
A more surprising example: Point processes
Spatial point processes model• They focus on the random location at which events happen.• They make excellent models for ‘presence only’ data whencoupled with an appropriate observation process.
• Realistic models can be quite complicated.LGMs — Examples David Bolin
Log-Gaussian Cox processes
The homogeneous Poisson process is often too restrictive.Generalizations include:• inhomogeneous Poisson process - inhomogeneous intensity• Markov point process - local interactions among individuals• Cox process - random intensity
We focus on the Cox process, the random intensity depends on aGaussian random field Z(s):
Λ(s) = exp(Z(s))
If Y denotes the set of observed locations, the likelihood is
log(π(Y |η)) = |Ω| −∫
ΩΛ(s) ds+
∑si∈Y
Λ(si),
This is very different to the previous examples!LGMs — Examples David Bolin
Or is it?
NB: The number of points in a region R is Poisson distributed withmean
∫R Λ(s) ds.
• Divide the ‘observationwindow’ into rectangles.
• Let yi be the number ofpoints in rectangle i.
yi|xi,θ ∼ Po(exi),
• The log-risk surface isreplaced with
x|θ ∼ N(µ(θ),Σ(θ)).
Introduction Case study I Case study II Summary Resolution Spatial e↵ect Interaction Estimated e↵ects
Andersonia heterophylla: 55 55 grid
Sigrunn Holbek Sørbye, University of Tromsø Spatial point patterns - simple case studies
LGMs — Examples David Bolin
Back to the linear model
Observation: (y,u,β) are jointly Gaussian!
π(y|u,β) ∝ exp(−τn
2(y − u−XTβ)T (y − u−XTβ)
)= exp
−τn2
(yT uT βT
) I −I −XT
−I I −XT
−X −X XTX
yuβ
It follows that
π(y,u,β) = π(y|u,β)π(u)π(β)
∝ exp
−τn2
(yTuTβT
) I −I −XT
−I I + τ−1n Qu −XT
−X −X XTX + τfixτn
I
yuβ
LGMs — Examples David Bolin
Estimation
Let x = (y,u,β). Estimating the parameters in the model, usingMCMC or ML, we have to evaluate the log-likelihood
log |Σ(θ)| − 1
2x>Σ(θ)−1x
We can easily calculate marginal distributions, for example to dokriging. Recall that if(
xAxB
)∼ N
((µAµB
),
(ΣAA ΣAB
ΣBA ΣBB
)),
then the conditional distribution is given by
xA|xB ∼ N(ΣABΣ−1BBxB,ΣAA −ΣABΣ−1
BBΣBA)
LGMs — Examples David Bolin
Can we calculate these things in practice?
Evaluating the likelihood and the kriging predictor both requiresolving Σ(θ)−1x. Evaluating the likelihood also requires calculating|Σ(θ)|.
• Computations scale as O(N3)
• Storage scales as O(N2):• 2500 points for 20 years requires ∼ 20 Gbytes
Thus, even this very simple model is not feasible for large problems
For more complicated models that require MCMC, N does not haveto be particularly large for computations to be a major problem
We need to decrease the computational burden!
LGMs — Examples David Bolin
“Low rank methods”
A popular approach to decrease the computational cost is toApproximate X(s) using some basis expansion
X(s) =
m∑j=1
wjϕj(s), (2)
where wj are Gaussian random variables and ϕjmj=1 are somepre-defined basis functions.This allows us to write Σ(θ) = BΣwB> and basically gives usO(K3) cost instead of O(N3) cost: Choose K N .There are many ways to obtain these “low rank” approximations:• Karhunen-Loève transforms• Empirical orthogonal functions• Process convolutions• Fixed-rank kriging or predictive processes• +++
LGMs — Examples David Bolin
Key Lesson: Sparse matrices
Definition (Sparse Matrix)A matrix Q is called sparse if most of its elements is zero.
• There exist very efficient numerical algorithms to deal withsparse matrices
• Computations scale as O(N3/2).• Storage scales as O(N):
• 2500 points for 20 years requires ∼ 400 Kilobytes
Two possible options:1 Force Σ to be sparse.
Forces independence between variables.2 Force the precision matrix Q = Σ−1 to be sparse.
What does this correspond to?
LGMs — Examples David Bolin
Example: AR(1) process
The simplest example of a GMRF is the AR(1) process:
xt = φxt−1 + εt, t = 1, 2, . . . , εt ∼ N (0, 1)
where t represents time and the distribution of x0 is chosen as thestationary distribution of the process: x0 ∼ N
(0, 1
1−φ2
)The joint density for x is
π(x) = π(x0)π(x1|x2) · · ·π(xn−1|xn−2)
=1
(2π)n/2|Q|1/2 exp
(−1
2x>Qx
).
The matrix Q = Σ−1 is called the precision matrix
The covariance matrix for x is dense since all time points aredependent, Σij = 1
1−φ2φ|i−j|.
GMRFs — AR David Bolin
Conditional independence
However, the precision matrix Q = Σ−1 is sparse:
Q =
1 −φ−φ 1 + φ2 −φ
. . . . . . . . .−φ 1 + φ2 −φ
−φ 1
What is the key property of this example that causes Q to besparse?• The key lies in the full conditionals
xt|x−t ∼ N(
φ
1− φ2(xt−1 + xt+1),
σ2
1 + φ2
)• Each timepoint is only conditionally dependent on the twoclosest timepoints, which is the reason for the tridiagonalstructure of Q
GMRFs — AR David Bolin
Main features of GMRFs
• Analytically tractable• Modelling using conditional independence• Merging GMRFs using conditioning (hierarchical models)• Unified framework for
• understanding• representation• computation using numerical methods for sparse matrices
• Fits nicely into the MCMC world• Can construct faster and more reliable block-MCMCalgorithms.
• Approximate Bayesian inference• Approximate GRFs through SPDE representations
GMRFs — AR David Bolin
Usage of GMRFs (I)
Structural time-series analysis• Autoregressive models.• Gaussian state-space models.• Computational algorithms based on the Kalman filter and itsvariants.
Analysis of longitudinal and survival data• temporal GMRF priors• state-space approaches• spatial GMRF priors
used to analyse longitudinal and survival data.
GMRFs — AR David Bolin
Usage of GMRFs (II)
Graphical models• A key model• Estimate Q and its (associated) graph from data.• Often used in a larger context.
Semiparametric regression and splines• Model a smooth curve in time or a surface in space.• Intrinsic GMRF models and random walk models• Discretely observed integrated Wiener processes are GMRFs• GMRFs models for coefficients in B-splines.
GMRFs — AR David Bolin
Usage of GMRFs (III)
Image analysis• Image restoration using the Wiener filter.• Texture modelling and texture discrimination.• Segmentation and object identification• Deformable templates• 3D reconstruction• Restoring ultrasound images
Spatial statistics• Latent GMRF model analysis of spatial binary data• Geostatistics using GMRFs• Analysis of data in social sciences and spatial econometrics• Spatial and space-time epidemiology• Environmental statistics• Inverse problems
GMRFs — AR David Bolin