Lecture 1: Introduction - Gaussian Markov random fieldsbodavid/GMRF2015/Lectures/F1slides.pdf ·...

Lecture 1: IntroductionGaussian Markov random fields

David BolinChalmers University of Technology

January 19, 2015

Practical information

Litterature:The course will mostly be based on the bookGaussian Markov Random Fields: Theory and Applicationsby Håvard Rue and Leonhard Held

Additional articles will be used later on.

Homepage:http://www.math.chalmers.se/~bodavid/GMRF2015/

Schedule:We will meet two times each week:Mondays and Tuesdays (10-12)

Lectures will be in MVL:14

There will be 10 lectures and 4 computer labs

Practical David Bolin

http://www.math.chalmers.se/~bodavid/GMRF2015/

Examimation

There will be two components in the examination:• Project assignments introduced in the computer labs• An oral exam at the end of the course

The projects can be done individually or in pairs of two students.

The final oral exam is individual.

The grading scale comprises Fail, (U), Pass (G).

Successful completion of the course will be rewarded by 7.5 hp.


Three relevant questions

Why take this course when we have had one course on Markovrandom fields and one course on Gaussian random fields this year?

Why is it a good idea to learn about Gaussian Markov randomfields?

What’s there to learn? Isn’t all just a Gaussian?


Outline of lectures

Lecture 1 IntroductionLecture 2 Definitions and basic properties of GMRFsLecture 3 Simulation and conditioningLecture 4 Numerical methods for sparse matricesLecture 5 Intrinsic GMRFsLecture 6 MCMC estimation for hierarchical modelsLecture 7 Approximation techniques and INLALecture 8 Stochastic PDEs and FEMLecture 9 SPDEs part 2Lecture 10 Extensions and applications


Random fields

Random fieldsA random field (or stochastic field), X(s, ω), s ∈ D, ω ∈ Ω, is arandom function specified by its finite-dimensional jointdistributions

F (y1, . . . , yn; s1, . . . , sn) = P(X(s1) ≤ y1, . . . , X(sn) ≤ yn)

for every finite n and every collection s1, . . . , sn of locations in D.

• The set D is usually a subset of Rd.• At every location s ∈ D, X(s, ω) is a random variable wherethe event ω lies in some abstract sample space Ω.

• Kolmogorov’s existence theorem can be used to ensure thatthe random field has a valid mathematical specification.

• To simplify the notation, one often writes X(s), removing thedependency on ω from the notation.

GRFs David Bolin

Gaussian random fields

An important special case is when the random field is Gaussian.

Gaussian random fieldsA Gaussian random field X(s) is defined by a mean functionµ(s) = E(X(s)) and a covariance functionC(s, t) = Cov(X(s), X(t)). It has the property that, for everyfinite collection of points s1, . . . , sp,

x ≡ (X(s1), . . . , X(sp))T ∼ N(µ,Σ),

where Σij = C(si, sj).

For existence of a Gaussian field with a prescribed mean andcovariance it is enough to ensure that C is positive definite.

GRFs David Bolin

Positive definite functions

A function C(s, t) is positive definite if for any finite set oflocations s1, . . . , sn in D, the covariance matrix

Σ =

C(s1, s1) C(s1, s2) · · · C(s1, sn)C(s2, s1) C(s2, s2) · · · C(s2, sn)

......

. . ....

C(sn, s1) C(sn, s2) · · · C(sn, sn)

is non-negative definite: z>Σz ≥ 0 for all real valued vectors z.

(Note the inconsequence in the notation here: A positive definitefunction requires a positive semi-definite matrix)

GRFs David Bolin

Stationary random fields

A common simplifying assumption is that the random field isstationary.

Strict stationarityA random field X(s) is called strictly stationary if for any vector hand for every collection s1, . . . , sn of locations in D

F (y1, . . . , yn; s1 + h, . . . , sn + h) = F (y1, . . . , yn; s1, . . . , sn).

Weak stationarityA random field X(s) is called weakly stationary if for any vector hand any locations s, t ∈ D

µ(s + h) = µ(s), and C(s + h, t + h) = C(s, t) = C(s− t).

There is no distinction between the two concepts in the Gaussiancase and one then simply writes that the field is stationary.

GRFs David Bolin

Isotropic fields

An important subclass of the weakly stationary fields are theisotropic fields. These have covariance functions that depend onlyon distance, and not direction, between points,i.e. C(s1, s2) = C(‖s1 − s2‖).

In most practical applications of Gaussian random fields, thecovariance function is chosen from a parametric family of isotropiccovariance functions such as:• Exponential covariance function.• Gaussian covariance function.• Matérn covariance function.

GRFs David Bolin

The standard choice: Gaussian Matérn fields

The Matérn covariance function:

C(h) =21−νφ2

(4π)d2 Γ(ν + d

2)κ2ν(κ‖h‖)νKν(κ‖h‖), h ∈ Rd, ν > 0,

Here ν is a shape parameter for the covariance function, κ a spatialscale parameter, φ2 a variance parameter, Γ is the gamma function,and Kν is a modified Bessel function of the second kind.

GRFs David Bolin

Spectral representations

• An alternative to covariance-based representation of Gaussianfields is to do the specification in the frequency domain.

• By Bochner’s theorem, a function C is a valid covariancefunction if and only if it can be written as

C(h) =

∫exp(ih>k) dΛ(k) (1)

for some non-negative and symmetric measure Λ.• Equation (1) is called the spectral representation of thecovariance function, and if the measure Λ has a Lebesguedensity S, this is called the spectral density.

• For example, the spectral density associated with the Matérncovariance function is

S(k) =φ2

(2π)d1

(κ2 + ‖k‖2)ν+ d2

.

GRFs David Bolin

Variograms

• Another popular representation, first proposed by Matheron(1971), is the (semi)variogram γ(h), that for a stationaryprocess is defined as

γ(h) =1

2V(X(s + h)−X(s)).

• One popular estimation method in geostatistics is to use socalled empirical variograms

• These can be useful for non-differentiable random fields butcan be missleading for differientable processes.

• We will not use variograms at all.

GRFs David Bolin

Geostatistics and kriging

One of the most important problems in geostatistics is spatialreconstruction of a random field X(s) given a finite number ofobservations Y = (Y1, . . . , Yn)> of the latent field at locationss1, . . . , sn taken under measurement noise.

The most popular method for spatial reconstruction in geostatisticswas developed by Georges Matheron.

Depending on the assumptions on the mean value function µ(s) forthe latent field, linear kriging is usually divided into three cases:Simple kriging µ(s) is knownOrdinary kriging µ(s) = µ and µ is unknownUniversal kriging µ(s) =

∑mk=1 βkbk(s) where bk are known basis

functions and the parameters βk are unknown.The kriging estimator of X(s) at some location s0 is derived as theminimum mean squared error linear predictor.

GRFs David Bolin

Hierarchical models

There is a close connection between kriging and estimation inhierarchical models which we use.

A hierarchical model is constructed as a hierarchy of conditionalprobability models that, when multiplied together, yield the jointdistribution for all quantities in the model.

Typically, we have a three-stage statistical model for data ymodelled using a latent field x with hyperparameters θ, structuredin a hierarchical way

π(y,x,θ) = π(y|x,θ)π(x|θ)π(θ)

GRFs David Bolin

The data y|x,θWe have been given some data y.• Normally distributed?• Count data?• Binary data?• Point pattern?• How was it collected? (Distance sampling?Capture/Recapture? Exhaustive sample? Preferentialsampling?)

We place all of this information into our likelihood π(y|x,θ).A typical situation is when the latent field is measured underadditive noise,

Yi = X(si) + εi .

A common assumption is that ε1, . . . , εn are independentidentically distributed with some variance σ2, uncorrelated with thelatent process.

GRFs David Bolin

The latent field x|θ

In our models, we will assume that the data depends on someunobserved latent components x.• Covariates• Unstructured random effects ("white noise")• Structured random effects (temporal dependency, spatialdependency, smoothness terms)

The dependence between the data and the latent field can be linearor non-linear, but as these are not directly observed, the modellingassumptions need to be more restrictive.

The process model can in itself be written as a hierarchical model,specified by a number of conditional sub-models.

GRFs David Bolin

The hyperparameters θ

Both our likelihood and our latent field can depend on somehyperparameters θ• Variance of observation noise• Probability of a zero (zero-inflated models)• Variance of the unstructured random field• Range of a structured random effect (effective correlationdistance)

• Autocorrelation parameter

For a Bayesian model, we specify these using a joint prior π(θ)

Frequentists assume that the parameters are fixed but unknown.The model is then sometimes referred to as an empirical-Bayesianmodel, or empirical hierarchical model.

GRFs David Bolin

Inference

Inference in hierarchical models is performed using the posteriordistribution

π(X,θ|Y) ∝ π(Y|X,θ)π(X|θ)π(θ).

Kriging predictions are calculated from the marginal posteriordistribution

π(X|Y) ∝∫π(X|Y,θ)π(θ|Y) dθ,

and one typically reports the posterior mean E(X|Y) as a pointestimator and the posterior variance V(X|Y) as a measure of theuncertainty in the predictor.

The posterior distribution for X and θ generally have to beestimated using Markov Chain Monte Carlo (MCMC) methods.

GRFs David Bolin

Inference II

In an empirical hierarchical model, inference is instead performedusing the conditional posterior π(X|Y, θ). Here θ is an estimate ofθ obtained using for example maximum likelihood estimation, ormaximum a posteriori estimation in the Bayesian setting.

The parameter model π(θ) can often be chosen so that theposterior mean and variance of X agree with the classical krigingpredictions.

Even if this is not done, we will refer to the conditional mean of theposterior distribution as the kriging predictor.

GRFs David Bolin

Latent Gaussian Models

We call a Bayesian hierarchical model where π(x|θ) is a Gaussiandistribution a Latent Gaussian model (LGM):

θ ∼ π(θ)

x | θ ∼ π(x | θ) = N (0,Σ(θ))

y | x,θ ∼∏i

π(yi | ηi,θ)

Note that we also assume that the observations are independentgiven the latent process.

This is a huge model class that is used in many seemingly unrelatedareas, and which is especially useful if we let x be a GMRF

GRFs — Latent Gaussian models David Bolin

Bayesian linear models

Consider the linear model yi = µ+ β1c1i + β2c

2i + ui + εi.

• yi is an observation• µ is the intercept• c1 and c2 are covariates (fixed effects) and β1 and β2 are thecorresponding weights

• εi is i.i.d. normal observation noise.• u is a random effect

To make a Bayesian model, we need to chose some priors. Classicalchoices:• β = (µ, β1, β2)T ∼ N(0, σ2

fixI), where σfix is a large numbernumber.

• u ∼ N(0,Σu) where the covariance matrix Σu is known.• ε ∼ N(0, σ2

nI).

LGMs — Examples David Bolin

Bayesian structured additive regression models

GLM/GAM/GLMM/GAMM/+++• Perhaps the most important class of statistical models• n-dimensional observation vector y, distributed according toan exponential family.

• mean µi = E(yi) linked to a linear predictor

ηi = g(µi) = α+ ziTβ +

∑γ

fγ(cγ,i) + ui, i = 1, . . . , n

where

α : Interceptβ : linear effects of covariates z

fγ(·) : Non-linear/smooth effects of covariates cγ

u : Unstructured error terms


Bayesian structured additive regression models cont.

Flexibility due to many different forms of the unknown functionsfγ(·)

• relax linear relationship of covariates• include random effects• temporally and/or spatially indexed covariates

Special cases:• Generalized linear models (GLM): g(µ) = α+

∑mj=1 βjzj

• Generalized additive models (GAM): g(µ) = α+∑m

j=1 fj(zj)

A latent Gaussian model is obtained by assigning Gaussian priors toall random terms α,β, fγ(·),u in the linear predictor.


Example: Disease mapping

• Data yi ∼ Poisson(Eiexp(ηi))

• Log-relative riskηi = µ+ ui + vi + f(ci)

• Smooth effect of a covariate c

• Structured component u

• Unstructured component v

−0.63

−0.37

−0.1

0.17

0.44

0.71

0.98


Example: Spatial geological count data

• Spatio-temporal observationsyit ∼ nBin(Eitexp(ηit))

• Log-relative risk of bycatchηit = uit +

∑j f(cjit)

• Smooth effects of covariates cj

• Spatio-temporal random field u


Example: Stochastic volatility

• Log daily difference of thepound-dollar exchange rates

• Data yt|ηt ∼ N (0, exp(ηt))

• Volatility ηi = µ+ ut

• Unknown mean µ• AR-process u

0 200 400 600 800 1000

−2

02

4


A more surprising example: Point processes

Spatial point processes model• They focus on the random location at which events happen.• They make excellent models for ‘presence only’ data whencoupled with an appropriate observation process.

• Realistic models can be quite complicated.LGMs — Examples David Bolin

Log-Gaussian Cox processes

The homogeneous Poisson process is often too restrictive.Generalizations include:• inhomogeneous Poisson process - inhomogeneous intensity• Markov point process - local interactions among individuals• Cox process - random intensity

We focus on the Cox process, the random intensity depends on aGaussian random field Z(s):

Λ(s) = exp(Z(s))

If Y denotes the set of observed locations, the likelihood is

log(π(Y |η)) = |Ω| −∫

ΩΛ(s) ds+

∑si∈Y

Λ(si),

This is very different to the previous examples!LGMs — Examples David Bolin

Or is it?

NB: The number of points in a region R is Poisson distributed withmean

∫R Λ(s) ds.

• Divide the ‘observationwindow’ into rectangles.

• Let yi be the number ofpoints in rectangle i.

yi|xi,θ ∼ Po(exi),

• The log-risk surface isreplaced with

x|θ ∼ N(µ(θ),Σ(θ)).

Introduction Case study I Case study II Summary Resolution Spatial e↵ect Interaction Estimated e↵ects

Andersonia heterophylla: 55 55 grid

Sigrunn Holbek Sørbye, University of Tromsø Spatial point patterns - simple case studies


Back to the linear model

Observation: (y,u,β) are jointly Gaussian!

π(y|u,β) ∝ exp(−τn

2(y − u−XTβ)T (y − u−XTβ)

)= exp

−τn2

(yT uT βT

) I −I −XT

−I I −XT

−X −X XTX

yuβ

It follows that

π(y,u,β) = π(y|u,β)π(u)π(β)

∝ exp

−τn2

(yTuTβT

) I −I −XT

−I I + τ−1n Qu −XT

−X −X XTX + τfixτn

I

yuβ


Estimation

Let x = (y,u,β). Estimating the parameters in the model, usingMCMC or ML, we have to evaluate the log-likelihood

log |Σ(θ)| − 1

2x>Σ(θ)−1x

We can easily calculate marginal distributions, for example to dokriging. Recall that if(

xAxB

)∼ N

((µAµB

),

(ΣAA ΣAB

ΣBA ΣBB

)),

then the conditional distribution is given by

xA|xB ∼ N(ΣABΣ−1BBxB,ΣAA −ΣABΣ−1

BBΣBA)


Can we calculate these things in practice?

Evaluating the likelihood and the kriging predictor both requiresolving Σ(θ)−1x. Evaluating the likelihood also requires calculating|Σ(θ)|.

• Computations scale as O(N3)

• Storage scales as O(N2):• 2500 points for 20 years requires ∼ 20 Gbytes

Thus, even this very simple model is not feasible for large problems

For more complicated models that require MCMC, N does not haveto be particularly large for computations to be a major problem

We need to decrease the computational burden!


“Low rank methods”

A popular approach to decrease the computational cost is toApproximate X(s) using some basis expansion

X(s) =

m∑j=1

wjϕj(s), (2)

where wj are Gaussian random variables and ϕjmj=1 are somepre-defined basis functions.This allows us to write Σ(θ) = BΣwB> and basically gives usO(K3) cost instead of O(N3) cost: Choose K N .There are many ways to obtain these “low rank” approximations:• Karhunen-Loève transforms• Empirical orthogonal functions• Process convolutions• Fixed-rank kriging or predictive processes• +++


Key Lesson: Sparse matrices

Definition (Sparse Matrix)A matrix Q is called sparse if most of its elements is zero.

• There exist very efficient numerical algorithms to deal withsparse matrices

• Computations scale as O(N3/2).• Storage scales as O(N):

• 2500 points for 20 years requires ∼ 400 Kilobytes

Two possible options:1 Force Σ to be sparse.

Forces independence between variables.2 Force the precision matrix Q = Σ−1 to be sparse.

What does this correspond to?


Example: AR(1) process

The simplest example of a GMRF is the AR(1) process:

xt = φxt−1 + εt, t = 1, 2, . . . , εt ∼ N (0, 1)

where t represents time and the distribution of x0 is chosen as thestationary distribution of the process: x0 ∼ N

(0, 1

1−φ2

)The joint density for x is

π(x) = π(x0)π(x1|x2) · · ·π(xn−1|xn−2)

=1

(2π)n/2|Q|1/2 exp

(−1

2x>Qx

).

The matrix Q = Σ−1 is called the precision matrix

The covariance matrix for x is dense since all time points aredependent, Σij = 1

1−φ2φ|i−j|.

GMRFs — AR David Bolin

Conditional independence

However, the precision matrix Q = Σ−1 is sparse:

Q =

1 −φ−φ 1 + φ2 −φ

. . . . . . . . .−φ 1 + φ2 −φ

−φ 1

What is the key property of this example that causes Q to besparse?• The key lies in the full conditionals

xt|x−t ∼ N(

φ

1− φ2(xt−1 + xt+1),

σ2

1 + φ2

)• Each timepoint is only conditionally dependent on the twoclosest timepoints, which is the reason for the tridiagonalstructure of Q


Main features of GMRFs

• Analytically tractable• Modelling using conditional independence• Merging GMRFs using conditioning (hierarchical models)• Unified framework for

• understanding• representation• computation using numerical methods for sparse matrices

• Fits nicely into the MCMC world• Can construct faster and more reliable block-MCMCalgorithms.

• Approximate Bayesian inference• Approximate GRFs through SPDE representations


Usage of GMRFs (I)

Structural time-series analysis• Autoregressive models.• Gaussian state-space models.• Computational algorithms based on the Kalman filter and itsvariants.

Analysis of longitudinal and survival data• temporal GMRF priors• state-space approaches• spatial GMRF priors

used to analyse longitudinal and survival data.


Usage of GMRFs (II)

Graphical models• A key model• Estimate Q and its (associated) graph from data.• Often used in a larger context.

Semiparametric regression and splines• Model a smooth curve in time or a surface in space.• Intrinsic GMRF models and random walk models• Discretely observed integrated Wiener processes are GMRFs• GMRFs models for coefficients in B-splines.


Usage of GMRFs (III)

Image analysis• Image restoration using the Wiener filter.• Texture modelling and texture discrimination.• Segmentation and object identification• Deformable templates• 3D reconstruction• Restoring ultrasound images

Spatial statistics• Latent GMRF model analysis of spatial binary data• Geostatistics using GMRFs• Analysis of data in social sciences and spatial econometrics• Spatial and space-time epidemiology• Environmental statistics• Inverse problems


Lecture 1: Introduction - Gaussian Markov random fieldsbodavid/GMRF2015/Lectures/F1slides.pdf ·...

Documents

Transcript of Lecture 1: Introduction - Gaussian Markov random fieldsbodavid/GMRF2015/Lectures/F1slides.pdf ·...