Download - An introduction to Gaussian processes for ... - gpss.ccgpss.cc/gpss13/assets/Sheffield-GPSS2013-Turner.pdf · An introduction to Gaussian processes for probabilistic inference Dr.

An introduction to Gaussian processes forprobabilistic inference

Dr. Richard E. Turner ([email protected])

Computational and Biological Learning Lab, Department of Engineering,University of Cambridge

Motivation: non-linear regression


?


Can we do this with a plain old Gaussian?

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =

1 .7

.7 1

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

1 .6

.6 1

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

1 .4

.4 1

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

1 .1

.1 1

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

1 .0

.0 1

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

1 .7

.7 1

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) 1

2

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) 1

2

y1

y 2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y2|y1,Σ) ∝ exp(−1

2(y2 − µ∗)Σ∗−1(y2 − µ∗)) 1

2

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y 2

Σ =

1 .9

.9 1

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y 2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9

.9 1

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9

.9 1

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Regression: probabilistic inference in function space

−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Parametric model

y(x) = f(x; θ) + σyε

ε ∼ N (0, 1)


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y


p(y|θ) = N (0,Σ)

function estimatewith uncertaintyK(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Parametric model


ε ∼ N (0, 1)


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)


noise

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Parametric model


ε ∼ N (0, 1)


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)


horizontal-scale

noise

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Parametric model


ε ∼ N (0, 1)


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)


noise

Parametric model


ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

horizontal-scalevertical-scale

Mathematical Foundations: Definition

Gaussian process = generalization of multivariate Gaussian distribution toinfinitely many variables.

Definition: a Gaussian process is a collection of random variables, anyfinite number of which have (consistent) Gaussian distributions.

A Gaussian distribution is fully specified by a mean vector, µ, and covariancematrix Σ:

f = (f1, . . . , fn) ∼ N (µ,Σ), indices i = 1, . . . , n

A Gaussian process is fully specified by a mean function m(x) and covariancefunction K(x,x′):

f(x) ∼ GP (m(x),K(x,x′)) , indices x

Mathematical Foundations: Marginalisation

Q1. A GP is "like" a Gaussian distribution with an infinitely long mean vector and an "infinite by infinite" covariance matrix, so how do we represent it on a computer?



We are saved by the marginalisation property:




Only need to represent finite dimensional projections of GPs on computer.




Q2. Why do we use covariances and not precisions to express modelling choices?







not the same






Entries in a precision matrix depend on what other data we are considering

not the same






Precisions tell us about the conditional independence relationships betweenpairs of variables (are and independent given all other variables).

Entries in a precision matrix depend on what other data we are considering

not the same

Mathematical Foundations: Regression

Q3. What's the formal justification for how we were using GPs for regression?



Generative model (like non-linear regression)




place GP prior over the non-linear function

(smoothly wiggling functions expected)




place GP prior over the non-linear function

since the sum of two Gaussians is a Gaussian, the model induces a GP over

(smoothly wiggling functions expected)

Mathematical foundations: Prediction

Q4. How do we make predictions?



linear in the data

predictive mean



prior uncertainty

predictive uncertainty

reduction inuncertainty

linear in the data

predictive mean predictive covariance

predictions more confident than prior

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model


ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)