Variational Inference, Gaussian Processes and non-linear...

27
Variational Inference, Gaussian Processes and non-linear state-space models Carl Edward Rasmussen Sensor Fusion 2018 July 11th, 2018 Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 1 / 27

Transcript of Variational Inference, Gaussian Processes and non-linear...

Page 1: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Variational Inference, Gaussian Processesand non-linear state-space models

Carl Edward Rasmussen

Sensor Fusion 2018

July 11th, 2018

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 1 / 27

Page 2: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Properties of Gaussian Process models

• allow analytic Bayesian inference• make probabilistic (Gaussian) predictions• marginal likelihood (for model selection) in closed form• flexible, interpretable, non-parametric models

but

• simple, exact inference only available in vanilla cases• exact inference scales badly with data set size

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 2 / 27

Page 3: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Gaussian Processes

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 3 / 27

Page 4: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Gaussian Processes

A Gaussian Process (GP) is a generalization of the Gaussian distribution toinfinite dimensional objects.

An inifinitely long vector ' a function.

Therefore, GPs specify distributions over functions

f (x) ∼ GP(m(x), k(x, x ′)

).

A GP is fully specified by a mean function and a covariance function.

The properties of the function are controlled by the mean and covariancefunctions, and their hyperparameters θ.

Note: GPs are often used as processes in time. In this talk the index set into theprocess is more general.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 4 / 27

Page 5: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

−6−4

−20

24

6

−6

−4

−2

0

2

4

6

−2

−1

0

1

2

3

4

5

6

7

8

Function drawn at random from a Gaussian Process with Gaussian covariance

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 5 / 27

Page 6: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

−4−3

−2−1

01

23

4

−4−3

−2−1

01

23

4

−1.5

−1

−0.5

0

0.5

1

1.5

Function drawn at random from a Neural Network covariance function

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 6 / 27

Page 7: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Important properties

Although functions are infinite dimensional objects, you can still do exactinference using finite memory and computation.

In probability theory this is called the marginalization property, in the kernellearning community it is called the kernel trick.

The prior is captured by a GP, after observing data, the posterior is also a GPwith a new mean and covariance function.

The posterior predictive distribution at a test input x∗ is

f (x∗)|y, θ ∼ N(k(x∗, x)[K(x, x) + σ2I]−1y,

k(x∗, x∗) − k(x∗, x)[K(x, x) + σ2I]−1k(x, x∗)).

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 7 / 27

Page 8: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Model selection or hyperparameter learning

Bayesian model selection and hyperpameter optimization can be done using themarginal likelihood

log p(y|θ) = − 12 y>[K(x, x) + σ2I]−1y︸ ︷︷ ︸

data fit

− 12 log |K(x, x) + σ2I|︸ ︷︷ ︸

complexity term

−N2 log 2π.

Note: this is a non-parametric model, it can fit the data exactly if it wants to, butit doesn’t, because we are marginalizing over functions (not optimizing). Occam’sRazor is automatic.

Wouldn’t it be nice if we could use GPs to solve complex non-linear modelingproblems?

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 8 / 27

Page 9: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Issues

One issue is that inference takes O(N3) computation and O(N2) memory –doesn’t scale to large problems.

Not clear how to handle stochastic inputs, as necessary in time-series models andlearning in deep architectures.

The Gaussian posterior is only exact for Gaussian likelihood.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 9 / 27

Page 10: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

An inconvenient truth

x(t-1)

y(t-1)

x(t)

y(t)

Gaussian processes (GP) are an extremely powerful framework for learning andinference about non-linear functions.

It’s the transition non-linearity which is essential; we will assume that theobservation model is linear.

Irritating fact: Although desirable, it is not easy to apply GPs to latent variabletime series.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 10 / 27

Page 11: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

The Evidence Lower BOund, ELBO

Start from Bayes rule

p(f |y, θ) =p(y|f)p(f |θ)

p(y|θ)⇔ p(y|θ) =

p(y|f)p(f |θ)p(f |y, θ)

,

multiply and divide by an arbitrary distribution q(f ), and take logarithms

p(y|θ) =p(y|f)p(f |θ)

p(f |y, θ)q(f )q(f )

⇔ log p(y|θ) = logp(y|f)p(f |θ)

q(f )+ log

q(f )p(f |y, θ)

,

average both sides wrt q(f )

log p(y|θ)︸ ︷︷ ︸marginal likelihood

=

∫q(f ) log

p(y|f)p(f |θ)q(f )

df︸ ︷︷ ︸Evidence Lower BOund, ELBO

+

∫q(f ) log

q(f )p(f |y, θ)

df︸ ︷︷ ︸KL(q(f)||p(f |y,θ))

.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 11 / 27

Page 12: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Variational Inference in Gaussian Processes

Algorithm: maximize ELBO wrt q(f ) within some restricted class; key assumption

q(f ) = p(f 6=u|u, θ)q(u),

where f is split into two disjont sets, the inducing targets u, and the rest f 6=u.

F(θ) =

∫q(f ) log

p(y|f)p(f |θ)q(f )

df

=

∫p(f 6=u|u)q(u) log

p(y|f)�����p(f6=u|u, θ)p(u|θ)

�����p(f6=u|u, θ)q(u)df 6=udu

=

N∑n=1

∫p(fn|u, θ)q(u) log p(yn|fn)dudfn −KL(q(u)||p(u|θ)),

which can be evaluated in closed form.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 12 / 27

Page 13: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Variantional Inference in Gaussian Processes, cont

The lower bound can be evaluated in closed form.

The lower bound can be optimized wrt q(u) by free form optimisation.

The optimal q∗(u) turns out to be Gaussian.

Plugging the optimal Gaussian back into the bound, we get

F(θ) = − 12 y>[Qff + σ2I]−1y − 1

2 log |Qff + σ2I|− 12σ2 tr(Kff − Qff) −

N2 log(2π),

where Qff = KfuK−1uu Kuf. The structure (low rank plus diagonal), allows

evaluation in O(M2N), where M is the number of inducing targets.

The bound can be optimized wrt hyperparameters θ and inducing inputs z.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 13 / 27

Page 14: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Generative Model

Gaussian Process State Space Model

f (x|θx) ∼ GP(m(x), k(x, x ′)

), (1)

xt|ft ∼ N(xt|ft, Q), Q diagonal, (2)

yt|xt ∼ N(yt|Cxt, R), (3)

with hyperparameters θ = (θx, Q, C, R).

The joint probability is the prior times the likelihood

p(y, x, f|θ) = p(x0)

T∏t=1

p(ft|f1:t−1, x0:t−1, θx)p(xt|ft, Q)p(yt|xt, C, R). (4)

Note that the joint distribution of the f values isn’t even Gaussian! The marginallikelihood is

p(y|θ) =

∫p(y, x, f|θ)dfdx. (5)

That’s really awful! But we need it to train the model.Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 14 / 27

Page 15: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Picture of part of generative process

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x(t)

f(t+1)

x=y

p(x(t+1))

x(t+1)

f(t+2)

f(x)

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 15 / 27

Page 16: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Let’s lower bound the marginal likelihood

log p(y|θ) >∫

q(x, f) logp(x0)

∏Tt=1 p(ft|f1:t−1, x0:t−1)p(xt|ft)p(yt|xt)

q(x, f)dxdf

=

∫q(x, f) log

p(y|θ)p(x, f|y, θ)q(x, f)

dxdf

= log p(y|θ) −KL(q(x, f)||p(x, f|y, θ)).

(6)

for any distribution q(x, f), by Jensen’s inequality.

Let’s chose the q(x, f) distribution within some restricted family to maximize thelower bound or equivalently minimize the KL divergence.

This is still nasty because of the annoying prior, unless we chose:

q(x, f) = q(x)T∏

t=1

p(ft|f1:t−1, x0:t−1).

This choice makes the lower bound simple but terribly loose! Why? Because theapproximating distribution over the f’s doesn’t depend on the observations y.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 16 / 27

Page 17: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Augment GP with inducing variables

Augment the model with an additional set of input output pairs {zi, ui|i = 1, . . . M}

Joint distribution:

p(y, x, f, u|z) = p(x, f|u)p(u|z)T∏

t=1

p(yt|xt). (7)

Consistency (or the marginalization property) of GPs ensures that it is straightforward to augment with extra variables.

This step seemingly makes our problem worse, because we have more latentvariables.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 17 / 27

Page 18: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Lower bound revisited

Lower bound on marginal likelihood

log p(y|θ) >∫q(x, f, u) log

p(u)p(x0)∏T

t=1 p(ft|f1:t−1, x0:t−1, u)p(xt|ft)p(yt|xt)

q(x, f, u)dxdfdu,

(8)

for any distribution q(x, f, u), by Jensen’s inequality.

Now chose

q(x, f, u) = q(u)q(x)T∏

t=1

p(ft|f1:t−1, x0:t−1, u). (9)

This choice conveniently makes the troublesome p(ft|f1:t−1, x0:t−1, u) term cancel.But the u values can be chosen (via q(u)) to reflect the observations, so the boundmay be tight(er).

The lower bound is L(y|q(x), q(u), q(f|x, u), θ).

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 18 / 27

Page 19: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

What do we get for all our troubles?

• For certain choices of covariance function, f can be integrated out

L(y|q(x), q(u), θ) =

∫L(y|q(x), q(u), q(f|x, u), θ)df (10)

• The optimal q(u) is found by calculus is variations and turns out to beGaussian, and can be maxed out

L(y|q(x), θ) = maxq(u)

L(y|q(x), q(u), θ) (11)

• The optimal q(x) has Markovian structure; making a further Gaussianassumption, the lower bound can be evaluated analytically, as a function ofits parameters µt, Σt,t and Σt−1,t.

Algorithm: optimize the lower bound L(y|q(x), θ) wrt the parameters of theGaussian q(x) and the remaining parameters θ (we need derivatives for theoptimisation).

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 19 / 27

Page 20: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Sparse approximations are builtin

The computational cost is dominated by the GP. But, the effective ’training setsize’ for the GP is given by M the number of inducing variables.

Therefore, we can chose M to trade off accuracy and computational demand.

The computational cost is linear in the length of the time-series.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 20 / 27

Page 21: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Data from non-linear Dynamical System

0 10 20 30 40 50 60 70 80 90 100-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 21 / 27

Page 22: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

State space representation: f(t) as a function of f(t-1)

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 22 / 27

Page 23: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

With explicit transitions

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 23 / 27

Page 24: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Data from non-linear Dynamical System

-4 -3 -2 -1 0 1 2-5

-4

-3

-2

-1

0

1

2

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 24 / 27

Page 25: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Parameterisation

Let’s use the method on a sequence of length T, with D dimensional observationsand a D dimensional latent state, ie we have TD observations.

The lower bound is to be optimized wrt all the parameters

• the pairwise marginal Gaussian distribution q(x). This contains TDparameters for the mean and 2TD2 for the covariances.

• the inducing inputs z. For each of the D GPs there are MD inducing inputs,ie a total of MD2.

• parameters of the observation model C, there are D2 of these• noise parameters, 2D.• GP hyperparameters, ∼ D2.

for a grand total of roughly (2T + M)D2.

Example: cart and inverted pendulum, D = 4 and 10 timeseries each of length100, so T = 1000 and M = 50. So we have 4000 observations and 36832 freeparameters. This large number of parameters may be inconvenient but it doesn’tlead to overfitting!

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 25 / 27

Page 26: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Implementation

Careful implementation is necessary

• q(x) is assumed Markovian. Thus the precision matrix is block tridiagonal.The covariance is full, don’t write it out, it’s huge!

• q(x) has to be parameterised carefully to allow all pos def matrices, butwithout writing out the covariances.

• initialization of parameters is important• most of the free parameters are in the covariance of q(x). Initially train

with shared covariances across time.• then continue training with free covariances.

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 26 / 27

Page 27: Variational Inference, Gaussian Processes and non-linear ...mlg.eng.cam.ac.uk/carl/talks/fusion.pdf · Variational Inference, Gaussian Processes and non-linear state-space models

Conclusions

• Principled, approximate inference in flexible non-linear non-parametricmodels for time series are possible and practical

• Allows to integrate over dynamics• Automatically discover dimensionality of the hidden space – it is not the case

that more latent dimensions lead to higher marginal likelihood

Some other interesting properties include

• Handles missing variables• Multivariate latent variables and observations straight forward• flexible non-parametric dynamics model• Occam’s Razor automatically at work, no overfitting• Framework includes computational control by limiting M

Rasmussen (Engineering, Cambridge) Variational Inference and GPs July 11th, 2018 27 / 27