COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine...

61
COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ([email protected]) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

Transcript of COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine...

Page 1: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

COMP 551 – Applied Machine LearningLecture 20: Gaussian processes

Associate Instructor: Herke van Hoof ([email protected])

Class web page: www.cs.mcgill.ca/~jpineau/comp551

Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

Page 2: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof2

Last week’s Quiz

The neural auto encoder is analogous to the PCA under which conditions?

Linear layer / Non-linear layer / Single hidden layer / Cross-entropy loss function / Squared-error loss function / L1-regularization

Which of the following statements are True:

Dropout reduces overfitting by reducing computation. Dropout reduces overfitting by increasing the model capacity. Dropout reduces overfitting by reducing noise. Dropout reduces overfitting by model averaging.

CNNs are effective for computer vision task for which reasons:

They have a built-in ability to exploit local regularities. They can scale to high-dimensional inputs. They can be trained with less data than feed-forward neural networks. They are invariant to translations.

Page 3: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof3

Today’s Quiz

Page 4: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof4

Today’s Goals

• Why do we need uncertainty in regression?

• How can we quantify uncertainty in regression?

• State-of-art algorithms for regression:

• Kernel ridge regression

• Gaussian process regression

Page 5: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof5

Bayesian linear regression - part II

Copyright C.M. Bishop, PRML

• Regression with (extremely) small and noisy dataset

• Many functions are compatible with data

Page 6: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof6

Bayesian linear regression - part II

Copyright C.M. Bishop, PRML

• Quantify the uncertainty using probabilities(e.g. Gaussian mean and variance for every input x)

Page 7: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof7

Decision making

• What to do with the predictive distribution?

• Knowing uncertainty of output helpful in decision making

• Consider inspecting task.

• x: some measurement

• y: predicted breaking strength

• Parts which are to weak (breaking strength < t) are rejected

• Falsely rejecting a part incurs a small cost (c=1)

• Falsely accepting a part can cause more damage down the

line (expected cost c=100)

Page 8: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof8

Decision making

Copyright C.M. Bishop, PRML

threshold

should we accept this part?

how about this one?

Page 9: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof9

Determining uncertainty

• To make good decisions, sometimes need to know uncertainty

• Sources of uncertainty:

• We do not know the parameters w, especially in areas where

we have little data

• Even if we knew the parameters w of the underlying function,

individual parts might be slightly offset from this functionp(y|x,w) = f(w,x) +N (0,�2)

p(w|D) = w +N (0,⌃)

D = {(x1, y1), . . . , (xN , yN )}

Page 10: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof10

Determining uncertainty

• To make good decisions, sometimes need to know uncertainty

• Sources of uncertainty:

• We do not know the parameters w, especially in areas where

we have little data1: determine

• Even if we knew the parameters w of the underlying function,

individual parts might be slightly offset from this function2: combine these predictions for all w

p(y|x,w) = f(w,x) +N (0,�2)

p(w|D) = w +N (0,⌃)

D = {(x1, y1), . . . , (xN , yN )}

Page 11: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof11

Step 1: Determine posterior

• Goal: fit lines

• Bayes theorem: p(w|D) =p(D|w)p(w)

p(D)

y = w0 + w1x+ ✏

x

y

Page 12: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof12

Step 1: Determine posterior

• Goal: fit lines

• Bayes theorem:

• Similar to ridge regression, expect good w to be small

p(w|D) =p(D|w)p(w)

p(D)

y = w0 + w1x+ ✏

x

yWhat prior?

Copyright C.M. Bishop, PRML

Page 13: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof13

Step 1: Determine posterior

• Goal: fit lines

• Bayes theorem:

• Similar to ridge regression, expect good w to be small

p(w|D) =p(D|w)p(w)

p(D)

y = w0 + w1x+ ✏

x

yWhat prior?

x

y

Copyright C.M. Bishop, PRML

Page 14: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof14

Step 1: Determine posterior

• Goal: fit lines

• Bayes theorem:

• Similar to ridge regression, expect good w to be small

p(w|D) =p(D|w)p(w)

p(D)

y = w0 + w1x+ ✏

x

yWhat prior?

x

y

Copyright C.M. Bishop, PRML

Page 15: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof15

Step 1: Determine posterior

• Goal: fit lines

• Bayes theorem:

• Similar to ridge regression, expect good w to be small

p(w|D) =p(D|w)p(w)

p(D)

y = w0 + w1x+ ✏

x

yWhat prior?

x

y

Copyright C.M. Bishop, PRML

Page 16: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof16

Step 1: Determine posterior

• Goal: fit lines

• Bayes theorem:

• Good lines should pass ‘close by’ datapoint

p(w|D) =p(D|w)p(w)

p(D)

y = w0 + w1x+ ✏

x

yWhat likelihood?

x

y

Copyright C.M. Bishop, PRML

Page 17: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof17

Step 1: Determine posterior

• Goal: fit lines

• Bayes theorem:

• Good lines should pass ‘close by’ datapoint

p(w|D) =p(D|w)p(w)

p(D)

y = w0 + w1x+ ✏

x

yWhat likelihood?

x

y

Copyright C.M. Bishop, PRML

Page 18: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof18

Step 1: Determine posterior

• Goal: fit lines

• Bayes theorem:

• For all values of w, multiply prior and likelihood

(and re-normalize)

p(w|D) =p(D|w)p(w)

p(D)

y = w0 + w1x+ ✏

x

y

x =

Copyright C.M. Bishop, PRML

Page 19: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof19

Determining uncertainty

• To make good decisions, sometimes need to know uncertainty

• Sources of uncertainty:

• We do not know the parameters w, especially in areas where

we have little data1: determine

• Even if we knew the parameters w of the underlying function,

individual parts might be slightly offset from this function2: combine these predictions for all w

p(y|x,w) = f(w,x) +N (0,�2)

p(w|D) = w +N (0,⌃)

D = {(x1, y1), . . . , (xN , yN )}

Page 20: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof20

Step 2: Combine predictions

• Every w makes a prediction y = w0 + w1x+ ✏

-1 0 1

-1

0

1

-1 0 1

-1

0

1

-1 0 1

-1

0

1

x low weight

x medium weight

x high weight

-1 0 1

-1

0

1

+

x

y

Cop

yrig

ht C

.M. B

isho

p, P

RM

L

Page 21: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof21

Bayesian linear regression in general

• Model:

• Likelihood

• Conjugate prior

• Prior precision and noise variance considered known

• Linear regression with uncertainty about the parameters

p(y|x,w) = N (wTx,�2)

p(w) = N (0,↵�1I)

�2↵

Page 22: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof22

• Some algebra on the model definitions gives the solution

• has one input per row, has one target output per row

• If prior precision goes to 0, mean becomes maximum

likelihood solution (ordinary linear regression)

• Infinitely wide likelihood variance , or 0 datapoints, means

distribution reduces to prior

SN = (↵I+ ��2XTX)�1

p(w|D) = N (��2SNXTy,SN )

�2

X y

Bayesian linear regression: inference

Page 23: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof23

• We can investigate the maximum of the posterior (MAP)

• Log-transform posterior: log is sum of prior + likelihood

Bayesian linear regression: inference

max log p(w|y)

max���2

2

NX

n=1

(yn �wTxn)2 � ↵

2wTw + const.

minNX

n=1

(yn �wTxn)2 + �wTw

Page 24: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof24

• We can investigate the maximum of the posterior (MAP)

• Log-transform posterior: log is sum of prior + likelihood

• Same objective function as for ridge regression!

• Penalty term:

Ridge regression, Lecture 4

(linear regression)

prior precisionlikelihood variance

� = ↵�2

Bayesian linear regression: inference

max log p(w|y)

max���2

2

NX

n=1

(yn �wTxn)2 � ↵

2wTw + const.

minNX

n=1

(yn �wTxn)2 + �wTw

Page 25: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof25

Bayesian linear regression: prediction

• Prediction for new datapoint:

• Convolution of two Gaussians, can compute solution analytically:

• Variance tends to go down with more data until it reaches

• Corresponds to sources of uncertainty discussed before

p(y⇤|x⇤,D) =

Z

RN

p(w|D)p(y⇤|x⇤,w)dw

p(y⇤|D) = N (��2x⇤TSNXTy,�2 + xTSNx)

mean w from before

from weight uncertainty

from observation noise

new input

�2

Page 26: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof26

Beyond linear regression

• Non-linear data sets can be handled by using non-linear features

• Features specify the class of functions we consider(hypothesis class)

• What if we do not know good features?

y =MX

i=1

wi�i(x)

Page 27: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof27

Beyond linear regression

Copyright C.M. Bishop, PRMLInput dimension 1

Inpu

t dim

ensi

on 2

• Certain features work with many problems

Page 28: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof28

Beyond linear regression

• Scaling with number of inputs

• Grid of radial basis functions k RBFs per dimension, m-dimensional input?

• Polynomial expansionorder k polynomial, m-dimensional input?

Page 29: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof29

Beyond linear regression

• Scaling with number of inputs

• Grid of radial basis functions k RBFs per dimension, m-dimensional input?

• Polynomial expansionorder k polynomial, m-dimensional input?

km

mk

features

features (+ lower order terms)

Page 30: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof30

Beyond linear regression

• Relying on features can be problematic

• We tried to avoid using features before…

Page 31: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof31

Beyond linear regression

• Relying on features can be problematic

• We tried to avoid using features before…

• Lecture 8, instance based learning. Use distances!

• Lecture 12, support vector machines. Use kernels!

• We can use a similar approach with (Bayesian) linear regression

Page 32: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof32

Kernels (recap)

• Kernel is a function of two arguments which corresponds to a dot

product in some feature space

• Advantage of using kernels:

• Sometimes evaluating k is cheaper than evaluating features

and taking the dot product

• Sometimes k corresponds to an inner product in a feature

space with infinite dimensions

k(x,y) = �(x)T�(y)

k(x,y) = (xTy)d

k(x,y) = exp(�(x� y)2)

Page 33: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof33

Kernels (recap)

• Kernelize algorithm:

• Try to formulate algorithm so feature vectors only ever occur

in inner products

• Replace inner products by kernel evaluations (kernel trick)

• Sidenote: different kernel definitions are used in different

methods. Here, ’Mercer kernels’ are used.

Page 34: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof34

Kernelizing the mean function

• Inspect solution mean from Bayesian linear regression

• Vector concatenates training outputs

• Matrix X has one column for each feature (length N) one row for each datapoint (length M)

• Mean prediction is

y

(2)

(1)

SN = (↵I+ ��2XTX)�1

p(y⇤|D) = N (��2x⇤TSNXTy,�2 + xTSNx)

y⇤ = ��2x⇤T (↵I+ ��2XTX)�1XTy

Page 35: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof

element i of this vector is

35

Kernelizing the mean function

• Step 2: Reformulate to only have inner products of features

element i,j of this matrix is �(xi)T�(xj)

�(xi)T�(x⇤)

k(x⇤)T

y⇤ = ��2x⇤T (↵I+ ��2XTX)�1XTy

y⇤ = ��2x⇤TXT (↵I+ ��2XXT )�1y

K

Page 36: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof

element i of this vector is

36

Kernelizing the mean function

• Step 2: Reformulate to only have inner products of features

element i,j of this matrix is �(xi)T�(xj)

�(xi)T�(x⇤)

Kk(x⇤)T

# features x #features

# datapoints x #datapoints

y⇤ = ��2x⇤TXT (↵I+ ��2XXT )�1y

y⇤ = ��2x⇤T (↵I+ ��2XTX)�1XTy

Page 37: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof37

Kernelizing the mean function

• Step 3: Replace inner products by kernel evaluations

• Remember: Mean function is same as ridge regression

• This is kernel ridge regression

element i,j of this matrix is

element i of this vector is k(xi,x⇤)

k(xi,xj)

y⇤ = k(x⇤)T (↵I+K)�1y

Page 38: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof38

Kernel ridge regression

• Choosing a kernel

0 5 10

-1

0

1

2

0 5 10

-1

0

1

2

0 5 10

-1

0

1

2

k(x, y) = exp�(x� y)2

k(x, y) = exp�|x� y|k(x, y) = xy

Page 39: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof39

Kernel ridge regression

• Setting parameters

0 5 10

-1

-0.5

0

0.5

1

0 5 10

-1

-0.5

0

0.5

1

0 5 10

-1

-0.5

0

0.5

1

� = 0.03 � = 0.3 � = 3

Page 40: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof40

Kernel ridge regression

• Setting parameters

k(x, y) = exp� (x� y)2

�2

� = 1� = 10 � = 0.1

0 5 10

-1

-0.5

0

0.5

1

0 5 10

-1

-0.5

0

0.5

1

0 5 10

-1

-0.5

0

0.5

1

Page 41: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof41

Why does it work

• We still have #features = #datapoints, so regularisation critical!

0 2 4 6 8 10

-3

-2

-1

0

1

2

3

� = 0

Page 42: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof42

Kernel regression: Practical issues

• Compare ridge regression:

inverse matrix-vector productprediction memory

• Kernel ridge regression: inverse, product

predictionmemory

O(d3) O(d2N)

O(N3)

O(d)

O(d)

O(N)

O(N)

y⇤ = k(x⇤)T (↵I+K)�1y

w = (�I+XTX)�1XTy

Page 43: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof43

Kernel regression: Practical issues

• If we have a small set of good features it’s faster to do

regression in feature space

• However, if no good features are available (or we need a very big

set of features), kernel regression might yield better results

• Often, it is easier to pick a kernel than to choose a good set of

features

Page 44: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof44

Kernelizing Bayesian linear regression

• We have now kernelized ridge regression

• Could we kernelize Bayesian linear regression, too?

linearregression

ridge regression

bayesian linearregression

(kernel regression)

kernel ridgeregression

Page 45: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof45

Kernelizing Bayesian linear regression

• We have now kernelized ridge regression

• Could we kernelize Bayesian linear regression, too?

• Yes, and this is called a Gaussian process regression (GPR)

linearregression

ridge regression

bayesian linearregression

(kernel regression)

kernel ridgeregression

Gaussian process

Page 46: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof46

Deriving GP equations

• Model:

• We are interested in the function values , at a set

of points . We observe target values for the

training set, but we assume these are noisy

• Prior: With y a vector of function values

and K the kernel matrix

• Likelihood:

y1, y2, . . .

x1,x2, . . .

tn = yn + ✏

y ⇠ N (0,K)

Copyright C.M. Bishop, PRML

t ⇠ N (y,��1I)

Page 47: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof47

Examples from the prior

y ⇠ N (0,K)

k(x, y) = exp�(x� y)2 k(x, y) = exp�|x� y|Copyright C.M. Bishop, PRML

Page 48: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof48

GP Regression

• Prior and likelihood are Gaussian

• Again obtain a closed form solution

• Prediction of new observations

• Easy to implement!

kernel ridge regression

prior variance

reduction in variance due to close training points

add noise term

Cov[t⇤] = k(x⇤,x⇤)� k(x⇤)T (K+ ��1I)�1k(x⇤) + ��1

Cov[y⇤] = k(x⇤,x⇤)� k(x⇤)T (K+ ��1I)�1k(x⇤)

E[y⇤] = yT (K+ ��1I)�1k(x⇤)

Page 49: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof49

GP Regression

• Results of GP regression

0 5 10

-3

-2

-1

0

1

2

3

0 5 10

-3

-2

-1

0

1

2

3

0 5 10

-3

-2

-1

0

1

2

3

Calculated for many possible y⇤

E[y⇤|t]

E[y⇤|t] +pCov[y⇤|t]

E[y⇤|t]�pCov[y⇤|t]

Page 50: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof50

GP Regression: hyperparameters• Hyperparameters

• Assumed noise (variance of likelihood)

• Any parameters of the kernel

• Typical kernel:

• s: scale (standard deviation prior to seeing data)

• : bandwidth (which datapoint are considered close)

• Effective regularisation:

• Knowing the ‘meaning’ of parameters helps tune them

t ⇠ N (y,��1I)

k(xi,xj) = s2 exp�kxi � xjk2

2�2

��1s�1

Page 51: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof51

GP Regression: hyperparameters

• Assumed noise (variance of likelihood)

• Effective regularisation:

• Mostly changes behaviour close to train points

t ⇠ N (y,��1I)

��1s�1

0 5 10

-3

-2

-1

0

1

2

3

0 5 10

-3

-2

-1

0

1

2

3

0 5 10

-3

-2

-1

0

1

2

3

��1 = 0 ��1 = 0.1 ��1 = 1

Page 52: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof52

GP Regression: hyperparameters

• Kernel

• Effective regularisation

• Mostly changes behaviour further away from training points

k(xi,xj) = s2 exp�kxi � xjk2

2�2

��1s�1

0 5 10

-3

-2

-1

0

1

2

3

0 5 10

-3

-2

-1

0

1

2

3

0 5 10

-3

-2

-1

0

1

2

3

s = 0.1 s = 1 s = 10

Page 53: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof53

GP Regression: hyperparameters

• Kernel

• Changes what is considered ‘close’ or ‘far’

k(xi,xj) = s2 exp�kxi � xjk2

2�2

0 5 10

-3

-2

-1

0

1

2

3

0 5 10

-3

-2

-1

0

1

2

3

0 5 10

-3

-2

-1

0

1

2

3

� = 0.1 � = 1 � = 10

Page 54: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof54

GPs: Practical issues

• Complexity pretty much similar to kernel regression

• Except for calculating predictive variance

• inverse, product

• prediction

• memory

O(N3)

O(N)

O(N)

O(N2)

Cov[y⇤] = k(x⇤,x⇤)� k(x⇤)T (K+ ��1I)�1k(x⇤)

E[y⇤] = yT (K+ ��1I)�1k(x⇤)

Page 55: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof55

GPs: Practical issues

• For small dataset, GPR is a state-of-the-art method!

• Advantage: uncertainty, flexible yet can control overfitting

• Computational costs acceptable for small datasets (<10 000)

• For large datasets, uncertainty not so important, expensive, ….

• Good approximations exist

• Specifying the right prior (kernel!) is important!

• This means choosing good hyper parameters

• Choice of hyper parameters investigated in next lecture

Page 56: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof56

Bayesian methods in practice

• Time complexity varies compared to frequentist methods

• Memory complexity can be higher

• e.g. need to store mean + uncertainty : quadratic, not linear

• Much data everywhere: posterior close to point estimate

• (might as well use frequentist methods)

• Little data everywhere

• Prior information helps bias/variance trade-off

• Some areas with little, some areas with much data

• Uncertainty helps to decide where predictions are reliable

Page 57: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof57

Inference in more complex models

• We saw some examples with closed-form posterior

• In many complex models, no closed-form representation

• Variational inference (deterministic)

• Consider family of distributions we can represent (Gaussian)

• Use optimisation techniques to find best of these

• Sampling (stochastic)

• Try to directly sample from the posterior

• Expectations can be approximated using the samples

• Maximum a posterior (point estimate)

Page 58: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof58

What you should know

• Last lecture:

• What is the Bayesian view of probability?

• Why can the Bayesian view be beneficial?

• Role of the following distributions:

• Likelihood, prior, posterior, posterior predictive

• This lecture:

• Key idea of Bayesian regression and its properties

• Key idea of kernel regression and its properties

• Key idea of Gaussian process regression and its properties

Page 59: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof59

Why does it work?

• In linear regression, we choose M basis functions, then find the

best function of the form

• In kernel regression, we find the best function of the form

For any set

• Basis function available at every possible input

y =MX

i=1

wi�(x)

y =X

s2S↵sk(x, s)

Sk(·, s)

Page 60: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof60

Why does it work?

• Basis function available at every possible input

• But only function at the training points are used to minimise

training error (representer theorem, Schölkopf et al., 2001)

k(·, s)

input dim 1

inpu

t dim

2

input dim 1

inpu

t dim

2

Page 61: COMP 551 – Applied Machine Learning Lecture 20: Gaussian ... · COMP 551 – Applied Machine Learning Lecture 20: Gaussian processes Associate Instructor: Herke van Hoof ... •

Herke van Hoof61

Why called Gaussian process?

Stochastic process Collection of random variables indexed by some set Ie. R.V. yi for every element 𝑖 in index set

Here, we will consider the function values yi as random variables

The index is the function argument, e.g.

Furthermore, we assume that any subset of these random variables is jointly Gaussian distributed - thus defining a Gaussian process

This same assumption underlies Bayesian linear regression!

2 Rd