Mathematics for Machine Learning - Deep Learning … · Mathematics for Machine Learning Marc...

Post on 05-Aug-2018

248 views 1 download

Transcript of Mathematics for Machine Learning - Deep Learning … · Mathematics for Machine Learning Marc...

Mathematics for Machine LearningMarc Deisenroth

Statistical Machine Learning GroupDepartment of ComputingImperial College London

@mpd37m.deisenroth@imperial.ac.ukmarc@prowler.io

Deep Learning IndabaUniversity of the WitwatersrandJohannesburg, South Africa

September 10, 2017

Applications of Machine Learning

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 2

Mathematical Concepts in Machine Learning

§ Linear algebra and matrix decomposition

§ Differentiation

§ Optimization

§ Integration

§ Probability theory and Bayesian inference

§ Functional analysis

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 3

Outline

Introduction

Differentiation

Integration

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 4

Overview

Introduction

Differentiation

Integration

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 5

Feedforward Neural Network

y “ σpzqz “ Ax` b

x z y

A, b

x1

x2

z1

z2

z3

y1

y2

y3

x y

A, b

σ

§ Training a neural network means parameter optimization:Typically via some form of gradient descent

Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b

§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a

neural network

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6

Feedforward Neural Network

y “ σpzqz “ Ax` b

x z y

A, b

x1

x2

z1

z2

z3

y1

y2

y3

x y

A, b

σ

§ Training a neural network means parameter optimization:Typically via some form of gradient descent

Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b

§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a

neural network

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6

Feedforward Neural Network

y “ σpzqz “ Ax` b

x z y

A, b

x1

x2

z1

z2

z3

y1

y2

y3

x y

A, b

σ

§ Training a neural network means parameter optimization:Typically via some form of gradient descent

Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b

§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a

neural network

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6

Feedforward Neural Network

y “ σpzqz “ Ax` b

x z y

A, b

x1

x2

z1

z2

z3

y1

y2

y3

x y

A, b

σ

§ Training a neural network means parameter optimization:Typically via some form of gradient descent

Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b

§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a

neural network

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6

Feedforward Neural Network

y “ σpzqz “ Ax` b

x z y

A, b

x1

x2

z1

z2

z3

y1

y2

y3

x y

A, b

σ

§ Training a neural network means parameter optimization:Typically via some form of gradient descent

Challenge 1: Differentiation. Compute gradients of a lossfunction with respect to neural network parameters A, b

§ Computing statistics (e.g., means, variances) of predictionsChallenge 2: Integration. Propagate uncertainty through a

neural network

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 6

Background: Matrix Multiplication

§ Matrix multiplication is notcommutative, i.e., AB ‰ BA

§ When multiplying matrices, the “neighboring” dimensions haveto fit:

Aloomoon

nˆk

Bloomoon

kˆm

“ Cloomoon

nˆm

y “ Ax y = A.dot(x)

yi “ř

j Aijxj y = np.einsum(’ij, j’, A, x)

C “ AB C = A.dot(B)

Cij “ř

k AikBkj C = np.einsum(’ik, kj’, A, B)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 7

Background: Matrix Multiplication

§ Matrix multiplication is notcommutative, i.e., AB ‰ BA

§ When multiplying matrices, the “neighboring” dimensions haveto fit:

Aloomoon

nˆk

Bloomoon

kˆm

“ Cloomoon

nˆm

y “ Ax y = A.dot(x)

yi “ř

j Aijxj y = np.einsum(’ij, j’, A, x)

C “ AB C = A.dot(B)

Cij “ř

k AikBkj C = np.einsum(’ik, kj’, A, B)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 7

Background: Matrix Multiplication

§ Matrix multiplication is notcommutative, i.e., AB ‰ BA

§ When multiplying matrices, the “neighboring” dimensions haveto fit:

Aloomoon

nˆk

Bloomoon

kˆm

“ Cloomoon

nˆm

y “ Ax y = A.dot(x)

yi “ř

j Aijxj y = np.einsum(’ij, j’, A, x)

C “ AB C = A.dot(B)

Cij “ř

k AikBkj C = np.einsum(’ik, kj’, A, B)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 7

Curve Fitting (Regression) in Machine Learning (1)

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 5

DataMaximum likelihood estimate

§ Setting: Given inputs x, predict outputs/targets y§ Model f that depends on parameters θ. Examples:

§ Linear model: f px, θq “ θJx, x, θ P RD

§ Neural network: f px, θq “ NNpx, θq

§ Training data, e.g., N pairs pxi, yiq of inputs xi and observations yi

§ Training the model means finding parameters θ˚, such thatf pxi, θ˚q « yi

§ Define a loss function, e.g.,řN

i“1pyi ´ f pxi, θqq2, which we want tooptimize

§ Typically: Optimization based on some form of gradient descentDifferentiation required

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 8

Curve Fitting (Regression) in Machine Learning (2)

§ Training data, e.g., N pairs pxi, yiq ofinputs xi and observations yi

§ Training the model means findingparameters θ˚, such that f pxi, θ˚q « yi

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 5

DataMaximum likelihood estimate

§ Define a loss function, e.g.,řN

i“1pyi ´ f pxi, θqq2, which we want tooptimize

§ Typically: Optimization based on some form of gradient descentDifferentiation required

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 9

Curve Fitting (Regression) in Machine Learning (2)

§ Training data, e.g., N pairs pxi, yiq ofinputs xi and observations yi

§ Training the model means findingparameters θ˚, such that f pxi, θ˚q « yi

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 5

DataMaximum likelihood estimate

§ Define a loss function, e.g.,řN

i“1pyi ´ f pxi, θqq2, which we want tooptimize

§ Typically: Optimization based on some form of gradient descentDifferentiation required

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 9

Overview

Introduction

Differentiation

Integration

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 10

Differentiation: Outline

1. Scalar differentiation: f : RÑ R

2. Multivariate case: f : RN Ñ R

3. Vector fields: f : RN Ñ RM

4. General derivatives: f : RMˆN Ñ RPˆQ

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 11

Scalar Differentiation f : R Ñ R

§ Derivative defined as the limit of the difference quotient

f 1pxq “d fdx“ lim

hÑ0

f p x` h q ´ f pxqh

Slope of the secant line through f pxq and f px` hq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 12

Examples

f pxq “ xn f 1pxq “ nxn´1

f pxq “ sinpxq f 1pxq “ cospxqf pxq “ tanhpxq f 1pxq “ 1´ tanh2

pxqf pxq “ exppxq f 1pxq “ exppxqf pxq “ logpxq f 1pxq “ 1

x

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 13

Rules

§ Sum Rule

`

f pxq ` gpxq˘1“ f 1pxq ` g1pxq “

d fdx`

dgdx

§ Product Rule

`

f pxqgpxq˘1“ f 1pxqgpxq ` f pxqg1pxq “

d fdx

gpxq ` f pxqdgdx

§ Chain Rule

pg ˝ f q1pxq “`

gp f pxqq˘1“ g1p f pxqq f 1pxq “

dgd f

d fdx

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 14

Rules

§ Sum Rule

`

f pxq ` gpxq˘1“ f 1pxq ` g1pxq “

d fdx`

dgdx

§ Product Rule

`

f pxqgpxq˘1“ f 1pxqgpxq ` f pxqg1pxq “

d fdx

gpxq ` f pxqdgdx

§ Chain Rule

pg ˝ f q1pxq “`

gp f pxqq˘1“ g1p f pxqq f 1pxq “

dgd f

d fdx

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 14

Rules

§ Sum Rule

`

f pxq ` gpxq˘1“ f 1pxq ` g1pxq “

d fdx`

dgdx

§ Product Rule

`

f pxqgpxq˘1“ f 1pxqgpxq ` f pxqg1pxq “

d fdx

gpxq ` f pxqdgdx

§ Chain Rule

pg ˝ f q1pxq “`

gp f pxqq˘1“ g1p f pxqq f 1pxq “

dgd f

d fdx

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 14

Example: Chain Rule

pg ˝ f q1pxq “`

gp f pxqq˘1“ g1p f pxqq f 1pxq “

dgd f

d fdx

gpzq “ tanhpzqz “ f pxq “ xn

pg ˝ f q1pxq “

`

1´ tanh2pxnqq

loooooooomoooooooon

dg{d f

nxn´1loomoon

d f {dx

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 15

Example: Chain Rule

pg ˝ f q1pxq “`

gp f pxqq˘1“ g1p f pxqq f 1pxq “

dgd f

d fdx

gpzq “ tanhpzqz “ f pxq “ xn

pg ˝ f q1pxq “`

1´ tanh2pxnqq

loooooooomoooooooon

dg{d f

nxn´1loomoon

d f {dx

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 15

f : RN Ñ R

y “ f pxq , x “

»

x1...

xN

fi

ffi

fl

P RN

§ Partial derivative (change one coordinate at a time):

B fBxi

“ limhÑ0

f px1, . . . , xi´1, xi ` h , xi`1, . . . , xNq ´ f pxqh

§ Jacobian vector (gradient) collects all partial derivatives:

d fdx“

B fBx1

¨ ¨ ¨B fxN

ı

P R1ˆN

Note: This is a row vector.

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 16

f : RN Ñ R

y “ f pxq , x “

»

x1...

xN

fi

ffi

fl

P RN

§ Partial derivative (change one coordinate at a time):

B fBxi

“ limhÑ0

f px1, . . . , xi´1, xi ` h , xi`1, . . . , xNq ´ f pxqh

§ Jacobian vector (gradient) collects all partial derivatives:

d fdx“

B fBx1

¨ ¨ ¨B fxN

ı

P R1ˆN

Note: This is a row vector.Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 16

Example

f : R2 Ñ R

f px1, x2q “ x21x2 ` x1x3

2 P R

§ Partial derivatives:B f px1, x2q

Bx1“ 2x1x2 ` x3

2

B f px1, x2q

Bx2“ x2

1 ` 3x1x22

§ Gradient:d fdx“

B f px1,x2q

Bx1

B f px1,x2q

Bx2

ı

2x1x2 ` x32 x2

1 ` 3x1x22

ı

P R1ˆ2 .

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 17

Example

f : R2 Ñ R

f px1, x2q “ x21x2 ` x1x3

2 P R

§ Partial derivatives:B f px1, x2q

Bx1“ 2x1x2 ` x3

2

B f px1, x2q

Bx2“ x2

1 ` 3x1x22

§ Gradient:d fdx“

B f px1,x2q

Bx1

B f px1,x2q

Bx2

ı

2x1x2 ` x32 x2

1 ` 3x1x22

ı

P R1ˆ2 .

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 17

Example

f : R2 Ñ R

f px1, x2q “ x21x2 ` x1x3

2 P R

§ Partial derivatives:B f px1, x2q

Bx1“ 2x1x2 ` x3

2

B f px1, x2q

Bx2“ x2

1 ` 3x1x22

§ Gradient:d fdx“

B f px1,x2q

Bx1

B f px1,x2q

Bx2

ı

2x1x2 ` x32 x2

1 ` 3x1x22

ı

P R1ˆ2 .

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 17

Rules

§ Sum RuleB

Bx`

f pxq ` gpxq˘

“B fBx`BgBx

§ Product Rule

B

Bx`

f pxqgpxq˘

“B fBx

gpxq ` f pxqBgBx

§ Chain Rule

B

Bxpg ˝ f qpxq “

B

Bx`

gp f pxqq˘

“BgB fB fBx

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 18

Example: Chain Rule

§ Consider the function

Lpeq “ 12}e}

2 “ 12 eJe

e “ y´ Ax , x P RN , A P RMˆN , e, y P RM

§ Compute dL{dx. What is the dimension/size of dL{dx?

§ dL{dx P R1ˆN

dLdx“

dLde

dedx

dLde“ eJ P R1ˆM (1)

dedx“ ´A P RMˆN (2)

ñdLdx“ eJp´Aq “ ´py´ AxqJA P R1ˆN

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 19

Example: Chain Rule

§ Consider the function

Lpeq “ 12}e}

2 “ 12 eJe

e “ y´ Ax , x P RN , A P RMˆN , e, y P RM

§ Compute dL{dx. What is the dimension/size of dL{dx?§ dL{dx P R1ˆN

dLdx“

dLde

dedx

dLde“ eJ P R1ˆM (1)

dedx“ ´A P RMˆN (2)

ñdLdx“ eJp´Aq “ ´py´ AxqJA P R1ˆN

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 19

f : RN Ñ RM

y “ f pxq P RM , x P RN

»

y1...

yM

fi

ffi

fl

»

f1pxq...

fMpxq

fi

ffi

fl

»

f1px1, . . . , xNq...

fMpx1, . . . , xNq

fi

ffi

fl

§ Jacobian matrix (collection of all partial derivatives)»

dy1dx...

dyMdx

fi

ffi

ffi

ffi

fl

»

B f1Bx1

¨ ¨ ¨B f1BxN

......

B fMBx1

¨ ¨ ¨B fMBxN

fi

ffi

ffi

ffi

fl

P RMˆN

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 20

f : RN Ñ RM

y “ f pxq P RM , x P RN

»

y1...

yM

fi

ffi

fl

»

f1pxq...

fMpxq

fi

ffi

fl

»

f1px1, . . . , xNq...

fMpx1, . . . , xNq

fi

ffi

fl

§ Jacobian matrix (collection of all partial derivatives)»

dy1dx...

dyMdx

fi

ffi

ffi

ffi

fl

»

B f1Bx1

¨ ¨ ¨B f1BxN

......

B fMBx1

¨ ¨ ¨B fMBxN

fi

ffi

ffi

ffi

fl

P RMˆN

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 20

Example

f pxq “ Ax , f pxq P RM, A P RMˆN , x P RN

§ Compute the gradient d f {dx§ Dimension of d f {dx:

Since f : RN Ñ RM, it follows that d f {dx P RMˆN

§ Gradient:

fi “

Nÿ

j“1

Aijxj ñB fiBxj

“ Aij

ñd fdx“

»

B f1Bx1

¨ ¨ ¨B f1BxN

......

B fMBx1

¨ ¨ ¨B fMBxN

fi

ffi

ffi

fl

»

A11 ¨ ¨ ¨ A1N...

...AM1 ¨ ¨ ¨ AMN

fi

ffi

ffi

fl

“ A

(3)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 21

Example

f pxq “ Ax , f pxq P RM, A P RMˆN , x P RN

§ Compute the gradient d f {dx§ Dimension of d f {dx:

Since f : RN Ñ RM, it follows that d f {dx P RMˆN

§ Gradient:

fi “

Nÿ

j“1

Aijxj ñB fiBxj

“ Aij

ñd fdx“

»

B f1Bx1

¨ ¨ ¨B f1BxN

......

B fMBx1

¨ ¨ ¨B fMBxN

fi

ffi

ffi

fl

»

A11 ¨ ¨ ¨ A1N...

...AM1 ¨ ¨ ¨ AMN

fi

ffi

ffi

fl

“ A

(3)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 21

Example

f pxq “ Ax , f pxq P RM, A P RMˆN , x P RN

§ Compute the gradient d f {dx§ Dimension of d f {dx:

Since f : RN Ñ RM, it follows that d f {dx P RMˆN

§ Gradient:

fi “

Nÿ

j“1

Aijxj ñB fiBxj

“ Aij

ñd fdx“

»

B f1Bx1

¨ ¨ ¨B f1BxN

......

B fMBx1

¨ ¨ ¨B fMBxN

fi

ffi

ffi

fl

»

A11 ¨ ¨ ¨ A1N...

...AM1 ¨ ¨ ¨ AMN

fi

ffi

ffi

fl

“ A

(3)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 21

Example

f pxq “ Ax , f pxq P RM, A P RMˆN , x P RN

§ Compute the gradient d f {dx§ Dimension of d f {dx:

Since f : RN Ñ RM, it follows that d f {dx P RMˆN

§ Gradient:

fi “

Nÿ

j“1

Aijxj ñB fiBxj

“ Aij

ñd fdx“

»

B f1Bx1

¨ ¨ ¨B f1BxN

......

B fMBx1

¨ ¨ ¨B fMBxN

fi

ffi

ffi

fl

»

A11 ¨ ¨ ¨ A1N...

...AM1 ¨ ¨ ¨ AMN

fi

ffi

ffi

fl

“ A (3)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 21

Chain Rule

B

Bxpg ˝ f qpxq “

B

Bx`

gp f pxqq˘

“BgB fB fBx

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 22

Example

§ Consider f : R2 Ñ R, x : RÑ R2

f pxq “ f px1, x2q “ x21 ` 2x2 ,

xptq “

«

x1ptqx2ptq

ff

«

sinptqcosptq

ff

§ The dimensions d f {dx and dx{dt are 1ˆ 2 and 2ˆ 1, respectively§ Compute the gradient d f {dt using the chain rule.

d fdt“

B fBx1

B fBx2

ı

»

Bx1Bt

Bx2Bt

fi

fl

2 sin t 2ı

«

cos t´ sin t

ff

“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 23

Example

§ Consider f : R2 Ñ R, x : RÑ R2

f pxq “ f px1, x2q “ x21 ` 2x2 ,

xptq “

«

x1ptqx2ptq

ff

«

sinptqcosptq

ff

§ The dimensions d f {dx and dx{dt are

1ˆ 2 and 2ˆ 1, respectively§ Compute the gradient d f {dt using the chain rule.

d fdt“

B fBx1

B fBx2

ı

»

Bx1Bt

Bx2Bt

fi

fl

2 sin t 2ı

«

cos t´ sin t

ff

“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 23

Example

§ Consider f : R2 Ñ R, x : RÑ R2

f pxq “ f px1, x2q “ x21 ` 2x2 ,

xptq “

«

x1ptqx2ptq

ff

«

sinptqcosptq

ff

§ The dimensions d f {dx and dx{dt are 1ˆ 2 and 2ˆ 1, respectively§ Compute the gradient d f {dt using the chain rule.

d fdt“

B fBx1

B fBx2

ı

»

Bx1Bt

Bx2Bt

fi

fl

2 sin t 2ı

«

cos t´ sin t

ff

“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1q

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 23

Example

§ Consider f : R2 Ñ R, x : RÑ R2

f pxq “ f px1, x2q “ x21 ` 2x2 ,

xptq “

«

x1ptqx2ptq

ff

«

sinptqcosptq

ff

§ The dimensions d f {dx and dx{dt are 1ˆ 2 and 2ˆ 1, respectively§ Compute the gradient d f {dt using the chain rule.

d fdt“

B fBx1

B fBx2

ı

»

Bx1Bt

Bx2Bt

fi

fl

2 sin t 2ı

«

cos t´ sin t

ff

“ 2 sin t cos t´ 2 sin t “ 2 sin tpcos t´ 1qMathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 23

BREAK

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 24

Derivatives with Matrices

§ Re-cap: Gradient of a function f : RD Ñ RE is an EˆD-matrix:

# target dimensions ˆ # parameters

withd fdxP REˆD , d f re, ds “

B fe

Bxd

§ Generalization to cases, where the parameters (D) or targets (E)are matrices, apply immediately

§ Assume f : RMˆN Ñ RPˆQ, then the gradient is apPˆQq ˆ pMˆ Nq object (tensor) where

d f rp, q, m, ns “B fpq

BXmn

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 25

Derivatives with Matrices

§ Re-cap: Gradient of a function f : RD Ñ RE is an EˆD-matrix:

# target dimensions ˆ # parameters

withd fdxP REˆD , d f re, ds “

B fe

Bxd

§ Generalization to cases, where the parameters (D) or targets (E)are matrices, apply immediately

§ Assume f : RMˆN Ñ RPˆQ, then the gradient is apPˆQq ˆ pMˆ Nq object (tensor) where

d f rp, q, m, ns “B fpq

BXmn

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 25

Derivatives with Matrices

§ Re-cap: Gradient of a function f : RD Ñ RE is an EˆD-matrix:

# target dimensions ˆ # parameters

withd fdxP REˆD , d f re, ds “

B fe

Bxd

§ Generalization to cases, where the parameters (D) or targets (E)are matrices, apply immediately

§ Assume f : RMˆN Ñ RPˆQ, then the gradient is apPˆQq ˆ pMˆ Nq object (tensor) where

d f rp, q, m, ns “B fpq

BXmn

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 25

Derivatives with Matrices: Example (1)

f “ Ax , f P RM, A P RMˆN , x P RN

d fdA

P RMˆpMˆNq

d fdA

»

B f1BA...B fMBA

fi

ffi

fl

,B fi

BAP R1ˆpMˆNq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 26

Derivatives with Matrices: Example (1)

f “ Ax , f P RM, A P RMˆN , x P RN

d fdA

P RMˆpMˆNq

d fdA

»

B f1BA...B fMBA

fi

ffi

fl

,B fi

BAP R1ˆpMˆNq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 26

Derivatives with Matrices: Example (2)

fi “

Nÿ

j“1

Aijxj, i “ 1, . . . , M

B fi

BAiq“ xq ñ

B fi

BAi,:“ xJ P R1ˆ1ˆN

B fi

BAk‰i,:“ 0J P R1ˆ1ˆN

B fi

BA“

»

0J...

xJ

0J...

0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

P R1ˆpMˆNq

(4)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27

Derivatives with Matrices: Example (2)

fi “

Nÿ

j“1

Aijxj, i “ 1, . . . , M

B fi

BAiq“ xq

ñB fi

BAi,:“ xJ P R1ˆ1ˆN

B fi

BAk‰i,:“ 0J P R1ˆ1ˆN

B fi

BA“

»

0J...

xJ

0J...

0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

P R1ˆpMˆNq

(4)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27

Derivatives with Matrices: Example (2)

fi “

Nÿ

j“1

Aijxj, i “ 1, . . . , M

B fi

BAiq“ xq ñ

B fi

BAi,:“ xJ P R1ˆ1ˆN

B fi

BAk‰i,:“ 0J P R1ˆ1ˆN

B fi

BA“

»

0J...

xJ

0J...

0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

P R1ˆpMˆNq

(4)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27

Derivatives with Matrices: Example (2)

fi “

Nÿ

j“1

Aijxj, i “ 1, . . . , M

B fi

BAiq“ xq ñ

B fi

BAi,:“ xJ P R1ˆ1ˆN

B fi

BAk‰i,:“ 0J P R1ˆ1ˆN

B fi

BA“

»

0J...

xJ

0J...

0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

P R1ˆpMˆNq

(4)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27

Derivatives with Matrices: Example (2)

fi “

Nÿ

j“1

Aijxj, i “ 1, . . . , M

B fi

BAiq“ xq ñ

B fi

BAi,:“ xJ P R1ˆ1ˆN

B fi

BAk‰i,:“ 0J P R1ˆ1ˆN

B fi

BA“

»

0J...

xJ

0J...

0J

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

P R1ˆpMˆNq (4)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 27

Example: Higher-Order Tensors

§ Consider a matrix A P R4ˆ2 whose entries depend on a vectorx P R3

§ We can compute dApxq{dx P R4ˆ2ˆ3 in two equivalent ways:

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 28

Example: Higher-Order Tensors

§ Consider a matrix A P R4ˆ2 whose entries depend on a vectorx P R3

§ We can compute dApxq{dx P R4ˆ2ˆ3 in two equivalent ways:

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 28

Gradients of a Single-Layer Neural Network (1)

x z f

A, b

σ

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBb“

B fBz

loomoon

MˆM

BzBb

loomoon

MˆM

P RMˆM

B fBA

“B fBz

loomoon

MˆM

BzBA

loomoon

MˆpMˆNq

P RMˆpMˆNq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 29

Gradients of a Single-Layer Neural Network (1)

x z f

A, b

σ

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBb“

B fBz

loomoon

MˆM

BzBb

loomoon

MˆM

P RMˆM

B fBA

“B fBz

loomoon

MˆM

BzBA

loomoon

MˆpMˆNq

P RMˆpMˆNq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 29

Gradients of a Single-Layer Neural Network (1)

x z f

A, b

σ

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBb“

B fBz

loomoon

MˆM

BzBb

loomoon

MˆM

P RMˆM

B fBA

“B fBz

loomoon

MˆM

BzBA

loomoon

MˆpMˆNq

P RMˆpMˆNq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 29

Gradients of a Single-Layer Neural Network (2)

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBb“

B fBz

loomoon

MˆM

BzBb

loomoon

MˆM

P RMˆM

B fBz“ diagp1´ tanh2

pzqq P RMˆM

BzBb“ I P RMˆM (5)

B fBbri, js “

Mÿ

l“1

B fBzri, ls

BzBbrl, js

dfdb = np.einsum(’il, lj’, dfdz, dzdb)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 30

Gradients of a Single-Layer Neural Network (2)

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBb“

B fBz

loomoon

MˆM

BzBb

loomoon

MˆM

P RMˆM

B fBz“ diagp1´ tanh2

pzqq P RMˆM

BzBb“ I P RMˆM (5)

B fBbri, js “

Mÿ

l“1

B fBzri, ls

BzBbrl, js

dfdb = np.einsum(’il, lj’, dfdz, dzdb)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 30

Gradients of a Single-Layer Neural Network (2)

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBb“

B fBz

loomoon

MˆM

BzBb

loomoon

MˆM

P RMˆM

B fBz“ diagp1´ tanh2

pzqq P RMˆM

BzBb“ I P RMˆM (5)

B fBbri, js “

Mÿ

l“1

B fBzri, ls

BzBbrl, js

dfdb = np.einsum(’il, lj’, dfdz, dzdb)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 30

Gradients of a Single-Layer Neural Network (2)

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBb“

B fBz

loomoon

MˆM

BzBb

loomoon

MˆM

P RMˆM

B fBz“ diagp1´ tanh2

pzqq P RMˆM

BzBb“ I P RMˆM (5)

B fBbri, js “

Mÿ

l“1

B fBzri, ls

BzBbrl, js

dfdb = np.einsum(’il, lj’, dfdz, dzdb)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 30

Gradients of a Single-Layer Neural Network (3)

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBA“

B fBz

loomoon

MˆM

BzBA

loomoon

MˆpMˆNq

P RMˆpMˆNq

B fBz“ diagp1´ tanh2

pzqq P RMˆM (6)

BzBA

See (4)

B fBAri, j, ks “

Mÿ

l“1

B fBzri, ls

BzBArl, j, ks

dfdA = np.einsum(’il, ljk’, dfdz, dzdA)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 31

Gradients of a Single-Layer Neural Network (3)

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBA“

B fBz

loomoon

MˆM

BzBA

loomoon

MˆpMˆNq

P RMˆpMˆNq

B fBz“ diagp1´ tanh2

pzqq P RMˆM (6)

BzBA

See (4)

B fBAri, j, ks “

Mÿ

l“1

B fBzri, ls

BzBArl, j, ks

dfdA = np.einsum(’il, ljk’, dfdz, dzdA)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 31

Gradients of a Single-Layer Neural Network (3)

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBA“

B fBz

loomoon

MˆM

BzBA

loomoon

MˆpMˆNq

P RMˆpMˆNq

B fBz“ diagp1´ tanh2

pzqq P RMˆM (6)

BzBA

See (4)

B fBAri, j, ks “

Mÿ

l“1

B fBzri, ls

BzBArl, j, ks

dfdA = np.einsum(’il, ljk’, dfdz, dzdA)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 31

Gradients of a Single-Layer Neural Network (3)

f “ tanhpAx` bloomoon

“:zPRM

q P RM, x P RN , A P RMˆN , b P RM

B fBA“

B fBz

loomoon

MˆM

BzBA

loomoon

MˆpMˆNq

P RMˆpMˆNq

B fBz“ diagp1´ tanh2

pzqq P RMˆM (6)

BzBA

See (4)

B fBAri, j, ks “

Mÿ

l“1

B fBzri, ls

BzBArl, j, ks

dfdA = np.einsum(’il, ljk’, dfdz, dzdA)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 31

Putting Things Together

§ Inputs x, observed outputs y “ f pz, θq ` ε , ε „ N`

0, Σ˘

§ Train single-layer neural network with

f pz, θq “ tanhpzq , z “ Ax` b , θ “ tA, bu

§ Find A, b, such that the squared loss

Lpθq “ 12}e}

2 , e “ y´ f pz, θq

is minimized§ Partial derivatives:

BLBA

“BLBeBeB fB fBzBzBA

BLBb

“BLBeBeB fB fBzBzBb

,

/

/

.

/

/

-

BLBe

(1)BeB f

(2), (3)B fBz

(6)

BzBA

(4)BzBb

(5)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 32

Putting Things Together

§ Inputs x, observed outputs y “ f pz, θq ` ε , ε „ N`

0, Σ˘

§ Train single-layer neural network with

f pz, θq “ tanhpzq , z “ Ax` b , θ “ tA, bu

§ Find A, b, such that the squared loss

Lpθq “ 12}e}

2 , e “ y´ f pz, θq

is minimized§ Partial derivatives:

BLBA

“BLBeBeB fB fBzBzBA

BLBb

“BLBeBeB fB fBzBzBb

,

/

/

.

/

/

-

BLBe

(1)BeB f

(2), (3)B fBz

(6)

BzBA

(4)BzBb

(5)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 32

Putting Things Together

§ Inputs x, observed outputs y “ f pz, θq ` ε , ε „ N`

0, Σ˘

§ Train single-layer neural network with

f pz, θq “ tanhpzq , z “ Ax` b , θ “ tA, bu

§ Find A, b, such that the squared loss

Lpθq “ 12}e}

2 , e “ y´ f pz, θq

is minimized

§ Partial derivatives:BLBA

“BLBeBeB fB fBzBzBA

BLBb

“BLBeBeB fB fBzBzBb

,

/

/

.

/

/

-

BLBe

(1)BeB f

(2), (3)B fBz

(6)

BzBA

(4)BzBb

(5)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 32

Putting Things Together

§ Inputs x, observed outputs y “ f pz, θq ` ε , ε „ N`

0, Σ˘

§ Train single-layer neural network with

f pz, θq “ tanhpzq , z “ Ax` b , θ “ tA, bu

§ Find A, b, such that the squared loss

Lpθq “ 12}e}

2 , e “ y´ f pz, θq

is minimized§ Partial derivatives:

BLBA

“BLBeBeB fB fBzBzBA

BLBb

“BLBeBeB fB fBzBzBb

,

/

/

.

/

/

-

BLBe

(1)BeB f

(2), (3)B fBz

(6)

BzBA

(4)BzBb

(5)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 32

Gradients of a Multi-Layer Neural Network

x fL

A1, b1 AL−1, bL−1

LfL−1

AL−2, bL−2

f1

A2, b2

§ Inputs x, observed outputs y§ Train multi-layer neural network with

f 0 “ x

f i “ σipAi´1 f i´1 ` bi´1q , i “ 1, . . . , L

§ Find Aj, bj for j “ 0, . . . , L´ 1, such that the squared loss

Lpθq “ }y´ f Lpθ, xq}2

is minimized, where θ “ tAj, bju , j “ 0, . . . , L´ 1

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 33

Gradients of a Multi-Layer Neural Network

x fL

A1, b1 AL−1, bL−1

LfL−1

AL−2, bL−2

f1

A2, b2

§ Inputs x, observed outputs y§ Train multi-layer neural network with

f 0 “ x

f i “ σipAi´1 f i´1 ` bi´1q , i “ 1, . . . , L

§ Find Aj, bj for j “ 0, . . . , L´ 1, such that the squared loss

Lpθq “ }y´ f Lpθ, xq}2

is minimized, where θ “ tAj, bju , j “ 0, . . . , L´ 1Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 33

Gradients of a Multi-Layer Neural Network

x fL

A1, b1 AL−1, bL−1

LfL−1

AL−2, bL−2

f1

A2, b2

BLBθL´1

“BLB f L

B f LBθL´1

BLBθL´2

“BLB f L

B f LB f L´1

B f L´1

BθL´2

BLBθL´3

“BLB f L

B f LB f L´1

B f L´1

B f L´2

B f L´2

BθL´3

BLBθi

“BLB f L

B f LB f L´1

¨ ¨ ¨B f i`2

B f i`1

B f i`1

Bθi

More details (including efficient implementation) later this week

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34

Gradients of a Multi-Layer Neural Network

x fL

A1, b1 AL−1, bL−1

LfL−1

AL−2, bL−2

f1

A2, b2

BLBθL´1

“BLB f L

B f LBθL´1

BLBθL´2

“BLB f L

B f LB f L´1

B f L´1

BθL´2

BLBθL´3

“BLB f L

B f LB f L´1

B f L´1

B f L´2

B f L´2

BθL´3

BLBθi

“BLB f L

B f LB f L´1

¨ ¨ ¨B f i`2

B f i`1

B f i`1

Bθi

More details (including efficient implementation) later this week

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34

Gradients of a Multi-Layer Neural Network

x fL

A1, b1 AL−1, bL−1

LfL−1

AL−2, bL−2

f1

A2, b2

BLBθL´1

“BLB f L

B f LBθL´1

BLBθL´2

“BLB f L

B f LB f L´1

B f L´1

BθL´2

BLBθL´3

“BLB f L

B f LB f L´1

B f L´1

B f L´2

B f L´2

BθL´3

BLBθi

“BLB f L

B f LB f L´1

¨ ¨ ¨B f i`2

B f i`1

B f i`1

Bθi

More details (including efficient implementation) later this week

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34

Gradients of a Multi-Layer Neural Network

x fL

A1, b1 AL−1, bL−1

LfL−1

AL−2, bL−2

f1

A2, b2

BLBθL´1

“BLB f L

B f LBθL´1

BLBθL´2

“BLB f L

B f LB f L´1

B f L´1

BθL´2

BLBθL´3

“BLB f L

B f LB f L´1

B f L´1

B f L´2

B f L´2

BθL´3

BLBθi

“BLB f L

B f LB f L´1

¨ ¨ ¨B f i`2

B f i`1

B f i`1

Bθi

More details (including efficient implementation) later this week

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34

Gradients of a Multi-Layer Neural Network

x fL

A1, b1 AL−1, bL−1

LfL−1

AL−2, bL−2

f1

A2, b2

BLBθL´1

“BLB f L

B f LBθL´1

BLBθL´2

“BLB f L

B f LB f L´1

B f L´1

BθL´2

BLBθL´3

“BLB f L

B f LB f L´1

B f L´1

B f L´2

B f L´2

BθL´3

BLBθi

“BLB f L

B f LB f L´1

¨ ¨ ¨B f i`2

B f i`1

B f i`1

Bθi

More details (including efficient implementation) later this weekMathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 34

Training Neural Networks as Maximum LikelihoodEstimation

§ Training a neural network in the above way corresponds tomaximum likelihood estimation:

§ If y “ NNpx, θq ` ε, ε „ N`

0, I˘

then the log-likelihood is

log ppy|X, θq “ ´ 12}y´ NNpx, θq}2

§ Find θ˚ by minimizing the negative log-likelihood:

θ˚ “ arg minθ´ log ppy|x, θq

“ arg minθ

12}y´ NNpx, θq}2

“ arg minθ

Lpθq

§ Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 35

Training Neural Networks as Maximum LikelihoodEstimation

§ Training a neural network in the above way corresponds tomaximum likelihood estimation:

§ If y “ NNpx, θq ` ε, ε „ N`

0, I˘

then the log-likelihood is

log ppy|X, θq “ ´ 12}y´ NNpx, θq}2

§ Find θ˚ by minimizing the negative log-likelihood:

θ˚ “ arg minθ´ log ppy|x, θq

“ arg minθ

12}y´ NNpx, θq}2

“ arg minθ

Lpθq

§ Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 35

Training Neural Networks as Maximum LikelihoodEstimation

§ Training a neural network in the above way corresponds tomaximum likelihood estimation:

§ If y “ NNpx, θq ` ε, ε „ N`

0, I˘

then the log-likelihood is

log ppy|X, θq “ ´ 12}y´ NNpx, θq}2

§ Find θ˚ by minimizing the negative log-likelihood:

θ˚ “ arg minθ´ log ppy|x, θq

“ arg minθ

12}y´ NNpx, θq}2

“ arg minθ

Lpθq

§ Maximum likelihood estimation can lead to overfitting (interpretnoise as signal)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 35

Example: Linear Regression (1)

§ Linear regression with a polynomial of order M:

y “ f px, θq ` ε , ε „ N`

0, σ2ε

˘

f px, θq “ θ0 ` θ1x` θ2x2 ` ¨ ¨ ¨ ` θMxM “

Mÿ

i“0

θixi

§ Given inputs xi and corresponding (noisy) observations yi,i “ 1, . . . , N, find parameters θ “ rθ0, . . . , θMs

J, that minimize thesquared loss (equivalently: maximize the likelihood)

Lpθq “Nÿ

i“1

pyi ´ f pxi, θqq2

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 36

Example: Linear Regression (1)

§ Linear regression with a polynomial of order M:

y “ f px, θq ` ε , ε „ N`

0, σ2ε

˘

f px, θq “ θ0 ` θ1x` θ2x2 ` ¨ ¨ ¨ ` θMxM “

Mÿ

i“0

θixi

§ Given inputs xi and corresponding (noisy) observations yi,i “ 1, . . . , N, find parameters θ “ rθ0, . . . , θMs

J, that minimize thesquared loss (equivalently: maximize the likelihood)

Lpθq “Nÿ

i“1

pyi ´ f pxi, θqq2

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 36

Example: Linear Regression (2)

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 16

DataMaximum likelihood estimate

§ Regularization, model selection etc. can address overfittingTutorials later this week

§ Alternative approach based on integration

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 37

Example: Linear Regression (2)

x-5 0 5

f(x)

-3

-2

-1

0

1

2

3Polynomial of degree 16

DataMaximum likelihood estimate

§ Regularization, model selection etc. can address overfittingTutorials later this week

§ Alternative approach based on integration

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 37

Overview

Introduction

Differentiation

Integration

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 38

Integration: Outline

1. Motivation

2. Monte-Carlo estimation

3. Basic sampling algorithms

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 39

Bayesian Integration to Avoid Overfitting

§ Instead of fitting a single setof parameters θ˚, we canaverage over all plausibleparameters

Bayesian integration:

ppy|xq “ż

ppy|x, θqppθqdθ

-5 0 5x

-3

-2

-1

0

1

2

3

f(x)

Polynomial of degree 16

95% predictive confidence boundDataMaximum likelihood estimateMAP estimate

§ More details on what ppθq is Tutorials later this week

§ For neural networks this integration is intractableApproximations

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 40

Bayesian Integration to Avoid Overfitting

§ Instead of fitting a single setof parameters θ˚, we canaverage over all plausibleparameters

Bayesian integration:

ppy|xq “ż

ppy|x, θqppθqdθ

-5 0 5x

-3

-2

-1

0

1

2

3

f(x)

Polynomial of degree 16

95% predictive confidence boundDataMaximum likelihood estimateMAP estimate

§ More details on what ppθq is Tutorials later this week

§ For neural networks this integration is intractableApproximations

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 40

Bayesian Integration to Avoid Overfitting

§ Instead of fitting a single setof parameters θ˚, we canaverage over all plausibleparameters

Bayesian integration:

ppy|xq “ż

ppy|x, θqppθqdθ

-5 0 5x

-3

-2

-1

0

1

2

3

f(x)

Polynomial of degree 16

95% predictive confidence boundDataMaximum likelihood estimateMAP estimate

§ More details on what ppθq is Tutorials later this week

§ For neural networks this integration is intractableApproximations

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 40

Computing Statistics of Random Variables

§ Computing means/(co)variances also requires solving integrals:

Exrxs “ż

xppxqdx “: µx

Vxrxs “ż

px´ µxqpx´ µxqJdx

Covrx, ys “ij

px´ µxqpy´ µyqJdxdy

§ These integrals can often not be computed in closed formApproximations

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 41

Computing Statistics of Random Variables

§ Computing means/(co)variances also requires solving integrals:

Exrxs “ż

xppxqdx “: µx

Vxrxs “ż

px´ µxqpx´ µxqJdx

Covrx, ys “ij

px´ µxqpy´ µyqJdxdy

§ These integrals can often not be computed in closed formApproximations

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 41

Approximate Integration

§ Numerical integration (low-dimensional problems)

§ Bayesian quadrature, e.g., O’Hagan (1987, 1991); Rasmussen &Ghahramani (2003)

§ Variational Bayes, e.g., Jordan et al. (1999)

§ Expectation Propagation, Opper & Winther (2001); Minka (2001)

§ Monte-Carlo Methods, e.g., Gilks et al. (1996), Robert & Casella(2004), Bishop (2006)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 42

Monte Carlo Methods—Motivation

§ Monte Carlo methods are computational techniques that makeuse of random numbers

§ Two typical problems:1. Problem 1: Generate samples txpsqu from a given probability

distribution ppxq, e.g., for simulation or representations of datadistributions

2. Problem 2: Compute expectations of functions under thatdistribution:

Er f pxqs “ż

f pxqppxqdx

Example: Means/variances of distributions, predictionsComplication: Integral cannot be evaluated analytically

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 43

Monte Carlo Methods—Motivation

§ Monte Carlo methods are computational techniques that makeuse of random numbers

§ Two typical problems:1. Problem 1: Generate samples txpsqu from a given probability

distribution ppxq, e.g., for simulation or representations of datadistributions

2. Problem 2: Compute expectations of functions under thatdistribution:

Er f pxqs “ż

f pxqppxqdx

Example: Means/variances of distributions, predictionsComplication: Integral cannot be evaluated analytically

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 43

Monte Carlo Methods—Motivation

§ Monte Carlo methods are computational techniques that makeuse of random numbers

§ Two typical problems:1. Problem 1: Generate samples txpsqu from a given probability

distribution ppxq, e.g., for simulation or representations of datadistributions

2. Problem 2: Compute expectations of functions under thatdistribution:

Er f pxqs “ż

f pxqppxqdx

Example: Means/variances of distributions, predictionsComplication: Integral cannot be evaluated analytically

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 43

Problem 2: Monte Carlo Estimation

§ Computing expectations via statistical sampling:

Er f pxqs “ż

f pxqppxqdx

«1S

ÿS

s“1f pxpsqq, xpsq „ ppxq

§ Making predictions (e.g., Bayesian regression with inputs x andtargets y)

ppy|xq “ż

ppy|θ, xq ppθqloomoon

Parameter distribution

«1S

ÿS

s“1ppy|θpsq, xq , θpsq „ ppθq

§ Key problem: Generating samples from ppxq or ppθqNeed to solve Problem 1

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 44

Problem 2: Monte Carlo Estimation

§ Computing expectations via statistical sampling:

Er f pxqs “ż

f pxqppxqdx

«1S

ÿS

s“1f pxpsqq, xpsq „ ppxq

§ Making predictions (e.g., Bayesian regression with inputs x andtargets y)

ppy|xq “ż

ppy|θ, xq ppθqloomoon

Parameter distribution

«1S

ÿS

s“1ppy|θpsq, xq , θpsq „ ppθq

§ Key problem: Generating samples from ppxq or ppθqNeed to solve Problem 1

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 44

Problem 2: Monte Carlo Estimation

§ Computing expectations via statistical sampling:

Er f pxqs “ż

f pxqppxqdx

«1S

ÿS

s“1f pxpsqq, xpsq „ ppxq

§ Making predictions (e.g., Bayesian regression with inputs x andtargets y)

ppy|xq “ż

ppy|θ, xq ppθqloomoon

Parameter distribution

«1S

ÿS

s“1ppy|θpsq, xq , θpsq „ ppθq

§ Key problem: Generating samples from ppxq or ppθqNeed to solve Problem 1

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 44

Sampling Discrete Values

p = 0.3 p = 0.2 p = 0.5a b c

u = 0.55

§ u „ U r0, 1s, where U is the uniform distribution

§ u “ 0.55 ñ x “ c

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 45

Continuous Variables

p(z)

z

~

§ More complicated§ Geometrically, we wish to sample uniformly from the area under

the curve§ Two algorithms here:

§ Rejection sampling§ Importance sampling

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 46

Rejection Sampling: Setting

p(z)

z

~

§ Assume:§ Sampling from ppzq is difficult§ Evaluating p̃pzq “ Zppzq is easy (and Z may be unknown)

§ Find a simpler distribution (proposal distribution) qpzq fromwhich we can easily draw samples (e.g., Gaussian, Laplace)

§ Find an upper bound kqpzq ě p̃pzq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 47

Rejection Sampling: Setting

p(z)

z

~

§ Assume:§ Sampling from ppzq is difficult§ Evaluating p̃pzq “ Zppzq is easy (and Z may be unknown)

§ Find a simpler distribution (proposal distribution) qpzq fromwhich we can easily draw samples (e.g., Gaussian, Laplace)

§ Find an upper bound kqpzq ě p̃pzq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 47

Rejection Sampling: Algorithm

z0 z

u0

kq(z0)kq(z)

p(z)~

acceptance arearejection area

Adapted from PRML (Bishop, 2006)

1. Generate z0 „ qpzq

2. Generate u0 „ U r0, kqpz0qs

3. If u0 ą p̃pz0q, reject the sample. Otherwise, retain z0

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 48

Properties

z0 z

u0

kq(z0)kq(z)

p(z)~

acceptance arearejection area

Adapted from PRML (Bishop, 2006)

§ Accepted pairs pz, uq are uniformly distributed under p̃pzq

§ Probability density of the z-coordiantes of accepted points mustbe proportional to ppzq

§ Samples are independent samples from ppzq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 49

Properties

z0 z

u0

kq(z0)kq(z)

p(z)~

acceptance arearejection area

Adapted from PRML (Bishop, 2006)

§ Accepted pairs pz, uq are uniformly distributed under p̃pzq

§ Probability density of the z-coordiantes of accepted points mustbe proportional to ppzq

§ Samples are independent samples from ppzq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 49

Properties

z0 z

u0

kq(z0)kq(z)

p(z)~

acceptance arearejection area

Adapted from PRML (Bishop, 2006)

§ Accepted pairs pz, uq are uniformly distributed under p̃pzq

§ Probability density of the z-coordiantes of accepted points mustbe proportional to ppzq

§ Samples are independent samples from ppzq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 49

Shortcomings

z0 z

u0

kq(z0)kq(z)

p(z)~

acceptance arearejection area

Adapted from PRML (Bishop, 2006)

§ Finding the upper bound k is tricky

§ In high dimensions the factor k is probably huge

§ Low acceptance rate/high rejection rate of samples

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 50

Shortcomings

z0 z

u0

kq(z0)kq(z)

p(z)~

acceptance arearejection area

Adapted from PRML (Bishop, 2006)

§ Finding the upper bound k is tricky

§ In high dimensions the factor k is probably huge

§ Low acceptance rate/high rejection rate of samples

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 50

Shortcomings

z0 z

u0

kq(z0)kq(z)

p(z)~

acceptance arearejection area

Adapted from PRML (Bishop, 2006)

§ Finding the upper bound k is tricky

§ In high dimensions the factor k is probably huge

§ Low acceptance rate/high rejection rate of samples

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 50

Importance Sampling

Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):

Epr f pxqs “ż

f pxqppxqdx

ż

f pxqppxqqpxqqpxq

dx “ż

f pxqppxqqpxq

qpxqdx

“ Eq

f pxqppxqqpxq

If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:

Eq

f pxqppxqqpxq

«1S

Sÿ

s“1

f pxpsqqppxpsqqqpxpsqq

“1S

Sÿ

s“1

ws f pxpsqq

, xpsq „ qpxq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51

Importance Sampling

Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):

Epr f pxqs “ż

f pxqppxqdx

ż

f pxqppxqqpxqqpxq

dx

ż

f pxqppxqqpxq

qpxqdx

“ Eq

f pxqppxqqpxq

If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:

Eq

f pxqppxqqpxq

«1S

Sÿ

s“1

f pxpsqqppxpsqqqpxpsqq

“1S

Sÿ

s“1

ws f pxpsqq

, xpsq „ qpxq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51

Importance Sampling

Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):

Epr f pxqs “ż

f pxqppxqdx

ż

f pxqppxqqpxqqpxq

dx “ż

f pxqppxqqpxq

qpxqdx

“ Eq

f pxqppxqqpxq

If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:

Eq

f pxqppxqqpxq

«1S

Sÿ

s“1

f pxpsqqppxpsqqqpxpsqq

“1S

Sÿ

s“1

ws f pxpsqq

, xpsq „ qpxq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51

Importance Sampling

Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):

Epr f pxqs “ż

f pxqppxqdx

ż

f pxqppxqqpxqqpxq

dx “ż

f pxqppxqqpxq

qpxqdx

“ Eq

f pxqppxqqpxq

If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:

Eq

f pxqppxqqpxq

«1S

Sÿ

s“1

f pxpsqqppxpsqqqpxpsqq

“1S

Sÿ

s“1

ws f pxpsqq

, xpsq „ qpxq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51

Importance Sampling

Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):

Epr f pxqs “ż

f pxqppxqdx

ż

f pxqppxqqpxqqpxq

dx “ż

f pxqppxqqpxq

qpxqdx

“ Eq

f pxqppxqqpxq

If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:

Eq

f pxqppxqqpxq

«1S

Sÿ

s“1

f pxpsqqppxpsqqqpxpsqq

“1S

Sÿ

s“1

ws f pxpsqq

, xpsq „ qpxq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51

Importance Sampling

Key idea: Do not throw away all rejected samples, but give themlower weight by rewriting the integral as an expectation under asimpler distribution q (proposal distribution):

Epr f pxqs “ż

f pxqppxqdx

ż

f pxqppxqqpxqqpxq

dx “ż

f pxqppxqqpxq

qpxqdx

“ Eq

f pxqppxqqpxq

If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:

Eq

f pxqppxqqpxq

«1S

Sÿ

s“1

f pxpsqqppxpsqqqpxpsqq

“1S

Sÿ

s“1

ws f pxpsqq, xpsq „ qpxq

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 51

Properties

§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p

§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)

Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)

§ Many draws from proposal density q required, especially in highdimensions

§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).

§ Does not scale to interesting (high-dimensional) problems

Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52

Properties

§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p

§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)

Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)

§ Many draws from proposal density q required, especially in highdimensions

§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).

§ Does not scale to interesting (high-dimensional) problems

Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52

Properties

§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p

§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)

Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)

§ Many draws from proposal density q required, especially in highdimensions

§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).

§ Does not scale to interesting (high-dimensional) problems

Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52

Properties

§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p

§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)

Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)

§ Many draws from proposal density q required, especially in highdimensions

§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).

§ Does not scale to interesting (high-dimensional) problems

Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52

Properties

§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p

§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)

Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)

§ Many draws from proposal density q required, especially in highdimensions

§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).

§ Does not scale to interesting (high-dimensional) problems

Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52

Properties

§ Unbiased if q ą 0 where p ą 0 and if we can evaluate p

§ Breaks down if we do not have enough samples (puts nearly allweight on a single sample)

Degeneracy, see also Particle Filtering and SMC(Thrun et al., 2005; Doucet et al., 2000)

§ Many draws from proposal density q required, especially in highdimensions

§ Requires to be able to evaluate true p. Generalization exists for p̃.This generalization is biased (but consistent).

§ Does not scale to interesting (high-dimensional) problems

Different approach to sample from complicated (high-dimensional)distributions: Markov Chain Monte Carlo (e.g., Gilks et al., 1996)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 52

Summary

z0 z

u0

kq(z0)kq(z)

p(z)~

acceptance arearejection area

§ Two mathematical challenges in machine learning§ Differentiation for optimizing parameters of machine learning

modelsVector calculus and chain rule

§ Integration for computing statistics (e.g., means, variances) and asa principled way to address overfitting issue

Monte-Carlo integration to solve intractable integralsMathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 53

Some Application Areas

§ Image/speech/text/language processing using deep neuralnetworks (e.g., Krizhevsky et al., 2012 or overview in Goodfellowet al., 2016)

§ Data-efficient reinforcement learning and robot learning usingGaussian processes (e.g., Deisenroth & Rasmussen, 2011)

§ High-energy physics using deep neural networks or Gaussianprocesses (e.g., Sadowski et al. 2014; Bertone et al., 2016)

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 54

References I

[1] G. Bertone, M. P. Deisenroth, J. S. Kim, S. Liem, R. R. de Austri, and M. Welling. Accelerating the BSM Interpretation ofLHC Data with Machine Learning. arXiv preprint arXiv:1611.02704, 2016.

[2] C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer-Verlag, 2006.

[3] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian Processes for Data-Efficient Learning in Robotics and Control.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, Feb. 2015.

[4] M. P. Deisenroth and C. E. Rasmussen. PILCO: A Model-Based and Data-Efficient Approach to Policy Search. InProceedings of the International Conference on Machine Learning, pages 465–472. ACM, June 2011.

[5] A. Doucet, S. J. Godsill, and C. Andrieu. On Sequential Monte Carlo Sampling Methods for Bayesian Filtering. Statisticsand Computing, 10:197–208, 2000.

[6] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice: Interdisciplinary Statistics.Chapman & Hall, 1996.

[7] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press, 2016.

[8] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An Introduction to Variational Methods for Graphical Models.Machine Learning, 37:183–233, 1999.

[9] S. Kamthe and M. P. Deisenroth. Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control.arXiv:1706.06491, abs/1706.06491, 2017.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet Classification with Deep Convolutional Neural Networks. InAdvances in neural information processing systems, pages 1097–1105, 2012.

[11] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Massachusetts Institute of Technology,Cambridge, MA, USA, Jan. 2001.

[12] R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, Department of Computer Science, University of Toronto,1996.

[13] A. O’Hagan. Monte Carlo is Fundamentally Unsound. The Statistician, 36(2/3):247–249, 1987.

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 55

References II

[14] A. O’Hagan. Bayes-Hermite Quadrature. Journal of Statistical Planning and Inference, 29:245–260, 1991.

[15] M. Opper and O. Winther. Adaptive and Self-averaging Thouless-Anderson-Palmer Mean-field Theory for ProbabilisticModeling. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, 64:056131, 2001.

[16] C. E. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In S. Becker, S. Thrun, and K. Obermayer, editors, Advancesin Neural Information Processing Systems 15, pages 489–496. The MIT Press, Cambridge, MA, USA, 2003.

[17] C. P. Robert and G. Casella. Monte Carlo Methods. Wiley Online Library, 2004.

[18] P. Sadowski, J. Collado, D. Whiteson, and P. Baldi. Deep Learning, Dark Knowledge, and Dark Matter. In NIPS Workshopon High-energy Physics and Machine Learning, pages 81–87, 2014.

[19] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. The MIT Press, Cambridge, MA, USA, 2005.

Mathematics for Machine Learning Marc Deisenroth @Deep Learning Indaba, September 10, 2017 56