Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf ·...

252
Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen, Netherlands May 29, 2008 Bert Kappen and Wim Wiegerinck

Transcript of Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf ·...

Page 1: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Introduction to Pattern Recognition

Bert Kappen, Wim WiegerinckSNN University of Nijmegen, Netherlands

May 29, 2008

Bert Kappen and Wim Wiegerinck

Page 2: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Recognition of digits

Image is array of pixels xi, each between 0 and 1.x = (x1, . . . ,xd), d is the total number of pixels. (d = 28× 28).

• Goal = input: pixels → output: correct category 0, . . . , 9

• wide variablity

• hand-made rules infeasible

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 1

Page 3: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Machine Learning

• Training set: large set of (pixel array, category) pairs

– input data x, target data t(x)– category = class

• Machine learning algorithm

– training, learning

• Result: function y(x)

– Fitted to target data– New input x → output y(x)– Hopefully: y(x) = t(x)– Generalization on new examples (test set)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 2

Page 4: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Preprocessing

• transformation of inputs x on training set + on test set

• by hand, rule based

– scaling, centering– aspect ratio, curvature

• machine learning, unsupervised learning

• feature extraction

– dimension reduction– reduces variability within class (noise reduction) → easier to learn– reduces variability between classes (information loss) → more difficult to learn– trade-off

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 3

Page 5: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Types of machine learning tasks

• Supervised learning

– Classification: targets are classes– Regression: target is continuous

• Unsupervised learning (no targets)

– clustering (similarity between input data)– density estimation (distribution of input data)– dimension reduction (preprocessing, visualisation)

• Reinforcement learning

– actions leading to maximal reward

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 4

Page 6: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

1.1. Polynomial curve fitting

x

t

0 1

−1

0

1

• Regression problem

• Given training set of N = 10 data points

– generated by tn = sin(2πx) + noise

• Goal: predict value t for new x (without knowing the curve)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 5

Page 7: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

We will fit the data using M -th order polynomial

y(x,w) = w0 + w1x + w2x2 + . . . + wMxM =

M∑

j=0

wjxj

y(x,w) is nonlinear in x, but linear in coefficients w, ”Linear model”

Training set: (xn, tn), n = 1, . . . , N . Objective: find parameters w, such that

y(xn,w) ≈ tn, for all n

This is done by minimizing the Error function

E(w) =12

N∑n=1

(y(xn,w)− tn)2

E(w) ≥ 0 and E(w) = 0 ⇔ y(xn,w) = tn

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 6

Page 8: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

t

x

y(xn,w)

tn

xn

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 7

Page 9: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Partial derivatives, gradient, Nabla symbol

Let f(x1, . . . , xn) = f(x) be a function of several variables. The gradient of f ,denoted as ∇f (the symbol “ ∇” is called ‘nabla’) , is the vector of all partialderivatives:

∇f(x) =(

∂f

∂x1, . . . ,

∂f

∂xn

)T

NB, the partial derivative ∂f(x1, . . . , xi, . . . , xn)/∂xi is computed by taking thederivative with respect to xi while keeping all other variables constant.

Example:f(x, y, z) = xy2 + 3.1yz

Then

∇f(x, y, z) =(y2, 2xy + 3.1z, 3.1y

)T

At local minima (and maxima, and so-called saddle points) of a differentiable functionf , the gradient is zero, i.e., ∇f = 0.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 8

Page 10: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Example:f(x, y) = x2 + y2 + (y + 1)x

So∇f(x, y) = (2x + y + 1, 2y + x)T

Then we can compute the point (x∗, y∗) that minimizes f by setting ∇f = 0,

2x∗ + y∗ + 1 = 02y∗ + x∗ = 0

⇒ (x∗, y∗) = (−2

3,13)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 9

Page 11: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Chain rule

Suppose f is a function of y1, y2, ..., yk and each yj is a function of x, then we cancompute the derivative of f with respect to x by the chain rule

df

dx=

k∑

j=1

∂f

∂yj

dyj

dx

Example:f(y(x), z(x)) = y(x)/z(x)

and y(x) = x4 and z = x2 then

df

dx=

1z(x)

y′(x)− y(x)z(x)2

z′(x) = 2x

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 10

Page 12: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Chain rule (2)

Suppose E is a function of y1, y2, ..., yN and each yj is a function of w0, . . . , wM ,then we can compute the derivative of E with respect to wi by the chain rule

∂E

∂wi=

N∑

j=1

∂E

∂yj

∂yj

∂wi

Example:

E =12

N∑

j=1

(yj(w)− tj)2 ⇒ ∂E

∂yj= yj − tj

and

yj(w) =M∑

i=0

xijwi ⇒ ∂yj

∂wi= xi

j

tj and xj are parameters in this example.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 11

Page 13: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Minimization of E

E(w) in minimum: gradient ∇wE = 0

E(w) =12

N∑n=1

(y(xn,w)− tn)2

y linear in w ⇒ E quadratic in w

⇒ ∇wE is linear in w

⇒ ∇wE = 0: coupled set of linear equations (excercise)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 12

Page 14: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Model comparison, model selection

x

t

M = 0

0 1

−1

0

1

x

t

M = 1

0 1

−1

0

1

x

t

M = 3

0 1

−1

0

1

x

t

M = 9

0 1

−1

0

1

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 13

Page 15: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

M

ER

MS

0 3 6 90

0.5

1TrainingTest

Too simple (small M) → poor fitToo complex (large M) → overfitting (fits the noise)

Note that series expansion of sin contain terms of all orders.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 14

Page 16: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

M=0 M=1 M=3 M=9w∗0 0.19 0.82 0.31 0.35w∗1 -1.27 8 232w∗2 -25 5321w∗3 -17 48568w∗4 -231639w∗5 640042w∗6 -10618000w∗7 10424000w∗8 -557683w∗9 -125201

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 15

Page 17: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Model comparison, model selection

x

t

N = 15

0 1

−1

0

1

x

t

N = 100

0 1

−1

0

1

Overfitting is not due to noise, but more due to sparseness of data.

Same model complexity: more data ⇒ less overfittingWith more data, more complex (i.e. more flexible) models can be used

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 16

Page 18: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Model complexity vs problem complexity

• Rough heuristic: (# data points) À (# adaptive parameters)

• Complexity as function of # data?

• Complexity as function of problem complexity

• Maximum likelihood vs Bayes

– a single polynomial of degree 9 versus a distribution of polynomials of degree 9– no overfitting, automatic adaptation of # effictive parameters– error bars– prior knowledge

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 17

Page 19: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Regularization

Change the cost function E by adding regularization term Ω(w) to penalize complexity.

E(w) = E(w) + λΩ(w)

For example,

E(w) =12

N∑n=1

(y(xn,w)− tn)2 +λ

2||w||2

weight decay = shrinkage = ridge regression

Penalty term independent of number of trainging data

• small data sets: penalty term relatively large

• large data sets: penalty term relatively small

• → effective complexity depends on #training data

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 18

Page 20: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

x

t

M = 9

0 1

−1

0

1

x

t

ln λ = −18

0 1

−1

0

1

x

t

ln λ = 0

0 1

−1

0

1

ln λ = −∞ λ = −18 λ = 0

w∗0 0.35 0.35 0.13

w∗1 232 4.74 -0.05

w∗2 5321 - 0.77 -0.06

w∗3 48568 -31.97 -0.05

w∗4 -231639 - 3.89 -0.03

w∗5 640042 55.28 -0.02

w∗6 -10618000 41.32 -0.01

w∗7 10424000 -45.95 -0.00

w∗8 -557683 -91.53 0.00

w∗9 -125201 72.68 0.01

ER

MS

ln λ−35 −30 −25 −200

0.5

1TrainingTest

• Training set to optimize (typically many) parameters w

• Validation set to optimize (typically a few) hyperparameters λ, M

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 19

Page 21: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Matrix multiplications as summations

If A is a N ×M matrix with entries Aij and v an M -dimensional vector with entriesvi, then w = Av is a N -dimensional vector with components

wi =M∑

j=1

Aijvj

Similarly, if B is a M ×K matrix with entries Bij, then C = AB is a N ×K matrixwith entries

Cik =M∑

j=1

AijBjk

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 20

Page 22: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Dummy indices

The indices that are summed over are ‘dummy’ indices, they are just a label, so e.g.,

M∑

k=1

AikBkj =M∑

l=1

AilBlj

furthermore, the entries of the vectors and matrices are just ordinary numbers, soyou don’t have to worry about multiplication order. In addition, if the summation ofindices is over a range that does not depend on other indices, you may interchancethe order of summation,

N∑

i=1

M∑

j=1

. . . =M∑

j=1

N∑

i=1

. . .

So e.g, by changing summation order and renaming dummy indices,

wk =M∑

j=1

N∑

i=1

AijBjk =N∑

l=1

M∑

i=1

BikAli

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 21

Page 23: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Kronecker delta

The notation δij denotes usually the Kronecker delta symbol, i.e.,

δij = 1 if i = jδij = 0 otherwise

It has the nice property that it ‘eats’ dummy indices in summations:

M∑

j=1

δijvj = vi for all 1 ≤ i ≤ M (1)

The Kronecker delta can be viewed as the entries of the identity matrix I. In vectornotation, (1) is equivalent to the statement Iv = v.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 22

Page 24: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Probability theory

• Consistent framework for quantification andmanipulation of uncertainty→ Foundation for Bayesian machine learning

• random variable = stochastic variableExample:

boxes (B = r, b) and fruit (F = a, o).• Consider an experiment of (infinitely) many (mentally) repeated trials

(randomly pick a box and then randomly select an item of fruit from that box)under the same macroscopic conditions

(number of red/blue boxes and apples/oranges balls in the boxes)but each time with different microscopic details

(arangements of boxes and fruits in boxes).

Probability of an event (e.g. selecting a orange) is fraction of times that eventoccurs in the experiment.

• Notation: p(F = o) = 9/20, etc (or P (. . .), IP(. . .), Prob(. . .), etc.)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 23

Page 25: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

X can take the values xi, i = 1, . . . , M .Y can take the values yj, j = 1, . . . , L.

N : total number of trials (N →∞).nij: number of trials with X = xi and Y = yj

ci: number of trials with X = xi

rj: number of trials with Y = yj

ci

rjyj

xi

nij

Joint probability of X = xi and Y = yj:

p(X = xi, Y = yj) =nij

N= p(Y = yj, X = xi)

Marginal probability of X = xi:

p(X = xi) =ci

N=

∑j nij

N=

j

p(X = xi, Y = yj)

Conditional probability of Y = yj given X = xi

p(Y = yj|X = xi) =nij

ci

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 24

Page 26: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

• Explicit, unambiguous notation: p(X = xi)• Short-hand notation: p(xi)• p(X): “distribution“ over the random variable X• NB: xi is assumed to be mutually exclusive and complete

The Rules of Probability

Sum rule p(X) =∑

Y

p(X, Y )

Product rule p(X, Y ) = p(Y |X)p(X)

Positivity p(X) ≥ 0

Normalization∑

X

p(X) = 1

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 25

Page 27: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

p(X,Y ) = p(Y |X)p(X) = P (X|Y )p(Y ) ⇒

Bayes’ theorem

p(Y |X) =p(X|Y )p(Y )

p(X)

(=

p(X|Y )p(Y )∑Y p(X|Y )p(Y )

)

Bayes’ theorem = Bayes’ rule

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 26

Page 28: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

p(X,Y )

X

Y = 2

Y = 1

p(Y )

p(X)

X X

p(X |Y = 1)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 27

Page 29: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Fruits again

Model

p(B = r) = 4/10

p(B = b) = 6/10

p(F = a|B = r) = 1/4

p(F = o|B = r) = 3/4

p(F = a|B = b) = 3/4

p(F = o|B = b) = 1/4

Note that the (conditional) probabilities are normalized:

p(B = r) + p(B = b) = 1

p(F = a|B = r) + p(F = o|B = r) = 1

p(F = a|B = b) + p(F = o|B = b) = 1

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 28

Page 30: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

• Marginal probability

p(F = a) = p(F = a|B = r)p(B = r) + p(F = a|B = b)p(B = b)

=14× 4

10+

34× 6

10=

1120

and from normalization,

p(F = o) = 1− p(F = a) =920

• Conditional probability (reversing probabilities):

p(B = r|F = o) =p(F = o|B = r)p(B = r)

p(F = o)=

34× 4

10× 20

9=

23

• Terminology:p(B): prior probability (before observing the fruit)p(B|F ): posterior probability (after observing F )

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 29

Page 31: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

(Conditionally) independent variables

• X and Y are called (marginally) independent if

P (X, Y ) = P (X)P (Y )

This is equivalent toP (X|Y ) = P (X)

and also toP (Y |X) = P (Y )

• X and Y are called conditionally independent given Z if

P (X, Y |Z) = P (X|Z)P (Y |Z)

This is equivalent toP (X|Y, Z) = P (X|Z)

and also toP (Y |X, Z) = P (Y |Z)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 30

Page 32: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Probability densities

• to deal with continuous variables (rather than discrete var’s)

When x takes values from a continuous domain, the probability of any value of x iszero! Instead, we must talk of the probability that x takes a value in a certain interval

Prob(x ∈ [a, b]) =∫ b

a

p(x) dx

with p(x) the probability density over x.

p(x) ≥ 0∫ ∞

−∞p(x) dx = 1 (normalization)

• NB: that p(x) may be bigger than one.

Probability of x falling in interval (x, x + δx) is p(x)δx for δx → 0

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 31

Page 33: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

xδx

p(x) P (x)

Cumulative distribution function P (x) =∫ z

−∞p(x)dx (not often used in ML)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 32

Page 34: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate densities

• Several continuous variables, denoted by the d dimensional vector x = (x1, . . . , xd).• Probability density p(x): probability of x falling in an infinitesimal volume δx aroundx is given by p(x)δx.

Prob(x ∈ R) =∫

Rp(x) dx =

Rp(x1, . . . , xd)dx1dx2 . . . dxd

and

p(x) ≥ 0∫

p(x)dx = 1

• Rules of probability apply to continuous variables as well,

p(x) =∫

p(x, y) dy

p(x, y) = p(y|x)p(x)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 33

Page 35: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Integration

The integral of a function of several variables x = (x1, x2, . . . , xn)

Rf(x)dx ≡

Rf(x1, x2, . . . , xn)dx1dx2 . . . dxn

is the volume of the n + 1 dimensional region lying ‘vertically above’ the domain ofintegration R ⊂ IRn and ’below’ the function f(x).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 34

Page 36: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Separable integrals

The most easy (but important) case is when we can separate the integration, e.g. in2-d, ∫ b

x=a

∫ d

y=c

f(x)g(y) dxdy =∫ b

x=a

f(x) dx

∫ d

y=c

g(y) dy

Example,

∫exp

( n∑

i=1

fi(xi))dx =

∫ n∏

i=1

exp(fi(xi)) dx =n∏

i=1

∫exp(fi(xi)) dxi

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 35

Page 37: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Iterated integration

A little more complicated are the cases, in which integration can be done by iteration,’from inside out’. Suppose we can write the 2-d region R as the set a < x < b andc(x) < y < d(x) then we can write

Rf(y, x) dydx =

∫ b

x=a

[∫ d(x)

y=c(x)

f(y, x) dy

]dx

The first step is evaluate the inner integral, where we interpret f(y, x) as a function ofy with fixed parameter x. Suppose we can find F such that ∂F (y, x)/∂y = f(y, x),then the result of the inner integral is

∫ d(x)

y=c(x)

f(y, x) dy = F (d(x), x)− F (c(x), x)

The result, which we call g(x) is obviously a function of x only,

g(x) ≡ F (d(x), x)− F (c(x), x)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 36

Page 38: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The next step is the outer integral, which is now just a one-dimensional integral ofthe function g, ∫ b

x=a

[∫ d(x)

y=c(x)

f(y, x) dy

]dx =

∫ b

x=a

g(x) dx

Now suppose that the same 2-d region R can also be written as the set s < y < tand u(y) < x < v(y), then we can also choose to evaluate the integral as

Rf(y, x) dxdy =

∫ t

y=s

[∫ v(y)

x=u(y)

f(y, x) dx

]dy

following the same procedure as above. In most regular cases the result is the same(for exceptions, see handout (*)).

Integration with more than two variables can be done with exactly the same procedure,‘from inside out’.

In Machine Learning, integration is mostly over the whole of x space, or over asubspace. Iterated integration is not often used.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 37

Page 39: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Transformation of variables (1-d)

Often it is easier to do the multidimensional integral in another coordinate frame.Suppose we want to do the integration

∫ d

y=c

f(y) dy

but the function f(y) is easier expressed as a f(y(x)) which is a function of x. So wewant to use x as integration variable. If y and x are related via invertible differentiablemappings y = y(x) and x = x(y) and the end points of the interval (y = c, y = d)are mapped to (x = a, x = b), (so c = y(a), etc) then we have the equality

∫ d

y=c

f(y) dy =∫ d

y(x)=c

f(y(x)) dy(x) =∫ b

x=a

f(y(x))y′(x) dx

The derivative y′(x) comes in as the ratio between the lengths of the differentials dyand dx,

dy = y′(x) dx

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 38

Page 40: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Several variables

With several variables, the substitution rule is generalized as follows. We have theinvertible mapping y = y(x). Let us also assume that the region of integration of Ris mapped by to S, (so S = y(R)), then we have the equality

y∈Sf(y) dy =

y(x)∈Sf(y) dy(x)

=∫

x∈Rf(y(x))

∣∣∣∣det(

∂y(x)∂x

)∣∣∣∣ dx

The factor det(

∂y(x)∂x

)is called the Jacobian of the coordinate transformation. Written

out in more detail

det(

∂y(x)∂x

)=

∣∣∣∣∣∣∣∣∣∣

∂y1(x)∂x1

∂y1(x)∂x2

. . . ∂y1(x)∂xn

∂y2(x)∂x1

∂y2(x)∂x2

. . . ∂y2(x)∂xn

. . . . . . . . . . . .∂yn(x)

∂x1

∂yn(x)∂x2

. . . ∂yn(x)∂xn

∣∣∣∣∣∣∣∣∣∣

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 39

Page 41: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The absolute value 1 of the Jacobian comes in as the ratio between that the volumerepresented by the differential dy and the volume represented by the differential dx,i.e.,

dy =∣∣∣∣det

(∂y(x)

∂x

)∣∣∣∣ dx

As a last remark, it is good to know that

det(

∂x(y)∂y

)= det

((∂y(x)

∂x

)−1)

=1

det(

∂y(x)∂x

)

1In the single-variable case, we took the orientation of the integration interval into account (R ba f(x) dx =

− R ab f(x) dx). With several variables, this is awkward. Fortunately, it turns out that the orientation of the mapping of

the domain always cancels to the ’orientation’ of the Jacobian (= sign of the determinant). Therefore we take a positiveorientation and the absolute value of the Jacobian

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 40

Page 42: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Polar coordinates

Example: compute the area of a disc.

Consider a two-dimensional disc with radius R

D = (x, y)|x2 + y2 < R2

Its area is ∫

D

dxdy

This integral is easiest evaluated by going to ‘polar-coordinates’. The mapping frompolar coordinates (r, θ) to Cartesian coordinates (x, y) is

x = r cos θ (2)

y = r sin θ (3)

Since In polar coordinates, the disc is described by 0 ≤ r < R (since x2 + y2 = r2)and 0 ≤ θ < 2π.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 41

Page 43: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The Jacobian is

J =∣∣∣∣

∂x∂r

∂x∂θ

∂y∂r

∂y∂θ

∣∣∣∣ =∣∣∣∣

cos θ −r sin θsin θ r cos θ

∣∣∣∣ = r(cos2 θ + sin2 θ) = r

In other words,dxdy = rdrdθ

The area of the disc is now easily evaluated.

D

dxdy =∫ 2π

θ=0

∫ R

r=0

rdrdθ = πR2

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 42

Page 44: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Gaussian integral

How to compute ∫ ∞

−∞exp(−x2)dx

( ∫ ∞

−∞exp(−x2)dx

)2

=∫ ∞

−∞exp(−x2)dx

∫ ∞

−∞exp(−y2)dy

=∫ ∞

−∞

∫ ∞

−∞exp(−x2) exp(−y2)dxdy

=∫ ∞

−∞

∫ ∞

−∞exp(−(x2 + y2))dxdy

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 43

Page 45: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The latter is easily evaluated by going to polar-coordinates,

∫ ∞

−∞

∫ ∞

−∞exp(−(x2 + y2))dxdy =

∫ 2π

θ=0

∫ ∞

r=0

exp(−r2)rdrdθ

= 2π

∫ ∞

r=0

exp(−r2)rdr

= 2π × (− 12

exp(−r2))∣∣∣∞

0

= π

So ∫ ∞

−∞exp(−x2)dx =

√π

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 44

Page 46: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Transformation of densities

Under nonlinear change of variables, a probability transforms p(x) transformsdifferently from an ordinary function, due to “conservation of probability” p(x)δx.

Consider x = x(y) then an ordinary functionf(x) becomes f(y) = f(x(y)).

Probability densities:• px(x)δx: probability that point falls in volume element δx around x• py(y)δy: same probability, now in terms of y

py(y)δy = px(x)δx ⇒ py(y) =∣∣∣ det

(∂x∂y

)∣∣∣px(x(y))

• Values p(x) of a probability density depends on choice of variable (p(x)δx isinvariant)• Maximum of a probability density depends on choice of variable (see exercise 1.4).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 45

Page 47: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Dirac’s delta-function

Dirac’s delta function δ(x) is defined such that

δ(x) = 0 if x 6= 0 and

∫ ∞

−∞δ(x)dx = 1

It can be viewed as the limit ∆ → 0 of the function

f(x, ∆) =1∆

if |x| ≤ ∆2

and f(x,∆) = 0 elsewhere

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 46

Page 48: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The Dirac delta δ(x) is a spike (a peak, a point mass) at x = 0. The functionδ(x− x0) as a function of x is a spike at x0. As a consequence of the definition, thedelta function has the important property

∫ ∞

−∞f(x)δ(x− x0)dx = f(x0)

(cf. Kronecker delta∑

j δijvj = vi ).

The multivariate deltafunction factorizes over the dimensions

δ(x−m) =n∏

i=1

δ(xi −mi)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 47

Page 49: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Dirac’s delta-function / delta-distribution

The Dirac delta is actually a distribution rather than a function:

δ(αx) =1α

δ(x)

This is true since

• if x 6= 0 left and right-handside are both zero.

• after transformation of variables x′ = αx, dx′ = αdx we have

∫δ(αx)dx =

∫δ(x′)dx′ =

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 48

Page 50: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Functionals vs functions

Function y: for any input value x, returns output value f(y).

Functional F : for any function y, returns an output value F [y].

Example (linear functional):

F [y] =∫

p(x)y(x) dx

(Compare with f(y) =∑

i piyi).

Other (nonlinear) example:

F [y] =∫

12(y′(x) + V (x))2 dx

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 49

Page 51: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Expectations and variances

Expectations

IE[f ] =⟨f⟩

=∑

x

p(x)f(x) discrete var’s

IE[f ] =∫

x

p(x)f(x) dx continuous var’s

Variance:

var[f ] =⟨f(x)2

⟩− ⟨f(x)

⟩2

var[x] =⟨x2

⟩− ⟨x⟩2

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 50

Page 52: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Covariance

Covariance of two random variables

cov[x, y] = 〈xy〉 − 〈x〉 〈y〉

Covariance matrix of the components of a vector variable

cov[x] ≡ cov[x,x] =⟨xTx

⟩− ⟨xT

⟩⟨x⟩

with components

(cov[x])ij = cov[xi, xj] =⟨xixj

⟩− ⟨xi

⟩⟨xj

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 51

Page 53: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian probabilities

• Classical or frequentists interpretation: probabilities in terms of frequencies ofrandom repeatable events

• Bayesian view: probabilities as subjective beliefs about uncertain event

– event not neccessarily repeatable– event may yield only indirect observations– Bayesian inference to update belief given observations

Why probability theory?

• Cox: common sense axioms for degree of uncertainty → probability theory

• de Finetti: betting games with uncertain events → probability theory (otherwiseface certain loss through Dutch book)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 52

Page 54: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian machine learning

Model parameters w: uncertain

• Prior assumptions and beliefs about model parameters: prior distribution p(w)• Observed data = x1, . . . ,xN = Data• Probability of data given w: p(Data|w)

Apply Bayes’ rule to obtain the posterior distribution

p(w|Data) =p(Data|w)p(w)

p(Data)∝ p(Data|w)p(w)

p(w) : prior

p(Data|w) : likelihood

p(Data) =∫

p(Data|w)p(w) dw : evidence

→ p(w|Data) : posterior

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 53

Page 55: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian vs frequentists view point

• Bayesians: there is a single fixed dataset (the one observed), and a probabilitydistribution of model parameters w which expresses a subjective belief includinguncertainty about the ‘true’ model parameters.

- They need a prior belief p(w), and apply Bayes rule to compute p(w|Data).

- Bayesians can talk about frequency that in repeated situations with this data whas a certain value.

• Frequentists: there is a single (unknown) fixed model parameter vector w. Theyconsider the probability distribution of possible datasets p(Data|w) - although onlyone is actually observed.

- Frequentist can talk about frequency that estimators w in similar experiments,each time with different datasets drawn from p(Data|w), are within a certainconfidence interval around the true w.

-They cannot make a claim for this particular data set. This is the price for nothaving a ‘prior’.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 54

Page 56: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Toy example

Unknown binary prameter w = 0, 1Data is binary x = 0, 1, distributed according to P (x = w|w) = 0.9.

Given data x = 1 what can be said about w?

• Bayesian: needs a prior.

• Frequentist: estimator is w = x, so w = 1.

The frequentist provides a “confidence”: In 90% of similar experiments with outcomesx (either 0 or 1), the estimate w = x is correct. This is true since P (x = w|w) = 0.9.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 55

Page 57: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

But what can be concluded for this particular case, where x = 1? Additionalinformation is needed: e.g. a prior P (w), then

P (w = 1|x = 1) =0.9 P (w = 1)

0.1 P (w = 0) + 0.9 P (w = 1)

Unless P (w) is uniform, the posterior differs from ’confidence’ result

Posterior is conditioned on the actually observed data.Confidence is a statement about outcomes in similar experiments

Medical example: Suppose w = 0, 1 is a disease state (absent/present). Thedisease is rare, say P (w = 1) ¿ 0.1. What is the most probable disease state if thepatient has x = 1?NB: the statement that w = x in 90% of the population is still correct.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 56

Page 58: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian vs frequentists

• Prior: inclusion of prior knowledge may be useful. True reflection of knowledge, orconvenient construct? Bad prior choice can overconfidently lead to poor result.

• Bayesian integrals cannot be calculated in general. Only approximate resultspossible, requiring intensive numerical computations.

• Frequentists methods of ‘resampling the data’, (crossvalidation, bootstrapping) areappealing

• Bayesian framework transparent and consistent. Assumptions are explicit,inference is a mechanstic procedure (Bayesian machinery) and results have aclear interpretation.

This course place emphasis on Bayesian approach. Frequentists maximum likelihoodmethod is viewed as an approximation (more about that later).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 57

Page 59: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Gaussian distribution

Normal distribution = Gaussian distribution

N (x|µ, σ2) =1√

2πσ2exp

− (x− µ)2

2σ2

Specified by µ and σ2

N (x|µ, σ2)

x

µ

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 58

Page 60: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Gaussian is normalized, ∫ ∞

−∞N (x|µ, σ2) dx = 1

The mean (= first moment), second moment, and variance are:

〈x〉 =∫ ∞

−∞xN (x|µ, σ2) dx = µ

⟨x2

⟩=

∫ ∞

−∞x2N (x|µ, σ2) dx = µ2 + σ2

var[x] =⟨x2

⟩− 〈x〉2 = σ2

(Exercise 1.8 and 1.9 - solutions on the web)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 59

Page 61: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate Gaussian

In d dimensions

N (x|µ,Σ) =1

(2π)d/2|Σ|1/2exp

(−1

2(x− µ)TΣ−1(x− µ)

)

x, µ are d-dimensional vectors.

Σ is a d× d covariance matrix, |Σ| is its determinant.

µ = 〈x〉 =∫

xN (x|µ,Σ) dx

Σ =⟨(x− µ)(x− µ)T

⟩=

∫(x− µ)T (x− µ)N (x|µ,Σ) dx

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 60

Page 62: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

We can also write this in component notation:

µi = 〈xi〉 =∫

xiN (x|µ,Σ) dx

Σij = 〈(xi − µi)(xj − µj)〉 =∫

(xi − µi)(xj − µj)N (x|µ,Σ) dx

p(x) is specified by d(d + 1)/2 + d parameters.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 61

Page 63: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Maximum likelihood estimation

Given a data set

Data = x1, . . . ,xNand a parametrized distribution

p(x|θ), θ = (θ1, . . . , θM),

find the value of θ that best describes the data.

The common approach is to assume that the data that we observe are drawnindependently from p(x|θ) (independent and identical distributed = i.i.d.) for someunknown value of θ:

p(Data|θ) = p(x1, . . . ,xN|θ) =N∏

i=1

p(xi|θ)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 62

Page 64: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Then, the most likely θ is obtained by maximizing p(Data|θ) wrt θ:

θML = argmaxθp(Data|θ) = argmaxθ

N∏

i=1

p(xi|θ)

= argmaxθ

N∑

i=1

log p(xi|θ)

since log is a monotonically increasing function.In general, this is a complex optimisation problem.

Error function = − log-likelihood.

With i.i.d. data:

E(θ; Data) =∑

i

E(θ; xi) = −N∑

i=1

log p(xi|θ)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 63

Page 65: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The likelihood for the Gaussian

Consider data (x1, . . . , xN). The likelihood of the data under a Gaussian model is theprobability of the data, assuming each data point is independently drawn from theGaussian distribution:

p(x|µ, σ) =N∏

n=1

N (xn|µ, σ2) =(

1√2πσ

)N

exp

(− 1

2σ2

N∑n=1

(xn − µ)2)

x

p(x)

xn

N (xn|µ, σ2)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 64

Page 66: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Maximum likelihood

Consider the log of the likelihood:

ln p(x|µ, σ) = − 12σ2

N∑n=1

(xn − µ)2 − N

2lnσ2 − N

2ln 2π

The values of µ and σ that maximize the likelihood are given by

µML =1N

N∑n=1

xn σ2ML =

1N

N∑n=1

(xn − µML)2

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 65

Page 67: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bias in the ML estimates

Note that µML, σ2ML are functions of the data. We can take their expectation value,

assuming that xn is from a N (x|µ, σ).

〈µML〉 =1N

N∑n=1

〈xn〉 = µ⟨σ2

ML

⟩=

1N

N∑n=1

⟨(xn − µML)2

⟩= . . . =

N − 1N

σ2

(a)

(b)

(c)

The variance is estimated too low. This is called a biased estimator. Bias disappearswhen N →∞. In complex models with many parameters, the bias is more severe.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 66

Page 68: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Curve fitting re-visited

Now from a probabilistic perspective.

Given a new input x, prediction is: p(t|x,wML, βML) = N (t|y(x,wML), β−1ML)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 67

Page 69: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Curve fitting re-visited

More Bayesian approach.Prior:

p(w|α) = N (w|α−1I) =( α

)(M+1)/2

exp(−α

2wTw

)

M is the dimension of w. Variables such as α, controling the distribution ofparameters, are called ‘hyperparameters’.

Using Bayes rule:

p(w|t,x, α, β) =p(t|w,x, α, β)p(w|x, α, β)

p(t|x, α, β)∝ p(t|w,x, β)p(w|α)

− ln p(w|t,x, α, β) ∝ β

2

N∑n=1

(y(xn,w)− tn)2 +α

2wTw

Maximizing the posterior wrt w yields wMAP . Similar as Eq. 1.4. with λ = α/β

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 68

Page 70: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian curve fitting

Given the training data x, t we are not so much interested in w, but rather in theprediction of t for a new x: p(t|x,x, t). This is given by

p(t|x,x, t) =∫

dwp(t|x,w)p(w|x, t)

It is the average prediction of an ensemble of models p(t|x,w) parametrized by wand averaged wrt to the posterior distribution p(w|x, t).

All quantities depend on α and β.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 69

Page 71: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian curve fitting

Generalized linear model with ‘basis functions’ e.g., φi(x) = xi,

y(x,w) =∑

i

φi(x)wi = φ(x)Tw

So: prediction given w is

p(t|x,w) = N (t|y(x,w), β−1) = N (t|φ(x)Tw, β−1)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 70

Page 72: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Result Bayesian curve fitting

Predictive distribution

p(t|x,x, t) = N (t|m(x), s2(x))

m(x) = βφ(x)TSN∑

n=1

φ(xn)tn = φ(x)TwMAP

s2(x) = β−1 + φ(x)TSφ(x)

S−1 = αI + βN∑

n=1

φ(xn)φ(xn)T

Note, s2 depend on x. First term as in ML estimate describes noise in target for fixedw. Second term describes noise due to uncertainty in w.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 71

Page 73: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Example

x

t

0 1

−1

0

1

Polynomial curve fitting with M = 9, α = 5 × 10−3, β = 11.1. Red: m(x) ± s(x).Note

√β−1 = 0.3

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 72

Page 74: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Model selection/Cross validation

run 1

run 2

run 3

run 4

If data is plenty, use separate validation set to select model with best generalizationperformance, and a third independent test set for final evaluation.

Small validation set: use S-fold cross validation.

Information criteria: penalty for complex models• Akaike IC (AIC): ln p(D|wML)−M• Bayesian IC (BIC): Bayesian + crude approximations• Full Bayesian → penalties arises automatically

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 73

Page 75: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Curse of dimensionality/Binning

Sofar, we considered x one-dimensional. How does pattern recognition work higherdimensions?

x6

x7

0 0.25 0.5 0.75 10

0.5

1

1.5

2

x6

x7

0 0.25 0.5 0.75 10

0.5

1

1.5

2

Two components of 12-dimensional data that describe gamma ray measurements of a mixture of oil,

water and gas. The mixture can be in three states: homogenous (red), annular (green) and laminar

(blue).

Classification of the ’x’ can be done by making a putting a grid on the space andassigning the class that is most numerous in the particular box.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 74

Page 76: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Curse of dimensionality/Binning

This approach scales exponentially with dimensions.

x1

D = 1x1

x2

D = 2

x1

x2

x3

D = 3

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 75

Page 77: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Curse of dimensionality/Polynomials

The polynomial function considered previously becomes in D dimensions:

y(x,w) = w0 +D∑

i=1

wixi +D∑

i=1

D∑

j=1

wijxixj +D∑

i=1

D∑

j=1

D∑

k=1

wijkxixjxk

The number of coefficients scales as DM (unpractically large).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 76

Page 78: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Curse of dimensionality/Gaussians

Intuition developed in low dimensional space may be wrong in high dimensions.

ε

volu

me

frac

tion

D = 1

D = 2

D = 5

D = 20

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

VD(r) = KDrD VD(1)− VD(1− ε)VD(1)

= 1− (1− ε)D

Spheres in high dimension have most of their volume on the boundary.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 77

Page 79: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Curse of dimensionality/Gaussians

D = 1

D = 2

D = 20

r

p(r)

0 2 40

1

2

As a result, when the distribution of the radius of a Gaussian with variance σ isconsidered in high dimensions, it is concentrated around a thin shell r ≈ σ

√D.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 78

Page 80: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Curse of dimensionality

Machine learning possible?

• Data often in low dimensional subspace.

– Object located in 3-D → images of objects are N -D → there should be 3-Dmanifold (curved subspace)

• Smoothness, local interpolation

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 79

Page 81: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Decision theory

Inference: given pairs x, t, learn p(x, t) and estimate p(x, t) for new value of x (andpossibly all t).

Decision: for new value of x estimate ’best’ t.

An example is x an X-ray image and t a class label that indicates whether the patienthas cancer or not.

Bayes’ theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)p(Ck) is the prior probability of class Ck. p(Ck|x) is the posterior probability of classCk after seeing the image x.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 80

Page 82: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Decision theory

A classifier is specified by defining regions Rk, such that all x ∈ Rk are assigned toclass Ck.

In the case of two classes, the probability that this classifier makes an mistake is

p(correct) = p(x ∈ R1, C1) + p(x ∈ R2, C2) =∫

R1

dxp(x,C1) +∫

R2

dxp(x,C2)

R1 R2

x0 x

p(x, C1)

p(x, C2)

x

p(correct) is maximized when the regions Rk are chosen such that

k = argmaxkp(x,Ck) = argmaxkp(Ck|x)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 81

Page 83: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Decision theory/Expected loss

(0 10001 0

)

Rows are true classes (cancer, normal), columns are assigned classes (cancer, normal).

Typically, not all classification errors are equally bad: classifying a healthy patient assick, is not as bad as classifying a sick patient as healthy.

The probability to assign an x to class j while to belongs to class k is p(x ∈ Rj, Ck).Thus the total expected loss is

〈L〉 =∑

jk

Ljkp(x ∈ Rj, Ck) =∑

j

Rj

dx∑

k

Ljkp(x,Ck)

〈L〉 is minimized if each x is assigned to class j such that∑

k Ljkp(x,Ck) is minimal.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 82

Page 84: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Decision theory/Reject option

x

p(C1|x) p(C2|x)

0.0

1.0θ

reject region

It may be that maxj p(Cj|x) is small, indicating that it is unclear to which class xbelongs.

One can introduce a threshold θ and only classify those x for which maxj p(Cj|x) > θ.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 83

Page 85: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Decision theory/Discriminant functions

Instead of first learning a probability model and then making a decision, one can alsodirectly learn a decision rule (a classifier) without the intermediate step of a probabilitymodel.

A set of approaches:

• Learn a model for the class conditional probabilities p(x|Ck). Use Bayes’ rule tocompute p(Ck|x) and construct a classifier using decision theory. This approach isthe most complex, but has the advantage of yielding a model of p(x) that can beused to reject unlikely inputs x.

• Learn the inference problem p(Ck|x) directly and construct a classifier using decisiontheory. This approach is simpler, since no input model p(x) is learned (see figure).

• Learn f(x), called a discriminant function, that maps x directly onto a class label0, 1, 2, . . .. Even simpler, only decision boundary is learned. But information on theexpected classification error is lost.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 84

Page 86: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Decision theory/Discriminant functions

p(x|C1)

p(x|C2)

x

clas

s de

nsiti

es

0 0.2 0.4 0.6 0.8 10

1

2

3

4

5

x

p(C1|x) p(C2|x)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

Example that shows that detailed structure in the joint model need not affect class conditional

probabilities. Learning only the decision boundary is the simplest approach.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 85

Page 87: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Decision theory/Discriminant functions

Advantages of learning a class conditional probability instead of discriminant function:

Minimizing risk When minimizing expected loss, the loss matrix may change overtime whereas the class probabilities may not (for instance, in a financial application).

Reject option One can reject uncertain class assignments

Unbalanced data One can compensate for unbalanced data sets. For instance, inthe cancer example, there may be 1000 times more healthy patients than cancerpatients. Very good classification (99.9 % correct) is obtained by classifyingeveryone as healthy. Using the posterior probability one can compute p(Ck =cancer|x). Although this probability may be low, it may be significantly higherthan p(Ck = cancer), indicating a risk of cancer.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 86

Page 88: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Decision theory/Discriminant functions

Combining models Given models for p(Ck|x) and p(Ck|y) one has a principledapproach to classify on both x and y. Naive Bayes assumption:

p(x, y|Ck) = p(x|Ck)p(y|ck)

(given the disease state of the patient the probability blood and X-ray values areindependent). Then

p(Ck|x, y) ∝ p(x, y|Ck)p(Ck) ∝ p(x|Ck)p(y|ck)p(Ck) ∝ p(Ck|x)p(Ck|y)p(Ck)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 87

Page 89: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Taylor series

Assuming that f(x) has derivatives of all orders in x = a, then the Taylor expansionof f around a is

f(a + ε) =∞∑

k=0

f (k)(a)k!

εk = f(a) + εf′(a) +

ε2

2f′′(a) + . . .

The prefactors in the Taylor series can be checked by computing the Taylor expansionof a polynomial.

Linearization of a function around a is taking the Taylor expansion up to first order:

f(a + x) = f(a) + xf ′(a)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 88

Page 90: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Taylor series, examples

Examples: check that for small x the following expansions are correct up to secondorder:

sin(x) = x

cos(x) = 1− 12x2

exp(x) = 1 + x +12x2

(1 + x)c = 1 + cx +c(c− 1)

2x2

ln(1 + x) = x− 12x2

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 89

Page 91: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Taylor expansion in several dimensions

The Taylor expansion of a function of several variables, f(x1, . . . , xn) = f(x) is (upto second order)

f(x) = f(a) +∑

i

(xi − ai)∂

∂xif(a) +

12

ij

(xi − ai)(xj − aj)∂

∂xi

∂xjf(a)

or in vector notation, with ε = x− a

f(a + ε) = f(a) + εT∇f(a) +12εTHε

with H the Hessian, which is the symmetric matrix of partial derivatives

Hij =∂2

∂xi∂xjf(x)

∣∣∣∣x=a

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 90

Page 92: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Information theory

Information is a measure of the ’degree of surprise’ that a certain value gives us.Unlikely events are informative, likely events less so. Certain events give us noadditional information. Thus, information decreases with the probability of the event.

Let us denote h(x) the information of x. Then if x, y are two independent events:h(x, y) = h(x) + h(y). Since p(x, y) = p(x)p(y) we see that

h(x) = − log2 p(x)

is a good candidate to quantify the information in x.

If x is observed repeatedly then the expected information is

H[x] = −∑

x

p(x) log2 p(x)

is the entropy of the distribution p.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 91

Page 93: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Information theory

Example: x can have 8 values with equal probability, then H(x) = −8 × 18 log 1

8 = 3bits.

Example: x can have 8 values with probabilities (12,

14,

18,

116,

164,

164,

164,

164). Then

H(x) = −12

log12− 1

4log

14− 1

8log

18− 1

16log

116− 4

64log

164

= 2bits

smaller than the uniform distribution.

We can encode x as a 3 bit binary number, in which case the expected code length is3 bits. We can do better, by coding likely x smaller and unlikely x larger, for instance0, 10, 110, 1110, 111100, 111101, 111110, 111111. Then

Av.codelength =12× 1 +

14× 2 +

18× 3 +

116× 4 +

464× 6 = 2bits

Noiseless coding theorem: Entropy is a lower bound on the average number of bitsneeded to transmit a random variable (Shannon 1948).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 92

Page 94: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Information theory

prob

abili

ties

H = 1.77

0

0.25

0.5

prob

abili

ties

H = 3.09

0

0.25

0.5

When x has values xi, i = 1, . . . , M , then

H[x] = −∑

i

p(xi) log p(xi)

When p is sharply peaked (p(x1) = 1, p(x2) = . . . = p(xM) = 0) then the entropy is

H[x] = −1 log 1− (M − 1)0 log 0 = 0

When p is flat (p(xi) = 1/M) the entropy is maximal

H[x] = −M1M

log1M

= log M

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 93

Page 95: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Information theory/Maximum entropy

For p(x) a distribution density over a continuous value x we define the entropy as

H[x] = −∫

dxp(x) log p(x)

suppose that all we know about p is its mean µ and its variance σ2. We can askourselves what is the form of the distribution p that is as broad as possible, ie.maximizes the entropy.

The solution is the Gaussian distribution N (x|µ, σ2) (exercise 1.34 and 1.35).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 94

Page 96: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Information theory/KL-divergence

Relative entropy or Kullback-Leibler divergence or KL-divergence:

KL(p||q) = −∑

i

pi ln qi −(−

i

pi ln pi

)

= −∑

i

pi lnqi

pi

• Additional amount of information required to specify i when q is used for codingrather than the true distribution p.

• Divergence between ‘true’ distribution p and ‘approximate’ distribution q.

• KL(p||q) 6= KL(q||p)

• KL(p||q) ≥ 0, KL(p||q) = 0 ⇔ p = q (use convex functions)

• with continuous variables: KL(p||q) = − ∫p(x) ln

q(x)p(x)

dx

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 95

Page 97: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Convex functions

xa bxλ

chord

f(x)

f(λa + (1− λ)b) ≤ λf(a) + (1− λ)f(b)

• Examples: f(x) = x2, f(x) = − ln(x) and f(x) = x ln(x) (exercise).

• Convex: ∪ shaped. Concave: ∩ shaped.

• Convex ⇔ second derivative positive.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 96

Page 98: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Convex functions/Jensen’s inequality

Convex functions satisfy Jensen’s inequality

f(M∑

i=1

λixi) ≤M∑

i=1

λif(xi)

where λi ≥ 0,∑

i λi = 1, for any set points xi.

In other words:f(〈x〉) ≤ 〈f(x)〉

Now, since − ln(x) is convex, we can apply Jensen with λi = pi,

KL(p||q) = −∑

i

pi ln(qi

pi

) ≥ − ln( ∑

i

piqi

pi

)= − ln(

i

qi) = 0

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 97

Page 99: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Information theory and density estimation

Relation with maximum likelihood:

Empirical distribution :

p(x) =1N

N∑n=1

δ(x− xn)

Approximating distribution (model) : q(x|θ)

KL(p||q) = −∫

p(x) ln q(x|θ)dx−∫

p ln pdx

= − 1N

N∑n=1

ln q(xn|θ) + const.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 98

Page 100: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Information theory/mutual information

Mutual information between x and y: KL divergence between joint distribution p(x, y)and product of marginals p(x)p(y),

I[x, y] ≡ KL(p(x, y)||p(x)p(y))

= −∫ ∫

p(x, y) ln(p(x)p(y)

p(x, y)

)dxdy

• I(x, y) ≥ 0, equality iff x and y independent

Relation with conditional entropy

I[x, y] = H[x]−H[x|y] = H[y]−H[y|x]

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 99

Page 101: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Lagrange multipliers

Minimize f(x) under constraint: g(x) = 0.

Fancy formulation: define Lagrangian,

L(x, λ) = f(x) + λg(x)

λ is called a Lagrange multiplier.

The constraint minimization of f w.r.t x equivalent to unconstraint minimization ofmaxλ L(x, λ) w.r.t x. The maximization w.r.t to λ yields the following function of x

maxλ

L(x, λ) = f(x) if g(x) = 0

maxλ

L(x, λ) = ∞ otherwise

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 100

Page 102: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Lagrange multipliers

Under certain conditions, in particular f(x) convex (i.e. the matrix of secondderivatives positive definite) and g(x) linear,

minx

maxλ

L(x, λ) = maxλ

minx

L(x, λ)

Procedure:

1. Minimize L(x, λ) w.r.t x, e.g. by taking the gradient and set to zero. This yieldsa (parametrized) solution x(λ).

2. Maximize L(x(λ), λ) w.r.t. λ. The solution λ∗ is precisely such that g(x(λ∗)) = 0.

3. The solution of the constraint optimization problem is

x∗ = x(λ∗)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 101

Page 103: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Example

g(x1, x2) = 0

x1

x2

(x?1, x

?2)

f(x1, x2) = 1− x21 − x2

2 and constraint g(x1, x2) = x1 + x2 − 1 = 0

Lagrangian:L(x1, x2, λ) = 1− x2

1 − x22 + λ(x1 + x2 − 1)

Minimize L w.r.t. xi gives xi(λ) = 12λ.

Plug into constraint: x1(λ) + x2(λ)− 1 = λ− 1 = 0.

So λ = 1 and x∗i = 12

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 102

Page 104: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Some remarks

• Works as well for maximization (of concave functions) under constraints. Theprocedure is essentially the same.

• The sign in front of the λ can be chosen as you want:

L(x, λ) = f(x) + λg(x) or L(x, λ) = f(x)− λg(x)

work equally well.

• More constraints? For each constraint gi(x) = 0 a Lagrange multiplier λi, so

L(x, λ) = f(x) +∑

i

λigi(x)

• Similar methods apply for inequality constraints g(x) ≥ 0 (restricts λ).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 103

Page 105: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Chapter 2

Probability distributions

• Density estimation

• Parametric distributions

• Maximum likelihood, Bayesian inference, conjugate priors

• Bernoulli (binary), Beta, Gaussian, ..., exponential family

• Nonparametric distribution

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 104

Page 106: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Binary variables

x ∈ 0.1

p(x = 1|µ) = µ,

p(x = 0|µ) = 1− µ

Bernoulli distribution:Bern(x|µ) = µx(1− µ)1−x

Mean and variance:

IE[x] = µ

var[x] = µ(1− µ)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 105

Page 107: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Likelihood

Data set (i.i.d) D = x1, . . . , xn, with xi ∈ 0, 1.Likelihood:

p(D|µ) =∏n

p(xn|µ) =∏n

µxn(1− µ)1−xn

Log likelihood

ln p(D|µ) =∑

n

ln p(xn|µ) =∑

n

xn ln µ + (1− xn) ln(1− µ)

= m lnµ + (N −m) ln(1− µ)

where m =∑

n xn, the total number of xn = 1.

Maximization w.r.t µ gives maximum likelihood solution:

µML =m

N

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 106

Page 108: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The beta distribution

Distribution for parameters µ. Conjugate prior for Bayesian treatment for problem.

µ

a = 0.1

b = 0.1

0 0.5 10

1

2

3

µ

a = 1

b = 1

0 0.5 10

1

2

3

µ

a = 2

b = 3

0 0.5 10

1

2

3

µ

a = 8

b = 4

0 0.5 10

1

2

3

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 107

Page 109: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The beta distribution

Beta(µ|a, b) =Γ(a + b)Γ(a)Γ(b)

µa−1(1− µ)b−1

∝ µa−1(1− µ)b−1 0 ≤ µ ≤ 1

Normalisation ∫ 1

0

Beta(µ|a, b) = 1

Mean and variance

IE[µ] =a

a + b

var[µ] =ab

(a + b)2(a + b + 1)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 108

Page 110: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian inference with binary variables

Prior:p(µ) = Beta(µ|a, b) ∝ µa−1(1− µ)b−1

Likelihood – Data set (i.i.d) D = x1, . . . , xN, with xi ∈ 0, 1.Assume m ones and l zeros, (m + l = N)

p(D|µ) =∏n

p(xn|µ) =∏n

µxn(1− µ)1−xn

= µm(1− µ)l

Posterior

p(µ|D) ∝ p(D|µ)p(µ)

= µm(1− µ)l × µa−1(1− µ)b−1

= µm+a−1(1− µ)l+b−1 ∝ Beta(µ|a + m, b + l)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 109

Page 111: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian inference with binary variables

Interpretation: Hyperparameters a and b effective number of ones and zeros.

Data: increments of these parameters.

Conjugacy:(1) prior has the same form as likelihood function.(2) this form is preserved in the product (the posterior)

µ

prior

0 0.5 10

1

2

µ

likelihood function

0 0.5 10

1

2

µ

posterior

0 0.5 10

1

2

Posterior interpreted as updated prior: sequential learning

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 110

Page 112: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian inference with binary variables

Prediction of next data point given data D:

p(x = 1|D) =∫ 1

0

p(x = 1|µ)p(µ|D)dµ =∫ 1

0

µp(µ|D)dµ = IE[µ|D]

with posterior is Beta(µ|a + m, b + l), and IE[µ|a, b] = a/(a + b) we find

p(x = 1|D) =m + a

m + a + l + b

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 111

Page 113: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multinomial variables

Alternative representation: x ∈ v1, v2, parameter vector: µ = (µ1, µ2), whereµ1 + µ2 = 1.

p(x = vk|µ) = µk

In fancy notation:

p(x|µ) =∏

k

µδxvkk

Generalizes to multinomial variables: x ∈ v1, . . . , vK,

µ = (µ1, . . . , µK)∑

k

µk = 1

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 112

Page 114: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Maximum likelihood

Likelihood:

p(D|µ) =∏n

p(xn|µ) =∏n

k

µδxnvkk

=∏

k

µP

n δxnvkk

=∏

k

µmkk

with mk =∑

n δxnvk, the total number of datapoints with value vk. Log likelihood

ln p(D|µ) =∑

k

mk ln µk

Maximize with contraints using Lagrange multipliers.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 113

Page 115: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Dirichlet distribution

Dir(µ|α) ∝∏

k

µαkk

Probability distribution on the simplex:

SK = (µ1, . . . , µK)|0 ≤ µk ≤ 1,

K∑

k=1

µk = 1

Bayesian inference: Prior Dir(µ|α) + data counts m

→ Posterior Dir(µ|α + m)

Parameters α: ‘pseudocounts’.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 114

Page 116: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Gaussian

Gaussian distribution

N (x|µ, σ2) =1√

2πσ2exp

− (x− µ)2

2σ2

Specified by µ and σ2

In d dimensions

N (x|µ,Σ) =1

(2π)d/2|Σ|1/2exp

(−1

2(x− µ)TΣ−1(x− µ)

)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 115

Page 117: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Central limit theorem

Sum of large set of random variables is approximately Gaussian distributed.

Let Xi set of random variables with mean µ and variance σ2.

The sum of first n variables is Sn = X1 + ... + Xn. Now if n →∞

Law of large numbers: The mean of the sum Yn =Sn

nconverges to µ

Central limit theorem: Distribution of

Zn =Sn − nµ

σ√

n

converges to Gaussian N (Zn|0, 1)

N = 1

0 0.5 10

1

2

3N = 2

0 0.5 10

1

2

3N = 10

0 0.5 10

1

2

3

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 116

Page 118: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Symmetric matrices

Symmetric matrices:Aij = Aji, AT = A

Inverse of a matrix is a matrix A−1 such that A−1A = AA−1 = I, where I is theidentity matrix.

A−1 is also symmetric:

I = IT = (A−1A)T = AT (A−1)T = A(A−1)T

Thus, (A−1)T = A−1.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 117

Page 119: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Eigenvalues

A symmetric real-valued d×d matrix has d real eigenvalues λk and d eigenvectors uk:

Auk = λkuk, k = 1, . . . , d

or(A− λkI)uk = 0

Solution of this equation for non-zero uk requires λk to satisfy the characteristicequation:

det(A− λI) = 0

This is a polynomial equation in λ of degree d and has thus d solutions 2 λ1, . . . , λd.

2In general, the solutions are complex. It can be shown that with symmetric matrices, the solutions are in fact real.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 118

Page 120: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Eigenvectors

Consider two different eigenvectors k and j. Multiply the k-th eigenvalue equation byuj from the left:

uTj Auk = λkuT

j uk

Multiply the j-th eigenvalue equation by uk from the left:

uTk Auj = λjuT

k uj = λjuTj uk

Subtract(λk − λj)uT

j uk = 0

Thus, eigenvectors with different eigenvalues are orthogonal

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 119

Page 121: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

If λk = λj then any linear combination is also an eigenvector:

A(αuk + βuj) = λk(αuk + βuj)

This can be used to choose eigenvectors with identical eigenvalues orthogonal.

If uk is an eigenvector of A, then αuk is also an eigenvector of A. Thus, we can makeall eigenvectors the same length one: uT

k uk = 1.

In summary,uT

j uk = δjk

with δjk the Kronecker delta, is equal to 1 if j=k and zero otherwise.

The eigenvectors span the d-dimensional space as an orthonormal basis.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 120

Page 122: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Orthogonal matrices

Write U = (u1, . . . ,ud).

U is an orthogonal matrix 3 , i.e.UTU = I

For orthogonal matrices,UTUU−1 = U−1 = UT

So UUT = 1, i.e. the transposed is orthogonal as well (note that U is in general notsymmetric).

Furthermore,

det(UUT ) = 1 ⇒ det(U) det(UT ) = 1

⇒ det(U) = ±13

Uij = (uj)i, (UT

U)ij =X

k

(UT)ikUkj =

X

k

UkiUkj =X

k

(ui)k(uj)k = uTi · uj = δij

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 121

Page 123: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Orthogonal matrices implement rigid rotations, i.e. length and angle preserving.

x1 = Ux1 x2 = Ux2

thenxT

1 x2 = xT1 UTUx2 = xT

1 x2

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 122

Page 124: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Diagonalization

The eigenvector equation can be written as

AU = A(u1, . . . ,ud) = (Au1, . . . , Aud) = (λ1u1, . . . , λdud)

= (u1, . . . ,ud)

λ1 . . . 0... ...0 . . . λd

= UΛ

By right-multiplying by UT we obtain the important result

A = UΛUT

which can also be written as ’expansion in eigenvectors’

A =d∑

k=1

λkukuTk

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 123

Page 125: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Applications

A = UΛUT ⇒ A2 = UΛUTUΛUT = UΛ2UT

An = UΛnUT A−n = UΛ−nUT

Determinant is product of eigenvalues:

det(A) = det(UΛUT ) = det(U) det(Λ) det(UT ) =∏

k

λk

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 124

Page 126: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Basis transformation

We can represent an arbitrary vector x in d dimensions on a new basis U as

x = UUTx =d∑

k=1

uk(uTk x)

u1

u2

x

(x.u1)

(x.u2)

The numbers xk = (uTk x) are the components of x on the basis uk, k = 1, . . . , d,

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 125

Page 127: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

i.e. on the new basis, the vector has components (UTx). If the matrix A is therepresentation of a linear transformation on the old basis, the matrix with components

A′ = UTAU

is the representation on the new basis.

For instance if y = Ax, x = UTx, y = UTy, then

A′x = UTAUUTx = UTAx = UTy = y

So a matrix is diagonal on a basis of its eigenvectors:

A′ = UTAU = UTUΛUTU = Λ

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 126

Page 128: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate Gaussian

In d dimensions

N (x|µ,Σ) =1

(2π)d/2|Σ|1/2exp

(−1

2(x− µ)TΣ−1(x− µ)

)

µ = 〈x〉 =∫

xN (x|µ,Σ) dx

Σ =⟨(x− µ)(x− µ)T

⟩=

∫(x− µ)T (x− µ)N (x|µ,Σ) dx

We can also write this in component notation:

µi = 〈xi〉 =∫

xiN (x|µ,Σ) dx

Σij = 〈(xi − µi)(xj − µj)〉 =∫

(xi − µi)(xj − µj)N (x|µ,Σ) dx

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 127

Page 129: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate Gaussian

Spectral (eigenvalue) representation of Σ:

Σ = UΛUT =∑

k

λkukuTk

Σ−1 = UΛ−1UT =∑

k

λ−1k ukuT

k

(x− µ)TUTUΣ−1UTU(x− µ) = yTΛ−1y

x1

x2

λ1/21

λ1/22

y1

y2

u1

u2

µ

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 128

Page 130: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate Gaussian

Explain the normalization: use transformation y = U(x− µ) = Uz

∫exp

(−1

2(x− µ)TΣ−1(x− µ)

)dx

=∫

exp(−1

2zTΣ−1z

)dz =

∫ ∣∣∣∣det(

∂z∂y

)∣∣∣∣ exp(−1

2yTΛ−1y

)dy

=∫

exp(− 1

2λ1y21

)dy1 . . .

∫exp

(− 1

2λdy2

d

)dyd

=√

2πλ1 . . .√

2πλd = (2π)d/2( ∏

i

λi

)1/2= (2π)d/2 det(Σ)1/2

The multivariate Gaussian becomes a product of independent factors on the basis ofeigenvectors.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 129

Page 131: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate Gaussian

Compute the expectation value : use shift-transformation z = x− µTake Z as normalisation constant.

Use symmetry f(z) = −f(−z) ⇒ ∫f(z)dz = 0

IE[x] =1Z

∫exp

(−1

2(x− µ)TΣ−1(x− µ)

)xdx

=1Z

∫exp

(−1

2zTΣ−1z

)(z + µ)dz

= µ

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 130

Page 132: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate Gaussian

Second order moment: use transformation y = U(x− µ) = Uz

First, shift by z = x− µ

IE[xxT ] =∫N (x|µ,Σ)xxTdx

=∫N (z|0,Σ)(z + µ)(z + µ)Tdz

=∫N (z|0,Σ)(zzT + zµT + µzT + µµT )dz

=∫N (z|0,Σ)zzTdz + µµT

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 131

Page 133: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate Gaussian

Now use transformation y = Uz, and use Σ = UTΛU

∫N (z|0,Σ)zzTdz =

∫N (y|0, Λ)UTyyTUdy

= UT

∫N (y|0, Λ)yyTdyU

Component-wise computation shows∫ N (y|0, Λ)yyTdy = Λ:

i 6= j →∫N (y|0, Λ)yiyjdy =

∫N (yi|0, λi)yidyi

∫N (yj|0, λj)yjdyj = 0

i = j →∫N (y|0, Λ)y2

i dy =∫N (yi|0, λi)y2

i dyi = λi

So ∫N (z|0,Σ)zzTdz = UTΛU = Σ

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 132

Page 134: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate Gaussian

So, second moment isIE[xxT ] = Σ + µµT

Covariancecov[x] = IE[xxT ]− IE[x]IE[x]T = Σ

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 133

Page 135: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multivariate Gaussian

Gaussian covariance has d(d + 1) parameters, mean has d parameters.

Number of parameters quadratic in d which may be too large for high dimensionalapplications.

x1

x2

(a)

x1

x2

(b)

x1

x2

(c)

Common simplifications: Σij = Σiiδij (2d parameters) or Σij = σ2δij (d + 1parameters).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 134

Page 136: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Conditional Gaussians

Exponent in Gaussian N (x|µ,Σ): quadratic form

−12(x− µ)TΣ−1(x− µ) = −1

2xTΣ−1x + xTΣ−1µ + c = −1

2xTKx + xTKµ + c

Precision matrix K = Σ−1, nb: Matrix inverse is not component-wise, e.g,

Kaa = (Σaa − ΣabΣ−1bb Σba)−1

Conditional

p(xa|xb) =p(xa,xb)

p(xb)∝ p(xa,xb)

Exponent of conditional: collect all terms with xa, ignore constants, regard xb asconstant, and write in quadratic form as above

12xT

a Ka|bxa + xTa Ka|bµa|b = −1

2xT

a Kaaxa + xTa Kaaµa − xT

a Kab(xb − µb)

= −12xT

a Kaaxa + xTa Kaa(µa −K−1

aa Kab(xb − µb))

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 135

Page 137: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Marginal and conditional Gaussians

Marginal and conditional of Gaussians are also Gaussian

xa

xb = 0.7

xb

p(xa, xb)

0 0.5 10

0.5

1

xa

p(xa)

p(xa|xb = 0.7)

0 0.5 10

5

10

p(xa|xb) = N (xa|µa|b,K−1aa )

p(xa) = N (xa|µa,Σaa)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 136

Page 138: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian inference for the Gaussian

Aim: inference of unknown parameter µ. Assume σ given.

Likelihood of µ with one data point:

p(x|µ) = N (x|µ, σ) =1√2πσ

exp(− 1

2σ2(x− µ)2

)

Likelihood of µ with the data set:

p(x1, . . . , xN|µ) =N∏

n=1

p(xn|µ) =(

1√2πσ

)N

exp

(− 1

2σ2

∑n

(xn − µ)2)

= exp

(−Nµ2

2σ2+

µ

σ2

∑n

xn + const.

)= exp

(− N

2σ2µ2 +

Nx

σ2µ + const.

)

= exp(− N

2σ2(µ− x)2 + const.

)with x =

1N

∑n

xn

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 137

Page 139: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian inference for the Gaussian

Likelihood:

p(Data|µ) = exp(− N

2σ2µ2 +

Nx

σ2µ + const.

)

Prior:

p(µ) = N (µ|µ0, σ0) =1√

2πσ0

exp(− 1

2σ20

(µ− µ0)2)

= exp(− 1

2σ20

µ2 +µ0

σ20

µ + const.

)

µ0, σ0 hyperparameters. Large σ0 = large prior uncertainty in µ.

p(µ|Data) ∝ p(Data|µ)p(µ)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 138

Page 140: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Product of Gaussian potentials

Gaussian potential: unnormalized Gaussian

φ1(x) = exp(−a

2(x− b)2 + const.)

= exp(−a

2x2 + abx + const.)

φ2(x) = exp(−c

2x2 + cdx + const.)

⇒φ1(x)× φ2(x) = exp(−a + c

2x2 + (ab + cd)x + const.)

= exp(−a + c

2x2 + (a + c)

ab + cd

a + cx + const.)

= exp(− a + c

2(x− ab + cd

a + c

)2 + const.)

With normalization, the product p(x) ∝ φ1(x)φ2(x) is a Gaussian.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 139

Page 141: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Posterior is proportional to the product of two Gaussian potentials.

p(Data|µ) ∝ N (µ|x,1N

σ2)

p(µ) = N (µ|µ0, σ20) = N (µ|µ0,

σ20

σ2σ2) = N (µ|µ0,

1N0

σ2)

Interpretation: µ0 mean of pseudodata; N0 = σ2

σ20: effective number of pseudocounts

p(µ|Data) = N (µ|µN , σ2N)

with

µN =Nσ2x + N0

σ2 µ0

Nσ2 + N0

σ2

=Nx + N0µ0

N + N0

1σ2

N

=N

σ2+

N0

σ2= =

N + N0

σ2

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 140

Page 142: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

N = 0

N = 1

N = 2

N = 10

−1 0 10

5

For N →∞: µN → x, σ2N → 0

i.e., posterior distribution is a peak around ML solutionso Bayesian inference and ML coincides.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 141

Page 143: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayes’ theorem for Gaussian variables

Useful relations: see page 93, (2.113 to 2.117)

p(x) = N (x|µ, Λ−1)

p(y|x) = N (y|Ax + b, L−1)

⇒p(y) = N (y|Aµ + b, L−1 + AΛ−1AT )

p(x|y) = N (x|ΣATL(y − b) + Λµ, Σ)

whereΣ = (Λ + ATLA)−1

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 142

Page 144: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The exponential family

Examples: Gaussian, Bernouilli, Beta, multinomial, Poisson distribution

p(x|η) = h(x)g(η) exp(ηTu(x))

η: ”natural parameters”u(x): (vector) function of x, ”sufficient statistic”g(η) to ensure normalization

Example:

p(x|µ, σ2) =1√

2πσ2exp

(− (x− µ)2

2σ2

)

=1√

2πσ2exp

(− 12σ2

x2 +µ

σ2x− µ

2σ2

)

• Natural parameters: (µ

σ2,− 1

2σ2) Sufficient statistic u(x) = (x, x2)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 143

Page 145: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The exponential family / Maximum likelihood and sufficientstatistics

In the ML solution η = ηML, the likelihood is stationary:

∇η

∑n

log p(xn|η) = 0

Since

∇η log p(xn|η) = ∇η log(h(xn)g(η) exp(ηTu(xn))

)

= ∇η log g(η) + u(xn)

this implies that in ML solution η = ηML

−∇η log g(η) =1N

∑n

u(xn)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 144

Page 146: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Now, g(η) is the normalization factor, i.e.,

g(η)−1 =∫

h(x) exp(ηTu(x))dx

So

−∇η log g(η) = ∇η log g(η)−1 = g(η)∇η

∫h(x) exp(ηTu(x))

= g(η)∫

h(x)u(x) exp(ηTu(x))

=∫

u(x)p(x|η)dx = 〈u(x)〉η

and we have that in the ML solution:

1N

∑n

u(xn) = 〈u(x)〉ηML

ML estimator depends on data only through sufficient statistics∑

n u(xn)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 145

Page 147: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Conjugate priors

Likelihood:

N∏n=1

p(xn|η) = [∏n

h(xn)]g(η)N exp(ηT∑

n

u(xn))

= [∏n

h(xn)]g(η)N exp(NηT [1N

∑n

u(xn)])

= [∏n

h(xn)]g(η)N exp(NηT 〈u〉Data)

Conjugate priorp(η|χ, ν) = f(χ, ν)g(η)ν exp(νηTχ)

in which ν: effective number of pseudo data and χ: sufficient statistic.

Posterior ∼ likelihood × prior:

P (η|X, χ, ν) ∝ g(η)N+ν exp(ηT (N〈u〉Data + νχ))

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 146

Page 148: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Mixtures of Gaussians

Model for multimodal distribution

1 2 3 4 5 640

60

80

100

1 2 3 4 5 640

60

80

100

p(x) =K∑

k=1

πkN (x|µk, Σk)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 147

Page 149: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Mixtures of Gaussians

x

p(x)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 148

Page 150: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Mixtures of Gaussians

• Each Gaussian is called a component of the mixture. The factors πk are the mixingcoefficients.

• Mixing coefficients satisfy

πk ≥ 0 and∑

k

πk = 1

this implies p(x) ≥ 0 and∫

p(x)dx = 1.

• k: label of the mixture. Joint distribution p(x, k) = πkp(x|k). Since data is notlabeled: marginal distribution p(x) =

∑k πkp(x|k).

• p(x,k) in exponential family. However, mixture model p(x) not in the exponentialfamily, no simple relation between data and parameters

• ML: by gradient ascent (numerical function maximization) or by EM (ExpectationMaximization) algorithm - deals with missing or hidden variables

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 149

Page 151: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Mixtures of Gaussians

0.5 0.3

0.2

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 150

Page 152: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Nonparametric methods/histograms

”Histograms”: p(x) = pi in bin i with width ∆i, then ML:

pi =ni

N∆i

∆ = 0.04

0 0.5 10

5

∆ = 0.08

0 0.5 10

5

∆ = 0.25

0 0.5 10

5

• number of bins exponential in dimension D

• discontinuous

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 151

Page 153: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Nonparametric methods

Assume true density is p(x), then the probability to find a data point in region R is

P =∫

R

p(x)dx

If the total number of data points N is large, the number of data points that arefound in R is K, which is approximately

K ' NP

On the other hand, if the region R is small compared to the variation in p, (i.e. p(x)approximately constant in R),

P ' p(x)Vwhere V is the volume of R. Combining gives

p(x) =K

NV

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 152

Page 154: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Nonparametric methods

• Fix V , determine K by data: kernel approach/Parzen window

• Fix K, determine V by data: K-nearest neighbour

Terminology:

• Parametric distribution: model given by a number of parameters p(x, θ)

• Nonparametric distribution: number of parameters grows with the data.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 153

Page 155: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Kernel method

Kernel function:

k(u) =

1, |ui| ≤ 1/20, otherwise

k((x− xn)/h): one if xn in cube of side h centered around x. Total number of datapoints lying in this cube:

K =N∑

n=1

k(x− xn

h

)

Volume is V = hD, so

p(x) =1N

N∑n=1

1hD

k(x− xn

h

)

• Trick works for any k(u) with∫

k(u)du = 1.

For example, Gaussian kernel k(u) ∝ exp(−12||u||2).

• h is smoothing parameter.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 154

Page 156: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Kernel methods

h = 0.005

0 0.5 10

5

h = 0.07

0 0.5 10

5

h = 0.2

0 0.5 10

5

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 155

Page 157: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Nearest neighbour methods

p(x) =K

NVnot a true density model (integral over x diverges)

Classification: Nk points in class Ck,∑

k Nk = N

p(x|Ck) =Kk

NkV

and

p(x) =K

NVandp(Ck) =

Nk

Nthen by Bayes’ theorem, posterior probability of class membership follows,

p(Ck|x) =p(x|Ck)p(Ck)

p(x)=

Kk

K

K (number of neighbours) is smoothing parameter.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 156

Page 158: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Nearest neighbour methods

x1

x2

(a)x1

x2

(b)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 157

Page 159: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Nearest neighbour methods

x6

x7

K = 1

x6

x7

0 1 20

1

2

x6

x7

K = 3

x6

x7

0 1 20

1

2

x6

x7

K = 31

x6

x7

0 1 20

1

2

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 158

Page 160: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Chapter 3

Linear models for regression

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 159

Page 161: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.2. Bias-Variance Decomposition

Complex models tend to overfit. Simple models tend to be too rigid. Number ofterms in polynomial, weight decay constant λ.

(Treatment of 1.5.5). Consider a regression problem with squared loss function

E(L) =∫

dxdty(x)− t2p(x, t)

p(x, t) is the true underlying model, y(x) is our estimate of t at x.

The optimal y minimizes E(L):

δE(L)δy(x)

= 2∫

dty(x)− tp(x, t) = 0

y(x) =∫

dttp(x, t)p(x)

=∫

dttp(t|x) = E(t|x)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 160

Page 162: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.2. Bias-Variance Decomposition

The square loss can also be written as

E(L) =∫

dxdty(x)− t2p(x, t)

=∫

dxdty(x)− E(t|x) + E(t|x)− t2p(x, t)

=∫

dxdt(y(x)− E(t|x)2 + E(t|x)− t2

+ 2y(x)− E(t|x)E(t|x)− t)p(x, t)

=∫

dx(y(x)− E(t|x)2p(x) +∫

dxdtE(t|x)− t2p(x, t)

First term depends on y(x) and is minimized when y(x) = E(t|x). The second termis variance in the data.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 161

Page 163: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

In the case of a finite data set D, our learning algorithm will give a solution y(x;D)that will not minimize the exact square loss (but maybe some approximate squareloss). Different data sets will give different solutions y(x;D).

Consider the thought experiment that a large number of data sets D are given. Thenwe can construct the average solution ED(y(x;D)), and

y(x,D)− E(t|x)2 = y(x,D)− ED(y(x;D))2 + ED(y(x;D))− E(t|x)2+ 2y(x,D)− ED(y(x;D))ED(y(x;D))− E(t|x)

ED(y(x,D)− E(t|x)2) = ED(y(x;D))− E(t|x)2 + EDy(x,D)− ED(y(x;D))2

Substitution in square loss:

Expected square loss = bias2 + variance + noise

Bias2: The difference between the average solution ED(y(x;D)) and the true solutionE(t|x).Variance: The scatter of individual solutions y(x,D) around their mean ED(y(x;D)).Noise: The (true) scatter of the data points t around their mean E(t|x).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 162

Page 164: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.2. Bias-Variance Decomposition

x

tln λ = 2.6

0 1

−1

0

1

x

t

0 1

−1

0

1

x

tln λ = −0.31

0 1

−1

0

1

x

t

0 1

−1

0

1

x

tln λ = −2.4

0 1

−1

0

1

x

t

0 1

−1

0

1

100 data sets, each with 25 data points from t = sin(2πx) + noise. y(x) as in Eq. 3.3-4.

Parameters optimized using Eq. 3.27 for different λ. Left shows variance, right shows bias.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 163

Page 165: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.2. Bias-Variance Decomposition

ln λ

−3 −2 −1 0 1 20

0.03

0.06

0.09

0.12

0.15

(bias)2

variance

(bias)2 + variancetest error

Sum of bias, variance and noise yields expected error on test set. Optimal λ is trade-off between bias

and variance.

True bias is usually not known, because E(t|x) is not known.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 164

Page 166: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.3. Bayesian linear regression

Generalized linear model with ‘basis functions’ e.g., φi(x) = xi,

y(x,w) =∑

i

wiφi(x) = wTφ(x)

Gaussian noise with precision β: prediction given w is

p(t|x,w) = N (t|y(x,w), β−1) = N (t|wTφ(x), β−1)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 165

Page 167: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Data x = (x1, . . . , xN) and t = (t1, . . . , tN).

Likelihood:

p(t|x,w, β) =N∏

n=1

N (tn|wTφ(xn), β−1)

= N (t|wTφ(x), β−1I)

is exponential of quadratic function in w.

With corresponding conjugate prior

p(w) = N (w|m0, S0)

Then the posterior is also Gaussian.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 166

Page 168: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Prior and likelihood (write Φnj = φj(xn):

p(w) = N (w|m0, S0)

p(t|w) = N (t|Φw, β−1I)

Then, by applying 2.113 + 2.114 ⇒ 2.116

p(w|t) = N (w|mN , SN)

with

mN = SN(S−10 m0 + βΦTt)

S−1N = S−1

0 + βΦTΦ

Note behaviour when S0 →∞ (broad prior) , when N = 0, and when N →∞.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 167

Page 169: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Data: t = a0 + a1x + noise. Model: y(x, w) = w0 + w1x; p(t|x, w, β) = N (t|y(x, w), β−1),

β−1 = (0.2)2; p(w|α) = N (w|0, α−1I), α = 2. When N large, S−1N is dominated by

the second term and becomes infinite. Thus, the width of the posterior goes to zero.Also the mean becomes independent of β, as it cancels with the SN term.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 168

Page 170: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.3. Bayesian linear regression

Predictive distribution is Gaussian:

p(w|t) = N (w|mN , SN)

p(t|w, x) = N (t|φT (x)w, β−1)

We can view p(w|t) as a new prior conditioned on t and we apply 2.113 + 2.114 ⇒2.115 to compute marginal (conditioned on t and x)

p(t|t, x) = N (t|φT (x)mN , σ2N(x))

whereσ2

N(x) = β−1 + φ(x)TSNφ(x)

What happens to σ2N(x) when N →∞?.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 169

Page 171: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.3. Bayesian linear regression

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Data points from t = sin(2πx) + noise. y(x) as in Eq. 3.3-4. Data set size N = 1, 2, 4, 25.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 170

Page 172: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.3. Bayesian linear regression

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Same data and model. Curves y(x, w) with w from posterior.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 171

Page 173: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.4. Bayesian model comparison

The maximum likelihood approach suffers from overfitting, which requires testingmodels of different complexity on separate data.

The Bayesian approach allows to compare different models directly on the trainingdata, but requires integration over model parameters.

Consider L probability models and a set of data generated from one of these models.We define a prior over models p(Mi), i = 1, . . . , L to express our prior uncertainty.

Given the training data D, we wish to compute

p(Mi|D) ∝ p(D|Mi)p(Mi)

p(D|Mi) is called model evidence, also marginal likelihood since it integrates overmodel parameters.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 172

Page 174: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.4. Bayesian model comparison

When the prior over models is flat, the model evidence is proportional to the modellikelihood:

p(D|Mi) =∫

dwp(D|w,Mi)p(w|Mi)

We can use a very crude estimate of this integral to understand the mechanism ofBayesian model selection.

Consider first the one dimensional (one parameter) case and assume that the prioris constant within an interval of size ∆wprior and zero elsewhere. Then p(w|Mi) =1/∆wprior. Assume that the posterior is also block shaped with height p(D|wMAP,Mi)and width ∆wposterior. Then

p(D|Mi) ≈ p(D|wMAP,Mi)∆wposterior

∆wprior

ln p(D|Mi) ≈ ln p(D|wMAP,Mi) + ln(

∆wposterior

∆wprior

)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 173

Page 175: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.4. Bayesian model comparison

∆wposterior

∆wprior

wMAP w

The Bayesian approach gives a negative correction to the ML estimate.

With M model parameters, the same argument gives

ln p(D|Mi) ≈ ln p(D|wMAP,Mi) + M ln(

∆wposterior

∆wprior

)

With increasing M , the first term increases (better fit), but the second term decreases.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 174

Page 176: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

3.4. Bayesian model comparison

Consider three models of increasing complexity M1,M2,M3. Consider drawing datasets from these models: we first sample a parameter vector w from the prior p(w|Mi)and then generate iid data points according to p(x|w,Mi). The resulting distributionis p(D|Mi).

A simple model has less variability in the resulting data sets than a complex model.Thus, p(D|M1) is more peaked than p(D|M3). Due to normalization, p(D|M1) isnecessarily higher than p(D|M3).

p(D)

DD0

M1

M2

M3

For the data set D0 the Bayesian approach will select model M2 because model M1

is too simple (too low likelihood) and model M3 is too complex (too large penaltyterm). This is known as Bayesian model selection.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 175

Page 177: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Chapter 4

Linear models for classification

• Discriminant function

• Conditional probabilities P (Ck|x)

• Generative models P (x|Ck)

Generalized linear models for classification

y(x) = f(wTx + w0)

with f(·) nonlinear.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 176

Page 178: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

4.1 Discriminant functions

Two category classification problem:

y(x) > 0 ⇔ x ∈ C1 y(x) < 0 ⇔ x ∈ C2

withy(x) = wTx + w0

x2

x1

wx

y(x)‖w‖

x⊥

−w0‖w‖

y = 0y < 0

y > 0

R2

R1

Notation: w = (w0,w) and x = (1,x), then

y(x) = wT x

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 177

Page 179: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multiple classes

yk(x) = wTk x + wk0 =

k∑

i=1

wkixi + wk0 =k∑

i=0

wkixi

Naive approach is to let each boundary separate class k from the rest. This does notwork.

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 178

Page 180: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Multiple classes

Instead, define decision boundaries as: yk(x) = yj(x).

Ri

Rj

Rk

xA

xB

x

If yk(x) = wTk x + w0k, k = 1, 2 then decision boundary is (w1 − w2)

Tx + w01 − w02 = 0.

For K = 3, we get three lines: y1(x) = y2(x), y1(x) = y3(x), y2(x) = y3(x), that cross in a

common point y1(x) = y2(x) = y2(x), and which defines the tesselation.

The class regions Rk are convex. If x,y ∈ Rk

yk(x) > yj(x), yk(y) > yj(y) ⇒ yk(αx + (1− α)y) > yj(αx + (1− α)y)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 179

Page 181: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Least-squares technique

Learning the parameters is done by minimizing a cost function, for instance the minuslog likelihood or the quadratic cost:

E =12

∑n

k

(∑

i

xniwik − tnk)2

∂E

∂wjk=

∑n

(∑

i

xniwik − tnk)xnj = 0

i

(∑n

xnjxni

)wjk =

∑n

tnkxnj

wk = C−1µk Cij =∑

n

xnjxni µkj =∑

n

tnkxnj

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 180

Page 182: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Least-squares technique

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

Two class classification problem solved with least square (magenta) and logistic regression (green)

Least square solution sensitive to outliers: far away data contribute too much to theerror.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 181

Page 183: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Least-squares technique

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

Three class classification problem solved with least square (left) and logistic regression (right)

Least square solution poor in multi-class case.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 182

Page 184: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Fisher’s linear discriminant

Consider two classes. Take an x and project it down to one dimension using

y = wTx

Let the two classes have two means:

m1 =1

N1

n∈C1

xn m2 =1

N2

n∈C2

xn

−2 2 6

−2

0

2

4

We can choose w to maximize wT (m1 −m2), subject to∑

i w2i = 1 (excersize 4.4).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 183

Page 185: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Fisher’s linear discriminant

Problem is that best separation requires to take within class variance into account.

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

Fisher: Maximize between class separation and within class variance.

Within class variance after transformation:

s2k =

n∈Ck

(yn −mk)2, k = 1, 2

with mk = wTmk, yn = wTxn. Total within class variance is s2

1 + s22.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 184

Page 186: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The Fisher method minimizes witin class variance and maximizes mean separation bymaximizing

J(w) =(m2 −m1)2

s21 + s2

2

=wTSBwwTSWw

SB = (m2 −m1)(m2 −m1)T

SW =∑

n∈C1

(xn −m1)(xn −m1)T +∑

n∈C2

(xn −m2)(xn −m2)T

(Exercise 4.5)

Differentiation wrt w yields (exercise)

(wTSBw)SWw = (wTSWw)SBw

Drop scalar factors, and SBw ∝ m2 −m1:

w ∝ S−1W (m2 −m1)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 185

Page 187: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Fisher’s linear discriminant

When SW ∝ the unit matrix (isotropic within-class covariance), we recover the naivesolution w ∝ (m2 −m1).

A classifier is built by thresholding y = wTx.

The Fisher discriminant method is best viewed as a dimension reduction method, inthis case from d dimensions to 1 dimension, such that optimal classification is obtainedin the linear sense.

It can be shown that the least square method with two classes and target valuestn1 = N/N1 and tn2 = −N/N2 is equivalent to the two class Fisher discriminantsolution (section 4.1.5).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 186

Page 188: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Several classes

Assume input dimension D > K, the number of classes.

We introduce D′ > 1 features yk = wTk x, k = 1, . . . D′ or

y = WTx, W = (w1, . . . ,wD′)

Class means mk = 1Nk

∑n∈Ck

xn and projected class means mk = wTk mk as before.

Within class covariance SW =∑K

k=1 Sk,Sk =∑

n∈Ck(xn − mk)(xn − mk)T and

projected within class covariance WTSWW.

Between class covariance SB, first define:total mean m = 1

N

∑n xn and projected total mean m = wT

k m.total covariance ST =

∑n(xn −m)(xn −m)T can be written as

ST = SW + SB, SB =K∑

k=1

Nk(mk −m)(mk −m)T

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 187

Page 189: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

MaximizeJ(W) = Tr

((WTSWW)−1(WTSBW)

)

The solution W = (w1, . . . ,wD′) consists of the largest D′ eigenvectors of the matrixS−1

W SB.

Note, SB is sum of K rank one matrices and is of rank K−1. S−1W SB has D−(K−1)

eigenvectors with eigenvalue zero.

Thus, D′ ≤ K − 1, because otherwise W contains any of the zero eigenvectors.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 188

Page 190: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The Perceptron

Relevant in history of pattern recognition and neural networks.

• Perceptron learning rule + convergence, Rosenblatt (1962)

• Perceptron critique (Minsky and Papert, 1969) → ”Dark ages of neural networks”

• Revival in the 80’s: Backpropagation

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 189

Page 191: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The Perceptron

y(x) = sign(wTφ(x))

where

sign(a) =

+1, a ≥ 0−1, a < 0.

and φ(x) is a feature vector (e.g. hard wired neural network).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 190

Page 192: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The Perceptron algorithm

Problem: error (# misclassified patterns) is piecewise constant: gradient methodcannot be applied.

Perceptron approach: seek w such that wTφ(xn) > 0 for xn ∈ C1 and wTφ(xn) < 0for xn ∈ C2.

Target coding: t = +1,−1 for x ∈ C1, C2. Then we want all patterns:

tnwTφ(xn) > 0

Perceptron criterion: looks at misclassified patterns

EP (w) = −∑

n∈M

tnwTφ(xn)

So total error is piecewise linear.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 191

Page 193: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Stochastic gradient descent: choose pattern n. If missclasified, update according to

wτ+1 = wτ − η∇E(n)P (w) = wτ + ηφntn

It can be shown that this algorithm converges in a finite number of steps, if a solutionexists (i.e. if the data is linearly separable.

Learning parameter can be held constant, however ||w|| might grow.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 192

Page 194: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Convergence of perceptron learning

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 193

Page 195: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Perceptron critique

Minsky and Papert: perceptron can only learn linearly separable problems:

Most functions are not linearly separable: e.g. the problem of determining whetherobjects are singly connected, using local receptive fields

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 194

Page 196: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Probabilistic generative models

Probabilistic models for classification

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)= σ(a)

where a is the ”log odds”

a = lnp(x|C1)p(C1)p(x|C2)p(C2)

and σ the logistic sigmoid function (i.e. S-shaped function)

σ(a) =1

1 + exp(−a)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 195

Page 197: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Sigmoid function

−5 0 50

0.5

1

Properties:σ(−a) = 1− σ(a)

inverse:a = ln

σ

1− σ

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 196

Page 198: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Sigmoid/softmax function

Softmax: generalization to K classes

p(Ck|x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)

=exp(ak)∑j exp(aj)

whereak = ln p(x|Ck)p(Ck)

Note that softmax is invariant under ak → ak + const

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 197

Page 199: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Continuous inputs

The discriminant function for a Gaussian distribution is

ak(x) = log p(x|Ck) + log p(Ck)

= −12(x− µk)

TΣ−1k (x− µk)−

12

log |Σk|+ log p(Ck)

The decision boundaries ak(x) = al(x) are quadratic functions in d dimensions.

If Σ−1k = Σ−1, then

ak(x) = −12xΣ−1x +

12µT

k Σ−1x +12xΣ−1µk −

12µT

k Σ−1µk −12

log |Σ|+ log p(Ck)

= wTk x + wk0 + const

with

wTk = µT

k Σ−1, wk0 = −12µT

k Σ−1µk + log p(Ck)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 198

Page 200: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

The discriminant function is linear, and the decision boundary is a straight line (orhyper-plane in d dimensions).

The class probability follows again from p(Ck|x) = softmaxk(a)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 199

Page 201: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

−2 −1 0 1 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 200

Page 202: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Continuous inputs, maximum likelihood solution

Once we have specified a parametric functional form for p(x|Ck) and p(Ck) we candetermine parameters using maximum (joint!) likelihood (as usual !).

Actually, for classification, we should aim for the conditional likelihood, but the jointlikelihood is in general easier. In a perfect model, it makes no difference, but withimperfect models, it certainly does!

Example of joint ML: a binary problem with Gaussian class-conditionals, with a sharedcovariance matrix. Data is xn, tn with labels coded as t = 1, 0 for classes C1, C2.

Class probabilities are parametrized as

p(C1) = π, p(C2) = 1− π

and class-conditional densities as

p(xn|C1) = N (xn|µ1, Σ), p(xn|C2) = N (xn|µ2, Σ)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 201

Page 203: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Then the likelihood is given by

p(t1, x1, . . . , tN , xn|π, µ1, µ2, Σ) =∏n

p(tn, xn|π, µ1, µ2, Σ)

=∏n

[πN (xn|µ1, Σ)]tn[(1− π)N (xn|µ2, Σ)]1−tn

The maximum likelihood solution follows as in page 201.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 202

Page 204: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Probabilistic discriminative models

Discriminative versus generative models

Discriminative models:

• no spoiled effort in modeling joint probabilities. Aims directly at the conditionalprobabilities of interest

• usually fewer parameters

• improved performance when class-conditional p(x|C) assumptions are poor (withjoint ML).

• basis functions can be employed

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 203

Page 205: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Fixed basis functions

x1

x2

−1 0 1

−1

0

1

φ1

φ2

0 0.5 1

0

0.5

1

• Basis functions φ(x), φn = φ(xn)

• Problems that are not linearly seperable in x might be linearly seperable in φ(x).

• Note: use of basis functions φ in place of variables x is not obvious in generativemodels: N (x, x2, x3|µ, Σ)??.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 204

Page 206: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Logistic regression

Two class classificationp(C1|φ) = σ(wTφ)

M dimensional feature space: M parameters, while Gaussian class conditional densitieswould require M(M + 5)/2 + 1 parameters.

Maximum likelihood to determine parameters (tn ∈ 0, 1)

p(t1, . . . , tN |w, x1, . . . , xN) =∏n

σ(wTφn)tn[1− σ(wTφn)]1−tn

i.e.,E(w) = − ln p = −

∑n

tn ln σ(wTφn) + (1− tn) ln(1− σ(wTφn))

NB: entropic error function for classification, rather than squared error!

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 205

Page 207: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Logistic regression

∇E(w) =∑

n

(σ(wTφn)− tn)φn

No closed form solution. Optimization by gradient descent, (or e.g., Newton-Raphson).

Overfitting risk when data is linearly separable: w →∞ (i.e. σ → step function).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 206

Page 208: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Laplace approximation

Assume distribution p(z) is given up to normalization, i.e. in the form

p(z) =1Z

f(z) Z =∫

f(z)dz

where f(z) is given, but Z is unknown (and the integral is infeasible).

Goal: approximate by a Gaussian q(z), centered around the mode of p(z).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 207

Page 209: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Laplace approximation

Mode z0 is maximum of p(z), i.e. dp(z)/dz|z0 = 0, or

df(z)dz

∣∣∣z0

= 0

The logarithm of a Gaussian is a quadratic function of the variables, so it makes senseto make a second order Taylor expansion of ln f around the mode (this would be exactif p was Gaussian).

ln f(z) ≈ ln f(z0)− 12A(z − z0)2

where

A =d2 ln f(z)

dz2

∣∣∣z0

Note that the first order term is absent since we expand around the maximum. Takingthe exponent we obtain,

f(z) ≈ f(z0) exp(−12A(z − z0)2)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 208

Page 210: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

and the Gaussian approximation is obtained by normalization

q(z) =( A

)1/2

exp(−12A(z − z0)2)

−2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

−2 −1 0 1 2 3 40

10

20

30

40

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 209

Page 211: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Laplace approximation

In M dimensions results are similar,

q(z) =( detA

(2π)M/2

)exp(−1

2(z− z0)TA(z− z0))

where

Aij = − ∂2

∂zi∂zjln f(z)

∣∣z=z0

Usually z0 is found by numerical optimization.

A weakness of Laplace approximation is that it relies only on the local properties ofthe mode. Other methods (e.g. sampling or variational methods) are based on moreglobal properties of p.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 210

Page 212: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Model comparison and BIC

Approximation of Z:

Z =∫

f(z)dz (4)

≈ f(z0)∫

exp(−12(z − z0)TA(z − z0))dz (5)

= f(z0)(2π)M/2

det(A)1/2(6)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 211

Page 213: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Model comparison and BIC

Application: approximation of model evidence.

f(θ) = p(D|θ)p(θ) (7)

Z = P (D) =∫

p(D|θ)p(θ) (8)

Then applying Laplace approximation, we obtain

ln P (D) ≈ ln p(D|θMAP) + ln p(θMAP) +M

2ln 2π − 1

2ln det(A)

whereA = −∇∇ ln p(D|θ)p(θ)

∣∣∣θMAP

= −∇∇p(θ|D)∣∣∣θMAP

which can be interpreted as the inverse width of the posterior.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 212

Page 214: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Model comparison and BIC

Under some assumptions, A =∑

n An ≈ NA, and full rank, then

ln det(A) ≈ ln det(NA) = ln(NM det(A)) ≈ M ln N

leads to the Bayesian Information Criterion (BIC)

ln P (D) ≈ ln p(D|θMAP)− 12M ln N (+const.)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 213

Page 215: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Bayesian logistic regression

Prior p(w) = N (w|m0, S0)

Log posterior = log prior + log likelihood + const

ln p(w|t) = −12(w −m0)TS−1

0 (w −m0)

+∑

n

tn ln yn + (1− tn) ln(1− yn)+ const

where yn = σ(wTφn).

Posterior distribution in p(w|t)?Laplace approximation: find wMAP and compute second derivatives:

S−1N = −∇∇ ln p(w|t) = S−1

0 +∑

n

yn(1− yn)φnφTn

andq(w) = N (w|wMAP, SN)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 214

Page 216: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Predictive distribution

p(C1|φ, t) =∫

σ(wTφ)p(w|t)dw ≈∫

σ(wTφ)q(w)dw

The function σ(wTφ) depends only on w via its projection on φ. We can marginalizeout the other variables, since q is a Gaussian. The marginal is then again a Gaussianfor which we can compute its parameters

∫σ(wTφ)q(w)dw =

∫σ(a)N (a|µa, σ

2a)da

where the parameters turn out to be µa = wTMAPφ and σ2

a = φTSNφ.

Unfortunately, this integral is cannot be expressed analytically. However σ(x) is wellapproximated by the probit function, i.e., the cumulative Gaussian

Φ(x) =∫ x

∞N (u|0, 1)du

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 215

Page 217: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

In particular σ(x) ≈ Φ(√

π8x). With additional manipulations the predictive

distributions can be shown to be approximated by

p(C1|φ, t) ≈ σ((1 + πσ2a/8)−1/2µa) = σ(κ(σ2

a)µa)

Note that the MAP predictive distribution is

p(C1|φ,wMAP ) = σ(µa).

The decision boundary p(C1|φ, . . .) = 0.5 is the same in both approximations, namelyat µa = 0. Since κ < 1, the Laplace approximation is less certain about theclassifications.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 216

Page 218: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Chapter 5 Neural Networks

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 217

Page 219: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Feed-forward neural networks

Non-linear methods using a fixed set of basis functions (polynomials) suffer from curseof dimensionality.

A succesful alternative is to adapt the basis functions to the problem.- SVMs: convex optimisation, number of SVs increases with data- MLPs: aka feed-forward neural networks, non-convex optimisation

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 218

Page 220: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.1 Feed-forward Network functions

We extend the previous regression model with fixed basis functions

y(x,w) = f

M∑

j=1

wjφj(x)

to a model where φj is adaptive:

φj(x) = h(D∑

i=0

w(1)ji xi)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 219

Page 221: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.1 Feed-forward Network functions

In the case of K outputs

yk(x,w) = h2

M∑

j=1

w(2)kj h1

(D∑

i=0

w(1)ji xi

)

h2(x) is σ(x) or x depending on the problem. h1(x) is σ(x) or tanh(x).

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

x1

x2

z1

z3

z2

y1

y2

inputs outputs

Left) Two layer architecture. Right) general feed-forward network with skip-layer connections.

If h1, h2 linear, the model is linear. If M < D,K it computes principle components(12.4.2).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 220

Page 222: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.1 Feed-forward Network functions

−2 −1 0 1 2

−2

−1

0

1

2

3

Left four) Two layer NN with 3 hidden units can approximate many functions. Right) Two layer NN

with two inputs and 2 hidden units for classification. Dashed lines are hidden unit activities.

Feed-forward neural networks have good approximation properties. A two layer networkwith linear outputs and tanh activation functions can approximate any function (givenenough hidden units).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 221

Page 223: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.1.1. Weight space symmetries

For any solutions of the weights, there are many equivalent solutions due to symmetry:- for any hidden unit j with tanh activation function, change wji → −wji andwkj → −wkj: 2M solutions- rename the hidden unit labels: M ! solutions

Thus a total of M !2M equivalent solutions, not only for tanh activation functions.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 222

Page 224: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.2. Network training

Regression: tn continue valued, h2(x) = x and one usually minimizes the squarederror (one output)

E(w) =12

N∑n=1

(y(xn,w)− tn)2

= − logN∏

n=1

N (tn|y(xn,w), β−1) + . . .

Classification: tn = 0, 1 , h2(x) = σ(x), y(xn,w) is probability to belong to class 1.

E(w) = −N∑

n=1

tn log y(xn,w) + (1− tn) log(1− y(xn,w))

= − logN∏

n=1

y(xn,w)tn(1− y(xn,w))1−tn

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 223

Page 225: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.2. Network training

More than two classes: consider network with K outputs. tnk = 1 is xn belongs toclass k and zero otherwise. yk(xn,w)

E(w) = −N∑

n=1

K∑

k=1

tnk log pk(xn,w)

pk(x,w) =exp(yk(x,w))∑K

k′=1 exp(yk′(x,w))

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 224

Page 226: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.2.1 Parameter optimization

w1

w2

E(w)

wA wB wC

∇E

E is minimal when ∇E(w) = 0, but not vice versa!

As a consequence, gradient based methods find a local minimum, not necessary theglobal minimum.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 225

Page 227: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Gradient descent optimization

The simplest procedure to optimize E is to start with a random w and iterate

wτ+1 = wτ − η∇E(wτ)

This is called batch learning, where all training data are included in the computationof ∇E.

One can also consider on-line learing, where only one or a subset of training patternsis considered for computing ∇E.

E(w) =∑

n

En(w) wτ+1 = wτ − η∇En(wτ)

May be efficient for large data sets. This results in a stochastic dynamics in w thatcan help to escape local minima.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 226

Page 228: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.3.1 Error backpropagation

Error is sum of error per pattern

E(w) =∑

n

En(w) En(w) =12‖y(xn,w)− tn‖2

yk(x,w) = h2

wk0 +

M∑

j=1

wkjh1

(D∑

i=0

wjixi

) = h2

wk0 +

M∑

j=1

wkjh1(aj)

= h2(ak)

aj = wj0 +D∑

i=1

wjixi ak = wk0 +M∑

j=1

wkjh1(aj)

Define zi = xi for i in the input layer, zj = h(aj) for j in the hiddens, and z0 = 1 forthe biases. Then

ak =∑

j

wkjzj

where j runs over units send connections to k, including j = 0 for the bias.

We do each pattern separately, so we consider En

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 227

Page 229: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.3.1 Error backpropagation

We compute the gradient of En wrt wji (weight from node i to node j) by observingthat En depends on wji only via aj (since wji affects the output only via aj). Chainrule

∂En

∂wji=

∂En

∂aj

∂aj

∂wji= δjzi

∂En

∂aj≡ δj

En depends on aj only via ak. Again apply chain rule

δj =∂En

∂aj=

k

∂En

∂ak

∂ak

∂aj=

k

δk∂ak

∂aj= h′1(aj)

k

wkjδk

δk =∂En

∂ak= (yk(xn,w)− tkn)h′2(ak)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 228

Page 230: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.3.1 Error backpropagation

zi

zj

δjδk

δ1

wji wkj

1. Take pattern xn, and compute by forward propagation all activations aj and ak

2. Compute the δk for the output units, and back-propagate the δ to obtain δj eachhidden unit j

3. Compute ∂En/∂wji = δjzi

4. for batch mode, ∂E/∂wji =∑

n ∂En/∂wji

E is a function of O(|w|) variables. In general, the computation of E requires O(|w|)operations. The computation of ∇E thus requires O(|w|2) operations.

The backpropagation method allows to compute ∇E efficiently, in O(|w|) operations.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 229

Page 231: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Error functions

Error functions can be derived from E = −loglikelihhood

1) Regression: Gaussian noise assumption t = y + ε

En =12

k

(yk(xn, w)− tnk)2

• What is δk at the outputs for yk = ak and yk = tanh(ak) 4

2) Classification (2 classes t = 0, 1): Motivitated from p(t|x) = yt(1− y)t

En = −[tn log yn + (1− tn) log(1− yn)]

• What is δ at the outputs for y =1

1 + exp(−a)(logistic activation)?5

4δk = (yk − tk)g′(ak)5δ = yn − tn

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 230

Page 232: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.3.3. Efficiency/ numerical differentiation

• Efficiency is O(W ) (rather than O(W 2)). Unfortunately, backprop is often quiteslow

• Numerical differentiation: good for software checking

∂En

∂wji≈ En(wji + ε)− En(wji)

ε

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 231

Page 233: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Newtons method

One can also use Hessian information for optimization. As an example, consider aquadratic approximation to E around w0:

E(w) = E(w0) + bT (w −w0) +12(w −w0)H(w −w0)

bi =∂E(w0)

∂wiHij =

∂2E(w0)∂wi∂wj

∇E(w) = b + H(w −w0)

We can solve ∇E(w) = 0 and obtain

w = w0 −H−1∇E(w0)

This is called Newtons method.

w1

w2

λ−1/21

λ−1/22

u1

w?

u2

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 232

Page 234: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Newtons method

Note that this is expensive since H is large and needs to be inverted. A simpleapproximation is to only consider the diagonal elements of the Hessian (section5.4.1), the Levenberg-Marquardt approximation (section 5.4.2) or approximate inverseHessians conjugate gradient methods (section 5.4.3).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 233

Page 235: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Practical applications

Classical ones:

• Backgammon, Protein folding, Hand Written ZIP-code Recognition, ...

But nowadays quite common (www.mbfys.kun.nl/snn/Research/siena/cases/)

• Prediction of Yarn Properties in Chemical Process Technology (AKZO)

• Current Prediction for Shipping Guidance (Rijkswaterstaat)

• Recognition of Exploitable Oil and Gas Wells (SHELL)

• Prediction of Newspaper Sales (Telegraaf)

• ...

Standard tool in many packages (SAS, Matlab, ...)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 234

Page 236: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.5. Regularization

M = 1

0 1

−1

0

1 M = 3

0 1

−1

0

1 M = 10

0 1

−1

0

1

Complexity of neural network solution is controlled by number of hidden units

0 2 4 6 8 10

60

80

100

120

140

160

sum squared test error for different number of hidden units and different weight initializations. Error is

also affected by local minima.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 235

Page 237: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.5.2 Early stopping

0 10 20 30 40 500.15

0.2

0.25

0 10 20 30 40 500.35

0.4

0.45

w1

w2

w

wML

Early stopping is to stop training when error on test set starts increasing.

Early stopping with small initial weigths has the effect of weight decay:

E(w) = λ1(w1 − w∗1)2 + λ2(w2 − w∗2)

2 + λ(w21 − w2

2)∂E

∂wi= λi(wi − w∗i ) + λwi = 0, i = 1, 2

wi =λi

λi + λw∗i

When λ1 ¿ λ ¿ λ2, w1 ≈ λ1/λw∗1 and w2 ≈ w∗2.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 236

Page 238: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.5.3 Invariances

Sometimes, the network output should be invariant to a transformation of theinput pattern. We discuss two approaches: Transforming the training patterns andConvolutional networks.

Transforming the training patterns:

Each pixel of a training pattern is displaced with a displacement field, which is generated from

smoothing randomly generated displacements.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 237

Page 239: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.5.6 Convolutional networks

Input image Convolutional layerSub-samplinglayer

Convolutional networks have three ingredients: 1) local receptive fields 2) weightssharing 3) subsampling.

As a result a shift or rotation of the input yields reduced shift or rotation in convolutionlayer.

Subsampling averages local activity.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 238

Page 240: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Learning proceeds through variant of backpropagation (see LeCun’s website for tutorialand demos).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 239

Page 241: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.6. Mixture density models

In many problems the mapping to be learned is not single valued.

L1

L2

θ1

θ2

(x1, x2) (x1, x2)

elbowdown

elbowup

Given joint angles θ1, θ2 the robot end point (x1, x2) is uniquely defined. The inverse problem is

ill-posed.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 240

Page 242: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.6. Mixture density models

Neural networks have problems learning such problems.

0 1

0

1

0 1

0

1

Left: t = x + sin(2πx) + noise. Right: t and x reversed. Red is solution of 6 hidden unit neural

network.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 241

Page 243: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.6. Mixture density models

We can train a density model for these multi-modal cases

p(t|x) =∑

k

πk(x)N (t|µk(x), σ2k(x))

πk(x) =exp(aπ

k)∑l exp(aπ

l )

σk(x) = exp(aσk)

aπk(x), aσ

k(x) and µk(x) are neural networks. Training by maximizing the likelihood.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 242

Page 244: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.6. Mixture density models

(a)

0 1

0

1

(b)

0 1

0

1

(c)

0 1

0

1

(d)

0 1

0

1

top left: πk(x), k = 1, 2, 3. top right: µk(x), k = 1, 2, 3. bottom left: contours of p(t|x).

bottom right: argmaxtp(t|x) (red)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 243

Page 245: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

5.7. Bayesian neural networks

We illustrate the Bayesian procedure in the Laplace approximation for a classificationtask.

log p(D|w) =N∑

n=1

tn log yn + (1− tn) log(1− yn)

E(w) = − log p(D|w) +α

2wTw

We first minimize E to obtain wMAP and the Hessian

A = ∇∇E = α−∇∇ log p(D|wMAP)

The posterior is approximately

q(w) = exp(−E(wMAP)− 12(w −wMAP)TA−1(w −wMAP)).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 244

Page 246: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

We compute the evidence

log p(D|α) = log∫

dwq(w|α) = −E(wMAP)− 12

log |A|+ W

2log α− N

2log 2π

We numerically maximize this wrt α starting from a small value.

−2 −1 0 1 2

−2

−1

0

1

2

3

−2 −1 0 1 2

−2

−1

0

1

2

3

−2 −1 0 1 2

−2

−1

0

1

2

3

Left: Two layer network with 8 hidden units for classification (α = 0, black); result with optimal α

from evidence framework; 0ptimal decision (green); Predictive distribution using p(t|x, wMAP, α).

Middle: Same solution with more contour plots. Right: Improvement by using Bayesian integration of

p(t|x, w, α) with Gaussian approximation of the posterior.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 245

Page 247: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Gradient descent

In this brief mathematical section, we want to introduce the concepts of gradientdescent and linear differential equations. As a motivating example, consider dataxn, n = 1, . . . , N drawn from a Gaussian distribution N (x|µ, σ) and minus log-likelihood

L(µ, σ) = −N∑

n=1

logN (xn|µ, σ)

Using 1.46, we obtain 1.54

L(µ, σ) =1

2σ2

N∑n=1

(xn − µ)2 +N

2log 2πσ2

For this simple case, we can maximize the likelihood in closed form and the solutionis 1.55 and 1.56. For more complex models (neural networks) this is no longerpossible. A general approach is then to use gradient descent to minimize the minuslog likelihood.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 246

Page 248: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Gradient descent

The idea: Let the likelihood L(θ) be function of parameters θ.

1. Start with an initial value of θ and ε small.

2. While ”change in θ large”: Compute θ := θ − εdLdθ

Stop criterion is dLdθ ≈ 0, which means that we stop in a local minimum of L.

Does this algorithm converge? Yes, if ε is ”sufficiently small” and L bounded frombelow.

Proof: Denote ∆θ = −εdLdθ .

L(θ + ∆θ) ≈ L(θ) +dL

dθ∆θ = L(θ)− ε

(dL

)2

≤ L(θ)

In each gradient descent step the value of L is lowered. Since L bounded from below,the procedure must converge asymptotically.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 247

Page 249: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Gradient descent

We apply this to the Gaussian likelihood (divided by N) for µ

∆µ = ε1σ2

(1N

N∑n=1

xn − µ

)= ε′(µML − µ)

So this defines a ’dynamics’ for µ:

µ(i + 1) = µ(i) + ε′(µML − µ)

Consider i = t as time and 1 as a time step of size dt = 1. Then

µ(t + dt) = µ(t) + ε′(µML − µ)dt

In the limit of dt → 0 we obtain the differential equation

dt= lim

dt→0

µ(t + dt)− µ(t)dt

= ε′(µML − µ)

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 248

Page 250: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Intermezzo: linear differential equation

Solving a linear differential equation:

dx

dt= −ax + b

with initial condition x(t = 0) = x0.

The solution of a linear differential equation is an exponential:

x(t) = A exp(λt) + B

with A,B to be determined (excersize).

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 249

Page 251: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Intermezzo: linear differential equation

Substituting yields:

λA exp(λt) = −a (A exp(λt) + B) + b ⇒ λ = −a − aB + b = 0

Thus, x(t) = A exp(−at) + b/a. Note, that a > 0 required to have finite solution fort →∞.

Initial condition:x0 = A + b/a ⇒ A = x0 − b/a

thus,

x(t) = x0 exp(−at) +b

a(1− exp(−at))

Indeed: x(0) = x0 and x(∞) = ba.

1a is the characteristic time.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 250

Page 252: Bert Kappen, Wim Wiegerinck SNN University of Nijmegen ...wimw/dirPRforAI/sheets_prml.pdf · Introduction to Pattern Recognition Bert Kappen, Wim Wiegerinck SNN University of Nijmegen,

Gradient descent

Apply this to the gradient descent dynamics:

dt= ε′(µML − µ)

Then a = ε′ and b = ε′µML. When µ(0) = µ0, the solution is

µ(t) = µ0 exp(−ε′t) + µML

The characteristic time is 1/ε′ and increases for small learning rate.

Bert Kappen and Wim Wiegerinck figures from http://research.microsoft.com/~cmbishop/PRML/webfigs.htm Pattern Recognition 251