Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 |...

36
Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 1of 825 Statistical Machine Learning Christian Walder Machine Learning Research Group CSIRO Data61 and College of Engineering and Computer Science The Australian National University Canberra Semester One, 2020. (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Transcript of Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 |...

Page 1: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Page 2: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

235of 825

Part VI

Linear Classification 2

Page 3: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

236of 825

Three Models for Decision Problems

In increasing order of complexityFind a discriminant function f (x) which maps each inputdirectly onto a class label.Discriminative Models

1 Solve the inference problem of determining the posteriorclass probabilities p(Ck | x).

2 Use decision theory to assign each new x to one of theclasses.

Generative Models1 Solve the inference problem of determining the

class-conditional probabilities p(x | Ck).2 Also, infer the prior class probabilities p(Ck).3 Use Bayes’ theorem to find the posterior p(Ck | x).4 Alternatively, model the joint distribution p(x, Ck) directly.5 Use decision theory to assign each new x to one of the

classes.

Page 4: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

237of 825

Data Generating Process

Given:class prior p(t)class-conditional p(x | t)

to generate data from the model we may do the following:1 Sample the class label from the class prior p(t).2 Sample the data features from the class-conditional

distribution p(x | t).(more about sampling later — this is called ancestral sampling)

Thinking about the data generating process is a usefulmodelling step, especially when we have more priorknowledge.

Page 5: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

238of 825

Probabilistic Generative Models

Generative approach: model class-conditional densitiesp(x | Ck) and class priors (not parameter priors!) p(Ck) tocalculate the posterior probability for class C1

p(C1 | x) =p(x | C1)p(C1)

p(x | C1)p(C1) + p(x | C2)p(C2)

=1

1 + exp(−a(x))≡ σ(a(x))

where a and the logistic sigmoid function σ(a) are given by

a(x) = lnp(x | C1) p(C1)

p(x | C2) p(C2)= ln

p(x, C1)

p(x, C2)

σ(a) =1

1 + exp(−a).

One point of this re-writing: we may learn a(x) directly ase.g. a deep neural network.

Page 6: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

239of 825

Logistic Sigmoid

The logistic sigmoid function is called a “squashingfunction” because it squashes the real axis into a finiteinterval (0, 1).Well known properties (derive them):

Symmetry: σ(−a) = 1− σ(a)Derivative: d

daσ(a) = σ(a)σ(−a) = σ(a) (1− σ(a))Inverse of σ is called the logit function

-10 -5 5 10a

0.2

0.4

0.6

0.8

1.0

ΣHaL

Sigmoid σ(a) = 11+exp(−a)

0.2 0.4 0.6 0.8 1.0Σ

-6

-4

-2

2

4

aHΣL

Logit a(σ) = ln(

σ1−σ

)

Page 7: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

240of 825

Probabilistic Generative Models - Multiclass

The normalised exponential is given by

p(Ck | x) =p(x | Ck) p(Ck)∑

j p(x | Cj) p(Cj)=

exp(ak)∑j exp(aj)

whereak = ln(p(x | Ck) p(Ck)).

Usually called the softmax function as it is a smoothedversion of the argmax function, in particular:

ak � aj ∀j 6= k⇒(

p(Ck | x) ≈ 1 ∧ p(Cj | x) ≈ 0)

So, softargmax is a more descriptive though less commonname.

Page 8: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

241of 825

Probabil. Generative Model - Continuous Input

Assume class-conditional probabilities are Gaussian, withthe same covariance and different mean:

Let’s characterise the posterior probabilities.We may separate the quadratic and linear term in x:

p(x | Ck)

=1

(2π)D/2

1|Σ|1/2 exp

{−1

2(x− µk)

TΣ−1(x− µk)

}

=1

(2π)D/2

1|Σ|1/2 exp

{−1

2xTΣ−1x + µT

k Σ−1x− 1

2µT

k Σ−1µk

}

Page 9: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

242of 825

Probabil. Generative Model - Continuous Input

For two classesp(C1 | x) = σ(a(x))

and a(x) is linear because the quadratic terms in x cancel(c.f. the previous slide):

a(x) = lnp(x | C1) p(C1)

p(x | C2) p(C2)

= lnexp

{µT

1Σ−1x− 1

2µT1Σ−1µ1

}exp

{µT

2Σ−1x− 1

2µT2Σ−1µ2

} + lnp(C1)

p(C2)

Thereforep(C1 | x) = σ(wTx + w0)

where

w = Σ−1(µ1 − µ2)

w0 = −12µT

1 Σ−1µ1 +

12µT

2 Σ−1µ2 + ln

p(C1)

p(C2)

Page 10: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

243of 825

Probabil. Generative Model - Continuous Input

Class-conditional densities for two classes (left). Posteriorprobability p(C1 | x) (right). Note the logistic sigmoid of a linear

function of x.

Page 11: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

244of 825

General Case - K Classes, Shared Covariance

Use the normalised exponential

p(Ck | x) =p(x | Ck)p(Ck)∑

j p(x | Cj)p(Cj)=

exp(ak)∑j exp(aj)

whereak = ln (p(x | Ck)p(Ck)) .

to get a linear function of x

ak(x) = wTk x + wk0.

where

wk = Σ−1µk

wk0 = −12µT

k Σ−1µk + p(Ck).

Page 12: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

245of 825

General Case - K Classes, Different Covariance

If the class-conditional distributions have differentcovariances, the quadratic terms − 1

2 xTΣ−1x do not cancelout.We get a quadratic discriminant.

−2 −1 0 1 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Page 13: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

246of 825

Parameter Estimation

Given the functional form of the class-conditional densitiesp(x | Ck), how can we determine the parameters µ and Σand the class prior ?

Simplest is maximum likelihood.Given also a data set (xn, tn) for n = 1, . . . ,N. (Using thecoding scheme where tn = 1 corresponds to class C1 andtn = 0 denotes class C2.Assume the class-conditional densities to be Gaussianwith the same covariance, but different mean.Denote the prior probability p(C1) = π, and thereforep(C2) = 1− π.Then

p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)

p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)

Page 14: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

247of 825

Parameter Estimation

Given the functional form of the class-conditional densitiesp(x | Ck), how can we determine the parameters µ and Σand the class prior ?Simplest is maximum likelihood.Given also a data set (xn, tn) for n = 1, . . . ,N. (Using thecoding scheme where tn = 1 corresponds to class C1 andtn = 0 denotes class C2.Assume the class-conditional densities to be Gaussianwith the same covariance, but different mean.Denote the prior probability p(C1) = π, and thereforep(C2) = 1− π.Then

p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)

p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)

Page 15: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

248of 825

Maximum Likelihood Solution

Thus the likelihood for the whole data set X and t is givenby

p(t,X |π,µ1,µ2,Σ)

=

N∏n=1

[πN (xn |µ1,Σ)]tn × [(1− π)N (xn |µ2,Σ)]1−tn

Maximise the log likelihoodThe term depending on π is

N∑n=1

(tn lnπ + (1− tn) ln(1− π)

)which is maximal for (derive it)

π =1N

N∑n=1

tn =N1

N=

N1

N1 + N2

where N1 is the number of data points in class C1.

Page 16: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

249of 825

Maximum Likelihood Solution

Similarly, we can maximise the likelihoodp(t,X |π,µ1,µ2,Σ) w.r.t. the means µ1 and µ2, to get

µ1 =1

N1

N∑n=1

tn xn

µ2 =1

N2

N∑n=1

(1− tn) xn

For each class, this are the means of all input vectorsassigned to this class.

Page 17: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

250of 825

Maximum Likelihood Solution

Finally, the log likelihood ln p(t,X |π,µ1,µ2,Σ) can bemaximised for the covariance Σ resulting in

Σ =N1

NS1 +

N2

NS2

Sk =1

Nk

∑n∈Ck

(xn − µk)(xn − µk)T

Page 18: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

251of 825

Discrete Features - Naïve Bayes

Assume the input space consists of discrete features, inthe simplest case xi ∈ {0, 1}.For a D-dimensional input space, a general distributionwould be represented by a table with 2D entries.Together with the normalisation constraint, this are 2D − 1independent variables.Grows exponentially with the number of features.The Naïve Bayes assumption is that, given the class Ck,the features are independent of each other:

p(x | Ck) =

D∏i=1

p(xi | C)

=

D∏i=1

µxiki(1− µki)

1−xi

Page 19: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

252of 825

Discrete Features - Naïve Bayes

With the naïve Bayes

p(x | Ck) =

D∏i=1

µxiki(1− µki)

1−xi

we can then again find the factors ak in the normalisedexponential

p(Ck | x) =p(x | Ck)p(Ck)∑

j p(x | Cj)p(Cj)=

exp(ak)∑j exp(aj)

as a linear function of the xi

ak(x) =D∑

i=1

{xi lnµki + (1− xi) ln(1− µki)}+ ln p(Ck).

Page 20: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

253of 825

Three Models for Decision Problems

In increasing order of complexityFind a discriminant function f (x) which maps each inputdirectly onto a class label.Discriminative Models

1 Solve the inference problem of determining the posteriorclass probabilities p(Ck | x).

2 Use decision theory to assign each new x to one of theclasses.

Generative Models1 Solve the inference problem of determining the

class-conditional probabilities p(x | Ck).2 Also, infer the prior class probabilities p(Ck).3 Use Bayes’ theorem to find the posterior p(Ck | x).4 Alternatively, model the joint distribution p(x, Ck) directly.5 Use decision theory to assign each new x to one of the

classes.

Page 21: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

254of 825

Probabilistic Discriminative Models

Discriminative training: learn only to discriminate betweenthe classes.Maximise a likelihood function defined through theconditional distribution p(Ck | x) directly.Typically fewer parameters to be determined.As we learn the posteriror p(Ck | x) directly, prediction maybe better than with a generative model where theclass-conditional density assumptions p(x | Ck) poorlyapproximate the true distributions.But: discriminative models can not create synthetic data,as p(x) is not modelled.As an aside: certain theoretical analyses show thatgenerative models converge faster to their — albeit worse— asymptotic classification performance and are superiorin some regimes.

Page 22: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

255of 825

Original Input versus Feature Space

So far in classification, we used direct input x.All classification algorithms work also if we first apply afixed nonlinear transformation of the inputs using a vectorof basis functions φ(x).Example: Use two Gaussian basis functions centered atthe green crosses in the input space.

x1

x2

−1 0 1

−1

0

1

φ1

φ2

0 0.5 1

0

0.5

1

Page 23: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

256of 825

Original Input versus Feature Space

Linear decision boundaries in the feature space generallycorrespond to nonlinear boundaries in the input space.Classes which are NOT linearly separable in the inputspace may become linearly separable in the feature space:

x1

x2

−1 0 1

−1

0

1

φ1

φ2

0 0.5 1

0

0.5

1

If classes overlap in input space, they will also overlap infeature space — nonlinear features φ(x) can not removethe overlap; but they may increase it.

Page 24: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

257of 825

Original Input versus Feature Space

Fixed basis functions do not adapt to the data andtherefore have important limitations (see discussion inLinear Regression).Understanding of more advanced algorithms becomeseasier if we introduce the feature space now and use itinstead of the original input space.Some applications use fixed features successfully byavoiding the limitations.We will therefore use φ instead of x from now on.

Page 25: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

258of 825

Logistic Regression

Two classes where the posterior of class C1 is a logisticsigmoid σ() acting on a linear function of the input:

p(C1 |φ) = y(φ) = σ(wTφ)

p(C2 |φ) = 1− p(C1 |φ)Model dimension is equal to dimension of the featurespace M.Compare this to fitting two Gaussians, which has aquadratic number of parameters in M:

2M︸︷︷︸means

+ M(M + 1)/2︸ ︷︷ ︸shared covariance

For larger M, the logistic regression model has a clearadvantage.

Page 26: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

259of 825

Logistic Regression

Determine the parameter via maximum likelihood for data(φn, tn), n = 1, . . . ,N, where φn = φ(xn). The classmembership is coded as tn ∈ {0, 1}.Likelihood function

p(t |w) =

N∏n=1

ytnn (1− yn)

1−tn

where yn = p(C1 |φn).Error function : negative log likelihood resulting in thecross-entropy error function

E(w) = − ln p(t |w) = −N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

Page 27: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

260of 825

Logistic Regression

Error function (cross-entropy loss)

E(w) = −N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

yn = p(C1 |φn) = σ(wTφn)

We obtain the gradient of the error function using the chainrule and the sigmoid result dσ

da = σ(1− σ) (derive it):

∇E(w) =

N∑n=1

(yn − tn)φn

for each data point error is product of deviation yn − tn andbasis function φn.We can now use gradient descent.We may easily modify this to reduce over-fitting by usingregularised error or MAP (how?).

Page 28: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

261of 825

Laplace Approximation

Given a continous distribution p(x) which is not Gaussian,can we approximate it by a Gaussian q(x) ?Need to find a mode of p(x). Try to find a Gaussian withthe same mode:

−2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

p.d.f. of :Non-Gaussian (yellow) and

Gaussian approximation (red).

−2 −1 0 1 2 3 40

10

20

30

40

negative log p.d.f. of :Non-Gaussian (yellow) and

Gaussian approxmation. (red).

Page 29: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

262of 825

Laplace Approximation

Cheap and nasty but sometimes effective.Assume p(x) can be written as

p(z) =1Z

f (z)

with normalisation Z =∫

f (z) dz.We do not even need to know Z to find the Laplaceapproximation.A mode of p(z) is at a point z0 where p′(z0) = 0.Taylor expansion of ln f (z) at z0

ln f (z) ' ln f (z0)−12

A(z− z0)2

where

A = − d2

dz2 ln f (z) |z=z0

Page 30: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

263of 825

Laplace Approximation

Exponentiating

ln f (z) ' ln f (z0)−12

A(z− z0)2

we get

f (z) ' f (z0) exp{−A2(z− z0)

2}.

And after normalisation we get the Laplace approximation

q(z) =(

A2π

)1/2

exp{−A2(z− z0)

2}.

Only defined for precision A > 0 as only then p(z) has amaximum.

Page 31: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

264of 825

Laplace Approximation - Vector Space

Approximate p(z) for z ∈ RM

p(z) =1Z

f (z).

we get the Taylor expansion

ln f (z) ' ln f (z0)−12(z− z0)

TA(z− z0)

where the Hessian A is defined as

A = −∇∇ ln f (z) |z=z0 .

The Laplace approximation of p(z) is then

q(z) ∝ exp

{−1

2(z− z0)

TA(z− z0)

}⇒ q(z) = N (z | z0,A−1)

Page 32: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

265of 825

Bayesian Logistic Regression

Exact Bayesian inference for the logistic regression isintractable.Why? Need to normalise a product of prior probabilitiesand likelihoods which itself are a product of logisticsigmoid functions, one for each data point.Evaluation of the predictive distribution also intractable.Therefore we will use the Laplace approximation.The predictive distribution remains intractible even underthe Laplace approximation to the posterior distribution, butit can be approximated.

Page 33: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

266of 825

Bayesian Logistic Regression

Assume a Gaussian prior:

p(w) = N (w |m0,S0)

for fixed hyperparameters m0 and S0.Hyperparameters are parameters of a prior distribution. Incontrast to the model parameters w, they are not learned.For a set of training data (xn, tn), where n = 1, . . . ,N, theposterior is given by

p(w | t) ∝ p(w)p(t |w)

where t = (t1, . . . , tN)T .

Page 34: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

267of 825

Bayesian Logistic Regression

Using our previous result for the cross-entropy function

E(w) = − ln p(t |w) = −N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

we can now calculate the log of the posterior

p(w | t) ∝ p(w)p(t |w)

using the notation yn = σ(wTφn) as

ln p(w | t) =− 12(w−m0)

TS−10 (w−m0)

+

N∑n=1

{tn ln yn + (1− tn) ln(1− yn)}

Page 35: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

268of 825

Bayesian Logistic Regression

To obtain a Gaussian approximation to

ln p(w | t)

= −12(w−m0)

TS−10 (w−m0) +

N∑n=1

{tn ln yn + (1− tn) ln(1− yn)}

1 Find wMAP which maximises ln p(w | t). This defines themean of the Gaussian approximation. (Note: This is anonlinear function in w because yn = σ(wTφn).)

2 Calculate the second derivative of the negative log likelihoodto get the inverse covariance of the Laplace approximation

SN = −∇∇ ln p(w | t) = S−10 +

N∑n=1

yn(1− yn)φnφTn .

Nowadays the gradient and Hessian would be computed withautomatic differentiation; one need only implement ln p(w | t).

Page 36: Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

269of 825

Bayesian Logistic Regression

The approximated Gaussian (via Laplace approximation)of the posterior distribution is now

q(w |φ) = N (w |wMAP,SN)

where

SN = −∇∇ ln p(w | t) = S−10 +

N∑n=1

yn(1− yn)φnφTn .