Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 |...

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1


Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")




University

Probabilistic GenerativeModels

Continuous Input

Discrete Features

ProbabilisticDiscriminative Models

Logistic Regression

Iterative ReweightedLeast Squares

Laplace Approximation

Bayesian LogisticRegression

235of 825

Part VI





University


Continuous Input

Discrete Features


Logistic Regression




236of 825

Three Models for Decision Problems

In increasing order of complexityFind a discriminant function f (x) which maps each inputdirectly onto a class label.Discriminative Models

1 Solve the inference problem of determining the posteriorclass probabilities p(Ck | x).

2 Use decision theory to assign each new x to one of theclasses.

Generative Models1 Solve the inference problem of determining the

class-conditional probabilities p(x | Ck).2 Also, infer the prior class probabilities p(Ck).3 Use Bayes’ theorem to find the posterior p(Ck | x).4 Alternatively, model the joint distribution p(x, Ck) directly.5 Use decision theory to assign each new x to one of the

classes.




University


Continuous Input

Discrete Features


Logistic Regression




237of 825

Data Generating Process

Given:class prior p(t)class-conditional p(x | t)

to generate data from the model we may do the following:1 Sample the class label from the class prior p(t).2 Sample the data features from the class-conditional

distribution p(x | t).(more about sampling later — this is called ancestral sampling)

Thinking about the data generating process is a usefulmodelling step, especially when we have more priorknowledge.




University


Continuous Input

Discrete Features


Logistic Regression




238of 825

Probabilistic Generative Models

Generative approach: model class-conditional densitiesp(x | Ck) and class priors (not parameter priors!) p(Ck) tocalculate the posterior probability for class C1

p(C1 | x) =p(x | C1)p(C1)

p(x | C1)p(C1) + p(x | C2)p(C2)

=1

1 + exp(−a(x))≡ σ(a(x))

where a and the logistic sigmoid function σ(a) are given by

a(x) = lnp(x | C1) p(C1)

p(x | C2) p(C2)= ln

p(x, C1)

p(x, C2)

σ(a) =1

1 + exp(−a).

One point of this re-writing: we may learn a(x) directly ase.g. a deep neural network.




University


Continuous Input

Discrete Features


Logistic Regression




239of 825

Logistic Sigmoid

The logistic sigmoid function is called a “squashingfunction” because it squashes the real axis into a finiteinterval (0, 1).Well known properties (derive them):

Symmetry: σ(−a) = 1− σ(a)Derivative: d

daσ(a) = σ(a)σ(−a) = σ(a) (1− σ(a))Inverse of σ is called the logit function

-10 -5 5 10a

0.2

0.4

0.6

0.8

1.0

ΣHaL

Sigmoid σ(a) = 11+exp(−a)

0.2 0.4 0.6 0.8 1.0Σ

-6

-4

-2

2

4

aHΣL

Logit a(σ) = ln(

σ1−σ

)




University


Continuous Input

Discrete Features


Logistic Regression




240of 825

Probabilistic Generative Models - Multiclass

The normalised exponential is given by

p(Ck | x) =p(x | Ck) p(Ck)∑

j p(x | Cj) p(Cj)=

exp(ak)∑j exp(aj)

whereak = ln(p(x | Ck) p(Ck)).

Usually called the softmax function as it is a smoothedversion of the argmax function, in particular:

ak � aj ∀j 6= k⇒(

p(Ck | x) ≈ 1 ∧ p(Cj | x) ≈ 0)

So, softargmax is a more descriptive though less commonname.




University


Continuous Input

Discrete Features


Logistic Regression




241of 825

Probabil. Generative Model - Continuous Input

Assume class-conditional probabilities are Gaussian, withthe same covariance and different mean:

Let’s characterise the posterior probabilities.We may separate the quadratic and linear term in x:

p(x | Ck)

=1

(2π)D/2

1|Σ|1/2 exp

{−1

2(x− µk)

TΣ−1(x− µk)

}

=1

(2π)D/2

1|Σ|1/2 exp

{−1

2xTΣ−1x + µT

k Σ−1x− 1

2µT

k Σ−1µk

}




University


Continuous Input

Discrete Features


Logistic Regression




242of 825


For two classesp(C1 | x) = σ(a(x))

and a(x) is linear because the quadratic terms in x cancel(c.f. the previous slide):

a(x) = lnp(x | C1) p(C1)

p(x | C2) p(C2)

= lnexp

{µT

1Σ−1x− 1

2µT1Σ−1µ1

}exp

{µT

2Σ−1x− 1

2µT2Σ−1µ2

} + lnp(C1)

p(C2)

Thereforep(C1 | x) = σ(wTx + w0)

where

w = Σ−1(µ1 − µ2)

w0 = −12µT

1 Σ−1µ1 +

12µT

2 Σ−1µ2 + ln

p(C1)

p(C2)




University


Continuous Input

Discrete Features


Logistic Regression




243of 825


Class-conditional densities for two classes (left). Posteriorprobability p(C1 | x) (right). Note the logistic sigmoid of a linear

function of x.




University


Continuous Input

Discrete Features


Logistic Regression




244of 825

General Case - K Classes, Shared Covariance

Use the normalised exponential

p(Ck | x) =p(x | Ck)p(Ck)∑

j p(x | Cj)p(Cj)=

exp(ak)∑j exp(aj)

whereak = ln (p(x | Ck)p(Ck)) .

to get a linear function of x

ak(x) = wTk x + wk0.

where

wk = Σ−1µk

wk0 = −12µT

k Σ−1µk + p(Ck).




University


Continuous Input

Discrete Features


Logistic Regression




245of 825

General Case - K Classes, Different Covariance

If the class-conditional distributions have differentcovariances, the quadratic terms − 1

2 xTΣ−1x do not cancelout.We get a quadratic discriminant.

−2 −1 0 1 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5




University


Continuous Input

Discrete Features


Logistic Regression




246of 825

Parameter Estimation

Given the functional form of the class-conditional densitiesp(x | Ck), how can we determine the parameters µ and Σand the class prior ?

Simplest is maximum likelihood.Given also a data set (xn, tn) for n = 1, . . . ,N. (Using thecoding scheme where tn = 1 corresponds to class C1 andtn = 0 denotes class C2.Assume the class-conditional densities to be Gaussianwith the same covariance, but different mean.Denote the prior probability p(C1) = π, and thereforep(C2) = 1− π.Then

p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)

p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)




University


Continuous Input

Discrete Features


Logistic Regression




247of 825

Parameter Estimation

Given the functional form of the class-conditional densitiesp(x | Ck), how can we determine the parameters µ and Σand the class prior ?Simplest is maximum likelihood.Given also a data set (xn, tn) for n = 1, . . . ,N. (Using thecoding scheme where tn = 1 corresponds to class C1 andtn = 0 denotes class C2.Assume the class-conditional densities to be Gaussianwith the same covariance, but different mean.Denote the prior probability p(C1) = π, and thereforep(C2) = 1− π.Then

p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)

p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)




University


Continuous Input

Discrete Features


Logistic Regression




248of 825

Maximum Likelihood Solution

Thus the likelihood for the whole data set X and t is givenby

p(t,X |π,µ1,µ2,Σ)

=

N∏n=1

[πN (xn |µ1,Σ)]tn × [(1− π)N (xn |µ2,Σ)]1−tn

Maximise the log likelihoodThe term depending on π is

N∑n=1

(tn lnπ + (1− tn) ln(1− π)

)which is maximal for (derive it)

π =1N

N∑n=1

tn =N1

N=

N1

N1 + N2

where N1 is the number of data points in class C1.




University


Continuous Input

Discrete Features


Logistic Regression




249of 825


Similarly, we can maximise the likelihoodp(t,X |π,µ1,µ2,Σ) w.r.t. the means µ1 and µ2, to get

µ1 =1

N1

N∑n=1

tn xn

µ2 =1

N2

N∑n=1

(1− tn) xn

For each class, this are the means of all input vectorsassigned to this class.




University


Continuous Input

Discrete Features


Logistic Regression




250of 825


Finally, the log likelihood ln p(t,X |π,µ1,µ2,Σ) can bemaximised for the covariance Σ resulting in

Σ =N1

NS1 +

N2

NS2

Sk =1

Nk

∑n∈Ck

(xn − µk)(xn − µk)T




University


Continuous Input

Discrete Features


Logistic Regression




251of 825

Discrete Features - Naïve Bayes

Assume the input space consists of discrete features, inthe simplest case xi ∈ {0, 1}.For a D-dimensional input space, a general distributionwould be represented by a table with 2D entries.Together with the normalisation constraint, this are 2D − 1independent variables.Grows exponentially with the number of features.The Naïve Bayes assumption is that, given the class Ck,the features are independent of each other:

p(x | Ck) =

D∏i=1

p(xi | C)

=

D∏i=1

µxiki(1− µki)

1−xi




University


Continuous Input

Discrete Features


Logistic Regression




252of 825

Discrete Features - Naïve Bayes

With the naïve Bayes

p(x | Ck) =

D∏i=1

µxiki(1− µki)

1−xi

we can then again find the factors ak in the normalisedexponential

p(Ck | x) =p(x | Ck)p(Ck)∑

j p(x | Cj)p(Cj)=

exp(ak)∑j exp(aj)

as a linear function of the xi

ak(x) =D∑

i=1

{xi lnµki + (1− xi) ln(1− µki)}+ ln p(Ck).




University


Continuous Input

Discrete Features


Logistic Regression




253of 825

Three Models for Decision Problems

In increasing order of complexityFind a discriminant function f (x) which maps each inputdirectly onto a class label.Discriminative Models

1 Solve the inference problem of determining the posteriorclass probabilities p(Ck | x).

2 Use decision theory to assign each new x to one of theclasses.

Generative Models1 Solve the inference problem of determining the

class-conditional probabilities p(x | Ck).2 Also, infer the prior class probabilities p(Ck).3 Use Bayes’ theorem to find the posterior p(Ck | x).4 Alternatively, model the joint distribution p(x, Ck) directly.5 Use decision theory to assign each new x to one of the

classes.




University


Continuous Input

Discrete Features


Logistic Regression




254of 825

Probabilistic Discriminative Models

Discriminative training: learn only to discriminate betweenthe classes.Maximise a likelihood function defined through theconditional distribution p(Ck | x) directly.Typically fewer parameters to be determined.As we learn the posteriror p(Ck | x) directly, prediction maybe better than with a generative model where theclass-conditional density assumptions p(x | Ck) poorlyapproximate the true distributions.But: discriminative models can not create synthetic data,as p(x) is not modelled.As an aside: certain theoretical analyses show thatgenerative models converge faster to their — albeit worse— asymptotic classification performance and are superiorin some regimes.




University


Continuous Input

Discrete Features


Logistic Regression




255of 825

Original Input versus Feature Space

So far in classification, we used direct input x.All classification algorithms work also if we first apply afixed nonlinear transformation of the inputs using a vectorof basis functions φ(x).Example: Use two Gaussian basis functions centered atthe green crosses in the input space.

x1

x2

−1 0 1

−1

0

1

φ1

φ2

0 0.5 1

0

0.5

1




University


Continuous Input

Discrete Features


Logistic Regression




256of 825


Linear decision boundaries in the feature space generallycorrespond to nonlinear boundaries in the input space.Classes which are NOT linearly separable in the inputspace may become linearly separable in the feature space:

x1

x2

−1 0 1

−1

0

1

φ1

φ2

0 0.5 1

0

0.5

1

If classes overlap in input space, they will also overlap infeature space — nonlinear features φ(x) can not removethe overlap; but they may increase it.




University


Continuous Input

Discrete Features


Logistic Regression




257of 825


Fixed basis functions do not adapt to the data andtherefore have important limitations (see discussion inLinear Regression).Understanding of more advanced algorithms becomeseasier if we introduce the feature space now and use itinstead of the original input space.Some applications use fixed features successfully byavoiding the limitations.We will therefore use φ instead of x from now on.




University


Continuous Input

Discrete Features


Logistic Regression




258of 825

Logistic Regression

Two classes where the posterior of class C1 is a logisticsigmoid σ() acting on a linear function of the input:

p(C1 |φ) = y(φ) = σ(wTφ)

p(C2 |φ) = 1− p(C1 |φ)Model dimension is equal to dimension of the featurespace M.Compare this to fitting two Gaussians, which has aquadratic number of parameters in M:

2M︸︷︷︸means

+ M(M + 1)/2︸︷︷︸shared covariance

For larger M, the logistic regression model has a clearadvantage.




University


Continuous Input

Discrete Features


Logistic Regression




259of 825

Logistic Regression

Determine the parameter via maximum likelihood for data(φn, tn), n = 1, . . . ,N, where φn = φ(xn). The classmembership is coded as tn ∈ {0, 1}.Likelihood function

p(t |w) =

N∏n=1

ytnn (1− yn)

1−tn

where yn = p(C1 |φn).Error function : negative log likelihood resulting in thecross-entropy error function

E(w) = − ln p(t |w) = −N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}




University


Continuous Input

Discrete Features


Logistic Regression




260of 825

Logistic Regression

Error function (cross-entropy loss)

E(w) = −N∑

n=1


yn = p(C1 |φn) = σ(wTφn)

We obtain the gradient of the error function using the chainrule and the sigmoid result dσ

da = σ(1− σ) (derive it):

∇E(w) =

N∑n=1

(yn − tn)φn

for each data point error is product of deviation yn − tn andbasis function φn.We can now use gradient descent.We may easily modify this to reduce over-fitting by usingregularised error or MAP (how?).




University


Continuous Input

Discrete Features


Logistic Regression




261of 825


Given a continous distribution p(x) which is not Gaussian,can we approximate it by a Gaussian q(x) ?Need to find a mode of p(x). Try to find a Gaussian withthe same mode:

−2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

p.d.f. of :Non-Gaussian (yellow) and

Gaussian approximation (red).

−2 −1 0 1 2 3 40

10

20

30

40

negative log p.d.f. of :Non-Gaussian (yellow) and

Gaussian approxmation. (red).




University


Continuous Input

Discrete Features


Logistic Regression




262of 825


Cheap and nasty but sometimes effective.Assume p(x) can be written as

p(z) =1Z

f (z)

with normalisation Z =∫

f (z) dz.We do not even need to know Z to find the Laplaceapproximation.A mode of p(z) is at a point z0 where p′(z0) = 0.Taylor expansion of ln f (z) at z0

ln f (z) ' ln f (z0)−12

A(z− z0)2

where

A = − d2

dz2 ln f (z) |z=z0




University


Continuous Input

Discrete Features


Logistic Regression




263of 825


Exponentiating

ln f (z) ' ln f (z0)−12

A(z− z0)2

we get

f (z) ' f (z0) exp{−A2(z− z0)

2}.

And after normalisation we get the Laplace approximation

q(z) =(

A2π

)1/2

exp{−A2(z− z0)

2}.

Only defined for precision A > 0 as only then p(z) has amaximum.




University


Continuous Input

Discrete Features


Logistic Regression




264of 825

Laplace Approximation - Vector Space

Approximate p(z) for z ∈ RM

p(z) =1Z

f (z).

we get the Taylor expansion

ln f (z) ' ln f (z0)−12(z− z0)

TA(z− z0)

where the Hessian A is defined as

A = −∇∇ ln f (z) |z=z0 .

The Laplace approximation of p(z) is then

q(z) ∝ exp

{−1

2(z− z0)

TA(z− z0)

}⇒ q(z) = N (z | z0,A−1)




University


Continuous Input

Discrete Features


Logistic Regression




265of 825

Bayesian Logistic Regression

Exact Bayesian inference for the logistic regression isintractable.Why? Need to normalise a product of prior probabilitiesand likelihoods which itself are a product of logisticsigmoid functions, one for each data point.Evaluation of the predictive distribution also intractable.Therefore we will use the Laplace approximation.The predictive distribution remains intractible even underthe Laplace approximation to the posterior distribution, butit can be approximated.




University


Continuous Input

Discrete Features


Logistic Regression




266of 825


Assume a Gaussian prior:

p(w) = N (w |m0,S0)

for fixed hyperparameters m0 and S0.Hyperparameters are parameters of a prior distribution. Incontrast to the model parameters w, they are not learned.For a set of training data (xn, tn), where n = 1, . . . ,N, theposterior is given by

p(w | t) ∝ p(w)p(t |w)

where t = (t1, . . . , tN)T .




University


Continuous Input

Discrete Features


Logistic Regression




267of 825


Using our previous result for the cross-entropy function

E(w) = − ln p(t |w) = −N∑

n=1


we can now calculate the log of the posterior

p(w | t) ∝ p(w)p(t |w)

using the notation yn = σ(wTφn) as

ln p(w | t) =− 12(w−m0)

TS−10 (w−m0)

+

N∑n=1





University


Continuous Input

Discrete Features


Logistic Regression




268of 825


To obtain a Gaussian approximation to

ln p(w | t)

= −12(w−m0)

TS−10 (w−m0) +

N∑n=1


1 Find wMAP which maximises ln p(w | t). This defines themean of the Gaussian approximation. (Note: This is anonlinear function in w because yn = σ(wTφn).)

2 Calculate the second derivative of the negative log likelihoodto get the inverse covariance of the Laplace approximation

SN = −∇∇ ln p(w | t) = S−10 +

N∑n=1

yn(1− yn)φnφTn .

Nowadays the gradient and Hessian would be computed withautomatic differentiation; one need only implement ln p(w | t).




University


Continuous Input

Discrete Features


Logistic Regression




269of 825


The approximated Gaussian (via Laplace approximation)of the posterior distribution is now

q(w |φ) = N (w |wMAP,SN)

where

SN = −∇∇ ln p(w | t) = S−10 +

N∑n=1

yn(1− yn)φnφTn .

Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 |...

Documents

Transcript of Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 |...