Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 |...
Transcript of Statistical Machine Learning · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 |...
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Outlines
OverviewIntroductionLinear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel MethodsSparse Kernel Methods
Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis
AutoencodersGraphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research GroupCSIRO Data61
and
College of Engineering and Computer ScienceThe Australian National University
CanberraSemester One, 2020.
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
235of 825
Part VI
Linear Classification 2
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
236of 825
Three Models for Decision Problems
In increasing order of complexityFind a discriminant function f (x) which maps each inputdirectly onto a class label.Discriminative Models
1 Solve the inference problem of determining the posteriorclass probabilities p(Ck | x).
2 Use decision theory to assign each new x to one of theclasses.
Generative Models1 Solve the inference problem of determining the
class-conditional probabilities p(x | Ck).2 Also, infer the prior class probabilities p(Ck).3 Use Bayes’ theorem to find the posterior p(Ck | x).4 Alternatively, model the joint distribution p(x, Ck) directly.5 Use decision theory to assign each new x to one of the
classes.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
237of 825
Data Generating Process
Given:class prior p(t)class-conditional p(x | t)
to generate data from the model we may do the following:1 Sample the class label from the class prior p(t).2 Sample the data features from the class-conditional
distribution p(x | t).(more about sampling later — this is called ancestral sampling)
Thinking about the data generating process is a usefulmodelling step, especially when we have more priorknowledge.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
238of 825
Probabilistic Generative Models
Generative approach: model class-conditional densitiesp(x | Ck) and class priors (not parameter priors!) p(Ck) tocalculate the posterior probability for class C1
p(C1 | x) =p(x | C1)p(C1)
p(x | C1)p(C1) + p(x | C2)p(C2)
=1
1 + exp(−a(x))≡ σ(a(x))
where a and the logistic sigmoid function σ(a) are given by
a(x) = lnp(x | C1) p(C1)
p(x | C2) p(C2)= ln
p(x, C1)
p(x, C2)
σ(a) =1
1 + exp(−a).
One point of this re-writing: we may learn a(x) directly ase.g. a deep neural network.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
239of 825
Logistic Sigmoid
The logistic sigmoid function is called a “squashingfunction” because it squashes the real axis into a finiteinterval (0, 1).Well known properties (derive them):
Symmetry: σ(−a) = 1− σ(a)Derivative: d
daσ(a) = σ(a)σ(−a) = σ(a) (1− σ(a))Inverse of σ is called the logit function
-10 -5 5 10a
0.2
0.4
0.6
0.8
1.0
ΣHaL
Sigmoid σ(a) = 11+exp(−a)
0.2 0.4 0.6 0.8 1.0Σ
-6
-4
-2
2
4
aHΣL
Logit a(σ) = ln(
σ1−σ
)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
240of 825
Probabilistic Generative Models - Multiclass
The normalised exponential is given by
p(Ck | x) =p(x | Ck) p(Ck)∑
j p(x | Cj) p(Cj)=
exp(ak)∑j exp(aj)
whereak = ln(p(x | Ck) p(Ck)).
Usually called the softmax function as it is a smoothedversion of the argmax function, in particular:
ak � aj ∀j 6= k⇒(
p(Ck | x) ≈ 1 ∧ p(Cj | x) ≈ 0)
So, softargmax is a more descriptive though less commonname.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
241of 825
Probabil. Generative Model - Continuous Input
Assume class-conditional probabilities are Gaussian, withthe same covariance and different mean:
Let’s characterise the posterior probabilities.We may separate the quadratic and linear term in x:
p(x | Ck)
=1
(2π)D/2
1|Σ|1/2 exp
{−1
2(x− µk)
TΣ−1(x− µk)
}
=1
(2π)D/2
1|Σ|1/2 exp
{−1
2xTΣ−1x + µT
k Σ−1x− 1
2µT
k Σ−1µk
}
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
242of 825
Probabil. Generative Model - Continuous Input
For two classesp(C1 | x) = σ(a(x))
and a(x) is linear because the quadratic terms in x cancel(c.f. the previous slide):
a(x) = lnp(x | C1) p(C1)
p(x | C2) p(C2)
= lnexp
{µT
1Σ−1x− 1
2µT1Σ−1µ1
}exp
{µT
2Σ−1x− 1
2µT2Σ−1µ2
} + lnp(C1)
p(C2)
Thereforep(C1 | x) = σ(wTx + w0)
where
w = Σ−1(µ1 − µ2)
w0 = −12µT
1 Σ−1µ1 +
12µT
2 Σ−1µ2 + ln
p(C1)
p(C2)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
243of 825
Probabil. Generative Model - Continuous Input
Class-conditional densities for two classes (left). Posteriorprobability p(C1 | x) (right). Note the logistic sigmoid of a linear
function of x.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
244of 825
General Case - K Classes, Shared Covariance
Use the normalised exponential
p(Ck | x) =p(x | Ck)p(Ck)∑
j p(x | Cj)p(Cj)=
exp(ak)∑j exp(aj)
whereak = ln (p(x | Ck)p(Ck)) .
to get a linear function of x
ak(x) = wTk x + wk0.
where
wk = Σ−1µk
wk0 = −12µT
k Σ−1µk + p(Ck).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
245of 825
General Case - K Classes, Different Covariance
If the class-conditional distributions have differentcovariances, the quadratic terms − 1
2 xTΣ−1x do not cancelout.We get a quadratic discriminant.
−2 −1 0 1 2−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
246of 825
Parameter Estimation
Given the functional form of the class-conditional densitiesp(x | Ck), how can we determine the parameters µ and Σand the class prior ?
Simplest is maximum likelihood.Given also a data set (xn, tn) for n = 1, . . . ,N. (Using thecoding scheme where tn = 1 corresponds to class C1 andtn = 0 denotes class C2.Assume the class-conditional densities to be Gaussianwith the same covariance, but different mean.Denote the prior probability p(C1) = π, and thereforep(C2) = 1− π.Then
p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)
p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
247of 825
Parameter Estimation
Given the functional form of the class-conditional densitiesp(x | Ck), how can we determine the parameters µ and Σand the class prior ?Simplest is maximum likelihood.Given also a data set (xn, tn) for n = 1, . . . ,N. (Using thecoding scheme where tn = 1 corresponds to class C1 andtn = 0 denotes class C2.Assume the class-conditional densities to be Gaussianwith the same covariance, but different mean.Denote the prior probability p(C1) = π, and thereforep(C2) = 1− π.Then
p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)
p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
248of 825
Maximum Likelihood Solution
Thus the likelihood for the whole data set X and t is givenby
p(t,X |π,µ1,µ2,Σ)
=
N∏n=1
[πN (xn |µ1,Σ)]tn × [(1− π)N (xn |µ2,Σ)]1−tn
Maximise the log likelihoodThe term depending on π is
N∑n=1
(tn lnπ + (1− tn) ln(1− π)
)which is maximal for (derive it)
π =1N
N∑n=1
tn =N1
N=
N1
N1 + N2
where N1 is the number of data points in class C1.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
249of 825
Maximum Likelihood Solution
Similarly, we can maximise the likelihoodp(t,X |π,µ1,µ2,Σ) w.r.t. the means µ1 and µ2, to get
µ1 =1
N1
N∑n=1
tn xn
µ2 =1
N2
N∑n=1
(1− tn) xn
For each class, this are the means of all input vectorsassigned to this class.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
250of 825
Maximum Likelihood Solution
Finally, the log likelihood ln p(t,X |π,µ1,µ2,Σ) can bemaximised for the covariance Σ resulting in
Σ =N1
NS1 +
N2
NS2
Sk =1
Nk
∑n∈Ck
(xn − µk)(xn − µk)T
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
251of 825
Discrete Features - Naïve Bayes
Assume the input space consists of discrete features, inthe simplest case xi ∈ {0, 1}.For a D-dimensional input space, a general distributionwould be represented by a table with 2D entries.Together with the normalisation constraint, this are 2D − 1independent variables.Grows exponentially with the number of features.The Naïve Bayes assumption is that, given the class Ck,the features are independent of each other:
p(x | Ck) =
D∏i=1
p(xi | C)
=
D∏i=1
µxiki(1− µki)
1−xi
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
252of 825
Discrete Features - Naïve Bayes
With the naïve Bayes
p(x | Ck) =
D∏i=1
µxiki(1− µki)
1−xi
we can then again find the factors ak in the normalisedexponential
p(Ck | x) =p(x | Ck)p(Ck)∑
j p(x | Cj)p(Cj)=
exp(ak)∑j exp(aj)
as a linear function of the xi
ak(x) =D∑
i=1
{xi lnµki + (1− xi) ln(1− µki)}+ ln p(Ck).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
253of 825
Three Models for Decision Problems
In increasing order of complexityFind a discriminant function f (x) which maps each inputdirectly onto a class label.Discriminative Models
1 Solve the inference problem of determining the posteriorclass probabilities p(Ck | x).
2 Use decision theory to assign each new x to one of theclasses.
Generative Models1 Solve the inference problem of determining the
class-conditional probabilities p(x | Ck).2 Also, infer the prior class probabilities p(Ck).3 Use Bayes’ theorem to find the posterior p(Ck | x).4 Alternatively, model the joint distribution p(x, Ck) directly.5 Use decision theory to assign each new x to one of the
classes.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
254of 825
Probabilistic Discriminative Models
Discriminative training: learn only to discriminate betweenthe classes.Maximise a likelihood function defined through theconditional distribution p(Ck | x) directly.Typically fewer parameters to be determined.As we learn the posteriror p(Ck | x) directly, prediction maybe better than with a generative model where theclass-conditional density assumptions p(x | Ck) poorlyapproximate the true distributions.But: discriminative models can not create synthetic data,as p(x) is not modelled.As an aside: certain theoretical analyses show thatgenerative models converge faster to their — albeit worse— asymptotic classification performance and are superiorin some regimes.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
255of 825
Original Input versus Feature Space
So far in classification, we used direct input x.All classification algorithms work also if we first apply afixed nonlinear transformation of the inputs using a vectorof basis functions φ(x).Example: Use two Gaussian basis functions centered atthe green crosses in the input space.
x1
x2
−1 0 1
−1
0
1
φ1
φ2
0 0.5 1
0
0.5
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
256of 825
Original Input versus Feature Space
Linear decision boundaries in the feature space generallycorrespond to nonlinear boundaries in the input space.Classes which are NOT linearly separable in the inputspace may become linearly separable in the feature space:
x1
x2
−1 0 1
−1
0
1
φ1
φ2
0 0.5 1
0
0.5
1
If classes overlap in input space, they will also overlap infeature space — nonlinear features φ(x) can not removethe overlap; but they may increase it.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
257of 825
Original Input versus Feature Space
Fixed basis functions do not adapt to the data andtherefore have important limitations (see discussion inLinear Regression).Understanding of more advanced algorithms becomeseasier if we introduce the feature space now and use itinstead of the original input space.Some applications use fixed features successfully byavoiding the limitations.We will therefore use φ instead of x from now on.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
258of 825
Logistic Regression
Two classes where the posterior of class C1 is a logisticsigmoid σ() acting on a linear function of the input:
p(C1 |φ) = y(φ) = σ(wTφ)
p(C2 |φ) = 1− p(C1 |φ)Model dimension is equal to dimension of the featurespace M.Compare this to fitting two Gaussians, which has aquadratic number of parameters in M:
2M︸︷︷︸means
+ M(M + 1)/2︸ ︷︷ ︸shared covariance
For larger M, the logistic regression model has a clearadvantage.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
259of 825
Logistic Regression
Determine the parameter via maximum likelihood for data(φn, tn), n = 1, . . . ,N, where φn = φ(xn). The classmembership is coded as tn ∈ {0, 1}.Likelihood function
p(t |w) =
N∏n=1
ytnn (1− yn)
1−tn
where yn = p(C1 |φn).Error function : negative log likelihood resulting in thecross-entropy error function
E(w) = − ln p(t |w) = −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
260of 825
Logistic Regression
Error function (cross-entropy loss)
E(w) = −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
yn = p(C1 |φn) = σ(wTφn)
We obtain the gradient of the error function using the chainrule and the sigmoid result dσ
da = σ(1− σ) (derive it):
∇E(w) =
N∑n=1
(yn − tn)φn
for each data point error is product of deviation yn − tn andbasis function φn.We can now use gradient descent.We may easily modify this to reduce over-fitting by usingregularised error or MAP (how?).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
261of 825
Laplace Approximation
Given a continous distribution p(x) which is not Gaussian,can we approximate it by a Gaussian q(x) ?Need to find a mode of p(x). Try to find a Gaussian withthe same mode:
−2 −1 0 1 2 3 40
0.2
0.4
0.6
0.8
p.d.f. of :Non-Gaussian (yellow) and
Gaussian approximation (red).
−2 −1 0 1 2 3 40
10
20
30
40
negative log p.d.f. of :Non-Gaussian (yellow) and
Gaussian approxmation. (red).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
262of 825
Laplace Approximation
Cheap and nasty but sometimes effective.Assume p(x) can be written as
p(z) =1Z
f (z)
with normalisation Z =∫
f (z) dz.We do not even need to know Z to find the Laplaceapproximation.A mode of p(z) is at a point z0 where p′(z0) = 0.Taylor expansion of ln f (z) at z0
ln f (z) ' ln f (z0)−12
A(z− z0)2
where
A = − d2
dz2 ln f (z) |z=z0
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
263of 825
Laplace Approximation
Exponentiating
ln f (z) ' ln f (z0)−12
A(z− z0)2
we get
f (z) ' f (z0) exp{−A2(z− z0)
2}.
And after normalisation we get the Laplace approximation
q(z) =(
A2π
)1/2
exp{−A2(z− z0)
2}.
Only defined for precision A > 0 as only then p(z) has amaximum.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
264of 825
Laplace Approximation - Vector Space
Approximate p(z) for z ∈ RM
p(z) =1Z
f (z).
we get the Taylor expansion
ln f (z) ' ln f (z0)−12(z− z0)
TA(z− z0)
where the Hessian A is defined as
A = −∇∇ ln f (z) |z=z0 .
The Laplace approximation of p(z) is then
q(z) ∝ exp
{−1
2(z− z0)
TA(z− z0)
}⇒ q(z) = N (z | z0,A−1)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
265of 825
Bayesian Logistic Regression
Exact Bayesian inference for the logistic regression isintractable.Why? Need to normalise a product of prior probabilitiesand likelihoods which itself are a product of logisticsigmoid functions, one for each data point.Evaluation of the predictive distribution also intractable.Therefore we will use the Laplace approximation.The predictive distribution remains intractible even underthe Laplace approximation to the posterior distribution, butit can be approximated.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
266of 825
Bayesian Logistic Regression
Assume a Gaussian prior:
p(w) = N (w |m0,S0)
for fixed hyperparameters m0 and S0.Hyperparameters are parameters of a prior distribution. Incontrast to the model parameters w, they are not learned.For a set of training data (xn, tn), where n = 1, . . . ,N, theposterior is given by
p(w | t) ∝ p(w)p(t |w)
where t = (t1, . . . , tN)T .
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
267of 825
Bayesian Logistic Regression
Using our previous result for the cross-entropy function
E(w) = − ln p(t |w) = −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
we can now calculate the log of the posterior
p(w | t) ∝ p(w)p(t |w)
using the notation yn = σ(wTφn) as
ln p(w | t) =− 12(w−m0)
TS−10 (w−m0)
+
N∑n=1
{tn ln yn + (1− tn) ln(1− yn)}
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
268of 825
Bayesian Logistic Regression
To obtain a Gaussian approximation to
ln p(w | t)
= −12(w−m0)
TS−10 (w−m0) +
N∑n=1
{tn ln yn + (1− tn) ln(1− yn)}
1 Find wMAP which maximises ln p(w | t). This defines themean of the Gaussian approximation. (Note: This is anonlinear function in w because yn = σ(wTφn).)
2 Calculate the second derivative of the negative log likelihoodto get the inverse covariance of the Laplace approximation
SN = −∇∇ ln p(w | t) = S−10 +
N∑n=1
yn(1− yn)φnφTn .
Nowadays the gradient and Hessian would be computed withautomatic differentiation; one need only implement ln p(w | t).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Probabilistic GenerativeModels
Continuous Input
Discrete Features
ProbabilisticDiscriminative Models
Logistic Regression
Iterative ReweightedLeast Squares
Laplace Approximation
Bayesian LogisticRegression
269of 825
Bayesian Logistic Regression
The approximated Gaussian (via Laplace approximation)of the posterior distribution is now
q(w |φ) = N (w |wMAP,SN)
where
SN = −∇∇ ln p(w | t) = S−10 +
N∑n=1
yn(1− yn)φnφTn .