Bayesian Logistic Regressionsrihari/CSE574/Chap4/4.5.1-B... · 2017-10-31 · Topics in Linear...

Bayesian Logistic Regression

Sargur N. Srihari University at Buffalo, State University of New York

USA

Topics in Linear Models for Classification •  Overview 1. Discriminant Functions 2. Probabilistic Generative Models 3. Probabilistic Discriminative Models 4. The Laplace Approximation 5. Bayesian Logistic Regression

2

Machine Learning Srihari

Topics in Bayesian Logistic Regression •  Recap of Logistic Regression •  Roadmap of Bayesian Logistic Regression •  Laplace Approximation •  Evaluation of posterior distribution

– Gaussian approximation •  Predictive Distribution

– Convolution of Sigmoid and Gaussian – Approximate sigmoid with probit

•  Variational Bayesian Logistic Regression


3

Recap of Logistic Regression •  Feature vector φ , two-classes C1 and C2

•  A posteriori probability p(C1|φ) can be written as p(C1|φ) =y(φ) = σ (wTφ) where φ is a M-dimensional feature vector σ (.) is the logistic sigmoid function

•  Goal is to determine the M parameters •  Known as logistic regression in statistics

–  Although a model for classification rather than for regression

Srihari Machine Learning

Determining Logistic Regression parameters •  Maximum Likelihood Approach to determine w

Data set {φn,tn} where tn ε {0,1} and φn=φ(xn), n=1,..,N Since tn is binary we can use Bernoulli Let yn be the probability that tn =1, i.e., yn= p(C1|φn) Denote t =(t1,..,tN)T

•  Likelihood function associated with N observations

p(t |w) = y

n

tn

n=1

N

∏ 1 − yn{ }1−tn

5


Simple sequential solution

•  Error function is the negative of the log-likelihood

E(w) = − ln p(t |w) = − t

nlny

n+ (1 − t

n)ln(1 − y

n){ }

n=1

N

∑ Cross-entropy error function

6


•  No closed-form maximum likelihood solution for determining w

•  Given Gradient of error function

•  Solve using an iterative approach

•  where

wτ+1 = w τ − η∇E

n

∇En= (y

n− t

n)φ

n

∇E(w) = y

n− t

n( )n=1

N

∑ φn

Solution has severe over-fitting problems for linearly separable data So use IRLS algorithm Error x Feature Vector

•  Posterior probability of class C1 is p(C1|φ) =y(φ) = σ (wTϕ)

•  Likelihood Function for data set {φn,tn}, tn ε {0,1}, φn=φ(xn)

1.  Error Function Log-likelihood yields Cross-entropy

IRLS for Logistic Regression

E(w) = − t

nlny

n+ (1 − t

n)ln(1 − y

n){ }

n=1

N

∑

7


p(t |w) = y

n

tn

n=1

N

∏ 1 − yn{ }1−tn

2. Gradient of Error Function:

3. Hessian: Hessian is not constant and depends on w through R Since H is positive-definite (i.e., for arbitrary u, uTHu>0) error function is a concave function of w and so has a unique minimum


∇E(w) = (y

n− t

n)φ

n= ΦT(y − t)

n=1

N

∑

H = ∇∇E(w) = y

n(1 − y

n)φ

nφ

nT = ΦT

RΦn=1

N

∑

R is N×N diagonal matrix with elements Rnn=yn(1-yn)

8


Φ is N×M design matrix whose nth row is φn T

4. Newton-Raphson update: Substituting and w(new) = w (old) - (ΦTRΦ)-1ΦT (y-t) = (ΦTRΦ)-1{ΦΦw(old)-ΦT(y-t)} = (ΦTRΦ)-1ΦTRz

where z is a N-dimensional vector with elements z =Φw (old)-R-1(y-t)

Update formula is a set of normal equations Since Hessian depends on w Apply them iteratively each time using the new weight vector


9


w(new) = w(old) −H

−1∇E(w)

∇E(w) = ΦT(y - t)

�

H = ΦTRΦ

Roadmap of Bayesian Logistic Regression

•  Logistic regression is a discriminative probabilistic linear classifier:

•  Exact Bayesian inference for Logistic Regression is intractable, because:

1. Evaluation of posterior distribution p(w|t) – Needs normalization of prior p(w)=N(w|m0,S0) times

likelihood (a product of sigmoids) •  Solution: use Laplace approximation to get Gaussian q(w)

2. Evaluation of predictive distribution – Convolution of sigmoid and Gaussian

•  Solution: Approximate Sigmoid by Probit

10


p(t |w) = y

n

tn

n=1

N

∏ 1 − yn{ }1−tn

p(C1 |φ)=σ (wTφ)

p(C1 |φ)! σ(wTφ)q(w)dw∫

p(C1 |φ)= σ(wTφ)p(w)dw∫

Laplace Approximation (summary)

•  Need mode w0 of posterior distribution p(w|t) – Done by a numerical optimization algorithm

•  Fit a Gaussian centered at the mode

– Needs second derivatives of log posterior •  Equivalent to finding Hessian matrix

11


q(w)=1W

f (w)=A1/2

(2π)M/2 exp -12(w -w0)

TA(w -w0)⎧⎨⎩

⎫⎬⎭

=N(w |w0,A-1)

A = −∇∇ ln f (w) |w=w0

SN = −∇∇ lnp(w | t)= S0

−1 + yn (1-yn )φnφnT

i=1

n

∑

Evaluation of Posterior Distribution •  Gaussian prior

p(w)=N(w|m0,S0) – Where m0 and S0 are hyper-parameters

•  Posterior distribution p(w|t) p(w)p(t|w)

where t =(t1,..,tN)T

– Substituting

•  where 12

α

p(t |w) = y

n

tn

n=1

N

∏ 1−yn{ }1−tn

ln p(w|t) = − 1

2(w −m0 )

T S0−1(w −m0 )+ (tn ln yn + (1− tn )ln(1− yn )+ const

i=1

n

∑


yn =σ (wTφn )

Gaussian Approximation of Posterior •  Maximize posterior p(w|t) to give

– MAP solution wmap •  Done by numerical optimization

– Defines mean of the Gaussian •  Covariance given by

–  Inverse of matrix of 2nd derivatives of negative log-likelihood

•  Gaussian approximation to posterior

•  Need to marginalize wrt this distribution to make predictions


13

SN = −∇∇ ln p(w|t) = S0

−1 + yn (1− yn )φnφnT

i=1

n

∑

q(w) = N(w |wmap,SN )

Predictive distrib. is a Convolution

– Function σ(wTϕ) depends on w only through its projection onto ϕ

– Denoting a = wTϕ we have •  where δ is the Dirac delta function

– Thus •  Can evaluate p(a) because

–  the delta function imposes a linear constraint on w –  Since q(w) is Gaussian, its marginal is also Gaussian

•  Evaluate its mean and covariance


p(C1 |φ, t) ! σ (wTφ)q(w)dw∫

σ (wTφ) ! δ (a −wTφ)σ (a)da∫

σ (wTφ)q(w)dw∫ = σ (a)p(a)da∫ where p(a) = δ (a −wTφ)q(w)dw∫

µa = Ε[a]= p(a)da = q(w)wTφ dw =wmap

T∫∫ φ

σ a2 = var[a]= p(a) a2 −Ε[a]2{ }∫ da

= q(w) (wTφ)2 − (mNTφ)2{ }∫ dw = φTSNφ

15

Variational Approximation to Predictive Distribution

•  Predictive distribution is

•  Convolution of Sigmoid-Gaussian is intractable •  Use probit instead of logistic sigmoid


16

p(C1 | t) = σ (a)p(a)da∫

= σ (a)N a | µa ,σ a2( )∫ da

!5 0 50

0.5

1

Approximation using Probit

•  Use probit which is similar to Logistic sigmoid – Defined as

•  Approximate σ(a) by Φ(λa) •  Find λ such that two functions have same slope at origin

•  Convolution of probit with Gaussian is a probit

– Thus


17

Φ(a) = N(θ | 0,1)dθ−∞

a

∫

Φ(λa)N(a | µ,σ 2 )da = Φ∫µ

λ−2 +σ 2( )1/2⎛

⎝⎜⎜

⎞

⎠⎟⎟

p(C1 | t)= σ (a)N a | µa ,σ a2( )∫ da

Approximate σ (a) by Φ(λa)Find suitable value of λ by requiring that two have same slope at origin, which yields λ2 =π /8

!5 0 50

0.5

1

p(C1 |φ, t)= σ (a)N a | µa ,σ a2( )∫ da

!σ (κ (σ a2 )µa )

whereκ (σ 2 ) = (1+πσ 2 / 8)−1/2

Probit Classification Machine Learning Srihari

Applying it to We have

where Decision boundary corresponding to p(C1|ϕ,t) =0.5 is given by

p(C1 |φ, t) = σ κ (σ a2 )µa( )

µa = 0

This is the same solution as

µa = wmapT φ σ a

2 = φT SNφ

p(C1 | t)= σ (a)N a | µa ,σ a2( )∫ da

wmapT φ = 0

Thus marginalization has no effect! When minimizing misclassification rate with equal prior probabilities

For more complex decision criteria it plays important role 18

Summary •  Logistic regression is a linear probabilistic

discriminative model •  Bayesian Logistic Regression is intractable •  Using Laplacian the posterior parameter

distribution p(w|t) can be approximated as a Gaussian

•  Predictive distribution is convolution of sigmoids and Gaussian – Probit yields convolution as probit


19

p(C1 |x ) =σ (wTφ)

p(C1 |φ) ! σ (wTφ)q(w)dw∫

Bayesian Logistic Regressionsrihari/CSE574/Chap4/4.5.1-B... · 2017-10-31 · Topics in Linear...

Documents

Transcript of Bayesian Logistic Regressionsrihari/CSE574/Chap4/4.5.1-B... · 2017-10-31 · Topics in Linear...