Bayesian Logistic Regressionsrihari/CSE574/Chap4/4.5.1-B... · 2017-10-31 · Topics in Linear...
Transcript of Bayesian Logistic Regressionsrihari/CSE574/Chap4/4.5.1-B... · 2017-10-31 · Topics in Linear...
Bayesian Logistic Regression
Sargur N. Srihari University at Buffalo, State University of New York
USA
Topics in Linear Models for Classification • Overview 1. Discriminant Functions 2. Probabilistic Generative Models 3. Probabilistic Discriminative Models 4. The Laplace Approximation 5. Bayesian Logistic Regression
2
Machine Learning Srihari
Topics in Bayesian Logistic Regression • Recap of Logistic Regression • Roadmap of Bayesian Logistic Regression • Laplace Approximation • Evaluation of posterior distribution
– Gaussian approximation • Predictive Distribution
– Convolution of Sigmoid and Gaussian – Approximate sigmoid with probit
• Variational Bayesian Logistic Regression
Machine Learning Srihari
3
Recap of Logistic Regression • Feature vector φ , two-classes C1 and C2
• A posteriori probability p(C1|φ) can be written as p(C1|φ) =y(φ) = σ (wTφ) where φ is a M-dimensional feature vector σ (.) is the logistic sigmoid function
• Goal is to determine the M parameters • Known as logistic regression in statistics
– Although a model for classification rather than for regression
Srihari Machine Learning
Determining Logistic Regression parameters • Maximum Likelihood Approach to determine w
Data set {φn,tn} where tn ε {0,1} and φn=φ(xn), n=1,..,N Since tn is binary we can use Bernoulli Let yn be the probability that tn =1, i.e., yn= p(C1|φn) Denote t =(t1,..,tN)T
• Likelihood function associated with N observations
p(t |w) = y
n
tn
n=1
N
∏ 1 − yn{ }1−tn
5
Srihari Machine Learning
Simple sequential solution
• Error function is the negative of the log-likelihood
E(w) = − ln p(t |w) = − t
nlny
n+ (1 − t
n)ln(1 − y
n){ }
n=1
N
∑ Cross-entropy error function
6
Srihari Machine Learning
• No closed-form maximum likelihood solution for determining w
• Given Gradient of error function
• Solve using an iterative approach
• where
wτ+1 = w τ − η∇E
n
∇En= (y
n− t
n)φ
n
∇E(w) = y
n− t
n( )n=1
N
∑ φn
Solution has severe over-fitting problems for linearly separable data So use IRLS algorithm Error x Feature Vector
• Posterior probability of class C1 is p(C1|φ) =y(φ) = σ (wTϕ)
• Likelihood Function for data set {φn,tn}, tn ε {0,1}, φn=φ(xn)
1. Error Function Log-likelihood yields Cross-entropy
IRLS for Logistic Regression
E(w) = − t
nlny
n+ (1 − t
n)ln(1 − y
n){ }
n=1
N
∑
7
Srihari Machine Learning
p(t |w) = y
n
tn
n=1
N
∏ 1 − yn{ }1−tn
2. Gradient of Error Function:
3. Hessian: Hessian is not constant and depends on w through R Since H is positive-definite (i.e., for arbitrary u, uTHu>0) error function is a concave function of w and so has a unique minimum
IRLS for Logistic Regression
∇E(w) = (y
n− t
n)φ
n= ΦT(y − t)
n=1
N
∑
H = ∇∇E(w) = y
n(1 − y
n)φ
nφ
nT = ΦT
RΦn=1
N
∑
R is N×N diagonal matrix with elements Rnn=yn(1-yn)
8
Srihari Machine Learning
Φ is N×M design matrix whose nth row is φn T
4. Newton-Raphson update: Substituting and w(new) = w (old) - (ΦTRΦ)-1ΦT (y-t) = (ΦTRΦ)-1{ΦΦw(old)-ΦT(y-t)} = (ΦTRΦ)-1ΦTRz
where z is a N-dimensional vector with elements z =Φw (old)-R-1(y-t)
Update formula is a set of normal equations Since Hessian depends on w Apply them iteratively each time using the new weight vector
IRLS for Logistic Regression
9
Srihari Machine Learning
w(new) = w(old) −H
−1∇E(w)
∇E(w) = ΦT(y - t)
�
H = ΦTRΦ
Roadmap of Bayesian Logistic Regression
• Logistic regression is a discriminative probabilistic linear classifier:
• Exact Bayesian inference for Logistic Regression is intractable, because:
1. Evaluation of posterior distribution p(w|t) – Needs normalization of prior p(w)=N(w|m0,S0) times
likelihood (a product of sigmoids) • Solution: use Laplace approximation to get Gaussian q(w)
2. Evaluation of predictive distribution – Convolution of sigmoid and Gaussian
• Solution: Approximate Sigmoid by Probit
10
Machine Learning Srihari
p(t |w) = y
n
tn
n=1
N
∏ 1 − yn{ }1−tn
p(C1 |φ)=σ (wTφ)
p(C1 |φ)! σ(wTφ)q(w)dw∫
p(C1 |φ)= σ(wTφ)p(w)dw∫
Laplace Approximation (summary)
• Need mode w0 of posterior distribution p(w|t) – Done by a numerical optimization algorithm
• Fit a Gaussian centered at the mode
– Needs second derivatives of log posterior • Equivalent to finding Hessian matrix
11
Machine Learning Srihari
q(w)=1W
f (w)=A1/2
(2π)M/2 exp -12(w -w0)
TA(w -w0)⎧⎨⎩
⎫⎬⎭
=N(w |w0,A-1)
A = −∇∇ ln f (w) |w=w0
SN = −∇∇ lnp(w | t)= S0
−1 + yn (1-yn )φnφnT
i=1
n
∑
Evaluation of Posterior Distribution • Gaussian prior
p(w)=N(w|m0,S0) – Where m0 and S0 are hyper-parameters
• Posterior distribution p(w|t) p(w)p(t|w)
where t =(t1,..,tN)T
– Substituting
• where 12
α
p(t |w) = y
n
tn
n=1
N
∏ 1−yn{ }1−tn
ln p(w|t) = − 1
2(w −m0 )
T S0−1(w −m0 )+ (tn ln yn + (1− tn )ln(1− yn )+ const
i=1
n
∑
Machine Learning Srihari
yn =σ (wTφn )
Gaussian Approximation of Posterior • Maximize posterior p(w|t) to give
– MAP solution wmap • Done by numerical optimization
– Defines mean of the Gaussian • Covariance given by
– Inverse of matrix of 2nd derivatives of negative log-likelihood
• Gaussian approximation to posterior
• Need to marginalize wrt this distribution to make predictions
Machine Learning Srihari
13
SN = −∇∇ ln p(w|t) = S0
−1 + yn (1− yn )φnφnT
i=1
n
∑
q(w) = N(w |wmap,SN )
Predictive Distribution • Predictive distribution for class C1, given
new feature vector – Obtained by marginalizing wrt posterior p(w|t)
14
Machine Learning Srihari
φ(x )
p(C1 |φ, t) = p(C1,w |φ, t)dw∫ Sum rule
= p(C1 |φ, t,w)p(w|t)dw∫ Product rule
= p(C1 |φ,w)p(w|t)dw∫ Given φ and w, C1 is indep of t
! σ (wTφ)q(w)dw∫ Approximate p(w|t) by Gaussian q(w)
corresponding probability for class C2
p(C2 |φ,t) = 1− p(C1 |φ,t)
Predictive distrib. is a Convolution
– Function σ(wTϕ) depends on w only through its projection onto ϕ
– Denoting a = wTϕ we have • where δ is the Dirac delta function
– Thus • Can evaluate p(a) because
– the delta function imposes a linear constraint on w – Since q(w) is Gaussian, its marginal is also Gaussian
• Evaluate its mean and covariance
Machine Learning Srihari
p(C1 |φ, t) ! σ (wTφ)q(w)dw∫
σ (wTφ) ! δ (a −wTφ)σ (a)da∫
σ (wTφ)q(w)dw∫ = σ (a)p(a)da∫ where p(a) = δ (a −wTφ)q(w)dw∫
µa = Ε[a]= p(a)da = q(w)wTφ dw =wmap
T∫∫ φ
σ a2 = var[a]= p(a) a2 −Ε[a]2{ }∫ da
= q(w) (wTφ)2 − (mNTφ)2{ }∫ dw = φTSNφ
15
Variational Approximation to Predictive Distribution
• Predictive distribution is
• Convolution of Sigmoid-Gaussian is intractable • Use probit instead of logistic sigmoid
Machine Learning Srihari
16
p(C1 | t) = σ (a)p(a)da∫
= σ (a)N a | µa ,σ a2( )∫ da
!5 0 50
0.5
1
Approximation using Probit
• Use probit which is similar to Logistic sigmoid – Defined as
• Approximate σ(a) by Φ(λa) • Find λ such that two functions have same slope at origin
• Convolution of probit with Gaussian is a probit
– Thus
Machine Learning Srihari
17
Φ(a) = N(θ | 0,1)dθ−∞
a
∫
Φ(λa)N(a | µ,σ 2 )da = Φ∫µ
λ−2 +σ 2( )1/2⎛
⎝⎜⎜
⎞
⎠⎟⎟
p(C1 | t)= σ (a)N a | µa ,σ a2( )∫ da
Approximate σ (a) by Φ(λa)Find suitable value of λ by requiring that two have same slope at origin, which yields λ2 =π /8
!5 0 50
0.5
1
p(C1 |φ, t)= σ (a)N a | µa ,σ a2( )∫ da
!σ (κ (σ a2 )µa )
whereκ (σ 2 ) = (1+πσ 2 / 8)−1/2
Probit Classification Machine Learning Srihari
Applying it to We have
where Decision boundary corresponding to p(C1|ϕ,t) =0.5 is given by
p(C1 |φ, t) = σ κ (σ a2 )µa( )
µa = 0
This is the same solution as
µa = wmapT φ σ a
2 = φT SNφ
p(C1 | t)= σ (a)N a | µa ,σ a2( )∫ da
wmapT φ = 0
Thus marginalization has no effect! When minimizing misclassification rate with equal prior probabilities
For more complex decision criteria it plays important role 18
Summary • Logistic regression is a linear probabilistic
discriminative model • Bayesian Logistic Regression is intractable • Using Laplacian the posterior parameter
distribution p(w|t) can be approximated as a Gaussian
• Predictive distribution is convolution of sigmoids and Gaussian – Probit yields convolution as probit
Machine Learning Srihari
19
p(C1 |x ) =σ (wTφ)
p(C1 |φ) ! σ (wTφ)q(w)dw∫