CS 4900/5900: Machine Learning Logistic Regression

43
Razvan C. Bunescu School of Electrical Engineering and Computer Science [email protected] Logistic Regression CS 4900/5900: Machine Learning

Transcript of CS 4900/5900: Machine Learning Logistic Regression

Page 1: CS 4900/5900: Machine Learning Logistic Regression

Razvan C. Bunescu

School of Electrical Engineering and Computer Science

[email protected]

Logistic Regression

CS 4900/5900: Machine Learning

Page 2: CS 4900/5900: Machine Learning Logistic Regression

Supervised Learning

• Task = learn an (unkown) function t : X ® T that maps input instances x Î X to output targets t(x) Î T:– Classification:

• The output t(x) Î T is one of a finite set of discrete categories.– Regression:

• The output t(x) Î T is continuous, or has a continuous component.

• Target function t(x) is known (only) through (noisy) set of training examples:

(x1,t1), (x2,t2), … (xn,tn)

Page 3: CS 4900/5900: Machine Learning Logistic Regression

Supervised Learning

Training Examples(xk, tk)

Test Examples(x, t)

Learning Algorithm Model h

Model h

Training

Testing

Generalization Performance

Page 4: CS 4900/5900: Machine Learning Logistic Regression

Parametric Approaches to Supervised Learning

• Task = build a function h(x) such that:– h matches t well on the training data:

=> h is able to fit data that it has seen.– h also matches t well on test data:

=> h is able to generalize to unseen data.

• Task = choose h from a “nice” class of functions that depend on a vector of parameters w:– h(x) º hw(x) º h(w,x)– what classes of functions are “nice”?

Page 5: CS 4900/5900: Machine Learning Logistic Regression

Neurons

Soma is the central part of the neuron:• where the input signals are combined.

Dendrites are cellular extensions:• where majority of the input occurs.

Axon is a fine, long projection:• carries nerve signals to other neurons.

Synapses are molecular structures between axon terminals and other neurons:• where the communication takes place.

Page 6: CS 4900/5900: Machine Learning Logistic Regression

Neuron Modelshttps://www.research.ibm.com/software/IBMResearch/multimedia/IJCNN2013.neuron-model.pdf

Page 7: CS 4900/5900: Machine Learning Logistic Regression

Spiking/LIF Neuron Function http://ee.princeton.edu/research/prucnal/sites/default/files/06497478.pdf

Page 8: CS 4900/5900: Machine Learning Logistic Regression

Neuron Modelshttps://www.research.ibm.com/software/IBMResearch/multimedia/IJCNN2013.neuron-model.pdf

Page 9: CS 4900/5900: Machine Learning Logistic Regression

McCulloch-Pitts Neuron Function

Σ f

1x0

x1

x2

x3

wixi∑ hw(x)

activation / outputfunction

w0

w1

w2

w3

• Algebraic interpretation:– The output of the neuron is a linear combination of inputs from other neurons,

rescaled by the synaptic weights.• weights wi correspond to the synaptic weights (activating or inhibiting).• summation corresponds to combination of signals in the soma.

– It is often transformed through an activation / output function.

Page 10: CS 4900/5900: Machine Learning Logistic Regression

Activation Functions

1

0

f (z) = 11+ e−z

logistic

f (z) = 0 if z < 01 if z ≥ 0

"#$

%$unit step

f (z) = zidentity

Perceptron

Logistic RegressionLinear Regression

Page 11: CS 4900/5900: Machine Learning Logistic Regression

Linear Regression

• Polynomial curve fitting is Linear Regression:x = φ(x) = [1, x, x2, ..., xM]T

h(x) = wTx

Σ f

1x0

x1

x2

x3

wixi∑ hw(x) =

activation / outputfunction

w0

w1

w2

w3 f (z) = z wixi∑

Page 12: CS 4900/5900: Machine Learning Logistic Regression

McCulloch-Pitts Neuron Function

Σ f

1x0

x1

x2

x3

wixi∑ hw(x)

activation / outputfunction

w0

w1

w2

w3

• Algebraic interpretation:– The output of the neuron is a linear combination of inputs from other neurons,

rescaled by the synaptic weights.• weights wi correspond to the synaptic weights (activating or inhibiting).• summation corresponds to combination of signals in the soma.

– It is often transformed through a monotonic activation / output function.

Page 13: CS 4900/5900: Machine Learning Logistic Regression

Logistic Regression

• Training set is (x1,t1), (x2,t2), … (xn,tn).x = [1, x1, x2, ..., xk]T

h(x) = σ(wTx)• Can be used for both classification and regression:

• Classification: T = {C1, C2} = {1, 0}.• Regression: T = [0, 1] (i.e. output needs to be normalized).

Σ

1x0

x1

x2

x3

wixi∑ hw(x)

activationfunction f

w0

w1

w2

w3=

11+ exp(−wTx)f (z) = 1

1+ exp(−z)

Page 14: CS 4900/5900: Machine Learning Logistic Regression

Logistic Regression for Binary Classification

• Model output can be interpreted as posterior class probabilities:

• How do we train a logistic regression model?– What error/cost function to minimize?

p(C1 | x) =σ (wTx) = 1

1+ exp(−wTx))

p(C2 | x) =1−σ (wTx) = exp(−wTx)

1+ exp(−wTx)

Page 15: CS 4900/5900: Machine Learning Logistic Regression

Logistic Regression Learning

• Learning = finding the “right” parameters wT = [w0, w1, … , wk ]– Find w that minimizes an error function E(w) which measures the

misfit between h(xn,w) and tn.– Expect that h(x,w) performing well on training examples xn Þ

h(x,w) will perform well on arbitrary test examples x Î X.

• Least Squares error function?

E(w) = 12

{h(xn,w)− tn}2

n=1

N

– Differentiable => can use gradient descent ✓– Non-convex => not guaranteed to find the global optimum ✗

Page 16: CS 4900/5900: Machine Learning Logistic Regression

Maximum Likelihood

Training set is D = {áxn, tnñ | tnÎ {0,1}, n Î 1…N}

Let

Maximum Likelihood (ML) principle: find parameters that maximize the likelihood of the labels.

• The likelihood function is

• The negative log-likelihood (cross entropy) error function:

p(t |w) = hntn (1− hn )

(1−tn )

n=1

N

hn = p(C1 | xn )⇔ hn = p(tn =1| xn ) =σ (wTxn )

E(w) = − ln p(t | x) = − tn lnhn + (1− tn )ln(1− hn ){ }n=1

N

Page 17: CS 4900/5900: Machine Learning Logistic Regression

Maximum Likelihood Learningfor Logistic Regression

• The ML solution is:

• ML solution is given by ÑE(w) = 0.– Cannot solve analytically => solve numerically with gradient

based methods: (stochastic) gradient descent, conjugate gradient, L-BFGS, etc.

– Gradient is (prove it):

∇E(w) = (hn − tn )xnT

n=1

N

wML = argmaxw p(t |w) = argminwE(w)

convex in w

Page 18: CS 4900/5900: Machine Learning Logistic Regression

Regularized Logistic Regression

• Use a Gaussian prior over the parameters:w = [w0, w1, … , wM]T

• Bayes’ Theorem:

• MAP solution:

þýü

îíì-÷

øö

çèæ==

+- wwI0w T

M

Νp2

exp2

),()(2/)1(

1 apaa

)()|()(

)()|()|( wwttwwttw pp

pppp µ=

)|(maxarg twwwpMAP =

Page 19: CS 4900/5900: Machine Learning Logistic Regression

Regularized Logistic Regression

• MAP solution:)|(maxarg tww

wpMAP = )()|(maxarg wwt

wpp=

)()|(lnminarg wwtw

pp-=

)(ln)|(lnminarg wwtw

pp --=

)(ln)(minarg www

pED -=

wwww

TDE 2

)(minarg a+= )()(minarg ww ww

EED +=

{ }å=

--+-=N

nnnnnD ytytE

1)1ln()1(ln)(w

wwwwTE

2)( a=

data term

regularization term

Page 20: CS 4900/5900: Machine Learning Logistic Regression

Regularized Logistic Regression

• MAP solution:

• ML solution is given by ÑE(w) = 0.

ÑE(w) = ÑED(w) + ÑEw(w)

• Cannot solve analytically => solve numerically:– (stochastic) gradient descent [PRML 3.1.3], Newton Raphson

iterative optimization [PRML 4.3.3], conjugate gradient, LBFGS.

)()(minarg www wwEEDMAP +=

= (hn − tn )xnT +αwT

n=1

N

still convex in w

where hn =σ (wTxn )

Page 21: CS 4900/5900: Machine Learning Logistic Regression

Softmax Regression = Logistic Regressionfor Multiclass Classification

• Multiclass classification:T = {C1, C2, ..., CK} = {1, 2, ..., K}.

• Training set is (x1,t1), (x2,t2), … (xn,tn).x = [1, x1, x2, ..., xM]t1, t2, … tn Î {1, 2, ..., K}

• One weight vector per class [PRML 4.3.4]:

p(Ck | x) =exp(wk

Tx))exp(w j

Tx)j∑

Page 22: CS 4900/5900: Machine Learning Logistic Regression

Softmax Regression (K ³ 2)

• Inference:

• Training using:– Maximum Likelihood (ML)– Maximum A Posteriori (MAP) with a Gaussian prior on w.

)|(maxarg* xkCCpC

k

=

= argmaxCk

exp(wkTx)

exp(w jTx)

j∑Z(x) a normalization constant

= argmaxCkexp(wk

Tx)

= argmaxCkwk

Tx

Page 23: CS 4900/5900: Machine Learning Logistic Regression

Softmax Regression

• The negative log-likelihood error function is:

• The Maximum Likelihood solution is:

• The gradient is (prove it):

ED (w) = −1Nln p(tn | xn )

n=1

N

∏convex in w

= −1N

lnexp(wtn

T xn )Z(xn )n=1

N

îíì

¹=

=txtx

xt 01

)(dwhere is the Kronecker delta function.

)(minarg www DML E=

∇wkED (w) = −

1N

δk (tn )− p(Ck | xn )( )n=1

N

∑ xn

Page 24: CS 4900/5900: Machine Learning Logistic Regression

Regularized Softmax Regression

• The new cost function is:

• The new gradient is (prove it):

E(w) = ED (w)+Ew (w)

∇wkE(w) = − 1

Nδk (tn )− p(Ck | xn )( )xnT

n=1

N

∑ +αwkT

= −1𝑁%&'(

)

lnexp 𝐰01

2 𝐱&𝑍 𝐱&

+𝛼2𝐰 8

Page 25: CS 4900/5900: Machine Learning Logistic Regression

Softmax Regression

• ML solution is given by ÑED(w) = 0 .– Cannot solve analytically.

– Solve numerically, by pluging [cost, gradient] = [E(w), ÑE(w)] values into general convex solvers:

• L-BFGS• Newton methods• conjugate gradient• (stochastic / minibatch) gradient-based methods.

– gradient descent (with / without momentum).– AdaGrad, AdaDelta– RMSProp– ADAM, ...

Page 26: CS 4900/5900: Machine Learning Logistic Regression

Implementation

• Need to compute [cost, gradient]:

§ cost

§ gradientk

=> need to compute, for k = 1, ..., K:

§ output

= −1N

δk (tn )ln p(Ck | xn )k=1

K

∑n=1

N

∑ +α2

wkT

k=1

K

∑ wk

= −1N

δk (tn )− p(Ck | xn )( )xnTn=1

N

∑ +αwkT

p(Ck | xn ) =exp(wk

Txn ))exp(w j

Txn )j∑ Overflow when wkTxn

are too large.

Page 27: CS 4900/5900: Machine Learning Logistic Regression

Implementation: Preventing Overflows

• Subtract from each product wkTxn the maximum product:

c =max1≤k≤K

wkTxn

p(Ck | xn ) =exp(wk

Txn − c))exp(w j

Txn − c)j∑

n

n

n

Page 28: CS 4900/5900: Machine Learning Logistic Regression

Implementation: Gradient Checking

• Want to minimize J(θ), where θ is a scalar.

• Mathematical definition of derivative:

• Numerical approximation of derivative:

ddθ

J(θ ) ≈ J(θ +ε)− J(θ −ε)2ε

where ε = 0.0001

𝑑𝑑𝜃 𝐽 𝜃 = lim

>→@

𝐽 𝜃 + 𝜀 − 𝐽(𝜃 − 𝜀)2𝜀

Page 29: CS 4900/5900: Machine Learning Logistic Regression

Implementation: Gradient Checking

• If θ is a vector of parameters θi, – Compute numerical derivative with respect to each θi.

• Create a vector v that is ε in position i and 0 everywhere else:– How do you do this without a for loop in NumPy?

• Compute Gnum(θi) = (J(θ +v) − J(θ − v)) / 2ε– Aggregate all derivatives into numerical gradient Gnum(θ).

• Compare numerical gradient Gnum(θ) with implementation of gradient Gimp(θ):

Gnum (θ)−Gimp(θ)Gnum (θ)+Gimp(θ)

≤10−6

Page 30: CS 4900/5900: Machine Learning Logistic Regression

Implementation: Vectorization of LR

• Version 1: Compute gradient component-wise.

– Assume example xn is stored in column X[:,n] in data matrix X.

grad = np.zeros(K)for n in range(N):

h = sigmoid(w.dot(X[:,n])temp = h − t[n]for k in range(K):grad[k] = grad[k] + temp * X[k,n]

∇E(w) = (hn − tn )xnT

n=1

N

def sigmoid(x):return 1 / (1 + np.exp(−x))

Page 31: CS 4900/5900: Machine Learning Logistic Regression

Implementation: Vectorization of LR

• Version 2: Compute gradient, partially vectorized.

grad = np.zeros(K)for n in range(N):

grad = grad + (sigmoid(w.dot(X[:,n])) − t[n]) * X[:,n]

∇E(w) = (hn − tn )xnT

n=1

N

def sigmoid(x):return 1 / (1 + np.exp(−x))

Page 32: CS 4900/5900: Machine Learning Logistic Regression

Implementation: Vectorization of LR

• Version 3: Compute gradient, vectorized.

grad = X.dot(sigmoid(w.dot(X)) − t)

∇E(w) = (hn − tn )xnT

n=1

N

def sigmoid(x):return 1 / (1 + np.exp(−x))

Page 33: CS 4900/5900: Machine Learning Logistic Regression

Vectorization of Softmax

• Need to compute [cost, gradient]:

§ cost

§ gradientk

=> compute ground truth matrix G such that G[k,n] = 𝛿k(tn)

from scipy.sparse import coo_matrixgroundTruth = coo_matrix((np.ones(N, dtype = np.uint8),

(labels, np.arange(N)))).toarray()

= −1N

δk (tn )ln p(Ck | xn )k=1

K

∑n=1

N

∑ +α2

wkT

k=1

K

∑ wk

= −1N

δk (tn )− p(Ck | xn )( )xnTn=1

N

∑ +αwkT

Page 34: CS 4900/5900: Machine Learning Logistic Regression

Vectorization of Softmax

• Compute cost

– Compute matrix of 𝐰E2𝐱&.

– Compute matrix of 𝐰E2𝐱& − 𝑐&.

– Compute matrix of exp(𝐰E2𝐱& − 𝑐&).

– Compute matrix of ln 𝑝(𝐶E|𝐱&).

– Compute log-likelihood.

= −1N

δk (tn )ln p(Ck | xn )k=1

K

∑n=1

N

∑ +α2

wkT

k=1

K

∑ wk

Page 35: CS 4900/5900: Machine Learning Logistic Regression

Vectorization of Softmax

• Compute gradk

§ Gradient = [grad1 | grad2 | … | gradK]

– Compute matrix of 𝑝(𝐶E|𝐱&).

– Compute matrix of gradient of data term.

– Compute matrix of gradient of regularization term.

= −1N

δk (tn )− p(Ck | xn )( )xnTn=1

N

∑ +αwkT

Page 36: CS 4900/5900: Machine Learning Logistic Regression

Vectorization of Softmax

• Useful Numpy functions:– np.dot()– np.amax()– np.argmax()– np.exp()– np.sum()– np.log()– np.mean()

Page 37: CS 4900/5900: Machine Learning Logistic Regression

import scipy

• scipy.sparse.coo_matrix()groundTruth = coo_matrix((np.ones(numCases, dtype = np.uint8),

(labels, np.arange(numCases)))).toarray()• scipy.optimize:

– scipy.optimize.fmin_l_bfgs_b()theta, _, _ = fmin_l_bfgs_b(softmaxCost, theta,

args = (numClasses, inputSize, decay, images, labels),maxiter = 100, disp = 1)

– scipy.optimize.fmin_cg()– scipy.minimizehttps://docs.scipy.org/doc/scipy-0.10.1/reference/tutorial/optimize.html

Page 38: CS 4900/5900: Machine Learning Logistic Regression

Multiclass Logistic Regression (K ³ 2)

1) Train one weight vector per class [PRML Chapter 4.3.4]:

2) More general approach:

- Inference:

å=

jTj

Tk

kCp ))(exp())(exp()|(xwxwxj

j

å=

j jT

kT

k CCCp

)),(exp()),(exp()|(

xwxwxj

j

)|(maxarg* xkCCpC

k

=

Page 39: CS 4900/5900: Machine Learning Logistic Regression

Logistic Regression (K ³ 2)

2) Inference in more general approach:

• Training using:– Maximum Likelihood (ML)– Maximum A Posteriori (MAP) with a Gaussian prior on w.

)|(maxarg* xkCCpC

k

=

å=

j jT

kT

C CC

k )),(exp()),(exp(maxarg

xwxwj

j Z(x) the partition function.

)),(exp(maxarg kT

CC

k

xw j=

),(maxarg kT

CC

k

xw j=

Page 40: CS 4900/5900: Machine Learning Logistic Regression

Logistic Regression (K ³ 2) with ML

• The negative log-likelihood error function is:

• The gradient is (prove it):

åÕ==

-=-=N

n n

nnTN

nnnD Z

ttpE11 )(

)),(exp(ln)|(ln)(xxwxw j

úû

ùêë

é¶

¶¶

¶¶

¶=Ñ

M

DDDD w

Ew

Ew

EE )(,,)(,)()(10

wwww !

),()|(),()(1 11

kni

N

n

K

knk

N

nnni

i

D CCptw

E xxxw jj ååå= ==

+-=¶

)(minarg www DML E=

convex in w

Page 41: CS 4900/5900: Machine Learning Logistic Regression

Logistic Regression (K ³ 2) with ML

• Set ÑED(w) = 0 Þ ML solution satisfies:

Þ for every feature ji, the observed value on D should be the same as the expected value on D!

• Solve numerically:– Stochastic gradient descent [chapter 3.1.3].– Newton Raphson iterative optimization (large Hessian!).– Limited memory Newton methods (e.g. L-BFGS).

),()|(),(1 11

kni

N

n

K

knk

N

nnni CCpt xxx jj ååå

= ==

=

Page 42: CS 4900/5900: Machine Learning Logistic Regression

The Maximum Entropy Principle

• Principle of Insufficient Reason• Principle of Indifference

– can be traced back to Pierre Laplace and Jacob Bernoulli.

Ø A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. 1996.A maximum entropy approach to natural language processing.Computational Linguistics, 22(1).– “model all that is known and assume nothing about that which is

unknown”.– “given a collection of facts, choose a model consistent with all the

facts, but otherwise as uniform as possible”.

Page 43: CS 4900/5900: Machine Learning Logistic Regression

Maximum Likelihood Û Maximum Entropy

1) Maximize conditional likelihood:

ÕÕ==

==N

n n

nnTN

nnn Z

ttpp11 )(

)),(exp()|()|(xxwxwt w

j)|(maxarg wtw

wpML =

)()),(exp()|()|(

n

nnTML

nnnnME Zttptp

ML xxwxx w

j==

2) Maximize conditional entropy:

subject to:)|(log)|(maxarg

1 1nk

N

n

K

knkpME CpCpp xxåå

= =

-=

),()|(),(1 11

kn

N

n

K

knk

N

nnn CCpt xxx jj ååå

= ==

=