CS 4900/5900: Machine Learning Logistic Regression
Transcript of CS 4900/5900: Machine Learning Logistic Regression
![Page 1: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/1.jpg)
Razvan C. Bunescu
School of Electrical Engineering and Computer Science
Logistic Regression
CS 4900/5900: Machine Learning
![Page 2: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/2.jpg)
Supervised Learning
• Task = learn an (unkown) function t : X ® T that maps input instances x Î X to output targets t(x) Î T:– Classification:
• The output t(x) Î T is one of a finite set of discrete categories.– Regression:
• The output t(x) Î T is continuous, or has a continuous component.
• Target function t(x) is known (only) through (noisy) set of training examples:
(x1,t1), (x2,t2), … (xn,tn)
![Page 3: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/3.jpg)
Supervised Learning
Training Examples(xk, tk)
Test Examples(x, t)
Learning Algorithm Model h
Model h
Training
Testing
Generalization Performance
![Page 4: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/4.jpg)
Parametric Approaches to Supervised Learning
• Task = build a function h(x) such that:– h matches t well on the training data:
=> h is able to fit data that it has seen.– h also matches t well on test data:
=> h is able to generalize to unseen data.
• Task = choose h from a “nice” class of functions that depend on a vector of parameters w:– h(x) º hw(x) º h(w,x)– what classes of functions are “nice”?
![Page 5: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/5.jpg)
Neurons
Soma is the central part of the neuron:• where the input signals are combined.
Dendrites are cellular extensions:• where majority of the input occurs.
Axon is a fine, long projection:• carries nerve signals to other neurons.
Synapses are molecular structures between axon terminals and other neurons:• where the communication takes place.
![Page 6: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/6.jpg)
Neuron Modelshttps://www.research.ibm.com/software/IBMResearch/multimedia/IJCNN2013.neuron-model.pdf
![Page 7: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/7.jpg)
Spiking/LIF Neuron Function http://ee.princeton.edu/research/prucnal/sites/default/files/06497478.pdf
![Page 8: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/8.jpg)
Neuron Modelshttps://www.research.ibm.com/software/IBMResearch/multimedia/IJCNN2013.neuron-model.pdf
![Page 9: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/9.jpg)
McCulloch-Pitts Neuron Function
Σ f
1x0
x1
x2
x3
wixi∑ hw(x)
activation / outputfunction
w0
w1
w2
w3
• Algebraic interpretation:– The output of the neuron is a linear combination of inputs from other neurons,
rescaled by the synaptic weights.• weights wi correspond to the synaptic weights (activating or inhibiting).• summation corresponds to combination of signals in the soma.
– It is often transformed through an activation / output function.
![Page 10: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/10.jpg)
Activation Functions
1
0
f (z) = 11+ e−z
logistic
f (z) = 0 if z < 01 if z ≥ 0
"#$
%$unit step
f (z) = zidentity
Perceptron
Logistic RegressionLinear Regression
![Page 11: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/11.jpg)
Linear Regression
• Polynomial curve fitting is Linear Regression:x = φ(x) = [1, x, x2, ..., xM]T
h(x) = wTx
Σ f
1x0
x1
x2
x3
wixi∑ hw(x) =
activation / outputfunction
w0
w1
w2
w3 f (z) = z wixi∑
![Page 12: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/12.jpg)
McCulloch-Pitts Neuron Function
Σ f
1x0
x1
x2
x3
wixi∑ hw(x)
activation / outputfunction
w0
w1
w2
w3
• Algebraic interpretation:– The output of the neuron is a linear combination of inputs from other neurons,
rescaled by the synaptic weights.• weights wi correspond to the synaptic weights (activating or inhibiting).• summation corresponds to combination of signals in the soma.
– It is often transformed through a monotonic activation / output function.
![Page 13: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/13.jpg)
Logistic Regression
• Training set is (x1,t1), (x2,t2), … (xn,tn).x = [1, x1, x2, ..., xk]T
h(x) = σ(wTx)• Can be used for both classification and regression:
• Classification: T = {C1, C2} = {1, 0}.• Regression: T = [0, 1] (i.e. output needs to be normalized).
Σ
1x0
x1
x2
x3
wixi∑ hw(x)
activationfunction f
w0
w1
w2
w3=
11+ exp(−wTx)f (z) = 1
1+ exp(−z)
![Page 14: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/14.jpg)
Logistic Regression for Binary Classification
• Model output can be interpreted as posterior class probabilities:
• How do we train a logistic regression model?– What error/cost function to minimize?
p(C1 | x) =σ (wTx) = 1
1+ exp(−wTx))
p(C2 | x) =1−σ (wTx) = exp(−wTx)
1+ exp(−wTx)
![Page 15: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/15.jpg)
Logistic Regression Learning
• Learning = finding the “right” parameters wT = [w0, w1, … , wk ]– Find w that minimizes an error function E(w) which measures the
misfit between h(xn,w) and tn.– Expect that h(x,w) performing well on training examples xn Þ
h(x,w) will perform well on arbitrary test examples x Î X.
• Least Squares error function?
E(w) = 12
{h(xn,w)− tn}2
n=1
N
∑
– Differentiable => can use gradient descent ✓– Non-convex => not guaranteed to find the global optimum ✗
![Page 16: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/16.jpg)
Maximum Likelihood
Training set is D = {áxn, tnñ | tnÎ {0,1}, n Î 1…N}
Let
Maximum Likelihood (ML) principle: find parameters that maximize the likelihood of the labels.
• The likelihood function is
• The negative log-likelihood (cross entropy) error function:
p(t |w) = hntn (1− hn )
(1−tn )
n=1
N
∏
hn = p(C1 | xn )⇔ hn = p(tn =1| xn ) =σ (wTxn )
E(w) = − ln p(t | x) = − tn lnhn + (1− tn )ln(1− hn ){ }n=1
N
∑
![Page 17: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/17.jpg)
Maximum Likelihood Learningfor Logistic Regression
• The ML solution is:
• ML solution is given by ÑE(w) = 0.– Cannot solve analytically => solve numerically with gradient
based methods: (stochastic) gradient descent, conjugate gradient, L-BFGS, etc.
– Gradient is (prove it):
∇E(w) = (hn − tn )xnT
n=1
N
∑
wML = argmaxw p(t |w) = argminwE(w)
convex in w
![Page 18: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/18.jpg)
Regularized Logistic Regression
• Use a Gaussian prior over the parameters:w = [w0, w1, … , wM]T
• Bayes’ Theorem:
• MAP solution:
þýü
îíì-÷
øö
çèæ==
+- wwI0w T
M
Νp2
exp2
),()(2/)1(
1 apaa
)()|()(
)()|()|( wwttwwttw pp
pppp µ=
)|(maxarg twwwpMAP =
![Page 19: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/19.jpg)
Regularized Logistic Regression
• MAP solution:)|(maxarg tww
wpMAP = )()|(maxarg wwt
wpp=
)()|(lnminarg wwtw
pp-=
)(ln)|(lnminarg wwtw
pp --=
)(ln)(minarg www
pED -=
wwww
TDE 2
)(minarg a+= )()(minarg ww ww
EED +=
{ }å=
--+-=N
nnnnnD ytytE
1)1ln()1(ln)(w
wwwwTE
2)( a=
data term
regularization term
![Page 20: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/20.jpg)
Regularized Logistic Regression
• MAP solution:
• ML solution is given by ÑE(w) = 0.
ÑE(w) = ÑED(w) + ÑEw(w)
• Cannot solve analytically => solve numerically:– (stochastic) gradient descent [PRML 3.1.3], Newton Raphson
iterative optimization [PRML 4.3.3], conjugate gradient, LBFGS.
)()(minarg www wwEEDMAP +=
= (hn − tn )xnT +αwT
n=1
N
∑
still convex in w
where hn =σ (wTxn )
![Page 21: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/21.jpg)
Softmax Regression = Logistic Regressionfor Multiclass Classification
• Multiclass classification:T = {C1, C2, ..., CK} = {1, 2, ..., K}.
• Training set is (x1,t1), (x2,t2), … (xn,tn).x = [1, x1, x2, ..., xM]t1, t2, … tn Î {1, 2, ..., K}
• One weight vector per class [PRML 4.3.4]:
p(Ck | x) =exp(wk
Tx))exp(w j
Tx)j∑
![Page 22: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/22.jpg)
Softmax Regression (K ³ 2)
• Inference:
• Training using:– Maximum Likelihood (ML)– Maximum A Posteriori (MAP) with a Gaussian prior on w.
)|(maxarg* xkCCpC
k
=
= argmaxCk
exp(wkTx)
exp(w jTx)
j∑Z(x) a normalization constant
= argmaxCkexp(wk
Tx)
= argmaxCkwk
Tx
![Page 23: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/23.jpg)
Softmax Regression
• The negative log-likelihood error function is:
• The Maximum Likelihood solution is:
• The gradient is (prove it):
ED (w) = −1Nln p(tn | xn )
n=1
N
∏convex in w
= −1N
lnexp(wtn
T xn )Z(xn )n=1
N
∑
îíì
¹=
=txtx
xt 01
)(dwhere is the Kronecker delta function.
)(minarg www DML E=
∇wkED (w) = −
1N
δk (tn )− p(Ck | xn )( )n=1
N
∑ xn
![Page 24: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/24.jpg)
Regularized Softmax Regression
• The new cost function is:
• The new gradient is (prove it):
E(w) = ED (w)+Ew (w)
∇wkE(w) = − 1
Nδk (tn )− p(Ck | xn )( )xnT
n=1
N
∑ +αwkT
= −1𝑁%&'(
)
lnexp 𝐰01
2 𝐱&𝑍 𝐱&
+𝛼2𝐰 8
![Page 25: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/25.jpg)
Softmax Regression
• ML solution is given by ÑED(w) = 0 .– Cannot solve analytically.
– Solve numerically, by pluging [cost, gradient] = [E(w), ÑE(w)] values into general convex solvers:
• L-BFGS• Newton methods• conjugate gradient• (stochastic / minibatch) gradient-based methods.
– gradient descent (with / without momentum).– AdaGrad, AdaDelta– RMSProp– ADAM, ...
![Page 26: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/26.jpg)
Implementation
• Need to compute [cost, gradient]:
§ cost
§ gradientk
=> need to compute, for k = 1, ..., K:
§ output
= −1N
δk (tn )ln p(Ck | xn )k=1
K
∑n=1
N
∑ +α2
wkT
k=1
K
∑ wk
= −1N
δk (tn )− p(Ck | xn )( )xnTn=1
N
∑ +αwkT
p(Ck | xn ) =exp(wk
Txn ))exp(w j
Txn )j∑ Overflow when wkTxn
are too large.
![Page 27: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/27.jpg)
Implementation: Preventing Overflows
• Subtract from each product wkTxn the maximum product:
c =max1≤k≤K
wkTxn
p(Ck | xn ) =exp(wk
Txn − c))exp(w j
Txn − c)j∑
n
n
n
![Page 28: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/28.jpg)
Implementation: Gradient Checking
• Want to minimize J(θ), where θ is a scalar.
• Mathematical definition of derivative:
• Numerical approximation of derivative:
ddθ
J(θ ) ≈ J(θ +ε)− J(θ −ε)2ε
where ε = 0.0001
𝑑𝑑𝜃 𝐽 𝜃 = lim
>→@
𝐽 𝜃 + 𝜀 − 𝐽(𝜃 − 𝜀)2𝜀
![Page 29: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/29.jpg)
Implementation: Gradient Checking
• If θ is a vector of parameters θi, – Compute numerical derivative with respect to each θi.
• Create a vector v that is ε in position i and 0 everywhere else:– How do you do this without a for loop in NumPy?
• Compute Gnum(θi) = (J(θ +v) − J(θ − v)) / 2ε– Aggregate all derivatives into numerical gradient Gnum(θ).
• Compare numerical gradient Gnum(θ) with implementation of gradient Gimp(θ):
Gnum (θ)−Gimp(θ)Gnum (θ)+Gimp(θ)
≤10−6
![Page 30: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/30.jpg)
Implementation: Vectorization of LR
• Version 1: Compute gradient component-wise.
– Assume example xn is stored in column X[:,n] in data matrix X.
grad = np.zeros(K)for n in range(N):
h = sigmoid(w.dot(X[:,n])temp = h − t[n]for k in range(K):grad[k] = grad[k] + temp * X[k,n]
∇E(w) = (hn − tn )xnT
n=1
N
∑
def sigmoid(x):return 1 / (1 + np.exp(−x))
![Page 31: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/31.jpg)
Implementation: Vectorization of LR
• Version 2: Compute gradient, partially vectorized.
grad = np.zeros(K)for n in range(N):
grad = grad + (sigmoid(w.dot(X[:,n])) − t[n]) * X[:,n]
∇E(w) = (hn − tn )xnT
n=1
N
∑
def sigmoid(x):return 1 / (1 + np.exp(−x))
![Page 32: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/32.jpg)
Implementation: Vectorization of LR
• Version 3: Compute gradient, vectorized.
grad = X.dot(sigmoid(w.dot(X)) − t)
∇E(w) = (hn − tn )xnT
n=1
N
∑
def sigmoid(x):return 1 / (1 + np.exp(−x))
![Page 33: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/33.jpg)
Vectorization of Softmax
• Need to compute [cost, gradient]:
§ cost
§ gradientk
=> compute ground truth matrix G such that G[k,n] = 𝛿k(tn)
from scipy.sparse import coo_matrixgroundTruth = coo_matrix((np.ones(N, dtype = np.uint8),
(labels, np.arange(N)))).toarray()
= −1N
δk (tn )ln p(Ck | xn )k=1
K
∑n=1
N
∑ +α2
wkT
k=1
K
∑ wk
= −1N
δk (tn )− p(Ck | xn )( )xnTn=1
N
∑ +αwkT
![Page 34: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/34.jpg)
Vectorization of Softmax
• Compute cost
– Compute matrix of 𝐰E2𝐱&.
– Compute matrix of 𝐰E2𝐱& − 𝑐&.
– Compute matrix of exp(𝐰E2𝐱& − 𝑐&).
– Compute matrix of ln 𝑝(𝐶E|𝐱&).
– Compute log-likelihood.
= −1N
δk (tn )ln p(Ck | xn )k=1
K
∑n=1
N
∑ +α2
wkT
k=1
K
∑ wk
![Page 35: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/35.jpg)
Vectorization of Softmax
• Compute gradk
§ Gradient = [grad1 | grad2 | … | gradK]
– Compute matrix of 𝑝(𝐶E|𝐱&).
– Compute matrix of gradient of data term.
– Compute matrix of gradient of regularization term.
= −1N
δk (tn )− p(Ck | xn )( )xnTn=1
N
∑ +αwkT
![Page 36: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/36.jpg)
Vectorization of Softmax
• Useful Numpy functions:– np.dot()– np.amax()– np.argmax()– np.exp()– np.sum()– np.log()– np.mean()
![Page 37: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/37.jpg)
import scipy
• scipy.sparse.coo_matrix()groundTruth = coo_matrix((np.ones(numCases, dtype = np.uint8),
(labels, np.arange(numCases)))).toarray()• scipy.optimize:
– scipy.optimize.fmin_l_bfgs_b()theta, _, _ = fmin_l_bfgs_b(softmaxCost, theta,
args = (numClasses, inputSize, decay, images, labels),maxiter = 100, disp = 1)
– scipy.optimize.fmin_cg()– scipy.minimizehttps://docs.scipy.org/doc/scipy-0.10.1/reference/tutorial/optimize.html
![Page 38: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/38.jpg)
Multiclass Logistic Regression (K ³ 2)
1) Train one weight vector per class [PRML Chapter 4.3.4]:
2) More general approach:
- Inference:
å=
jTj
Tk
kCp ))(exp())(exp()|(xwxwxj
j
å=
j jT
kT
k CCCp
)),(exp()),(exp()|(
xwxwxj
j
)|(maxarg* xkCCpC
k
=
![Page 39: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/39.jpg)
Logistic Regression (K ³ 2)
2) Inference in more general approach:
• Training using:– Maximum Likelihood (ML)– Maximum A Posteriori (MAP) with a Gaussian prior on w.
)|(maxarg* xkCCpC
k
=
å=
j jT
kT
C CC
k )),(exp()),(exp(maxarg
xwxwj
j Z(x) the partition function.
)),(exp(maxarg kT
CC
k
xw j=
),(maxarg kT
CC
k
xw j=
![Page 40: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/40.jpg)
Logistic Regression (K ³ 2) with ML
• The negative log-likelihood error function is:
• The gradient is (prove it):
åÕ==
-=-=N
n n
nnTN
nnnD Z
ttpE11 )(
)),(exp(ln)|(ln)(xxwxw j
úû
ùêë
é¶
¶¶
¶¶
¶=Ñ
M
DDDD w
Ew
Ew
EE )(,,)(,)()(10
wwww !
),()|(),()(1 11
kni
N
n
K
knk
N
nnni
i
D CCptw
E xxxw jj ååå= ==
+-=¶
¶
)(minarg www DML E=
convex in w
![Page 41: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/41.jpg)
Logistic Regression (K ³ 2) with ML
• Set ÑED(w) = 0 Þ ML solution satisfies:
Þ for every feature ji, the observed value on D should be the same as the expected value on D!
• Solve numerically:– Stochastic gradient descent [chapter 3.1.3].– Newton Raphson iterative optimization (large Hessian!).– Limited memory Newton methods (e.g. L-BFGS).
),()|(),(1 11
kni
N
n
K
knk
N
nnni CCpt xxx jj ååå
= ==
=
![Page 42: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/42.jpg)
The Maximum Entropy Principle
• Principle of Insufficient Reason• Principle of Indifference
– can be traced back to Pierre Laplace and Jacob Bernoulli.
Ø A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. 1996.A maximum entropy approach to natural language processing.Computational Linguistics, 22(1).– “model all that is known and assume nothing about that which is
unknown”.– “given a collection of facts, choose a model consistent with all the
facts, but otherwise as uniform as possible”.
![Page 43: CS 4900/5900: Machine Learning Logistic Regression](https://reader030.fdocuments.in/reader030/viewer/2022032801/623de746f3caac559610cef6/html5/thumbnails/43.jpg)
Maximum Likelihood Û Maximum Entropy
1) Maximize conditional likelihood:
ÕÕ==
==N
n n
nnTN
nnn Z
ttpp11 )(
)),(exp()|()|(xxwxwt w
j)|(maxarg wtw
wpML =
)()),(exp()|()|(
n
nnTML
nnnnME Zttptp
ML xxwxx w
j==
2) Maximize conditional entropy:
subject to:)|(log)|(maxarg
1 1nk
N
n
K
knkpME CpCpp xxåå
= =
-=
),()|(),(1 11
kn
N
n
K
knk
N
nnn CCpt xxx jj ååå
= ==
=