Linear regression (cont) Logistic...
Transcript of Linear regression (cont) Logistic...
![Page 1: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/1.jpg)
1
CS 2710 Foundations of Machine Learning
Lecture 24
Milos Hauskrecht
5329 Sennott Square
Linear regression (cont)
Logistic regression
Linear regression
• Vector definition of the model
– Include bias constant in the input vector
xwxT
dd xwxwxwxwf 221100)(
kwww ,, 10 - parameters (weights)
1
1x),( wxf
0w
1w
2w
dw
dx
2x
x
Input vector
),,,1( 21 dxxx x
![Page 2: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/2.jpg)
2
Linear regression. Error.
• Data:
• Function:
• We would like to have
• Error function
– measures how much our predictions deviate from the desired answers
• Learning:
We want to find the weights minimizing the error !
2
,..1
))((1
ii
ni
n fyn
J x
iii yD ,x
)( ii f xx
nify ii ,..,1allfor)( x
Mean-squared error
Linear regression. Example
• 1 dimensional input
-1.5 -1 -0.5 0 0.5 1 1.5 2-15
-10
-5
0
5
10
15
20
25
30
)( 1xx
![Page 3: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/3.jpg)
3
Linear regression. Example.
• 2 dimensional input ),( 21 xxx
-3-2
-10
12
3 -4
-2
0
2
4-20
-15
-10
-5
0
5
10
15
20
Linear regression. Optimization.
• We want the weights minimizing the error
• For the optimal set of parameters, derivatives of the error with
respect to each parameter must be 0
• Vector of derivatives:
2
,..1
2
,..1
)(1
))((1
i
T
i
ni
ii
ni
n yn
fyn
J xwx
0xxwww ww
ii
T
i
n
i
nn yn
JJ )(2
))(())((grad1
0)(2
)( ,,1,10,0
1
jididiii
n
i
n
j
xxwxwxwyn
Jw
w
![Page 4: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/4.jpg)
4
Solving linear regression
By rearranging the terms we get a system of linear equations
with d+1 unknowns
0)(2
)( ,,1,10,0
1
jididiii
n
i
n
j
xxwxwxwyn
Jw
w
ji
n
i
iji
n
i
didji
n
i
jijji
n
i
iji
n
i
i xyxxwxxwxxwxxw ,
1
,
1
,,
1
,,
1
1,1,
1
0,0
bAw
1111111
,
1
,
1
1,1
1
0,0
n
i
i
n
i
did
n
i
jij
n
i
i
n
i
i yxwxwxwxw
1,
1
1,
1
,1,
1
,1,
1
1,11,
1
0,0 i
n
i
ii
n
i
didi
n
i
jiji
n
i
ii
n
i
i xyxxwxxwxxwxxw
Solving linear regression
• The optimal set of weights satisfies:
Leads to a system of linear equations (SLE) with d+1
unknowns of the form
Solution to SLE: ?
0xxwww
ii
T
i
n
i
n yn
J )(2
))((1
ji
n
i
iji
n
i
didji
n
i
jijji
n
i
iji
n
i
i xyxxwxxwxxwxxw ,
1
,
1
,,
1
,,
1
1,1,
1
0,0
bAw
![Page 5: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/5.jpg)
5
Solving linear regression
• The optimal set of weights satisfies:
Leads to a system of linear equations (SLE) with d+1
unknowns of the form
Solution to SLE:
• matrix inversion
0xxwww
ii
T
i
n
i
n yn
J )(2
))((1
ji
n
i
iji
n
i
didji
n
i
jijji
n
i
iji
n
i
i xyxxwxxwxxwxxw ,
1
,
1
,,
1
,,
1
1,1,
1
0,0
bAw
bAw1
Gradient descent solution
Goal: the weight optimization in the linear regression model
An alternative to SLE solution:
• Gradient descent
Idea:
– Adjust weights in the direction that improves the Error
– The gradient tells us what is the right direction
- a learning rate (scales the gradient changes)
)(www w iError
2
,..1
)),((1
)( wxw ii
ni
n fyn
ErrorJ
0
![Page 6: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/6.jpg)
6
Gradient descent method
• Descend using the gradient information
• Change the value of w according to the gradient
w
*|)( ww wError
*w
)(wError
)(www w iError
Direction of the descent
Gradient descent method
• New value of the parameter
- a learning rate (scales the gradient changes)
w
*|)( wwErrorw
*w
)(wError
*|)(* w
j
jj wErrorw
ww
0
For all j
![Page 7: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/7.jpg)
7
Gradient descent method
• Iteratively approaches the optimum of the Error function
w)0(w
)(wError
)2(w)1(w )3(w
Batch vs Online regression algorithm
• The error function defined on the complete dataset D
• We say we are learning the model in the batch mode:
– All examples are available at the time of learning
– Weights are optimizes with respect to all training examples
• An alternative is to learn the model in the online mode
– Examples are arriving sequentially
– Model weights are updated after every example
– If needed examples seen can be forgotten
-
2
,..1
)),((1
)( wxw ii
ni
n fyn
ErrorJ
![Page 8: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/8.jpg)
8
Online gradient algorithm
• The error function is defined for the complete dataset D
• Error for one example
• Online gradient method: changes weights after every example
• vector form:
2
,..1
)),((1
)( wxw ii
ni
n fyn
ErrorJ
2
online )),((2
1)( wxw iii fyErrorJ
)(wi
j
jj Errorw
ww
0 - Learning rate that depends on the number of updates
)(www w iError
iii yD ,x
Online gradient method
2)),((2
1)( wxw iiionline fyErrorJ
(i)-th update step with :
xwxTf )(Linear model
On-line error
ii
1)( Annealed learning rate:
- Gradually rescales changes
On-line algorithm: generates a sequence of online updates
)1(|)(
)()1()(
i
j
ii
j
i
jw
Erroriww
w
w
j-th weight:
iii yD ,x
ji
i
ii
i
j
i
j xfyiww ,
)1()1()()),()((
wx
Fixed learning rate: Ci )(
- Use a small constant
![Page 9: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/9.jpg)
9
Online regression algorithm
Online-linear-regression (stopping_criterion)
Initialize weights
initialize i=1;
while stopping_criterion = FALSE
select the next data point
set learning rate
update weight vector
end
return weights
),,( 210 dwwww w
)(i
iii fyi xwxww )),()((
),( iii yD x
Advantages: very easy to implement, continuous data streams
On-line learning. Example
-3 -2 -1 0 1 2 31
1.5
2
2.5
3
3.5
4
4.5
-3 -2 -1 0 1 2 31
1.5
2
2.5
3
3.5
4
4.5
-3 -2 -1 0 1 2 30.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
-3 -2 -1 0 1 2 30.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
1 2
3 4
![Page 10: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/10.jpg)
10
Adaptive models
2)),((2
1)( wxw iiionline fyErrorJ
xwxTf )(Linear model
On-line error
ci )(
Adaptive models:
• the underlying model is not stationary and can change over time
• Example: seasonal changes
• On-line algorithm can be made adaptive by keeping
the learning at some constant value
On-line algorithm:
• Sequence of online updates (one example at the time)
• Useful for continuous data streams
Extensions of simple linear model
)()(1
0 xx j
m
j
jwwf
)(1 x
)(2 x
)(xm
1
1x )(xf
0w
1w
2w
mwdx
The same techniques as before to learn the weights !!!!
)(xj - an arbitrary function of x
Replace inputs to linear units with feature (basis) functions
to model nonlinearities
![Page 11: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/11.jpg)
11
Extensions of the linear model
• Models linear in the parameters we want to fit
• Basis functions examples:
– a higher order polynomial, one-dimensional input
– Multidimensional quadratic
– Other types of basis functions
)()(1
0 xx k
m
k
kwwf
)()...(),( 21 xxx m - feature or basis functions
mwww ..., 10 - parameters
xx )(12
2 )( xx 3
3 )( xx ),( 21 xxx
11 )( xx2
12 )( xx23 )( xx 2
24 )( xx 215 )( xxx
xx sin)(1 xx cos)(2
)( 1xx
Example. Regression with polynomials.
Regression with polynomials of degree m
• Data points: pairs of
• Feature functions: m feature functions
• Function to learn:i
m
i
ii
m
i
i xwwxwwxf
1
0
1
0 )(),( w
i
i xx )(
yx,
mi ,,2,1
x)(1 x
2
2 )( xx
m
m x)(x
1
x
0w
1w
2w
mw
![Page 12: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/12.jpg)
12
Multidimensional model example
-3-2
-10
12
3 -4
-2
0
2
4-20
-15
-10
-5
0
5
10
15
20
Multidimensional model example
![Page 13: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/13.jpg)
13
Regularized linear regression
• If the number of parameters is large relative to the number of data points used to train the model, we face the threat of overfit (generalization error of the model goes up)
• The prediction accuracy can be often improved by setting some coefficients to zero
– Increases the bias, reduces the variance of estimates
• Solutions:
– Subset selection
– Ridge regression
– Lasso regression
– Principal component regression
• Next: ridge regression
Ridge regression
• Error function for the standard least squares estimates:
• We seek:
• Ridge regression:
• Where
• What does the new error function do?
2
,..1
)(1
)( i
T
i
ni
n yn
J xww
2
,..1
* )(1
minarg i
T
i
ni
yn
xwww
22
,..1
)(1
)( wxww
i
T
i
ni
n yn
J
d
i
iw0
22w 0and
![Page 14: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/14.jpg)
14
Ridge regression
• Standard regression:
• Ridge regression:
• penalizes non-zero weights with the cost
proportional to (a shrinkage coefficient)
• If an input attribute has a small effect on improving the error
function it is “shut down” by the penalty term
• Inclusion of a shrinkage penalty is often referred to as
regularization.
(ridge regression is related to Tikhonov regularization)
2
,..1
)(1
)( i
T
i
ni
n yn
J xww
2
2
2
,..1
)(1
)(Li
T
i
ni
n yn
J wxww
d
i
iLw
0
22
2w
jx
Regularized linear regression
How to solve the least squares problem if the error function is
enriched by the regularization term ?
Answer: The solution to the optimal set of weights w is obtained
again by solving a set of linear equation.
Standard linear regression:
Solution:
Regularized linear regression:
2w
0xxwww
ii
T
i
n
i
n yn
J )(2
))((1
yXXXwTT 1)(*
where X is an nxd matrix with rows corresponding to
examples and columns to inputs
yXXXIwTT 1)(*
![Page 15: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/15.jpg)
15
Lasso regression
• Standard regression:
• Lasso regression/regularization:
• penalizes non-zero weights with the cost
proportional to .
• L1 is more aggressive pushing the weights to 0 compared to L2.
2
,..1
)(1
)( i
T
i
ni
n yn
J xww
1
2
,..1
)(1
)(Li
T
i
ni
n yn
J wxww
d
i
iLw
01
||w
Classification
• Data:
– represents a discrete class value
• Goal: learn
• Binary classification
– A special case when
• First step:
– we need to devise a model of the function f
}1,0{Y
YXf :
},..,,{ 21 ndddD
iii yd ,x
iy
![Page 16: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/16.jpg)
16
Discriminant functions
• A common way to represent a classifier is by using
– Discriminant functions
• Works for both the binary and multi-way classification
• Idea:
– For every class i define a function mapping
– When the decision on input x should be made choose the
class with the highest value of
)(xig X
)(xig
)(maxarg* xii gy
Discriminant functions
• A common way to represent a classifier is by using
– Discriminant functions
• Works for both the binary and multi-way classification
• Idea:
– For every class i define a function mapping
– When the decision on input x should be made choose the
class with the highest value of
• So what happens with the input space? Assume a binary case.
)(xig X
)(xig
)(maxarg* xii gy
![Page 17: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/17.jpg)
17
Discriminant functions
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Discriminant functions
)()( 01 xx gg
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
)()( 01 xx gg
![Page 18: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/18.jpg)
18
Discriminant functions
)()( 01 xx gg
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
)()( 01 xx gg
)()( 01 xx gg
Discriminant functions
• Decision boundary: discriminant functions are equal
)()( 01 xx gg
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
)()( 01 xx gg
)()( 01 xx gg
)()( 01 xx gg
![Page 19: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/19.jpg)
19
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Decision boundary
Quadratic decision boundary
)()( 01 xx gg
)()( 01 xx gg )()( 01 xx gg
Logistic regression model
• Defines a linear decision boundary
• Discriminant functions:
• where
)()()( 1 xwxwwx,TT ggf
)1/(1)( zezg
x
Input vector
1
1x )( wx,f
0w
1w
2w
dw2x
z
dx
Logistic function
)()(1 xwxTgg )(1)(0 xwx
Tgg
- is a logistic function
![Page 20: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/20.jpg)
20
Logistic function
Function:
• Is also referred to as a sigmoid function
• takes a real number and outputs the number in the interval [0,1]
• Models a smooth switching function; replaces hard threshold
function
)1(
1)(
zezg
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic (smooth) switching Threshold (hard) switching
Logistic regression model
• Discriminant functions:
• Values of discriminant functions vary in interval [0,1]
– Probabilistic interpretation
),|1( wxyp
x
Input vector
1
1x
0w
1w
2w
dw2x
z
dx
)()(1 xwxTgg )(1)(0 xwx
Tgg
)()(),|1()( 1 xwxxwwx,Tggypf
![Page 21: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/21.jpg)
21
Logistic regression
• We learn a probabilistic function
– where f describes the probability of class 1 given x
Note that:
• Making decisions with the logistic regression model:
?
),|1()()( 1 wxxwwx, ypgf T
]1,0[: Xf
)|1(1),|0( wx,wx ypyp
Logistic regression
• We learn a probabilistic function
– where f describes the probability of class 1 given x
Note that:
• Making decisions with the logistic regression model:
),|1()()( 1 wxxwwx, ypgf T
]1,0[: Xf
2/1)|1( xypIf then choose 1
Else choose 0
)|1(1),|0( wx,wx ypyp
![Page 22: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/22.jpg)
22
Linear decision boundary
• Logistic regression model defines a linear decision boundary
• Why?
• Answer: Compare two discriminant functions.
• Decision boundary:
• For the boundary it must hold:
0)(
)(1log
)(
)(log
1
xw
xw
x
xT
T
g
g
g
go
)()( 01 xx gg
0)(explog
)(exp1
1
)(exp1
)(exp
log)(
)(log
1
xwxw
xw
xw
xw
x
x TT
T
T
T
g
go
Logistic regression model. Decision boundary
• LR defines a linear decision boundary
Example: 2 classes (blue and red points)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Decision boundary
wTx=0
![Page 23: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/23.jpg)
23
Likelihood of outputs
• Let
• Then
• Find weights w that maximize the likelihood of outputs
– Apply the log-likelihood trick. The optimal weights are
the same for both the likelihood and the log-likelihood
Logistic regression: parameter learning
n
i
y
i
y
i
n
i
y
i
y
iiiiiDl
1
1
1
1)1(log)1(log),( w
n
i
y
i
y
iii
n
i
iiyyPDL1
1
1
)1(),|(),( wxw
)()(),|1( i
T
iiii gzgyp xwwx
)1log()1(log1
iii
n
i
i yy
iii yD ,x
Logistic regression: parameter learning
• Notation:
• Log likelihood
• Derivatives of the loglikelihood
• Gradient descent:
)1log()1(log),(1
iii
n
i
i yyDl
w
)),(())((),(11
iii
n
i
i
T
ii
n
i
fygyDl xwxxwxww
)1(|)],([)()1()(
kDlkkk
ww www
Nonlinear in weights !!
n
i
ii
k
i
kk fyk1
)1()1()( )],([)( xxwww
))((),(1
, ii
n
i
ji
j
zgyxDlw
w
)()(),|1( i
T
iiii gzgyp xwwx
![Page 24: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/24.jpg)
24
Derivation of the gradient
• Log likelihood
• Derivatives of the loglikelihood
)1log()1(log),(1
iii
n
i
i yyDl
w
)),(())((),(11
iii
n
i
i
T
ii
n
i
fygyDl xwxxwxww
j
in
i
iiii
ij w
zzgyzgy
zDl
w
1
))(1log()1()(log),( w
i
i
i
i
i
i
i
iiiii
i z
zg
zgy
z
zg
zgyzgyzgy
z
)(
)(1
1)1(
)(
)(
1))(1log()1()(log
))(1)(()(
ii
i
i zgzgz
zg
Derivative of a logistic function
)())()(1())(1( iiiiii zgyzgyzgy
ji
j
i xw
z,
Logistic regression. Online gradient descent
• On-line component of the loglikelihood
• On-line learning update for weight w
• ith update for the logistic regression and
)1(|)],([)()1()(
kkonline
kk DJkww www
),( wkonline DJ
kkk yD ,x
kk
k
i
kk fyk xxwww )],()[( )1()1()(
)1log()1(log),(online iiiii yyDJ w
![Page 25: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/25.jpg)
25
Online logistic regression algorithm
Online-logistic-regression (stopping_criterion)
initialize weights
while stopping_criterion = FALSE
do select next data point
set
update weights (in parallel)
end
return weights
),,( 210 dwwww w
)(i
iii fyi xxwww )],()[(
iii yD ,x
w
Online algorithm. Example.
![Page 26: Linear regression (cont) Logistic regressionpeople.cs.pitt.edu/~milos/courses/cs2710-Fall2017/... · 2017. 11. 28. · Solving linear regression • The optimal set of weights satisfies:](https://reader036.fdocuments.in/reader036/viewer/2022062609/60f47f9ae4eb3e5e5a2a4986/html5/thumbnails/26.jpg)
26
Online algorithm. Example.
Online algorithm. Example.