Linear regression (cont) Logistic...

1

CS 2710 Foundations of Machine Learning

Lecture 24

Milos Hauskrecht

[email protected]

5329 Sennott Square

Linear regression (cont)

Logistic regression

Linear regression

• Vector definition of the model

– Include bias constant in the input vector

xwxT

dd xwxwxwxwf 221100)(

kwww ,, 10 - parameters (weights)

1

1x),( wxf

0w

1w

2w

dw

dx

2x

x

Input vector

),,,1( 21 dxxx x

mailto:[email protected]

2

Linear regression. Error.

• Data:

• Function:

• We would like to have

• Error function

– measures how much our predictions deviate from the desired answers

• Learning:

We want to find the weights minimizing the error !

2

,..1

))((1

ii

ni

n fyn

J x

iii yD ,x

)( ii f xx

nify ii ,..,1allfor)( x

Mean-squared error

Linear regression. Example

• 1 dimensional input

-1.5 -1 -0.5 0 0.5 1 1.5 2-15

-10

-5

0

5

10

15

20

25

30

)( 1xx

3

Linear regression. Example.

• 2 dimensional input ),( 21 xxx

-3-2

-10

12

3 -4

-2

0

2

4-20

-15

-10

-5

0

5

10

15

20

Linear regression. Optimization.

• We want the weights minimizing the error

• For the optimal set of parameters, derivatives of the error with

respect to each parameter must be 0

• Vector of derivatives:

2

,..1

2

,..1

)(1

))((1

i

T

i

ni

ii

ni

n yn

fyn

J xwx

0xxwww ww

ii

T

i

n

i

nn yn

JJ )(2

))(())((grad1

0)(2

)( ,,1,10,0

1

jididiii

n

i

n

j

xxwxwxwyn

Jw

w

4

Solving linear regression

By rearranging the terms we get a system of linear equations

with d+1 unknowns

0)(2

)( ,,1,10,0

1

jididiii

n

i

n

j

xxwxwxwyn

Jw

w

ji

n

i

iji

n

i

didji

n

i

jijji

n

i

iji

n

i

i xyxxwxxwxxwxxw ,

1

,

1

,,

1

,,

1

1,1,

1

0,0

bAw

1111111

,

1

,

1

1,1

1

0,0

n

i

i

n

i

did

n

i

jij

n

i

i

n

i

i yxwxwxwxw

1,

1

1,

1

,1,

1

,1,

1

1,11,

1

0,0 i

n

i

ii

n

i

didi

n

i

jiji

n

i

ii

n

i

i xyxxwxxwxxwxxw


• The optimal set of weights satisfies:

Leads to a system of linear equations (SLE) with d+1

unknowns of the form

Solution to SLE: ?

0xxwww

ii

T

i

n

i

n yn

J )(2

))((1

ji

n

i

iji

n

i

didji

n

i

jijji

n

i

iji

n

i

i xyxxwxxwxxwxxw ,

1

,

1

,,

1

,,

1

1,1,

1

0,0

bAw

5


• The optimal set of weights satisfies:

Leads to a system of linear equations (SLE) with d+1

unknowns of the form

Solution to SLE:

• matrix inversion

0xxwww

ii

T

i

n

i

n yn

J )(2

))((1

ji

n

i

iji

n

i

didji

n

i

jijji

n

i

iji

n

i

i xyxxwxxwxxwxxw ,

1

,

1

,,

1

,,

1

1,1,

1

0,0

bAw

bAw1

Gradient descent solution

Goal: the weight optimization in the linear regression model

An alternative to SLE solution:

• Gradient descent

Idea:

– Adjust weights in the direction that improves the Error

– The gradient tells us what is the right direction

- a learning rate (scales the gradient changes)

)(www w iError

2

,..1

)),((1

)( wxw ii

ni

n fyn

ErrorJ

0

6

Gradient descent method

• Descend using the gradient information

• Change the value of w according to the gradient

w

*|)( ww wError

*w

)(wError

)(www w iError

Direction of the descent


• New value of the parameter

- a learning rate (scales the gradient changes)

w

*|)( wwErrorw

*w

)(wError

*|)(* w

j

jj wErrorw

ww

0

For all j

7


• Iteratively approaches the optimum of the Error function

w)0(w

)(wError

)2(w)1(w )3(w

Batch vs Online regression algorithm

• The error function defined on the complete dataset D

• We say we are learning the model in the batch mode:

– All examples are available at the time of learning

– Weights are optimizes with respect to all training examples

• An alternative is to learn the model in the online mode

– Examples are arriving sequentially

– Model weights are updated after every example

– If needed examples seen can be forgotten

-

2

,..1

)),((1

)( wxw ii

ni

n fyn

ErrorJ

8

Online gradient algorithm

• The error function is defined for the complete dataset D

• Error for one example

• Online gradient method: changes weights after every example

• vector form:

2

,..1

)),((1

)( wxw ii

ni

n fyn

ErrorJ

2

online )),((2

1)( wxw iii fyErrorJ

)(wi

j

jj Errorw

ww

0 - Learning rate that depends on the number of updates

)(www w iError

iii yD ,x

Online gradient method

2)),((2

1)( wxw iiionline fyErrorJ

(i)-th update step with :

xwxTf )(Linear model

On-line error

ii

1)( Annealed learning rate:

- Gradually rescales changes

On-line algorithm: generates a sequence of online updates

)1(|)(

)()1()(

i

j

ii

j

i

jw

Erroriww

w

w

j-th weight:

iii yD ,x

ji

i

ii

i

j

i

j xfyiww ,

)1()1()()),()((

wx

Fixed learning rate: Ci )(

- Use a small constant

9

Online regression algorithm

Online-linear-regression (stopping_criterion)

Initialize weights

initialize i=1;

while stopping_criterion = FALSE

select the next data point

set learning rate

update weight vector

end

return weights

),,( 210 dwwww w

)(i

iii fyi xwxww )),()((

),( iii yD x

Advantages: very easy to implement, continuous data streams

On-line learning. Example

-3 -2 -1 0 1 2 31

1.5

2

2.5

3

3.5

4

4.5

-3 -2 -1 0 1 2 31

1.5

2

2.5

3

3.5

4

4.5

-3 -2 -1 0 1 2 30.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

-3 -2 -1 0 1 2 30.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

1 2

3 4

10

Adaptive models

2)),((2

1)( wxw iiionline fyErrorJ

xwxTf )(Linear model

On-line error

ci )(

Adaptive models:

• the underlying model is not stationary and can change over time

• Example: seasonal changes

• On-line algorithm can be made adaptive by keeping

the learning at some constant value

On-line algorithm:

• Sequence of online updates (one example at the time)

• Useful for continuous data streams

Extensions of simple linear model

)()(1

0 xx j

m

j

jwwf

)(1 x

)(2 x

)(xm

1

1x )(xf

0w

1w

2w

mwdx

The same techniques as before to learn the weights !!!!

)(xj - an arbitrary function of x

Replace inputs to linear units with feature (basis) functions

to model nonlinearities

11

Extensions of the linear model

• Models linear in the parameters we want to fit

• Basis functions examples:

– a higher order polynomial, one-dimensional input

– Multidimensional quadratic

– Other types of basis functions

)()(1

0 xx k

m

k

kwwf

)()...(),( 21 xxx m - feature or basis functions

mwww ..., 10 - parameters

xx )(12

2 )( xx 3

3 )( xx ),( 21 xxx

11 )( xx2

12 )( xx23 )( xx 2

24 )( xx 215 )( xxx

xx sin)(1 xx cos)(2

)( 1xx

Example. Regression with polynomials.

Regression with polynomials of degree m

• Data points: pairs of

• Feature functions: m feature functions

• Function to learn:i

m

i

ii

m

i

i xwwxwwxf

1

0

1

0 )(),( w

i

i xx )(

yx,

mi ,,2,1

x)(1 x

2

2 )( xx

m

m x)(x

1

x

0w

1w

2w

mw

12

Multidimensional model example

-3-2

-10

12

3 -4

-2

0

2

4-20

-15

-10

-5

0

5

10

15

20

Multidimensional model example

13

Regularized linear regression

• If the number of parameters is large relative to the number of data points used to train the model, we face the threat of overfit (generalization error of the model goes up)

• The prediction accuracy can be often improved by setting some coefficients to zero

– Increases the bias, reduces the variance of estimates

• Solutions:

– Subset selection

– Ridge regression

– Lasso regression

– Principal component regression

• Next: ridge regression

Ridge regression

• Error function for the standard least squares estimates:

• We seek:

• Ridge regression:

• Where

• What does the new error function do?

2

,..1

)(1

)( i

T

i

ni

n yn

J xww

2

,..1

* )(1

minarg i

T

i

ni

yn

xwww

22

,..1

)(1

)( wxww

i

T

i

ni

n yn

J

d

i

iw0

22w 0and

14

Ridge regression

• Standard regression:

• Ridge regression:

• penalizes non-zero weights with the cost

proportional to (a shrinkage coefficient)

• If an input attribute has a small effect on improving the error

function it is “shut down” by the penalty term

• Inclusion of a shrinkage penalty is often referred to as

regularization.

(ridge regression is related to Tikhonov regularization)

2

,..1

)(1

)( i

T

i

ni

n yn

J xww

2

2

2

,..1

)(1

)(Li

T

i

ni

n yn

J wxww

d

i

iLw

0

22

2w

jx

Regularized linear regression

How to solve the least squares problem if the error function is

enriched by the regularization term ?

Answer: The solution to the optimal set of weights w is obtained

again by solving a set of linear equation.

Standard linear regression:

Solution:

Regularized linear regression:

2w

0xxwww

ii

T

i

n

i

n yn

J )(2

))((1

yXXXwTT 1)(*

where X is an nxd matrix with rows corresponding to

examples and columns to inputs

yXXXIwTT 1)(*

15

Lasso regression

• Standard regression:

• Lasso regression/regularization:

• penalizes non-zero weights with the cost

proportional to .

• L1 is more aggressive pushing the weights to 0 compared to L2.

2

,..1

)(1

)( i

T

i

ni

n yn

J xww

1

2

,..1

)(1

)(Li

T

i

ni

n yn

J wxww

d

i

iLw

01

||w

Classification

• Data:

– represents a discrete class value

• Goal: learn

• Binary classification

– A special case when

• First step:

– we need to devise a model of the function f

}1,0{Y

YXf :

},..,,{ 21 ndddD

iii yd ,x

iy

16

Discriminant functions

• A common way to represent a classifier is by using

– Discriminant functions

• Works for both the binary and multi-way classification

• Idea:

– For every class i define a function mapping

– When the decision on input x should be made choose the

class with the highest value of

)(xig X

)(xig

)(maxarg* xii gy


• A common way to represent a classifier is by using

– Discriminant functions

• Works for both the binary and multi-way classification

• Idea:

– For every class i define a function mapping

– When the decision on input x should be made choose the

class with the highest value of

• So what happens with the input space? Assume a binary case.

)(xig X

)(xig

)(maxarg* xii gy

17


)()( 01 xx gg

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


)()( 01 xx gg

)()( 01 xx gg

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

)()( 01 xx gg

18


)()( 01 xx gg

)()( 01 xx gg

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

)()( 01 xx gg

)()( 01 xx gg


• Decision boundary: discriminant functions are equal

)()( 01 xx gg

)()( 01 xx gg

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

)()( 01 xx gg

)()( 01 xx gg

)()( 01 xx gg

19

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

Decision boundary

Quadratic decision boundary

)()( 01 xx gg

)()( 01 xx gg )()( 01 xx gg

Logistic regression model

• Defines a linear decision boundary

• Discriminant functions:

• where

)()()( 1 xwxwwx,TT ggf

)1/(1)( zezg

x

Input vector

1

1x )( wx,f

0w

1w

2w

dw2x

z

dx

Logistic function

)()(1 xwxTgg )(1)(0 xwx

Tgg

- is a logistic function

20

Logistic function

Function:

• Is also referred to as a sigmoid function

• takes a real number and outputs the number in the interval [0,1]

• Models a smooth switching function; replaces hard threshold

function

)1(

1)(

zezg

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Logistic (smooth) switching Threshold (hard) switching

Logistic regression model

• Discriminant functions:

• Values of discriminant functions vary in interval [0,1]

– Probabilistic interpretation

),|1( wxyp

x

Input vector

1

1x

0w

1w

2w

dw2x

z

dx

)()(1 xwxTgg )(1)(0 xwx

Tgg

)()(),|1()( 1 xwxxwwx,Tggypf

21

Logistic regression

• We learn a probabilistic function

– where f describes the probability of class 1 given x

Note that:

• Making decisions with the logistic regression model:

?

),|1()()( 1 wxxwwx, ypgf T

]1,0[: Xf

)|1(1),|0( wx,wx ypyp

Logistic regression

• We learn a probabilistic function

– where f describes the probability of class 1 given x

Note that:

• Making decisions with the logistic regression model:

),|1()()( 1 wxxwwx, ypgf T

]1,0[: Xf

2/1)|1( xypIf then choose 1

Else choose 0

)|1(1),|0( wx,wx ypyp

22

Linear decision boundary

• Logistic regression model defines a linear decision boundary

• Why?

• Answer: Compare two discriminant functions.

• Decision boundary:

• For the boundary it must hold:

0)(

)(1log

)(

)(log

1

xw

xw

x

xT

T

g

g

g

go

)()( 01 xx gg

0)(explog

)(exp1

1

)(exp1

)(exp

log)(

)(log

1

xwxw

xw

xw

xw

x

x TT

T

T

T

g

go

Logistic regression model. Decision boundary

• LR defines a linear decision boundary

Example: 2 classes (blue and red points)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Decision boundary

wTx=0

23

Likelihood of outputs

• Let

• Then

• Find weights w that maximize the likelihood of outputs

– Apply the log-likelihood trick. The optimal weights are

the same for both the likelihood and the log-likelihood

Logistic regression: parameter learning

n

i

y

i

y

i

n

i

y

i

y

iiiiiDl

1

1

1

1)1(log)1(log),( w

n

i

y

i

y

iii

n

i

iiyyPDL1

1

1

)1(),|(),( wxw

)()(),|1( i

T

iiii gzgyp xwwx

)1log()1(log1

iii

n

i

i yy

iii yD ,x

Logistic regression: parameter learning

• Notation:

• Log likelihood

• Derivatives of the loglikelihood

• Gradient descent:

)1log()1(log),(1

iii

n

i

i yyDl

w

)),(())((),(11

iii

n

i

i

T

ii

n

i

fygyDl xwxxwxww

)1(|)],([)()1()(

kDlkkk

ww www

Nonlinear in weights !!

n

i

ii

k

i

kk fyk1

)1()1()( )],([)( xxwww

))((),(1

, ii

n

i

ji

j

zgyxDlw

w

)()(),|1( i

T

iiii gzgyp xwwx

24

Derivation of the gradient

• Log likelihood

• Derivatives of the loglikelihood

)1log()1(log),(1

iii

n

i

i yyDl

w

)),(())((),(11

iii

n

i

i

T

ii

n

i

fygyDl xwxxwxww

j

in

i

iiii

ij w

zzgyzgy

zDl

w

1

))(1log()1()(log),( w

i

i

i

i

i

i

i

iiiii

i z

zg

zgy

z

zg

zgyzgyzgy

z

)(

)(1

1)1(

)(

)(

1))(1log()1()(log

))(1)(()(

ii

i

i zgzgz

zg

Derivative of a logistic function

)())()(1())(1( iiiiii zgyzgyzgy

ji

j

i xw

z,

Logistic regression. Online gradient descent

• On-line component of the loglikelihood

• On-line learning update for weight w

• ith update for the logistic regression and

)1(|)],([)()1()(

kkonline

kk DJkww www

),( wkonline DJ

kkk yD ,x

kk

k

i

kk fyk xxwww )],()[( )1()1()(

)1log()1(log),(online iiiii yyDJ w

25

Online logistic regression algorithm

Online-logistic-regression (stopping_criterion)

initialize weights

while stopping_criterion = FALSE

do select next data point

set

update weights (in parallel)

end

return weights

),,( 210 dwwww w

)(i

iii fyi xxwww )],()[(

iii yD ,x

w

Online algorithm. Example.

26



Linear regression (cont) Logistic...

Documents

Transcript of Linear regression (cont) Logistic...