Linear Model (III)

23
Linear Model (III) Rong Jin

description

Linear Model (III). Rong Jin. Announcement. Homework 2 is out and is due 02/05/2004 (next Tuesday) Homework 1 is handed out. Recap: Logistic Regression Model. Assume the inputs and outputs are related in the log linear function Estimate weights: MLE approach. Example: Text Classification. - PowerPoint PPT Presentation

Transcript of Linear Model (III)

Page 1: Linear Model (III)

Linear Model (III)

Rong Jin

Page 2: Linear Model (III)

Announcement Homework 2 is out and is due 02/05/2004

(next Tuesday) Homework 1 is handed out

Page 3: Linear Model (III)

Recap: Logistic Regression Model Assume the inputs and outputs are related in the log

linear function

Estimate weights: MLE approach

1 2

1( | ; )

1 exp ( )

{ , ,..., , }m

p y xy x w c

w w w c

*1

1

max ( ) max log ( | ; )

1max log

1 exp( )

ntrain i iiw w

n

iw

w l D p y x

y x w c

1 2{ , ,..., , }mw w w c

Page 4: Linear Model (III)

Example: Text Classification Input x: a binary vector

Each word is a different dimension xi = 0 if the ith word does not appear in the document

xi = 1 if it appears in the document Output y: interesting document or not

+1: interesting -1: uninteresting

Page 5: Linear Model (III)

Example: Text Classification

Doc 1

The purpose of the Lady Bird Johnson Wildflower Center is to educate people around the world, …

Doc 2

Rain Bird is one of the leading irrigation manufacturers in the world, providingcomplete irrigation solutions for people…

term the world people company center …

Doc 1 1 1 1 0 1 …

Doc 2 1 1 1 1 0 …

Page 6: Linear Model (III)

Example 2: Text Classification Logistic regression model

Every term ti is assigned with a weight wi

Learning parameters: MLE approach

Need numerical solutions

1 2

1( | ; )

1 exp ( )

{ , ,..., , }

i ii

m

p y dy w t c

w w w c

( ) ( )

1 1

( ) ( )

1 1

( ) log ( | ) log ( | )

1 1log log

1 exp 1 exp

N Ntrain i ii i

N N

i ii i i ii i

l D p d p d

w t c w t c

Page 7: Linear Model (III)

Example 2: Text Classification Weight wi

wi > 0: term ti is a positive evidence

wi < 0: term ti is a negative evidence

wi = 0: term ti is irrelevant to whether or not the document is intesting

The larger the | wi |, the more important ti term is determining whether the document is interesting.

Threshold c0 : more likely to be an interesting document

0 : more likely to be an uninteresting document

0 : decision boundary

i

i

i

i it d

i it d

i it d

w t c

w t c

w t c

Page 8: Linear Model (III)

Example 2: Text Classification

• Dataset: Reuter-21578

• Classification accuracy

• Naïve Bayes: 77%

• Logistic regression: 88%

Page 9: Linear Model (III)

Why Logistic Regression Works better for Text Classification? Common words

Small weights in logistic regression Large weights in naïve Bayes

Weight ~ p(w|+) – p(w|-)

Independence assumption Naive Bayes assumes that each word is generated

independently Logistic regression is able to take into account of

the correlation of words

Page 10: Linear Model (III)

Comparison

Generative Model

• Model P(x|y)• Model the input patterns

• Usually fast converge• Cheap computation• Robust to noise data

But• Usually performs worse

Discriminative Model

• Model P(y|x) directly• Model the decision boundary

• Usually good performance

But• Slow convergence• Expensive computation• Sensitive to noise data

Page 11: Linear Model (III)

Problems with Logistic Regression?

1 1 2 2

1 2

1( | ; )

1 exp ( ... )

{ , ,..., , }m m

m

p xc x w x w x w

w w w c

How about words that only appears in one class?

Page 12: Linear Model (III)

Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is

a positive document. Let w be its associated weight

Consider the derivative of l(Dtrain) with respect to w

w will be infinite !

( ) ( )

1 1

( )

1

( ) log ( | ) log ( | )

log ( | ) log ( | ) log ( | )

log ( | )i

N Ntrain i ii i

Ni id d i

l D p d p d

p d p d p d

p d l l

( ) log ( | ) 10 0 0

1 exptrainl D l lp d

w w w w c x w

Page 13: Linear Model (III)

Solution: Regularization Regularized log-likelihood

Large weights small weights Prevent weights from being too large

Small weights zero Sparse weights

2

( ) ( ) 21 1 1

( ) ( )

log ( | ) log ( | )

reg train train

N N mi i ii i i

l D l D s w

p d p d s w

Page 14: Linear Model (III)

Why do We Need Sparse Solution? Two types of solutions

1. Many non-zero weights but many of them are small

2. Only a small number of weights, and many of them are large

Occam’s Razor: the simpler the better A simpler model that fits data unlikely to be coincidence A complicated model that fit data might be coincidence Smaller number of non-zero weights

less amount of evidence to consider

simpler model

case 2 is preferred

Page 15: Linear Model (III)

Occam’s Razer

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

Page 16: Linear Model (III)

Occam’s Razer: Power = 1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

1y a x

Page 17: Linear Model (III)

Occam’s Razer: Power = 3

2 31 2 3y a x a x a x

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

Page 18: Linear Model (III)

Occam’s Razor: Power = 10

2 101 2 10...y a x a x a x

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

Page 19: Linear Model (III)

Finding Optimal Solutions

Concave objective function No local maximum

Many standard optimization algorithms work

22 1 1

21 1

( ) ( ) log ( | )

1log

1 exp ( )

N mreg train train ii i

N mii i

l D l D w p y x s w

s wy c x w

Page 20: Linear Model (III)

Gradient Ascent Maximize the log-likelihood by iteratively adjusting the

parameters in small increments In each iteration, we adjust w in the direction that increases the

log-likelihood (toward the gradient)

21 1

1

21 1

1

log ( | )

(1 ( | ))

log ( | )

(1 ( | ))

where is learning rate.

N mi i ii i

Ni i i ii

N mi i ii i

Ni i ii

w w p y x s ww

w sw x y p y x

c c p y x s wc

c y p y x

Predication ErrorsPreventing weights from being too large

Page 21: Linear Model (III)

Graphical Illustration

No regularization case

Page 22: Linear Model (III)

Using regularization Without regularization

Iteration

Page 23: Linear Model (III)

When should Stop? The gradient ascent learning method converges when

there is no incentive to move the parameters in any particular direction:

In many cases, it can be very tricky Small first order derivative close to the maximum point

21 1 1

21 1 1

log ( | ) (1 ( | )) 0

log ( | ) (1 ( | )) 0

N m Ni i i i i i ii i i

N m Ni i i i i ii i i

p y x w sw x y p y xw

p y x w y p y xc