Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i...

88
Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from William Cohen Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 1 of 10

Transcript of Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i...

Page 1: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Classification: Logistic Regressionfrom Data

Machine Learning: Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 3

Slides adapted from William Cohen

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 1 of 10

Page 2: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Content Questions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 2 of 10

Page 3: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Content Questions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 2 of 10

Page 4: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Content Questions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 2 of 10

Page 5: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Content Questions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 2 of 10

Page 6: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Content Questions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 2 of 10

Page 7: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Administrivia Questions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 3 of 10

Page 8: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Administrivia Questions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 3 of 10

Page 9: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Administrivia Questions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 3 of 10

Page 10: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Reminder: Logistic Regression

P(Y = 0|X)=1

1+exp�

β0 +∑

i βiXi

� (1)

P(Y = 1|X)=exp

β0 +∑

i βiXi

1+exp�

β0 +∑

i βiXi

� (2)

• Discriminative prediction: p(y |x)• Classification uses: ad placement, spam detection

• What we didn’t talk about is how to learn β from data

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 4 of 10

Page 11: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Logistic Regression: Objective Function

L ≡ lnp(Y |X ,β)=∑

j

lnp(y(j) |x(j),β) (3)

=∑

j

y(j)

β0 +∑

i

βix(j)i

!

− ln

1+exp

β0 +∑

i

βix(j)i

!

(4)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 10

Page 12: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Algorithm

1 Initialize a vector B to be all zeros

2 For t = 1, . . . ,T

◦ For each example ~xi ,yi and feature j :

• Compute πi ≡ Pr(yi = 1 |~xi)• Set β [j] =β [j]′+λ(yi −πi)xi

3 Output the parameters β1, . . . ,βd .

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 10

Page 13: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨βbias = 0,βA = 0,βB = 0,βC = 0,βD = 0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

You first see the positive example. First, compute π1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 14: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

You first see the positive example. First, compute π1

π1 = Pr(y1 = 1 | ~x1)=expβT xi

1+expβT xi=

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 15: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

You first see the positive example. First, compute π1

π1 = Pr(y1 = 1 | ~x1)=expβT xi

1+expβT xi= exp0

exp0+1 = 0.5

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 16: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

π1 = 0.5 What’s the update for βbias?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 17: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βbias?βbias =β ′bias +λ · (y1−π1) · x1,bias = 0.0+1.0 · (1.0−0.5) ·1.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 18: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βbias?βbias =β ′bias +λ · (y1−π1) · x1,bias = 0.0+1.0 · (1.0−0.5) ·1.0 =0.5

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 19: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βA?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 20: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βA?βA =β ′A +λ · (y1−π1) · x1,A = 0.0+1.0 · (1.0−0.5) ·4.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 21: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βA?βA =β ′A +λ · (y1−π1) · x1,A = 0.0+1.0 · (1.0−0.5) ·4.0 =2.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 22: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βB?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 23: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βB?βB =β ′B +λ · (y1−π1) · x1,B = 0.0+1.0 · (1.0−0.5) ·3.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 24: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βB?βB =β ′B +λ · (y1−π1) · x1,B = 0.0+1.0 · (1.0−0.5) ·3.0 =1.5

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 25: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βC?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 26: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βC?βC =β ′C +λ · (y1−π1) · x1,C = 0.0+1.0 · (1.0−0.5) ·1.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 27: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βC?βC =β ′C +λ · (y1−π1) · x1,C = 0.0+1.0 · (1.0−0.5) ·1.0 =0.5

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 28: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βD?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 29: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βD?βD =β ′D +λ · (y1−π1) · x1,D = 0.0+1.0 · (1.0−0.5) ·0.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 30: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨0,0,0,0,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βD?βD =β ′D +λ · (y1−π1) · x1,D = 0.0+1.0 · (1.0−0.5) ·0.0 =0.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 31: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

Now you see the negative example. What’s π2?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 32: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

Now you see the negative example. What’s π2?

π2 = Pr(y2 = 1 | ~x2)=expβT xi

1+expβT xi=

exp[.5+1.5+1.5+0]exp[.5+1.5+1.5+0]+1 =

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 33: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

Now you see the negative example. What’s π2?

π2 = Pr(y2 = 1 | ~x2)=expβT xi

1+expβT xi=

exp[.5+1.5+1.5+0]exp[.5+1.5+1.5+0]+1 = 0.97

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 34: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

Now you see the negative example. What’s π2?π2 = 0.97What’s the update for βbias?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 35: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βbias?βbias =β ′bias +λ · (y2−π2) · x2,bias = 0.5+1.0 · (0.0−0.97) ·1.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 36: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βbias?βbias =β ′bias +λ · (y2−π2) · x2,bias = 0.5+1.0 · (0.0−0.97) ·1.0 =-0.47

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 37: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βA?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 38: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βA?βA =β ′A +λ · (y2−π2) · x2,A = 2.0+1.0 · (0.0−0.97) ·0.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 39: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βA?βA =β ′A +λ · (y2−π2) · x2,A = 2.0+1.0 · (0.0−0.97) ·0.0 =2.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 40: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βB?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 41: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βB?βB =β ′B +λ · (y2−π2) · x2,B = 1.5+1.0 · (0.0−0.97) ·1.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 42: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βB?βB =β ′B +λ · (y2−π2) · x2,B = 1.5+1.0 · (0.0−0.97) ·1.0 =0.53

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 43: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βC?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 44: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βC?βC =β ′C +λ · (y2−π2) · x2,C = 0.5+1.0 · (0.0−0.97) ·3.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 45: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βC?βC =β ′C +λ · (y2−π2) · x2,C = 0.5+1.0 · (0.0−0.97) ·3.0 =-2.41

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 46: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βD?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 47: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βD?βD =β ′D +λ · (y2−π2) · x2,D = 0.0+1.0 · (0.0−0.97) ·4.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 48: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents

β [j] =β [j]+λ(yi −πi)xi

~β =⟨.5,2,1.5,0.5,0⟩

y1 =1

A A A A B B B C

y2 =0

B C C C D D D D(Assume step size λ= 1.0.)

What’s the update for βD?βD =β ′D +λ · (y2−π2) · x2,D = 0.0+1.0 · (0.0−0.97) ·4.0 =-3.88

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 10

Page 49: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

How does the gradient change with regularization?

• You can do your normal update

• Thenβj =β

′j −λ2µβj =β

′j · (1−2λµ) (5)

• Doesn’t depend on X or Y . Just makes all your weights smaller

• But difficult to update every feature every time (if there are many features)

• Following this up, we note that we can perform m successive“regularization” updates by letting βj =β ′j · (1−2λµ)mj

Basic Idea

Don’t perform regularization updates for zero-valued xj ’s, but instead tosimply keep track of how many such updates would need to be performed toupdate βj

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 8 of 10

Page 50: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

How does the gradient change with regularization?

• You can do your normal update

• Thenβj =β

′j −λ2µβj =β

′j · (1−2λµ) (5)

• Doesn’t depend on X or Y . Just makes all your weights smaller

• But difficult to update every feature every time (if there are many features)

• Following this up, we note that we can perform m successive“regularization” updates by letting βj =β ′j · (1−2λµ)mj

Basic Idea

Don’t perform regularization updates for zero-valued xj ’s, but instead tosimply keep track of how many such updates would need to be performed toupdate βj

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 8 of 10

Page 51: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

How does the gradient change with regularization?

• You can do your normal update

• Thenβj =β

′j −λ2µβj =β

′j · (1−2λµ) (5)

• Doesn’t depend on X or Y . Just makes all your weights smaller

• But difficult to update every feature every time (if there are many features)

• Following this up, we note that we can perform m successive“regularization” updates by letting βj =β ′j · (1−2λµ)mj

Basic Idea

Don’t perform regularization updates for zero-valued xj ’s, but instead tosimply keep track of how many such updates would need to be performed toupdate βj

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 8 of 10

Page 52: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

How does the gradient change with regularization?

• You can do your normal update

• Thenβj =β

′j −λ2µβj =β

′j · (1−2λµ) (5)

• Doesn’t depend on X or Y . Just makes all your weights smaller

• But difficult to update every feature every time (if there are many features)

• Following this up, we note that we can perform m successive“regularization” updates by letting βj =β ′j · (1−2λµ)mj

Basic Idea

Don’t perform regularization updates for zero-valued xj ’s, but instead tosimply keep track of how many such updates would need to be performed toupdate βj

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 8 of 10

Page 53: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

How does the gradient change with regularization?

• You can do your normal update

• Thenβj =β

′j −λ2µβj =β

′j · (1−2λµ) (5)

• Doesn’t depend on X or Y . Just makes all your weights smaller

• But difficult to update every feature every time (if there are many features)

• Following this up, we note that we can perform m successive“regularization” updates by letting βj =β ′j · (1−2λµ)mj

Basic Idea

Don’t perform regularization updates for zero-valued xj ’s, but instead tosimply keep track of how many such updates would need to be performed toupdate βj

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 8 of 10

Page 54: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 55: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

You first see the positive example. π1 is still 0.5.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 56: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

You first see the positive example. π1 is still 0.5.What’s the update for βbias?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 57: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βbias?βbias =

β ′bias +λ · (y1−π1) · x1,bias

(1−2 ·λ ·µ)mbias =

(0.0+1.0 · (1.0−0.5) ·1.0)�

1−2 ·1.0 · 14

�1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 58: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βbias?βbias =

β ′bias +λ · (y1−π1) · x1,bias

(1−2 ·λ ·µ)mbias =

(0.0+1.0 · (1.0−0.5) ·1.0)�

1−2 ·1.0 · 14

�1=0.25

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 59: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βA?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 60: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βA?βA =

β ′A +λ · (y1−π1) · x1,A

(1−2 ·λ ·µ)mA =

(0.0+1.0 · (1.0−0.5) ·4.0)�

1−2 ·1.0 · 14

�1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 61: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βA?βA =

β ′A +λ · (y1−π1) · x1,A

(1−2 ·λ ·µ)mA =

(0.0+1.0 · (1.0−0.5) ·4.0)�

1−2 ·1.0 · 14

�1=1.0

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 62: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βB?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 63: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βB?βB =

β ′B +λ · (y1−π1) · x1,B

(1−2 ·λ ·µ)mB =

(0.0+1.0 · (1.0−0.5) ·3.0)�

1−2 ·1.0 · 14

�1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 64: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βB?βB =

β ′B +λ · (y1−π1) · x1,B

(1−2 ·λ ·µ)mB =

(0.0+1.0 · (1.0−0.5) ·3.0)�

1−2 ·1.0 · 14

�1=0.75

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 65: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βC?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 66: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βC?βC =

β ′C +λ · (y1−π1) · x1,C

(1−2 ·λ ·µ)mC =

(0.0+1.0 · (1.0−0.5) ·1.0)�

1−2 ·1.0 · 14

�1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 67: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βC?βC =

β ′C +λ · (y1−π1) · x1,C

(1−2 ·λ ·µ)mC =

(0.0+1.0 · (1.0−0.5) ·1.0)�

1−2 ·1.0 · 14

�1=0.25

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 68: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βD?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 69: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨0,0,0,0,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βD?We don’t care: leave it for later.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 70: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 71: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

Now you see the negative example. What’s π2?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 72: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

Now you see the negative example. What’s π2?

π2 = Pr(y2 = 1 | ~x2)=expβT xi

1+expβT xi=

exp[.25+0.75+0.75+0]exp[.25+0.75+0.75+0]+1 =

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 73: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

Now you see the negative example. What’s π2?

π2 = Pr(y2 = 1 | ~x2)=expβT xi

1+expβT xi=

exp[.25+0.75+0.75+0]exp[.25+0.75+0.75+0]+1 = 0.85

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 74: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

π2 = 0.85What’s the update for βbias?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 75: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βbias?βbias =

β ′bias +λ · (y2−π2) · x2,bias

(1−2 ·λ ·µ)mbias =

(0.25+1.0 · (0.0−0.85) ·1.0)�

1−2 ·1.0 · 14

�1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 76: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βbias?βbias =

β ′bias +λ · (y2−π2) · x2,bias

(1−2 ·λ ·µ)mbias =

(0.25+1.0 · (0.0−0.85) ·1.0)�

1−2 ·1.0 · 14

�1=-0.30

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 77: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βA?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 78: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βA?We don’t care: leave it for later.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 79: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βB?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 80: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βB?βB =

β ′B +λ · (y2−π2) · x2,B

(1−2 ·λ ·µ)mB =

(0.75+1.0 · (0.0−0.85) ·1.0)�

1−2 ·1.0 · 14

�1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 81: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βB?βB =

β ′B +λ · (y2−π2) · x2,B

(1−2 ·λ ·µ)mB =

(0.75+1.0 · (0.0−0.85) ·1.0)�

1−2 ·1.0 · 14

�1=-0.05

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 82: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βC?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 83: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βC?βC =

β ′C +λ · (y2−π2) · x2,C

(1−2 ·λ ·µ)mC =

(0.25+1.0 · (0.0−0.85) ·3.0)�

1−2 ·1.0 · 14

�1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 84: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βC?βC =

β ′C +λ · (y2−π2) · x2,C

(1−2 ·λ ·µ)mC =

(0.25+1.0 · (0.0−0.85) ·3.0)�

1−2 ·1.0 · 14

�1=-1.15

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 85: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βD?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 86: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βD?βD =

β ′D +λ · (y2−π2) · x2,D

(1−2 ·λ ·µ)mD =

(0.0+1.0 · (0.0−0.85) ·4.0)�

1−2 ·1.0 · 14

�2

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 87: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Example Documents (Regularized)

β [j] =(β [j]′+λ(y −p)xi) · (1−2λµ)mj

~β =⟨.25,1,0.75,0.25,0⟩

y1 = 1

A A A A B B B C

y2 = 0

B C C C D D D DAssume step size λ= 1.0 and µ= 1

4 .

What’s the update for βD?βD =

β ′D +λ · (y2−π2) · x2,D

(1−2 ·λ ·µ)mD =

(0.0+1.0 · (0.0−0.85) ·4.0)�

1−2 ·1.0 · 14

�2=-0.85

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 10

Page 88: Classification: Logistic Regression from Datajbg/teaching/CSCI_5622/03b.pdf · 1+exp fl 0 + P i iXi Š (1) P(Y =1jX)= exp fl 0 + P i iXi Š 1+exp fl 0 + P i iXi Š (2) Discriminative

Next time . . .

• Multinomial logistic regression in sklearn (more than one option)

• Crafting effective features

• Preparation for third homework

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 10