98
Discriminant Functions Generative Models Discriminative Models Linear Models for Classiﬁcation Bishop PRML Ch. 4 Alireza Ghane

vandang
• Category

## Documents

• view

219

1

### Transcript of Linear Models for Classiﬁcation -...

Discriminant Functions Generative Models Discriminative Models

Linear Models for ClassificationBishop PRML Ch. 4

Alireza Ghane

Discriminant Functions Generative Models Discriminative Models

Classification: Hand-written Digit Recognition

xi =

!"#\$%&'"()#*\$#+ !,- )( !.,\$#/+ +%((/#/01/! 233456 334

7/#()#-! 8/#9 */:: )0 "'%! /\$!9 +\$"\$;\$!/ +,/ ") "'/ :\$1< )(

8\$#%\$"%)0 %0 :%&'"%0& =>?@ 2ABC D,!" -\$</! %" (\$!"/#56E'/ 7#)")"97/ !/:/1"%)0 \$:&)#%"'- %! %::,!"#\$"/+ %0 F%&6 GH6

C! !//0I 8%/*! \$#/ \$::)1\$"/+ -\$%0:9 ()# -)#/ 1)-7:/J1\$"/&)#%/! *%"' '%&' *%"'%0 1:\$!! 8\$#%\$;%:%"96 E'/ 1,#8/-\$#</+ 3BK7#)") %0 F%&6 L !')*! "'/ %-7#)8/+ 1:\$!!%(%1\$"%)07/#()#-\$01/ ,!%0& "'%! 7#)")"97/ !/:/1"%)0 !"#\$"/&9 %0!"/\$+)( /.,\$::9K!7\$1/+ 8%/*!6 M)"/ "'\$" */ );"\$%0 \$ >6? 7/#1/0"

/##)# #\$"/ *%"' \$0 \$8/#\$&/ )( )0:9 (),# "*)K+%-/0!%)0\$:

8%/*! ()# /\$1' "'#//K+%-/0!%)0\$: );D/1"I "'\$0<! ") "'/

(:/J%;%:%"9 7#)8%+/+ ;9 "'/ -\$"1'%0& \$:&)#%"'-6

!"# \$%&'() *+,-. */0+12.33. 4,3,5,6.

N,# 0/J" /J7/#%-/0" %08):8/! "'/ OAPQKR !'\$7/ !%:'),/""/

+\$"\$;\$!/I !7/1%(%1\$::9 B)#/ PJ7/#%-/0" BPK3'\$7/KG 7\$#" SI

*'%1' -/\$!,#/! 7/#()#-\$01/ )( !%-%:\$#%"9K;\$!/+ #/"#%/8\$:

=>T@6 E'/ +\$"\$;\$!/ 1)0!%!"! )( GI?HH %-\$&/!U RH !'\$7/

1\$"/&)#%/!I >H %-\$&/! 7/# 1\$"/&)#96 E'/ 7/#()#-\$01/ %!

-/\$!,#/+ ,!%0& "'/ !)K1\$::/+ V;,::!/9/ "/!"IW %0 *'%1' /\$1'

!"# \$%%% &'()*(+&\$,)* ,) -(&&%') ()(./*\$* ()0 1(+2\$)% \$)&%..\$3%)+%4 5,.6 784 ),6 784 (-'\$. 7997

:;<6 #6 (== >? @AB C;DE=FDD;?;BG 1)\$*& @BD@ G;<;@D HD;I< >HJ CB@A>G KLM >H@ >? "94999N6 &AB @BO@ FP>QB BFEA G;<;@ ;IG;EF@BD @AB BOFCR=B IHCPBJ

?>==>SBG PT @AB @JHB =FPB= FIG @AB FDD;<IBG =FPB=6

:;<6 U6 M0 >PVBE@ JBE><I;@;>I HD;I< @AB +,\$.W79 GF@F DB@6 +>CRFJ;D>I >?@BD@ DB@ BJJ>J ?>J **04 *AFRB 0;D@FIEB K*0N4 FIG *AFRB 0;D@FIEB S;@A!!"#\$%&\$' RJ>@>@TRBD K*0WRJ>@>N QBJDHD IHCPBJ >? RJ>@>@TRB Q;BSD6 :>J**0 FIG *04 SB QFJ;BG @AB IHCPBJ >? RJ>@>@TRBD HI;?>JC=T ?>J F==>PVBE@D6 :>J *0WRJ>@>4 @AB IHCPBJ >? RJ>@>@TRBD RBJ >PVBE@ GBRBIGBG >I@AB S;@A;IW>PVBE@ QFJ;F@;>I FD SB== FD @AB PB@SBBIW>PVBE@ D;C;=FJ;@T6

:;<6 "96 -J>@>@TRB Q;BSD DB=BE@BG ?>J @S> G;??BJBI@ M0 >PVBE@D ?J>C @AB+,\$. GF@F DB@ HD;I< @AB F=<>J;@AC GBDEJ;PBG ;I *BE@;>I !676 X;@A @A;DFRRJ>FEA4 Q;BSD FJB F==>EF@BG FGFR@;QB=T GBRBIG;I< >I @AB Q;DHF=E>CR=BO;@T >? FI >PVBE@ S;@A JBDRBE@ @> Q;BS;I< FI<=B6

ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)

• Each input vector classified into one of K discrete classes• Denote classes by Ck

• Represent input image as a vector xi ∈ R784.• We have target vector ti ∈ {0, 1}10

• Given a training set {(x1, t1), . . . , (xN , tN )}, learning problemis to construct a “good” function y(x) from these.• y : R784 → R10

Linear Models for Classification Alireza Ghane / Greg Mori 1

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression, wewill use a “linear” model for classification:

y(x) = f(wTx+ w0)

• This is called a generalized linear model• f(·) is a fixed non-linear function

• e.g.

f(u) =

{1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear function of x• Can also apply non-linearity to x, as in φi(x) for regression

Linear Models for Classification Alireza Ghane / Greg Mori 2

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression, wewill use a “linear” model for classification:

y(x) = f(wTx+ w0)

• This is called a generalized linear model• f(·) is a fixed non-linear function

• e.g.

f(u) =

{1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear function of x• Can also apply non-linearity to x, as in φi(x) for regression

Linear Models for Classification Alireza Ghane / Greg Mori 3

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression, wewill use a “linear” model for classification:

y(x) = f(wTx+ w0)

• This is called a generalized linear model• f(·) is a fixed non-linear function

• e.g.

f(u) =

{1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear function of x• Can also apply non-linearity to x, as in φi(x) for regression

Linear Models for Classification Alireza Ghane / Greg Mori 4

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Model

���

���

−1 0 1

−1

0

1

���

���

0 0.5 1

0

0.5

1

y(x) = f(wTφ(x) + w0)

y

([x1

x2

])= f

([w1 w2

] [ φ1(x1, x2)φ2(x1, x2)

]+ w0

)Linear Models for Classification Alireza Ghane / Greg Mori 5

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

Linear Models for Classification Alireza Ghane / Greg Mori 6

Discriminant Functions Generative Models Discriminative Models

Discriminant Functions with Two Classes

x2

x1

wx

y(x)‖w‖

x⊥

−w0‖w‖

y = 0y < 0

y > 0

R2

R1

• Simple linear discriminant

y(x) = wTx+ w0

apply threshold function to getclassification

• Projection of x in w dir. is wTx||w||

Linear Models for Classification Alireza Ghane / Greg Mori 7

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers, between Ck

and all others• One-versus-one method: build K(K − 1)/2 classifiers, between

all pairs

Linear Models for Classification Alireza Ghane / Greg Mori 8

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers, between Ck

and all others• One-versus-one method: build K(K − 1)/2 classifiers, between

all pairs

Linear Models for Classification Alireza Ghane / Greg Mori 9

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers, between Ck

and all others• One-versus-one method: build K(K − 1)/2 classifiers, between

all pairs

Linear Models for Classification Alireza Ghane / Greg Mori 10

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers, between Ck

and all others• One-versus-one method: build K(K − 1)/2 classifiers, between

all pairs

Linear Models for Classification Alireza Ghane / Greg Mori 11

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Ri

Rj

Rk

xA

xB

• A solution is to build K linear functions:

yk(x) = wTk x+ wk0

assign x to class argmaxk yk(x)

• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xByk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j 6= kLinear Models for Classification Alireza Ghane / Greg Mori 12

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Ri

Rj

Rk

xA

xB

• A solution is to build K linear functions:

yk(x) = wTk x+ wk0

assign x to class argmaxk yk(x)

• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xByk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j 6= kLinear Models for Classification Alireza Ghane / Greg Mori 13

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• FindW to minimize squared error over all examples and all

components of the label vector:

E(W ) =1

2

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverse as inregression

Linear Models for Classification Alireza Ghane / Greg Mori 14

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• FindW to minimize squared error over all examples and all

components of the label vector:

E(W ) =1

2

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverse as inregression

Linear Models for Classification Alireza Ghane / Greg Mori 15

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• FindW to minimize squared error over all examples and all

components of the label vector:

E(W ) =1

2

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverse as inregression

Linear Models for Classification Alireza Ghane / Greg Mori 16

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squares decisionboundary• Similar to logistic regression

decision boundary (more later)

• Gets worse by adding easy points?!• Why?

• If target value is 1, points far fromboundary will have high value,say 10; this is a large error so theboundary is moved

Linear Models for Classification Alireza Ghane / Greg Mori 17

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squares decisionboundary• Similar to logistic regression

decision boundary (more later)

• Gets worse by adding easy points?!

• Why?• If target value is 1, points far from

boundary will have high value,say 10; this is a large error so theboundary is moved

Linear Models for Classification Alireza Ghane / Greg Mori 18

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squares decisionboundary• Similar to logistic regression

decision boundary (more later)

• Gets worse by adding easy points?!• Why?

• If target value is 1, points far fromboundary will have high value,say 10; this is a large error so theboundary is moved

Linear Models for Classification Alireza Ghane / Greg Mori 19

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squares decisionboundary• Similar to logistic regression

decision boundary (more later)

• Gets worse by adding easy points?!• Why?

• If target value is 1, points far fromboundary will have high value,say 10; this is a large error so theboundary is moved

Linear Models for Classification Alireza Ghane / Greg Mori 20

Discriminant Functions Generative Models Discriminative Models

More Least Squares Problems

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

• We’ll address these problems later with better models• First, a look at a different criterion for linear discriminant

Linear Models for Classification Alireza Ghane / Greg Mori 21

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx≥ −w0

followed by a threshold

• In which direction w should we project?• One which separates classes “well”

Linear Models for Classification Alireza Ghane / Greg Mori 22

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx≥ −w0

followed by a threshold

• In which direction w should we project?• One which separates classes “well”

Linear Models for Classification Alireza Ghane / Greg Mori 23

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx≥ −w0

followed by a threshold• In which direction w should we project?

• One which separates classes “well”

Linear Models for Classification Alireza Ghane / Greg Mori 24

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx≥ −w0

followed by a threshold• In which direction w should we project?• One which separates classes “well”

Linear Models for Classification Alireza Ghane / Greg Mori 25

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of the lineconnecting class means

• However, problematic if classes have variance in this direction• Fisher criterion: maximize ratio of inter-class separation

(between) to intra-class variance (inside)

Linear Models for Classification Alireza Ghane / Greg Mori 26

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of the lineconnecting class means

• However, problematic if classes have variance in this direction• Fisher criterion: maximize ratio of inter-class separation

(between) to intra-class variance (inside)

Linear Models for Classification Alireza Ghane / Greg Mori 27

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of the lineconnecting class means

• However, problematic if classes have variance in this direction

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

Linear Models for Classification Alireza Ghane / Greg Mori 28

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of the lineconnecting class means

• However, problematic if classes have variance in this direction• Fisher criterion: maximize ratio of inter-class separation

(between) to intra-class variance (inside)

Linear Models for Classification Alireza Ghane / Greg Mori 29

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection yn = wTxn

• Inter-class separation is distance between class means (good):

mk =1

Nk

∑n∈Ck

wTxn

s2k =

∑n∈Ck

(yn −mk)2

• Fisher criterion:

J(w) =(m2 −m1)

2

s21 + s2

2

maximize wrt wLinear Models for Classification Alireza Ghane / Greg Mori 30

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection yn = wTxn

• Inter-class separation is distance between class means (good):

mk =1

Nk

∑n∈Ck

wTxn

s2k =

∑n∈Ck

(yn −mk)2

• Fisher criterion:

J(w) =(m2 −m1)

2

s21 + s2

2

maximize wrt wLinear Models for Classification Alireza Ghane / Greg Mori 31

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection yn = wTxn

• Inter-class separation is distance between class means (good):

mk =1

Nk

∑n∈Ck

wTxn

s2k =

∑n∈Ck

(yn −mk)2

• Fisher criterion:

J(w) =(m2 −m1)

2

s21 + s2

2

maximize wrt wLinear Models for Classification Alireza Ghane / Greg Mori 32

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection yn = wTxn

• Inter-class separation is distance between class means (good):

mk =1

Nk

∑n∈Ck

wTxn

s2k =

∑n∈Ck

(yn −mk)2

• Fisher criterion:

J(w) =(m2 −m1)

2

s21 + s2

2

maximize wrt wLinear Models for Classification Alireza Ghane / Greg Mori 33

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

J(w) =(m2 −m1)

2

s21 + s2

2

=wTSBw

wTSWw

Between-class covariance:

SB = (m2 −m1)(m2 −m1)T

Within-class covariance:

SW =∑n∈C1

(xn −m1)(xn −m1)T +

∑n∈C2

(xn −m2)(xn −m2)T

Lots of math:

w ∝ S−1W (m2 −m1)

If covariance SW is isotropic, reduces to class mean difference vector

Linear Models for Classification Alireza Ghane / Greg Mori 34

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

J(w) =(m2 −m1)

2

s21 + s2

2

=wTSBw

wTSWw

Between-class covariance:

SB = (m2 −m1)(m2 −m1)T

Within-class covariance:

SW =∑n∈C1

(xn −m1)(xn −m1)T +

∑n∈C2

(xn −m2)(xn −m2)T

Lots of math:

w ∝ S−1W (m2 −m1)

If covariance SW is isotropic, reduces to class mean difference vector

Linear Models for Classification Alireza Ghane / Greg Mori 35

Discriminant Functions Generative Models Discriminative Models

FLD Summary

• FLD is a dimensionality reduction technique (more later in thecourse)

• Criterion for choosing projection based on class labels• Still suffers from outliers (e.g. earlier least squares example)

Linear Models for Classification Alireza Ghane / Greg Mori 36

Discriminant Functions Generative Models Discriminative Models

Perceptrons

• Perceptrons is used to refer to many neural network structures(more in Chapter 5.)

• The classic type is a fixed non-linear transformation of input, onelayer of adaptive weights, and a threshold:

y(x) = f(wTφ(x))

• Developed by Rosenblatt in the 50s

• The main difference compared to the methods we’ve seen so faris the learning algorithm

Linear Models for Classification Alireza Ghane / Greg Mori 37

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and t = −1

for class C2

• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:

EP (w) = −∑n∈M

wTφ(xn)tn

sum over mis-classified examples only

Linear Models for Classification Alireza Ghane / Greg Mori 38

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and t = −1

for class C2

• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:

EP (w) = −∑n∈M

wTφ(xn)tn

sum over mis-classified examples only

Linear Models for Classification Alireza Ghane / Greg Mori 39

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and t = −1

for class C2

• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:

EP (w) = −∑n∈M

wTφ(xn)tn

sum over mis-classified examples only

Linear Models for Classification Alireza Ghane / Greg Mori 40

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and t = −1

for class C2

• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:

EP (w) = −∑n∈M

wTφ(xn)tn

sum over mis-classified examples only

Linear Models for Classification Alireza Ghane / Greg Mori 41

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradient descent(gradient descent per example):

w(τ+1) = w(τ) − η∇EP (w)

= w(τ) + ηφ(xn)tn︸ ︷︷ ︸if incorrect

• Iterate over all training examples, only change w if the exampleis mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

Linear Models for Classification Alireza Ghane / Greg Mori 42

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradient descent(gradient descent per example):

w(τ+1) = w(τ) − η∇EP (w) = w(τ) + ηφ(xn)tn︸ ︷︷ ︸if incorrect

• Iterate over all training examples, only change w if the exampleis mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

Linear Models for Classification Alireza Ghane / Greg Mori 43

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradient descent(gradient descent per example):

w(τ+1) = w(τ) − η∇EP (w) = w(τ) + ηφ(xn)tn︸ ︷︷ ︸if incorrect

• Iterate over all training examples, only change w if the exampleis mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

Linear Models for Classification Alireza Ghane / Greg Mori 44

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way of

choosing one

Linear Models for Classification Alireza Ghane / Greg Mori 45

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way of

choosing one

Linear Models for Classification Alireza Ghane / Greg Mori 46

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way of

choosing one

Linear Models for Classification Alireza Ghane / Greg Mori 47

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way of

choosing one

Linear Models for Classification Alireza Ghane / Greg Mori 48

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way of

choosing oneLinear Models for Classification Alireza Ghane / Greg Mori 49

Discriminant Functions Generative Models Discriminative Models

Limitations of Perceptrons

• Perceptrons can only solve linearly separable problems in featurespace• Same as the other models in this chapter

• Canonical example of non-separable problem is X-OR• Real datasets can look like this too

Expressiveness of perceptrons

Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)

Can represent AND, OR, NOT, majority, etc.

Represents a linear separator in input space:

!jWjxj > 0 or W · x > 0

I 1

I 2

I 1

I 2

I 1

I 2

?

(a) (b) (c)

0 1

0

1

0

1 1

0

0 1 0 1

xor I 2I 1orI 1 I 2and I 1 I 2

Chapter 20 11

Linear Models for Classification Alireza Ghane / Greg Mori 50

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

Linear Models for Classification Alireza Ghane / Greg Mori 51

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• Up to now we’ve looked at learning classification by choosingparameters to minimize an error function

• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck) whichgenerates the data for each class

Linear Models for Classification Alireza Ghane / Greg Mori 52

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• Up to now we’ve looked at learning classification by choosingparameters to minimize an error function

• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck) whichgenerates the data for each class

Linear Models for Classification Alireza Ghane / Greg Mori 53

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• Up to now we’ve looked at learning classification by choosingparameters to minimize an error function

• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck) whichgenerates the data for each class

Linear Models for Classification Alireza Ghane / Greg Mori 54

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• Up to now we’ve looked at learning classification by choosingparameters to minimize an error function

• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck) whichgenerates the data for each class

Linear Models for Classification Alireza Ghane / Greg Mori 55

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• Up to now we’ve looked at learning classification by choosingparameters to minimize an error function

• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck) whichgenerates the data for each class

Linear Models for Classification Alireza Ghane / Greg Mori 56

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• Up to now we’ve looked at learning classification by choosingparameters to minimize an error function

• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck) whichgenerates the data for each class

Linear Models for Classification Alireza Ghane / Greg Mori 57

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature• Determine if we are in Vancouver (C1) or Honolulu (C2)• Generative model:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures in Vancouver• e.g. p(x|C1) = N (x; 10, 5)

• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)

• Class priors p(C1) = 0.1, p(C2) = 0.9

• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33

Linear Models for Classification Alireza Ghane / Greg Mori 58

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature• Determine if we are in Vancouver (C1) or Honolulu (C2)• Generative model:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures in Vancouver• e.g. p(x|C1) = N (x; 10, 5)

• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)

• Class priors p(C1) = 0.1, p(C2) = 0.9

• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33

Linear Models for Classification Alireza Ghane / Greg Mori 59

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• We can write the classifier in another form

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)

where a = lnp(x|C1)p(C1)

p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple form thisis another generalized linear model which we have been studying

• Of course, we will see how such a simple form a = wTx+ w0

arises naturally

Linear Models for Classification Alireza Ghane / Greg Mori 60

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• We can write the classifier in another form

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)

where a = lnp(x|C1)p(C1)

p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple form thisis another generalized linear model which we have been studying

• Of course, we will see how such a simple form a = wTx+ w0

arises naturally

Linear Models for Classification Alireza Ghane / Greg Mori 61

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• We can write the classifier in another form

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)

where a = lnp(x|C1)p(C1)

p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple form thisis another generalized linear model which we have been studying

• Of course, we will see how such a simple form a = wTx+ w0

arises naturally

Linear Models for Classification Alireza Ghane / Greg Mori 62

Discriminant Functions Generative Models Discriminative Models

Logistic Sigmoid

−5 0 50

0.5

1

• The function σ(a) = 11+exp(−a) is known as the logistic sigmoid

• It squashes the real axis down to [0, 1]

• It is continuous and differentiable• It avoids the problems encountered with the too correct

least-squares error fitting (later)

Linear Models for Classification Alireza Ghane / Greg Mori 63

Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

• There is a generalization of the logistic sigmoid to K > 2classes:

p(Ck|x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)

=exp(ak)∑j exp(aj)

where ak = ln p(x|Ck)p(Ck)

• a. k. a. softmax function• If some ak � aj , p(Ck|x) goes to 1

Linear Models for Classification Alireza Ghane / Greg Mori 64

Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

• There is a generalization of the logistic sigmoid to K > 2classes:

p(Ck|x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)

=exp(ak)∑j exp(aj)

where ak = ln p(x|Ck)p(Ck)

• a. k. a. softmax function• If some ak � aj , p(Ck|x) goes to 1

Linear Models for Classification Alireza Ghane / Greg Mori 65

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are Gaussians,

and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)TΣ−1(x− µk)

}

• a takes a simple form:

a = lnp(x|C1)p(C1)

p(x|C2)p(C2)= wTx+ w0

• Note that quadratic terms xTΣ−1x cancelLinear Models for Classification Alireza Ghane / Greg Mori 66

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are Gaussians,

and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)TΣ−1(x− µk)

}

• a takes a simple form:

a = lnp(x|C1)p(C1)

p(x|C2)p(C2)= wTx+ w0

• Note that quadratic terms xTΣ−1x cancelLinear Models for Classification Alireza Ghane / Greg Mori 67

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are Gaussians,

and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)TΣ−1(x− µk)

}

• a takes a simple form:

a = lnp(x|C1)p(C1)

p(x|C2)p(C2)= wTx+ w0

• Note that quadratic terms xTΣ−1x cancelLinear Models for Classification Alireza Ghane / Greg Mori 68

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are Gaussians,

and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)TΣ−1(x− µk)

}

• a takes a simple form:

a = lnp(x|C1)p(C1)

p(x|C2)p(C2)= wTx+ w0

• Note that quadratic terms xTΣ−1x cancelLinear Models for Classification Alireza Ghane / Greg Mori 69

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)

• For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

Linear Models for Classification Alireza Ghane / Greg Mori 70

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)

• For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

Linear Models for Classification Alireza Ghane / Greg Mori 71

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)

• For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

Linear Models for Classification Alireza Ghane / Greg Mori 72

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• The likelihood of the training data is:

p(t|π,µ1,µ2,Σ) =

N∏n=1

[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn

• As usual, ln is our friend:

`(t; θ) =

N∑n=1

tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π

+ tn lnN1 + (1− tn) lnN2︸ ︷︷ ︸µ1,µ2,Σ

• Maximize for each separately

Linear Models for Classification Alireza Ghane / Greg Mori 73

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• The likelihood of the training data is:

p(t|π,µ1,µ2,Σ) =

N∏n=1

[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn

• As usual, ln is our friend:

`(t; θ) =

N∑n=1

tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π

+ tn lnN1 + (1− tn) lnN2︸ ︷︷ ︸µ1,µ2,Σ

• Maximize for each separately

Linear Models for Classification Alireza Ghane / Greg Mori 74

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter π isstraightforward:

∂π`(t; θ) =

N∑n=1

tnπ− 1− tn

1− π

⇒ π =N1

N1 +N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

Linear Models for Classification Alireza Ghane / Greg Mori 75

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter π isstraightforward:

∂π`(t; θ) =

N∑n=1

tnπ− 1− tn

1− π

⇒ π =N1

N1 +N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

Linear Models for Classification Alireza Ghane / Greg Mori 76

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter π isstraightforward:

∂π`(t; θ) =

N∑n=1

tnπ− 1− tn

1− π

⇒ π =N1

N1 +N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

Linear Models for Classification Alireza Ghane / Greg Mori 77

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same fashion• Class means:

µ1 =1

N1

N∑n=1

tnxn

µ2 =1

N2

N∑n=1

(1− tn)xn

• Means of training examples from each class

• Shared covariance matrix:

Σ =N1

N

1

N1

∑n∈C1

(xn−µ1)(xn−µ1)T+

N2

N

1

N2

∑n∈C2

(xn−µ2)(xn−µ2)T

• Weighted average of class covariances

Linear Models for Classification Alireza Ghane / Greg Mori 78

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same fashion• Class means:

µ1 =1

N1

N∑n=1

tnxn

µ2 =1

N2

N∑n=1

(1− tn)xn

• Means of training examples from each class

• Shared covariance matrix:

Σ =N1

N

1

N1

∑n∈C1

(xn−µ1)(xn−µ1)T+

N2

N

1

N2

∑n∈C2

(xn−µ2)(xn−µ2)T

• Weighted average of class covariances

Linear Models for Classification Alireza Ghane / Greg Mori 79

Discriminant Functions Generative Models Discriminative Models

Gaussian with Different Covariances

−2 −1 0 1 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

a = lnp(x|Cb)p(Cb)p(x|Cr)p(Cr)

a = ln(p(x|Cb))− ln(p(x|Cr))︸ ︷︷ ︸−(x−µb)T Σb(x−µb)+(x−µr)T Σr(x−µr)

+ ln(p(Cb))− ln(p(Cr))︸ ︷︷ ︸const.

Linear Models for Classification Alireza Ghane / Greg Mori 80

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models Summary

• Fitting Gaussian using ML criterion is sensitive to outliers• Simple linear form for a in logistic sigmoid occurs for more than

just Gaussian distributions• Arises for any distribution in the exponential family, a large class

of distributions

Linear Models for Classification Alireza Ghane / Greg Mori 81

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

Linear Models for Classification Alireza Ghane / Greg Mori 82

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)• Resulted in logistic sigmoid of linear function of x

• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx+ w0)

and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

Linear Models for Classification Alireza Ghane / Greg Mori 83

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)• Resulted in logistic sigmoid of linear function of x

• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx+ w0)

and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

Linear Models for Classification Alireza Ghane / Greg Mori 84

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)• Resulted in logistic sigmoid of linear function of x

• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx+ w0)

and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

Linear Models for Classification Alireza Ghane / Greg Mori 85

Discriminant Functions Generative Models Discriminative Models

Generative vs. Discriminative

• Generative models• Can generate synthetic example

data• Perhaps accurate classification is

equivalent to accurate synthesis• e.g. vision and graphics

• Tend to have more parameters• Require good model of class

distributions

• Discriminative models• Only usable for classification• Don’t solve a harder problem than

you need to• Tend to have fewer parameters• Require good model of decision

boundary

Linear Models for Classification Alireza Ghane / Greg Mori 86

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model

• As usual we can use the maximum likelihood criterion forlearning

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇`(w) =

N∑n=1

(tn − yn)xn

• This time no closed-form solution since yn = σ(wTx)

• Could use (stochastic) gradient descent• But there’s a better iterative technique

Linear Models for Classification Alireza Ghane / Greg Mori 87

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model

• As usual we can use the maximum likelihood criterion forlearning

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇`(w) =

N∑n=1

(tn − yn)xn

• This time no closed-form solution since yn = σ(wTx)

• Could use (stochastic) gradient descent• But there’s a better iterative technique

Linear Models for Classification Alireza Ghane / Greg Mori 88

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model

• As usual we can use the maximum likelihood criterion forlearning

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇`(w) =

N∑n=1

(tn − yn)xn

• This time no closed-form solution since yn = σ(wTx)

• Could use (stochastic) gradient descent• But there’s a better iterative technique

Linear Models for Classification Alireza Ghane / Greg Mori 89

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model

• As usual we can use the maximum likelihood criterion forlearning

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇`(w) =

N∑n=1

(tn − yn)xn

• This time no closed-form solution since yn = σ(wTx)

• Could use (stochastic) gradient descent• But there’s a better iterative technique

Linear Models for Classification Alireza Ghane / Greg Mori 90

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares

• Iterative reweighted least squares (IRLS) is a descent method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up being aweighted least-squares problem, as in the regression case fromlast week• Hence the name IRLS

Linear Models for Classification Alireza Ghane / Greg Mori 91

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares

• Iterative reweighted least squares (IRLS) is a descent method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up being aweighted least-squares problem, as in the regression case fromlast week• Hence the name IRLS

Linear Models for Classification Alireza Ghane / Greg Mori 92

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares

• Iterative reweighted least squares (IRLS) is a descent method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up being aweighted least-squares problem, as in the regression case fromlast week• Hence the name IRLS

Linear Models for Classification Alireza Ghane / Greg Mori 93

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares

• Iterative reweighted least squares (IRLS) is a descent method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up being aweighted least-squares problem, as in the regression case fromlast week• Hence the name IRLS

Linear Models for Classification Alireza Ghane / Greg Mori 94

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares

• Iterative reweighted least squares (IRLS) is a descent method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up being aweighted least-squares problem, as in the regression case fromlast week• Hence the name IRLS

Linear Models for Classification Alireza Ghane / Greg Mori 95

Discriminant Functions Generative Models Discriminative Models

Newton-Raphson

484 9 Unconstrained minimization

PSfrag replacements

f

!f

(x, f(x))

(x + !xnt, f(x + !xnt))

Figure 9.16 The function f (shown solid) and its second-order approximation!f at x (dashed). The Newton step !xnt is what must be added to x to give

the minimizer of !f .

9.5 Newton’s method

9.5.1 The Newton step

For x ! dom f , the vector

!xnt = "#2f(x)!1#f(x)

is called the Newton step (for f , at x). Positive definiteness of #2f(x) implies that

#f(x)T !xnt = "#f(x)T#2f(x)!1#f(x) < 0

unless #f(x) = 0, so the Newton step is a descent direction (unless x is optimal).The Newton step can be interpreted and motivated in several ways.

Minimizer of second-order approximation

The second-order Taylor approximation (or model) !f of f at x is

!f(x + v) = f(x) + #f(x)T v +1

2vT#2f(x)v, (9.28)

which is a convex quadratic function of v, and is minimized when v = !xnt. Thus,the Newton step !xnt is what should be added to the point x to minimize thesecond-order approximation of f at x. This is illustrated in figure 9.16.

This interpretation gives us some insight into the Newton step. If the functionf is quadratic, then x + !xnt is the exact minimizer of f . If the function f isnearly quadratic, intuition suggests that x + !xnt should be a very good estimateof the minimizer of f , i.e., x!. Since f is twice di"erentiable, the quadratic modelof f will be very accurate when x is near x!. It follows that when x is near x!,the point x + !xnt should be a very good estimate of x!. We will see that thisintuition is correct.

• Figure from Boyd and Vandenberghe, Convex Optimization• Excellent reference, free for download onlinehttp://www.stanford.edu/~boyd/cvxbook/

Linear Models for Classification Alireza Ghane / Greg Mori 96

Discriminant Functions Generative Models Discriminative Models

Conclusion

• Readings: Ch. 4.1.1-4.1.4, 4.1.7, 4.2.1-4.2.2, 4.3.1-4.3.3• Generalized linear models y(x) = f(wTx+ w0)

• Threshold/max function for f(·)• Minimize with least squares• Fisher criterion - class separation• Perceptron criterion - mis-classified examples

• Probabilistic models: logistic sigmoid / softmax for f(·)• Generative model - assume class conditional densities in

exponential family; obtain sigmoid• Discriminative model - directly model posterior using sigmoid

(a. k. a. logistic regression, though classification)• Can learn either using maximum likelihood

• All of these models are limited to linear decision boundaries infeature space

Linear Models for Classification Alireza Ghane / Greg Mori 97