Linear Models for Classification - Simon Fraser...

95
Discriminant Functions Generative Models Discriminative Models Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4

Transcript of Linear Models for Classification - Simon Fraser...

Page 1: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Linear Models for ClassificationOliver Schulte - CMPT 726

Bishop PRML Ch. 4

Page 2: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Classification: Hand-written Digit Recognition

xi =

!"#$%&'"()#*$#+ !,- )( !.,$#/+ +%((/#/01/! 233456 334

7/#()#-! 8/#9 */:: )0 "'%! /$!9 +$"$;$!/ +,/ ") "'/ :$1< )(

8$#%$"%)0 %0 :%&'"%0& =>?@ 2ABC D,!" -$</! %" ($!"/#56E'/ 7#)")"97/ !/:/1"%)0 $:&)#%"'- %! %::,!"#$"/+ %0 F%&6 GH6

C! !//0I 8%/*! $#/ $::)1$"/+ -$%0:9 ()# -)#/ 1)-7:/J1$"/&)#%/! *%"' '%&' *%"'%0 1:$!! 8$#%$;%:%"96 E'/ 1,#8/-$#</+ 3BK7#)") %0 F%&6 L !')*! "'/ %-7#)8/+ 1:$!!%(%1$"%)07/#()#-$01/ ,!%0& "'%! 7#)")"97/ !/:/1"%)0 !"#$"/&9 %0!"/$+)( /.,$::9K!7$1/+ 8%/*!6 M)"/ "'$" */ );"$%0 $ >6? 7/#1/0"

/##)# #$"/ *%"' $0 $8/#$&/ )( )0:9 (),# "*)K+%-/0!%)0$:

8%/*! ()# /$1' "'#//K+%-/0!%)0$: );D/1"I "'$0<! ") "'/

(:/J%;%:%"9 7#)8%+/+ ;9 "'/ -$"1'%0& $:&)#%"'-6

!"# $%&'() *+,-. */0+12.33. 4,3,5,6.

N,# 0/J" /J7/#%-/0" %08):8/! "'/ OAPQKR !'$7/ !%:'),/""/

+$"$;$!/I !7/1%(%1$::9 B)#/ PJ7/#%-/0" BPK3'$7/KG 7$#" SI

*'%1' -/$!,#/! 7/#()#-$01/ )( !%-%:$#%"9K;$!/+ #/"#%/8$:

=>T@6 E'/ +$"$;$!/ 1)0!%!"! )( GI?HH %-$&/!U RH !'$7/

1$"/&)#%/!I >H %-$&/! 7/# 1$"/&)#96 E'/ 7/#()#-$01/ %!

-/$!,#/+ ,!%0& "'/ !)K1$::/+ V;,::!/9/ "/!"IW %0 *'%1' /$1'

!"# $%%% &'()*(+&$,)* ,) -(&&%') ()(./*$* ()0 1(+2$)% $)&%..$3%)+%4 5,.6 784 ),6 784 (-'$. 7997

:;<6 #6 (== >? @AB C;DE=FDD;?;BG 1)$*& @BD@ G;<;@D HD;I< >HJ CB@A>G KLM >H@ >? "94999N6 &AB @BO@ FP>QB BFEA G;<;@ ;IG;EF@BD @AB BOFCR=B IHCPBJ

?>==>SBG PT @AB @JHB =FPB= FIG @AB FDD;<IBG =FPB=6

:;<6 U6 M0 >PVBE@ JBE><I;@;>I HD;I< @AB +,$.W79 GF@F DB@6 +>CRFJ;D>I >?@BD@ DB@ BJJ>J ?>J **04 *AFRB 0;D@FIEB K*0N4 FIG *AFRB 0;D@FIEB S;@A!!"#$%&$' RJ>@>@TRBD K*0WRJ>@>N QBJDHD IHCPBJ >? RJ>@>@TRB Q;BSD6 :>J**0 FIG *04 SB QFJ;BG @AB IHCPBJ >? RJ>@>@TRBD HI;?>JC=T ?>J F==>PVBE@D6 :>J *0WRJ>@>4 @AB IHCPBJ >? RJ>@>@TRBD RBJ >PVBE@ GBRBIGBG >I@AB S;@A;IW>PVBE@ QFJ;F@;>I FD SB== FD @AB PB@SBBIW>PVBE@ D;C;=FJ;@T6

:;<6 "96 -J>@>@TRB Q;BSD DB=BE@BG ?>J @S> G;??BJBI@ M0 >PVBE@D ?J>C @AB+,$. GF@F DB@ HD;I< @AB F=<>J;@AC GBDEJ;PBG ;I *BE@;>I !676 X;@A @A;DFRRJ>FEA4 Q;BSD FJB F==>EF@BG FGFR@;QB=T GBRBIG;I< >I @AB Q;DHF=E>CR=BO;@T >? FI >PVBE@ S;@A JBDRBE@ @> Q;BS;I< FI<=B6

ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)

• Each input vector classified into one of K discrete classes• Denote classes by Ck

• Represent input image as a vector xi ∈ R784.• We have target vector ti ∈ {0, 1}10

• Given a training set {(x1, t1), . . . , (xN , tN)}, learning problemis to construct a “good” function y(x) from these.• y : R784 → R10

Page 3: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:

y(x) = f (wTx + w0)

• This is called a generalized linear model• f (·) is a fixed non-linear function

• e.g.

f (u) ={

1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear functionof x

• Can also apply non-linearity to x, as in φi(x) for regression

Page 4: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:

y(x) = f (wTx + w0)

• This is called a generalized linear model• f (·) is a fixed non-linear function

• e.g.

f (u) ={

1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear functionof x

• Can also apply non-linearity to x, as in φi(x) for regression

Page 5: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:

y(x) = f (wTx + w0)

• This is called a generalized linear model• f (·) is a fixed non-linear function

• e.g.

f (u) ={

1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear functionof x

• Can also apply non-linearity to x, as in φi(x) for regression

Page 6: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Overview

• Linear regression for Classification• The Fisher Linear Discriminant, or How to Draw a Line

Between Classes• The Perceptron, or The Smallest Neural Net• Logistic Regression—The Statistician’s Classifier

Page 7: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

Page 8: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Discriminant Functions with Two Classes

x2

x1

wx

y(x)‖w‖

x⊥

−w0‖w‖

y = 0y < 0

y > 0

R2

R1

• Start with 2 class problem,ti ∈ {0, 1}

• Simple linear discriminant

y(x) = wTx + w0

apply threshold function to getclassification

• Decision surface is line;orthogonal to w.

• Projection of x in w dir. is wT x||w||

Page 9: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

Page 10: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

Page 11: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

Page 12: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

Page 13: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

Page 14: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

Page 15: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Ri

Rj

Rk

xA

xB

• A solution is to build K linear functions:

yk(x) = wTk x + wk0

assign x to class maxk yk(x)• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xB

yk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j 6= k

Page 16: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Ri

Rj

Rk

xA

xB

• A solution is to build K linear functions:

yk(x) = wTk x + wk0

assign x to class maxk yk(x)• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xB

yk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j 6= k

Page 17: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk,wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and all

components of the label vector:

E(W) =12

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverseas in regression

Page 18: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk,wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and all

components of the label vector:

E(W) =12

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverseas in regression

Page 19: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk,wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and all

components of the label vector:

E(W) =12

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverseas in regression

Page 20: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squaresdecision boundary• Similar to logistic regression

decision boundary (more later)

• Gets worse by adding easypoints?!

• Why?• If target value is 1, points far

from boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved

Page 21: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squaresdecision boundary• Similar to logistic regression

decision boundary (more later)

• Gets worse by adding easypoints?!

• Why?• If target value is 1, points far

from boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved

Page 22: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squaresdecision boundary• Similar to logistic regression

decision boundary (more later)

• Gets worse by adding easypoints?!

• Why?

• If target value is 1, points farfrom boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved

Page 23: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squaresdecision boundary• Similar to logistic regression

decision boundary (more later)

• Gets worse by adding easypoints?!

• Why?• If target value is 1, points far

from boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved

Page 24: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

More Least Squares Problems

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

• Easily separated by hyperplanes, but not found using leastsquares!

• Remember that least squares is MLE under Gaussian noisemodel for continuous target - we’ve got discrete targets.

• We’ll address these problems later with better models• First, a look at a different criterion for linear discriminant

Page 25: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx

≥ −w0

followed by a threshold• In which direction w should we project?• One which separates classes “well”• Intuition: We want the (projected) centers of the classes to

be far apart, and each class (projection) to be clusteredaround its centre.

Page 26: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0

followed by a threshold

• In which direction w should we project?• One which separates classes “well”• Intuition: We want the (projected) centers of the classes to

be far apart, and each class (projection) to be clusteredaround its centre.

Page 27: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0

followed by a threshold• In which direction w should we project?

• One which separates classes “well”• Intuition: We want the (projected) centers of the classes to

be far apart, and each class (projection) to be clusteredaround its centre.

Page 28: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0

followed by a threshold• In which direction w should we project?• One which separates classes “well”• Intuition: We want the (projected) centers of the classes to

be far apart, and each class (projection) to be clusteredaround its centre.

Page 29: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of theline connecting class means

• However, problematic if classes have variance in thisdirection (ie are not clustered around the mean)

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

Page 30: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of theline connecting class means

• However, problematic if classes have variance in thisdirection (ie are not clustered around the mean)

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

Page 31: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of theline connecting class means

• However, problematic if classes have variance in thisdirection (ie are not clustered around the mean)

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

Page 32: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of theline connecting class means

• However, problematic if classes have variance in thisdirection (ie are not clustered around the mean)

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

Page 33: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection yn = wTxn

• Inter-class separation is distance between class means(good):

mk =1

Nk

∑n∈Ck

wTxn

• Intra-class variance (bad):

s2k =

∑n∈Ck

(yn − mk)2

• Fisher criterion:

J(w) =(m2 − m1)

2

s21 + s2

2

maximize wrt w

Page 34: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection yn = wTxn

• Inter-class separation is distance between class means(good):

mk =1

Nk

∑n∈Ck

wTxn

• Intra-class variance (bad):

s2k =

∑n∈Ck

(yn − mk)2

• Fisher criterion:

J(w) =(m2 − m1)

2

s21 + s2

2

maximize wrt w

Page 35: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection yn = wTxn

• Inter-class separation is distance between class means(good):

mk =1

Nk

∑n∈Ck

wTxn

• Intra-class variance (bad):

s2k =

∑n∈Ck

(yn − mk)2

• Fisher criterion:

J(w) =(m2 − m1)

2

s21 + s2

2

maximize wrt w

Page 36: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

• Projection yn = wTxn

• Inter-class separation is distance between class means(good):

mk =1

Nk

∑n∈Ck

wTxn

• Intra-class variance (bad):

s2k =

∑n∈Ck

(yn − mk)2

• Fisher criterion:

J(w) =(m2 − m1)

2

s21 + s2

2

maximize wrt w

Page 37: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

J(w) =(m2 − m1)

2

s21 + s2

2=

wTSBwwTSWw

Between-class covariance:SB = (m2 −m1)(m2 −m1)

T

Within-class covariance:SW =

∑n∈C1

(xn −m1)(xn −m1)T +

∑n∈C2

(xn −m2)(xn −m2)T

Lots of math:

w ∝ S−1W (m2 −m1)

If covariance SW is isotropic (proportional to unit matrix, so littlevariance within class), reduces to class mean difference vector

Page 38: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

J(w) =(m2 − m1)

2

s21 + s2

2=

wTSBwwTSWw

Between-class covariance:SB = (m2 −m1)(m2 −m1)

T

Within-class covariance:SW =

∑n∈C1

(xn −m1)(xn −m1)T +

∑n∈C2

(xn −m2)(xn −m2)T

Lots of math:

w ∝ S−1W (m2 −m1)

If covariance SW is isotropic (proportional to unit matrix, so littlevariance within class), reduces to class mean difference vector

Page 39: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

FLD Summary

• FLD is a dimensionality reduction technique (more later inthe course)

• Criterion for choosing projection based on class labels• Still suffers from outliers (e.g. earlier least squares

example)

Page 40: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptrons

• Perceptrons is used to refer to many neural networkstructures (more next week)

• The classic type is a fixed non-linear transformation ofinput, one layer of adaptive weights, and a threshold:

y(x) = f (wTφ(x))

• Developed by Rosenblatt in the 50s

• The main difference compared to the methods we’ve seenso far is the learning algorithm

Page 41: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2

• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:

EP(w) = −∑

n∈MwTφ(xn)tn

sum over mis-classified examples only

Page 42: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2

• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:

EP(w) = −∑

n∈MwTφ(xn)tn

sum over mis-classified examples only

Page 43: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2

• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:

EP(w) = −∑

n∈MwTφ(xn)tn

sum over mis-classified examples only

Page 44: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2

• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:

EP(w) = −∑

n∈MwTφ(xn)tn

sum over mis-classified examples only

Page 45: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradientdescent (gradient descent per example):

w(τ+1) = w(τ) − η∇EP(w)

= w(τ) + ηφ(xn)tn︸ ︷︷ ︸if incorrect

• Iterate over all training examples, only change w if theexample is mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

Page 46: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradientdescent (gradient descent per example):

w(τ+1) = w(τ) − η∇EP(w) = w(τ) + ηφ(xn)tn︸ ︷︷ ︸if incorrect

• Iterate over all training examples, only change w if theexample is mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

Page 47: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradientdescent (gradient descent per example):

w(τ+1) = w(τ) − η∇EP(w) = w(τ) + ηφ(xn)tn︸ ︷︷ ︸if incorrect

• Iterate over all training examples, only change w if theexample is mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

Page 48: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

Page 49: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

Page 50: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

Page 51: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

Page 52: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

Page 53: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Limitations of Perceptrons

• Perceptrons can only solve linearly separable problems infeature space• Same as the other models in this chapter

• Canonical example of non-separable problem is X-OR• Real datasets can look like this too

Expressiveness of perceptrons

Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)

Can represent AND, OR, NOT, majority, etc.

Represents a linear separator in input space:

!jWjxj > 0 or W · x > 0

I 1

I 2

I 1

I 2

I 1

I 2

?

(a) (b) (c)

0 1

0

1

0

1 1

0

0 1 0 1

xor I 2I 1orI 1 I 2and I 1 I 2

Chapter 20 11

Page 54: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

Page 55: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Intuition for Logistic Regression in Generative Models• Classification with Joint Probabilities With 2 classes, C1

and C2:• Choose C1 if

1 <p(C1|x)p(C2|x)

=p(C1, x)p(C2, x)

⇐⇒

0 < ln(

p(C1|x)p(C2|x)

)= ln(p(C1|x))− ln(p(C2|x))

• The quantity

a = ln(

p(C1|x)p(C2|x)

)is called the log-odds.

• Logistic Regression assumes that the log-odds are a linearfunction of the feature vector: a = wTx + w0.

• Often true given assumptions about the class-conditionaldensities p(x|Ck).

Page 56: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Intuition for Logistic Regression in Generative Models• Classification with Joint Probabilities With 2 classes, C1

and C2:• Choose C1 if

1 <p(C1|x)p(C2|x)

=p(C1, x)p(C2, x)

⇐⇒

0 < ln(

p(C1|x)p(C2|x)

)= ln(p(C1|x))− ln(p(C2|x))

• The quantity

a = ln(

p(C1|x)p(C2|x)

)is called the log-odds.

• Logistic Regression assumes that the log-odds are a linearfunction of the feature vector: a = wTx + w0.

• Often true given assumptions about the class-conditionaldensities p(x|Ck).

Page 57: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Intuition for Logistic Regression in Generative Models• Classification with Joint Probabilities With 2 classes, C1

and C2:• Choose C1 if

1 <p(C1|x)p(C2|x)

=p(C1, x)p(C2, x)

⇐⇒

0 < ln(

p(C1|x)p(C2|x)

)= ln(p(C1|x))− ln(p(C2|x))

• The quantity

a = ln(

p(C1|x)p(C2|x)

)is called the log-odds.

• Logistic Regression assumes that the log-odds are a linearfunction of the feature vector: a = wTx + w0.

• Often true given assumptions about the class-conditionaldensities p(x|Ck).

Page 58: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Intuition for Logistic Regression in Generative Models• Classification with Joint Probabilities With 2 classes, C1

and C2:• Choose C1 if

1 <p(C1|x)p(C2|x)

=p(C1, x)p(C2, x)

⇐⇒

0 < ln(

p(C1|x)p(C2|x)

)= ln(p(C1|x))− ln(p(C2|x))

• The quantity

a = ln(

p(C1|x)p(C2|x)

)is called the log-odds.

• Logistic Regression assumes that the log-odds are a linearfunction of the feature vector: a = wTx + w0.

• Often true given assumptions about the class-conditionaldensities p(x|Ck).

Page 59: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the class-conditionaldistribution p(x|Ck) which generates the data for each class

Page 60: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the class-conditionaldistribution p(x|Ck) which generates the data for each class

Page 61: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the class-conditionaldistribution p(x|Ck) which generates the data for each class

Page 62: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the class-conditionaldistribution p(x|Ck) which generates the data for each class

Page 63: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature• Determine if we are in Vancouver (C1) or Honolulu (C2)• Generative model:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures inVancouver• e.g. p(x|C1) = N (x; 10, 5)

• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)

• Class priors p(C1) = 0.1, p(C2) = 0.9

• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33

Page 64: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature• Determine if we are in Vancouver (C1) or Honolulu (C2)• Generative model:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures inVancouver• e.g. p(x|C1) = N (x; 10, 5)

• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)

• Class priors p(C1) = 0.1, p(C2) = 0.9

• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33

Page 65: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Suppose we have built a model for predicting the log-odds.We can use it to compute the class probability as follows.

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)

where a = lnp(x|C1)p(C1)

p(x|C2)p(C2)

= lnp(x, C1)

p(x, C2).

Page 66: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Suppose we have built a model for predicting the log-odds.We can use it to compute the class probability as follows.

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)

where a = lnp(x|C1)p(C1)

p(x|C2)p(C2)

= lnp(x, C1)

p(x, C2).

Page 67: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Logistic Sigmoid

−5 0 50

0.5

1

• The function σ(a) = 11+exp(−a) is known as the logistic

sigmoid• It squashes the real axis down to [0, 1]• It is continuous and differentiable

Page 68: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

• There is a generalization of the logistic sigmoid to K > 2classes:

p(Ck|x) =p(x|Ck)p(Ck)∑

j p(x|Cj)p(Cj)

=exp(ak)∑

j exp(aj)

where ak = ln p(x|Ck)p(Ck)

• a. k. a. softmax function• If some ak � aj, p(Ck|x) goes to 1

Page 69: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

• There is a generalization of the logistic sigmoid to K > 2classes:

p(Ck|x) =p(x|Ck)p(Ck)∑

j p(x|Cj)p(Cj)

=exp(ak)∑

j exp(aj)

where ak = ln p(x|Ck)p(Ck)

• a. k. a. softmax function• If some ak � aj, p(Ck|x) goes to 1

Page 70: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Example Logistic Regression

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

Page 71: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to the log-odds a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2 exp{−1

2(x− µk)

TΣ−1(x− µk)

}• a takes a simple form:

a = lnp(x|C1)p(C1)

p(x|C2)p(C2)= wTx + w0

• Note that quadratic terms xTΣ−1x cancel

Page 72: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to the log-odds a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2 exp{−1

2(x− µk)

TΣ−1(x− µk)

}• a takes a simple form:

a = lnp(x|C1)p(C1)

p(x|C2)p(C2)= wTx + w0

• Note that quadratic terms xTΣ−1x cancel

Page 73: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to the log-odds a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2 exp{−1

2(x− µk)

TΣ−1(x− µk)

}• a takes a simple form:

a = lnp(x|C1)p(C1)

p(x|C2)p(C2)= wTx + w0

• Note that quadratic terms xTΣ−1x cancel

Page 74: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to the log-odds a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2 exp{−1

2(x− µk)

TΣ−1(x− µk)

}• a takes a simple form:

a = lnp(x|C1)p(C1)

p(x|C2)p(C2)= wTx + w0

• Note that quadratic terms xTΣ−1x cancel

Page 75: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)

• For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

Page 76: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)

• For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

Page 77: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)

• For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

Page 78: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• The likelihood of the training data is:

p(t|π,µ1,µ2,Σ) =

N∏n=1

[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn

• As usual, ln is our friend:

`(t; θ) =N∑

n=1

tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π

+ tn ln N1 + (1− tn) ln N2︸ ︷︷ ︸µ1,µ2,Σ

• Maximize for each separately

Page 79: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• The likelihood of the training data is:

p(t|π,µ1,µ2,Σ) =

N∏n=1

[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn

• As usual, ln is our friend:

`(t; θ) =N∑

n=1

tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π

+ tn ln N1 + (1− tn) ln N2︸ ︷︷ ︸µ1,µ2,Σ

• Maximize for each separately

Page 80: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter πis straightforward:

∂π`(t; θ) =

N∑n=1

tnπ− 1− tn

1− π

⇒ π =N1

N1 + N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

Page 81: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter πis straightforward:

∂π`(t; θ) =

N∑n=1

tnπ− 1− tn

1− π

⇒ π =N1

N1 + N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

Page 82: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter πis straightforward:

∂π`(t; θ) =

N∑n=1

tnπ− 1− tn

1− π

⇒ π =N1

N1 + N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

Page 83: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same

fashion• Class means:

µ1 =1

N1

N∑n=1

tnxn

µ2 =1

N2

N∑n=1

(1− tn)xn

• Means of training examples from each class• Shared covariance matrix:

Σ =N1

N1

N1

∑n∈C1

(xn−µ1)(xn−µ1)T+

N2

N1

N2

∑n∈C2

(xn−µ2)(xn−µ2)T

• Weighted average of class covariances

Page 84: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same

fashion• Class means:

µ1 =1

N1

N∑n=1

tnxn

µ2 =1

N2

N∑n=1

(1− tn)xn

• Means of training examples from each class• Shared covariance matrix:

Σ =N1

N1

N1

∑n∈C1

(xn−µ1)(xn−µ1)T+

N2

N1

N2

∑n∈C2

(xn−µ2)(xn−µ2)T

• Weighted average of class covariances

Page 85: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models Summary

• Fitting Gaussian using ML criterion is sensitive to outliers• Simple linear form for a in logistic sigmoid occurs for more

than just Gaussian distributions• Arises for any distribution in the exponential family, a large

class of distributions

Page 86: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

Page 87: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)• Resulted in logistic sigmoid of linear function of x

• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx + w0)

and find w directly• For the generative model we had 2M + M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

Page 88: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)• Resulted in logistic sigmoid of linear function of x

• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx + w0)

and find w directly• For the generative model we had 2M + M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

Page 89: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)• Resulted in logistic sigmoid of linear function of x

• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx + w0)

and find w directly• For the generative model we had 2M + M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

Page 90: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for

learning

p(t|w) =N∏

n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇`(w) =N∑

n=1

(yn − tn)xn

• This time no closed-form solution since yn = σ(wTx)• Could use (stochastic) gradient descent

• But Iterative Reweighted Least Squares (IRLS) is a bettertechnique.

Page 91: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for

learning

p(t|w) =N∏

n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇`(w) =N∑

n=1

(yn − tn)xn

• This time no closed-form solution since yn = σ(wTx)• Could use (stochastic) gradient descent

• But Iterative Reweighted Least Squares (IRLS) is a bettertechnique.

Page 92: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for

learning

p(t|w) =N∏

n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇`(w) =N∑

n=1

(yn − tn)xn

• This time no closed-form solution since yn = σ(wTx)• Could use (stochastic) gradient descent

• But Iterative Reweighted Least Squares (IRLS) is a bettertechnique.

Page 93: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for

learning

p(t|w) =N∏

n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇`(w) =N∑

n=1

(yn − tn)xn

• This time no closed-form solution since yn = σ(wTx)• Could use (stochastic) gradient descent

• But Iterative Reweighted Least Squares (IRLS) is a bettertechnique.

Page 94: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Generative vs. Discriminative

• Generative models• Can generate synthetic

example data• Perhaps accurate

classification is equivalent toaccurate synthesis

• Support learning with missingdata

• Tend to have more parameters• Require good model of class

distributions

• Discriminative models• Only usable for classification• Don’t solve a harder problem

than you need to• Tend to have fewer parameters• Require good model of

decision boundary

Page 95: Linear Models for Classification - Simon Fraser Universityoschulte/teaching/726/spring11/slides/mychapter4.pdf · Discriminant FunctionsGenerative ModelsDiscriminative Models Linear

Discriminant Functions Generative Models Discriminative Models

Conclusion

• Readings: Ch. 4.1.1-4.1.4, 4.1.7, 4.2.1-4.2.2, 4.3.1-4.3.3• Generalized linear models y(x) = f (wTx + w0)

• Threshold/max function for f (·)• Minimize with least squares• Fisher criterion - class separation• Perceptron criterion - mis-classified examples

• Probabilistic models: logistic sigmoid / softmax for f (·)• Generative model - assume class conditional densities in

exponential family; obtain sigmoid• Discriminative model - directly model posterior using

sigmoid (a. k. a. logistic regression, though classification)• Can learn either using maximum likelihood

• All of these models are limited to linear decisionboundaries in feature space