Linear Methods for Classification

27
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray

description

Linear Methods for Classification. Lecture Notes for CMPUT 466/551 Nilanjan Ray. Linear Classification. What is meant by linear classification? The decision boundaries in the in the feature (input) space is linear Should the regions be contiguous?. R 1. R 2. X 2. R 3. R 4. X 1. - PowerPoint PPT Presentation

Transcript of Linear Methods for Classification

Page 1: Linear Methods for Classification

1

Linear Methods for Classification

Lecture Notes for CMPUT 466/551

Nilanjan Ray

Page 2: Linear Methods for Classification

2

Linear Classification

• What is meant by linear classification?– The decision boundaries in the in the feature

(input) space is linear

• Should the regions be contiguous?

R1 R2

R3R4

X1

X2

Piecewise linear decision boundaries in 2D input space

Page 3: Linear Methods for Classification

3

Linear Classification…

• There is a discriminant function k(x) for

each class k

• Classification rule:

• In higher dimensional space the decision

boundaries are piecewise hyperplanar

• Remember that 0-1 loss function led to the

classification rule:

• So, can serve as k(x)

)}(maxarg:{ xkxR jj

k

)}|(maxarg:{ xXjGPkxRj

k

)|( XkGP

Page 4: Linear Methods for Classification

4

Linear Classification…

• All we require here is the class boundaries {x:k(x) = j(x)} be linear for every (k, j) pair

• One can achieve this if k(x) themselves are linear or any monotone transform of k(x) is linear– An example:

xxXGP

xXGP

xxXGP

x

xxXGP

T

T

T

T

0

0

0

0

])|2(

)|1(log[

)exp(1

1)|2(

)exp(1

)exp()|1(

Linear

So that

Page 5: Linear Methods for Classification

5

Linear Classification as a Linear Regression

)())(1()),((ˆ 3211

2121 TTTTT xxxxxxxY YXXX

535251

434241

333231

232221

131211

5251

4241

3231

2221

1211

,

1

1

1

1

1

yyy

yyy

yyy

yyy

yyy

xx

xx

xx

xx

xx

YX

321213

221212

121211

)1())((ˆ

)1())((ˆ

)1())((ˆ

xxxxY

xxxxY

xxxxY

2D Input space: X = (X1, X2)

Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)

Training sample, size N=5,

Regression output:

Each row hasexactly one 1indicating thecategory/class

Indicator Matrix

Or, Classification rule:

))((ˆmaxarg))((ˆ2121 xxYxxG k

k

Page 6: Linear Methods for Classification

6

The Masking

3213 )1(ˆ xxY

2212 )1(ˆ xxY

Linear regression of the indicator matrix can lead to masking

LDA can avoid this masking

2D input space and three classes Masking

1211 )1(ˆ xxY

Viewing direction

Page 7: Linear Methods for Classification

7

Linear Discriminant Analysis

K

lll

kk

xf

xfxXkG

1

)(

)()|Pr(

Essentially minimum error Bayes’ classifier

Assumes that the conditional class densities are (multivariate) Gaussian

Assumes equal covariance for every class

Posterior probability

k is the prior probability for class k

fk(x) is class conditional density or likelihood density

Application ofBayes rule

))()(2

1exp(

||)2(

1)( 1

2/12/ kT

kpk xxxf

ΣΣ

Page 8: Linear Methods for Classification

8

LDA…

)2

1(log)

2

1(log

loglog)|Pr(

)|Pr(log

1111l

Tll

Tlk

Tkk

Tk

l

k

l

k

xx

f

f

xXlG

xXkG

ΣΣΣΣ

)(xl)(xk

)(maxarg)(ˆ xxG kk

)|Pr(maxarg)(ˆ xXkGxGk

Classification rule:

is equivalent to:

The good old Bayes classifier!

Page 9: Linear Methods for Classification

9

LDA…

kkg ik Nxi

NNkk /ˆ

)/()ˆ)(ˆ(ˆ1

KNxxK

k g

Tkiki

i

Σ

Training data utilized to estimate

Prior probabilities:

Means:

Covariance matrix:

When are we going to use the training data?

Nixg ii :1),,( Total N input-output pairs Nk number of pairs in class k Total number of classes: K

Page 10: Linear Methods for Classification

10

LDA: Example

LDA was able to avoid masking here

Page 11: Linear Methods for Classification

11

Quadratic Discriminant Analysis

• Relaxes the same covariance assumption– class conditional probability densities (still multivariate Gaussians) are allowed to have different covariant matrices

• The class decision boundaries are not linear rather quadratic

|)|log2

1)()(

2

1(log|)|log

2

1)()(

2

1(log

loglog)|Pr(

)|Pr(log

11lll

Tllkkk

Tkk

l

k

l

k

xxxx

f

f

xXlG

xXkG

ΣΣΣΣ

)(xl)(xk

Page 12: Linear Methods for Classification

12

QDA and Masking

Better than Linear Regression in terms of handling masking:

Usually computationally more expensive than LDA

Page 13: Linear Methods for Classification

13

Fisher’s Linear Discriminant[DHS]

From training set we want to find out a direction where the separationbetween the class means is high and overlap between the classes is small

Page 14: Linear Methods for Classification

14

Fisher’s LD…

w

xwT

x

Projection of a vector x on a unit vector w:

Geometric interpretation:

xwT

From training set we want to find out a direction w where the separationbetween the projections of class means is high and

the projections of the class overlap is small

Page 15: Linear Methods for Classification

15

Fisher’s LD…

21 2

21

1

1,

1

Rxi

Rxi

ii

xN

mxN

m

22

211

1

21

1~,1~ mwxw

Nmmwxw

Nm T

Rxi

TT

Rxi

T

ii

)(~~1212 mmwmm T

wSwwmxmxwmwxwmys

wSwwmxmxwmwxwmys

T

Rx

Tii

T

Rx

Ti

T

Rxyi

T

Rx

Tii

T

Rx

Ti

T

Rxyi

iiii

iiii

2222

2:

22

22

1112

1:

21

21

222

111

))(()()~(~

))(()()~(~

Class means:

Projected class means:

Difference between projected class means:

Scatter of projected data (this will indicate overlap between the classes):

Page 16: Linear Methods for Classification

16

Fisher’s LD…

wSw

wSw

ss

mmwr

wT

BT

22

21

212~~)~~(

)(

TB

w

mmmmS

SSS

))(( 1212

21

)( 121 mmSw w

Ratio of difference of projected means over total scatter:

where

We want to maximize r(w). The solution is

Rayleigh quotient

Page 17: Linear Methods for Classification

17

Fisher’s LD: Classifier

))(2

1)(()(

2

1)~~(

2

1)( 2112

12121 mmxmmSmmwxwmmxwxy w

TTT

Classification rule: x in R2 if y(x)>0, else x in R1, where

So far so good. However, how do we get the classifier?

All we know at this point is that the direction )( 121 mmSw w

separates the projected data very well

Since we know that the projected class means are well separated, we can choose average of the two projected means as a thresholdfor classification

Page 18: Linear Methods for Classification

18

Fisher’s LD: Multiple Classes

ii

k

xnn

m...

1

1

wSw

wSwwr

wT

BT

)(

Tkkk

TB

Cx

Tkk

Cx

Tw

mmmmnmmmmnS

mxmxmxmxSk

))((...))((

))((...))((

111

11

1

Bw SS 1

Maximize Rayleigh ratio:

The solution largest eigenvector of is

Compute means for the classes:

There are k clases C1,…,Ck with number of elements ni in the ith class

Compute variances:

iCxi

i xn

m1

Compute the grand mean:

At most (k-1) eigenvalues will be non-zero. Dimensionality reduction.

Page 19: Linear Methods for Classification

19

Fisher’s LD and LDA

They become same when

(1) Prior probabilities are same

(2) Common covariance matrix for the class conditional densities

(3) Both class conditional densities are multivariate Gaussian

Ex. Show that Fisher’s LD classifier and LDA produce thesame rule of classification given the above assumptions

Note: (1) Fisher’s LD does not assume Gaussian densities (2) Fisher’s LD can be used in dimension reduction for a multiple class scenario

Page 20: Linear Methods for Classification

20

Logistic Regression

• The output of regression is the posterior probability i.e., Pr(output | input)

• Always ensures that the sum of output variables is 1 and each output is non-negative

• A linear classification method• We need to know about two concepts to

understand logistic regression– Newton-Raphson method– Maximum likelihood estimation

Page 21: Linear Methods for Classification

21

Newton-Raphson Method

0)( 1 nxf

)(

)()( 11

n

nnnn xf

xfxfxx

)()()()( 11 nnnnn xfxxxfxf

)(

)(1

n

nnn xf

xfxx

A technique for solving non-linear equation f(x)=0

Taylor series:

After rearrangement:

If xn+1 is a root or very close to the root, then:

So:

Rule for iterationNeed an initial guess x0

Page 22: Linear Methods for Classification

22

Newton-Raphson in Multi-dimensions

Njxx

fxfxxf

N

kk

k

jjj ,...,1,)()(

1

0),,,(

0),,,(

0),,,(

21

212

211

NN

N

N

xxxf

xxxf

xxxf

We want to solve the equations:

Taylor series:

After some rearrangement etc.the rule for iteration:(Need an initial guess)

),,,(

),,,(

),,,(

21

212

211

1

21

2

2

2

1

2

1

2

1

1

1

1

12

11

1

12

11

nN

nnN

nN

nn

nN

nn

N

NNN

N

N

nN

n

n

nN

n

n

xxxf

xxxf

xxxf

x

f

x

f

x

f

x

f

x

f

x

fx

f

x

f

x

f

x

x

x

x

x

x

Jacobian matrix

Page 23: Linear Methods for Classification

23

Newton-Raphson : Example

0)sin(),(

0)cos(),(32

211212

221211

xxxxxf

xxxxfSolve:

32

211

22

1

1

2211

21

2

11

2

11

)()()sin(

)cos()(

)(32)cos(

)sin(2nnn

nn

nnn

nn

n

n

n

n

xxx

xx

xxx

xx

x

x

x

x

Iteration ruleneed initial guess

Page 24: Linear Methods for Classification

24

Maximum Likelihood Parameter Estimation

)2

)(exp(

2

1),;(

2

2

x

xp

N

i

ixL1

2

2

)2

)(exp(

2

1),(

),(maxarg)ˆ,ˆ(,

L

Let’s start with an example. We want to find out the unknown parameters mean and standard deviation of a Gaussian pdf, given N independent samples from it.

Samples: x1,….,xN

Form the likelihood function:

Estimate the parameters that maximize the likelihood function

Let’s find out )ˆ,ˆ(

Page 25: Linear Methods for Classification

25

Logistic Regression Model

1

10

1

10

0

)exp(1

1)|Pr(

1,,1,)exp(1

)exp()|Pr(

K

l

Tll

K

l

Tll

Tkk

xxXKG

Kkx

xxXkG

The method directly models the posterior probabilities as the output of regression

Note that the class boundaries are linear

How can we show this linear nature?

What is the discriminant function for every class in this model?

Page 26: Linear Methods for Classification

26

Logistic Regression Computation

Let’s fit the logistic regression model for K=2, i.e., number of classes is 2

N

ii

Tii

Ti

N

i iTii

Ti

N

iiiii

N

iii

xyxy

xyxy

xXGyxXGy

xXyGl

1

1

1

1

)))exp(1log()1((

))exp(1

1log)1((

))|0log(Pr()1())|1log(Pr(

)}|Pr({log)(

Training set: (xi, gi), i=1,…,N

Log-likelihood:

We want to maximize the log-likelihood in order to estimate

Page 27: Linear Methods for Classification

27

Logistic Regression Computation…

0))exp(1

)exp((

)(

1

N

iiT

T

i xx

xy

l

(p+1) Non-linear equations

Solve by Newton-Raphson method:

)(

)])(

Jacobian([ 1-oldold

oldnew ll

Let’s workout the details hidden in the above equation.In the process we’ll learn a bit about vector differentiation etc.