Linear Methods for Classification

1

Linear Methods for Classification

Lecture Notes for CMPUT 466/551

Nilanjan Ray

2

Linear Classification

• What is meant by linear classification?– The decision boundaries in the in the feature

(input) space is linear

• Should the regions be contiguous?

R1 R2

R3R4

X1

X2

Piecewise linear decision boundaries in 2D input space

3

Linear Classification…

• There is a discriminant function k(x) for

each class k

• Classification rule:

• In higher dimensional space the decision

boundaries are piecewise hyperplanar

• Remember that 0-1 loss function led to the

classification rule:

• So, can serve as k(x)

)}(maxarg:{ xkxR jj

k

)}|(maxarg:{ xXjGPkxRj

k

)|( XkGP

4

Linear Classification…

• All we require here is the class boundaries {x:k(x) = j(x)} be linear for every (k, j) pair

• One can achieve this if k(x) themselves are linear or any monotone transform of k(x) is linear– An example:

xxXGP

xXGP

xxXGP

x

xxXGP

T

T

T

T

0

0

0

0

])|2(

)|1(log[

)exp(1

1)|2(

)exp(1

)exp()|1(

Linear

So that

5

Linear Classification as a Linear Regression

)())(1()),((ˆ 3211

2121 TTTTT xxxxxxxY YXXX

535251

434241

333231

232221

131211

5251

4241

3231

2221

1211

,

1

1

1

1

1

yyy

yyy

yyy

yyy

yyy

xx

xx

xx

xx

xx

YX

321213

221212

121211

)1())((ˆ

)1())((ˆ

)1())((ˆ

xxxxY

xxxxY

xxxxY

2D Input space: X = (X1, X2)

Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)

Training sample, size N=5,

Regression output:

Each row hasexactly one 1indicating thecategory/class

Indicator Matrix

Or, Classification rule:

))((ˆmaxarg))((ˆ2121 xxYxxG k

k

6

The Masking

3213 )1(ˆ xxY

2212 )1(ˆ xxY

Linear regression of the indicator matrix can lead to masking

LDA can avoid this masking

2D input space and three classes Masking

1211 )1(ˆ xxY

Viewing direction

7

Linear Discriminant Analysis

K

lll

kk

xf

xfxXkG

1

)(

)()|Pr(

Essentially minimum error Bayes’ classifier

Assumes that the conditional class densities are (multivariate) Gaussian

Assumes equal covariance for every class

Posterior probability

k is the prior probability for class k

fk(x) is class conditional density or likelihood density

Application ofBayes rule

))()(2

1exp(

||)2(

1)( 1

2/12/ kT

kpk xxxf

ΣΣ

8

LDA…

)2

1(log)

2

1(log

loglog)|Pr(

)|Pr(log

1111l

Tll

Tlk

Tkk

Tk

l

k

l

k

xx

f

f

xXlG

xXkG

ΣΣΣΣ

)(xl)(xk

)(maxarg)(ˆ xxG kk

)|Pr(maxarg)(ˆ xXkGxGk

Classification rule:

is equivalent to:

The good old Bayes classifier!

9

LDA…

kkg ik Nxi

/ˆ

NNkk /ˆ

)/()ˆ)(ˆ(ˆ1

KNxxK

k g

Tkiki

i

Σ

Training data utilized to estimate

Prior probabilities:

Means:

Covariance matrix:

When are we going to use the training data?

Nixg ii :1),,( Total N input-output pairs Nk number of pairs in class k Total number of classes: K

10

LDA: Example

LDA was able to avoid masking here

11

Quadratic Discriminant Analysis

• Relaxes the same covariance assumption– class conditional probability densities (still multivariate Gaussians) are allowed to have different covariant matrices

• The class decision boundaries are not linear rather quadratic

|)|log2

1)()(

2

1(log|)|log

2

1)()(

2

1(log

loglog)|Pr(

)|Pr(log

11lll

Tllkkk

Tkk

l

k

l

k

xxxx

f

f

xXlG

xXkG

ΣΣΣΣ

)(xl)(xk

12

QDA and Masking

Better than Linear Regression in terms of handling masking:

Usually computationally more expensive than LDA

13

Fisher’s Linear Discriminant[DHS]

From training set we want to find out a direction where the separationbetween the class means is high and overlap between the classes is small

14

Fisher’s LD…

w

xwT

x

Projection of a vector x on a unit vector w:

Geometric interpretation:

xwT

From training set we want to find out a direction w where the separationbetween the projections of class means is high and

the projections of the class overlap is small

15

Fisher’s LD…

21 2

21

1

1,

1

Rxi

Rxi

ii

xN

mxN

m

22

211

1

21

1~,1~ mwxw

Nmmwxw

Nm T

Rxi

TT

Rxi

T

ii

)(~~1212 mmwmm T

wSwwmxmxwmwxwmys

wSwwmxmxwmwxwmys

T

Rx

Tii

T

Rx

Ti

T

Rxyi

T

Rx

Tii

T

Rx

Ti

T

Rxyi

iiii

iiii

2222

2:

22

22

1112

1:

21

21

222

111

))(()()~(~

))(()()~(~

Class means:

Projected class means:

Difference between projected class means:

Scatter of projected data (this will indicate overlap between the classes):

16

Fisher’s LD…

wSw

wSw

ss

mmwr

wT

BT

22

21

212~~)~~(

)(

TB

w

mmmmS

SSS

))(( 1212

21

)( 121 mmSw w

Ratio of difference of projected means over total scatter:

where

We want to maximize r(w). The solution is

Rayleigh quotient

17

Fisher’s LD: Classifier

))(2

1)(()(

2

1)~~(

2

1)( 2112

12121 mmxmmSmmwxwmmxwxy w

TTT

Classification rule: x in R2 if y(x)>0, else x in R1, where

So far so good. However, how do we get the classifier?

All we know at this point is that the direction )( 121 mmSw w

separates the projected data very well

Since we know that the projected class means are well separated, we can choose average of the two projected means as a thresholdfor classification

18

Fisher’s LD: Multiple Classes

ii

k

xnn

m...

1

1

wSw

wSwwr

wT

BT

)(

Tkkk

TB

Cx

Tkk

Cx

Tw

mmmmnmmmmnS

mxmxmxmxSk

))((...))((

))((...))((

111

11

1

Bw SS 1

Maximize Rayleigh ratio:

The solution largest eigenvector of is

Compute means for the classes:

There are k clases C1,…,Ck with number of elements ni in the ith class

Compute variances:

iCxi

i xn

m1

Compute the grand mean:

At most (k-1) eigenvalues will be non-zero. Dimensionality reduction.

19

Fisher’s LD and LDA

They become same when

(1) Prior probabilities are same

(2) Common covariance matrix for the class conditional densities

(3) Both class conditional densities are multivariate Gaussian

Ex. Show that Fisher’s LD classifier and LDA produce thesame rule of classification given the above assumptions

Note: (1) Fisher’s LD does not assume Gaussian densities (2) Fisher’s LD can be used in dimension reduction for a multiple class scenario

20

Logistic Regression

• The output of regression is the posterior probability i.e., Pr(output | input)

• Always ensures that the sum of output variables is 1 and each output is non-negative

• A linear classification method• We need to know about two concepts to

understand logistic regression– Newton-Raphson method– Maximum likelihood estimation

21

Newton-Raphson Method

0)( 1 nxf

)(

)()( 11

n

nnnn xf

xfxfxx

)()()()( 11 nnnnn xfxxxfxf

)(

)(1

n

nnn xf

xfxx

A technique for solving non-linear equation f(x)=0

Taylor series:

After rearrangement:

If xn+1 is a root or very close to the root, then:

So:

Rule for iterationNeed an initial guess x0

22

Newton-Raphson in Multi-dimensions

Njxx

fxfxxf

N

kk

k

jjj ,...,1,)()(

1

0),,,(

0),,,(

0),,,(

21

212

211

NN

N

N

xxxf

xxxf

xxxf

We want to solve the equations:

Taylor series:

After some rearrangement etc.the rule for iteration:(Need an initial guess)

),,,(

),,,(

),,,(

21

212

211

1

21

2

2

2

1

2

1

2

1

1

1

1

12

11

1

12

11

nN

nnN

nN

nn

nN

nn

N

NNN

N

N

nN

n

n

nN

n

n

xxxf

xxxf

xxxf

x

f

x

f

x

f

x

f

x

f

x

fx

f

x

f

x

f

x

x

x

x

x

x

Jacobian matrix

23

Newton-Raphson : Example

0)sin(),(

0)cos(),(32

211212

221211

xxxxxf

xxxxfSolve:

32

211

22

1

1

2211

21

2

11

2

11

)()()sin(

)cos()(

)(32)cos(

)sin(2nnn

nn

nnn

nn

n

n

n

n

xxx

xx

xxx

xx

x

x

x

x

Iteration ruleneed initial guess

24

Maximum Likelihood Parameter Estimation

)2

)(exp(

2

1),;(

2

2

x

xp

N

i

ixL1

2

2

)2

)(exp(

2

1),(

),(maxarg)ˆ,ˆ(,

L

Let’s start with an example. We want to find out the unknown parameters mean and standard deviation of a Gaussian pdf, given N independent samples from it.

Samples: x1,….,xN

Form the likelihood function:

Estimate the parameters that maximize the likelihood function

Let’s find out )ˆ,ˆ(

25

Logistic Regression Model

1

10

1

10

0

)exp(1

1)|Pr(

1,,1,)exp(1

)exp()|Pr(

K

l

Tll

K

l

Tll

Tkk

xxXKG

Kkx

xxXkG

The method directly models the posterior probabilities as the output of regression

Note that the class boundaries are linear

How can we show this linear nature?

What is the discriminant function for every class in this model?

26

Logistic Regression Computation

Let’s fit the logistic regression model for K=2, i.e., number of classes is 2

N

ii

Tii

Ti

N

i iTii

Ti

N

iiiii

N

iii

xyxy

xyxy

xXGyxXGy

xXyGl

1

1

1

1

)))exp(1log()1((

))exp(1

1log)1((

))|0log(Pr()1())|1log(Pr(

)}|Pr({log)(

Training set: (xi, gi), i=1,…,N

Log-likelihood:

We want to maximize the log-likelihood in order to estimate

27

Logistic Regression Computation…

0))exp(1

)exp((

)(

1

N

iiT

T

i xx

xy

l

(p+1) Non-linear equations

Solve by Newton-Raphson method:

)(

)])(

Jacobian([ 1-oldold

oldnew ll

Let’s workout the details hidden in the above equation.In the process we’ll learn a bit about vector differentiation etc.

Linear Methods for Classification

Documents

Transcript of Linear Methods for Classification