Linear Classifiers

33
Linear Classifiers Linear Classifiers Dept. Computer Science & Dept. Computer Science & Engineering, Engineering, Shanghai Jiao Tong Shanghai Jiao Tong University University

description

Linear Classifiers. Dept. Computer Science & Engineering, Shanghai Jiao Tong University. Outline. Linear Regression Linear and Quadratic Discriminant Functions Reduced Rank Linear Discriminant Analysis Logistic Regression Separating Hyperplanes. Linear Regression. - PowerPoint PPT Presentation

Transcript of Linear Classifiers

Page 1: Linear Classifiers

Linear ClassifiersLinear Classifiers

Dept. Computer Science & Dept. Computer Science & Engineering, Engineering,

Shanghai Jiao Tong UniversityShanghai Jiao Tong University

Page 2: Linear Classifiers

23/4/22 Linear Classifiers 2

OutlineOutline• Linear Regression• Linear and Quadratic Discriminant Functions• Reduced Rank Linear Discriminant Analysis• Logistic Regression• Separating Hyperplanes

Page 3: Linear Classifiers

23/4/22 Linear Classifiers 3

Linear RegressionLinear Regression• The response classes is coded by a indicator

variable. A K classes depend on K indicator variables, as Y(k), k=1,2,…, K, each indicate a class. And N train instances of the indicator vector could form a indicator response matrix YY.

• To the Data set X(k), there is map:X X : X(k) Y(k)

• According to the linear regression model:f(x)=(1,XXT)B^

Y Y ^=XX(XXTXX)-1XXTYY

Page 4: Linear Classifiers

23/4/22 Linear Classifiers 4

Linear RegressionLinear Regression• Given X, the classification should be:

• In another form, a target tk is constructed for each class, the tk presents the kth column of a K identity matrix, according to the a sum-of-squared-norm criterion :

and the classification is:

xfxG kGk

maxarg

N

i

TiiB

Bxy1

2,1min

2

minarg kGk txfxG

Page 5: Linear Classifiers

23/4/22 Linear Classifiers 5

• The data come from three classes in IR2 and easily separated by linear decision boundaries.

• The left plot shows the boundaries found by linear regression of indicator response variables.

• The middle class is completely masked .

Problems of the linear regressionProblems of the linear regression

The rug plot at bottom indicates the positions and the class membership of each observations. The 3 curves are the fitted regressions to the 3-class indicator variables.

Page 6: Linear Classifiers

23/4/22 Linear Classifiers 6

Problems of the linear regressionProblems of the linear regression

• The left plot shows the boundaries found by linear discriminant analysis. And the right shows the fitted regressions to the 3-class indicator variables.

Page 7: Linear Classifiers

23/4/22 Linear Classifiers 7

Linear Discriminant Analysis Linear Discriminant Analysis • According to the Bayes optimal classification

mentioned in chapter 2, the posteriors is needed. post probability :assume:

——condition-density of X in class G=k. ——prior probability of class k, with

Bayes theorem give us the discriminant:

xfk

11

K

k kk

XG |Pr

K

l ll

kk

xfxfxXkG

1

|Pr

Page 8: Linear Classifiers

23/4/22 Linear Classifiers 8

Linear Discriminant AnalysisLinear Discriminant Analysis• Multivariate Gaussian density:

• Comparing two classes k and l , assume

kkT

k xxp

kpk exf

121

2/2/21

)(

)()(21log

log)()(log

)|Pr()|Pr(log

1

1

lkT

lklkl

k

l

k

l

k

x

xfxf

xXlGxXkG

kk ,

Page 9: Linear Classifiers

23/4/22 Linear Classifiers 9

Linear Discriminant AnalysisLinear Discriminant Analysis• The linear log-odds function above implies that the

class k and l is linear in x; in p dimension a hyperplane.

• Linear Discriminant Function:

So we estimate

kkkkT

k xx log21)( 11

,ˆ ,ˆ kk

Page 10: Linear Classifiers

23/4/22 Linear Classifiers 10

Parameter EstimationParameter Estimation

.)/()ˆ)(ˆ(ˆ

;/ˆ

datak Class ofnumber theis ,ˆ

1

K

k kg

Tkiki

kgkik

kk

k

i

i

KNxx

Nx

NNN

Page 11: Linear Classifiers

23/4/22 Linear Classifiers 11

LDA RuleLDA Rule

)}()( | x {boundary

)}({maxvar)(g :rule

log21)(

k

11

xxDecision

xxLDA

xx

lk

ll

kkkkT

k

Page 12: Linear Classifiers

23/4/22 Linear Classifiers 12

Linear Discriminant AnalysisLinear Discriminant Analysis

• Three Gaussian distribution with the same covariance and different means. The Bayes boundaries are shown on the left (solid lines). On the right is the fitted LDA boundaries on a sample of 30 drawn from each Gaussian distribution.

Page 13: Linear Classifiers

23/4/22 Linear Classifiers 13

Quadratic Discriminant AnalysisQuadratic Discriminant Analysis• When the covariance are different

• This is the Quadratic Discriminant Function• The decision boundary is described by a quadratic

equation

kkkkk xxx log21log

21)( 1

)}()(:{ xxx lk

Page 14: Linear Classifiers

23/4/22 Linear Classifiers 14

LDA & QDALDA & QDA

• Boundaries on a 3-classes problem found by both the linear discriminant analysis in the original 2-dimensional space XX11, X, X22 (the left) and in a 5-dimensional space XX11, X, X22, X, X1212, X, X11

22, X, X222 2 (the right).

Page 15: Linear Classifiers

23/4/22 Linear Classifiers 15

LDA & QDALDA & QDA

• Boundaries on the 3-classes problem found by LDA in the 5-dimensional space above (the left) and by Quadratic Discriminant Analysis (the right).

Page 16: Linear Classifiers

23/4/22 Linear Classifiers 16

Regularized Discriminant Ana.Regularized Discriminant Ana.• Shrink the separate covariances of QDA toward a

common covariance as in LDA.

Regularized QDA• was allowed to be shrunk toward the scalar

covariance.

Regularized LDA• Together :

]1,0[,ˆ)1(ˆ)(ˆ kk

]1,0[,ˆ)1(ˆ)(ˆ 2

),(ˆ

Page 17: Linear Classifiers

23/4/22 Linear Classifiers 17

Regularized Discriminant Ana.Regularized Discriminant Ana.• Could use• In recent micro expression work, we can use

where in a “SHRUNKEN CENTROID”

)ˆdiag()1(ˆ)(ˆ

K

P

j j

jkjK S

uxx log

21)ˆ(

)(1

2

2

jkuˆ

Page 18: Linear Classifiers

23/4/22 Linear Classifiers 18

• Test and training errors for the vowel data, using regularized discriminant analysis with a series of values of . The optimum for the test data occurs around , close to quadratic discriminant analysis

]1,0[

9.0

Page 19: Linear Classifiers

23/4/22 Linear Classifiers 19

Computations for LDAComputations for LDA

• The eigen-decomposition for eachwhere is orthonormal, and is a diagonal matrix of positive eigenvalues .

• So the ingredients for are:

Tkkkk UDU

l klk

xTkk

Tx

Tkxk

Tx

d

xUDxUxx

logˆlog

ˆˆˆˆˆ 11

kU kDpp

kld)(xk

kkkkk xxx log21log

21)( 1

Page 20: Linear Classifiers

23/4/22 Linear Classifiers 20

Reduced Rank LDAReduced Rank LDA• Let • Let

• LDA: • Closest centroid in sphered space( apart from )

• Can project data onto K-1 dim subspace spanned by , and lose nothing!

• Can project even lower dim using principal components of , k=1, …, K.

TUDU

K/-

KT

K

T

UUUDUXXUDX

212/1*

2/12/1*

ˆˆˆˆ

i.e.

KKK xx ˆlogˆ)(2**

21

Klog

**1

ˆ,,ˆKUU

*ˆKU

Page 21: Linear Classifiers

23/4/22 Linear Classifiers 21

Reduced Rank LDAReduced Rank LDA• Compute matrix M of centroids• Compute• Compute B*, cov matrix of M* , and

• is l-th discriminant variable ( canonical variable )

2/1*ˆ MWMW and ,

pK

TBVDVB ***

*2/1ll

Tll vWvXvZ with

Page 22: Linear Classifiers

23/4/22 Linear Classifiers 22

• A two-dimensional plot of the vowel training data. There are eleven classes with ,and this is the best view in terms of a LDA model. The heavy circles are the projected mean vectors for each class.

10IRX

Page 23: Linear Classifiers

23/4/22 Linear Classifiers 23

• Projections onto different pairs of canonical varieties

Page 24: Linear Classifiers

23/4/22 Linear Classifiers 24

Reduced Rank LDAReduced Rank LDA

• Although the line joining the centroids defines the direction of greatest centroid spread, the projected data overlap because of the covariance ( left panel).

• The discriminant direction minimizes this overlap for Gaussian data ( right panel).

Page 25: Linear Classifiers

23/4/22 Linear Classifiers 25

Fisher’s problemFisher’s problem• Find “between-class var” is

maximized relative to “within-class var”.• Maximize “Rayleigh quotient” :

..tsXaZ T

etc.

2

12

2222

1

01..),max(

!1..),max(

)max(

vaWaaWaatsBaa

vaWaatsBaa

WaaBaa

T

TT

TTT

T

Page 26: Linear Classifiers

23/4/22 Linear Classifiers 26

• Training and test error rates for the vowel data, as a function of the dimension of the discriminant subspace.

• In this case the best rate is for dimension 2.

Page 27: Linear Classifiers

23/4/22 Linear Classifiers 27

Page 28: Linear Classifiers

23/4/22 Linear Classifiers 28

South African Heart

Disease Data

Page 29: Linear Classifiers

23/4/22 Linear Classifiers 29

Logistic RegressionLogistic Regression• Model:

1

1 0,

1

1 0,

0,

20,

10,1

)exp(11)Pr(

1,,1,)exp(1

)exp()Pr(

)Pr()1Pr(

)Pr()1Pr(

K

lTll

K

lTll

Tkk

Tk

T

xxXKg

Kkx

xxXkg

xxXKg

xXKgLog

xxXkgxXg

Log

n

iigi xPL

1

);(log)(

:Likelihood Log

Page 30: Linear Classifiers

23/4/22 Linear Classifiers 30

Logistic Regression / LDALogistic Regression / LDA

Same form as Logistic Regressionx

xxXKgxXkg

Tkk

KkT

KkT

KkK

k

0

1

1

)(

)()(21log

)Pr()Pr(

log

K

k kk

kk

K

lTll

Tkk

XXXkgXLDA

x

xxXkgXkgXLR

1

1

1 0,

0,

),;()Pr(),;(),Pr(:

)exp(1

)exp()Pr()Pr(),Pr(:

Conditional Likelihood

Probability

Page 31: Linear Classifiers

23/4/22 Linear Classifiers 31

Rosenblatt’s Perceptron Learning AlgorithmRosenblatt’s Perceptron Learning Algorithm

• --distance of points in M to boundary• Stochastic Gradient Descent

nsObservatioedMiscassifiMy

xyD

i

Mi

Tii

}1,1{

)(),( 00

),( 0D

i

ii

Mii

Miii

yxy

yDxyD

00

0

Page 32: Linear Classifiers

23/4/22 Linear Classifiers 32

Separating HyperplanesSeparating Hyperplanes• A toy example with two

classes separable by hyperplane. The orange line is the least squares solution, which misclassifies one of the training points. Also shown are two blue separating hyperplanes found by the perceptron learning algorithm with different random starts.

Page 33: Linear Classifiers

23/4/22 Linear Classifiers 33

Optimal Separating Hyperplanes

NiCxy

C

Tii ,,1,)(

max

0

1,, 0

to subject