Lecture 13: Multiclass Classification (I)lu/ST7901/lecture notes... · 2019. 10. 5. · Linear...

Lecture 13: Multiclass Classification (I)

Wenbin Lu

Department of StatisticsNorth Carolina State University

Fall 2019

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 43

Multiclass Classification

General setup

Bayes rule for multiclass problems

under equal costsunder unequal costs

Linear multiclass methods

Linear regression modelLiner discriminant analysis (LDA)

Visualization of LDAGeneralizing LDA

Multiple logistic regression Models


General Setup

Multiclass Problems

Class label Y ∈ {1, ...,K},K ≥ 3.

The classifier f : Rd −→ {1, ...,K}.Examples: zip code classification, multiple-type cancer classification,vowel recognition

The loss function L(Y , f (X)) =∑L

k=1

∑Kl=1 C (l , k)I (Y = l , f (X) = k),

where

C (l , k) = cost of classifying a sample in class l to class k.

In general, C (k , k) = 0 for any k = 1, · · · ,K . The loss can be describedas an K × K matrix.For example, K = 3,

C =

0 1 11 0 11 1 0

.


General Setup

Terminologies

Prior class probabilities (marginal distribution of Y ):

πk = P(Y = k), k = 1, · · · ,K

Conditional distribution of X given Y : assume that

gk is the conditional density of X given Y = k , k = 1, · · · ,K .

Marginal density of X (a mixture):

g(x) =K∑

k=1

πkgk(x), k = 1, · · · ,K .

Posterior class probabilities (conditional dist. of Y given X):

P(Y = k |X = x) =πkgk(x)

g(x), k = 1, · · · ,K .


Bayes Rule for Multiclass Problems

Bayes Rule

The optimal (Bayes) rule aims to minimize the average loss function overthe population

φB(x) = arg minf

EX,Y L(Y , f (X))

= arg minf

EXEY |XL(Y , f (X)),

where

EY |XL(Y , f (X)) =K∑

k=1

I (f (X) = k)K∑l=1

C (l , k)P(Y = l |X).



Bayes Rule Under Equal Costs

If C (k , l) = I (k 6= l), then R[f ] = EXEY |XI (Y 6= f (X )) and

EY |XL(Y , f (X)) =K∑

k=1

I (f (X) = k)K∑l 6=k

P(Y = l |X )

=K∑l=k

I (f (X) = k)[1− P(Y = k |X)]

Given x, the minimizer of EY |X=xL(Y , f (X)) is f (x) = k∗, where

k∗ = arg mink=1,..,K

{1− P(Y = k |x)}.

The Bayes rule is

φB(x) = k∗ if P(Y = k∗|x) = maxk=1,...,K

P(Y = k |x),

which assigns x to the most probable class using Pr(Y = k |X = x).Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 43


Bayes Rule Under Unequal Costs

For the general loss function, the Bayes rule can be derived as

φB(x) = k∗ if k∗ = arg mink=1,...,K

K∑l=1

C (l , k)P(Y = l |x).

There is not a simple analytic form for the solution.



Decision (Discriminating) Functions

For multiclass problems, we generally need to estimate multiplediscriminant functions fk(x), k = 1, · · · ,K

Each fk(x) is associated with class k.

fk(x) represents the evidence strength of a sample (x, y) belonging toclass k .

The decision rule constructed using fk ’s is

f (x) = k∗, where k∗ = arg maxk=1,...,K

fk(x).

The decision boundary of the classification rule f between class k andclass l is defined as

{x : fk(x) = fl(x)}.


Traditional Methods for Multiclass Problems

Traditional Methods: Divide and Conquer

Main ideas:

(i) Decompose the multiclass classification problem into multiple binaryclassification problems.

(ii) Use the majority voting principle (a combined decision from thecommittee) to predict the label

Common approaches: simple but effective

One-vs-rest (one-vs-all) approaches

Pairwise (one-vs-one, all-vs-all) approaches



One-vs-rest Approach

One of the simplest multiclass classifier; commonly used in SVMs; alsoknown as the one-vs-all (OVA) approach

(i) Solve K different binary problems: classify “class k” versus “the restclasses” for k = 1, · · · ,K .

(ii) Assign a test sample to the class giving the largest fk(x) (mostpositive) value, where fk(x) is the solution from the kth problem

Properties:

Very simple to implement, perform well in practice

Not optimal (asymptotically): the decision rule is not Fisherconsistent if there is no dominating class (i.e. arg max pk(x) < 1

2 ).

Read: Rifkin and Klautau (2004) “In Defense of One-vs-all Classification”



Pairwise Approach

Also known as all-vs-all (AVA) approach

(i) Solve(K

2

)different binary problems: classify “class k” versus “class j”

for all j 6= k. Each classifier is called gij .

(ii) For prediction at a point, each classifier is queried once and issues avote. The class with the maximum number of (weighted) votes is thewinner.

Properties:

Training process is efficient, by dealing with small binary problems.

If K is big, there are too many problems to solve. If K = 10, we needto train 45 binary classifiers.

Simple to implement; perform competitively in practice.

Read: Park and Furnkranz (2007) “Efficient Pairwise Classification”


Linear Regression Models

Linear Classifier for Multiclass Problems

Assume the decision boundaries are linear. Common linear classificationmethods:

Multivariate linear regression methods

Linear log-odds (logit) models

Linear discriminant analysis (LDA)Multiple logistic regression

Separating hyperplane – explicitly model the boundaries as linear

Perceptron model, SVMs


Linear Regression Models Multivariate Linear Regression

Coding for Linear Regression

For each class k , code it using an indicator variable Zk :

Zk = 1 if Y = k;

Zk = 0 if Y 6= k .

The response y is coded using a vector z = (z1, ..., zK ). If Y = k ,only the kth component of z = 1.

The entire training samples can be coded using a n × K matrix,denoted by Z .

Call the matrix Z the indicator response matrix.



Multivariate Lineaer Regression

Consider the multivariate model

Z = XB + E, or Zk = fk ≡ Xbk + εk , k = 1, · · · ,K .

Zk is the kth column of the indicator response matrix Z .

X is n × (d + 1) input matrix

E is n × K matrix of errors.

B is (d + 1)× K matrix of parameters, bk is the kth column of B.



Ordinary Least Square Estimates

Minimize the residual sum of squares

RSS(B) =K∑

k=1

n∑i=1

[zik − β0k −d∑

j=1

βjkxij ]2

= tr[(Z − XB)T (Z − XB)].

The LS estimates has the same form as univariate case

B = (XTX )−1XTZ ,

Z = X (XTX )−1XTZ ≡ XB.

Coefficient bk for yk is same as univariate OLS estimates. (Multipleoutputs do not affect the OLS estimates).

For x, compute f (x) =[(1 x)B

]Tand classify it as

f (x) = argmaxk=1,...,K fk(x).



About Multiple Linear Regression

Justification: Each fk is an estimate of conditional expectation of Yk ,which is the conditional probability of (X ,Y ) belonging to the class k

E(Yk |X = x) = Pr(Y = k |X = x).

If the intercept is in the model (column of 1 in X), then∑Kk=1 fk(x) = 1 for any x.

If the linear assumption is appropriate, then fk give a consistentestimate of the probability for k = 1, · · · ,K , as the sample size ngoes to infinity.



Limitations of Linear Regression

There is no guarantee that all fk ∈ [0, 1]; some can be negative orgreater than 1.

Linear structure of X ’s can be rigid.

Serious masking problem for K ≥ 3 due to rigid linear structure.



Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4Linear Regression

1

1

1

1

1

1111

1

1

1

1

11

1 1

1

1

1

1

1

1 11

1

1

1

1

1

1

11

11

1

1

1

1 11

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1 1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

111 1

1

11 1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

111

1

1

1

1

1

1

1

11

11

1

1

11

1

1

1

1

1

1

1

1 111

1

1

11

1

1

11

11

11

1

1

1

1

1

111

1

11

1

1

11

1

1

11

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 111 1

1

1

1

1

1

1

1

1

1 11

11

1

1

1

1

1

1

1

1

1 1

1

11

11

1

1

1

1

1

1 1

1

1

1

1

1

11

11

1

1 1

11

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

11

1

1

1

11

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

1 1

1

1

1

1

1 1

111

1

1

111

11

11

1

111

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

11

1

11

1

1

1

1

11

1

11

1

1

1 1

11

1

11

1

11

1

1

1

1

11

1

11

1

11

11

1

1

1

11

1

1

1

11

1

11

11

1

1

1

1

11

11

1

1

1

1

1

111 1

11

1 1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

11

1 1

11

1

1

1

1

1

1

1

1

1

1

11

22

2

2

22

2

2

2

2

2

2

22

2

22

22 22

2

2 2

2

2 222

2

22

2

2

2

2

2

2

2

2

22

2

22

22

2

22

2

2

22

2

2

2

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2 22

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

2

22

2

2 22 2

2

2

2

2

2

22

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

22

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

2

2

22

2

2

2

2

2

2

2

2

2

22

2

2

2

22

22

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

222

2

22

2

2

2

2

2

22

2

2

22

2

2

2 2

2

22

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

22

2

2

2

2

2

22

2

2

2

2

2

2

2 2

2

2

2

2

2

2

2 2

2

2 22

2

2

2

2

22

2

2

222

2

2

2

22

22

222

22

2

2

22

2

2

2 2

2

2

222

2

2

2

2

2

22

2

2 2

2

22

2

2

22

2

2

22

2

2

2

2

2

2

22

2 2

2

2

2

2

2

2

22

2

2 2

2

2

22

22 22

2

2

2 2

2

22

2

2

2

2 2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2 2

2

2 2

2

2

2

22

2

2

2

22

222

2

22

2

22

222

2

22

2

2

2

22

2

2

2

2

22

2

2

22

2 2

2

2

2

22

2

2

22

2

2 2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

3

3

3

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

33

3

3 3 3

3

3

3

33

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

333

3 3

3

3

33

3

33

33

3

3

3

3

333

3

3

33

33

33

33

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

33

3

3

333

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

3

33

3

3

3 3

3

3

3

3

3

3

3

33

3

3

3

3

3 33

3

33

3

3

3

3

333

3

3

3

33

3

3

3

3 3

33

3 333

3

3

3

3

33

3

33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

3

333

3

3

3

3

3

3

3

3

3

3

3 3

3

3

3

3

3

3

3

33

3

3

33

33

3

3

3

3

3

33

3

3

33

3

3

3

33

3

3 33

3

3

3

33

3

33

333

3

3

33

33

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

3

3

33

3 3

3

3

33

3

33

33

3

3

33

33

33

33 3

3

3

3

3

3

3

33

33

33

33

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

3 33

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

33

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

33

3

3

3

33

333

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

33

3

3

3

3

333

3

3

3

3

33

3

3

33

3

3

3

3

33

33

33

Linear Discriminant Analysis

1

1

1

1

1

1111

1

1

1

1

11

1 1

1

1

1

1

1

1 11

1

1

1

1

1

1

11

11

1

1

1

1 11

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1 1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

111 1

1

11 1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

111

1

1

1

1

1

1

1

11

11

1

1

11

1

1

1

1

1

1

1

1 111

1

1

11

1

1

11

11

11

1

1

1

1

1

111

1

11

1

1

11

1

1

11

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 111 1

1

1

1

1

1

1

1

1

1 11

11

1

1

1

1

1

1

1

1

1 1

1

11

11

1

1

1

1

1

1 1

1

1

1

1

1

11

11

1

1 1

11

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

11

1

1

1

11

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

1 1

1

1

1

1

1 1

111

1

1

111

11

11

1

111

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

11

1

11

1

1

1

1

11

1

11

1

1

1 1

11

1

11

1

11

1

1

1

1

11

1

11

1

11

11

1

1

1

11

1

1

1

11

1

11

11

1

1

1

1

11

11

1

1

1

1

1

111 1

11

1 1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

11

1 1

11

1

1

1

1

1

1

1

1

1

1

11

22

2

2

22

2

2

2

2

2

2

22

2

22

22 22

2

2 2

2

2 222

2

22

2

2

2

2

2

2

2

2

22

2

22

22

2

22

2

2

22

2

2

2

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2 22

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

2

22

2

2 22 2

2

2

2

2

2

22

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

22

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

2

2

22

2

2

2

2

2

2

2

2

2

22

2

2

2

22

22

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

222

2

22

2

2

2

2

2

22

2

2

22

2

2

2 2

2

22

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

22

2

2

2

2

2

22

2

2

2

2

2

2

2 2

2

2

2

2

2

2

2 2

2

2 22

2

2

2

2

22

2

2

222

2

2

2

22

22

222

22

2

2

22

2

2

2 2

2

2

222

2

2

2

2

2

22

2

2 2

2

22

2

2

22

2

2

22

2

2

2

2

2

2

22

2 2

2

2

2

2

2

2

22

2

2 2

2

2

22

22 22

2

2

2 2

2

22

2

2

2

2 2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2 2

2

2 2

2

2

2

22

2

2

2

22

222

2

22

2

22

222

2

22

2

2

2

22

2

2

2

2

22

2

2

22

2 2

2

2

2

22

2

2

22

2

2 2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

3

3

3

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

33

3

3 3 3

3

3

3

33

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

333

3 3

3

3

33

3

33

33

3

3

3

3

333

3

3

33

33

33

33

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

33

3

3

333

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

3

33

3

3

3 3

3

3

3

3

3

3

3

33

3

3

3

3

3 33

3

33

3

3

3

3

333

3

3

3

33

3

3

3

3 3

33

3 333

3

3

3

3

33

3

33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

3

333

3

3

3

3

3

3

3

3

3

3

3 3

3

3

3

3

3

3

3

33

3

3

33

33

3

3

3

3

3

33

3

3

33

3

3

3

33

3

3 33

3

3

3

33

3

33

333

3

3

33

33

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

3

3

33

3 3

3

3

33

3

33

33

3

3

33

33

33

33 3

3

3

3

3

3

3

33

33

33

33

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

3 33

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

33

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

33

3

3

3

33

333

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

33

3

3

3

3

333

3

3

3

3

33

3

3

33

3

3

3

3

33

33

33

PSfrag repla ementsX1X1

X 2X 2Figure 4.2: The data ome from three lasses inIR2 and are easily separated by linear de ision bound-aries. The right plot shows the boundaries found bylinear dis riminant analysis. The left plot shows theboundaries found by linear regression of the indi a-tor response variables. The middle lass is ompletelymasked (never dominates).



Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4

11

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

11

1

11

111

1

1

1

1

1

1

1

1

1

1

1111

1

11

111

11

1

1

11

1

1

1

1

1

1

1

1

1

11

111

11

1

111

1

1

1

1

1

1111

11

11

1

1

11

11

1

111

1

11

1

1

1

11

11

1

1

1

1

1

11

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

111

1

11

111

11

11

1

111

1

222 222 2 2 2222 22 22 2 22 22222 2 22 222 22 2 22 22 22 22222 22 2 222 2 2222 222 222 222 2222 22 22 2 22 222 2222 22 222 222 222 222 222 2 22 22 222 222 22 22 2222 22 2 222 2 22 2 22 2 22 22 222 2 222 222 222 2

33

3

3

3

3

33

33

3

3

3

3

3

3

3

3

3

33

3

33

333

3

3

3

3

3

3

3

33

3

3333

3

33

333

33

3

3

33

3

3

3

3

3

3

3

3

3

33

333

33

3

333

3

33

3

3

3333

33

33

3

3

33

33

3

333

3

33

3

3

3

33

33

3

3

3

3

3

33

3

3

33

3

3

3

3

3

3

33

3

3

3

3

3

3

3

333

3

33

333

33

33

3

333

3

0.0

0.5

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Degree = 1; Error = 0.25

11

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

111

1

1

1

1

1

1

1

1

1

1

11

11

1

11

111

1

1

1

1

111

1

1

11

11

1

1

11

1111 1

1

111

11 1

11

1111

11

11

1

1

1111

1

1111

111

11 11 111 11

11

1 11 1111 11 1

111 1 11 1 11 1 1

1 111

11 1 111 111 1111

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

22

22

2

22

222

2

2

2

2 2222

2

222

22 222 2222 22

22 2

2

22

22 2222 22 2222

22 222 2222

22 2

2

2

22

22

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

22

2

2

2

22

2

2

333 33

33 3 333

3

33

33

33

3 3333

3

3 33 333

3

3

3 33 33

33 33333 33 3 33

3 3

3333

3

33

3

3

3

3

33

333

3 3

333

3

3

33

33

3333

33

33

3

3

33

33

3333

3

333

3

3

33

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

33

3

3

0.0

0.5

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Degree = 2; Error = 0.03

Figure 4.3: The e�e ts of masking on linear regres-sion in IR for a three- lass problem. The rug plot atthe base indi ates the positions and lass membershipof ea h observation. The three urves in ea h panel arethe �tted regressions to the three- lass indi ator vari-ables; for example, for the red lass, yred is 1 for thered observations, and 0 for the green and blue. The �tsare linear and quadrati polynomials. Above ea h plotis the training error rate. The Bayes error rate is 0:025for this problem, as is the LDA error rate.



About Masking Problems

One-dimensional example: (see the previous figure)

Projected the data onto the line joining the three centroid

There is no information in the northwest-southeast direction

Three curves are regression lines

The left panel is for the linear regression fit:the middle class is never dominant!The right panel is for the quadratic fit (masking problem is solved!)



How to Solve Masking Problems

In the previous example with K = 3

Using Quadratic regression rather than linear regression

Linear regression error 0.25; Quadratic regression error 0.03; Bayeserror 0.025.

In general,

If K ≥ 3 classes are lined up, we need polynomial terms up to degreeK − 1 to resolve the masking problem.

In the worst case, need O(pK−1) terms.

Though LDA is also based on linear functions, it does not suffer frommasking problems.



Large K , Small d Problem

Masking often occurs for large K and small d .Example: Vowel data K = 11 and d = 10. (Two-dimensional for trainingdata using LDA projection)

A difficult classification problem, as the class overlap is considerate

the best methods achieve 40% test error

Method Traing Error Test Error

Linear regression 0.48 0.67LDA 0.32 0.56QDA 0.01 0.53Logistic regression 0.22 0.51



��

Coordinate 1 for Training Data

Coord

inate 2

for Tr

aining

Data

-4 -2 0 2 4

-6-4

-20

24

oooooo

ooo

o

oo

ooo

o oo

ooo

oo

o

o o o o oo

ooo o oo

oo

o

o

o o

o oo

ooo

oo oooo

o

o

oo

o

o

oooo

o

o

oooooo

o ooo

o

o

o

o

oooo

ooo o

oo

o

oo

oo

o

ooooo

oooooo

o

o

ooo

o

o

o oooo o

ooo o oo

ooo o

oo

oo o ooo

o ooo

oo

oooooo

oooooo

oo

oooo

oooooo

ooooo o

oooooo

oooooo

oo

oooo

o oo

oo

o

o ooooo

ooooo

o

oooooo

oooooo

oooooo

oooooo

oo

o

o

oo

ooooo oo ooo

o oo o

oooo

oooo

o

o

oo

o

o

oo

oooo

o

o

oooo

o

o oo

o

oo

o

oooooo

oo oooo

oo

oo oo o

oo

oo o

o o o

oo

ooo

oooo

oooo

o

o

oo

o

ooo

ooo

ooo

o o ooo o

o oooo

o

ooooo

o

o o

oo

oo

oo o ooo

ooo

oo o

o

oo

o o o

ooooo

o

ooo

oo

o

o

oo

ooo

ooooo

o

oooo

oo

o oooo

o

oooooo

ooo o

oo

ooooo oo oo

o o o

oooooo

oo

ooo o

oo

o oo o

ooo

ooo

o

o

oo

oo

oo

oooo

oooo

oo

ooo ooo

oo

oo oo

o ooo

oo

ooo

ooo

ooo

ooo

oooooo

oo

o

oo o

••••

•••• •••• ••

••

••••

••


��

��

��

�� !��

��

��


Linear Regression Models Linear Discriminant Analysis

Model Assumptions for Linear Discriminant Analysis

Model Setup

Let πk be prior probabilities of class k

Let gk(x) be the class-conditional densities of X in class k .

The posterior probability

Pr(Y = k|X = x) =gk(x)πk∑Kl=1 gl(x)πl

.

Model Assumptions:

Assume each class density is multivariate Gaussian N(µk ,Σk).

Further, we assume Σk = Σ for all k (equal covariance required!)



Linear Discriminant Function

The log-ratio between class k and l is

logPr(Y = k |X = x)

Pr(Y = l |X = x)= log

πkπl

+ loggk(x)

gl(x)

= [xTΣ−1µk −1

2µTk Σ−1µk + log πk ]−

[xTΣ−1µl −1

2µTl Σ−1µl + log πl ]

For each class k , its discriminant function is

fk(x) = xTΣ−1µk −1

2µTk Σ−1µk + log πk ,

The quadratic term of x canceled due to the equal covariance matrix.The decision rule is

Y (x) = argmaxk=1,...,K fk(x).




+ +

+3

21

1

1

2

3

3

3

1

2

3

3

2

1 1 21

1

3

3

1 21

2

3

2

3

3

1

2

2

1

1

1

1

3

2

2

2

2

1 3

2 2

3

1

3

1

3

3 2

1

3

3

2

3

1

3

3

21

33

2

2

32

2

211

1

1

1

2

1

3

3

11

3

32

2

2

23

1

2

Figure 4.5: The left panel shows three Gaussian distri-butions, with the same ovarian e and di�erent means.In luded are the ontours of onstant density en losing95% of the probability in ea h ase. The Bayes de isionboundaries between ea h pair of lasses are shown (bro-ken straight lines), and the Bayes de ision boundariesseparating all three lasses are the thi ker solid lines (asubset of the former). On the right we see a sampleof 30 drawn from ea h Gaussian distribution, and the�tted LDA de ision boundaries.



Interpretation of LDA

log Pr(Y = k |X = x) = −1

2(x− µk)TΣ−1(x− µk) + log πk + const.

The constant term does not involve x.

If prior probabilities are same, the LDA classifies x to the classwith centroid closest to x, using the squared Mahalanobis distance,dΣ(x ,µ), based on the common covariance matrix.

If dΣ(x− µk) = dΣ(x− µl), then the prior determines the classifier.

Special Case: Σk = I for all k

fk(x) = −1

2‖x− µk‖2 + log πk .

only Euclidean distance is needed!



Mahalanobis Distance

Mahalanobis distance is a distance measure introduced by P. C.Mahalanobis in 1936.Def: The Mahalanobis distance of x = (x1, . . . , xd)T from a set of pointswith mean µ = (µ1, . . . , µd)T and covariance matrix Σ is defined as:

dΣ(x,µ) =[(x− µ)TΣ−1(x− µ)

]1/2.

It differs from Euclidean distance in that

it takes into account the correlations of the data set

it is scale-invariant.

It can also be defined as a dissimilarity measure between x and x′ of thesame distribution with the covariance matrix Σ:

dΣ(x, x′) =[(x− x′)TΣ−1(x− x′)

]1/2.



Special Cases

If Σ is the identity matrix, the Mahalanobis distance reduces to theEuclidean distance.

If Σ is diagonal diag(σ21, · · · , σ2

d), then the resulting distance measureis called the normalized Euclidean distance:

dΣ(x, x′) =

[d∑

i=1

(xi − x ′i )2

σ2i

]1/2

,

where σi is the standard deviation of the xi and x ′i .



Normalized Distance

Assume we like to decide whether x belongs to a class. Intuitively, thedecision relies on the distance of x to the class center µ: the closer to µ,the more likely x belongs to the class.

To decide whether a given distance is large or small, we need to knowif the points in the class are spread out over a large range or a smallrange. Statistically, we measure the spread by the standard deviationof distances of the sample points to µ.

If dΣ(x,µ) is less than one standard deviation, then it is highlyprobable that x belongs to the class. The further away it is, the morelikely that x not belonging to the class. This is the concept of thenormalized distance.



Motivation of Mahalanobis Distance

The drawback of the normalized distance is that the sample points areassumed to be distributed about µ in a spherical manner.

For non-spherical situations, for instance ellipsoidal, the probability ofx belonging to the class depends not only on the distance from µ, butalso on the direction.

In those directions where the ellipsoid has a short axis, x must becloser. In those where the axis is long, x can be further away from thecenter.

The ellipsoid that best represents the class’s probability distribution can beestimated by the covariance matrix of the samples.



Linear Decision Boundary of LDA

Linear discriminant functions

fk(x) = βT1kx + β0k , fj(x) = βT

1jx + β0j .

Decision boundary function between class k and j is

(β1k − β1j)Tx + (β0k − β0j) = βT

1 x + β0 = 0.

Since β1k = Σ−1µk and β1j = Σ−1µj , the decision boundary has thedirectional vector

β1 = Σ−1(µk − µj)

β1 in generally not in the direction of µk − µj

The discriminant direction β1 minimizes the overlap for Gaussian data



Parameter Estimation in LDA

In practice, we estimate the parameters from the training data

πk = nk/n, where nk is sample size in

µk =∑

Yi=k xi/nk

The sample covariance matrix for the kth class:

Sk =1

nk − 1

∑Yi=k

(xi − µk)(xi − µk)T .

(Unbiased) pooled sample covariance is a weighted average

Σ =K∑

k=1

nk − 1∑Kl=1(nl − 1)

Sk

=K∑

k=1

∑Yi=k

(xi − µk)(xi − µk)T/(n − K ).



Implemenation of LDA

Compute the eigen-decomposition:

Σ = UDUT

Sphere the data using Σ, using the following transformation

X ∗ = D−1/2UTX .

The common covariance estimate of X ∗ is the identity matrix.

In the transformed space, classify the point to the closest transformedcentroid, with the adjustment using πk ’s.


Linear Regression Models Reduced-Rank LDA

An Alternative View

Fisher’s reduced rank LDA:

Find the linear combination Z = aTX such that the between-classvariance is maximized relative to the within-class variance.

The first discriminant coordinate: maxa1

aT1 Ba1

aT1 Wa1, where B is the

covariance matrix of the class centroid matrix M (M is of sizeK × d), and W is the within-class covariance.

The other discriminant coordinates: maxa`aT` Ba`aT` Wa`

subject to

aT` Waj = 0; j = 1, . . . , `− 1.

The solution can be obtained by the eigen-decomposition of W−1B.



Dimension Reduction in LDA

LDA allows us to visualize informative low-dimensional projections of data

The K centroids in d-dimensional input space span a subspace H,with dim(H) ≤ K − 1

Project x∗ onto H and make a distance comparison there.

We can ignore the distances orthogonal to H, since they willcontribute equally to each class.

Data can be viewed in H without losing any information. When K � d ,dimension reduction!

When K = 3, we can view the data in a two-dimensional plot

When K > 3, we might to find a subspace HL ⊂ H optimal for LDAin some sense.



Subspace Optimality

Fisher’s Optimality for HL:the projected centroids are spread out as much as possible in terms ofvariance.

This amounts to: finding principal components subspaces of thecentroids.

These principal components are called discriminant coordinates orcanonical variates.

Pick up the top-ranked discriminant coordinates. The lower rank, thecentroids are less spread out.

For visualization:we project the data onto the principal components of the K centroids.




Coordinate 1

Coord

inate

3

-4 -2 0 2 4

-20

2

oooooo

o

o

oo

o

o

oo

oo o

o

oooooo

o

oo

o oo

o

o

oooo

oo o o o

o oo

oo

oo

o

ooooo

ooooo o

oooo

oo

oooooo

o

o

o

o

o

o

o

o

ooo

o

oooo

oo

oo

oo

o o

oooooo

oo

ooo

oo

ooo oo

oooooo

oo

o

o ooo

ooooo

oo ooo

o

oo

o

o

oo

oooooo

ooo

ooo

oooooo

ooo

o

oo

ooo

oo

o ooooo o

ooo

ooo

oo

oo

o

o

o

o

o

ooo

oooo

oo ooo

ooo

o

ooo

oo

ooo

oo

o

oooo

oo

oooooo

o

oo

o

oo

ooo

o

oo

o o

o

o

o

o

oo

oooo o

ooo

oo

oooo

o

o

oooo

oo

oo

o

o

o

oo

o oo

o

o

oo

oooo

o

o

oooo

o

o

oo oo

oo

oo

o

o

o o ooo

o

oooo

o

o

ooo

o

oo

o

oo

oo

o

ooo

ooo

o ooo

o

o

o

o

oooo

o

oo

oo

o o

o

ooo

ooo

o

oo

oo

o

ooo o

oo

o

o

o o

ooo

ooo

oo

o

o

o

o

oo

o

oo

ooo

o

o

o

o

o

o

oooo

o ooooo

ooo

o

o

o

ooo o o

o

oooooo

o

ooo

oo

ooooo

o

o o ooo ooo

o oo

o

oo o

oo

o

oo o

ooo

o

o

o

o

oo

oooooo

o

oo

oo

o

o

oo

ooo

o

o

oo oo

ooooo

o oooooo

oooo

o

o

ooo

oo

o

••••••••

•••• •• •••• ••••

Coordinate 2

Coord

inate

3

-6 -4 -2 0 2 4

-20

2

oooooo

o

o

oo

o

o

oo

ooo

o

oooo

oo

o

ooooo

o

o

oooo

ooooo

o oo

oo

oo

o

ooooo

oooo

oo

oooo

oo

oooooo

o

o

o

o

o

o

o

o

ooo

o

oooo

oo

oo

oo

o o

oooooo

oo

oo

o

oo

oooooo

ooooo

oo

o

oooo

ooooo

ooooo

o

oo

o

o

oo

oooooo

oooo

oo

oooooo

oooo

oo

ooooo

ooo

oo

oo

oooooo

oo

oo

o

o

o

o

o

ooo

ooooooo

ooooo

o

ooooo

ooo

oo

o

oooo

oo

ooooo

o

o

oo

o

oo

ooo

o

oo

oo

o

o

o

o

oo

oooo o

ooo

oo

oooo

o

o

oooo

oo

oo

o

o

o

oo

ooo

o

o

oo

oooo

o

o

oooo

o

o

ooo o

oo

ooo

o

ooooo

o

oo

oo

o

o

ooo

o

oo

o

oo

oo

o

ooo

ooo

oooo

o

o

o

o

ooo o

o

oo

oo

o o

o

oo o

ooo

o

oo

oo

o

oooo

oo

o

o

oo

oo

oooo

oo

o

o

o

o

oo

o

oo

ooo

o

o

o

o

o

o

oo oo

oooooo

ooo

o

o

o

oooo o

o

oooooo

o

oo

o

oo

oooooo

oooooo oo

ooo

o

oo o

oo

o

ooo

o oo

o

o

o

o

oo

oooo oo

o

oo

oo

o

o

o o

ooo

o

o

oooo

ooooo oo

ooooo

oo oo

o

o

ooo

oo

o

•••• •• ••

•••••••••••• ••

Coordinate 1

Coord

inate

7

-4 -2 0 2 4

-3-2

-10

12

3

oooooo

ooo

o

o

o

oo

oo oo

oooooo

o o oooo

ooo

ooo

o o oo

o

o

oo

o

ooo

ooo

ooo

oooo

o

o

oooo

oooooo

oo

o

o

oo

o

o

o

o ooooooo

ooo

o

o

oo

o o

oo

oo

oo

oo

oooo

oooo oo

ooo

ooooo

o o

oo

ooo

ooo

ooo

o

oo

o ooo

oo

oooooo

ooooo

o

oooooo

oooo

o

oooo

ooo

ooooo o

o

o

oo

oo

ooo

o

o

o

oooo

oo

o ooo

oo

oooooo

oooo

oo

ooo

oo

oooo

ooo

oo

oooo

oo

o

oo

o

oooooo

oo

oo

o o o oooo

o

oooo

o

o

oo

ooo o o

ooooo

o

o

o

o

oo

o o oooo

oooo

oo

o

ooooooo

oo ooo

oooo o

oo o

o

oooo

oo

oo

ooo

o

oo

o ooo

o

o

oo

oo

oo

oo

ooo o

ooo

o

oo

oooo

oo

oo

oo

oo

ooo oo

o

ooo

ooo o

ooo o oo

oo

o

oo

oo

oo

oo

oo o

ooo

oooooo

oooooo

oo

ooo

o

oo

o

ooo

ooo ooo

oooooo

o ooo

oo

oo

oo

o

o

oo

ooo

o

oo

o o

o

o

oo

oo o

ooo

oooo

o

o

oo

oo

oooooo

ooo

oo

o

ooo o oo

o ooo

oo

o

oooo

ooo

ooooo

o

ooo

o

oo

o

o

o o

•••••••• •••• •• •••• ••••

Coordinate 9

Coord

inate

10

-2 -1 0 1 2 3-2

-10

12

ooo

ooo

oo

o

oo o

o

oo oo

o

ooo

ooo

ooo

o

oo

oo

o

oo

o

o

oooo

o

oo

o

ooo

o o

ooo

o

o

oo

ooo

oooooooooooo

oo

o o

ooo

o o

o oo

oo

oo

o o

o

o o oo

o

ooo

ooo

oooo o

oooo

ooo

oo

ooo o

oo oo oo

o

o

oo

o o

ooo

oo

o

oooo

o o

o

oooo o oooo

o o

oooooo

o

ooo

o

o

ooo o

oo

ooo ooo

o

oo

ooo

ooooo

o

oo

ooo

o ooooo

o

oooo

oo

oooo

o o ooo

o

oo

oooo

oo

oo oo

oo

oo

ooo

o

oo o

oo

o

oo

oo

oo

ooo

oo

oo

oo

ooo

o

oo

oo

o

oooo

o o

ooooo

o

oo

oo

oo

o oo

ooo

o

oooo

o

oooo o o

oo

oo o

o

o o

o

oo

o

oo ooo

o

oo o

o

oo

oo

oo

oo

ooo oo o

o ooooo

oo

o

o

oo

ooooo o

oo

ooo o

oo

ooo

o

o

o o oo

o

o

o

o

oo

o

o

o

oo o

o

o

o

oo

o

o

o

oo

oo

o

ooo o o

o

oo

oo

o o

ooo

oo

o

oo

ooo

o

oo

oo

oo

oo

o

oo

o

o

o

o o

o

o

o

o

oo o o

ooo

ooo

o

ooooo

oo

o

o

o

o

oo o

oo

o

ooo

o

oo

o ooo

oo

o

oo

oo

o

o

o

ooo

o

oooooo

oo

o oo

o

o oo

ooo

o

ooo

oo

oo

o

o

oo

••••••••••••••••••••••


Figure 4.8: Four proje tions onto pairs of anoni alvariates. Noti e that as the rank of the anoni al vari-ates in reases, the entroids be ome less spread out. Inthe lower right panel they appear to be superimposed,and the lasses most onfused.Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 38 / 43


��

o

o

oo

o

o o

o

o

o

oo

o

o o

o

o

o

o

o

o

o

o

oo

o

o

oo

oo

o

oo

o

o

o o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

oo

o

o

o

oo

o

o

o o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o o o

o

o

oo

o

o

oo

oo

oo

o

o

o

o

o

o

o

o

o oo

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o oo

o

o

oo

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

oo

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

Canonical Coordinate 1

Cano

nical

Coord

inate

2Classification in Reduced Subspace

••••

•••• •••• ••

••

••••

••

��

��

��

��

�� Æ��

��



Generalizing Linear Discriminant Analysis

More flexible, more complex decision boundaries (Chapter 12)

FDA: flexible discriminant analysis

allow nonlinear boundary: recast the LDA problem as a linearregression problem and then use basis expansions to do discrimination

PDA: penalized discriminant analysis

select variables: penalize the coefficients to be smooth or sparse, whichis more interpretablesparse discriminant analysis (Clemmensen et al., 2011); R package“sparseLDA”.

MDA: mixture discriminant analysis

allow each class by a mixture of two or more Gaussian with differentcentroids, but each component Gaussian (within and between classes)shares the same covariance matrix.


Linear Regression Models Multiple Logistic Regression

Multiple Logistic Models

Model the K posterior probabilities by linear functions of x.We use logit transformations

logPr(Y = 1|X = x)

Pr(Y = K |X = x)= β10 + βT

1 x

logPr(Y = 2|X = x)

Pr(Y = K |X = x)= β20 + βT

2 x

... ...

logPr(Y = K − 1|X = x)

Pr(Y = K |X = x)= β(K−1)0 + βT

K−1x

The choice of denominator in the odd-ratios is arbitrary.The parameter vector

θ = {β10,β1, ...., β(K−1)0,βK−1}.



Probability Estimates

We use the logit transformation to assure that

the total probabilities sum to one

each estimated probability lies in [0, 1]

pk(x) ≡ Pr(Y = k |x) =exp (βk0 + βT

k x)

1 +∑K−1

l=1 exp (βl0 + βTl x)

for k = 1, ...,K − 1.

pK (x) ≡ Pr(Y = K |x) =1

1 +∑K−1

l=1 exp (βl0 + βTl x)

Check∑K

k=1 pk(x) = 1 for any x.



Maximum Likelihood Estimate for Logistic Models

The joint conditional likelihood of yi given xi is

l(θ) =n∑

i=1

log pyi (θ; xi )

Using the Newton-Raphson method:

solve iteratively by minimizing re-weighted least squares.

provide consistent estimates if the models are specified correctly.


Lecture 13: Multiclass Classification (I)lu/ST7901/lecture notes... · 2019. 10. 5. · Linear...

Documents

Transcript of Lecture 13: Multiclass Classification (I)lu/ST7901/lecture notes... · 2019. 10. 5. · Linear...