Chapter 8 Discriminant Analysis. 8.1 Introduction Classification is an important issue in...

44
Chapter 8 Discriminant Analysis

Transcript of Chapter 8 Discriminant Analysis. 8.1 Introduction Classification is an important issue in...

Page 1: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Chapter 8

Discriminant Analysis

Page 2: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

8.1 Introduction

Classification is an important issue in multivariate analysis and data mining.

Classification: classifies data (constructs a model) based on the

training set and the values (class labels) in a classifying attribute and uses it in classifying new data, i.e., predicts unknown or missing values

Page 3: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined b

y the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematic

al formulae Prediction: for classifying future or unknown objects

Estimate accuracy of the model The known label of test sample is compared with the classified result fro

m the model Accuracy rate is the percentage of test set samples that are correctly classi

fied by the model Test set is independent of training set, otherwise over-fitting will occur

If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

Page 4: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Classification Process : Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier

(Model)

Page 5: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Classification Process: Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Page 6: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Supervised vs. Unsupervised Learning

Supervised learning (classification)

Supervision: The training data (observations, measurements, etc.) are

accompanied by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim of establi

shing the existence of classes or clusters in the data

Page 7: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Discrimination— Introduction

Discrimination is a technique concerned with allocating new observations to previously defined groups.

There are k samples from k distinct populations:

One wants to find the so-called discriminant function and related rule to identify the new observations.

: :

1

111

111

11

111

1

11

kpn

kn

kp

k

k

pnn

p

kkxx

xx

G

xx

xx

G

Page 8: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Example 11.3 Bivariate case

Page 9: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Discriminant function and rule

1

2

Discriminant function:

ifRule

if

w l'

G w

G w

x x

x x a

x x a

Page 10: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Example 11.1: Riding mowersExample 11.1: Riding mowers

Consider two groups in city: riding-mower owners Consider two groups in city: riding-mower owners

and those without riding mowers. In order to identify and those without riding mowers. In order to identify

the best sales prospects for an intensive sales the best sales prospects for an intensive sales

campaign, a riding-mower manufacturer is interested campaign, a riding-mower manufacturer is interested

in classifying families as prospective owners or non-in classifying families as prospective owners or non-

owners on the basis of income and lot size.owners on the basis of income and lot size.

Page 11: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Example 11.1: Riding mowersExample 11.1: Riding mowers

x1:

(Income in $1000s)

x2:

(Lot size 1000 ft2)

x1:

(Income in $1000s)

x2:

(Lot size 1000 ft2)60 18.4 75 19.6

85.5 16.8 52.8 20.864.8 21.6 64.8 17.261.5 20.8 43.2 20.487 23.6 84 17.6

110.1 19.2 49.2 17.6108 17.6 59.4 1682.8 22.4 66 18.469 20 47.4 16.493 20.8 33 18.851 22 51 1481 20 63 14.8

π1: Riding-mower owners π2: Nonowners

Page 12: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Example 11.1: Riding mowersExample 11.1: Riding mowers

G1 G2

G1 10 2G2 2 10

True

Classify as

Page 13: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

8.2 Discriminant by Distance

Assume k=2 for simplicity

0if

0if :Rule

:functionnt Discrimina

2

1

22

12

xx

xx

xxx

wG

wG

,Gd,Gdw

22

211

1 Σ,μ :,Σ,μ : pp NGNG

Page 14: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Consider the Mahalanobis distance

.,,μxΣμxx 21 12 j',Gd jj

jj

1 2

1 1 2 21 1

1 2 1 21

when

12

2

- -

-

w ' '

'

Σ Σ Σ

x x μ Σ x μ x μ Σ x μ

x μ μ Σ μ -μ

8.2 Discriminant by Distance

Page 15: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

211

21

21

μμΣc

μμμ

-

Let

μxc

μμΣμxx

x

'

'w

w- 211

becan function nt discrimina The

8.2 Discriminant by Distance

Page 16: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

'

nn

n

ji

n

i

jij

n

ii

j

j

jj

j

jj

Where

21

1

are estimators their known, are When

1

2121

1

21

xxxxA

AAΣ~

xx

Σ,μ,μ

8.2 Discriminant by Distance

Page 17: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Example Univariate Case with equal variance

212

1

21

if

if:Rule μμa

aG

aG

xx

xx

a1 2

2222

2111 :: ,,, NGNG

Page 18: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

a*

2222

2111 :: ,,, NGNG

21

2112

*a

Example Univariate Case with equal variance

Page 19: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

8.3 Fisher’s Discriminant Function

Idea: projection, ANOVA

Page 20: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Training samples

kn

kkpk

knp

kNG

NG

xxΣ μ

xxΣ μ

,,,,:

,,,,:

1

1111 1

8.3 Fisher’s Discriminant Function

Page 21: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Projection the data on a direction , the F-statisticspRl

,Ell

Blll kn'

k'F

1

where

k

a

aj

n

ja

aj

a

k

aaa

'E

'nB

a

1 1

1

xxxx

xxxx

8.3 Fisher’s Discriminant Function

Page 22: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

To find such that pR*l

lll* FF

pRmax

The solution of is the eigenvector associated with the largest eigenvalue of .

*l

Discriminant function: ll x,lx where'u

0 EB

8.3 Fisher’s Discriminant Function

Page 23: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

(B) Two Populations(B) Two Populations

'n'n xxxxxxxxB 222

111

Note

21

22

11

nnnn xx

x

We have and21 AAE

'nn

nn 2121

21

21 xxxxB

There is only one non-zero eigenvalue of as 0 EB .B 1rank

Page 24: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

The associated eigenvector is .xxE 211

1 21

11 2

2

Discriminant function:

ifRule: when

if

' '

G

G

u x x E x x c x

x u xΣ Σ

x u x

where 1 21

2' c x x

(B) Two Populations(B) Two Populations

Page 25: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

When is replaced by 1 ,2ΣΣ

21

21

12

ˆˆ

xcˆxcˆ ''

where

211212

121

21

2

22

22

211211

121

21

1

11

21

11

11

11

11

xxAAAAAxx

cAcˆ

xxAAAAAxx

cAcˆ

'n

'n

'n

'n

(B) Two Populations(B) Two Populations

Page 26: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Example Inset Classification

No. x1 x2 n. g. c. g. y

1 6.36 5.24 1 1 2.47132 5.92 5.12 1 2 2.33353 5.92 5.36 1 1 2.36634 6.44 5.64 1 1 2.54815 6.40 5.16 1 1 2.47146 6.56 5.56 1 1 2.57027 6.64 5.36 1 1 2.56508 6.68 4.96 1 1 2.52139 6.72 5.48 1 1 2.603410 6.76 5.60 1 1 2.630911 6.72 5.08 1 1 2.5488

Table 2.1 Data of two species of insects

No. x1 x2 n. g. c. g. y

1 6.00 4.88 2 2 2.32272 5.60 4.64 2 2 2.17963 5.65 4.96 2 2 2.23434 5.76 4.80 2 2 2.24565 5.96 5.08 2 2 2.33916 5.72 5.04 2 2 2.26747 5.64 4.96 2 2 2.23438 5.44 4.88 2 2 2.16829 5.04 4.44 2 2 1.997710 4.56 4.04 2 2 1.810611 5.48 4.20 2 2 2.086312 5.76 4.80 2 2 2.2456

Table 2.1 Data of two species of insects

Note: data x1 and x2 are the characteristics of insect (Hoel,1947)

n.g. means natural group (species),

c.g. the classified group,

y the value of the discriminant function

Page 27: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

1 26.4654 5.5500 5 9878, ,

5.3236 4.7267 5 0122

2 6765 1 2942 4.8097 3.1364,

1.2942 1.7545 3.1364 2.0453

.

.

. .

x x x

E B

The eigenvalue of is 1.9187 and the associated eigenvector is

0 EB

..

.xxE

13670

27590211

Example Inset Classification

Page 28: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

The discriminant function is

and the associated value of each observation is given in the table. The cutting point is

2121 1367027590 xxxxu ..,

..34472

Classification is G1 G2

G1 10 1G2 0 12

classify as

True

If we use , we have the same classification.

1 2ˆ ˆ2.3831 0.0939, 0.1497

Example Inset Classification

Page 29: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

8.4 Bayes’ Discriminant Analysis

A. Idea

There are k populations G1, …, Gk in Rp.

A partition of Rp, R1, …, Rk , is determined based on a trainingsample.

Rule: if falls into Ri

Loss: is from Gi , but falls into Rj

The Probability of this misclassification

where is the density of .

iGx x

:ijc | x x

, xx| dpijPjR i

xip iGx

Page 30: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Expected cost of misclassification is

where q1, …, qk are prior probabilities.

We want to minimize ECM(R1, …, Rk ) w.r.t. R1, …, Rk .

11 1

ECM , , | |k k

k ii j

R R q c j i p j i

8.4 Bayes’ Discriminant Analysis

Page 31: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Theorem 6.4.1

Let

Then the optimal Rt’s are

1

|k

t i iii t

h x q p c t i

x

.,,,,xxx kttjhhR jtt 1:

B. Method

Page 32: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Take if and 0 if .

Then

| 1ijc j i ji ji

.,,,,xxx kttjpqpqR jjttt 1:

Proof:

1

k

t i i t ti

t t

h x q p q p

c q p

x x

x x

Corollary 1

Page 33: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

In the case of k=2

12

21

112

221

cpqxh

cpqxh

x

x

we have

1 2 2 1 1

2 2 2 1 1

: 1| 2 2 |1

: 2 |1 1| 2

R q p c q p c

R q p c q p c

x x x

x x x

Corollary 2

Page 34: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

1 2

1

2

2

1

Discriminant function:

ifRule:

if

1| 2where

2 |1

u p p

G u d

G u d

q cd

q c

x x x

x x

x x

Page 35: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

In the case of k=2 and

22

11

if

if

GN

GN

p

p

xΣ,μ

xΣ,μ ~x

Corollary 3

Page 36: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

ln if

ln if :Rule

2

1

dwG

dwG

xx

xx

Then

21121

2

1

21

where

exp

μμ Σμμ x x

xxx

x

-'w

wpp

u

Page 37: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

C. Example 11.3:C. Example 11.3:Detection of hemophilia A carriersDetection of hemophilia A carriers

For the detection of hemophilia A carriers, to construct a For the detection of hemophilia A carriers, to construct a procedure for detecting potential hemophilia A carriers, procedure for detecting potential hemophilia A carriers, blood samples were assayed for two groups of women blood samples were assayed for two groups of women and measurements on the two variables. The first group and measurements on the two variables. The first group of 30 women were selected from a population of women of 30 women were selected from a population of women who did not carry the hemophilia gene. This group was who did not carry the hemophilia gene. This group was called the normal group. The second group of 22 women called the normal group. The second group of 22 women was selected from known hemophilia A carriers. This was selected from known hemophilia A carriers. This group was called the obligatory carriers.group was called the obligatory carriers.

Page 38: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Variables:Variables: loglog1010 (AHF activity) (AHF activity)

loglog1010 (AHF-like antigen) (AHF-like antigen)

Populations:Populations: population of women who did not carrypopulation of women who did not carry

the hemophilia gene (nthe hemophilia gene (n11=30)=30)

population of women who are knownpopulation of women who are known

hemophilia A carriers (nhemophilia A carriers (n22=45)=45)

C. Example 11.3C. Example 11.3::Detection of hemophilia a carriersDetection of hemophilia a carriers

Page 39: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

C. Example 11.3:C. Example 11.3:Detection of hemophilia a carriersDetection of hemophilia a carriers

Page 40: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Data setData set

-0.0056 -0.1698 -0.3469 -0.0894 -0.1679 -0.0836 -0.1979 -0.0762 -0.1913 -0.1092 -0.0056 -0.1698 -0.3469 -0.0894 -0.1679 -0.0836 -0.1979 -0.0762 -0.1913 -0.1092 -0.5268 -0.0842 -0.0225 0.0084 -0.1827 0.1237 -0.4702 -0.1519 0.0006 -0.2015 -0.5268 -0.0842 -0.0225 0.0084 -0.1827 0.1237 -0.4702 -0.1519 0.0006 -0.2015 -0.1932 0.1507 -0.1259 -0.1551 -0.1952 0.0291 -0.228 -0.0997 -0.1972 -0.0867-0.1932 0.1507 -0.1259 -0.1551 -0.1952 0.0291 -0.228 -0.0997 -0.1972 -0.0867

-0.1657 -0.1585 -0.1879 0.0064 0.0713 0.0106 -0.0005 0.0392 -0.2123 -0.119 --0.1657 -0.1585 -0.1879 0.0064 0.0713 0.0106 -0.0005 0.0392 -0.2123 -0.119 -0.4773 0.4773 0.0248 -0.058 0.0782 -0.1138 0.214 -0.3099 -0.0686 -0.1153 -0.0498 -0.2293 0.0933 0.0248 -0.058 0.0782 -0.1138 0.214 -0.3099 -0.0686 -0.1153 -0.0498 -0.2293 0.0933 -0.0669 -0.1232 -0.1007 0.0442 -0.171 -0.0733 -0.0607 -0.056-0.0669 -0.1232 -0.1007 0.0442 -0.171 -0.0733 -0.0607 -0.056

-0.3478 -0.3618 -0.4986 -0.5015 -0.1326 -0.6911 -0.3608 -0.4535 -0.3479 -0.3539 -0.3478 -0.3618 -0.4986 -0.5015 -0.1326 -0.6911 -0.3608 -0.4535 -0.3479 -0.3539 -0.4719 -0.361 -0.3226 -0.4319 -0.2734 -0.5573 -0.3755 -0.495 -0.5107 -0.1652 -0.4719 -0.361 -0.3226 -0.4319 -0.2734 -0.5573 -0.3755 -0.495 -0.5107 -0.1652 -0.2447 -0.4232 -0.2375 -0.2205 -0.2154 -0.3447 -0.254 -0.3778 -0.4046 -0.0639 -0.2447 -0.4232 -0.2375 -0.2205 -0.2154 -0.3447 -0.254 -0.3778 -0.4046 -0.0639 -0.3351 -0.0149 -0.0312 -0.174 -0.1416 -0.1508 -0.0964 -0.2642 -0.0234 -0.3352 -0.3351 -0.0149 -0.0312 -0.174 -0.1416 -0.1508 -0.0964 -0.2642 -0.0234 -0.3352 -0.1878 -0.1744 -0.4055 -0.2444 -0.4784-0.1878 -0.1744 -0.4055 -0.2444 -0.4784  0.1151 -0.2008 -0.086 -0.2984 0.0097 -0.339 0.1237 -0.1682 -0.1721 0.0722 0.1151 -0.2008 -0.086 -0.2984 0.0097 -0.339 0.1237 -0.1682 -0.1721 0.0722 -0.1079 -0.0399 0.167 -0.0687 -0.002 0.0548 -0.1865 -0.0153 -0.2483 0.2132 -0.1079 -0.0399 0.167 -0.0687 -0.002 0.0548 -0.1865 -0.0153 -0.2483 0.2132 -0.0407 -0.0998 0.2876 0.0046 -0.0219 0.0097 -0.0573 -0.2682 -0.1162 0.1569 -0.0407 -0.0998 0.2876 0.0046 -0.0219 0.0097 -0.0573 -0.2682 -0.1162 0.1569 -0.1368 0.1539 0.14 -0.0776 0.1642 0.1137 0.0531 0.0867 0.0804 0.0875 0.251 -0.1368 0.1539 0.14 -0.0776 0.1642 0.1137 0.0531 0.0867 0.0804 0.0875 0.251 0.1892 -0.2418 0.1614 0.02820.1892 -0.2418 0.1614 0.0282

normalnormal

log10(AHF activity)log10(AHF activity)

log10(AHF-like antigen)log10(AHF-like antigen)

ObligatoryObligatorycarriercarrier

log10(AHF activity)log10(AHF activity)

log10(AHF-like antigen)log10(AHF-like antigen)

C. Example 11.3:C. Example 11.3:Detection of hemophilia a carriersDetection of hemophilia a carriers

Page 41: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

SAS outputSAS output

C. Example 11.3:C. Example 11.3:Detection of hemophilia a carriersDetection of hemophilia a carriers

Page 42: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

C. Example 11.3C. Example 11.3::Detection of hemophilia a carriersDetection of hemophilia a carriers

Page 43: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

C. Example 11.3C. Example 11.3::Detection of hemophilia a carriersDetection of hemophilia a carriers

Page 44: Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

C. Example 11.3C. Example 11.3::Detection of hemophilia a carriersDetection of hemophilia a carriers