Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799...

51
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 [email protected]

Transcript of Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799...

Page 1: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Prototype Classification Methods

Fu Chang

Institute of Information Science

Academia Sinica

2788-3799 ext. 1819

[email protected]

Page 2: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Types of Prototype Methods

Crisp model (K-means, KM) Prototypes are centers of non-overlapping clusters

Fuzzy model (Fuzzy c-means, FCM) Prototypes are weighted average of all samples

Gaussian Mixture model (GM) Prototypes have a mixture of distributions

Linear Discriminant Analysis (LDA) Prototypes are projected sample means

K-nearest neighbor classifier (K-NN)

Learning vector quantization (LVQ)

Page 3: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Prototypes thru Clustering

Given the number k of prototypes, find k clusters whose centers are prototypes

Commonality: Use iterative algorithm, aimed at decreasing an

objective function May converge to local minima The number of k as well as an initial solution

must be specified

Page 4: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Clustering Objectives

The aim of the iterative algorithm is to decrease the value of an objective function

Notations: Samples Prototypes

L2-distance:

nxxx ,...,, 21

kppp ,...,, 21

d

kjkikji px

1

22 )(|||| px

Page 5: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Objectives (cnt’d)

Crisp objective:

Fuzzy objective:

Gaussian mixture objective

2

1 },...,2,1{||||min j

n

ii

kjpx

n

i

k

jjji cpcp

1 1)()|( log x

2

1 1|||| j

k

ii

n

j

mij

u xp

Page 6: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

K-Means Clustering

Page 7: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

The Algorithm

Initiate k seeds of prototypes p1, p2, …, pk

Grouping: Assign samples to their nearest prototypes Form non-overlapping clusters out of these samples

Centering: Centers of clusters become new prototypes

Repeat the grouping and centering steps, until convergence

Page 8: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Justification

Grouping: Assigning samples to their nearest prototypes helps to

decrease the objective

Centering:

Also helps to decrease the above objective, because

and equality holds only if

2

1 },...,2,1{||||min j

n

ii

kjpx

m

ii

m

ii

1

222

1|||||||| ||| wyyywy

m

iim 1

1yyw

Page 9: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Exercise:

1. Prove that for any group of vectors yi, the following ineq

uality is always true

2. Prove that the equality holds only when

3. Use this fact to prove that the centering step is helpful to

decrease the objective function

m

iim 1

1yyw

m

ii

m

ii

1

222

1|||||||| |||| wyyywy

Page 10: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Fuzzy c-Means Clustering

Page 11: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Crisp vs. Fuzzy Membership

Membership matrix: Uc×n

Uij is the grade of membership of sample j with respect

to prototype i

Crisp membership:

Fuzzy membership:

otherwise 0,

||||min|||| if ,1 22

ij

jkk

jiij

u

u xpxp

c

iij nju

1

,,1,1

Page 12: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Fuzzy c-means (FCM)

The objective function of FCM is

c

i

n

jji

mijij

c

i

n

j

mij udu

1 1

22

1 1|||| xp

Page 13: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

FCM (Cnt’d)

Introducing the Lagrange multiplier λ with respect

to the constraint

we rewrite the objective function as:

1

11

2c

iij

c

iij

mij uduJ

,11

c

iiju

Page 14: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

FCM (Cnt’d)

Setting the partial derivatives to zero, we obtain

021 ij

mij

ij

dumu

J

011

c

kiku

J

Page 15: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

FCM (Cnt’d)

From the 2nd equation, we obtain

From this fact and the 1st equation, we obtain

1

1

2

m

ijij dm

u

C

kiku

1

1

C

k

m

ik

m

C

k

m

ik

dm

dm

1

1

1

2

1

1

1

1

1

2

1

Page 16: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

FCM (Cnt’d)

Therefore,

and

c

k

m

ik

m

d

m

1

1

1

2

1

1

1

1

1

1

2

1

1

1

2

1

1

1

m

ijc

k

m

ik

ijd

d

u

c

k

m

ik

ij

d

d

1

1

1

2

2

1

Page 17: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

FCM (Cnt’d)

Together with the 2nd equation, we obtain the

updating rule for uij

1

1

2

1

1

1

2

1

1

1

m

ijc

k

m

ik

ijd

d

u

c

k

m

ik

ij

d

d

1

1

1

2

2

1

Page 18: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

FCM (Cnt’d)

On the other hand, setting the derivative of J with respect to pi to zero, we obtain

)(2

)()(

||||

||||0

1

1

2

1

1 1

2

ji

n

j

mij

jiT

ji

n

j i

mij

ji

n

j i

mij

c

i

n

jji

mij

ii

u

u

u

uJ

xp

xpxpp

xpp

xppp

Page 19: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

FCM (Cnt’d)

It follows that

Finally, we can obtain the update rule of ci:

0)(1

ji

n

j

mij

i

uJ

xpp

n

j

mij

j

n

j

mij

i

u

u

1

1x

p

Page 20: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

FCM (Cnt’d)

To summarize:

n

j

mij

n

jj

mij

i

u

u

1

1x

p

c

k

m

kj

ij

ij

d

du

1

)1(1

1

Page 21: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

K-means vs. Fuzzy c-means

Sample Points

Page 22: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

K-means vs. Fuzzy c-means

K-means Fuzzy c-means

Page 23: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Expectation-Maximization (EM) Algorithm

Page 24: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

What Is Given

Observed data: X = {x1, x2, …, xn}, each of them is drawn independently from a mixture of probability distributions with the density

where

m

kkk pp

1)|()|( θxx

),...,,,,...,,( 2121 nm xxx

mk k1 1

Page 25: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Incomplete vs. Complete Data

The incomplete-data log-likelihood is given by:

which is difficult to optimize

The complete-data log-likelihood

can be handled much easily, where H is the set of hidden ra

ndom variables

How do we compute the distribution of H?

n

iipL

1)|(log)|(log xX

)|,(log),|(log HXHX pL

Page 26: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

EM Algorithm

E-Step: first find the expected value

where is the current estimate of

M-Step: Update the estimate

Repeat the process, until convergence

k

iii kfkppEQ ),|()|,(log,|)|,(log ),( )1()1()1( XXXHX

)1( i

),(maxaug )1()(

ii Q

Page 27: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

E-M Steps

Page 28: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Justification

The expected value (the circled term) is the lower bound of the log-likelihood

h

h

h

h

h

XXh

hh

h

XXhh

h

hXh

hh

hX

hXX

(1) ),(log),|(

)(log)(

)(

),(),|(log)(

)Inequality s(Jensen' )(

),,(log)(

)()(

),(log

),(log),(log

pp

qq-

q

ppq

q

pq

qq

,p

,pp

Page 29: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Justification (Cnt’d)

The maximum of the lower bound equals to the log-likelihood The first term of (1) is the relative entropy of q(h) with

respect to The second term is a magnitude that does not depend

on h We would obtain the maximum of (1) if the relative

entropy becomes zero With this choice, the first term becomes zero and (1)

achieves the upper bound, which is

),|( Xhp

),(log Xp

),(log Xp

Page 30: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Details of EM Algorithm

Let be the guessed values of

For the given , we can compute

),,,,,,,...,,(2121

gN

gggM

ggg xxx

g

)|( gkikp x

)|(

)|(),(

gi

gkik

g

kgi

p

pkp

x

xx|

mk

gkik

gk

gkik

g

k

p

p

1 )|(

)|(

x

x

Page 31: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Details (Cnt’d)

We then consider the expected value:

),( gQ hghphL ),|(),|(log XX

m

k

n

i

gikikk kpp

1 1),|())|(log( xx

m

k

n

i

gik kp

1 1),|()log( x

m

k

n

i

gikik kpp

1 1),|())|(log( xx

Page 32: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Details (Cnt’d)

Lagrangian and partial derivative equation:

kk

m

k

n

i

gik

k

kp 0)]1(),|()log( [1 1

x

(2) 0),|(1

1

gn

ii

kkp x

Page 33: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Details (Cnt’d)

From (2), we derive that λ = - n and

Based on these values, we can derive the optimal

for , of which only the following part

involves :

),|(1

1

gn

iik kp

n

x

m

k

n

i

gikik

g kppE1 1

),|())|(log( ),( xx

k

),( gQ

k

Page 34: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Exercise:

4. Deduce from (1) that λ = - n and

),|(1

1

gn

iik kp

n

x

Page 35: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Gaussian Mixtures

The Gaussian distribution is given by:

For Gaussian mixtures,

)]()(2

1exp[

||)2(

1),|( 1

2/12/ kkT

kk

dkkkp μxΣμxΣ

Σμx

m

k

n

i

gikik

Tkik

g kpE1 1

1 )||()()(2

1|)log(|

2

1 ),( xμxμxΣ

Page 36: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Gaussian Mixtures (Cnt’d)

Partial derivative:

Setting this to zero, we obtain

),|()(),(1

1 gi

n

i

Tkik

g

k

kpE

xμxΣ

ni

gi

ni

gii

kkp

kp

1

1

),|(

),|(

x

xx

Page 37: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Gaussian Mixtures (Cnt’d)

Taking the derivative of with respect to

and setting it to zero, we get

(many details are omitted)

),( gE

ni

gi

ni

Tkiki

gi

kkp

kp

1

1

),|(

))()(,|(

x

μxμxxΣ

Page 38: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Gaussian Mixtures (Cnt’d)

To summarize:

),|(1

1

gn

ii

newk

kpn

x

ni

gi

ni

giinew

k kp

kp

1

1

),|(

),|(

x

xx

ni

gi

ni

Tkiki

ginew

k kp

kp

1

1

),|(

))()(,|(

x

μxμxxΣ

Page 39: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Linear Discriminant Analysis(LDA)

Page 40: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Illustration

ProjectionDirection

Class1

Class 2

Page 41: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Definitions

Given: Samples x1, x2, …, xn

Classes: ni of them are of class i, i = 1, 2, …, c

Definition: Sample mean for class i:

Scatter matrix for class i:

iclassi

i n )(

1

xxm

iclass

Tiii

)())((

xmxmxS

Page 42: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Scatter Matrices

Total scatter matrix:

Within-class scatter matrix:

Between-class scatter matrix:

Ti

c

iiitotal n ))((

1mmmmS

c

iiW

1SS

WtotalB SSS

Page 43: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Multiple Discriminant Analysis

We seek vectors wi, i = 1, 2, .., c-1

And project the samples x to the c-1 dimensional space y

The criterion for W = (w1, w2, …, wc-1) is

) ..., , ,( 121xwxwxwy T

cTT

1 subject to max WWWSWW

WT

BT S

or , maxWSW

WSWW

WT

BT

Page 44: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Multiple Discriminant Analysis (Cnt’d)

Consider the Lagrangian

Take the partial derivative

Setting the derivative to zero, we obtain

)1()( WWWSWW WT

BT SJ

WSWSWWWSWWW WBW

TB

T SJ

)1(

or ,WSWS WB

iiWiB wSwS

Page 45: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Multiple Discriminant Analysis (Cnt’d)

Find the roots of the characteristic function as

eigenvalues

and then solve

for wi for the largest c-1 eigenvalues

0|| WiiB SwS

0)( iWiB wSS

Page 46: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

LDA Prototypes

The prototype of each class is the mean of the

projected samples of that class, the projection is

thru the matrix W

In the testing phase

All test samples are projected thru the same optimal W

The nearest prototype is the winner

Page 47: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

K-Nearest Neighbor (K-NN) Classifier

Page 48: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

K-NN Classifier

For each test sample x, find the nearest K training samples and classify x according to the vote among the K neighbors

The error rate is

where

This shows that the error rate is at most twice the Bayes error

2

1

** )))(1(1

))(1(2))(1)((

K

kkk xp

K

Kxpxpxp

)(max)(* xpxp kk

Page 49: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

Learning Vector Quantization (LVQ)

Page 50: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

LVQ Algorithm

1. Initialize R prototypes for each class: m1(k), m2(k), …,

mR(k), where k = 1, 2, …, K.

2. Sample a training sample x and find the nearest prototype

mj(k) to x

a) If x and mj(k) match in class type,

b) Otherwise,

3. Repeat step 2, decreasing ε at each iteration

))(()()( kmxkmkm jjj

))(()()( kmxkmkm jjj

Page 51: Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw.

References

F. Höppner, F. Klawonn, R. Kruse, and T. Runkler, Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition, John Wiley & Sons, 1999.J. A. Bilmes, “A Gentle Tutorial of the EM algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” www.cs.berkeley.edu/~daf/appsem/WordsAndPictures/Papers/bilmes98gentle.pdfT. P. Minka, “Expectation-Maximization as Lower Bound Maximization,” www.stat.cmu.edu/~minka/papers/em.html R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001.T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001.