Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799...

Prototype Classification Methods

Fu Chang

Institute of Information Science

Academia Sinica

2788-3799 ext. 1819

[email protected]

Types of Prototype Methods

Crisp model (K-means, KM) Prototypes are centers of non-overlapping clusters

Fuzzy model (Fuzzy c-means, FCM) Prototypes are weighted average of all samples

Gaussian Mixture model (GM) Prototypes have a mixture of distributions

Linear Discriminant Analysis (LDA) Prototypes are projected sample means

K-nearest neighbor classifier (K-NN)

Learning vector quantization (LVQ)

Prototypes thru Clustering

Given the number k of prototypes, find k clusters whose centers are prototypes

Commonality: Use iterative algorithm, aimed at decreasing an

objective function May converge to local minima The number of k as well as an initial solution

must be specified

Clustering Objectives

The aim of the iterative algorithm is to decrease the value of an objective function

Notations: Samples Prototypes

L2-distance:

nxxx ,...,, 21

kppp ,...,, 21

d

kjkikji px

1

22 )(|||| px

Objectives (cnt’d)

Crisp objective:

Fuzzy objective:

Gaussian mixture objective

2

1 },...,2,1{||||min j

n

ii

kjpx

n

i

k

jjji cpcp

1 1)()|( log x

2

1 1|||| j

k

ii

n

j

mij

u xp

K-Means Clustering

The Algorithm

Initiate k seeds of prototypes p1, p2, …, pk

Grouping: Assign samples to their nearest prototypes Form non-overlapping clusters out of these samples

Centering: Centers of clusters become new prototypes

Repeat the grouping and centering steps, until convergence

Justification

Grouping: Assigning samples to their nearest prototypes helps to

decrease the objective

Centering:

Also helps to decrease the above objective, because

and equality holds only if

2

1 },...,2,1{||||min j

n

ii

kjpx

m

ii

m

ii

1

222

1|||||||| ||| wyyywy

m

iim 1

1yyw

Exercise:

1. Prove that for any group of vectors yi, the following ineq

uality is always true

2. Prove that the equality holds only when

3. Use this fact to prove that the centering step is helpful to

decrease the objective function

m

iim 1

1yyw

m

ii

m

ii

1

222

1|||||||| |||| wyyywy

Fuzzy c-Means Clustering

Crisp vs. Fuzzy Membership

Membership matrix: Uc×n

Uij is the grade of membership of sample j with respect

to prototype i

Crisp membership:

Fuzzy membership:

otherwise 0,

||||min|||| if ,1 22

ij

jkk

jiij

u

u xpxp

c

iij nju

1

,,1,1

Fuzzy c-means (FCM)

The objective function of FCM is

c

i

n

jji

mijij

c

i

n

j

mij udu

1 1

22

1 1|||| xp

FCM (Cnt’d)

Introducing the Lagrange multiplier λ with respect

to the constraint

we rewrite the objective function as:

1

11

2c

iij

c

iij

mij uduJ

,11

c

iiju

FCM (Cnt’d)

Setting the partial derivatives to zero, we obtain

021 ij

mij

ij

dumu

J

011

c

kiku

J

FCM (Cnt’d)

From the 2nd equation, we obtain

From this fact and the 1st equation, we obtain

1

1

2

m

ijij dm

u

C

kiku

1

1

C

k

m

ik

m

C

k

m

ik

dm

dm

1

1

1

2

1

1

1

1

1

2

1

FCM (Cnt’d)

Therefore,

and

c

k

m

ik

m

d

m

1

1

1

2

1

1

1

1

1

1

2

1

1

1

2

1

1

1

m

ijc

k

m

ik

ijd

d

u

c

k

m

ik

ij

d

d

1

1

1

2

2

1

FCM (Cnt’d)

Together with the 2nd equation, we obtain the

updating rule for uij

1

1

2

1

1

1

2

1

1

1

m

ijc

k

m

ik

ijd

d

u

c

k

m

ik

ij

d

d

1

1

1

2

2

1

FCM (Cnt’d)

On the other hand, setting the derivative of J with respect to pi to zero, we obtain

)(2

)()(

||||

||||0

1

1

2

1

1 1

2

ji

n

j

mij

jiT

ji

n

j i

mij

ji

n

j i

mij

c

i

n

jji

mij

ii

u

u

u

uJ

xp

xpxpp

xpp

xppp

FCM (Cnt’d)

It follows that

Finally, we can obtain the update rule of ci:

0)(1

ji

n

j

mij

i

uJ

xpp

n

j

mij

j

n

j

mij

i

u

u

1

1x

p

FCM (Cnt’d)

To summarize:

n

j

mij

n

jj

mij

i

u

u

1

1x

p

c

k

m

kj

ij

ij

d

du

1

)1(1

1

K-means vs. Fuzzy c-means

Sample Points

K-means vs. Fuzzy c-means

K-means Fuzzy c-means

Expectation-Maximization (EM) Algorithm

What Is Given

Observed data: X = {x1, x2, …, xn}, each of them is drawn independently from a mixture of probability distributions with the density

where

m

kkk pp

1)|()|( θxx

),...,,,,...,,( 2121 nm xxx

mk k1 1

Incomplete vs. Complete Data

The incomplete-data log-likelihood is given by:

which is difficult to optimize

The complete-data log-likelihood

can be handled much easily, where H is the set of hidden ra

ndom variables

How do we compute the distribution of H?

n

iipL

1)|(log)|(log xX

)|,(log),|(log HXHX pL

EM Algorithm

E-Step: first find the expected value

where is the current estimate of

M-Step: Update the estimate

Repeat the process, until convergence

k

iii kfkppEQ ),|()|,(log,|)|,(log ),( )1()1()1( XXXHX

)1( i

),(maxaug )1()(

ii Q

E-M Steps

Justification

The expected value (the circled term) is the lower bound of the log-likelihood

h

h

h

h

h

XXh

hh

h

XXhh

h

hXh

hh

hX

hXX

(1) ),(log),|(

)(log)(

)(

),(),|(log)(

)Inequality s(Jensen' )(

),,(log)(

)()(

),(log

),(log),(log

pp

qq-

q

ppq

q

pq

qq

,p

,pp

Justification (Cnt’d)

The maximum of the lower bound equals to the log-likelihood The first term of (1) is the relative entropy of q(h) with

respect to The second term is a magnitude that does not depend

on h We would obtain the maximum of (1) if the relative

entropy becomes zero With this choice, the first term becomes zero and (1)

achieves the upper bound, which is

),|( Xhp

),(log Xp

),(log Xp

Details of EM Algorithm

Let be the guessed values of

For the given , we can compute

),,,,,,,...,,(2121

gN

gggM

ggg xxx

g

)|( gkikp x

)|(

)|(),(

gi

gkik

g

kgi

p

pkp

x

xx|

mk

gkik

gk

gkik

g

k

p

p

1 )|(

)|(

x

x

Details (Cnt’d)

Lagrangian and partial derivative equation:

kk

m

k

n

i

gik

k

kp 0)]1(),|()log( [1 1

x

(2) 0),|(1

1

gn

ii

kkp x

Details (Cnt’d)

From (2), we derive that λ = - n and

Based on these values, we can derive the optimal

for , of which only the following part

involves :

),|(1

1

gn

iik kp

n

x

m

k

n

i

gikik

g kppE1 1

),|())|(log( ),( xx

k

),( gQ

k

Exercise:

4. Deduce from (1) that λ = - n and

),|(1

1

gn

iik kp

n

x

Gaussian Mixtures

The Gaussian distribution is given by:

For Gaussian mixtures,

)]()(2

1exp[

||)2(

1),|( 1

2/12/ kkT

kk

dkkkp μxΣμxΣ

Σμx

m

k

n

i

gikik

Tkik

g kpE1 1

1 )||()()(2

1|)log(|

2

1 ),( xμxμxΣ

Gaussian Mixtures (Cnt’d)

Partial derivative:

Setting this to zero, we obtain

),|()(),(1

1 gi

n

i

Tkik

g

k

kpE

xμxΣ

ni

gi

ni

gii

kkp

kp

1

1

),|(

),|(

x

xx


Taking the derivative of with respect to

and setting it to zero, we get

(many details are omitted)

),( gE

kΣ

ni

gi

ni

Tkiki

gi

kkp

kp

1

1

),|(

))()(,|(

x

μxμxxΣ

Linear Discriminant Analysis(LDA)

Illustration

ProjectionDirection

Class1

Class 2

Definitions

Given: Samples x1, x2, …, xn

Classes: ni of them are of class i, i = 1, 2, …, c

Definition: Sample mean for class i:

Scatter matrix for class i:

iclassi

i n )(

1

xxm

iclass

Tiii

)())((

xmxmxS

Scatter Matrices

Total scatter matrix:

Within-class scatter matrix:

Between-class scatter matrix:

Ti

c

iiitotal n ))((

1mmmmS

c

iiW

1SS

WtotalB SSS

Multiple Discriminant Analysis

We seek vectors wi, i = 1, 2, .., c-1

And project the samples x to the c-1 dimensional space y

The criterion for W = (w1, w2, …, wc-1) is

) ..., , ,( 121xwxwxwy T

cTT

1 subject to max WWWSWW

WT

BT S

or , maxWSW

WSWW

WT

BT

Multiple Discriminant Analysis (Cnt’d)

Consider the Lagrangian

Take the partial derivative

Setting the derivative to zero, we obtain

)1()( WWWSWW WT

BT SJ

WSWSWWWSWWW WBW

TB

T SJ

)1(

or ,WSWS WB

iiWiB wSwS

Multiple Discriminant Analysis (Cnt’d)

Find the roots of the characteristic function as

eigenvalues

and then solve

for wi for the largest c-1 eigenvalues

0|| WiiB SwS

0)( iWiB wSS

LDA Prototypes

The prototype of each class is the mean of the

projected samples of that class, the projection is

thru the matrix W

In the testing phase

All test samples are projected thru the same optimal W

The nearest prototype is the winner

K-Nearest Neighbor (K-NN) Classifier

K-NN Classifier

For each test sample x, find the nearest K training samples and classify x according to the vote among the K neighbors

The error rate is

where

This shows that the error rate is at most twice the Bayes error

2

1

** )))(1(1

))(1(2))(1)((

K

kkk xp

K

Kxpxpxp

)(max)(* xpxp kk

Learning Vector Quantization (LVQ)

LVQ Algorithm

1. Initialize R prototypes for each class: m1(k), m2(k), …,

mR(k), where k = 1, 2, …, K.

2. Sample a training sample x and find the nearest prototype

mj(k) to x

a) If x and mj(k) match in class type,

b) Otherwise,

3. Repeat step 2, decreasing ε at each iteration

))(()()( kmxkmkm jjj

))(()()( kmxkmkm jjj

References

F. Höppner, F. Klawonn, R. Kruse, and T. Runkler, Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition, John Wiley & Sons, 1999.J. A. Bilmes, “A Gentle Tutorial of the EM algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” www.cs.berkeley.edu/~daf/appsem/WordsAndPictures/Papers/bilmes98gentle.pdfT. P. Minka, “Expectation-Maximization as Lower Bound Maximization,” www.stat.cmu.edu/~minka/papers/em.html R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001.T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001.

Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799...

Documents

Transcript of Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799...