Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799...
-
Upload
warren-walker -
Category
Documents
-
view
215 -
download
0
Transcript of Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799...
Prototype Classification Methods
Fu Chang
Institute of Information Science
Academia Sinica
2788-3799 ext. 1819
Types of Prototype Methods
Crisp model (K-means, KM) Prototypes are centers of non-overlapping clusters
Fuzzy model (Fuzzy c-means, FCM) Prototypes are weighted average of all samples
Gaussian Mixture model (GM) Prototypes have a mixture of distributions
Linear Discriminant Analysis (LDA) Prototypes are projected sample means
K-nearest neighbor classifier (K-NN)
Learning vector quantization (LVQ)
Prototypes thru Clustering
Given the number k of prototypes, find k clusters whose centers are prototypes
Commonality: Use iterative algorithm, aimed at decreasing an
objective function May converge to local minima The number of k as well as an initial solution
must be specified
Clustering Objectives
The aim of the iterative algorithm is to decrease the value of an objective function
Notations: Samples Prototypes
L2-distance:
nxxx ,...,, 21
kppp ,...,, 21
d
kjkikji px
1
22 )(|||| px
Objectives (cnt’d)
Crisp objective:
Fuzzy objective:
Gaussian mixture objective
2
1 },...,2,1{||||min j
n
ii
kjpx
n
i
k
jjji cpcp
1 1)()|( log x
2
1 1|||| j
k
ii
n
j
mij
u xp
K-Means Clustering
The Algorithm
Initiate k seeds of prototypes p1, p2, …, pk
Grouping: Assign samples to their nearest prototypes Form non-overlapping clusters out of these samples
Centering: Centers of clusters become new prototypes
Repeat the grouping and centering steps, until convergence
Justification
Grouping: Assigning samples to their nearest prototypes helps to
decrease the objective
Centering:
Also helps to decrease the above objective, because
and equality holds only if
2
1 },...,2,1{||||min j
n
ii
kjpx
m
ii
m
ii
1
222
1|||||||| ||| wyyywy
m
iim 1
1yyw
Exercise:
1. Prove that for any group of vectors yi, the following ineq
uality is always true
2. Prove that the equality holds only when
3. Use this fact to prove that the centering step is helpful to
decrease the objective function
m
iim 1
1yyw
m
ii
m
ii
1
222
1|||||||| |||| wyyywy
Fuzzy c-Means Clustering
Crisp vs. Fuzzy Membership
Membership matrix: Uc×n
Uij is the grade of membership of sample j with respect
to prototype i
Crisp membership:
Fuzzy membership:
otherwise 0,
||||min|||| if ,1 22
ij
jkk
jiij
u
u xpxp
c
iij nju
1
,,1,1
Fuzzy c-means (FCM)
The objective function of FCM is
c
i
n
jji
mijij
c
i
n
j
mij udu
1 1
22
1 1|||| xp
FCM (Cnt’d)
Introducing the Lagrange multiplier λ with respect
to the constraint
we rewrite the objective function as:
1
11
2c
iij
c
iij
mij uduJ
,11
c
iiju
FCM (Cnt’d)
Setting the partial derivatives to zero, we obtain
021 ij
mij
ij
dumu
J
011
c
kiku
J
FCM (Cnt’d)
From the 2nd equation, we obtain
From this fact and the 1st equation, we obtain
1
1
2
m
ijij dm
u
C
kiku
1
1
C
k
m
ik
m
C
k
m
ik
dm
dm
1
1
1
2
1
1
1
1
1
2
1
FCM (Cnt’d)
Therefore,
and
c
k
m
ik
m
d
m
1
1
1
2
1
1
1
1
1
1
2
1
1
1
2
1
1
1
m
ijc
k
m
ik
ijd
d
u
c
k
m
ik
ij
d
d
1
1
1
2
2
1
FCM (Cnt’d)
Together with the 2nd equation, we obtain the
updating rule for uij
1
1
2
1
1
1
2
1
1
1
m
ijc
k
m
ik
ijd
d
u
c
k
m
ik
ij
d
d
1
1
1
2
2
1
FCM (Cnt’d)
On the other hand, setting the derivative of J with respect to pi to zero, we obtain
)(2
)()(
||||
||||0
1
1
2
1
1 1
2
ji
n
j
mij
jiT
ji
n
j i
mij
ji
n
j i
mij
c
i
n
jji
mij
ii
u
u
u
uJ
xp
xpxpp
xpp
xppp
FCM (Cnt’d)
It follows that
Finally, we can obtain the update rule of ci:
0)(1
ji
n
j
mij
i
uJ
xpp
n
j
mij
j
n
j
mij
i
u
u
1
1x
p
FCM (Cnt’d)
To summarize:
n
j
mij
n
jj
mij
i
u
u
1
1x
p
c
k
m
kj
ij
ij
d
du
1
)1(1
1
K-means vs. Fuzzy c-means
Sample Points
K-means vs. Fuzzy c-means
K-means Fuzzy c-means
Expectation-Maximization (EM) Algorithm
What Is Given
Observed data: X = {x1, x2, …, xn}, each of them is drawn independently from a mixture of probability distributions with the density
where
m
kkk pp
1)|()|( θxx
),...,,,,...,,( 2121 nm xxx
mk k1 1
Incomplete vs. Complete Data
The incomplete-data log-likelihood is given by:
which is difficult to optimize
The complete-data log-likelihood
can be handled much easily, where H is the set of hidden ra
ndom variables
How do we compute the distribution of H?
n
iipL
1)|(log)|(log xX
)|,(log),|(log HXHX pL
EM Algorithm
E-Step: first find the expected value
where is the current estimate of
M-Step: Update the estimate
Repeat the process, until convergence
k
iii kfkppEQ ),|()|,(log,|)|,(log ),( )1()1()1( XXXHX
)1( i
),(maxaug )1()(
ii Q
E-M Steps
Justification
The expected value (the circled term) is the lower bound of the log-likelihood
h
h
h
h
h
XXh
hh
h
XXhh
h
hXh
hh
hX
hXX
(1) ),(log),|(
)(log)(
)(
),(),|(log)(
)Inequality s(Jensen' )(
),,(log)(
)()(
),(log
),(log),(log
pp
qq-
q
ppq
q
pq
,p
,pp
Justification (Cnt’d)
The maximum of the lower bound equals to the log-likelihood The first term of (1) is the relative entropy of q(h) with
respect to The second term is a magnitude that does not depend
on h We would obtain the maximum of (1) if the relative
entropy becomes zero With this choice, the first term becomes zero and (1)
achieves the upper bound, which is
),|( Xhp
),(log Xp
),(log Xp
Details of EM Algorithm
Let be the guessed values of
For the given , we can compute
),,,,,,,...,,(2121
gN
gggM
ggg xxx
g
)|( gkikp x
)|(
)|(),(
gi
gkik
g
kgi
p
pkp
x
xx|
mk
gkik
gk
gkik
g
k
p
p
1 )|(
)|(
x
x
Details (Cnt’d)
We then consider the expected value:
),( gQ hghphL ),|(),|(log XX
m
k
n
i
gikikk kpp
1 1),|())|(log( xx
m
k
n
i
gik kp
1 1),|()log( x
m
k
n
i
gikik kpp
1 1),|())|(log( xx
Details (Cnt’d)
Lagrangian and partial derivative equation:
kk
m
k
n
i
gik
k
kp 0)]1(),|()log( [1 1
x
(2) 0),|(1
1
gn
ii
kkp x
Details (Cnt’d)
From (2), we derive that λ = - n and
Based on these values, we can derive the optimal
for , of which only the following part
involves :
),|(1
1
gn
iik kp
n
x
m
k
n
i
gikik
g kppE1 1
),|())|(log( ),( xx
k
),( gQ
k
Exercise:
4. Deduce from (1) that λ = - n and
),|(1
1
gn
iik kp
n
x
Gaussian Mixtures
The Gaussian distribution is given by:
For Gaussian mixtures,
)]()(2
1exp[
||)2(
1),|( 1
2/12/ kkT
kk
dkkkp μxΣμxΣ
Σμx
m
k
n
i
gikik
Tkik
g kpE1 1
1 )||()()(2
1|)log(|
2
1 ),( xμxμxΣ
Gaussian Mixtures (Cnt’d)
Partial derivative:
Setting this to zero, we obtain
),|()(),(1
1 gi
n
i
Tkik
g
k
kpE
xμxΣ
ni
gi
ni
gii
kkp
kp
1
1
),|(
),|(
x
xx
Gaussian Mixtures (Cnt’d)
Taking the derivative of with respect to
and setting it to zero, we get
(many details are omitted)
),( gE
kΣ
ni
gi
ni
Tkiki
gi
kkp
kp
1
1
),|(
))()(,|(
x
μxμxxΣ
Gaussian Mixtures (Cnt’d)
To summarize:
),|(1
1
gn
ii
newk
kpn
x
ni
gi
ni
giinew
k kp
kp
1
1
),|(
),|(
x
xx
ni
gi
ni
Tkiki
ginew
k kp
kp
1
1
),|(
))()(,|(
x
μxμxxΣ
Linear Discriminant Analysis(LDA)
Illustration
ProjectionDirection
Class1
Class 2
Definitions
Given: Samples x1, x2, …, xn
Classes: ni of them are of class i, i = 1, 2, …, c
Definition: Sample mean for class i:
Scatter matrix for class i:
iclassi
i n )(
1
xxm
iclass
Tiii
)())((
xmxmxS
Scatter Matrices
Total scatter matrix:
Within-class scatter matrix:
Between-class scatter matrix:
Ti
c
iiitotal n ))((
1mmmmS
c
iiW
1SS
WtotalB SSS
Multiple Discriminant Analysis
We seek vectors wi, i = 1, 2, .., c-1
And project the samples x to the c-1 dimensional space y
The criterion for W = (w1, w2, …, wc-1) is
) ..., , ,( 121xwxwxwy T
cTT
1 subject to max WWWSWW
WT
BT S
or , maxWSW
WSWW
WT
BT
Multiple Discriminant Analysis (Cnt’d)
Consider the Lagrangian
Take the partial derivative
Setting the derivative to zero, we obtain
)1()( WWWSWW WT
BT SJ
WSWSWWWSWWW WBW
TB
T SJ
)1(
or ,WSWS WB
iiWiB wSwS
Multiple Discriminant Analysis (Cnt’d)
Find the roots of the characteristic function as
eigenvalues
and then solve
for wi for the largest c-1 eigenvalues
0|| WiiB SwS
0)( iWiB wSS
LDA Prototypes
The prototype of each class is the mean of the
projected samples of that class, the projection is
thru the matrix W
In the testing phase
All test samples are projected thru the same optimal W
The nearest prototype is the winner
K-Nearest Neighbor (K-NN) Classifier
K-NN Classifier
For each test sample x, find the nearest K training samples and classify x according to the vote among the K neighbors
The error rate is
where
This shows that the error rate is at most twice the Bayes error
2
1
** )))(1(1
))(1(2))(1)((
K
kkk xp
K
Kxpxpxp
)(max)(* xpxp kk
Learning Vector Quantization (LVQ)
LVQ Algorithm
1. Initialize R prototypes for each class: m1(k), m2(k), …,
mR(k), where k = 1, 2, …, K.
2. Sample a training sample x and find the nearest prototype
mj(k) to x
a) If x and mj(k) match in class type,
b) Otherwise,
3. Repeat step 2, decreasing ε at each iteration
))(()()( kmxkmkm jjj
))(()()( kmxkmkm jjj
References
F. Höppner, F. Klawonn, R. Kruse, and T. Runkler, Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition, John Wiley & Sons, 1999.J. A. Bilmes, “A Gentle Tutorial of the EM algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” www.cs.berkeley.edu/~daf/appsem/WordsAndPictures/Papers/bilmes98gentle.pdfT. P. Minka, “Expectation-Maximization as Lower Bound Maximization,” www.stat.cmu.edu/~minka/papers/em.html R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001.T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001.