Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ [email protected].

69
Clustering Validity Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ [email protected]

Transcript of Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ [email protected].

Page 1: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

Clustering ValidityClustering Validity

Adriano Joaquim de O Cruz ©2006

NCE/UFRJ

[email protected]

Page 2: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 2

Clustering ValidityClustering Validity

The number of clusters is not always The number of clusters is not always previously known.previously known.

In many problems the number of In many problems the number of classes is known but it is not the best classes is known but it is not the best configuration.configuration.

It is necessary to study methods to It is necessary to study methods to indicate and/or validate the number of indicate and/or validate the number of classes.classes.

Page 3: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 3

Clustering Validity Example 1Clustering Validity Example 1

Consider the problem of number Consider the problem of number recognitionrecognition

It is known that there are 10 classes (10 It is known that there are 10 classes (10 digits)digits)

The number of clusters, however, may The number of clusters, however, may be greater than 10be greater than 10

This is the result of different handwriting This is the result of different handwriting to the same digitto the same digit

Page 4: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 4

Clustering Validity Example 2Clustering Validity Example 2

Consider the problem segmentation of Consider the problem segmentation of thermal image in a roomthermal image in a room

It is known that there are 2 classes of It is known that there are 2 classes of temperatures: body and room temperatures: body and room temperaturestemperatures

This is a problem where the number of This is a problem where the number of classes is well defined.classes is well defined.

Page 5: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 5

Clustering Validity ProblemClustering Validity Problem

First data is partitioned in different number of First data is partitioned in different number of clustersclusters

It is also important to try different initial It is also important to try different initial conditions to the same number of partitionsconditions to the same number of partitions

Validity measures are applied to these Validity measures are applied to these partitions to estimate their qualitypartitions to estimate their quality

It is necessary to estimate the quality when It is necessary to estimate the quality when the number of partitions is changed and, for the number of partitions is changed and, for the same number, when the initial conditions the same number, when the initial conditions are differentare different

Page 6: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

Clustering Validity

L-ClustersL-Clusters

Page 7: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 7

Initial Definitions Initial Definitions

dd((eeii,,eekk) is the dissimilarity between ) is the dissimilarity between element element eeii and and eekk. .

Euclidean distance is an example Euclidean distance is an example of an measure of dissimilarityof an measure of dissimilarity

Page 8: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 8

L–Cluster DefinitionL–Cluster Definition

CC is an L-cluster if for is an L-cluster if for each objecteach object eeii belonging to belonging to CC::

eekk C,C, maxmax dd((eeii,,eekk)<)<eehh C, C, minmin dd((eeii,,eehh))

Maximum distance between any element Maximum distance between any element eeii and any element and any element eekk is smaller than the is smaller than the minimum distance between minimum distance between eeii and any and any eehh from another cluster.from another cluster.

Page 9: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 9

L-clusterL-cluster

C

Page 10: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 10

L* – DefinitionL* – Definition

CC is an L*-cluster if for each object is an L*-cluster if for each object eeii belonging to belonging to CC::

eekk C,C, maxmax dd((eeii,,eekk) < ) < eell C,C, eehh C, C, minmin dd((eell,,eehh))

Page 11: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 11

L*-clusterL*-cluster

C

Page 12: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

Clustering Validity

SilhouettesSilhouettes

Page 13: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 13

IntroductionIntroduction

Silhouettes: a graphical aid to the Silhouettes: a graphical aid to the interpretation and validation of cluster interpretation and validation of cluster analysis. analysis. Journal of Computational and Journal of Computational and Applied MathematicsApplied Mathematics. P.J. Rousseeuw, 1987. P.J. Rousseeuw, 1987

Each cluster is represented by one silhouette, Each cluster is represented by one silhouette, showing which objects lie well within the showing which objects lie well within the cluster.cluster.

The user can compare the quality of the The user can compare the quality of the clustersclusters

Page 14: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 14

Method - IMethod - I

Consider a cluster Consider a cluster A .A . For each element For each element eei i A A calculate the calculate the

average dissimilarity to all other objects average dissimilarity to all other objects of of AA, , aa((eeii) = ) = dd((eeii,A,A).).

Therefore, Therefore, AA can not be a singleton. can not be a singleton. Euclidean distance is an example of Euclidean distance is an example of

dissimilarity.dissimilarity.

Page 15: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 15

Method - IIMethod - II

Consider all clusters Consider all clusters CCkk different from different from AA..

Calculate Calculate ddkk((eeii,C,Ckk), the average ), the average

dissimilarity of dissimilarity of eeii to all elements of to all elements of CCkk..

Select Select bb((eeii) = ) = minmin((ddkk((eeii,C,Ckk)).)).

Let us call Let us call BB the cluster whose the cluster whose dissimilarity is dissimilarity is bb((eeii).).

This is the second-best choice for This is the second-best choice for eeii

Page 16: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 16

Method - IIIMethod - III

The silhouette s(The silhouette s(eeii) is equal to ) is equal to

ss((eeii) = 1–[) = 1–[aa((eeii) / ) / bb((eeii)])] sese aa((eeii) < ) < bb((eeii))..

ss((eeii) = 0 ) = 0 sese aa((eeii) = ) = bb((eeii))..

ss((eeii) = [) = [bb((eeii) / ) / aa((eeii)] - 1 )] - 1 sese aa((eeii) > ) > bb((eeii))..

ouou

ss((eeii) = [) = [bb((eeii) - ) - aa((eeii)] / )] / maxmax ( (bb((eeii),),aa((eeii))))

-1 <= -1 <= ss((eeii) <= +1) <= +1

Page 17: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 17

Understanding Understanding ss((eeii))

ss((eeii) ) 1: within dissimilarity 1: within dissimilarity aa((eeii) << ) <<

bb((eeii), ), eeii is well classified. is well classified.

ss((eeii) ) 0: 0: aa((eeii) ) bb((eeii), ), eeii may belong to may belong to

either cluster.either cluster. ss((eeii) ) -1: within dissimilarity -1: within dissimilarity

aa((eeii)>>)>>bb((eeii), ), eeii is misclassified, should is misclassified, should

belong to belong to BB..

Page 18: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 18

SilhouetteSilhouette

The silhouette of the cluster The silhouette of the cluster AA is the plot is the plot of all of all ss((eeii) ranked in decreasing order.) ranked in decreasing order.

The average of all The average of all ss((eeii) of all elements ) of all elements

in the cluster is called the average in the cluster is called the average silhouette.silhouette.

Page 19: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 19

Example of use I Example of use I

QTY = 100;QTY = 100;

X = [randn(QTY,2)+0.5*ones(QTY,2);randn(QTY,2)...X = [randn(QTY,2)+0.5*ones(QTY,2);randn(QTY,2)...

- 0.5*ones(QTY,2)];- 0.5*ones(QTY,2)];

opts = statset('Display','final');opts = statset('Display','final');

[cidx, ctrs] = kmeans(X, 2, 'Distance','city', ...[cidx, ctrs] = kmeans(X, 2, 'Distance','city', ...

'Replicates',5, 'Options',opts);'Replicates',5, 'Options',opts);

figure;figure;

plot(X(cidx==1,1),X(cidx==1,2),'r.', ...plot(X(cidx==1,1),X(cidx==1,2),'r.', ...

X(cidx==2,1),X(cidx==2,2), ...X(cidx==2,1),X(cidx==2,2), ...

'b.', ctrs(:,1),ctrs(:,2),'kx');'b.', ctrs(:,1),ctrs(:,2),'kx');

figure;figure;

[s, h] = silhouette(X, cidx, 'sqeuclid');[s, h] = silhouette(X, cidx, 'sqeuclid');

Page 20: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 20

Ex Silhouette 1Ex Silhouette 1

Page 21: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 21

Ex Silhouette 2Ex Silhouette 2

Page 22: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 22

Example of use I IExample of use I I

QTY = 100;QTY = 100;

X = [randn(QTY,2)+2*ones(QTY,2);randn(QTY,2)...X = [randn(QTY,2)+2*ones(QTY,2);randn(QTY,2)...

- 2*ones(QTY,2)];- 2*ones(QTY,2)];

opts = statset('Display','final');opts = statset('Display','final');

[cidx, ctrs] = kmeans(X, 2, 'Distance','city', ...[cidx, ctrs] = kmeans(X, 2, 'Distance','city', ...

'Replicates',5, 'Options',opts);'Replicates',5, 'Options',opts);

figure;figure;

plot(X(cidx==1,1),X(cidx==1,2),'r.', ...plot(X(cidx==1,1),X(cidx==1,2),'r.', ...

X(cidx==2,1),X(cidx==2,2), ...X(cidx==2,1),X(cidx==2,2), ...

'b.', ctrs(:,1),ctrs(:,2),'kx');'b.', ctrs(:,1),ctrs(:,2),'kx');

figure;figure;

[s, h] = silhouette(X, cidx, 'sqeuclid');[s, h] = silhouette(X, cidx, 'sqeuclid');

Page 23: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 23

Ex silhouette 3Ex silhouette 3

Page 24: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 24

Ex silhouette 4Ex silhouette 4

Page 25: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

Cluster Validity

Partition CoefficientPartition Coefficient

Page 26: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 26

Partition CoefficientPartition Coefficient

This coefficient is defined asThis coefficient is defined as

1/1

/1 1

2

Fc

n)(μ=Fc

=i

n

j=ij

Page 27: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 27

Partition Coefficient commentsPartition Coefficient comments

FF is inversely proportional to the number is inversely proportional to the number of clusters.of clusters.

FF is not appropriated to find the best is not appropriated to find the best number of partitionsnumber of partitions

FF is best suited to validate the best is best suited to validate the best partition among those with the same partition among those with the same number of clustersnumber of clusters

Page 28: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 28

Partition CoefficientPartition Coefficient

When When F=1/cF=1/c the system is entirely the system is entirely fuzzy, since every element belongs to fuzzy, since every element belongs to all clusters with the same degree of all clusters with the same degree of membershipmembership

When When F=1F=1 the system is rigid and the system is rigid and membership values are either 1 or 0.membership values are either 1 or 0.

This measurement can only be applied This measurement can only be applied to fuzzy partitionsto fuzzy partitions

Page 29: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 29

Partition Coefficient ExamplePartition Coefficient Example

The Partition Matrix isThe Partition Matrix is

w1

w2

w3

w3

1100

0011=U

14

1111 2222

=+++

=F

Page 30: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 30

Partition Coefficient ExamplePartition Coefficient Example

The Partition Matrix isThe Partition Matrix is

w1

w2 w3

w4

0.50.50.50.5

0.50.50.50.5=U

c====F /12/10.54

0.58 2

Page 31: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 31

Partition Coefficient ExamplePartition Coefficient Example

The Partition Matrix isThe Partition Matrix is

0.80.70.1100.5

0.20.30.9010.51=U

X1 X2 X3

X4 X5 X6

0.7636

0.80.70.110.50.20.30.910.5 2222222222

=F

+++++++++=F

Page 32: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

Cluster Validity

Partition EntropyPartition Entropy

Page 33: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 33

Partition EntropyPartition Entropy

Partition Entropy is defined asPartition Entropy is defined as

When When H=0H=0 the partition is rigid. the partition is rigid. When When H=log(c)H=log(c) the fuzziness is maximum. the fuzziness is maximum. 0 <= 1-F <= H0 <= 1-F <= H

cH

n)μ(μ=Hc

=iij

n

j=ij

log0

/log1 1

Page 34: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 34

Partition Entropy commentsPartition Entropy comments

Partition Entropy (Partition Entropy (HH) is directly proportional to ) is directly proportional to the number of partitions.the number of partitions.

HH is more appropriated to validate the best is more appropriated to validate the best partition among several runs of an algorithm.partition among several runs of an algorithm.

HH is strictly a fuzzy measure is strictly a fuzzy measure

Page 35: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

Cluster Validity

Compactness and SeparationCompactness and Separation

Page 36: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 36

Compactness and SeparationCompactness and Separation

CS is defined as CS is defined as

JJmm is the objective function minimized by is the objective function minimized by

the FCM algorithm.the FCM algorithm. nn is the number of elements. is the number of elements. ddminmin is minimum Euclidean distance is minimum Euclidean distance

between the center of two clusters.between the center of two clusters.

2min )(dn

J=CS m

Page 37: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 37

Compactness and SeparationCompactness and Separation

The minimum distance is defined asThe minimum distance is defined as

The complete formula isThe complete formula is

jiji,cc=d nim

min

2nim

1 1

2

jiji,

c

=i

n

j=ji

mij

vvn

xvμ

=CS

Page 38: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 38

Compactness and SeparationCompactness and Separation

This a very complete validation This a very complete validation measure.measure.

It validates the number of clusters and It validates the number of clusters and the checks the separation among the checks the separation among clusters.clusters.

From our experiments it works well From our experiments it works well even when the degree of superposition even when the degree of superposition is high.is high.

Page 39: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

Cluster Validity

Fuzzy Linear DiscriminantFuzzy Linear Discriminant

Page 40: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 40

Fischer Linear DiscriminantFischer Linear Discriminant

The Fisher’s Linear Discriminant (FLD)The Fisher’s Linear Discriminant (FLD) is an important technique used in is an important technique used in pattern recognition problems to evaluate pattern recognition problems to evaluate the the compactnesscompactness and and separationseparation of the of the partitions produced by partitions produced by crisp clusteringcrisp clustering techniques.techniques.

Page 41: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 41

Fischer Linear DiscriminantFischer Linear Discriminant

It is easier to handle classification It is easier to handle classification problems in which sampled data has problems in which sampled data has few characteristicsfew characteristics

So it is important to reduce the problem So it is important to reduce the problem dimensionalitydimensionality

When FLD is applied to a space crisply When FLD is applied to a space crisply partitioned it produces an operator (partitioned it produces an operator (WW) ) that maps the original set (that maps the original set (RRpp) into a ) into a new set (new set (RRkk), where ), where k<pk<p

Page 42: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 42

Fischer Linear DiscriminantFischer Linear Discriminant

W

x1

x2

Figura . – Projeção de amostras dispostas em 2 classes em uma reta feita pelo Discriminante Linear de Fisher

Page 43: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 43

FLDFLD

FLD measures the compactness and FLD measures the compactness and separation of all categories when crisp separation of all categories when crisp partitions are createdpartitions are created

FLD uses two matrices: FLD uses two matrices:

SSBB : Between Classes Scatter Matrix : Between Classes Scatter Matrix

SSWW: Within Classes Scatter Matrix: Within Classes Scatter Matrix

Page 44: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 44

FLD – FLD – SSBB Matrix Matrix

Measures the quality of separation between classes

Ti

c

=iiiB m))(mm(mn=S

1

n

j=jxn

=m1

1ii

n

j=i

ii cxxn

=mi

,1

1

Page 45: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 45

FLD – FLD – SSBB Matrix Matrix

m is the average of all samples

mi is the average of all samples belonging to cluster i

n is the number of samples ni is the number of samples belonging to cluster i

Ti

c

=iiiB m))(mm(mn=S

1

n

j=jxn

=m1

1 ii

n

=ji

ii cxxn

=mi

,1

1

Page 46: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 46

FLD – FLD – SSWW Matrix Matrix

Measures the compactness of all Measures the compactness of all classesclasses

It is the sum of all internal scatteringIt is the sum of all internal scattering

Tij

icjijiW

)m)(xm(x=S

c

=i

Tij

n

j=ijW )m)(xm(x=S

1 1

Page 47: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 47

Total ScatteringTotal Scattering

The total scattering is the sum of the The total scattering is the sum of the internal scattering and the scattering internal scattering and the scattering between the classesbetween the classes

SSTT=S=SWW+S+SBB

In an optimal partition the separation In an optimal partition the separation between classes (between classes (SSBB) must be maximum ) must be maximum

and within the classes minimum (and within the classes minimum (SSWW))

Page 48: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 48

JJ criteria criteria

Fisher defined the Fisher defined the JJ criteria that must criteria that must be maximizedbe maximized

A simplified way to evaluate A simplified way to evaluate JJ is is

WB

S

S=J

)trace(S

)trace(S=J

W

B

Page 49: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 49

JJ comments comments

JJ may vary in the interval 0<= may vary in the interval 0<=JJ<=<=

JJ is strictly rigid is strictly rigid

JJ looses precision as the sample looses precision as the sample overlapping increasesoverlapping increases

Page 50: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 50

EFLDEFLD

EFLD measures the compactness and EFLD measures the compactness and separation of all categories when fuzzy separation of all categories when fuzzy partitions are createdpartitions are created

EFLD uses two matrices: EFLD uses two matrices:

SSBeBe : Between Classes Scatter Matrix : Between Classes Scatter Matrix

SSWeWe: Within Classes Scatter Matrix: Within Classes Scatter Matrix

Page 51: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 51

EFLD – EFLD – SSBeBe Matrix Matrix

Measures the quality of separation Measures the quality of separation between classesbetween classes

n

j=jxn

=m1

1

Tei

c

=i

n

j=eiijBe m))(mm(mμ=S

1 1

n

j=ij

n

j=jij

ei

μ

=m

1

1

Page 52: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 52

EFLD – EFLD – SSWeWe Matrix Matrix

Measures the compactness of all Measures the compactness of all classesclasses

It is the sum of all internal scatteringIt is the sum of all internal scattering

c

=i

Teij

n

j=eijijWe )m)(xm(xμ=S

1 1

Page 53: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 53

Total ScatteringTotal Scattering

The total scattering is the sum of the The total scattering is the sum of the internal scattering and the scattering internal scattering and the scattering between the classesbetween the classes

SSTeTe=S=SWeWe+S+SBeBe

In an optimal partition the separation In an optimal partition the separation between classes (between classes (SSBeBe) must be ) must be

maximum and within the classes maximum and within the classes minimum (minimum (SSWeWe))

Page 54: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 54

JJee criteria criteria

JJee : criteria that must be maximised : criteria that must be maximised

A simplified way to evaluate A simplified way to evaluate JJee is is

eW

eB

e S

S=J

)trace(S

)trace(S=J

eW

eB

e

Page 55: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 55

Simplifying Simplifying JJee criteria criteria

A simplified way to evaluate A simplified way to evaluate JJee It can be proved that It can be proved that SSTT is constant and is constant and

equal toequal to

n

j=jT

TT

mx=S

)(S=S

1

2

trace

BeT

Be

We

Bee SS

S=

S

S=J

Page 56: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 56

JJee comments comments

JJee may vary in the interval 0<= may vary in the interval 0<=JJee<=<=

JJee is strictly rigid is strictly rigid

JJee looses precision as the sample looses precision as the sample

overlapping increasesoverlapping increases

Page 57: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 57

Applying EFLDApplying EFLD

EFLD

Número de Categorias

2 3 4 5 6

Amostras X1 4,6815 4,9136 0,2943 0,2559 0,3157

Amostras X2 0,3271 0,8589 0,8757 0,9608 1,0674

Page 58: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

Cluster Validity

Inter Class ContrastInter Class Contrast

Page 59: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 59

CommentsComments

EFLDEFLD

Increases as the number of clusters Increases as the number of clusters rises. rises.

Increases when classes have high Increases when classes have high degree of overlapping.degree of overlapping.

Reaches maximum for a wrong number Reaches maximum for a wrong number of clusters.of clusters.

Page 60: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 60

ICCICC

Evaluates a crisp and fuzzy clustering Evaluates a crisp and fuzzy clustering algorithmsalgorithms

Measures:Measures: Partition Compactness Partition Compactness Partition Separation Partition Separation

ICC must be MaximizedICC must be Maximized

Page 61: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 61

ICCICC

ssBeBe – estimates the quality of the – estimates the quality of the

placement of the centres. placement of the centres. 1/1/nn – scale factor – scale factor

Compensates the influence of the number Compensates the influence of the number of points in of points in ssBeBe

cDn

s=ICC Be

min

Page 62: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 62

ICC - 2ICC - 2

DDminmin – minimum Euclidian distance between – minimum Euclidian distance between all pairs of centresall pairs of centres

Neutralizes the tendency of Neutralizes the tendency of ssBeBe to grow, to grow,

avoiding the maximum being reached for a avoiding the maximum being reached for a number of clusters greater than the ideal number of clusters greater than the ideal value.value. When 2 or more clusters represent a class When 2 or more clusters represent a class

– – DDminmin decreases abruptly decreases abruptly

cDn

s=ICC Be

min

Page 63: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 63

ICC Fuzzy ApplicationICC Fuzzy Application

Five classes with 500 points eachFive classes with 500 points each No class overlappingNo class overlapping X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3 Apply FCM for m = 2 and c = 2 ...10Apply FCM for m = 2 and c = 2 ...10

Page 64: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 64

ICC Fuzzy Application ResultsICC Fuzzy Application Results

  00000,3160,3160,1000,10000MinRFMinRF

  1,8871,8871,3271,3270,4960,4960,5190,519MMMeanHTMeanHT

  1,9941,9942,1242,1240,5720,5720,6470,647MMMinHTMinHT

  0,9430,9430,7950,7950,7130,7130,7050,705MMFF

  0,0110,0110,0700,0700,0960,0960,3500,350mmCSCS

  182,70182,703,9603,9600,9550,955INDINDMMEFLDDetEFLDDet

  13,6513,651,8771,8770,9860,9860,1850,185MMEFLDTraEFLDTra

  13.6513.651.8771.8770.9860.9860.1850.185MMEFLDEFLD

  673637673637259791259791154685154685INDINDMMICCDetICCDet

  96,7096,7051,9251,9241,9941,997,5967,596MMICCTraICCTra

  96,7096,7051,9251,9241,9941,997,5967,596MMICCICC

  55443322

Number of clustersNumber of clustersMeasuresMeasures

Page 65: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 65

ICC Fuzzy Application TimeICC Fuzzy Application Time

0,00530,0053220,00490,00490,00450,00450,00610,0061FPIFPI

0,00490,0049110,00490,00490,00450,00450,00440,0044FF

0,00600,0060330,00580,00580,00560,00560,00610,0061NFINFI

0,04760,04760,03820,03820,02610,02610,02260,0226CSCS

2,01602,01601,55101,55101,13921,13920,78000,7800EFLDDetEFLDDet

1,89821,89821,47801,47801,08701,08700,76780,7678EFLDTraEFLDTra

0.00800.00800.00630.00630.00710.00710.00530.0053EFLDEFLD

0,01320,01320,01100,01100,00880,00880,01100,0110ICCDetICCDet

0,01100,01100,00880,00880,00600,00600,00780,0078ICCTraICCTra

0,00910,0091440,00820,00820,00690,00690,00610,0061ICCICC

55443322

Number of CategoriesNumber of CategoriesTimeTime

Page 66: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 66

Application with OverlappingApplication with Overlapping

Five classes with 500 points eachFive classes with 500 points each High cluster overlapping High cluster overlapping X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3 Apply FCM for m = 2 and c = 2 ...10Apply FCM for m = 2 and c = 2 ...10

Page 67: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 67

Application Overlapping ResultsApplication Overlapping Results

0,5650,5650,5250,5250,5610,5610,6010,6010,5680,568mmMPEMPE

0,4020,4020,2100,2100,1940,1940,2940,2940,1700,17000MinRFMinRF

0,4290,4290,5970,5970,5500,5500,4850,4850,6320,632MMMeanHTMeanHT

0,4390,4390,5860,5860,5910,5910,6210,6210,7540,754MMFF

0,2230,2230,1220,1220,1910,1910,2250,2250,1640,164mmCSCS

1,2001,2000,7430,7430,3150,3150,0490,049INDINDMMEFLDDetEFLDDet

1,3441,3441,0951,0950,8390,8390,5850,5850,4500,450MMEFLDTraEFLDTra

1.3441.3441.0951.0950.8390.8390.5850.5850.4500.450MMEFLDEFLD

602460247048704835723572715,19715,19INDINDMMICCDetICCDet

5,695,697,8297,8296,1916,1914,9384,9385,0655,065MMICCTraICCTra

5,695,697,8297,8296,1916,1914,9384,9385,0655,065MMICCICC

101055443322MeasuresMeasures

Page 68: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 68

Application Time ResultsApplication Time Results

0,03970,0397220,03190,03190,02710,02710,01670,0167MPEMPE

0,01640,01640,00610,00610,01210,01210,01120,0112FF

0,05900,0590330,03620,03620,02830,02830,02200,0220CSCS

1,84501,84501,60901,60901,25801,25800,97200,9720EFLDDetEFLDDet

2,25842,25841,75981,75982,10382,10380,79300,7930EFLDTraEFLDTra

0.01100.01100.00960.00960.00880.00880.00630.0063EFLDEFLD

0,01200,01200,01100,01100,00780,00780,01100,0110ICCDetICCDet

0,01100,01100,00980,00980,00600,00600,00660,0066ICCTraICCTra

0,00880,0088110,00770,00770,00640,00640,00600,0060ICCICC

55443322

Number of ClustersNumber of ClustersTimeTime

Page 69: Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br.

*@2006 Adriano Cruz *NCE e IM - UFRJ Cluster 69

ICC conclusionsICC conclusions

Fast and efficientFast and efficient Works with fuzzy and crisp partitionsWorks with fuzzy and crisp partitions Efficient even with high overlapping Efficient even with high overlapping

clustersclusters High rate of right resultsHigh rate of right results