Cluster AnalysisCluster Analysis
SUSHIL KULKARNISUSHIL KULKARNI
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
SUSHIL KULKARNI
What is Cluster ?
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that –Data points in one cluster are more similar to
one another. –Data points in separate clusters are less
similar to one another.
Similarity Measures: –Euclidean Distance if attributes are
continuous. –Other Problem-specific Measures.
SUSHIL KULKARNI
Outliers
Outliers are objects that do not belong to any cluster or form clusters of very small cardinality
In some applications we are interested in discovering outliers, not clusters (outlier analysis)
cluster
outliers
SUSHIL KULKARNI
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering
Methods Partitioning Methods Hierarchical Methods
SUSHIL KULKARNI
Data Structures data matrix
(two modes)
dissimilarity or distancematrix
(one mode)
npx...nfx...n1x
...............ipx...ifx...i1x
...............1px...1fx...11x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
the “classic” data input
attributes/dimensions
tupl
es/o
bjec
ts
the desired data input to some clustering algorithms
objects
obje
cts
Measuring Similarity in Clustering
Dissimilarity/Similarity metric:
The dissimilarity d(i, j) between two objects i and j is expressed in terms of a distance function, which is typically a metricmetric:
1. d(i, j) 0 (non-negativity)2. d(i, i) =0 (isolation)3. d(i, j)= d(j, i) (symmetry)4. d (i, j) ≤ d(i, h)+d(h, j) (triangular inequality)
SUSHIL KULKARNI
Type of data in cluster analysis
Interval-scaled variables
e.g., salary, height
Binary variables
e.g., gender (M/F), has_cancer(T/F)
Nominal (categorical) variables
e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)
SUSHIL KULKARNI
Similarity and Dissimilarity Between Objects
Distance metrics are normally used to measure the similarity or dissimilarity between two data objects
SUSHIL KULKARNI
Similarity and Dissimilarity Between Objects (Cont.)
Euclidean distance:
Propertiesd(i,j) 0d(i,i) = 0d(i,j) = d(j,i)d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance:
)||...|||(|),( 22
22
2
11 nn jx
ix
jx
ix
jx
ixjid
)||...||2
||1
(),( 22
22
2
11 nn jx
ixnw
jx
ixw
jx
ixwjid
SUSHIL KULKARNI
Binary Variables A binary variable has two states: 0 absent, 1 present
A contingency table for binary data
Jaccard coefficient distance (noninvariant if the binary
variable is asymmetric):
cbacb jid
),(
pdbcasum
dcdc
baba
sum
0
1
01
object i
object j
SUSHIL KULKARNI
Binary Variables
1-1 match is stronger indicator of similarity
then 0-0 indicator of similarity
It may be possible that 0-0 indicator can be
ignored.
In matching coefficient equal weight for 1-1
matches and 0-0 matches are taken
In Jacords coefficient 0-0 indicator match is
ignored
SUSHIL KULKARNI
Dissimilarity between Binary Variables
Example (Jaccard coefficient)
Name Fever Cough Test-1 Test-2 Test-3 Test-4 Tina 1 0 1 0 0 0 Dina 1 0 1 0 1 0 Meena 1 1 0 0 0 0
SUSHIL KULKARNI
Dissimilarity between Binary Variables
Name Fever Cough Test-1 Test-2 Test-3 Test-4 Tina 1 0 1 0 0 0 Dina 1 0 1 0 1 0
Consider name as Tina , Dina
Number of attributes where name_1 is 1 and name_2 is 1=2 =a (Fever, Test-1)
Number of attributes where name_ 1 is 1 and name_ 2 as 0 = b =0
Number of attributes where name_ 1 is 0 and name_ 2 as 1 = c =1 (Test-3)
Number of attributes where both names = 0 = d = 3
(Cough, Test-2, Test-4)SUSHIL KULKARNI
Dissimilarity between Binary Variables
75.0211
21)Dina,Meena(d
67.0111
11)Meena,Tina(d
33.0102
10)Dina,Tina(d
Based on the magnitude of the similarity coefficient, we conclude that Meena and Dina are most similar and Tina and Dina are least similar Other pair lies between these extremes
SUSHIL KULKARNI
cbacb jid
),(
Using Jaccard coefficient distance we get : a =2b=0c=1d=3
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering
Methods Partitioning Methods Hierarchical Methods
SUSHIL KULKARNI
Major Clustering Approaches
Partitioning algorithms: Construct
random partitions and then iteratively
refine them by some criterion
Hierarchical algorithms: Create a
hierarchical decomposition of the set of
data (or objects) using some criterion
SUSHIL KULKARNI
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods
SUSHIL KULKARNI
Partitioning Algorithms: Basic Concepts
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
Methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is
represented by the center of the cluster k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
SUSHIL KULKARNI
Centroid Medoid
Centroid or Medoid
SUSHIL KULKARNI
The k-means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps:
1. Partition objects into k nonempty subsets2. Compute seed points as the centroidscentroids of
the clusters of the current partition. The centroid is the center (mean point) of the cluster.
3. Assign each object to the cluster with the nearest seed point.
4. Go back to Step 2, stop when no more new assignment.
SUSHIL KULKARNI
K-Means Example
Given: {2,4,10,12,3,20,30,11,25} Assume that we want two clusters. Write the elements in increasing order
{2,3,4,10,11,12,20,25,30} Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,11,12,20,25,30} Means are m1=2.5,m2=16 K1={2,3,4},K2={10,11,12,20,25,30} Means are m1=3,m2=18 K1={2,3,4,10},K2={11,12,20,25,30} Means are m1=4.75,m2=19.6 K1={2,3,4,10,11,12},K2={20,25,30} Means are m1=7,m2=25 Stop as the clusters with these means as no more jumps
from K2 to K 1 is possible. SUSHIL KULKARNI
The k-means Clustering Method
Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
SUSHIL KULKARNI
Comments on the k-means Method
Applicable only when mean is defined, then what about categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
SUSHIL KULKARNI
The K-Medoids Clustering Method
SUSHIL KULKARNI
A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.
The K-Medoids Clustering Algorithm
SUSHIL KULKARNI
A medoid clustering algorithm is as follows: The algorithm begins with arbitrary selection of the k
objects as medoid points out of n data points (n>k) After selection of the k medoid points, associate each
data object in the given data set to most similar medoid. The similarity here is defined using distance measure that can be Euclidean distance
Randomly select non medoid object O′ compute total cost S of swapping initial medoid object to
O′ If S<0, then swap initial medoid with the new one (if
S<0 then there will be new set of medoids) repeat steps 2 to 5 until there is no change in the
medoid
The K-Medoids Clustering Method
Cluster the following data set of ten objects into two clusters i.e k = 2.
Consider a data set of ten objects as follows
SUSHIL KULKARNI
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
The K-Medoids Clustering Method
SUSHIL KULKARNI
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
The K-Medoids Clustering Method
Step 1
Initialise k centre Let us assume c1 = (3,4) and c2 = (7,4) So here c1 and c2 are selected as medoid. Calculating distance so as to associate each data
object to its nearest medoid. Cost is calculated using Minkowski distance metric.
SUSHIL KULKARNI
Similarity and Dissimilarity Between Objects
The most popular conform to Minkowski distance:
where i = (xi1, xi2, …, xin) and
j = (xj1, xj2, …, xjn) are two n-dimensional
data objects, and p is a positive integer
ppjn
xin
xp
jx
ix
pj
xi
xjipL/1
||...|22
||11
|),(
SUSHIL KULKARNI
Similarity and Dissimilarity Between Objects
If p = 1, then L1 is :
||...||||),(1 2211 nn jxixjxixjxixjiL
SUSHIL KULKARNI
The K-Medoids Clustering Method
c1 Data objects (Xi)
Cost (distance)
(3,4) (2,6) 3
(3,4) (3,8) 4
(3,4) (4,7) 4
(3,4) (6,2) 5
(3,4) (6,4) 3
(3,4) (7,3) 5
(3,4) (8,5) 6
(3,4) (7,6) 6
SUSHIL KULKARNI
c2 Data objects (Xi)
Cost (distance)
(7,4) (2,6) 7
(7,4) (3,8) 8
(7,4) (4,7) 6
(7,4) (6,2) 3
(7,4) (6,4) 1
(7,4) (7,3) 1
(7,4) (8,5) 2
(7,4) (7,6) 2
The K-Medoids Clustering Method
SUSHIL KULKARNI
So the clusters become:
Cluster1 = {(3,4)(2,6)(3,8)(4,7)} Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)} Since the points (2,6) (3,8) and (4,7) are close to c1 hence they form one cluster while remaining points form another cluster. Cost between any two points is found using formula
d
1i|cx|)c,x(tcos
where x is any data object, c is the medoid, and d is the dimension of the object which in this case is 2. Cost between any two points is found using formula
Total cost is the summation of the cost of data object from its medoid in its cluster so here:
The K-Medoids Clustering Method
SUSHIL KULKARNI
Total Cost = {cost( (3,4),(2,6))+cost( (3,4), (3,8))+ cost( (3,4), (4,7))} + {cost( (7,4),(6,2))+ cost( (7,4),(7,3))+ cost((7,4),(8,5))+ cost( (7,4),(7,6))} =3+4+4+3+1+1+2+2 =20
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}Cluster2 ={(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
The K-Medoids Clustering Method
SUSHIL KULKARNIclusters after step 1
The K-Medoids Clustering Method
SUSHIL KULKARNI
Step 2
Selection of nonmedoid O′ randomly Let us assume O′ = (7,3) So now the medoids are c1(3,4) and O′(7,3) If c1 and O′ are new medoids, calculate the total cost involved by using the formula in the step 1
The K-Medoids Clustering Method
c1 Data objects (Xi)
Cost (distance)
(3,4) (2,6) 3
(3,4) (3,8) 4
(3,4) (4,7) 4
(3,4) (6,2) 5
(3,4) (6,4) 3
(3,4) (7,4) 4
(3,4) (8,5) 6
(3,4) (7,6) 6
SUSHIL KULKARNI
O’ Data objects (Xi)
Cost (distance)
(7,3) (2,6) 8
(7,3) (3,8) 9
(7,3) (4,7) 7
(7,3) (6,2) 2
(7,3) (6,4) 2
(7,3) (7,4) 1
(7,3) (8,5) 3
(7,3) (7,6) 3
The K-Medoids Clustering Method
SUSHIL KULKARNI
The K-Medoids Clustering Method
SUSHIL KULKARNI
Total Cost = 3+4+4+2+2+1+3+3=22 So cost of swapping medoid from c2 to O′ is S = current total cost – Past total cost = 22-20 =2 > 0 So moving to O′ would be bad idea, so the previous choice was good and algorithm terminates here (i.e there is no change in the medoids).
Cluster Analysis
What is a Cluster ? Types of Data in Cluster A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods
SUSHIL KULKARNI
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
AGNES (Agglomerative Nesting)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge objects that have the least dissimilarity
Go on in a non-descending fashion
Eventually all objects belong to the same cluster
Single-Link: each time merge the clusters (C1,C2) which are connected by the shortest single link of objects, i.e., minpC1,qC2dist(p,q)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
SUSHIL KULKARNI
• Uses dissimilarity/distance matrix as input. • Start with each individual item in its own cluster Merge nodes that have the least dissimilarity• Go on in a non-descending fashion• Eventually all nodes belong to the same cluster• The output is a dendogram which is represented as a set of ordered triples (d, k , K) where d is the threshold distance, k is the number of clusters and K is the set of clusters
AGNES (Agglomerative Nesting)
SUSHIL KULKARNI
A B
C
D
E
1
1
23
3
2
5
32
4
Graph with all distances
C
1
1A B
E
D
Graph with threshhold dmin=1
AGNES (Agglomerative Nesting): Minimum Distance Method
A B C
Single link dendrogram
ED
A B
C
D
E
1
1
22
2
Graph with threshhold dmin=2
A B
C
D
E
1
1
23
3
232
4
A B
C
D
E
1
1
23
3
232
Graph with threshhold dmin=3
Graph with threshhold dmin=4
A B C D E
Single link dendrogram
AGNES (Agglomerative Nesting): Minimum Distance Method
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
E.g., level 1 gives 4 clusters: {a,b},{c},{d},{e},level 2 gives 3 clusters: {a,b},{c},{d,e} level 3 gives 2 clusters: {a,b},{c,d,e}, etc.
A Dendrogram Shows How the Clusters are Merged Hierarchically
a b c d e
ab
d
e
c
level 1
level 2
level 3
level 4
Agglomerative ExampleA B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
BA
E C
D
4
Threshold of
2 3 51
A B C D ESUSHIL KULKARNI
MST Example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
BA
E C
D
SUSHIL KULKARNI
DIANA (Divisive Analysis)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
o Inverse order of AGNES
o Initially all items are placed in one cluster
oThe clusters are split when some elements are not efficiently close to other elements
o Eventually each node forms a cluster on its own
o One simple example of a divisive algorithm is based on the MST version of the single link algorithm
o Edges are cut out from the minimum spanning tree from largest to the smallest
DIANA (Divisive Analysis)
SUSHIL KULKARNI
D
A B
CE
1
1
23
3
2
5
32
4A B
CE
1
1
2
3D
DIANA (Divisive Analysis)
3
A
B
C
E
1
1
2
D
Cut the largest edge ED
The cluster {A,B,C,D,E} is split into two clusters
{E} {A,B,C,D}
DIANA (Divisive Analysis)
SUSHIL KULKARNI
E
A B1
1
2
D
C
A
B
C
E
1
1
2
D
Two clusters are
{E} {A,B,C,D}
Next cut the edge between B and C
The cluster {A,B,C,D} is split into {A,B} and {C,D}
A B
E
1
1
D
DIANA (Divisive Analysis)
SUSHIL KULKARNI
C
A
B
C
E
1
1
D
In the next step these will be split finally giving clusters as
{E}, {A}, {B}, {C}, {D},
A B
E
D
DIANA (Divisive Analysis)
SUSHIL KULKARNI
C
More on Hierarchical Clustering Methods
Integration of hierarchical with distance-based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction
CHAMELEON (1999): hierarchical clustering using dynamic modeling
SUSHIL KULKARNI
Top Related