C LUSTERING (Segmentation) Saed Sayad 1.
-
Upload
vincent-peters -
Category
Documents
-
view
218 -
download
0
Transcript of C LUSTERING (Segmentation) Saed Sayad 1.
CLUSTERING (Segmentation)CLUSTERING
(Segmentation)
Saed Sayad
1www.ismartsoft.com
Data Mining Steps
www.ismartsoft.com 2
What is Clustering?
www.ismartsoft.com 3
Age
Income
Given a set of records, organizethe records into clusters
Given a set of records, organizethe records into clusters
A cluster is a subset of records which
are similar
A cluster is a subset of records which
are similar
Clustering Requirements
• The ability to discover some or all of the hidden clusters.
• Within-cluster similarity and between-cluster disimilarity.
• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.
• The ability to discover some or all of the hidden clusters.
• Within-cluster similarity and between-cluster disimilarity.
• Ability to deal with various types of attributes.• Can deal with noise and outliers.• Can handle high dimensionality.• Scalability, Interpretability and usability.
www.ismartsoft.com 4
Similarity - Distance Measure
To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality
To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality
www.ismartsoft.com 5
Similarity - Distance Measure
www.ismartsoft.com 6
k
iii yx
1
k
iii yx
1
2Euclidean
Minkowski q
k
i
q
ii yx
1
1
Manhattan
Similarity - Correlation
www.ismartsoft.com 7
Age
Credit$
Age
Credit$
SimilarSimilar DissimilarDissimilar
22 )()(
))((
yyxx
yyxxr
ii
iixy
Similarity – Hamming Distance
Gene 1 A A T C C A G T
Gene 2 T C T C A A G C
Hamming Distance 1 1 0 0 1 0 0 1
www.ismartsoft.com 8
k
iiiH yxD
1
Clustering Methods
• Exclusive vs. Overlapping• Hierarchical vs. Partitive• Deterministic vs. Probabilistic• Incremental vs. Batch learning
• Exclusive vs. Overlapping• Hierarchical vs. Partitive• Deterministic vs. Probabilistic• Incremental vs. Batch learning
www.ismartsoft.com 9
Exclusive vs. Overlapping
www.ismartsoft.com 10
Age
Income
Age
Income
Hierarchical vs. Partitive
www.ismartsoft.com 11
Age
Income
Hierarchical Clustering
• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.
• There are two types of hierarchical clustering:– Agglomerative– Divisive
• Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy.
• There are two types of hierarchical clustering:– Agglomerative– Divisive
www.ismartsoft.com 12
Hierarchical Clustering
www.ismartsoft.com 13
AgglomerativeAgglomerative DivisiveDivisive
Hierarchical Clustering - Agglomerative
1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)
between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a
single cluster left.
1. Assign each observation to its own cluster.2. Compute the similarity (e.g., distance)
between each of the clusters.3. Join the two most similar clusters.4. Repeat steps 2 and 3 until there is only a
single cluster left.
www.ismartsoft.com 14
Hierarchical Clustering - Divisive
1. Assign all of the observations to a single cluster.
2. Partition the cluster to two least similar clusters.
3. Proceed recursively on each cluster until there is one cluster for each observation.
1. Assign all of the observations to a single cluster.
2. Partition the cluster to two least similar clusters.
3. Proceed recursively on each cluster until there is one cluster for each observation.
www.ismartsoft.com 15
Hierarchical Clustering – Single Linkage
www.ismartsoft.com 16
)),(min(),( sjri xxDsrL
rr
ss
Hierarchical Clustering – Complete Linkage
www.ismartsoft.com 17
)),(max(),( sjri xxDsrL
rr
ss
Hierarchical Clustering – Average Linkage
www.ismartsoft.com 18
r sn
i
n
jsjri
sr
xxDnn
srL1 1
),(1
),(
rr
ss
K Means Clustering
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center
according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in
each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are
assigned to each cluster in consecutive rounds.
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.3. Assign observations to their closest cluster center
according to the Euclidean distance function.4. Calculate the centroid or mean of all instances in
each cluster (this is the mean part)5. Repeat steps 2, 3 and 4 until the same points are
assigned to each cluster in consecutive rounds.
www.ismartsoft.com 19
K Means Clustering
www.ismartsoft.com 20
Age
Income
K Means Clustering
www.ismartsoft.com 21
K
j Snjn
j
xJ1
2)(
Sum of Squares functionSum of Squares function
Clustering Evaluation
• Sarle’s Cubic Clustering Criterion • The Pseudo-F Statistic• The Pseudo-T2 Statistic• Beale’s F-Type Statistic • Target-based
• Sarle’s Cubic Clustering Criterion • The Pseudo-F Statistic• The Pseudo-T2 Statistic• Beale’s F-Type Statistic • Target-based
www.ismartsoft.com 22
Clustering Evaluation
www.ismartsoft.com 23
Chi2 Test
www.ismartsoft.com 24
ActualY N
PredictedY n11 n12
N n21 n22
r
i
c
j ij
ijij
e
en
1 1
22 )(
Analysis of Variance (ANOVA)
www.ismartsoft.com 25
Source of Variation
Sum of Squares
Degree of Freedom Mean Square F P
Between Groups
SSB dfB MSB = SSB/dfB F=MSB/MSW P(F)
Within Groups SSW dfw MSW = SSW/dfw
Total SST dfT
Clustering - Applications• Marketing: finding groups of customers with similar
behavior.• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their
features.• Libraries: book ordering.• City-planning: identifying groups of houses according
to their house type, value and geographical location.• World Wide Web: document classification; clustering
weblog data to discover groups with similar access patterns.
• Marketing: finding groups of customers with similar behavior.
• Insurance & Banking: identifying frauds.• Biology: classification of plants and animals given their
features.• Libraries: book ordering.• City-planning: identifying groups of houses according
to their house type, value and geographical location.• World Wide Web: document classification; clustering
weblog data to discover groups with similar access patterns.
www.ismartsoft.com 26
Summary• Clustering is the process of organizing objects
(records or variables) into groups whose members are similar in some way.
• Hierarchical and K-Means are the two most used clustering techniques.
• The effectiveness of the clustering method depends on the similarity function.
• The result of the clustering algorithm can be interpreted and evaluated in different ways.
• Clustering is the process of organizing objects (records or variables) into groups whose members are similar in some way.
• Hierarchical and K-Means are the two most used clustering techniques.
• The effectiveness of the clustering method depends on the similarity function.
• The result of the clustering algorithm can be interpreted and evaluated in different ways.
www.ismartsoft.com 27
28www.ismartsoft.com
Questions?