What Is Good Clustering?
-
Upload
garth-church -
Category
Documents
-
view
26 -
download
0
description
Transcript of What Is Good Clustering?
![Page 1: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/1.jpg)
8
What Is Good Clustering?
A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity
The quality of a clustering result depends on the similarity measure used by the method.
The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
![Page 2: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/2.jpg)
9
Vocabulary of Clustering
Records, data points, samples, items, objects, patterns…
Attributes, features, variables…
Similarity, dissimilarity, distances.
Centre, Centroid, Prototype.
Hard Clustering (Crisp Clustering)
![Page 3: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/3.jpg)
10
Requirements of Clustering
Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to
determine input parameters Able to deal with noise and outliers Insensitive to order of input records Insensitive to the initial conditions High dimensionality
![Page 4: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/4.jpg)
11
Clustering Algorithms
![Page 5: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/5.jpg)
12
Clustering Algorithms
![Page 6: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/6.jpg)
13
Data Representation
Data matrix (two mode) N objects with p attributes
Dissimilarity matrix (one mode) d(i,j) : dissimilarity between i and j with p attributes
npx...
nfx...
n1x
...............ip
x...if
x...i1
x
...............1p
x...1f
x...11
x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
![Page 7: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/7.jpg)
14
How to deal with missing values?
npx...
nfx...
n1x
...............ip
x...if
x...i1
x
...............1p
x...1f
x...11
x
![Page 8: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/8.jpg)
15
Types of Clusters: Well-Separated
Well-separated clusters A cluster is a set of points such that any point
in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster
3 well-separated clusters
![Page 9: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/9.jpg)
16
Types of Clusters: Center-Based
Center-based A cluster is a set of objects such that an
object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster
The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster
4 center-based clusters
![Page 10: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/10.jpg)
17
Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in
a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
8 contiguous clusters
![Page 11: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/11.jpg)
18
Types of Clusters: Density-Based
Density-based A cluster is a dense region of points, which is
separated by low-density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
6 density-based clusters
![Page 12: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/12.jpg)
19
Types of Clusters: Conceptual Clusters
Shared Property or Conceptual Clusters Finds clusters that share some common
property or represent a particular concept.
2 Overlapping Circles
![Page 13: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/13.jpg)
20
Types of Clusters: Objective Function
Clusters Defined by an Objective Function Finds clusters that minimize or maximize an
objective function. Enumerate all possible ways of dividing the
points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function.
![Page 14: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/14.jpg)
April 20, 2023 21
Type of data in clustering analysis
![Page 15: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/15.jpg)
April 20, 2023 22
Symbol Table
![Page 16: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/16.jpg)
April 20, 2023 23
Symbol Table
![Page 17: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/17.jpg)
April 20, 2023 24
Frequency Table
![Page 18: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/18.jpg)
April 20, 2023 25
Frequency Table
![Page 19: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/19.jpg)
April 20, 2023 26
Frequency Table
![Page 20: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/20.jpg)
April 20, 2023 27
Frequency Table
![Page 21: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/21.jpg)
April 20, 2023 28
Type of data in clustering analysis
Binary variables
Nominal variables
Ordinal variables
Interval-scaled variables
Ratio variables
Variables of mixed types
![Page 22: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/22.jpg)
April 20, 2023 29
Binary variables
The binary variable is symmetric (Simple match
coefficient)
The binary variable is asymmetric (Jaccard
coefficient)
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
dcbacb jid
),(
cbacb jid
),(
![Page 23: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/23.jpg)
April 20, 2023 30
Binary variables
![Page 24: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/24.jpg)
April 20, 2023 31
Dissimilarity between Binary Variables
Example
gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be
set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
![Page 25: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/25.jpg)
April 20, 2023 32
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching m: # of matches, p: total # of variables
Method 2: use a large number of binary variables creating a new binary variable for each of the M
nominal states
pmpjid ),(
![Page 26: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/26.jpg)
April 20, 2023 33
Nominal Variables
Examples Eye Color Days of the week Religion Seasons Job title
![Page 27: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/27.jpg)
April 20, 2023 34
Nominal Variables
Find the Proximity Matrix?
![Page 28: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/28.jpg)
April 20, 2023 35
Ordinal Variables
Order is important, e.g., rank Can be treated like interval-scaled
replacing xif by their rank
map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
compute the dissimilarity using methods for interval-scaled variables
11
f
ifif M
rz
},...,1{fif
Mr
![Page 29: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/29.jpg)
April 20, 2023 36
Ordinal Variables
Find the Proximity Matrix?
![Page 30: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/30.jpg)
April 20, 2023 37
Interval-valued variables
Examples Temperature Weight Time Age Length
![Page 31: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/31.jpg)
April 20, 2023 38
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
where
Calculate the standardized measurement (z-
score)
Using mean absolute deviation is more robust than
using standard deviation
.)...21
1nffff
xx(xn m
|)|...|||(|121 fnffffff
mxmxmxns
f
fifif s
mx z
![Page 32: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/32.jpg)
April 20, 2023 39
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt Methods:
treat them like interval-scaled variables — not a good choice! (why?)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as interval-scaled.
![Page 33: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/33.jpg)
April 20, 2023 40
Ratio-Scaled Variables
Find the Proximity Matrix?
![Page 34: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/34.jpg)
Variables of Mixed Types
A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio. One may use a weighted formula to combine their
effects.
f is binary or nominal:dij
(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance f is ordinal or ratio-scaled
compute ranks rif and and treat zif as interval-scaled
)(1
)()(1),(
fij
pf
fij
fij
pf
djid
1
1
f
if
Mrz
if
![Page 35: What Is Good Clustering?](https://reader036.fdocuments.in/reader036/viewer/2022062321/56813116550346895d9771f8/html5/thumbnails/35.jpg)
April 20, 2023 42
Variables of Mixed Types
Find the Proximity Matrix?