COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing...
Transcript of COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing...
![Page 1: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/1.jpg)
COMP9318: Data Warehousing and Data Mining 1
COMP9318: Data Warehousing and Data Mining
— L8: Clustering —
![Page 2: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/2.jpg)
COMP9318: Data Warehousing and Data Mining 2
n What is Cluster Analysis?
![Page 3: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/3.jpg)
Cluster Analysisn Motivations
n Arranging objects into groups is a natural and necessary skill that we all share
![Page 4: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/4.jpg)
Human Being’s Approach
![Page 5: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/5.jpg)
Computer’s Approach
sex glasses moust-ache smile hat
1 m y n y n
2 f n n y n
3 m y n n n
4 m n n n n
5 m n n y? n
6 m n y n y
7 m y n y n
8 m n n y n
9 m y y y n
10 f n n n n
11 m n y n n
12 f n n n n
![Page 6: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/6.jpg)
Colour Histograms (Swain & Ballard, 1990)
Find matching model in subsequent video frames
We see a person
Model the target statistically
Find Similarity
Create a target model
Computers see numbers
Object Tracking in Computer Vision
Bucket 1 2 … 121 40 70 … 70
2 36 71 … 69
![Page 7: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/7.jpg)
![Page 8: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/8.jpg)
![Page 9: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/9.jpg)
![Page 10: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/10.jpg)
IRDM WS 20097-10
Example 1: Clustering of Flickr Tags
![Page 11: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/11.jpg)
IRDM WS 20097-11
Example 2: Clustering of Social Tags
Grigory Begelmann et al: Automated Tag Clustering: Improving search and exploration in the tag spacehttp://www.pui.ch/phred/automated_tag_clustering
![Page 12: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/12.jpg)
IRDM WS 20097-12
Example 3: Clustering in Social Networks
Jeffrey Heer, Danah Boyd: Vizster: Visualizing Online Social Networks. INFOVIS 2005
![Page 13: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/13.jpg)
IRDM WS 20097-13
Example 4: Clustering Search Resultsfor Visualization and Navigation
http://www.grokker.com/
![Page 14: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/14.jpg)
COMP9318: Data Warehousing and Data Mining 14
What is Cluster Analysis?
n Cluster: a collection of data objectsn Similar to one another within the same clustern Dissimilar to the objects in other clusters
n Cluster analysisn Grouping a set of data objects into clusters
n Clustering belongs to unsupervised classification: no predefined classes
n Typical applicationsn As a stand-alone tool to get insight into data
distribution n As a preprocessing step for other algorithms
![Page 15: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/15.jpg)
COMP9318: Data Warehousing and Data Mining 15
General Applications of Clustering
n Pattern Recognitionn Spatial Data Analysis
n create thematic maps in GIS by clustering feature spaces
n detect spatial clusters and explain them in spatial data mining
n Image Processingn Economic Science (especially market research)n WWW
n Document classificationn Cluster Weblog data to discover groups of similar
access patterns
![Page 16: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/16.jpg)
COMP9318: Data Warehousing and Data Mining 16
Examples of Clustering Applications
n Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
n Land use: Identification of areas of similar land use in an earth observation database
n Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
n City-planning: Identifying groups of houses according to their house type, value, and geographical location
n Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
![Page 17: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/17.jpg)
COMP9318: Data Warehousing and Data Mining 17
What Is Good Clustering?
n A good clustering method will produce high quality clusters withn high intra-class similarityn low inter-class similarity
n The quality of a clustering result depends on both the similarity measure used by the method and its implementation.
n The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
![Page 18: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/18.jpg)
COMP9318: Data Warehousing and Data Mining 18
Requirements of Clustering in Data Mining
n Scalabilityn Ability to deal with different types of attributesn Discovery of clusters with arbitrary shapen Minimal requirements for domain knowledge to
determine input parametersn Able to deal with noise and outliersn Insensitive to order of input recordsn High dimensionalityn Incorporation of user-specified constraintsn Interpretability and usability
![Page 19: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/19.jpg)
COMP9318: Data Warehousing and Data Mining 19
Chapter 8. Cluster Analysis
n Preliminaries
![Page 20: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/20.jpg)
COMP9318: Data Warehousing and Data Mining 20
Typical Inputs
n Data matrixn N objects, each
represented by a m-dimensional feature vector
n Dissimilarity matrixn A square matrix giving
distances between all pairs of objects.
n If similarity functions are used è similarity matrix
úúúúúúú
û
ù
êêêêêêê
ë
é
nmx...nfx...n1x...............imx...ifx...i1x...............1mx...1fx...11x
úúúúúú
û
ù
êêêêêê
ë
é
0...)2,()1,(:::
)2,3()
...ndnd
0dd(3,10d(2,1)
0n x m
n x n
Key component for clustering: the dissimilarity/similarity metric:d(i, j)
![Page 21: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/21.jpg)
COMP9318: Data Warehousing and Data Mining 21
Comments
n The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.
n Weights should be associated with different variables based on applications and data semantics, or appropriate preprocessing is needed.
n There is a separate “quality” function that measures the “goodness” of a cluster.
n It is hard to define “similar enough” or “good enough” n the answer is typically highly subjective.
![Page 22: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/22.jpg)
COMP9318: Data Warehousing and Data Mining 22
Type of data in clustering analysis
n Interval-scaled variables:
n Binary variables:
n Nominal, ordinal, and ratio variables:
n Variables of mixed types:
![Page 23: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/23.jpg)
COMP9318: Data Warehousing and Data Mining 23
Interval-valued variables
n Standardize data
n Calculate the mean absolute deviation:
where
n Calculate the standardized measurement (z-score)
n Using mean absolute deviation is more robust than using standard deviation
.... )1 21
f f f nfm (x x xn + += +
1 21(| | | | ... | |)f f f f f nf fs x m x m x mn= - + - + + -
f
fifif s
mx z
-=
![Page 24: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/24.jpg)
COMP9318: Data Warehousing and Data Mining 24
Similarity and Dissimilarity Between Objects
n Distances are normally used to measure the similarity or dissimilarity between two data objects
n A popular choice is the Minkowski distance, or the Lpnorm of difference vector
n Special cases: n if p = 1, d is the Manhattan distancen if p = 2, d is the Euclidean distancen
![Page 25: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/25.jpg)
COMP9318: Data Warehousing and Data Mining 25
Similarity and Dissimilarity Between Objects (Cont.)
n Other similarity/distance functions:n Mahalanobis distancen Jaccard, Dice, cosine similarity, Pearson correlation
coefficientn Metric distance
n Propertiesn d(i,j) ³ 0n d(i,i) = 0n d(i,j) = d(j,i)n d(i,j) £ d(i,k) + d(k,j)
positivenesssymmetryreflexivity
triangular inequality
common to all distance functions
![Page 26: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/26.jpg)
COMP9318: Data Warehousing and Data Mining 26
Areas within a unit distance from q under different Lp distances
q
L1
q
L2 L8
L¥
q
q
![Page 27: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/27.jpg)
COMP9318: Data Warehousing and Data Mining 27
Binary Variablesn A contingency table for binary data
n Simple matching coefficient (invariant, if the binary variable is symmetric):
n Jaccard coefficient (noninvariant if the binary variable is asymmetric):
dcbacb jid+++
+=),(
pdbcasumdcdcbaba
sum
++++
01
01
cbacb jid++
+=),(
Object i
Object j
Obj Vector Representationi [0, 1, 0, 1, 0, 0, 1, 0]j [0, 0, 0, 0, 1, 0, 1, 1]
![Page 28: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/28.jpg)
COMP9318: Data Warehousing and Data Mining 28
Dissimilarity between Binary Variables
n Example
n gender is a symmetric attributen the remaining attributes are asymmetric binaryn let the values Y and P be set to 1, and the value N be set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N
75.021121),(
67.011111),(
33.010210),(
=++
+=
=++
+=
=++
+=
maryjimd
jimjackd
maryjackd
cbacb jid++
+=),(
![Page 29: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/29.jpg)
COMP9318: Data Warehousing and Data Mining 29
Nominal Variables
n A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
n Method 1: Simple matchingn m: # of matches, p: total # of variables
n Method 2: One-hot encodingn creating a new binary variable for each of the M
nominal states
pmpjid -=),(
![Page 30: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/30.jpg)
COMP9318: Data Warehousing and Data Mining 30
Ordinal Variables
n An ordinal variable can be discrete or continuousn Order is important, e.g., rankn Can be treated like interval-scaled
n replace xif by their rank n map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
n compute the dissimilarity using methods for interval-scaled variables
11--
=f
ifif M
rz
},...,1{ fif Mr Î
![Page 31: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/31.jpg)
COMP9318: Data Warehousing and Data Mining 31
Ratio-Scaled Variables
n Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
n Methods:1. treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted), or2. apply logarithmic transformation
yif = log(xif)3. or treat them as continuous ordinal data; treat their
rank as interval-scaled
![Page 32: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/32.jpg)
COMP9318: Data Warehousing and Data Mining 32
n A Categorization of Major Clustering Methods
![Page 33: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/33.jpg)
COMP9318: Data Warehousing and Data Mining 33
Major Clustering Approaches
n Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
n Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
n Graph-based algorithms: Spectral clustering
n Density-based: based on connectivity and density functionsn Grid-based: based on a multiple-level granularity structuren Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to each other
![Page 34: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/34.jpg)
COMP9318: Data Warehousing and Data Mining 34
n Partitioning Methods
![Page 35: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/35.jpg)
35
Partitioning Algorithms: Problem Definition
n Partitioning method: Construct a “good” partition of a database of n objects into a set of k clustersn Input: a n x m data matrix
n How to measure the “goodness” of a given partitioning scheme? n Cost of a cluster, cost(Ci) =
n Note: L2 distance usedn Analogy with binning? n How to choose the center of a cluster?
n Centroid (i.e., Avg) of Xj è Minimizes cost(Ci)
n Cost of k clusters: sum of cost(Ci)
X
xj2Ci
kxj � center(Ci)k22
![Page 36: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/36.jpg)
Example (2D)
COMP9318: Data Warehousing and Data Mining 36
![Page 37: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/37.jpg)
COMP9318: Data Warehousing and Data Mining 37
Partitioning Algorithms: Basic Concept
n It’s an optimization problem! n Global optimal:
n NP-hard (for a wide range of cost functions) n Requires exhaustively enumerate all partitions
n Stirling numbers of the second kind
n Heuristic methods: n k-means: an instance of the EM (expectation-maximization)
algorithmn Many variants
nn
k
o= ⇥
✓kn
k!
◆
![Page 38: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/38.jpg)
COMP9318: Data Warehousing and Data Mining 38
The K-Means Clustering Method
n Lloyds Algorithm:1. Initialize k centers randomly2. While stopping condition is not met
i. Assign each object to the cluster with the nearest center
ii. Compute the new center for each cluster.
n Stopping condition =?
n What are the final clusters?
Cost function: Total squared distance of points to its cluster representative
![Page 39: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/39.jpg)
COMP9318: Data Warehousing and Data Mining 39
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
reassign
The K-Means Clustering Method
n Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassign
![Page 40: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/40.jpg)
COMP9318: Data Warehousing and Data Mining 40
Comments on the K-Means Method
n Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.n Comparing: PAM: O(k(n-k)2), CLARA: O(ks2 + k(n-k))
n Comment:n Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic algorithms
n No guarantee on the quality. Use k-means++.n Weakness
n Applicable only when mean is defined, then what about categorical data?
n Need to specify k, the number of clusters, in advancen Unable to handle noisy data and outliersn Not suitable to discover clusters with non-convex shapes
![Page 41: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/41.jpg)
COMP9318: Data Warehousing and Data Mining 41
Variations of the K-Means Method
n A few variants of the k-means which differ in
n Selection of the initial k means
n Dissimilarity calculations
n Strategies to calculate cluster means
n Handling categorical data: k-modes (Huang’98)
n Replacing means of clusters with modes
n Using new dissimilarity measures to deal with categorical objects
n Using a frequency-based method to update modes of clusters
n A mixture of categorical and numerical data: k-prototype method
![Page 42: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/42.jpg)
k-Means++ [Arthur and Vassilvitskii, SODA 2007]
n A simple initialization routine that guarantees to find a solution that is O(log k) competitive to the optimal k-means solution.
n Algorithm:1. Find first center uniformly at random2. For each data point x, compute D(x) as the distance
to its nearest center3. Randomly sample one point as the new center, with
probabilities proportional to D2(x)4. Goto 2 if less than k centers5. Run the normal k-means with the k centers
COMP9318: Data Warehousing and Data Mining 42
![Page 43: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/43.jpg)
k-means: Special Matrix Factorizationn Xn x d ≈ Un x k Vk x d
n Loss function: ǁ X – UV ǁF2
n Squared Frobenius normn Constraints:
n Rows of U must be a one-hot encodingn Alternative view
n Xj,* ≈ Uj,* V è Xj,* can be explained as a “special” linear combination of rows in V
43
≈ --- x2 --- 1 0 0--- c1 ------ c2 ------ c3 ---
![Page 44: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/44.jpg)
Expectation Maximization Algorithmn Xn x d ≈ Un x k Vk x d
n Loss function: ǁ X – UV ǁF2
n Finding the best U and V simutaneously is hard, but n Expectation step:
n Given V, find the best U è easyn Maximization step:
n Given U, find the best V è easy n Iterate until converging at a local minima.
44
≈ --- x2 --- 1 0 0--- c1 ------ c2 ------ c3 ---
![Page 45: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/45.jpg)
COMP9318: Data Warehousing and Data Mining 45
![Page 46: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/46.jpg)
COMP9318: Data Warehousing and Data Mining 46
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
What is the problem of k-Means Method?n The k-means algorithm is sensitive to outliers !
n Since an object with an extremely large value may substantially distort the distribution of the data.
n K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 47: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/47.jpg)
COMP9318: Data Warehousing and Data Mining 47
K-medoids (PAM)
n k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
![Page 48: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/48.jpg)
COMP9318: Data Warehousing and Data Mining 48
The K-Medoids Clustering Method
n Find representative objects, called medoids, in clusters
n PAM (Partitioning Around Medoids, 1987)
n starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
n PAM works effectively for small data sets, but does not scale well for large data sets
n CLARA (Kaufmann & Rousseeuw, 1990)
n CLARANS (Ng & Han, 1994): Randomized sampling
n Focusing + spatial data structure (Ester et al., 1995)
![Page 49: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/49.jpg)
COMP9318: Data Warehousing and Data Mining 49
Typical k-medoids algorithm (PAM)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary choose k object as initial medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each remaining object to nearest medoids
Fo each nonmedoid object,Oa
Compute total cost of swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O and Oa
If quality is improved.
Do loopUntil no change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 50: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/50.jpg)
COMP9318: Data Warehousing and Data Mining 50
PAM (Partitioning Around Medoids) (1987)
n PAM (Kaufman and Rousseeuw, 1987), built in Splusn Use real object to represent the cluster
n Select k representative objects arbitrarilyn For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
n For each pair of i and h, n If TCih < 0, i is replaced by hn Then assign each non-selected object to the most
similar representative objectn repeat steps 2-3 until there is no change
![Page 51: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/51.jpg)
COMP9318: Data Warehousing and Data Mining 51
What is the problem with PAM?n PAM is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean
n PAM works efficiently for small data sets but does not scale well for large data sets.n O(k(n-k)2) for each iteration
where n is # of data,k is # of clustersèSampling based method,
CLARA(Clustering LARge Applications)
![Page 52: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/52.jpg)
Gaussian Mixture Model for Clustering
n k-means can be deemed as a special case of the EM algorithm for GMM
n GMMn allows “soft” cluster assignment:
n model Pr(C | x)n also a good example of
n Generative modeln Latent variable model
n Use the Expectation-Maximization (EM) algorithm to obtain a local optimal solution
COMP9318: Data Warehousing and Data Mining 52
![Page 53: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/53.jpg)
COMP9318: Data Warehousing and Data Mining 53
n Hierarchical Methods
![Page 54: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/54.jpg)
Hierarchical Clustering n Produces a set of nested clusters organized as a hierarchical treen Can be visualized as a dendrogram
n A tree like diagram that records the sequences of merges or splitsn A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected component forms a cluster.
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
Objects
Distance
![Page 55: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/55.jpg)
Strengths of Hierarchical Clustering
n Do not have to assume any particular number of clustersn Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the proper level
n They may correspond to meaningful taxonomiesn Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
![Page 56: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/56.jpg)
Hierarchical Clusteringn Two main types of hierarchical clustering
n Agglomerative: n Start with the points as individual clustersn At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
n Divisive: n Start with one, all-inclusive cluster n At each step, split a cluster until each cluster contains a point (or
there are k clusters)
n Traditional hierarchical algorithms use a similarity or distance matrixn Merge or split one cluster at a time
![Page 57: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/57.jpg)
Agglomerative Clustering Algorithm
n More popular hierarchical clustering techniquen Basic algorithm is straightforward
1. Compute the proximity matrix (i.e., matrix of pair-wise distances)2. Let each data point be a cluster3. Repeat4. Merge the two closest clusters5. Update the proximity matrix6. Until only a single cluster remains
n Key operation is the computation of the proximity of two clusters ç different from that of two points
n Different approaches to defining the distance between clusters distinguish the different algorithms
![Page 58: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/58.jpg)
Starting Situation
n Start with clusters of individual points and a proximity matrix
p1
p3
p5p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 59: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/59.jpg)
Intermediate Situation
n After some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2C1C1
C3
C5C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 60: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/60.jpg)
Intermediate Situation
n We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
C1
C4
C2 C5
C3
C2C1C1
C3
C5C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 61: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/61.jpg)
After Mergingn The question is “How do we update the proximity matrix?”
C1
C4
C2 U C5
C3 ? ? ? ?
?
?
?
C2 U C5C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 62: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/62.jpg)
How to Define Inter-Cluster Distance
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
● MIN● MAX● Centroid-based● Group Average● Other methods driven by an objective
function– Ward’s Method uses squared error
Proximity Matrix
![Page 63: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/63.jpg)
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
● MIN● MAX● Centroid-based● Group Average● Other methods driven by an objective
function– Ward’s Method uses squared error
![Page 64: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/64.jpg)
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
● MIN● MAX● Centroid-based● Group Average● Other methods driven by an objective
function– Ward’s Method uses squared error
![Page 65: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/65.jpg)
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
● MIN● MAX● Centroid-based● Group Average● Other methods driven by an objective
function– Ward’s Method uses squared error
´ ´
![Page 66: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/66.jpg)
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
● MIN● MAX● Centroid-based● Group Average● Other methods driven by an objective
function– Ward’s Method uses squared error
Note: not simple avg distance between the clusters
![Page 67: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/67.jpg)
Cluster Similarity: MIN or Single Link/LINK
n Similarity of two clusters is based on the two most similar (closest) points in the different clustersn i.e., sim(Ci, Cj) = min(dissim(px, py)) // px ∈ Ci, py ∈ Cj
= max(sim(px, py))n Determined by one pair of points, i.e., by one link in the
proximity graph.
1 2 3 4 5
Usu. Called single-link to avoid confusions (min similarity or dissimilarity?)
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
![Page 68: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/68.jpg)
Single-Link Example
1 2 3 4 5
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 1.00 0.70 0.60 0.50
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
12 P3 P4 P5
12 1.00
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
0.70 0.65 0.50
12 P3 45
12 1.00 0.70
P3 1.00
45 1.00
0.650.40
sim(Ci, Cj) = max(sim(px, py))
![Page 69: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/69.jpg)
Hierarchical Clustering: MIN
Nested Clusters Dendrogram
1
2
3
4
5
6
12
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
![Page 70: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/70.jpg)
Strength of MIN
Original Points Two Clusters
• Can handle non-elliptical shapes
![Page 71: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/71.jpg)
Limitations of MIN
Original Points Two Clusters
• Sensitive to noise and outliers
![Page 72: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/72.jpg)
Cluster Similarity: MAX or Complete Link (CLINK)
n Similarity of two clusters is based on the two least similar (most distant) points in the different clustersn i.e., sim(Ci, Cj) = max(dissim(px, py)) // px ∈ Ci, py ∈ Cj
= min(sim(px, py))n Determined by all pairs of points in the two clusters
1 2 3 4 5
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
![Page 73: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/73.jpg)
Complete-Link ExampleP1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 1.00 0.70 0.60 0.50
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
12 P3 P4 P5
12 1.00
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
0.10 0.60 0.20
12 P3 45
12 1.00 0.10
P3 1.00
45 1.00
0.200.30
1 2 3 4 5
sim(Ci, Cj) = min(sim(px, py))
![Page 74: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/74.jpg)
Hierarchical Clustering: MAX
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
61
2 5
3
4
![Page 75: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/75.jpg)
Strength of MAX
Original Points Two Clusters
• Less susceptible to noise and outliers
![Page 76: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/76.jpg)
Limitations of MAX
Original Points Two Clusters
•Tends to break large clusters
•Biased towards globular clusters
![Page 77: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/77.jpg)
Cluster Similarity: Group Averagen GAAC (Group Average Agglomerative Clustering)n Similarity of two clusters is the average of pair-wise similarity
between points in the two clusters.
n Why not using simple average distance? This method guarantees that no inversions can occur.
�
similarity(Clusteri,Clusterj) =
similarity(pi,pj)pi ,pj ∈Clusteri∪Clusterjpj ≠pi
∑
(|Clusteri|+|Clusterj|)∗(|Clusteri|+|Clusterj|-1)
(3, 4.5)
(5, 1)(1, 1)
![Page 78: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/78.jpg)
Group Average ExampleP1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 1.00 0.70 0.60 0.50
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
12 P3 P4 P5
12 1.00 0.567 0.717 0.533
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
12 P3 45
12 1.0 0.567 0.608
P3 1.00 0.5
45 1.00
sim(Ci, Cj) = avg(sim(pi, pj))
Sim(12,3)=2*(0.1+0.7+0.9)/6 = 0.56666661 2 34 5Sim(12,45)=2*(0.9+0.65+0.2+0.6+0.5+0.8)/12 = 0.608
![Page 79: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/79.jpg)
Hierarchical Clustering: Centroid-based and Group Average
n Compromise between Single and Complete Link
n Strengthsn Less susceptible to noise and outliers
n Limitationsn Biased towards globular clusters
![Page 80: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/80.jpg)
COMP9318: Data Warehousing and Data Mining 80
More on Hierarchical Clustering Methods
n Major weakness of agglomerative clustering methodsn do not scale well: time complexity of at least O(n2),
where n is the number of total objectsn can never undo what was done previously
n Integration of hierarchical with distance-based clusteringn BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clustersn CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the cluster by a specified fraction
n CHAMELEON (1999): hierarchical clustering using dynamic modeling
![Page 81: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/8clst.pdf · COMP9318: Data Warehousing and Data Mining 17 What Is Good Clustering? n A good clusteringmethod will produce](https://reader033.fdocuments.in/reader033/viewer/2022042218/5ec45ba0729db15f1a3085c6/html5/thumbnails/81.jpg)
Spectral Clustering
n See additional slides.
COMP9318: Data Warehousing and Data Mining 81