Clustering Credit: Padhraic Smyth University of California, Irvine.
Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter...
-
Upload
martina-whitehead -
Category
Documents
-
view
212 -
download
0
Transcript of Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter...
Data Mining - Massey University
Clustering
credits:Padhraic Smyth lecture notes
Hand, et al Chapter 9
Data Mining - Massey University
Clustering Outline
• Introduction to Clustering• Distance measures• k-means clustering• hierarchical clustering• probabilistic clustering
Data Mining - Massey University
Clustering
• “automated detection of group structure in data”– Typically: partition N data points into K
groups (clusters) such that the points in each group are more similar to each other than to points in other groups
– descriptive technique (contrast with predictive)
– Identify “natural” groups of data objects - qualitatively describe groups of the data
• often useful, if a bit reductionist
– for real-valued vectors, clusters can be thought of as clouds of points in p-dimensional space
Data Mining - Massey University
Clustering
Sometimes easy
Sometimes impossible
and usually in between
Data Mining - Massey University
What is Cluster Analysis?
• A good cluster analysis results in– Similar (close) to one another within the same cluster– Dissimilar (far) from the objects in other clusters
• In other words
– high intra-cluster similarity
– low inter-cluster similarity
• Typical applications– As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms
Data Mining - Massey University
Example
x xx x x xx x x x
x x xx x
xxx x
x x x x x
xx x x
x
x xx x x x x x x
x
x
x
Data Mining - Massey University
Why is Clustering useful?• “Discovery” of new knowledge from data
– Contrast with supervised classification (where labels are known)
– Can be very useful for summarizing large data sets • For large n and/or high dimensionality
• Applications of clustering– WWW
• Clustering of documents produced by a search engine • Clustering weblog data to determine usage patterns
– Pattern recognition / image processing• Google face finder (&imgtype=face)
– Segmentation of customers for an e-commerce store– Spatial data Analysis
• geographical clusters of events: cancer rates, sales, etc.– Clustering of genes with similar expression profiles– many more
Data Mining - Massey University
General Issues in Clustering
• No golden truth! – answer is often subjective
• Cluster Representation:– What types or “shapes” of clusters are we looking for? What
defines a cluster?
• Score:– A clustering = assignment of n objects to K clusters– Score = quantitative criterion used to evaluate different
clusterings
• Other issues– Distance function, D[x(i),x(j)] critical aspect of clustering,
both• distance of individual pairs of objects• distance of individual objects from clusters
– How is K selected?
Data Mining - Massey University
Clustering Outline
• Introduction to Clustering• Distance measures• k-means clustering• hierarchical clustering• probabalistic clustering
Data Mining - Massey University
Distance Measures
• In order to cluster, we need some kind of “distance” between points.
• Sometimes distances are not obvious, but we can create them
case sex glasses Moustache smile hat
1 0 1 0 1 0
2 1 0 0 1 0
3 0 1 0 0 0
4 0 0 0 0 0
5 0 0 0 1 0
6 0 0 1 0 1
7 0 1 0 1 0
8 0 0 0 1 0
9 0 1 1 1 0
10 1 0 0 0 0
11 0 0 1 0 0
12 1 0 0 0 0
Data Mining - Massey University
Euclidean Vs. Non-Euclidean
• A Euclidean space has some number of real-valued dimensions and “dense” points.– There is a notion of “average” of two points.– A Euclidean distance is based on the
locations of points in such a space.
• A Non-Euclidean distance is based on properties of points, but not their “location” in a space.
Data Mining - Massey University
Some Euclidean Distances
• L2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension.– The most common notion of “distance.”
• L1 norm : sum of the differences in each dimension.– Manhattan distance = distance if you
had to travel along coordinates only.
Data Mining - Massey University
Examples of Euclidean Distances
x = (5,5)
y = (9,8)L2-norm:dist(x,y) =(42+32)= 5
L1-norm:dist(x,y) =4+3 = 7
4
35
Data Mining - Massey University
Non-Euclidean Distances
• Some observations are not appropriate for Euclidian distance:
• Binary Vectors: Jaccard coefficient, cosine
• Strings: Edit Distance• Ordinal variables: transformation
of ranks• Categorical: matching sums.
Data Mining - Massey University
Distances for Binary Vectors
• Intersection over union– Example: p1 = 10111; p2 = 10011.
• Size of intersection = 3; size of union = 4, Jaccard similarity (not distance) = 3/4.
• Need to make a distance function satisfying triangle inequality and other laws.
• d(x,y) = 1 – (Jaccard similarity) works.
Data Mining - Massey University
Distances for Binary Vectors
• A contingency table for binary data
• 1 - Jaccard similarity (intersection over
union):
pdbcasum
dcdc
baba
sum
++++
01
01
cbacb jid++
+=),(
Object i
Object j
Data Mining - Massey University
Cosine Distance (similarity)
• Think of a point as a vector from the origin (0,0,…,0) to its location.
• Two points’ vectors make an angle, the cosine of this angle is a measure of similarity– Recall cos(0) = 1; cos(90)=0– Also: the cosine is the normalized dot-
product of the vectors: p1.p2/|p2||p1|.– Example p1 = 00111; p2 = 10011.– p1.p2 = 2; |p1| = |p2| = 3.– cos() = 2/3; is about 48 degrees.
Data Mining - Massey University
Cosine-Measure Diagram
p1
p2p1.p2
|p2|
Data Mining - Massey University
Edit Distance for strings
• The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other.
• Equivalently: d(x,y) = |x| + |y| - |LCS(x,y)|.– LCS = longest common subsequence
= longest string obtained both by deleting from x and deleting from y.
Data Mining - Massey University
Example
• x = abcde ; y = bcduve.• Turn x into y by deleting a, then
inserting u and v after d.– Edit-distance = 3.
• Or, LCS(x,y) = bcde.• |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 =
3.
Data Mining - Massey University
Categorical Variables
• A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching– m: # of matches, p: total # of variables
• Method 2: use a large number of binary variables– creating a new binary variable for each of the M nominal
states
pmpjid −=),(
Data Mining - Massey University
Ordinal Variables
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank
• Can be treated like interval-scaled – replacing xif by their rank
– map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
– compute the dissimilarity using methods for interval-scaled variables
– Not always such a great idea…
11−−
=f
ifif M
rz
},...,1{fif
Mr ∈
Data Mining - Massey University
Clustering Methods
• enough about distances!
• Now we have a matrix (n x n) of distances.
• Two major types of clustering algorithms:– partitioning
• Partitions the set into clusters with defined boundaries
• place each point in its nearest cluster
– hierarchical • agglomerative: each point is in its own cluser,
iteratively combine• divisive: all data in one cluser, iteratively dissect
Data Mining - Massey University
Clustering Outline
• Introduction to Clustering• Distance measures• k-means Clustering• Hierarchical clustering• Probabalistic clustering
Data Mining - Massey University
k –Means Algorithm(s)
• Assumes Euclidean space.• Start by picking k, the number of
clusters.• Initialize clusters by picking one
point per cluster.– typically, k random points
Data Mining - Massey University
K-Means Algorithm
1. Arbitrarily select K objects from the data (e.g., K customer) to be each a cluster center
2. For each of the remaining objects: Assign each object to the cluster whose center it is most close to
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Custer center Custer center
Data Mining - Massey University
Then Repeat the following 3 steps until clusters converge (no change in clusters):
1.Compute the new center of the current clusters.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K-Means Algorithm
Data Mining - Massey University
2. Assign each object to the cluster whose center it is most close to.
3. Go back to Step 1, or stop if center do not change.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K-Means Algorithm
Data Mining - Massey University
The K-Means Clustering Method
• Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Data Mining - Massey University
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• Solve for the rest ….
Data Mining - Massey University
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2• Randomly assign means: m1=3,m2=4• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16• K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6• K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25• Stop as the clusters with these means are
the same.
Data Mining - Massey University
Getting k Right
• Hard! Often done subjectively (by feel)• Try different k, looking at the change in
the average distance to centroid, as k increases.
• Looking for a balance between within-cluster variance and between-cluster variance.
• Average falls rapidly until right k, then changes little.
k
Averagedistance tocentroid
Best valueof k
Data Mining - Massey University
Example
x xx x x xx x x x
x x xx x
xxx x
x x x x x
xx x x
x
x xx x x x x x x
x
x
x
Too few;many longdistancesto centroid.
Data Mining - Massey University
Example
x xx x x xx x x x
x x xx x
xxx x
x x x x x
xx x x
x
x xx x x x x x x
x
x
x
Just right;distancesrather short.
Data Mining - Massey University
Example
x xx x x xx x x x
x x xx x
xxx x
x x x x x
xx x x
x
x xx x x x x x x
x
x
x
Too many;little improvementin averagedistance.
Data Mining - Massey University
Comments on the K-Means Method
• Strength – Relatively efficient: Easy to implement - often comes
up with good, if not best, solutions– intuitive
• Weakness– Applicable only when mean is defined, then what
about categorical data?– Need to specify k, the number of clusters, in advance– Unable to handle noisy data and outliers– Not suitable to discover clusters with non-convex
shapes– Quite sensitive to initial starting points - will find a
local optimum (although methods exist for finding global optima).
Data Mining - Massey University
Variations on k-means
• Make it more robust by using k-modes or k-mediods
– K-Medoids: medoids are the most centrally located
object in a cluster.
• Make the initialization better
– Take a small random sample and cluster to find a starting
point
– From a sample, pick a random point, and then k – 1 more
points, each as far from the previously selected points as
possible.
Data Mining - Massey University
Clustering Outline
• Introduction to Clustering• Distance measures• k-means clustering• Hierarchical clustering• Probabalistic clustering
Data Mining - Massey University
Hierarchical Clustering
• Representation: tree of nested clusters• Works from a distance matrix
– advantage: x’s can be any type of object– disadvantage: computation
• two basic approachs:– merge points (agglomerative)– divide superclusters (divisive)
• visualize both via “dendograms”– shows nesting structure– merges or splits = tree nodes
• Applications– e.g., clustering of gene expression data– Useful for seeing hierarchical structure, for relatively small data sets
Data Mining - Massey University
Simple example of hierarchical clustering
Data Mining - Massey University
Distances Between Clusters• Single Link:
– smallest distance between points– Nearest neighbor– can be outlier sensitive
• Complete Link: – largest distance between points– enforces “compactness”
• Average Link:– mean - gets average behavior– centroid - more robust– Ward’s measure
• Merge clusters that minimize increase in within-cluster distances• (SS(Ci) + SS(Cj) - SS(Ci+j))
Data Mining - Massey University
Hierarchical Clustering
• This method does not require the number of clusters k as an input.
• Can be done forward or backward:
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
Data Mining - Massey University
Agglomerative Hierarchical Clustering
• most common in statistical packages
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Data Mining - Massey University
Divisive Hierarchical Clustering
• Start with one cluster with all data points
• Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Data Mining - Massey University
A Dendrogram Shows How the Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
Data Mining - Massey University
Data Mining - Massey University
Height of the cross-bar shows the change in within-cluster SS
Agglomerative
Data Mining - Massey University
Dendrogram Using Single-Link Method
Old Faithful Eruption Duration vs Wait Data Notice how single-linktends to “chain”.
dendrogram y-axis = crossbar’s distance score
Data Mining - Massey University
Dendogram Using Ward’s SSE Distance
Old Faithful Eruption Duration vs Wait DataMore balanced thansingle-link.
Data Mining - Massey University
Hierarchical Clustering
• Pros– don’t have to specify k beforehand– visual representation of various
cluster characteristics from dendogram
• Cons– different linkage options get very
different results
Data Mining - Massey University
Clustering Outline
• Introduction to Clustering• Distance measures• k-means clustering• Hierarchical clustering• Probabalistic clustering
Data Mining - Massey University
Estimating Probability Densities
• Using Probability densities is one way to describe data.
• Finite mixtures of probability densities can be viewed as clusters
•log-likelihood is a common score function:
•Can be amended to penalize complexity:
∑=
−=n
iL ixpS
1
));((log)( θθ
ndMSM kkkLk log);ˆ(2)( += BICS
Data Mining - Massey University
Mixture Models
)!52(
)()1(
!
)()(
21 5221
x
ep
x
epxf
xx
−−+=
−−− λλ λλ
“Two-stage model”
∑=
=K
kkkk xfxf
1
);()( θπ
weekly credit card usage
Data Mining - Massey University
Mixture Models and EM• How do we find the models to mix over?
• EM (Expectation / Maximization) is a widely used technique that converges to a solution for finding mixture models.
• Assume multivariate normal components. To apply EM:
– take an initial solution
– calculate the probability that each point comes from each component and assign it (E-step)
– re-estimate parameters for the components based on the new assignments (M-step)
– repeat until convergance.
• Can be slow to converge; can find local maxima
Data Mining - Massey University
Probabilistic Clustering: Mixture Models
• assume a probabilistic model for each component cluster
• mixture model: f(x) = k=1…K wk fk(x;k) • wk are K mixing weights
– 0 wk 1 and k=1…K wk = 1
• where K component densities fk(x;k) can be:– Gaussian– Poisson– exponential– ...
• Note:– Assumes a model for the data (advantages and
disadvantages)– Results in probabilistic membership: p(cluster k | x)
Pd
Data Mining - Massey University
Learning Mixture Models from Data
• Score function = log-likelihood L() – L() = log p(X|) = log H p(X,H|)– H = hidden variables (cluster memberships of each x)– L() cannot be optimized directly
• EM Procedure– General technique for maximizing log-likelihood with
missing data– For mixtures
• E-step: compute “memberships” p(k | x) = wk fk(x;k) / f(x)• M-step: pick a new to maximize expected data log-likelihood• Iterate: guaranteed to climb to (local) maximum of L()
Data Mining - Massey University
The E (Expectation) Step
Current K clustersand parameters
n datapoints
E step: Compute p(data point i is in group k)
Data Mining - Massey University
The M (Maximization) Step
New parameters forthe K clusters
n datapoints
M step: Compute , given n data points and memberships
Data Mining - Massey University
Comments on Mixtures and EM Learning
• Probabilistic assignment to clusters…not a partition
• K-means is a special case of EM– Gaussian mixtures with isotropic (diagonal, equi-
variance) k ‘s – Approximate the E-step by choosing most likely
cluster (instead of using membership probabilities)
Data Mining - Massey University
Selecting K in mixture models
• cannot just choose K that maximizes likelihood– Likelihood L() is always larger for larger K
• Model selection alternatives for choosing k:– 1) penalizing complexity
• e.g., BIC = L() – d/2 log n , d = # parameters (Bayesian information criterion)• Easy to implement: asymptotically correct
– 2) Bayesian: compute posteriors p(k | data)• P(k|data) requires computation of p(data|k) = marginal likelihood• Can be tricky to compute for mixture models
– 3) (cross) validation: • split data into train and validate sets • Score different models by likelihood of test data log p(Xtest | ) • Works well on large data sets• Can be noisy on small data (logL is sensitive to outliers)
– Note: all of these methods evaluate the quality of the clustering as a density estimator, rather than with any explicit notion of clustering
Data Mining - Massey University
Example of BIC Score for Red-Blood Cell Data
Data Mining - Massey University
Example of BIC Score for Red-Blood Cell Data
True numberof classes (2)selected by BIC
Data Mining - Massey University
Model Based Clustering
• set k• choose
parametric model for each cluster
• EM or Bayesian methods to fit clusters
• use library(mclust) in R
f(x) = k=1…K wk fk(x;k)
Name Distribution
Volume Shape Orientation
EII Spherical equal equal NA
VII Spherical variable equal NA
EEI Diagonal equal equal coordinate azxes
VEI Diagonal variable equal coordinate axes
VVI Diagonal variable variable coordinate axes
EEE Ellipsoidal
equal equal equal
EEV Ellipsoidal
equal equal variable
VEV Ellipsoidal
variable equal variable
VVV Ellipsoidal
variable var variable
Data Mining - Massey University
Model-based clustering: red blood cells
set k=2 with Gaussian clusters…
Data Mining - Massey University
Iter: 0
Iter: 1
Iter: 2
Iter: 5
Iter: 10
Iter: 25
Data Mining - Massey University
(Dis)Advantages of the Probabilistic Approach
• Provides a full distributional description for each component - different variances (iron-deficient group has greater spread)
•For each observation, provides a K-component vector of probabilities of class membership (we can find those at risk).
• Only limited by imagination of likelihoods…different distributions, shapes, clusters-within-clusters, etc.
•Can make inference about the number of clusters, via a penalized likelihood model
•But... its computationally somewhat costly, and we have to assume a distributional form.
Data Mining - Massey University
Lab #6 - Clustering
• Task 1: Australian crabs data– use model based clustering to find
clusters
• Task 2: Music data– use hierarchical clustering to see if
you can reconstruct the genres.