Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter...

Data Mining - Massey University

Clustering

credits:Padhraic Smyth lecture notes

Hand, et al Chapter 9


Clustering Outline

• Introduction to Clustering• Distance measures• k-means clustering• hierarchical clustering• probabilistic clustering


Clustering

• “automated detection of group structure in data”– Typically: partition N data points into K

groups (clusters) such that the points in each group are more similar to each other than to points in other groups

– descriptive technique (contrast with predictive)

– Identify “natural” groups of data objects - qualitatively describe groups of the data

• often useful, if a bit reductionist

– for real-valued vectors, clusters can be thought of as clouds of points in p-dimensional space


Clustering

Sometimes easy

Sometimes impossible

and usually in between


What is Cluster Analysis?

• A good cluster analysis results in– Similar (close) to one another within the same cluster– Dissimilar (far) from the objects in other clusters

• In other words

– high intra-cluster similarity

– low inter-cluster similarity

• Typical applications– As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms


Example

x xx x x xx x x x

x x xx x

xxx x

x x x x x

xx x x

x

x xx x x x x x x

x

x

x


Why is Clustering useful?• “Discovery” of new knowledge from data

– Contrast with supervised classification (where labels are known)

– Can be very useful for summarizing large data sets • For large n and/or high dimensionality

• Applications of clustering– WWW

• Clustering of documents produced by a search engine • Clustering weblog data to determine usage patterns

– Pattern recognition / image processing• Google face finder (&imgtype=face)

– Segmentation of customers for an e-commerce store– Spatial data Analysis

• geographical clusters of events: cancer rates, sales, etc.– Clustering of genes with similar expression profiles– many more


General Issues in Clustering

• No golden truth! – answer is often subjective

• Cluster Representation:– What types or “shapes” of clusters are we looking for? What

defines a cluster?

• Score:– A clustering = assignment of n objects to K clusters– Score = quantitative criterion used to evaluate different

clusterings

• Other issues– Distance function, D[x(i),x(j)] critical aspect of clustering,

both• distance of individual pairs of objects• distance of individual objects from clusters

– How is K selected?


Clustering Outline

• Introduction to Clustering• Distance measures• k-means clustering• hierarchical clustering• probabalistic clustering


Distance Measures

• In order to cluster, we need some kind of “distance” between points.

• Sometimes distances are not obvious, but we can create them

case sex glasses Moustache smile hat

1 0 1 0 1 0

2 1 0 0 1 0

3 0 1 0 0 0

4 0 0 0 0 0

5 0 0 0 1 0

6 0 0 1 0 1

7 0 1 0 1 0

8 0 0 0 1 0

9 0 1 1 1 0

10 1 0 0 0 0

11 0 0 1 0 0

12 1 0 0 0 0


Euclidean Vs. Non-Euclidean

• A Euclidean space has some number of real-valued dimensions and “dense” points.– There is a notion of “average” of two points.– A Euclidean distance is based on the

locations of points in such a space.

• A Non-Euclidean distance is based on properties of points, but not their “location” in a space.


Some Euclidean Distances

• L2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension.– The most common notion of “distance.”

• L1 norm : sum of the differences in each dimension.– Manhattan distance = distance if you

had to travel along coordinates only.


Examples of Euclidean Distances

x = (5,5)

y = (9,8)L2-norm:dist(x,y) =(42+32)= 5

L1-norm:dist(x,y) =4+3 = 7

4

35


Non-Euclidean Distances

• Some observations are not appropriate for Euclidian distance:

• Binary Vectors: Jaccard coefficient, cosine

• Strings: Edit Distance• Ordinal variables: transformation

of ranks• Categorical: matching sums.


Distances for Binary Vectors

• Intersection over union– Example: p1 = 10111; p2 = 10011.

• Size of intersection = 3; size of union = 4, Jaccard similarity (not distance) = 3/4.

• Need to make a distance function satisfying triangle inequality and other laws.

• d(x,y) = 1 – (Jaccard similarity) works.


Distances for Binary Vectors

• A contingency table for binary data

• 1 - Jaccard similarity (intersection over

union):

pdbcasum

dcdc

baba

sum

++++

01

01

cbacb jid++

+=),(

Object i

Object j


Cosine Distance (similarity)

• Think of a point as a vector from the origin (0,0,…,0) to its location.

• Two points’ vectors make an angle, the cosine of this angle is a measure of similarity– Recall cos(0) = 1; cos(90)=0– Also: the cosine is the normalized dot-

product of the vectors: p1.p2/|p2||p1|.– Example p1 = 00111; p2 = 10011.– p1.p2 = 2; |p1| = |p2| = 3.– cos() = 2/3; is about 48 degrees.


Cosine-Measure Diagram

p1

p2p1.p2

|p2|


Edit Distance for strings

• The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other.

• Equivalently: d(x,y) = |x| + |y| - |LCS(x,y)|.– LCS = longest common subsequence

= longest string obtained both by deleting from x and deleting from y.


Example

• x = abcde ; y = bcduve.• Turn x into y by deleting a, then

inserting u and v after d.– Edit-distance = 3.

• Or, LCS(x,y) = bcde.• |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 =

3.


Categorical Variables

• A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

• Method 1: Simple matching– m: # of matches, p: total # of variables

• Method 2: use a large number of binary variables– creating a new binary variable for each of the M nominal

states

pmpjid −=),(


Ordinal Variables

• An ordinal variable can be discrete or continuous

• order is important, e.g., rank

• Can be treated like interval-scaled – replacing xif by their rank

– map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

– compute the dissimilarity using methods for interval-scaled variables

– Not always such a great idea…

11−−

=f

ifif M

rz

},...,1{fif

Mr ∈


Clustering Methods

• enough about distances!

• Now we have a matrix (n x n) of distances.

• Two major types of clustering algorithms:– partitioning

• Partitions the set into clusters with defined boundaries

• place each point in its nearest cluster

– hierarchical • agglomerative: each point is in its own cluser,

iteratively combine• divisive: all data in one cluser, iteratively dissect


Clustering Outline

• Introduction to Clustering• Distance measures• k-means Clustering• Hierarchical clustering• Probabalistic clustering


k –Means Algorithm(s)

• Assumes Euclidean space.• Start by picking k, the number of

clusters.• Initialize clusters by picking one

point per cluster.– typically, k random points


K-Means Algorithm

1. Arbitrarily select K objects from the data (e.g., K customer) to be each a cluster center

2. For each of the remaining objects: Assign each object to the cluster whose center it is most close to

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Custer center Custer center


Then Repeat the following 3 steps until clusters converge (no change in clusters):

1.Compute the new center of the current clusters.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K-Means Algorithm


2. Assign each object to the cluster whose center it is most close to.

3. Go back to Step 1, or stop if center do not change.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K-Means Algorithm


The K-Means Clustering Method

• Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


K-Means Example

• Given: {2,4,10,12,3,20,30,11,25}, k=2

• Randomly assign means: m1=3,m2=4

• Solve for the rest ….


K-Means Example

• Given: {2,4,10,12,3,20,30,11,25}, k=2• Randomly assign means: m1=3,m2=4• K1={2,3}, K2={4,10,12,20,30,11,25},

m1=2.5,m2=16• K1={2,3,4},K2={10,12,20,30,11,25},

m1=3,m2=18• K1={2,3,4,10},K2={12,20,30,11,25},

m1=4.75,m2=19.6• K1={2,3,4,10,11,12},K2={20,30,25},

m1=7,m2=25• Stop as the clusters with these means are

the same.


Getting k Right

• Hard! Often done subjectively (by feel)• Try different k, looking at the change in

the average distance to centroid, as k increases.

• Looking for a balance between within-cluster variance and between-cluster variance.

• Average falls rapidly until right k, then changes little.

k

Averagedistance tocentroid

Best valueof k


Example

x xx x x xx x x x

x x xx x

xxx x

x x x x x

xx x x

x

x xx x x x x x x

x

x

x

Too few;many longdistancesto centroid.


Example

x xx x x xx x x x

x x xx x

xxx x

x x x x x

xx x x

x

x xx x x x x x x

x

x

x

Just right;distancesrather short.


Example

x xx x x xx x x x

x x xx x

xxx x

x x x x x

xx x x

x

x xx x x x x x x

x

x

x

Too many;little improvementin averagedistance.


Comments on the K-Means Method

• Strength – Relatively efficient: Easy to implement - often comes

up with good, if not best, solutions– intuitive

• Weakness– Applicable only when mean is defined, then what

about categorical data?– Need to specify k, the number of clusters, in advance– Unable to handle noisy data and outliers– Not suitable to discover clusters with non-convex

shapes– Quite sensitive to initial starting points - will find a

local optimum (although methods exist for finding global optima).


Variations on k-means

• Make it more robust by using k-modes or k-mediods

– K-Medoids: medoids are the most centrally located

object in a cluster.

• Make the initialization better

– Take a small random sample and cluster to find a starting

point

– From a sample, pick a random point, and then k – 1 more

points, each as far from the previously selected points as

possible.


Clustering Outline

• Introduction to Clustering• Distance measures• k-means clustering• Hierarchical clustering• Probabalistic clustering


Hierarchical Clustering

• Representation: tree of nested clusters• Works from a distance matrix

– advantage: x’s can be any type of object– disadvantage: computation

• two basic approachs:– merge points (agglomerative)– divide superclusters (divisive)

• visualize both via “dendograms”– shows nesting structure– merges or splits = tree nodes

• Applications– e.g., clustering of gene expression data– Useful for seeing hierarchical structure, for relatively small data sets


Simple example of hierarchical clustering


Distances Between Clusters• Single Link:

– smallest distance between points– Nearest neighbor– can be outlier sensitive

• Complete Link: – largest distance between points– enforces “compactness”

• Average Link:– mean - gets average behavior– centroid - more robust– Ward’s measure

• Merge clusters that minimize increase in within-cluster distances• (SS(Ci) + SS(Cj) - SS(Ci+j))



• This method does not require the number of clusters k as an input.

• Can be done forward or backward:

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive


Agglomerative Hierarchical Clustering

• most common in statistical packages

• Merge nodes that have the least dissimilarity

• Go on in a non-descending fashion

• Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


Divisive Hierarchical Clustering

• Start with one cluster with all data points

• Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


A Dendrogram Shows How the Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.


Height of the cross-bar shows the change in within-cluster SS

Agglomerative


Dendrogram Using Single-Link Method

Old Faithful Eruption Duration vs Wait Data Notice how single-linktends to “chain”.

dendrogram y-axis = crossbar’s distance score


Dendogram Using Ward’s SSE Distance

Old Faithful Eruption Duration vs Wait DataMore balanced thansingle-link.



• Pros– don’t have to specify k beforehand– visual representation of various

cluster characteristics from dendogram

• Cons– different linkage options get very

different results


Clustering Outline

• Introduction to Clustering• Distance measures• k-means clustering• Hierarchical clustering• Probabalistic clustering


Estimating Probability Densities

• Using Probability densities is one way to describe data.

• Finite mixtures of probability densities can be viewed as clusters

•log-likelihood is a common score function:

•Can be amended to penalize complexity:

∑=

−=n

iL ixpS

1

));((log)( θθ

ndMSM kkkLk log);ˆ(2)( += BICS


Mixture Models

)!52(

)()1(

!

)()(

21 5221

x

ep

x

epxf

xx

−−+=

−−− λλ λλ

“Two-stage model”

∑=

=K

kkkk xfxf

1

);()( θπ

weekly credit card usage


Mixture Models and EM• How do we find the models to mix over?

• EM (Expectation / Maximization) is a widely used technique that converges to a solution for finding mixture models.

• Assume multivariate normal components. To apply EM:

– take an initial solution

– calculate the probability that each point comes from each component and assign it (E-step)

– re-estimate parameters for the components based on the new assignments (M-step)

– repeat until convergance.

• Can be slow to converge; can find local maxima


Probabilistic Clustering: Mixture Models

• assume a probabilistic model for each component cluster

• mixture model: f(x) = k=1…K wk fk(x;k) • wk are K mixing weights

– 0 wk 1 and k=1…K wk = 1

• where K component densities fk(x;k) can be:– Gaussian– Poisson– exponential– ...

• Note:– Assumes a model for the data (advantages and

disadvantages)– Results in probabilistic membership: p(cluster k | x)

Pd


Learning Mixture Models from Data

• Score function = log-likelihood L() – L() = log p(X|) = log H p(X,H|)– H = hidden variables (cluster memberships of each x)– L() cannot be optimized directly

• EM Procedure– General technique for maximizing log-likelihood with

missing data– For mixtures

• E-step: compute “memberships” p(k | x) = wk fk(x;k) / f(x)• M-step: pick a new to maximize expected data log-likelihood• Iterate: guaranteed to climb to (local) maximum of L()


The E (Expectation) Step

Current K clustersand parameters

n datapoints

E step: Compute p(data point i is in group k)


The M (Maximization) Step

New parameters forthe K clusters

n datapoints

M step: Compute , given n data points and memberships


Comments on Mixtures and EM Learning

• Probabilistic assignment to clusters…not a partition

• K-means is a special case of EM– Gaussian mixtures with isotropic (diagonal, equi-

variance) k ‘s – Approximate the E-step by choosing most likely

cluster (instead of using membership probabilities)


Selecting K in mixture models

• cannot just choose K that maximizes likelihood– Likelihood L() is always larger for larger K

• Model selection alternatives for choosing k:– 1) penalizing complexity

• e.g., BIC = L() – d/2 log n , d = # parameters (Bayesian information criterion)• Easy to implement: asymptotically correct

– 2) Bayesian: compute posteriors p(k | data)• P(k|data) requires computation of p(data|k) = marginal likelihood• Can be tricky to compute for mixture models

– 3) (cross) validation: • split data into train and validate sets • Score different models by likelihood of test data log p(Xtest | ) • Works well on large data sets• Can be noisy on small data (logL is sensitive to outliers)

– Note: all of these methods evaluate the quality of the clustering as a density estimator, rather than with any explicit notion of clustering


Example of BIC Score for Red-Blood Cell Data


Example of BIC Score for Red-Blood Cell Data

True numberof classes (2)selected by BIC


Model Based Clustering

• set k• choose

parametric model for each cluster

• EM or Bayesian methods to fit clusters

• use library(mclust) in R

f(x) = k=1…K wk fk(x;k)

Name Distribution

Volume Shape Orientation

EII Spherical equal equal NA

VII Spherical variable equal NA

EEI Diagonal equal equal coordinate azxes

VEI Diagonal variable equal coordinate axes

VVI Diagonal variable variable coordinate axes

EEE Ellipsoidal

equal equal equal

EEV Ellipsoidal

equal equal variable

VEV Ellipsoidal

variable equal variable

VVV Ellipsoidal

variable var variable


Model-based clustering: red blood cells

set k=2 with Gaussian clusters…


Iter: 0

Iter: 1

Iter: 2

Iter: 5

Iter: 10

Iter: 25


(Dis)Advantages of the Probabilistic Approach

• Provides a full distributional description for each component - different variances (iron-deficient group has greater spread)

•For each observation, provides a K-component vector of probabilities of class membership (we can find those at risk).

• Only limited by imagination of likelihoods…different distributions, shapes, clusters-within-clusters, etc.

•Can make inference about the number of clusters, via a penalized likelihood model

•But... its computationally somewhat costly, and we have to assume a distributional form.


Lab #6 - Clustering

• Task 1: Australian crabs data– use model based clustering to find

clusters

• Task 2: Music data– use hierarchical clustering to see if

you can reconstruct the genres.

Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter...

Documents

Transcript of Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter...