Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter...

67
Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9

Transcript of Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter...

Page 1: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Clustering

credits:Padhraic Smyth lecture notes

Hand, et al Chapter 9

Page 2: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Clustering Outline

• Introduction to Clustering• Distance measures• k-means clustering• hierarchical clustering• probabilistic clustering

Page 3: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Clustering

• “automated detection of group structure in data”– Typically: partition N data points into K

groups (clusters) such that the points in each group are more similar to each other than to points in other groups

– descriptive technique (contrast with predictive)

– Identify “natural” groups of data objects - qualitatively describe groups of the data

• often useful, if a bit reductionist

– for real-valued vectors, clusters can be thought of as clouds of points in p-dimensional space

Page 4: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Clustering

Sometimes easy

Sometimes impossible

and usually in between

Page 5: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

What is Cluster Analysis?

• A good cluster analysis results in– Similar (close) to one another within the same cluster– Dissimilar (far) from the objects in other clusters

• In other words

– high intra-cluster similarity

– low inter-cluster similarity

• Typical applications– As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms

Page 6: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Example

x xx x x xx x x x

x x xx x

xxx x

x x x x x

xx x x

x

x xx x x x x x x

x

x

x

Page 7: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Why is Clustering useful?• “Discovery” of new knowledge from data

– Contrast with supervised classification (where labels are known)

– Can be very useful for summarizing large data sets • For large n and/or high dimensionality

• Applications of clustering– WWW

• Clustering of documents produced by a search engine • Clustering weblog data to determine usage patterns

– Pattern recognition / image processing• Google face finder (&imgtype=face)

– Segmentation of customers for an e-commerce store– Spatial data Analysis

• geographical clusters of events: cancer rates, sales, etc.– Clustering of genes with similar expression profiles– many more

Page 8: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

General Issues in Clustering

• No golden truth! – answer is often subjective

• Cluster Representation:– What types or “shapes” of clusters are we looking for? What

defines a cluster?

• Score:– A clustering = assignment of n objects to K clusters– Score = quantitative criterion used to evaluate different

clusterings

• Other issues– Distance function, D[x(i),x(j)] critical aspect of clustering,

both• distance of individual pairs of objects• distance of individual objects from clusters

– How is K selected?

Page 9: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Clustering Outline

• Introduction to Clustering• Distance measures• k-means clustering• hierarchical clustering• probabalistic clustering

Page 10: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Distance Measures

• In order to cluster, we need some kind of “distance” between points.

• Sometimes distances are not obvious, but we can create them

case sex glasses Moustache smile hat

1 0 1 0 1 0

2 1 0 0 1 0

3 0 1 0 0 0

4 0 0 0 0 0

5 0 0 0 1 0

6 0 0 1 0 1

7 0 1 0 1 0

8 0 0 0 1 0

9 0 1 1 1 0

10 1 0 0 0 0

11 0 0 1 0 0

12 1 0 0 0 0

Page 11: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Euclidean Vs. Non-Euclidean

• A Euclidean space has some number of real-valued dimensions and “dense” points.– There is a notion of “average” of two points.– A Euclidean distance is based on the

locations of points in such a space.

• A Non-Euclidean distance is based on properties of points, but not their “location” in a space.

Page 12: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Some Euclidean Distances

• L2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension.– The most common notion of “distance.”

• L1 norm : sum of the differences in each dimension.– Manhattan distance = distance if you

had to travel along coordinates only.

Page 13: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Examples of Euclidean Distances

x = (5,5)

y = (9,8)L2-norm:dist(x,y) =(42+32)= 5

L1-norm:dist(x,y) =4+3 = 7

4

35

Page 14: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Non-Euclidean Distances

• Some observations are not appropriate for Euclidian distance:

• Binary Vectors: Jaccard coefficient, cosine

• Strings: Edit Distance• Ordinal variables: transformation

of ranks• Categorical: matching sums.

Page 15: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Distances for Binary Vectors

• Intersection over union– Example: p1 = 10111; p2 = 10011.

• Size of intersection = 3; size of union = 4, Jaccard similarity (not distance) = 3/4.

• Need to make a distance function satisfying triangle inequality and other laws.

• d(x,y) = 1 – (Jaccard similarity) works.

Page 16: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Distances for Binary Vectors

• A contingency table for binary data

• 1 - Jaccard similarity (intersection over

union):

pdbcasum

dcdc

baba

sum

++++

01

01

cbacb jid++

+=),(

Object i

Object j

Page 17: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Cosine Distance (similarity)

• Think of a point as a vector from the origin (0,0,…,0) to its location.

• Two points’ vectors make an angle, the cosine of this angle is a measure of similarity– Recall cos(0) = 1; cos(90)=0– Also: the cosine is the normalized dot-

product of the vectors: p1.p2/|p2||p1|.– Example p1 = 00111; p2 = 10011.– p1.p2 = 2; |p1| = |p2| = 3.– cos() = 2/3; is about 48 degrees.

Page 18: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Cosine-Measure Diagram

p1

p2p1.p2

|p2|

Page 19: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Edit Distance for strings

• The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other.

• Equivalently: d(x,y) = |x| + |y| - |LCS(x,y)|.– LCS = longest common subsequence

= longest string obtained both by deleting from x and deleting from y.

Page 20: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Example

• x = abcde ; y = bcduve.• Turn x into y by deleting a, then

inserting u and v after d.– Edit-distance = 3.

• Or, LCS(x,y) = bcde.• |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 =

3.

Page 21: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Categorical Variables

• A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

• Method 1: Simple matching– m: # of matches, p: total # of variables

• Method 2: use a large number of binary variables– creating a new binary variable for each of the M nominal

states

pmpjid −=),(

Page 22: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Ordinal Variables

• An ordinal variable can be discrete or continuous

• order is important, e.g., rank

• Can be treated like interval-scaled – replacing xif by their rank

– map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

– compute the dissimilarity using methods for interval-scaled variables

– Not always such a great idea…

11−−

=f

ifif M

rz

},...,1{fif

Mr ∈

Page 23: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Clustering Methods

• enough about distances!

• Now we have a matrix (n x n) of distances.

• Two major types of clustering algorithms:– partitioning

• Partitions the set into clusters with defined boundaries

• place each point in its nearest cluster

– hierarchical • agglomerative: each point is in its own cluser,

iteratively combine• divisive: all data in one cluser, iteratively dissect

Page 24: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Clustering Outline

• Introduction to Clustering• Distance measures• k-means Clustering• Hierarchical clustering• Probabalistic clustering

Page 25: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

k –Means Algorithm(s)

• Assumes Euclidean space.• Start by picking k, the number of

clusters.• Initialize clusters by picking one

point per cluster.– typically, k random points

Page 26: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

K-Means Algorithm

1. Arbitrarily select K objects from the data (e.g., K customer) to be each a cluster center

2. For each of the remaining objects: Assign each object to the cluster whose center it is most close to

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Custer center Custer center

Page 27: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Then Repeat the following 3 steps until clusters converge (no change in clusters):

1.Compute the new center of the current clusters.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K-Means Algorithm

Page 28: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

2. Assign each object to the cluster whose center it is most close to.

3. Go back to Step 1, or stop if center do not change.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K-Means Algorithm

Page 29: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

The K-Means Clustering Method

• Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 30: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

K-Means Example

• Given: {2,4,10,12,3,20,30,11,25}, k=2

• Randomly assign means: m1=3,m2=4

• Solve for the rest ….

Page 31: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

K-Means Example

• Given: {2,4,10,12,3,20,30,11,25}, k=2• Randomly assign means: m1=3,m2=4• K1={2,3}, K2={4,10,12,20,30,11,25},

m1=2.5,m2=16• K1={2,3,4},K2={10,12,20,30,11,25},

m1=3,m2=18• K1={2,3,4,10},K2={12,20,30,11,25},

m1=4.75,m2=19.6• K1={2,3,4,10,11,12},K2={20,30,25},

m1=7,m2=25• Stop as the clusters with these means are

the same.

Page 32: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Getting k Right

• Hard! Often done subjectively (by feel)• Try different k, looking at the change in

the average distance to centroid, as k increases.

• Looking for a balance between within-cluster variance and between-cluster variance.

• Average falls rapidly until right k, then changes little.

k

Averagedistance tocentroid

Best valueof k

Page 33: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Example

x xx x x xx x x x

x x xx x

xxx x

x x x x x

xx x x

x

x xx x x x x x x

x

x

x

Too few;many longdistancesto centroid.

Page 34: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Example

x xx x x xx x x x

x x xx x

xxx x

x x x x x

xx x x

x

x xx x x x x x x

x

x

x

Just right;distancesrather short.

Page 35: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Example

x xx x x xx x x x

x x xx x

xxx x

x x x x x

xx x x

x

x xx x x x x x x

x

x

x

Too many;little improvementin averagedistance.

Page 36: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Comments on the K-Means Method

• Strength – Relatively efficient: Easy to implement - often comes

up with good, if not best, solutions– intuitive

• Weakness– Applicable only when mean is defined, then what

about categorical data?– Need to specify k, the number of clusters, in advance– Unable to handle noisy data and outliers– Not suitable to discover clusters with non-convex

shapes– Quite sensitive to initial starting points - will find a

local optimum (although methods exist for finding global optima).

Page 37: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Variations on k-means

• Make it more robust by using k-modes or k-mediods

– K-Medoids: medoids are the most centrally located

object in a cluster.

• Make the initialization better

– Take a small random sample and cluster to find a starting

point

– From a sample, pick a random point, and then k – 1 more

points, each as far from the previously selected points as

possible.

Page 38: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Clustering Outline

• Introduction to Clustering• Distance measures• k-means clustering• Hierarchical clustering• Probabalistic clustering

Page 39: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Hierarchical Clustering

• Representation: tree of nested clusters• Works from a distance matrix

– advantage: x’s can be any type of object– disadvantage: computation

• two basic approachs:– merge points (agglomerative)– divide superclusters (divisive)

• visualize both via “dendograms”– shows nesting structure– merges or splits = tree nodes

• Applications– e.g., clustering of gene expression data– Useful for seeing hierarchical structure, for relatively small data sets

Page 40: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Simple example of hierarchical clustering

Page 41: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Distances Between Clusters• Single Link:

– smallest distance between points– Nearest neighbor– can be outlier sensitive

• Complete Link: – largest distance between points– enforces “compactness”

• Average Link:– mean - gets average behavior– centroid - more robust– Ward’s measure

• Merge clusters that minimize increase in within-cluster distances• (SS(Ci) + SS(Cj) - SS(Ci+j))

Page 42: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Hierarchical Clustering

• This method does not require the number of clusters k as an input.

• Can be done forward or backward:

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive

Page 43: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Agglomerative Hierarchical Clustering

• most common in statistical packages

• Merge nodes that have the least dissimilarity

• Go on in a non-descending fashion

• Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 44: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Divisive Hierarchical Clustering

• Start with one cluster with all data points

• Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 45: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

A Dendrogram Shows How the Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Page 46: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Page 47: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Height of the cross-bar shows the change in within-cluster SS

Agglomerative

Page 48: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Dendrogram Using Single-Link Method

Old Faithful Eruption Duration vs Wait Data Notice how single-linktends to “chain”.

dendrogram y-axis = crossbar’s distance score

Page 49: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Dendogram Using Ward’s SSE Distance

Old Faithful Eruption Duration vs Wait DataMore balanced thansingle-link.

Page 50: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Hierarchical Clustering

• Pros– don’t have to specify k beforehand– visual representation of various

cluster characteristics from dendogram

• Cons– different linkage options get very

different results

Page 51: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Clustering Outline

• Introduction to Clustering• Distance measures• k-means clustering• Hierarchical clustering• Probabalistic clustering

Page 52: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Estimating Probability Densities

• Using Probability densities is one way to describe data.

• Finite mixtures of probability densities can be viewed as clusters

•log-likelihood is a common score function:

•Can be amended to penalize complexity:

∑=

−=n

iL ixpS

1

));((log)( θθ

ndMSM kkkLk log);ˆ(2)( += BICS

Page 53: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Mixture Models

)!52(

)()1(

!

)()(

21 5221

x

ep

x

epxf

xx

−−+=

−−− λλ λλ

“Two-stage model”

∑=

=K

kkkk xfxf

1

);()( θπ

weekly credit card usage

Page 54: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Mixture Models and EM• How do we find the models to mix over?

• EM (Expectation / Maximization) is a widely used technique that converges to a solution for finding mixture models.

• Assume multivariate normal components. To apply EM:

– take an initial solution

– calculate the probability that each point comes from each component and assign it (E-step)

– re-estimate parameters for the components based on the new assignments (M-step)

– repeat until convergance.

• Can be slow to converge; can find local maxima

Page 55: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Probabilistic Clustering: Mixture Models

• assume a probabilistic model for each component cluster

• mixture model: f(x) = k=1…K wk fk(x;k) • wk are K mixing weights

– 0 wk 1 and k=1…K wk = 1

• where K component densities fk(x;k) can be:– Gaussian– Poisson– exponential– ...

• Note:– Assumes a model for the data (advantages and

disadvantages)– Results in probabilistic membership: p(cluster k | x)

Pd

Page 56: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Learning Mixture Models from Data

• Score function = log-likelihood L() – L() = log p(X|) = log H p(X,H|)– H = hidden variables (cluster memberships of each x)– L() cannot be optimized directly

• EM Procedure– General technique for maximizing log-likelihood with

missing data– For mixtures

• E-step: compute “memberships” p(k | x) = wk fk(x;k) / f(x)• M-step: pick a new to maximize expected data log-likelihood• Iterate: guaranteed to climb to (local) maximum of L()

Page 57: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

The E (Expectation) Step

Current K clustersand parameters

n datapoints

E step: Compute p(data point i is in group k)

Page 58: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

The M (Maximization) Step

New parameters forthe K clusters

n datapoints

M step: Compute , given n data points and memberships

Page 59: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Comments on Mixtures and EM Learning

• Probabilistic assignment to clusters…not a partition

• K-means is a special case of EM– Gaussian mixtures with isotropic (diagonal, equi-

variance) k ‘s – Approximate the E-step by choosing most likely

cluster (instead of using membership probabilities)

Page 60: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Selecting K in mixture models

• cannot just choose K that maximizes likelihood– Likelihood L() is always larger for larger K

• Model selection alternatives for choosing k:– 1) penalizing complexity

• e.g., BIC = L() – d/2 log n , d = # parameters (Bayesian information criterion)• Easy to implement: asymptotically correct

– 2) Bayesian: compute posteriors p(k | data)• P(k|data) requires computation of p(data|k) = marginal likelihood• Can be tricky to compute for mixture models

– 3) (cross) validation: • split data into train and validate sets • Score different models by likelihood of test data log p(Xtest | ) • Works well on large data sets• Can be noisy on small data (logL is sensitive to outliers)

– Note: all of these methods evaluate the quality of the clustering as a density estimator, rather than with any explicit notion of clustering

Page 61: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Example of BIC Score for Red-Blood Cell Data

Page 62: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Example of BIC Score for Red-Blood Cell Data

True numberof classes (2)selected by BIC

Page 63: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Model Based Clustering

• set k• choose

parametric model for each cluster

• EM or Bayesian methods to fit clusters

• use library(mclust) in R

f(x) = k=1…K wk fk(x;k)

Name Distribution

Volume Shape Orientation

EII Spherical equal equal NA

VII Spherical variable equal NA

EEI Diagonal equal equal coordinate azxes

VEI Diagonal variable equal coordinate axes

VVI Diagonal variable variable coordinate axes

EEE Ellipsoidal

equal equal equal

EEV Ellipsoidal

equal equal variable

VEV Ellipsoidal

variable equal variable

VVV Ellipsoidal

variable var variable

Page 64: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Model-based clustering: red blood cells

set k=2 with Gaussian clusters…

Page 65: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Iter: 0

Iter: 1

Iter: 2

Iter: 5

Iter: 10

Iter: 25

Page 66: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

(Dis)Advantages of the Probabilistic Approach

• Provides a full distributional description for each component - different variances (iron-deficient group has greater spread)

•For each observation, provides a K-component vector of probabilities of class membership (we can find those at risk).

• Only limited by imagination of likelihoods…different distributions, shapes, clusters-within-clusters, etc.

•Can make inference about the number of clusters, via a penalized likelihood model

•But... its computationally somewhat costly, and we have to assume a distributional form.

Page 67: Data Mining - Massey University Clustering credits: Padhraic Smyth lecture notes Hand, et al Chapter 9.

Data Mining - Massey University

Lab #6 - Clustering

• Task 1: Australian crabs data– use model based clustering to find

clusters

• Task 2: Music data– use hierarchical clustering to see if

you can reconstruct the genres.