Clustering Credit: Padhraic Smyth University of California, Irvine.

Clustering

Credit: Padhraic SmythUniversity of California, Irvine

Clustering

• “automated detection of group structure in data”– Typically: partition N data points into K groups (clusters)

such that the points in each group are more similar to each other than to points in other groups

– descriptive technique (contrast with predictive)– for real-valued vectors, clusters can be thought of as

clouds of points in p-dimensional space

Clustering

Sometimes easy

Sometimes impossible

and sometimes in between

Why is Clustering useful?

• “Discovery” of new knowledge from data– Contrast with supervised classification (where labels are

known)– Long history in the sciences of categories, taxonomies, etc– Can be very useful for summarizing large data sets

• For large n and/or high dimensionality

• Applications of clustering– Clustering of documents produced by a search engine – Segmentation of customers for an e-commerce store– Discovery of new types of galaxies in astronomical data– Clustering of genes with similar expression profiles– Cluster pixels in an image into regions of similar intensity

– …. many more

General Issues in Clustering

• Clustering algorithm = Representation + Score + Optimization

• Cluster Representation:– What types or “shapes” of clusters are we looking for? What defines a cluster?

• Score:– A clustering = assignment of n objects to K clusters– Score = quantitative criterion used to evaluate different clusterings

• Optimization and Search– Finding the optimal (minimal/maximal score) clustering is typically NP-hard

• Greedy algorithms to optimize the score are widely used

• Other issues– Distance function, D[x(i),x(j)] critical aspect of clustering, both

• distance of individual pairs of objects• distance of individual objects from clusters

– How is K selected?– Different types of data

• Real-valued versus categorical• Attribute-valued vectors vs. n2 distance matrix

Different Types of Clustering Algorithms

• partition-based clustering– e.g. K-means

• probabilistic model-based clustering– e.g. fuzzy k-means, mixture models [both of the above work with measurement data, e.g., feature

vectors]

• hierarchical clustering– e.g. hierarchical agglomerative clustering

• graph-based clustering– E.g., min-cut algorithms

[both of the above work with distance data, e.g., distance matrix]

Different Types of Input to Clustering Algorithms

• Data matrix: – N rows d columns

• Distance matrix – N x N distances between objects

Partition-Based Clustering

• input: – n data points X={x(1) … x(n)}, of dimension d– K = number of cluster s

• output: C = {C1 … CK} = specification of K clusters– implicit representation:

• each x(i) is assigned to a unique Cj (hard-assignment)

– explicit representation• each Cj is specified in some manner, e.g., as a mean or a region in input space

• Optimization algorithm– require that score[C] is minimized (or maximized)

• e.g., sum-of-squares of within cluster distances– exhaustive search is intractable– combinatorial optimization problem: assign n objects to K classes– large search space: number of possible clusterings is approximately Kn / K!

• so, use greedy iterative method • will be subject to local maxima

Score Functions for Partition-Based Clustering

• want compact clusters– minimize within cluster distances wc(C)

• want different clusters far apart– maximize between cluster distances bc(C)

• given cluster partitioning C, find centers c1…cK

– e.g. for vectors, use centroids of points in cluster Ci• Ci = 1/(ni) x Ci x

– wc(C) = sum-of-squares within cluster distance (minimize)

• wc(C) = i=1…k wc(Ci) where wc(Ci) = i=1…k x Ci d(x,ci)2

– bc(C) = distance between clusters (maximize)

• bc(C) = i,j=1…k d(ci,cj)2

K-means Clustering

• basic idea:– Score = wc(C) = sum-of-squares within cluster distance– start with randomly chosen cluster centers c1 … cK

– repeat until no cluster memberships change:• assign each point x to cluster with nearest center

– find smallest d(x,ci), over all c1 … cK

• recompute cluster centers over data assigned to them– Ci = 1/(ni) x Ci x

• algorithm terminates (finite number of steps)– Score(C) at each iteration (if membership changes)

• converges to at least a local minimum of Score(C)– not necessarily the global minimum …– different initial centers (seeds) can lead to diff local minima

Squared Errors and Cluster Centers

• Squared error (distance) between a data point x and a cluster center c:

d [ x , c ] = j ( xj - cj )2 • •Total squared error between a cluster center c(k) and all Nk

points assigned to that cluster:

Sk = i d [ xi , ck ]

Distance is usually defined as Euclidean distance

• Total squared error summed across K clusters

SSE = k Sk

Data Mining Lectures Lectures 12,13: Clustering Padhraic Smyth, UC Irvine

K-means Complexity

• time complexity = O(I e n K) << exhaustive ~ Kn / K!– I = number of interations (steps)– e = cost of distance computation (e=p for Euclidean dist)

• Approximations/speed-up tricks for very large n– use x(i)’s nearest to means as cluster centers instead of actual

mean• reuse of cached dists from size n2 dist mat D (lowers effective “e”)

– “condense”: reduce “n” by replacing group with prototype– Additional references:

K-means1. Ask user how

many clusters they’d like. (e.g. K=5)

(Example is courtesy of Andrew Moore, CMU)



2. Randomly guess K cluster Center locations



2. Randomly guess K cluster Center locations

3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)


many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns


many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

5. New Centers => new boundaries

6. Repeat until no change

K-means clustering of RGB (3 value) pixelcolor intensities, K = 11 segments(from David Forsyth, UC Berkeley)

Image

K-means clustering of RGB (3 value) pixelcolor intensities, K = 11 segments(from David Forsyth, UC Berkeley)

Image Clusters on color

Issues in K-means clustering

• Simple, but useful– tends to select compact “isotropic” cluster shapes– can be useful for initializing more complex methods– many algorithmic variations on the basic theme

• e.g., in signal processing/data compression is similar to vector-quantization

• Choice of distance measure– Euclidean distance– Weighted Euclidean distance– Many others possible

• Selection of K– “screen diagram” - plot SSE versus K, look for “knee” of curve

• Limitation: may not be any clear K value

Convergence to Global Minimum

• Does K-means always converge to the best possible solution? – i.e., the set of K centers that minimize the SSE?

• No: always converges to *some* solution, but not necessarily the best – Depends on the starting point chosen

• To think about: prove that SSE always decreases after every iteration of the K-means algorithm, until convergence. (hint: need to prove that assignment step and computation of cluster centers both decrease the SSE)

Local Search and Local Minima


Suboptimal Results from K-means on Simulated Data

Why does k-means not perform so well on this example?

Finite Mixture Models

?)( xp


K

k

kcpp1

),()( xx


)()|(

),()(

1

1

k

K

k

k

K

k

k

cpcp

cpp

x

xx


k

K

k

kk

k

K

k

k

K

k

k

wcp

cpcp

cpp

1

,

1

1

)|(

)()|(

),()(

x

x

xx


k

K

k

kk

k

K

k

k

K

k

k

wcp

cpcp

cpp

1

,

1

1

)|(

)()|(

),()(

x

x

xx

Weightk

ComponentModelk

Parametersk


-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Component 1 Component 2p(

x)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x)


-5 0 5 100

0.5

1

1.5

2

Component Modelsp(

x)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x)

Interpretation of Mixtures

1. C has a direct (physical) interpretatione.g., C {age of fish}, C = {male, female}

Interpretation of Mixtures

1. C has a direct (physical) interpretatione.g., C {age of fish}, C = {male, female}

2. C is a convenient hidden variable (i.e., the cluster variable) - focuses attention on subsets of the data e.g., for visualization, clustering, etc

- C might have a physical/real interpretation but not necessarily so

Probabilistic Clustering: Mixture Models

• assume a probabilistic model for each component cluster

• mixture model: f(x) = k=1…K wk fk(x;k)

• wk are K mixing weights

– 0 wk 1 and k=1…K wk = 1

• where K component densities fk(x;k) can be:– Gaussian– Poisson– exponential– ...

• Note:– Assumes a model for the data (advantages and disadvantages)– Results in probabilistic membership: p(cluster k | x)

Pd

Gaussian Mixture Models (GMM)

• model for k-th component is normal N(k,k)

– often assume diagonal covariance: jj = j2

, ij = 0

– or sometimes even simpler: jj = 2 , ij = 0

• f(x) = k=1…K wk fk(x;k) with k = <k , k> or <k ,k>

• generative model:– randomly choose a component

• selected with probability wk

– generate x ~ N(k,k)

– note: k & k both d-dim vectors

Learning Mixture Models from Data

• Score function = log-likelihood L() – L() = log p(X|) = log H p(X,H|)

– H = hidden variables (cluster memberships of each x)– L() cannot be optimized directly

• EM Procedure– General technique for maximizing log-likelihood with missing data– For mixtures

• E-step: compute “memberships” p(k | x) = wk fk(x;k) / f(x)

• M-step: pick a new to maximize expected data log-likelihood• Iterate: guaranteed to climb to (local) maximum of L()

The E (Expectation) Step

Current K clustersand parameters (mean and covariance for each cluster)

n datapoints

E step: Compute p(data point i is in group k) given mean and covariance for each cluster

The M (Maximization) Step

New parameters forthe K clusters

n datapoints

M step: Compute , given n data points and membershipsa new estimate for mean and covariance for each cluster

Complexity of EM for mixtures

K modelsn datapoints

Complexity per iteration scales as O( n K f(p) )

Comments on Mixtures and EM Learning

• Complexity of each EM iteration– Depends on the probabilistic model being used

• e.g., for Gaussians, Estep is O(nK), Mstep is O(nKp2)– Sometimes E or M-step is not closed form

• => can require numerical optimization or sampling within each iteration• Generalized EM (GEM): instead of maximizing likelihood, just increase likelihood• EM can be thought of as hill-climbing with direction and step-size provided

automatically

• K-means as a special case of EM– Gaussian mixtures with isotropic (diagonal, equi-variance) k ‘s – Approximate the E-step by choosing most likely cluster (instead of using

membership probabilities)

• Generalizations…– Mixtures of multinomials for text data– Mixtures of Markov chains for Web sequences– + more – Will be discussed later in lectures on text and Web data


3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion


3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 1


3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 3


3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 5


3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 10


3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 15


3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 25


3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

ANEMIA DATA WITH LABELS


0 5 10 15 20 25400

410

420

430

440

450

460

470

480

490LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS

EM Iteration

Lo

g-L

ike

liho

od

Selecting K in mixture models• cannot just choose K that maximizes likelihood

– Likelihood L() is always larger for larger K

• Model selection alternatives:– 1) penalize complexity

• e.g., BIC = L() – d/2 log n , d = # parameters (Bayesian information criterion)• Asymptotically correct under certain assumptions• Often used in practice for mixture models even though assumptions for theory are not met

– 2) Bayesian: compute posteriors p(k | data)• P(k|data) requires computation of p(data|k) = marginal likelihood• Can be tricky to compute for mixture models• Recent work on Dirichlet process priors has made this more practical

– 3) (cross) validation: • Score different models by log p(Xtest | ) • split data into train and validate sets • Works well on large data sets• Can be noisy on small data (logL is sensitive to outliers)

– Note: all of these methods evaluate the quality of the clustering as a density estimator, rather than with any explicit notion of clustering

Example of BIC Score for Red-Blood Cell Data

Example of BIC Score for Red-Blood Cell Data

True numberof classes (2)selected by BIC

Hierarchical Clustering

• Representation: tree of nested clusters• Works from a distance matrix

– advantage: x’s can be any type of object– disadvantage: computation

• two basic approachs:– merge points (agglomerative)– divide superclusters (divisive)

• visualize both via “dendograms”– shows nesting structure– merges or splits = tree nodes

• Applications– e.g., clustering of gene expression data– Useful for seeing hierarchical structure, for relatively small data sets

Simple example of hierarchical clustering

Agglomerative Methods: Bottom-Up

• algorithm based on distance between clusters:– for i=1 to n let Ci = { x(i) }, i.e. start with n singletons

– while more than one cluster left• let Ci and Cj be cluster pair with minimum distance, dist[Ci , Cj ]

• merge them, via Ci = Ci Cj and remove Cj

• time complexity = O(n2) to O(n3)– n iterations (start: n clusters; end: 1 cluster)– 1st iteration: O(n2) to find nearest singleton pair

• space complexity = O(n2)– accesses all distances between x(i)’s

• interpreting large n dendrogram difficult anyway (like decision trees)– large n idea: partition-based clusters at leafs

Distances Between Clusters

• single link / nearest neighbor measure:– D(Ci,Cj) = min { d(x,y) | x Ci, y Cj } – can be outlier/noise sensitive



• complete link / furthest neighbor measure:– D(Ci,Cj) = max { d(x,y) | x Ci, y Cj } – enforces more “compact” clusters



• complete link / furthest neighbor measure:– D(Ci,Cj) = max { d(x,y) | x Ci, y Cj } – enforces more “compact” clusters

• intermediates between those extremes:– average link: D(Ci,Cj) = avg { d(x,y) | x Ci, y Cj }– centroid: D(Ci,Cj) = d(ci,cj) where ci , cj are centroids

• Note that centroid require that vector mean can be defined

• Which to choose? Different methods may be used for exploratory purposes, depends on goals and application

Old-Faithful Eruption Timing Data Set

• Notice that these variables are not scaled: so waiting time will get a lot more emphasis in Euclidean distance calculations than duration

Dendrogram Using Single-Link Method

Old Faithful Eruption Duration vs Wait Data Notice how single-linktends to “chain”.

dendrogram y-axis = crossbar’s distance score

Dendogram Using Ward’s SSE Distance

Old Faithful Eruption Duration vs Wait DataMore balanced thansingle-link.

Hierarchical Cluster Structure of languages

Scalability of Hierarchical Clustering

• N objects to cluster • Hierarchical clustering algorithms scale as O(N2) to O(N3)

– Why?

• This is problematic for large N…… • Solutions?

– Use K-means (or a similar algorithm) to create an initial set of K clusters and then use hierarchical clustering from there

– Use approximate fast algorithms


Divisive Methods: Top-Down

• algorithm:– begin with single cluster containing all data– split into components, repeat until clusters = single points

• two major types:– monothetic:

• split by one variable at a time -- restricts search space• analogous to decision trees

– polythetic• splits by all variables at once -- many choices makes this difficult

• less commonly used than agglomerative methods– generally more computationally intensive

• more choices in search space

MATLAB Demo

• In-class Matlab demo comparing different hierarchical algorithms and k-means, on the same data sets

Spectral/Graph-based Clustering

Idea:• think of distance matrix as a weighted graph where - objects are nodes - edges exist for objects with high similarity

Spectral/Graph-based Clustering

Idea:• think of distance matrix as a weighted graph where - objects are nodes - edges exist for objects with high similarity

• finding a good grouping of the objects is somewhat equivalent to finding low-weight subgraphs - can be reduced to “min-cut” algorithms - related to eigenstructure of distance matrix - e.g., Shi and Malik, 1997, for image segmentation

Clustering non-vector objects

• E.g., sequences, images, documents, etc – Can be of varying lengths, sizes

• Distance matrix approach– E.g., compute edit distance/transformations for pairs of sequences– Apply clustering (e.g., hierarchical) based on distance matrix– However….does not scale well computationally

• “Vectorization”– Represent each object as a vector– Cluster resulting vectors using vector-space algorithm– However…. can lose (e.g., sequence) information by going to vector space

• Probabilistic model-based clustering– Treat as mixture of stochastic models (e.g., Markov models)– Can naturally handle variable lengths/sizes – Will discuss application to Web session clustering later in the quarter

Clustering of TROPICAL CYCLONES Western North Pacific 1983-2002 (from Scott Gaffney’s Phd Thesis, 2004: uses mixture of regressions clustering)

Task

K-Means Clustering

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Clustering

Partition based on K centers

Within-cluster sum of squared errors

Iterative greedy search

None specified

K centers

Task

Probabilistic Model-Based Clustering

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Clustering

Log-likelihood

EM (iterative)

None specified

Probability model

Mixture of Probability Components

Task

Single-Link Hierarchical Clustering

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Clustering

Tree of nested groupings

No global score

Iterative merging of nearest neighbors

None specified

Dendrogram

Summary

• Many different approaches and algorithms• No “optimal” or “best” approach

– What type of cluster structure are you looking for?• Computational complexity may be an issue for large n• Dimensionality is also an issue• Validation/selection of K is often an ill-posed problem

– Often no “right answer” on what the optimal number of clusters is

Clustering Credit: Padhraic Smyth University of California, Irvine.

Documents

Transcript of Clustering Credit: Padhraic Smyth University of California, Irvine.