David M. Blei - Computer Science Department at Princeton University

Hierarchical clustering

David M. Blei

COS424Princeton University

February 28, 2008

D. Blei Clustering 02 1 / 21

Hierarchical clustering

• Hierarchical clustering is a widely used data analysis tool.

• The idea is to build a binary tree of the data that successivelymerges similar groups of points

• Visualizing this tree provides a useful summary of the data


Hierarchical clusering vs. k-means

• Recall that k-means or k-medoids requires

• A number of clusters k• An initial assignment of data to clusters• A distance measure between data d(xn, xm)

• Hierarchical clustering only requires a measure of similarity betweengroups of data points.




• A number of clusters k

• An initial assignment of data to clusters• A distance measure between data d(xn, xm)





• A number of clusters k• An initial assignment of data to clusters

• A distance measure between data d(xn, xm)





• A number of clusters k• An initial assignment of data to clusters• A distance measure between data d(xn, xm)



Agglomerative clustering

• We will talk about agglomerative clustering.

• Algorithm:

1 Place each data point into its own singleton group2 Repeat: iteratively merge the two closest groups3 Until: all the data are merged into a single cluster




• Algorithm:

1 Place each data point into its own singleton group

2 Repeat: iteratively merge the two closest groups3 Until: all the data are merged into a single cluster




• Algorithm:

1 Place each data point into its own singleton group2 Repeat: iteratively merge the two closest groups

3 Until: all the data are merged into a single cluster




• Algorithm:

1 Place each data point into its own singleton group2 Repeat: iteratively merge the two closest groups3 Until: all the data are merged into a single cluster


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

Data


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 001

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 002

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 003

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 004

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 005

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 006

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 007

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 008

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 009

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 010

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 011

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 012

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 013

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 014

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 015

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 016

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 017

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 018

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 019

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 020

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 021

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 022

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 023

V1

V2


Example

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60 80

−20

020

4060

80

iteration 024

V1

V2



• Each level of the resulting tree is a segmentation of the data

• The algorithm results in a sequence of groupings

• It is up to the user to choose a ”natural” clustering from thissequence


Dendrogram

• Agglomerative clustering is monotonic

• The similarity between merged clusters is monotone decreasingwith the level of the merge.

• Dendrogram: Plot each merge at the (negative) similarity betweenthe two merged groups

• Provides an interpretable visualization of the algorithm and data

• Useful summarization tool, part of why hierarchical clustering ispopular


Dendrogram of example data

2904

2432

2641

2489

2278

2905

2085

2959 27

4327

9723

1422

8224

2520

2414

5517

2316

2216

16 244

851

220

81 184 25

247

7

020

4060

8010

012

0

Cluster Dendrogram

hclust (*, "complete")dist(x)

Hei

ght

Groups that merge at high values relative to the merger values of theirsubgroups are candidates for natural clusters. (Tibshirani et al., 2001)


Group similarity

• Given a distance measure between points, the user has many choicesfor how to define intergroup similarity.

• Three most popular choices

• Single-linkage: the similarity of the closest pair

dSL(G ,H) = mini∈G ,j∈H

di ,j

• Complete linkage: the similarity of the furthest pair

dCL(G ,H) = maxi∈G ,j∈H

di ,j

• Group average: the average similarity between groups

dGA =1

NGNH

∑i∈G

∑j∈H

di ,j


Properties of intergroup similarity

• Single linkage can produce “chaining,” where a sequence of closeobservations in different groups cause early merges of those groups

• Complete linkage has the opposite problem. It might not mergeclose groups because of outlier members that are far apart.

• Group average represents a natural compromise, but depends on thescale of the similarities. Applying a monotone transformation to thesimilarities can change the results.


Caveats

• Hierarchical clustering should be treated with caution.

• Different decisions about group similarities can lead to vastlydifferent dendrograms.

• The algorithm imposes a hierarchical structure on the data, evendata for which such structure is not appropriate.


Examples

• “Repeated Observation of Breast Tumor Subtypes in IndependentGene Expression Data Sets” (Sorlie et al., 2003)

• Hierarchical clustering of gene expression data lead to new theories

• Later, theories tested in the lab.


Examples

• “The Balance of Roger de Piles” (Studdert-Kennedy and Davenport,1974)

• Roger de Piles rated 57 paintings along different dimensions.

• These authors cluster them using different methods, includinghierarchical clustering

• They discuss the different clusters. (They are art critics.)


Good: They are cautious. “The value of this analysis...will depend onany interesting speculation it may provoke.”


Examples

• “Similarity Grouping of Australian Universities” (Stanley andReynlds, 1994)

• Use hierarchical clustering on Austrailian universities

• Use features such as

• # of staff in different departments• entry scores• funding• evaluations


Examples




• # of staff in different departments

• entry scores• funding• evaluations


Examples




• # of staff in different departments• entry scores

• funding• evaluations


Examples




• # of staff in different departments• entry scores• funding

• evaluations


Examples




• # of staff in different departments• entry scores• funding• evaluations


One dendrogram


Another dendrogram—notice that it’s different


• Split values: They notice that there’s no kink and conclude thatthere is no cluster structure in Austrailian universities.

• Good: Cautious interpretation of clustering, analysis of clusteringbased on multiple subsets of the features.

• Bad: Their conclusions—we can’t cluster Australianuniversities—ignores all the algorithmic choices that were made.


Examples

• “Comovement of International Equity Markets: A TaxonomicApproach” (Panton et al., 1976)

• Data: weekly rates of return for stocks in twelve countries

• Run agglometerative clustering year by year

• Interpret the structure and examine stability over different timeperiods


Examples

Good: Cautious. “This study is only descriptive...A logical subsequentresearch area is to explain observed structural properties and the causesof structural change.”


David M. Blei - Computer Science Department at Princeton University

Documents

Transcript of David M. Blei - Computer Science Department at Princeton University