An Overview of Clustering Methods Michael D. Kane, Ph.D.

23
An Overview of Clustering Methods Michael D. Kane, Ph.D.

Transcript of An Overview of Clustering Methods Michael D. Kane, Ph.D.

Page 1: An Overview of Clustering Methods Michael D. Kane, Ph.D.

An Overview of Clustering Methods

Michael D. Kane, Ph.D.

Page 2: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Topics

• What is clustering?

• Clustering mechanics (how the computer does it).

• Parameter choices and their effect.

• Examples.

Page 3: An Overview of Clustering Methods Michael D. Kane, Ph.D.

What is clustering?

Grouping by similarity.

Page 4: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Similar genes.

Group genes that have similar expression profiles when observed over multiple samples.

Genes

Samples

Gene clustering

Page 5: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Similar samples.

Group samples that are similar when observed over multiple genes.

Genes

Samples

Sample clustering

Page 6: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Why cluster?

• Similar gene expression infers common biology.

Function of uncharacterized genes may be deduced from co-

expression with known genes.

• Associate expression patterns with:Response to environmental change.

Disease pathology/progression.

Page 7: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Clustering Mechanics

E1

+

+-

-

E2

Gene a

Gene e

Gene b

Gene c

Gene d

Gene f

E2E1 E2

c

e d

f

For gene clustering, we must measure similarity between genes.

a

b

Page 8: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Distance (similarity) measure

E1

+

+-

-

E2

a

b

c

e d

f

Euclidean distance

dbe

(4.6, 0.5)

(1.0, 1.7)

22 7.15.00.16.4 bed

Page 9: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Distance Measure

Pearson Correlation

b

i

a

iN

i

bbaa

NbaS

1

1,

S=(-1 . . . +1)

Used in “Eisen” clustering

Page 10: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Hierarchical Clustering

E1

+

+-

-

E2

a

b

c

e d

fa b c d e f

Page 11: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Measuring distance between clusters

Single linkage

The minimum distance between clusters.

May form loose clusters.

Complete linkage

The maximum distance between clusters.

Tends to form compact clusters.

Produces “chained” clusters.

Page 12: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Methods for joining clusters

UPGMA unweighted pair group method (Average linkage)

The average distance between clusters.

Weighted pair group method

Same as UPGMA but the distance is weighted by cluster size.

Use when clusters are expected to be significantly uneven in size!

Page 13: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Effect of distance measure

EuclideanSingle Linkage

EuclideanComplete Linkage

Page 14: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Effect of distance measure

EuclideanUPGMA

EuclideanWard’s Method

Page 15: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Alternatives to hierarchical clustering

• Number of clusters specified by user.

• Good when prior knowledge available.

k-means

Page 16: An Overview of Clustering Methods Michael D. Kane, Ph.D.

k-means clustering

E1

+

+-

-

E2

a

b

c

e d

f

1. Number of clusters specified by user.

2. Genes randomly assigned to clusters.

3. Assess inter and intra-cluster similarity.

4. Move genes to alternative cluster if distance is reduced.

3. Assess inter and intra-cluster similarity.

4. Move genes to alternative cluster if distance is reduced.

Page 17: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Alternatives to hierarchical clustering

• Number of clusters specified by user.

• Good when prior knowledge available.

SOM Self-organizing maps

Page 18: An Overview of Clustering Methods Michael D. Kane, Ph.D.

SOM

Gene a

Gene e

Gene b

Gene c

Gene d

Gene f

E2E1 E2

+

0

-

+

0

-

+

0

-

+

0

-

+

0

-

+

0

-

E1 E2

+

0

-

E1 E2

+

0

-

E1 E2

+

0

-

E1 E2

cluster 1

cluster 2

cluster 3

User specified number of clusters.

Each initially given a random expression representation.

+

0

-

E1 E2

+

0

-

E1 E2

+

0

-

E1 E2

cluster 1

cluster 2

cluster 3

For a gene, find the most similar cluster representation.

+

0

-

E1 E2

+

0

-

E1 E2

+

0

-

E1 E2

cluster 1

cluster 2

cluster 3

Increase the similarity by adjusting the cluster representation.

“Training”

+

0

-

E1 E2

+

0

-

E1 E2

+

0

-

E1 E2

cluster 1

cluster 2

cluster 3

Iteratively train the cluster representations.

+

0

-

E1 E2

+

0

-

E1 E2

+

0

-

E1 E2

cluster 1

cluster 2

cluster 3

After training, assign each gene to the most similar cluster.

Page 19: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Gene clustering

Eisen et al.,

Cluster analysis and display of genome-wide expression patterns.

PNAS v95,14863-14868, 1998

24 hour time course after re-introduction of

serum to serum-deprived human fibroblasts.

Pearson correlation, average linkage.

cholesterol biosynthesis

cell cycleimmediate-early response

signaling

wound healing

Page 20: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Sample clustering

Ross et al.,

Systematic variation in gene expression patterns in human cancer cell lines.

Nature Genetics v24, 227-235, 2000

64 cancer cell lines clustered.

8,000 genes.

Clustering performed with 2 different subsets of genes. Similar results.

Pearson correlation, average linkage.

Note breast cancer cell lines, derived from the same patient.

Page 21: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Summary

• Different methods often provide different clusters.

• No overall “best” clustering method.

• Clustering applied to unrelated data will still provide clusters.

• Use biological insight in method selection and interpretation.

Page 22: An Overview of Clustering Methods Michael D. Kane, Ph.D.

Clustering

E1

+

+-

-

E2

a

b

c

e d

fa b c d e f

Page 23: An Overview of Clustering Methods Michael D. Kane, Ph.D.

SOM

Gene a

Gene e

Gene b

Gene c

Gene d

Gene f

E2E1 E2

+

0

-

+

0

-

+

0

-

+

0

-

+

0

-

+

0

-

+

0

-

E1 E2

+

0

-

E1 E2

+

0

-

E1 E2

cluster 1

cluster 2

cluster 3

After training, assign each gene to the most similar cluster.