An Overview of Clustering Methods Michael D. Kane, Ph.D.

An Overview of Clustering Methods

Michael D. Kane, Ph.D.

Topics

• What is clustering?

• Clustering mechanics (how the computer does it).

• Parameter choices and their effect.

• Examples.

What is clustering?

Grouping by similarity.

Similar genes.

Group genes that have similar expression profiles when observed over multiple samples.

Samples

Gene clustering

Similar samples.

Group samples that are similar when observed over multiple genes.

Samples

Sample clustering

Why cluster?

• Similar gene expression infers common biology.

Function of uncharacterized genes may be deduced from co-

expression with known genes.

• Associate expression patterns with:Response to environmental change.

Disease pathology/progression.

Clustering Mechanics

Gene a

Gene e

Gene b

Gene c

Gene d

Gene f

E2E1 E2

For gene clustering, we must measure similarity between genes.

Distance (similarity) measure

Euclidean distance

(4.6, 0.5)

(1.0, 1.7)

22 7.15.00.16.4 bed

Distance Measure

Pearson Correlation

S=(-1 . . . +1)

Used in “Eisen” clustering

Hierarchical Clustering

fa b c d e f

Measuring distance between clusters

Single linkage

The minimum distance between clusters.

May form loose clusters.

Complete linkage

The maximum distance between clusters.

Tends to form compact clusters.

Produces “chained” clusters.

Methods for joining clusters

UPGMA unweighted pair group method (Average linkage)

The average distance between clusters.

Weighted pair group method

Same as UPGMA but the distance is weighted by cluster size.

Use when clusters are expected to be significantly uneven in size!

Effect of distance measure

EuclideanSingle Linkage

EuclideanComplete Linkage

Effect of distance measure

EuclideanUPGMA

EuclideanWard’s Method

Alternatives to hierarchical clustering

• Number of clusters specified by user.

• Good when prior knowledge available.

k-means

k-means clustering

1. Number of clusters specified by user.

2. Genes randomly assigned to clusters.

3. Assess inter and intra-cluster similarity.

4. Move genes to alternative cluster if distance is reduced.

3. Assess inter and intra-cluster similarity.

4. Move genes to alternative cluster if distance is reduced.

Alternatives to hierarchical clustering

• Number of clusters specified by user.

• Good when prior knowledge available.

SOM Self-organizing maps

Gene a

Gene e

Gene b

Gene c

Gene d

Gene f

E2E1 E2

cluster 1

cluster 2

cluster 3

User specified number of clusters.

Each initially given a random expression representation.

cluster 1

cluster 2

cluster 3

For a gene, find the most similar cluster representation.

cluster 1

cluster 2

cluster 3

Increase the similarity by adjusting the cluster representation.

“Training”

cluster 1

cluster 2

cluster 3

Iteratively train the cluster representations.

cluster 1

cluster 2

cluster 3

After training, assign each gene to the most similar cluster.

Gene clustering

Eisen et al.,

Cluster analysis and display of genome-wide expression patterns.

PNAS v95,14863-14868, 1998

24 hour time course after re-introduction of

serum to serum-deprived human fibroblasts.

Pearson correlation, average linkage.

cholesterol biosynthesis

cell cycleimmediate-early response

signaling

wound healing

Sample clustering

Ross et al.,

Systematic variation in gene expression patterns in human cancer cell lines.

Nature Genetics v24, 227-235, 2000

64 cancer cell lines clustered.

8,000 genes.

Clustering performed with 2 different subsets of genes. Similar results.

Pearson correlation, average linkage.

Note breast cancer cell lines, derived from the same patient.

Summary

• Different methods often provide different clusters.

• No overall “best” clustering method.

• Clustering applied to unrelated data will still provide clusters.

• Use biological insight in method selection and interpretation.

Clustering

fa b c d e f

Gene a

Gene e

Gene b

Gene c

Gene d

Gene f

E2E1 E2

cluster 1

cluster 2

cluster 3

After training, assign each gene to the most similar cluster.

An Overview of Clustering Methods Michael D. Kane, Ph.D.

Documents

Transcript of An Overview of Clustering Methods Michael D. Kane, Ph.D.

Kane Williamson

Effects of Green House Nursing Homes on Residents’ Families · PDF fileNursing Homes on Residents’ Families Terry Y. Lum, M.S.W., Ph.D., Rosalie A. Kane, M.S.W., Ph.D., Lois J.

Kane Pomperroknuclearexportsfinal052013map

Introduction to Toxicology - University of Floridaaquaticpath.phhp.ufl.edu/waterbiology/handouts2011/Kane-Introtox...1 Andrew S. Kane, Ph.D. Department of Environmental & Global Health

Kane Kane 400 Operating Manual Boiler Manual

Chris Kane

Kane Marriage

Building Responsive Learning Communities: The Heart of RtI Vicki L. Collins, Ph.D. Kane/DuPage Library Institute February 27, 2015.

Solomon Kane

Kane County Oversight Report - Utahfinancialreports.utah.gov/saoreports/2017/KANE-16-SPKane...KANE COUNTY Oversight of Canyon Land Special Service District and Kane County Recreation

Genomic Technologies CIT581N Michael Kane, Ph.D. Lecture 1: Sequencing Technology and DNA Microarray Technology.

Russell Kane

Kane Presentation

ILLINOISINTEGRATEDJUSTICE INFORMATIONSYSTEM … · 2015. 9. 8. · Candice M. Kane, Ph.D., J.D., Executive Director ILLINOIS CRIMINAL JUSTICE INFORMATION AUTHORITY Michael Mahoney

Presentation Kane

Kane Lawsuit

Kapulu pulu kane Kapulu pulu kane Kapulu pulu kane kuka na luah Kapulu pulu kane Kapulu pulu kane

Tamas Doszkocs, Ph.D. Computer Scientist doszkocs@nlm.nih.gov Meta Searching and Clustering.

COUNTY of KANE PURCHASING DEPARTMENT KANE COUNTY ... … · COUNTY of KANE PURCHASING DEPARTMENT KANE COUNTY GOVERNMENT CENTER Theresa Dobersztyn, C.P.M., CPPB 719 S. Batavia Ave.,

Principles of Toxicology - University of Floridaaquaticpath.phhp.ufl.edu/waterbiology/handouts/introtox-020909.pdf · Andrew S. Kane, Ph.D. Environmental Health Program College of