The Broad Institute of MIT and Harvard Clustering.

Clustering

Clustering Preliminaries

• Log2 transformation

• Row centering and normalization

• Filtering

Log2 Transformation

• Log2-transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values.– We would like dist(100,200)=dist(1000,2000).

Advantages of log2 transformation:

Row Centering & Normalization

x y=x-mean(x) z=y/stdev(y)

Filtering genes

• Filtering is very important for unsupervised analysis since many noisy genes may totally mask the structure in the data

• After finding a hypothesis one can identify marker genes in a larger dataset via supervised analysis.

Clustering

Supervised AnalysisMarker Selection

All genes

Clustering/Class Discovery

• Aim: Partition data (e.g. genes or samples) into sub-groups (clusters), such that points of the same cluster are “more similar”.

• Challenge: Not well defined. No single objective function / evaluation criterion

• Example:How many clusters? 2+noise, 3+noise, 20, Hierarchical: 23 + noise

• One has to choose:– Similarity/distance measure – Clustering method– Evaluate clusters

Clustering in GenePattern

• Representative based: Find representatives/centroids– K-means: KMeansClustering– Self Organizing Maps (SOM): SOMClustering

• Bottom-up (Agglomerative): HierarchicalClustering Hierarchically unite clusters– single linkage analysis– complete linkage analysis– average linkage analysis

• Clustering-like:– NMFConsensus– PCA (Principal Components Analysis)

No BEST method! For easy problems – most of them work. Each algorithm has its assumptions and strengths and weaknesses

K-means Clustering

• Aim: Partition the data points into K subsets and associate each

subset with a centroid such that the sum of squared distances

between the data points and their associated centroid is minimal.

Iteration = 0

K-means: Algorithm

• Initialize centroids at random positions

• Iterate:– Assign each data point to

its closest centroid

– Move centroids to center of assigned points

• Stop when converged

• Guaranteed to reach a local minimum

Iteration = 1

K=3

Iteration = 1Iteration = 2Iteration = 2

K-means: Summary

• Result depends on initial centroids’ position• Fast algorithm: needs to compute distances from data

points to centroids• Must preset number of clusters.• Fails for non-spherical distributions

Hierarchical Clustering

3

1

4 2

5

52 41 3

Distance between joined clusters

Dendrogram

52 41 3

3

1

4 2

5

Distance between joined clusters

Dendrogram

The dendrogram induces a linear ordering of the data points (up to left/right flip in each split)

The dendrogram induces a linear ordering of the data points (up to left/right flip in each split)

Hierarchical ClusteringNeed to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Average Linkage

Leukemia samples and genes

Single and Complete Linkage

Single-linkage Complete-linkage

Leukemia samples and genes

Similarity/Distance Measures

Decide: which samples/genes should be clustered together

– Euclidean: the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula

– Pearson correlation - a parametric measure of the strength of linear dependence between two variables.

– Absolute Pearson correlation - the absolute value of the Pearson correlation– Spearman rank correlation - a non-parametric measure of independence

between two variables– Uncentered correlation - same as Pearson but assumes the mean is 0– Absolute uncentered correlation - the absolute value of the uncentered

correlation

– Kendall’s tau - a non-parametric similarity measure used to measure the

degree of correspondence between two rankings – City-block/Manhattan - the distance that would be traveled to get from one

point to the other if a grid-like path is followed

yi x

Reasonable Distance Measure

Gene 1

Gene 2

Gene 3

Gene 4

Genes: Close -> Correlated

Samples: Similar profile givingGene 1 and 2 a similar contribution to the distance between sample 1 and 5

Sample 1 Sample 5

Euclidean distance on samples and genes on row-centered and normalized data.

Pitfalls in Clustering

• Elongated clusters

• Filament

• Clusters of different sizes

Compact Separated Clusters

• All methods work

Adapted from E. Domany

Elongated Clusters

Single linkage succeeds to partition Average linkage fails

Filament

• Single linkage not robust


Filament with Point Removed

• Single linkage not robust


Two-way Clustering

• Two independent cluster analyses on genes and samples used to reorder the data (two-way clustering):

Hierarchical Clustering

• Results depend on distance update method– Single Linkage: elongated clusters– Complete Linkage: sphere-like clusters

• Greedy iterative process • NOT robust against noise• No inherent measure to choose the clusters –

we return to this point in cluster validation

Summary

Clustering Protocol

Validating Number of Clusters

How do we know how many real clusters exist in the dataset?

...

D1 D2Dn

Generate “perturbed”datasets

Consensus Clustering

Apply clustering algorithmto each Di

Clustering1 Clustering2 .. Clusteringn

OriginalDataset

Consensus matrix: counts proportion of times two samples are clustered together.• (1) two samples always cluster together• (0) two samples never cluster together

s1 s2 … sn

s1 s2…

sn

compute consensus matrix dendogram

based on matrix

The Broad Institute of MIT and Harvard

Consensus Clustering

Consensus matrix: counts proportion of times two samples are clustered together.• (1) two samples always cluster together• (0) two samples never cluster together

C1

C2

C3

s1 s3 … si

s1 s3…

si

consensus matrixordered according to dendogram

compute consensus matrix dendogram

based on matrix

...

D1 D2Dn

Apply clustering algorithmto each Di

Clustering1 Clustering2 .. Clusteringn

OriginalDataset

Validation

• Aim: Measure agreement between clustering results on “perturbed” versions of the data.

• Method: – Iterate N times:

• Generate “perturbed” version of the original dataset bysubsampling, resampling with repeats, adding noise

• Cluster the perturbed dataset

– Calculate fraction of iterations where different samples belong to the same cluster

– Optimize the number of clusters K by choosing the value of K which yields the most consistent results

Consistency / Robustness Analysis

Consensus Clustering in GenePattern

Clustering Cookbook

• Reduce number of genes by variation filtering– Use stricter parameters than for comparative marker

selection

• Choose a method for cluster discovery (e.g. hierarchical clustering)

• Select a number of clusters– Check for sensitivity of clusters against filtering and

clustering parameters– Validate on independent data sets– Internally test robustness of clusters with consensus

clustering

The Broad Institute of MIT and Harvard Clustering.

Documents

Transcript of The Broad Institute of MIT and Harvard Clustering.