The Broad Institute of MIT and Harvard Clustering.
-
Upload
allen-simpson -
Category
Documents
-
view
224 -
download
2
Transcript of The Broad Institute of MIT and Harvard Clustering.
![Page 1: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/1.jpg)
Clustering
![Page 2: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/2.jpg)
Clustering Preliminaries
• Log2 transformation
• Row centering and normalization
• Filtering
![Page 3: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/3.jpg)
Log2 Transformation
• Log2-transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values.– We would like dist(100,200)=dist(1000,2000).
Advantages of log2 transformation:
![Page 4: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/4.jpg)
Row Centering & Normalization
x y=x-mean(x) z=y/stdev(y)
![Page 5: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/5.jpg)
Filtering genes
• Filtering is very important for unsupervised analysis since many noisy genes may totally mask the structure in the data
• After finding a hypothesis one can identify marker genes in a larger dataset via supervised analysis.
Clustering
Supervised AnalysisMarker Selection
All genes
![Page 6: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/6.jpg)
Clustering/Class Discovery
• Aim: Partition data (e.g. genes or samples) into sub-groups (clusters), such that points of the same cluster are “more similar”.
• Challenge: Not well defined. No single objective function / evaluation criterion
• Example:How many clusters? 2+noise, 3+noise, 20, Hierarchical: 23 + noise
• One has to choose:– Similarity/distance measure – Clustering method– Evaluate clusters
![Page 7: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/7.jpg)
Clustering in GenePattern
• Representative based: Find representatives/centroids– K-means: KMeansClustering– Self Organizing Maps (SOM): SOMClustering
• Bottom-up (Agglomerative): HierarchicalClustering Hierarchically unite clusters– single linkage analysis– complete linkage analysis– average linkage analysis
• Clustering-like:– NMFConsensus– PCA (Principal Components Analysis)
No BEST method! For easy problems – most of them work. Each algorithm has its assumptions and strengths and weaknesses
![Page 8: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/8.jpg)
K-means Clustering
• Aim: Partition the data points into K subsets and associate each
subset with a centroid such that the sum of squared distances
between the data points and their associated centroid is minimal.
![Page 9: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/9.jpg)
Iteration = 0
K-means: Algorithm
• Initialize centroids at random positions
• Iterate:– Assign each data point to
its closest centroid
– Move centroids to center of assigned points
• Stop when converged
• Guaranteed to reach a local minimum
Iteration = 1
K=3
Iteration = 1Iteration = 2Iteration = 2
![Page 10: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/10.jpg)
K-means: Summary
• Result depends on initial centroids’ position• Fast algorithm: needs to compute distances from data
points to centroids• Must preset number of clusters.• Fails for non-spherical distributions
![Page 11: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/11.jpg)
Hierarchical Clustering
3
1
4 2
5
52 41 3
Distance between joined clusters
Dendrogram
![Page 12: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/12.jpg)
52 41 3
3
1
4 2
5
Distance between joined clusters
Dendrogram
The dendrogram induces a linear ordering of the data points (up to left/right flip in each split)
The dendrogram induces a linear ordering of the data points (up to left/right flip in each split)
Hierarchical ClusteringNeed to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Need to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
![Page 13: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/13.jpg)
Average Linkage
Leukemia samples and genes
![Page 14: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/14.jpg)
Single and Complete Linkage
Single-linkage Complete-linkage
Leukemia samples and genes
![Page 15: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/15.jpg)
Similarity/Distance Measures
Decide: which samples/genes should be clustered together
– Euclidean: the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula
– Pearson correlation - a parametric measure of the strength of linear dependence between two variables.
– Absolute Pearson correlation - the absolute value of the Pearson correlation– Spearman rank correlation - a non-parametric measure of independence
between two variables– Uncentered correlation - same as Pearson but assumes the mean is 0– Absolute uncentered correlation - the absolute value of the uncentered
correlation
– Kendall’s tau - a non-parametric similarity measure used to measure the
degree of correspondence between two rankings – City-block/Manhattan - the distance that would be traveled to get from one
point to the other if a grid-like path is followed
yi x
![Page 16: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/16.jpg)
Reasonable Distance Measure
Gene 1
Gene 2
Gene 3
Gene 4
Genes: Close -> Correlated
Samples: Similar profile givingGene 1 and 2 a similar contribution to the distance between sample 1 and 5
Sample 1 Sample 5
Euclidean distance on samples and genes on row-centered and normalized data.
![Page 17: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/17.jpg)
Pitfalls in Clustering
• Elongated clusters
• Filament
• Clusters of different sizes
![Page 18: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/18.jpg)
Compact Separated Clusters
• All methods work
Adapted from E. Domany
![Page 19: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/19.jpg)
Elongated Clusters
Single linkage succeeds to partition Average linkage fails
![Page 20: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/20.jpg)
Filament
• Single linkage not robust
Adapted from E. Domany
![Page 21: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/21.jpg)
Filament with Point Removed
• Single linkage not robust
Adapted from E. Domany
![Page 22: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/22.jpg)
Two-way Clustering
• Two independent cluster analyses on genes and samples used to reorder the data (two-way clustering):
![Page 23: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/23.jpg)
Hierarchical Clustering
• Results depend on distance update method– Single Linkage: elongated clusters– Complete Linkage: sphere-like clusters
• Greedy iterative process • NOT robust against noise• No inherent measure to choose the clusters –
we return to this point in cluster validation
Summary
![Page 24: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/24.jpg)
Clustering Protocol
![Page 25: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/25.jpg)
Validating Number of Clusters
How do we know how many real clusters exist in the dataset?
![Page 26: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/26.jpg)
...
D1 D2Dn
Generate “perturbed”datasets
Consensus Clustering
Apply clustering algorithmto each Di
Clustering1 Clustering2 .. Clusteringn
OriginalDataset
Consensus matrix: counts proportion of times two samples are clustered together.• (1) two samples always cluster together• (0) two samples never cluster together
s1 s2 … sn
s1 s2…
sn
compute consensus matrix dendogram
based on matrix
The Broad Institute of MIT and Harvard
![Page 27: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/27.jpg)
Consensus Clustering
Consensus matrix: counts proportion of times two samples are clustered together.• (1) two samples always cluster together• (0) two samples never cluster together
C1
C2
C3
s1 s3 … si
s1 s3…
si
consensus matrixordered according to dendogram
compute consensus matrix dendogram
based on matrix
...
D1 D2Dn
Apply clustering algorithmto each Di
Clustering1 Clustering2 .. Clusteringn
OriginalDataset
![Page 28: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/28.jpg)
Validation
• Aim: Measure agreement between clustering results on “perturbed” versions of the data.
• Method: – Iterate N times:
• Generate “perturbed” version of the original dataset bysubsampling, resampling with repeats, adding noise
• Cluster the perturbed dataset
– Calculate fraction of iterations where different samples belong to the same cluster
– Optimize the number of clusters K by choosing the value of K which yields the most consistent results
Consistency / Robustness Analysis
![Page 29: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/29.jpg)
Consensus Clustering in GenePattern
![Page 30: The Broad Institute of MIT and Harvard Clustering.](https://reader031.fdocuments.in/reader031/viewer/2022020111/56649d095503460f949dbe69/html5/thumbnails/30.jpg)
Clustering Cookbook
• Reduce number of genes by variation filtering– Use stricter parameters than for comparative marker
selection
• Choose a method for cluster discovery (e.g. hierarchical clustering)
• Select a number of clusters– Check for sensitivity of clusters against filtering and
clustering parameters– Validate on independent data sets– Internally test robustness of clusters with consensus
clustering