1
Clustering megavariate data
Dhammika AmaratungaTeam Leader - Statistics in Drug Discovery
Senior Research Fellow - Nonclinical Statistics
Rutgers Biostatistics Day, April 2010
Joint work with
Javier Cabrera, Yauheniya Cherkas, Vladimir Kovtun, YungSeop Lee, and others
2
Cluster analysis
Data collected for N samples.
For each sample, measurements made on G variables.
Data represented as a GxN matrix.
The objective is to cluster
the N samples into a few
classes in such a way that
samples within a class are
collectively more similar to
each other than to samples
in any other class.
C5C6
C3
C4
C2
C1
3
Cluster analysis methods
There are many standard approaches available (e.g., partitioning methods such as K-means, hierarchical methods such as average linkage, machine learning methods such as self organizing maps)
For example, hierarchical clustering is one of the more popular clustering methods.
-- Define an inter-sample dissimilarity
(e.g., Euclidean distance, 1-Correlation)
-- Define an inter-cluster dissimilarity
(e.g., Dissimilarity between a pair of clusters is the average dissimilarity between a sample in one cluster and a sample in the other cluster)
-- Combine “close” samples/clusters sequentially
4
12
3
4
7
6
5
SA
MPL
E 1
SA
MPL
E 2
SA
MPL
E 3
SA
MPL
E 4
SA
MPL
E 5
SA
MPL
E 6
SA
MPL
E 7
Hierarchical clustering: how it works
5
The catch
In many contemporary settings, the data are megavariate, i.e., N<<G (e.g., in high throughput gene expression studies G is around 1,000-50,000 while N is around 10-500); in such cases, most predictors are noninformative and could overwhelm the dissimilarity estimates.
Example: Use gene expression data to discover unexpected novel classes among the samples (e.g., in leukemia patients, subtypes of leukemia).
6
WT:
C1
Case study
Experiment: Compare the gene expression profiles of 6 KO mice vs 6 WT mice using a microarray with 45101 genes.
C2 C3 C4 C5 C6
KO:
T1 T2 T3 T4 T5 T6
Note 1: Data available for early stage and late stage development of these mice. Note 2: This data is useful for illustration but is not representative of a cluster analysis situation as here the classes are known.
7
Gene expression data
Gene expression levels (measured via microarrays) for G genes in N samples:
C1 C2 C3 C4 C5 C6 …
G1 83 94 82 111 130 122
G2 16 14 7 2 11 33
G3 490 879 193 604 1031 962
G4 46458 49268 74059 44849 42235 44611
G5 32 70 185 20 25 19
G6 1067 891 546 906 1038 1098
G7 118 111 95 896 536 695
G8 10 30 25 24 31 28
G9 166 132 162 27 109 213
G10 136 139 44 62 23 135
. . . . . . . . . . . .
. . . . . . . . . . . .Preprocess and analyze
8
Biplots of data from knockout experiment
Early stage Late stage
9
Clustering of data from knockout experiment
Early stage Late stage
MR=5/12 MR=0/12
10
Filtering
Problem: With megavariate data, most predictors are noninformative and will overwhelm the dissimilarity estimates.
Usual (partial) resolution: Filter the genes based on variance or coefficient of variation to reduce the error rates (but which genes are informative?).
Resolution: Ensemble approach: Filter genes repeatedly and apply an ensemble technique.
11
Similari ty S1 S2 S3 S4 S5 S6
S1 0 1 0 0 0 0 S2 1 0 0 0 0 0 S3 0 0 0 0 0 0 S4 0 0 0 0 1 1 S5 0 0 0 1 0 1 S6 0 0 0 1 1 0
Similarity S1 S2 S3 S4 S5 S6
S1 0 1 1 1 0 0 S2 1 0 0 0 0 0 S3 1 0 0 1 0 0 S4 1 0 1 0 1 1 S5 0 0 0 1 0 2 S6 0 0 0 1 2 0
Similarity S1 S2 S3 S4 S5 S6
S1 0 1 1 1 0 0 S2 1 0 0 0 1 1 S3 1 0 0 2 0 0 S4 1 0 2 0 1 1 S5 0 1 0 1 0 3 S6 0 1 0 1 3 0
Similarity S1 S2 S3 S4 S5 S6
S1 0 1 2 2 0 0 S2 1 0 0 0 1 1 S3 2 0 0 3 0 0 S4 2 0 3 0 1 1 S5 0 1 0 1 0 4 S6 0 1 0 1 4 0
Similarity S1 S2 S3 S4 S5 S6
S1 0 2 3 3 0 0 S2 2 0 1 1 1 1 S3 3 1 0 4 0 0 S4 3 1 4 0 1 1 S5 0 1 0 1 0 5 S6 0 1 0 1 5 0
Similarity S1 S2 S3 S4 S5 S6
S1 0 6 7 7 0 0 S2 6 0 5 5 1 1 S3 7 5 0 8 0 0 S4 7 5 8 0 2 2 S5 0 2 0 2 0 10 S6 0 2 0 2 10 0
S1 S2 S4 S5 S6
G8523 680 749 669 724 643
G8524 262 311 1677 1286 1486
G8528 2571 1929 2439 1613 5074
G8530 1640 1693 1731 1861 1550
G8537 4077 2557 3394 2926 2755
G8545 1652 1799 254 383 258
G8547 2607 3394 2755 3077 2227
Select n samples and g genesGene expression matrix
{S1,S2,S3,S4} {S5,S6}
Final Clusters
Compute similarity
S1 S2 S3 S4 S5 S6
G8521 1003 1306 713 1628 1268 1629
G8522 890 705 566 975 883 1005
G8523 680 749 811 669 724 643
G8524 262 311 336 1677 1286 1486
G8525 254 383 258 1652 1799 1645
G8526 81 140 288 298 241 342
G8527 4077 2557 2600 3394 2926 2755
G8528 2571 1929 1406 2439 1613 5074
G8529 55 73 121 22 141 44
G8530 1640 1693 1517 1731 1861 1550
G8531 168 229 284 220 310 315
G8532 323 258 359 345 308 315
G8533 12131 11199 14859 11544 11352 11506
G8534 11544 11352 12131 11199 14859 12529
G8535 1929 1406 2439 254 383 258
G8536 191 140 288 298 241 342
G8537 4077 2557 2600 3394 2926 2755
G8538 2571 1613 5074 1652 1799 1645
G8539 55 73 121 22 91 24
G8540 1640 1693 1517 1731 1861 1750
G8541 168 229 284 220 312 335
G8542 323 258 359 345 298 325
G8543 2007 1878 1502 1758 2480 1731
G8544 2480 1731 2007 1878 1502 1758
G8545 1652 1799 1645 254 383 258
G8546 298 241 342 81 150 298
G8547 2607 3394 2926 2755 3077 2227
G8548 2571 1929 1406 2439 1613 5074
G8549 121 22 55 730 201 35
G8550 1640 1693 1517 1731 1861 1550
12
Data
Simple random sample of cases
Random sample of genes
Cluster analysis
Iterate
ABC dissimilarities
ABC(i,j) = 1-relative frequency
of how often samples i and j
cluster together
Ref: Amaratunga, Cabrera and Kovtun (Biostatistics, 2007)
Simple or weighted
based on variance
HC (Ave, Ward’s),
Kmeans, …
Input to clustering
algorithm
13
ABC clustering of data from knockout experiment
Early stage Late stage
MR=2/12 MR=0/12
14
Early stage Late stage
ABC-MDS plot of data from knockout experiment
15
Within-cluster and between-cluster dissimilarities
16
More proof-of-concept examples
Try on data in which the clusters are known.
Misclassification Rates
Method Golub AMS ALL Colon
Ward's with ABC 18.1 1.4 0.0 9.7
Ward’s with 1-Cor 23.6 9.7 2.3 48.4
Single Linkage 47.0 47.0 25.0 37.0
Complete Linkage 37.5 23.6 41.4 45.0
Average Linkage 47.2 27.8 26.5 38.7
K-means 20.8 5.5 42.2 48.4
PAM 23.6 8.3 2.3 16.1
Random Forest 43.0 26.4 48.0 43.5
17
More proof-of-concept examples (ctd)
… with feature selection
Misclassification Rates
Method Golub AMS ALL Colon
Ward's with ABC 18.1 1.4 0.0 9.7
Ward’s with 1-Cor 6.9 13.9 0.0 24.2
Single Linkage 45.8 58.3 26.6 35.5
Complete Linkage 29.2 13.9 0.0 27.4
Average Linkage 5.6 30.6 0.0 37.1
K-means 6.9 6.9 0.0 14.5
PAM 8.3 13.9 0.0 12.9
Random Forest 23.6 12.5 0.0 11.3
18
Hepatotoxicity example (1)
In this experiment N=87 compounds were tested in rats for a certain type of hepatotoxicity.
19
Hepatotoxicity example (2)
ABC was run on this dataset.
-0.4 -0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
0.4
cmdobj2[,1]
cm
do
bj2
[,2
]
Eth
Ery
Rif
Ani
Met
Sul ANIGli
Ami
Adr
AmiChoSpi
Sta
Tes
Per
Val
Pur
ParFlu
Tet
Dis
Asp
Cap
But
FurPip
Met
Nia
Vit
Fam
Rot
Car
Ral
Cy p
Ran
Iso
KetSim
Bro
Dap
DipMeb
Met
Cy c
Bro
Eto
Ace
Flu
Hy d
Tac
Dic
Ams
Cis
Dac
Dox
MetIsoStr
Phe
BusChl
Gen
Car
Die
NimPhe
Tan
CadDig
Dex
Mif
Sul
Met
BusMetAce
Chl
Pro
Tam
Ver
CloMy c
Nal
Niz
Ate
Dan
20
In this case, it was known that there are 3 genes thought to be implicated with the toxicity of interest.
Hepatotoxicity example (3)
Hepatotoxicity example (4)
-0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
cmdobj2[,1]
cm
do
bj2
[,2
]
Ethi
Ery t
Rif a
Anil
Meth
Suli
ANIT
Glib
Amio
Adre
AminCholSpir
StanTest
Perh
Valp
Puro
Para
Fluo
Tetr
Disu
Aspi
Capt
Buty
Furo
PipeMeth
Niac
VitaFamo
Rote
Carb
Ralo
Cy pr
Rani
Ison
Keto
Simv
Brom
Daps
Dipy
Mebe
Meth
Cy cl
Brom
Etop
AcetFlut
Hy dr
Tacr
Dich
Amsa
Cisp
Daca
DoxoMethIsop
StrePhen
BusuChlo
Gent
Carm
Diel
NimePhenTann
Cadm
DigoDexa
Mif e
Sulf
Meto
Busp
Metf
Acet
Chlo
Prog
Tamo
Vera
Cloz
My co
Nalt
NizaAten
Dant
Running ABC with weights proportional to the maximum correlation to these 3 genes gave a much more interesting result.
Data
Simple random sample of subjects
Simple random sample of genes
Construct classifier
Collate results
Extension: ensemble classifiers
Ref: Breiman (Machine Learning, 2001), Amaratunga et al (2009)
Tree ( Random
Forest*), LDA, …
22
Predict using classifier
Prediction:
Majority Vote
23
Case study: KO experiment
Try on data in which the classes are known.
Out-of-bag error rates
Ref: Amaratunga, Cabrera & Lee (Bioinformatics, 2008)
RF RF(p) ERFE-
LDA
EE-
LDA
Slc17A5 Day 0 0.583 0.583 0.167 0.583 0.083
Slc17A5 Day 18 0.083 0.083 0.000 0.000 0.000
Slc17A5 Day 0
(scrambled)0.750 0.750 0.833 0.833 0.833
Slc17A5 Day 18
(scrambled)0.583 0.667 0.667 0.583 0.583
24
Megavariate data are becoming more and more prevalent
Megavariate data introduce special challenges - overparametrized and undersampled- overfitting and redundancy- computationally challengingIn this setting, ensemble methods are among the best choices for classification.
Wrap Up
25
Wrap Up
Scientific collaborators: Michael McMillian, Jennifer Sasaki
References:
D Amaratunga and J Cabrera (2004) Exploration and Analysis of DNA Microarray and Protein Array Data. John Wiley.
D Amaratunga, J Cabrera and V Kovtun (2008) Microarray learning with ABC, Biostatistics.
D Amaratunga, J Cabrera and Y S Lee (2008) Enriched random forests, Bioinformatics.
D Amaratunga, J Cabrera, Y Cherkas and Y S Lee (2009) Ensemble classifiers, in review.
Website (recent papers and software):www. amaratunga.comwww.rci.rutgers.edu/~cabrera/DNAMR
Email:[email protected]
Top Related