Classification and Clustering for Hit Identification in High Content RNAi Screens
Transcript of Classification and Clustering for Hit Identification in High Content RNAi Screens
Classifica(on and Clustering for Hit Iden(fica(on in High
Content RNAi Screens
Rajarshi Guha, Ph.D. NIH Center for Transla:onal Therapeu:cs
January 11, 2012
DNA Re-replication
Sivaprasad et al Cell Division
DNA replication is a tightly controlled and well-studied process. Proteins including geminin, cyclin A, and Emi1 can help prevent DNA re-replication.!
Levels of geminin increase as cells enter S phase, which help to prevent a second round of DNA replication.!
After mitosis, levels of geminin and cyclins decrease through ubiqutin mediated degradation.!
Collaborator:!Mel Depamphilis, NICHD!Wenge Zhu, Georgetown U!
DNA Re-replication
Certain cancer cells may have less safeguards against DNA re-replication than normal cells (i.e. Achilles heel). Induction of re-replication results in apoptosis.!
Zhu et al, Cancer Res, 2009
Screening Protocol
• HCT-116 colon cancer cells are fixed and stained (Hoechst)!
• Image at 4X on ImageXpress!
• MetaXpress used to perform cell cycle analysis to quantify cells with >4N DNA content !
• Screens were run with singles and pools
Screen Summary
• Qiagen druggable genome library (6,866 genes) • 94 plates, 36K wells including controls
• Good screen performance, some poorer plates were redone
Plate Index
Statistic
0.5
0.6
0.7
0.8
0 20 40 60 80 100
Trimmed Z'
46
810
12140 20 40 60 80 100
SSMD
Goals
• Can we iden:fy genes with GMNN-‐like phenotypes – We already iden:fied a set of genes via thresholding the %G2 parameter
– We’d like to see what we get when we use a mul:-‐dimensional representa:on
• Employ predic:ve modeling to “learn” the phenotype
• Apply clustering and iden:fy biologically relevant clusters
What Do GMNN Wells Look Like?
Cell-‐Level Modeling
• A first approach was to match distribu:ons of individual wells with the overall distribu:on from the posi:ve control wells – Expected that distribu:on for GMNN wells should match the posi:ve control
– Use KS test to iden:fy wells with similar distribu:ons – Doesn’t work too well, even for GMNN itself – Considers 1 parameter at a :me (though a 2D KS test is possible)
Random Forest Model
• Ensemble of decision trees (Breiman 1984) • Not always the most accurate, but great for exploratory modeling – Implicit feature selec:on – Proven to not overfit – Provides a measure of feature importance
• Employ the randomForest package from R
h`p://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html
Cell-‐Level Modeling
• Removed cells with “incomplete” parameters • S:ll leaves 291K posi:ve cases and 3M nega:ve cases
• Developed a random forest model, sampling from nega:ves to maintain balanced classes – Predict whether a cell is GMNN-‐like – Models from mul:ple samples of the nega:ve control exhibited similar performance
Posi-ve Nega-ve
Posi-ve 220,636 72,498
Nega-ve 35,614 257,520
Overall 18% error, 25% error on posi3ve class and 12% error on nega3ve class
Cell-‐Level Modeling
• Significant overlap between distribu:ons for the nega:ve and posi:ve controls
Cell-‐Level Predic(ons
• Aggregate predic:ons for all cells in a well to label a well as GMNN-‐like
• Iden:fy genes with >= 2 siRNA’s (ie wells) labeled as GMNN-‐like – 31 genes iden:fied (GMNN, KIF11, ESPL1, …)
• Iden:fied expected genes and most of the set were func:onally relevant – Also iden:fied a few interes:ng, novel genes
• Reconfirma:on based on Ambion sequences was rela:vely low (9/31)
Well-‐Level Modeling
• Started with 27 parameters from MetaXpress • Performed automated feature selec:on – Remove undefined, constant features – Manually removed a few highly correlated features
• Work with 12 parameters
• Convert to Z-‐scores • Posi:ve & nega:ve controls are nicely separated
All Wells Controls Wells
Parameter Distribu(ons
Model Performance
• Classifica:on model trained using the posi:ve (GMNN-‐like) and nega:ve (not GMNN-‐like) controls
• Perfect classifica:on! – Worrying – overfiqng? – Nearly, 99% of the control wells were confidently classified as a posi:ve or nega:ve
Posi-ve Nega-ve
Posi-ve 1504 0
Nega-ve 0 1504
Descriptor Importance
• What does the model iden:fy as the most relevant descriptors?
• Some parameters are moderately correlated
Cell.MitoticAverageIntensity
Cell.DNAAverageIntensity
X.SPhase
G2Cells
DNABackgroundValue
Cell.DNAArea
X.G0.G1
Cell.DNAIntegratedIntensity
Cell.MitoticIntegratedIntensity
X.G2
SPhaseCells
G0.G1Cells
0 100 200 300
MeanDecreaseGini
Random Forest Predic(ons
• We use the model to predict the class for all the remaining wells
• All four siRNA’s targe:ngGMNN are classified as Geminin-‐like with high confidence
Probability of being Geminin-like
Per
cent
of T
otal
0
2
4
6
8
10
0.0 0.2 0.4 0.6 0.8 1.0
Random Forest Predic(ons
• Select genes for which > 75% of its siRNA’s are predicted to be Geminin-‐like with probability > 0.8
• Good overlap with cell-‐level model
Pro
babi
lity
of b
eing
Gem
inin
-like
0.0
0.2
0.4
0.6
0.8
1.0
AURKA
AURKBBRD8
C8orf79
CDCA5
CDCA8CRAT
ESPL1F12
FBXO5
GMNNGUSB
INCENPITPKA JU
N
KCNH6KIF11MLL4
OR10A2PLK1
PSMA1
PSMB4
ROBO2
RPLP2SNRK
TOP2A
TRIM64 TT
KUBCWRN
GO Enrichment
• GO Biological Processes enriched by this set of selected genes, are relevant to the biology
• Similarly with pathways (from GeneGo)
Clustering
• RF classifica:on is useful, but doesn’t directly tell us much about finer groups of genes that might be phenotypically related
• So we apply unsupervised clustering (PAM) – Explore different numbers of clusters – Evaluate sta:s:cal cluster quality metrics – Evaluate biologically mo:vated quality metrics
• We considered both plate-‐wise and experiment-‐wise clustering protocols
Platewise Clustering (k=4)
• Cluster assignments can’t be directly compared across plates
• Good to see that control columns are dis:nctly clustered
• Certain plates show no membership to the ‘GMNN cluster’
Experimentwise Clustering (k=2)
• Encouraging to see clean separa:on between control columns
• Bulk of wells are iden:fied as inac:ve • We can compare results from this clustering to RF classifica:on – 6 genes iden:fied, with mul:ple siRNA’s clustered with nega:ve control
Experimentwise Clustering (k=2)
• 6 genes iden:fied with mul:ple siRNA’s clustered with the nega:ve control
• These were confidently iden:fied by the RF model
Pro
babi
lity
of b
eing
Gem
inin
-like
0.0
0.2
0.4
0.6
0.8
1.0
AURKA
AURKBBRD8
C8orf79
CDCA5
CDCA8CRAT
ESPL1F12
FBXO5
GMNNGUSB
INCENPITPKA JU
N
KCNH6KIF11MLL4
OR10A2PLK1
PSMA1
PSMB4
ROBO2
RPLP2SNRK
TOP2A
TRIM64 TT
KUBCWRN
How Many Clusters?
• A priori, difficult to decide how many clusters there should be – Manual spot checks did not iden:fy dis:nctly different morphologies, counts
• Evaluate clusters with varying k and calculate average silhoue`e width
• Clustering based on the Euclidean metric doesn’t do a good job
Number of Clusters
Ave
rage
Silh
ouet
te W
idth
0.2
0.3
0.4
0.5
0.6
0.7
2 5 8 11 14 17 20
How Many Clusters?
• One approach is to ignore clusterings that have spread all GMNN siRNAs across mul:ple clusters
• The current data suggests that we s:ck to k = 5
Biological Enrichment in Clusters
• Considering 5 clusters • Some clusters are annotated with more relevant terms
Cluster containing ¾ GMNN siRNAs
Signal Enhancement in Clusters
• Signal is significantly enhanced in some clusters versus others
• Clusters 1, 2 and 4 did not contain any siRNA’s above Z = 3
Making a Final Hitlist
• Off targets effects are a major confounding factor
• We are able to assess OTE on a gene by gene basis using Common Seed Analysis
• Select genes from individual clusters, using % G2 and number of siRNA’s as secondary filters
• Combine with hits from random forest model
Marine, S. et al, J. Biomol. Screen., 2011, ASAP
Reconfirma(on
• 18/211 genes selected based on thresholding from the primary reconfirmed using Ambion sequences
• Considering just the genes selected by the random forest and/or clustering methods – 11/30 genes selected by RF reconfirmed using Ambion libraries
– 5/6 Genes iden:fied by RF & clustering reconfirmed using mul:ple libraries • ESPL1, FBXO5, INCENP, KIF11 reconfirmed very strongly
• Based on k = 5 clustering, – 23/181 genes from cluster 3 reconfirmed – 5/5 genes from cluster 5 reconfirmed
Outlook
• Complements tradi:onal threshold based selec:on methods
• The random forest approach is sufficiently accurate and lets us avoid explicitly selec:ng features up front
• Combined with clustering lets us zoom into biological relevant clusters of genes
Acknowledgements
• Sco` Mar:n • Pinar Tuzmen • Carleen Klump • Eugen Buehler