Classification and Clustering for Hit Identification in High Content RNAi Screens

Classifica(on and Clustering for Hit Iden(fica(on in High

Content RNAi Screens

Rajarshi Guha, Ph.D. NIH Center for Transla:onal Therapeu:cs

January 11, 2012

DNA Re-replication

Sivaprasad et al Cell Division

DNA replication is a tightly controlled and well-studied process. Proteins including geminin, cyclin A, and Emi1 can help prevent DNA re-replication.!

Levels of geminin increase as cells enter S phase, which help to prevent a second round of DNA replication.!

After mitosis, levels of geminin and cyclins decrease through ubiqutin mediated degradation.!

Collaborator:!Mel Depamphilis, NICHD!Wenge Zhu, Georgetown U!

DNA Re-replication

Certain cancer cells may have less safeguards against DNA re-replication than normal cells (i.e. Achilles heel). Induction of re-replication results in apoptosis.!

Zhu et al, Cancer Res, 2009

Screening Protocol

•  HCT-116 colon cancer cells are fixed and stained (Hoechst)!

•  Image at 4X on ImageXpress!

•  MetaXpress used to perform cell cycle analysis to quantify cells with >4N DNA content !

•  Screens were run with singles and pools

Screen Summary

•  Qiagen druggable genome library (6,866 genes) •  94 plates, 36K wells including controls

•  Good screen performance, some poorer plates were redone

Plate Index

Statistic

0.5

0.6

0.7

0.8

0 20 40 60 80 100

Trimmed Z'

46

810

12140 20 40 60 80 100

SSMD

Goals

•  Can we iden:fy genes with GMNN-‐like phenotypes – We already iden:fied a set of genes via thresholding the %G2 parameter

– We’d like to see what we get when we use a mul:-‐dimensional representa:on

•  Employ predic:ve modeling to “learn” the phenotype

•  Apply clustering and iden:fy biologically relevant clusters

What Do GMNN Wells Look Like?

Cell-‐Level Modeling

•  A first approach was to match distribu:ons of individual wells with the overall distribu:on from the posi:ve control wells – Expected that distribu:on for GMNN wells should match the posi:ve control

– Use KS test to iden:fy wells with similar distribu:ons – Doesn’t work too well, even for GMNN itself – Considers 1 parameter at a :me (though a 2D KS test is possible)

Random Forest Model

•  Ensemble of decision trees (Breiman 1984) •  Not always the most accurate, but great for exploratory modeling –  Implicit feature selec:on – Proven to not overfit – Provides a measure of feature importance

•  Employ the randomForest package from R

h`p://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html


•  Removed cells with “incomplete” parameters •  S:ll leaves 291K posi:ve cases and 3M nega:ve cases

•  Developed a random forest model, sampling from nega:ves to maintain balanced classes – Predict whether a cell is GMNN-‐like – Models from mul:ple samples of the nega:ve control exhibited similar performance

Posi-ve Nega-ve

Posi-ve 220,636 72,498

Nega-ve 35,614 257,520

Overall 18% error, 25% error on posi3ve class and 12% error on nega3ve class


•  Significant overlap between distribu:ons for the nega:ve and posi:ve controls

Cell-‐Level Predic(ons

•  Aggregate predic:ons for all cells in a well to label a well as GMNN-‐like

•  Iden:fy genes with >= 2 siRNA’s (ie wells) labeled as GMNN-‐like – 31 genes iden:fied (GMNN, KIF11, ESPL1, …)

•  Iden:fied expected genes and most of the set were func:onally relevant – Also iden:fied a few interes:ng, novel genes

•  Reconfirma:on based on Ambion sequences was rela:vely low (9/31)

Well-‐Level Modeling

•  Started with 27 parameters from MetaXpress •  Performed automated feature selec:on – Remove undefined, constant features – Manually removed a few highly correlated features

•  Work with 12 parameters

•  Convert to Z-‐scores •  Posi:ve & nega:ve controls are nicely separated

All Wells Controls Wells

Parameter Distribu(ons

Model Performance

•  Classifica:on model trained using the posi:ve (GMNN-‐like) and nega:ve (not GMNN-‐like) controls

•  Perfect classifica:on! – Worrying – overfiqng? – Nearly, 99% of the control wells were confidently classified as a posi:ve or nega:ve

Posi-ve Nega-ve

Posi-ve 1504 0

Nega-ve 0 1504

Descriptor Importance

•  What does the model iden:fy as the most relevant descriptors?

•  Some parameters are moderately correlated

Cell.MitoticAverageIntensity

Cell.DNAAverageIntensity

X.SPhase

G2Cells

DNABackgroundValue

Cell.DNAArea

X.G0.G1

Cell.DNAIntegratedIntensity

Cell.MitoticIntegratedIntensity

X.G2

SPhaseCells

G0.G1Cells

0 100 200 300

MeanDecreaseGini

Random Forest Predic(ons

•  We use the model to predict the class for all the remaining wells

•  All four siRNA’s targe:ngGMNN are classified as Geminin-‐like with high confidence

Probability of being Geminin-like

Per

cent

of T

otal

0

2

4

6

8

10

0.0 0.2 0.4 0.6 0.8 1.0

Random Forest Predic(ons

•  Select genes for which > 75% of its siRNA’s are predicted to be Geminin-‐like with probability > 0.8

•  Good overlap with cell-‐level model

Pro

babi

lity

of b

eing

Gem

inin

-like

0.0

0.2

0.4

0.6

0.8

1.0

AURKA

AURKBBRD8

C8orf79

CDCA5

CDCA8CRAT

ESPL1F12

FBXO5

GMNNGUSB

INCENPITPKA JU

N

KCNH6KIF11MLL4

OR10A2PLK1

PSMA1

PSMB4

ROBO2

RPLP2SNRK

TOP2A

TRIM64 TT

KUBCWRN

GO Enrichment

•  GO Biological Processes enriched by this set of selected genes, are relevant to the biology

•  Similarly with pathways (from GeneGo)

Clustering

•  RF classifica:on is useful, but doesn’t directly tell us much about finer groups of genes that might be phenotypically related

•  So we apply unsupervised clustering (PAM) – Explore different numbers of clusters – Evaluate sta:s:cal cluster quality metrics – Evaluate biologically mo:vated quality metrics

•  We considered both plate-‐wise and experiment-‐wise clustering protocols

Platewise Clustering (k=4)

•  Cluster assignments can’t be directly compared across plates

•  Good to see that control columns are dis:nctly clustered

•  Certain plates show no membership to the ‘GMNN cluster’

Experimentwise Clustering (k=2)

•  Encouraging to see clean separa:on between control columns

•  Bulk of wells are iden:fied as inac:ve •  We can compare results from this clustering to RF classifica:on – 6 genes iden:fied, with mul:ple siRNA’s clustered with nega:ve control

Experimentwise Clustering (k=2)

•  6 genes iden:fied with mul:ple siRNA’s clustered with the nega:ve control

•  These were confidently iden:fied by the RF model

Pro

babi

lity

of b

eing

Gem

inin

-like

0.0

0.2

0.4

0.6

0.8

1.0

AURKA

AURKBBRD8

C8orf79

CDCA5

CDCA8CRAT

ESPL1F12

FBXO5

GMNNGUSB

INCENPITPKA JU

N

KCNH6KIF11MLL4

OR10A2PLK1

PSMA1

PSMB4

ROBO2

RPLP2SNRK

TOP2A

TRIM64 TT

KUBCWRN

How Many Clusters?

•  A priori, difficult to decide how many clusters there should be – Manual spot checks did not iden:fy dis:nctly different morphologies, counts

•  Evaluate clusters with varying k and calculate average silhoue`e width

•  Clustering based on the Euclidean metric doesn’t do a good job

Number of Clusters

Ave

rage

Silh

ouet

te W

idth

0.2

0.3

0.4

0.5

0.6

0.7

2 5 8 11 14 17 20

How Many Clusters?

•  One approach is to ignore clusterings that have spread all GMNN siRNAs across mul:ple clusters

•  The current data suggests that we s:ck to k = 5

Biological Enrichment in Clusters

•  Considering 5 clusters •  Some clusters are annotated with more relevant terms

Cluster containing ¾ GMNN siRNAs

Signal Enhancement in Clusters

•  Signal is significantly enhanced in some clusters versus others

•  Clusters 1, 2 and 4 did not contain any siRNA’s above Z = 3

Making a Final Hitlist

•  Off targets effects are a major confounding factor

•  We are able to assess OTE on a gene by gene basis using Common Seed Analysis

•  Select genes from individual clusters, using % G2 and number of siRNA’s as secondary filters

•  Combine with hits from random forest model

Marine, S. et al, J. Biomol. Screen., 2011, ASAP

Reconfirma(on

•  18/211 genes selected based on thresholding from the primary reconfirmed using Ambion sequences

•  Considering just the genes selected by the random forest and/or clustering methods –  11/30 genes selected by RF reconfirmed using Ambion libraries

–  5/6 Genes iden:fied by RF & clustering reconfirmed using mul:ple libraries •  ESPL1, FBXO5, INCENP, KIF11 reconfirmed very strongly

•  Based on k = 5 clustering, –  23/181 genes from cluster 3 reconfirmed –  5/5 genes from cluster 5 reconfirmed

Outlook

•  Complements tradi:onal threshold based selec:on methods

•  The random forest approach is sufficiently accurate and lets us avoid explicitly selec:ng features up front

•  Combined with clustering lets us zoom into biological relevant clusters of genes

Acknowledgements

•  Sco` Mar:n •  Pinar Tuzmen •  Carleen Klump •  Eugen Buehler

Classification and Clustering for Hit Identification in High Content RNAi Screens

Documents

Transcript of Classification and Clustering for Hit Identification in High Content RNAi Screens