Classification and Clustering for Hit Identification in High Content RNAi Screens

Post on 10-May-2015

936 views 0 download

Tags:

Transcript of Classification and Clustering for Hit Identification in High Content RNAi Screens

Classifica(on  and  Clustering  for    Hit  Iden(fica(on  in  High    

Content  RNAi  Screens  

Rajarshi  Guha,  Ph.D.  NIH  Center  for  Transla:onal  Therapeu:cs  

 January  11,  2012  

DNA Re-replication

Sivaprasad et al Cell Division

DNA replication is a tightly controlled and well-studied process. Proteins including geminin, cyclin A, and Emi1 can help prevent DNA re-replication.!

Levels of geminin increase as cells enter S phase, which help to prevent a second round of DNA replication.!

After mitosis, levels of geminin and cyclins decrease through ubiqutin mediated degradation.!

Collaborator:!Mel Depamphilis, NICHD!Wenge Zhu, Georgetown U!

DNA Re-replication

Certain cancer cells may have less safeguards against DNA re-replication than normal cells (i.e. Achilles heel). Induction of re-replication results in apoptosis.!

Zhu et al, Cancer Res, 2009

Screening  Protocol  

•  HCT-116 colon cancer cells are fixed and stained (Hoechst)!

•  Image at 4X on ImageXpress!

•  MetaXpress used to perform cell cycle analysis to quantify cells with >4N DNA content !

•  Screens were run with singles and pools  

Screen  Summary  

•  Qiagen  druggable  genome  library  (6,866  genes)  •  94  plates,  36K  wells    including  controls  

•  Good  screen    performance,    some  poorer    plates  were    redone  

 

Plate Index

Statistic

0.5

0.6

0.7

0.8

0 20 40 60 80 100

Trimmed Z'

46

810

12140 20 40 60 80 100

SSMD

Goals  

•  Can  we  iden:fy  genes  with  GMNN-­‐like  phenotypes  – We  already  iden:fied  a  set  of  genes  via  thresholding  the  %G2  parameter  

– We’d  like  to  see  what  we  get  when  we  use  a  mul:-­‐dimensional  representa:on  

•  Employ  predic:ve  modeling  to  “learn”  the  phenotype  

•  Apply  clustering  and  iden:fy  biologically  relevant  clusters  

What  Do  GMNN  Wells  Look  Like?  

Cell-­‐Level  Modeling  

•  A  first  approach  was  to  match  distribu:ons  of  individual  wells  with  the  overall  distribu:on  from  the  posi:ve  control  wells  – Expected  that  distribu:on  for  GMNN  wells  should  match  the  posi:ve  control  

– Use  KS  test  to  iden:fy  wells  with  similar  distribu:ons  – Doesn’t  work  too  well,  even  for  GMNN  itself  – Considers  1  parameter  at  a  :me  (though  a  2D  KS  test  is  possible)  

Random  Forest  Model  

•  Ensemble  of  decision  trees  (Breiman  1984)  •  Not  always  the  most    accurate,  but  great  for    exploratory  modeling  –  Implicit  feature  selec:on  – Proven  to  not  overfit  – Provides  a  measure  of  feature  importance  

•  Employ  the  randomForest  package  from  R  

h`p://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html  

Cell-­‐Level  Modeling  

•  Removed  cells  with  “incomplete”  parameters  •  S:ll  leaves  291K  posi:ve  cases  and  3M  nega:ve  cases  

•  Developed  a  random  forest  model,  sampling  from  nega:ves  to  maintain  balanced  classes  – Predict  whether  a  cell  is  GMNN-­‐like  – Models  from  mul:ple  samples    of  the  nega:ve  control    exhibited  similar  performance  

Posi-ve   Nega-ve  

Posi-ve   220,636   72,498  

Nega-ve   35,614   257,520  

Overall  18%  error,  25%  error  on  posi3ve    class  and  12%  error  on  nega3ve  class  

Cell-­‐Level  Modeling  

•  Significant  overlap  between  distribu:ons  for  the  nega:ve  and  posi:ve  controls  

Cell-­‐Level  Predic(ons  

•  Aggregate  predic:ons  for  all  cells  in  a  well  to  label  a  well  as  GMNN-­‐like  

•  Iden:fy  genes  with  >=  2  siRNA’s  (ie  wells)  labeled  as  GMNN-­‐like  – 31  genes  iden:fied  (GMNN,  KIF11,  ESPL1,  …)  

•  Iden:fied  expected  genes  and  most  of  the  set  were  func:onally  relevant  – Also  iden:fied  a  few  interes:ng,  novel  genes  

•  Reconfirma:on  based  on  Ambion  sequences  was  rela:vely  low  (9/31)  

Well-­‐Level  Modeling  

•  Started  with  27  parameters  from  MetaXpress  •  Performed  automated  feature  selec:on  – Remove  undefined,  constant  features  – Manually  removed  a  few  highly  correlated  features  

•  Work  with  12    parameters  

•  Convert  to  Z-­‐scores  •  Posi:ve  &  nega:ve  controls  are  nicely  separated  

All  Wells   Controls  Wells  

Parameter  Distribu(ons  

Model  Performance  

•  Classifica:on  model  trained  using  the  posi:ve  (GMNN-­‐like)  and  nega:ve  (not  GMNN-­‐like)  controls  

•  Perfect  classifica:on!        – Worrying  –  overfiqng?  – Nearly,  99%  of  the  control  wells  were  confidently  classified  as  a  posi:ve  or  nega:ve    

Posi-ve   Nega-ve  

Posi-ve   1504   0  

Nega-ve   0   1504  

Descriptor  Importance  

•  What  does  the  model  iden:fy  as  the  most  relevant  descriptors?  

•  Some  parameters  are  moderately  correlated    

Cell.MitoticAverageIntensity

Cell.DNAAverageIntensity

X.SPhase

G2Cells

DNABackgroundValue

Cell.DNAArea

X.G0.G1

Cell.DNAIntegratedIntensity

Cell.MitoticIntegratedIntensity

X.G2

SPhaseCells

G0.G1Cells

0 100 200 300

MeanDecreaseGini

Random  Forest  Predic(ons  

•  We  use  the  model  to  predict  the  class  for  all  the  remaining  wells  

•  All  four  siRNA’s  targe:ngGMNN  are  classified  as  Geminin-­‐like  with  high  confidence  

Probability of being Geminin-like

Per

cent

of T

otal

0

2

4

6

8

10

0.0 0.2 0.4 0.6 0.8 1.0

Random  Forest  Predic(ons  

•  Select  genes  for  which  >  75%  of  its  siRNA’s  are  predicted  to  be  Geminin-­‐like  with  probability  >  0.8  

•  Good  overlap  with  cell-­‐level  model  

Pro

babi

lity

of b

eing

Gem

inin

-like

0.0

0.2

0.4

0.6

0.8

1.0

AURKA

AURKBBRD8

C8orf79

CDCA5

CDCA8CRAT

ESPL1F12

FBXO5

GMNNGUSB

INCENPITPKA JU

N

KCNH6KIF11MLL4

OR10A2PLK1

PSMA1

PSMB4

ROBO2

RPLP2SNRK

TOP2A

TRIM64 TT

KUBCWRN

GO  Enrichment  

•  GO  Biological  Processes  enriched  by  this  set  of  selected  genes,  are  relevant  to  the  biology  

•  Similarly  with  pathways  (from  GeneGo)  

Clustering  

•  RF  classifica:on  is  useful,  but  doesn’t  directly  tell  us  much  about  finer  groups  of  genes  that  might    be  phenotypically  related  

•  So  we  apply  unsupervised  clustering  (PAM)  – Explore  different  numbers  of  clusters  – Evaluate  sta:s:cal  cluster  quality  metrics  – Evaluate  biologically  mo:vated  quality  metrics  

•  We  considered  both  plate-­‐wise  and  experiment-­‐wise  clustering  protocols  

Platewise  Clustering  (k=4)  

•  Cluster  assignments  can’t  be  directly  compared  across  plates  

•  Good  to  see  that    control  columns  are  dis:nctly    clustered  

•  Certain  plates  show  no    membership  to  the  ‘GMNN  cluster’  

Experimentwise  Clustering  (k=2)  

•  Encouraging  to  see  clean  separa:on  between  control  columns  

•  Bulk  of  wells  are  iden:fied  as  inac:ve  •  We  can  compare  results  from  this  clustering  to    RF  classifica:on  – 6  genes  iden:fied,  with  mul:ple  siRNA’s    clustered  with  nega:ve  control  

Experimentwise  Clustering  (k=2)  

•  6  genes  iden:fied  with  mul:ple  siRNA’s  clustered  with  the  nega:ve  control  

•  These  were  confidently  iden:fied  by  the  RF  model  

Pro

babi

lity

of b

eing

Gem

inin

-like

0.0

0.2

0.4

0.6

0.8

1.0

AURKA

AURKBBRD8

C8orf79

CDCA5

CDCA8CRAT

ESPL1F12

FBXO5

GMNNGUSB

INCENPITPKA JU

N

KCNH6KIF11MLL4

OR10A2PLK1

PSMA1

PSMB4

ROBO2

RPLP2SNRK

TOP2A

TRIM64 TT

KUBCWRN

How  Many  Clusters?  

•  A  priori,  difficult  to  decide  how  many  clusters  there  should  be  – Manual  spot  checks  did  not  iden:fy  dis:nctly    different  morphologies,  counts  

•  Evaluate  clusters  with  varying  k  and  calculate  average  silhoue`e  width  

•  Clustering  based  on  the    Euclidean  metric  doesn’t    do  a  good  job  

Number of Clusters

Ave

rage

Silh

ouet

te W

idth

0.2

0.3

0.4

0.5

0.6

0.7

2 5 8 11 14 17 20

How  Many  Clusters?  

•  One  approach  is  to  ignore  clusterings  that  have  spread  all  GMNN  siRNAs  across  mul:ple  clusters  

•  The  current  data  suggests  that  we  s:ck  to  k  =  5  

Biological  Enrichment  in  Clusters  

•  Considering  5  clusters  •  Some  clusters  are  annotated  with  more  relevant  terms    

Cluster  containing  ¾  GMNN  siRNAs  

Signal  Enhancement  in  Clusters  

•  Signal  is  significantly  enhanced  in  some  clusters  versus  others  

•  Clusters  1,  2  and  4  did  not  contain  any  siRNA’s  above  Z  =  3  

Making  a  Final  Hitlist  

•  Off  targets  effects  are  a  major  confounding  factor  

•  We  are  able  to  assess  OTE  on  a  gene  by  gene  basis  using  Common  Seed  Analysis  

•  Select  genes  from  individual  clusters,  using  %  G2  and  number  of  siRNA’s  as  secondary  filters  

•  Combine  with  hits  from  random  forest  model  

Marine,  S.  et  al,  J.  Biomol.  Screen.,  2011,  ASAP  

Reconfirma(on  

•  18/211  genes  selected  based  on  thresholding  from  the  primary  reconfirmed  using  Ambion  sequences  

•  Considering  just  the  genes  selected  by  the  random  forest  and/or  clustering  methods  –  11/30  genes  selected  by  RF  reconfirmed  using  Ambion  libraries  

–  5/6  Genes  iden:fied  by  RF  &  clustering  reconfirmed  using  mul:ple  libraries  •  ESPL1,  FBXO5,  INCENP,  KIF11  reconfirmed  very  strongly  

•  Based  on  k  =  5  clustering,    –  23/181  genes  from  cluster  3  reconfirmed  –  5/5  genes  from  cluster  5  reconfirmed    

Outlook  

•  Complements  tradi:onal  threshold  based  selec:on  methods  

•  The  random  forest  approach  is  sufficiently  accurate  and  lets  us  avoid  explicitly  selec:ng  features  up  front  

•  Combined  with  clustering  lets  us  zoom  into  biological  relevant  clusters  of  genes  

Acknowledgements  

•  Sco`  Mar:n  •  Pinar  Tuzmen  •  Carleen  Klump  •  Eugen  Buehler