Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

35
Making protein func0on and subcellular localiza0on predic0ons – challenges and opportuni0es Fiona Brinkman Department of Molecular Biology and Biochemistry (Associate, Faculty of Health Sciences and School of Compu0ng Sciences) Simon Fraser University Greater Vancouver, BC, Canada April 2014

description

Fiona Brinkman talk at Automated Function Prediction SIG, ISMB 2014, Boston, MA, USA

Transcript of Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Page 1: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Making  protein  func0on  and  subcellular  localiza0on  predic0ons  –  challenges  and  

opportuni0es  

Fiona  Brinkman    

Department  of  Molecular  Biology  and  Biochemistry  (Associate,  Faculty  of  Health  Sciences  and  School  of  Compu0ng  Sciences)  

Simon  Fraser  University  Greater  Vancouver,  BC,  Canada  

 

April  2014  

 

Page 2: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

•  Improving  seq  similarity/orthology-­‐based  predic0ons  –  a  keystone  of  many  predictors    

•  Improving  pathway/network-­‐based  analysis  to  iden0fy  protein  func0ons      

•  Future  challenges  and  opportuni0es  (using  protein  localiza0on  as  an  example  of  what  is  to  come)              

                                                       What  we  MUST  do  to  move  AFP  forward….   2  

Page 3: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

3  

 

 

One-­‐to-­‐one  orthologs  are,  in  par0cular,  more  func0onally  similar  to  each  other,  vs  other  orthologs,  paralogs,  when  >80%  seq  iden0ty  

Func0onal  similarity  measured  by  GO  annota0on  similarity  (13  species)  Altenhoff  AM  et  al.  PLoS  Comput  Biol.  2012  

Page 4: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

4  

 

 

One-­‐to-­‐one  orthologs  are,  in  par0cular,  more  func0onally  similar  to  each  other,  vs  other  orthologs,  paralogs,  when  >80%  seq  iden0ty  

Func0onal  similarity  measured  by  GO  annota0on  similarity  (13  species)  Altenhoff  AM  et  al.  PLoS  Comput  Biol.  2012  

Page 5: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities
Page 6: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

6  

If  true  ortholog  is  missing…    (gene  loss,  or  incomplete  genome)    

Ingroup1   Ingroup2   Outgroup  

Species  Tree:  

Gene  Tree:  

Ingroup1   Ingroup2   Outgroup  

RBBH  

Reciprocal  Best  Blast  Hit    FAIL

Gene  Tree:  

Ingroup1   Outgroup  

Ingroup2  

Usual  Divergence  

One  of  the  orthologous  genes  diverges  faster…    

Paralog  

RBBH  

Paralog  

Page 7: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Ortholuge Uses  phyle0c  ra0os  to  differen0ate    Suppor0ng  Species  Divergence  (SSD)  orthologs    vs  proteins  more  divergent  than  expected  (non-­‐SSD)  

7  

Ra*o1  distance { ingroup1-­‐ingroup2 }  distance { ingroup1-­‐outgroup }  

Ingroup1   Ingroup2   Outgroup  

SSD  

Non-­‐SSD  

Ortholuge  analysis  comparing  Burkholderia  cepacia  &  B.cenocepacia  (outgroup:  B.pseudomallei)  

Ra*o2  distance { ingroup1-­‐ingroup2 }  distance { ingroup2-­‐outgroup }  

Ingroup1   Ingroup2   Outgroup  

Whiteside  et  al  2013    PMID  23203876  

Page 8: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

0.000  

0.200  

0.400  

0.600  

0.800  

1.000  

KEGG  Orthology  

Pfam  Domains   Tigrfam  Annota0ons  

Subcellular  Localiza0ons  

Prop

or*o

n  Predicted  Orthologs  in  600  Pairs  of  Bacterial  Species  

SSD  Ortholog  

Non-­‐SSD  

8  

*   *   *   *  

*  p-­‐value  <  0.05  

0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  

One  or  more  homologs  (based  on  

BLAST  hits)  

Prop

or*o

n  

SSD  orthologs  

Non-­‐SSD  

*  

*  p-­‐value  <  0.05  

Non-­‐SSD  “Orthologs”  more  likely:      

-­‐  Func0onally  dissimilar    -­‐  Have  one  or  more  homologs  

Page 9: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

A Database of Ortholuge Evaluations OrtholugeDB      (0nyurl.com/ortholugeDB)  

•  Provides  pre-­‐computed  ortholog  predic0ons  for  >1400  bacteria  and  archaea  (update  coming  next  month!),  with  further    Ortholuge  assessments  

•  Covers  all  genes  in  fully  sequenced  bacterial  and  archaeal  genomes  •  Facilitates  visualiza0on  and  evalua0on  of  ortholog  predic0ons  

9  

Page 10: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Similar  issue  with  ini0al  metagenomics  seq  func0onal  evalua0on  

1.  Simulated  reads  from  Pseudomonas  aeruginosa  PAO1  

2.  Created  databases  at  different  levels  of  clade  exclusion  •  E.g.  for  species  clade  exclusion  removed  all  Pseudomonas    

aeruginosa  genomes  from  the  database  

3.  Used  RAPSearch2  and  MEGAN5  to  assign  func0onal  categories  to  the  simulated  reads  

4.  Calculated  propor0on  of  reads  assigned  to  each  func0onal  category  rela0ve  to  how  many  reads  expected  

•  E.g:  

10  

Category  Expected  #  assigned  

Actual  #  assigned  

Rela0ve  Propor0on  

Membrane  Transport   567   583   1.02822  

Page 11: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Most  func0onal  categories  are  predicted  well  but  some  are  overpredicted  (ra0o  notably  >1)  

0  

0.5  

1  

1.5  

2  

2.5  

Ra*o

 of  a

ssigne

d    

rela*v

e  to  expected  

None  

Species  

Family  

Class  

Level of clade exclusion:

Ie. Endocrine system: 3 problematic orthology groups – all with high #’s of proteins (one has 3538 when median is 54!)

Page 12: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

The  rela0ve  propor0ons  of  func0onal  categories  stays  rela0vely  consistent  as  clade  exclusion  level  increases  

0%  

10%  

20%  

30%  

40%  

50%  

60%  

70%  

80%  

90%  

100%  

None   Species   Family   Class  

Prop

or*o

n  of  re

ads  a

ssigne

d  

Clade  exclusion  level  

Xenobio0cs  Biodegrada0on  and  Metabolism  Transcrip0on  

Signal  Transduc0on  

Replica0on  and  Repair  

Infec0ous  Diseases  

Nucleo0de  Metabolism  

Neurodegenera0ve  Diseases  Metabolism  of  Other  Amino  Acids  Metabolism  of  Cofactors  and  Vitamins  Membrane  Transport  

Page 13: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Improving  pathway-­‐based  analysis  

Issue:  Biomolecular  pathway  classifica0ons  can  bias  analyses  of  pathways  found  to  be  upregulated  or  downregulated  by  transcriptome  (or  other  omics-­‐level)  analysis    What  you  iden0fy  depends  on  how  everything  is  classified….    Need  beper  “signatures”  of  pathways…  

Page 14: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Dealing  with  PART  of  the  issue…      

Distribu0on  of  the  number  of  associated    pathways  for  human  genes  in  KEGG.  

1

7-45

2

34

5

6

                                                                                                           Membership  of  a  gene  in  mul0ple  pathways  is  the  norm,  not  the  excep0on…  

Foroushani et al, 2014 PMCID: PMC3883547

Page 15: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Not  all  genes  are  equal…    

Maroon:  pathway  member      White:  no  membership  

   All  genes  are  not  equivalent  signatures  of  a  given  pathway  

Foroushani et al, 2014 PMCID: PMC3883547

Page 16: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Individual Gene ORA Antigen processing and presentation Graft-versus-host disease Natural killer cell mediated cytotoxicity Viral myocarditis Allograft rejection Cell adhesion molecules (CAMs) Chemokine signaling pathway Type I diabetes mellitus Toll-like receptor signaling pathway Cytokine-cytokine receptor interaction

Example:  Treated  vs  Untreated  Mouse  Severe  InflammaIon  –  Gene  Expression  Dataset  

   

Standard Over-Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) treat all genes in a given pathway as equal indicators that that pathway is significant. à Emphasizes generalist genes/pathways

Foroushani et al, 2014 PMCID: PMC3883547

Page 17: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Pathway  Signatures  using  SIGORA:  IdenIfying  genes/gene  pairs    uniquely  associated  with  a  single  pathway  

SIGORA identifies statistically significant enrichment of Pathway Signatures in a gene list of interest.

Foroushani et al, 2014 PMCID: PMC3883547

Page 18: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Example: Treated vs Untreated Mouse Severe Inflammation – Gene Expression Dataset  

 SIGORA  avoids  many  biologically  less  plausible  results  seen  by  other  

methods  that  over-­‐emphasize  generalist  genes/pathways.  

For example, 6/8 up-regulated genes in “Type I diabetes mellitus” pathway are also in the "Antigen processing and presentation" pathway.

Individual Gene ORA SIGORA Antigen processing and presentation Antigen processing and presentation Graft-versus-host disease Natural killer cell mediated cytotoxicity Natural killer cell mediated cytotoxicity Complement and coagulation cascades Viral myocarditis Toll-like receptor signaling pathway Allograft rejection Cytokine-cytokine receptor interaction Cell adhesion molecules (CAMs) Leukocyte transendothelial migration Chemokine signaling pathway Cell adhesion molecules (CAMs) Type I diabetes mellitus Cytosolic DNA-sensing pathway Toll-like receptor signaling pathway Chemokine signaling pathway Cytokine-cytokine receptor interaction

Page 19: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Future  challenges  and  opportuni0es      

(using  bacterial  protein  localiza0on  as  an  example    of  what  is  to  come)  

 

(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741) 19  

Page 20: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Bacterial  protein  subcellular  localiza0on  predic0on  

•  Aids  genome  annota0on  and  predic0on  of  protein  func0on    •  Used  to  iden0fy  cell  surface/secreted  targets  for  drugs  and  

diagnos0cs,  as  well  as  poten0al  vaccine  components  •  Many  pathogen-­‐associated  virulence  factors  predicted  as  secreted  

(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741) 20  

Page 21: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Signal  pep0des:  Non-­‐cytoplasmic    Amino  acid  composi0on/paperns:  All  localiza0ons  

 -­‐  Support  Vector  Machine’s  trained  with  amino  acid                                      composi0ons  or  frequent  subsequences          Transmembrane  helices:  Cytoplasmic  membrane  

 -­‐  HMMTOP    PROSITE  mo0fs  with  100%  precision:  All  localiza0ons    Outer  membrane  mo0fs:  Outer  membrane  

 -­‐  Iden0fied  by  associa0on-­‐rule  mining      Homology  to  proteins  of  experimentally  known  localiza0on:  All  loc.  

 -­‐  “SCL-­‐BLAST”  against  pro  of  known  localiza0on    -­‐  E=10e-­‐10  and  length  restric0on  for  precision    

Integra0on  with  a  Baysian  Network  

Yu  et  al  (2010)  BioinformaIcs  26:1608    

PSORTb:  bacterial  protein  subcellular  localiza0on  (SCL)  predic0on  sosware  

Page 22: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

PSORTb:  version  3  

22

• Type  III  secre0on  apparatus  • Pili/fimbria  • Host-­‐associated  SCL  • Flagellum  • Spore  • Gas  vesicle  

Sub-­‐category  localiza0on  predic0ons  

Main  localiza0ons  predicted   Bacteria  and  Archaea  predic0ons  

Page 23: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Gram-­‐nega6ve SoNware Precision Recall PSORTb  v3.0 96.8 88.0 PSORTb  v2.0 95.7 81.5 Gram-­‐posi6ve PSORTb  v3.0 97.0 93.2  PSORTb  v2.0 96.7 89.3

Archaea  PSORTb  v3.0

95.0   93.3  

PSORTb  v3.0:  high  precision,  improved  sensi0vity/recall  and  genome  predic0on  coverage  

0  

10  

20  

30  

40  

50  

60  

70  

80  

90  

100  

PSORTb  v.2.0  

PSORTb  v.3.0  

Five-­‐fold  cross  valida0on   Genome  predic0on  coverage  

Gram-­‐negaIve   Gram-­‐posiIve  

A  computa0onal  predictor  more  accurate  than  related  high-­‐throughput  lab  methods  

Page 24: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

 Classic  Gram  posi0ve  bacteria,  monoderms:  Thick  pep0doglycan,  no  outer  membrane  Classic  Gram  nega0ve  bacteria,  diderms:  Thin  pep0doglycan  +  outer  membrane    …but  can  have  Gram  nega0ves  with  no  outer  membrane  (i.e.  Mycoplasma)    or  a  different  outer  membrane  (Synergistetes,  Sphingomonas),  or  Gram  posi0ve  (thick  peptdoglycan)  with  a  different  outer  membrane  (Deinococcus  –  6  layers  in  cell  envelope!),  or  “acid  fast”with  asymmetric  lipid-­‐containing  thick  cell  wall  (Mycobacteria)    Plus  bacterial  organelles  and  other  substructures  (ie.  magnetosome  of  Magnetospirillum)...    Solu*on:    -­‐   For  whole  genome  (deduced-­‐proteome)  analysis,        detect  key  protein  markers  of  a  par0cular  cell  type        (i.e.  Omp85  essen0al  for  classic  Gram  nega0ve  membrane)  -­‐  For  single  protein  analysis,  learn  from  above  analysis,  plus        literature  cura0on,  the  most  likely  cell  type  for  a  given  phyla    

                                             …then  make  predic0ons  assuming  that  cell  “type”    

           

Challenge:  Organismal  diversity    

24

Reproduced under Fair Use

Page 25: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Challenge:  Temporal,  contextual  diversity  

Proteins  can  be  associated  with  mul0ple  subcellular  localiza0ons                

i.e.  Cell  division  proteins,  Autotransporters,  “protein  A  dependant  on  protein  B”                              Solu0on:  Note  all  possible  localizaIons  since  Temporal,  contextual  predic0ons  non-­‐trivial  –  not  enough  knowledge  for  most  

Kjærgaard K et al. J. Bacteriol. 2000;182:4789-4796

Page 26: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Challenge:  Metagenomics  

High  demand  for  PSORTb  to  be  able  to  analyze  metagenomic  sequences  ….  under  development  

              Need  taxonomy  data  to  aid  predic0ons    

(then  enable  appropriate  cell  type  analysis)      

Page 27: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

             

Through  over  a  decade  of  cura0ng  for,  making  and  evalua0ng  predictors  of  protein  localiza0on,  genomic  islands,  etc      What  makes  a  great  predictor?        

Page 28: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

             

Through  over  a  decade  of  cura0ng  for,  making  and  evalua0ng  predictors  of  protein  localiza0on,  genomic  islands,  etc      What  makes  a  great  predictor?      (besides  it  being  right)    ☺    

Page 29: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Bioinforma0cs  Predictor’s  Code  of  Conduct  

-­‐  Never  force  predic0ons  -­‐  always  have  a  predic0on  op0on/category  of          “unknown”    

             

Inspired  by  the  classic  “Data  Provider’s  Code  of  Conduct”  in  Stein  (2002)  Nature  417,  119-­‐120  

Page 30: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Example  of  forced  predic0ons:  PSORT  I  predic0on  method    

Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%

What’s wrong here?

Page 31: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Example  of  forced  predic0ons:  PSORT  I  predic0on  method  

Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%

No secreted/extracellular localization!

Page 32: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Inspired  by  the  classic  “Data  Provider’s  Code  of  Conduct”  in  Stein  (2002)  Nature  417,  119-­‐120    -­‐  Never  force  predic0ons  -­‐  always  have  “unknown”  op0on/category            -­‐  Ensure  open  source  -­‐  enable  viewing  of  predic0on  method  details      -­‐   Predictor  should  easily  be  trainable  with  different  datasets          (if  applicable;  so  others  can  robustly  evaluate  accuracy)    -­‐   Have  ability  to  run  locally  or  over  web  (with  an  API  is  preferred)  

-­‐   Provide  access  to  old  versions  (at  minimum  when  transi0oning        to  new  version)  

-­‐  Encourage  con0nuing  cura0on  from  the  literature/lab  experiments!          Incorporate  some  curaIon  efforts  into  predictor  funding  applicaIons  

Bioinforma0cs  Predictor’s  Code  of  Conduct  

Page 33: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Bioinforma0cs  Predictor’s  Code  of  Conduct  -­‐  evalua*on  

33

 -­‐  Evaluate  precision  and  recall  (and  accuracy  measure  combos  thereof)        with  x-­‐fold  cross  valida0on  and/or  new  datasets  (like  CAFA!)    -­‐   ID  errors,  biases  and  provide  guidance  to  users  re  issues  to  watch  for  

-­‐   bias  in  training  and/or  tes0ng  datasets        (“homology  reduc0on”,  “clade  exclusion”  may  help)  -­‐  errors  in  “gold  standard”  lab-­‐based  measure  -­‐  contextual/temporal  changes  in  proteins,  impac0ng  predic0on        (ie.  Func0on  changes  when  another  protein/compound  present)      

   What  we  MUST  do:  Guide  users  to  not  just  blindly  use  a  predictor  and  its  default  output.    

Page 34: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

What  we  MUST  do:    Guide  users  to  not  just  blindly  use  a  predictor  and  its  default  output.      Curators,  experimentalists,  and  automated  funcIon  predictor  developers  must  coordinate  efforts  more    •  Experimentalists  working  on  what    

they  think  best…  •  Curators  cura0ng  what  they    

priori0ze…  •  Func0on  predictors  op0mizing    

predic0on  using  exis0ng  data….      FuncIon  predictors/bioinformaIcists  need  to  get  in  the  drivers  seat  more  for  research    

Bioinforma0cs  Predictor’s  Code  of  Conduct  

Page 35: Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Brinkman  Lab  Kayaking  Trip,  Summer  2013    

(Next  up,  Archery  Tag!)    

Amir  Foroushani  Maphew  Laird  David  Lynn  Raymond  Lo      

Mike  Peabody  Thea  Van  Rossum  Maphew  Whiteside  Nancy  Yu