Download - Protocols(for(Representa/on(of(Protein(Domain(Annota/ons(in ......!!!!!Interpro!ontology!mapped!into!chado!schema cvterm_relaonship ! table! Protocols(for(Representa/on(of(Protein(Domain(Annota/ons(in(Clade6

Transcript
Page 1: Protocols(for(Representa/on(of(Protein(Domain(Annota/ons(in ......!!!!!Interpro!ontology!mapped!into!chado!schema cvterm_relaonship ! table! Protocols(for(Representa/on(of(Protein(Domain(Annota/ons(in(Clade6

           Interpro  ontology  mapped  into  chado  schema  

cvterm_rela5onship  table  

Protocols  for  Representa/on  of  Protein  Domain  Annota/ons  in  Clade-­‐Oriented  Databases:  a  Case  Study  at  the  Legume  Informa/on  System  using  

Chado/Tripal        Pooja  E.  Umale  ,  Andrew  D.  Farmer    

     Na5onal  Center  for  Genome  Resources  (NCGR),  Santa  Fe,  NM  87505,  USA                                        

Introduc5on   Methods   Results  

Interpro  Consor/um  Databases  

PROSITE  

HAMAP  

PFAM  

PRINTS  

ProDom  

SMART  

TIGRFAMS  

PIRSF  

SUPERFAMILY  

CATH-­‐Gene3D  

PANTHER  

Input  FASTA  amino  acid  sequences  

Score  BLAST  hits  

Tokenize  blast  hits    

Score  the  tokens  (lexical  

analysis)  

Gene  Ontology  annota5on  

Assign  best  scoring  

descrip5on  

Interpro   is   a   searchable   database   that   is   used   to   elucidate  protein   func5on   and   annota5on   for   our   project.  InterproScan   tool   is   used   to   scan   query   sequences   against  Interpro   protein   signature   databases.   We   employed   AHRD  (h\ps://github.com/groupschoof/AHRD)   to   assign   human  readable  descrip5ons  to  predicted  proteins.  Also  for  a  be\er  user   experience   and   visualiza5on   of   protein   domain  annota5ons  we  incorporated  in  the  context  of  the  MSA  view  provided   by   jalview.   Chado   database,   Drupal   (open   source  content   management   system)   and   GMOD’s   Tripal   are   the  so`ware  tools  that  were  used  for  data  storage  and  module/website  development.    

Acknowledgements  

Web-­‐based  presenta5on  of  protein  domain  data  and   its  annota5ons   is  made  available  at  h\p://www.legumeinfo.org/search/protein_domains.    We   developed   a   shareable   Tripal   extension   module   for   this   purpose,  enabling   search   by   domains   and   interlinking   our   domain-­‐oriented  representa5on  to  other  modules  that  showcase  gene  and  gene  families  of  legumes.    

Gene  family  set  sharing  common  domain  

Chado  Schema  representa5on  of  InterproScan  results  

Example:  Jalview  display  of  Protein  domain  annota5ons  on  consensus                            sequence  of  a  gene  family  

AHRD  tool  workflow  

feature  table  (match$1_26_518)  protein_hmm_match  

domain  feature    feature_id  organism_id    uniquename  

type_id  

 featureloc  table  

(for  source  feature  -­‐1  )    

featureloc_id    feature_id  

srcfeature_id  fmin  fmax    

 featureloc  table  

(for  source  feature-­‐2)    

featureloc_id    feature_id  

srcfeature_id  fmin  fmax    

organism  table  organism_id  

genus  species  

cvterm  table  cvterm_id    cv_id  name  

feature  table  (PF00221)  

HMM  representa5on  of  domain  

 feature_id  

feature  table  (glyma.Glyma.10G209800.1)  

Polypep5de  feature    

feature_id  

Display  of  set  of  genes  that  have  common  domain  

Protein  domains  can  be  conceptualized  from  a  number  of  perspec5ves,  from  their  role  in  defining  an  individual  protein’s  structure  and  func5on  to  their  evolu5onary  role  in  crea5ng  novel  molecular  func5ons   through   duplica5on   and   recombina5on   into   unique  mul5-­‐domain   protein   architectures.   Although  many   species-­‐   and   clade-­‐oriented   databases   use   standard   protein   domain   analyses   to  characterize   the   puta5ve   func5ons   and   cellular   localiza5ons   of   the   gene  products   represented   in   the   genomes   and   transcriptomes  of   their   species   of   interest,   this   is   o`en   limited   to   trea5ng   the  matched   domains   as   proper5es   of   the   genes   that   are   simply   an   aid   to   their   classifica5on   and   retrieval.  While   this   gene-­‐centric   perspec5ve   is   clearly   of   great   importance,   eleva5ng   domains   to   a  prominent  posi5on  in  the  context  of  such  databases  has  the  poten5al  to  provide  insights  into  many  interes5ng  biological  ques5ons,  from  the  role  of  domains  in  constraining  and  shaping  intra-­‐species  diversity  pa\erns   (including  SNPs,   splice   isoforms,  and  gene   fusions)   to   their   role   in  providing   the  basis   for   the  defini5on  of  gene   family  groupings  of  orthologous  and  paralogous  genes  as  well   as  providing  insights  into  their  evolu5onary  dynamics.  We  have  u5lized  and  extended  a  set  of  widely  used  open  source  tools  for  analysis,  storage  and  web-­‐based  presenta5on  of  protein  domain  data  to  populate  the  Chado  database  underlying  the  Legume  Informa5on  System  (h\p://legumeinfo.org)  and  to  make  this  data  available  through  a  shareable  Tripal  extension  module  for  enabling  search  by  domains,  exploi5ng  the  ontological  structure  of  InterPro  and  interlinking  our  domain-­‐oriented  representa5on  to  other  modules  for  presenta5on  of  gene  and  gene  families.  

Protein  domain  search  page  

dbxref  table  (IPR001106)  

cvterm  table  (Aroma5c  amino  

acid  lyase)  

cvterm_id  

dbxref_id  

The   InterPro  protein   families  database:   the  classifica/on   resource  aEer  15  years.  Nucleic  Acids  Research,  Jan  2015;  doi:  10.1093/nar/gku1243    InterProScan   5:   genome-­‐scale   protein   func/on   classifica/on.  BioinformaCcs,  Jan  2014;  doi:10.1093/bioinformaCcs/btu031    Waterhouse   AM,   Procter   JB,   Mar5n   DMA,   Clamp   M,   Barton   GJ   (2009)  Jalview   Version   2-­‐a   mul5ple   sequence   alignment   editor   and   analysis  w o r k b e n c h .   B i o i n f o r m a 5 c s   2 5 :   1 1 8 9 -­‐ 1 1 9 1 .  doi:10.1093/bioinforma5cs/btp033    Ficklin  S.P.,  Sanderson  L.A.,  Cheng  C.H.,  Staton  M.E.,  Lee  T.,  Cho  I.H.,  Jung  S.,   Be\   K.E.,   Main   D.   Tripal:   a   construc5on   toolkit   for   online   genome  databases.  Database.  2011:bar044.  .    

References/Publica5ons  

Example  

GFF  file  storing  iprscan  results  

Methods  

Introduc5on  

Results  

Future  Direc5ons  

•  Use  of  the  ontology  structure  of  interproscan  to  enhance  searching    •  display  of  intraspecific  varia5on  in  the  context  of  the  domain  

architecture  (similar  to  how  we  are  now  displaying  interspecific  varia5on  in  the  MSAs)