PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c#...

32
PharmaMatrix Workshop 2010 Bioinforma6c Databases 14 July 2010 Philip Winter & Ishwar Hosamani

Transcript of PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c#...

Page 1: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

PharmaMatrix  Workshop  2010  Bioinforma6c  Databases  

14  July  2010  Philip  Winter  &  Ishwar  Hosamani  

Page 2: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Database  Growth  

Source:  http://www.kokocinski.net/bioinformatics/databases.php  

Page 3: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Genes  &  Proteins  

Gene  &  Protein  Interac=ons  

Cheminforma=cs:  Drugs  &  Metabolites  

Database  Survey  

Page 4: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Genes  &  Proteins  

Gene  &  Protein  Interac=ons  

Cheminforma=cs:  Drugs  &  Metabolites  

UniProt  

GenBank  

dbSNP  

PDB  

GEO  

Pfam  TGI  

Database  Survey  

Page 5: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Genes  &  Proteins  

Gene  &  Protein  Interac=ons  

Cheminforma=cs:  Drugs  &  Metabolites  

ZINC  

UniProt  

GenBank  

DrugBank  

dbSNP  

PDB  

GEO  

PubChem  

SciFinder  

Pfam  TGI  

Database  Survey  

Page 6: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Database  Survey  

Genes  &  Proteins  

Gene  &  Protein  Interac=ons  

Cheminforma=cs:  Drugs  &  Metabolites  

ZINC  

UniProt  

GenBank  

DrugBank  

dbSNP  

PDB  

GEO  

PubChem  

SciFinder  

BioGRID  

Pfam  TGI  

KEGG  

NetPath  

Page 7: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Cura6on  

•  Manual  cura6on  (or  just  cura6on):  A  human  creates  and  annotates  the  database  entry    

•  Automa6c  cura6on:  A  computer  program  creates  and  annotates  the  database  entry  

•  Semi-­‐automa6c  cura6on:  A  combina=on  of  manual  and  automa=c  

Page 8: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Database  Iden6fiers  

•  Every  database  record  will  have  a  unique  iden6fier;  oUen  this  will  be  called  an  accession  number  which  is  assigned  with  the  record  is  first  added  to  the  database  

•  Be  careful:  databases  will  oUen  permit  a  record  to  be  modified  but  keep  the  same  accession  number;  you  should  record  the  version  number  as  well  

•  Furthermore,  databases  may  have  different  rules  for  handling  records  that  are  merged  or  split  

Page 9: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Database  Iden6fier  Cheat  Sheets  

Page 10: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

PaMern   Iden6fier  Name  

Examples   En6ty   Database   URL  

[op=onal  “GI:”][digits]  

GenInfo  Iden=fier  

GI:34222261  

Nucleo=de  or  protein  sequence  

GenBank,  RefSeq  

h`p://www.ncbi.nlm.nih.gov/  

[le`er][5  digits]  OR  [2  le`ers][6  digits]  

GenBank  ACCESSION  

AB088100   Nucleo=de  sequence  

GenBank   h`p://www.ncbi.nlm.nih.gov/  

[2  le`er  type  code]_[digits]  

RefSeq  ACCESSION  

NM_178014   Nucleo=de  or  protein  sequence  

RefSeq   h`p://www.ncbi.nlm.nih.gov/  

[GenBank  or  RefSeq  ACCESSION].[version  number]  

GenBank  or  RefSeq  VERSION  

AB088100.1  

NM_178014.2  

Nucleo=de  or  protein  sequence  

GenBank,  RefSeq  

h`p://www.ncbi.nlm.nih.gov/  

(iden=cal  to  accession  for  recent  entries)  

GenBank  LOCUS  

Nucleo=de  or  protein  sequence  

GenBank,  RefSeq  

h`p://www.ncbi.nlm.nih.gov/  

Page 11: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

PaMern   Iden6fier  Name  

Examples   En6ty   Database   URL  

[Protein  code]_[Species  code]  

Swiss-­‐Prot  ID  (entry  name)  

TBB5_  HUMAN  

Protein  sequence  

UniProtKB/Swiss-­‐Prot  

h`p://www.uniprot.org/  

[UniProt  AC]_[Species  code]  

UniProt  ID  (entry  name)  

Q9BUU9_  HUMAN  

Protein  sequence  

UniProtKB/TrEMBL  

h`p://www.uniprot.org/  

[A-­‐N,R-­‐Z][0-­‐9][A-­‐Z][A-­‐Z,  0-­‐9][A-­‐Z,  0-­‐9][0-­‐9]  OR  [O,P,Q][0-­‐9][A-­‐Z,  0-­‐9][A-­‐Z,  0-­‐9][A-­‐Z,  0-­‐9][0-­‐9]  

UniProt  AC  (accession  number)  

P07437   Protein  sequence  

UniProtKB   h`p://www.uniprot.org/  

Page 12: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

PaMern   Iden6fier  Name  

Examples   En6ty   Database   URL  

[capital  le`ers  or  digits;  no  ini=al  digit]  

HGNC  gene  symbol  

TUBB  

TUBB1  

Human  gene  

HGNC  database  

h`p://www.genenames.org/  

GO:[7  digits]  

GO  accession  number  

GO:0005874  

Gene  class   AmiGO   h`p://www.geneontology.org/  

[0-­‐9][A-­‐Z,0-­‐9][A-­‐Z,0-­‐9][A-­‐Z,0-­‐9]  

PDB  ID   1TUB   Protein,  nucleic  acid,  or  complex  structure  

PDB   h`p://www.rcsb.org/  

[2  or  3  le`ers  or  digits]  

PDB  ligand  ID  

CN2   Ligand   PDB   h`p://www.rcsb.org/  

Page 13: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

PaMern   Iden6fier  Name  

Examples   En6ty   Database   URL  

[up  to  7  digits]-­‐[2  digits]-­‐[1  digit]  

CAS  registry  number  

64-­‐86-­‐8   Chemical  structure  

SciFinder   h`ps://scifinder-­‐cas-­‐org.login.ezproxy.library.ualberta.ca/  

[digits]   PubChem  CID  (compound  ID)  

6167   Chemical  structure  

PubChem   h`p://pubchem.ncbi.nlm.nih.gov/  

ZINC[8  digits]  OR  [digits]  

ZINC  ID   ZINC00621853  

621853  

Chemical  structure  

ZINC   h`p://zinc.docking.org/  

DB[5  digits]   DrugBank  accession  number  

DB01394   Drug  (chemical  structure)  

DrugBank   h`p://www.drugbank.ca/  

Page 14: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Key  File  Formats  for    Sequences  and  Structures  

•  Sequences  – FASTA  format  .fasta .fst .txt!

•  Macromolecule  structures  – PDB  format    .pdb .ent!

Page 15: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Accessing  Databases  

•  Web  interface  

•  Query  string  e.g. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?! db=nucleotide&id=34222261&rettype=fasta&retmode=fasta!

•  Web  services  (SOAP)  

•  FTP  -­‐>  local  copy  

Page 16: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

CAS  SciFinder   PubChem  

DrugBank  ZINC  

Cheminforma6c  Database  Survey  

Page 17: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

h`ps://scifinder-­‐cas-­‐org.login.ezproxy.library.ualberta.ca/  

h`p://pubchem.ncbi.nlm.nih.gov/  

h`p://www.drugbank.ca/  

CAS  SciFinder   PubChem  

DrugBank  ZINC  

h`p://zinc.docking.org/  

Cheminforma6c  Database  Survey  

Page 18: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

>52  million  organic  compounds  >61  million  inorganic  compounds  

Physical  property  info  

>27  million  unique  structures  >23  million  with  3d  conforma=ons  

Mostly  organic,  biologically  interes=ng  compounds  

~4,800  drugs  >1,350  FDA  approved  drugs  

Includes  drug  target  info  

CAS  SciFinder   PubChem  

DrugBank  ZINC  

>13  million  purchasable  compounds  

Ready  to  dock  

Cheminforma6c  Database  Survey  

Page 19: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

N

O

N

O

SS

HO HN

O

N

O

NNH

SS

OH

5  Chaetocin  structures  from  PubChem  

CID  161591:  no  stereochemistry  

N

O

N

O

SS

HO HN

O

N

O

NNH

SS

OH

CID  5390098:  bad  stereochemistry  

N

O

N

O

SS

HO HN

O

N

O

NNH

SS

OH

CID  11657687:  Natural  product  stereochemistry  

N

O

N

O

SS

HO HN

O

N

O

NNH

SS

OH

CID  46191942:  Enan=omer  of  natural  product  

N

O

N

O

SS

HO HN

O

N

O

NNH

SS

OH

CID  11563851:  incomplete  stereochemistry  

Stereochemistry  Issues  

Page 20: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Other  Cheminforma6c  Issues  

•  Tautomers  /  protona=on  states?  •  Salt  forms?  

•  Implicit  or  explicit  hydrogens?  

•  2D  connec=vity  only  or  3D  conforma=on?  

•  Non-­‐organic  elements?  – Many  programs  only  handle:  CHNOPS  +  halogens  – But  some  drugs  have  B,  Pt,  Hg,  As,  …  

Page 21: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

CC(=O)N[C@H]1CCC2=CC!(=C(C(=C2C3=CC=C(C(=O)!C=C13)OC)OC)OC)OC  

•  Isomeric  SMILES  –  Allows  specifica=on  of  stereochemistry  

•  Canonical  SMILES  –  Canonicaliza=on  will  generate  a  unique  string  for  a  molecule,  regardless  of  

atom  order  –  Different  programs  will  canonicalize  differently  

•  SMARTS  –  Chemical  pa`erns  for  searching  or  filtering  

h`p://www.daylight.com/smiles/index.html  

SMILES  O H

N

O

O

OO

O

Page 22: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

File  formats  

•  MDL  Molfile    .mol  –  Allows  a  3D  conforma=on  to  be  stored  

•  SDF        .sdf!– Wraps  Molfile  format;  mul=ple  structures;  annota=ons  

•  PDB        .pdb .ent!–  Not  the  best  for  small  molecules  

Need  to  convert?  -­‐>  Try  OpenBabel    h7p://openbabel.org/wiki/Main_Page  

Page 23: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Pathway  and  Interac6on  Databases  

KEGG  Pathways  

NetPath  

BioGRID  

Page 24: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Pathway  and  Interac6on  Databases  

KEGG  Pathways  

NetPath  

BioGRID  

h`p://thebiogrid.org/  

h`p://www.netpath.org/  

h`p://www.genome.jp/kegg/  

Page 25: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Pathway  and  Interac6on  Databases  

KEGG  Pathways  

NetPath  

BioGRID  

A  repository  for  protein  and  gene  interac=on  data  

345,620  interac=ons  

Curated  protein  signal  pathways  in  humans  

20  pathways,  1,800  interac=ons  

Manually  drawn  pathways  of  metabolism,  signaling,  and  other  biological  processes  

>300  pathways  +  organism  specific  versions  

Page 26: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Pathway  Formats  

•  SBML      .xml!– The  Systems  Biology  Markup  Language  

 h`p://sbml.org/Main_Page  

•  Also  check  out  the  BioPAX  format    h`p://www.biopax.org/  

Page 27: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Pathway  Tools  

•  libSBML  h`p://sbml.org/SoUware/libSBML  

•  Cell  Designer  h`p://www.celldesigner.org/  

•  CytoScape  h`p://www.cytoscape.org/  

Page 28: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

<?xml version="1.0" encoding="UTF-8"?><sbml level="2" version="3" xmlns="http://www.sbml.org/sbml/level2/version3">...<listOfSpecies> <species compartment="cytosol" id="ES" /> <species compartment="cytosol" id="P" /> <species compartment="cytosol" id="S" /> <species compartment="cytosol" id="E" /> </listOfSpecies> <listOfReactions> <reaction id="veq"> <listOfReactants> <speciesReference species="E"/> <speciesReference species="S"/> </listOfReactants> <listOfProducts> <speciesReference species="ES"/> </listOfProducts> <kineticLaw> <math xmlns="http://www.w3.org/1998/Math/MathML"> <apply> <times/> <ci>cytosol</ci>

Page 29: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

KEGG:  Pathways  in  Cancer  

Page 30: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

NetPath:  EGFR1  pathway  

Page 31: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Exercises  

1.  What  databases  are  these  iden=fiers  from?  a.  3KYL  b.  EZH2  c.  Q15910  d.  GO:0008017  e.  GI:8017  f.  A9145C  

2.  Try  finding  the  corresponding  entries  online  

Page 32: PharmaMatrixWorkshop2010 Bioinforma6cDatabasespwinter/Bioinformatic...manual#and#automa=c# DatabaseIdenfiers* • Every#database#record#will#have#a unique*iden6fier; oUen#this#will#be#called#an#accessionnumber*

Exercise  Answers  

1.  What  databases  are  these  iden=fiers  from?  a.  3KYL  -­‐>  PDB  (a  protein-­‐RNA  structure  for  telomerase  

reverse  transcriptase,  cataly=c  region)  b.  EZH2  -­‐>  HGNC  (a  human  gene  for  a  histone  lysine  methyl  

transferase)  c.  Q15910  -­‐>  UniProt  (a  protein  sequence  for  EZH2)  d.  GO:0008017  -­‐>  AmiGO  (microtubule  binding  gene  

ontology)  e.  GI:8017  -­‐>  GenBank  (a  DNA  sequence  from  D.  

melanogaster)  f.  A9145C  -­‐>  this  one’s  a  trick:  it’s  a  chemical  compound;  

you  can  look  it  up  in  PubChem  with  CID:  6438632