Pathway/Genome Databases: Concepts and Software...
Transcript of Pathway/Genome Databases: Concepts and Software...
Pathway/GenomeDatabases:
Concepts and SoftwareTools
Peter D. Karp, Ph.D.Bioinformatics Research Group
SRI International
http://www.ai.sri.com/pkarp/
http://BioCyc.org/
SRI InternationalBioinformaticsOverview
l Pathway/genome databases
l Pathway Tools software
l EcoCyc and MetaCyc
lCharacterization of the E. coli metabolic network
SRI InternationalBioinformaticsWhat to do When Theories Become
Larger than Minds can Grasp?
l Example: E. coli genetic network
l Control by 97 transcription factors of 1174 genes in 630transcription units
l Example: E. coli metabolic network
l 160 pathways involving 744 reactions and 791 substrates
l Partition theories across multiple minds
lRely on the printed word
SRI InternationalBioinformaticsLimitations
lCannot effectively
l Evaluate them for internal consistency
l Evaluate them for consistency with new data: microarrays
l Refine them with respect to new data
l Integrate across them to produce system understanding
l They are too large and complex
l The printed word cannot be manipulatedeffectively
SRI InternationalBioinformaticsSolution:
Biological Knowledge Bases
l Store biological knowledge and theories in computers in adeclarative form
l Amenable to computational analysis and generative user interfaces
l Accepted to store data in computers, but not knowledge
l Refined, interpreted, consensus views
l Establish ongoing efforts to curate (maintain, refine,embellish) these knowledge bases
l Such knowledge bases are an integral part of the scientificenterprise
SRI InternationalBioinformaticsOrganism-Specific
Pathway/Genome Databases
l Layer functional information above the genome
lRich ontology to encode biological informationwith high fidelity
l Chromosomes, genes, operons, gene products, reactions,pathways
lCurated by experts for that organism
l Integrate literature and computational predictions
SRI InternationalBioinformaticsPathway/Genome Database
Chromosomes,Plasmids
Genes
Proteins
Reactions
Pathways
Compounds
CELL
Operons,Promoters,DNA Binding Sites
SRI InternationalBioinformaticsPathway Tools Software
l PathoLogicl Prediction of metabolic network from genomel Computational creation of new Pathway/Genome Databases
l Pathway/Genome Editorsl Distributed curation of genome annotationsl Distributed object database systeml Interactive editing tools
l Pathway/Genome Navigatorl WWW publishing of PGDBsl Graphic depictions of pathways, chromosomes, operonsl Analysis operations
u Pathway visualization of gene-expression datau Global comparisons of metabolic networks
SRI InternationalBioinformaticsSequence Project Workflow
Raw Sequence
Phred
Phrap
CONSED
BLAST, BLOCKS
GeneMark/Glimmer
PathoLogic
P/G Navigator
P/G Editors
WWW Publishing Analyses
SRI InternationalBioinformaticsBioCyc Collection of
Pathway/Genome DBs
Literature-based Datasets:
lMetaCyc
lEscherichia coli (EcoCyc)
Computationally DerivedDatasets:
lAgrobacterium tumefaciens
lCaulobacter crescentus
lChlamydia trachomatis
lBacillus subtilis
lHelicobacter pylori
lHaemophilus influenzae
lMycobacterium tuberculosis
lMycoplasma pneumonia
lPseudomonas aeruginosa
lSaccharomyces cerevisiae
lTreponema pallidum
http://BioCyc.org/
SRI InternationalBioinformaticsEcoCyc Project Overview
l E. coli Encyclopedia
l Model-Organism Database for E. colil Tracks the evolving annotation of the E. coli genome
l Over 3500 literature citationsl Collaborative development via Internet
l Karp (SRI) -- Bioinformatics architect
l Riley (MBL) -- Metabolic pathways, signal transduction
l Saier (UCSD) and Paulsen (TIGR)-- Transport
l Collado (UNAM)-- Regulation of gene expression
l Ontology: 1000 biological classes
l Database content: 16,000 instances
SRI InternationalBioinformatics
EcoCyc = E.coli Dataset + Pathway/Genome Navigator
Genes: 4,393
Proteins: 4,273
Reactions: 2,760
Pathways: 165
Compounds: 774
http://BioCyc.org/
Transcription Units: 684 Factors: 108
Enzymes: 914Transporters: 162
Promoters: 781TransFac Sites: 910
Citations: 3,508
SRI InternationalBioinformaticsEcoCyc Pathways
lBiosynthesis of amino acids, purines,pyrimidines, fatty acids, cofactors (heme, biotin,folic acid, etc)
lCatabolism of fatty acids, D-glucuronate,L-alanine, L-arabinose, fucose, galactonate,galactose, glucose, mannose, ribose, xylose
l Entner-Doudoroff pathway, TCA cycle,fermentation, gluconeogenesis, glycerolmetabolism, glycolysis, glyoxylate cycle, pentosephosphate pathway
SRI InternationalBioinformaticsMotivations for Understanding
Schema
l Pathway Tools visualizations and analysesdepend upon the software being able to findprecise information in precise places within aPathway/Genome DB
lWhen writing Lisp complex queries to PGDBs,those queries must name classes and slots withinthe schema
lA Pathway/Genome Database is a web ofinterconnected objects; each object represents abiological entity
SRI InternationalBioinformaticsWeb of Relationships for One Enzyme
Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2
sdhA sdhB sdhC sdhD
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
TCA Cycle
SRI InternationalBioinformaticsFrames
l Entities with which facts are associated
l Kinds of frames:
l Classes: Genes, Pathways, Biosynthetic Pathways
l Instances (objects): trpA, TCA cycle
l Classes:
l Superclass(es)
l Subclass(es)
l Instance(s)
l A symbolic frame name (id, key) uniquely identifies eachframe
SRI InternationalBioinformaticsSlots
l Encode attributes/properties of a frame
l Integer, real number, string
lRepresent relationships between frames
l The value of a slot is the identifier of another frame
l Every slot is described by a “slot frame” in a KBthat defines meta information about that slot
SRI InternationalBioinformaticsSlot Links
Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2
sdhA sdhB sdhC sdhD
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
TCA Cycle
product
component-of
catalyzes
reaction
in-pathway
SRI InternationalBioinformaticsRepresentation of Function
Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2
sdhA sdhB sdhC sdhD
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
TCA Cycle
EC#Keq
CofactorsInhibitors
Molecular wtpI
Left-end-position
SRI InternationalBioinformaticsMonofunctional Monomer
Gene
Reaction
Enzymatic-reaction
Monomer
Pathway
SRI InternationalBioinformaticsBifunctional Monomer
Gene
Reaction
Enzymatic-reaction
Monomer
Pathway
Reaction
Enzymatic-reaction
SRI InternationalBioinformaticsMonofunctional Multimer
Monomer Monomer Monomer Monomer
Gene Gene Gene Gene
Reaction
Enzymatic-reaction
Multimer
Pathway
SRI InternationalBioinformaticsPathway and Substrates
Reactant-1
Reaction
Pathway
ReactionReactionReaction
Reactant-2
Product-2
Product-1
in-pathwayleft
right
SRI InternationalBioinformaticsTranscriptional Regulation
site001
pro001
trpE
trpD
trpC
trpB
trpA
trpL
Int003 RpoSig70
TrpR*trpInt001
trpLEDCBA
trp
apoTrpRInt005
SRI InternationalBioinformaticsPrinciple Classes
l Class names are capitalized, plural
l Genetic-Elements, with subclasses:l Chromosomesl Plasmids
l Genes
l Transcription-Units
l RNAs
l Proteins, with subclasses:l Polypeptidesl Protein Complexes
SRI InternationalBioinformaticsPrinciple Classes
lReactions, with subclasses:
l Transport-Reactions
l Enzymatic-Reactions
l Pathways
lCompounds-And-Elements
SRI InternationalBioinformaticsSlots in Multiple Classes
lCommon-Name
l Synonyms
lNames (computed as union of Common-Name,Synonyms)
lComment
lCitations
lDB-Links
SRI InternationalBioinformaticsGenes Slots
lChromosome
l Left-End-Position
lRight-End-Position
lCentisome-Position
l Transcription-Direction
l Product
SRI InternationalBioinformaticsProteins Slots
lMolecular-Weight-Seq
lMolecular-Weight-Exp
l pI
l Locations
lModified-Form
lUnmodified-Form
lComponent-Of
SRI InternationalBioinformaticsPolypeptides Slots
lGene
SRI InternationalBioinformaticsProtein-Complexes Slots
lComponents
SRI InternationalBioinformaticsReactions Slots
l EC-Number
l Left, Right
l Substrates (computed as union of Left, Right)
lDeltaG0
lKeq
l Spontaneous?
l Species
SRI InternationalBioinformaticsEnzymatic-Reactions Slots
l Enzyme
lReaction
lActivators
l Inhibitors
l Physiologically-Relevant
lCofactors
l Prosthetic-Groups
lAlternative-Substrates
lAlternative-Cofactors
SRI InternationalBioinformaticsPathways Slots
lReaction-List
l Predecessors
l Primaries
SRI InternationalBioinformatics
MetaCyc Overview
lMeta Metabolic Encyclopedia
l 445 pathways, 1115 enzymes, 4218 reactions
l 173 E. coli pathways; 158 organisms
l 2381 citations
l Literature-based DB with extensive referencesand commentary
l Pathways, reactions, enzymes, substrates
SRI InternationalBioinformaticsMetaCyc Frequent Organisms
7M. pneumoniae
7P. putida
8S. cerevisiae
12M. capricolum
15Hp. influenzae
17Pseudomonas
18Soybean
18B. subtilis
20Sf. sulfataricus
31Ho. sapiens
35Sm. typhimurium
173E. coli
SRI InternationalBioinformaticsMetaCyc Data
lMetaCyc contains one DB object for each distinctpathway
l Distinct in terms of reaction steps
l Each pathway labeled with species it occurs in
lMetaCyc pathways are experimentally determined
l 4218 reactions in MetaCyc
l 401 lack EC numbers
SRI InternationalBioinformaticsMetaCyc Enzyme Data
lReaction(s) catalyzed
lAlternative substrates
lCofactors / prosthetic groups
lActivators and inhibitors
l Subunit structure
lMolecular weight, pI
lComment, literature citations
l Species
SRI InternationalBioinformaticsMetaCyc Super-Pathways
l Groups of pathways linked by common substrates
l Example: Super-pathway containing
l Chorismate biosynthesis
l Tryptophan biosynthesis
l Phenylalanine biosynthesis
l Tyrosine biosynthesis
l Super-pathways defined by listing their componentpathways
l Multiple levels of super-pathways can be defined
l Pathway layout algorithms accommodate super-pathways
SRI InternationalBioinformaticsComparison of MetaCyc to KEGG
lDatal KEGG has no literature citations, no commentsl KEGG has no detailed information about enzymes (inhibitors,
subunits)l KEGG pathways are composites of pathways found in many
organismsu Unclear what sub-pathways occur in what organisms
l Software toolsl KEGG has no algorithmic visualization toolsl KEGG has no queryable metabolic-map overview diagraml KEGG has no interactive editing tools
SRI InternationalBioinformaticsEcoCyc/MetaCyc Availability
lWWW EcoCyc-Plus freely availablel EcoCyc, MetaCycl Pathway/genome DBs for 12 other organisms
lhttp://BioCyc.org/
lOn-site EcoCyc-Plus freely available to non-profits
l Flatfilesl Binary executable: Hardware requirements
u Sun UltraSparc-170 w/ 64MB memoryu PC, 500MHz CPU, 64MB memory, Windows-98
SRI InternationalBioinformatics
EcoCyc and MetaCyc:Resources for Microbial GenomeAnalysis
l E. coli has large fraction of gene functionsidentified experimentally
lAssigning function by similarity to E. coli genesless likely to introduce annotation errors
l Predict metabolic pathways of other microbesusing MetaCyc
SRI InternationalBioinformaticsApplications of EcoCyc and MetaCyc
lReference sources on E. coli and metabolism
l Sequence/pathway analysis of microbial genomes
lAnalysis of gene-expression data
lComputer-aided education
lAnti-microbial drug discovery
l Pathway engineering
l Investigations of
l Comparative metabolism
l Global properties of E. coli metabolic network
SRI InternationalBioinformaticsPathway Tools Software
l PathoLogicl Prediction of metabolic network from genomel Computational creation of new Pathway/Genome Databases
l Pathway/Genome Editorsl Distributed curation of genome annotationsl Distributed object database systeml Interactive editing tools
l Pathway/Genome Navigatorl WWW publishing of PGDBsl Graphic depictions of pathways, chromosomes, operonsl Analysis operations
u Pathway visualization of gene-expression datau Global comparisons of metabolic networks
SRI InternationalBioinformaticsImplementation Details
lAllegro Common Lisp
l Sun and PC platforms
lOcelot object database
l Lisp-based WWW server at BioCyc.org
l CWEST-based
l Manages 14 organism DBs
SRI InternationalBioinformaticsPathway Tools Architecture
Object DBMS
GFP API
PathwayGenome Navigator
WWWServer
X-Windows Graphics
Object EditorPathway EditorReaction Editor
Oracle
SRI InternationalBioinformaticsOcelot Knowledge Server
Architecture
l Frame data modell Classes, instances, inheritancel Classes and instances both treated as data
l Persistent storage via disk files, Oracle DBMSl Concurrent development: Oraclel Single-user development: disk filesl Read-only delivery: bundle data into binary program
l Transaction logging facilitylOptimistic concurrency-control protocoll Schema evolutionl Local disk cache to improve Internet performance
SRI InternationalBioinformaticsEcoCyc WWW Server
SRI InternationalBioinformaticsVisualization and Editing Tools
l Full Metabolic Map
l Pathways
lReactions
lCompounds
l Enzymes, Transporters, Transcription Factors
lGenes
lChromosomes
lOperons
SRI InternationalBioinformatics
Inference of Metabolic Pathways
GenomicMap
Genes
Gene Products
Reactions
Pathways
Compounds
Pathway/Genome Database
PathoLogicList of Genes/ORFs
List of Gene Products
ANNOTATED GENOMEStructured ASCII Text File
DNA Sequence
MetaCyc
SRI InternationalBioinformaticsPathoLogic Analysis Phases
l Trial parsing of input data files
lAutomated build of initial PGDB
l Initialize schema of new PGDB
l Create DB objects for chromosomes, genes, proteins
l Predict reactions and pathways present
lDefine protein complexes
lDefine metabolic overview diagram
SRI InternationalBioinformaticsPathoLogic Pathway Prediction
l Create associations between enzymes and metabolicreactions
l Reactions and substrates imported from MetaCycl Automatically via EC numbersl Automatically via enzyme name matchingl Manuallyl CC0092 / galE / “UDP-glucose-4-epimerase” / EC 5.1.3.2l UDP-D-glucose à UDP-galactose
l Import from MetaCyc all pathways associated with inferredreactions
l UDP-D-glucose à UDP-galactose is a reaction of:l galactose metabolism, UDP-glucose conversion,l lactose degradation 4, colanic acid building blocks biosynthesis
l Prune out pathways with insufficient evidence
SRI InternationalBioinformaticsPathoLogic Prunes Pathways With
Insufficient Evidence
lNo unique enzyme AND EITHER
l 1 reaction present for pathway greater than 2steps
l Set of reactions present is a subset of reactionspresent in another pathway
l There exists a variant pathway with moreevidence
SRI InternationalBioinformaticsPathoLogic: Inference of
Pathway Complement
l Extends the paradigm of genome analysis
l Predicted genes placed in their biochemicalcontext
l Information reduction device
l Assess coherence of the set of genes in a genome
l Identifies pathway holes and singleton enzymes
l Provides a framework for analysis of functional-genomicsdata
SRI InternationalBioinformaticsPathway Comparisons
Eco Mtb Bsu Hin Sce Hpy
Eco 130 103 92 90 84 73
Mtb 103 84 79 82 70
Bsu 96 77 72 65
Hin 90 67 61
Sce 84 64
Hpy 74
Mp
SRI InternationalBioinformaticsSummary
l Pathway/Genome Databases
l 14 PGDBs available through SRI at BioCyc.org
l Computational theories of biochemical machinery
l Pathway Tools software
l Extract pathways from genomes
l Distributed curation tools
l Query, visualization, WWW publishing
l Analysis algorithms
SRI InternationalBioinformaticsAcknowledgements
l SRI: Suzanne Paley, Pedro Romero, John Pick
l EcoCyc Project: Milton Saier, Julio Collado, Ian Paulsen,Monica Riley
l Stanford: Harley McAdams, Lucy Shapiro, Gary Schoolnik,Russ Altman
l Funding sources:l NIH National Center for Research Resourcesl Department of Energy Microbial Cell Projectl DARPA BioSpice, UPC
[email protected]://BioCyc.org/