1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research...

44
1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International [email protected] BioCyc.org EcoCyc.org, MetaCyc.org, HumanCyc.org

Transcript of 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research...

Page 1: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

1 SRI International Bioinformatics

The BioCyc Ontologies

Markus KrummenackerBioinformatics Research Group

SRI International

[email protected]

BioCyc.org

EcoCyc.org, MetaCyc.org, HumanCyc.org

Page 2: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

2 SRI International Bioinformatics

Overview

Pathway/Genome Databases (PGDBs) BioCyc collection EcoCyc, MetaCyc

Pathway Tools Software & Applications Visualization, Editing, Analysis, Omics data Inference tools: PathoLogic, Operon predictor, Pathway hole

filler Tools for debugging a predicted metabolic network

Some Ontology Details Pathways, Reactions and Compounds, Enzymes, Genes Regulation Integration with other efforts: BioPAX, GO, NCBI Taxonomy

Page 3: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

3 SRI International Bioinformatics

Model Organism Databases / PGDBs

DBs that describe the genome and molecular machinery of one specific organism.

Integrating many diverse types of data into a coherent model of a cell

Every sequenced organism with an active experimental community requires a MOD

Integrate genome data with information about the biochemical and genetic network of the organism

Integrate literature-based information with computational predictions Ongoing updating of sequence, gene positions and functions, regulatory

sites, pathways

MODs are platforms for global analyses of the organism Interpret omics data in a pathway context In silico prediction of essential genes Characterize systems properties of metabolic and genetic networks

Page 4: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

4 SRI International Bioinformatics

BioCyc Collection of Pathway/Genome Databases

Pathway/Genome Database (PGDB) – combines information about

Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors/sites, promoters,

operons

Tier 1: Literature-Derived PGDBs MetaCyc EcoCyc -- Escherichia coli K-12

Tier 2: Computationally-derived DBs, Some Curation -- 20 PGDBs

HumanCyc Mycobacterium tuberculosis

Tier 3: Computationally-derived DBs, No Curation -- 349 DBs

Page 5: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

5 SRI International Bioinformatics

Pathway Tools: PathoLogic Inference

Pathway/GenomeEditors

Pathway/GenomeDatabase

PathoLogicAnnotatedGenome

MetaCycReference

Pathway DB

Pathway/GenomeNavigator

Page 6: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

6 SRI International Bioinformatics

Pathway Tools Software: PGDBs Created Outside SRI

1,300+ licensees: 75+ groups applying software to 200+ organisms

Saccharomyces cerevisiae, SGD project, Stanford UniversityMouse, MGD, Jackson LaboratorydictyBase, Northwestern UniversityUnder development:

CGD (Candida albicans), Stanford University Drosophila, P. Ebert in collaboration with FlyBase C. elegans, P. Ebert in collaboration with WormBase

Planned: RGD (Rat), Medical College of Wisconsin

Arabidopsis thaliana, TAIR, Carnegie Institution of WashingtonPlantCyc, ~20 plant PGDBs, Carnegie Institution of WashingtonSix Solanaceae species, Cornell University GrameneDB, Cold Spring Harbor LaboratoryMedicago truncatula, Samuel Roberts Noble Foundation

Page 7: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

7 SRI International Bioinformatics

Pathway Tools Software: PGDBs Created Outside SRI

BioHealthBase (M. tuberculosis, F. tuleremia), PATRIC, ApiDBGary Xie, Los Alamos Lab, Dental pathogensF. Brinkman, Simon Fraser Univ, Pseudomonas aeruginosaV. Schachter, Genoscope, AcinetobacterM. Bibb, John Innes Centre, Streptomyces coelicolorG. Church, Harvard, Prochlorococcus marinus, multiple strainsE. Uberbacher, ORNL and G. Serres, MBL, Shewanella onedensisR.J.S. Baerends, University of Groningen, Lactococcus lactis IL1403, Lactococcus lactis MG1363, Streptococcus pneumoniae TIGR4, Bacillus subtilis 168, Bacillus cereus ATCC14579Matthew Berriman, Sanger Centre, Trypanosoma brucei, Leishmania majorHerbert Chiang, Washington University, Bacteroides thetaiotaomicronSergio Encarnacion, UNAM, Sinorhizobium melilotiGregory Fournier, MIT, Mesoplasma florumMark van der Giezen, University of London, Entamoeba histolytica, Giardia intestinalis Michael Gottfert, Technische Universitat Dresden, Bradyrhizobium japonicumArtiva Maria Goudel, Universidade Federal de Santa Catarina, Brazil, Chromobacterium violaceum ATCC 12472

Page 8: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

8 SRI International Bioinformatics

Pathway Tools Software: PGDBs Created Outside SRI

Large scale users: C. Medigue, Genoscope, 150+ PGDBs G. Burger, U Montreal, 60+ PGDBs Bart Weimer, Utah State University, Lactococcus lactis, Brevibacterium linens,

Lactobacillus acidophilus, Lactobacillus plantarum, Lactobacillus johnsonii, Listeria monocytogenes

Partial listing of outside PGDBs at BioCyc.org

Page 9: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

9 SRI International Bioinformatics

Pathway Evidence

Page 10: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

10 SRI International Bioinformatics

Pathway Tools Overviews and Omics Viewers

Designed to avoid the hairball effectGenerated automatically from PGDBMagnify, interrogateOmics viewers paint omics data onto overview diagrams

Different perspectives on same dataset Use animation for multiple time points or

conditions Paint any data that associates numbers

with genes, proteins, reactions, or metabolites

Provide genome-scale visualizations of cellular networksHarness human visual system to interpret patterns in biological contexts

Page 11: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

11 SRI International Bioinformatics

Regulatory Overview and Omics Viewer

Show regulatory relationships among gene groups

Page 12: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

12 SRI International Bioinformatics

Page 13: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

13 SRI International Bioinformatics

Page 14: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

14 SRI International Bioinformatics

Comparative Analysis

Via Cellular Overview

Comparative genome browser

Comparative pathway table

Comparative analysis reports Compare reaction complements Compare pathway complements Compare transporter complements

Page 15: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

15 SRI International Bioinformatics

Pathway Tools Ontology

1621 Classes Main classes such as:

Pathways, Reactions, Compounds, Macromolecules, Proteins, Replicons, DNA-Segments (Genes, Operons, Promoters)

Taxonomies for Pathways, Reactions (EC), Compounds Cell Component Ontology Protein Feature ontology

221 Slots for attributes and relationships Meta-data: Creator, Creation-Date Comment, Citations, Common-Name, Synonyms Attributes: Molecular-Weight, DNA-Footprint-Size Relationships: Catalyzes, Component-Of, Product

Evidence codes, supporting citations

Page 16: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

16 SRI International Bioinformatics

Pathway/Genome Database Schema

Page 17: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

17 SRI International Bioinformatics

Protein Feature Ontology

Page 18: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

18 SRI International Bioinformatics

Advanced Query FormIntuitive construction of complex database

queries of SQL power

Page 19: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

19 SRI International Bioinformatics

Enzymatic-Reactions

Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2

sdhA sdhB sdhC sdhD

Succinate + FAD = fumarate + FADH2

Enzymatic-reaction

Succinate dehydrogenase

TCA Cycle

product

component-of

catalyzes

reaction

in-pathway

Page 20: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

20 SRI International Bioinformatics

Need for Enzymatic-Reactions

Reactions can have isozymes Enzymes can be multi-functional

Enzymatic-Reaction frames are needed to decouple the many-to-many relationships

Isozymes may have different inhibitors, etc.

Gene-Reaction schema diagrams:

Page 21: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

21 SRI International Bioinformatics

New Representation of RegulationPreviously, regulation was represented idiosyncratically:

One representation for modulation of enzymes Completely different representation for regulation of transcription initiation

Now unified under single Regulation class w/ subclassesThis enables us to easily add support for new kinds of regulation, e.g.

Transcriptional attenuation (done) Regulation of translation by small RNAs (in progress)

New tools for display and editing of new Regulation classes

Page 22: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

22 SRI International Bioinformatics

Operons and Transcription Units

Operon: A set of two or more genes that are transcribed as a unit. May include multiple promoters.

Transcription Unit: A set of one or more genes that are transcribed as a unit from a single promoter.

Pathway Tools schema does not represent operons explicitly, only transcription-units

Page 23: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

23 SRI International Bioinformatics

Ontology for Transcriptional Regulation

trpLEDCBAp1

trpE

trpD

trpC

trpB

trpA

trpL

reg001

site001

TrpR*trp

trpLEDCBA

trp

apoTrpRBR001

components

left

right

regulated-by

associated-binding-site

regulator

Page 24: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

24 SRI International Bioinformatics

Representation of Transcriptional Regulation

Transcription-Unit Components include genes, a single promoter, zero or more terminators

Binding-Sites Linked to regulation frames

Regulation frames Transcriptional Initiation: defines a 3-way pairing between promoter,

transcription factor and binding-site Transcriptional Attenuation: defines relationship between terminator and

the entity (tRNA, protein, small molecule) that regulates it.

Page 25: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

25 SRI International Bioinformatics

Infer Anti-Microbial Drug Targets

Infer drug targets as genes coding for enzymes that encode chokepoint reactions

Two types of chokepoint reactions:

Genome Research 14:917 2004

Page 26: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

26 SRI International Bioinformatics

Reachability Analysis of Metabolic Network

Given: A PGDB for an organism A set of initial metabolites

Infer: What set of products can be synthesized by the small-

molecule metabolism of the organism

Can known growth medium yield known essential compounds?

Romero and Karp, Pacific Symposium on Biocomputing, 2001

Page 27: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

27 SRI International Bioinformatics

Algorithm: Forward PropagationThrough Production System

Each reaction becomes a production rule Each metabolite in nutrient set becomes an axiom

Nutrientset

Metaboliteset

“Fire”reactions

Transport

Products

Reactants

PGDBreaction

pool

Page 28: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

28 SRI International Bioinformatics

Page 29: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

29 SRI International Bioinformatics

Results

Phase I: Forward propagation 21 initial compounds yielded only half of the 41 essential compounds for E.

coli

Phase II: Manually identify Bugs in EcoCyc (e.g., two objects for tryptophan)

A B B’ C Incomplete knowledge of E. coli metabolic network

A + B C + D “Bootstrap compounds” Missing initial protein substrates (e.g., ACP)

Protein synthesis not represented

Phase III: Forward propagation with 11 more initial metabolites

Yielded all 41 essential compounds

Page 30: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

30 SRI International Bioinformatics

Integration with other efforts

Export of BioPAX SBML

Import of Enzyme DB (EC hierarchy of reactions) GO NCBI Taxonomy BioPAX (work in progress)

Page 31: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

31 SRI International Bioinformatics

Near Future

Signalling pathways Validating the design

Regulation Small RNAs, and other additional types

Higher Eukaryotes Gene expression, Multiple splice forms Cell types, localization

Page 32: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

32 SRI International Bioinformatics

Summary

Pathway/Genome Databases MetaCyc non-redundant DB of literature-derived pathways 370 organism-specific PGDBs available through SRI at

BioCyc.org Computational theories of biochemical machinery

Pathway Tools software Extract pathways from genomes Morph annotated genome into structured ontology Distributed curation tools for MODs Query, visualization, WWW publishing

Page 33: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

33 SRI International Bioinformatics

BioCyc and Pathway Tools Availability

BioCyc.org Web site and database files freely available to all

Pathway Tools freely available to non-profits Macintosh, PC/Windows, PC/Linux

References Pathway Tools User’s Guide

Appendix A: Guide to the Pathway Tools Schema Ontology Papers section of http://biocyc.org/publications.shtml

Page 34: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

34 SRI International Bioinformatics

Acknowledgements

SRI Suzanne Paley, Ron Caspi,

Ingrid Keseler, Carol Fulcher, Markus Krummenacker, Alex Shearer, Tomer Altman, Joe Dale, Fred Gilham, Pallavi Kaipa

EcoCyc Collaborators Julio Collado-Vides, Robert

Gunsalus, Ian Paulsen

MetaCyc Collaborators Sue Rhee, Peifen Zhang, Kate

Dreher Lukas Mueller, Anuradha Pujar

Funding sources: NIH National Center for

Research Resources NIH National Institute of

General Medical Sciences NIH National Human

Genome Research Institute

BioCyc.org

Learn more from BioCyc webinars: biocyc.org/webinar.shtml

Page 35: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

35 SRI International Bioinformatics

BioWarehouse: A Bioinformatics Database

WarehousePeter D. Karp, Tom J. Lee, Valerie Wagner

Oracle (10g) orMySQL (4.1.11)

UniProt

ENZYME

Genbank

Taxonomy

BioCyc

BioPAX

BioWarehouse

GO

MAGE-ML

KEGG

CMR

Eco2DBase

BMC Bioinformatics 7:170 2006bioinformatics.ai.sri.com/biowarehouse/

Page 36: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

36 SRI International Bioinformatics

Motivations

Hundreds of bioinformatics DBs exist

Important problems involve queries across multiple DBs

Page 37: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

37 SRI International Bioinformatics

Why is the Multidatabase Approach Alone Not Sufficient?

Multidatabase query approaches assume databases are in a queryable DBMS

Most sites that do operate DBMSs do not allow remote query access because of security and loading concerns

Users want to control data stability Users want to control speed of their hardware Internet bandwidth limits query throughput Users need to capture, integrate and publish

locally produced data of different types

Multidatabase and Warehouse approaches complementary

Page 38: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

38 SRI International Bioinformatics

Key Challenges for BioWarehouse

Designing a schema that accurately captures the contents of source DBs

Designing a schema that is understandable and scalable

Addressing poorly-specified syntax & semantics of source DBs

Balancing the preservation of source data with mapping into common semantics

Page 39: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

39 SRI International Bioinformatics

Technical Approach Multi-platform support: Oracle (10g) and MySQL Schema support for multitude of bioinformatics

datatypes Create loaders for public bioinformatics DBs

Parse file format of the source DB Semantic transformations Insert DB contents into warehouse tables

Provide Warehouse query access mechanisms SQL queries via ODBC, JDBC, OAA

Operate public BioWarehouse server: publichouse

BMC Bioinformatics 7:170 2006

Page 40: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

40 SRI International Bioinformatics

PublicHouse Server Publicly queryable BioWarehouse server operated by SRI

Manages a set of biological DBs constructed using BioWarehouse

CMR Open BioCyc DBs ENZYME NCBI Taxonomy UniProt

Large-scale data mining using Dashboard Warehouse Query Analyzer MySQL client command line

See: http://bioinformatics.ai.sri.com/biowarehouse/publichouse.html

Host: publichouse.sri.comPort: 3306Database: biospice

Page 41: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

41 SRI International Bioinformatics

BioWarehouse Schema

Manages many bioinformatics datatypes simultaneously Pathways, Reactions, Chemicals Proteins, Genes, Replicons Sequences, Sequence Features Organisms, Taxonomic relationships Computations (sequence matches) Citations, Controlled vocabularies Links to external databases Gene expression datasets Protein-protein interactions datasets Flow cytometry datasets

Each type of warehouse object implemented through one or more relational tables (currently ~150)

Page 42: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

42 SRI International Bioinformatics

Warehouse Schema

Manages multiple datasets simultaneously Dataset = Single version of a database

Version comparison

Multiple software tools or experiments that require access to different versions

Each dataset is a warehouse entity

Every warehouse object is registered in a dataset

Page 43: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

43 SRI International Bioinformatics

Warehouse Schema

Different databases storing the same biological datatypes are coerced into same warehouse tables

Design of most datatypes inspired by multiple databases

Representational tricks to decrease schema bloat Single space of primary keys Single set of satellite tables such as for synonyms,

citations, comments, etc.

Page 44: 1 SRI International Bioinformatics The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org,

44 SRI International Bioinformatics

Acknowledgements

SRI Suzanne Paley, Ron Caspi,

Ingrid Keseler, Carol Fulcher, Markus Krummenacker, Alex Shearer, Tomer Altman, Joe Dale, Fred Gilham, Pallavi Kaipa

EcoCyc Collaborators Julio Collado-Vides, Robert

Gunsalus, Ian Paulsen

MetaCyc Collaborators Sue Rhee, Peifen Zhang, Kate

Dreher Lukas Mueller, Anuradha Pujar

Funding sources: NIH National Center for

Research Resources NIH National Institute of

General Medical Sciences NIH National Human

Genome Research Institute

BioCyc.org

Learn more from BioCyc webinars: biocyc.org/webinar.shtml