The CROP (Common Reference Ontologies for Plants) Initiative Barry Smith September 13, 2013 1.

Post on 28-Dec-2015

215 views 0 download

Transcript of The CROP (Common Reference Ontologies for Plants) Initiative Barry Smith September 13, 2013 1.

The CROP (Common Reference Ontologies for Plants)

Initiative

Barry Smith

September 13, 2013http://ontology.buffalo.edu/smith

1

2

The OBO FoundryPrinciplesReference ontologies vs. application ontologiesOther ontology consortiaThe CROP InitiativeExamples of ontologies within CROP

Agenda

On June 22, 1799, in Paris,everything changed

3

International System of Units

4

How to find data?

How to find other people’s data?

How to reason with data when you find it?

How to work out what data does not yet exist?

5

6

How to solve the problem of making the data we find queryable and re-

usable by others?

Part of the solution must involve: standardized terminologies and coding schemes

But there are multiple kinds of standardization for biological data, and

they do not work well together

7

Proposed solution: Ontology-based annotation of data

8

ontologies = standardized labels designed for use in annotations

to make the data cognitively accessible to human beings

and algorithmically accessible to computers

9

ontologies = high quality controlled structured vocabularies for the annotation (description) of data, images, journal articles …

Ramirez et al. Linking of Digital Images to Phylogenetic Data Matrices Using a Morphological OntologySyst. Biol. 56(2):283–294, 2007

11

what cellular component?

what molecular function?

what biological process?

ontologies used in curation of literature

Proposed framework: the Semantic Web

• html demonstrated the power of the Web to allow sharing of information

• can we use semantic technology to create a Web 2.0 which would allow algorithmic reasoning with online information based on a common Web Ontology Language (OWL)?

• can we use netcentricity, common URLs, to break down silos, and create useful integration of on-line data and information

12/24

Ontology success stories, and some reasons for failure

A fragment of the “Linked Open Data” in the biomedical domain

13

http://bioportal.bioontology.org/

14

15

16

17

18

The more ontology-building is successful, the more it fails

OWL breaks down data silos via controlled vocabularies for the description of data dictionaries

Unfortunately the very success of this approach led to the creation of multiple, new, semantic silos – because multiple ontologies are being created in ad hoc ways

19/24

http://bioportal.bioontology.org/

Many ontologies in bioportal are created by importing content from existing ontologies and giving the terms imported new names and new IDs

The result is chaos, with bits and pieces of the same ontologies chopped in multiple different places.

Leads to massively redundant effort, forking and doom

20

• It is easier to write useful software if one works with a simplified model

• (“…we can’t know what reality is like in any case; we only have our concepts…”)

• This looks like a useful model to me

• (One week goes by:) This other thing looks like a useful model to him

• Data in Pittsburgh does not interoperate with data in Vancouver

• Science is siloed

A standard engineering methodology

A good solution to this silo problem must be:

• modular• incremental• independent of hardware and software• bottom-up• evidence-based • revisable• incorporate a strategy for motivating potential

developers and users

22

Uses of ‘ontology’ in PubMed abstracts

23

24

main reason for GO’s success

Gene Ontology and associated databases

“make it possible to systematically dissect large gene lists in an attempt to assemble a summary of the most enriched and pertinent biology”PMC2615629

GO provides a controlled system of terms for use in annotating (describing, tagging) data

• multi-species, multi-disciplinary, open source

• contributing to the cumulativity of scientific results obtained by distinct research communities

• compare use of kilograms, meters, seconds in formulating experimental results

26

GO is 3 ontologies

biological process

cellular component

molecular function

Top-Level Architecture

Continuant Occurrent(Process, Event)

IndependentContinuant

DependentContinuant

28

..... ..... .....universals

instances

Problem with the GO

• it covers only three types of entities

• no diseases

• no laboratory artifacts

• no anatomy (above the cell)

• only species-terms for development

• no phenotypes

29

RELATION TO TIME

GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Biological Process

(GO)CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Compone

nt(FMA, GO)

Cellular Function

(GO)

MOLECULEMolecule

(ChEBI, SO,RnaO, PrO)

Molecular Function(GO)

Molecular Process

(GO)

The Open Biomedical Ontologies (OBO) Foundry30

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity

(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Organism-Level Process

(GO)

CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Compone

nt(FMA, GO)

Cellular Function

(GO)

Cellular Process

(GO)

MOLECULEMolecule

(ChEBI, SO,RNAO, PRO)

Molecular Function(GO)

Molecular Process

(GO)

rationale of OBO Foundry coverage

GRANULARITY

RELATION TO TIME

31

32

a shared portal for (so far) 58 ontologies (low regimentation)

http://obo.sourceforge.net NCBO BioPortal

First step (2001)

33

OBO builds on the principles successfully implemented by the GO

recognizing that ontologies need to be developed in tandem

34

35

The OBO FoundryThe OBO Foundryhttp://obofoundry.org/http://obofoundry.org/

Second step Second step (2006)(2006)

36

RELATION TO TIME

GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Biological Process

(GO)CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Compone

nt(FMA, GO)

Cellular Function

(GO)

MOLECULEMolecule

(ChEBI, SO,RnaO, PrO)

Molecular Function(GO)

Molecular Process

(GO)

Building out from the original GO

37

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity

(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Organism-Level Process

(GO)

CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Compone

nt(FMA, GO)

Cellular Function

(GO)

Cellular Process

(GO)

MOLECULEMolecule

(ChEBI, SO,RnaO, PrO)

Molecular Function(GO)

Molecular Process

(GO)

initial OBO Foundry coverage

GRANULARITY

RELATION TO TIME

OBO Foundry Principles common formal architecture

clearly delineated content (redundant – overlaps with orthogonality)

the ontology is well-documented (– overlaps with rules for definitions; needs expanding, for developers, for users, minimal metadata)

plurality of independent users

single locus of authority, trackers, help desk

38

OBO Foundry Principles

textual definitions plus formal definitions all definitions should be of the genus-species

form

A =def. a B which Cs

where B is the parent term of A in the ontology hierarchy

• formal definitions use OBO format or OWL

39

Orthogonality• For each domain, there should be convergence upon a

single ontology that is recommended for use by those who wish to become involved with the Foundry initiative

• Part of the goal here is to avoid the need for mappings – which are in any case too expensive, too fragile, too difficult to keep up-to-date as mapped ontologies change

• Orthogonality means: – everyone knows where to look to find out how to annotate each

kind of data– everyone knows where to look to find content for application

ontologies

40

Orthogonality = non-redundancy for the reference ontologies inside

the Foundry

• application ontologies can overlap, but then only in those areas where common coverage is supplied by a reference ontology

41

42

COMMON FORMAL ARCHITECTURE: The ontology uses relations which are unambiguously defined following the pattern of definitions laid down in the Basic Formal Ontology (BFO)

http://www.ifomis.uni-saarland.de/bfo/

‘formal’= domain neutral

PRINCIPLES

Continuant Occurrent

IndependentContinuant

DependentContinuant

cell component

biological process

molecular function

Basic Formal Ontology

OBO Foundry

provides guidelines (traffic laws) to new groups of ontology developers in ways which can counteract current dispersion of effort

New principle: Employ the methodology of cross-products

compound terms in ontologies are to be defined as cross-products of simpler terms:E.g elevated blood glucose is a cross-product of PATO: increased concentration with FMA: blood and CheBI: glucose.

= factoring out of ontologies into discipline-specific modules (orthogonality)

45

The methodology of cross-products

enforcing use of common relations in linking terms drawn from Foundry ontologies serves

• to ensure that the ontologies are maintained and revised in tandem

• logically defined relations serve to bind terms in different ontologies together to create a network

46

47

RELATION TO TIME

GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Biological Process

(GO)CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Compone

nt(FMA, GO)

Cellular Function

(GO)

MOLECULEMolecule

(ChEBI, SO,RnaO, PrO)

Molecular Function(GO)

Molecular Process

(GO)

Building out from the original GO

RELATION TO TIME

GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

COMPLEX OFORGANISMS

Family, Community, Deme, Population

OrganFunction

(FMP, CPRO)

Population Phenotype

PopulationProcess

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity(FMA, CARO) Phenotypic

Quality(PaTO)

Biological Process

(GO)CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Componen

t(FMA, GO)

Cellular Function

(GO)

MOLECULEMolecule

(ChEBI, SO,RnaO, PrO)

Molecular Function(GO)

Molecular Process

(GO)

Population-level ontologies 48

RELATION TO TIME

GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN ANDORGANISM

Organism(NCBI

Taxonomy)

Anatomical Entity(FMA, CARO)

OrganFunction

(FMP, CPRO) Phenotypic

Quality(PaTO)

Biological Process

(GO)CELL AND CELLULAR

COMPONENT

Cell(CL)

Cellular Compone

nt(FMA, GO)

Cellular Function

(GO)

MOLECULEMolecule

(ChEBI, SO,RnaO, PrO)

Molecular Function(GO)

Molecular Process

(GO)

Environment Ontology

envi

ron

men

ts

49

Anatomy Ontology(FMA*, CARO)

Environment

Ontology(EnvO)

Infectious Disease

Ontology(IDO*)

Biological Process

Ontology (GO*)

Cell Ontology

(CL)

CellularComponentOntology

(FMA*, GO*) Phenotypic Quality

Ontology(PaTO)

Subcellular Anatomy Ontology (SAO)Sequence Ontology

(SO*) Molecular Function

(GO*)Protein Ontology(PRO*) Extension Strategy + Modular Organization 50

top level

mid-level

domain level

Information Artifact Ontology

(IAO)

Ontology for Biomedical

Investigations(OBI)

Spatial Ontology

(BSPO)

Basic Formal Ontology (BFO)

Third step:Third step:Creation of new ontology consortia,

modeled on the OBO Foundry

51

OBO Foundry Open Biological and Biomedical Ontologies

NIF Standard Neuroscience Information Framework

eagle-I Ontologies used by VIVO and CTSAconnect

IDO Consortium Infectious Disease Ontology

A good solution to the silo problem must be:

• modular• incremental• independent of software and hardware• bottom-up• evidence-based • revisable• incorporate a strategy for motivating potential

developers and users

52

Because the ontologies in the Foundry

are built as orthogonal modules which form an incrementally evolving network

• scientists are motivated to commit to developing ontologies because they will need in their own work ontologies that fit into this network

• users are motivated by the assurance that the ontologies they turn to are maintained by experts

53

More benefits of orthogonality

• helps those new to ontology to find what they need

• to find models of good practice• ensures mutual consistency of ontologies

(trivially)• and thereby ensures additivity of annotations

54

More benefits of orthogonality

• it rules out the sorts of simplification and partiality which may be acceptable under more pluralistic regimes

• thereby brings an obligation on the part of ontology developers to commit to scientific accuracy and domain-completeness

55

More benefits of orthogonality

• No need to reinvent the wheel for each new domain

• Can profit from storehouse of lessons learned• Can more easily reuse what is made by others• Can more easily reuse training• Can more easily inspect and criticize results of

others’ work• Leads to innovations (e.g. Mireot, Ontofox) in

strategies for combining ontologies 56

Reference Ontologies vs. Application Ontologies

Reference ontology = an ontology that captures generic content and is designed for aggressive reuse in multiple different types of context. Our assumption is that most reference ontologies will be created manually on the basis of explicit assertion of the taxonomical and other relations between their terms.

Reference Ontologies vs. Application Ontologies

By ‘application ontology’ we mean an ontology that is tied to specific local applications. Each application ontology is created by using ontology merging software to combine new, local content with generic content taken over from relevant reference ontologies

Xiang, et al., “OntoFox: Web-Based Support for Ontology Reuse”, BMC Research Notes. 2010, 3:175.

Normalization of the ontology space – content from reference ontologies is maximally re-used, e.g. in formulation of compound terms and of cross-product definitions

(Compare normalization of a vector space)(Compare, again, SI System of Units)

International System of Units

60

Infectious Disease Ontology (IDO)

61

We have data, e.g.:• TBDB: Tuberculosis Database, including

Microarray data• VFDB: Virulence Factor DB • TropNetEurop Dengue Case Data • ISD: Influenza Sequence Database at LANL• MPD/MRD/CPP: Protein Data of PIR Resource

Center for Biodefense Proteomics Research• PathPort: Pathogen Portal Project

62

Purpose of Infectious Disease Ontology (IDO)

• Retrieval and integration of infectious disease relevant data– Sequence and protein data for pathogens– Case report data for patients– Clinical trial data for drugs, vaccines– Epidemiological Data for surveillance, prevention– ...

• Goal: to make data deriving from different sources comparable and computable

63

IDO Strategy

• Reference ontology (IDO Core) with terms relevant to any infectious disease

• Disease- and organism-specific application ontologies– for different types of host, types of vector, types of

pathogen, types of disease

64

Infectious Disease Ontology (IDO)

• Member of the OBO Foundry• A suite of ontologies

– IDO Core: • General terms in the ID domain. • A hub for all IDO extensions.

– IDO Extensions: • Disease specific. • Developed by subject matter experts.

• Provides:– Clear, precise, and consistent natural language definitions– Computable logical representations (OWL, OBO)

How IDO evolvesIDOCore

IDOSa

IDOHumanSa

IDORatSa

IDOStrep

IDORatStrep

IDOHumanStrep

IDOMRSa

IDOHumanBacterial

IDOAntibioticResistant

IDOMAL IDOHIVCORE and SPOKES:Domain ontologies

SEMI-LATTICE:By subject matter experts in different communities of interest.

IDOFLU

IDO Process Model

Sample Application: A lattice of infectious disease application ontologies from NARSA isolate data

• Expose value of Genotype-Phenotype Linked Data by converting a free-text database from NARSA (Network on Antimicrobial Resistance in Staphylococcus Aureu) into a computational resource

Ways of differentiating Staphylococcus aureus infectious diseases

• Infectious Disease– By host type– By (sub-)species of pathogen– By antibiotic resistance– By anatomical site of infection

• Bacterial Infectious Disease– By PFGE (Strain)– By MLST (Sequence Type)– By BURST (Clonal Complex)

• Sa Infectious Disease– By SCCmec type

• By ccr type• By mec class

– spa type

http://www.sccmec.org/Pages/SCC_ClassificationEN.html

ido.owl

narsa.owl

narsa-isolates.owl

ndf-rt

NRS701’s resistance to clindamycin

Further extensions of IDO• Vaccine (Vaccine Ontology)• Plant IDOfrom ICBO 2012:

71

Founding CROP

The ontologies in CROPGeneral ontologies taken over from OBO Foundry•ChEBI Chemistry ontology•GO Gene Ontology•PRO Protein Ontology•ENVO Environment Ontology

+ GAZ Gazetteer built on ontological principles•PATO Phenotype Ontology

73

Plant specific ontologies to be developed by CROP group

PO Plant OntologyTO Trait OntologyEO Plant Environment OntologyPlant IDOPlant DiseaseAction items:

fix relation between EnvO and EOfix relation between PATO and TO

Taxonomy resource(for diseases of host and causal organisms + vectors/secondary

hosts)

NCBI Taxonomy has most of the hosts , but not the viruses

Next steps in CROP:

PRO-PO-GO MeetingBuffalo, Spring 2013

PRO = protein ontologyPO = plant ontologyGO = gene ontology

The Environment Ontology

77

OBO FoundryGenomic Standards ConsortiumNational Environment Research Council (UK)USDA, Gramene, J. Craig Venter Institute ...

78

Applications of EnvO in biology

79

80

81

How EnvO currently works for information retrieval

Retrieve all experiments on organisms obtained from:– deep-sea thermal vents– arctic ice cores– rainforest canopy– alpine melt zone

Retrieve all data on organisms sampled from:– hot and dry environments– cold and wet environments– a height above 5,000 meters

Retrieve all the omic data from soil organisms subject to:– moderate heavy metal contamination

82

extending EnvO to clinical and translational research

• we have public heath, community and population data

• we need to make this data available for search and algorithmic processing

• we create a consensus-based ontology which can interoperate with ontologies for neighboring domains of medicine and basic biology

83

Environment = totality of circumstances external to a living organism or group of organisms

– pH– evapotranspiration– turbidity– available light– predominant vegetation– predatory pressure– nutrient limitation …

84

extend EnvO to the clinical domain– dietary patterns (Food Ontology: FAO, USDA) ...

allergies– neighborhood patterns

• built environment, living conditions• climate• social networking• crime, transport• education, religion, work• health, hygiene

– disease patterns• bio-environment (bacteriological, ...)• patterns of disease transmission (links to IDO)

85

Aligning EnvO to the Basic Formal Ontology

habitat

• Habitat =def. An ecosystem which can support the life of a given organism, population, or community

• Realized niche =def. An ecosystem which is that part of a habitat which supports the life of a given organism, population or community

Aligning EnvO to the Basic Formal Ontology

Hutchinsonion niche(niche as volume in a functionally defined hyperspace)

• =def. an n-dimensional hyper-volume whose dimensions correspond to resource gradients over which species are distributed– degree of slope, exposure to sunlight, soil

fertility, foliage density, salinity...

G.E. Hutchinson (1957, 1965)

Aligning EnvO to the Basic Formal Ontology

part_of

93

94