Ontologies for Semantic Normalization of Immunological Data

1
Ontologies for Semantic Normalization of Immunological Data Yannick Pouliot 1 , Atul J. Butte 1,2 1. Division of Systems Medicine, Stanford University School of Medicine 2. Center for Pediatric Bioinformatics, Lucile Packard Children’s Hospital, Palo Alto, California Results & Discussion Acknowledgments NIAID, Hewlett Packard Foundation, Butte Lab 57 Ontologies within these domains were then screened for preliminary suitability to HIPC data/metadata according to four broad criteria: Criterium Parameter Design Must be integrative (ability to draw on similar/identical concepts from other ontologies) Must minimize overlap with other ontologies, consistent with providing terms that can be inter-related across ontologies Should be “relatable” to clinical applications Must be applicable to humans, and perhaps one animal model Must be an ontology, not just a controlled vocabulary (with a few exceptions) Developmental state Evidence of ongoing development and maintenance Developed with, accepted by standards or professional organizations Adheres to standards such as the Basic Formal Ontology Usage Must be released (post beta) Reasonably widely adopted Content Must exhibit a good balance between expressiveness and usability/understandability: Conceptual clarity (e.g., no ambiguous classifications for dual-use organs such as reproductive/urinary organs) Limited redundancy of synonymous concepts Usable definitions of concepts describing HIPC data or metadata, including experiment design Completeness (frequency of missing concepts) Correctness (how accurately is a concept is expressed) Ontologies provided by BioPortal 1 were first selected based on their domain of application: Domain Description Example Ontology Analysis Process of data analysis Ontology of Data Mining Anatomy Anatomical structures at all level of resolution except molecular Cell Ontology Disease Disease states manifested by organisms at anatomical, spatial, temporal and functional levels Infectious Disease Ontology (IDO) Experimental conditions Conditions/specifications associated with a scientific or clinical protocol Ontology of Clinical Research (OCRe) Modeling Process/properties/data types of modeling, computational or otherwise Interaction Network Ontology (INO) Molecule Aspects of biomolecules : structure, sequence, function ONTIE - Ontology of Immune Epitopes Pathways Biochemical, signaling pathways used by organisms Pathway Ontology Phenotype States manifested by organisms at anatomical, spatial, temporal and functional levels. “Anatomy” Phenotypic Quality 1 2 These ontologies were then analyzed for their ability to recognize terms from text obtained from Build 1 datasets, protocols, and metadata, as well as from Stanford’s DataMt database 2 (which stores many Stanford HIPC datasets). An automated pipeline that relies on the National Center for Biomedical Ontology’s Annotator 3 was written that relies on BioPortal’s Web services to parse the text and attempt to map to the reference ontologies. 3 The Problem Methods References 1. Noy et al., (2009) “BioPortal: Ontologies and Integrated Data resources at the Click of a Mouse”, Nucl. Acids Res., 37:W170-W173. 2. Siebert, J., Munsil, D. & Maecker, H. (2011) "A Novel Approach for Integrating and Exploring Heterogeneous Translational Data", manuscript in preparation. 3. Jonquet et al., (2009) “The Open Biomedical Annotator”, Summit on Transla. Bioinfo., 56-60. Build 1 ImmPor t CV Data Mt Ontology (%) n (%) n (%) n NCI Thesaurus 16.7% 7 33.3% 14 52.6% 92 Medical Subject Headings 7.1% 3 21.4% 9 22.9% 40 Molecule role 14.3% 6 0.0% 24.6% 43 SNOMED Clinical Terms 2.4% 1 11.9% 5 17.7% 31 PRotein Ontology (PRO) 16.7% 7 0.0% 10.3% 18 Cell Cycle Ontology 7.1% 3 0.0% 12.6% 22 Ontology for Biomedical Investigations 2.4% 1 14.3% 6 5.7% 10 Experimental Factor Ontology 7.1% 3 7.4% 13 SemanticScience Integrated Ontology 2.9% 5 Units of measurement 2.4% 1 2.3% 4 Phenotypic quality 2.4% 1 2.3% 4 EDAM 7.1% 3 0.6% 1 Foundational Model of Anatomy 2.3% 4 Vaccine Ontology 2.4% 1 1.1% 2 ICPC-2 PLUS 1.1% 2 MGED Ontology 2.4% 1 Measurement Method Ontology 2.4% 1 Gene Ontology 2.4% 1 Ontology of Clinical Research (OCRe) 0.6% 1 Protein-protein interaction 0.6% 1 Mammalian phenotype 0.6% 1 • Many mapping failures attributable to lack of definition for commercial objects within the reference ontologies (e.g., “Anti-CD27” antibody from BD) Solution: Contacting ontology owners to have them add commercial terms to their ontologies • Many mapping failures are easily correctable Example: Adding a pre-processor able to recognized instances of the “anti-“ problem (e.g., “anti-CD20” not recognized even though “CD20” is known) We conclude that ImmPort should be able to migrate toward ontologically-based encodings. Data from experiments probing the immune system are inherently complex because of the diversity of data types, assay types and the number of biological agents involved. This complexity is further increased by the multi-center nature of data generated by HIPC. One of the goals of HIPC is to deliver a database able to support broad community access to these complex data sets. Critical to the success of this database will be its ability to provide conceptual characterizations of experiments and their results (“data and metadata encoding”). Such encodings identify data sets according to experimental properties so that users can quickly narrow their searches to the most pertinent results. To this end, conceptual encoding that rely on “industry-standard” ontologies are preferred is the best way to achieve this. We determined the extent to which existing ontologies can be used to encode HIPC data, and ImmPort’s ability to support the application of these concepts. Since ImmPort will be the repository of HIPC data, we evaluated its use of ontologies. Upon determining that ImmPort is not ontology-compliant, we analyzed the universe of ontologies to determine the extent to which existing ontologies can be used to encode HIPC data, and ImmPort’s ability to support the application of these concepts.

Transcript of Ontologies for Semantic Normalization of Immunological Data

Page 1: Ontologies for Semantic Normalization of Immunological Data

Ontologies for Semantic Normalization of Immunological DataYannick Pouliot1, Atul J. Butte1,2

1. Division of Systems Medicine, Stanford University School of Medicine2. Center for Pediatric Bioinformatics, Lucile Packard Children’s Hospital, Palo Alto, California

Results & Discussion

AcknowledgmentsNIAID, Hewlett Packard Foundation, Butte Lab

57 Ontologies within these domains were then screened for preliminary suitability to HIPC data/metadata according to four broad criteria:

Criterium ParameterDesign Must be integrative (ability to draw on similar/identical

concepts from other ontologies)

Must minimize overlap with other ontologies, consistent with providing terms that can be inter-related across ontologies

Should be “relatable” to clinical applications Must be applicable to humans, and perhaps one animal modelMust be an ontology, not just a controlled vocabulary (with a few exceptions)

Developmental state

Evidence of ongoing development and maintenance

Developed with, accepted by standards or professional organizations

Adheres to standards such as the Basic Formal OntologyUsage Must be released (post beta)

Reasonably widely adoptedContent Must exhibit a good balance between expressiveness and

usability/understandability: Conceptual clarity (e.g., no ambiguous classifications for

dual-use organs such as reproductive/urinary organs) Limited redundancy of synonymous concepts Usable definitions of concepts describing HIPC data or

metadata, including experiment design

Completeness (frequency of missing concepts)Correctness (how accurately is a concept is expressed)

Ontologies provided by BioPortal1 were first selected based on their domain of application: Domain Description Example OntologyAnalysis Process of data analysis Ontology of Data MiningAnatomy Anatomical structures at all level of

resolution except molecularCell Ontology

Disease Disease states manifested by organisms at anatomical, spatial, temporal and functional levels

Infectious Disease Ontology (IDO)

Experimental conditions

Conditions/specifications associated with a scientific or clinical protocol

Ontology of Clinical Research (OCRe)

Modeling Process/properties/data types of modeling, computational or otherwise

Interaction Network Ontology (INO)

Molecule Aspects of biomolecules : structure, sequence, function

ONTIE - Ontology of Immune Epitopes

Pathways Biochemical, signaling pathways used by organisms

Pathway Ontology

Phenotype States manifested by organisms at anatomical, spatial, temporal and functional levels. “Anatomy” and “Disease” are components of “Phenotype” but treated distinctly

Phenotypic Quality

1

2

These ontologies were then analyzed for their ability to recognize terms from text obtained from Build 1 datasets, protocols, and metadata, as well as from Stanford’s DataMt database2 (which stores many Stanford HIPC datasets). An automated pipeline that relies on the National Center for Biomedical Ontology’s Annotator3 was written that relies on BioPortal’s Web services to parse the text and attempt to map to the reference ontologies.

3

The Problem

Methods

References1. Noy et al., (2009) “BioPortal: Ontologies and Integrated Data resources at the Click of a Mouse”, Nucl. Acids Res., 37:W170-W173.2. Siebert, J., Munsil, D. & Maecker, H. (2011) "A Novel Approach for Integrating and Exploring Heterogeneous Translational Data", manuscript in preparation.3. Jonquet et al., (2009) “The Open Biomedical Annotator”, Summit on Transla. Bioinfo., 56-60.

Build 1 ImmPort

CV Data Mt

Ontology (%) n (%) n (%) n

NCI Thesaurus 16.7% 7 33.3% 14 52.6% 92

Medical Subject Headings 7.1% 3 21.4% 9 22.9% 40

Molecule role 14.3% 6 0.0% 24.6% 43

SNOMED Clinical Terms 2.4% 1 11.9% 5 17.7% 31

PRotein Ontology (PRO) 16.7% 7 0.0% 10.3% 18

Cell Cycle Ontology 7.1% 3 0.0% 12.6% 22

Ontology for Biomedical Investigations 2.4% 1 14.3% 6 5.7% 10

Experimental Factor Ontology 7.1% 3 7.4% 13

SemanticScience Integrated Ontology 2.9% 5

Units of measurement 2.4% 1 2.3% 4

Phenotypic quality 2.4% 1 2.3% 4

EDAM 7.1% 3 0.6% 1

Foundational Model of Anatomy 2.3% 4

Vaccine Ontology 2.4% 1 1.1% 2

ICPC-2 PLUS 1.1% 2

MGED Ontology 2.4% 1

Measurement Method Ontology 2.4% 1

Gene Ontology 2.4% 1

Ontology of Clinical Research (OCRe) 0.6% 1

Protein-protein interaction 0.6% 1

Mammalian phenotype 0.6% 1

• Many mapping failures attributable to lack of definition for commercial objects within the reference ontologies (e.g., “Anti-CD27” antibody from BD)

Solution: Contacting ontology owners to have them add commercial terms to their ontologies

• Many mapping failures are easily correctable

Example: Adding a pre-processor able to recognized instances of the “anti-“ problem (e.g., “anti-CD20” not recognized even though “CD20” is known)

We conclude that ImmPort should be able to migrate toward ontologically-based encodings.

Data from experiments probing the immune system are inherently complex because of the diversity of data types, assay types and the number of biological agents involved. This complexity is further increased by the multi-center nature of data generated by HIPC. One of the goals of HIPC is to deliver a database able to support broad community access to these complex data sets. Critical to the success of this database will be its ability to provide conceptual characterizations of experiments and their results (“data and metadata encoding”). Such encodings identify data sets according to experimental properties so that users can quickly narrow their searches to the most pertinent results. To this end, conceptual encoding that rely on “industry-standard” ontologies are preferred is the best way to achieve this. We determined the extent to which existing ontologies can be used to encode HIPC data, and ImmPort’s ability to support the application of these concepts. Since ImmPort will be the repository of HIPC data, we evaluated its use of ontologies. Upon determining that ImmPort is not ontology-compliant, we analyzed the universe of ontologies to determine the extent to which existing ontologies can be used to encode HIPC data, and ImmPort’s ability to support the application of these concepts.