Pipeline for automated structure-based classification in the ChEBI ontology

15
Pipeline for automated structure-based classification in the ChEBI ontology Janna Hastings Coordinator, Cheminformatics and Metabolism www.ebi.ac.uk/chebi ACS Symposium on Chemical Ontologies, Taxonomies and Schemas. Dallas, 16 March 2014

description

Presented at the ACS in Dallas: ChEBI is a database and ontology of chemical entities of biological interest, organised into a structure-based and role-based classification hierarchy. Each entry is extensively annotated with a name, definition and synonyms, other metadata such as cross-references, and chemical structure information where appropriate. In addition to the classification hierarchy, the ontology also contains diverse chemical and ontological relationships. While ChEBI is primarily manually maintained, recent developments have focused on improvements in curation through partial automation of common tasks. We will describe a pipeline we have developed for structure-based classification of chemicals into the ChEBI structural classification. The pipeline connects class-level structural knowledge encoded in Web Ontology Language (OWL) axioms as an extension to the ontology, and structural information specified in standard MOLfiles. We make use of the Chemistry Development Kit, the OWL API and the OWLTools library. Harnessing the pipeline, we are able to suggest the best structural classes for the classification of novel structures within the ChEBI ontology.

Transcript of Pipeline for automated structure-based classification in the ChEBI ontology

Page 1: Pipeline for automated structure-based classification in the ChEBI ontology

Pipeline for automated structure-based classification in the ChEBI ontology

Pipeline for automated structure-based classification in the ChEBI ontology

Janna Hastings

Coordinator, Cheminformatics and Metabolism

www.ebi.ac.uk/chebi

ACS Symposium on Chemical Ontologies, Taxonomies and Schemas. Dallas, 16 March 2014

Page 2: Pipeline for automated structure-based classification in the ChEBI ontology

Chemical Entities of Biological Interest

Freely available online, available

for download in full

Freely available online, available

for download in full

Low molecular weight, i.e. no proteins

Low molecular weight, i.e. no proteins

Definitions, relationships,

hierarchy

Definitions, relationships,

hierarchy

E.g. metabolites,

drugs, pesticides

E.g. metabolites,

drugs, pesticides

38,215 entries last release

38,215 entries last release

Page 3: Pipeline for automated structure-based classification in the ChEBI ontology

What does ChEBI provide?

Chemical structures and visualisations

caffeine1,3,7-trimethylxanthine methyltheobromine

Names and synonyms

Formula: C8H10N4O2Charge: 0 Mass: 194.19

Chemical data

metaboliteCNS stimulanttrimethylxanthines

Ontology – classifications

MSDchem: CFFKEGG DRUG: D00528PubMed citations

Links to more information

Chemical InformaticsInChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3

SMILES CN1C(=O)N(C)c2ncn(C)c2C1=O

Page 4: Pipeline for automated structure-based classification in the ChEBI ontology

Example ChEBI entry page

Page 5: Pipeline for automated structure-based classification in the ChEBI ontology

Example entry page (continued)

Page 6: Pipeline for automated structure-based classification in the ChEBI ontology

Example entry page (continued)

Page 7: Pipeline for automated structure-based classification in the ChEBI ontology

Structure-based classification in ChEBI

Page 8: Pipeline for automated structure-based classification in the ChEBI ontology

Challenges with manual classification

• May be incomplete

• May be inconsistent

• Difficult to maintain (even with extensive use of computationally expensive automatic validations)

• Blocks automatic loading of otherwise high-quality externally annotated chemical data into ChEBI (as no classification available)

Page 9: Pipeline for automated structure-based classification in the ChEBI ontology

SOCO (SMARTS, OWL) Leonid Chepelev, Michel Dumontier, collaborators• Given a training set of classified molecules,

examine structures for consensus features across all (using fragmentation and feature detection)

• Capture features hierarchically

• Use OWL to classify

Chepelev et al. BMC Bioinformatics 2012 13:3   doi:10.1186/1471-2105-13-3

Page 10: Pipeline for automated structure-based classification in the ChEBI ontology

Limitations of SOCO

• No support for negation

• Only “min” (at least) counting supported, not max or exact. Thus, dicarboxylic acid is_a monocarboxylic acid (Every two-legged human is also a one-legged human in the sense that they have at least one leg…)

• SMARTS is powerful – but not very human-readable. ChEBI is for human biologist and chemist consumption. E.g. SMARTS for the class of aliphatic amines: [$([NH2][CX4]),$([NH]([CX4])[CX4]),$[NX3]([CX4])([CX4])[CX4])]

Can we do better at making definitions accessible?

Page 11: Pipeline for automated structure-based classification in the ChEBI ontology

A new pipeline for automated structure-based ontology classification in ChEBI

Definitions (OWL)

ChEBI structures

OWL Parser => logical

cheminformatics definitions

OWL Parser => logical

cheminformatics definitions

Novelstructure

Candidateclasses

RankingRankingBest classes: save is_a relations

MatchingMatching

Page 12: Pipeline for automated structure-based classification in the ChEBI ontology

Human-readable definitions, mapped to structures in ChEBI knowledgebase

thiadiazoles:molecular_entity and has_part some ( 1,2,3-thiadiazole or 1,2,4-thiadiazole or 1,2,5-thiadiazole or 1,3,4-thiadiazole )

diterpenoid: organic_molecular_entity and has_part exactly 2 terpenoid

organic ion: organic_molecular_entity and ( has_charge some int[>0] or has_charge some int[<0] )monocyclic compound: molecular_entity and has_cycles value "1"^^int

Logical operatorsLogical operators

Counts (min, max and exact)

Counts (min, max and exact)

PropertiesProperties

PartsParts

Page 13: Pipeline for automated structure-based classification in the ChEBI ontology

Planned integration into ChEBI tools

• ChEBI internal data loader and bulk submissions

• ChEBI online submission tool

Pre-population of matched

classes

Pre-population of matched

classes

Page 14: Pipeline for automated structure-based classification in the ChEBI ontology

Acknowledgements – Thanks!

ChEBI team:

Christoph SteinbeckGareth OwenAdriano DekkerNamrata KaleSteve TurnerVenkatesh Muthukrishnan

Collaborators:

Colin Batchelor, RSCLian Duan, ETHLeonid Chepelev, OttawaMichel Dumontier, StanfordDespoina Magka, OxfordIlinca Tudose and John May, EBI

Funding:

BBSRC “Continued development of ChEBI towards better usability for the systems biology and metabolic modelling communities” BB/K019783/1

Page 15: Pipeline for automated structure-based classification in the ChEBI ontology

Questions?

Thank you for listening!Thank you for listening!

[email protected]

ACS Symposium on Chemical Ontologies, Taxonomies and Schemas. Dallas, 16 March 2014