Ontology and the National Cancer Institute Thesaurus (2005)

36
1 Ontology and the NCI Thesaurus Barry Smith with thanks to Werner Ceusters and Louis Goldberg

description

The National Cancer Institute Thesaurus is described by its authors as "a biomedical vocabulary that provides consistent, unambiguous codes and definitions for concepts used in cancer research" and which "exhibits ontology-like properties in its construction and use". We performed a qualitative analysis of the Thesaurus in order to assess its conformity with principles of good practice in terminology and ontology design. MATERIALS AND METHODS: We used both the on-line browsable version of the Thesaurus and its OWL-representation (version 04.08b, released on August 2, 2004), measuring each in light of the requirements put forward in relevant ISO terminology standards and in light of ontological principles advanced in the recent literature. RESULTS: We found many mistakes and inconsistencies with respect to the term-formation principles used, the underlying knowledge representation system, and missing or inappropriately assigned verbal and formal definitions. CONCLUSION: Version 04.08b of the NCI Thesaurus suffers from the same broad range of problems that have been observed in other biomedical terminologies. For its further development, we recommend the use of a more principled approach that allows the Thesaurus to be tested not just for internal consistency but also for its degree of correspondence to that part of reality which it is designed to represent.

Transcript of Ontology and the National Cancer Institute Thesaurus (2005)

Page 1: Ontology and the National Cancer Institute Thesaurus (2005)

1

Ontology and the NCI Thesaurus

Barry Smith

with thanks to Werner Ceusters and Louis Goldberg

Page 2: Ontology and the National Cancer Institute Thesaurus (2005)

2

Ontology developments in Buffalo

Department of Philosophy: 8 full-time ontologists

National Center for Ontological Research (http://ncor.us)

NYS Center of Excellence in Bioinformatics & Life Sciences

Werner Ceusters Referent Tracking Pilot EHR

Page 3: Ontology and the National Cancer Institute Thesaurus (2005)

3

GO + OBO

National Center for Biomedical Ontology

Berkeley Drosophila Genome ProjectCambridge University Department of GeneticsMayo ClinicUniversity of Oregon Institute of NeuroscienceUniversity of California San Francisco Medical

CenterUniversity at Buffalo Department of Philosophy

http://ncbo.us

Page 4: Ontology and the National Cancer Institute Thesaurus (2005)

4

A methodology for quality assurance of ontologies

rules for ontology building based on two millennia of philosophical research on classification and categorization

targets thus far in the biomedical domain:

– FMA– SNOMED– GALEN– Gene Ontology– UMLS Semantic Network– ICF (International Classification of Functioning,

Disability and Health)– ISO Terminology Standards– HL7-RIM

Page 5: Ontology and the National Cancer Institute Thesaurus (2005)

5

–FMA

• SNOMED• GALEN

– Gene Ontology

» UMLS-SN

» ICF

» HL7-RIM

Page 6: Ontology and the National Cancer Institute Thesaurus (2005)

6

Ontologies of Reality vs. Information Models

Data:sequence, expression, genotype, structureData structures:patterns, clusters, alignments, ...

UMLS-SN: amino acid sequence is_a idea or concept

Swimming is healthy and has 8 letters

Page 7: Ontology and the National Cancer Institute Thesaurus (2005)

7

New criteria for admission to OBO (Open Biomedical Ontologies)

Library

Satisfaction of basic principles of ontology design

Goal: to move beyond information retrieval and statistical clustering to automatic reasoning

Page 8: Ontology and the National Cancer Institute Thesaurus (2005)

8

First Rule: Univocity

Terms should have the same meanings on every occasion of use.

They should refer to the same kinds of entities in reality

Page 9: Ontology and the National Cancer Institute Thesaurus (2005)

9

Second Rule: Positivity

Complements of kinds are not themselves kinds.

Terms such as ‘non-mammal’ or ‘non-membrane’ or ‘other metalworker in New Zealand’ do not designate genuine kinds in reality.

Page 10: Ontology and the National Cancer Institute Thesaurus (2005)

10

Third Rule: Objectivity

Which kinds exist is not a function of our knowledge.

Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.

Page 11: Ontology and the National Cancer Institute Thesaurus (2005)

11

Fourth Rule: Single Inheritance

No kind in a classificatory hierarchy should have more than one is_a parent on the immediate higher level

Page 12: Ontology and the National Cancer Institute Thesaurus (2005)

12

Basic ontological relations such as is_a and part_of should be shared by all ontologies

thing

car

blue thing

blue car

is_a1 is_a2

Page 13: Ontology and the National Cancer Institute Thesaurus (2005)

13

Fifth Rule

Use common upper-level categories and relations (is_a, part_of ...)

• with precise formal definitions for machine purposes

• with equivalent natural language definitions for human beings

Page 14: Ontology and the National Cancer Institute Thesaurus (2005)

14

Sixth Rule: Intelligibility of Definitions

The terms used in a definition should be simpler (more intelligible) than the term to be defined

otherwise the definition provides no assistance – to human understanding– to machine processing

Definitions should be intuitively meaningful (should not contradict common sense)

Page 15: Ontology and the National Cancer Institute Thesaurus (2005)

15

The National Cancer Institute Thesaurus (NCIT)

part of OBO

but does not (yet) satisfy these principles

Page 16: Ontology and the National Cancer Institute Thesaurus (2005)

16

NCIT

“a biomedical vocabulary that provides consistent, unambiguous codes and definitions for concepts used in cancer research”

“exhibits ontology-like properties in its construction and use”.

Page 17: Ontology and the National Cancer Institute Thesaurus (2005)

17

Goals

to make use of current terminology “best practices” to relate relevant concepts to one another in a formal structure, so that computers as well as humans can use the Thesaurus for a variety of purposes, including the support of automatic reasoning;

to speed the introduction of new concepts and new relationships in response to the emerging needs of basic researchers, clinical trials, information services and other users.

Page 18: Ontology and the National Cancer Institute Thesaurus (2005)

18

Formal Definitions

of 37,261 nodes, 33,720 were stipulated to be primitive in the DL sense

Thus only a small portion of the NCIT ontology can be used for purposes of automatic classification and error-checking.

Page 19: Ontology and the National Cancer Institute Thesaurus (2005)

19

Verbal Definitions

About half the NCIT terms are assigned verbal definitions

Unfortunately some are assigned more than one

Page 20: Ontology and the National Cancer Institute Thesaurus (2005)

20

Disease Progression

Definition1Cancer that continues to grow or spread.

Definition2 Increase in the size of a tumor or spread of cancer in the body.

Definition3 The worsening of a disease over time. This concept is most often used for chronic and incurable diseases where the stage of the disease is an important determinant of therapy and prognosis.

Page 21: Ontology and the National Cancer Institute Thesaurus (2005)

21

To make matters worse Disease Progression has subclass:

Cancer Progression

Definition:

The worsening of a cancer over time. This concept is most often used for incurable cancers where the stage of the cancer is an important determinant of therapy and prognosis.

Page 22: Ontology and the National Cancer Institute Thesaurus (2005)

22

Cancer

an object (which can grow and spread)

a process (of getting better or worse)

Page 23: Ontology and the National Cancer Institute Thesaurus (2005)

23

Confuses definitions with descriptions

Tuberculosis DefinitionA chronic, recurrent infection caused by the bacterium Mycobacterium tuberculosis. Tuberculosis (TB) may affect almost any tissue or organ of the body with the lungs being the most common site of infection. The clinical stages of TB are primary or initial infection, latent or dormant infection, and recrudescent or adult-type TB. Ninety to 95% of primary TB infections may go unrecognized. Histopathologically, tissue lesions consist of granulomas which usually undergo central caseation necrosis. Local symptoms of TB vary according to the part affected; acute symptoms include hectic fever, sweats, and emaciation; serious complications include granulomatous erosion of pulmonary bronchi associated with hemoptysis. If untreated, progressive TB may be associated with a high degree of mortality. This infection is frequently observed in immunocompromised individuals with AIDS or a history of illicit IV drug use.

Page 24: Ontology and the National Cancer Institute Thesaurus (2005)

24

A better solution

Tuberculosis

Definition:

A chronic, recurrent infection caused by the bacterium Mycobacterium tuberculosis.

Page 25: Ontology and the National Cancer Institute Thesaurus (2005)

25

Inherits ontological and terminological incoherence from source vocabularies such as UMLS-SN

Conceptual Entities

Definition

An organizational header for concepts representing mostly abstract entities.

Confuses use and mention (swimming is healthy and has eight letters)

Includes as subtypes:

action, change, color, death, event, fluid, injection, temperature

Page 26: Ontology and the National Cancer Institute Thesaurus (2005)

26

and imprecision

Duratec, Lactobutyrin, Stilbene Aldehyde classified as Unclassified Drugs and Chemicals

Page 27: Ontology and the National Cancer Institute Thesaurus (2005)

27

and problematic synonyms

Anatomic Structure, System, or Substance ~ Anatomic Structures and Systems

Does ‘anatomic’ apply only to structure or also to system and substance?

Biological Function ~ Biological Processsome biological processes are the exercises of biological

functionsothers (e.g. pathological processes) not

Genetic Abnormality ~ Molecular Abnormality (with subtype: Molecular Genetic Abnormality) (definitions not supplied)

Page 28: Ontology and the National Cancer Institute Thesaurus (2005)

28

more problematic synonyms

Diseases and Disorders ~ Disease ~ Disorder

Definition1 for Disease:A disease is any abnormal condition of the body or mind

that causes discomfort, dysfunction, or distress to the person affected or those in contact with the person. ...

Definition2 for DiseaseA definite pathologic process with a characteristic set of

signs and symptoms. ...

Condition ProcessDefinition2 contradicts NCIT’s own classification hierarchy

Page 29: Ontology and the National Cancer Institute Thesaurus (2005)

29

Ontological problems

Three disjoint classes of plants:

Vascular Plant

Non-vascular Plant

Other Plant

Page 30: Ontology and the National Cancer Institute Thesaurus (2005)

30

Ontological problems

Abnormal Cell is a top-level class (thus not subsumed by Cell

Cell is a subclass of Other Anatomic Concept (so that cells themselves are concepts)

Normal Cell is a subclass of Microanatomy.

Page 31: Ontology and the National Cancer Institute Thesaurus (2005)

31

Next step

Alignment of OBO ontologies through a common system of top-level categories in the OBO-UBO (Upper Biomedical Ontology) and through a common system of formally defined relations in the OBO-RO (Relation Ontology)

see “Relations in Biomedical Ontologies”, Genome Biology Apr. 2005

Donnelly, M., Bittner, T. and Rosse, C. 2005. 'A Formal Theory for Spatial Representation and Reasoning in Biomedical Ontologies'. Artificial Intelligence in Medicine

Page 32: Ontology and the National Cancer Institute Thesaurus (2005)

32

is_a

A is_a B

Definition

For all x, t if x instance_of A at t then x instance_of B at t

allows reliable cross-ontology inferences from ‘abnormal cell’ to ‘cell’

Page 33: Ontology and the National Cancer Institute Thesaurus (2005)

33

part_of

A part_of BDefinitionFor all x, t if x instance_of A at t then there is some y, y

instance_of B at t and x part_of y‘part_of’ is the instance-level part relation, e.g. between

this nucleus and this cellThe all-some structure of such definitions allowscascading of inferences

(i) within ontologies(ii) between ontologies(iii) between ontologies and EHR repositories of instance-data

Page 34: Ontology and the National Cancer Institute Thesaurus (2005)

34

Cascading inferences

Whichever A you choose, its including B will be included in some C, which will include as part also the A with which you begin

The same principle applies to the other relations

located_at, transformation_of, derived_from

etc. in the OBO-RO

(UML treatment here very poor)

Page 35: Ontology and the National Cancer Institute Thesaurus (2005)

35

NCIT as now constituted will block such automatic reasoning

Neither Normal Cell nor Abnormal Cells are Cells within the context of the NCIT

Page 36: Ontology and the National Cancer Institute Thesaurus (2005)

36

Some consolations

NCIT is open source

NCIT has broad coverage

NCIT has some formal structure (DL)

NCIT has realized the errors of its ways

NCIT is much, much better than (for example) the HL7-RIM