Ontology and the National Cancer Institute Thesaurus (2005)
-
Upload
barry-smith -
Category
Health & Medicine
-
view
950 -
download
2
description
Transcript of Ontology and the National Cancer Institute Thesaurus (2005)
1
Ontology and the NCI Thesaurus
Barry Smith
with thanks to Werner Ceusters and Louis Goldberg
2
Ontology developments in Buffalo
Department of Philosophy: 8 full-time ontologists
National Center for Ontological Research (http://ncor.us)
NYS Center of Excellence in Bioinformatics & Life Sciences
Werner Ceusters Referent Tracking Pilot EHR
3
GO + OBO
National Center for Biomedical Ontology
Berkeley Drosophila Genome ProjectCambridge University Department of GeneticsMayo ClinicUniversity of Oregon Institute of NeuroscienceUniversity of California San Francisco Medical
CenterUniversity at Buffalo Department of Philosophy
http://ncbo.us
4
A methodology for quality assurance of ontologies
rules for ontology building based on two millennia of philosophical research on classification and categorization
targets thus far in the biomedical domain:
– FMA– SNOMED– GALEN– Gene Ontology– UMLS Semantic Network– ICF (International Classification of Functioning,
Disability and Health)– ISO Terminology Standards– HL7-RIM
5
–FMA
• SNOMED• GALEN
– Gene Ontology
» UMLS-SN
» ICF
» HL7-RIM
6
Ontologies of Reality vs. Information Models
Data:sequence, expression, genotype, structureData structures:patterns, clusters, alignments, ...
UMLS-SN: amino acid sequence is_a idea or concept
Swimming is healthy and has 8 letters
7
New criteria for admission to OBO (Open Biomedical Ontologies)
Library
Satisfaction of basic principles of ontology design
Goal: to move beyond information retrieval and statistical clustering to automatic reasoning
8
First Rule: Univocity
Terms should have the same meanings on every occasion of use.
They should refer to the same kinds of entities in reality
9
Second Rule: Positivity
Complements of kinds are not themselves kinds.
Terms such as ‘non-mammal’ or ‘non-membrane’ or ‘other metalworker in New Zealand’ do not designate genuine kinds in reality.
10
Third Rule: Objectivity
Which kinds exist is not a function of our knowledge.
Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.
11
Fourth Rule: Single Inheritance
No kind in a classificatory hierarchy should have more than one is_a parent on the immediate higher level
12
Basic ontological relations such as is_a and part_of should be shared by all ontologies
thing
car
blue thing
blue car
is_a1 is_a2
13
Fifth Rule
Use common upper-level categories and relations (is_a, part_of ...)
• with precise formal definitions for machine purposes
• with equivalent natural language definitions for human beings
14
Sixth Rule: Intelligibility of Definitions
The terms used in a definition should be simpler (more intelligible) than the term to be defined
otherwise the definition provides no assistance – to human understanding– to machine processing
Definitions should be intuitively meaningful (should not contradict common sense)
15
The National Cancer Institute Thesaurus (NCIT)
part of OBO
but does not (yet) satisfy these principles
16
NCIT
“a biomedical vocabulary that provides consistent, unambiguous codes and definitions for concepts used in cancer research”
“exhibits ontology-like properties in its construction and use”.
17
Goals
to make use of current terminology “best practices” to relate relevant concepts to one another in a formal structure, so that computers as well as humans can use the Thesaurus for a variety of purposes, including the support of automatic reasoning;
to speed the introduction of new concepts and new relationships in response to the emerging needs of basic researchers, clinical trials, information services and other users.
18
Formal Definitions
of 37,261 nodes, 33,720 were stipulated to be primitive in the DL sense
Thus only a small portion of the NCIT ontology can be used for purposes of automatic classification and error-checking.
19
Verbal Definitions
About half the NCIT terms are assigned verbal definitions
Unfortunately some are assigned more than one
20
Disease Progression
Definition1Cancer that continues to grow or spread.
Definition2 Increase in the size of a tumor or spread of cancer in the body.
Definition3 The worsening of a disease over time. This concept is most often used for chronic and incurable diseases where the stage of the disease is an important determinant of therapy and prognosis.
21
To make matters worse Disease Progression has subclass:
Cancer Progression
Definition:
The worsening of a cancer over time. This concept is most often used for incurable cancers where the stage of the cancer is an important determinant of therapy and prognosis.
22
Cancer
an object (which can grow and spread)
a process (of getting better or worse)
23
Confuses definitions with descriptions
Tuberculosis DefinitionA chronic, recurrent infection caused by the bacterium Mycobacterium tuberculosis. Tuberculosis (TB) may affect almost any tissue or organ of the body with the lungs being the most common site of infection. The clinical stages of TB are primary or initial infection, latent or dormant infection, and recrudescent or adult-type TB. Ninety to 95% of primary TB infections may go unrecognized. Histopathologically, tissue lesions consist of granulomas which usually undergo central caseation necrosis. Local symptoms of TB vary according to the part affected; acute symptoms include hectic fever, sweats, and emaciation; serious complications include granulomatous erosion of pulmonary bronchi associated with hemoptysis. If untreated, progressive TB may be associated with a high degree of mortality. This infection is frequently observed in immunocompromised individuals with AIDS or a history of illicit IV drug use.
24
A better solution
Tuberculosis
Definition:
A chronic, recurrent infection caused by the bacterium Mycobacterium tuberculosis.
25
Inherits ontological and terminological incoherence from source vocabularies such as UMLS-SN
Conceptual Entities
Definition
An organizational header for concepts representing mostly abstract entities.
Confuses use and mention (swimming is healthy and has eight letters)
Includes as subtypes:
action, change, color, death, event, fluid, injection, temperature
26
and imprecision
Duratec, Lactobutyrin, Stilbene Aldehyde classified as Unclassified Drugs and Chemicals
27
and problematic synonyms
Anatomic Structure, System, or Substance ~ Anatomic Structures and Systems
Does ‘anatomic’ apply only to structure or also to system and substance?
Biological Function ~ Biological Processsome biological processes are the exercises of biological
functionsothers (e.g. pathological processes) not
Genetic Abnormality ~ Molecular Abnormality (with subtype: Molecular Genetic Abnormality) (definitions not supplied)
28
more problematic synonyms
Diseases and Disorders ~ Disease ~ Disorder
Definition1 for Disease:A disease is any abnormal condition of the body or mind
that causes discomfort, dysfunction, or distress to the person affected or those in contact with the person. ...
Definition2 for DiseaseA definite pathologic process with a characteristic set of
signs and symptoms. ...
Condition ProcessDefinition2 contradicts NCIT’s own classification hierarchy
29
Ontological problems
Three disjoint classes of plants:
Vascular Plant
Non-vascular Plant
Other Plant
30
Ontological problems
Abnormal Cell is a top-level class (thus not subsumed by Cell
Cell is a subclass of Other Anatomic Concept (so that cells themselves are concepts)
Normal Cell is a subclass of Microanatomy.
31
Next step
Alignment of OBO ontologies through a common system of top-level categories in the OBO-UBO (Upper Biomedical Ontology) and through a common system of formally defined relations in the OBO-RO (Relation Ontology)
see “Relations in Biomedical Ontologies”, Genome Biology Apr. 2005
Donnelly, M., Bittner, T. and Rosse, C. 2005. 'A Formal Theory for Spatial Representation and Reasoning in Biomedical Ontologies'. Artificial Intelligence in Medicine
32
is_a
A is_a B
Definition
For all x, t if x instance_of A at t then x instance_of B at t
allows reliable cross-ontology inferences from ‘abnormal cell’ to ‘cell’
33
part_of
A part_of BDefinitionFor all x, t if x instance_of A at t then there is some y, y
instance_of B at t and x part_of y‘part_of’ is the instance-level part relation, e.g. between
this nucleus and this cellThe all-some structure of such definitions allowscascading of inferences
(i) within ontologies(ii) between ontologies(iii) between ontologies and EHR repositories of instance-data
34
Cascading inferences
Whichever A you choose, its including B will be included in some C, which will include as part also the A with which you begin
The same principle applies to the other relations
located_at, transformation_of, derived_from
etc. in the OBO-RO
(UML treatment here very poor)
35
NCIT as now constituted will block such automatic reasoning
Neither Normal Cell nor Abnormal Cells are Cells within the context of the NCIT
36
Some consolations
NCIT is open source
NCIT has broad coverage
NCIT has some formal structure (DL)
NCIT has realized the errors of its ways
NCIT is much, much better than (for example) the HL7-RIM