Conference Review An upper-level ontology for the...
Transcript of Conference Review An upper-level ontology for the...
Comparative and Functional GenomicsComp Funct Genom 2003; 4: 80–84.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.255
Conference Review
An upper-level ontology for the biomedicaldomain
Alexa T. McCray*National Library of Medicine, 8600 Rockville Pike, Bethesda, MD, USA
*Correspondence to:Alexa T. McCray, NationalLibrary of Medicine, 8600Rockville Pike, Bethesda,MD, USA.E-mail: [email protected]
Received: 22 October 2002Accepted: 4 December 2002
AbstractAt the US National Library of Medicine we have developed the Unified MedicalLanguage System (UMLS), whose goal it is to provide integrated access to a largenumber of biomedical resources by unifying the vocabularies that are used to accessthose resources. The UMLS currently interrelates some 60 controlled vocabularies inthe biomedical domain. The UMLS coverage is quite extensive, including not onlymany concepts in clinical medicine, but also a large number of concepts applicable tothe broad domain of the life sciences. In order to provide an overarching conceptualframework for all UMLS concepts, we developed an upper-level ontology, called theUMLS semantic network. The semantic network, through its 134 semantic types,provides a consistent categorization of all concepts represented in the UMLS. The 54links between the semantic types provide the structure for the network and representimportant relationships in the biomedical domain. Because of the growing number ofinformation resources that contain genetic information, the UMLS coverage in thisarea is being expanded. We recently integrated the taxonomy of organisms developedby the NLM’s National Center for Biotechnology Information, and we are currentlyworking together with the developers of the Gene Ontology to integrate this resource,as well. As additional, standard, ontologies become publicly available, we expect tointegrate these into the UMLS construct. Published in 2003 by John Wiley & Sons,Ltd.
Keywords: ontologies; semantic networks; controlled vocabularies; Unified MedicalLanguage System
The development of an ontology is generally moti-vated by a particular problem that its develop-ers are attempting to solve. In most cases, theproblem itself will have a significant impact onthe design, implementation and further develop-ment of the ontology. In addition, the develop-ers’ domain expertise, experience in knowledgerepresentation methodology, and ability to main-tain the ontology over a long period of time, alldetermine the final outcome. While there is somedisagreement about what qualifies as an ontol-ogy, most agree that an ontology is a represen-tation of a domain of interest, which, at a mini-mum, involves naming the basic concepts in thatdomain [6]. Such a definition would allow a sim-ple list of terms to be an ontology. A more formal
definition of an ontology would, however, requirethat the relationships between and among theseconcepts be made explicit. These relationships maybe taxonomic links, resulting in hierarchically orga-nized concepts, or they may include additionalnon-hierarchical relationships, together with someconstraints on how these relationships are to beinterpreted [2,8,9].
The purpose to which the ontology will beput determines the nature and type of ontologythat is created. While a simple list of controlledterms can be sufficient for indexing documents andother datasets, even here some complexity, e.g. inthe form of synonyms, is often added once theterminology is put to use. Placing the concepts ina hierarchy provides another level of complexity,
This article is a US Government work and is in the public domain in the USA.
A biomedical ontology (UMLS) 81
but it also allows for greater flexibility in searching,making it possible for a user to formulate a querythat asks for items indexed not only under aparticular concept but also, for example, under allthe descendants of that concept in the hierarchy.This may be all that is ever needed for information-retrieval purposes, but if the ontology is intended tobe used in an application that requires reasoning ina knowledge-based application, such as a decisionsupport system, then a richer set of relationshipsbetween concepts will be needed.
At the US National Library of Medicine, we havedeveloped a system whose goal it is to provideintegrated access to a large number of biomedi-cal resources by unifying the domain vocabulariesthat are used to access those resources [4,10]. TheUnified Medical Language System (UMLS) projectcurrently interrelates some 60 controlled vocabu-laries in the biomedical domain. The vocabulariesvary in nature, size and scope and have been cre-ated for widely differing purposes. Some consistof a list of a few hundred terms, while otherscontain tens of thousands of interrelated concepts.Some vocabularies have been created for documentretrieval systems, others for coding medical recordsfor billing and administrative purposes, and yet oth-ers have been created for use in medical decisionsupport systems. Some are highly specific to a par-ticular medical specialty, such as the National Can-cer Institute’s Physician Data Query (PDQ) system,the psychiatrists’ Diagnostic and Statistical Manualof Mental Disorders (DSM-III and DSM-IV), andthe nurses’ Classification of Nursing Diagnoses.Others are targeted to particular fields of study,such as the University of Washington’s anatomyterminology. Several large vocabularies are quitebroad and deep in their scope, including the Sys-tematized Nomenclature of Medicine and the Med-ical Subject Headings (MeSH). The Metathesauruscontains almost 900 000 concepts drawn from itsmultiple vocabularies. Its coverage is quite exten-sive, including not only many concepts in clinicalmedicine but also a large number of concepts appli-cable to the broad domain of the life sciences, e.g.MeSH, a thesaurus of some 19 000 concepts inthe biomedical domain, has many concepts in theareas of anatomy, biology, physiology, organisms,diseases and chemicals. It also has a large numberof concepts in molecular biology and genetics, e.g.cytogenetics, medical genetics, genetic recombina-tion and mutations.
When a new vocabulary is added to the UMLS,its constituent terms are linked whenever possi-ble to existing Metathesaurus concepts. Thus, ifa new clinical vocabulary has, for example, thedisease name ‘lymphogenous leukemia’ and if theconcept ‘lymphocytic leukemia’ already exists inthe Metathesaurus, then the new name is added tothe existing concept as a synonym. Similarly, ifthe new vocabulary has the term ‘acute lympho-cytic leukemia’, and this concept does not alreadyexist in the Metathesaurus, then a new concept isformed, and it is linked to the most closely relatedUMLS concept. In this case, since ‘acute lympho-cytic leukemia’ is not a synonym of ‘lymphocyticleukemia’, it is linked to the latter concept as anarrower concept.
Early in the UMLS project, in order to pro-vide an overarching conceptual framework forall UMLS concepts, we developed an upper-levelontology, which we call the UMLS semantic net-work. Semantic networks have been created andused in artificial intelligence applications for sometime [3,7], and a number of groups are currentlycollaborating in the development of standards forupper-level ontologies encompassing a variety ofdomains [5]. Each UMLS concept is assigned oneor more semantic types from the semantic network.The internal structure of the constituent vocabularyis maintained, so that it is always possible to viewthe original contexts in which a particular concepthas appeared. The role of the semantic network isto provide the higher-level framework in which allconcepts are given a consistent and semanticallycoherent representation.
The UMLS semantic network currently consistsof 134 semantic types and 54 relationships. Thenetwork is defined at the highest level by twohierarchies, one for entities and another for events.Each semantic type is linked to its parent bythe ‘is a’ link, e.g. ‘Human’ is a leaf node inthe ‘Entity’ hierarchy. Traversing the ‘is a’ linksfrom ‘Human’ to ‘Entity’ allows the followingstatements: a human is a mammal, which is avertebrate; a vertebrate is an animal, which isan organism; an organism is a physical object,which is an entity. In addition to the definitionalpower of the network itself, each semantic type isgiven a textual definition. The definition is helpfulfor assigning, as well as interpreting, semantictypes linked to Metathesaurus concepts, e.g. thedefinition for the semantic type ‘Mammal’ is ‘a
Published in 2003 by John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 80–84.
82 A. T. McCray
vertebrate having a constant body temperature andcharacterized by the presence of hair, mammaryglands and sweat glands’. Accompanying eachdefinition are examples of instances of that typefound in the Metathesaurus. In this case, someinstances are ‘bears’, ‘Guernsey cow’, ‘Rattusnorvegicus’, and ‘whales’. In some cases, usagenotes are included with the definitional informationand these serve as clear guidelines for semantictype assignment by UMLS curators, e.g. mostdrugs can be viewed from the perspective of boththeir therapeutic or functional activities and theirunderlying structural properties. Usage notes fordrug semantic type assignment, therefore, indicatethat both a functional and a structural semantic typeshould be chosen.
The semantic network is further defined by a setof associative relationships, which themselves forma hierarchy. The top-level associative relationshipsare ‘physically related to’, ‘spatially related to’,‘functionally related to’, ‘temporally related to’,and ‘conceptually related to’. Table 1 shows thecomplete set of relationships currently available inthe UMLS semantic network.
A typical assertion might be ‘PharmacologicSubstance treats Disease or Syndrome’, where‘Pharmacologic Substance’ and ‘Disease or Syn-drome’ are semantic types, and ‘treats’ is one of therelationships that obtains between them. Figure 1shows a portion of the UMLS semantic network,illustrating some of the relationships that interrelatethe semantic types.
Note that the associative relationships are statedat the highest possible level, and are inheritedby the descendants of those types, e.g. becausebiological function is a process of an organism, it isalso a process of an animal, a vertebrate, a mammaland a human. Analogously, since a genetic functionis a molecular, physiologic, and biologic function,it is also a process of an organism, and thereforealso of a human.
Because of the growing number of informationresources that contain genetic information, it hasbecome clear that the UMLS coverage in thisarea needs to be extended. Some of the UMLSvocabularies contain terminology at the cellularand molecular level, but none has been createdspecifically for genetic resources. As a first stepin extending the coverage of the UMLS with con-cepts relevant to the genomic domain, we recentlyintegrated the taxonomy of organisms developed
Table 1. UMLS semanticnetwork relationships
is aassociated with
physically related topart ofconsists ofcontainsconnected tointerconnectsbranch oftributary ofingredient of
spatially related tolocation ofadjacent tosurroundstraverses
functionally related toaffects
managestreatsdisruptscomplicatesinteracts withprevents
brings aboutproducescauses
performscarries outexhibitspractices
occurs inprocess of
usesmanifestation ofindicatesresult of
temporally related toco occurs withprecedes
conceptually related toevaluation ofdegree ofanalyzes
assesses effect ofmeasurement ofmeasuresdiagnosesproperty ofderivative ofdevelopmental form ofmethod ofconceptual part ofissue in
Published in 2003 by John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 80–84.
A biomedical ontology (UMLS) 83
Organism
PlantAnimalBacteriumVirusFungusArchaeon Rickettsiaor
Chlamydia
AlgaVertebrateInvertebrate
Amphibian Bird Fish Reptile Mammal
Human
process of
evaluation of
BiologicFunction
PhysiologicFunction
OrganismFunction
MentalProcess
GeneticFunction
NeoplasticProcess
Mental orBehavioralDysfunction
Organ orTissue
Function
CellFunction
MolecularFunction
Cell orMolecular
Dysfunction
Disease orSyndrome
Experimentalmodel
of Disease
PathologicFunction
OrganismAttribute
Finding
Laboratory orTest Result
Sign orSymptom
AnatomicalStructure
AnatomicalAbnormality
EmbryonicStructure
CongentialAbnormality
AcquiredAbnormality
Fully FormedAnatomicalStructure
Injury orPoisoning
BodySubstance
Body Spaceor Junction
Body Locationor Region
Gene orGenome
Body System
CellComponent
CellTissueBody Part, Organ orOrgan Component
part of
disruptsdisrupts
part ofpart of
part of part of
conceptualpart of
conceptualpart of
conceptualpart of
conceptualpart of
containedin
evaluation of
property of
isa linksnon-isa relations
Figure 1. A portion of the UMLS semantic network
by the NLM’s National Center for BiotechnologyInformation (NCBI). The NCBI taxonomy is arapidly evolving taxonomy of organisms, incor-porating both phylogenetic and taxonomic knowl-edge from a variety of sources [12]. The taxon-omy currently contains more than 100 000 organ-ism names, including archaea, bacteria, eukary-ota and viruses. The coverage of the taxonomyis determined by the names for organisms whosesequences have been made public in a varietyof sequence databases, including GenBank, Euro-pean Molecular Biology Laboratory (EMBL), theSWISS-PROT protein sequence database and Pro-tein Information Resource (PIR).
We are working together with the developers ofthe Gene Ontology (GO) to integrate this ontol-ogy with the UMLS [1]. Our initial algorithmicmapping indicates that the three GO components,molecular function, biological process and cellu-lar component, map unevenly to existing UMLSconcepts. The greatest number of concepts mappedto the UMLS is found in the molecular func-tion component. About 43% of the GO molecu-lar function component terms mapped to UMLS
concepts, and about 35% of the cellular com-ponents mapped. A relatively small number ofthe GO biological processes were found (about5%). The algorithmic mappings are currently beingreviewed and modified by a GO curator in col-laboration with UMLS curators at the NLM. Aspart of this effort, we are reviewing the semanticnetwork for its coverage of the genetic domain.Semantic types and relationships that will beuseful include, for example, ‘Cell Component’,‘Biologically Active Substance’, ‘Genetic Func-tion’, ‘Nucleotide Sequence’, ‘Enzyme’, ‘Molec-ular Biology Research Technique’, ‘part of’, ‘pro-cess of’, ‘interacts with’, but it is likely that morewill be needed.
Knowledge in genomics continues to evolve ata pace that could not have been imagined even adecade ago, and there are ongoing efforts in thegenomics community to develop standard nomen-clatures for this domain [11]. As additional, stan-dard, ontologies become publicly available, weexpect to integrate these into the UMLS construct,which is regularly updated and readily available tothe community. As our understanding of biological
Published in 2003 by John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 80–84.
84 A. T. McCray
processes, particularly at the cellular and molecularlevel, continues to increase, so, too, will our under-standing of the pathogenesis of disease. Genomicdatabases provide the raw data for making thesediscoveries. Results of these investigations are pub-lished in the scientific literature and, in time, thisknowledge will be stored in a variety of clini-cal information systems. At every step, standardontologies play an important role. To the extentthat we, as a community, are able to successfullydevelop, maintain, integrate and use these ontolo-gies, we will be making a contribution to scientificprogress in this important domain.
References
1. The Gene Ontology Consortium. 2001. Creating the geneontology resource: design and implementation. Genome Res11(8): 1425–1433.
2. Guarino N. 1995. Formal ontology, conceptual analysis andknowledge representation. Int J Hum Comput St 43: 625–640.
3. Lehmann F. 1992. Semantic Networks in Artificial Intelligence.Pergamon: Tarrytown, New York.
4. McCray AT, Nelson S. 1995. The representation of meaningin the UMLS. Methods Inf Med 34(1–2): 193–201.
5. Niles I, Pease A. 2001. Towards a standard upper ontology. InFormal Ontologies in Information Systems, Welty C, Smith B(eds). ACM Press: New York; 2–9.
6. Poli R. 1996. Ontology for knowledge representation. InKnowledge Organization and Change, Green R (ed.). Indeks:Frankfurt; 313–319.
7. Quillian M. 1968. Semantic memory. In Semantic InformationProcessing, Minsky M (ed.). MIT Press: Cambridge, MA;227–270.
8. Schulze-Kremer S. 1998. Ontologies for molecular biology. InPac Symp Biocomput 695–706.
9. Stevens R, Goble CA, Bechhofer S. 2000. Ontology-basedknowledge representation for bioinformatics. Brief Bioinform1(4): 398–414.
10. Unified Medical Language System (UMLS): http://umlsinfo.nlm.nih.gov/
11. Wain HM, Bruford EA, Lovering RC, et al. 2002. Guidelinesfor human gene nomenclature. Genomics 79(40): 464–470.
12. Wheeler DL, Chappey C, Lash AE, et al. 2000. Databaseresources of the National Center for BiotechnologyInformation. Nucleic Acids Res 28(1): 10–14.
Published in 2003 by John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 80–84.
Submit your manuscripts athttp://www.hindawi.com
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporation http://www.hindawi.com
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttp://www.hindawi.com
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
International Journal of
Microbiology