Post on 20-Dec-2015
Battling Scylla and Charybdis:The Search for Redundancy and Ambiguity
in the 2001 UMLS Metathesuarus
James J. Cimino
Department of Medical Informatics
Columbia University
2001 Metathesaurus
• 99 sources (92 in 2000)
• 1,734,707 strings (1,598,176 in 2000)
• 797,360 concepts (730,155 in 2000)
Lumping vs. Splitting
Cold (infection)
Cold (temperature)
COLD (COPD)
COLD (temperature)Cold (infection)
Cold (temperature)
COLD (COPD)
COLD (temperature)
Ambiguity!
Redundancy!
Three Auditing Methods
• Ambiguity through of multiple semantic types
• Redundancy through semantic string matching
• Inconsistency in parent-child semantic types
Previous Results: 1995
Possible ambiguity 1,817
Possible redundancy 5,031
Actually redundancy 3,274
Parent-Child problems 544
* Cimino JJ. Auditing the Unified Medical Language System with semantic methods. Journal of the American Medical Informatics Association; 1998;5:41-51.
*
Tools and Rules
• Simple Metathesaurus data model
• Normalized word index
• “Mutually exclusive semantic types”
• “Mutual concept subsumption”
Simple Metathesaurus Data Model
L0009264:S0829315: “COLD <3>”
C0024117: Chronic Obstructive Airway DiseaseL0486186:S0837575: “Chronic Obstructive Airway Disease”
L0486186:S0837576: “Chronic Obstructive Lung Disease”
Semantic type: T04: Disease or Syndrome
S0474508: “COLD”
Simple Metathesaurus Data Model
S0829315: “COLD <3>”
C0024117: Chronic Obstructive Airway Disease
S0837575: “Chronic Obstructive Airway Disease”
S0837576: “Chronic Obstructive Lung Disease”
Semantic type: T04: Disease or Syndrome
S0474508: “COLD”
Simple Metathesaurus Data Model
“COLD <3>”
C0024117: Chronic Obstructive Airway Disease
“Chronic Obstructive Airway Disease”
“Chronic Obstructive Lung Disease”
Semantic type: T04: Disease or Syndrome
“COLD”
Simple Metathesaurus Data Model
C0024117: Chronic Obstructive Airway Disease
Semantic type: T04: Disease or Syndrome
COLD <3>
Chronic Obstructive Airway DiseaseChronic Obstructive Lung Disease
COLD
C0035242: Respiratory Tract DiseasesSemantic type: T04: Disease or Syndrome
Parent-Child(is-a)
Mutually Inclusive Semantic Types
PhysicalObject
Organism
Animal
Invertebrate
Plant
Alga
Substance
Food
Mutually Exclusive Semantic Types
PhysicalObject
Organism
Animal
Invertebrate
Plant
Alga
Substance
Food
Rules for Multiple Semantic Types3. Concepts can have two Substance types, except: a) Element, Ion or Isotope and Chemicals Viewed Structurally b) Inorganic Chemical and Organic Chemicals
5. Concepts can have two Conceptual Entity types, except: Molecular Sequence and Geographic Area
Molecular Sequence and Body Location or Region Geographic Area and Body Location or Region
7. Concepts can have two Event types, except: Diagnostic Procedure and Laboratory Procedure
8. Concepts can have two types that ancestors/descendants
Detection of Ambiguity by Mutually Exclusive Semantic Types
If a concept has multiple semantic types
And if any pair of the types are mutually exclusive
Then the concept may have multiple meanings (ambiguity)
Or the semantic type assignment is incorrect
Ambiguity ExamplesC0015155: Euglena gracilis
Alga and Invertebrate
C0223537: Fourth lumbar vertebraBody Part, Organ, or Organ Component and Disease or Syndrome
C0035510: ToxicodendronPlant and Disease or Syndrome
C0242789: Crown-Rump LengthOrganism Attribute and Diagnostic Procedure
C0007608: Cell MovementCell Function and Biomedical Occupation or Discipline
C0030756: Lice InfestationsInvertebrate and Disease or Syndrome
C0008715: Chronically IllDisease or Syndrome and Patient or Disabled Group
Normalized Word Index
• UMLS Normalized Word Index
– e.g., “lungs” “lung”
– 293,004 words
• Keyword synonyms
– e.g., “lung” “pulmonary”
– 9,650 mappings
• Translated strings
• Built word index
Word Normalization
C0024117: Chronic Obstructive Airway Disease
Semantic type: T04: Disease or Syndrome
COLD <3>
Chronic Obstructive Airway DiseaseChronic Obstructive Lung Disease
COLD
C0035242: Respiratory Tract DiseasesSemantic type: T04: Disease or Syndrome
Parent-Child(is-a)
Word Normalization
Parent-Child(is-a)
C0035242: Respiratory Tract DiseasesSemantic type: T04: Disease or Syndrome
C0024117: Chronic Obstructive Airway Disease
Semantic type: T04: Disease or Syndrome
cold 3
chronic obstructive airway diseasechronic obstructive lung disease
cold
Word Normalization
Parent-Child(is-a)
C0035242: Respiratory Tract DiseasesSemantic type: T04: Disease or Syndrome
C0024117: Chronic Obstructive Airway Disease
Semantic type: T04: Disease or Syndrome
cold 3
chronic obstructive airway diseasechronic obstructive pulmonary disease
cold
Word Normalization
Parent-Child(is-a)
C0035242: Respiratory Tract DiseasesSemantic type: T04: Disease or Syndrome
C0024117: Chronic Obstructive Airway Disease
Semantic type: T04: Disease or Syndrome
cold 3
chronic obstructive airway disorderchronic obstructive pulmonary disorder
cold
Word Normalization
Parent-Child(is-a)
C0035242: Respiratory Tract DiseasesSemantic type: T04: Disease or Syndrome
C0024117: Chronic Obstructive Airway Disease
Semantic type: T04: Disease or Syndrome
cold three
chronic obstructive airway disorderchronic obstructive pulmonary disorder
cold
Word Index
airwaychroniccolddisorderobstructivepulmonarythree
Parent-Child(is-a)
C0035242: Respiratory Tract DiseasesSemantic type: T04: Disease or Syndrome
C0024117: Chronic Obstructive Airway Disease
Semantic type: T04: Disease or Syndrome
cold three
chronic obstructive airway disorderchronic obstructive pulmonary disorder
cold
Mutual String Subsumption
1) If Concept A has String A1
And all words in A1 are in Concept B’s word list
Then B subsumes A1
2) If B subsumes any string in A
And A subsumes any string in B
Then A and B are mutually subsumptive
Mutual String Subsumption
common coldcold twocold
C0009443: Common Cold
T04: Disease or Syndrome
coldcommontwo
C0009264: cold temperature
cold temperaturecold onecold
T070: Natural Phenomenon or Process
coldonetemperature
C0024117: Chronic Obstructive Airway Diseasechronic obstructive airway disorderchronic obstructive pulmonary disordercold threecold
T04: Disease or Syndrome
airwaychroniccolddisorderobstructivepulmonarythree
Mutual String SubsumptionC0009264: cold temperature
cold temperaturecold onecold
T070: Natural Phenomenon or Process
common coldcold twocold
C0009443: Common Cold
T04: Disease or Syndrome
C0024117: Chronic Obstructive Airway Diseasechronic obstructive airway disorderchronic obstructive pulmonary disordercold threecold
T04: Disease or Syndrome
coldcommontwo
coldonetemperature
airwaychroniccolddisorderobstructivepulmonarythree
Mutual String SubsumptionC0009264: cold temperature
cold temperaturecold onecold
T070: Natural Phenomenon or Process
common coldcold twocold
C0009443: Common Cold
T04: Disease or Syndrome
C0024117: Chronic Obstructive Airway Diseasechronic obstructive airway disorderchronic obstructive pulmonary disordercold threecold
T04: Disease or Syndrome
coldcommontwo
coldonetemperature
airwaychroniccolddisorderobstructivepulmonarythree
Mutual String SubsumptionC0009264: cold temperature
cold temperaturecold onecold
T070: Natural Phenomenon or Process
common coldcold twocold
C0009443: Common Cold
T04: Disease or Syndrome
C0024117: Chronic Obstructive Airway Diseasechronic obstructive airway disorderchronic obstructive pulmonary disordercold threecold
T04: Disease or Syndrome
coldcommontwo
coldonetemperature
airwaychroniccolddisorderobstructivepulmonarythree
Detection of Redundancy by String Subsumption
If A and B are mutually subsumptive
And semantic types of A and B are mutually inclusive
Then A and B may be redundant
Detection of Redundancy by String Subsumption
C0009264: cold temperature
cold temperaturecold onecold
T070: Natural Phenomenon or Process
common coldcold twocold
C0009443: Common Cold
T04: Disease or Syndrome
C0024117: Chronic Obstructive Airway Diseasechronic obstructive airway disorderchronic obstructive pulmonary disordercold threecold
T04: Disease or Syndrome
coldcommontwo
coldonetemperature
airwaychroniccolddisorderobstructivepulmonarythree
Redundancy Examples
C0673603: NPS-R-467 (Organic Chemical)
C0673604: NPS R-467 (Organic Chemical)
C0673769: des-Arg(10)-(Leu(9))kallidin (Amino Acid, Peptide or Protein)
C0673771: kallidin, des-Arg(10)-(Leu(9))-) (Amino Acid, Peptide or Protein)
C0266133: Congenital diverticulum of esophagus (Congenital Abnormality)
C0555218: Congenital esophageal pouch (Congenital Abnormality)
• Incorrect synonymy (MeSH translations) C0013005: Dolphins
has synonyms “ORCA” (Span.) and "FALSA BALEIA ASSASSINA“ (Port.)
so it is mutually subsumptive withC0325138: Whale, False Killer which has synonym "FALSA ORCA" (Span.)
Redundancy False Positives
• Partial names as synonyms: C0687720: Central Diabetes Insipidus
has “Diabetes Insipidus” as synonym so it is mutually subsumptive with
C0011848: Diabetes Insipidus
Detecting Semantic Type Problems through Parent-Child Relations
If Concept A is Parent of Concept B
And Concept A has semantic type X
And Concept B has semantic type Y
And if X and Y are different
And X is not an ancestor of Y (in Semantic Net)
Then one (or both) semantic types are wrong
Or the parent-child relation is wrong
Detecting Semantic Type Problems through Parent-Child Relations
OKWrong Type orWrong Concept
OK Nonspecific Semantic Type
Cartilaginous Fish(vertebrate)
Shark(vertebrate)
Dogfish(fish)
Stingray(animal)
Skate(manufactured
object)
Parent-Child Examples
C00013769: Elbow
has type Body Location or Regions
which is in the Conceptual Entity hierarchy
Is parent of:
C0230353: Right elbow
has type Body Part, Organ, or Organ Component which is in the Physical Object hierarchy
Results: 1995 VS. 2001
Possible ambiguity 1,817
Possible redundancy 5,031
Actually redundant 3,274
Parent-Child problems 544
Number of concepts: 222,927 797,359 (3.6x)
Parent-Child relations 100,586 607,043 (6.0x)
8,082
38,140
not done
2,868
Results: 1995 VS. 2001
Possible ambiguity 1,817 (0.82%) 8,082 (1.01%)
Possible redundancy 5,031 (2.26%) 38,140 (4.78%)
Actually redundant 3,274 (1.47%) not done
Parent-Child problems 544 (0.54%) 2,868 (0.47%)
Number of concepts: 222,927 797,359 (3.6x)
Parent-Child relations 100,586 607,043 (6.0x)
Discussion: Ambiguity Detection
• Small number (1.01%) is a good sign
• Allows focusing manual review
• Semantic type definitions need to be clarified
• Semantic type assignment rules need to be clarified
Discussion: Redundancy Detection
• Specificity is worse, without improved sensitivity
• Normalized string index is part of the reason
• “Incomplete” names are a bigger part of the reason
• Manual review will be relatively inefficient
• Incorrect mappings detected, especially foreign language
Discussion: Parent-Child Relations
• Mostly detects errors in semantic type assignment
• Strict hierarchy in Semantic Net causes problems
Conclusions
• Specific “answers” not possible– Domain expertise needed for assessment of chemical
names– Assessments are necessarily subjective– NLM gets to make the rules– NLM hasn’t finished making the rules
• Methods provide focus for manual review
• Methods highlight where clearer definitions are needed
• The results show the UMLS is doing well at a difficult task