ALINA PETROVA
EMCL WORKSHOP 1 8.02.2014
Learning formal definitions for biomedical concepts
Examples of reasoning over structured biomedical knowledge
2
1) Covert et al. 2012: Whole-cell simulation • computational model of all processes in a bacterium • 2 years, >1000 articles
2) King et al. 2009: Automation of science • Adam the Robot Scientist • generate functional genomic hypothesis
about a yeast • used knowledge bases and ontologies
for hypothesis generation and analysis • experimental validation
The growth of biomedical scientific literature 3
Tsatsaronis et al. 2013
Existing biomedical ontologies 4
# concept year research/production
Definitions
UMLS 1,000,000 1986 R,P textual, triples
SNOMED CT 300,000 1965 P formal
FMA 75,000 1995 P triples
GO 42,000 1998 P textual
GALEN 29,000 1991 R formal
MeSH 25,000 1963 P textual
Great need to convert textual definitions to formal representation!
Formalizing biomedical knowledge 5
Atelectasis (Lung collapse) example:
Absence of air in the entire or part of a lung, such as an incompletely inflated neonate lung or a collapsed adult lung. Pulmonary atelectasis can be caused by airway obstruction, lung compression, fibrotic contraction, or other factors.
vs.
Atelectasis = Disorder of lung ⊓ ∃has_associate_morphology(Collapse) ⊓ ∃has_finding_site(Lung structure) ⊓ ∃has_episodicity(Episodicities) ⊓ ∃has_clinical_course(Courses)
…
An example of a MeSH definition 6
Arthritis is a form of joint disorder that results from joint inflammation. When bone surfaces become less well protected by cartilage, bone may be exposed and damaged.
Is it easy to formalize a definition? 7
Arthritis is a form of joint disorder that results from joint inflammation. Arthritis = Joint_Disorder ⊓ ∃results_from.Joint_Inflammation
YES!
Is it easy to formalize a definition? 8
When bone surfaces become less well protected by cartilage, bone may be exposed and damaged.
Temporal logic?
Modal logic?
Situation calculus?
???
NO!
DL? which one?
Sources of problems 9
� Conceptual modeling ¡ Joint_Inflammation or Inflammation – related_to – Joints ?
� Expressive modeling ¡ what exactly do we want to model? to what degree of
sophistication? using which formalism?
� Text mining ¡ how to establish the dependencies between words in a
definition?
The Goal 10
A is a B that has property C.
A ≣ B ⊓ ∃property.C
How to extract formal definitions? 11
CONCEPT ANNOTATION
RELATION EXTRACTION
RELATION CLASSIFICATION
Example 12
Abdominal Wall: the outer margins of the abdomen, extending from the osteocartilaginous thoracic cage to the pelvis.
Step 1: Concept annotation 13
Abdominal Wall: the outer margins of the abdomen, extending from the osteocartilaginous thoracic cage to the pelvis. the abdomen -> ‘Abdomen’ the osteocartilaginous thoracic cage -> ‘Thorax’ the pelvis -> ‘Pelvis’
Step 2: Relation extraction 14
Abdominal Wall: the outer margins of the abdomen, extending from the osteocartilaginous thoracic cage to the pelvis. “outer margins of” (Abdominal wall, Abdomen) “that extends from” (Abdominal wall, Thorax) “that extends to” (Abdominal wall, Pelvis)
Step 3: Relation classification 15
“outer margins of” (Abdominal wall, Abdomen) “that extends from” (Abdominal wall, Thorax) “that extends to” (Abdominal wall, Pelvis) location(Abdominal wall, Abdomen) starts(Abdominal wall, Thorax) ends(Abdominal wall, Pelvis)
How to extract formal definitions? 16
CONCEPT ANNOTATION
RELATION EXTRACTION
RELATION CLASSIFICATION
RELATION EXTRACTION
17
SUPERVISED
Approach #1: align existing resources 18
Atelectasis (Lung collapse) example:
Absence of air in the entire or part of a lung, such as an incompletely inflated neonate lung or a collapsed adult lung. Pulmonary atelectasis can be caused by airway obstruction, lung compression, fibrotic contraction, or other factors.
vs.
Atelectasis = Disorder of lung ⊓ ∃has_associate_morphology(Collapse) ⊓ ∃has_finding_site(Lung structure) ⊓ ∃has_episodicity(Episodicities) ⊓ ∃has_clinical_course(Courses)
…
Results 19
� Relations: extract 3 SNOMED relations from MeSH textual definitions
� Results: 75% success rate for single-label classification
A – relational string – B
A – relation label – B
Results 20
� How to improve 75%? ¡ add new features ¡ use resources with consistent modeling
Be data-driven!
A – relational string – B
A – relation label – B
Approach #2: annotate a corpus 21
SemRep: � a rule-based system for biomedical relation extraction � 26 relations � a corpus of 500 annotated sentences � 1300 relation instances
Top relations: process_of, location_of, part_of, treats, isa, affects, causes, interacts_with, uses etc.
SemRep relations 22
Two key improvements 23
� Consistent modeling Before: MeSH texts VS. SNOMED CT relations After: SemRep texts VS. SemRep relations � The use of concept types Before: lexical features (ngrams) After: ngrams + concept types of relation arguments
Concept types 24
Motivation: every relation has a domain and a range è only specific types of concepts can be used as arguments
UMLS (biggest knowledge source for biomedicine, thesaurus, upper ontology etc.): 133 semantic types Tissue, Cell Function, Animal, Behavior, Physical Object, Molecular Sequence etc. Hormone – affects – Cell Function Body Substance – causes – Anatomical Abnormality
Why concept types are useful? 25
given concepts A, B MeSH triple: A “is in some relation with” B Before: A – relation R1 – B both are candidates!
A – relation R2 – B After: A à type Аt, B à type Bt
R1 ⊆ At x Bt R2 ⊆ Ct x Dt
RESULTS 26
Before: 424 instances, top 3 relations, 75% After: 860 instances, top 5 relations, 94%
1144 instances, top 10 relations, 89% 1357 instances, all 26 relations, 83%
Comparison with SemRep 27
SemRep ML method
Quality top 5 95% 94%
top 10 94% 89%
all 94% 83%
Scalability
not scalable scalable
Training speed
manually annotated corpus + rules = months
annotated corpus + ML = minutes
still rely on the labeled corpus è approach #3
RELATION EXTRACTION
28
UNSUPERVISED
Why is no annotated corpus needed? 29
Original approach:
term A – relational string – term B
concept A – formal relation – B concept Now add the concept types!
annotation
The corpus is not manually annotated! 30
term A – relational string – term B
concept A – relational string – concept B
concept type A’ – formal relation – concept type B’
known from taxonomy/thesaurus!
known from the corpus
Still we use SemRep as a background. Can we do better?
Approach #3: unsupervised relation extraction 31
Yes! � no manual annotation � no predefined relations
� only taxonomy and annotation needed � semantic clustering
term A – relational string – term B
concept A – relational string – concept B
concept type A’ – verb – concept type B’
Cluster examples 32
� {attach, bind}
� {cause, produce, induce}
� {transmit, convey, carry}
� {limit, inhibit, reduce}
� {result, lead}
etc.
Conclusions 33
� decompose the task of formal definition generation ¡ review of the existing approaches ¡ adaptation/creation ¡ implementation ¡ evaluation
� explore non-taxonomic relation extraction ¡ feature analysis ¡ performance of 94% on a par with SemRep
� suggest workflow for unsupervised relation extraction ¡ faster ¡ less resource dependent ¡ can be generalized to different domains and applications
QUESTIONS?
Thank you!
35
T R I P L E E X T R A C T I O N
Back-up slides
Example 37
Abdominal Wall: the outer margins of the abdomen, extending from the osteocartilaginous thoracic cage to the pelvis. STEP #2: triple extraction “outer margins of the” (Abdominal wall, Abdomen) “that extends from the osteocartilaginous” (Abdominal wall, Thorax) “to the” (Abdominal wall, Pelvis)
Triple extraction steps 38
1. separate the definition into head and body 2. find the parent term, if there is one 3. group coordinated concepts together 4. organize concepts into concept pairs 5. extract relational string for every pair 6. detect negation
Triple extraction steps 39
� separate the definition into head and body
Head: Abdominal wall Body: the outer margins of the abdomen… � find the parent term, if there is one � group coordinated concepts together � organize concepts into concept pairs � extract relational string for every pair � detect negation
Triple extraction steps 40
� separate the definition into head and body � find the parent term, if there is one
“Cancer is a disease that…” è IS_A(Cancer, Disease)
� group coordinated concepts together � organize concepts into concept pairs � extract relational string for every pair � detect negation
Triple extraction steps 41
� separate the definition into head and body � find the parent term, if there is one
� group coordinated concepts together
“X causes swelling and rashes” è causes(X, Swelling), causes(X, Rash)
� organize concepts into concept pairs � extract relational string for every pair � detect negation
Triple extraction steps 42
� separate the definition into head and body � find the parent term, if there is one � group coordinated concepts together � organize concepts into concept pairs
� extract relational string for every pair “that extends to the osteocartilaginous” (Abdominal wall, Thorax) � detect negation
Triple extraction steps 43
� separate the definition into head and body � find the parent term, if there is one � group coordinated concepts together � organize concepts into concept pairs � extract relational string for every pair
� detect negation “that does not respond to the ordinary” (Refractory anemia, Treatment) è NEGATION
A N N O T A T I O N
44
Back-up slides
Attribute Alignment Annotator 45
Problem # 1: missing annotations 46
Problem #2: ambiguity 47
D E T A I L S O F T H E A P P R O A C H
Back-up slides
Improvements since the last meeting 49
Old approach New approach
Text source MeSH definitions MEDLINE abstracts
Relation set R source SNOMED CT UMLS
Feature sources text of a definition text of a definition + concept types
Feature representations BoW, token and character ngrams, combination
character ngrams
Weighting schemes boolean, per-class weights boolean
Classification algorithm SVMs, Random Forests, Logistic Regression, Naïve Bayes
SVMs
Top Related