Knowledge-driven Implicit Information Extraction

62
1 Knowledge-driven Implicit Information Extraction Sujan Perera Dissertation Committee : Drs. Amit P. Sheth (advisor), Krishnaprasad Thirunarayan, Michael Raymer, Pablo N. Mendes (IBM Research) Ph.D. Dissertation Defense

Transcript of Knowledge-driven Implicit Information Extraction

Page 1: Knowledge-driven Implicit Information Extraction

1

Knowledge-driven Implicit Information Extraction

Sujan PereraDissertation Committee : Drs. Amit P. Sheth (advisor), Krishnaprasad

Thirunarayan, Michael Raymer, Pablo N. Mendes (IBM Research)

Ph.D. Dissertation Defense

Page 2: Knowledge-driven Implicit Information Extraction

2

Information Extraction

• More than 70% of data in organizations exist in unstructured form1

• Extraction of structured information from unstructured data is a fundamental task

“All home medications although his insulin dose (nph 20 qPM) was halved (--> NPH 10 qPM) on the floor, and his sugars were running in the 150s-250s range.”

Insulin

Cisapride

contradicti

ng drug

Diabetes Mellitus

Hyperglycemia

may_treat

may treat

Proinsulin

Porcine Insulin Insulin Glulisine

is a is a

is a

1https://en.wikipedia.org/wiki/Unstructured_data

Page 3: Knowledge-driven Implicit Information Extraction

3

Information Extraction

• Almost exclusively focused on explicit information

“Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain.”

Page 4: Knowledge-driven Implicit Information Extraction

4

Information Extraction

• Almost exclusively focused on explicit information

Named Entity Recognition Relationship ExtractionEntity Linking

“Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain.”

Person Person C0018795

C0015672

C0008031

Page 5: Knowledge-driven Implicit Information Extraction

5

Information Extraction

• Misses the implicit information

“Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain.”

Person Person C0018795

C0015672

C0008031

No shortness of breath

edema

Named Entity Recognition Relationship ExtractionEntity Linking Implicit information extraction

Page 6: Knowledge-driven Implicit Information Extraction

6

Thesis Statement

Implicit factual information in unstructured text can be efficiently extracted by bridging syntactic and semantic gaps in natural language

usage and augmenting information extraction techniques with relevant domain knowledge.

Page 7: Knowledge-driven Implicit Information Extraction

7

• Express sarcasm/sentiment• “I'm striving to be positive in what I say on Twitter. So I'll refrain

from making a comment about the latest Michael Bay movie.”• Provide descriptive information• “small fluid adjacent to the gallbladder with gallstones which may

represent inflammation”• Emphasize features of the entity• “Mason Evans 12 year long shoot won big in golden globe”

• Communicate the common understanding• “He is suffering from nausea and severe headaches. Dolasteron was

prescribed.”• Stylistic Preferences• “Democratic candidate Bernie Sanders … The Vermont senator …”

Credit:http://bit.ly/2b9Bnjk

Page 8: Knowledge-driven Implicit Information Extraction

8

Significance

• Volume• 20% movie references and 40% book references in tweets• 35% edema and 40% shortness of breath references in clinical

narratives• Value

Explicit InformationComputer Assisted Coding

30-day Readmission Prediction

Sentiment Analysis

Structured Information

Page 9: Knowledge-driven Implicit Information Extraction

9

Significance

• Volume• 20% movie references and 40% book references in tweets• 35% edema and 40% shortness of breath references in clinical

narratives• Value

Ignoring implicit information in text would adversely affect downstream applications

Explicit Information

Implicit Information

Computer Assisted Coding

30-day Readmission Prediction

Sentiment Analysis

Structured Information

Page 10: Knowledge-driven Implicit Information Extraction

10

Role of Knowledge

New Sandra Bullock astronaut lost in space movie looks absolutely terrifying

The patient showed accumulation of fluid in his extremities, but respirations were unlabored and there

were no use of accessory muscles.

Edema Accumulation of an excessive amount of watery fluid in cells or intercellular tissues

Shortness of breath

Labored or difficult breathing associated with a variety of disorders

UMLS

Sandra Bullock Gravity

Knowledge Bases

WordNet

Image credits: http://bit.ly/2b5HPDQ and Icon made by Freepik from www.flaticon.com

Credit: http://bit.ly/2bi34FGCredit: http://bit.ly/1x3sack Credit: http://bit.ly/2b9CejW Credit: http://bit.ly/2aXM97v

Page 11: Knowledge-driven Implicit Information Extraction

11

Knowledge Acquisition

Knowledge Modeling

Detecting Implicit

Information

Information

Extraction

Implicit Information Extraction

Page 12: Knowledge-driven Implicit Information Extraction

12

Dissertation Focus

Implicit Information Extraction

Entities Relationships

Organized Text Unorganized Text

Clinical Narratives Tweets

Disorders Symptoms Movies Books

Clinical Narratives

Disorders and Symptoms

Page 13: Knowledge-driven Implicit Information Extraction

13

Dissertation Focus

Implicit Information Extraction

Entities Relationships

Organized Text Unorganized Text

Clinical Narratives Tweets

Disorders Symptoms Movies Books

Clinical Narratives

Disorders and Symptoms

Page 14: Knowledge-driven Implicit Information Extraction

14

Sentence Entity

“small fluid adjacent to the gallbladder with gallstones which may represent inflammation.”

Cholecystitis

“His tip of the appendix is inflamed.” Appendicitis

“The respirations were unlabored and there were no use of accessory muscles.” Shortness of breath (NEG)

Implicit Entities in Clinical Documents

• One should know the physiological observations that characterize particular entity

• Negations are embedded in the phrases indicating entities• “Patient denies shortness of breath”• “The respirations were unlabored”

Page 15: Knowledge-driven Implicit Information Extraction

15

Knowledge Acquisition

• Unified Medical Language System – integrate many health and biomedical vocabularies

• Linguistic Knowledge – WordNet• Synonyms/antonyms• Syntactic variations of the same term

CUI AUI STR

CUI TUI

CUI STR DEF SABDefinitions for shortness of breath

A disorder characterized by an uncomfortable sensation of difficulty breathing

Difficult or labored breathing

Labored or difficulty breathing associated with a variety of disorders, indicating inadequate ventilation or low blood oxygen or a subjective experience of breathing discomfort

Page 16: Knowledge-driven Implicit Information Extraction

16

Knowledge Modeling

• Each entity has multiple definitions• Each definition is processed to create entity indicator

• Representative power of the term (r1) calculated with measure inspired by TF-IDF

• A collection of entity indicators constitute entity model

definition1

definition2

definition3

Entity Indicator1

Entity Indicator2

Entity Indicator3

Entity Model

Definition Entity Indicator

A disorder characterized by an uncomfortable sensation of difficulty breathing

(uncomfortable, r1), (sensation, r2), (difficulty, r3), (breathing, r4)

Difficult or labored breathing (difficult, r5), (labored, r6), (breathing, r4)

Page 17: Knowledge-driven Implicit Information Extraction

17

Detecting Sentences with Implicit Entities

• The sentences with entity representative term but without the entity name may have implicit mention of the entity.

“However, Mr. Smith is comfortably breathing in room air.”

Candidate sentence for shortness of breath

Page 18: Knowledge-driven Implicit Information Extraction

18

• The similarity between entity model and the pruned sentence is measured to annotate them with positive or negative labels

• We developed a semantic similarity measure that takes care of the synonyms and antonyms

Information Extraction – Entity Linking

Candidate Sentence

Indicator1

Indicator2

Indicator3

Entity Model

sim1

sim2

sim3

Page 19: Knowledge-driven Implicit Information Extraction

19

Information Extraction – Entity Linking

ct1

ct2

ct3

ct4

et5

et6

et7

Candidate Sentence Entity Indicator

WordNet

If antonym then -1

else max similarity

∑ 𝑠𝑖𝑚∗𝑟𝑝𝑒𝑡

∑ 𝑟𝑝𝑒𝑡

>t1

<t2

Positive Annotation

Negative Annotation

Page 20: Knowledge-driven Implicit Information Extraction

20

Evaluation

• Re-annotated the SemEval-2014 task 7 dataset for implicit entities

• Entities are selected considering the frequency of appearance and with expert feedback

• 857 sentences selected for 8 entities

• Annotated by three domain experts

• Annotation agreement 0.58

Entity Positive Annotations

Negative Annotations

Shortness of Breath 93 94

Edema 115 35

Syncope 96 92

Cholecystitis 78 36

Gastrointestinal Gas 18 14

Colitis 12 11

Cellulitis 8 2

Fasciitis 7 3

Page 21: Knowledge-driven Implicit Information Extraction

21

Algorithm Positive Precision

Positive Recall

Positive F1

Negative Precision

Negative Recall

Negative F1

Our 0.66 0.87 0.75 0.73 0.73 0.73

MCS 0.50 0.93 0.65 0.31 0.76 0.44

SVM 0.73 0.82 0.77 0.66 0.67 0.67

Adding similarity value as a feature for the supervised algorithmSVM+MCS 0.73 0.82 0.77 0.66 0.66 0.66

SVM+Our 0.77 0.85 0.81 0.72 0.75 0.73

• Baselines• MCS algorithm (Mihalcea 2006)• SVM (trained on n-grams)

• Our algorithm outperforms selected baselines in negative category.• SVM is able to leverage the supervision to beat our algorithm in

positive category.

Annotation Performance

Page 22: Knowledge-driven Implicit Information Extraction

22

Similarity as a Feature to Supervised Algorithm

• Added similarity value of unsupervised algorithms as a feature to the SVM.

Positive Annotations Negative Annotations

Page 23: Knowledge-driven Implicit Information Extraction

23

Annotation Performance – A Study with the Confidence

• Each annotation has confidence ranges from 1 to 5

• Low confidence reflects incomplete or ambiguous information

• Annotation performance increases as the confidence increases

• The negative class shows significant increment

Page 24: Knowledge-driven Implicit Information Extraction

24

Dissertation Focus

Implicit Information Extraction

Entities Relationships

Organized Text Unorganized Text

Clinical Narratives Tweets

Disorders Symptoms Movies Books

Clinical Narratives

Disorders and Symptoms

Page 25: Knowledge-driven Implicit Information Extraction

25

• Use diverse characteristics of the entity– “New Sandra Bullock astronaut lost in space movie looks absolutely

terrifying”– “ISRO sends probe to Mars for less money than it takes Hollywood to send a

woman to space.”– “oh yeah there is that new space movie coming out that looks terrifying i am

going to go see it”

• Use time-sensitive phrases

Furious 7Gravity The Martian

Fall 2013 April 2014 Fall 2015

space movie

fastest movie to earn $1 billion

Paul walkers’ last movie

Tweets with Implicit Entities

Credit: http://bit.ly/2bkePJ6

Page 26: Knowledge-driven Implicit Information Extraction

26

• Use diverse characteristics of the entity– “… Richard Linklater movie …”– “… Ellar Coltrane on his 12-year movie …”– “… 12-year long movie shoot …”– “… Mason Evan's childhood movie …”

• Use time-sensitive phrases

Furious 7Gravity The Martian

Fall 2013 April 2014 Fall 2015

space movie

fastest movie to earn $1 billion

Paul walkers’ last movie

Tweets with Implicit Entities

Credit: http://bit.ly/2bk8xdp

Page 27: Knowledge-driven Implicit Information Extraction

27

Knowledge Acquisition

• Acquiring factual knowledge• Source – DBpedia• Not all factual knowledge is important – movie has ‘starring’ and

‘director’ as well as ‘billed‘ and ‘license’• Rank the relationships based on joint probability with the entity type• Values of top-k relationships and the value of rdfs:comment are obtained

• Acquiring contextual knowledge• Source – contemporary tweets• We collect 1000 tweets with explicit mentions of the entity

• Number of views for the entity’s Wikipedia page within last t days

Page 28: Knowledge-driven Implicit Information Extraction

28

Knowledge Acquisition

Wikipedia page titles and anchor texts

Contemporary tweets

Generate semantic cues

Factual knowledge

Clean tweets

Generate n-grams

• Need to extract meaningful phrases from acquired knowledge

• Meaningful phrases = Wikipedia titles + anchor texts• Matching n-grams are added to semantic cues• Non-matching n-grams are added to semantic cues

after removing stop words

Page 29: Knowledge-driven Implicit Information Extraction

29

Knowledge Modeling – Entity Model Network

Sandra BullockAlfonso Curan

Mars orbiter mission

Woman in space

astronaut

• A property graph - reflecting the topical relationships between entities

𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=¿𝑁∨ ¿¿𝑁𝑐 𝑗

∨¿¿¿

𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦=𝑡𝑜𝑡𝑎𝑙𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 h𝑝 𝑟𝑎𝑠𝑒 𝑖𝑛𝑡𝑤𝑒𝑒𝑡𝑠

number of Wikipedia views

𝑁−𝑡𝑜𝑡𝑎𝑙𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ,𝑁 𝑐 𝑗𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠

Factual Knowledge

Contextual Knowledge

Entity

Gravity

Christopher Nolan

Matt Damon

Interstellar

The Martian

Page 30: Knowledge-driven Implicit Information Extraction

30

Detecting Tweets with Implicit Entities

• Tweets are filtered with keywords – movie, film, book, novel• Applied simple annotation technique – dictionary matching• The tweets that are not annotated with entity of types we are

looking for are considered to have implicit entity mentions

KeywordsEntity

Dictionary

Annotating Tweets

Page 31: Knowledge-driven Implicit Information Extraction

31

Information Extraction – Entity Linking

• Two Step Process

• Step 1: Candidate selection and filtering• Objective - prune the search space to reduce number of entities to be

considered in disambiguation step from EMN

• Step 2: Disambiguation• Objective - sort the selected candidate entities to place the implicitly

mentioned entity in top position

Page 32: Knowledge-driven Implicit Information Extraction

32

Entity Linking - Candidate selection and filtering

m1

m2 m4

m5

m3

m7

m6c1

c5

c8

c4

c6

c3

c2

c9

c7

“ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space”

m8

EntityFactual Knowledge Contextual Knowledge

Page 33: Knowledge-driven Implicit Information Extraction

33

m1

m2 m4

m5

m3

m7

m6c1

c5

c8

c6

c3

c2

c9

c7

“ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space” c5

c2 c7

c8

m8

Factual Knowledge Contextual Knowledge Entity

Entity Linking - Candidate selection and filtering

c4

Page 34: Knowledge-driven Implicit Information Extraction

34

m1

m2 m4

m5

m3

m7

m6c1

c5

c8

c6

c3

c2

c9

c7

c5c2

m1

m2

m4

m5

m3

c7

c8

m6

m7

m8

Factual Knowledge Contextual Knowledge Entity

“ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space”

Entity Linking - Candidate selection and filtering

c4

Page 35: Knowledge-driven Implicit Information Extraction

35

m1

m2 m4

m5

m3

m7

m6c1

c5

c8

c6

c3

c2

c9

c7

c5c2

m1

m2

m4

m5

m3

𝑠𝑐𝑜𝑟𝑒𝑚𝑖= ∑

𝑐 𝑗𝜖ℂ𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐 𝑗∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐 𝑗 ,𝑚𝑖)

c7

c8

m6

m7

m2

m4

m6

m7

m3

is the set of matching cues

m8

Factual Knowledge Contextual Knowledge Entity

“ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space”

Entity Linking - Candidate selection and filtering

c4

Page 36: Knowledge-driven Implicit Information Extraction

36

• Formulated as a ranking problem

• SVMrank to rank candidates• Similarity between the candidate entity and the tweet

• Temporal salience of the candidate entity

x1 x2 x3 … xn

xj

𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒𝑒𝑖∑𝑒∈𝐸 𝑐

𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒𝑒

is the selected candidate setm2m6

m4m3m7

Winner

Entity Linking - Disambiguation

Page 37: Knowledge-driven Implicit Information Extraction

37

Evaluation Dataset

Entity Type Annotation Tweets Entity

Movie Explicit 391 107

Implicit 207 54

NULL 117 0

Book Explicit 200 24

Implicit 190 53

NULL 70 0

• Tweets are collected in August 2014 using keywords • Manually annotated the tweets with DBpedia URL of entities

• The tweets annotated with NULL do not have either explicit or implicit mention of an entity

Page 38: Knowledge-driven Implicit Information Extraction

38

Entity Model Network Creation

• 15,000 tweets for movies and books in July 2014

• 617 movies and 102 books

• Recent 1000 tweets per entity to build its contextual knowledge

• May 2014 version of DBpedia used to extract factual knowledge

• Temporal salience is obtained for July 2014

m1

m2 m4

m5

m3

m7

m6c1

c5

c8

c4

c6

c3

c2

c9

c7

Factual Knowledge Contextual Knowledge Entity

Page 39: Knowledge-driven Implicit Information Extraction

39

• How many tweets had correct entity within selected candidate set (top-25)?• How many entities were correctly linked by our disambiguation approach?

• Importance of contextual knowledge

Evaluation - Implicit Entity Linking

Entity Type Candidate Selection Recall Disambiguation accuracy

Movie 90.33% 60.97%

Book 94.73% 61.05%

Step Entity Type Without Contextual Knowledge

With Contextual Knowledge

Candidate Selection Recall

Movie 77.29% 90.33%

Book 76.84% 94.73%

Disambiguation Accuracy

Movie 51.7% 60.97%

Book 50.0% 61.05%

Page 40: Knowledge-driven Implicit Information Extraction

40

Qualitative Error Analysis

Error Tweet Entity

Lack of contextual knowledge

‘That Movie Where Shailene Woodley Has Her First Nude Scene? The Trailer Is RIGHT HERE!: No one can say Shailene Woodley isn't brave!’

White Bird in a Blizzard

Novel entities ‘”hey, what's wrawng widdis goose?" RT @TIME: Mark Wahlberg could be starring in a movie about the BP oil spill http://ti.me/1oZh55V'

Deepwater Horizon

Cold start of entities ‘Video: George R.R. Martin's Children's Book Gets Re-releasehttp://bit.ly/1qNNH5r’

The Ice Dragon

Multiple implicit entity mentions

‘That moment when you realize that hazel grace and Augustus are brother and sister in one movie and in love battling cancer in another’

Divergent, The Fault in Our Stars

Page 41: Knowledge-driven Implicit Information Extraction

41

Dissertation Focus

Implicit Information Extraction

Entities Relationships

Organized Text Unorganized Text

Clinical Narratives Tweets

Disorders Symptoms Movies Books

Clinical Narratives

Disorders and Symptoms

Page 42: Knowledge-driven Implicit Information Extraction

42

Implicit Relationships in Clinical Narratives

atrial fibrillation hypertension

diabetes

chest pain

weight gain

headache

lisinopril

warfarin

insulin

atenolol

medication

disease

symptomis_treated_with

has_symptom

Page 43: Knowledge-driven Implicit Information Extraction

43

• Implicit relationships:• Exist between symptoms, disorders, medications, and procedures• Can be established by leveraging domain knowledge

• The existing knowledge bases fall short in eliciting relationships• Data + Knowledge can help to elicit such implicit relationships

efficiently

Implicit Relationships in Clinical Narratives

Page 44: Knowledge-driven Implicit Information Extraction

44

A Scenario

Atrial fibrillation

Hypertension

Diabetes

Fatigue

Syncope

Weight loss

Chest painDiscomfort in chest

DizzyShortness of Breath

NauseaVomitingHeadacheCoughWeight gain

Page 45: Knowledge-driven Implicit Information Extraction

45

A Scenario

Atrial fibrillation

Hypertension

Diabetes

Fatigue

Syncope

Weight loss

Chest painDiscomfort in chest

DizzyShortness of Breath

NauseaVomitingHeadacheCoughWeight gain

Atrial fibrillation

Hypertension

Diabetes

Chest pain

Weight gain

Discomfort in chest

CoughHeadache

Edema

Shortness of Breath

Knowledge base does not know about edema. Now edema can be a symptom of any disorder in the document.

Observed Disorders

Observed Symptoms

Page 46: Knowledge-driven Implicit Information Extraction

46

Knowledge Acquisition

• Hierarchical knowledge and non-hierarchical knowledge

Hierarchical Knowledge

Retrieved from UMLS

Non-hierarchical Knowledge

Extracted from Web Resources

+Feedback from domain expert

www.nlm.nih.gov www.en.wikipedia.org

www.webmd.com www.mayoclinic.com

www.clevelandclinic.org ww.healthline.org

CUI AUI PAUI PTR

C0013404 A0052186 A0111363 A0434168.A2367943. …

C0013604 A0052723 A0135504 A0434168.A2367943

CUI AUI SAB STR

C0013404 A0052186 MSH Shortness of breath

C0013604 A0052723 MSH Edema

MRHIER

MRCONSO

Page 47: Knowledge-driven Implicit Information Extraction

47

Hypertension

Diastolic Hypertension

Pulmonary Hypertension

Renal Hypertension

Episodic Pulmonary Hypertension

Solitary Pulmonary Hypertension

Breathing Problems

Shortness of Breath

Asthma

is_symptom_of

Instances of symptomsInstances of disorders

Shortness of Breath

Hypertension

Classes of disorders Classes of symptoms

rdfs:subclassOf rdf:type

Knowledge Modeling

Page 48: Knowledge-driven Implicit Information Extraction

48

Detecting Unexplained Symptoms

• Clinical documents were semantically annotated for entities using cTAKES

• Known relationships are populated• Unexplained symptoms were detected Modeled

Knowledge

Credit:http://bit.ly/2aMWVAd

Page 49: Knowledge-driven Implicit Information Extraction

49

Information Extraction – Unknown Relationships

• Naïve method would assume relationship between unexplained symptom and all disorders in clinical narrative

• Can we leverage the knowledge we have about symptom to find most plausible disorders?

• Intuition: a symptom is most likely to be shared by similar disorders

Page 50: Knowledge-driven Implicit Information Extraction

50

1. All co-occurring disorders are candidates

Information Extraction – Unknown Relationships

D1

S

D2

D3

D4

D5

Page 51: Knowledge-driven Implicit Information Extraction

51

2. Find known disorders of the symptom

D1

S

D6

D7

D2

D3

D4

D5

Information Extraction – Unknown Relationships

Page 52: Knowledge-driven Implicit Information Extraction

52

3. Collect more knowledge about

known relationships

D1

S

D6

D7

D2

D3

D4

D5

D7

D8 D2

D10 D11

D12

D4

D14

Information Extraction – Unknown Relationships

Page 53: Knowledge-driven Implicit Information Extraction

53

4. Compare co-occurring disorders with collected

knowledge

D1

S

D6

D7

D2

D3

D4

D5

D7

D8 D2

D10 D11

D12

D4

D14

Information Extraction – Unknown Relationships

Page 54: Knowledge-driven Implicit Information Extraction

54

5. Eliminate non-matching candidate

disorders

S

D2

D4

We left with most plausible disorders for unexplained symptom. If this scenario occurs frequently, it increases the confidence on this

relationship.

Information Extraction – Unknown Relationships

Page 55: Knowledge-driven Implicit Information Extraction

55

Evaluation

• A corpus of 1,500 electronic medical records were used• Annotated with cTAKES and selected the most frequent entities

were selected• UMLS semantic types were used to categorize disorders and

symptoms• Initial knowledge base - 86 disorders, 42 symptoms, 255 disorder-

symptom relationships

Page 56: Knowledge-driven Implicit Information Extraction

56

• There were 29 distinct unexplained symptoms

• Precision of the questions generated • 1st iteration - 105 correct from 142 (73.94%)• 2nd iteration - 20 correct from 29 (68.96%)• 3rd iteration - 4 correct from 9 (44.44%)

Evaluation – Relationship Prediction

Symptom Number of unexplained instances

Edema 910

Syncope 336

Systolic Murmur 168

Tachycardia 143

Angina 136

Disorder Number of co-occurrences

Hypertension 647

Hyperlipidemia 641

Claudication 454

Coronary atherosclerosis 395

Coronary artery disease 242

Top 5 unexplained symptom Top 5 co-occurring disorders with edema

Page 57: Knowledge-driven Implicit Information Extraction

57

Evaluation – Increment in Explainability

Knowledge base Number of unexplained relationships

Increment in explainability

Initial knowledge base 2251 0%

After 1st iteration 878 60.99%

After 2nd iteration 806 64.19%

Page 58: Knowledge-driven Implicit Information Extraction

58

Summary

• Implicit information is frequent occurrence in text and ignoring them would adversely affect downstream applications.

• Linguistic and world Knowledge plays an important role in decoding implicit information.

• This dissertation demonstrated characteristics of implicit information and developed solution to capture factual implicit constructs.

Knowledge Acquisition

Knowledge Modeling

Detecting Implicit

Information

Information Extraction

UMLS

TaxonomicalDefinitional

Non-taxonomicalAssociational

Representative terms

Domain Semantics Semi-supervised

Supervised

Unsupervised

Page 59: Knowledge-driven Implicit Information Extraction

59

Contributions

• Identify and demonstrate the value of implicit information.• Study the characteristics of the implicit information manifestation.• Demonstrate the value of knowledge in extracting factual implicit

information.- Linguistic - Domain -

Contextual• Developed a framework for factual implicit information extraction.• Demonstrated the usage of the framework to solve three implicit

information extraction problems.

Page 60: Knowledge-driven Implicit Information Extraction

60

Graduate [email protected] Publications:• Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas

Nair, Semantics Driven Approach for Knowledge Acquisition from EMRs, IEEE Journal of Biomedical and Health Informatics.

• Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu Chen and Amit Sheth. I just wanted to tell you that loperamide WILL WORK': A Web-Based Study of Extra-Medical Use of Loperamide.

Conference Publications:• Sujan Perera, Pablo Mendes, Adarsh Alex, Amit Sheth, Krishnaprasad

Thirunarayan, Implicit Entity Linking in Tweets, ESWC 2016• Sujan Perera, Pablo Mendes, Amit Sheth, Krishnaprasad Thirunarayan, Adarsh Alex,

Christopher Heid, Greg Mott, Implicit Entity Recognition in Clinical Documents, *SEM 2015

• Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas Nair, Data Driven Knowledge Acquisition Method for Domain Knowledge Enrichment in the Healthcare, BIBM 2012

• Menasha Thilakaratne, Ruvan Weerasinghe, Sujan Perera, Knowledge-driven Approach to Predict Personality Traits by Leveraging Social Media Data, WI 2016

Workshop and Posters:• Sujan Perera, Amit Sheth, Krishnaprasad Thirunarayan, Challenges in Understanding

Clinical Notes: Why NLP Engines Fall Short and Where Background Knowledge Can Help, DARE 2013

• Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu Chen, Amit Sheth. A Web-Based Study of Self-Treatment of Opioid Withdrawal Symptoms with Loperamide, CPDD 2012

Internships:• ezDI Summer 2012• IBM Watson Summer 2014 and 2015

Awards and grants:• George Thomas Graduate Fellowship • NSF travel grants: BIBM and ICHI

PC Committee:• DARE (2013), EKAW (2014, 2016), ISWC

2015, IJCAI 2016External Reviewer:• ISWC, ESWC, IJSWIS, IEEE Intelligent

Systems, Applied Ontology, ODBASE

Proposal Contributions:• eDrugTrends (NIH R01)• Healthcare Outcome Prediction (NSF-SCH)

Mentoring:• Adarsh Alex (MSc)• Menasha Tilakaratne (BSc)

Page 61: Knowledge-driven Implicit Information Extraction

61

Thank You

Mentors Collaborators

Page 62: Knowledge-driven Implicit Information Extraction

62

Coffee Mates and Colleagues

Thank You

Funding• ezDI• George Thomas Fellowship• NSF: CNS 1513721 Context-

Aware Harassment Detection on Social Media