K NOWLEDGE - BASED M ETHOD FOR D ETERMINING THE M EANING OF A MBIGUOUS B IOMEDICAL T ERMS U SING I...

Post on 19-Dec-2015

214 views 0 download

Tags:

Transcript of K NOWLEDGE - BASED M ETHOD FOR D ETERMINING THE M EANING OF A MBIGUOUS B IOMEDICAL T ERMS U SING I...

1

KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY

Bridget McInnes

Ted Pedersen

Ying Liu

Genevieve B. Melton

Serguei Pakhomov

2

OBJECTIVE OF THIS WORK

Develop and evaluate a method than can disambiguate terms in biomedical text by exploiting similarity information extrapolated from the Unified Medical

Language System

Evaluate the efficacy of Information Content-based similarity measures over path-based similarity measures

for Word Sense Disambiguation, WSD

3

WORD SENSE DISAMBIGUATION

Word sense disambiguation is the task of determining the appropriate sense of a term

given context in which it is used.

TERM: tolerance

DrugTolerance

ImmuneTolerance

4

WORD SENSE DISAMBIGUATION

Word sense disambiguation is the task of determining the appropriate sense of a term

given context in which it is used.

Busprione attenuates tolerance to morphine in mice with skin cancer

DrugTolerance

ImmuneTolerance

5

SENSE INVENTORY: UNIFIED MEDICAL LANGUAGE SYSTEM

Unified Medical Language Sources (UMLS) Semantic Network Metathesaurus

~1.7 million biomedical and clinical concepts; integrated semi-automatically

CUIs (Concept Unique Identifiers), linked: Hierarchical: PAR/CHD and RB/RN Non-hierarchical: SIB, RO

Sources viewed together or independently Medical Subject Heading (MSH)

SPECIALIST Lexicon Biomedical and clinical terms, including variants

6

WORD SENSE DISAMBIGUATION

Busprione attenuates tolerance to morphine in mice with skin cancer

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Concept Unique Identifiers: CUIs

7

SENSERELATE ALGORITHM

Each possible sense of a target word is assigned a score [sum similarity between it and its surrounding terms]

Assign target word the sense with highest score

Proposed by Patwardhan and Pedersen 2003 using WordNet

UMLS::SenseRelate is a modification of this algorithm using information from the UMLS

NEXT UP: an example

8

SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine

in mice with skin cancer

9

SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine

in mice with skin cancer

DrugTolerance: C0013220

ImmuneTolerance:C0020963

10

SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine

in mice with skin cancer

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462

Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

11

SENSERELATE EXAMPLE

0.090.16

0.11

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462

Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

12

SENSERELATE EXAMPLE

0.090.16

0.11

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462

Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

13

SENSERELATE EXAMPLE

0.090.16

0.11

0.090.050.04

Busprione attenuates tolerance to morphine in mice with skin cancer

0.090.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462

Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

14

SENSERELATE EXAMPLE

0.090.16

0.11

0.090.050.04

Busprione attenuates tolerance to morphine in mice with skin cancer

0.090.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462

Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

Immune ToleranceScore = 0.09 + 0.09 + 0.05 + 0.05 = 0.27

15

SENSERELATE EXAMPLE

0.090.16

0.11

0.090.050.04

Busprione attenuates tolerance to morphine in mice with skin cancer

0.090.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462

Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

Immune ToleranceScore = 0.09 + 0.09 + 0.05 + 0.05 = 0.27

16

SENSE RELATE ASSUMPTION

An ambiguous word is often used in the sense

that is most similar to the sense of the terms that surround it

17

SENSERELATE COMPONENTS

Identifying the concepts of surrounding terms

Calculating semantic similarity

18

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLS

19

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLSBusprione attenuates tolerance to morphine

in mice with skin cancer

20

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLS

...skin cancerskin graftingskin disease

...

SPECIALISTLEXICON

Busprione attenuates tolerance to morphine in mice with skin cancer

21

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLS

...skin cancerskin graftingskin disease

...

...skin cancer C0007114skin grafting C0037297skin disease C0037274

...

SPECIALISTLEXICON

MRCONSO

Busprione attenuates tolerance to morphine in mice with skin cancer

22

SEMANTIC SIMILARITY MEASURES

Path-based measures Path Wu and Palmer Leacock and Chodorow Ngyuen and Al-Mubaid

Information content (IC)-based measures Resnik Lin Jiang and Conrath

23

PATH-BASED SIMILARITY MEASURES

Use only the path information obtained from a taxonomy

24

PATH-BASED SIMILARITY MEASURES

Use only the path information obtained from a taxonomy

Path measure sim(c1,c2) = 1 / minpath(c2,c2)

where minpath is the shortest path between the two concepts

25

PATH-BASED SIMILARITY MEASURES

Use only the path information obtained from a taxonomy

Path measure sim(c1,c2) = 1/minpath(c2,c2)

where minpath is the shortest path between the two concepts

Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) /

(depth(c1)+depth(c2)) where LCS is the least common subsumer of the

two concepts

26

PATH-BASED SIMILARITY MEASURES

Use only the path information obtained from a taxonomy

Path measure sim(c1,c2) = 1/ minpath(c2,c2)

where minpath is the shortest path between the two concepts

Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))

where LCS is the least common subsumer of the two concepts

Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) )

where D is the total depth of the taxonomy

27

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy

Path measure sim(c1,c2) = 1/ minpath(c2,c2)

where minpath is the shortest path between the two concepts

Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) )

where D is the total depth of the taxonomy

Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))

where LCS is the least common subsumer of the two concepts

Nyguen and Al-Mubaid, 2006 sim(c1,c2) = log ( (2 + minpath(c1,c2) - 1) *

(D - depth(LCS(c1,c2))) )

28

PATH-BASED SIMILARITY MEASURES

USE ONLY THE PATH INFORMATION OBTAINED FROM A TAXONOMY

Disease: C0012634

Drug Related

Disorder: C0277579

DrugTolerance: C0013220

Neoplasm: C1302761

Neoplastic Disease: C1882062

Malignant Neoplasm: C0006826

Skin cancer:

C0007114

29

INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept))

30

INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept))

P(concept)

Calculated by summing the probability of the concept and the probability of its descendants

Probabilities are obtained from an external corpus

31

INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept)

Resnik, 1995 sim(c1,c2) = IC(LCS(c1,c2))

32

INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept)

Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2))

Jiang and Conrath, 1997 sim(c1,c2) = 1 / (IC(c1)+IC(c2) – 2*

IC(LCS(c1,c2))

33

INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept)

Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2))

Jiang and Conrath, 1997 sim(c1,c2) = 1 ÷ (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2))

Lin, 1998 sim(c1,c2) = (2*IC(LCS(c2,c2))) / (IC(c1)+IC(c2))

34

IC-BASED SIMILARITY MEASURES

Disease: C001263

4Drug

Related Disorder: C0277579Drug

Tolerance:

C0013220

Neoplasm:

C1302761Neoplasti

c Disease: C188206

2Malignan

t Neoplas

m: C000682

6Skin cancer: C000711

4

+

PATH INFORMATIONPROBABILITY OF

CONCEPTS

EXTERNAL CORPUS

35

EXPERIMENTAL FRAMEWORK

Use open-source UMLS::Similarity package to obtain the similarity between the terms and possible senses in the SenseRelate algorithm

Path information: parent/child relations in MSH source

Information content: calculated using the UMLSonMedline dataset created by NLM

Consists of concepts from 2009AB UMLS and the frequency they occurred in Medline using the Essie Search Engine (Ide et al 2007)

Medline: database of citations of biomedical/clinical articles

36

EVALUATION DATA: MSH WSD

MSH-WSD dataset (Jimeno-Yepes, et al 2011) 203 target words (ambiguous word) from Medline

106 terms e.g. tolerance 88 acronyms e.g. CA (calcium, california) 9 mixtures e.g. bat (brown adipose tissue)

Each target word contains ~187 instances (Medline abstracts) abstract = ~ 500 words

Each target word in the instances assigned a concept from MSH by exploiting the manually assigned MSH concepts assigned to the abstract

Average of 2.08 possible senses per target word Majority sense over all the target words is 54.5%

37

RESULTS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.55

0.7200000000000010.690000000

0000010.700000000

000001

0.720000000000001

0.730000000000001

0.740000000000001

0.740000000000001

baseline path

lch wup

nam

res jcn

accuracy

Path-based IC-based

lin

38

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.8

0.670000000000002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

39

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.8

0.670000000000002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

40

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.8

0.670000000000002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

41

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.8

0.670000000000002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

42

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.8

0.670000000000002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

43

WINDOW SIZES

Use the terms surrounding the target word within a specified window: 1, 2, 5, 10, 25, 50, 60, 70

Busprione attenuates tolerance to morphine in mice with skin_cancer

WINDOW SIZE = 2

44

COMPARISON OF WINDOW SIZES FOR LIN

0 1 2 5 10 25 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.50.53

0.650000000000002

0.690000000000001

0.710000000000001

0.740000000000002

0.740000000000002

0.740000000000002

0.740000000000002

lin

accuracy

window size

45

SURROUNDING TERMS

Not all terms have a concept in the UMLS

therefore

Not all surrounding terms in the window mapped to CUIs

46

WINDOW SIZES VERSUS MAPPED TERMS

0 1 2 5 10 25 50 60 700

2

4

6

8

10

12

14

16

18

0 0.27 0.79

1.85

3.49

7.6

12.96

14.28

15.64

lin

number

of

mappings

window size

47

FUTURE WORK: MAPPING TERMS

Currently looking at mapping the terms to CUIs using information from the concept mapping system MetaMap

Obtain the terms from MetaMap and do a dictionary look up in MRCONSO Hypothesis – the terms obtained by MetaMap are more

accurate than using the SPECIALIST Lexicon

Obtain the CUIs from MetaMap Hypothesis – the CUIs obtained by MetaMap will be

more accurate than the dictionary look-up

48

OBJECTIVE #1

Develop and evaluate a method than can disambiguate terms in biomedical text by

exploiting similarity information extrapolated from the UMLS

UMLS::SenseRelate statistically significantly higher disambiguation accuracy than the baseline

On par with previous unsupervised methods for terms

49

OBJECTIVE #2

Evaluate the efficacy of IC-based similarity measures over path-based measures on a

secondary task

There is no statistically significant difference between the accuracies obtained by the IC-based measures

There is a statistically significant difference between the IC-based measures and the path-based measures

50

TAKE HOME MESSAGE:

An ambiguous word is often used in the sense

that is most similar to the sense of the concepts

of the terms that surround it

51

RESOURCES

Software: UMLS::SenseRelate

http://search.cpan.org/dist/UMLS-SenseRelate/ UMLS::Similarity

http://search.cpan.org/dist/UMLS-Similarity/

Data MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml

52

RESOURCES

Software: UMLS::SenseRelate

http://search.cpan.org/dist/UMLS-SenseRelate/ UMLS::Similarity

http://search.cpan.org/dist/UMLS-Similarity/

Data MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml

THANK YOU

53

RESOURCES

Software: UMLS::SenseRelate

http://search.cpan.org/dist/UMLS-SenseRelate/ UMLS::Similarity

http://search.cpan.org/dist/UMLS-Similarity/

Data MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml

QUESTIONS?