June 19-21, 2006WMS'06, Chania, Crete1 Design and Evaluation of Semantic Similarity Measures for...

22
June 19-21, 2006 WMS'06, Chania, Crete 1 Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies Euripides G.M. Petrakis Giannis Varelas Angelos Hliaoutakis Paraskevi Raftopoulou
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of June 19-21, 2006WMS'06, Chania, Crete1 Design and Evaluation of Semantic Similarity Measures for...

June 19-21, 2006 WMS'06, Chania, Crete 1

Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies

Euripides G.M. PetrakisGiannis VarelasAngelos HliaoutakisParaskevi Raftopoulou

June 19-21, 2006 WMS'06, Chania, Crete 2

Semantic Similarity Relates to computing the conceptual

similarity between terms which are not necessarily lexicacally similar “car”-“automobile”-“vehicle”, “drug”- “medicine”

Tool for making knowledge commonly understandable in applications such as IR, information communication in general

June 19-21, 2006 WMS'06, Chania, Crete 3

Methodology

Terms from different communicating sources are represented by ontologies

Map two terms to an ontology and compute their relationship in that ontology

Terms from different ontologies: Discover linguistic relationships or affinities between terms in different ontologies

June 19-21, 2006 WMS'06, Chania, Crete 4

Contributions

We investigate several Semantic Similarity Methods and we evaluate their performance http://www.intelligence.tuc.gr/similarity

We propose a novel semantic similarity measure for comparing concepts from different ontologies

June 19-21, 2006 WMS'06, Chania, Crete 5

Ontologies Tools of information representation on a

subject Hierarchical categorization of terms from

general to most specific terms object artifact construction stadium

Domain Ontologies representing knowledge of a domain e.g., MeSH medical ontology

General Ontologies representing common sense knowledge about the world e.g., WordNet

June 19-21, 2006 WMS'06, Chania, Crete 6

WordNet A vocabulary and a thesaurus offering a

hierarchical categorization of natural language terms More than 100,000 terms

Nouns, verbs, adjectives and adverbs are grouped into synonym sets (synsets)

Synsets represent terms or concepts with similar meaning stadium, bowl, arena, sports stadium – (a large

structure for open-air sports or entertainments)

June 19-21, 2006 WMS'06, Chania, Crete 7

WordNet Hierarchies The synsets are also organized into senses

Senses: Different meanings of the same term The synsets are related to other synsets

higher or lower in the hierarchy by different types of relationships e.g. Hyponym/Hypernym (Is-A relationships) Meronym/Holonym (Part-Of relationships)

Nine noun and several verb Is-A hierarchies

June 19-21, 2006 WMS'06, Chania, Crete 8

A Fragment of the WordNet Is-A Hierarchy

June 19-21, 2006 WMS'06, Chania, Crete 9

MeSH

MeSH: ontology for medical and biological terms by the N.L.M.

Organized in IS-A hierarchies More than 15 taxonomies, more than

22,000 terms No part-of relationships The terms are organized into synsets

called “entry terms’’

June 19-21, 2006 WMS'06, Chania, Crete 10

A Fragment of the MeSH Is-A Hierarchy

June 19-21, 2006 WMS'06, Chania, Crete 11

Semantic Similarity Methods Map terms to an ontology and compute

their relationship in that ontology Four main categories of methods:

Edge counting: path length between terms Information content: as a function of their

probability of occurrence in a corpus Feature based: similarity between their

properties (e.g., definitions) or based on their relationships to other similar terms

Hybrid: combine the above ideas

June 19-21, 2006 WMS'06, Chania, Crete 12

Example Edge counting

distance between “conveyance” and “ceramic” is 2

An information content method, would associate the two terms with their common subsumer and with their probabilities of occurrence in a corpus

June 19-21, 2006 WMS'06, Chania, Crete 13

X-Similarity Relies on matching between synsets and

set description sets

A,B: synsets or term description sets

Do the same with all IS-A, Part-Of relationships and take their maximum

.0),( ,),(),,(max

;0),( ,1),(

baSifbaSbaS

baSifbaSim

synsetsnsdescriptioodsneighborho

synsets

BA

BASbaS nsdescriptio

),(

),(max),( baSbaS iodneighborho

June 19-21, 2006 WMS'06, Chania, Crete 14

WordNet term: “Hypothyroidism” MeSH term: “Hyperthyroidism”

<term> hypothyroidism <definition> An underactive thyroid gland; a glandular disorder Resulting from insufficient production of thyroid hormones. </definition> <synset> Hypothyroidism </synset> <hypernyms> glandular disease, disorder, condition, state </hypernyms> <hyponyms> myxedema, cretinism </hyponyms></term>

<term> hyperthyroidism <definition> Hypersecretion of Thyroid Hormones from Thyroid

Gland. Elevated levels of thyroid hormones increase Basal Metabolic Rate.

</definition> <synset> Hyperthyroidism </synset> <hypernyms> disease, thyroid, Endocrine System Diseases,

diseases </hypernyms> <hyponyms> thyrotoxicosis, thyrotoxicoses </hyponyms></term>

Example S(Hypothyroidism, Hyperthyroidism) = 0.387

June 19-21, 2006 WMS'06, Chania, Crete 15

Evaluation

The most popular methods are evaluated

All methods applied on a set of 38 term pairs

Their similarity values are correlated with scores obtained by humans

The higher the correlation of a method the better the method is

June 19-21, 2006 WMS'06, Chania, Crete 16

Evaluation on WordNetMethod Type Correlation

Rada 1989 Edge Counting 0.59

Wu 1994 Edge Counting 0.74

Li 2003 Edge Counting 0.82

Leackok 1998 Edge Counting 0.82

Richardson 1994 Edge Counting 0.63

Resnik 1999 Info. Content 0.79

Lin 1993 Info. Content 0.82

Lord 2003 Info. Content 0.79

Jiang 1998 Info. Content 0.83

Tversky 1977 Feature Based 0.73

X-Similarity Feature Based 0.74

Rodriguez 2003 Hybrid 0.71

June 19-21, 2006 WMS'06, Chania, Crete 17

Evaluation on MeSHMethod Type Correlation

Rada 1989 Edge Counting 0.50

Wu 1994 Edge Counting 0.67

Li 2003 Edge Counting 0.70

Leackok 1998 Edge Counting 0.74

Richardson 1994 Edge Counting 0.64

Resnik 1999 Info. Content 0.71

Lin 1993 Info. Content 0.72

Lord 2003 Info. Content 0.70

Jiang 1998 Info. Content 0.71

Tversky 1977 Feature Based 0.67

X-Similarity Feature Based 0.71

Rodriguez 2003 Hybrid 0.71

June 19-21, 2006 WMS'06, Chania, Crete 18

Cross Ontology Measures We used 40 MeSH terms pairs One of the terms is a also a WordNet term We measured correlation with scores

obtained by experts

Method Type Correlation

X-Similarity Feature-Based 0.70

Rodriguez Hybrid 0.55

June 19-21, 2006 WMS'06, Chania, Crete 19

Comments Edge counting/Info. Content methods work by

exploiting structure information Good methods take the position of the terms

into account Higher similarity for terms which are close

together but lower in the hierarchy e.g., [Li et.al. 2003]

X – Similarity performs at least as good as other Feature-Based methods

Outperforms other Cross-Ontology methods

June 19-21, 2006 WMS'06, Chania, Crete 20

Conclusions Semantic similarity methods approximated

the human notion of similarity reaching correlation up to 83%

Cross ontology similarity is a difficult problem that required further investigation

Work towards integrating Sem. Sim within IntelliSearch information Retrieval System for Web documents http://www.intelligence.tuc.gr/intellisearch

June 19-21, 2006 WMS'06, Chania, Crete 21

Try our system on the Web

http://www.intelligence.tuc.gr/similarity

Implementation: Giannis Varelas Spyros Argyropoulos

June 19-21, 2006 WMS'06, Chania, Crete 22

www.intelligence.tuc.gr/similarity