Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin...

Post on 18-Jan-2018

220 views 0 download

description

Algorithms Similarity between sememes Similarity between concepts Similarity between words Amendment with thesaurus

Transcript of Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin...

Measuring Semantic Similarity between Words Using HowNet

ICCSIT 2008Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU

School of Computer Science, Beijing Institute of Technology

HowNet

• W_C=工夫• DEF={Ability|能力 :host={human|人 }}• DEF={Strength|力量 :host={group|群體 }{human|人 }}• DEF={time|時間 }

• Word : 工夫• Concept : {Ability|能力 :host={human|人 }}• Sememe : Ability|能力

Algorithms

• Similarity between sememes• Similarity between concepts• Similarity between words

• Amendment with thesaurus

Similarity between sememes

• Strategy 1

• Strategy 2

• d : Distance between S1 and S2• h : Depth of the first common parent node

of the two sememes• α , β : Parameters to adjust d,h

Similarity between concepts

• Word “Doctor”• DEF={human|人 :{own|有 :possession={Status|身分 :

domain={education|教育 },modifier={HighRank|高等 :degree={most|最 }}},possessor={~}}}

• Human → Primary sememe• Status, own … → Modifying sememe• Possession , domain …

→ Descriptors

Similarity between concepts

• P , Q : Two concepts. Assume P has less number of modifying sememe.

• P_i , Q_j : ith, jth modifying sememe of P , Q.• S , T : Descriptor set of P , Q• α,β,γ : Weight of 3 parts

Similarity between words

• One word may has many concepts.• Choose the most similar pair.

Amendment with thesaurus

• Some words are missing and some DEFs are too rough in in HowNet.

• Using Chinese thesaurus Tongyici Cilin(同義詞詞林 )應為哈爾濱工業大學 IR-Lab的哈工大信息檢索研究室同義詞詞林擴展版

• d : Distance between W1 and W2

Similarity between words

• Sim1 : Eq. 6 (Similarity in HowNet)• Sim2 : Eq. 7 (Similarity in Tongyici Cilin)• α,β,γ,η : Parameters to scale the weights of

the two parts.

Evaluation• Dataset– RG-65

• Rubenstein and Goodenough established synonymy judgments for 65 pairs of nouns.They invited 51 human judges to assign every pair a score between 0.0 and 4.0 to indicate semantic similarity.

– MC-28• Miller and Charles follow this idea and restricted themselves to 30 pairs

of nouns selected from Rubenstein and Goodenough’s list, divided equally amongst words with high, intermediate and low similarity.

• For measuring similarity between Chinese words , translate RG-65 into Chinese manually.

Evaluation

• Parameters– Similarity between sememes• Strategy 1 : α = 1.6 , β = 0.16• Strategy 2 : α = 0.2 , β = 0.16

– Similarity between concepts• α = 0.54 , β = 0.36 , γ = 0.1

– Similarity between words• On Chinese dataset :α = 0.95,β = 0.05,γ = 0.95,η = 0.05• On English dataset : α = 0.95,β = 0.05,γ = 0.45,η = 0.55

Result– HAPI : HowNet_Get_Concept_Similarity in HowNet API

Result• In addition, They compare results to eight groups of measures

that rely on WordNet.• Table 1. Correlations coefficient of algorithms

Approach RG-28 MC-28 RG-65Hirst-St.Onge 0.671 0.682 0.732Jiang 0.67 0.682 0.732Leacock 0.801 0.82 0.852Lin 0.773 0.814 0.834Resnik 0.706 0.763 0.8Yang 0.889 0.921 0.897Li 0.8914 0.882 N/AAlvarez 0.9 0.913 N/AS1-English 0.9238 0.9074 0.8764S2-English 0.9286 0.9056 0.8744HAPI-English 0.5371 0.5113 0.6089S1-Chinese 0.8617 0.8401 0.8958S2-Chinese 0.8679 0.846 0.895HAPI-Chinese 0.5328 0.5001 0.6752

RG-65

MC-30 & RG-30