Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin...

15
Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU School of Computer Science, Beijing Institute of Technology

description

Algorithms Similarity between sememes Similarity between concepts Similarity between words Amendment with thesaurus

Transcript of Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin...

Page 1: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Measuring Semantic Similarity between Words Using HowNet

ICCSIT 2008Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU

School of Computer Science, Beijing Institute of Technology

Page 2: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

HowNet

• W_C=工夫• DEF={Ability|能力 :host={human|人 }}• DEF={Strength|力量 :host={group|群體 }{human|人 }}• DEF={time|時間 }

• Word : 工夫• Concept : {Ability|能力 :host={human|人 }}• Sememe : Ability|能力

Page 3: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Algorithms

• Similarity between sememes• Similarity between concepts• Similarity between words

• Amendment with thesaurus

Page 4: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Similarity between sememes

• Strategy 1

• Strategy 2

• d : Distance between S1 and S2• h : Depth of the first common parent node

of the two sememes• α , β : Parameters to adjust d,h

Page 5: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Similarity between concepts

• Word “Doctor”• DEF={human|人 :{own|有 :possession={Status|身分 :

domain={education|教育 },modifier={HighRank|高等 :degree={most|最 }}},possessor={~}}}

• Human → Primary sememe• Status, own … → Modifying sememe• Possession , domain …

→ Descriptors

Page 6: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Similarity between concepts

• P , Q : Two concepts. Assume P has less number of modifying sememe.

• P_i , Q_j : ith, jth modifying sememe of P , Q.• S , T : Descriptor set of P , Q• α,β,γ : Weight of 3 parts

Page 7: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Similarity between words

• One word may has many concepts.• Choose the most similar pair.

Page 8: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Amendment with thesaurus

• Some words are missing and some DEFs are too rough in in HowNet.

• Using Chinese thesaurus Tongyici Cilin(同義詞詞林 )應為哈爾濱工業大學 IR-Lab的哈工大信息檢索研究室同義詞詞林擴展版

• d : Distance between W1 and W2

Page 9: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Similarity between words

• Sim1 : Eq. 6 (Similarity in HowNet)• Sim2 : Eq. 7 (Similarity in Tongyici Cilin)• α,β,γ,η : Parameters to scale the weights of

the two parts.

Page 10: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Evaluation• Dataset– RG-65

• Rubenstein and Goodenough established synonymy judgments for 65 pairs of nouns.They invited 51 human judges to assign every pair a score between 0.0 and 4.0 to indicate semantic similarity.

– MC-28• Miller and Charles follow this idea and restricted themselves to 30 pairs

of nouns selected from Rubenstein and Goodenough’s list, divided equally amongst words with high, intermediate and low similarity.

• For measuring similarity between Chinese words , translate RG-65 into Chinese manually.

Page 11: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Evaluation

• Parameters– Similarity between sememes• Strategy 1 : α = 1.6 , β = 0.16• Strategy 2 : α = 0.2 , β = 0.16

– Similarity between concepts• α = 0.54 , β = 0.36 , γ = 0.1

– Similarity between words• On Chinese dataset :α = 0.95,β = 0.05,γ = 0.95,η = 0.05• On English dataset : α = 0.95,β = 0.05,γ = 0.45,η = 0.55

Page 12: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Result– HAPI : HowNet_Get_Concept_Similarity in HowNet API

Page 13: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

Result• In addition, They compare results to eight groups of measures

that rely on WordNet.• Table 1. Correlations coefficient of algorithms

Approach RG-28 MC-28 RG-65Hirst-St.Onge 0.671 0.682 0.732Jiang 0.67 0.682 0.732Leacock 0.801 0.82 0.852Lin 0.773 0.814 0.834Resnik 0.706 0.763 0.8Yang 0.889 0.921 0.897Li 0.8914 0.882 N/AAlvarez 0.9 0.913 N/AS1-English 0.9238 0.9074 0.8764S2-English 0.9286 0.9056 0.8744HAPI-English 0.5371 0.5113 0.6089S1-Chinese 0.8617 0.8401 0.8958S2-Chinese 0.8679 0.846 0.895HAPI-Chinese 0.5328 0.5001 0.6752

Page 14: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

RG-65

Page 15: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute.

MC-30 & RG-30