STL: A similarity measure based on semantic and linguistic information

21
Copyright 2011 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.ie Enabling Networked Knowledge STL : A Similarity Measure Based on Semantic, Terminological and Linguistic Information Nitish Aggarwal joint work with Tobias Wunner, MihaelArcan DERI, NUI Galway [email protected] Friday,19 th Aug, 2011 DERI, Friday Meeting

Transcript of STL: A similarity measure based on semantic and linguistic information

Page 1: STL: A similarity measure based on semantic and linguistic information

Copyright 2011 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

STL : A Similarity Measure Based on

Semantic, Terminological and Linguistic

InformationNitish Aggarwal

joint work with Tobias Wunner, MihaelArcan

DERI, NUI Galway

[email protected]

Friday,19th Aug, 2011

DERI, Friday Meeting

Page 2: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Overview

Motivation & Applications

Why STL?

Semantic

Terminology

Linguistic

Evaluation

Conclusion and future work

2

Page 3: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Motivation & Applications

SemanticAnnotation

Similarity between corpus data and ontology concepts

SAP AG held €1615 million in

short-term liquid assets (2009)

“dbpedia:SAP_AG” “xEBR:LiquidAssets”

at “dbpedia:year:2009”

3

Page 4: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

SemanticSearch

Similarity between Query and index object

SAP liquid asset in 2010

Net cash of SAP in 2010

Current asset of SAP last year

SAP total amount received in 2010

“dbpedia:SAP_AG” “xEBR:liquid asset” at “dbpedia:year:2010”

Motivation & Applications

4

Page 5: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Motivation & Applications

OntologyMatching&Alignment

Similarity between ontology concepts

Assets

xebr:SubscribedCa

pitalUnpaid xebr:FixedAssets xebr:CurrentAssets

xebr:TangibleFix

edAssetsxebr:Amount

Receivable

xebr:Liquid

Assets

xebr:IntangibleFix

edAssets

xebr:KeyBalanceSheet

Ifrs:Assets

Ifrs:CurrentAssets ifrs:BiologicalAssets Ifrs:NonCurrentAssets

Ifrs:InventoriesIfrs:TradeAndOtherC

urrentReceivables

ifrs:PropertyPlantAn

dEquipment

ifrs:CashAndCashE

quivalents

ifrs:StatementOfFinancialPosition

Similarity = ?Similarity = ?

5

Page 6: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Classical Approaches

String Similarity

Levenshteindistance, Dice Coefficient

Corpus-based

LSA, ESA, Google distance,Vector-Space Model

Ontology-based

Path distance, Information content

Syntax Similarity

Word-order, Part of Speech

6

Page 7: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Why STL?

Semantic

Semanticstructure and relations

Terminology

complex terms expressing the same concept

Linguistic

Phrase and dependency structure

7

Page 8: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

STL

Definition

Linear combination of semantic, terminological and linguistic

obtained by using a linear regression

Formula used

STL = w1*S + w2*T + w3*L + Constant

– w1, w2, w3 represent the contribution of each

8

Page 9: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Semantic

WuPalmer

2*depth(MSCA) / depth(c1) + depth(c2)

Resnik’s Information Content

IC(c) = -log p(c)

Intrinsic Information Content (Pirro09)

Overcome the analysis of large corpora

9

Page 10: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Cont.

Intrinsic information content(iIC)

.

where sub(c) is number of sub-concept of given concept c.

Pirro_Similarity

10

Page 11: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Cont.

Assets

Subscribed Capital Unpaid Fixed AssetsCurrent Assets

Tangible Fixed Assets Amount Receivable

Amount Receivable

[total]

Other Tangible

Fixed Assets

Amount Receivable

with in one year

Amount Receivable

after more than one

year

Trade Debtors

Property, Plant

and Equipment

Other FixturePlant and Machinery

Land and

BuildingOther Debtors

Payments on

account and asset

in construction

Furniture Fixture

and Equipment

Other Property, Plant

and EquipmentProperty, Plant

and Equipment [Total]

StocksPirro_Sim =?

MSCA

subconcepts = 48

IC (TFA) = 0.32

subconcepts = 9

IC (TFA) = 0.60

subconcepts = 6

IC (AR) = 0.69

Pirro_Sim = 0.33

11

Page 12: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Limitation

Does semantic structure reflect a good

similarity?

not necessarily

– e.g. In xEBR, parent-child relation for describing the layout of

concepts

– “Work in progress” is not a type of asset, although both are

linked via the parent-child relationship

12

Page 13: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Terminology

Definition

Common naming convention

Ngram Vs subterms

In financial domain, bigram ”Intangible Fixed” is a subtring of

”Other Intangible Fixed Assets” but not a subterm.

Terminological similarity

maximal subterm overlap

13

Page 14: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Cont.

Trade Debts Payable After More Than One Year

Financial Debts Payable After More Than One Year Financial[Debts][Payable][After More Than One Year]

[Investopedia:Trade]

[Ifrs:After More Than One Year][Investoword:Debt]

[SAP:Payable]

[FinanceDict:Trade Debts]

[[Trade][Debts]][Payable][After More Than One Year]

14

Page 15: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Multilingual Subterms

Translatedsubterms

Available in otherlanguages

Advantage

Reflect terminological similarities that may be available in one

language but not in others.

15

”Property Plant and Equipment”@en

”Tangible Fixed Asset” @en

”Sachanlagen”@de

Page 16: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Linguistic

Syntactic Information

Beyond simple word order

– phrase structure

– Dependency structure

Phrase structure

Intangible fixed : adj adj > ??

Intangible fixed assets : adj adj n > NP

Dependency structure

Amounts receivable : N Adv : receive:mod, amounts:head

Received amounts : V N : receive:mod, amounts:head

16

Page 17: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Evaluation

Data Set

xEBR finance vocabulary

269 terms (concept labels)

72,361(269*269) termpairs

Benchmarks

SimSem59: sample of 59 term pairs

SimSem200 : sample of 200 term pairs (under construction)

17

Page 18: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Experiment

An overview of similarity measures

18

Page 19: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Experiment Results

(Simsem59)

STL formula used

STL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791

Correlation between similarity scores & simsem59

19

Semantic

Contribution

Terminology

Contribution

Linguistic

Contribution

Page 20: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Conclusion

STL outperforms more traditional similarity measures

Largest contribution by T (Terminological Analysis)

Multilingual subterms performs better than monolingual

20

Page 21: STL: A similarity measure based on semantic and linguistic information

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Future work

Evaluation on larger data set and vocabularies (IFRS)

3000+ terms

9M term pairs

richer set of linguistic operations

“recognise” => “recognition”

by derivation rule verb_lemma+"ion”

Similarity between subterms

“Staff Costs” and "Wages And Salaries"

21