STL: A similarity measure based on semantic and linguistic information
-
Upload
nitish-aggarwal -
Category
Education
-
view
292 -
download
1
Transcript of STL: A similarity measure based on semantic and linguistic information
Copyright 2011 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
STL : A Similarity Measure Based on
Semantic, Terminological and Linguistic
InformationNitish Aggarwal
joint work with Tobias Wunner, MihaelArcan
DERI, NUI Galway
Friday,19th Aug, 2011
DERI, Friday Meeting
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Overview
Motivation & Applications
Why STL?
Semantic
Terminology
Linguistic
Evaluation
Conclusion and future work
2
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Motivation & Applications
SemanticAnnotation
Similarity between corpus data and ontology concepts
SAP AG held €1615 million in
short-term liquid assets (2009)
“dbpedia:SAP_AG” “xEBR:LiquidAssets”
at “dbpedia:year:2009”
3
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
SemanticSearch
Similarity between Query and index object
SAP liquid asset in 2010
Net cash of SAP in 2010
Current asset of SAP last year
SAP total amount received in 2010
“dbpedia:SAP_AG” “xEBR:liquid asset” at “dbpedia:year:2010”
Motivation & Applications
4
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Motivation & Applications
OntologyMatching&Alignment
Similarity between ontology concepts
Assets
xebr:SubscribedCa
pitalUnpaid xebr:FixedAssets xebr:CurrentAssets
xebr:TangibleFix
edAssetsxebr:Amount
Receivable
xebr:Liquid
Assets
xebr:IntangibleFix
edAssets
xebr:KeyBalanceSheet
Ifrs:Assets
Ifrs:CurrentAssets ifrs:BiologicalAssets Ifrs:NonCurrentAssets
Ifrs:InventoriesIfrs:TradeAndOtherC
urrentReceivables
ifrs:PropertyPlantAn
dEquipment
ifrs:CashAndCashE
quivalents
ifrs:StatementOfFinancialPosition
Similarity = ?Similarity = ?
5
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Classical Approaches
String Similarity
Levenshteindistance, Dice Coefficient
Corpus-based
LSA, ESA, Google distance,Vector-Space Model
Ontology-based
Path distance, Information content
Syntax Similarity
Word-order, Part of Speech
6
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Why STL?
Semantic
Semanticstructure and relations
Terminology
complex terms expressing the same concept
Linguistic
Phrase and dependency structure
7
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
STL
Definition
Linear combination of semantic, terminological and linguistic
obtained by using a linear regression
Formula used
STL = w1*S + w2*T + w3*L + Constant
– w1, w2, w3 represent the contribution of each
8
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Semantic
WuPalmer
2*depth(MSCA) / depth(c1) + depth(c2)
Resnik’s Information Content
IC(c) = -log p(c)
Intrinsic Information Content (Pirro09)
Overcome the analysis of large corpora
9
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Cont.
Intrinsic information content(iIC)
.
where sub(c) is number of sub-concept of given concept c.
Pirro_Similarity
10
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Cont.
Assets
Subscribed Capital Unpaid Fixed AssetsCurrent Assets
Tangible Fixed Assets Amount Receivable
Amount Receivable
[total]
Other Tangible
Fixed Assets
Amount Receivable
with in one year
Amount Receivable
after more than one
year
Trade Debtors
Property, Plant
and Equipment
Other FixturePlant and Machinery
Land and
BuildingOther Debtors
Payments on
account and asset
in construction
Furniture Fixture
and Equipment
Other Property, Plant
and EquipmentProperty, Plant
and Equipment [Total]
StocksPirro_Sim =?
MSCA
subconcepts = 48
IC (TFA) = 0.32
subconcepts = 9
IC (TFA) = 0.60
subconcepts = 6
IC (AR) = 0.69
Pirro_Sim = 0.33
11
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Limitation
Does semantic structure reflect a good
similarity?
not necessarily
– e.g. In xEBR, parent-child relation for describing the layout of
concepts
– “Work in progress” is not a type of asset, although both are
linked via the parent-child relationship
12
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Terminology
Definition
Common naming convention
Ngram Vs subterms
In financial domain, bigram ”Intangible Fixed” is a subtring of
”Other Intangible Fixed Assets” but not a subterm.
Terminological similarity
maximal subterm overlap
13
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Cont.
Trade Debts Payable After More Than One Year
Financial Debts Payable After More Than One Year Financial[Debts][Payable][After More Than One Year]
[Investopedia:Trade]
[Ifrs:After More Than One Year][Investoword:Debt]
[SAP:Payable]
[FinanceDict:Trade Debts]
[[Trade][Debts]][Payable][After More Than One Year]
14
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Multilingual Subterms
Translatedsubterms
Available in otherlanguages
Advantage
Reflect terminological similarities that may be available in one
language but not in others.
15
”Property Plant and Equipment”@en
”Tangible Fixed Asset” @en
”Sachanlagen”@de
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Linguistic
Syntactic Information
Beyond simple word order
– phrase structure
– Dependency structure
Phrase structure
Intangible fixed : adj adj > ??
Intangible fixed assets : adj adj n > NP
Dependency structure
Amounts receivable : N Adv : receive:mod, amounts:head
Received amounts : V N : receive:mod, amounts:head
16
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Evaluation
Data Set
xEBR finance vocabulary
269 terms (concept labels)
72,361(269*269) termpairs
Benchmarks
SimSem59: sample of 59 term pairs
SimSem200 : sample of 200 term pairs (under construction)
17
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Experiment
An overview of similarity measures
18
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Experiment Results
(Simsem59)
STL formula used
STL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791
Correlation between similarity scores & simsem59
19
Semantic
Contribution
Terminology
Contribution
Linguistic
Contribution
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Conclusion
STL outperforms more traditional similarity measures
Largest contribution by T (Terminological Analysis)
Multilingual subterms performs better than monolingual
20
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Future work
Evaluation on larger data set and vocabularies (IFRS)
3000+ terms
9M term pairs
richer set of linguistic operations
“recognise” => “recognition”
by derivation rule verb_lemma+"ion”
Similarity between subterms
“Staff Costs” and "Wages And Salaries"
21