Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop,...
-
Upload
tobias-wunner -
Category
Documents
-
view
1.394 -
download
1
Transcript of Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop,...
Cross-lingual ontology lexicalisation, translation and information
extraction
Net2 workshop, University of South AfricaTobias Wunner
DERI, National University of Ireland, Galway
Copyright 2010 Digital Enterprise Research Institute. All rights reserved, Paul Buitelaar
us-gaap: GainLossOnSaleOfOilAndGasPropertyifrs:Revenue
de-gaap:BilanzsummeSummeAktiva
be-gaaap:MinderwaardenBijDeRealisatieVanVasteActiva
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Outline
1. Research challenge and motivation
2. Ontology Translation
3. Lexicalization (lemon)
4. CLOBIE (CL Ontology-based Inf. Extraction)
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Context and Motivation
• Monnet use case in financial domain– query financial information
• Cross-vocabulary• Cross-lingual• Get result in your own language
• Research challenges– localization & translation of vocabularies
– cross-lingual ontology-based information extraction
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Finance Terminology is complex!
“minimum finance lease payments
receivable,
at present value,
end of period not later than one year”representative term of financial domain
16 words complex structure (conceptually & linguistically)
Multilingual Ontologies for networked knowledgeenabling networked knowledge
DomainIndependent
DomainRelated
Some insight in finance terminology
Domain Terminology(SAPTerm)
DomainRelated
Dictionary(WordNet)
DomainIndependent
XBRL(IFRS)
DomainSpecific
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Break down complexity
Terminological Linguistic
3-faceted lexical enrichment
Termdecomposition
Semantic
asset
financialasset
non-financialasset
is-ais-a
available-for-salefinancial
asset
is-a
[asset]
[available-for-sale]
[financial]
[financial asset]
[non-financial asset]
[available-for-sale financial asset]
Noun_Sing: asset
Noun_Plural: assets
?P: available-for-sale
Adjective: financial
NP: available-for-sale fin. asset
VP: to sell financial assets
Multilingual Ontologies for networked knowledgeenabling networked knowledge
XBRL – Semantic Analysis
Multilingual Ontologies for networked knowledgeenabling networked knowledge
XBRL – Semantic Analysis
Multilingual Ontologies for networked knowledgeenabling networked knowledge
XBRL – Semantic Analysis
“Enhance semantics tofacilitate translation andinformation extraction.”
Multilingual Ontologies for networked knowledgeenabling networked knowledge
DomainSpecific
DomainRelated
DomainIndependent
sapTerm:payments
XBRL – Terminological Analysis
Minimum finance lease payments receivable, at present value
ifrs:MinimumFinanceLeasePaymentsReceivable
ifrs:MinimumFinanceLeasePaymentsReceivableAtPresentValue
sapTerm:financeLease
DomainRelated
DomainIndependent
DomainSpecific
googleDefine:leasePayments
googleDefine:Finance_lease
DomainIndependent
Multilingual Ontologies for networked knowledgeenabling networked knowledge
XBRL – Linguistic Analysis
Financial text
XBRL term
plural
“… lease payment …”
… lease payments …
singular
simple
“… received minimum finance lease payments …”
minimum finance lease payments receivable
adverb
verb
complex
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Outline
1. Research challenge and motivation
2. Ontology Translation
3. Lexicalization (lemon)
4. CLOBIE (CL Ontology-based Inf. Extraction)
Multilingual Ontologies for networked knowledgeenabling networked knowledge
• Models developed in Monnet– English / German / Spanish / Dutch
• …Net2– Afrikaans?
– Zulu?
– Xhosa?
– …
Translation using STL
ifrs:Revenueifrs:ProfitLossBeforeTaxifrs:MinimumFinanceLeasePaymentsPayable
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Application in Machine Translation
available-for-sale financial assets
[voor verkoop beschikbare] [financiële] [activa]
2. translate subterms using: domain TM (IFRS), Linked Open Data (DBPedia),
Translation services (GoogleTranslate)
[available-for-sale] [financial] [assets]
1. term analysis using:
IFRS, SAPTerm, GoogleDefine
3. term synthesis using:
voor verkoop beschikbare financiële activa
grammars (rules, statistical models)
in Dutch
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Application in Machine Translation
available-for-sale financial assets
[beskikbaar vir verkoop] [finansiële] [bates]
2. translate subterms using: domain TM (IFRS), Linked Open Data (DBPedia),
Translation services (GoogleTranslate)
[available-for-sale] [financial] [assets]
1. term analysis using:
IFRS, SAPTerm, GoogleDefine
3. term synthesis using:
finansiële bates beskikbaar vir verkoop
grammars (rules, statistical models)
in Afrikaans
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Application in Machine Translation
available-for-sale financial assets
[disponibles para la venta] [financia] [activos]
2. translate subterms using: domain TM (IFRS), Linked Open Data (DBPedia),
Translation services (GoogleTranslate)
[available-for-sale] [financial] [assets]
1. term analysis using:
IFRS, SAPTerm, GoogleDefine
3. term synthesis using:
activos financieros disponibles para la venta
grammars (rules, statistical models)
in Spanish
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Outline
1. Research challenge and motivation
2. Ontology Translation
3. Lexicalization (lemon)
4. CLOBIE (CL Ontology-based Inf. Extraction)
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Why do we need a lexicon?
“loads of unlinked domain-specific
terminology on the web !”
• An interoperable web for … ?– re-use
– enable multilinguality
– cross-lingual search
– cross-lingual fact extraction
http://en.wikipedia.org/wiki/Finance_lease
http://www.investopedia.com/terms/l/lease-payments.asp
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Lexicon standards overview
• ISO (XML)– TEI (Text Encoding Initiative)– LMF (Lexical Markup Framework)
• W3C & Semantic Web (RDF / OWL)– build-in rdfs:label– lightweight linguistic representations (SKOS,
SKOS-XL)– rich linguistic representations (GOLD, LexInfo)
Multilingual Ontologies for networked knowledgeenabling networked knowledge
SKOS – Multilingual Information
dbpedia:Lease_payments
ifrs:MinimumFinanceLease
Payments
dbpedia:Finance_lease skos:narrower
skos:related
skos:broader
skos:related
• SKOS concepts with…– germ relations
– multilingual labels
– resource references
Multilingual Ontologies for networked knowledgeenabling networked knowledge
SKOS – Multilingual Information
Not much uptake yet? from http://data.nytimes.com/
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Ontology-Text Mismatch
‘Edificio-historico’ vs. ‘…edificio, declarado Monumento Histórico…’
>> goes beyond SKOS (monolingual & multilingual term variants)
>> requires representation of lexical information to compute linguistic variants, e.g.
‘edificio historico[apposVP[NP[Adj]]]’
Multilingual Ontologies for networked knowledgeenabling networked knowledge
A Lexicon Model for Ontologies
• Requirements for ‘ontology-lexicon’ model
– Represent linguistic information relative to ontology
• Avoid unnecessary ambiguities by representing only lexical features relevant to semantics of underlying application
– Keep semantics separate from linguistic info
• Separate clearly ‘world’ (properties of objects referred to by words) from ‘word’ (properties of words) knowledge
– Modular, minimal design
• Provide simple core model that can be easily extended upon need
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Was there a solution already? - SKOS
• Simple Knowledge Organization System – SKOS
– General model for formalizing thesauri, terminologies and related semantic and knowledge resources
– Formalization of terminology in focus - terminology, classification, Semantic Web communities
– Does not address linguistic aspects of terminology, or therefore, the lexicon-ontology interface
– http://www.w3.org/2004/02/skos/
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Was there a solution already? - GOLD
• General Ontology for Linguistic Description – GOLD
– Community-based ontology of linguistics
– Linguistic study in focus - linguistics community
– Formal model of linguistics as an ontology, but not about connecting lexical features to ontological semantics
– Other issues: very big, modularity?
– http://linguistics-ontology.org/gold/2010
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Was there a solution already? - OWN
• OntoWordNet – OWN
– Formal specification of WordNet through extension and axiomatization of its conceptual relations
– Formal knowledge representation in focus - logic, knowledge representation, Semantic Web communities
– Turns WordNet into an ontology but not about connecting lexical features to ontological semantics
– http://wiki.loa-cnr.it/index.php/LoaWiki:OWN
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Was there a solution already? - LMF
• Lexical Markup Framework – LMF
– General model for formalizing and sharing of machine-readable dictionaries
– Lexical knowledge representation in focus - lexicography, NLP communities
– Very close to ontology-lexicon requirements, but no view on how lexical features link to ontological semantics – semantics is limited to a notion of sense based on synsets
– Other issues: incomplete formal model, focus on classes, less on properties/relations
– http://www.lexicalmarkupframework.org/
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon
• lexicon model for ontologies: ‘lemon’
– General model for formalizing lexical features relative to independently defined ontological semantics
– http://www.monnet-project.eu/lemon
• Two-level modelling
– Abstract level (meta-model): lemon
– Instantiation level (lexicon model): e.g. ‘LexInfo2’
– http://lexinfo.net/
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Many solutions…
…with an a priori amount of linguistics or semantics!
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon: Overview
Multilingual Ontologies for networked knowledgeenabling networked knowledge
LexicalEntry can be a Word, Phrase, or Part - such as an Affix
lemon: Lexicon
entry
Lexicon: wild animals
LE: Kudu
entry entry
LE: shaped like a Kudu
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon: Form
wild animals
LE
“kudu” “greater”
F
“great”
otherForm
LE
F
abstractForm
LE
F
canonicalForm
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon: Structure
LexicalEntry can be decomposed into one or more Components and compositional structure can be represented
LE: shaped like a Kudu
LE: shaped
LE: like
LE: aLE: Kudu
?
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon: Structure - Example
shaped like a kudu
:LexicalEntry
constituent:PP
:node
shaped, lemma=“shape”
:LexicalEntry:Component
decomposition
element leaf
edge edge
lexemeconstituent:VP
:node
constituent:VBN
:node
edge edge
like, lemma=“like”
:LexicalEntry:Component
element leafconstituent:NP
:node
constituent:IN
:node
edgea
:LexicalEntry:Component
element leafconstituent:DT
:node
Kudu
:LexicalEntry:Component
element leafconstituent:NNP
:node
edge
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon: Meaning & Reference
LE: kudu
LS
sense
reference
lexeme
sememe
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon: Meaning & Reference
LE: kudu
LS
sense
reference
LS
LE: greater kudu sense
reference
narrower
preSem
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon: Meaning & Reference
LS
sense
reference
LS
sense
reference
incompatible
lexicalincompatibility
LE: greater kudu
LE: lesser kudu
dbpedia:Kudu
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon: Meaning & Reference
LE: kudu
LS
sense
reference
LS
sense
reference
owl:disjointWith
LE: goat
ontologicalincompatibility
Multilingual Ontologies for networked knowledgeenabling networked knowledge
lemon: Lexical Projection
LexicalEntry can introduce a syntactic frame with arguments that are mapped to LexicalSense and indirectly to ontological semantic objects/properties
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Lexical projection (Verb Frame)
SAP AG sold long-term fixed rate conventional mortgage loans
syntactic frame
S ( NP VP( VB NP ) )
…with semanticsugar!
Multilingual Ontologies for networked knowledgeenabling networked knowledge
…more frames with LexInfo2
subject : Argument
ditransitive : Frame
synarg
direct object : Argument indirect object : Argument
synarg synarg
SAP AG sold Company X mortgage loans
verb: Frame
extends
http://lexinfo.net/ontology/2.0/lexinfo#DitransitiveFrame
Multilingual Ontologies for networked knowledgeenabling networked knowledge
…more frames with LexInfo2
subject : Argument
ditransitive_to : Frame
synarg
direct object : Argument indirect object : Argument
synarg synarg
SAP AG sold mortgage loans to Company X
ditransitive: Frame
extends
http://lexinfo.net/ontology/2.0/lexinfo#DitransitiveFrame_To
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Or Zulu morphology…
:zuluNC7_8 a lemon:MorphPattern ; lemon:transform [ lemon:rule "isi(?=[^aeiou])~" ; lemon:rule "is(?=[aeiou])~" ; lemon:generates [ lexinfo:number lexinfo:singular ] ] , [ lemon:rule "izi(?=[^aeiou])~" ; lemon:rule "iz(?=[aeiou])~" ; lemon:generates [ lexinfo:number lexinfo:plural ] ] .
class = lemon:MorphologicalPatternLE: tolo
sense
pattern
LE: angoma
pattern
sense
isitolo (shop) izangoma (witch doctors)
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Lemon Editor and Generator
• http://monnetproject.deri.ie/Lemon-Editor
– “asset-backed-debts”
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix lemon: <http://www.monnet-project.eu/lemon#> .@prefix financeV4: <http://fadyart.com/financeV4#> .@prefix lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> .@prefix pennbank: <http://www.monnet-project.eu/pennbank#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .…<file:test#assetbackeddebt> lemon:phraseRoot [ lemon:edge [ lemon:edge [ lemon:edge [ lemon:leaf _:n6 ] ; lemon:constituent pennbank:NNP ] ; lemon:constituent pennbank:NP ] , [ lemon:edge [ lemon:edge [ lemon:leaf _:n88 ] ; lemon:constituent pennbank:VBD ] , [ lemon:edge [ lemon:edge [ lemon:leaf _:n69 ] ; lemon:constituent pennbank:NN ] ; lemon:constituent pennbank:NP ] ; lemon:constituent pennbank:VP ] ; lemon:constituent pennbank:S ] ; lemon:decomposition ( _:n6 _:n88 _:n69 ) ; lemon:sense [ lemon:reference financeV4:AssetBackedDebt ] ; lemon:canonicalForm [ lemon:writtenRep "Asset backed debt"@en ] .…
<file:test#back> lexinfo:partOfSpeech lexinfo:verb ; lemon:canonicalForm [ lexinfo:tense lexinfo:past ; lexinfo:verbFormMood lexinfo:indicative ; lemon:writtenRep "backed"@en ; lexinfo:aspect lexinfo:perfective ] .
Finance Ontology
lemon lexicon
_:n88 rdf:type lemon:Component ; lexinfo:tense lexinfo:past ; lemon:element <file:test#back> ; lexinfo:verbFormMood lexinfo:indicative ; lexinfo:aspect lexinfo:perfective .
lemon Lexical Entries
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Outline
1. Research challenge and motivation
2. Ontology Translation & Inform.
Extraction
3. Lexicalization (lemon)
4. CLOBIE (Cross-lingual Ontology-based
Information Extraction)
Multilingual Ontologies for networked knowledgeenabling networked knowledge
What is CLOBIE
• Information Extraction– Monolingual– No semantics
• Cross-lingual Information Extraction– Multilingual
• Ontology-based Information Extraction– Semantics in the background
Multilingual Ontologies for networked knowledgeenabling networked knowledge
What is CLOBIE
• Information extraction(monolingual)
• Information extraction (multilingual)
• Information extraction with semantics
.*[COMPANY] sell [ASSETS] .*
“SAP sold risk securities at a value of 12b EUR.”
PATTERN: .*SAP.*[sells|sold|issues].*[risk securities].*[0-9]+b [EUR|USD].*
PATTERN_DE: .*SAP.*verkaufte*.*[Risiko Wertpapiere].*[0-9]+b [EUR|USD].*
PATTERN: .*$COMPANY .*[sells|sold|issues].*$ASSETS.*$MONETARY_VALUE.}
financial assets
risk securities
non-financial assets
Property, Plant & Equipment
Multilingual Ontologies for networked knowledgeenabling networked knowledge
…The fair value of the Group’sfinance lease receivables at23 February 2008 was £5m…
Tesco’s Annual Report 2009
..As at December 31, 2008,the future minimum leasepayments expected to bereceived was €16 million…
SAP Annual Report 2008
…The fair value of the Group’sfinance lease receivables at23 February 2008 was £5m…
Tesco’s Annual Report 2009
..As at December 31, 2008,the future minimum leasepayments expected to bereceived was €16 million…
SAP Annual Report 2008
linguistic analysis payments received receivables
…The fair value of the Group’sfinance lease receivables at23 February 2008 was £5m…
Tesco’s Annual Report 2009
..As at December 31, 2008,the future minimum leasepayments expected to bereceived was €16 million…
SAP Annual Report 2008
…The fair value of the Group’sfinance lease receivables at23 February 2008 was £5m…
Tesco’s Annual Report 2009
..As at December 31, 2008,the future minimum leasepayments expected to bereceived was €16 million…
SAP Annual Report 2008
Application in Information Extraction (IE)
:MinimumFinanceLeasePaymentsReceivable rdfs:subClassOf xbrli:monetaryItemType ; rdfs:label “Minimum finance lease payments receivable”@en .
semantically lifted
Minimum finance lease payments receivableterm analysis
Multilingual Ontologies for networked knowledgeenabling networked knowledge
CLOBIE Interdisciplinary
SemanticWeb Ontologies
SKOS, lemonSPARQL queries
NLP
Corpus queryTerm analysisPOS taggingMorph analysis
MachineTranslation
Statistical MTRule-based MTLocalization
InformationRetrieval
TF-IDFWeb queryranking algorithmsCLIR (ESA, MT-based)
InformationExtraction
Term extractionRelation extractionExtract. grammars
CLOBIE
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Why CLOBIE?
• Many unstructured resources (News,
FinReps)
• Knowledge in SW is often:– Not dynamic (no regular, only manual updates)– Knowledge across languages/countries not
integrated
Multilingual Ontologies for networked knowledgeenabling networked knowledge
CLOBIE blackboard architecture
Linguistic Analyzer
• Morphology
• Dependency Parser
Basic NLP
• Splitter
• Tok. / POS
Annotators
Blackboard
Term Analyzer Semantic Analyzer
• Terminology DB
CLOBIE Search
token_id /POS
token_id /token_id
sent_id/term
sent_id/concept…
read / contribute
read / contribute
read / contribute
read / contribute
read
Semantic / Terminological / Linguistic Enrichment Process
Multilingual Ontologies for networked knowledgeenabling networked knowledge
CLOBIE Data set (Wind Energy)
• 10 companies in Wind Energy domain
• Financial reports in– German / Spanish / English / Dutch
– IFRS / DE-GAAP
• Semantics defined by– IFRS vocabulary
– xEBR vocabulary
Multilingual Ontologies for networked knowledgeenabling networked knowledge
Next steps…
• Benchmark development and evaluation on the basis of a data set in finance domain– financial reports and news from different companies in
wind energy domain• multilingual (German, Dutch, Spanish, English)
• multi-vocabulary (IFRS, European local GAAPs, DBPedia)
• Cross-lingual ontology-based information retrieval system
• Generate ontology-based information extraction grammars from lemon ontology-lexicons