Multilingual support to a proposed Semantic Web architecture

Multilingual supportto a proposed Semantic Web

architecture

Andrea FerratoTOP-UIC MS Thesis, 2003/’04

Advisor: Laura Farinetti

A. Ferrato, TOP-UIC 2003-'04 2

Purpose of this work

Design and (partially) implement multilingual support on a pre-existing Semantic Web platform Provide an approach as generical as

possible Exploit features of the pre-existing

architecture Cope with the average chaotic

structure of resources currently available

Outline

Semantic WebMultilingualityThe DOSE platformProposed solutionGiven implementationExperimental resultsConclusions

Semantic Web

The next evolutionary stage for WWWGoal: make network data usable by

intelligent agentsDeployable only on top of existing

infrastructureTwo pressing tasks

Transform existing contents to include semantics

Setup ad hoc user agents to work on them

Transform existing contents

Basic data units: resources Every single information entity that

can be semantically isolatedFeatures to be given

Identification: URI Structure: XML Meaning: RDF Knowledge: ontologies

Set up ad hoc user agents

Major players in Semantic Web deployment

Invoked by users, can proceed autonomously

Key facilities to be supported Logic Proof Trust

Semantic Web: layer cake view(Berners-Lee)

Unicode URI

XML + NS + XMLschema

RDF + RDFschema

Ontology vocabulary

Self desc

. doc.

Data D

Multilinguality

The extension to multiple languages of tasks already performed in a monolingual context

Typical issues from cross-language mapping Lexical gaps Role of the context Lack of pre-acquired knowledge

Multilinguality and Semantic Web

A problem of Text Retrieval in multiple languages (NLP) Start from popular approaches

(Controlled Vocabulary, Free text, etc.)

Two main requirements Recognize language ID of resources Map contents independently from

language

Language ID retrieval

Two possible scenarios Retrieve a given ID via resource

parsing Recreate the ID via resource analysis

When recollecting a given language attribute, conform to existing language specification standards

Language ID specification

Content-language

CSS-leveldeclarations

“lang”attribute

Languageinheritance

Language-independent contents mapping

Investigate the form/meaning relationship Ontology design is crucial Three main requirements

1. Consistency (based on linguistic evidence)

2. Flexibility (meaningful for all languages)

3. Extendibility (easy addition of new languages)

Ontology models

Conceptual founded upon general knowledge

Language-based Built on a particular language

Interlingua A combination of the above two

None is definitely superior for multilinguality

The DOSE platform

Distributed Open Semantic Elaboration platform

Key features Modularity Scalability Semantic integration

Main functionalities offered Annotation Search

DOSE: layered view

Indexer SearchEngine

SemanticMapper

FragmentRetriever

Substructure

Extractor AnnotationRepository

Onto-logy Syn-

Servicelayer

Back-endlayer

Front-endlayer

DOSE: distributed view

Onto-logy

Syn-set

Fragment Retriever

Substructure Extractor

SemanticMapper

AnnotationRepository

Indexer SearchEngine

XML-RPC infrastructure

9 1011

DOSE: annotation

SemanticMapper

Substructure

Extractor

The Web

Indexer FragmentRetriever

DOSE: search

SearchEngine

The WebFragmentRetriever

SemanticMapper

DOSE and multilinguality

Traditionally: a new ontology for each different language

DOSE: the ontology language is totally independent of the synset language Use synsets to store lexical

representations only Let the ontology focus on knowledge

modelization

Practical requirements for multilinguality

Indexing Recognize language of resources to

consequently setup the system Store language IDs with annotations

Search Interpret user queries coming in

natural languages Allow for cross-language search tasks

Extension to language

Proposed approach: one ontology, many synsets A concept is expressed by a different

synset for each supported language Each synset contains multiple lexical

representations of a related concept in a single language

Separate semantic and textual layers

lavorostipendiodatore di lavoro…

salaryjobemployment…

travailchomeur…

Extension to language (cont’d)

(one concept,three synsets)

Advantages

Reduced implementation requirements Ontology design Resource occupation

Simplicity (in ontology management)Flexibility

A new language just brings a new bag of synsets

Expansion of indexing word set

Language recognition

Proposed approach Retrieve language IDs whenever present Otherwise, recognize language(s)

Design constraints To be activated in the annotation phase Refined at the document substructure level Has to deal with the average low authoring

quality of Web documents

Language recognition (cont’d)

1. Validate explicit request

2. Retrieve “lang” value

3. Guess via heuristics

4. Retrieve from ancestor

5. Accept default

Russian

There was an Old Man of Coblenz,The length of whose legs was immense…

English

default = “it”

Italian

<H1 lang=“fr”>Le Bilboquet</H1><P>C’était un vieux passe-temps…

<P> is French

Hindisynset

Current implementation

A new English synset to couple with a disability ontology (~500 concepts)

A set of 20 bilingual documents (Italian, English) on disability

A basic Language Detector XML-RPC module implemented in Java

Testing scenarios Parallel annotation Language recognition

Implementation work

Language Detector module (Java, ~1000 lines of code)

Additions to pre-existing modules (Java, ~1000 lines of code)

English synset (RDF, ~3500 lines of code)~ 24 Mb of annotations producedSimulation results analysis (A 600x40 .XLS

for <BODY>, a 925x250 .XLS for <Hx>)

Multilingual DOSE in action

Parallel annotation

Two parallel documents have The same structure elements with the

same contents Two different languages of expression

Goal: demonstrate that two sets of parallel documents are (almost) simmetrically mapped to the same concepts (“parallel annotation”)

Both sets indexed separately, with language explicitly specified

Parallel annotation (cont’d)

Test methodology: “Vector Space Model”Document fragments described as vectors

Dimensions are ontology concepts Components are weighted (tf/idf)

occurrencies of such conceptsThe correlation between two fragments is

quantified as the cosine of the angle between their vectors

Parallel annotation (cont’d)

IT/html/body/p[3]X:Part-time job (2.5)Y:Retirement (0)

EN/html/body/p[3]X:Part-time job (1.5)Y:Retirement (1.5)

CorrelationItalian

English

Parallel annotation results at <BODY> level

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Correlation factor

ed fre

Parallel fragments Others

Correlation results at <BODY> level

1 4 7 10 13 16 19

Correlation factor

Italian pages

English pages

0-0,2 0,2-0,4 0,4-0,6 0,6-0,8

Correlation results at <BODY> level (alt)

Correlation factor

Italianpages

Englishpages

0,6-0,8

0,4-0,6

0,2-0,4

Parallel annotation results at <Hx> level

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Correlation factor

ed fre

Parallel fragments Others

Parallel annotation: notes

Parallel and nonparallel pairs can be grouped as two different distributions i.e. Gaussian distributions

Average values of the two distributions are clearly separated, both for <BODY> and <Hx> levels This proves that the indexing system is

able to annotate relevant document fragments independently from language

Language recognition

Separate testing on the same document setItalian and English documents are

alternated in batch processing Avoid reuse of default settings for

contiguous documents of the same language

Two ways to retrieve ancestor language Via Annotation Repository (acceptable) Via a “Language Stack” (still inefficient)

Annotation Repository vs. Language Stack

<H1 lang="it">Passatempi</H1>

<H2 lang="en">Board Games</H2>

<P>Gomuku</P><P>Dama</P>…

All cyan, underlined words are to annotate (included in the synsets)Language Stack: Dama is ignored (language “en” inherited by <H2>)Annotation Repository: Dama is annotated (language “it” inherited by <H1>, annotated)

Language recognition results(via Annotation Repository)

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Analyzed pages

Hit percentage (%) Hit average (%)

Conclusions

Typical issues discussedOverall validity of the approach shownFurther work and improvements

Synset composition Annotation testing with more

languages Optimize proposed language

recognition techniques, add new ones

Thank you…

Questions?

Language recognition (2)

1 4 7 10 13 16 19 22 25 28 31 34 37 40

Analyzed pages

Percentage Anno

Percentage Stack

Average Anno

Average Stack

Multilingual support to a proposed Semantic Web architecture

Documents

Transcript of Multilingual support to a proposed Semantic Web architecture

Crowdsourcing a large scale multilingual lexico-semantic resource · 2017-11-19 · large-scale multilingual lexico-semantic resource, ... synsets, , sets of synon-ymous words, glosses

Multilingual Semantic Enrichment report - Word...2/43 EuropeanaTech Task Force on a Multilingual and Semantic Enrichment Strategy: final report Executive Summary 3! 1.!Introduction

Multilingual support to a proposed Semantic Web architecture Andrea Ferrato TOP-UIC MS Thesis, 2003/’04 Advisor: Laura Farinetti.

The automatic construction, evaluation and application of a wide-coverage multilingual semantic

Diversity-aware Multilingual Lexical Semantic …...Diversity-aware Multilingual Lexical Semantic Resources Management Freihat Abed Alhakim 03 25, 2018 Dipartimento di Ingegneria e

Supporting Multilingual Semantic Web Services Discovery using DBpedia

Semantic Relatedness for All (Languages): A Comparative Analysis of Multilingual Semantic Relatedness using Machine Translation

D5.2 Data Modelling for the Social Semantic Knowledge ......Social Semantic Emotion Analysis for Innovative Multilingual Big Data Analytics Markets D5.2 Data Modelling for the Social

Toward semantic indexing and retrieval using hierarchical ...€¦ · the proposed approach and exhibit that the proposed frame-work provides a foundation in semantic indexing and

Multilingual Ontologies for Cross-Language Information Extraction and Semantic Search

RROVT: A Proposed Visualization Tool for Semantic Web Technologies

Semantic Frames as an Interlingua for Multilingual Lexical Databases

A Semantic Model for End-to-End Multilingual Web …...A Semantic Model for End-to-End Multilingual Web Content Processing David Lewis, Alex O’Connor, Dominic Jones Centre for Next

Semantic Representations University of Cambridge, 20 April ...€¦ · José Camacho Collados, Mohammad Taher Pilehvar and Roberto Navigli. A Unified Multilingual Semantic Representation

A Latent Semantic Indexing-based approach to multilingual ...A Latent Semantic Indexing-based approach to multilingual document clustering Chih-Ping Weia,⁎, Christopher C. Yangb,

Multilingual Semantic Annotation Engine for Agricultural Documents

lemon: An Ontology-Lexicon model for the Multilingual ... · lemon: An Ontology-Lexicon model for the Multilingual Semantic Web MONNET Project Consortium, represented by Thierry Declerck

Multilingual Sentiment Analysis Using Latent Semantic Indexing and ...

A Friendly Localized Platform for Multilingual Semantic Communication

High-performance Multilingual Semantic Role Labeling