Semantic Enrichment of Mappings

36
Semantic Enrichment of Mappings Patrick Arnold WDI-Lab, Abteilung für Datenbanken, Universität Leipzig

description

Semantic Enrichment of Mappings. Patrick Arnold. Outline. 1. Motivation 2. Goals 3. Related Work 4 . Determining the Relation Type 5. Implementation 6. First Results 7. Conclusions. 1. Motivation. - PowerPoint PPT Presentation

Transcript of Semantic Enrichment of Mappings

Semantic Enrichment ofMappings

Patrick Arnold

WDI-Lab, Abteilung für Datenbanken, Universität Leipzig

2

Outline

04/19/2023Abteilung für Datenbanken, Inst. für Informatik, Universität Leipzig

1. Motivation2. Goals3. Related Work4. Determining the Relation Type5. Implementation6. First Results7. Conclusions

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig3

1. Motivation

Classic approaches in schema/ontology matching provide only little information about the correspondences Source node Target node Confidence

Further details are commonly omitted What kind of relation?

equal, is-a, part-of, overlap Simple correspondence vs. complex correspondence?

(first name, last name) ↔ name Transformation functions?

gross price = net price * (1 + sales taxes) name = first name + “ “ + last name

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig4

1. Motivation

Our intentions: Mapping enrichment Enhance a mapping by adding further or more-specific

information to its correspondences Useful for merging and transforming schemas/ontologies

Workflow: Input: A mapping Mapping enrichment carried out in an independent system

(blackbox) Output is an enriched mapping

Implies a new, more-specific format

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig5

1. Motivation

Typical relation types Equal Is-a Part-of Overlap

Inverse types: Equal Inverse is-a Has-a Overlap

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig6

2. Goals

First Focus: Detecting the relation type of a correspondence Investigate linguistic methods on element level Extension by existing strategies possible equal, is-a, inverse is-a

Later… Relation type detection on instance level Exploiting background knowledge Correspondence type, transformation rules, …

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig7

3. Related Work

Several projects dealing with this problem Mainly based on the following methods:

Using dictionaries, thesauri, corpora WordNet, GermaNet Includes tokenization, normalization of strings etc.

Using background knowledge The Open University: Using Swoogle to retrieve multiple

ontologies referring to a concept Exploiting the structure between ontologies Exploiting Reasoning, Bayes Nets, Feature Vectors etc. Search Engines (Google)

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig8

3. Related Work

SMatch Complex strategy using WordNet to determine the following

relations: Equal, more-general, less-general, overlap, mismatch “Overlap” offers few interesting information (concepts are

somehow related…) Approach: To each word in a label, annotate all meanings of

this word found in WordNet Compare/match the meanings of the words Exploit the relations offered by WordNet

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig9

3. Related Work

TaxoMap Focus on geographic ontologies Detect relations equal, is-a, inv is-a and is-close

Focus rather on the correspondence itself, not on the type Is-a relation if a label in node S appears in node T and is a full

word Use WordNet as additional source

Working on manually pre-defined branches of WordNet instead of the entire thesaurus

Useful for domain-specific ontologies Recall: 23 %, Precision: 83 %

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig10

3. Related Work

LogMap Uses reasoning algorithms to repair/discover mappings

Based on Horn logics and Dowling-Gallier-Algorithm Use background knowledge (thesauri)

Detects full correspondences and weak correspondences No specific relation detection per se

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig11

4. Relation Type Determination4.1 Introduction

Typically, there is no link between the syntax and semantics of words stool, chair, seat… refer to the same object stool, school, tool, pool, wool… have nothing in common!

Things change when it comes to compounds… blackbird is a bird high school is a school

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig12

4. Relation Type Determination4.1 Introduction

Compound: Two words A, B of a language form a new word AB apple + tree → apple tree sun + glasses → sunglasses forth + with → forthwith

A, B can be noun, verb, adjective/adverb, preposition We are normally interested in nouns

19.04.202313

4. Relation Type Determination4.1 Introduction

WDI-Lab, Abteilung für Datenbanken, Universität Leipzig

No compounds are... Compositions AB where A (or B) is not an official word

broom, nausea Derivations

discard, unload, increase, compound Compositions AB where A and B are not semantically

related door (do + or), wither (wit + her)

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig14

4. Relation Type Determination4.1 Introduction

Unlike non-compounds, semantics can be generally derived from the compound’s syntax Especially in nouns

blackboard is a board handbag is a bag

Germanic languages are left-branching Germanic: school bus, central intelligence agency Romanic: rio de las palmas (= palm river)

In English, no changes are applied to the words: German: Ort + Eingang → Ortseingang, Stadt + Bau → Städtebau English: city + limit → city limit, city + planning → city planning

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig15

4. Relation Type Determination4.2 Classification

From an Linguistic point of views…

* C A, C B, AB ~⊈ ⊈ R B

Description Example Relation (AB : B)

Endocentric A+B denote something more specific than B

lecture hallblackboard

AB B⊂

Exocentric A+B denote something more specific of an unexpressed term C

doughnutbuttercup

AB C *⊂

Copulative / appositional

A+B denote something that is the sum of what A and B denote

bittersweetBosnia-Herzeg.actor-director

AB ⊂(A B)∪

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig16

4. Relation Type Determination4.2 Classification

From the English point of view… Closed form

database, playground, blackbird Hyphened form

bus-driver, single-minded, small-appliance industry Open form

web space, container ship, computer scientist

From a POS point of view… noun-noun, adjective-noun, verb-verb, …

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig17

4. Relation Type Determination4.3 First Conclusions

From the knowledge now gained, we can enrich correspondences in schemas in two ways: Set the relation type to is-a instead of equal (1) Remove or at least doubt an existing correspondence (2)

For (1) we assume that AB B⊂ (cookbook, book, 0.8, equal) → (cookbook, book, 0.8, is-a)

For (2) we assume that If A is not a word in AB, the correspondence is likely to be false: (stool, tool, 0.9, equal) → false? (refund, fund, 0.7, equal) → false? (discharge, charge, 0.7, equal) → false?

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig18

4. Relation Type Determination4.4 Mismatches

A word changed its spelling over the centuries: butterfly (“flutter-by”, “beat fly”, …) Weiße Elster (from Czech: alstra = water)

A compound is of literal meaning (metaphor): Completely different meaning

computer mouse, gravy train, buttercup Obvious origin (in a broad sense being related):

airport, birdhouse, downtown, snowman

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig19

4. Relation Type Determination4.4 Mismatches

Inaccuracies in (vernacular) language e.g., in biology: strawberry, blackberry, raspberry etc.

Neither is a berry in the biological sense (yet tomato, banana, grape, pumpkin, melon etc. are)

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig20

4. Relation Type Determination4.4 Mismatches

For detecting the relation type, the mismatch problem has no negative effect on the mapping The correspondence is wrong after all

(buttercup, cup, equal) is as wrong as (buttercup, cup, is-a) Enrichment has no negative effect on the mapping per se

Still, enhanced methods can be used to reduce the mismatches

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig21

5. Implementation5.1 Goals

Specify the following relation types on linguistic methods: equal (default), is-a, inverse is-a Missing: part-of and overlap English and German language

Main focus on English language

Possibly apply mapping repair Remove correspondences that seem clearly wrong

Test & Evaluation

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig22

5. Implementation5.1 Goals

First concentrate on the element level Use linguistic knowledge as presented before Different cases to be distinguished

Single items vs. itemizations

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig23

5. Implementation5.2 Cases

Simple Case (1:1) Source and target node consist of one item

blackboard ↔ board high school ↔ school international database conference ↔ conference

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig24

5. Implementation5.2 Cases

Complex Cases (1:n, n:1, n:m) Source/target node consist of several item

blackboard, whiteboard ↔ board wine ↔ white wine, red wine beer, wine ↔ wine computers, laptops ↔ computers

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig25

5. Implementation5.3 Node Level vs. Path level

Relation type depends on the perspective… Node level vs. Path level Relation is often…

is-a on node level equal on path level

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig26

5. Implementation5.3 Node Level vs. Path level

Source Target

+ Apparel + Children - Shoes - Caps - …

+ Apparel + Children Shoes + Caps + …

Source Target

+ Kids + Apparel + Shoes + Caps + …

+ Clothing + Children Shoes + Caps + …

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig27

5. Implementation5.4 Requirements

Benchmarks / Gold Standards (English language) Manually defined

Dictionary / Thesauri

More-specific data structure Correspondence: source node, target node, confidence, type Node: A list of items Item: A list of word Word: single word vs. compound

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig28

5. Implementation5.5 Generating Benchmarks

Benchmarks More difficult than in standard mappings In some cases even for humans difficult to decide

Birdhouse is a house? Airport is a port?

How to judge correspondences in an evaluation? car = bike → FALSE car = auto → TRUE motorbike bike → ?⊂

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig29

5. Implementation5.6 Challenges

Exocentric compounds Airport, buttercup, saw tooth, …

Compounds in itemizations (French wine, German wine — French wine) inverse is-a (French wine, German wine — European wine) is-a (French wine, German wine — Mosel wine) overlap (French wine, German wine — Italian wine) mismatch

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig30

5. Implementation5.6 Challenges

Plurals (Christian churches — church) (red wine, white wine — wines)

Short forms Infant colic — colic (equal instead of is-a)

Node Level vs. Path Level Compound extending/skipping levels in the schema

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig31

5. Implementation5.6 Challenges

Limited recall Strong dependency to input (mapping) Some is-a relations cannot be detected with simple

linguistic methods (car, vehicle) (wine, beverage) (cell phones, communication devices)

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig32

6. First Results

Web ↔ Yahoo 421 Correspondences 68 subset-correspondences

Found 50 subset-relations, with 34 being correct Recall: 50.0 % Precision: 68.0 % f-Measure: 59.0 %

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig33

6. First Results

Google Health ↔ Yahoo Health (excerpt) 396 Correspondences 31 subset-correspondences

Found 20 subset-relations, with 15 being correct Recall: 48.3 % Precision: 75.0 % f-Measure: 61.6 %

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig34

6. First Results

Main issues observed… Imprecise labels

infant colic — colic (equal) Uterine-Fibroids — Uterus.Fibroids (equal) picture frames — frames (equal in field “arts”)

Node-Path-Discrepancies “No-Compound”-Subsets

vehicle — car (isa)

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig35

7. Conclusions

Mapping Enrichment Relation type Simple vs. complex correspondences

Transformation rules

Relation Type Determination Linguistic approach on element level

Compounds, itemizations Advanced methods

Instance level, background knowledge etc. Increase recall, keep up precision

19.04.2023WDI-Lab, Abteilung für Datenbanken, Universität Leipzig36

Discussion

Thank You!