Benchmarks From Usage To Evaluation -...

Yannis Velegrakis (University of Trento) Angela Bonifati (CNR)

EDBT 2011, Uppsala, Sweden, March 21st-25th

Benchmarks

From Usage To Evaluation

Schema Matching and Mapping Systems

2 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

Talk Outline

Introduction

Data is inherently heterogeneous

Due to the explosion of online data repositories

Due to the variety of users, who develop a wealth of applications

At different time

With disparate requirements in their mind

A fundamental requirement is to translate data across different formats

How data is transformed from one format to another is done through mappings

Mappings Are All Around

Data integration [Lenzerini 2002]

to specify the relationship between local and global schemas

Global

Schema T

I3 Sources

Schema integration [Batini et al. 1986]

to specify the relationship between the input schemas and the integrated schemas

Integrated

Schema

Input Schemas

Data exchange [Fagin et al. 2005]

to specify the relationship between source and target schemas

mappings

Source schema Target Schema

Schema evolution [Lerner 2000]

to specify the relationship between the old and new version of an evolved schema

S1 S1’ S1’’

Evolving Schema S1

How Did It All Start

One of the first systems to deal with this problem was developed at IBM in 1977: EXPRESS (EXtraction, Processing and REStructuring

System) [Shu et al. 1977] consists of two languages: DEFINE that works as a DDL (Data Definition Language)

CONVERT that works as a DTL (Data Translation Language) and has a total of 9 operators, each of which receives as input a data file, performs the respective transformation and generates an output data file.

EXPRESS required the users familiarity with the languages and was customized to only one model (hierarchical)

After that, inter-model transformations were also studied [Tork-Roth et al. 1997] [Atzeni et al. 1997]

Emphasis on Data Translation

[Abiteboul et al. 1997] proposed a declarative framework for data translation

[Davidson et al. 1997] focused on constraint satisfaction

[Milo et al. 1998] leveraged a library of transformation rules and pattern-matching techniques

[Clue et al. 1998] emphasized type-checking

[Beeri at al. 1999] focused on tree-based transformations for XML data structures

Talk Outline

Introduction

A Data Transfer Example

Source: Rcd

projects: Set of

project: Rcd

status

grants: Set of

grant: Rcd

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

companies: Set of

company: Rcd

official

Target: Rcd

projects: Set of

project: Rcd

funds: Set of

fund: Rcd

finances: Set of

finance: Rcd

mPhone

company

companies: Set of

company: Rcd

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

Benedikt benedikt@research.bell-labs.com 5827766

Hull hull@research.bell-labs.com 5824509

Shrivastava divesh@research.att.com 3608776

Belanger dgb@research.att.com 3608600

Fernandez mff@research.att.com 3608679

name official

AT&T AT&T Research Labs

Lucent Lucent Technologies, Bell Labs Innovations

Projects

Grants

Contacts

Companies

gid project recipient manager supervisor

g1 PIX AT&T Fernandez Belanger

g2 PIX AT&T Shrivastava Belanger

g3 E-services Bell-labs Benedikt Hull

urce I

Desired Target Instance

Target: Rcd

projects: Set of

project: Rcd

funds: Set of

fund: Rcd

finances: Set of

finance: Rcd

mPhone

company

companies: Set of

company: Rcd

Source: Rcd

projects: Set of

project: Rcd

status

grants: Set of

grant: Rcd

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

companies: Set of

company: Rcd

official

Finances

code: E-services

Projects

fid finId

g3 ???

code: PIX

fid finId

g1 ???

g2 ???

finId mPhone company

??? 3608679 ???

??? 3608776 ???

??? 5827766 ???

coid name

Sk2(AT&T) AT&T

Sk2(Lucent) Lucent

??? ???

Companies

The Needed Transformation Query LET $doc0 := document("inputXMLfile") RETURN <T> { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <project> <code> { $x0/project/text() } </code> { distinct-values ( FOR $x0L1 IN $doc0/S/grant, $x1L1 IN $doc0/S/project, $x2L1 IN $doc0/S/contact, $x3L1 IN $doc0/S/contact WHERE $x2L1/cid/text() = $x0L1/manager/text() AND $x0L1/supervisor/text() = $x3L1/cid/text() AND $x0L1/project/text() = $x1L1/name/text() AND $x0/project/text() = $x0L1/project/text() RETURN <funding> <fid> { $x0L1/gid/text() } </fid> <finId> { "Sk52(", $x0L1/gid/text(), ", ", $x0L1/project/text(), ")" } </finId> </funding> ) } </project> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <finance> <finId> { $x0/gid/text() } </finId> <mPhone> { $x2/phone/text() } </mPhone> <company> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </company> </finance> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <company> <coid> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </coid> <name> { "Sk49(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </name> </company> ) } { distinct-values ( FOR $x0 IN $doc0/S/company RETURN <company> <coid> { "Sk93(", $x0/cname/text(), ")" } </coid> <name> { $x0/cname/text() } </name> </company> ) }

The Road To Mapping Systems

The design of data transformations has been a manual task for a long while

Designers had to be familiar with the language

As schemas became larger and more complex, the task became too laborious, time-consuming and error-prone

The need of raising the level of abstraction and trying to automate the tasks was soon realized.

The idea …

Mapping Systems

Generating Mapping

Different techniques exist to generate mappings:

Manual, e.g.

by means of high-level mapping languages, such as [Bernstein et al. 2007]

by means of sophisticated user interfaces [Altova 2008]

Semi-automatic, e.g.

By means of designer guidance [Alexe2008]

Via advanced algorithms to do the reasoning instead of the mapping designer [Madhavan at al. 2001] [Popa et al. 2002][Do et al. 2002][Bonifati et al. 2008]

The First Step of a Mapping Task

Source: Rcd

projects: Set of

project: Rcd

status

grants: Set of

grant: Rcd

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

companies: Set of

company: Rcd

official

Target: Rcd

projects: Set of

project: Rcd

funds: Set of

fund: Rcd

finances: Set of

finance: Rcd

mPhone

company

companies: Set of

company: Rcd

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

name official

Projects

Grants

Contacts

Companies

Matching

Given two schemas as input, the source and the target schema, matching is a process that produces as output a set of matches or correspondences or (simply) lines between the elements of the two schemas

A match is a triple <Es, Et, e> Where Es is a set of elements of the source

schema, Et is a set of elements of the target schema, and e specifies a simple relationship (equality or set inclusion) or complex relationship between element in Es and Et

The Matching Relationship e

Depends on the cardinalities of Es and Et

Depends on the semantics:

Can be a function

Can be an arithmetic operation

Can be a set-theoretic relation (e.g. ≡,overlaps)

Can be a conceptual modeling relationship (e.g. part-of, subclass-of)

Matching: An Alternative Definition

The matching process [Euzenat et al. 2007] can be seen as a function f from a pair of schemas S and T, an optional input alignment A, a set of matching parameters p and a set of resources r:

A’ = f(S, T, A?, p, r)

Ultimately, an alignment is a set of correspondences between elements in S and elements in T

Matching Examples

Simple relationship: Name ≡ Title Location ≡ Address

Complex relationship: speed = velocity x 2.237 speed x 0.447 = velocity speed = concat(velocity x 2.237, „MPH‟) speed ≥ velocity

Company

Location

Source

Address

Organization

Target

The matching process

Can be roughly divided into three steps:

Pre-match: training of classifiers for machine learning-based matchers, matching parameters (weights, thresholds), adjustments of resources, such as thesauri and constraints

Match: the actual matching task

Post-match: the user may check and modify the displayed matches

Some Schema Matchers

Cupid [Madhavan et al. 2001] : based on structural and name similarity

S-Match [Giunchiglia et al. 2004]: based on semantic closeness

Coma++ [Aumueller et al. 2005]: based on matching reuse

LSD [Doan et al. 2001]: based on data value analysis and machine-learning techniques

iMap [Dhamankar et al. 2004]: suited for complex e expressions

Similarity Flooding [Melnik et al. 2002]: based on graph similarity

Similarity Flooding

COMA++

Matchings Are Not Enough

Source: Rcd

projects: Set of

project: Rcd

status

grants: Set of

grant: Rcd

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

companies: Set of

company: Rcd

official

Target: Rcd

projects: Set of

project: Rcd

funds: Set of

fund: Rcd

finances: Set of

finance: Rcd

mPhone

company

companies: Set of

company: Rcd

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

name official

Projects

Grants

Contacts

Companies

Cannot describe the full details of the transformation

Source Schema Target Schema

Matchings

Matcher

The Mapping Generation Process

Matching is just the beginning of any mapping generation

process

Mappings

Given the source and the target schemas, mapping is a process that takes as input a set of matches between the elements of the two schemas and produces a relationship or constraint e that must hold between their respective instances

In other words, a mapping is a triple <S, T, e>

Where S is the source schema, T is the target schema, and e specify a constraint that any instances adhering to S and T must satisfy or an executable statement to transform the instance of S into the instances of T

A Mapping Example

Source: Rcd

projects: Set of

project: Rcd

status

grants: Set of

grant: Rcd

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

companies: Set of

company: Rcd

official

Target: Rcd

projects: Set of

project: Rcd

funds: Set of

fund: Rcd

finances: Set of

finance: Rcd

mPhone

company

companies: Set of

company: Rcd

project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)

project(na,FUND), fund(gid,finId), finance (finId,ph,company)

company(company, name)

Mappings & Instances

Mappings are the basic ingredients of many tasks, such as information integration, P2P query answering, data exchange etc.

In particular, mappings as inter-schema constraints may not be enough to fully specify a unique target instance There may exist multiple target instances

satisfying the mappings

Finding the best target instance is the goal of the data exchange problem [Fagin et al. 2005] The mapping is converted into a executable

transformation script to obtain that particular instance

S: Rcd

projects: Set of

project: Rcd

status

grants: Set of

grant: Rcd

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

companies: Set of

company: Rcd

official

A Data Exchange Example

T: Rcd

projects: Set of

project: Rcd

funds: Set of

fund: Rcd

finances: Set of

finance: Rcd

mPhone

company

companies: Set of

company: Rcd

Finances

code: E-services

Projects

fid finId

g3 ???

code: PIX

fid finId

g1 ???

g2 ???

finId mPhone company

??? 3608679 ???

??? 3608776 ???

??? 3608600 ???

coid name

??? AT&T

??? Lucent

Companies

project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)

project(na,FUND), fund(gid,finId), finance (finId,ph,company),

company(company, name)

Matchings

Transformation Scripts

Matcher

Mapping Generation Engine

Mappings (Dependencies)

Query Engine

The Mapping Generation Process

Data Exchange Engine

Source Instance

Target Instance

Research Prototype Systems

Mapping generation and data exchange are separate tasks Clio[Popa et al. 2002], HePToX[Bonifati et al. 2010],

Spicy[Bonifati et al. 2008]

Mappings Generation the mappings are expressed as high-level assertions in a

logical formalism

A mapping is a source-to-target tuple-generating dependency (or s-t tgd in short)

𝜙 𝑥 → ∃ 𝜓 𝑥, 𝑦 where

φ(x) (ψ(x,y), resp.) is a conjunction of atoms over the source (target, resp.)

Data exchange The respective module transforms the high-level mappings

into transformation scripts (in SQL or XQuery) to generate the target instance.

HepToX

Matchings

Matcher

Query Engine

Commercial Mapping Systems

Source Instance

Target Instance

Mapping Engine

Mapping generation and Data exchange are merged in one. The system directly creates in some native language the final transformation script

Popular Commercial Systems

Altova Mapforce

Stylus Studio

IBM Rational Data Architect

BizTalk mapper

Adeptia

BEA Aqualogic

Stylus Studio

Altova MapForce

Adeptia

BizTalk Mapper

IBM Rational Data Architect

A Mapping Tool Categorization

All tools provide to the mapping designer:

A graphical representation of the two schemas

A set of graphical transformation constructs

The granularity and power of these constructs is a main factor of differentiation among the tools

Detailed Specification by the Designer

Intelligence of the mapping tool/

Effort in post-verification

Roughly

Commercial

Mapping Tools

Roughly

Research

Prototypes

Issues in Data Exchange

When multiple target instances exist, how do we compute the best one?

Is a given target instance better than another?

Universal solutions

Introduced in [Fagin et al. 2005]

These are the “most general” target instances, and also represent the entire space of solutions

Among the universal solutions, the smallest of all and the most compact one is called the “core”

Universal Core Instances

T: Rcd

Advised: Rcd

WorksWith: Rcd

S: Rcd

PTStud: Rcd

GradStud: Rcd

sname facid

Bob N3

Ann N4

WorksWith

PTStud(x,y), Advised(y,z)

GradStud (x,y) Advised(y,z), WorksWith(y,z)

Source instance

PTStud

age name

27 Bob

30 Ann

GradStud

age name

32 John

30 Ann

Advised

sname facid

Bob N3

Ann N4

John N1

N1 Cathy

A solution:

sname facid

Bob N3

Ann N4

WorksWith

Advised

sname facid

Bob N3

Ann N4

John N1

N2 Ann

sname facid

Bob N3

Ann N4

WorksWith

Advised

sname facid

Bob N3

Ann N4

John N1

A universal Solution:

The core:

Commercial vs. Research Prototype Systems

Whereas research prototypes (e.g. Clio, Spicy++) are tending to produce target instances that look more and more like the core

Commercial tools leave the task to the users, who have to manually interact with sophisticated GUIs and write pieces of the transformation manually

No core definition is even considered Core? No,

thanks

Mixing Matching and Mapping

Matching and Mapping not always by separate tools

Clio has as an add-on a matcher based on attribute feature analysis [Naumann et al. 2002]

Bernstein‟s model management considers the matcher as a fully integrated and indistinguishable component

Spicy [Bonifati et al. 2008] has a matcher based on instance-based structural analysis

Limitations of Current Systems

Manual approaches are not applicable to large-scale mapping tasks

The user/developer has to become familiar with the mapping language and the user interfaces

The outcome of the mapping process may not respect the user requirements and desired semantics (unsurprisingly!)

Specifications may be incomplete and dependent of system peculiarities

Thus, there is a need for a verification and guidance process

Matchings

Matcher

Query Engine

The Verification Process

Source Instance

Target Instance

Examples

Expected

Target

Instance

Verification

Selection User

A-Posteriori Verification

The main problem with matching and mapping is the dichotomy between the expected results and the generated answers

Some tools allow a post-verification

by using data examples

Tupelo [Fletcher et al. 2006], Muse[Alexe et al. 2008] Clio [Alexe et al. 2010]

by using an automatic instance comparison,

Spicy [Bonifati et al. 2008]

by means of manual user feedback

unfeasible for large-scale tasks

via debugging techniques

Routes [Chiticariu et al. 2006]

ETL systems

Extract-Transform-Load tools are data transformation tools based on graphical flowcharts with nodes encoding transformation primitives and edges encoding the transformation flow

Can be considered as a special form of mapping system Generate transformations

An intermediate language (An algebra for ETL)

Output (transformation scripts)

They are not mapping tools in the classical sense Focus only on data transformation operators

Not Null

(CustKey) SK(custkey) PhoneFormat

New - Old

Customer.

MER7C.D+

SK(custkey) PhoneFormat

Customer.

An ETL data flowchart

How Can One Decide If A Product is Good ?

Talk Outline

Introduction

Importance Of A Benchmark

Help Designers and Developers to improve their tools by

assessing their usefulness and constantly evaluating their performance

Users to compare the different available tools and evaluate suitability for their needs [Haas et al. 2007]

Researchers to compare themselves to others

Exist for a long term To allow adequate measurements

Evolution in the field

Help assess absolute results Properties of the results

How they compare to the others

Benchmark

Well-designed tests (scenarios) with which the results of a system can be evaluated [Castro et al. 2004]

A standardized application scenario that serves as a basis for testing and evaluation and comparison [Merriam-Webster]

Clearly specified scenarios that everyone can implement

Clearly specify the factors that are measured, and under what conditions they should be measured.

Should measure of the degree of achievement

Should be reproducible and stable

Can be used repeatedly

Principles

Systematic Procedure

Continuity

Quality and equity

Dissemination

Intelligibility

Types of evaluation

Competence Benchmarks Measure competences and performance with respect to a

Aim at characterizing the kind of tasks each method is good for

For designers to improve their systems

Comparative evaluation Comparison of results of various systems on a common task

Aim at finding the best system Tuning of the system an issue

Comparison of systems and aim at general field improvement

Application-specific Comparison of various systems on a specific task

Competitive evaluation

Evaluation Steps

Planning

Specifying task, software, hardware, input, output

Processing

Analysis

Result evaluation according to predefined measures

Bottom Line: Benchmarks Are Great !

Talk Outline

Introduction

Generic Matching/Mapping Benchmark Goals

Compare in terms of

Performance

Usability

Effectiveness

Applicability to real-world scenarios

Improve the quality of the matching and mapping generation process

Query Benchmark Schema Mapping Benchmark

■Evaluation Scenarios: A setting ( a

database instance/schema + a query)

and the outcome

■For a mapping system, what is the input and

what is the output?

■The query engine should support the

scenarios (mainly should be able to

evaluate the query of each scenario)

■A mapping tool input language should be able

to express the transformations of interest.

What are they?

■Supporting the scenario:

search engine result=expected outcome

■How do you compare a mapping system

output with an expected output?

■Good Query Engine = Fast (Correct)

Responses

■What do we measure?

Effort? Expressiveness?

■Range the characteristics of the data

instance to measure how well the engine

scales

■What do we scale?

Query vs Matching/Mapping Benchmarks

The Scenario Input

Source Schema S

Target Schema T

Maybe an Instance of the Source Schema S

A specification of what we need to achieve

Matching Systems

Typically there is no specification

Just the Source and the Target Schema

A complete set of matches assumed as correct

Mappings Systems

Major Issue

The Specification for Mapping Systems

An expected (desired) transformation

Mapping Systems try to guess it

Issues

No formal semantic framework to express it

No formal relationship to the outcome.

Note: Query engine benchmarks (e.g., TMC-H or XMark) leverage on the semantics of the query language

Clear what the scenario is asking

Clear what to compare the result sets

Expressing Desired Transformations

Natural Language

Too generic and ambiguous

Complete specification formalism (query?)

Beats the purpose of a mapping system

Comparing the generated mapping of the tool to the precise specification is like asking equivalence of two mappings (a hard problem ! )

Expressing Desired Transformations

Graphical Interface Different constructs expressed in different tools

Typically a GUI for the query language

Continuously evolve

Simple specification (Correspondences?) 1-1, many-1, between atomic or complex

elements, Nested, w/o annotations, GUI constructs Can get so complex that they become the same as the

actual mappings

Ambiguous. The same set of correspondences interpreted different from different mapping tools Without a standard way to interpret them? Risky !

Company

Location

Source

Address

Organization

Target

A Simple Ambiguous Scenario

Different interpretations may arise from a simple “copy” scenario

</Company>

</Company>

</Source>

</Organization>

</Target>

Different tools might generate different instances with the same arrows

</Company>

</Company>

</Source>

</Organization>

</Target>

Company

Location

Source

Address

Organization

Target

Arrows between non-leaf nodes are not allowed in all tools

</Company>

</Company>

</Source>

</Organization>

</Organization>

</Target>

Company

Location

Source

Address

Organization

Target

The Issue of The User Input

Many tools allow the mapping designer to manually edit the generated transformation

Power as the one provided by the language.

Shortcuts and abstraction levels.

The Scenario Output

The output has to be correct

Satisfy the desired transformation

Compare to the expected transformation

Matching

A set of Matches

Mapping

It is not clear what the output is

Mapping

The transformation scripts

The transformed data

Same data may be generated by different mappings

Evaluation Challenges

What are we testing?

Expressiveness

Performance

of the tool?

of the generated mappings?

Quality

of the generated mappings?

of the integrated schema?

of the target data? [Dong et al. 2009]

User effort

Heavily depends on the mapping interface

Measuring these factors is hard without a formal (and standard) agreement on expressing specifications

Matching vs. Mapping Benchmarking

Matching System Evaluation is typically set comparison

Consider only semantics of the schema

More automatic

Mapping System Evaluation is more challenging

Considers semantics of schemas and transformations

Requires more human intervention

The User Can Also Help …

By being presented with …

The mappings

Difficult to overcome the heterogeneity of languages

The generated target instance

Not feasible for large and complex instances [Velegrakis et al. 2005]

A representative sample of the target instance

Appealing alternative based on positive/negative examples, but still in its infancy [Alexe et al. 2008]

Presented in details later on

How do all these get into the evaluation function?

Talk Outline

Introduction

A Common Design Pattern

Example: TPC-H

Sets of test cases (With their expected output)

Find those that can be successfully executed

Characterize the system

Sets of Matching or Mapping Scenarios

Examples of Data Sets

Public and well-designed schemas

Meaningful overlap

Limited by the existence of real schemas

Need to be discriminating

Who is the Oracle? External knowledge or a human?

Large Scale Ontology Sets

[Zhang et al. 2004]

Two large ontologies from the anatomy domain

Foundational Model of anatomy & Galen

Thousands of classes, no instances

[Lambrix et al. 2003]

Gene & Signal Ontology

Partial overlap

OAEI Data

Artificial data set

33 classes, 64 properties, 76 individuals

Initial ontology distorted

Result: ~50 pairs of ontologies

Correct by construction

Data Set Factors Affecting The Evaluation

Heterogeneity of the modeling language (schemas/ontologies) The language itself Number of schemas (1-to-1 or many-to-1) From scratch matching or there is a head-start Multiplicity: How many elements in one schema can match with

how many on the other Are Oracles permitted? Is user input permitted Can there be a-priori training External methods and auxiliary inputs Is justification of the output needed? Relations of the correspondences (only = or others as well) Is there a time limit for the matching/mapping Can the matching be on leaves only or not?

OAEI Evaluation Example

[Euzenat et al., 2006]

Ontology Alignment Evaluation Initiative

oaei.ontologymatching.org

Yearly contest

Participants:

Provided with OAEI API

Execute all tests

Provide their results & Paper

Make the results public

Building Large Ontology Sets

[Avesani et al. 2005]

Test sets for matching web directories and classifications

Two web directories are similar if their web pages are similar

It can be considered a matching technique by itself

Thesauri

Thesauri covering large hierarchies of concepts and textual knowledge

Digital Libraries and Museums

Large need to match them

Example:

AGROVOC (FAO): 16K terms

NAL (US Agricaltural dep): 41K terms

Various examples

Illinois Semantic Integration Archive

http://pages.cs.wisc.edu/~anhai/wisc-si-archive

Collection of different schemas & Data

Faculties

Courses

Real Estate

Real Examples Lack Systematic Design

Existing Datasets not systematic

Completeness ?

Correctness?

Deduplicated?

Clarity?

Mainly testbeds or standardized tests

But not benchmarks

Benchmark tests should be

Consistent

Complete

Minimal

Real-World Matching Problems

[Kopcke et al. 2010]

Collection of matching problems

[Giunchiglia et al. 2009]

4500 matches between 3 web directories

Error free

Low complexity

High discriminative capacity

XBenchMatch

[Duchateau et al. 2007]

Criteria for testing and evaluating matching tools

Focuses on assessment of matching tools

Quality

10 Datasets for matching

Classified according to:

Data level, e.g., degree of heterogeneity

Process level, e.g., scale

STBenchmark

[Alexe et al. 2008] www.stbenchmark.org

Evaluate the effectiveness of the mapping system

Derived from real applications DBLP, BioWarehouse, …

Derived from Information Integration Literature [Lerner, 2000], [Carey, 2000],

Minimum set of transformations that should be supported

1 scenario - 1 transformation Described by

Source & Target Schemas Transformation Query Instance of the Source Schema Capture most practically relevant

transformation cases

Copying Constant Value Generation Horizontal Partition Surrogate Key Assignment Vertical Partition Unnesting (Flattening) Nesting self-Joins Denormalization Keys and Object Fusion Atomic Value Management Aggregation Order Ordered By Flipping Metadata to Data Flipping Data to Metadata Flipping Data to Nested Metadata

Scenario: Copy

Source:

Protein * name accession created

Target:

Protein * Name Accession Created

for $x0 in $doc/Source/Protein return <Protein> <Name> $x0/name/text() <Accession> $x0/accession/text() <Created> $x0/created/text() </Protein>

Textual description +

Scenario: Value Generation

Target:

DataSet * Name LoadingDate

“SwissProt” “July 4th”

<DataSet> <Name>SwissProt</Name> <LoadingDate>July 4th</LoadingDate> </DataSet>

Scenario: Horizontal Partitioning

Source:

gene * txt type protein

Target:

Gene * Name Protein

Synonym * Name Protein

If type ==“ primary”

for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() </Synonym>

Scenario: Surrogate Key Assignment

Source:

gene * txt type protein

Target:

Gene * Name Protein WID

Synonym * Name Protein WID

Id’() for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Synonym>

Scenario: Vertical Partition

Source:

Reaction * entry name comment orthology definition equation

Target:

Reaction * Entry Name Comment CoFactor

ChemicalInfo * Orthology Definition Equation CoFactor

Labeled Nulls

for $x0 in $doc/Source/Reaction let $id = genID() return <Reaction> <Entry> $x0/name/text() <Name> $x0/name/text() <Comment> $x0/comment/text() <CoFactor> $id </Reaction> <ChemicalInfo> <Orthology> $x0/orthology/text() <Definition> $x0/definition/text() <Equation> $x0/equation/text() <CoFactor> $id </ChemicalInfo>

Normalization Note that no key information is assumed, as such duplication is allowed

Scenario: Join Path Selection

Target:

Taxon * Id Name UniqueName Class Parent Rank EmblCode

Source:

Name * id name uniqueName class

Node * taxid parentId rank emblCode

for $x0 in $doc/Source/Name, $x1 in $doc/Source/Node where $x0/id/text() = $x1/taxId/text() return <Taxon> <Id> $x0/id/text() <Name> $x0/name/text() <UniqueName> $x0/uniqueName/text() <Class> $x0/class/text() <Parent> $x1/parentId/text() <Rank> $x1/rank/text() <EmblCode> $x1/emblCode/text() </Taxon>

Denormalization

Scenario: Cyclic Joins

Source:

Gene * name type protein

Target:

Gene * Name Protein

Synonym * Name GeneWID

for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <name> $x0/name/text() <protein> $x0/protein/text() </Gene> for $x1 in $doc/Source/Gene where $x1/type/text() != ‟primary‟ and $x1/protein/text() == $x0/protein/text() return <Synonym> <Name> $x1/name/text() <GeneWID> $x0/name/text() </Synonym>

Scenario: Un-Nesting Structures

Target:

Publication * Title Year PublishedIn Name

Source:

Reference * title year publishedIn Author * name

for $x0 in $doc/Source/Reference $x1 in $x0/Author return <Publication> <Title> $x0/title/text() <Year> $x0/text() <PublishedIn> $x0/publishedIn/text() <Name> $x1/name/text() </Publication>

Scenario: Nesting Structures

Target:

Period * Year Author * Name Publication * Title PublishedIn

Source:

Publication * title year publishedIn name

for $x0 in distinct-values($doc/Source/Publication/year) return <Period> <Year> $x0 for $x1 in distinct-values($doc/Source/Publication[year=$x0]/Name) return <Author> <Name> $x1 for $x2 in $doc/Source/Publication where $x2/year/text()=$x0 and $x2/name/text()=$x1 return <Publication> <Title> $x2/title/text() <PublishedIn> $x2/publishedIn/text() </Publication> </Author> </Period>

Scenario: Keys & Object Fusion

Target:

Experiment * Contact Date Description ExperimentalData * Data Role

Source:

Experiment * eid contact date description ExperimentalData * data role

FlowCytometrySample id contact date Probe * data type

<Source2> for $x0 in $doc/Source/Experiment $x1 in $x0/ExperimentalData return <Datum> <id> genID($x0/contact/text(), $x0/date/text()) <Contact> $x0/contact/text() <Date> $x0/date/text() <Description> $x0/description/text() <Data> $x1/data/text() <Role> $x1/role/text() </Datum> for $x0 in $doc/Source/FlowCytometrySample $x1 in $x0/Probe return <Datum> <id> genID($x0/contact/text(), $x0/date/text() ) <Contact> $x0/contact/text() <Date> $x0/date/text() <Data> $x1/data/text() <Role> $x1/type/text() </Datum> </Source2>

for $x0 in distinct-values($doc/Source2/Datum/id) return <Experiment> for $x1 in ($doc/Source2/Datum[id=$x0])[1] return $x1/Contact $x1/Date $x1/Description for $x3 in $doc/Source2/Datum where $x3/id/text() = $x0 <ExperimentalData> $x3/Data $x3/Role <ExperimentalData> </Experiment>

Scenario: Atomic Value Manipulation

Target:

Contact * FirstName LastName Address Phone

Source:

Contact * name address street city zip phone

GetFirstName(…)

Type Discrepancy Handling

for $x0 in $doc/Source/Contact return <Contact> <FirstName> GetFirstName( $x0/name/text() ) <LastName> GetLastName( $x0/name/text() ) <Address> Concat( $x0/street/text(), $x0/city/text(), $x0/zip/text() ) <Phone> String2Int( $x0/phone/text() ) </Contact>

Scenarios: Aggregation & Order

Scenarios: Data Meta-data

Thalia

[Hammer et al. 2005]

Integration tools benchmark

Rich set of test data

Source schemas

Syntactic and semantic heterogeneity

12 test queries for the integrated schema

Real Data Is Not Enough for Benchmarking...

… but definitely not for the reason Dilbert thinks !

Talk Outline

Introduction

Why Synthetic Data & Scenarios

To stress test the system

To understand performance in diverse situations

To create additional realistic test cases

To ensure that unforeseen situations are also tested

Top-down Scenario Construction

Start with a big schema and divide/extract

TaxME2 [Giunchiglia et al. 2009] Preserves correctness, complexity, performance

[Okawara et al. 2006] Techniques on how a benchmark should be

[Hammer et al. 2005] – Thalia Large dataset + filters

[Lee et al. 2007] eTuner Duplicate schema, split the data in 2

Modify the first half

Limited kinds of modifications not very natural

Bottom-up Scenario Construction

Create the schema from scratch

STBenchmark [Alexe et al. 2008]

Schema Generator

Expands basic scenarios

Changing basic characteristics of the scenario

Data Generator

Hand-to-Hand with the Schema Generator

ToXGen [Barbosa at al. 2002]

Parameters of Schema Generation

• Number of subelements

• Nesting depth

• Join size

• Join width

• Join kind (star / chain)

• Function arity

f(…)

Source R1 [0…*] A1 A2 A3

R2 [0…*] A4 A5

R3 [0…*] A6

R4 [0…*] A7 A8 A9

R5 [0…*] A10 A11

R6 [0…*] A12

Parameter values:

Sampled from normal

distributions given by

average and standard

deviation

Stretching The Unesting Scenario

Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname

Target Publication [0…*] Title Year PublishedIn Name University Country StudentName

Unnesting

Basic Scenario

Stretching The Horizontal Partitioning

Vary the number of partitions

Vary the number of elements that exist in each partition

Source:

gene * txt type protein hpAttr1 hpAttr2

Target:

Gene * Name Protein HpAttr1 HpAttr2

Synonym * Name Protein HpAttr1 HpAttr2

HPRel1 * Name Protein HpAttr1 HpAttr2

Combining Mapping Scenarios

Lack of diversity Combine scenarios

Based on a set of configuration parameters, generate a complex mapping scenario by concatenating scaled-up mapping scenarios

Concatenation of (S1, T1, P1) and (S2, T2, P2)

Stretched Mapping Scenarios

Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname

Target Publication [0…*] Title Year PublishedIn Name University Country StudentName

Horizontal

Partitioning

Horizontal

Partitioning

Unnesting Unnesting

Complex Mapping Scenario

Composing Mapping Scenarios

Intermix of basic mapping scenario transformations

capture cases where different types of transformations occur simultaneously on the same part of a schema

Main idea:

Generate a source schema S

Evolve S to obtain a target schema.

All based on the configuration parameters

Example next:

Composed Mapping Scenarios R1 [0…*] A1 A2 SE1 [0…*] A3 A4 SE2 [0…*] A5 R2 [0…*] A7 A8 SE3 [0…*] A9 SE4 [0…*] A10 A11 R3 [0…*] A12 A13 A14

F1 A1 A2 A3 A4 A5 F2 A7 A8 A9 A10 A11 F3 A12 A13 A14

F1 A1 A3 A5 F2 A7 A8 A10 A11 F3 A12 A13 A14

F1 A1 A3 A5 A7b F2 A7 A8 A10 A11 A12 F3 A13 A14

F1 A1 A3 A5 A7b A15 (id) F2 A7 A8 A10 A11 A12 A16(=“June”) F3 A13 A14 A17(=A3*A14)

R4 [0…*] A1 A3 SE5 [0…*] A5 A7b A15 R6 [0…*] A7 A8 A10 A11 A12 A16 R7 [0…*] A13 A14 SE6 [0…*] A17

Transformation

• unnesting

• removal

• duplication

• migration

• addition

• nesting

Synthetic Examples for Matching

[Ferrara et al. 2008]

ISLab Instance Matching Benchmark

Creates a reference ontology and populates it

Using the web

Performs a sequence of modifications

Variations in data values

Structural heterogeneity

Semantic variations

Scenarios Are Useless Without Metrics

Talk Outline

Introduction

Metric Categorization

Qualitative metrics

Compliance measures

Quantitative metrics

Performance measures

User-specific metrics

Application specific metrics

Qualitative = Compliance

Evaluate the degree of compliance of the system with respect to some standard Matching: Precision, Recall, F-measure and Fallout

[Euzenat et al. 2007]

Measure the difference of the system output to some reference (expected) output An expert user is typically assumed to provide the

expected matches [Duchateau et al. 2007] [Euzenat et al. 2004] They do not provide any measure of the post-match

effort

They do not consider the time spent by the user in doing verification during intermediate stages

Terminology

E−𝑮 𝑮 − 𝑬 𝑮 ∩ 𝑬

𝑼− (𝑮 ∪ 𝑬)

False Negatives

False Positives

True Positives

True Negatives

E: Expected G: Generated

Hamming Distance

Measures the dissimilarity between matches

H(G,E)= 1 − |G∩E|

|G∪E|

Example:

E=(Book-Volume,Person-Human,Science-Essay}

G=(Product-Volume,Person-Writer,Science-Essay}

H G, E = 1 −1

Precision

Originated from IR [van Rijsbergen, 1975]

Adopted to matching [Do et al. 2002]

Ratio of correctly found correspondences (true positives) over the total number of returned correspondences (true and false positives)

Intuitively: The degree of correctness

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G, E =|G∩E|

Recall

Originated from IR [van Rijsbergen, 1975]

Adopted to matching [Do et al. 2002]

Ratio of correctly found correspondences (true positives) over the total number of expected correspondences (true positives and false negatives)

Intuitively: The degree of completeness

𝑅𝑒𝑐𝑎𝑙𝑙 G, E =|G∩E|

Fallout

The percentage of the found matches that are false positives

Intuitively: How much error has been made

𝐹𝑎𝑙𝑙𝑜𝑢𝑡 G, E =G − |G∩E|

|G−E|

F-Measure

Precision & Recall not always consistent

Their complements Noise & Silence neither

Aggregate Precision & Recall

Percentage of the false positive found matches

Intuitively: How much error has been made

𝐹𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑎 G, E =𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E ×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)

1−𝛼 ×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E +𝛼×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)

For a=1, F-Measure is equal to precision and for a=0 to Recall. For a=0.5, it is the harmonic mean

Overall

Like an edit-distance [Melnik et al. 2002]

Ratio of errors over the total number of expected correspondences (true positives and false negatives)

Overall < F-Measure. Ranges [-1,1]

Intuitively: The effort required to fix a matching

𝑂𝑣𝑒𝑟𝑎𝑙𝑙 G, E = 𝑅𝑒𝑐𝑎𝑙𝑙 G, E × 2 −1

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(G,E)

= 1 −G∪E − G∩E

E= 1 −

E−G +|G−E|

Strength-based Similarity

Takes into consideration the degree of confidence

𝑆𝐵𝑆 G, E =2× |𝑠𝑡𝑟𝑒𝑛𝑔ℎ𝑡ℎG 𝑐 −𝑠𝑡𝑟𝑒𝑛𝑔𝑡ℎE 𝑐 |𝑐∈G∩ E

G +|E|

All-or-nothing vs. Approximation

Product

Science

Textbook

Popular

……

Volume

Politics

Biography

Pocket

…… Expected

Pocket

Relaxed Precision & Recall

[Ehrig et al. 2005]

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑎 G, E =𝜔(G,E)

𝑅𝑒𝑐𝑎𝑙𝑙𝑎 G, E =𝜔(G,E)

𝜔 G, E = 𝜎(𝑎, 𝑟)𝑎,𝑟 ∈𝑀(G,E)

𝜎: correspondence similarity function

M(G,E): Best matches with regard to 𝜎

Instead of |𝑨 ∩ 𝑹|

Instead of |𝑮 ∩ 𝑬|

Weighted Harmonic Mean

Integrates multiple similarity measures

Given n similarity measures 𝑀𝑖, each with a weight 𝑤𝑖 such that 𝑤𝑖 ∈ 0,1 and 𝑤𝑖𝑖=0..𝑛 = 1

𝑊𝐻𝑀 G, E = 𝑀𝑖(G,E)𝑖∈𝐼

𝑤𝑖×𝑀𝑖(G,E)𝑖∈𝐼

Evaluating Mapping Systems

Efficiency

Mapping Generation Time

Data Translation Performance

Parallelization

Human effort

Matching/Mapping Generation Time

Matching tools Few human intervention

Time has been measured for matching tasks in [Yatskevic et al. 2003], discussed in a recent benchmark [Kopcke et al. 2010], and also addressed in XBenchMatch [Duchateau et al. 2007]

Mapping tools, e.g., Clio, HePToX, Spicy, and STBenchmark do not elaborate on the issue It is hard to measure time in a process in which

human participation is part of the process It includes the time to guide, verify and tune the

mapping tool

Translation Time as Performance Metric

Time to execute the transformation script

Indirectly a quality metric on the generated mappings

Attention to avoid evaluation of the query engines

Same engine & Hardware

Need to be fair

Same target instance

Efficient Core generation

[Mecca et al. 2009] [tenCate et al. 2009]

Time performance of ETL workflows

[Simitsis et al. 2009]

Time and Parallelization for ETL tools

Beyond time performance, other factors may be quite relevant, such as: Workflow execution throughput (under failures or

Avg latency per tuple

Along with the above factors, it is important to increase parallelization: Pipelining: tasks of the ETL workflow are executed

in parallel by different processors, and the output can be consumed by the next task without waiting for the overall completion

Partitioning: data is partitioned and the transformation is applied to chunks of data

Human Effort In Matching Tools

The amount of work required to remove false positives and add false negatives

Since no human intervention takes place during the matching process

Human-spared resources [Duchateau 2009]

It counts the number of user interactions to obtain a 100% F-measure, i.e. the effort to remove false positives and add false negatives, and also to discover missing correspondences

Human Effort In Mapping Tools

Mapping tools can be seen as graphic tools

HCI study can be used

Comparing the GUI of the tools is hard

Schema mapping tools is a new technology and the tools are evolving and keep improving their interface daily

STBenchmark provides a first-cut measure on the effort required to implement a mapping scenario through the visual interface of a mapping system [Alexe et al. 2008]

A Simple Model

STBenchmark model Cost of implementing a mapping scenario:

4*L + S + 2*D + 4*K

L – mouse dragging actions

S – single mouse clicks

D – double mouse clicks

K – keystrokes

[MacKenzie et al. ’91]

Mouse dragging is slower and

more error-prone than clicking

It is easier to make mistakes

when typing

Scenario / System A B C D

Scenario 1 8 16 16 4

Scenario 2 72 78 65 110

Scenario 3 32 52 37 7

Scenario 4 66 81 65 200

Usability study for HePToX/Clio

• [Bonifati et al. 2010]

• Whereas Clio could implement fairly more scenarios, HePToX required

less effort than Clio in the majority of the scenarios it could implement.

• HePToX is click-and-drag oriented, while Clio is click-and-select oriented.

Evaluating Mapping Systems

Efficiency

Mapping Generation Time

Data Translation Performance

Parallelization

Human effort

Effectiveness

Supported Scenarios

Quality

Generated Mappings

Target instance

Target schema

Enumerating Supported Scenarios

System A System B System C

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Scenario 5

Scenario 6

Scenario 7

……………..

Before talking about quality…

In order to check the quality of the generated mappings or the generated target instance:

Somebody has to provide you with the „ideal‟ or „desired‟ mappings or target instance

Normally, the user provides those

Or, alternatively, a „core‟ target instance can be used

Or the set of mappings suggested by a benchmark

The focus here is not on who provides the above components, but on quality!

Quality of the Generated Mappings

Things are tricky for mapping tools:

Measuring the quality of generated mappings amounts to checking Query Containment or Query Equivalence

An NP-complete problem

It is preferable to check the quality of the results of a mapping:

i.e. the generated target instance

Current efforts try to characterize mappings in a quantitative way:

By their information loss, while inverting them

The notion of maximum extended recovery has been introduced in [Arenas2008, Fagin2009]

Quality of the Generated Target instance

A mapping is inherently a query

Same transformation, multiple ways.

Generation time

Readability

Mapping result

Whether it produces the respective results

Yes/No is too harsh

Similarity of queries …. too difficult to compute

Target instance is an alternative

Observe the result of the mappings

The Spicy Way

[Bonifati et al. 2008] – Spicy

Schema Matcher (internal or external)

Mapping Generator (internal or external)

s.A t.E, 0.87

s.C t.F, 0.76

s.D t.F, 0.98

Ranked

mappings:

mapping 1, 0.97

mapping 2, 0.87

mapping 3. 0.72

line selection

mapping generation

mapping

verification

source

target

The Spicy Way

Structural Analysis

uses the model of electrical circuits

uses sampling

uses a set of features on samples, e.g.:

length and character distribution

entropy of values

density of null values etc.

Structural Analysis

For an attribute A with sample sample(A), the atomic piece of circuit is the following

Trees Into Circuits

Circuits can be obtained for nested structures

The Spicy way: lessons learned

Spicy is a system for schema mapping verification

using comparison of instances to gauge the quality

iterating the mapping search algorithm until mapping quality is acceptable

Open issues:

Special element types (images, full-text), complex models (e.g. ontologies) and more complex classes of lines

Other instance comparison techniques other than structural analysis may be needed

What if we look at the efficiency of transformations (e.g. core computation) and their quality at the same time?

Quality of target instance for ETL

The quality of target instances is also important for ETL systems

Target instances can be characterized in terms of:

Easy of maintenance [Simitsis et al. 2009]

Resilience to failures

Data freshness

Compliance to business rules

Data examples

Since the expected/generated target instance may be large and the generated mappings may be numerous, Samples of the expected/generated target instance

can be used in turn

The importance of data examples goes back to [Yan et al. 2001] Each mapping is a connected graph G = (N, E), where

N is the set of nodes or source schema relations, and E represents conjunctions of join predicates among the nodes

Data associations (subgraphs of G) can be derived as relations that contain that maximum number of attributes that can be joined along E

Data associations are leveraged to understand what has to be included in a mapping

Routes

Source-to-target dependencies, st:

m1: CardHolders(cn,l,s,n) ->

Accounts(cn,L,s),Clients(s,n)

m2: Dependents(an,s,n) -> Clients(s,n)

Target dependencies, t:

m3: Clients(s,n) -> (Accounts(A,L,s))

MANHATTAN CREDIT

CardHolders:

cardNo ²

limit ²

ssn ²

name ²

Dependents:

accNo ²

ssn ²

name ²

FARGO FINANCE

Accounts:

² accNo

² creditLine

² accHolder

Clients:

² ssn

² name

Source instance I Target instance J Solution for I under the schema mapping

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

Allow the Inspection of the flow of the mappings [Chiticariu and Tan, 2006]

Example Debugging Scenario 1

Unknown credit limit?

15K is not copied over to the target

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

Alice ID1 $15K 123

CardHolders ID1 L1 123

Accounts

Alice ID1

Clients m1

A route for the Accounts tuple

Unknown credit limit?

15K is not copied over to the target

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

Alice ID1 $15K 123

CardHolders ID1 L1 123

Accounts

Alice ID1

Clients m1

A route for the Accounts tuple

Unknown account number?

123 is not copied over to the target

as Bob’s account number

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

m2 Bob ID2 123

Dependents

ID2 L2 A2

Accounts

Bob ID2

Clients m3

Route for Accounts tuple with accNo A2

Unknown account number?

123 is not copied over to the target

as Bob’s account number

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

m2 Bob ID2 123

Dependents

ID2 L2 A2

Accounts

Bob ID2

Clients m3

Route for Accounts tuple with accNo A2

The SPIDER System

Based on Routes

Data Examples as Evaluation Tools

The use of data examples as evaluation tools is underway [Chiticariu et al. 2008]

Examples are used to understand and refine mappings towards the desired specification

Universal data examples are data examples derived from universal solutions [Alexe et al. 2010]

If S and T contain only unary relations, with only Σst, a mapping is characterized by a set of

Positive data examples (I, J) such that (I,J) Σ

Negative data examples (I, J) such that (I,J) Σ

[Chiticariu et al. 2008] Muse

Build ad-doc probes for each attribute, such that an small source example is built and two differentiating target examples are obtained

After that, Muse asks the designer „Which target instance look correct?‟

This leads to eliminate some mappings that lead to the unchosen target instance

The result is:

A set of correct homomorphically equivalent target instances

It also allows the design of Skolem functions, not addressed in Routes

MUSE Workflow

Mapping

Specification Real Source

Instance

(if available)

Real/Synthetic

Examples

Mapping designer

inspects

data examples

Examination

Generation

Essentially

Yes/No Answers

Refinement Grouping Semantics

Disambiguation

Example

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

location

Projects: Set of

Project: Rcd

cbranch

manager

Employees: Set of

Employee: Rcd

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

Projects: Set of

Project: Rcd

manager

Employees: Set of

Employee: Rcd

Declarative Mapping

c in CompDB.Companies

p in CompDB.Projects

e in CompDB.Employees

satisfy

p.cbranch = c.cbranch

e.eid = p.manager

exists

o in OrgDB.Orgs

p1 in o.Projects

e1 in OrgDB.Employees

satisfy

p1.manager = e1.eid

c.cname = o.oname

e.eid = e1.eid

e.ename = e1.ename

p.pname = p1.pname

Example

Grouping Projects:

Example source:

Companies

Redmond Microsoft USA

S. Valley Microsoft USA

Projects

P1 DB Redmond e4

P2 Web S. Valley e5

Group by cbranch

Microsoft

Projects:

Microsoft

Projects:

Web e5

Group by cname

Microsoft

Projects:

Web e5

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

location

Projects: Set of

Project: Rcd

cbranch

manager

Employees: Set of

Employee: Rcd

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

Projects: Set of

Project: Rcd

manager

Employees: Set of

Employee: Rcd

Example

Declarative Mapping

c in CompDB.Companies

p in CompDB.Projects

e in CompDB.Employees

satisfy

p.cbranch = c.cbranch

e.eid = p.manager

exists

o in OrgDB.Orgs

p1 in o.Projects

e1 in OrgDB.Employees

satisfy

p1.manager = e1.eid

c.cname = o.oname

e.eid = e1.eid

e.ename = e1.ename

p.pname = p1.pname

o.Projects = SKProjects(c.cbranch, c.cname, c.location)

Grouping

Function

But group by what subset of {cbranch, cname, location} ?

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

location

Projects: Set of

Project: Rcd

cbranch

manager

Employees: Set of

Employee: Rcd

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

Projects: Set of

Project: Rcd

manager

Employees: Set of

Employee: Rcd

Muse-G: Grouping Semantics Design

Goal: infer a grouping function that has the same effect as the one intended by the designer

Muse-G probes each possible grouping attribute: start with cbranch

Example source

Companies

Redmond Microsoft USA

Projects

P1 DB Redmond e4

P2 Web S. Valley e5

Employees

e4 John x234

e5 Anna x888

Target Scenario 1

by cbranch

Microsoft

Projects:

Microsoft

Projects:

Web e5

Employees

e4 John

e5 Anna

y subset of {Microsoft, USA}

Target Scenario 2

do not group

by cbranch

Microsoft

Projects:

Web e5

Employees

e4 John

e5 Anna

SK(Redmond,y)

SK(S. Valley,y)

Muse-G: Second Question

The next probed attribute is cname

Example source

Companies

Mt. View Google USA

Projects

P1 DB S. Valley e4

P4 Web Mt. View e6

Employees

e4 John x234

e6 Kat x331

Target Scenario 1

by cname

Microsoft

Projects:

Google

Projects:

Web e6

Employees

e4 John

e6 Kat

Target Scenario 2

do not group

by cname

Microsoft

Projects:

Web e6

Google

Projects:

Web e6

Employees

e4 John

e6 Kat

y subset of {USA}

The wizard continues to probe the remaining possible grouping attributes

SK(Microsoft,y)

SK(Google,y)

Quality of the Generated Target Schema

When mappings are used in schema integration:

The quality of the generated integrated schema wrt. the intended integrated schema

Three metrics have been conceived:

Completeness [Batista07]:

Minimality [Batista07]:

Quality of the Generated Target Schema

An example

Nr. of common elements=6; α = 2;

Completeness=6/7; Minimality = 5/7;

Structurality = (1+1+10+1/4+1/2)/6

Proximity = 0.73

(generated) (intended)

Talk Outline

Introduction

Challenges in evaluating matching & mapping

Conclusions and References

What We Talked About

The importance of Matching/Mappings

Explained the matching/mapping tasks

Presented existing tools & their functionality

What is a benchmark

Generic evaluation principles

Why is a benchmark for mapping systems a challenge.

Finding real-world evaluation scenarios

Metrics for effectiveness, efficiency and quality

Main Sources

[Euzenat et al. 2007] J. Euzenat and P. Shvaiko, "Ontology matching", Springer-Verlag, 2007

[Bellahsene et al. 2011] Z. Bellahsene, A. Bonifati and E. Rahm, "Schema Matching and Mapping", Springer-Verlag, 2011

Are you taking questions?

Thank you !!

Of course ! Go ahead

Partial List of References

[Alexe et al. 2008] Alexe B, Chiticariu L, Miller RJ, Tan WC (2008) “Muse: Mapping Understanding and deSign by Example”. In: ICDE, pp 10-19

[Bonifati et al. 2008] Bonifati A, Mecca G, Pappalardo A, Raunich S, Summa G (2008) “Schema Mapping Verification: The Spicy Way”. In: EDBT, pp 85-96

[Lenzerini 2002] Lenzerini Maurizio “Data Integration: A Theoretical Perspective”. In: PODS, pp 233-246

[Miller et al. 2000] Miller RJ, Haas LM, Hernandez MA (2000) “Schema Mapping as Query Discovery”. In: VLDB, pp 77-88

[Rahm et al. 2001] Rahm E, Bernstein PA (2001) “A survey of approaches to automatic schema matching”. VLDB Journal 10(4):334-35

[Velegrakis 2005] Velegrakis Y “Managing Schema Mappings in Highly Heterogeneous Environments”. PhD thesis, University of Toronto

Partial List of References (cont’d)

[Altova 2008] Altova (2008) MapForce. Http://www.altova.com [Batini et al. 1986] Batini C, Lenzerini M, Navathe SB (1986) A

Comparative Analysis of Methodologies for Database Schema Integration. ACM Comp. Surv. 18(4):323-364

[Bernstein et al. 2007] Bernstein PA, Melnik S (2007) Model management 2.0: manipulating richer mappings. In: SIGMOD, pp 1–12

[Do et al. 2002] Do HH, Rahm E (2002) COMA - A System for Flexible Combination of Schema Matching Approaches. In: VLDB, pp 610–621

[Euzenat et al. 2007] Euzenat J, Shvaiko P (2007) Ontology matching. Springer Verlag, Heidelberg

[Fagin et al. 2005] Fagin R, Kolaitis PG, Miller RJ, Popa L (2005) Data exchange: semantics and query answering. Theoretical Computer Science 336(1):89-124

[Lerner 2000] Lerner BS (2000) A Model for Compound Type Changes Encountered in Schema Evolution. TPCTC 25(1):83–127

[Popa et al, 2002] Popa L, Velegrakis Y, Miller RJ, Hernandez MA, Fagin R (2002)Translating Web Data. In: VLDB, pp 598–609

[Alexe2010] Alexe B, Kolaitis PG, Tan W (2010b) Characterizing Schema Mappings via Data Examples. In: PODS

[Aumueller2006] Aumueller D, Do HH, Massmann S, Rahm E (2005) Schema and ontology matching with COMA++. In: SIGMOD, pp 906–908

[Bonifati et al. 2010] Angela Bonifati, Elaine Qing Chang, Terence Ho, Laks V. S. Lakshmanan, Rachel Pottinger, Yongik Chung: Schema mapping and query Translation in heterogeneous P2P XML databases. VLDB J. 19(2): 231-256 (2010)

[Chiticariu et al. 2006] Chiticariu L, TanWC(2006) Debugging Schema Mappings with Routes. In: VLDB, pp 79–90

[Dhamankar at al. 2004] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Y. Halevy, Pedro Domingos: iMAP: Discovering Complex Mappings between Database Schemas. SIGMOD Conference 2004: 383-394

[Doan et al. 2001] Doan A, Domingos P, Halevy AY (2001) Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD, pp 509–520

[Fletcher et al. 2006] Fletcher GHL, Wyss CM (2006) Data Mapping as Search. In: EDBT, pp 95–111

[Giunchiglia et al. 2004] Giunchiglia F, Shvaiko P, Yatskevich M (2004) S-Match: an Algorithm and an Implementation of Semantic Matching. In: ESWS, pp 61–75

[Madhavanet al. 2001] Madhavan J, Bernstein PA, Rahm E (2001) Generic Schema Matching with Cupid. In: VLDB, pp 49–58.

[Naumann et al. 2002] Naumann F, Ho CT, Tian X, Haas LM, Megiddo N (2002) Attribute Classification Using Feature Analysis. In: ICDE, p 271

[Shu et al. 1977] N. C. Shu, B. C. Housel, R. W. Taylor, S. P. Ghosh, and V. Y. Lum. EXPRESS: A Data EXtraction, Processing and REstructuring System. ACM Transactions on Database Systems (TODS), 2(2):134–174, 1977.

Benchmarks From Usage To Evaluation -...

Documents

Transcript of Benchmarks From Usage To Evaluation -...

Introduction to Web Services - CNRstaff.icar.cnr.it/cannataro/unical/RSI/Lezioni/RSI-Lezione02-WS...Introduction to web services, 20th July 2004 - 7 W3C definition of a Web Service

SQL-2 - CNRstaff.icar.cnr.it/spezzano/iat/lucidiese/SQL2.pdf · Luisa Anna Anna ano:l Luigi An (Ilea . Title: Microsoft PowerPoint - SQL2.ppt Author: spezzano Created Date: 2/2/2004

Keystone Real Estatekeystonere.net/Keystone Application.pdf · 2019. 10. 12. · by mphone C]mai/ e-mail a fax Din person that Applicant was Page 3 of 4 [L] approved not approved.

HEPTOX 1 : Marrying XML and Heterogeneity in Your P2P Databases Angela Bonifati (Icar CNR, Italy), Elaine Q.Chang, Laks V.S.Lakshmanan, Terence Ho, Rachel.

Andrei Arion Angela Bonifati Ioana Manolescu Andrea Pugliese

Benchmarks From Usage To Evaluationperso.ens-lyon.fr/christian.perez/_media/160614/lsc-j2...mPhone company companies: Set of company: Rcd coid name name status PIX Active E-services

FRE E! Children Families...subject line. No mphone calls, please and thank fyou. Seeking Ad Sales Help: If you have a computer and a couple of free hours each weekday, you may be able

APPENDIX II - FreeHostianinuzzo.freehostia.com/assets/files/Pitmans... · Pitman's Commercial Spanish Antonio Bonifati - 412/412 - Un simple favor: A simple favour. Un favor simple:

Main Nersingen - Natural Science · Basteistraße 40, DE-89073 Ulm, Germany, mPhone +49(0)731-9230 The hotel will be entirely smoke-free for the duration of the Congress! €Stuttgart

for Functional Dependencies at Scale ChaseFUN: a Data … · ChaseFUN: a Data Exchange Engine for Functional Dependencies at Scale Angela Bonifati ,: Ioana Ileana},; Michele Linardi;

Lezione 5 Reti Neurali - CNRstaff.icar.cnr.it/manco/Teaching/2005/datamining/lezioni/lezione6.pdf · Reti Neurali Algoritmo Backpropagation Input Layer x 1 x 2 x 3 w 11 Hidden Layer

Evaluating Top-k Queries with Inconsistency DegreesEvaluating Top-k Queries with Inconsistency Degrees Ousmane Issa University Clermont Auvergne ousmane.issa@uca.fr Angela Bonifati

Lecture 10 Clustering - CNRstaff.icar.cnr.it/manco/Teaching/datamining/wp-content/uploads/2009/11/Clustering.pdf• Sample bias • CLARANS [Ng and Han, 1994) – Due ulteriori parametri:

ninuzzo.freehostia.comninuzzo.freehostia.com/.../337-384.pdf · 2010. 2. 3. · Pitman's Commercial Spanish Antonio Bonifati - 384/412 - =calcular=, to calculate, to reckon =callar=,

Document Structure and Layout Analysis - CNRstaff.icar.cnr.it/ruffolo/progetti/projects/03.PDFBox... · 2010-09-11 · Document Structure and Layout Analysis Anoop M. Namboodiri1

Angela Bonifati, “Active XQuery”, ICDE 20021 Active XQuery A. Bonifati, D. Braga, A. Campi, S. Ceri Politecnico di Milano (Italy)

mPhone -Next Generation Smart Phones:Global Launch

SOLAR MICRO INVERTER LETRIKA SMI 260 ......Mphone obile AC DC DC AC ADVANTAGES OF MI SYSTEMS • The best PV levelized cost of electricity (LCOE) / Total cost of ownership (TCO) in

Guide to Editing a Book - CNRstaff.icar.cnr.it/cannataro/LifeSciencesGrid/doc/Author_… · Web viewThis chapter introduces the Chaos Theory as a means of studying information systems.

lezione 3 - riepilogo UML - CNRstaff.icar.cnr.it/ruffolo/files/lezione 3 - riepilogo UML...3 • JohnRoskill, Microsoft -Director of Product Marketing forMicrosoft Visual Studio: “…Noi