Benchmarks From Usage To Evaluation -...

Post on 03-Jun-2020

1 views 0 download

Transcript of Benchmarks From Usage To Evaluation -...

Yannis Velegrakis (University of Trento) Angela Bonifati (CNR)

EDBT 2011, Uppsala, Sweden, March 21st-25th

Benchmarks

From Usage To Evaluation

Schema Matching and Mapping Systems

2 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

3 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

4 EDBT 2011, A. Bonifati & Y. Velegrakis

Introduction

Data is inherently heterogeneous

Due to the explosion of online data repositories

Due to the variety of users, who develop a wealth of applications

At different time

With disparate requirements in their mind

A fundamental requirement is to translate data across different formats

How data is transformed from one format to another is done through mappings

5 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings Are All Around

Data integration [Lenzerini 2002]

to specify the relationship between local and global schemas

S1

S2

S3

Global

Schema T

I1

I2

I3 Sources

6 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings Are All Around

Schema integration [Batini et al. 1986]

to specify the relationship between the input schemas and the integrated schemas

S1

S2

S3

Integrated

Schema

Input Schemas

7 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings Are All Around

Data exchange [Fagin et al. 2005]

to specify the relationship between source and target schemas

S T

mappings

Source schema Target Schema

I I J

8 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings Are All Around

Schema evolution [Lerner 2000]

to specify the relationship between the old and new version of an evolved schema

S1 S1’ S1’’

Evolving Schema S1

9 EDBT 2011, A. Bonifati & Y. Velegrakis

How Did It All Start

One of the first systems to deal with this problem was developed at IBM in 1977: EXPRESS (EXtraction, Processing and REStructuring

System) [Shu et al. 1977] consists of two languages: DEFINE that works as a DDL (Data Definition Language)

CONVERT that works as a DTL (Data Translation Language) and has a total of 9 operators, each of which receives as input a data file, performs the respective transformation and generates an output data file.

EXPRESS required the users familiarity with the languages and was customized to only one model (hierarchical)

After that, inter-model transformations were also studied [Tork-Roth et al. 1997] [Atzeni et al. 1997]

10 EDBT 2011, A. Bonifati & Y. Velegrakis

Emphasis on Data Translation

[Abiteboul et al. 1997] proposed a declarative framework for data translation

[Davidson et al. 1997] focused on constraint satisfaction

[Milo et al. 1998] leveraged a library of transformation rules and pattern-matching techniques

[Clue et al. 1998] emphasized type-checking

[Beeri at al. 1999] focused on tree-based transformations for XML data structures

11 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

12 EDBT 2011, A. Bonifati & Y. Velegrakis

A Data Transfer Example

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

Benedikt benedikt@research.bell-labs.com 5827766

Hull hull@research.bell-labs.com 5824509

Shrivastava divesh@research.att.com 3608776

Belanger dgb@research.att.com 3608600

Fernandez mff@research.att.com 3608679

name official

AT&T AT&T Research Labs

Lucent Lucent Technologies, Bell Labs Innovations

Projects

Grants

Contacts

Companies

gid project recipient manager supervisor

g1 PIX AT&T Fernandez Belanger

g2 PIX AT&T Shrivastava Belanger

g3 E-services Bell-labs Benedikt Hull

So

urce I

nsta

nce

13 EDBT 2011, A. Bonifati & Y. Velegrakis

Desired Target Instance

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Finances

code: E-services

Projects

fid finId

g3 ???

Funds

code: PIX

fid finId

g1 ???

g2 ???

Funds

finId mPhone company

??? 3608679 ???

??? 3608776 ???

??? 5827766 ???

coid name

Sk2(AT&T) AT&T

Sk2(Lucent) Lucent

??? ???

??? ???

??? ???

Companies

Targ

et in

sta

nce

14 EDBT 2011, A. Bonifati & Y. Velegrakis

The Needed Transformation Query LET $doc0 := document("inputXMLfile") RETURN <T> { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <project> <code> { $x0/project/text() } </code> { distinct-values ( FOR $x0L1 IN $doc0/S/grant, $x1L1 IN $doc0/S/project, $x2L1 IN $doc0/S/contact, $x3L1 IN $doc0/S/contact WHERE $x2L1/cid/text() = $x0L1/manager/text() AND $x0L1/supervisor/text() = $x3L1/cid/text() AND $x0L1/project/text() = $x1L1/name/text() AND $x0/project/text() = $x0L1/project/text() RETURN <funding> <fid> { $x0L1/gid/text() } </fid> <finId> { "Sk52(", $x0L1/gid/text(), ", ", $x0L1/project/text(), ")" } </finId> </funding> ) } </project> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <finance> <finId> { $x0/gid/text() } </finId> <mPhone> { $x2/phone/text() } </mPhone> <company> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </company> </finance> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <company> <coid> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </coid> <name> { "Sk49(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </name> </company> ) } { distinct-values ( FOR $x0 IN $doc0/S/company RETURN <company> <coid> { "Sk93(", $x0/cname/text(), ")" } </coid> <name> { $x0/cname/text() } </name> </company> ) }

15 EDBT 2011, A. Bonifati & Y. Velegrakis

The Road To Mapping Systems

The design of data transformations has been a manual task for a long while

Designers had to be familiar with the language

As schemas became larger and more complex, the task became too laborious, time-consuming and error-prone

The need of raising the level of abstraction and trying to automate the tasks was soon realized.

The idea …

Mapping Systems

16 EDBT 2011, A. Bonifati & Y. Velegrakis

Generating Mapping

Different techniques exist to generate mappings:

Manual, e.g.

by means of high-level mapping languages, such as [Bernstein et al. 2007]

by means of sophisticated user interfaces [Altova 2008]

Semi-automatic, e.g.

By means of designer guidance [Alexe2008]

Via advanced algorithms to do the reasoning instead of the mapping designer [Madhavan at al. 2001] [Popa et al. 2002][Do et al. 2002][Bonifati et al. 2008]

17 EDBT 2011, A. Bonifati & Y. Velegrakis

The First Step of a Mapping Task

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

Benedikt benedikt@research.bell-labs.com 5827766

Hull hull@research.bell-labs.com 5824509

Shrivastava divesh@research.att.com 3608776

Belanger dgb@research.att.com 3608600

Fernandez mff@research.att.com 3608679

name official

AT&T AT&T Research Labs

Lucent Lucent Technologies, Bell Labs Innovations

Projects

Grants

Contacts

Companies

gid project recipient manager supervisor

g1 PIX AT&T Fernandez Belanger

g2 PIX AT&T Shrivastava Belanger

g3 E-services Bell-labs Benedikt Hull

18 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching

Given two schemas as input, the source and the target schema, matching is a process that produces as output a set of matches or correspondences or (simply) lines between the elements of the two schemas

A match is a triple <Es, Et, e> Where Es is a set of elements of the source

schema, Et is a set of elements of the target schema, and e specifies a simple relationship (equality or set inclusion) or complex relationship between element in Es and Et

19 EDBT 2011, A. Bonifati & Y. Velegrakis

The Matching Relationship e

Depends on the cardinalities of Es and Et

Depends on the semantics:

Can be a function

Can be an arithmetic operation

Can be a set-theoretic relation (e.g. ≡,overlaps)

Can be a conceptual modeling relationship (e.g. part-of, subclass-of)

20 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching: An Alternative Definition

The matching process [Euzenat et al. 2007] can be seen as a function f from a pair of schemas S and T, an optional input alignment A, a set of matching parameters p and a set of resources r:

A’ = f(S, T, A?, p, r)

Ultimately, an alignment is a set of correspondences between elements in S and elements in T

21 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching Examples

Simple relationship: Name ≡ Title Location ≡ Address

Complex relationship: speed = velocity x 2.237 speed x 0.447 = velocity speed = concat(velocity x 2.237, „MPH‟) speed ≥ velocity

Company

Location

Name

Source

Address

Organization

Title

Target

22 EDBT 2011, A. Bonifati & Y. Velegrakis

The matching process

Can be roughly divided into three steps:

Pre-match: training of classifiers for machine learning-based matchers, matching parameters (weights, thresholds), adjustments of resources, such as thesauri and constraints

Match: the actual matching task

Post-match: the user may check and modify the displayed matches

23 EDBT 2011, A. Bonifati & Y. Velegrakis

Some Schema Matchers

Cupid [Madhavan et al. 2001] : based on structural and name similarity

S-Match [Giunchiglia et al. 2004]: based on semantic closeness

Coma++ [Aumueller et al. 2005]: based on matching reuse

LSD [Doan et al. 2001]: based on data value analysis and machine-learning techniques

iMap [Dhamankar et al. 2004]: suited for complex e expressions

Similarity Flooding [Melnik et al. 2002]: based on graph similarity

24 EDBT 2011, A. Bonifati & Y. Velegrakis

Similarity Flooding

25 EDBT 2011, A. Bonifati & Y. Velegrakis

COMA++

26 EDBT 2011, A. Bonifati & Y. Velegrakis

Matchings Are Not Enough

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

Benedikt benedikt@research.bell-labs.com 5827766

Hull hull@research.bell-labs.com 5824509

Shrivastava divesh@research.att.com 3608776

Belanger dgb@research.att.com 3608600

Fernandez mff@research.att.com 3608679

name official

AT&T AT&T Research Labs

Lucent Lucent Technologies, Bell Labs Innovations

Projects

Grants

Contacts

Companies

gid project recipient manager supervisor

g1 PIX AT&T Fernandez Belanger

g2 PIX AT&T Shrivastava Belanger

g3 E-services Bell-labs Benedikt Hull

Cannot describe the full details of the transformation

27 EDBT 2011, A. Bonifati & Y. Velegrakis

Source Schema Target Schema

Matchings

Matcher

The Mapping Generation Process

Matching is just the beginning of any mapping generation

process

28 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings

Given the source and the target schemas, mapping is a process that takes as input a set of matches between the elements of the two schemas and produces a relationship or constraint e that must hold between their respective instances

In other words, a mapping is a triple <S, T, e>

Where S is the source schema, T is the target schema, and e specify a constraint that any instances adhering to S and T must satisfy or an executable statement to transform the instance of S into the instances of T

29 EDBT 2011, A. Bonifati & Y. Velegrakis

A Mapping Example

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)

project(na,FUND), fund(gid,finId), finance (finId,ph,company)

company(company, name)

30 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings & Instances

Mappings are the basic ingredients of many tasks, such as information integration, P2P query answering, data exchange etc.

In particular, mappings as inter-schema constraints may not be enough to fully specify a unique target instance There may exist multiple target instances

satisfying the mappings

Finding the best target instance is the goal of the data exchange problem [Fagin et al. 2005] The mapping is converted into a executable

transformation script to obtain that particular instance

31 EDBT 2011, A. Bonifati & Y. Velegrakis

S: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

A Data Exchange Example

T: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

Targ

et in

sta

nce

Finances

code: E-services

Projects

fid finId

g3 ???

Funds

code: PIX

fid finId

g1 ???

g2 ???

Funds

finId mPhone company

??? 3608679 ???

??? 3608776 ???

??? 3608600 ???

coid name

??? AT&T

??? Lucent

Companies

project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)

project(na,FUND), fund(gid,finId), finance (finId,ph,company),

company(company, name)

32 EDBT 2011, A. Bonifati & Y. Velegrakis

Source Schema Target Schema

Matchings

User

Transformation Scripts

Matcher

Mapping Generation Engine

Mappings (Dependencies)

Query Engine

The Mapping Generation Process

Data Exchange Engine

Source Instance

Target Instance

33 EDBT 2011, A. Bonifati & Y. Velegrakis

Research Prototype Systems

Mapping generation and data exchange are separate tasks Clio[Popa et al. 2002], HePToX[Bonifati et al. 2010],

Spicy[Bonifati et al. 2008]

Mappings Generation the mappings are expressed as high-level assertions in a

logical formalism

A mapping is a source-to-target tuple-generating dependency (or s-t tgd in short)

𝜙 𝑥 → ∃ 𝜓 𝑥, 𝑦 where

φ(x) (ψ(x,y), resp.) is a conjunction of atoms over the source (target, resp.)

Data exchange The respective module transforms the high-level mappings

into transformation scripts (in SQL or XQuery) to generate the target instance.

34 EDBT 2011, A. Bonifati & Y. Velegrakis

Clio

35 EDBT 2011, A. Bonifati & Y. Velegrakis

Spicy

36 EDBT 2011, A. Bonifati & Y. Velegrakis

HepToX

37 EDBT 2011, A. Bonifati & Y. Velegrakis

Source Schema Target Schema

Matchings

User

Transformation Scripts

Matcher

Mapping Generation Engine

Mappings (Dependencies)

Query Engine

Commercial Mapping Systems

Data Exchange Engine

Source Instance

Target Instance

Mapping Engine

Mapping generation and Data exchange are merged in one. The system directly creates in some native language the final transformation script

38 EDBT 2011, A. Bonifati & Y. Velegrakis

Popular Commercial Systems

Altova Mapforce

Stylus Studio

IBM Rational Data Architect

BizTalk mapper

Adeptia

BEA Aqualogic

39 EDBT 2011, A. Bonifati & Y. Velegrakis

Stylus Studio

40 EDBT 2011, A. Bonifati & Y. Velegrakis

Altova MapForce

41 EDBT 2011, A. Bonifati & Y. Velegrakis

Adeptia

42 EDBT 2011, A. Bonifati & Y. Velegrakis

BizTalk Mapper

43 EDBT 2011, A. Bonifati & Y. Velegrakis

IBM Rational Data Architect

44 EDBT 2011, A. Bonifati & Y. Velegrakis

A Mapping Tool Categorization

All tools provide to the mapping designer:

A graphical representation of the two schemas

A set of graphical transformation constructs

The granularity and power of these constructs is a main factor of differentiation among the tools

Detailed Specification by the Designer

Intelligence of the mapping tool/

Effort in post-verification

Roughly

Commercial

Mapping Tools

Roughly

Research

Prototypes

45 EDBT 2011, A. Bonifati & Y. Velegrakis

Issues in Data Exchange

When multiple target instances exist, how do we compute the best one?

Is a given target instance better than another?

Universal solutions

Introduced in [Fagin et al. 2005]

These are the “most general” target instances, and also represent the entire space of solutions

Among the universal solutions, the smallest of all and the most compact one is called the “core”

46 EDBT 2011, A. Bonifati & Y. Velegrakis

Universal Core Instances

T: Rcd

Advised: Rcd

sname

facid

WorksWith: Rcd

sname

facid

S: Rcd

PTStud: Rcd

age

name

GradStud: Rcd

age

name

sname facid

Bob N3

Ann N4

WorksWith

PTStud(x,y), Advised(y,z)

GradStud (x,y) Advised(y,z), WorksWith(y,z)

Source instance

PTStud

age name

27 Bob

30 Ann

GradStud

age name

32 John

30 Ann

Advised

sname facid

Bob N3

Ann N4

John N1

N1 Cathy

A solution:

sname facid

Bob N3

Ann N4

WorksWith

Advised

sname facid

Bob N3

Ann N4

John N1

N2 Ann

sname facid

Bob N3

Ann N4

WorksWith

Advised

sname facid

Bob N3

Ann N4

John N1

A universal Solution:

The core:

47 EDBT 2011, A. Bonifati & Y. Velegrakis

Commercial vs. Research Prototype Systems

Whereas research prototypes (e.g. Clio, Spicy++) are tending to produce target instances that look more and more like the core

Commercial tools leave the task to the users, who have to manually interact with sophisticated GUIs and write pieces of the transformation manually

No core definition is even considered Core? No,

thanks

48 EDBT 2011, A. Bonifati & Y. Velegrakis

Mixing Matching and Mapping

Matching and Mapping not always by separate tools

Clio has as an add-on a matcher based on attribute feature analysis [Naumann et al. 2002]

Bernstein‟s model management considers the matcher as a fully integrated and indistinguishable component

Spicy [Bonifati et al. 2008] has a matcher based on instance-based structural analysis

49 EDBT 2011, A. Bonifati & Y. Velegrakis

Limitations of Current Systems

Manual approaches are not applicable to large-scale mapping tasks

The user/developer has to become familiar with the mapping language and the user interfaces

The outcome of the mapping process may not respect the user requirements and desired semantics (unsurprisingly!)

Specifications may be incomplete and dependent of system peculiarities

Thus, there is a need for a verification and guidance process

50 EDBT 2011, A. Bonifati & Y. Velegrakis

Source Schema Target Schema

Matchings

User

Transformation Scripts

Matcher

Mapping Generation Engine

Mappings (Dependencies)

Query Engine

The Verification Process

Data Exchange Engine

Source Instance

Target Instance

Data

Examples

Expected

Target

Instance

Verification

And

Selection User

51 EDBT 2011, A. Bonifati & Y. Velegrakis

A-Posteriori Verification

The main problem with matching and mapping is the dichotomy between the expected results and the generated answers

Some tools allow a post-verification

by using data examples

Tupelo [Fletcher et al. 2006], Muse[Alexe et al. 2008] Clio [Alexe et al. 2010]

by using an automatic instance comparison,

Spicy [Bonifati et al. 2008]

by means of manual user feedback

unfeasible for large-scale tasks

via debugging techniques

Routes [Chiticariu et al. 2006]

52 EDBT 2011, A. Bonifati & Y. Velegrakis

ETL systems

Extract-Transform-Load tools are data transformation tools based on graphical flowcharts with nodes encoding transformation primitives and edges encoding the transformation flow

Can be considered as a special form of mapping system Generate transformations

GUI

An intermediate language (An algebra for ETL)

Output (transformation scripts)

They are not mapping tools in the classical sense Focus only on data transformation operators

53 EDBT 2011, A. Bonifati & Y. Velegrakis

1 2 3

6

Not Null

(CustKey) SK(custkey) PhoneFormat

New - Old

Customer.

new

CUSTO

MER7C.D+

Error

4 5

SK(custkey) PhoneFormat

Customer.

old

Cnew

Cold

An ETL data flowchart

54 EDBT 2011, A. Bonifati & Y. Velegrakis

How Can One Decide If A Product is Good ?

55 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

56 EDBT 2011, A. Bonifati & Y. Velegrakis

Importance Of A Benchmark

Help Designers and Developers to improve their tools by

assessing their usefulness and constantly evaluating their performance

Users to compare the different available tools and evaluate suitability for their needs [Haas et al. 2007]

Researchers to compare themselves to others

Exist for a long term To allow adequate measurements

Evolution in the field

Help assess absolute results Properties of the results

How they compare to the others

57 EDBT 2011, A. Bonifati & Y. Velegrakis

Benchmark

Well-designed tests (scenarios) with which the results of a system can be evaluated [Castro et al. 2004]

A standardized application scenario that serves as a basis for testing and evaluation and comparison [Merriam-Webster]

Clearly specified scenarios that everyone can implement

Clearly specify the factors that are measured, and under what conditions they should be measured.

Should measure of the degree of achievement

Should be reproducible and stable

Can be used repeatedly

58 EDBT 2011, A. Bonifati & Y. Velegrakis

Principles

Systematic Procedure

Continuity

Quality and equity

Dissemination

Intelligibility

59 EDBT 2011, A. Bonifati & Y. Velegrakis

Types of evaluation

Competence Benchmarks Measure competences and performance with respect to a

task

Aim at characterizing the kind of tasks each method is good for

For designers to improve their systems

Comparative evaluation Comparison of results of various systems on a common task

Aim at finding the best system Tuning of the system an issue

Comparison of systems and aim at general field improvement

Application-specific Comparison of various systems on a specific task

Competitive evaluation

60 EDBT 2011, A. Bonifati & Y. Velegrakis

Evaluation Steps

Planning

Specifying task, software, hardware, input, output

Processing

Analysis

Result evaluation according to predefined measures

61 EDBT 2011, A. Bonifati & Y. Velegrakis

Bottom Line: Benchmarks Are Great !

62 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

63 EDBT 2011, A. Bonifati & Y. Velegrakis

Generic Matching/Mapping Benchmark Goals

Compare in terms of

Performance

Usability

Effectiveness

Applicability to real-world scenarios

Improve the quality of the matching and mapping generation process

64 EDBT 2011, A. Bonifati & Y. Velegrakis

Query Benchmark Schema Mapping Benchmark

■Evaluation Scenarios: A setting ( a

database instance/schema + a query)

and the outcome

■For a mapping system, what is the input and

what is the output?

■The query engine should support the

scenarios (mainly should be able to

evaluate the query of each scenario)

■A mapping tool input language should be able

to express the transformations of interest.

What are they?

■Supporting the scenario:

search engine result=expected outcome

■How do you compare a mapping system

output with an expected output?

■Good Query Engine = Fast (Correct)

Responses

■What do we measure?

Effort? Expressiveness?

■Range the characteristics of the data

instance to measure how well the engine

scales

■What do we scale?

Query vs Matching/Mapping Benchmarks

65 EDBT 2011, A. Bonifati & Y. Velegrakis

The Scenario Input

Source Schema S

Target Schema T

Maybe an Instance of the Source Schema S

A specification of what we need to achieve

Matching Systems

Typically there is no specification

Just the Source and the Target Schema

A complete set of matches assumed as correct

Mappings Systems

Major Issue

66 EDBT 2011, A. Bonifati & Y. Velegrakis

The Specification for Mapping Systems

An expected (desired) transformation

Mapping Systems try to guess it

Issues

No formal semantic framework to express it

No formal relationship to the outcome.

Note: Query engine benchmarks (e.g., TMC-H or XMark) leverage on the semantics of the query language

Clear what the scenario is asking

Clear what to compare the result sets

67 EDBT 2011, A. Bonifati & Y. Velegrakis

Expressing Desired Transformations

Natural Language

Too generic and ambiguous

Complete specification formalism (query?)

Beats the purpose of a mapping system

Comparing the generated mapping of the tool to the precise specification is like asking equivalence of two mappings (a hard problem ! )

68 EDBT 2011, A. Bonifati & Y. Velegrakis

Expressing Desired Transformations

Graphical Interface Different constructs expressed in different tools

Typically a GUI for the query language

Continuously evolve

Simple specification (Correspondences?) 1-1, many-1, between atomic or complex

elements, Nested, w/o annotations, GUI constructs Can get so complex that they become the same as the

actual mappings

Ambiguous. The same set of correspondences interpreted different from different mapping tools Without a standard way to interpret them? Risky !

69 EDBT 2011, A. Bonifati & Y. Velegrakis

Company

Location

Name

Source

Address

Organization

Title

Target

A Simple Ambiguous Scenario

Different interpretations may arise from a simple “copy” scenario

<Source>

<Company>

<Name>IBM</Name>

<Location>NY</Location>

</Company>

<Company>

<Name>MS</Name>

<Location>WA</Location>

</Company>

</Source>

<Target>

<Organization>

<Title>IBM</Title>

<Title>MS</Title>

<Address>NY</Address>

<Address>WA</Address>

</Organization>

</Target>

70 EDBT 2011, A. Bonifati & Y. Velegrakis

A Simple Ambiguous Scenario

Different tools might generate different instances with the same arrows

<Source>

<Company>

<Name>IBM</Name>

<Location>NY</Location>

</Company>

<Company>

<Name>MS</Name>

<Location>WA</Location>

</Company>

</Source>

<Target>

<Organization>

<Title>IBM</Title>

<Address>NY</Address>

</Organization>

</Target>

Company

Location

Name

Source

Address

Organization

Title

Target

71 EDBT 2011, A. Bonifati & Y. Velegrakis

A Simple Ambiguous Scenario

Arrows between non-leaf nodes are not allowed in all tools

<Source>

<Company>

<Name>IBM</Name>

<Location>NY</Location>

</Company>

<Company>

<Name>MS</Name>

<Location>WA</Location>

</Company>

</Source>

<Target>

<Organization>

<Title>IBM</Title>

<Address>NY</Address>

</Organization>

<Organization>

<Title>MS</Title>

<Address>WA</Address>

</Organization>

</Target>

Company

Location

Name

Source

Address

Organization

Title

Target

72 EDBT 2011, A. Bonifati & Y. Velegrakis

The Issue of The User Input

Many tools allow the mapping designer to manually edit the generated transformation

Power as the one provided by the language.

Shortcuts and abstraction levels.

73 EDBT 2011, A. Bonifati & Y. Velegrakis

The Scenario Output

The output has to be correct

Satisfy the desired transformation

Compare to the expected transformation

Matching

A set of Matches

Mapping

It is not clear what the output is

Mapping

The transformation scripts

The transformed data

Same data may be generated by different mappings

74 EDBT 2011, A. Bonifati & Y. Velegrakis

Evaluation Challenges

What are we testing?

Expressiveness

Performance

of the tool?

of the generated mappings?

Quality

of the generated mappings?

of the integrated schema?

of the target data? [Dong et al. 2009]

User effort

Heavily depends on the mapping interface

Measuring these factors is hard without a formal (and standard) agreement on expressing specifications

75 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching vs. Mapping Benchmarking

Matching System Evaluation is typically set comparison

Consider only semantics of the schema

More automatic

Mapping System Evaluation is more challenging

Considers semantics of schemas and transformations

Requires more human intervention

76 EDBT 2011, A. Bonifati & Y. Velegrakis

The User Can Also Help …

By being presented with …

The mappings

Difficult to overcome the heterogeneity of languages

The generated target instance

Not feasible for large and complex instances [Velegrakis et al. 2005]

A representative sample of the target instance

Appealing alternative based on positive/negative examples, but still in its infancy [Alexe et al. 2008]

Presented in details later on

How do all these get into the evaluation function?

77 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

78 EDBT 2011, A. Bonifati & Y. Velegrakis

A Common Design Pattern

Example: TPC-H

Sets of test cases (With their expected output)

Find those that can be successfully executed

Characterize the system

Sets of Matching or Mapping Scenarios

79 EDBT 2011, A. Bonifati & Y. Velegrakis

Examples of Data Sets

Public and well-designed schemas

Meaningful overlap

Limited by the existence of real schemas

Need to be discriminating

Who is the Oracle? External knowledge or a human?

80 EDBT 2011, A. Bonifati & Y. Velegrakis

Large Scale Ontology Sets

[Zhang et al. 2004]

Two large ontologies from the anatomy domain

Foundational Model of anatomy & Galen

Thousands of classes, no instances

[Lambrix et al. 2003]

Gene & Signal Ontology

Partial overlap

81 EDBT 2011, A. Bonifati & Y. Velegrakis

OAEI Data

Artificial data set

33 classes, 64 properties, 76 individuals

Initial ontology distorted

Result: ~50 pairs of ontologies

Correct by construction

82 EDBT 2011, A. Bonifati & Y. Velegrakis

Data Set Factors Affecting The Evaluation

Heterogeneity of the modeling language (schemas/ontologies) The language itself Number of schemas (1-to-1 or many-to-1) From scratch matching or there is a head-start Multiplicity: How many elements in one schema can match with

how many on the other Are Oracles permitted? Is user input permitted Can there be a-priori training External methods and auxiliary inputs Is justification of the output needed? Relations of the correspondences (only = or others as well) Is there a time limit for the matching/mapping Can the matching be on leaves only or not?

83 EDBT 2011, A. Bonifati & Y. Velegrakis

OAEI Evaluation Example

[Euzenat et al., 2006]

Ontology Alignment Evaluation Initiative

oaei.ontologymatching.org

Yearly contest

Participants:

Provided with OAEI API

Execute all tests

Provide their results & Paper

Make the results public

84 EDBT 2011, A. Bonifati & Y. Velegrakis

Building Large Ontology Sets

[Avesani et al. 2005]

Test sets for matching web directories and classifications

Two web directories are similar if their web pages are similar

It can be considered a matching technique by itself

85 EDBT 2011, A. Bonifati & Y. Velegrakis

Thesauri

Thesauri covering large hierarchies of concepts and textual knowledge

Digital Libraries and Museums

Large need to match them

Example:

AGROVOC (FAO): 16K terms

NAL (US Agricaltural dep): 41K terms

86 EDBT 2011, A. Bonifati & Y. Velegrakis

Various examples

Illinois Semantic Integration Archive

http://pages.cs.wisc.edu/~anhai/wisc-si-archive

Collection of different schemas & Data

Faculties

Courses

Real Estate

87 EDBT 2011, A. Bonifati & Y. Velegrakis

Real Examples Lack Systematic Design

Existing Datasets not systematic

Completeness ?

Correctness?

Deduplicated?

Clarity?

Mainly testbeds or standardized tests

But not benchmarks

Benchmark tests should be

Consistent

Complete

Minimal

88 EDBT 2011, A. Bonifati & Y. Velegrakis

Real-World Matching Problems

[Kopcke et al. 2010]

Collection of matching problems

[Giunchiglia et al. 2009]

4500 matches between 3 web directories

Error free

Low complexity

High discriminative capacity

89 EDBT 2011, A. Bonifati & Y. Velegrakis

XBenchMatch

[Duchateau et al. 2007]

Criteria for testing and evaluating matching tools

Focuses on assessment of matching tools

Quality

Time

10 Datasets for matching

Classified according to:

Data level, e.g., degree of heterogeneity

Process level, e.g., scale

90 EDBT 2011, A. Bonifati & Y. Velegrakis

STBenchmark

[Alexe et al. 2008] www.stbenchmark.org

Evaluate the effectiveness of the mapping system

Derived from real applications DBLP, BioWarehouse, …

Derived from Information Integration Literature [Lerner, 2000], [Carey, 2000],

etc.

Minimum set of transformations that should be supported

1 scenario - 1 transformation Described by

Source & Target Schemas Transformation Query Instance of the Source Schema Capture most practically relevant

transformation cases

Copying Constant Value Generation Horizontal Partition Surrogate Key Assignment Vertical Partition Unnesting (Flattening) Nesting self-Joins Denormalization Keys and Object Fusion Atomic Value Management Aggregation Order Ordered By Flipping Metadata to Data Flipping Data to Metadata Flipping Data to Nested Metadata

91 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Copy

Source:

Protein * name accession created

Target:

Protein * Name Accession Created

for $x0 in $doc/Source/Protein return <Protein> <Name> $x0/name/text() <Accession> $x0/accession/text() <Created> $x0/created/text() </Protein>

Textual description +

92 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Value Generation

Target:

DataSet * Name LoadingDate

“SwissProt” “July 4th”

<DataSet> <Name>SwissProt</Name> <LoadingDate>July 4th</LoadingDate> </DataSet>

93 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Horizontal Partitioning

Source:

gene * txt type protein

Target:

Gene * Name Protein

Synonym * Name Protein

If type ==“ primary”

for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() </Synonym>

94 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Surrogate Key Assignment

Source:

gene * txt type protein

Target:

Gene * Name Protein WID

Synonym * Name Protein WID

If type ==“ primary”

Id()

Id’() for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Synonym>

95 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Vertical Partition

Source:

Reaction * entry name comment orthology definition equation

Target:

Reaction * Entry Name Comment CoFactor

ChemicalInfo * Orthology Definition Equation CoFactor

Labeled Nulls

for $x0 in $doc/Source/Reaction let $id = genID() return <Reaction> <Entry> $x0/name/text() <Name> $x0/name/text() <Comment> $x0/comment/text() <CoFactor> $id </Reaction> <ChemicalInfo> <Orthology> $x0/orthology/text() <Definition> $x0/definition/text() <Equation> $x0/equation/text() <CoFactor> $id </ChemicalInfo>

Normalization Note that no key information is assumed, as such duplication is allowed

96 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Join Path Selection

Target:

Taxon * Id Name UniqueName Class Parent Rank EmblCode

Source:

Name * id name uniqueName class

Node * taxid parentId rank emblCode

for $x0 in $doc/Source/Name, $x1 in $doc/Source/Node where $x0/id/text() = $x1/taxId/text() return <Taxon> <Id> $x0/id/text() <Name> $x0/name/text() <UniqueName> $x0/uniqueName/text() <Class> $x0/class/text() <Parent> $x1/parentId/text() <Rank> $x1/rank/text() <EmblCode> $x1/emblCode/text() </Taxon>

Denormalization

97 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Cyclic Joins

Source:

Gene * name type protein

Target:

Gene * Name Protein

Synonym * Name GeneWID

If type ==“ primary”

for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <name> $x0/name/text() <protein> $x0/protein/text() </Gene> for $x1 in $doc/Source/Gene where $x1/type/text() != ‟primary‟ and $x1/protein/text() == $x0/protein/text() return <Synonym> <Name> $x1/name/text() <GeneWID> $x0/name/text() </Synonym>

98 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Un-Nesting Structures

Target:

Publication * Title Year PublishedIn Name

Source:

Reference * title year publishedIn Author * name

for $x0 in $doc/Source/Reference $x1 in $x0/Author return <Publication> <Title> $x0/title/text() <Year> $x0/text() <PublishedIn> $x0/publishedIn/text() <Name> $x1/name/text() </Publication>

99 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Nesting Structures

Target:

Period * Year Author * Name Publication * Title PublishedIn

Source:

Publication * title year publishedIn name

for $x0 in distinct-values($doc/Source/Publication/year) return <Period> <Year> $x0 for $x1 in distinct-values($doc/Source/Publication[year=$x0]/Name) return <Author> <Name> $x1 for $x2 in $doc/Source/Publication where $x2/year/text()=$x0 and $x2/name/text()=$x1 return <Publication> <Title> $x2/title/text() <PublishedIn> $x2/publishedIn/text() </Publication> </Author> </Period>

100 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Keys & Object Fusion

Target:

Experiment * Contact Date Description ExperimentalData * Data Role

Source:

Experiment * eid contact date description ExperimentalData * data role

FlowCytometrySample id contact date Probe * data type

<Source2> for $x0 in $doc/Source/Experiment $x1 in $x0/ExperimentalData return <Datum> <id> genID($x0/contact/text(), $x0/date/text()) <Contact> $x0/contact/text() <Date> $x0/date/text() <Description> $x0/description/text() <Data> $x1/data/text() <Role> $x1/role/text() </Datum> for $x0 in $doc/Source/FlowCytometrySample $x1 in $x0/Probe return <Datum> <id> genID($x0/contact/text(), $x0/date/text() ) <Contact> $x0/contact/text() <Date> $x0/date/text() <Data> $x1/data/text() <Role> $x1/type/text() </Datum> </Source2>

for $x0 in distinct-values($doc/Source2/Datum/id) return <Experiment> for $x1 in ($doc/Source2/Datum[id=$x0])[1] return $x1/Contact $x1/Date $x1/Description for $x3 in $doc/Source2/Datum where $x3/id/text() = $x0 <ExperimentalData> $x3/Data $x3/Role <ExperimentalData> </Experiment>

101 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Atomic Value Manipulation

Target:

Contact * FirstName LastName Address Phone

Source:

Contact * name address street city zip phone

GetFirstName(…)

Type Discrepancy Handling

for $x0 in $doc/Source/Contact return <Contact> <FirstName> GetFirstName( $x0/name/text() ) <LastName> GetLastName( $x0/name/text() ) <Address> Concat( $x0/street/text(), $x0/city/text(), $x0/zip/text() ) <Phone> String2Int( $x0/phone/text() ) </Contact>

102 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenarios: Aggregation & Order

103 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenarios: Data Meta-data

104 EDBT 2011, A. Bonifati & Y. Velegrakis

Thalia

[Hammer et al. 2005]

Integration tools benchmark

Rich set of test data

Source schemas

Syntactic and semantic heterogeneity

12 test queries for the integrated schema

105 EDBT 2011, A. Bonifati & Y. Velegrakis

Real Data Is Not Enough for Benchmarking...

… but definitely not for the reason Dilbert thinks !

106 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

107 EDBT 2011, A. Bonifati & Y. Velegrakis

Why Synthetic Data & Scenarios

To stress test the system

To understand performance in diverse situations

To create additional realistic test cases

To ensure that unforeseen situations are also tested

108 EDBT 2011, A. Bonifati & Y. Velegrakis

Top-down Scenario Construction

Start with a big schema and divide/extract

TaxME2 [Giunchiglia et al. 2009] Preserves correctness, complexity, performance

[Okawara et al. 2006] Techniques on how a benchmark should be

[Hammer et al. 2005] – Thalia Large dataset + filters

[Lee et al. 2007] eTuner Duplicate schema, split the data in 2

Modify the first half

Limited kinds of modifications not very natural

109 EDBT 2011, A. Bonifati & Y. Velegrakis

Bottom-up Scenario Construction

Create the schema from scratch

STBenchmark [Alexe et al. 2008]

Schema Generator

Expands basic scenarios

Changing basic characteristics of the scenario

Data Generator

Hand-to-Hand with the Schema Generator

ToXGen [Barbosa at al. 2002]

110 EDBT 2011, A. Bonifati & Y. Velegrakis

Parameters of Schema Generation

• Number of subelements

• Nesting depth

• Join size

• Join width

• Join kind (star / chain)

• Function arity

f(…)

Source R1 [0…*] A1 A2 A3

R2 [0…*] A4 A5

R3 [0…*] A6

R4 [0…*] A7 A8 A9

R5 [0…*] A10 A11

R6 [0…*] A12

Parameter values:

Sampled from normal

distributions given by

average and standard

deviation

111 EDBT 2011, A. Bonifati & Y. Velegrakis

Stretching The Unesting Scenario

Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname

Target Publication [0…*] Title Year PublishedIn Name University Country StudentName

Unnesting

Basic Scenario

112 EDBT 2011, A. Bonifati & Y. Velegrakis

Stretching The Horizontal Partitioning

Vary the number of partitions

Vary the number of elements that exist in each partition

Source:

gene * txt type protein hpAttr1 hpAttr2

Target:

Gene * Name Protein HpAttr1 HpAttr2

Synonym * Name Protein HpAttr1 HpAttr2

HPRel1 * Name Protein HpAttr1 HpAttr2

113 EDBT 2011, A. Bonifati & Y. Velegrakis

Combining Mapping Scenarios

Lack of diversity Combine scenarios

Based on a set of configuration parameters, generate a complex mapping scenario by concatenating scaled-up mapping scenarios

S1 T1

S2 T2

P1

P2

Concatenation of (S1, T1, P1) and (S2, T2, P2)

114 EDBT 2011, A. Bonifati & Y. Velegrakis

Stretched Mapping Scenarios

Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname

Target Publication [0…*] Title Year PublishedIn Name University Country StudentName

Horizontal

Partitioning

Source Schema Target Schema

Horizontal

Partitioning

Copy

Unnesting Unnesting

Copy

Complex Mapping Scenario

115 EDBT 2011, A. Bonifati & Y. Velegrakis

Composing Mapping Scenarios

Intermix of basic mapping scenario transformations

capture cases where different types of transformations occur simultaneously on the same part of a schema

Main idea:

Generate a source schema S

Evolve S to obtain a target schema.

All based on the configuration parameters

Example next:

116 EDBT 2011, A. Bonifati & Y. Velegrakis

Composed Mapping Scenarios R1 [0…*] A1 A2 SE1 [0…*] A3 A4 SE2 [0…*] A5 R2 [0…*] A7 A8 SE3 [0…*] A9 SE4 [0…*] A10 A11 R3 [0…*] A12 A13 A14

F1 A1 A2 A3 A4 A5 F2 A7 A8 A9 A10 A11 F3 A12 A13 A14

F1 A1 A3 A5 F2 A7 A8 A10 A11 F3 A12 A13 A14

F1 A1 A3 A5 A7b F2 A7 A8 A10 A11 A12 F3 A13 A14

F1 A1 A3 A5 A7b A15 (id) F2 A7 A8 A10 A11 A12 A16(=“June”) F3 A13 A14 A17(=A3*A14)

R4 [0…*] A1 A3 SE5 [0…*] A5 A7b A15 R6 [0…*] A7 A8 A10 A11 A12 A16 R7 [0…*] A13 A14 SE6 [0…*] A17

Transformation

query

S T

P

• unnesting

• removal

• duplication

• migration

• addition

• nesting

117 EDBT 2011, A. Bonifati & Y. Velegrakis

Synthetic Examples for Matching

[Ferrara et al. 2008]

ISLab Instance Matching Benchmark

Creates a reference ontology and populates it

Using the web

Performs a sequence of modifications

Variations in data values

Structural heterogeneity

Semantic variations

118 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenarios Are Useless Without Metrics

119 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

120 EDBT 2011, A. Bonifati & Y. Velegrakis

Metric Categorization

Qualitative metrics

Compliance measures

Quantitative metrics

Performance measures

User-specific metrics

Application specific metrics

121 EDBT 2011, A. Bonifati & Y. Velegrakis

Qualitative = Compliance

Evaluate the degree of compliance of the system with respect to some standard Matching: Precision, Recall, F-measure and Fallout

[Euzenat et al. 2007]

Measure the difference of the system output to some reference (expected) output An expert user is typically assumed to provide the

expected matches [Duchateau et al. 2007] [Euzenat et al. 2004] They do not provide any measure of the post-match

effort

They do not consider the time spent by the user in doing verification during intermediate stages

122 EDBT 2011, A. Bonifati & Y. Velegrakis

Terminology

E−𝑮 𝑮 − 𝑬 𝑮 ∩ 𝑬

𝑼− (𝑮 ∪ 𝑬)

False Negatives

False Positives

True Positives

True Negatives

E: Expected G: Generated

123 EDBT 2011, A. Bonifati & Y. Velegrakis

Hamming Distance

Measures the dissimilarity between matches

H(G,E)= 1 − |G∩E|

|G∪E|

Example:

E=(Book-Volume,Person-Human,Science-Essay}

G=(Product-Volume,Person-Writer,Science-Essay}

H G, E = 1 −1

3=

2

3

124 EDBT 2011, A. Bonifati & Y. Velegrakis

Precision

Originated from IR [van Rijsbergen, 1975]

Adopted to matching [Do et al. 2002]

Ratio of correctly found correspondences (true positives) over the total number of returned correspondences (true and false positives)

Intuitively: The degree of correctness

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G, E =|G∩E|

|G|

125 EDBT 2011, A. Bonifati & Y. Velegrakis

Recall

Originated from IR [van Rijsbergen, 1975]

Adopted to matching [Do et al. 2002]

Ratio of correctly found correspondences (true positives) over the total number of expected correspondences (true positives and false negatives)

Intuitively: The degree of completeness

𝑅𝑒𝑐𝑎𝑙𝑙 G, E =|G∩E|

|E|

126 EDBT 2011, A. Bonifati & Y. Velegrakis

Fallout

The percentage of the found matches that are false positives

Intuitively: How much error has been made

𝐹𝑎𝑙𝑙𝑜𝑢𝑡 G, E =G − |G∩E|

|G|=

|G−E|

|G|

127 EDBT 2011, A. Bonifati & Y. Velegrakis

F-Measure

Precision & Recall not always consistent

Their complements Noise & Silence neither

Aggregate Precision & Recall

Percentage of the false positive found matches

Intuitively: How much error has been made

𝐹𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑎 G, E =𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E ×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)

1−𝛼 ×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E +𝛼×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)

For a=1, F-Measure is equal to precision and for a=0 to Recall. For a=0.5, it is the harmonic mean

128 EDBT 2011, A. Bonifati & Y. Velegrakis

Overall

Like an edit-distance [Melnik et al. 2002]

Ratio of errors over the total number of expected correspondences (true positives and false negatives)

Overall < F-Measure. Ranges [-1,1]

Intuitively: The effort required to fix a matching

𝑂𝑣𝑒𝑟𝑎𝑙𝑙 G, E = 𝑅𝑒𝑐𝑎𝑙𝑙 G, E × 2 −1

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(G,E)

= 1 −G∪E − G∩E

E= 1 −

E−G +|G−E|

|E|

129 EDBT 2011, A. Bonifati & Y. Velegrakis

Strength-based Similarity

Takes into consideration the degree of confidence

𝑆𝐵𝑆 G, E =2× |𝑠𝑡𝑟𝑒𝑛𝑔ℎ𝑡ℎG 𝑐 −𝑠𝑡𝑟𝑒𝑛𝑔𝑡ℎE 𝑐 |𝑐∈G∩ E

G +|E|

130 EDBT 2011, A. Bonifati & Y. Velegrakis

All-or-nothing vs. Approximation

Product

DVD

Book

Science

Textbook

Popular

……

Volume

Essay

Politics

Biography

Pocket

…… Expected

Far

Close

Pocket

131 EDBT 2011, A. Bonifati & Y. Velegrakis

Relaxed Precision & Recall

[Ehrig et al. 2005]

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑎 G, E =𝜔(G,E)

|G|

𝑅𝑒𝑐𝑎𝑙𝑙𝑎 G, E =𝜔(G,E)

|E|

𝜔 G, E = 𝜎(𝑎, 𝑟)𝑎,𝑟 ∈𝑀(G,E)

𝜎: correspondence similarity function

M(G,E): Best matches with regard to 𝜎

Instead of |𝑨 ∩ 𝑹|

Instead of |𝑮 ∩ 𝑬|

132 EDBT 2011, A. Bonifati & Y. Velegrakis

Weighted Harmonic Mean

Integrates multiple similarity measures

Given n similarity measures 𝑀𝑖, each with a weight 𝑤𝑖 such that 𝑤𝑖 ∈ 0,1 and 𝑤𝑖𝑖=0..𝑛 = 1

𝑊𝐻𝑀 G, E = 𝑀𝑖(G,E)𝑖∈𝐼

𝑤𝑖×𝑀𝑖(G,E)𝑖∈𝐼

133 EDBT 2011, A. Bonifati & Y. Velegrakis

Evaluating Mapping Systems

Efficiency

Mapping Generation Time

Data Translation Performance

Time

Parallelization

Human effort

134 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching/Mapping Generation Time

Matching tools Few human intervention

Time has been measured for matching tasks in [Yatskevic et al. 2003], discussed in a recent benchmark [Kopcke et al. 2010], and also addressed in XBenchMatch [Duchateau et al. 2007]

Mapping tools, e.g., Clio, HePToX, Spicy, and STBenchmark do not elaborate on the issue It is hard to measure time in a process in which

human participation is part of the process It includes the time to guide, verify and tune the

mapping tool

135 EDBT 2011, A. Bonifati & Y. Velegrakis

Translation Time as Performance Metric

Time to execute the transformation script

Indirectly a quality metric on the generated mappings

Attention to avoid evaluation of the query engines

Same engine & Hardware

Need to be fair

Same target instance

Efficient Core generation

[Mecca et al. 2009] [tenCate et al. 2009]

Time performance of ETL workflows

[Simitsis et al. 2009]

136 EDBT 2011, A. Bonifati & Y. Velegrakis

Time and Parallelization for ETL tools

Beyond time performance, other factors may be quite relevant, such as: Workflow execution throughput (under failures or

not)

Avg latency per tuple

Along with the above factors, it is important to increase parallelization: Pipelining: tasks of the ETL workflow are executed

in parallel by different processors, and the output can be consumed by the next task without waiting for the overall completion

Partitioning: data is partitioned and the transformation is applied to chunks of data

137 EDBT 2011, A. Bonifati & Y. Velegrakis

Human Effort In Matching Tools

The amount of work required to remove false positives and add false negatives

Since no human intervention takes place during the matching process

Human-spared resources [Duchateau 2009]

It counts the number of user interactions to obtain a 100% F-measure, i.e. the effort to remove false positives and add false negatives, and also to discover missing correspondences

138 EDBT 2011, A. Bonifati & Y. Velegrakis

Human Effort In Mapping Tools

Mapping tools can be seen as graphic tools

HCI study can be used

Comparing the GUI of the tools is hard

Schema mapping tools is a new technology and the tools are evolving and keep improving their interface daily

STBenchmark provides a first-cut measure on the effort required to implement a mapping scenario through the visual interface of a mapping system [Alexe et al. 2008]

139 EDBT 2011, A. Bonifati & Y. Velegrakis

A Simple Model

STBenchmark model Cost of implementing a mapping scenario:

4*L + S + 2*D + 4*K

L – mouse dragging actions

S – single mouse clicks

D – double mouse clicks

K – keystrokes

[MacKenzie et al. ’91]

Mouse dragging is slower and

more error-prone than clicking

It is easier to make mistakes

when typing

Scenario / System A B C D

Scenario 1 8 16 16 4

Scenario 2 72 78 65 110

Scenario 3 32 52 37 7

Scenario 4 66 81 65 200

140 EDBT 2011, A. Bonifati & Y. Velegrakis

Usability study for HePToX/Clio

• [Bonifati et al. 2010]

• Whereas Clio could implement fairly more scenarios, HePToX required

less effort than Clio in the majority of the scenarios it could implement.

• HePToX is click-and-drag oriented, while Clio is click-and-select oriented.

141 EDBT 2011, A. Bonifati & Y. Velegrakis

Evaluating Mapping Systems

Efficiency

Mapping Generation Time

Data Translation Performance

Time

Parallelization

Human effort

Effectiveness

Supported Scenarios

Quality

Generated Mappings

Target instance

Target schema

142 EDBT 2011, A. Bonifati & Y. Velegrakis

Enumerating Supported Scenarios

System A System B System C

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Scenario 5

Scenario 6

Scenario 7

……………..

143 EDBT 2011, A. Bonifati & Y. Velegrakis

Before talking about quality…

In order to check the quality of the generated mappings or the generated target instance:

Somebody has to provide you with the „ideal‟ or „desired‟ mappings or target instance

Normally, the user provides those

Or, alternatively, a „core‟ target instance can be used

Or the set of mappings suggested by a benchmark

The focus here is not on who provides the above components, but on quality!

144 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of the Generated Mappings

Things are tricky for mapping tools:

Measuring the quality of generated mappings amounts to checking Query Containment or Query Equivalence

An NP-complete problem

It is preferable to check the quality of the results of a mapping:

i.e. the generated target instance

Current efforts try to characterize mappings in a quantitative way:

By their information loss, while inverting them

The notion of maximum extended recovery has been introduced in [Arenas2008, Fagin2009]

145 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of the Generated Target instance

A mapping is inherently a query

Same transformation, multiple ways.

Generation time

Readability

Mapping result

Whether it produces the respective results

Yes/No is too harsh

Similarity of queries …. too difficult to compute

Target instance is an alternative

Observe the result of the mappings

146 EDBT 2011, A. Bonifati & Y. Velegrakis

The Spicy Way

[Bonifati et al. 2008] – Spicy

Schema Matcher (internal or external)

Mapping Generator (internal or external)

s.A t.E, 0.87

s.C t.F, 0.76

s.D t.F, 0.98

Ranked

mappings:

mapping 1, 0.97

mapping 2, 0.87

mapping 3. 0.72

match

line selection

mapping generation

Str

uctu

ral

analy

sis

mapping

verification

Spicy

source

target

147 EDBT 2011, A. Bonifati & Y. Velegrakis

The Spicy Way

Structural Analysis

uses the model of electrical circuits

uses sampling

uses a set of features on samples, e.g.:

length and character distribution

entropy of values

density of null values etc.

148 EDBT 2011, A. Bonifati & Y. Velegrakis

Structural Analysis

For an attribute A with sample sample(A), the atomic piece of circuit is the following

149 EDBT 2011, A. Bonifati & Y. Velegrakis

Trees Into Circuits

Circuits can be obtained for nested structures

150 EDBT 2011, A. Bonifati & Y. Velegrakis

The Spicy way: lessons learned

Spicy is a system for schema mapping verification

using comparison of instances to gauge the quality

iterating the mapping search algorithm until mapping quality is acceptable

Open issues:

Special element types (images, full-text), complex models (e.g. ontologies) and more complex classes of lines

Other instance comparison techniques other than structural analysis may be needed

What if we look at the efficiency of transformations (e.g. core computation) and their quality at the same time?

151 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of target instance for ETL

The quality of target instances is also important for ETL systems

Target instances can be characterized in terms of:

Easy of maintenance [Simitsis et al. 2009]

Resilience to failures

Data freshness

Compliance to business rules

152 EDBT 2011, A. Bonifati & Y. Velegrakis

Data examples

Since the expected/generated target instance may be large and the generated mappings may be numerous, Samples of the expected/generated target instance

can be used in turn

The importance of data examples goes back to [Yan et al. 2001] Each mapping is a connected graph G = (N, E), where

N is the set of nodes or source schema relations, and E represents conjunctions of join predicates among the nodes

Data associations (subgraphs of G) can be derived as relations that contain that maximum number of attributes that can be joined along E

Data associations are leveraged to understand what has to be included in a mapping

153 EDBT 2011, A. Bonifati & Y. Velegrakis

Routes

Source-to-target dependencies, st:

m1: CardHolders(cn,l,s,n) ->

Accounts(cn,L,s),Clients(s,n)

m2: Dependents(an,s,n) -> Clients(s,n)

Target dependencies, t:

m3: Clients(s,n) -> (Accounts(A,L,s))

MANHATTAN CREDIT

CardHolders:

cardNo ²

limit ²

ssn ²

name ²

Dependents:

accNo ²

ssn ²

name ²

FARGO FINANCE

Accounts:

² accNo

² creditLine

² accHolder

Clients:

² ssn

² name

m2

m1

m3

S: T:

Source instance I Target instance J Solution for I under the schema mapping

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

fk1

Allow the Inspection of the flow of the mappings [Chiticariu and Tan, 2006]

154 EDBT 2011, A. Bonifati & Y. Velegrakis

Example Debugging Scenario 1

Unknown credit limit?

15K is not copied over to the target

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

Alice ID1 $15K 123

CardHolders ID1 L1 123

Accounts

Alice ID1

Clients m1

A route for the Accounts tuple

155 EDBT 2011, A. Bonifati & Y. Velegrakis

Example Debugging Scenario 1

Unknown credit limit?

15K is not copied over to the target

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

Alice ID1 $15K 123

CardHolders ID1 L1 123

Accounts

Alice ID1

Clients m1

A route for the Accounts tuple

156 EDBT 2011, A. Bonifati & Y. Velegrakis

Example Debugging Scenario 2

Unknown account number?

123 is not copied over to the target

as Bob’s account number

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

m2 Bob ID2 123

Dependents

ID2 L2 A2

Accounts

Bob ID2

Clients m3

Route for Accounts tuple with accNo A2

157 EDBT 2011, A. Bonifati & Y. Velegrakis

Example Debugging Scenario 2

Unknown account number?

123 is not copied over to the target

as Bob’s account number

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

m2 Bob ID2 123

Dependents

ID2 L2 A2

Accounts

Bob ID2

Clients m3

Route for Accounts tuple with accNo A2

158 EDBT 2011, A. Bonifati & Y. Velegrakis

The SPIDER System

Based on Routes

159 EDBT 2011, A. Bonifati & Y. Velegrakis

Data Examples as Evaluation Tools

The use of data examples as evaluation tools is underway [Chiticariu et al. 2008]

Examples are used to understand and refine mappings towards the desired specification

Universal data examples are data examples derived from universal solutions [Alexe et al. 2010]

If S and T contain only unary relations, with only Σst, a mapping is characterized by a set of

Positive data examples (I, J) such that (I,J) Σ

Negative data examples (I, J) such that (I,J) Σ

160 EDBT 2011, A. Bonifati & Y. Velegrakis

Muse

[Chiticariu et al. 2008] Muse

Build ad-doc probes for each attribute, such that an small source example is built and two differentiating target examples are obtained

After that, Muse asks the designer „Which target instance look correct?‟

This leads to eliminate some mappings that lead to the unchosen target instance

The result is:

A set of correct homomorphically equivalent target instances

It also allows the design of Skolem functions, not addressed in Routes

161 EDBT 2011, A. Bonifati & Y. Velegrakis

MUSE Workflow

MUSE

Mapping

Specification Real Source

Instance

(if available)

Real/Synthetic

Data

Examples

Mapping designer

inspects

data examples

Examination

Generation

Essentially

Yes/No Answers

Refinement Grouping Semantics

Disambiguation

162 EDBT 2011, A. Bonifati & Y. Velegrakis

Example

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

cname

location

Projects: Set of

Project: Rcd

pid

pname

cbranch

manager

Employees: Set of

Employee: Rcd

eid

ename

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

oname

Projects: Set of

Project: Rcd

pname

manager

Employees: Set of

Employee: Rcd

eid

ename

Declarative Mapping

for

c in CompDB.Companies

p in CompDB.Projects

e in CompDB.Employees

satisfy

p.cbranch = c.cbranch

e.eid = p.manager

exists

o in OrgDB.Orgs

p1 in o.Projects

e1 in OrgDB.Employees

satisfy

p1.manager = e1.eid

where

c.cname = o.oname

e.eid = e1.eid

e.ename = e1.ename

p.pname = p1.pname

163 EDBT 2011, A. Bonifati & Y. Velegrakis

Example

Grouping Projects:

Example source:

Companies

Redmond Microsoft USA

S. Valley Microsoft USA

Projects

P1 DB Redmond e4

P2 Web S. Valley e5

Group by cbranch

Orgs

Microsoft

Projects:

DB e4

Microsoft

Projects:

Web e5

Group by cname

Orgs

Microsoft

Projects:

DB e4

Web e5

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

cname

location

Projects: Set of

Project: Rcd

pid

pname

cbranch

manager

Employees: Set of

Employee: Rcd

eid

ename

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

oname

Projects: Set of

Project: Rcd

pname

manager

Employees: Set of

Employee: Rcd

eid

ename

164 EDBT 2011, A. Bonifati & Y. Velegrakis

Example

Declarative Mapping

for

c in CompDB.Companies

p in CompDB.Projects

e in CompDB.Employees

satisfy

p.cbranch = c.cbranch

e.eid = p.manager

exists

o in OrgDB.Orgs

p1 in o.Projects

e1 in OrgDB.Employees

satisfy

p1.manager = e1.eid

where

c.cname = o.oname

e.eid = e1.eid

e.ename = e1.ename

p.pname = p1.pname

o.Projects = SKProjects(c.cbranch, c.cname, c.location)

Grouping

Function

But group by what subset of {cbranch, cname, location} ?

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

cname

location

Projects: Set of

Project: Rcd

pid

pname

cbranch

manager

Employees: Set of

Employee: Rcd

eid

ename

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

oname

Projects: Set of

Project: Rcd

pname

manager

Employees: Set of

Employee: Rcd

eid

ename

165 EDBT 2011, A. Bonifati & Y. Velegrakis

Muse-G: Grouping Semantics Design

Goal: infer a grouping function that has the same effect as the one intended by the designer

Muse-G probes each possible grouping attribute: start with cbranch

Example source

Companies

Redmond Microsoft USA

S. Valley Microsoft USA

Projects

P1 DB Redmond e4

P2 Web S. Valley e5

Employees

e4 John x234

e5 Anna x888

Target Scenario 1

group

by cbranch

Orgs

Microsoft

Projects:

DB e4

Microsoft

Projects:

Web e5

Employees

e4 John

e5 Anna

y subset of {Microsoft, USA}

Target Scenario 2

do not group

by cbranch

Orgs

Microsoft

Projects:

DB e4

Web e5

Employees

e4 John

e5 Anna

SK(Redmond,y)

SK(S. Valley,y)

SK(y)

166 EDBT 2011, A. Bonifati & Y. Velegrakis

Muse-G: Second Question

The next probed attribute is cname

Example source

Companies

S. Valley Microsoft USA

Mt. View Google USA

Projects

P1 DB S. Valley e4

P4 Web Mt. View e6

Employees

e4 John x234

e6 Kat x331

Target Scenario 1

group

by cname

Orgs

Microsoft

Projects:

DB e4

Google

Projects:

Web e6

Employees

e4 John

e6 Kat

Target Scenario 2

do not group

by cname

Orgs

Microsoft

Projects:

DB e4

Web e6

Google

Projects:

DB e4

Web e6

Employees

e4 John

e6 Kat

y subset of {USA}

The wizard continues to probe the remaining possible grouping attributes

SK(Microsoft,y)

SK(Google,y)

SK(y)

SK(y)

167 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of the Generated Target Schema

When mappings are used in schema integration:

The quality of the generated integrated schema wrt. the intended integrated schema

Three metrics have been conceived:

Completeness [Batista07]:

Minimality [Batista07]:

168 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of the Generated Target Schema

169 EDBT 2011, A. Bonifati & Y. Velegrakis

An example

Nr. of common elements=6; α = 2;

Completeness=6/7; Minimality = 5/7;

Structurality = (1+1+10+1/4+1/2)/6

Proximity = 0.73

B

F

D

E

C

X

Z

A

B

F

D

E

C

A

(generated) (intended)

G

170 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Challenges in evaluating matching & mapping

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and References

171 EDBT 2011, A. Bonifati & Y. Velegrakis

What We Talked About

The importance of Matching/Mappings

Explained the matching/mapping tasks

Presented existing tools & their functionality

What is a benchmark

Generic evaluation principles

Why is a benchmark for mapping systems a challenge.

Finding real-world evaluation scenarios

Generating synthetic evaluation scenarios

Metrics for effectiveness, efficiency and quality

172 EDBT 2011, A. Bonifati & Y. Velegrakis

Main Sources

[Euzenat et al. 2007] J. Euzenat and P. Shvaiko, "Ontology matching", Springer-Verlag, 2007

[Bellahsene et al. 2011] Z. Bellahsene, A. Bonifati and E. Rahm, "Schema Matching and Mapping", Springer-Verlag, 2011

173 EDBT 2011, A. Bonifati & Y. Velegrakis

Are you taking questions?

Thank you !!

Of course ! Go ahead

174 EDBT 2011, A. Bonifati & Y. Velegrakis

Partial List of References

[Alexe et al. 2008] Alexe B, Chiticariu L, Miller RJ, Tan WC (2008) “Muse: Mapping Understanding and deSign by Example”. In: ICDE, pp 10-19

[Bonifati et al. 2008] Bonifati A, Mecca G, Pappalardo A, Raunich S, Summa G (2008) “Schema Mapping Verification: The Spicy Way”. In: EDBT, pp 85-96

[Lenzerini 2002] Lenzerini Maurizio “Data Integration: A Theoretical Perspective”. In: PODS, pp 233-246

[Miller et al. 2000] Miller RJ, Haas LM, Hernandez MA (2000) “Schema Mapping as Query Discovery”. In: VLDB, pp 77-88

[Rahm et al. 2001] Rahm E, Bernstein PA (2001) “A survey of approaches to automatic schema matching”. VLDB Journal 10(4):334-35

[Velegrakis 2005] Velegrakis Y “Managing Schema Mappings in Highly Heterogeneous Environments”. PhD thesis, University of Toronto

175 EDBT 2011, A. Bonifati & Y. Velegrakis

Partial List of References (cont’d)

[Altova 2008] Altova (2008) MapForce. Http://www.altova.com [Batini et al. 1986] Batini C, Lenzerini M, Navathe SB (1986) A

Comparative Analysis of Methodologies for Database Schema Integration. ACM Comp. Surv. 18(4):323-364

[Bernstein et al. 2007] Bernstein PA, Melnik S (2007) Model management 2.0: manipulating richer mappings. In: SIGMOD, pp 1–12

[Do et al. 2002] Do HH, Rahm E (2002) COMA - A System for Flexible Combination of Schema Matching Approaches. In: VLDB, pp 610–621

[Euzenat et al. 2007] Euzenat J, Shvaiko P (2007) Ontology matching. Springer Verlag, Heidelberg

[Fagin et al. 2005] Fagin R, Kolaitis PG, Miller RJ, Popa L (2005) Data exchange: semantics and query answering. Theoretical Computer Science 336(1):89-124

[Lerner 2000] Lerner BS (2000) A Model for Compound Type Changes Encountered in Schema Evolution. TPCTC 25(1):83–127

[Popa et al, 2002] Popa L, Velegrakis Y, Miller RJ, Hernandez MA, Fagin R (2002)Translating Web Data. In: VLDB, pp 598–609

176 EDBT 2011, A. Bonifati & Y. Velegrakis

Partial List of References (cont’d)

[Alexe2010] Alexe B, Kolaitis PG, Tan W (2010b) Characterizing Schema Mappings via Data Examples. In: PODS

[Aumueller2006] Aumueller D, Do HH, Massmann S, Rahm E (2005) Schema and ontology matching with COMA++. In: SIGMOD, pp 906–908

[Bonifati et al. 2010] Angela Bonifati, Elaine Qing Chang, Terence Ho, Laks V. S. Lakshmanan, Rachel Pottinger, Yongik Chung: Schema mapping and query Translation in heterogeneous P2P XML databases. VLDB J. 19(2): 231-256 (2010)

[Chiticariu et al. 2006] Chiticariu L, TanWC(2006) Debugging Schema Mappings with Routes. In: VLDB, pp 79–90

[Dhamankar at al. 2004] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Y. Halevy, Pedro Domingos: iMAP: Discovering Complex Mappings between Database Schemas. SIGMOD Conference 2004: 383-394

[Doan et al. 2001] Doan A, Domingos P, Halevy AY (2001) Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD, pp 509–520

177 EDBT 2011, A. Bonifati & Y. Velegrakis

Partial List of References (cont’d)

[Fletcher et al. 2006] Fletcher GHL, Wyss CM (2006) Data Mapping as Search. In: EDBT, pp 95–111

[Giunchiglia et al. 2004] Giunchiglia F, Shvaiko P, Yatskevich M (2004) S-Match: an Algorithm and an Implementation of Semantic Matching. In: ESWS, pp 61–75

[Madhavanet al. 2001] Madhavan J, Bernstein PA, Rahm E (2001) Generic Schema Matching with Cupid. In: VLDB, pp 49–58.

[Naumann et al. 2002] Naumann F, Ho CT, Tian X, Haas LM, Megiddo N (2002) Attribute Classification Using Feature Analysis. In: ICDE, p 271

[Shu et al. 1977] N. C. Shu, B. C. Housel, R. W. Taylor, S. P. Ghosh, and V. Y. Lum. EXPRESS: A Data EXtraction, Processing and REstructuring System. ACM Transactions on Database Systems (TODS), 2(2):134–174, 1977.