Benchmarks From Usage To Evaluation -...

177
Yannis Velegrakis (University of Trento) Angela Bonifati (CNR) EDBT 2011, Uppsala, Sweden, March 21 st -25 th Benchmarks From Usage To Evaluation Schema Matching and Mapping Systems

Transcript of Benchmarks From Usage To Evaluation -...

Page 1: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

Yannis Velegrakis (University of Trento) Angela Bonifati (CNR)

EDBT 2011, Uppsala, Sweden, March 21st-25th

Benchmarks

From Usage To Evaluation

Schema Matching and Mapping Systems

Page 2: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

2 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

Page 3: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

3 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

Page 4: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

4 EDBT 2011, A. Bonifati & Y. Velegrakis

Introduction

Data is inherently heterogeneous

Due to the explosion of online data repositories

Due to the variety of users, who develop a wealth of applications

At different time

With disparate requirements in their mind

A fundamental requirement is to translate data across different formats

How data is transformed from one format to another is done through mappings

Page 5: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

5 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings Are All Around

Data integration [Lenzerini 2002]

to specify the relationship between local and global schemas

S1

S2

S3

Global

Schema T

I1

I2

I3 Sources

Page 6: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

6 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings Are All Around

Schema integration [Batini et al. 1986]

to specify the relationship between the input schemas and the integrated schemas

S1

S2

S3

Integrated

Schema

Input Schemas

Page 7: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

7 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings Are All Around

Data exchange [Fagin et al. 2005]

to specify the relationship between source and target schemas

S T

mappings

Source schema Target Schema

I I J

Page 8: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

8 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings Are All Around

Schema evolution [Lerner 2000]

to specify the relationship between the old and new version of an evolved schema

S1 S1’ S1’’

Evolving Schema S1

Page 9: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

9 EDBT 2011, A. Bonifati & Y. Velegrakis

How Did It All Start

One of the first systems to deal with this problem was developed at IBM in 1977: EXPRESS (EXtraction, Processing and REStructuring

System) [Shu et al. 1977] consists of two languages: DEFINE that works as a DDL (Data Definition Language)

CONVERT that works as a DTL (Data Translation Language) and has a total of 9 operators, each of which receives as input a data file, performs the respective transformation and generates an output data file.

EXPRESS required the users familiarity with the languages and was customized to only one model (hierarchical)

After that, inter-model transformations were also studied [Tork-Roth et al. 1997] [Atzeni et al. 1997]

Page 10: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

10 EDBT 2011, A. Bonifati & Y. Velegrakis

Emphasis on Data Translation

[Abiteboul et al. 1997] proposed a declarative framework for data translation

[Davidson et al. 1997] focused on constraint satisfaction

[Milo et al. 1998] leveraged a library of transformation rules and pattern-matching techniques

[Clue et al. 1998] emphasized type-checking

[Beeri at al. 1999] focused on tree-based transformations for XML data structures

Page 11: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

11 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

Page 12: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

12 EDBT 2011, A. Bonifati & Y. Velegrakis

A Data Transfer Example

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

Benedikt [email protected] 5827766

Hull [email protected] 5824509

Shrivastava [email protected] 3608776

Belanger [email protected] 3608600

Fernandez [email protected] 3608679

name official

AT&T AT&T Research Labs

Lucent Lucent Technologies, Bell Labs Innovations

Projects

Grants

Contacts

Companies

gid project recipient manager supervisor

g1 PIX AT&T Fernandez Belanger

g2 PIX AT&T Shrivastava Belanger

g3 E-services Bell-labs Benedikt Hull

So

urce I

nsta

nce

Page 13: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

13 EDBT 2011, A. Bonifati & Y. Velegrakis

Desired Target Instance

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Finances

code: E-services

Projects

fid finId

g3 ???

Funds

code: PIX

fid finId

g1 ???

g2 ???

Funds

finId mPhone company

??? 3608679 ???

??? 3608776 ???

??? 5827766 ???

coid name

Sk2(AT&T) AT&T

Sk2(Lucent) Lucent

??? ???

??? ???

??? ???

Companies

Targ

et in

sta

nce

Page 14: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

14 EDBT 2011, A. Bonifati & Y. Velegrakis

The Needed Transformation Query LET $doc0 := document("inputXMLfile") RETURN <T> { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <project> <code> { $x0/project/text() } </code> { distinct-values ( FOR $x0L1 IN $doc0/S/grant, $x1L1 IN $doc0/S/project, $x2L1 IN $doc0/S/contact, $x3L1 IN $doc0/S/contact WHERE $x2L1/cid/text() = $x0L1/manager/text() AND $x0L1/supervisor/text() = $x3L1/cid/text() AND $x0L1/project/text() = $x1L1/name/text() AND $x0/project/text() = $x0L1/project/text() RETURN <funding> <fid> { $x0L1/gid/text() } </fid> <finId> { "Sk52(", $x0L1/gid/text(), ", ", $x0L1/project/text(), ")" } </finId> </funding> ) } </project> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <finance> <finId> { $x0/gid/text() } </finId> <mPhone> { $x2/phone/text() } </mPhone> <company> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </company> </finance> ) } { distinct-values ( FOR $x0 IN $doc0/S/grant, $x1 IN $doc0/S/project, $x2 IN $doc0/S/contact, $x3 IN $doc0/S/contact WHERE $x2/cid/text() = $x0/manager/text() AND $x0/supervisor/text() = $x3/cid/text() AND $x0/project/text() = $x1/name/text() RETURN <company> <coid> { "Sk46(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </coid> <name> { "Sk49(", $x2/phone/text(), ", ", $x0/gid/text(), ")" } </name> </company> ) } { distinct-values ( FOR $x0 IN $doc0/S/company RETURN <company> <coid> { "Sk93(", $x0/cname/text(), ")" } </coid> <name> { $x0/cname/text() } </name> </company> ) }

Page 15: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

15 EDBT 2011, A. Bonifati & Y. Velegrakis

The Road To Mapping Systems

The design of data transformations has been a manual task for a long while

Designers had to be familiar with the language

As schemas became larger and more complex, the task became too laborious, time-consuming and error-prone

The need of raising the level of abstraction and trying to automate the tasks was soon realized.

The idea …

Mapping Systems

Page 16: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

16 EDBT 2011, A. Bonifati & Y. Velegrakis

Generating Mapping

Different techniques exist to generate mappings:

Manual, e.g.

by means of high-level mapping languages, such as [Bernstein et al. 2007]

by means of sophisticated user interfaces [Altova 2008]

Semi-automatic, e.g.

By means of designer guidance [Alexe2008]

Via advanced algorithms to do the reasoning instead of the mapping designer [Madhavan at al. 2001] [Popa et al. 2002][Do et al. 2002][Bonifati et al. 2008]

Page 17: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

17 EDBT 2011, A. Bonifati & Y. Velegrakis

The First Step of a Mapping Task

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

Benedikt [email protected] 5827766

Hull [email protected] 5824509

Shrivastava [email protected] 3608776

Belanger [email protected] 3608600

Fernandez [email protected] 3608679

name official

AT&T AT&T Research Labs

Lucent Lucent Technologies, Bell Labs Innovations

Projects

Grants

Contacts

Companies

gid project recipient manager supervisor

g1 PIX AT&T Fernandez Belanger

g2 PIX AT&T Shrivastava Belanger

g3 E-services Bell-labs Benedikt Hull

Page 18: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

18 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching

Given two schemas as input, the source and the target schema, matching is a process that produces as output a set of matches or correspondences or (simply) lines between the elements of the two schemas

A match is a triple <Es, Et, e> Where Es is a set of elements of the source

schema, Et is a set of elements of the target schema, and e specifies a simple relationship (equality or set inclusion) or complex relationship between element in Es and Et

Page 19: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

19 EDBT 2011, A. Bonifati & Y. Velegrakis

The Matching Relationship e

Depends on the cardinalities of Es and Et

Depends on the semantics:

Can be a function

Can be an arithmetic operation

Can be a set-theoretic relation (e.g. ≡,overlaps)

Can be a conceptual modeling relationship (e.g. part-of, subclass-of)

Page 20: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

20 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching: An Alternative Definition

The matching process [Euzenat et al. 2007] can be seen as a function f from a pair of schemas S and T, an optional input alignment A, a set of matching parameters p and a set of resources r:

A’ = f(S, T, A?, p, r)

Ultimately, an alignment is a set of correspondences between elements in S and elements in T

Page 21: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

21 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching Examples

Simple relationship: Name ≡ Title Location ≡ Address

Complex relationship: speed = velocity x 2.237 speed x 0.447 = velocity speed = concat(velocity x 2.237, „MPH‟) speed ≥ velocity

Company

Location

Name

Source

Address

Organization

Title

Target

Page 22: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

22 EDBT 2011, A. Bonifati & Y. Velegrakis

The matching process

Can be roughly divided into three steps:

Pre-match: training of classifiers for machine learning-based matchers, matching parameters (weights, thresholds), adjustments of resources, such as thesauri and constraints

Match: the actual matching task

Post-match: the user may check and modify the displayed matches

Page 23: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

23 EDBT 2011, A. Bonifati & Y. Velegrakis

Some Schema Matchers

Cupid [Madhavan et al. 2001] : based on structural and name similarity

S-Match [Giunchiglia et al. 2004]: based on semantic closeness

Coma++ [Aumueller et al. 2005]: based on matching reuse

LSD [Doan et al. 2001]: based on data value analysis and machine-learning techniques

iMap [Dhamankar et al. 2004]: suited for complex e expressions

Similarity Flooding [Melnik et al. 2002]: based on graph similarity

Page 24: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

24 EDBT 2011, A. Bonifati & Y. Velegrakis

Similarity Flooding

Page 25: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

25 EDBT 2011, A. Bonifati & Y. Velegrakis

COMA++

Page 26: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

26 EDBT 2011, A. Bonifati & Y. Velegrakis

Matchings Are Not Enough

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

name status

PIX Active

E-services Active

Clio Inactive

cid email phone

Benedikt [email protected] 5827766

Hull [email protected] 5824509

Shrivastava [email protected] 3608776

Belanger [email protected] 3608600

Fernandez [email protected] 3608679

name official

AT&T AT&T Research Labs

Lucent Lucent Technologies, Bell Labs Innovations

Projects

Grants

Contacts

Companies

gid project recipient manager supervisor

g1 PIX AT&T Fernandez Belanger

g2 PIX AT&T Shrivastava Belanger

g3 E-services Bell-labs Benedikt Hull

Cannot describe the full details of the transformation

Page 27: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

27 EDBT 2011, A. Bonifati & Y. Velegrakis

Source Schema Target Schema

Matchings

Matcher

The Mapping Generation Process

Matching is just the beginning of any mapping generation

process

Page 28: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

28 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings

Given the source and the target schemas, mapping is a process that takes as input a set of matches between the elements of the two schemas and produces a relationship or constraint e that must hold between their respective instances

In other words, a mapping is a triple <S, T, e>

Where S is the source schema, T is the target schema, and e specify a constraint that any instances adhering to S and T must satisfy or an executable statement to transform the instance of S into the instances of T

Page 29: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

29 EDBT 2011, A. Bonifati & Y. Velegrakis

A Mapping Example

Source: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

Target: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)

project(na,FUND), fund(gid,finId), finance (finId,ph,company)

company(company, name)

Page 30: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

30 EDBT 2011, A. Bonifati & Y. Velegrakis

Mappings & Instances

Mappings are the basic ingredients of many tasks, such as information integration, P2P query answering, data exchange etc.

In particular, mappings as inter-schema constraints may not be enough to fully specify a unique target instance There may exist multiple target instances

satisfying the mappings

Finding the best target instance is the goal of the data exchange problem [Fagin et al. 2005] The mapping is converted into a executable

transformation script to obtain that particular instance

Page 31: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

31 EDBT 2011, A. Bonifati & Y. Velegrakis

S: Rcd

projects: Set of

project: Rcd

name

status

grants: Set of

grant: Rcd

gid

project

recipient

manager

supervisor

contacts: Set of

contact: Rcd

cid

email

phone

companies: Set of

company: Rcd

name

official

A Data Exchange Example

T: Rcd

projects: Set of

project: Rcd

code

funds: Set of

fund: Rcd

fid

finId

finances: Set of

finance: Rcd

finId

mPhone

company

companies: Set of

company: Rcd

coid

name

Targ

et in

sta

nce

Finances

code: E-services

Projects

fid finId

g3 ???

Funds

code: PIX

fid finId

g1 ???

g2 ???

Funds

finId mPhone company

??? 3608679 ???

??? 3608776 ???

??? 3608600 ???

coid name

??? AT&T

??? Lucent

Companies

project(na,st), grant(gid,na,re,ma,su), contact(ma,em,ph)

project(na,FUND), fund(gid,finId), finance (finId,ph,company),

company(company, name)

Page 32: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

32 EDBT 2011, A. Bonifati & Y. Velegrakis

Source Schema Target Schema

Matchings

User

Transformation Scripts

Matcher

Mapping Generation Engine

Mappings (Dependencies)

Query Engine

The Mapping Generation Process

Data Exchange Engine

Source Instance

Target Instance

Page 33: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

33 EDBT 2011, A. Bonifati & Y. Velegrakis

Research Prototype Systems

Mapping generation and data exchange are separate tasks Clio[Popa et al. 2002], HePToX[Bonifati et al. 2010],

Spicy[Bonifati et al. 2008]

Mappings Generation the mappings are expressed as high-level assertions in a

logical formalism

A mapping is a source-to-target tuple-generating dependency (or s-t tgd in short)

𝜙 𝑥 → ∃ 𝜓 𝑥, 𝑦 where

φ(x) (ψ(x,y), resp.) is a conjunction of atoms over the source (target, resp.)

Data exchange The respective module transforms the high-level mappings

into transformation scripts (in SQL or XQuery) to generate the target instance.

Page 34: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

34 EDBT 2011, A. Bonifati & Y. Velegrakis

Clio

Page 35: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

35 EDBT 2011, A. Bonifati & Y. Velegrakis

Spicy

Page 36: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

36 EDBT 2011, A. Bonifati & Y. Velegrakis

HepToX

Page 37: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

37 EDBT 2011, A. Bonifati & Y. Velegrakis

Source Schema Target Schema

Matchings

User

Transformation Scripts

Matcher

Mapping Generation Engine

Mappings (Dependencies)

Query Engine

Commercial Mapping Systems

Data Exchange Engine

Source Instance

Target Instance

Mapping Engine

Mapping generation and Data exchange are merged in one. The system directly creates in some native language the final transformation script

Page 38: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

38 EDBT 2011, A. Bonifati & Y. Velegrakis

Popular Commercial Systems

Altova Mapforce

Stylus Studio

IBM Rational Data Architect

BizTalk mapper

Adeptia

BEA Aqualogic

Page 39: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

39 EDBT 2011, A. Bonifati & Y. Velegrakis

Stylus Studio

Page 40: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

40 EDBT 2011, A. Bonifati & Y. Velegrakis

Altova MapForce

Page 41: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

41 EDBT 2011, A. Bonifati & Y. Velegrakis

Adeptia

Page 42: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

42 EDBT 2011, A. Bonifati & Y. Velegrakis

BizTalk Mapper

Page 43: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

43 EDBT 2011, A. Bonifati & Y. Velegrakis

IBM Rational Data Architect

Page 44: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

44 EDBT 2011, A. Bonifati & Y. Velegrakis

A Mapping Tool Categorization

All tools provide to the mapping designer:

A graphical representation of the two schemas

A set of graphical transformation constructs

The granularity and power of these constructs is a main factor of differentiation among the tools

Detailed Specification by the Designer

Intelligence of the mapping tool/

Effort in post-verification

Roughly

Commercial

Mapping Tools

Roughly

Research

Prototypes

Page 45: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

45 EDBT 2011, A. Bonifati & Y. Velegrakis

Issues in Data Exchange

When multiple target instances exist, how do we compute the best one?

Is a given target instance better than another?

Universal solutions

Introduced in [Fagin et al. 2005]

These are the “most general” target instances, and also represent the entire space of solutions

Among the universal solutions, the smallest of all and the most compact one is called the “core”

Page 46: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

46 EDBT 2011, A. Bonifati & Y. Velegrakis

Universal Core Instances

T: Rcd

Advised: Rcd

sname

facid

WorksWith: Rcd

sname

facid

S: Rcd

PTStud: Rcd

age

name

GradStud: Rcd

age

name

sname facid

Bob N3

Ann N4

WorksWith

PTStud(x,y), Advised(y,z)

GradStud (x,y) Advised(y,z), WorksWith(y,z)

Source instance

PTStud

age name

27 Bob

30 Ann

GradStud

age name

32 John

30 Ann

Advised

sname facid

Bob N3

Ann N4

John N1

N1 Cathy

A solution:

sname facid

Bob N3

Ann N4

WorksWith

Advised

sname facid

Bob N3

Ann N4

John N1

N2 Ann

sname facid

Bob N3

Ann N4

WorksWith

Advised

sname facid

Bob N3

Ann N4

John N1

A universal Solution:

The core:

Page 47: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

47 EDBT 2011, A. Bonifati & Y. Velegrakis

Commercial vs. Research Prototype Systems

Whereas research prototypes (e.g. Clio, Spicy++) are tending to produce target instances that look more and more like the core

Commercial tools leave the task to the users, who have to manually interact with sophisticated GUIs and write pieces of the transformation manually

No core definition is even considered Core? No,

thanks

Page 48: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

48 EDBT 2011, A. Bonifati & Y. Velegrakis

Mixing Matching and Mapping

Matching and Mapping not always by separate tools

Clio has as an add-on a matcher based on attribute feature analysis [Naumann et al. 2002]

Bernstein‟s model management considers the matcher as a fully integrated and indistinguishable component

Spicy [Bonifati et al. 2008] has a matcher based on instance-based structural analysis

Page 49: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

49 EDBT 2011, A. Bonifati & Y. Velegrakis

Limitations of Current Systems

Manual approaches are not applicable to large-scale mapping tasks

The user/developer has to become familiar with the mapping language and the user interfaces

The outcome of the mapping process may not respect the user requirements and desired semantics (unsurprisingly!)

Specifications may be incomplete and dependent of system peculiarities

Thus, there is a need for a verification and guidance process

Page 50: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

50 EDBT 2011, A. Bonifati & Y. Velegrakis

Source Schema Target Schema

Matchings

User

Transformation Scripts

Matcher

Mapping Generation Engine

Mappings (Dependencies)

Query Engine

The Verification Process

Data Exchange Engine

Source Instance

Target Instance

Data

Examples

Expected

Target

Instance

Verification

And

Selection User

Page 51: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

51 EDBT 2011, A. Bonifati & Y. Velegrakis

A-Posteriori Verification

The main problem with matching and mapping is the dichotomy between the expected results and the generated answers

Some tools allow a post-verification

by using data examples

Tupelo [Fletcher et al. 2006], Muse[Alexe et al. 2008] Clio [Alexe et al. 2010]

by using an automatic instance comparison,

Spicy [Bonifati et al. 2008]

by means of manual user feedback

unfeasible for large-scale tasks

via debugging techniques

Routes [Chiticariu et al. 2006]

Page 52: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

52 EDBT 2011, A. Bonifati & Y. Velegrakis

ETL systems

Extract-Transform-Load tools are data transformation tools based on graphical flowcharts with nodes encoding transformation primitives and edges encoding the transformation flow

Can be considered as a special form of mapping system Generate transformations

GUI

An intermediate language (An algebra for ETL)

Output (transformation scripts)

They are not mapping tools in the classical sense Focus only on data transformation operators

Page 53: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

53 EDBT 2011, A. Bonifati & Y. Velegrakis

1 2 3

6

Not Null

(CustKey) SK(custkey) PhoneFormat

New - Old

Customer.

new

CUSTO

MER7C.D+

Error

4 5

SK(custkey) PhoneFormat

Customer.

old

Cnew

Cold

An ETL data flowchart

Page 54: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

54 EDBT 2011, A. Bonifati & Y. Velegrakis

How Can One Decide If A Product is Good ?

Page 55: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

55 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

Page 56: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

56 EDBT 2011, A. Bonifati & Y. Velegrakis

Importance Of A Benchmark

Help Designers and Developers to improve their tools by

assessing their usefulness and constantly evaluating their performance

Users to compare the different available tools and evaluate suitability for their needs [Haas et al. 2007]

Researchers to compare themselves to others

Exist for a long term To allow adequate measurements

Evolution in the field

Help assess absolute results Properties of the results

How they compare to the others

Page 57: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

57 EDBT 2011, A. Bonifati & Y. Velegrakis

Benchmark

Well-designed tests (scenarios) with which the results of a system can be evaluated [Castro et al. 2004]

A standardized application scenario that serves as a basis for testing and evaluation and comparison [Merriam-Webster]

Clearly specified scenarios that everyone can implement

Clearly specify the factors that are measured, and under what conditions they should be measured.

Should measure of the degree of achievement

Should be reproducible and stable

Can be used repeatedly

Page 58: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

58 EDBT 2011, A. Bonifati & Y. Velegrakis

Principles

Systematic Procedure

Continuity

Quality and equity

Dissemination

Intelligibility

Page 59: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

59 EDBT 2011, A. Bonifati & Y. Velegrakis

Types of evaluation

Competence Benchmarks Measure competences and performance with respect to a

task

Aim at characterizing the kind of tasks each method is good for

For designers to improve their systems

Comparative evaluation Comparison of results of various systems on a common task

Aim at finding the best system Tuning of the system an issue

Comparison of systems and aim at general field improvement

Application-specific Comparison of various systems on a specific task

Competitive evaluation

Page 60: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

60 EDBT 2011, A. Bonifati & Y. Velegrakis

Evaluation Steps

Planning

Specifying task, software, hardware, input, output

Processing

Analysis

Result evaluation according to predefined measures

Page 61: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

61 EDBT 2011, A. Bonifati & Y. Velegrakis

Bottom Line: Benchmarks Are Great !

Page 62: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

62 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

Page 63: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

63 EDBT 2011, A. Bonifati & Y. Velegrakis

Generic Matching/Mapping Benchmark Goals

Compare in terms of

Performance

Usability

Effectiveness

Applicability to real-world scenarios

Improve the quality of the matching and mapping generation process

Page 64: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

64 EDBT 2011, A. Bonifati & Y. Velegrakis

Query Benchmark Schema Mapping Benchmark

■Evaluation Scenarios: A setting ( a

database instance/schema + a query)

and the outcome

■For a mapping system, what is the input and

what is the output?

■The query engine should support the

scenarios (mainly should be able to

evaluate the query of each scenario)

■A mapping tool input language should be able

to express the transformations of interest.

What are they?

■Supporting the scenario:

search engine result=expected outcome

■How do you compare a mapping system

output with an expected output?

■Good Query Engine = Fast (Correct)

Responses

■What do we measure?

Effort? Expressiveness?

■Range the characteristics of the data

instance to measure how well the engine

scales

■What do we scale?

Query vs Matching/Mapping Benchmarks

Page 65: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

65 EDBT 2011, A. Bonifati & Y. Velegrakis

The Scenario Input

Source Schema S

Target Schema T

Maybe an Instance of the Source Schema S

A specification of what we need to achieve

Matching Systems

Typically there is no specification

Just the Source and the Target Schema

A complete set of matches assumed as correct

Mappings Systems

Major Issue

Page 66: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

66 EDBT 2011, A. Bonifati & Y. Velegrakis

The Specification for Mapping Systems

An expected (desired) transformation

Mapping Systems try to guess it

Issues

No formal semantic framework to express it

No formal relationship to the outcome.

Note: Query engine benchmarks (e.g., TMC-H or XMark) leverage on the semantics of the query language

Clear what the scenario is asking

Clear what to compare the result sets

Page 67: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

67 EDBT 2011, A. Bonifati & Y. Velegrakis

Expressing Desired Transformations

Natural Language

Too generic and ambiguous

Complete specification formalism (query?)

Beats the purpose of a mapping system

Comparing the generated mapping of the tool to the precise specification is like asking equivalence of two mappings (a hard problem ! )

Page 68: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

68 EDBT 2011, A. Bonifati & Y. Velegrakis

Expressing Desired Transformations

Graphical Interface Different constructs expressed in different tools

Typically a GUI for the query language

Continuously evolve

Simple specification (Correspondences?) 1-1, many-1, between atomic or complex

elements, Nested, w/o annotations, GUI constructs Can get so complex that they become the same as the

actual mappings

Ambiguous. The same set of correspondences interpreted different from different mapping tools Without a standard way to interpret them? Risky !

Page 69: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

69 EDBT 2011, A. Bonifati & Y. Velegrakis

Company

Location

Name

Source

Address

Organization

Title

Target

A Simple Ambiguous Scenario

Different interpretations may arise from a simple “copy” scenario

<Source>

<Company>

<Name>IBM</Name>

<Location>NY</Location>

</Company>

<Company>

<Name>MS</Name>

<Location>WA</Location>

</Company>

</Source>

<Target>

<Organization>

<Title>IBM</Title>

<Title>MS</Title>

<Address>NY</Address>

<Address>WA</Address>

</Organization>

</Target>

Page 70: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

70 EDBT 2011, A. Bonifati & Y. Velegrakis

A Simple Ambiguous Scenario

Different tools might generate different instances with the same arrows

<Source>

<Company>

<Name>IBM</Name>

<Location>NY</Location>

</Company>

<Company>

<Name>MS</Name>

<Location>WA</Location>

</Company>

</Source>

<Target>

<Organization>

<Title>IBM</Title>

<Address>NY</Address>

</Organization>

</Target>

Company

Location

Name

Source

Address

Organization

Title

Target

Page 71: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

71 EDBT 2011, A. Bonifati & Y. Velegrakis

A Simple Ambiguous Scenario

Arrows between non-leaf nodes are not allowed in all tools

<Source>

<Company>

<Name>IBM</Name>

<Location>NY</Location>

</Company>

<Company>

<Name>MS</Name>

<Location>WA</Location>

</Company>

</Source>

<Target>

<Organization>

<Title>IBM</Title>

<Address>NY</Address>

</Organization>

<Organization>

<Title>MS</Title>

<Address>WA</Address>

</Organization>

</Target>

Company

Location

Name

Source

Address

Organization

Title

Target

Page 72: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

72 EDBT 2011, A. Bonifati & Y. Velegrakis

The Issue of The User Input

Many tools allow the mapping designer to manually edit the generated transformation

Power as the one provided by the language.

Shortcuts and abstraction levels.

Page 73: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

73 EDBT 2011, A. Bonifati & Y. Velegrakis

The Scenario Output

The output has to be correct

Satisfy the desired transformation

Compare to the expected transformation

Matching

A set of Matches

Mapping

It is not clear what the output is

Mapping

The transformation scripts

The transformed data

Same data may be generated by different mappings

Page 74: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

74 EDBT 2011, A. Bonifati & Y. Velegrakis

Evaluation Challenges

What are we testing?

Expressiveness

Performance

of the tool?

of the generated mappings?

Quality

of the generated mappings?

of the integrated schema?

of the target data? [Dong et al. 2009]

User effort

Heavily depends on the mapping interface

Measuring these factors is hard without a formal (and standard) agreement on expressing specifications

Page 75: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

75 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching vs. Mapping Benchmarking

Matching System Evaluation is typically set comparison

Consider only semantics of the schema

More automatic

Mapping System Evaluation is more challenging

Considers semantics of schemas and transformations

Requires more human intervention

Page 76: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

76 EDBT 2011, A. Bonifati & Y. Velegrakis

The User Can Also Help …

By being presented with …

The mappings

Difficult to overcome the heterogeneity of languages

The generated target instance

Not feasible for large and complex instances [Velegrakis et al. 2005]

A representative sample of the target instance

Appealing alternative based on positive/negative examples, but still in its infancy [Alexe et al. 2008]

Presented in details later on

How do all these get into the evaluation function?

Page 77: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

77 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

Page 78: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

78 EDBT 2011, A. Bonifati & Y. Velegrakis

A Common Design Pattern

Example: TPC-H

Sets of test cases (With their expected output)

Find those that can be successfully executed

Characterize the system

Sets of Matching or Mapping Scenarios

Page 79: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

79 EDBT 2011, A. Bonifati & Y. Velegrakis

Examples of Data Sets

Public and well-designed schemas

Meaningful overlap

Limited by the existence of real schemas

Need to be discriminating

Who is the Oracle? External knowledge or a human?

Page 80: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

80 EDBT 2011, A. Bonifati & Y. Velegrakis

Large Scale Ontology Sets

[Zhang et al. 2004]

Two large ontologies from the anatomy domain

Foundational Model of anatomy & Galen

Thousands of classes, no instances

[Lambrix et al. 2003]

Gene & Signal Ontology

Partial overlap

Page 81: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

81 EDBT 2011, A. Bonifati & Y. Velegrakis

OAEI Data

Artificial data set

33 classes, 64 properties, 76 individuals

Initial ontology distorted

Result: ~50 pairs of ontologies

Correct by construction

Page 82: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

82 EDBT 2011, A. Bonifati & Y. Velegrakis

Data Set Factors Affecting The Evaluation

Heterogeneity of the modeling language (schemas/ontologies) The language itself Number of schemas (1-to-1 or many-to-1) From scratch matching or there is a head-start Multiplicity: How many elements in one schema can match with

how many on the other Are Oracles permitted? Is user input permitted Can there be a-priori training External methods and auxiliary inputs Is justification of the output needed? Relations of the correspondences (only = or others as well) Is there a time limit for the matching/mapping Can the matching be on leaves only or not?

Page 83: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

83 EDBT 2011, A. Bonifati & Y. Velegrakis

OAEI Evaluation Example

[Euzenat et al., 2006]

Ontology Alignment Evaluation Initiative

oaei.ontologymatching.org

Yearly contest

Participants:

Provided with OAEI API

Execute all tests

Provide their results & Paper

Make the results public

Page 84: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

84 EDBT 2011, A. Bonifati & Y. Velegrakis

Building Large Ontology Sets

[Avesani et al. 2005]

Test sets for matching web directories and classifications

Two web directories are similar if their web pages are similar

It can be considered a matching technique by itself

Page 85: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

85 EDBT 2011, A. Bonifati & Y. Velegrakis

Thesauri

Thesauri covering large hierarchies of concepts and textual knowledge

Digital Libraries and Museums

Large need to match them

Example:

AGROVOC (FAO): 16K terms

NAL (US Agricaltural dep): 41K terms

Page 86: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

86 EDBT 2011, A. Bonifati & Y. Velegrakis

Various examples

Illinois Semantic Integration Archive

http://pages.cs.wisc.edu/~anhai/wisc-si-archive

Collection of different schemas & Data

Faculties

Courses

Real Estate

Page 87: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

87 EDBT 2011, A. Bonifati & Y. Velegrakis

Real Examples Lack Systematic Design

Existing Datasets not systematic

Completeness ?

Correctness?

Deduplicated?

Clarity?

Mainly testbeds or standardized tests

But not benchmarks

Benchmark tests should be

Consistent

Complete

Minimal

Page 88: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

88 EDBT 2011, A. Bonifati & Y. Velegrakis

Real-World Matching Problems

[Kopcke et al. 2010]

Collection of matching problems

[Giunchiglia et al. 2009]

4500 matches between 3 web directories

Error free

Low complexity

High discriminative capacity

Page 89: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

89 EDBT 2011, A. Bonifati & Y. Velegrakis

XBenchMatch

[Duchateau et al. 2007]

Criteria for testing and evaluating matching tools

Focuses on assessment of matching tools

Quality

Time

10 Datasets for matching

Classified according to:

Data level, e.g., degree of heterogeneity

Process level, e.g., scale

Page 90: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

90 EDBT 2011, A. Bonifati & Y. Velegrakis

STBenchmark

[Alexe et al. 2008] www.stbenchmark.org

Evaluate the effectiveness of the mapping system

Derived from real applications DBLP, BioWarehouse, …

Derived from Information Integration Literature [Lerner, 2000], [Carey, 2000],

etc.

Minimum set of transformations that should be supported

1 scenario - 1 transformation Described by

Source & Target Schemas Transformation Query Instance of the Source Schema Capture most practically relevant

transformation cases

Copying Constant Value Generation Horizontal Partition Surrogate Key Assignment Vertical Partition Unnesting (Flattening) Nesting self-Joins Denormalization Keys and Object Fusion Atomic Value Management Aggregation Order Ordered By Flipping Metadata to Data Flipping Data to Metadata Flipping Data to Nested Metadata

Page 91: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

91 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Copy

Source:

Protein * name accession created

Target:

Protein * Name Accession Created

for $x0 in $doc/Source/Protein return <Protein> <Name> $x0/name/text() <Accession> $x0/accession/text() <Created> $x0/created/text() </Protein>

Textual description +

Page 92: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

92 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Value Generation

Target:

DataSet * Name LoadingDate

“SwissProt” “July 4th”

<DataSet> <Name>SwissProt</Name> <LoadingDate>July 4th</LoadingDate> </DataSet>

Page 93: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

93 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Horizontal Partitioning

Source:

gene * txt type protein

Target:

Gene * Name Protein

Synonym * Name Protein

If type ==“ primary”

for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() </Synonym>

Page 94: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

94 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Surrogate Key Assignment

Source:

gene * txt type protein

Target:

Gene * Name Protein WID

Synonym * Name Protein WID

If type ==“ primary”

Id()

Id’() for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Gene> for $x0 in $doc/Source/Gene where $x0/type/text() != ‟primary‟ return <Synonym> <Name> $x0/name/text() <Protein> $x0/protein/text() <WID> genID() </Synonym>

Page 95: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

95 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Vertical Partition

Source:

Reaction * entry name comment orthology definition equation

Target:

Reaction * Entry Name Comment CoFactor

ChemicalInfo * Orthology Definition Equation CoFactor

Labeled Nulls

for $x0 in $doc/Source/Reaction let $id = genID() return <Reaction> <Entry> $x0/name/text() <Name> $x0/name/text() <Comment> $x0/comment/text() <CoFactor> $id </Reaction> <ChemicalInfo> <Orthology> $x0/orthology/text() <Definition> $x0/definition/text() <Equation> $x0/equation/text() <CoFactor> $id </ChemicalInfo>

Normalization Note that no key information is assumed, as such duplication is allowed

Page 96: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

96 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Join Path Selection

Target:

Taxon * Id Name UniqueName Class Parent Rank EmblCode

Source:

Name * id name uniqueName class

Node * taxid parentId rank emblCode

for $x0 in $doc/Source/Name, $x1 in $doc/Source/Node where $x0/id/text() = $x1/taxId/text() return <Taxon> <Id> $x0/id/text() <Name> $x0/name/text() <UniqueName> $x0/uniqueName/text() <Class> $x0/class/text() <Parent> $x1/parentId/text() <Rank> $x1/rank/text() <EmblCode> $x1/emblCode/text() </Taxon>

Denormalization

Page 97: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

97 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Cyclic Joins

Source:

Gene * name type protein

Target:

Gene * Name Protein

Synonym * Name GeneWID

If type ==“ primary”

for $x0 in $doc/Source/Gene where $x0/type/text() == ‟primary‟ return <Gene> <name> $x0/name/text() <protein> $x0/protein/text() </Gene> for $x1 in $doc/Source/Gene where $x1/type/text() != ‟primary‟ and $x1/protein/text() == $x0/protein/text() return <Synonym> <Name> $x1/name/text() <GeneWID> $x0/name/text() </Synonym>

Page 98: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

98 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Un-Nesting Structures

Target:

Publication * Title Year PublishedIn Name

Source:

Reference * title year publishedIn Author * name

for $x0 in $doc/Source/Reference $x1 in $x0/Author return <Publication> <Title> $x0/title/text() <Year> $x0/text() <PublishedIn> $x0/publishedIn/text() <Name> $x1/name/text() </Publication>

Page 99: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

99 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Nesting Structures

Target:

Period * Year Author * Name Publication * Title PublishedIn

Source:

Publication * title year publishedIn name

for $x0 in distinct-values($doc/Source/Publication/year) return <Period> <Year> $x0 for $x1 in distinct-values($doc/Source/Publication[year=$x0]/Name) return <Author> <Name> $x1 for $x2 in $doc/Source/Publication where $x2/year/text()=$x0 and $x2/name/text()=$x1 return <Publication> <Title> $x2/title/text() <PublishedIn> $x2/publishedIn/text() </Publication> </Author> </Period>

Page 100: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

100 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Keys & Object Fusion

Target:

Experiment * Contact Date Description ExperimentalData * Data Role

Source:

Experiment * eid contact date description ExperimentalData * data role

FlowCytometrySample id contact date Probe * data type

<Source2> for $x0 in $doc/Source/Experiment $x1 in $x0/ExperimentalData return <Datum> <id> genID($x0/contact/text(), $x0/date/text()) <Contact> $x0/contact/text() <Date> $x0/date/text() <Description> $x0/description/text() <Data> $x1/data/text() <Role> $x1/role/text() </Datum> for $x0 in $doc/Source/FlowCytometrySample $x1 in $x0/Probe return <Datum> <id> genID($x0/contact/text(), $x0/date/text() ) <Contact> $x0/contact/text() <Date> $x0/date/text() <Data> $x1/data/text() <Role> $x1/type/text() </Datum> </Source2>

for $x0 in distinct-values($doc/Source2/Datum/id) return <Experiment> for $x1 in ($doc/Source2/Datum[id=$x0])[1] return $x1/Contact $x1/Date $x1/Description for $x3 in $doc/Source2/Datum where $x3/id/text() = $x0 <ExperimentalData> $x3/Data $x3/Role <ExperimentalData> </Experiment>

Page 101: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

101 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenario: Atomic Value Manipulation

Target:

Contact * FirstName LastName Address Phone

Source:

Contact * name address street city zip phone

GetFirstName(…)

Type Discrepancy Handling

for $x0 in $doc/Source/Contact return <Contact> <FirstName> GetFirstName( $x0/name/text() ) <LastName> GetLastName( $x0/name/text() ) <Address> Concat( $x0/street/text(), $x0/city/text(), $x0/zip/text() ) <Phone> String2Int( $x0/phone/text() ) </Contact>

Page 102: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

102 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenarios: Aggregation & Order

Page 103: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

103 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenarios: Data Meta-data

Page 104: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

104 EDBT 2011, A. Bonifati & Y. Velegrakis

Thalia

[Hammer et al. 2005]

Integration tools benchmark

Rich set of test data

Source schemas

Syntactic and semantic heterogeneity

12 test queries for the integrated schema

Page 105: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

105 EDBT 2011, A. Bonifati & Y. Velegrakis

Real Data Is Not Enough for Benchmarking...

… but definitely not for the reason Dilbert thinks !

Page 106: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

106 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

Page 107: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

107 EDBT 2011, A. Bonifati & Y. Velegrakis

Why Synthetic Data & Scenarios

To stress test the system

To understand performance in diverse situations

To create additional realistic test cases

To ensure that unforeseen situations are also tested

Page 108: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

108 EDBT 2011, A. Bonifati & Y. Velegrakis

Top-down Scenario Construction

Start with a big schema and divide/extract

TaxME2 [Giunchiglia et al. 2009] Preserves correctness, complexity, performance

[Okawara et al. 2006] Techniques on how a benchmark should be

[Hammer et al. 2005] – Thalia Large dataset + filters

[Lee et al. 2007] eTuner Duplicate schema, split the data in 2

Modify the first half

Limited kinds of modifications not very natural

Page 109: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

109 EDBT 2011, A. Bonifati & Y. Velegrakis

Bottom-up Scenario Construction

Create the schema from scratch

STBenchmark [Alexe et al. 2008]

Schema Generator

Expands basic scenarios

Changing basic characteristics of the scenario

Data Generator

Hand-to-Hand with the Schema Generator

ToXGen [Barbosa at al. 2002]

Page 110: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

110 EDBT 2011, A. Bonifati & Y. Velegrakis

Parameters of Schema Generation

• Number of subelements

• Nesting depth

• Join size

• Join width

• Join kind (star / chain)

• Function arity

f(…)

Source R1 [0…*] A1 A2 A3

R2 [0…*] A4 A5

R3 [0…*] A6

R4 [0…*] A7 A8 A9

R5 [0…*] A10 A11

R6 [0…*] A12

Parameter values:

Sampled from normal

distributions given by

average and standard

deviation

Page 111: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

111 EDBT 2011, A. Bonifati & Y. Velegrakis

Stretching The Unesting Scenario

Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname

Target Publication [0…*] Title Year PublishedIn Name University Country StudentName

Unnesting

Basic Scenario

Page 112: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

112 EDBT 2011, A. Bonifati & Y. Velegrakis

Stretching The Horizontal Partitioning

Vary the number of partitions

Vary the number of elements that exist in each partition

Source:

gene * txt type protein hpAttr1 hpAttr2

Target:

Gene * Name Protein HpAttr1 HpAttr2

Synonym * Name Protein HpAttr1 HpAttr2

HPRel1 * Name Protein HpAttr1 HpAttr2

Page 113: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

113 EDBT 2011, A. Bonifati & Y. Velegrakis

Combining Mapping Scenarios

Lack of diversity Combine scenarios

Based on a set of configuration parameters, generate a complex mapping scenario by concatenating scaled-up mapping scenarios

S1 T1

S2 T2

P1

P2

Concatenation of (S1, T1, P1) and (S2, T2, P2)

Page 114: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

114 EDBT 2011, A. Bonifati & Y. Velegrakis

Stretched Mapping Scenarios

Source Reference [0…*] title year publishedIn Author [0…*] name Affiliation [0…*] university country Students [0…*] sname

Target Publication [0…*] Title Year PublishedIn Name University Country StudentName

Horizontal

Partitioning

Source Schema Target Schema

Horizontal

Partitioning

Copy

Unnesting Unnesting

Copy

Complex Mapping Scenario

Page 115: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

115 EDBT 2011, A. Bonifati & Y. Velegrakis

Composing Mapping Scenarios

Intermix of basic mapping scenario transformations

capture cases where different types of transformations occur simultaneously on the same part of a schema

Main idea:

Generate a source schema S

Evolve S to obtain a target schema.

All based on the configuration parameters

Example next:

Page 116: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

116 EDBT 2011, A. Bonifati & Y. Velegrakis

Composed Mapping Scenarios R1 [0…*] A1 A2 SE1 [0…*] A3 A4 SE2 [0…*] A5 R2 [0…*] A7 A8 SE3 [0…*] A9 SE4 [0…*] A10 A11 R3 [0…*] A12 A13 A14

F1 A1 A2 A3 A4 A5 F2 A7 A8 A9 A10 A11 F3 A12 A13 A14

F1 A1 A3 A5 F2 A7 A8 A10 A11 F3 A12 A13 A14

F1 A1 A3 A5 A7b F2 A7 A8 A10 A11 A12 F3 A13 A14

F1 A1 A3 A5 A7b A15 (id) F2 A7 A8 A10 A11 A12 A16(=“June”) F3 A13 A14 A17(=A3*A14)

R4 [0…*] A1 A3 SE5 [0…*] A5 A7b A15 R6 [0…*] A7 A8 A10 A11 A12 A16 R7 [0…*] A13 A14 SE6 [0…*] A17

Transformation

query

S T

P

• unnesting

• removal

• duplication

• migration

• addition

• nesting

Page 117: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

117 EDBT 2011, A. Bonifati & Y. Velegrakis

Synthetic Examples for Matching

[Ferrara et al. 2008]

ISLab Instance Matching Benchmark

Creates a reference ontology and populates it

Using the web

Performs a sequence of modifications

Variations in data values

Structural heterogeneity

Semantic variations

Page 118: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

118 EDBT 2011, A. Bonifati & Y. Velegrakis

Scenarios Are Useless Without Metrics

Page 119: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

119 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Designing a matching & mapping benchmark

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and future directions

Page 120: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

120 EDBT 2011, A. Bonifati & Y. Velegrakis

Metric Categorization

Qualitative metrics

Compliance measures

Quantitative metrics

Performance measures

User-specific metrics

Application specific metrics

Page 121: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

121 EDBT 2011, A. Bonifati & Y. Velegrakis

Qualitative = Compliance

Evaluate the degree of compliance of the system with respect to some standard Matching: Precision, Recall, F-measure and Fallout

[Euzenat et al. 2007]

Measure the difference of the system output to some reference (expected) output An expert user is typically assumed to provide the

expected matches [Duchateau et al. 2007] [Euzenat et al. 2004] They do not provide any measure of the post-match

effort

They do not consider the time spent by the user in doing verification during intermediate stages

Page 122: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

122 EDBT 2011, A. Bonifati & Y. Velegrakis

Terminology

E−𝑮 𝑮 − 𝑬 𝑮 ∩ 𝑬

𝑼− (𝑮 ∪ 𝑬)

False Negatives

False Positives

True Positives

True Negatives

E: Expected G: Generated

Page 123: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

123 EDBT 2011, A. Bonifati & Y. Velegrakis

Hamming Distance

Measures the dissimilarity between matches

H(G,E)= 1 − |G∩E|

|G∪E|

Example:

E=(Book-Volume,Person-Human,Science-Essay}

G=(Product-Volume,Person-Writer,Science-Essay}

H G, E = 1 −1

3=

2

3

Page 124: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

124 EDBT 2011, A. Bonifati & Y. Velegrakis

Precision

Originated from IR [van Rijsbergen, 1975]

Adopted to matching [Do et al. 2002]

Ratio of correctly found correspondences (true positives) over the total number of returned correspondences (true and false positives)

Intuitively: The degree of correctness

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G, E =|G∩E|

|G|

Page 125: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

125 EDBT 2011, A. Bonifati & Y. Velegrakis

Recall

Originated from IR [van Rijsbergen, 1975]

Adopted to matching [Do et al. 2002]

Ratio of correctly found correspondences (true positives) over the total number of expected correspondences (true positives and false negatives)

Intuitively: The degree of completeness

𝑅𝑒𝑐𝑎𝑙𝑙 G, E =|G∩E|

|E|

Page 126: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

126 EDBT 2011, A. Bonifati & Y. Velegrakis

Fallout

The percentage of the found matches that are false positives

Intuitively: How much error has been made

𝐹𝑎𝑙𝑙𝑜𝑢𝑡 G, E =G − |G∩E|

|G|=

|G−E|

|G|

Page 127: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

127 EDBT 2011, A. Bonifati & Y. Velegrakis

F-Measure

Precision & Recall not always consistent

Their complements Noise & Silence neither

Aggregate Precision & Recall

Percentage of the false positive found matches

Intuitively: How much error has been made

𝐹𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑎 G, E =𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E ×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)

1−𝛼 ×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 G,E +𝛼×𝑅𝑒𝑐𝑎𝑙𝑙(G,E)

For a=1, F-Measure is equal to precision and for a=0 to Recall. For a=0.5, it is the harmonic mean

Page 128: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

128 EDBT 2011, A. Bonifati & Y. Velegrakis

Overall

Like an edit-distance [Melnik et al. 2002]

Ratio of errors over the total number of expected correspondences (true positives and false negatives)

Overall < F-Measure. Ranges [-1,1]

Intuitively: The effort required to fix a matching

𝑂𝑣𝑒𝑟𝑎𝑙𝑙 G, E = 𝑅𝑒𝑐𝑎𝑙𝑙 G, E × 2 −1

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(G,E)

= 1 −G∪E − G∩E

E= 1 −

E−G +|G−E|

|E|

Page 129: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

129 EDBT 2011, A. Bonifati & Y. Velegrakis

Strength-based Similarity

Takes into consideration the degree of confidence

𝑆𝐵𝑆 G, E =2× |𝑠𝑡𝑟𝑒𝑛𝑔ℎ𝑡ℎG 𝑐 −𝑠𝑡𝑟𝑒𝑛𝑔𝑡ℎE 𝑐 |𝑐∈G∩ E

G +|E|

Page 130: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

130 EDBT 2011, A. Bonifati & Y. Velegrakis

All-or-nothing vs. Approximation

Product

DVD

Book

Science

Textbook

Popular

……

Volume

Essay

Politics

Biography

Pocket

…… Expected

Far

Close

Pocket

Page 131: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

131 EDBT 2011, A. Bonifati & Y. Velegrakis

Relaxed Precision & Recall

[Ehrig et al. 2005]

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑎 G, E =𝜔(G,E)

|G|

𝑅𝑒𝑐𝑎𝑙𝑙𝑎 G, E =𝜔(G,E)

|E|

𝜔 G, E = 𝜎(𝑎, 𝑟)𝑎,𝑟 ∈𝑀(G,E)

𝜎: correspondence similarity function

M(G,E): Best matches with regard to 𝜎

Instead of |𝑨 ∩ 𝑹|

Instead of |𝑮 ∩ 𝑬|

Page 132: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

132 EDBT 2011, A. Bonifati & Y. Velegrakis

Weighted Harmonic Mean

Integrates multiple similarity measures

Given n similarity measures 𝑀𝑖, each with a weight 𝑤𝑖 such that 𝑤𝑖 ∈ 0,1 and 𝑤𝑖𝑖=0..𝑛 = 1

𝑊𝐻𝑀 G, E = 𝑀𝑖(G,E)𝑖∈𝐼

𝑤𝑖×𝑀𝑖(G,E)𝑖∈𝐼

Page 133: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

133 EDBT 2011, A. Bonifati & Y. Velegrakis

Evaluating Mapping Systems

Efficiency

Mapping Generation Time

Data Translation Performance

Time

Parallelization

Human effort

Page 134: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

134 EDBT 2011, A. Bonifati & Y. Velegrakis

Matching/Mapping Generation Time

Matching tools Few human intervention

Time has been measured for matching tasks in [Yatskevic et al. 2003], discussed in a recent benchmark [Kopcke et al. 2010], and also addressed in XBenchMatch [Duchateau et al. 2007]

Mapping tools, e.g., Clio, HePToX, Spicy, and STBenchmark do not elaborate on the issue It is hard to measure time in a process in which

human participation is part of the process It includes the time to guide, verify and tune the

mapping tool

Page 135: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

135 EDBT 2011, A. Bonifati & Y. Velegrakis

Translation Time as Performance Metric

Time to execute the transformation script

Indirectly a quality metric on the generated mappings

Attention to avoid evaluation of the query engines

Same engine & Hardware

Need to be fair

Same target instance

Efficient Core generation

[Mecca et al. 2009] [tenCate et al. 2009]

Time performance of ETL workflows

[Simitsis et al. 2009]

Page 136: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

136 EDBT 2011, A. Bonifati & Y. Velegrakis

Time and Parallelization for ETL tools

Beyond time performance, other factors may be quite relevant, such as: Workflow execution throughput (under failures or

not)

Avg latency per tuple

Along with the above factors, it is important to increase parallelization: Pipelining: tasks of the ETL workflow are executed

in parallel by different processors, and the output can be consumed by the next task without waiting for the overall completion

Partitioning: data is partitioned and the transformation is applied to chunks of data

Page 137: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

137 EDBT 2011, A. Bonifati & Y. Velegrakis

Human Effort In Matching Tools

The amount of work required to remove false positives and add false negatives

Since no human intervention takes place during the matching process

Human-spared resources [Duchateau 2009]

It counts the number of user interactions to obtain a 100% F-measure, i.e. the effort to remove false positives and add false negatives, and also to discover missing correspondences

Page 138: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

138 EDBT 2011, A. Bonifati & Y. Velegrakis

Human Effort In Mapping Tools

Mapping tools can be seen as graphic tools

HCI study can be used

Comparing the GUI of the tools is hard

Schema mapping tools is a new technology and the tools are evolving and keep improving their interface daily

STBenchmark provides a first-cut measure on the effort required to implement a mapping scenario through the visual interface of a mapping system [Alexe et al. 2008]

Page 139: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

139 EDBT 2011, A. Bonifati & Y. Velegrakis

A Simple Model

STBenchmark model Cost of implementing a mapping scenario:

4*L + S + 2*D + 4*K

L – mouse dragging actions

S – single mouse clicks

D – double mouse clicks

K – keystrokes

[MacKenzie et al. ’91]

Mouse dragging is slower and

more error-prone than clicking

It is easier to make mistakes

when typing

Scenario / System A B C D

Scenario 1 8 16 16 4

Scenario 2 72 78 65 110

Scenario 3 32 52 37 7

Scenario 4 66 81 65 200

Page 140: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

140 EDBT 2011, A. Bonifati & Y. Velegrakis

Usability study for HePToX/Clio

• [Bonifati et al. 2010]

• Whereas Clio could implement fairly more scenarios, HePToX required

less effort than Clio in the majority of the scenarios it could implement.

• HePToX is click-and-drag oriented, while Clio is click-and-select oriented.

Page 141: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

141 EDBT 2011, A. Bonifati & Y. Velegrakis

Evaluating Mapping Systems

Efficiency

Mapping Generation Time

Data Translation Performance

Time

Parallelization

Human effort

Effectiveness

Supported Scenarios

Quality

Generated Mappings

Target instance

Target schema

Page 142: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

142 EDBT 2011, A. Bonifati & Y. Velegrakis

Enumerating Supported Scenarios

System A System B System C

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Scenario 5

Scenario 6

Scenario 7

……………..

Page 143: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

143 EDBT 2011, A. Bonifati & Y. Velegrakis

Before talking about quality…

In order to check the quality of the generated mappings or the generated target instance:

Somebody has to provide you with the „ideal‟ or „desired‟ mappings or target instance

Normally, the user provides those

Or, alternatively, a „core‟ target instance can be used

Or the set of mappings suggested by a benchmark

The focus here is not on who provides the above components, but on quality!

Page 144: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

144 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of the Generated Mappings

Things are tricky for mapping tools:

Measuring the quality of generated mappings amounts to checking Query Containment or Query Equivalence

An NP-complete problem

It is preferable to check the quality of the results of a mapping:

i.e. the generated target instance

Current efforts try to characterize mappings in a quantitative way:

By their information loss, while inverting them

The notion of maximum extended recovery has been introduced in [Arenas2008, Fagin2009]

Page 145: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

145 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of the Generated Target instance

A mapping is inherently a query

Same transformation, multiple ways.

Generation time

Readability

Mapping result

Whether it produces the respective results

Yes/No is too harsh

Similarity of queries …. too difficult to compute

Target instance is an alternative

Observe the result of the mappings

Page 146: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

146 EDBT 2011, A. Bonifati & Y. Velegrakis

The Spicy Way

[Bonifati et al. 2008] – Spicy

Schema Matcher (internal or external)

Mapping Generator (internal or external)

s.A t.E, 0.87

s.C t.F, 0.76

s.D t.F, 0.98

Ranked

mappings:

mapping 1, 0.97

mapping 2, 0.87

mapping 3. 0.72

match

line selection

mapping generation

Str

uctu

ral

analy

sis

mapping

verification

Spicy

source

target

Page 147: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

147 EDBT 2011, A. Bonifati & Y. Velegrakis

The Spicy Way

Structural Analysis

uses the model of electrical circuits

uses sampling

uses a set of features on samples, e.g.:

length and character distribution

entropy of values

density of null values etc.

Page 148: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

148 EDBT 2011, A. Bonifati & Y. Velegrakis

Structural Analysis

For an attribute A with sample sample(A), the atomic piece of circuit is the following

Page 149: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

149 EDBT 2011, A. Bonifati & Y. Velegrakis

Trees Into Circuits

Circuits can be obtained for nested structures

Page 150: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

150 EDBT 2011, A. Bonifati & Y. Velegrakis

The Spicy way: lessons learned

Spicy is a system for schema mapping verification

using comparison of instances to gauge the quality

iterating the mapping search algorithm until mapping quality is acceptable

Open issues:

Special element types (images, full-text), complex models (e.g. ontologies) and more complex classes of lines

Other instance comparison techniques other than structural analysis may be needed

What if we look at the efficiency of transformations (e.g. core computation) and their quality at the same time?

Page 151: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

151 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of target instance for ETL

The quality of target instances is also important for ETL systems

Target instances can be characterized in terms of:

Easy of maintenance [Simitsis et al. 2009]

Resilience to failures

Data freshness

Compliance to business rules

Page 152: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

152 EDBT 2011, A. Bonifati & Y. Velegrakis

Data examples

Since the expected/generated target instance may be large and the generated mappings may be numerous, Samples of the expected/generated target instance

can be used in turn

The importance of data examples goes back to [Yan et al. 2001] Each mapping is a connected graph G = (N, E), where

N is the set of nodes or source schema relations, and E represents conjunctions of join predicates among the nodes

Data associations (subgraphs of G) can be derived as relations that contain that maximum number of attributes that can be joined along E

Data associations are leveraged to understand what has to be included in a mapping

Page 153: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

153 EDBT 2011, A. Bonifati & Y. Velegrakis

Routes

Source-to-target dependencies, st:

m1: CardHolders(cn,l,s,n) ->

Accounts(cn,L,s),Clients(s,n)

m2: Dependents(an,s,n) -> Clients(s,n)

Target dependencies, t:

m3: Clients(s,n) -> (Accounts(A,L,s))

MANHATTAN CREDIT

CardHolders:

cardNo ²

limit ²

ssn ²

name ²

Dependents:

accNo ²

ssn ²

name ²

FARGO FINANCE

Accounts:

² accNo

² creditLine

² accHolder

Clients:

² ssn

² name

m2

m1

m3

S: T:

Source instance I Target instance J Solution for I under the schema mapping

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

fk1

Allow the Inspection of the flow of the mappings [Chiticariu and Tan, 2006]

Page 154: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

154 EDBT 2011, A. Bonifati & Y. Velegrakis

Example Debugging Scenario 1

Unknown credit limit?

15K is not copied over to the target

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

Alice ID1 $15K 123

CardHolders ID1 L1 123

Accounts

Alice ID1

Clients m1

A route for the Accounts tuple

Page 155: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

155 EDBT 2011, A. Bonifati & Y. Velegrakis

Example Debugging Scenario 1

Unknown credit limit?

15K is not copied over to the target

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

Alice ID1 $15K 123

CardHolders ID1 L1 123

Accounts

Alice ID1

Clients m1

A route for the Accounts tuple

Page 156: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

156 EDBT 2011, A. Bonifati & Y. Velegrakis

Example Debugging Scenario 2

Unknown account number?

123 is not copied over to the target

as Bob’s account number

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

m2 Bob ID2 123

Dependents

ID2 L2 A2

Accounts

Bob ID2

Clients m3

Route for Accounts tuple with accNo A2

Page 157: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

157 EDBT 2011, A. Bonifati & Y. Velegrakis

Example Debugging Scenario 2

Unknown account number?

123 is not copied over to the target

as Bob’s account number

Source instance I Target instance J

123 $15K ID1 Alice

CardHolders

123 ID2 Bob

Dependents

123 L1 ID1

A2 L2 ID2

Accounts

ID1 Alice

ID2 Bob

Clients

m2 Bob ID2 123

Dependents

ID2 L2 A2

Accounts

Bob ID2

Clients m3

Route for Accounts tuple with accNo A2

Page 158: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

158 EDBT 2011, A. Bonifati & Y. Velegrakis

The SPIDER System

Based on Routes

Page 159: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

159 EDBT 2011, A. Bonifati & Y. Velegrakis

Data Examples as Evaluation Tools

The use of data examples as evaluation tools is underway [Chiticariu et al. 2008]

Examples are used to understand and refine mappings towards the desired specification

Universal data examples are data examples derived from universal solutions [Alexe et al. 2010]

If S and T contain only unary relations, with only Σst, a mapping is characterized by a set of

Positive data examples (I, J) such that (I,J) Σ

Negative data examples (I, J) such that (I,J) Σ

Page 160: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

160 EDBT 2011, A. Bonifati & Y. Velegrakis

Muse

[Chiticariu et al. 2008] Muse

Build ad-doc probes for each attribute, such that an small source example is built and two differentiating target examples are obtained

After that, Muse asks the designer „Which target instance look correct?‟

This leads to eliminate some mappings that lead to the unchosen target instance

The result is:

A set of correct homomorphically equivalent target instances

It also allows the design of Skolem functions, not addressed in Routes

Page 161: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

161 EDBT 2011, A. Bonifati & Y. Velegrakis

MUSE Workflow

MUSE

Mapping

Specification Real Source

Instance

(if available)

Real/Synthetic

Data

Examples

Mapping designer

inspects

data examples

Examination

Generation

Essentially

Yes/No Answers

Refinement Grouping Semantics

Disambiguation

Page 162: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

162 EDBT 2011, A. Bonifati & Y. Velegrakis

Example

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

cname

location

Projects: Set of

Project: Rcd

pid

pname

cbranch

manager

Employees: Set of

Employee: Rcd

eid

ename

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

oname

Projects: Set of

Project: Rcd

pname

manager

Employees: Set of

Employee: Rcd

eid

ename

Declarative Mapping

for

c in CompDB.Companies

p in CompDB.Projects

e in CompDB.Employees

satisfy

p.cbranch = c.cbranch

e.eid = p.manager

exists

o in OrgDB.Orgs

p1 in o.Projects

e1 in OrgDB.Employees

satisfy

p1.manager = e1.eid

where

c.cname = o.oname

e.eid = e1.eid

e.ename = e1.ename

p.pname = p1.pname

Page 163: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

163 EDBT 2011, A. Bonifati & Y. Velegrakis

Example

Grouping Projects:

Example source:

Companies

Redmond Microsoft USA

S. Valley Microsoft USA

Projects

P1 DB Redmond e4

P2 Web S. Valley e5

Group by cbranch

Orgs

Microsoft

Projects:

DB e4

Microsoft

Projects:

Web e5

Group by cname

Orgs

Microsoft

Projects:

DB e4

Web e5

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

cname

location

Projects: Set of

Project: Rcd

pid

pname

cbranch

manager

Employees: Set of

Employee: Rcd

eid

ename

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

oname

Projects: Set of

Project: Rcd

pname

manager

Employees: Set of

Employee: Rcd

eid

ename

Page 164: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

164 EDBT 2011, A. Bonifati & Y. Velegrakis

Example

Declarative Mapping

for

c in CompDB.Companies

p in CompDB.Projects

e in CompDB.Employees

satisfy

p.cbranch = c.cbranch

e.eid = p.manager

exists

o in OrgDB.Orgs

p1 in o.Projects

e1 in OrgDB.Employees

satisfy

p1.manager = e1.eid

where

c.cname = o.oname

e.eid = e1.eid

e.ename = e1.ename

p.pname = p1.pname

o.Projects = SKProjects(c.cbranch, c.cname, c.location)

Grouping

Function

But group by what subset of {cbranch, cname, location} ?

CompDB: Rcd

Companies: Set of

Company: Rcd

cbranch

cname

location

Projects: Set of

Project: Rcd

pid

pname

cbranch

manager

Employees: Set of

Employee: Rcd

eid

ename

contact

OrgDB: Rcd

Orgs: Set of

Org: Rcd

oname

Projects: Set of

Project: Rcd

pname

manager

Employees: Set of

Employee: Rcd

eid

ename

Page 165: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

165 EDBT 2011, A. Bonifati & Y. Velegrakis

Muse-G: Grouping Semantics Design

Goal: infer a grouping function that has the same effect as the one intended by the designer

Muse-G probes each possible grouping attribute: start with cbranch

Example source

Companies

Redmond Microsoft USA

S. Valley Microsoft USA

Projects

P1 DB Redmond e4

P2 Web S. Valley e5

Employees

e4 John x234

e5 Anna x888

Target Scenario 1

group

by cbranch

Orgs

Microsoft

Projects:

DB e4

Microsoft

Projects:

Web e5

Employees

e4 John

e5 Anna

y subset of {Microsoft, USA}

Target Scenario 2

do not group

by cbranch

Orgs

Microsoft

Projects:

DB e4

Web e5

Employees

e4 John

e5 Anna

SK(Redmond,y)

SK(S. Valley,y)

SK(y)

Page 166: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

166 EDBT 2011, A. Bonifati & Y. Velegrakis

Muse-G: Second Question

The next probed attribute is cname

Example source

Companies

S. Valley Microsoft USA

Mt. View Google USA

Projects

P1 DB S. Valley e4

P4 Web Mt. View e6

Employees

e4 John x234

e6 Kat x331

Target Scenario 1

group

by cname

Orgs

Microsoft

Projects:

DB e4

Google

Projects:

Web e6

Employees

e4 John

e6 Kat

Target Scenario 2

do not group

by cname

Orgs

Microsoft

Projects:

DB e4

Web e6

Google

Projects:

DB e4

Web e6

Employees

e4 John

e6 Kat

y subset of {USA}

The wizard continues to probe the remaining possible grouping attributes

SK(Microsoft,y)

SK(Google,y)

SK(y)

SK(y)

Page 167: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

167 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of the Generated Target Schema

When mappings are used in schema integration:

The quality of the generated integrated schema wrt. the intended integrated schema

Three metrics have been conceived:

Completeness [Batista07]:

Minimality [Batista07]:

Page 168: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

168 EDBT 2011, A. Bonifati & Y. Velegrakis

Quality of the Generated Target Schema

Page 169: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

169 EDBT 2011, A. Bonifati & Y. Velegrakis

An example

Nr. of common elements=6; α = 2;

Completeness=6/7; Minimality = 5/7;

Structurality = (1+1+10+1/4+1/2)/6

Proximity = 0.73

B

F

D

E

C

X

Z

A

B

F

D

E

C

A

(generated) (intended)

G

Page 170: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

170 EDBT 2011, A. Bonifati & Y. Velegrakis

Talk Outline

Introduction

Matching and mapping: techniques & tools

Benchmarks and evaluation principles

Challenges in evaluating matching & mapping

Using real-world scenarios for evaluation

Generating synthetic evaluation scenarios

Measuring efficiency and effectiveness

Conclusions and References

Page 171: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

171 EDBT 2011, A. Bonifati & Y. Velegrakis

What We Talked About

The importance of Matching/Mappings

Explained the matching/mapping tasks

Presented existing tools & their functionality

What is a benchmark

Generic evaluation principles

Why is a benchmark for mapping systems a challenge.

Finding real-world evaluation scenarios

Generating synthetic evaluation scenarios

Metrics for effectiveness, efficiency and quality

Page 172: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

172 EDBT 2011, A. Bonifati & Y. Velegrakis

Main Sources

[Euzenat et al. 2007] J. Euzenat and P. Shvaiko, "Ontology matching", Springer-Verlag, 2007

[Bellahsene et al. 2011] Z. Bellahsene, A. Bonifati and E. Rahm, "Schema Matching and Mapping", Springer-Verlag, 2011

Page 173: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

173 EDBT 2011, A. Bonifati & Y. Velegrakis

Are you taking questions?

Thank you !!

Of course ! Go ahead

Page 174: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

174 EDBT 2011, A. Bonifati & Y. Velegrakis

Partial List of References

[Alexe et al. 2008] Alexe B, Chiticariu L, Miller RJ, Tan WC (2008) “Muse: Mapping Understanding and deSign by Example”. In: ICDE, pp 10-19

[Bonifati et al. 2008] Bonifati A, Mecca G, Pappalardo A, Raunich S, Summa G (2008) “Schema Mapping Verification: The Spicy Way”. In: EDBT, pp 85-96

[Lenzerini 2002] Lenzerini Maurizio “Data Integration: A Theoretical Perspective”. In: PODS, pp 233-246

[Miller et al. 2000] Miller RJ, Haas LM, Hernandez MA (2000) “Schema Mapping as Query Discovery”. In: VLDB, pp 77-88

[Rahm et al. 2001] Rahm E, Bernstein PA (2001) “A survey of approaches to automatic schema matching”. VLDB Journal 10(4):334-35

[Velegrakis 2005] Velegrakis Y “Managing Schema Mappings in Highly Heterogeneous Environments”. PhD thesis, University of Toronto

Page 175: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

175 EDBT 2011, A. Bonifati & Y. Velegrakis

Partial List of References (cont’d)

[Altova 2008] Altova (2008) MapForce. Http://www.altova.com [Batini et al. 1986] Batini C, Lenzerini M, Navathe SB (1986) A

Comparative Analysis of Methodologies for Database Schema Integration. ACM Comp. Surv. 18(4):323-364

[Bernstein et al. 2007] Bernstein PA, Melnik S (2007) Model management 2.0: manipulating richer mappings. In: SIGMOD, pp 1–12

[Do et al. 2002] Do HH, Rahm E (2002) COMA - A System for Flexible Combination of Schema Matching Approaches. In: VLDB, pp 610–621

[Euzenat et al. 2007] Euzenat J, Shvaiko P (2007) Ontology matching. Springer Verlag, Heidelberg

[Fagin et al. 2005] Fagin R, Kolaitis PG, Miller RJ, Popa L (2005) Data exchange: semantics and query answering. Theoretical Computer Science 336(1):89-124

[Lerner 2000] Lerner BS (2000) A Model for Compound Type Changes Encountered in Schema Evolution. TPCTC 25(1):83–127

[Popa et al, 2002] Popa L, Velegrakis Y, Miller RJ, Hernandez MA, Fagin R (2002)Translating Web Data. In: VLDB, pp 598–609

Page 176: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

176 EDBT 2011, A. Bonifati & Y. Velegrakis

Partial List of References (cont’d)

[Alexe2010] Alexe B, Kolaitis PG, Tan W (2010b) Characterizing Schema Mappings via Data Examples. In: PODS

[Aumueller2006] Aumueller D, Do HH, Massmann S, Rahm E (2005) Schema and ontology matching with COMA++. In: SIGMOD, pp 906–908

[Bonifati et al. 2010] Angela Bonifati, Elaine Qing Chang, Terence Ho, Laks V. S. Lakshmanan, Rachel Pottinger, Yongik Chung: Schema mapping and query Translation in heterogeneous P2P XML databases. VLDB J. 19(2): 231-256 (2010)

[Chiticariu et al. 2006] Chiticariu L, TanWC(2006) Debugging Schema Mappings with Routes. In: VLDB, pp 79–90

[Dhamankar at al. 2004] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Y. Halevy, Pedro Domingos: iMAP: Discovering Complex Mappings between Database Schemas. SIGMOD Conference 2004: 383-394

[Doan et al. 2001] Doan A, Domingos P, Halevy AY (2001) Reconciling schemas of disparate data sources: A machine-learning approach. In: SIGMOD, pp 509–520

Page 177: Benchmarks From Usage To Evaluation - CNRstaff.icar.cnr.it/bonifati/pubs/EDBT11MappingBenchmark... · 2011-03-24 · mPhone contacts: Set of contact: Rcd cid companyemail coid phone

177 EDBT 2011, A. Bonifati & Y. Velegrakis

Partial List of References (cont’d)

[Fletcher et al. 2006] Fletcher GHL, Wyss CM (2006) Data Mapping as Search. In: EDBT, pp 95–111

[Giunchiglia et al. 2004] Giunchiglia F, Shvaiko P, Yatskevich M (2004) S-Match: an Algorithm and an Implementation of Semantic Matching. In: ESWS, pp 61–75

[Madhavanet al. 2001] Madhavan J, Bernstein PA, Rahm E (2001) Generic Schema Matching with Cupid. In: VLDB, pp 49–58.

[Naumann et al. 2002] Naumann F, Ho CT, Tian X, Haas LM, Megiddo N (2002) Attribute Classification Using Feature Analysis. In: ICDE, p 271

[Shu et al. 1977] N. C. Shu, B. C. Housel, R. W. Taylor, S. P. Ghosh, and V. Y. Lum. EXPRESS: A Data EXtraction, Processing and REstructuring System. ACM Transactions on Database Systems (TODS), 2(2):134–174, 1977.