CSE 636 Data Integration Schema Matching Cupid Fall 2006.

27
CSE 636 Data Integration Schema Matching Cupid Fall 2006

Transcript of CSE 636 Data Integration Schema Matching Cupid Fall 2006.

CSE 636Data Integration

Schema Matching

Cupid

Fall 2006

2

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

Query Result

Wrapper Wrapper

End User

Design-Time

MediationLanguage

SchemaMatching

Run-Time

QueryReformulation

Optimization& Execution

XML

Web Services

3

Independently created schemas…… might be modeling similar information…

… in slightly different ways

Schema Heterogeneity

nameugradID

ugrad *DB1

enrollment *courseIDugradIDgrade

type

courseIDcourse *

student *DB3

studentIDnametype

letter

title ?evaluation

studentIDstudent *

course *DB2

courseIDtitle

nametype

4

Schema Heterogeneity

nameugradID

ugrad *DB1

enrollment *courseIDugradIDgrade

type

courseIDcourse *

student *DB3

studentIDnametype

letter

title ?

• Similar entities represented• Dissimilar structures (inverted nesting)• Different element names for similar data values• Similar element names for different data values

evaluation

studentIDstudent *

course *DB2

courseIDtitle

nametype

5

Schema Matching vs. Schema Mapping

• GAV and LAV are schema mapping languages• Mappings:

– set of queries– associations + semantics

• Match:– set of associations only

• Schema Matching:– Identifying associations– First step towards constructing mappings

6

Associations

Semantics

Schema Matching vs. Schema Mapping

for $s1 in DB3/studentwhere $s1/type = ‘UGRAD’return <DB1>

<ugrad><ugradID>{$s1/studentID}</ugradID><name>{$s1/name}</name>

</ugrad></DB1>

LAV Mapping: DB1 Q(DB3)

nameugradID

ugrad *DB1

enrollment *courseIDugradIDgrade

type

courseIDcourse *

student *DB3

studentIDnametype

letter

title ?

7

The Problem of Schema Matching

Input

• Schemas S1 and S2

• Possibly data instances for S1 and S2

• Background knowledge– thesauri– validated matches– standard schemas– reference instances– ontologies– constraints (keys, data types etc)

Output

• Associations between S1 and S2

Goal• Schema matching tools with significant automated support

8

Schema Matching

How is the match result expressed?

type

courseIDcourse *

student *DB3

studentIDnametype

letter

title ?evaluation

studentIDstudent *

course *DB2

courseIDtitle

nametype

• Pairs of paths• Lists of paths• Schema names

9

Schema Matching

What do we match?

• Depends on the queries we want to ask1. Elements in isolation (leaves in particular)2. Substructures3. Whole schemas

10

Motivation

• Important component in many applications– Data Integration– Data Migration– E-Commerce

• Model Management[Bernstein, Halevy, Pottinger ’00]– Algebra for manipulating models and mappings– Match, Merge, Compose …

11

• Minimize user involvement (semi-automatic)• Data model independent matching (generic)• Schema matching is a hard problem

– Naming and structural differences in schemas– Similar, but non-identical concepts modeled– Multiple data models – SQL DDL, XML, ODMG…

Problems

12

Schema Matching Approaches

• Graph matching

Constraint-based

Individual matchers

Schema-based Content-based

StructuralPer-Element

Constraint-based• Types• Keys

Linguistic

• Names• Descriptions

• Value pattern

and ranges

Constraint-based

Linguistic

• IR (word frequencies, key terms)

Per-Element

Combined matchers

CompositeHybrid

automatic composition

manual composition

Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001

How to match?

13

Cupid

Individual matchers

Schema-based Content-based

• Graph matching

Linguistic Constraint-based

StructuralPer-Element

• Types• Keys

• Value pattern

and ranges

Constraint-based

Linguistic

• IR (word frequencies, key terms)

Per-Element

Constraint-based

• Names• Descriptions

Combined matchers

automatic composition

Composite

manual composition

Hybrid

Madhavan, Bernstein and Rahm, VLDB, 2001

14

Cupid Example

PO

Item

POLines

Qty

LineUoM

POShipTo

City

Street

Item

PurchaseOrder

Items

Quantity

ItemNumberUnitofMeasu

re

DeliverTo

City

Street

Address

NameNam

e

15

Cupid Architecture

Schema 1

Schema 2

StructureMatching

GenerateMapping

Output Mapping

Thesaurus

Linguistic Matching

LSIM

SSIMWSIM

16

Linguistic Matching

• Heuristic name matching– Tokenization of names

POOrderNum PO, Order, Num

– Expansion of short-forms, acronymsPO Purchase, Order; Num Number

– Clustering of schema elements based on keywords and data-typesStreet, City, POAddress Address

– Thesaurus of synonyms, hypernyms, acronyms

– Linguistic Similarity coefficient (LSIM) [0,1]

17

Structure Matching

PO

Item

POLines

Qty

LineUoM

City

Street

Item

PurchaseOrder

Items

Quantity

ItemNumber

UnitofMeasure

POShipTo

DeliverTo

City

Street

Address

Name

Name

18

PO

Item

POLines

Qty

Line

UoM

Item

PurchaseOrder

Items

Quantity

ItemNum

UnitofMeasure

WSIM > thhigh

WSIM > thhigh

SSIM++

SSIM++

SSIM++

Structure MatchingMutually Reinforcing Similarity

19

PO

POShipTo

PurchaseOrder

InvoiceTo DeliverT

o

Street

City

Address

Street

City

POBillTo

Street

City Address

Street

City

SSIM++

SSIM++

SSIM--

Structure MatchingContext Dependent Disambiguation

20

Intuition

• Atomic elements are similar – Linguistically and data-type similar– Their ancestors are similar

• Compound elements (non-leaf) are similar if– Linguistically similar– Subtrees rooted at the elements are similar

• Mutually recursive – Leaves determine internal node similarity– Similarity of internal nodes leads to increase in leaf

similarity

21

Structure Match Details

• Subtrees are similar if– Immediate children are similar– Leaf sets are similar

• Subtree Similarity (nodes s and t)– Fraction of leaves in subtree s that can be mapped to a

leaf in the other subtree t and vice-versa– Less sensitive to variation in intermediate structure

• Pruning the number of comparisons– Elements must have comparable number of leaves

22

Order-Customer-fk

Referential Integrity

Purchase Order

Product Name

Order ID

Customer ID

Customer

Customer ID Nam

e

Address

Order-Customer-fk

Schema A

Customer-Purchase-Order

Schema B

• Join nodes added to the schema tree for each referential integrity constraint

• Views can be similarly used

23

Cupid Architecture

Schema 1

Schema 2

StructureMatching

GenerateMapping

Output Mapping

Thesaurus

Linguistic Matching

LSIM

SSIMWSIM

Structural (SSIM), Weighted (WSIM) Similarity

InvoiceTo BillTo 0.7

UoM UnitMeasure 0.9

City City 1.0

Linguistic Similarity (LSIM)

InvoiceTo BillTo 0.8 0.7

UoM UnitMeasure 0.7 0.8

InvoiceTo/City BillTo/City 0.8 0.9

24

Mapping Generation

• Individual mapping elements computed from WSIM values:

– Consider only mapping pairs that have WSIM greater than threshold

– For each element of target find most similar source element

– Not accepted mappings with high similarity are returned in order to help user modify map

25

Cupid Architecture

Schema 1

Schema 2

StructureMatching

GenerateMapping

Output Mapping

Thesaurus

Linguistic Matching

LSIM

SSIMWSIM

Input hint

26

Work Needed

• A more robust solution– Auto-tuning parameters– Thesaurus Generation and Evolution

• Schema matching component architecture– Easily extensible by adding multiple techniques– Data Instances for matching– Look at COMA & ProtoPlasm systems

27

References

1. J. Madhavan, P. A. Bernstein, E. RahmGeneric Schema Matching with CupidVLDB, 2001

2. H. H. Do, E. Rahm:COMA - A System for Flexible Combination of Schema Matching ApproachesVLDB, 2002

3. P. A. Bernstein, S. Melnik, M. Petropoulos, C. QuixIndustrial-Strength Schema MatchingSIGMOD Record 33(4), 2004