M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung...

45
MANAGING UNCERTAINTY OF XML SCHEMA MATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

Transcript of M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung...

Page 1: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

MANAGING UNCERTAINTY OF XML SCHEMA MATCHING

Reynold Cheng, Jian Gong, David W. Cheung

ICDE’2010

Page 2: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

22

THE DATA INTEGRATION PROBLEM Querying the source data through target query

interface Eg.: querying multiple data sources through a mediate query

interface

Data source

Query interface Target schema

Source schema

Schema mapping

2

…… ……

Page 3: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

SCHEMA MATCHING & MAPPING Schema matching: finding element correspondences

with similarities between schemas Schema mapping: a set of one-to-one

correspondences between two schemas Generation: pick up the best correspondences

3

Sample mapping Order - ORDER BP - IP BCN – ICN ……

Page 4: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

44

SCHEMA MAPPING AND UNCERTAINTY The mapping between schemas can be uncertain

Compute Pr(Mi) by: 1) aggregating similarities of correspondences, and 2) normalizing probabilities of top-k mappings

Which one is correct?

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

Example: Purchase Order schemas

4

Page 5: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

55

DATA INTEGRATION RELOADED Managing uncertainty of XML schema matching

Issues: mapping generation and storage, query evaluation etc

Data source

Query interface Mediate schema

Source schema

Uncertain schema mapping

5

…… ……

Page 6: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

66

OBSERVATION

Sharing among uncertain mappings

Uncertain mappings

Overlapping: “Order~ORDER” shared by m1-m5

“BP~IP” shared by m1, m2, m4, m5

“BCN~ICN” shared by m1, m2

… 6

Page 7: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

77

OBSERVATION How much overlapping are there in real world schema

mappings? Overlapping ratio (o-ratio): the average overlap of the top-

100 possible schema mappings

7

Page 8: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

OUR CONTRIBUTION Propose block tree: a novel data structure to represent

a set of mappings Definition Efficient generation

Propose probabilistic twig query (PTQ) Definition Efficient evaluation with the block tree Top-k PTQ, and its computation issue

Improve the possible mapping generation process A divide-and-conquer approach

Conduct experiment on real data to validate our methods

8

Page 9: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

RELATED WORK Schema matching approaches and tools [RB01]

COMA [DR02]

Managing uncertainty in schema matching Top-k schema mappings [Gal06] Generating top-k mappings [Murty86]

Query evaluation in data integration Theoretical foundation [Len02] Data integration with uncertainty [DHY07] XML query rewriting for data integration [YP04]

XML query evaluation Twig query [QYD07] Querying probabilistic XML document [KYS08] 9

Page 10: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1010

OUTLINE

Introduction Problem

Data model Query model

Techniques Results Conclusion

10

Page 11: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1111

DATA MODEL XML schema and document [QYD07]

Node-labeled tree Document node may carry text values

Schema mapping [DHY07] One-to-one mapping

11

Schema

Schema

Document

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

Page 12: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1212

QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]

Step 1: rewrite target query into source query, based on schema mapping

M1: Order-ORDER, BP-IP, BCN-ICN, …

12

Source query: Target query:

Source schema: Target schema:

Page 13: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1313

QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]

Step 1: rewrite target query into source query, based on schema mapping

Step 2: evaluate source query on source document

13

Source query:

Source document:

Page 14: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1414

QUERY MODEL (UNCERTAIN MAPPINGS) Query evaluation with uncertain mappings [DHY07]

Mappings: pM = {(M1,Pr(M1)), …, (Mh,Pr(Mh)} The query answers from mapping Mi have probability Pr(Mi)

Target query QT

M1,Pr(M1)

Mh,Pr(Mh)

R1,Pr(M1)

Rh,Pr(Mh)

QS1

QSh

Rewriting Evaluation

14

Source query

Page 15: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1515

OUTLINE

Introduction Problem Techniques

Block tree Query evaluation Mapping generation

Results Conclusion

15

Page 16: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1616

THE BLOCK Each block, which is attached to a target schema

element, consists of: C: A set of correspondences M: A set of mappings

Block Block Block

16

Drawback: Exponential number of blocks to handle

Semantic: mappings in M share correspondences in C

Page 17: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1717

THE C-BLOCK A c-block (constrained block) is a block which:

Contains correspondence for all elements in its sub-tree (so that it’s more useful for query evaluation)

Contains shared mappings more than a threshold (else it’s not worthy to store it)

17

c-block

|pM| = 5Threshold = 0.4

Page 18: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1818

THE BLOCK TREE Creation of the block tree

Follows the structure of the target schema A bottom-up method

18

Lemma 1: (informal)The c-blocks for an element can be created from the c-blocks of its children.(detail)

Lemma 2: (informal)If an element has no c-block, then its parent (if any) has no c-blcok.

Page 19: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

1919

THE BLOCK TREE Reducing the storage cost of uncertain mappings

IP

b4

b3

ICN

g2g1

b2

b1C: BCN~ICN

M: m1, m2

C: RCN~ICNM: m3, m4

C: OCN~SCNM: m2, m3

SCN

C: BCN~SCNM: m4, m5

b5

C: BP~IPM: m1, m2, m4, m5

C: BP~IP, BCN~ICNM: m1, m2

SP

...

ORDER

g3C: Order~ORDER

M: m1, m2, m3, m4, m5

m1 Order~ORDER

RCN~SCN...

m2 Order~ORDER

OCN~SCN...

b2.C

b3.C

b2.C

b4.C

m4 Order~ORDER BP~IP

...

b4.C

m5 Order~ORDER BP~IP OCN~ICN ...

b5.C b5.C

m3 Order~ORDER SP~IP BP~SP...

If part of a mapping is in the block tree, then replace it with a link

Page 20: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

2020

OUTLINE

Introduction Problem Techniques

Block tree Query evaluation Mapping generation

Results Conclusion

20

Page 21: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

2121

QUERY EVALUATION AND UNCERTAINTY The uncertainty in mappings may affect query

answers

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

Target query Q: //ICN

which finds all ICNs (contact names of invoice parties) in the purchase order

Example: a source document

Return by M1

Return by M2

21

Page 22: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

2222

THE BASELINE APPROACH

Evaluate QT with each mapping in pM separately Drawback

When the mapping Mi is large, or h is large, the computation cost is expensive

Target query QT

M1,Pr(M1)

Mh,Pr(Mh)

R1,Pr(M1)

Rh,Pr(Mh)

QS1

QSh

Rewriting Evaluation

DS

DS

Page 23: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

23

QUERY EVALUATION WITH BLOCK TREE Consider the root of a query

Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

Page 24: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

24

IP

ICN

QUERY EVALUATION WITH BLOCK TREE Case 1): the root is found in the block tree, then use the

blocks to evaluate the whole query Only one mapping in the block is used Deal with remainder mappings

Page 25: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

25

QUERY EVALUATION WITH BLOCK TREE Consider the root of a query

Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

Case 2): the root is not found, decompose the query (if possible), invoke recursion, and join partial answers

Page 26: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

26

IP

ICN

ORDER

SP

QUERY EVALUATION WITH BLOCK TREE Case 2): the root is not found, decompose the query (if

possible), invoke recursion, and join partial answers

ORDERIP

ICN

SP+ +

Direct query

Recursion Direct query

Page 27: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

2727

OUTLINE

Introduction Problem

Data model Query model

Techniques Block tree Query evaluation Mapping generation

Results Conclusion

27

Page 28: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

28

MAPPING GENERATION A mapping m for a schema S with another schema T

contains a set of correspondences (es,et) et may be EMPTY, i.e., es matches none element in T Each element in S occurs exactly once in m Each element in T occurs at most once in m m’s score is the sum of similarities of its correspondences

Problem definition Given: two schemas S and T, a set of correspondences

(es,et) with similarities (which are schema matching results) Return: h mappings m1, …, mh, whose scores are among the

highest ones

Page 29: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

29

MAPPING GENERATION Baseline solution

Finding h-maximum bipartite matching (Min-Cost Flow) Polynomial with the size of bipartite

Page 30: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

30

MAPPING GENERATION Observation: XML schema matching is usually sparse Improvement: a divide-and-conquer approach

Derive partitions (Maximal Connected Sub-Graphs) of the bipartite

Find the top-h partial mappings from each partition Merge

Page 31: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

3131

OUTLINE

Introduction Problem Techniques Results Conclusion

31

Page 32: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

32

DATASET AND RESULTS XML schemas and documents

7 schemas for purchase order, obtained from various E-Commence standards (eg. XCBL, OpenTrans)

Accompanied sample XML documents

Schema matching Tool: COMA++, with different schema matching methods 10 dataset: (source-schema, target-schema, matching-

method)

Target query 10 hand-write queries

Page 33: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

33

RESULTS Uncertain mappings, do they really overlap?

Page 34: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

34

RESULTS How much space does the block tree save for storing

uncertain mappings? And why?

Page 35: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

35

RESULTS Is the block tree effective?

Intuitively, larger blocks tends to be more useful

Page 36: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

36

RESULTS The block tree can be efficiently created

Fast, and controllable

Page 37: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

37

RESULTS Can the block tree really help to improvement query

performance? Varies the total number of mappings

Page 38: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

38

RESULTS Can it scale?

Probabilistic twig query and top-k query

Page 39: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

39

RESULTS Top-h mapping generation

Performance gain of partitioning

Page 40: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

40

CONCLUSION We study the problem of handling uncertainty in XML

schema matching Observation

Overlapping mappings, sparse bipartite, etc Approach

The block tree Query evaluation with the block tree Generating uncertain mapping more efficiently

Future work Other types of queries, probabilistic document, index

update, relational scenario, etc

Page 41: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

4141

THANKS!

Q & A

41

Page 42: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

REFERENCES [Len02] Lenzerini, “Data integration: a theoretical perspective”, in

PODS, 2002 [YP04] Yu et al, “Constraint-based XML query rewriting for data

integration”, in SIGMOD, 2004 [DR02] Do et al, “COMA: a system for flexible combination of schema

matching approaches”, in VLDB, 2002 [Gal06] Gal, “Managing uncertainty in schema matching with top-k

schema mappings”, in J. Data Semantics VI, 2006 [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in

DASFAA, 2007 [Murty86] Murty, “An algorithm for ranking all the assignment in

increasing order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema

matching”, VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”,

in SIGMOD, 2008 …

42

Page 43: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

4343

QUERY REWRITING

Given A target twig query QT

A schema mapping m between S and T, which is a set of correspondences (es,et)

Mapping semantic For each sub-tree in source document DS which

contains a set of source element in m, there exists a sub-tree in target document DT which contains the corresponding target elements

Procedure For each element in QT, replace with a source

element Connect all the source elements

Page 44: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

4444

LEMMA 1

An example

Lemma 1: (conceptually)The c-blocks for an schema element t can be created from the c-blocks of t’s children.(detail)

Order

InvoiceTo

27|24|25|24

name

Address

streetemail city country

DeliverTo

27|24|25|24

name

Address

streetemail city country

ContactContact

51|49 49|5110052|48 53|4749|5110052|48 50|50 51|49

...

b1.M: 1-52b2.M: 53-100

b3.M: 1,3,5,…b4.M: 2,4,6,...

Page 45: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

45

RESULTS

What kind of queries do we used?