CSE 636 Data Integration Answering Queries Using Views MiniCon Algorithm.
CSE 636 Data Integration
description
Transcript of CSE 636 Data Integration
![Page 1: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/1.jpg)
CSE 636Data Integration
Schema Matching
Cupid
Fall 2006
![Page 2: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/2.jpg)
2
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
SchemaMatching
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
![Page 3: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/3.jpg)
3
Independently created schemas…… might be modeling similar information…
… in slightly different ways
Schema Heterogeneity
nameugradID
ugrad *DB1
enrollment *courseIDugradIDgrade
type
courseIDcourse *
student *DB3
studentIDnametype
letter
title ?evaluation
studentIDstudent *
course *DB2
courseIDtitle
nametype
![Page 4: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/4.jpg)
4
Schema Heterogeneity
nameugradID
ugrad *DB1
enrollment *courseIDugradIDgrade
type
courseIDcourse *
student *DB3
studentIDnametype
letter
title ?
• Similar entities represented• Dissimilar structures (inverted nesting)• Different element names for similar data values• Similar element names for different data values
evaluation
studentIDstudent *
course *DB2
courseIDtitle
nametype
![Page 5: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/5.jpg)
5
Schema Matching vs. Schema Mapping
• GAV and LAV are schema mapping languages• Mappings:
– set of queries– associations + semantics
• Match:– set of associations only
• Schema Matching:– Identifying associations– First step towards constructing mappings
![Page 6: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/6.jpg)
6
Associations
Semantics
Schema Matching vs. Schema Mapping
for $s1 in DB3/studentwhere $s1/type = ‘UGRAD’return <DB1>
<ugrad><ugradID>{$s1/studentID}</ugradID><name>{$s1/name}</name>
</ugrad></DB1>
LAV Mapping: DB1 Q(DB3)
nameugradID
ugrad *DB1
enrollment *courseIDugradIDgrade
type
courseIDcourse *
student *DB3
studentIDnametype
letter
title ?
![Page 7: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/7.jpg)
7
The Problem of Schema Matching
Input
• Schemas S1 and S2
• Possibly data instances for S1 and S2
• Background knowledge– thesauri– validated matches– standard schemas– reference instances– ontologies– constraints (keys, data types etc)
Output
• Associations between S1 and S2
Goal• Schema matching tools with significant automated support
![Page 8: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/8.jpg)
8
Schema Matching
How is the match result expressed?
type
courseIDcourse *
student *DB3
studentIDnametype
letter
title ?evaluation
studentIDstudent *
course *DB2
courseIDtitle
nametype
• Pairs of paths• Lists of paths• Schema names
![Page 9: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/9.jpg)
9
Schema Matching
What do we match?
• Depends on the queries we want to ask1. Elements in isolation (leaves in particular)2. Substructures3. Whole schemas
![Page 10: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/10.jpg)
10
Motivation
• Important component in many applications– Data Integration– Data Migration– E-Commerce
• Model Management[Bernstein, Halevy, Pottinger ’00]– Algebra for manipulating models and mappings– Match, Merge, Compose …
![Page 11: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/11.jpg)
11
• Minimize user involvement (semi-automatic)• Data model independent matching (generic)• Schema matching is a hard problem
– Naming and structural differences in schemas– Similar, but non-identical concepts modeled– Multiple data models – SQL DDL, XML, ODMG…
Problems
![Page 12: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/12.jpg)
12
Schema Matching Approaches
• Graph matching
Constraint-based
Individual matchers
Schema-based Content-based
StructuralPer-Element
Constraint-based• Types• Keys
Linguistic
• Names• Descriptions
• Value pattern
and ranges
Constraint-based
Linguistic
• IR (word frequencies, key terms)
Per-Element
Combined matchers
CompositeHybrid
automatic composition
manual composition
Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001
How to match?
![Page 13: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/13.jpg)
13
Cupid
Individual matchers
Schema-based Content-based
• Graph matching
Linguistic Constraint-based
StructuralPer-Element
• Types• Keys
• Value pattern
and ranges
Constraint-based
Linguistic
• IR (word frequencies, key terms)
Per-Element
Constraint-based
• Names• Descriptions
Combined matchers
automatic composition
Composite
manual composition
Hybrid
Madhavan, Bernstein and Rahm, VLDB, 2001
![Page 14: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/14.jpg)
14
Cupid Example
PO
Item
POLines
Qty
LineUoM
POShipTo
City
Street
Item
PurchaseOrder
Items
Quantity
ItemNumberUnitofMeasu
re
DeliverTo
City
Street
Address
NameNam
e
![Page 15: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/15.jpg)
15
Cupid Architecture
Schema 1
Schema 2
StructureMatching
GenerateMapping
Output Mapping
Thesaurus
Linguistic Matching
LSIM
SSIMWSIM
![Page 16: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/16.jpg)
16
Linguistic Matching
• Heuristic name matching– Tokenization of names
POOrderNum PO, Order, Num
– Expansion of short-forms, acronymsPO Purchase, Order; Num Number
– Clustering of schema elements based on keywords and data-typesStreet, City, POAddress Address
– Thesaurus of synonyms, hypernyms, acronyms
– Linguistic Similarity coefficient (LSIM) [0,1]
![Page 17: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/17.jpg)
17
Structure Matching
PO
Item
POLines
Qty
LineUoM
City
Street
Item
PurchaseOrder
Items
Quantity
ItemNumber
UnitofMeasure
POShipTo
DeliverTo
City
Street
Address
Name
Name
![Page 18: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/18.jpg)
18
PO
Item
POLines
Qty
Line
UoM
Item
PurchaseOrder
Items
Quantity
ItemNum
UnitofMeasure
WSIM > thhigh
WSIM > thhigh
SSIM++
SSIM++
SSIM++
Structure MatchingMutually Reinforcing Similarity
![Page 19: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/19.jpg)
19
PO
POShipTo
PurchaseOrder
InvoiceTo DeliverT
o
Street
City
Address
Street
City
POBillTo
Street
City Address
Street
City
SSIM++
SSIM++
SSIM--
Structure MatchingContext Dependent Disambiguation
![Page 20: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/20.jpg)
20
Intuition
• Atomic elements are similar – Linguistically and data-type similar– Their ancestors are similar
• Compound elements (non-leaf) are similar if– Linguistically similar– Subtrees rooted at the elements are similar
• Mutually recursive – Leaves determine internal node similarity– Similarity of internal nodes leads to increase in leaf
similarity
![Page 21: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/21.jpg)
21
Structure Match Details
• Subtrees are similar if– Immediate children are similar– Leaf sets are similar
• Subtree Similarity (nodes s and t)– Fraction of leaves in subtree s that can be mapped to a
leaf in the other subtree t and vice-versa– Less sensitive to variation in intermediate structure
• Pruning the number of comparisons– Elements must have comparable number of leaves
![Page 22: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/22.jpg)
22
Order-Customer-fk
Referential Integrity
Purchase Order
Product Name
Order ID
Customer ID
Customer
Customer ID Nam
e
Address
Order-Customer-fk
Schema A
Customer-Purchase-Order
Schema B
• Join nodes added to the schema tree for each referential integrity constraint
• Views can be similarly used
![Page 23: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/23.jpg)
23
Cupid Architecture
Schema 1
Schema 2
StructureMatching
GenerateMapping
Output Mapping
Thesaurus
Linguistic Matching
LSIM
SSIMWSIM
Structural (SSIM), Weighted (WSIM) Similarity
InvoiceTo BillTo 0.7
UoM UnitMeasure 0.9
City City 1.0
Linguistic Similarity (LSIM)
InvoiceTo BillTo 0.8 0.7
UoM UnitMeasure 0.7 0.8
InvoiceTo/City BillTo/City 0.8 0.9
![Page 24: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/24.jpg)
24
Mapping Generation
• Individual mapping elements computed from WSIM values:
– Consider only mapping pairs that have WSIM greater than threshold
– For each element of target find most similar source element
– Not accepted mappings with high similarity are returned in order to help user modify map
![Page 25: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/25.jpg)
25
Cupid Architecture
Schema 1
Schema 2
StructureMatching
GenerateMapping
Output Mapping
Thesaurus
Linguistic Matching
LSIM
SSIMWSIM
Input hint
![Page 26: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/26.jpg)
26
Work Needed
• A more robust solution– Auto-tuning parameters– Thesaurus Generation and Evolution
• Schema matching component architecture– Easily extensible by adding multiple techniques– Data Instances for matching– Look at COMA & ProtoPlasm systems
![Page 27: CSE 636 Data Integration](https://reader036.fdocuments.in/reader036/viewer/2022062422/56813ffe550346895dab2d0c/html5/thumbnails/27.jpg)
27
References
1. J. Madhavan, P. A. Bernstein, E. RahmGeneric Schema Matching with CupidVLDB, 2001
2. H. H. Do, E. Rahm:COMA - A System for Flexible Combination of Schema Matching ApproachesVLDB, 2002
3. P. A. Bernstein, S. Melnik, M. Petropoulos, C. QuixIndustrial-Strength Schema MatchingSIGMOD Record 33(4), 2004