Generic Schema Matching using Cupid

30
Generic Schema Matching Generic Schema Matching using Cupid using Cupid Jayant Madhavan Jayant Madhavan University of Washington University of Washington Philip A. Bernstein Philip A. Bernstein Erhard Rahm Erhard Rahm Microsoft Research Microsoft Research University of Leipzig University of Leipzig

description

Generic Schema Matching using Cupid. Jayant Madhavan University of Washington. Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig. PO. PurchaseOrder. POLines. Items. DeliverTo. POShipTo. POShipTo. POShipTo. - PowerPoint PPT Presentation

Transcript of Generic Schema Matching using Cupid

Generic Schema Matching using Generic Schema Matching using CupidCupid

Jayant MadhavanJayant MadhavanUniversity of WashingtonUniversity of Washington

Philip A. Bernstein Erhard RahmPhilip A. Bernstein Erhard Rahm Microsoft Research University of LeipzigMicrosoft Research University of Leipzig

September 11th 2001

VLDB 2001 Roma Italy 2

Schema MatchingSchema Matching

PO

Item

POLines

Qty

LineUoM

POShipTo

City

Street

Item

PurchaseOrder

Items

Quantity

ItemNumber

UnitofMeasure

DeliverTo

City

Street

Address

NameNam

e

POShipTo

DeliverTo

Line

ItemNumber

Qty

UoM

Quantity

UnitofMeasure

POShipTo

DeliverTo

Qty

UoM

Quantity

UnitofMeasure

September 11th 2001

VLDB 2001 Roma Italy 3

• Given two schemas obtain a mapping Given two schemas obtain a mapping between them that identifies corresponding between them that identifies corresponding elementselements

The ProblemThe Problem

• A hard problemA hard problem– Naming and structural differences in schemasNaming and structural differences in schemas– Similar, but non-identical concepts modeledSimilar, but non-identical concepts modeled– Multiple data models – SQL DDL, XML, ODMG…Multiple data models – SQL DDL, XML, ODMG…

– Minimize user involvement (semi-automatic)Minimize user involvement (semi-automatic)– Data model independent matching (generic)Data model independent matching (generic)

September 11th 2001

VLDB 2001 Roma Italy 4

MotivationMotivation

• Important component in many applicationsImportant component in many applications– Data IntegrationData Integration– Data MigrationData Migration– E-CommerceE-Commerce

• Model Management [Bernstein, Halevy, Model Management [Bernstein, Halevy, Pottinger ’00]Pottinger ’00]– Algebra for manipulating models and mappingsAlgebra for manipulating models and mappings– Match, Merge, Compose …Match, Merge, Compose …

September 11th 2001

VLDB 2001 Roma Italy 5

Schema Matching ApproachesSchema Matching Approaches

Individual matchers

Schema-based Content-based

• Graph matching

Linguistic Constraint-based• Types• Keys

• Value pattern and ranges

Constraint-based

Linguistic

• IR (word frequencies, key terms)

Constraint-based

• Names• Descriptions

StructuralPer-Element Per-Element

Combined matchers

automatic composition

Composite

manual composition

Hybrid

Taxonomy based survey [Rahm,Bernstein’00]Taxonomy based survey [Rahm,Bernstein’00]

September 11th 2001

VLDB 2001 Roma Italy 6

Related WorkRelated Work

• Hybrid approaches for schema integrationHybrid approaches for schema integration– DIKE [Palopoli, Sacca, Ursino, Terracina]DIKE [Palopoli, Sacca, Ursino, Terracina]– MOMIS [Bergamaschi, Castano, Vincini]MOMIS [Bergamaschi, Castano, Vincini]

• Linguistic and Instance based Linguistic and Instance based – SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li]SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li]

• Instance based Multi-strategy learningInstance based Multi-strategy learning– LSD [Doan, Domingos, Halevy]LSD [Doan, Domingos, Halevy]

• OthersOthers– Hybrid rule based - Transcm [Milo, Zohar]Hybrid rule based - Transcm [Milo, Zohar]– Query Discovery - CLIO [Haas, Hernandez, Miller]Query Discovery - CLIO [Haas, Hernandez, Miller]

September 11th 2001

VLDB 2001 Roma Italy 7

ContributionsContributions

• Taxonomy of schema matching approachesTaxonomy of schema matching approaches

• Cupid system that exploits linguistic, data-type, Cupid system that exploits linguistic, data-type, structure and referential integrity informationstructure and referential integrity information– New algorithm that exploits schema structureNew algorithm that exploits schema structure

• Experimental validation and comparison with Experimental validation and comparison with other systemsother systems

September 11th 2001

VLDB 2001 Roma Italy 8

Cupid architectureCupid architecture

Schema 1

Schema 2

StructureMatching

GenerateMapping

Output Mapping

Linguistic Matching

Thesaurus

LSIM

SSIMWSIM

September 11th 2001

VLDB 2001 Roma Italy 9

Linguistic MatchingLinguistic Matching

– Tokenization of namesTokenization of namesPOOrderNum POOrderNum PO, Order, Num PO, Order, Num

– Expansion of short-forms, acronymsExpansion of short-forms, acronymsPOPO Purchase, Order; Num Purchase, Order; Num Number Number

– Clustering of schema elements based on keywords Clustering of schema elements based on keywords and data-typesand data-types

Street, City, POAddress Street, City, POAddress Address Address

– Thesaurus of synonyms, hypernyms, acronymsThesaurus of synonyms, hypernyms, acronyms

– Linguistic Similarity coefficient (lsim) Linguistic Similarity coefficient (lsim) [0,1] [0,1]

• Heuristic name matchingHeuristic name matching

September 11th 2001

VLDB 2001 Roma Italy 10

Structure MatchingStructure Matching

PO

Item

POLines

Qty

LineUoM

City

Street

Item

PurchaseOrder

Items

Quantity

ItemNumber

UnitofMeasure

POShipTo

DeliverTo

City

Street

Address

Name

Qty

UoM

Quantity

UnitofMeasure

Item Item

Line

ItemNumber

POShipTo

DeliverTo

NameCit

yStreet Cit

yStreet

Name

September 11th 2001

VLDB 2001 Roma Italy 11

Structure MatchStructure MatchMutually Reinforcing SimilarityMutually Reinforcing Similarity

PO

Item

POLines

Qty

Line

UoM

Item

PurchaseOrder

Items

Quantity

ItemNum

UnitofMeasure

Wsim > thhigh

Wsim > thhigh

Ssim ++

Ssim ++

Ssim ++

Ssim ++

Ssim ++

Ssim ++

Qty

UoM

Quantity

UnitofMeasureQt

y

UoM

Quantity

UnitofMeasure

Item ItemItem Item

Line

ItemNum

Line

ItemNum

POLines

ItemsPOLines

Items

PO PurchaseOrder

September 11th 2001

VLDB 2001 Roma Italy 12

Structure MatchStructure MatchContext dependent disambiguationContext dependent disambiguation

PO

POShipTo

PurchaseOrder

InvoiceTo DeliverT

o

Street

City

Address

Street

City

POBillTo

Street

City Address

Street

City

Ssim++

Ssim++

Ssim--

City

City

City

City

City

City

POShipTo

Address

POBillTo

POShipTo

Address

POBillTo

AddressAddress

InvoiceTo

POBillTo

InvoiceTo

POBillToPOShipTo

InvoiceToPOShipTo

InvoiceTo DeliverT

oPOShipTo

September 11th 2001

VLDB 2001 Roma Italy 13

IntuitionIntuition

• Atomic elements are similarAtomic elements are similar – Linguistically and data-type similarLinguistically and data-type similar– Their ancestors are similarTheir ancestors are similar

• Compound elements (non-leaf) are similar ifCompound elements (non-leaf) are similar if– Linguistically similarLinguistically similar– Subtrees rooted at the elements are similarSubtrees rooted at the elements are similar

• Mutually recursive Mutually recursive – Leaves determine internal node similarityLeaves determine internal node similarity– Similarity of internal nodes leads to increase in leaf Similarity of internal nodes leads to increase in leaf

similaritysimilarity

September 11th 2001

VLDB 2001 Roma Italy 14

Structure Match detailsStructure Match details

• Subtrees are similar ifSubtrees are similar if– Immediate children are similarImmediate children are similar– Leaf sets are similarLeaf sets are similar

• Subtree Similarity (nodes s and t)Subtree Similarity (nodes s and t)– Fraction of leaves in subtree s that can be mapped to a Fraction of leaves in subtree s that can be mapped to a

leaf in the other subtree t and vice-versaleaf in the other subtree t and vice-versa– Less sensitive to variation in intermediate structureLess sensitive to variation in intermediate structure

• Pruning the number of comparisonsPruning the number of comparisons– Elements must have comparable number of leavesElements must have comparable number of leaves

September 11th 2001

VLDB 2001 Roma Italy 15

Order-Customer-fk

Referential IntegrityReferential Integrity

• Join nodes added to the schema tree for each Join nodes added to the schema tree for each referential integrity constraintreferential integrity constraint

• Views can be similarly usedViews can be similarly used

Purchase Order

Product Name

Order ID

Customer ID

Customer

Customer ID Nam

e

Address

Order-Customer-fk

Schema A

Customer-Purchase-Order

Schema B

September 11th 2001

VLDB 2001 Roma Italy 16

Cupid architectureCupid architecture

Schema 1

Schema 2

StructureMatching

Lsim

GenerateMapping

Output Mapping

Linguistic Matching

Thesaurus

Structural(Ssim), Weighted(Wsim) similarity

InvoiceTo BillTo 0.7

UoM UnitMeasure 0.9

City City 1.0

Linguistic Similarity (Lsim)

Ssim,Wsim

InvoiceTo BillTo 0.8 0.7

UoM UnitMeasure 0.7 0.8

InvoiceTo/City BillTo/City 0.8 0.9

September 11th 2001

VLDB 2001 Roma Italy 17

Mapping GenerationMapping Generation

• Individual mapping elements computed from Individual mapping elements computed from Wsim Wsim valuesvalues

– Consider only mapping pairs that have Wsim greater Consider only mapping pairs that have Wsim greater than thresholdthan threshold

– For each element of target find most similar source For each element of target find most similar source elementelement

– Not accepted mappings with high similarity are Not accepted mappings with high similarity are returned in order to help user modify map returned in order to help user modify map

September 11th 2001

VLDB 2001 Roma Italy 18

Cupid ArchitectureCupid Architecture

Schema 1

Schema 2

StructureMatching

Lsim

GenerateMapping

Output Mapping

Linguistic Matching

Thesaurus

Ssim,Wsim

Input hint

September 11th 2001

VLDB 2001 Roma Italy 19

Experimental ValidationExperimental Validation

• DIKEDIKE– Graph Matching of ER modelsGraph Matching of ER models– No Lsim component (LSPD entries)No Lsim component (LSPD entries)

• MOMISMOMIS– Class Level Matching of OO descriptionsClass Level Matching of OO descriptions– Word senses manually chosen from WordNetWord senses manually chosen from WordNet

MOMISDIKE Cupid

CanonicalExamples

Real WorldExamples

September 11th 2001

VLDB 2001 Roma Italy 20

Evaluation InsightsEvaluation InsightsLinguistic SimilarityLinguistic Similarity

• Cupid is less sensitive to Cupid is less sensitive to name variationsname variations due to token due to token level manipulationslevel manipulations

• MOMIS is able to infer linguistic relationships based on MOMIS is able to infer linguistic relationships based on intra-schema propertiesintra-schema properties using Description Logic using Description Logic techniquestechniques

• MOMIS has a interface to WordNetMOMIS has a interface to WordNet– Word senses need to be chosen manuallyWord senses need to be chosen manually– Choosing a single sense is not always possibleChoosing a single sense is not always possible

• Matching performance without thesaurusMatching performance without thesaurus depends on depends on similarity of terms used and on available structure similarity of terms used and on available structure (tokenization helps Cupid)(tokenization helps Cupid)

September 11th 2001

VLDB 2001 Roma Italy 21

Evaluation InsightsEvaluation InsightsStructural SimilarityStructural Similarity

• DIKE and Cupid exploit DIKE and Cupid exploit structural similarity beyond the structural similarity beyond the immediate neighborhoodimmediate neighborhood of schema elements of schema elements

• Leaf structure for sub-tree similarityLeaf structure for sub-tree similarity relaxes relaxes requirements on intermediate structure match requirements on intermediate structure match

• Class-level structural similarityClass-level structural similarity in MOMIS can be in MOMIS can be restrictive while matching schemas with different nestingrestrictive while matching schemas with different nesting

• Context-dependent matchingContext-dependent matching in Cupid resolves mapping in Cupid resolves mapping ambiguityambiguity

• Linguistic similarity with complete path namesLinguistic similarity with complete path names (and no (and no structural similarity) is insufficientstructural similarity) is insufficient

September 11th 2001

VLDB 2001 Roma Italy 22

SummarySummary

• Taxonomy of schema matching approachesTaxonomy of schema matching approaches

• Cupid system that performs linguistic and Cupid system that performs linguistic and structural matching structural matching

• New algorithm for exploiting schema structure New algorithm for exploiting schema structure

• Comparative evaluation Comparative evaluation

September 11th 2001

VLDB 2001 Roma Italy 23

Future WorkFuture Work

• Towards a more robust solutionTowards a more robust solution– Auto-tuning parametersAuto-tuning parameters– Thesaurus Generation and EvolutionThesaurus Generation and Evolution– More scalability testingMore scalability testing

• Schema matching component architectureSchema matching component architecture– Easily extensible by adding multiple techniquesEasily extensible by adding multiple techniques– Data Instances for matchingData Instances for matching– Mapping, Expression and Query DiscoveryMapping, Expression and Query Discovery

• Model ManagementModel Management

September 11th 2001

VLDB 2001 Roma Italy 24

Model ManagementModel Management

• Other recent publicationsOther recent publications– A Model Theory for Generic Schema Management, A Model Theory for Generic Schema Management, DBPL DBPL

20012001– Generic Model Management – A Database Infrastructure for Generic Model Management – A Database Infrastructure for

Schema Manipulation, Schema Manipulation, CoopIS 2001CoopIS 2001– A Vision for Management of Complex Models, A Vision for Management of Complex Models, Sigmod Sigmod

Record, Dec 2000Record, Dec 2000– Data Warehouse Scenarios for Model Management, Data Warehouse Scenarios for Model Management, ER ER

20002000

• More informationMore information– http://data.cs.washington.edu/model/http://data.cs.washington.edu/model/– http://www.cs.washington.edu/homes/jayanthttp://www.cs.washington.edu/homes/jayant– MSR Technical Report MSR Technical Report

• Talk to us for a demoTalk to us for a demo

September 11th 2001

VLDB 2001 Roma Italy 25

End of the talkEnd of the talk

September 11th 2001

VLDB 2001 Roma Italy 26

Schema Matching Schema Matching

For each Lines create Items

For each Item create ItemItemNumber = concat(“Itm”, Line)Price = “Unknown”Quantity = Pounds2Kgs(Qty)

Count = Number of Item in Lines

PO

Item

Lines

QtyLine Unit

PurchaseOrder

Item

Items

Quantity

ItemNumber

Price

Count

ItemNumber=concat(“Itm”,Line)

Quantity=Pounds2Kgs(Qty)

September 11th 2001

VLDB 2001 Roma Italy 27

Tree MatchTree Match

For each pair of leaves For each pair of leaves initialize ssiminitialize ssim to be their to be their data-type data-type compatibilitycompatibility

For each s in S (For each s in S (post orderpost order))For each t in T(For each t in T(post orderpost order))

Compute Compute ssim(s,t) = structural-similarity(s,t)ssim(s,t) = structural-similarity(s,t)wsim(s,t) = g(lsim(s,t), ssim(s,t))wsim(s,t) = g(lsim(s,t), ssim(s,t))

If (If (wsim(s,t) > thwsim(s,t) > thhighhigh) ) Inc-struct-similarity(leaves(s), leaves(t))Inc-struct-similarity(leaves(s), leaves(t))

If (If (wsim(s,t) < thwsim(s,t) < thlowlow))Dec-struct-similarity(leaves(s), leaves(t)Dec-struct-similarity(leaves(s), leaves(t)

Tree Match (Schema tree S, Schema tree T)Tree Match (Schema tree S, Schema tree T)

September 11th 2001

VLDB 2001 Roma Italy 28

Tree Match (example)Tree Match (example)

POShipTo

PO

Item

POLines

Qty

LineUoM

POBillTo

Count

City

Street

City

Street

Item

PurchaseOrder

Items

Quantity

ItemNumber

UnitofMeasure

InvoiceTo

DeliverToItemCoun

t

City Street

Address

City Street

Address

September 11th 2001

VLDB 2001 Roma Italy 29

Canonical ExamplesCanonical Examples

MOMISMOMIS DIKEDIKE CupidCupid

Identical schemasIdentical schemas YY YY YY

Attributes with identical names, Attributes with identical names, but different data-typesbut different data-types YY YY YY

Attributes with same data-types, Attributes with same data-types, but slightly different namesbut slightly different names YY NN YY

Different class names, but same Different class names, but same attribute namesattribute names NN YY YY

Different nesting of schema Different nesting of schema elementselements NN YY YY

Type substitutionType substitution NN YY YY

September 11th 2001

VLDB 2001 Roma Italy 30

Real world exampleReal world example

PO

POHeader

PODate PONumber ContactName

ContactEmail

Contact

ContactFunctionCode

ContactPhone

POBillTo

Street4 Street3

PostalCode

attn

StateProvince City

Street2

Country

Street1

entityIdentifier

POShipTo

Street4 Street3

PostalCode

attn

StateProvince City

Street2

Country

Street1

entityIdentifier startAt

POLines

partno

Item

line

qty

unitPrice uom

count

PurchaseOrder

partNumber

unitPrice

Item

itemNumber

unitOfMeasure

Items

Quantity

itemCount

yourPartNumber

partDescription

DeliverTo InvoiceTo

street2

city stateProvince

street3

country

Address

street1

postalCode

street4

contactName

e-mail

Contact

companyName

telephone

yourAccountCode

orderDate ourAccountCode

orderNum

Header

Footer

totalValue

CIDX Purchase Order Excel Purchase Order