Generic Schema Matching using Cupid

download Generic Schema Matching using Cupid

of 30

  • date post

    17-Jan-2016
  • Category

    Documents

  • view

    25
  • download

    5

Embed Size (px)

description

Generic Schema Matching using Cupid. Jayant Madhavan University of Washington. Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig. PO. PurchaseOrder. POLines. Items. DeliverTo. POShipTo. POShipTo. POShipTo. - PowerPoint PPT Presentation

Transcript of Generic Schema Matching using Cupid

  • Generic Schema Matching using CupidJayant MadhavanUniversity of WashingtonPhilip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig

    VLDB 2001 Roma Italy

  • Schema Matching

    VLDB 2001 Roma Italy

  • The ProblemGiven two schemas obtain a mapping between them that identifies corresponding elementsA hard problemNaming and structural differences in schemasSimilar, but non-identical concepts modeledMultiple data models SQL DDL, XML, ODMGMinimize user involvement (semi-automatic)Data model independent matching (generic)

    VLDB 2001 Roma Italy

  • MotivationImportant component in many applicationsData IntegrationData MigrationE-Commerce

    Model Management [Bernstein, Halevy, Pottinger 00]Algebra for manipulating models and mappingsMatch, Merge, Compose

    VLDB 2001 Roma Italy

  • Schema Matching ApproachesCombined matchersTaxonomy based survey [Rahm,Bernstein00]

    VLDB 2001 Roma Italy

  • Related WorkHybrid approaches for schema integrationDIKE [Palopoli, Sacca, Ursino, Terracina]MOMIS [Bergamaschi, Castano, Vincini]Linguistic and Instance based SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li]Instance based Multi-strategy learningLSD [Doan, Domingos, Halevy]OthersHybrid rule based - Transcm [Milo, Zohar]Query Discovery - CLIO [Haas, Hernandez, Miller]

    VLDB 2001 Roma Italy

  • ContributionsTaxonomy of schema matching approaches

    Cupid system that exploits linguistic, data-type, structure and referential integrity informationNew algorithm that exploits schema structure

    Experimental validation and comparison with other systems

    VLDB 2001 Roma Italy

  • Cupid architectureLSIMSSIMWSIM

    VLDB 2001 Roma Italy

  • Linguistic MatchingTokenization of namesPOOrderNum PO, Order, Num

    Expansion of short-forms, acronymsPO Purchase, Order; Num Number

    Clustering of schema elements based on keywords and data-typesStreet, City, POAddress Address

    Thesaurus of synonyms, hypernyms, acronyms

    Linguistic Similarity coefficient (lsim) [0,1]

    Heuristic name matching

    VLDB 2001 Roma Italy

  • Structure MatchingPOItemPOLinesQtyLineUoMCityStreetItemPurchaseOrderItemsQuantityItemNumberUnitofMeasureCityStreetAddressName

    VLDB 2001 Roma Italy

  • Structure MatchMutually Reinforcing SimilarityPOItemPOLinesQtyLineUoMItemPurchaseOrderItemsQuantityItemNumUnitofMeasureWsim > thhighWsim > thhigh

    VLDB 2001 Roma Italy

  • Structure MatchContext dependent disambiguationPOPOShipToPurchaseOrderInvoiceToDeliverToStreetCityAddressStreetCityPOBillToStreetCitySsim++AddressAddress

    VLDB 2001 Roma Italy

  • IntuitionAtomic elements are similar Linguistically and data-type similarTheir ancestors are similar

    Compound elements (non-leaf) are similar ifLinguistically similarSubtrees rooted at the elements are similar

    Mutually recursive Leaves determine internal node similaritySimilarity of internal nodes leads to increase in leaf similarity

    VLDB 2001 Roma Italy

  • Structure Match detailsSubtrees are similar ifImmediate children are similarLeaf sets are similar

    Subtree Similarity (nodes s and t)Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versaLess sensitive to variation in intermediate structure

    Pruning the number of comparisonsElements must have comparable number of leaves

    VLDB 2001 Roma Italy

  • Referential IntegrityJoin nodes added to the schema tree for each referential integrity constraintViews can be similarly used

    Purchase OrderProduct NameOrder IDCustomer IDCustomerCustomer IDNameAddressSchema A

    VLDB 2001 Roma Italy

  • Cupid architectureStructureMatchingLsimGenerateMappingOutput MappingLinguistic MatchingThesaurusStructural(Ssim), Weighted(Wsim) similarityLinguistic Similarity (Lsim)Ssim,Wsim

    InvoiceToBillTo0.7UoMUnitMeasure0.9CityCity1.0

    InvoiceToBillTo0.80.7UoMUnitMeasure0.70.8InvoiceTo/CityBillTo/City0.80.9

    VLDB 2001 Roma Italy

  • Mapping GenerationIndividual mapping elements computed from Wsim valuesConsider only mapping pairs that have Wsim greater than threshold

    For each element of target find most similar source element

    Not accepted mappings with high similarity are returned in order to help user modify map

    VLDB 2001 Roma Italy

  • Cupid ArchitectureStructureMatchingLsimGenerateMappingOutput MappingLinguistic MatchingThesaurusSsim,Wsim

    VLDB 2001 Roma Italy

  • Experimental ValidationDIKEGraph Matching of ER modelsNo Lsim component (LSPD entries)MOMISClass Level Matching of OO descriptionsWord senses manually chosen from WordNet

    VLDB 2001 Roma Italy

  • Evaluation InsightsLinguistic SimilarityCupid is less sensitive to name variations due to token level manipulations

    MOMIS is able to infer linguistic relationships based on intra-schema properties using Description Logic techniques

    MOMIS has a interface to WordNetWord senses need to be chosen manuallyChoosing a single sense is not always possible

    Matching performance without thesaurus depends on similarity of terms used and on available structure (tokenization helps Cupid)

    VLDB 2001 Roma Italy

  • Evaluation InsightsStructural SimilarityDIKE and Cupid exploit structural similarity beyond the immediate neighborhood of schema elements

    Leaf structure for sub-tree similarity relaxes requirements on intermediate structure match

    Class-level structural similarity in MOMIS can be restrictive while matching schemas with different nesting

    Context-dependent matching in Cupid resolves mapping ambiguity

    Linguistic similarity with complete path names (and no structural similarity) is insufficient

    VLDB 2001 Roma Italy

  • SummaryTaxonomy of schema matching approaches

    Cupid system that performs linguistic and structural matching

    New algorithm for exploiting schema structure

    Comparative evaluation

    VLDB 2001 Roma Italy

  • Future WorkTowards a more robust solutionAuto-tuning parametersThesaurus Generation and EvolutionMore scalability testing

    Schema matching component architectureEasily extensible by adding multiple techniquesData Instances for matchingMapping, Expression and Query Discovery

    Model Management

    VLDB 2001 Roma Italy

  • Model ManagementOther recent publicationsA Model Theory for Generic Schema Management, DBPL 2001Generic Model Management A Database Infrastructure for Schema Manipulation, CoopIS 2001A Vision for Management of Complex Models, Sigmod Record, Dec 2000Data Warehouse Scenarios for Model Management, ER 2000

    More informationhttp://data.cs.washington.edu/model/http://www.cs.washington.edu/homes/jayantMSR Technical Report

    Talk to us for a demo

    VLDB 2001 Roma Italy

  • End of the talk

    VLDB 2001 Roma Italy

  • Schema Matching For each Lines create ItemsFor each Item create ItemItemNumber = concat(Itm, Line)Price = UnknownQuantity = Pounds2Kgs(Qty) Count = Number of Item in LinesPOItemLinesQtyLineUnitPurchaseOrderItemItemsQuantityItemNumberPriceCount

    VLDB 2001 Roma Italy

  • Tree MatchFor each pair of leaves initialize ssim to be their data-type compatibilityFor each s in S (post order)For each t in T(post order)

    Compute ssim(s,t) = structural-similarity(s,t)wsim(s,t) = g(lsim(s,t), ssim(s,t))

    If (wsim(s,t) > thhigh) Inc-struct-similarity(leaves(s), leaves(t))If (wsim(s,t) < thlow)Dec-struct-similarity(leaves(s), leaves(t)Tree Match (Schema tree S, Schema tree T)

    VLDB 2001 Roma Italy

  • Tree Match (example)POShipToItem

    VLDB 2001 Roma Italy

  • Canonical Examples

    MOMISDIKECupidIdentical schemasYYYAttributes with identical names, but different data-typesYYYAttributes with same data-types, but slightly different namesYNYDifferent class names, but same attribute namesNYYDifferent nesting of schema elementsNYYType substitutionNYY

    VLDB 2001 Roma Italy

  • Real world example

    VLDB 2001 Roma Italy

    count

    uom

    unitPrice

    qty

    line

    Item

    partno

    POLines

    startAt

    entityIdentifier

    Street1

    Country

    Street2

    City

    StateProvince

    attn

    PostalCode

    Street3

    Street4

    POShipTo

    entityIdentifier

    Street1

    Country

    Street2

    City

    StateProvince

    attn

    PostalCode

    Street3

    Street4

    POBillTo

    ContactPhone

    ContactFunctionCode

    Contact

    ContactEmail

    ContactName

    PONumber

    PODate

    POHeader

    PO

    totalValue

    Footer

    Header

    orderNum

    ourAccountCode

    orderDate

    yourAccountCode

    telephone

    companyName

    Contact

    e-mail

    contactName

    street4

    postalCode

    street1

    Address

    country

    street3

    stateProvince

    city

    street2

    InvoiceTo

    DeliverTo

    partDescription

    yourPartNumber

    itemCount

    Quantity

    Items

    unitOfMeasure

    itemNumber

    Item

    unitPrice

    partNumber

    PurchaseOrder

    RDB Schema

    Excel Purchase Order

    CIDX Purchase Order

    Cupid the match-makerSchema Matching.E-Commerce two businesses willing to co-operate Naming differences exampleStructural differences clickThe problem is to identify corresponding elements in the two schemasDesiderata reduce the need for domain experts by minimizing user involvement data model independent wayUncomfortable about the bit about the hard problemDI Multiple source schema descriptions have