Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos...

18
Reconciling Schemas of Disparate Data Sources: A Machine- Learning Approach AnHai Doan Pedro Domingos Alon Halevy
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos...

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

AnHai Doan Pedro Domingos

Alon Halevy

CS 652 Information Extraction and Information Integration

Problem & Solution Problem

Large-scale Data Integration Systems Bottleneck: Semantic Mappings

Solution Multi-strategy Learning Integrity Constraints XML Structure Learner 1-1 Mappings

CS 652 Information Extraction and Information Integration

Learning Source Descriptions (LSD)

Components Base learners Meta-learner Prediction converter Constraint handler

Operating Phases Training phase Matching phase

CS 652 Information Extraction and Information Integration

Learners Basic Learners

Name Matcher (Whirl) Content Matcher (Whirl) Naïve Bayes Learner County-Name Recognizer XML Learner

Meta-Learner (Stacking)

CS 652 Information Extraction and Information Integration

Naïve Bayes Learner

Input instance=bags of tokens

CS 652 Information Extraction and Information Integration

XML Learner

Input instance=bags of tokens including text tokens and

structure tokens

CS 652 Information Extraction and Information Integration

Domain Constraint Handler

Domain Constraints Impose semantic regularities on schemas

and source data in the domain Can be specified at the beginning

When creating a mediated schema Independent of any actual source schema

Constraint Handler Domain constraints + Prediction

Converter + Users’ feedback + Output mappings

CS 652 Information Extraction and Information Integration

Training Phase Manually Specify

Mappings for Several Sources

Extract Source Data Create Training Data

for each Base Learner

Train the Base-Learner

Train the Meta-Learner

CS 652 Information Extraction and Information Integration

Example1 (Training Phase)

CS 652 Information Extraction and Information Integration

Example1 (Cont.)

Source Data Training Data

CS 652 Information Extraction and Information Integration

Example1 (Cont.)(“location” , ADDRE

SS)

(“Miami, FL”, ADDRESS)

Source Data: (location: Miami, FL)

CS 652 Information Extraction and Information Integration

Matching Phase

Extract and Collect Data

Match each Source-DTD Tag

Apply the Constraint Handler

CS 652 Information Extraction and Information Integration

Example2 (Matching Phase)

CS 652 Information Extraction and Information Integration

Example2 (Cont.)

CS 652 Information Extraction and Information Integration

Example2 (Cont.)

CS 652 Information Extraction and Information Integration

Experimental Evaluation Measures

Matching accuracy of a source Average matching accuracy of a source Average matching accuracy of a domain

Experiment Results Average matching accuracy for different domains Contributions of base learners and domain constraint

handler Contributions of schema information and instance

information Performance sensitivity to the amount data instances

CS 652 Information Extraction and Information Integration

Limitations Enough Training Data Domain Dependent Learners Ambiguities in Sources Efficiency Overlapping of Schemas

CS 652 Information Extraction and Information Integration

Conclusion and Future Work

Improve over time Extensible framework Multiple types of knowledge Non 1-1 mapping ?