Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

AnHai Doan Pedro Domingos

Alon Halevy

CS 652 Information Extraction and Information Integration

Problem & Solution Problem

Large-scale Data Integration Systems Bottleneck: Semantic Mappings

Solution Multi-strategy Learning Integrity Constraints XML Structure Learner 1-1 Mappings


Learning Source Descriptions (LSD)

Components Base learners Meta-learner Prediction converter Constraint handler

Operating Phases Training phase Matching phase


Learners Basic Learners

Name Matcher (Whirl) Content Matcher (Whirl) Naïve Bayes Learner County-Name Recognizer XML Learner

Meta-Learner (Stacking)


Naïve Bayes LearnerInput instance=bags of tokens


XML Learner

Input instance=bags of tokens including text tokens and

structure tokens


Domain Constraint Handler

Domain Constraints Impose semantic regularities on schemas

and source data in the domain Can be specified at the beginning

When creating a mediated schema Independent of any actual source schema

Constraint Handler Domain constraints + Prediction

Converter + Users’ feedback + Output mappings


Training Phase Manually Specify

Mappings for Several Sources

Extract Source Data Create Training Data

for each Base Learner

Train the Base-Learner

Train the Meta-Learner


Example1 (Training Phase)


Example1 (Cont.)

Source Data Training Data


Example1 (Cont.)(“location” ， ADDRE

SS)

(“Miami, FL”, ADDRESS)

Source Data: (location: Miami, FL)


Matching Phase Extract and Collect

Data Match each Source-

DTD Tag Apply the Constraint

Handler


Example2 (Matching Phase)


Example2 (Cont.)


Experimental Evaluation Measures

Matching accuracy of a source Average matching accuracy of a source Average matching accuracy of a domain

Experiment Results Average matching accuracy for different domains Contributions of base learners and domain constraint

handler Contributions of schema information and instance

information Performance sensitivity to the amount data instances


Limitations Enough Training Data Domain Dependent Learners Ambiguities in Sources Efficiency Overlapping of Schemas


Conclusion and Future Work

Improve over time Extensible framework Multiple types of knowledge Non 1-1 mapping ?

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

Documents

Transcript of Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach