Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

18
Reconciling Schemas of Disparate Data Sources: A Machine- Learning Approach AnHai Doan Pedro Domingos Alon Halevy

description

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. AnHai Doan Pedro Domingos Alon Halevy. Problem & Solution. Problem Large-scale Data Integration Systems Bottleneck: Semantic Mappings Solution Multi-strategy Learning Integrity Constraints - PowerPoint PPT Presentation

Transcript of Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

Page 1: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

AnHai Doan Pedro Domingos

Alon Halevy

Page 2: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Problem & Solution Problem

Large-scale Data Integration Systems Bottleneck: Semantic Mappings

Solution Multi-strategy Learning Integrity Constraints XML Structure Learner 1-1 Mappings

Page 3: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Learning Source Descriptions (LSD)

Components Base learners Meta-learner Prediction converter Constraint handler

Operating Phases Training phase Matching phase

Page 4: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Learners Basic Learners

Name Matcher (Whirl) Content Matcher (Whirl) Naïve Bayes Learner County-Name Recognizer XML Learner

Meta-Learner (Stacking)

Page 5: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Naïve Bayes LearnerInput instance=bags of tokens

Page 6: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

XML Learner

Input instance=bags of tokens including text tokens and

structure tokens

Page 7: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Domain Constraint Handler

Domain Constraints Impose semantic regularities on schemas

and source data in the domain Can be specified at the beginning

When creating a mediated schema Independent of any actual source schema

Constraint Handler Domain constraints + Prediction

Converter + Users’ feedback + Output mappings

Page 8: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Training Phase Manually Specify

Mappings for Several Sources

Extract Source Data Create Training Data

for each Base Learner

Train the Base-Learner

Train the Meta-Learner

Page 9: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Example1 (Training Phase)

Page 10: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Example1 (Cont.)

Source Data Training Data

Page 11: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Example1 (Cont.)(“location” , ADDRE

SS)

(“Miami, FL”, ADDRESS)

Source Data: (location: Miami, FL)

Page 12: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Matching Phase Extract and Collect

Data Match each Source-

DTD Tag Apply the Constraint

Handler

Page 13: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Example2 (Matching Phase)

Page 14: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Example2 (Cont.)

Page 15: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Example2 (Cont.)

Page 16: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Experimental Evaluation Measures

Matching accuracy of a source Average matching accuracy of a source Average matching accuracy of a domain

Experiment Results Average matching accuracy for different domains Contributions of base learners and domain constraint

handler Contributions of schema information and instance

information Performance sensitivity to the amount data instances

Page 17: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Limitations Enough Training Data Domain Dependent Learners Ambiguities in Sources Efficiency Overlapping of Schemas

Page 18: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

CS 652 Information Extraction and Information Integration

Conclusion and Future Work

Improve over time Extensible framework Multiple types of knowledge Non 1-1 mapping ?