Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos...
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos...
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach
AnHai Doan Pedro Domingos
Alon Halevy
CS 652 Information Extraction and Information Integration
Problem & Solution Problem
Large-scale Data Integration Systems Bottleneck: Semantic Mappings
Solution Multi-strategy Learning Integrity Constraints XML Structure Learner 1-1 Mappings
CS 652 Information Extraction and Information Integration
Learning Source Descriptions (LSD)
Components Base learners Meta-learner Prediction converter Constraint handler
Operating Phases Training phase Matching phase
CS 652 Information Extraction and Information Integration
Learners Basic Learners
Name Matcher (Whirl) Content Matcher (Whirl) Naïve Bayes Learner County-Name Recognizer XML Learner
Meta-Learner (Stacking)
CS 652 Information Extraction and Information Integration
Naïve Bayes Learner
Input instance=bags of tokens
CS 652 Information Extraction and Information Integration
XML Learner
Input instance=bags of tokens including text tokens and
structure tokens
CS 652 Information Extraction and Information Integration
Domain Constraint Handler
Domain Constraints Impose semantic regularities on schemas
and source data in the domain Can be specified at the beginning
When creating a mediated schema Independent of any actual source schema
Constraint Handler Domain constraints + Prediction
Converter + Users’ feedback + Output mappings
CS 652 Information Extraction and Information Integration
Training Phase Manually Specify
Mappings for Several Sources
Extract Source Data Create Training Data
for each Base Learner
Train the Base-Learner
Train the Meta-Learner
CS 652 Information Extraction and Information Integration
Example1 (Cont.)
Source Data Training Data
CS 652 Information Extraction and Information Integration
Example1 (Cont.)(“location” , ADDRE
SS)
(“Miami, FL”, ADDRESS)
Source Data: (location: Miami, FL)
CS 652 Information Extraction and Information Integration
Matching Phase
Extract and Collect Data
Match each Source-DTD Tag
Apply the Constraint Handler
CS 652 Information Extraction and Information Integration
Experimental Evaluation Measures
Matching accuracy of a source Average matching accuracy of a source Average matching accuracy of a domain
Experiment Results Average matching accuracy for different domains Contributions of base learners and domain constraint
handler Contributions of schema information and instance
information Performance sensitivity to the amount data instances
CS 652 Information Extraction and Information Integration
Limitations Enough Training Data Domain Dependent Learners Ambiguities in Sources Efficiency Overlapping of Schemas