AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate...

23
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Reconciling Schemas of Disparate Data Sources: Data Sources: A Machine Learning Approach A Machine Learning Approach The LSD Project The LSD Project

Transcript of AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate...

Page 1: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

AnHai Doan, Pedro Domingos, Alon Halevy

University of Washington

Reconciling Schemas of Disparate Data Sources: Reconciling Schemas of Disparate Data Sources: A Machine Learning ApproachA Machine Learning Approach

The LSD ProjectThe LSD Project

Page 2: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

2

Data IntegrationData Integration

Find houses with four bathrooms priced under $500,000

mediated schema

homes.comrealestate.com

source schema 2

homeseekers.com

wrapper wrapperwrapper

source schema 3source schema 1

Page 3: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

3

Semantic Mappings between SchemasSemantic Mappings between Schemas

Mediated & source schemas = XML DTDs

house

location contact

house

address

name phone

num-baths

full-baths half-baths

contact-info

agent-name agent-phone

1-1 mapping non 1-1 mapping

Page 4: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

4

Current State of AffairsCurrent State of Affairs

Finding semantic mappings is now the bottleneck!– largely done by hand– labor intensive & error prone

Will only be exacerbated– data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data– reconciling ontologies on the semantic web

Need (semi-)automatic approaches to scale up!

Page 5: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

5

Suppose user wants to integrate 100 data sources

1. User – manually creates mappings for a few sources, say 3– shows LSD these mappings

2. LSD learns from the mappings

3. LSD proposes mappings for remaining 97 sources

The LSD (The LSD (LLearning earning SSource ource DDescriptions) escriptions) Approach Approach

Page 6: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

6

listed-price $250,000 $110,000 ...

address price agent-phone description

Example Example

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

realestate.com

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

Page 7: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

7

Our ContributionsOur Contributions

1. Use of multi-strategy learning– well-suited to exploit multiple types of knowledge– highly modular & extensible

2. Extend learning to incorporate constraints– handle a wide range of domain & user-specified constraints

3. Develop XML learner– exploit hierarchical nature of XML

Page 8: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

8

Multi-Strategy LearningMulti-Strategy Learning

Use a set of base learners– each exploits well certain types of information

Match schema elements of a new source– apply the base learners– combine their predictions using a meta-learner

Meta-learner– uses training sources to measure base learner accuracy– weighs each learner based on its accuracy

Page 9: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

9

Base LearnersBase Learners Input

– schema information: name, proximity, structure, ...– data information: value, format, ...

Output– prediction weighted by confidence score

Examples– Name learner

– agent-name => (name,0.7), (phone,0.3)

– Naive Bayes learner– “Kent, WA” => (address,0.8), (name,0.2)– “Great location” => (description,0.9), (address,0.1)

Page 10: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

10

<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>

<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>

Training the Learners Training the Learners

Naive Bayes Learner

(location, address)(listed-price, price)(phone, agent-phone)(comments, description) ...

(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone)(“Fantastic house”, description) ...

realestate.com

Name Learner

address price agent-phone description

Schema of realestate.com

Mediated schema

location listed-price phone comments

Page 11: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

11

<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>

<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>

<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>

Applying the LearnersApplying the Learners

Name LearnerNaive Bayes

Meta-Learner

(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)

(address,0.6), (description,0.4)

Meta-LearnerName LearnerNaive Bayes

(address,0.7), (description,0.3)

(agent-phone,0.9), (description,0.1)

address price agent-phone description

Schema of homes.com Mediated schema

area day-phone extra-info

Page 12: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

12

Domain ConstraintsDomain Constraints

Impose semantic regularities on sources– verified using schema or data

Examples– a = address & b = address a = b– a = house-id a is a key– a = agent-info & b = agent-name b is nested in a

Can be specified up front– when creating mediated schema– independent of any actual source schema

Page 13: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

13

area: address contact-phone: agent-phoneextra-info: description

area: address contact-phone: agent-phoneextra-info: address

area: (address,0.7), (description,0.3)contact-phone: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)

The Constraint HandlerThe Constraint Handler

Can specify arbitrary constraints User feedback = domain constraint

– ad-id = house-id Extended to handle domain heuristics

– a = agent-phone & b = agent-name a & b are usually close to each other

0.30.10.40.012

0.70.90.60.378

0.70.90.40.252

Domain Constraintsa = address & b = adderss a = b

Predictions from Meta-Learner

Page 14: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

14

Putting It All Together: the LSD SystemPutting It All Together: the LSD System

L1 L2 Lk

Mediated schema

Source schemas

Data listings

Training datafor base learners Constraint Handler

Mapping Combination

User Feedback

Domain Constraints

Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner

– uses stacking [Ting&Witten99, Wolpert92]– returns linear weighted combination of base learners’ predictions

Matching PhaseTraining Phase

Page 15: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

15

Empirical EvaluationEmpirical Evaluation

Four domains– Real Estate I & II, Course Offerings, Faculty Listings

For each domain– create mediated DTD & domain constraints– choose five sources– extract & convert data listings into XML– mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48

Ten runs for each experiment - in each run:– manually provide 1-1 mappings for 3 sources– ask LSD to propose mappings for remaining 2 sources– accuracy = % of 1-1 mappings correctly identified

Page 16: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

16

High Matching AccuracyHigh Matching Accuracy

0

10

20

30

40

50

60

70

80

90

100

Real Estate I Real Estate II CourseOfferings

FacultyListings

LSD’s accuracy: 71 - 92%

Best single base learner: 42 - 72%

+ Meta-learner: + 5 - 22%

+ Constraint handler: + 7 - 13%

+ XML learner: + 0.8 - 6%

Ave

rage

Mat

chin

g A

cccu

racy

(%

)

Page 17: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

17

Performance SensitivityPerformance Sensitivity

40

50

60

70

80

90

100

0 100 200 300 400 500

Ave

rage

mat

chin

g ac

cura

cy (

%)

Number of data listings per source

Page 18: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

18

0

10

20

30

40

50

60

70

80

90

100

Real Estate I Real Estate II Course Offerings Faculty Listings

Contribution of Schema vs. DataContribution of Schema vs. Data

More experiments in the paper!

Ave

rage

mat

chin

g ac

cura

cy (

%)

Page 19: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

19

Related WorkRelated Work

Rule-based approaches– TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99],

[Palopoli et. al. 98], CUPID [Madhavan et. al. 01]– utilize only schema information

Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability

Others– DELTA [Clifton et. al. 97], CLIO [Miller et. al. 00][Yan et. al. 01]

Multi-strategy learning in other domains– series of workshops [91,93,96,98,00]– [Freitag98], Proverb [Keim et. al. 99]

Page 20: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

20

SummarySummary

LSD project– applies machine learning to schema matching

Main ideas & contributions– use of multi-strategy learning– extend learning to handle domain & user-specified constraints– develop XML learner

System design: A contribution to generic schema-matching – highly modular & extensible– handle multiple types of knowledge– continuously improve over time

Page 21: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

21

Ongoing & Future WorkOngoing & Future Work

Improve accuracy– address current system limitations

Extend LSD to more complex mappings Apply LSD to other application contexts

– data translation– data warehousing– e-commerce– information extraction– semantic web

www.cs.washington.edu/homes/anhai/lsd.html

Page 22: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

22

Contribution of Each ComponentContribution of Each Component

0

20

40

60

80

100

Real Estate I Course Offerings Faculty Listings Real Estate II

Ave

rage

Mat

chin

g A

cccu

racy

(%

)

Without Name Learner

Without Naive Bayes

Without Whirl Learner

Without Constraint Handler

The complete LSD system

Page 23: AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

23

Existing learners flatten out all structures

Developed XML learner– similar to the Naive Bayes learner

– input instance = bag of tokens

– differs in one crucial aspect– consider not only text tokens, but also structure tokens

Exploiting Hierarchical Structure Exploiting Hierarchical Structure

<description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors.</description>

<contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm></contact>