Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE...

21
Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema eTuner: Tuning Schema Matching Software using Matching Software using Synthetic Scenarios Synthetic Scenarios

Transcript of Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE...

Page 1: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

Mayssam Sayyadian, Yoonkyong Lee, AnHai DoanUniversity of Illinois, USA

Arnon RosenthalMITRE Corp., USA

eTuner: Tuning Schema Matching eTuner: Tuning Schema Matching Software using Synthetic ScenariosSoftware using Synthetic Scenarios

Page 2: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

2

Main PointsMain Points

Tuning matching systems: long standing problem– becomes increasingly worse

We propose a principled solution– exploits synthetic input/output pairs– promising, though much work remains

Idea applicable to other contexts

Page 3: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

3

price agent-name address

Schema MatchingSchema Matching

1-1 match complex match

listed-price contact-name city state

Schema 2

120,000 George Bush Crawford, TX239,900 Hillary Clinton New York City, NY

320K Jane Brown Seattle WA240K Mike Smith Miami FL

Schema 1

Page 4: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

4

Schema Matching is UbiquitousSchema Matching is Ubiquitous

Databases– data integration, – model management– data translation, – collaborative data sharing– keyword querying, schema/view integration– data warehousing, peer data management, …

AI– knowledge bases, ontology merging, information gathering agents, ...

Web– e-commerce, Deep Web, Semantic Web

eGovernment, bio-informatics, scientific data management

Page 5: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

5

Current State of AffairsCurrent State of Affairs Finding semantic mappings is now a key bottleneck!

– largely done by hand, labor intensive & error prone

Numerous matching techniques have been developed– Databases: IBM Almaden, Microsoft Research, BYU, George Mason,

U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin, ...

– AI: Stanford, Karlsruhe University, NEC Japan, ...

Techniques are often synergistic, leading to multi-component matching architectures– each component employs a particular technique– final predictions combine those of the components

Page 6: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

6

An Example: LSD An Example: LSD [SIGMOD-01][SIGMOD-01]

Schema 1

Urbana, IL James Smith Seattle, WA Mike Doan

address agent-name

area contact-agent

Peoria, IL (206) 634 9435 Kent, WA (617) 335 4243

Schema 2

Name Matcher

Naive BayesMatcher

Combiner

0.3

agent

name

contact

agent0.5

0.1

area => (address, 0.7), (description, 0.3)contact-agent => (agent-phone, 0.7), (agent-name, 0.3)

comments => (address, 0.6), (desc, 0.4)

Match Selector

ConstraintEnforcer

Only one attribute of Schema 2 matches address

area = address

contact-agent = agent-phone

...

comments = desc

Page 7: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

7

Multi-Component Matching SolutionsMulti-Component Matching Solutions

Such systems are very powerful ...– maximize accuracy; highly customizable to individual domain

... but place a serious tuning burden on domain users

Constraintenforcer

Match selector

Matcher Matcher Combiner

… Matcher 1 Matcher n

Constraintenforcer

Match selector

Combiner

Matcher 1 Matcher n…

Constraintenforcer

Match selector

Combiner

Matcher 1 Matcher n…

Match selector

Combiner

LSD COMA SF

LSD-SF

Developed in many recent works– e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02;

Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05

Now commonly adopted, with industrial-strength systems – e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig]

Page 8: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

8

Tuning Schema Matching SystemsTuning Schema Matching Systems

Library of matching components

Constraintenforcer

Match selector

Combiner

Matcher 1 Matcher n…

Execution graph

Knobs of decision tree matcher

Threshold selector

Bipartite graph selector

A* search enforcer Relax. labeler ILP

Average combiner

Min combiner

Max combiner

Weightedsum combiner

q-gram name matcher

Decision treematcher

Naïve Baysmatcher

TF/IDF name matcher

SVMmatcher

• Characteristics of attr.

• Post-prune?• Size of validation set

• Split measure

•••

Given a particular matching situation– how to select the right components? – how to adjust the multitude of knobs?

Untuned versions produce inferior accuracy, however ...

Page 9: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

9

Large number of knobs– e.g., 8-29 in our experiments

Wide variety of techniques – database, machine learning, IR, information theory, etc.

Complex interaction among components Not clear how to compare the quality of knob configs

Matching systems are still tuned manually, by trial and error Multiple component systems make tuning even worse

... Tuning is Extremely Difficult ... Tuning is Extremely Difficult

Developing efficient tuning techniques is crucial to making matching systems attractive in practice

Page 10: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

10

The eTuner SolutionThe eTuner Solution Given schema S & matching system M

– tunes M to maximize average accuracy of matching S with future schemas

– incurs virtually no cost to user

Key challenge 1: Evaluation– must search for “best” knob config – how to compute the quality of any knob config C?

– if knowing “ground-truth” matches for a representative workload W = {(S,T1), ..., (S,Tn)}, then can use W to evaluate C

– but often have no such W

Key challenge 2: Search– how to efficiently evaluate the huge space of knob configs?

Page 11: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

11

Key Idea: Generate Synthetic Input/Output PairsKey Idea: Generate Synthetic Input/Output Pairs

Need workload W = {(S,T1), (S,T2), …, (S,Tn)}

To generate W– start with S– perturb S to generate T1– perturb S to generate T2– etc.

Know the perturbation => know matches between S & Ti

Page 12: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

12

Key Idea: Generate Synthetic Input/Output PairsKey Idea: Generate Synthetic Input/Output Pairs

Perturb # of tables

id first last salary ($)

1 Bill Laup 40,000 $

2 Mike Brown 60,000 $

EMPLOYEES

EMPS

emp-last id wage

Laup 1 45200

Brown 2 59328

V1

Schema S

1

23

id first last salary ($)

1 Bill Laup 40,000 $

2 Mike Brown 60,000 $

3 Jean Ann 30,000 $

4 Roy Bond 70,000 $

EMPLOYEES

id first last salary ($)

3 Jean Ann 30,000$

4 Roy Bond 70,000$

EMPLOYEES

Perturb # of columnsin each table

last id salary($)

Laup 1 40,000$

Brown 2 60,000$

EMPLOYEES

Perturb column and table names

Perturb data tuplesin each table

EMPS

emp-last id wage

Laup 1 40,000$

Brown 2 60,000$

EMPS.emp-last = EMPLOYEES.lastEMPS.id = EMPLOYEES.idEMPS.wage = EMPLOYEES.salary($)

U

1

23

V

1

23 312

312

312

312

V1U Ω1: a set of semantic matches

Vn

.

.

.Split S into V and U with disjoint data tuples

Page 13: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

13

Examples of Perturbation RulesExamples of Perturbation Rules Number of tables

– merge two tables based on a join path– splits a table into two

Structure of table– merges two columns

– e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name)

– drop a column– swap location of two columns

Names of tables/columns– rules capture common name transformations– abbreviation to the first 3-4 characters, dropping all vowels, synonyms,

dropping prefixes, adding table name to column name, etc

Data values– rules capture common format transformations: 12/4 => Dec 4– values are changed based on some distributions (e.g., Gaussian)

See paper for details

Page 14: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

14

The eTuner ArchitectureThe eTuner Architecture

StagedTuner

Tuning Procedures

Workload Generator

Perturbation Rules

Matching Tool M

SyntheticWorkload

(Optional)

Tuned Matching Tool M

U Ω1 V1

U Ω2 V2

U Ωn Vn Schema S

Page 15: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

15

The Staged TunerThe Staged Tuner

Level 1

Level 2

Level 3Constraintenforcer

Match selector

Combiner

Matcher 1 Matcher n…

Level 4

Tuning direction

Tune sequentially starting with lowest-level components Assume

– execution graph has k levels, m nodes per level– each node can be assigned one of n components– each component has p knobs, each of which has q values

tuning examines (npqkm) out of (npq)^(km) knob configs

Page 16: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

16

Empirical Evaluation Empirical Evaluation

Domain # schemas# tables per

schema# attributes per schema

# tuples per table

reference paper

Real Estate 5 2 30 1000 LSD (SIGMOD’01)

Courses 5 3 13 50 LSD

Inventory 10 4 10 20 Corpus (ICDE’05)

Product 2 2 50 120 iMAP (SIGMOD’04)

Domains

LSD: 6 Matchers, 6 Combiners, 1 Constraint enforcer, 2 Match selectors, 21 Knobs

iCOMA: 10 Matchers, 4 Combiners, 2 Match selectors, 20 Knobs

SF: 3 Matchers, 1 Constraint enforcer, 2 Match selectors, 8 Knobs

LSD-SF: 7 Matcher, 7 Combiners, 1 Constraint enforcer, 2 Match selectors, 29 Knobs

Matching systems

Page 17: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

17

Matching AccuracyMatching Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

CourseInventoryProductReal Estate

LSD COMA

SF

Off-the-shelfDomain-independent

LSD-SF

eTuner achieves higher accuracy than current best methods, at virtually no cost to the user

Domain-dependentSource-dependent

eTUNER: Automatic eTUNER: Human-assisted

CourseInventoryProductReal Estate0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

CourseInventoryProductReal Estate CourseInventoryProductReal Estate0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Page 18: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

18

Cost of Using eTunerCost of Using eTuner

You have a schema S and a matching system M Vendor supplies eTuner

– will hook it up with matching system M

Vendor supplies a matching system M– bundles eTuner inside

Page 19: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

19

Sensitivity AnalysisSensitivity Analysis Adding perturbation rules Exploiting prior match results (enriching the workload)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 10 20 25 40 50Schemas in Synthetic Workload (#)

Acc

urac

y (F

1)

Average

Inventory Domain

Real Estate Domain

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 22 44 66 88

Tuned LSD

Previous matches in collection (%)

Page 20: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

20

Summary: The eTuner Project @ IllinoisSummary: The eTuner Project @ Illinois

Tuning matching systems is crucial – long standing problem, is getting worse– a next logical step in schema matching research

Provides an automatic & principled solution– generates a synthetic workload, employs it to tune efficiently– incurs virtually no cost to human users– exploits user assistance whenever available

Extensive experiments over 4 domains with 4 systems

Future directions– find optimal synthetic workload– apply to other matching scenarios– adapt ideas to scenarios beyond schema matching (see 3rd speaker)

Page 21: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

21

Backup: User AssistanceBackup: User Assistance S(phone1,phone2,…) Generate V by dropping phone2: V(phone1,…) Rename phone1 in V: V(x,…) Problem:

– x matches phone1, x does not match phone2

User: – group phone1 and phone2– so if x matches phone1, it will also match phone2

Intuition: tell system do not bother to try distinguish phone1 and phone2