1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD...

32
1 Statistical Schema Matching across Web Query Interfaces Bin He Kevin Chen-Chuan Chang SIGMOD 2003
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    2

Transcript of 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD...

Page 1: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

1

Statistical Schema Matching across Web Query Interfaces

Bin He , Kevin Chen-Chuan Chang

SIGMOD 2003

Page 2: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

2

Background: Large-Scale Integration of the deep Web

Query Result

The Deep Web

Page 3: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

3

Challenge: matching query interfaces (QIs)Book Domain

Music Domain

Page 4: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

4

Traditional approaches of schema matching – Pairwise Attribute Correspondence Scale is a challenge

Only small scale Large-scale is a

must for our task Scale is an opportunity

Useful ContextPairwise Attribute

Correspondence

S2:writertitlecategoryformat

S3:nametitlekeywordbinding

S1:authortitlesubjectISBN

S1.author S3.nameS1.subject S2.category

Page 5: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

5

Deep Web Observation

Proliferating sources

Converging

vocabularies

Page 6: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

6

A hidden schema model exists?

Our View (Hypothesis):

M

P

QIsFinite Vocabulary Statistical Model Generate QIs with different probabilities

QI1

Instantiation probability:P(QI1|M)

Page 7: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

7

A hidden schema model exists? Our View (Hypothesis):

Now the problem is:

M

P

QIsFinite Vocabulary Statistical Model Generate QIs with different probabilities

P

QIs

Given , can we discover

M

?

QI1

Instantiation probability:P(QI1|M)

Page 8: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

8

MGS framework & Goal

Hypothesis modeling Hypothesis generation Hypothesis selection

Goal:

Verify the phenomenons

Validate MGSsd with two metrics

Page 9: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

9

Comparison with Related Work

Related Work Authors’ Work

Paradigms Match two input sources Match many sources

Techniques Machine Learning, Contraint-based, hybrid ones

Statistical approach

Input data Relational or Structured schemas with inconsistency

Interface with consistency

Focuses Name match, structure match,etc

Synonym discovery

Page 10: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

10

Outline

MGS MGSsd: Hypothesis Modeling, Generation, Selection Deal with Real World Data Final Algorithm Case Study Metrics Experimental Results Conclusion and Future Issues My Assessment

Page 11: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

11

Towards hidden model discovery: Statistical schema matching (MGS)

1. Define the abstract Model structure M to solve a target question

P(QI|M) = …

M

2. Given QIs, Generate the model candidates

P(QIs|M) > 0

M1 M2

AA BB CC SS TT PP

3. Select the candidate with highest confidence

What is the confidence of given ?

M1

AA BB CC

Page 12: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

12

MGSSD: Specialize MGS for Synonym Discovery

MGS is generally applicable to a wide range of schema matching tasks E.g., attribute grouping

Focus : discover synonym attributes Author – Writer, Subject – Category No hierarchical matching: Query interface as flat schema No complex matching: (LastName, FirstName) – Author

Page 13: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

13

Hypothesis Modeling: Structure Goal: capture synonym relationship Two-level model structure

Possible schemas: I1={author, title, subject, ISBN}, I2={title,category, ISBN}

Concepts

Attributes

Mutually Independent

Mutually Exclusive

No overlapping concepts

Page 14: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

14

Hypothesis Modeling: Formula Definition and Formula:

Probability that M can generate schema I:

Page 15: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

15

Hypothesis Modeling: Instantiation probability

P(author|M) = α1 * β1P(C1|M)

C1

* P(author|C1) =

author

1.Observing an attribute

2.Observing a schemaP({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C2|M))

3.Observing a schema setP(QIs|M) = П P(QIi|M)

Page 16: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

16

Consistency check

A set of schema I as schema observation <Ii,Bi>:number of occurrences Bi for each Ii M is consistent if Pr (I|M)>0 Find consistent models as candidates

Page 17: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

17

Hypothesis Generation

Two sub-steps

1. Consistent Concept Construction

2.Build Hypothesis Space

Page 18: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

18

Hypothesis Generation: Space pruning Prune the space of model candidates

Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Co-occurrence graph

Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category}

author categorysubject

C1 C3C2

M1

author categorysubject

C1 C2

M4

author categorysubject

C1C2

M2

author subjectcategory

C1C2

M3

author categorysubject

C1

M5

Page 19: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

19

Hypothesis Generation Prune the space of model candidates

Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption

Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} Model candidates after pruning:

author categorysubject

C1 C3C2

M1

author categorysubject

C1 C2

M4

author categorysubject

C1C2

M2

author subjectcategory

C1C2

M3

author categorysubject

C1

M5

Page 20: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

20

Hypothesis Generation (Cont.) Build Probability Functions Maximum likelihood estimation

Estimate ai and Bj that maximize Pr (I|M)

Page 21: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

21

Hypothesis Selection

Rank the model candidates Select the model that generates the closest distribution

to the observations Approach: hypothesis testing

Example: select schema model at significance level 0.05

=3.93 3.93<7.815: accept =20.20 20.20>14.067: reject

Page 22: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

22

Dealing with the Real World Data Head-often, tail-rare distribution Attribute Selection Systematically remove rare attributes Rare Schema Smoothing Aggregate infrequent schemas into a conceptual event

I(rare) Consensus Projection Follow concept mutual independence assumption

Extract and aggregate New input schemas with re-estimation para.

Page 23: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

23

Final Algorithm Two phases:

Build initial hypothesis space

Discover the hidden model

Attribute Selection Extract the common parts of model candidates of last iteration

Hypothesis Generation

Hypothesis Selection

Combine rare interfaces

Page 24: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

24

Experiment Setup in Case Studies Over 200 sources on four domains Threshold f=10% Significance level : 0.05 Can be specified by users

Page 25: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

25

Example of the MSGsd Algorithm

M1={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,ln), (fn)}

M2={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,fn), (ln)}

Page 26: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

26

Metrics

1. How it is close to the correct schema model Precision: Recall:

2. How good it can answer the target question Precison: Recall:

Page 27: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

27

Examples on Metrics

I={<I1,6>, <I2,3>, <I3,1>} I1={author, subject}, I2={author, category}, I3={subject} M1={(author:1):0.6, (subject:0.7,category:0.3):1} M2={(author:1):0.6, (subject:1):0.7, (category:1):0.3}

Metrics 1: Pm(M2,Mc)=0.196+0.036+0.249+0.054=0.58 Rm(M2,Mc)=0.28+0.12+0.42+0.18=1

Metrics 2:

Page 28: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

28

Experimental Results

This approach can identify most concepts correctly Incorrect matchings due to small # observations Do need two suites of metrics Time complexity is exponential

Can generate all correct instances

The discovered synonyms are all correct ones

Page 29: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

29

Advantages Scalability: large-scale matching Solvability: exploit statistical information Generality

Holistic Model Discovery

author name subject categorywriter

S2:writertitlecategoryformat

S3:nametitlekeywordbinding

S1:authortitlesubjectISBN

Pairwise Attribute Correspondence

S2:writertitlecategoryformat

S3:nametitlekeywordbinding

S1:authortitlesubjectISBN

S1.author S3.nameS1.subject S2.category

V.S.

Page 30: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

30

Conclusions & Future Work

Holistic statistical schema matching of massive sources MGS framework to find synonym attributes Discover hidden models Suited for large-scale database Results verify the observed phenomena and show

accuracy and effectiveness Future Issues

Complex matching: (Last Name, First Name) – Author More efficient approximation algorithm Incorporating other matching techniques

Page 31: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

31

My Assessments Promise

Use minimal “light-weight” information: attribute name

Effective with sufficient instances Leverage challenge as opportunity

Limitation Need sufficient observations Simple Assumptions Exponential time complexity Homonyms

Page 32: 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD 2003.

32

Questions