Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri...

35
??? ??? W eb Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof. Chitta Baral Prof. Yi Chen Prof. Huan Liu
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri...

Page 1: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Query Processing over Incomplete Autonomous Web Databases

MS Thesis Defenseby Hemal Khatri

Committee Members: Prof. Subbarao Kambhampati (chair)Prof. Chitta BaralProf. Yi ChenProf. Huan Liu

Page 2: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Introduction to Web databases Many websites allow user query through a form

based interface and are supported by backend databases

Consider used cars selling websites such as Cars.com, Yahoo! autos, etc

AutonomousDatabase

Page 3: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Incompleteness in Web databases Web databases are often input by lay individuals

without any curation. For e.g. Cars.com, Yahoo! Autos

Web databases are being populated using automated information extraction techniques which are inherently imperfect

The local schema of data sources may not support certain attributes supported by the global schema

Incomplete/Uncertain tuple: A tuple in which one or more of its attributes have a missing value

Website # of attributes

# of tuples

incomplete tuples

body style engine

autotrader.com 13 25127 33.67% 3.6% 8.1%carsdirect.com 14 32564 98.74% 55.7% 55.8%

Page 4: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Problem Statement Many entities corresponding to tuples with missing

values might be relevant to the user query Current query processing techniques return answers that

exactly satisfy the user query– Such techniques return results with high precision but

low recall

Relevant Uncertain tuple: A tuple which does not exactly satisfy the query predicates but the entity represented by that tuple might be relevant to the query

How to support query processing over incomplete autonomous databases in order to retrieve ranked uncertain results?

null Accord 2003 sedanQ:Make=Honda

Page 5: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Challenges Involved

How to predict missing values in autonomous databases?

As autonomous databases are accessible only through form-based interfaces, how to retrieve relevant uncertain answers?– How to keep query

processing cost manageable in retrieving uncertain tuples?

How to rank the retrieved uncertain answers?

Page 6: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Related Work Probabilistic databases

– Incomplete databases are similar to probabilistic databases once we assess the probabilities for missing values

– TRIO: uncertainty with lineage– ConQuer: handling inconsistency over databases

• Assume probability distributions are given for uncertain or inconsistent attributes

– We assess probability distribution for missing attribute and use it to rank rewritten queries to retrieve relevant answers since the probabilities cannot be stored in databases

– Our query rewriting framework is general and can be used by these systems if the databases are autonomous

Handling Missing Values– EM algorithm, Bayes Net, Association rules

Page 7: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Possible Approaches

For a query Q:body style = convt1.Certain Answers Only (CAO): Return

certain answers only as in traditional databases

2. All Uncertain Answers (AUA): Null matches any concrete value, hence return all answers having body style=convt along with answers having body style as null

3. Relevant Uncertain Answers (RUA): Ranking answers by predicting values of missing attribute

Low Recall

Low Precision, infeasible

Costly, infeasible

Page 8: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Outline

Introduction QPIAD: Query Processing over

Incomplete Autonomous Databases Data Integration over Incomplete

Autonomous Databases Other Contributions Conclusion

Page 9: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

QPIAD System Architecture

Page 10: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

RRUA: Generating Rewritten Queries Restricted Relevant Uncertain Answers (RRUA) approach only retrieves

only relevant incomplete tuples instead of retrieving all tuples as in AUA and RUA

Consider a query Q:Body style=convt

Make Model Year Price Body styleAudi a4 2004 20000 convt

BMW z4 2003 17000 convt

Porsche boxster 2000 13000 convt

….. …… …… …… ……

Rewritten queries are based on the determining set from AFD for Body style: Model ~~> Body style:0.9

Q1:model=‘a4’Q2:model=‘z4’Q3:model=‘boxster’

Determining Attribute set(dtrSet)

Base Result Set:RS(Q)

Page 11: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Learning Attribute Correlations

AFD: VIN ~~> Model where VIN is an Approximate Key(AKey) with high confidence

VIN will not be useful for query rewriting and feature selection since it will not be able to retrieve additional new tuples

SampleDatabase

TANE Algorithm AFDs and AKeys Prune AFDs basedon AKeys

AFDs for Query Rewritingand Feature Selection in classifier

Page 12: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

RRUA: Ranking Rewritten Queries

All queries may not be equally good in retrieving relevant answers– “z4” model cars are more likely to be

convertibles than a car with “a4” model When database or network resources

are limited, the mediator can choose to issue the top K queries to get the most relevant uncertain answers

Page 13: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Learning Value Distributions Used to rank queries based on the

determining set of attributes from the AFD for query attribute

We use Naïve Bayes Classifier with m-estimates with AFD as a feature selection step

Rank of a rewritten query Qi = P(Am=vm|ti), where ti ε ПdtrSet(Am)(RS(Q))– Q1:model=‘a4’, R(Q1) = P(bodystyle=convt|model=a4) = 0.4– Q2:model=‘z4’, R(Q2) = P(bodystyle=convt|model=z4)= 1.0– Q3:model=‘boxster’, R(Q3) = P(bodystyle=convt|model=boxster)=0.7

R(Q2) > R(Q3) > R(Q1)

Relevant uncertain answers are ranked based on the rank of the rewritten query that retrieved it

Page 14: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Combining AFDs and Classifiers

More than one AFD may exist for some attributes

Experimented with several approaches:– Only best-AFD having highest confidence– All attributes ignoring AFDs– Hybrid One-AFD – Ensemble of classifiers

Page 15: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Empirical Evaluation of QPIAD

Test Databases: AutoTrader database containing 100K tuples and Census database from UCI Repository containing 50K tuples

Oracular study: To evaluate the effectiveness of our system against a ground truth, we artificially insert missing values in 10% of the tuples within these databases

Page 16: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

Pre

cis

ion

AUA (467)

RUA (467)

RRUA (204)

Q:education=bachelors

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cis

ion

AUA(1245)

RUA(1245)

RRUA(209)

Q:body style=convt

RRUA vs AUA vs RUA

Page 17: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Top K tuples

Pre

cis

ion

AUA

RUA

RRUA

Q:body style=convt

Precision over Top K Tuples

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60 70 80 90 100

Top K Tuples

Pre

cis

ion

AUA

RUA

RRUA

Q:education=bachelors

Page 18: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Ranking the Rewritten Queries

Cars database Census database

0

0.2

0.4

0.6

0.8

0 20 40 60 80 100

Kth Query

Avg

. Acc

um

ula

ted

Pre

cisi

on

0.3

0.4

0.5

0.6

0 20 40 60 80 100Kth Query

Av

g. A

cc

um

ula

ted

Pre

cis

ion

Page 19: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Robustness of QPIAD

Q:workshop=private

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Kth Query

Acc

um

ula

ted

Pre

cisi

on

3% 5% 10% 15%

Page 20: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

User Relevance Issues with QPIAD

When the query processor presents incomplete tuples, it becomes a recommender system

For a query Q:year=2000 How to convince users into believing the

system results?

Make Model Year Price Mileage

Honda Civic null 15000 18000

Explanation

We have determined that this car’s year is 60% likely to 2000 based on price=15000 and mileage=18000

Page 21: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Outline

Introduction QPIAD: Query Processing over

Incomplete Autonomous Databases Data Integration over Incomplete

Autonomous Databases Other Contributions Conclusion

Page 22: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web Leveraging Correlations between Data Sources

Mediator:GS(Make,Model,Year,Price,Mileage,Bodystyle)Q:Body style=coupe

Page 23: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web Correlated Source and Maximum Correlated Source Consider four sources with schema:

– S1(Make,Model,Year,Price)– S2(Engine,Drive,Bodystyle),

• AFD: {Engine, Drive} -> Body style confidence 0.7– S3(Make,Model,Body style)

• AFD: Model -> Body style confidence 0.8– S4(Make,Price,Body style)

• AFD: {Make, Price} -> Body Style confidence 0.6– Mediator global schema GS(Make,Model,Year,Price,

Bodystyle, Engine, Drive) S3 and S4 are correlated sources with S1 on Body

style attribute S3 is the maximum correlated source for S1 on

Body style attribute

Page 24: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web Retrieving Relevant Uncertain Answers from CarsDirect.com Consider a query Q:body style = coupe(GS) Cars.com has an AFD: Model ~~> Body style(0.9) Cars.com is the maximum correlated source for

CarsDirect.com which doesn’t support Body style but supports Model attribute

Make Model Year Price Body style

Honda Accord 2003 19000 coupe

Ford Mustang 2004 29100 coupe

Acura Legend 1997 12000 coupe

BMW 325 2003 28000 coupe

Q1:model=Accord

Q2:model=Mustang

Q3:model=Legend

Q4:model=325

Page 25: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web Empirical Evaluation of using Correlation between Data Sources We consider a mediator performing data

integration over three sources: Cars.com, Yahoo! Autos and CarsDirect.com

Yahoo! Autos and CarsDirect.com do not allow querying on body style but when the tuples are retrieved we can check the body style attribute to determine if the tuple retrieved has the body style specified in the query

Evaluation using attribute correlations and value distributions learned from Cars.com for 5 test queries on body style attribute

Page 26: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40

Kth Tuple

Pre

cis

ion

Yahoo! Autos

Retrieving Relevant Answers using Correlations from Cars.com

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 20 40 60 80 100

Kth Tuple

Pre

cis

ion

Carsdirect.com

Page 27: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web Handling Joins over Incomplete Autonomous databases Mediator performing data integration across two sources:

– Source S1 is incomplete– Source S2 is complete

Source Local Schema

S1 Cars(Make,Model,Year,Price)

S2 Review(Model,Ratings)

Mediator View

UsedCars(Make,Model,Year,Price,Ratings) :- Cars(Make,Model,Year,Price), Review(Model, Ratings)

Page 28: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Issues in Handling Joins Performing joins over probabilistic

databases will lead to a disjunction in join results

Consider joining uncertain tuples from the two sources:

Make Model Year Price

Honda null [0.6 Civic]

[0.4 Accord]

2003 18000

Model Ratings

Civic 5

Accord 4

Make Model Year Price Ratings

Honda Civic 2003 18000 5

Honda Accord 2003 18000 4or0.6

0.4

Approximation

Page 29: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Handling Join Queries Q:σMake=Honda(UsedCars) Assume AFDs: {Make,Year} ~~> Model, Model ~~> Make

Make Model(FK) Year Price

Honda Odyssey 2000 10000

Honda Accord 2004 20000

Honda null 2000 15000

null Accord 2002 18000

Toyota Camry 2003 16000

Model(PK) Ratings

Civic 5

Corolla 4

Accord 4

Altima 3

Camry 5

Odyssey 3

Honda Odyssey 2000 10000 3

Honda Accord 2004 20000 4

null Accord 2002 18000 4

Honda null 2000 15000 5

1.0

0.6

Q1: Model=Odyssey:R(Q1)=1

Q2: Model=Accord:R(Q2)=1

0.6 Civic0.4 Accord

Queries on source S2 to joinQ3:Model=Odyssey:R(Q3)=1Q4:Model=Accord:R(Q4)=1Q5:Model=Civic:R(Q5)=0.6

Page 30: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Q:ratings=4

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

Pre

cis

ion

RUA(2475)

RRUA(157)Q:make=audi

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8Recall

Pre

cis

ion

RUA(4892)

RRUA(58)

Q:model=Civic

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

Pre

cis

ion

RUA(2475)

RRUA(24)

Experimental Results Joins

Page 31: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Outline

Introduction QPIAD: Query Processing over

Incomplete Autonomous Databases Data Integration over Incomplete

Autonomous Databases Other Contributions Conclusion

Page 32: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

WebQUIC: Querying under Imprecision and Incompleteness Consider a query Q:model like Civic(Cars) User might be interested in similar cars like “Accord”, ”Camry”,

etc Ranking results in presence of both similar and incomplete tuples

Id Make Model Year Body style

1 Honda Civic 2000 Sedan

2 Honda Accord 2004 Coupe

3 Toyota Camry 2001 Sedan

4 Honda null 2004 Coupe

5 Honda null 2000 Sedan

6 Honda Civic 2004 Coupe

7 BMW 3series 2001 convt

8 Toyota null 1999 sedan

Page 33: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web Other Contributions[*Collaboration with Garrett Wolf]

Handling multi-attribute selection queries for incomplete databases*

QUIC system for query processing under imprecision and incompleteness

Online learning of value distribution based on base result set to avoid sample biases

Page 34: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Conclusion

Thesis proposed a framework for query processing over incomplete autonomous web databases:– QPIAD: Query processing over incomplete

autonomous databases– QPIAD: Data Integration over multiple

incomplete data sources Results of empirical evaluation on real world

databases show that our system returns relevant answers with high precision while keeping the query processing cost manageable

Page 35: Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.

??????

Web

Thank You!!

Questions??