Query Processing over Incomplete Autonomous Databases
description
Transcript of Query Processing over Incomplete Autonomous Databases
Query Processing over Query Processing over Incomplete Autonomous DatabasesIncomplete Autonomous Databases
Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati
Arizona State University
2008-02-04
Summerized By Sungchan Park
Copyright 2008 by CEBT
IntroductionIntroduction
More and more data is becoming accessible via web servers which are supported by backend autonomous databases
E.g. Cars.com, Realtor.com, Google Base, Etc.
Center for E-Business Technology
AutonomousDatabase
AutonomousDatabase
AutonomousDatabase
Mediator
Copyright 2008 by CEBT
Web DB.s are Incomplete!Web DB.s are Incomplete!
Incomplete Entry
Inaccurate Extraction
Heterogeneous Schemas
User-Defined Schemas
Center for E-Business Technology
Copyright 2008 by CEBT
ProblemProblem
Current autonomous database systems only return certain answers, namely those which exactly satisfy all the user query constraints
Although there has been work on handling incompleteness in databases, much of it has been focused on single databases on which the query processor has complete control.
Modify databases directly by replacing null values with likely values.
– Not applicable to autonomous databases
Center for E-Business Technology
Copyright 2008 by CEBT
Possible Naïve ApproachesPossible Naïve Approaches
Query Q: (Body Style = Convt)
CERTAINONLY
Return only certain answer
– Low Recall
ALLRETURNED
Return all answer having Body Style = Convt or Body Style = Null
– Low Precision, Infeasible
ALLRANKED
Return all answers having Body Style = Convt. Additionally, rank all answers having body style as null by predicting the missing values and return them to the user
– Costly, Infeasible
Center for E-Business Technology
Copyright 2008 by CEBT
QPIADQPIAD
Solved the problem by generating rewritten queries according to a set of mined attribute correlation rules.
Approximate Functional Dependency(AFD)
Naïve Bayesian Classifier
Center for E-Business Technology
Copyright 2008 by CEBT
QPIAD SolutionQPIAD Solution
Center for E-Business Technology
Copyright 2008 by CEBT
QPIAD ArchitectureQPIAD Architecture
Center for E-Business Technology
Copyright 2008 by CEBT
Overall ProcessOverall Process
1. Learn
2. Rewrite
3. Rank
4. Explain
Center for E-Business Technology
Copyright 2008 by CEBT
#1. Learn - AFD#1. Learn - AFD
Learn Attribute Correlations
Approximate Functional Dependencies(AFD)
Approximate Keys(Akeys)
– For pruning
Learn by TANE algorithm
Y. Huhtala, et al. Efficient discovery of functional and approximate dependencies using partition. 1998.
Pruning example
AFD {A1, A2} ~> A3
Akey {A1}
Center for E-Business Technology
Copyright 2008 by CEBT
#1. Learn - Naïve Bayesian Classifier#1. Learn - Naïve Bayesian Classifier
Learn Value distribution by NBC
Using mined AFD as selected feature
E.g.
– AFD {Make, Body} ~> Model
– P(Model = Accord | Make = Honda, Body = Coupe) = ?
Center for E-Business Technology
Copyright 2008 by CEBT
#1. Learn - Selectivity#1. Learn - Selectivity
SmplSel(Q)*SmplRatio(R)*PerInc(R)
SmplSel(Q) = Selectivity of rewritten query issued on sample
SmplRatio(R) = Ratio of original database size over sample
PerInc(R) = Percent of incomplete tuples while creating sample
Center for E-Business Technology
Copyright 2008 by CEBT
#2. Rewrite#2. Rewrite
1. Get base result(Certain answers)
2. Generate rewritten queries by base result and learned AFD
Center for E-Business Technology
Rewritten Queries
Copyright 2008 by CEBT
#3. Rank #3. Rank
1. Select top-k queries based on F-Measure
2. Reorder selected query based on P
3. Retrieve tuples
Center for E-Business Technology
P = learned Prob.R = selectivity
Copyright 2008 by CEBT
#4. Explain#4. Explain
Center for E-Business Technology
Copyright 2008 by CEBT
Other Issues: Correlated SourceOther Issues: Correlated Source
Center for E-Business Technology
Copyright 2008 by CEBT
Other Issues: Handling AggregationOther Issues: Handling Aggregation
Center for E-Business Technology
Copyright 2008 by CEBT
Empirical Evaluation: QualityEmpirical Evaluation: Quality
QPIAD vs. ALLRETURNED
ALLRETURNED has low precision because not all tuples with missing values on the constrained attributes are relevant to the query
QPIAD has a much higher precision than ALLRETURNED as it aims to retrieve tuples with missing values on the constrained attributes which are very likely to be relevant to the query
Center for E-Business Technology
Copyright 2008 by CEBT
Empirical Evaluation: EfficiencyEmpirical Evaluation: Efficiency
QPIAD vs. ALLRANKED
ALLRANKED approach is often infeasible as direct retrieval of null values is not often allowed
QPIAD is able to achieve the same level of recall as ALLRANKED while requiring much fewer tuples to be retrieved
Center for E-Business Technology
Copyright 2008 by CEBT
Empirical Evaluation: RobustnessEmpirical Evaluation: Robustness
Robustness w.r.t. Sample Size
QPIAD is robust even when face with a relatively small data sample
Center for E-Business Technology
Copyright 2008 by CEBT
Empirical Evaluation: ExtensionsEmpirical Evaluation: Extensions
Aggregates
Prediction of missing values increases the fraction of queries that achieve higher levels of accuracy
Approximately 20% more queries achieve 100% accuracy when prediction is used
Join
As alpha is increased, we obtain a higher recall without sacrificing much precision
Center for E-Business Technology
Copyright 2008 by CEBT
Related WorkRelated Work
Querying Incomplete Databases Possible World Approaches – tracks the completions of incomplete tuples
(CoddTables, V-Tables, Conditional Tables)
Probabilistic Approaches – quantify distribution over completions to distinguish between likelihood of various possible answers
Probabilistic Databases Tuples are associated with an attribute describing the probability of its existence
However, in our work, the mediator does not have the capability to modify the underlying autonomous databases
Query Reformulation / Relaxation Aims to return similar or approximate answers to the user after returning or in the
absence of exact answers
Our focus is on retrieving tuples with missing values on constrained attributes
Learning Missing Values Common imputation approaches replace missing values by substituting the mean,
most common value, default value, or using kNN, association rules, etc.
Our work requires schema level dependencies between attributes as well as distribution information over missing values
Center for E-Business Technology
Copyright 2008 by CEBT
ContributionContribution
Efficiently retrieve relevant uncertain answers from autonomous sources given only limited query access patterns Query Rewriting
Retrieves answers with missing values on constrained attributes without modifying the underlying databases AFD-Enhanced Classifiers
Rewriting & ranking considers the natural tension between precision and recall F-Measure based ranking
AFDs play a major role in: Query Rewriting
Feature Selection
Explanations
Center for E-Business Technology