Probabilistic Ranking of Database Query Results
-
Upload
kim-sherman -
Category
Documents
-
view
19 -
download
2
description
Transcript of Probabilistic Ranking of Database Query Results
![Page 1: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/1.jpg)
Probabilistic Ranking of Database Query Results
Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik
Presented by Weimin HeCSE@UTA
![Page 2: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/2.jpg)
04/19/23 Weimin He CSE@UTA 2
Outline
Motivation Problem Definition System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems
![Page 3: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/3.jpg)
04/19/23 Weimin He CSE@UTA 3
Motivating example
Realtor DB: Table D=(TID, Price , City, Bedrooms,
Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)
SQL query:Select * From D Where City=Seattle AND View=Waterfront
![Page 4: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/4.jpg)
04/19/23 Weimin He CSE@UTA 4
Motivation
Many-answers problem Two alternative solutions:
Query reformulation Automatic ranking Apply probabilistic model in IR to
DB tuple ranking
![Page 5: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/5.jpg)
04/19/23 Weimin He CSE@UTA 5
Problem DefinitionGiven a database table D with n tuples {t1, …, tn} over a set of
m categorical attributes A = {A1, …, Am}and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xswhere each Xi is an attribute from A and xi is a value in its
domain.
The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes
Let be the answer set of Q
How to rank tuples in S and return top-k tuples to the user ?
},...,{ 1 nttS
![Page 6: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/6.jpg)
04/19/23 Weimin He CSE@UTA 6
System Architecture
![Page 7: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/7.jpg)
04/19/23 Weimin He CSE@UTA 7
Intuition for Ranking Function Select * From D Where City=“Seattle” And
View=“Waterfront”
Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified
Attribute Values E.g., Homes with good school districts are
globally desirable Conditional Score: Correlations between
Specified and Unspecified Attribute Values E.g., Waterfront BoatDock
![Page 8: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/8.jpg)
04/19/23 Weimin He CSE@UTA 8
Probabilistic Model in IR Bayes’ Rule Product Rule
)(
)()|()|(
bp
apabpbap
),|()|()|,( cabpcapcbap
)|(
)|(
)(
)()|()(
)()|(
)|(
)|()(
Rtp
Rtp
tp
RpRtptp
RpRtp
tRp
tRptScore
Document t, Query QR: Relevant document setR = D - R: Irrelevant document set
![Page 9: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/9.jpg)
04/19/23 Weimin He CSE@UTA 9
Adaptation of PIR to DB
Tuple t is considered as a document
Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function
until final ranking function is obtained
![Page 10: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/10.jpg)
04/19/23 Weimin He CSE@UTA 10
Preliminary Derivation
![Page 11: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/11.jpg)
04/19/23 Weimin He CSE@UTA 11
Limited Independence Assumptions
Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
Xx
CxpCXp )()(
Yy
CypCYp )()(
![Page 12: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/12.jpg)
04/19/23 Weimin He CSE@UTA 12
Continuing Derivation
![Page 13: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/13.jpg)
04/19/23 Weimin He CSE@UTA 13
Workload-based Estimation of )( Ryp
Assume a collection of “past” queries existed in system
Workload W is represented as a set of “tuples”
Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X
All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X
),()( WXypRyp
![Page 14: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/14.jpg)
04/19/23 Weimin He CSE@UTA 14
Final Ranking Function
![Page 15: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/15.jpg)
04/19/23 Weimin He CSE@UTA 15
Pre-computing Atomic Probabilities in Ranking Function
)( Wyp
)( Dyp
),( Dyxp
Relative frequency in W
Relative frequency in D
),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W
(#of tuples in D that conatains x, y)/total # of tuples in D
![Page 16: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/16.jpg)
04/19/23 Weimin He CSE@UTA 16
Example for Computing Atomic Probabilities
Select * From D Where City=“Seattle” And View=“Waterfront”
Y={SchoolDistrict, BoatDock, …}
D=10,000 W=1000 W{excellent}=10 W{waterfront &yes}=5
p(excellent|W)=10/1000=0.1 p(excellent|D)=10/10,000=0.01 p(waterfront|yes,W)=5/1000=0.005 p(waterfront|yes,D)=5/10,000=0.0005
![Page 17: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/17.jpg)
04/19/23 Weimin He CSE@UTA 17
Indexing Atomic Probabilities
)( Wyp
)( Dyp
),( Dyxp
{AttName, AttVal, Prob}
B+ tree index on (AttName, AttVal)
),( Wyxp
{AttName, AttVal, Prob}
B+ tree index on (AttName, AttVal)
{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}
B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)
{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}
B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)
![Page 18: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/18.jpg)
04/19/23 Weimin He CSE@UTA 18
Scan AlgorithmPreprocessing - Atomic Probabilities Module Computes and Indexes the Quantities
P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y
Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-
Tuple Return Top-K Tuples
![Page 19: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/19.jpg)
04/19/23 Weimin He CSE@UTA 19
Beyond Scan Algorithm Scan algorithm is Inefficient
Many tuples in the answer set Another extreme
Pre-compute top-K tuples for all possible queriesStill infeasible in practice
Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples
![Page 20: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/20.jpg)
04/19/23 Weimin He CSE@UTA 20
Two kinds of Ranked List CondList Cx
{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)
GlobList Gx
{AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)
![Page 21: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/21.jpg)
04/19/23 Weimin He CSE@UTA 21
Index Module
![Page 22: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/22.jpg)
04/19/23 Weimin He CSE@UTA 22
List Merge Algorithm
![Page 23: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/23.jpg)
04/19/23 Weimin He CSE@UTA 23
Experimental Setup Datasets:
MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)
Internet Movie Database (http://www.imdb.com)
Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO
![Page 24: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/24.jpg)
04/19/23 Weimin He CSE@UTA 24
Quality Experiments
Conducted on Seattle Homes and Movies tables
Collect a workload from users Compare Conditional Ranking
Method in the paper with the Global Method [CIDR03]
![Page 25: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/25.jpg)
04/19/23 Weimin He CSE@UTA 25
Quality Experiment-Average Precision
For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples
Let each user mark 10 tuples in Hi as most relevant to Qi
Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm
![Page 26: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/26.jpg)
04/19/23 Weimin He CSE@UTA 26
Quality Experiment- Fraction of Users Preferring Each Algorithm
5 new queries Users were given the top-5 results
![Page 27: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/27.jpg)
04/19/23 Weimin He CSE@UTA 27
Performance Experiments
Table NumTuples Database Size (MB)
Seattle Homes 17463 1.936
US Homes 1380762 140.432
Datasets
Compare 2 Algorithms: Scan algorithm List Merge algorithm
![Page 28: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/28.jpg)
04/19/23 Weimin He CSE@UTA 28
Performance Experiments – Pre-computation Time
![Page 29: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/29.jpg)
04/19/23 Weimin He CSE@UTA 29
Performance Experiments – Execution Time
![Page 30: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/30.jpg)
04/19/23 Weimin He CSE@UTA 30
Performance Experiments – Execution Time
![Page 31: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/31.jpg)
04/19/23 Weimin He CSE@UTA 31
Performance Experiments – Execution Time
![Page 32: Probabilistic Ranking of Database Query Results](https://reader036.fdocuments.in/reader036/viewer/2022062522/56812ee9550346895d9485fb/html5/thumbnails/32.jpg)
04/19/23 Weimin He CSE@UTA 32
Conclusion and Open Problems
Automatic ranking for many-answers
Adaptation of PIR to DB
Mutiple-table query Non-categorical attributes