IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für...
-
Upload
kristin-beasley -
Category
Documents
-
view
220 -
download
1
Transcript of IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für...
IO-Top-k:Index-access Optimized Top-k Query Processing
Debapriyo MajumdarMax-Planck-Institut für Informatik
Saarbrücken, Germany
Joint work withHolger Bast, Ralf Schenkel, Martin Theobald, Gerhard
Weikum
VLDB 2006, Seoul, Korea
Setup
priceresolutio
n zoom
camera 1
€300
camera 5
8MP
camera 3
7x
camera 3
€330
camera 1
7MP
camera 1
5x
camera 5
€490
camera 4
6MP
camera 2
4x
camera 4
€580
camera 2
4MP
camera 5
4x
…
…
…
…
…
…
Pre-computed index-lists over multiple attributes
lists are accessible by both sorted and random accesses
combine scores by some monotonic aggregation function:
. res + .zoom - . price
Goal: find the top-k items with highest total scores
single numeric score for every item for each attribute
Top-k algorithms: example
lists
sort
ed b
y
score
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
List 1 List 2 List 3
Fagin’s NRA Algorithm:
Top-k algorithms: example
lists
sort
ed b
y
score
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
Fagin’s NRA Algorithm: round 1
item 83
[0.9, 2.1]
item 17
[0.6, 2.1]
item 25
[0.6, 2.1]
Candidatesmin top-2 score: 0.6maximum score for unseen items: 2.1
min-top-2 < best-score of candidates
List 1 List 2 List 3
read one item from every list current
scorebest-score
Top-k algorithms: example
lists
sort
ed b
y
score
Fagin’s NRA Algorithm: round 2
item 17
[1.3, 1.8]
item 83
[0.9, 2.0]
item 25
[0.6, 1.9]
item 38
[0.6, 1.8]
item 78
[0.5, 1.8]
Candidatesmin top-2 score: 0.9maximum score for unseen items: 1.8
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
List 1 List 2 List 3
read one item from every list
min-top-2 < best-score of candidates
Top-k algorithms: example
lists
sort
ed b
y
score
item 83
[1.3, 1.9]
item 17
[1.3, 1.9]
item 25
[0.6, 1.5]
item 78
[0.5, 1.4]
Candidatesmin top-2 score: 1.3maximum score for unseen items: 1.3
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
Fagin’s NRA Algorithm: round 3
List 1 List 2 List 3
read one item from every list
min-top-2 < best-score of candidates
no more new items can get into top-2
but, extra candidates left in queue
Top-k algorithms: example
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
lists
sort
ed b
y
score
item 17
1.6
item 83
[1.3, 1.9]
item 25
[0.6, 1.4]
Candidatesmin top-2 score: 1.3maximum score for unseen items: 1.1
Fagin’s NRA Algorithm: round 4
List 1 List 2 List 3
read one item from every list
min-top-2 < best-score of candidates
no more new items can get into top-2
but, extra candidates left in queue
Top-k algorithms: example
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
lists
sort
ed b
y
score
item 83
1.8
item 17
1.6
Candidatesmin top-2 score: 1.6maximum score for unseen items: 0.8
Done!
Fagin’s NRA Algorithm: round 5
List 1 List 2 List 3
read one item from every list
no extra candidate in queue
Top-k algorithms
NRA performs only sorted accesses (SA) (No Random Access) Random access (RA)
– lookup actual (final) score of an item– costlier than SA (100 – 100,000 times), cR/cS := (cost of RA)/(cost of SA)
– often very useful
CA (Combined Algorithm), (Fagin et al., 2001)– one RA after every cR/cS SAs
– total cost of SA ~ total cost of RA
Measure of effectiveness (access cost): #SA + cR/cS x #RA
Full-merge: compute scores for all items followed by partial sort– simple and efficient– important baseline for any top-k algorithm
Problems with NRA, CA – high bookkeeping overhead: cannot beat full-merge in runtime– for “high” values of k, gain in even access cost not significant
Top-k algorithms Greedy heuristics for sorted access scheduling, based on
crude estimate of scores (Guntzer, Balke, Kiessling, ITCC 2001)
RankSQL: ordering of binary rank joins at query planning time (Ilyas et al., SIGMOD ’04 and Li et al., SIGMOD ’05)
Scheduling RAs on “expensive predicates”, where SAs may not even be possible on all attributes (our setting is different)
– MPro (Chang and Hwang, SIGMOD 2002)
– Upper, Pick (Bruno, Gravano and Marian, ICDE ’02, ACM TODS ’04)
Probabilistic pruning of candidates, RA scheduling (Theobald, Schenkel and Weikum, VLDB ’04, VLDB ’05)
Main related previous works: NRA, CA
Our algorithm: IO-Top-k
lists
sort
ed b
y
score
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
Round 1: same as NRA
item 83
[0.9, 2.1]
item 17
[0.6, 2.1]
item 25
[0.6, 2.1]
Candidatesmin top-2 score: 0.6maximum score for unseen items: 2.1
List 1 List 2 List 3
min-top-2 < best-score of candidates
not necessarily round robin
Our algorithm: IO-Top-k
lists
sort
ed b
y
score
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
Round 2
item 17
[1.3, 1.8]
item 83
[0.9, 2.0]
item 25
[0.6, 1.9]
item 78
[0.5, 1.4]
Candidatesmin top-2 score: 0.9maximum score for unseen items: 1.4
List 1 List 2 List 3
min-top-2 < best-score of candidates
not necessarily round robin
Our algorithm: IO-Top-k
lists
sort
ed b
y
score
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
Round 3
item 17
1.6
item 83
[1.3, 1.9]
item 25
[0.6, 1.4]
Candidatesmin top-2 score: 1.3maximum score for unseen items: 1.1
List 1 List 2 List 3
min-top-2 < best-score of candidates
not necessarily round robin
potential candidate for
top-2
Our algorithm: IO-Top-k
lists
sort
ed b
y
score
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
Round 4: random access for item 83
item 83
1.8
item 17
1.6
Candidatesmin top-2 score: 1.6maximum score for unseen items: 1.1
Done!
fewer sorted accesses
carefully scheduled random access
List 1 List 2 List 3
random access for item 83
no extra candidate in queue
not necessarily round robin
Outline
Our contributions– Inverted block-index data structure
– Sorted access scheduling
– Random access scheduling
– Lower bound
Experiments
Conclusion
Inverted block-index
Lists are first sorted by score
Inverted block-index
Lists are first sorted by score
sort each block by item-id
333
222
111
Top-k algorithm with block-index
1 1 1
2 2
3 3
1 1 1
2 2 2
3 3 3
full-merge
blocks are sorted by item ids, efficiently merged by full-merge!
and so on…full merge
pruned
split into blocks
Choose block size balancing disk seek time and data transfer rate
Low overhead: prune once every round
Sorted access scheduling
List 1 List 2 List 3
Inverted Block-Index
General Paradigm
Sorted access scheduling
List 1 List 2 List 3
b11 b21 b31
b12 b22 b32
b13 b23 b33
b14 b24 b34
General Paradigm We assign benefits to every block of each list
Optimization problem– Goal: choose a total of 3 blocks from any
of the lists such that the total benefit is maximized
– We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it
– But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities
– We choose the schedule with maximum benefit, and continue to next round
Inverted Block-Index
Sorted access scheduling
List 1 List 2 List 3
b11 b21 b31
b12 b22 b32
b13 b23 b33
b14 b24 b34
General Paradigm We assign benefits to every block of each list
Optimization problem– Goal: choose a total of 3 blocks from any
of the lists such that the total benefit is maximized
– We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it
– But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities
– We choose the schedule with maximum benefit, and continue to next round
Inverted Block-Index
Sorted access scheduling
List 1 List 2 List 3
b11 b21 b31
b12 b22 b32
b13 b23 b33
b14 b24 b34
General Paradigm We assign benefits to every block of each list
Optimization problem– Goal: choose a total of 3 blocks from any
of the lists such that the total benefit is maximized
– We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it
– But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities
– We choose the schedule with maximum benefit, and continue to next round
Inverted Block-Index
scans to different
depths in lists
Sorted access scheduling
List 1 List 2 List 3
Knapsack for Score Reduction (KSR)
Pre-compute score reduction ij of every block
of each list : (max-score of the block – min-score of the block)
Inverted Block-Index
List 1 List 2 List 3
Sorted access scheduling
Knapsack for Score Reduction (KSR)
Pre-compute score reduction ij of every block
of each list : (max-score of the block – min-score of the block)
Candidate item d is already seen in list 3. If we
scan list 3 further, score sd and best-score bd of
d do not change: no benefit
In list 2, d is not yet seen. If we scan one block
(block22) from list 2
– with high probability d will not be not found
in that block: best-score bd of d decreases
by 22
Benefit of block B in list i
d B (1 - Pr[d found in B]) ~ d B
sum taken over all candidates d not yet seen in list i
Inverted Block-Index
List 1 List 2 List 3
item d[sd,bd]
scanned till some depth
Random access scheduling
List 1 List 2 List 3
Redundant random accesses of CA CA: one RA after every cR/cS SAs
Many RAs turn out to be redundant
Our strategy: two-phase algorithm
First sorted access rounds only, then switch to random access: no redundant random access
Switch from SA to RA, when– max-score for unseen ≤ min-top-k score
– estimated RA-cost ≤ total SA-cost so far
– cost of SA ~ cost of RA
List 1 List 2 List 3
CA: RA for
item d
But d is found anyway in
subsequent SA round
need to estimate cost of RA
Random access scheduling
current min-top-3 score
candidate items sorted by best score: CA style
random access
List 1 List 2 List 3
lists scanned till some depths by sorted access
Estimating number of random accesses
best-scores
current scores
Each random access can prune some candidates, so better estimate of #RAs
necessary
A crude upper estimate: #of items in queue
pruned
Random access scheduling
current min-top-3 score
candidate items sorted by best score: CA style
Estimating number of random accesses
item d[sd,bd]
bd
If there are
at least three items before d
with final score > bd,
d will be pruned before random access
random accesses
d is pruned
Random access scheduling
current min-top-3 score
candidate items sorted by best score: CA style
Estimating number of random accesses
item d[sd,bd]
bd
If there are
less than three items before d
with final score > bd,
a random access for d must be made
random accesses
next: RA for d
Random access scheduling
current min-top-k score
candidate items sorted by best score
Estimating number of random accesses
item d[sd,bd]
bd
Let d be the j-th item dj by best-score
ordering
For all i < j, define random variables Fi,j
Fi,j = 1 if final-score(di) > the best-
score(d),
0 otherwise
We compute Pr[Fi,j = 1] using histogram
of the score distributions of the lists
Observation:
Pr[RA is made for d] = Pr[F1,j+ + Fj-1,j <
k]
Expected #of random accesses
j Pr[F1,j+ + Fj-1,j < k]
the sum is taken over all candidate items
For General k:
There will be random access for d
if and only if
#of items before d
with final score > bd
is less than k
j-1 items
Experiments: estimate of RA
queue size
queue size
EST EST DONE
DONE
TREC Terabyte data, TREC 2005 adhoc task queries
After all sorted accesses
#items in queue, #RA estimated and #RA actually done
Tota
l R
A f
or
50
qu
eri
es
Lower bound: what is the best possible?
List 1 List 2 List 3 Try every possible SA-schedule
Count essential number of RAs that must be done
Lower bound: what is the best possible?
List 1 List 2 List 3 Try every possible SA-schedule
Count essential number of RAs that must be done
#SA CR/CS x #RA =
Total cost
Schedule 1 6 x 10000 + 1000 x 75 = 135,000
block size 10,000
Lower bound: what is the best possible?
List 1 List 2 List 3 Try every possible SA-schedule
Count essential number of RAs that must be done
#SA CR/CS x #RA =
Total cost
Schedule 1 6 x 10000 + 1000 x 75 = 135,000
Schedule 2 9 x 10000 + 1000 x 12 = 102,000
block size 10,000
Lower bound: what is the best possible?
List 1 List 2 List 3 Try every possible SA-schedule
Count essential number of RAs that must be done
#SA CR/CS x #RA =
Total cost
Schedule 1 6 x 10000 + 1000 x 75 = 135,000
Schedule 2 9 x 10000 + 1000 x 12 = 102,000
Schedule 3 12 x 10000 + 1000 x 3 = 123,000
… … … …
… … … …
Lower bound … … 102,000carefully engineered dynamic programming to try out all schedules
block size 10,000
Experiments: TREC
10 50 100 200 5000
4,000,000
k
avera
ge c
ost
(#
SA
+ 1
00
0 x
#
RA
)
full merge
NRA
CA
IO-Top-k (OUR)
lower bound
10 50 100 200 5000
250
k
avera
ge r
unnin
g t
ime (
mill
iseco
nds)
full merge
NRA
IO-Top-k (OUR)100
TREC Terabyte benchmark collection over 25 million documents, 426 GB raw data 50 queries from TREC 2005 adhoc task
CA
Experiments: HTTP logs
FIFA World Cup HTTP logs
World cup 1998 1.3 billion HTTP
requests schema Log( interval,
user-id, bytes ) aggregated for each
user within one-day intervals
typical query: find k users with most usage during June 1-10
full merge
NRA
CA
IO-Top-k (OUR)
lower bound
Experiments: IMDB
IMDB movie data more than 375,000
movies, 1,200,000 persons
attributes: Title, Genre, Actors, Description
20 human generated queries
full merge
CA
NRAIO-Top-k (OUR)
lower bound
Conclusion
We presented
An inverted block-index data structure– efficient: optimizes disk access
– performs fast merge in blocks, minimizes overhead
Integrated sorted access and random access scheduling
– SA scheduling: maximizes benefit of scanning blocks
– RA scheduling: effectively estimate RA-cost at every round
– postpone RA till the end of all SA: save redundant RAs
Lower Bound– shows that our algorithm is close to the best possible
Thank you!
Appendix
Sorted access scheduling
List 1 List 2 List 3
Knapsack for Benefit Aggregation (KBA)
Pre-compute expected score eij of an item seen in
block j of list i : (average score of the block)
Pre-compute score reduction ij of every block of each
list : (max-score of the block – min-score of the block)
Inverted Block-Index
List 1 List 2 List 3
e11 e21 e31
e12 e22 e32
e13 e23 e33
e14 e24 e34
Sorted access scheduling
Knapsack for Benefit Aggregation (KBA)
Pre-compute expected score eij of an item seen in
block j of list i : (average score of the block)
Pre-compute score reduction ij of every block of each
list : (max-score of the block – min-score of the block)
Candidate item d is already seen in list 3. If we scan
list 3 further, score sd and best-score bd of d do not
change
In list 2, d is not yet seen. If we scan one block from list 2
– either d is found in that block: score sd of d
increases, expected increase = e22
– or d is not found in that block: best-score bd of d
decreases by 22
Benefit of block B in list i
d eB Pr[d found in B] + B (1 - Pr[d found in B])
The sum is taken over all candidates d not yet seen in list i
Inverted Block-Index
List 1 List 2 List 3
e11 e21 e31
e12 e22 e32
e13 e23 e33
e14 e24 e34
item d[sd,bd]
Random access scheduling: details
current min-top-k score
candidate items sorted by best score
Estimating number of random accesses
item d[sd,bd]
bd
Let d be the j-th items by best-score ordering
For all i < j, Define random variables Fi,j which
takes value 1 if final score of the i-th item is greater than the best-score of d, 0 otherwise
Compute Pr[Fi,j = 1] using the expected score gain
of the i-th item from lists where it is not yet seen
Also define a random variable Rj which takes value
1 if a random access is made for d, 0 otherwise
Observation: Pr[Rj = 1] = Pr[F1,j+ + Fj-1,j < k]
Let Xj := F1,j + + Fj-1,j
Assume Fi,js are independent, then Xj follows
Poisson distribution with mean i Pr[Fi,j = 1]
We can compute Pr[Xj < k] using the incomplete
gamma function
Expected number of random accesses is
j E(Rj) = j Pr[Rj = 1] = j Pr[Xj < k]
the sum is taken over all candidate items
There will be random access for d
if and only if
#of items before d
with final score > bd
is less than k
j-1 items
Other Experiments
For different values of cost of RA compared to cost of SA
CR/S ratio: 100, 1000 and 10000
varying query size
title fields: average size 3
description fields: average size 8
TREC Terabyte collection indexed with BM25 scores
query size: 3
query size: 8
20,000,000
0
cost
(cost of RA)/(cost of SA)
cost
0
3,000,000
End of appendix