Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed...
-
date post
18-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed...
Max-Planck-Institut University of PatrasNetCInS LabInformatik
KLEE: A Framework for Distributed Top-k Query Algorithms
KLEE: A Framework for Distributed Top-k Query Algorithms
Sebastian MichelMax-Planck Institute for Informatics
Saarbrücken, [email protected]
Peter TriantafillouRACTI / Univ. of Patras
Rio, [email protected]
Gerhard WeikumMax-Planck Institute for Informatics
Saarbrücken, [email protected]
VLDB 2005, Trondheim 2KLEE: A Framework for Distributed Top-k Query Algorithms
Overview
• Problem Statement
• Related Work
• KLEE
• The Histogram Bloom Structure
• Candidate Filtering
• Evaluation
• Conclusion / Future Work
VLDB 2005, Trondheim 3KLEE: A Framework for Distributed Top-k Query Algorithms
Computational Model• Distributed aggregation queries:
Query with m terms with index lists spread across m peers P1 ... Pm
P4
P5P2
P3P1
P4
P5P2
P3P1
Applications:• Internet traffic monitoring• Sensor networks• P2P Web search
VLDB 2005, Trondheim 4KLEE: A Framework for Distributed Top-k Query Algorithms
Problem Statement
• Consider– network consumption– per peer load– latency (query response time)
• network• I/O• processing
P0
P1
P2
P3
Query initiator P0 serves as per-query coordinator
…t1d780.9
d10.7
d880.2
d230.8
d100.8
…d100.2
d780.1t2
d640.8
d230.6
d100.6
…d990.2
d340.1t3
d100.7
d780.5
d640.4
VLDB 2005, Trondheim 5KLEE: A Framework for Distributed Top-k Query Algorithms
Existing Methods:• Distributed NRA/TA: Extend NRA/TA (Fagin et
al. `99/‘03, Güntzer et al. `01, Nepal et al. `99) with batched access
• TPUT (Cao/Wang 2004): 1) fetch k best entries (d, sj) from each of P1 ...
Pm and aggregate (j=1..m sj(d)) at P0 2) ask each of P1 ... Pm for all entries with sj >
min-k / m and aggregate results at P0 3) fetch missing scores for all candidates by
random lookups at P1 ... Pm + DNRA aims to minimize per-peer work- DTA/DNRA incur many messages+ TPUT guarantees fixed number of message rounds- TPUT incurs high per-peer load and net BW
Related Work
VLDB 2005, Trondheim 6KLEE: A Framework for Distributed Top-k Query Algorithms
TPUT
...
Index List
CohortPeer Pi
Coordinator Peer P0
current top-k-
candidate set
...
score
Index List
CohortPeer Pj
score
topk
topk
can
did
ates
can
did
ates
min-k / m min-k / m
min-k / m
Ret
riev
e m
issi
ng
sco
res
Retrieve missing scores
VLDB 2005, Trondheim 7KLEE: A Framework for Distributed Top-k Query Algorithms
KLEE: Key Ideas
• if mink / m is small TPUT retrieves a lot of data in Phase 2
high network traffic
• random accesses high per-peer load
KLEE: Different philosophy: approximate answers! Efficiency:
Reduces (docId, score)-pair transfers no random accesses at each peer
Two pillars: The HistogramBlooms structure The Candidate List Filter structure
VLDB 2005, Trondheim 8KLEE: A Framework for Distributed Top-k Query Algorithms
The KLEE Algorithms
• KLEE 3 or 4 steps:1. Exploration Step: … to get a better
approximation of min-k score threshold
2. Optimization Step: – decide: 3 or 4 steps ?
3. Candidate Filtering: … a docID is a good candidate if high-scored in many peers.
4. Candidate Retrieval: get all good docID candidates.
VLDB 2005, Trondheim 9KLEE: A Framework for Distributed Top-k Query Algorithms
Histogram Bloom Structure• Each peer pre-computes for each index list:
an equi-width histogram
+ Bloom filter for each cell+ average score per cell+ upper/lower scorescore
#do
cs
01100
00101
01100
00101
00110
00101
01110
00101
01100
00101
10
“increase” the mink / m threshold
VLDB 2005, Trondheim 10
KLEE: A Framework for Distributed Top-k Query Algorithms
Bloom Filter
• bit array of size m• k hash functions
hi: docId_space {1,..,m}
• insert n docs by hashing the ids and settings the corresponding bits
• Membership Queries:– document is in the Bloom Filter if the
corresponding bits are set
• probability of false positives (pfp)• tradeoff accuracy vs. efficiency
kmknepfp )1( /
VLDB 2005, Trondheim 11
KLEE: A Framework for Distributed Top-k Query Algorithms
Exploration and Candidate Retrieval
...
Index List
CohortPeer Pi
Coordinator Peer P0
current top-k-
candidate set
...
score
Index List
CohortPeer Pj
Histogram Histogram
b bits
00
01
01
10
00
01
01
10
01
01
10
10
01
01
10
10
01
00
10
10
01
00
10
10
00
01
00
10
00
01
00
10
01
00
01
11
01
00
01
11c cells
b bits
01
01
01
01
01
01
01
01
00
01
11
01
00
01
11
01
01
00
00
00
01
00
00
00
00
00
00
10
00
00
00
10
01
00
11
10
01
00
11
10
c cells
score
topk
topk
can
did
ates
can
did
ates
min-k / m min-k / m
VLDB 2005, Trondheim 12
KLEE: A Framework for Distributed Top-k Query Algorithms
Candidate List Filter Matrix
• Goal: filter out unpromising candidate documents in step 2
• estimate the max number of docs that are above the mink / m threshold
• send this number and the threshold to the cohort peers
score
nu
mb
er
of
do
cu
me
nts
min-k / m threshold
VLDB 2005, Trondheim 13
KLEE: A Framework for Distributed Top-k Query Algorithms
Candidate List Filter Matrix (2)• Each cohort returns a Bloom Filter that
“contains” all docs above the mink / m threshold
Candidate List Filter Matrix (CLFM)010101001011110101001001010101001
010010011001011111001001010111110
101010101010100110010010011110000
000000001000000100000000000100000
Select all columns with at least R bits set
VLDB 2005, Trondheim 14
KLEE: A Framework for Distributed Top-k Query Algorithms
KLEE– Candidate Set Reduction
...
score
010010000100010001
Index List
CohortPeer Pi
topk
Coordinator Peer P0
min-k / m
current top-k
candidate set
0000100000100000001
xx x
can
did
ates
min
-k / m
candidate filter matrix
CohortPeer Pj
100010100000010001
0000100000100000001 0000100000100000001
VLDB 2005, Trondheim 15
KLEE: A Framework for Distributed Top-k Query Algorithms
KLEE – Candidate Retrieval
...
score
010010000100010001
Index List
CohortPeer Pi
topk
Coordinator Peer P0
min-k / m
current top-k
candidate set
0000100000100000001
xx x
can
did
ates
early stoppingpoint
candidate filter matrix
CohortPeer Pj
100010100000010001
0000100000100000001 0000100000100000001
VLDB 2005, Trondheim 16
KLEE: A Framework for Distributed Top-k Query Algorithms
Enhanced Filtering
• BF represenation can be improved …(d1, 0.9)(d2, 0.6)(d5, 0.5)(d3, 0.3)(d4, 0.25)(d17, 0.08)(d9, 0.07)
d1, d2, and d5 are promising documents but e.g. s1-s3 = 0.4 !
• Send byte-array with cell-numbers instead of bits
• Select „columns“ withSum over upper-bounds > min-k
(d1, 0.9)(d2, 0.6)(d5, 0.5)(d3, 0.3)(d4, 0.25)(d17, 0.08)(d9, 0.07)
(d1, 0.9)(d2, 0.6)(d5, 0.5)(d3, 0.3)(d4, 0.25)(d17, 0.08)(d9, 0.07)
000001000000100000000000
000001000000100010001000
000001000000100000001000
28,23,14,85,34,23,11,24,54,33,60,34
43,13,15,25,54,21,71,44,64,71,72,31
31,43,21,75,21,71,64,34,74,63,50,62
VLDB 2005, Trondheim 17
KLEE: A Framework for Distributed Top-k Query Algorithms
Architecture/Testbed
open() get(k) getWithBF(..)
next()
Index Lists
Oracle DB
close()open()
SQL
KLEEAlgorithmic Framework
ExtendedIndexLists
with BloomFilters, Histograms,
and Batched Access
12
4 3
...getAbove(score)
00 01 11 10
01 01 00 10
B+ Index
…
VLDB 2005, Trondheim 18
KLEE: A Framework for Distributed Top-k Query Algorithms
Evaluation: Benchmarks• GOV: TREC .GOV collection + 50 TREC-2003
Web queries, e.g. juvenile delinquency
• XGOV: TREC .GOV collection + 50 manually expanded queries, e.g. juvenile delinquency youth minor crime law jurisdiction offense prevention
• IMDB: Movie Database, queries like– actor = John Wayne; genre =western
• Synthetic Distribution (Zipf, different skewness): GOV collection but with synthetic scores
• Synthetic Distribution + Synthetic Correlation: 10 index lists
VLDB 2005, Trondheim 19
KLEE: A Framework for Distributed Top-k Query Algorithms
Evaluation: Metrics• Relative recall w.r.t. to the actual results• Score error• Bandwidth consumption• Rank distance• Number of RA and number of SA• Query response time
- network cost (150ms RTT,
800Kb/s data transfer rate)
- local I/O cost (8ms rotation latency
+ 8MB/s transfer delay)
- processing cost
VLDB 2005, Trondheim 20
KLEE: A Framework for Distributed Top-k Query Algorithms
Evaluated Algorithms
• DTA:
– batched distributed threshold algorithm, batch size k.
• TPUT• X-TPUT:
– approximate TPUT. No random accesses.
• KLEE-3
• KLEE-4C = 10% of the score mass
VLDB 2005, Trondheim 21
KLEE: A Framework for Distributed Top-k Query Algorithms
Synthetic Score Benchmarks
Zipf-XGOV Total Total Average
c=10% # of Bytes Time in ms Recall
DTA 617,009,260 39,582,682 1
TPUT 377,928,880 1,599,581 1
X-TPUT 377,097,644 1,521,220 0.98
KLEE 3 287,294,812 1,189,891 0.91
KLEE 4 165,077,807 375,077 0.92
Zipf-XGOV, c=10%, q=0.7
0.0E+00
5.0E+06
1.0E+07
1.5E+07
2.0E+07
2.5E+07
10 11 12 13 14 15 18
Number of Query Terms
Ba
nd
wid
th in
By
tes
DTA
TPUT
X-TPUT
KLEE 3
KLEE 4
Zipf-GOV Total Total Average
c=10% # of Bytes Time in ms Recall
DTA 17,752,769 3,532,180 1
TPUT 53,494,903 576,713 1
X-TPUT 53,011,252 404,991 0.99
KLEE 3 49,861,342 367,931 0.97
KLEE 4 25,057,920 160,585 0.94
q = 0.7
VLDB 2005, Trondheim 22
KLEE: A Framework for Distributed Top-k Query Algorithms
Synthetic Correlation Benchmark
Overlap, c=10%, q=0.7
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
10,000,000
10 20 30 40 50 60 70 80 90 100 W in %
Ba
nd
wid
th in
By
tes DTA
TPUT
X-TPUT
KLEE 3
KLEE 4
W
randomly insert top k documents from list i in the top W documents of list j
Overlap+Zipf Total Total Average
c=10% # of Bytes Time in ms Recall
DTA 1,146.32 157,420 1
TPUT 9,150,904 29,270 1
X-TPUT 9,150,904 28,335 1
KLEE 3 3,678,780 12,971 0.92
KLEE 4 1,192,704 6,546 0.91
W = 30%
VLDB 2005, Trondheim 23
KLEE: A Framework for Distributed Top-k Query Algorithms
GOV / XGOV
XGOV, c=10%
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
4 5 6 7 8 9 10 11 12 13 14 15 18
Number of Query Terms
Ba
nd
wid
th in
By
tes
TA
TPUT
X-TPUT
KLEE 3
KLEE 4
GOV Total Total Average
c=10% # of Bytes Time in ms Recall
DTA 1,172,446 190,259 1
TPUT 1,505,290 185,049 1
X-TPUT 597,991 31,432 0.89
KLEE 3 722,664 28,319 0.9
KLEE 4 440,868 39,564 0.9
GOV, c=10%
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
2 3 4Number of Query Terms
Ba
nd
wid
th in
By
tes DTA
TPUT
X-TPUT
KLEE 3
KLEE 4
XGOV Total Total Average
c=10% # of Bytes Time in ms Recall
DTA 92,587,264 3,740,677 1
TPUT 70,044,884 2,346,882 1
X-TPUT 19,236,084 96,153 0.91
KLEE 3 16,690,912 88,271 0.83
KLEE 4 7,920,774 56,609 0.79
VLDB 2005, Trondheim 24
KLEE: A Framework for Distributed Top-k Query Algorithms
Conclusion / Future Work
• Conclusion– KLEE: approximate top-k algorithms for wide-area
networks– significant performance benefits can be enjoyed, at
only small penalties in result quality– flexible framework for top-k algorithms, allowing
for trading-off• efficiency versus result quality and • bandwidth savings versus the number of communication
phases.
– various fine-tuning parameters
• Future Work– Reasoning about parameter values– Consider “moving” coordinator
VLDB 2005, Trondheim 25
KLEE: A Framework for Distributed Top-k Query Algorithms
Thanks for your attention!
Questions?
Comments?