Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed...

25
Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed Top-k Query Algorithms Sebastian Michel Planck Institute for Informatics Saarbrücken, Germany [email protected] Peter Triantafillou RACTI / Univ. of Patras Rio, Greece [email protected] Gerhard Weikum Max-Planck Institute for Informat Saarbrücken, Germany [email protected]
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed...

Page 1: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

Max-Planck-Institut University of PatrasNetCInS LabInformatik

KLEE: A Framework for Distributed Top-k Query Algorithms

KLEE: A Framework for Distributed Top-k Query Algorithms

Sebastian MichelMax-Planck Institute for Informatics

Saarbrücken, [email protected]

Peter TriantafillouRACTI / Univ. of Patras

Rio, [email protected]

Gerhard WeikumMax-Planck Institute for Informatics

Saarbrücken, [email protected]

Page 2: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 2KLEE: A Framework for Distributed Top-k Query Algorithms

Overview

• Problem Statement

• Related Work

• KLEE

• The Histogram Bloom Structure

• Candidate Filtering

• Evaluation

• Conclusion / Future Work

Page 3: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 3KLEE: A Framework for Distributed Top-k Query Algorithms

Computational Model• Distributed aggregation queries:

Query with m terms with index lists spread across m peers P1 ... Pm

P4

P5P2

P3P1

P4

P5P2

P3P1

Applications:• Internet traffic monitoring• Sensor networks• P2P Web search

Page 4: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 4KLEE: A Framework for Distributed Top-k Query Algorithms

Problem Statement

• Consider– network consumption– per peer load– latency (query response time)

• network• I/O• processing

P0

P1

P2

P3

Query initiator P0 serves as per-query coordinator

…t1d780.9

d10.7

d880.2

d230.8

d100.8

…d100.2

d780.1t2

d640.8

d230.6

d100.6

…d990.2

d340.1t3

d100.7

d780.5

d640.4

Page 5: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 5KLEE: A Framework for Distributed Top-k Query Algorithms

Existing Methods:• Distributed NRA/TA: Extend NRA/TA (Fagin et

al. `99/‘03, Güntzer et al. `01, Nepal et al. `99) with batched access

• TPUT (Cao/Wang 2004): 1) fetch k best entries (d, sj) from each of P1 ...

Pm and aggregate (j=1..m sj(d)) at P0 2) ask each of P1 ... Pm for all entries with sj >

min-k / m and aggregate results at P0 3) fetch missing scores for all candidates by

random lookups at P1 ... Pm + DNRA aims to minimize per-peer work- DTA/DNRA incur many messages+ TPUT guarantees fixed number of message rounds- TPUT incurs high per-peer load and net BW

Related Work

Page 6: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 6KLEE: A Framework for Distributed Top-k Query Algorithms

TPUT

...

Index List

CohortPeer Pi

Coordinator Peer P0

current top-k-

candidate set

...

score

Index List

CohortPeer Pj

score

topk

topk

can

did

ates

can

did

ates

min-k / m min-k / m

min-k / m

Ret

riev

e m

issi

ng

sco

res

Retrieve missing scores

Page 7: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 7KLEE: A Framework for Distributed Top-k Query Algorithms

KLEE: Key Ideas

• if mink / m is small TPUT retrieves a lot of data in Phase 2

high network traffic

• random accesses high per-peer load

KLEE: Different philosophy: approximate answers! Efficiency:

Reduces (docId, score)-pair transfers no random accesses at each peer

Two pillars: The HistogramBlooms structure The Candidate List Filter structure

Page 8: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 8KLEE: A Framework for Distributed Top-k Query Algorithms

The KLEE Algorithms

• KLEE 3 or 4 steps:1. Exploration Step: … to get a better

approximation of min-k score threshold

2. Optimization Step: – decide: 3 or 4 steps ?

3. Candidate Filtering: … a docID is a good candidate if high-scored in many peers.

4. Candidate Retrieval: get all good docID candidates.

Page 9: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 9KLEE: A Framework for Distributed Top-k Query Algorithms

Histogram Bloom Structure• Each peer pre-computes for each index list:

an equi-width histogram

+ Bloom filter for each cell+ average score per cell+ upper/lower scorescore

#do

cs

01100

00101

01100

00101

00110

00101

01110

00101

01100

00101

10

“increase” the mink / m threshold

Page 10: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 10

KLEE: A Framework for Distributed Top-k Query Algorithms

Bloom Filter

• bit array of size m• k hash functions

hi: docId_space {1,..,m}

• insert n docs by hashing the ids and settings the corresponding bits

• Membership Queries:– document is in the Bloom Filter if the

corresponding bits are set

• probability of false positives (pfp)• tradeoff accuracy vs. efficiency

kmknepfp )1( /

Page 11: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 11

KLEE: A Framework for Distributed Top-k Query Algorithms

Exploration and Candidate Retrieval

...

Index List

CohortPeer Pi

Coordinator Peer P0

current top-k-

candidate set

...

score

Index List

CohortPeer Pj

Histogram Histogram

b bits

00

01

01

10

00

01

01

10

01

01

10

10

01

01

10

10

01

00

10

10

01

00

10

10

00

01

00

10

00

01

00

10

01

00

01

11

01

00

01

11c cells

b bits

01

01

01

01

01

01

01

01

00

01

11

01

00

01

11

01

01

00

00

00

01

00

00

00

00

00

00

10

00

00

00

10

01

00

11

10

01

00

11

10

c cells

score

topk

topk

can

did

ates

can

did

ates

min-k / m min-k / m

Page 12: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 12

KLEE: A Framework for Distributed Top-k Query Algorithms

Candidate List Filter Matrix

• Goal: filter out unpromising candidate documents in step 2

• estimate the max number of docs that are above the mink / m threshold

• send this number and the threshold to the cohort peers

score

nu

mb

er

of

do

cu

me

nts

min-k / m threshold

Page 13: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 13

KLEE: A Framework for Distributed Top-k Query Algorithms

Candidate List Filter Matrix (2)• Each cohort returns a Bloom Filter that

“contains” all docs above the mink / m threshold

Candidate List Filter Matrix (CLFM)010101001011110101001001010101001

010010011001011111001001010111110

101010101010100110010010011110000

000000001000000100000000000100000

Select all columns with at least R bits set

Page 14: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 14

KLEE: A Framework for Distributed Top-k Query Algorithms

KLEE– Candidate Set Reduction

...

score

010010000100010001

Index List

CohortPeer Pi

topk

Coordinator Peer P0

min-k / m

current top-k

candidate set

0000100000100000001

xx x

can

did

ates

min

-k / m

candidate filter matrix

CohortPeer Pj

100010100000010001

0000100000100000001 0000100000100000001

Page 15: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 15

KLEE: A Framework for Distributed Top-k Query Algorithms

KLEE – Candidate Retrieval

...

score

010010000100010001

Index List

CohortPeer Pi

topk

Coordinator Peer P0

min-k / m

current top-k

candidate set

0000100000100000001

xx x

can

did

ates

early stoppingpoint

candidate filter matrix

CohortPeer Pj

100010100000010001

0000100000100000001 0000100000100000001

Page 16: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 16

KLEE: A Framework for Distributed Top-k Query Algorithms

Enhanced Filtering

• BF represenation can be improved …(d1, 0.9)(d2, 0.6)(d5, 0.5)(d3, 0.3)(d4, 0.25)(d17, 0.08)(d9, 0.07)

d1, d2, and d5 are promising documents but e.g. s1-s3 = 0.4 !

• Send byte-array with cell-numbers instead of bits

• Select „columns“ withSum over upper-bounds > min-k

(d1, 0.9)(d2, 0.6)(d5, 0.5)(d3, 0.3)(d4, 0.25)(d17, 0.08)(d9, 0.07)

(d1, 0.9)(d2, 0.6)(d5, 0.5)(d3, 0.3)(d4, 0.25)(d17, 0.08)(d9, 0.07)

000001000000100000000000

000001000000100010001000

000001000000100000001000

28,23,14,85,34,23,11,24,54,33,60,34

43,13,15,25,54,21,71,44,64,71,72,31

31,43,21,75,21,71,64,34,74,63,50,62

Page 17: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 17

KLEE: A Framework for Distributed Top-k Query Algorithms

Architecture/Testbed

open() get(k) getWithBF(..)

next()

Index Lists

Oracle DB

close()open()

SQL

KLEEAlgorithmic Framework

ExtendedIndexLists

with BloomFilters, Histograms,

and Batched Access

12

4 3

...getAbove(score)

00 01 11 10

01 01 00 10

B+ Index

Page 18: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 18

KLEE: A Framework for Distributed Top-k Query Algorithms

Evaluation: Benchmarks• GOV: TREC .GOV collection + 50 TREC-2003

Web queries, e.g. juvenile delinquency

• XGOV: TREC .GOV collection + 50 manually expanded queries, e.g. juvenile delinquency youth minor crime law jurisdiction offense prevention

• IMDB: Movie Database, queries like– actor = John Wayne; genre =western

• Synthetic Distribution (Zipf, different skewness): GOV collection but with synthetic scores

• Synthetic Distribution + Synthetic Correlation: 10 index lists

Page 19: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 19

KLEE: A Framework for Distributed Top-k Query Algorithms

Evaluation: Metrics• Relative recall w.r.t. to the actual results• Score error• Bandwidth consumption• Rank distance• Number of RA and number of SA• Query response time

- network cost (150ms RTT,

800Kb/s data transfer rate)

- local I/O cost (8ms rotation latency

+ 8MB/s transfer delay)

- processing cost

Page 20: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 20

KLEE: A Framework for Distributed Top-k Query Algorithms

Evaluated Algorithms

• DTA:

– batched distributed threshold algorithm, batch size k.

• TPUT• X-TPUT:

– approximate TPUT. No random accesses.

• KLEE-3

• KLEE-4C = 10% of the score mass

Page 21: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 21

KLEE: A Framework for Distributed Top-k Query Algorithms

Synthetic Score Benchmarks

Zipf-XGOV Total Total Average

c=10% # of Bytes Time in ms Recall

DTA 617,009,260 39,582,682 1

TPUT 377,928,880 1,599,581 1

X-TPUT 377,097,644 1,521,220 0.98

KLEE 3 287,294,812 1,189,891 0.91

KLEE 4 165,077,807 375,077 0.92

Zipf-XGOV, c=10%, q=0.7

0.0E+00

5.0E+06

1.0E+07

1.5E+07

2.0E+07

2.5E+07

10 11 12 13 14 15 18

Number of Query Terms

Ba

nd

wid

th in

By

tes

DTA

TPUT

X-TPUT

KLEE 3

KLEE 4

Zipf-GOV Total Total Average

c=10% # of Bytes Time in ms Recall

DTA 17,752,769 3,532,180 1

TPUT 53,494,903 576,713 1

X-TPUT 53,011,252 404,991 0.99

KLEE 3 49,861,342 367,931 0.97

KLEE 4 25,057,920 160,585 0.94

q = 0.7

Page 22: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 22

KLEE: A Framework for Distributed Top-k Query Algorithms

Synthetic Correlation Benchmark

Overlap, c=10%, q=0.7

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

8,000,000

9,000,000

10,000,000

10 20 30 40 50 60 70 80 90 100 W in %

Ba

nd

wid

th in

By

tes DTA

TPUT

X-TPUT

KLEE 3

KLEE 4

W

randomly insert top k documents from list i in the top W documents of list j

Overlap+Zipf Total Total Average

c=10% # of Bytes Time in ms Recall

DTA 1,146.32 157,420 1

TPUT 9,150,904 29,270 1

X-TPUT 9,150,904 28,335 1

KLEE 3 3,678,780 12,971 0.92

KLEE 4 1,192,704 6,546 0.91

W = 30%

Page 23: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 23

KLEE: A Framework for Distributed Top-k Query Algorithms

GOV / XGOV

XGOV, c=10%

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

4 5 6 7 8 9 10 11 12 13 14 15 18

Number of Query Terms

Ba

nd

wid

th in

By

tes

TA

TPUT

X-TPUT

KLEE 3

KLEE 4

GOV Total Total Average

c=10% # of Bytes Time in ms Recall

DTA 1,172,446 190,259 1

TPUT 1,505,290 185,049 1

X-TPUT 597,991 31,432 0.89

KLEE 3 722,664 28,319 0.9

KLEE 4 440,868 39,564 0.9

GOV, c=10%

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

2 3 4Number of Query Terms

Ba

nd

wid

th in

By

tes DTA

TPUT

X-TPUT

KLEE 3

KLEE 4

XGOV Total Total Average

c=10% # of Bytes Time in ms Recall

DTA 92,587,264 3,740,677 1

TPUT 70,044,884 2,346,882 1

X-TPUT 19,236,084 96,153 0.91

KLEE 3 16,690,912 88,271 0.83

KLEE 4 7,920,774 56,609 0.79

Page 24: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 24

KLEE: A Framework for Distributed Top-k Query Algorithms

Conclusion / Future Work

• Conclusion– KLEE: approximate top-k algorithms for wide-area

networks– significant performance benefits can be enjoyed, at

only small penalties in result quality– flexible framework for top-k algorithms, allowing

for trading-off• efficiency versus result quality and • bandwidth savings versus the number of communication

phases.

– various fine-tuning parameters

• Future Work– Reasoning about parameter values– Consider “moving” coordinator

Page 25: Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

VLDB 2005, Trondheim 25

KLEE: A Framework for Distributed Top-k Query Algorithms

Thanks for your attention!

Questions?

Comments?