A Thin Monitoring Layer for Top-k Aggregation Queries over a Database Foteini AlvanakiSebastian...

A Thin Monitoring Layer for Top-k Aggregation Queries over a Database

Foteini Alvanaki Sebastian MichelSaarland University

DBRank 2013, Riva Del Garda, Italy

Data Cube

Brand

Coun

try

Product Type

sum(price*quantity)

Data Cube

Brand Product Type Country sum(Price*Quantity)Brand1 Type1 Country1 1234

Brand1 Type2 Country1 3522

Brand1 Type1 1234

Brand1 Type2 3522

Brand1 Country1 4756

Type1 Country1 1234

Type2 Country1 3522

Brand1 4756

Type1 1234

Type2 3522

Country1 4756

1. What are the top-2 product types with the highest revenue of each brand in each country?

2. What are the top-2 brands with the highest revenue in each country?

Top-k Queries• Primary Attribute: The attribute/dimension over which the selection is performed (e.g. product type)• Secondary Attributes: Used to filter specific results (e.g. brand, country)• Aggregated Attributes: Used to compute an aggregated score (e.g. price, quantity)• Aggregate Function: e.g. sum

One top-k query for each combination of secondary attribute instances (filtering condition)

Filtering Conditions: Example (1)brand={X} - country = {Y, W}

brand=X AND country=Y

brand=X AND country=W

SELECT type, SUM(price*quantity) FROM relationWHERE brand=X AND country=YGROUP BY typeORDER BY SUM(price*quantity) LIMIT K

SELECT type, SUM(price*quantity) FROM relationWHERE brand=X AND country=WGROUP BY typeORDER BY SUM(price*quantity) LIMIT K

Filtering Conditions: Example (2)country = {Y, W} - brand={X}

country=Y

country=W

brand=X

SELECT type, SUM(price*quantity) FROM relation WHERE country=YGROUP BY typeORDER BY SUM(price*quantity) LIMIT K

SELECT type, SUM(price*quantity) FROM relation WHERE country=WGROUP BY typeORDER BY SUM(price*quantity) LIMIT K

SELECT type, SUM(price*quantity) FROM relation WHERE brand=XGROUP BY typeORDER BY SUM(price*quantity) LIMIT K

Filtering Conditions: Example (3)country = {Y, W} - brand={X}

SELECT type, SUM(price*quantity) FROM relationGROUP BY typeORDER BY SUM(price*quantity) LIMIT K

UpdatesInsertions to the underlying database that contain all information related to the top-k queries

INSERT INTO relation (type, brand, country, price, quantity)

VALUES (T, X, Y, 100, 3)

Problem

How to maintain all these queries in the presence of fast updates?

Outline

Setting/Problem• Algorithms– Naïve Approach– Estimates Approach– Groups Approach

• Experimental Results• Conclusions

ExampleSELECT type, SUM(price*quantity) FROM relationWHERE brand=X AND country=YGROUP BY typeORDER BY SUM(price*quantity) LIMIT 2

Update: (type, X, Y, 300)

Naïve ApproachCase 1: type in the top-2, e.g. (B,X,Y,300)

Type ScoreA 3452B 2406 +300

Type ScoreA 3452B 2706

Case 2: type NOT in the top-2, e.g. (K,X,Y,300)

Verification Query: SELECT type, SUM(price*quantity) FROM relationWHERE brand=X AND country=Y AND type=KGROUP BY type

Estimates ApproachIn-memory Structures

• top-(k+N) instances with exact aggregated scores• B instances with estimated aggregated scores• best possible score (basic score) + inserted values

Type ScoreA 3452B 2406C 2356D 2167E 1987

top-2

top-5

Type ScoreO 1990P 2112Q 2076R 1997

Buffer

Estimates ApproachCase 1.1: type in the top-2, e.g. (B,X,Y,300)

Type ScoreA 3452B 2406C 2356D 2167E 1987

top-2

top-5

Type ScoreA 3452B 2706C 2356D 2167E 1987

top-2

top-5

+300

Estimates ApproachCase 1.2: type in the top-5, e.g. (D,X,Y,300)

Type ScoreA 3452B 2406C 2356D 2167E 1987

top-2

top-5

Type ScoreA 3452B 2406C 2356D 2467E 1987

top-2

top-5+300

Type ScoreA 3452D 2467B 2406C 2356E 1987

Estimates ApproachCase 2: type in the Buffer, e.g. (P,X,Y,300)

Type ScoreA 3452B 2406C 2356D 2167E 1987

top-2

top-5

Type ScoreO 1990P 2112Q 2076R 1997

Buffer

+300

Type ScoreO 1990P 2412Q 2076R 1997

Buffer

Verification Query: SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type=P GROUP BY type

Estimates ApproachSub-case 2.1: score(P) < score(E), e.g. score(P) = 756

Type ScoreA 3452B 2406C 2356D 2167E 1987

top-2

top-5

Type ScoreO 1990P 756Q 2076R 1997

BufferType ScoreO 1990

Q 2076R 1997

Estimates ApproachSub-case 2.2: score(P) > score(E), e.g. score(P) = 2178

Type ScoreA 3452B 2406C 2356D 2167E 1987

top-2

top-5

Type ScoreO 1990P 2178Q 2076R 1997

BufferType ScoreA 3452B 2406C 2356D 2167P 2178

Type ScoreO 1990

Q 2076R 1997

Estimates ApproachSub-case 2.3: score(P) > score(B), e.g. score(P) = 2407

Type ScoreA 3452B 2406C 2356D 2167E 1987

top-2

top-5

Type ScoreO 1990P 2407Q 2076R 1997

BufferType ScoreA 3452P 2407B 2406C 2356D 2167

Type ScoreO 1990

Q 2076R 1997

Estimates Approach

Buffer Full Reset Query

Estimated Score(T) = basic score + 300 = 2287

Case 3: type NOT in in-memory structures, e.g. (T,X,Y,300)

SELECT type, SUM(price*quantity) FROM relationWHERE brand=X AND country=YAND type IN (O,P,Q,R)GROUP BY type

Estimates Approach

score(O)=1254, score(P)=432, score(Q)=2050, score(R)=1990

Type ScoreA 3452B 2406C 2356D 2167E 1987

top-2

top-5

Type ScoreO 1990P 2112Q 2076R 1997

BufferType ScoreT 2287

Type ScoreA 3452B 2406C 2356D 2167Q 2050


Queries Characteristics

• SAME primary attribute• SAME aggregate attributes• SAME aggregate function• SAME top-k condition• DIFFERENT filtering condition

Lattice organisation

Groups Approach

• The updates are forwarded from top to bottom in the lattice

• Each ranking forwards the queried results to the rankings lying in lower levels in the lattice

Groups Approach: Example

SELECT type, SUM(price*quantity)FROM relationWHERE brand=XGROUP BY typeORDER BY SUM(price*quantity)LIMIT 2

Update: (type, X, Y, 300)Ranking: brand=X, country=ANY

Groups ApproachCase 2: type in the Buffer, e.g. (P,X,Y,300)

Verification Query: SELECT type, brand, country, price*quantity FROM relation WHERE brand=X AND type=P

Buffer Reset Query:SELECT type, brand, country, price*quantity FROM relation WHERE brand=X AND type IN (O,P,Q,R)


Groups Approach

• Tuples (type, brand, country, price*quantity) limited to those satisfying its filtering condition

• Uses them to compute the scores.

• Forwards them to the rankings lower in lattice

• Rankings receiving tuples use those qualifying to their filtering condition to compute the scores

Outline

ProblemAlgorithms

Naïve Approach Estimates Approach Groups Approach

• Experimental Results• Conclusions

Experiments (1)

• TPC-H data• Select on part.p_partKey (200,000 unique

values)• Filter on customer.c_mktsegment,

orders.o_orderpriority and region.r_name• Aggregation sum on lineitem.l_quantity• 216 total rankings• 30,000 updates/insertions

Experiments (2)

Updates• Random: inserts quantity between 1 and 50 for a

random part.p_partKey.• 80-20: inserts quantity between 1 and 50 for a

part.p_partKey selected according to the 80-20 rule

N-extra Gap• Difference between top-k and top-(k+N) scores

100% (1*50) and 200% (2*50)

80-20 Updates: Queries

80-20 Updates: Time

Random Updates: Queries

Random Updates: Time

Naïve Approach

• 80-20 updates: 239,985 Verification Queries, 4 secs/update

• Random updates: 239,977 Verification Queries, 4 secs/update

Outline

ProblemAlgorithms

Naïve Approach Estimates Approach Groups Approach

Experimental Results• Conclusions

Conclusion

• Two algorithms to maintain top-k rankings in the presence of fast updates arriving in an underlying database

• Exact top-k results• Faster than a Naïve approach while Groups

Approach limits further the communication with the database

• Preliminary results which provide insights on the impact of the various parameters in the effectiveness of our methods

Thank you!

Additional Instances

A Thin Monitoring Layer for Top-k Aggregation Queries over a Database Foteini AlvanakiSebastian...

Documents

Transcript of A Thin Monitoring Layer for Top-k Aggregation Queries over a Database Foteini AlvanakiSebastian...