Incremental Aggregation on Multiple Continuous Queries

1

Incremental Aggregationon Multiple Continuous Queries

Chun JinCarnegie Mellon University

09/28/2006 ISMIS, Bari Italy

2

•Intelligence monitoring•Fraud detection•Onset epidemic patterns•Network intrusion detection•GeoSpacial changes

•Transactions•Senor network readings•Network traffic data

Stream Processing

3

Problem

• Aggregate queries

• Continuous evaluation

• Multiple concurrent queries

4

Solutions

• Incremental aggregation

• Incremental multiple aggregate query optimization (incremental sharing)

5

Roadmap

• System overview

• Query examples

• Incremental Aggregation

• Incremental sharing

• Evaluation

6

QueryNetwork

QueryCoordinator

SystemCatalog

Common Computation Identifier

(CCI)

Network Operation Manager (NOM)

Code Assembler

Sharing Optimizer(SO)

Projection Manager(PM)

System ArchitectureNew Query Insertion:1. Index query network2. Identify common computation3. Select optimal sharing path4. Expand query network

Query Network Execution:1. Code assembly2. Incremental aggregation3. Periodical execution

Engine

Generator

Oracle

7

S A B

hospital vdate COUNT(*) SUM(fee) AVERAGE(fee)

S A

dis_cat hospital vdate COUNT(*) SUM(fee) AVERAGE(fee)

SELECT dis_cat, hospital, vdate,COUNT(*), AVERAGE(fee)

FROM MedGROUP BY CAT(disease) AS dis_cat,

hospital, DAY(visit_time) AS vdate(a) Query A

SELECT hospital, vdate,AVERAGE(fee)

FROM MedGROUP BY hospital,

DAY(visit_time) AS vdate(b) Query B

Query Examples

SH

SN

AH

AN

8

Roadmap

• System overview

• Query examples

• Incremental Aggregation• Incremental sharing

• Evaluation

9

Aggregate Function Types

• Distributive: aggregate function itself. Sum, count.

• Algebraic: a finite set of aggregate functions. Average.

• Holistic: no such finite set. Quantiles.

Incremental Aggregation

10

Holistic Aggregation

• Revisiting the entire history.

• Usage: – For holistic aggregates.– For post-non-incrementally-evaluated

aggregates.– Baseline to incremental aggregation.


11

GID COUNT(*)

AS COUNTA

SUM(fee)

AS SUMA

AVERAGE(fee)

AS AVGA

GID COUNT(*)

AS COUNTA

SUM(fee)

AS SUMA

AVERAGE(fee)

AS AVGA

0: PreUpdate State

1: Aggregate AN

t1: AH

t2: AN

SH

SN

2: Merge Groupst2.COUNTA = t1.COUNTA + t2.COUNTAt2.SUMA = t1.SUMA + t2.SUMA

3: Compute Algebraic Aggregate

COUNTAt

SUMAtAVGAt

.2

.2.2

ADig

ADig

ADig

ADig

4: Drop Duplicates

5: Insert New Results

Algorithm


12

Complexity

1. Aggregate SN. T1 = O(|SN|)

2. Merge groups in AH to AN. Tcurr2 = O(|AH| + |AN|), Thash2 = O(|AH| + |AN|), Tprefetch2 = O(|AN|)

3. Compute algebraic aggregates in AN. T3 = O(|AN|)

4. Drop duplicates. Tcurr4 = O(|AN|*|AN

H|) = O(|AN|2), Thash4 = O(|AH|+|AN|), Tprefetch4 = O(|AN|)

5. Insert new results. T5 = O(|AN|)Incremental Aggregation

13

Implementation

• System catalog:– AggreRules– AggreBasics

• Incremental aggregation instantiation


14

System Catalog


Function Category Incremental Aggregation Rule

Vertical Expansion Rule

AVERAGE A SUMX/COUNTW SUMX/COUNTW

SUM D SUMX(H)+SUMX(N) SUM(SUMX)

MEDIAN H NULL NULL

COUNT D COUNTW(H)+COUNTW(N) SUM(COUNTW)

Function Basics Basic ID

AVERAGE COUNT(W) COUNTW

AVERAGE SUM(X) SUMX

SUM SUM(X) SUMX

COUNT COUNT(W) COUNTW

AggreBasics

AggreRules

15

COUNTW

SUMXAVERAGE )()()(

)()()(

NCOUNTWHCOUNTWNCOUNTW

NSUMXHSUMXNSUMX

AggreRules:AggreBasics:AVERAGE: SUM(X): SUMXAVERAGE: COUNT(W): COUNTW

New Query A:AVERAGE(fee)

GroupColumns:SUM(fee): SUMACOUNT(*): COUNTAAVERAGE(fee): AVGA

AVERAGE fee

COUNTA

SUMAAVGA

COUNTW

SUMXfeeAVERAGE )(

COUNTW

SUMXAVGA

COUNTAt

SUMAtAVGAt

.2

.2.2

)()()(

)()()(

NCOUNTAHCOUNTANCOUNTA

NSUMAHSUMANSUMA

COUNTAtCOUNTAtCOUNTAt

SUMAtSUMAtSUMAt

.2.1.2

.2.1.2

SUM(X) SUMXCOUNT(W) COUNTW

SUM(fee) SUMXCOUNT(*) COUNTW

parse

retrieve rules

substitute

insert columns

sub

stitu

te

SUM(fee) SUMX

SUMA

COUNT(*) COUNTW

COUNTAAVERAGE(fee) AVGA

Name Mapping:

InstantiationIncremental Aggregation

16

Roadmap

• System overview

• Query examples


• Incremental sharing• Evaluation

17

Incremental Multiple Query Optimization (Incremental Sharing)

• Index existing query plan information R.

• Given a new query Q, identify the sharable computations from R.

• Select the optimal sharing path.

• Expand R to compute Q.

Incremental Sharing

18

Expanding Query Network

• Limited sharing on holistic aggregates

• Sharing on distributive/algebraic aggregates through vertical expansion

Incremental Sharing

19

BID Rest

ID

COUNT(*)

AS COUNTA

SUM(fee)

AS SUMA

AVERAGE(fee)

AS AVGA

AH

1: Further Aggregate:COUNTB=SUM(COUNTA)SUMB=SUM(SUMA)GROUP BY BID

2:

COUNTB

SUMBAVGB

BID COUNT(*)

AS COUNTB

SUM(fee)

AS SUMB

AVERAGE(fee)

AS AVGB

BH

1: Further AggregateCOUNTB=SUM(COUNTA)SUMB=SUM(SUMA)GROUP BY BID

A B

Vertical Expansion

BDig

BDig

BDig

BDig

Incremental Sharing

Vertical Expansion

20

BID

Rest ID

COUNT(*)

AS COUNTA

SUM(fee)

AS SUMA

…

AN

A B

BID

Rest ID

…

AH

BID

COUNT(*)

AS COUNTB

SUM(fee)

AS SUMB

AVERAGE(fee)

AS AVGB

BH

2: Merge Groupst2.COUNTA = t1.COUNTA + t2.COUNTAt2.SUMA = t1.SUMA + t2.SUMA

1: Further AggregateCOUNTB=SUM(COUNTA)SUMB=SUM(SUMA)GROUP BY BID

Vertical Expansion

3: Compute Algebraic Aggregate

COUNTB

SUMBAVGB

BDig

BID

COUNT(*)

AS COUNTB

SUM(fee)

AS SUMB

AVERAGE(fee)

AS AVGB

BDig BN

4: Drop Duplicates

5: Insert New Results

BDig

BDig

BDig

BDig

BDig

21

Vertical Expansion Complexity

• TVcurr = O(|AN|2 + |BH|)

• TVhash = O(|AN| + |BH|)

• TVprefetch = O(|AN|)

Incremental Sharing

22

Original DirectParent NodeName GroupID

Original ExprCanonical ColumnName NodeName

Original GroupExprCanonical GroupExprID

GroupExprID GroupID

GroupTopology

GroupExprSet

GroupExprIndex

GroupColumns

Incremental Sharing

SystemCatalog

23

Select Optimal Sharing Path

• Select least-size node for sharing

Incremental Sharing

24

Rerouting

S B

S A

A

B

S B

A

S B

B

Animation Evolution

Incremental Sharing

25

Roadmap

• System overview

• Query examples


• Incremental sharing

• Evaluation

26

Evaluation

• Databases: – Synthesized FedWire money transfers– Anonymized Medical patient admission records

• Queries:– Seed queries– Generate sharable queries from seeds– A wild range of queries (aggregates in this paper)

• Simulation:– Historical data (300000 on Fed, and 600000 on Med)– Chunks of new data (4000 per chunk)

Evaluation

27


Fed

(350 queries)

Med

(450 queries)


662 316

Non Incremental Aggregation

6236 938

Total execution time in seconds

Evaluation

28

Number of FED queries

Exe

cutio

n T

ime

(s)

0

200

400

600

800

1000

1200

1400

1600

0 50 100 150 200 250 300 350

SIA NS-IA

(a) FedEvaluation

29

0

20

40

60

80

100

120

140

160

180

0 50 100 150 200 250 300 350 400 450

SIA NS-IA

Number of MED queries

Exe

cutio

n T

ime

(s)

(a) MedEvaluation

30

Conclusion

• Multiple aggregates over streams• Solutions:

– Incremental aggregation– Incremental MQO (incremental sharing)– Built atop DBMSs for direct practical utility

• Big performance improvement• Future work:

– A broad range of queries– Built atop DSMSs.

31

Acknowledgement

• Work with Professor Jaime Carbonell.

• Part of ARGUS by CMU and Dynamix.

• Team: Phil Hayes, Santosh Ananthraman, Bob Frederking, Eugene Fink, Dwight Dietrich, Ganesh Mani, Johny Mathew.

• Thanks to Professor Chris Olston for helpful discussion.

32

0.01

0.1

1

10

100

1 3 10 33 100 333 1000 3333 10000 30000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Incremental Size: |SN|

NonVE ITTVE ITT

Non-VE IBTVE IBT

IBT

: Inc

rem

enta

l-B

atch

Exe

cutio

n T

ime

(s)

ITT

: Ave

rag

e In

divi

dual

-Tup

le E

xecu

tion

Tim

e (s

)

FED Query Pair 1

(a) Pair 1Evaluation

Incremental Aggregation on Multiple Continuous Queries

Documents

Transcript of Incremental Aggregation on Multiple Continuous Queries