Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries
Incremental Aggregation on Multiple Continuous Queries
description
Transcript of Incremental Aggregation on Multiple Continuous Queries
1
Incremental Aggregationon Multiple Continuous Queries
Chun JinCarnegie Mellon University
09/28/2006 ISMIS, Bari Italy
2
•Intelligence monitoring•Fraud detection•Onset epidemic patterns•Network intrusion detection•GeoSpacial changes
•Transactions•Senor network readings•Network traffic data
Stream Processing
3
Problem
• Aggregate queries
• Continuous evaluation
• Multiple concurrent queries
4
Solutions
• Incremental aggregation
• Incremental multiple aggregate query optimization (incremental sharing)
5
Roadmap
• System overview
• Query examples
• Incremental Aggregation
• Incremental sharing
• Evaluation
6
QueryNetwork
QueryCoordinator
SystemCatalog
Common Computation Identifier
(CCI)
Network Operation Manager (NOM)
Code Assembler
Sharing Optimizer(SO)
Projection Manager(PM)
System ArchitectureNew Query Insertion:1. Index query network2. Identify common computation3. Select optimal sharing path4. Expand query network
Query Network Execution:1. Code assembly2. Incremental aggregation3. Periodical execution
Engine
Generator
Oracle
7
S A B
hospital vdate COUNT(*) SUM(fee) AVERAGE(fee)
S A
dis_cat hospital vdate COUNT(*) SUM(fee) AVERAGE(fee)
SELECT dis_cat, hospital, vdate,COUNT(*), AVERAGE(fee)
FROM MedGROUP BY CAT(disease) AS dis_cat,
hospital, DAY(visit_time) AS vdate(a) Query A
SELECT hospital, vdate,AVERAGE(fee)
FROM MedGROUP BY hospital,
DAY(visit_time) AS vdate(b) Query B
Query Examples
SH
SN
AH
AN
8
Roadmap
• System overview
• Query examples
• Incremental Aggregation• Incremental sharing
• Evaluation
9
Aggregate Function Types
• Distributive: aggregate function itself. Sum, count.
• Algebraic: a finite set of aggregate functions. Average.
• Holistic: no such finite set. Quantiles.
Incremental Aggregation
10
Holistic Aggregation
• Revisiting the entire history.
• Usage: – For holistic aggregates.– For post-non-incrementally-evaluated
aggregates.– Baseline to incremental aggregation.
Incremental Aggregation
11
GID COUNT(*)
AS COUNTA
SUM(fee)
AS SUMA
AVERAGE(fee)
AS AVGA
GID COUNT(*)
AS COUNTA
SUM(fee)
AS SUMA
AVERAGE(fee)
AS AVGA
0: PreUpdate State
1: Aggregate AN
t1: AH
t2: AN
SH
SN
2: Merge Groupst2.COUNTA = t1.COUNTA + t2.COUNTAt2.SUMA = t1.SUMA + t2.SUMA
3: Compute Algebraic Aggregate
COUNTAt
SUMAtAVGAt
.2
.2.2
ADig
ADig
ADig
ADig
4: Drop Duplicates
5: Insert New Results
Algorithm
Incremental Aggregation
12
Complexity
1. Aggregate SN. T1 = O(|SN|)
2. Merge groups in AH to AN. Tcurr2 = O(|AH| + |AN|), Thash2 = O(|AH| + |AN|), Tprefetch2 = O(|AN|)
3. Compute algebraic aggregates in AN. T3 = O(|AN|)
4. Drop duplicates. Tcurr4 = O(|AN|*|AN
H|) = O(|AN|2), Thash4 = O(|AH|+|AN|), Tprefetch4 = O(|AN|)
5. Insert new results. T5 = O(|AN|)Incremental Aggregation
13
Implementation
• System catalog:– AggreRules– AggreBasics
• Incremental aggregation instantiation
Incremental Aggregation
14
System Catalog
Incremental Aggregation
Function Category Incremental Aggregation Rule
Vertical Expansion Rule
AVERAGE A SUMX/COUNTW SUMX/COUNTW
SUM D SUMX(H)+SUMX(N) SUM(SUMX)
MEDIAN H NULL NULL
COUNT D COUNTW(H)+COUNTW(N) SUM(COUNTW)
Function Basics Basic ID
AVERAGE COUNT(W) COUNTW
AVERAGE SUM(X) SUMX
SUM SUM(X) SUMX
COUNT COUNT(W) COUNTW
AggreBasics
AggreRules
15
COUNTW
SUMXAVERAGE )()()(
)()()(
NCOUNTWHCOUNTWNCOUNTW
NSUMXHSUMXNSUMX
AggreRules:AggreBasics:AVERAGE: SUM(X): SUMXAVERAGE: COUNT(W): COUNTW
New Query A:AVERAGE(fee)
GroupColumns:SUM(fee): SUMACOUNT(*): COUNTAAVERAGE(fee): AVGA
AVERAGE fee
COUNTA
SUMAAVGA
COUNTW
SUMXfeeAVERAGE )(
COUNTW
SUMXAVGA
COUNTAt
SUMAtAVGAt
.2
.2.2
)()()(
)()()(
NCOUNTAHCOUNTANCOUNTA
NSUMAHSUMANSUMA
COUNTAtCOUNTAtCOUNTAt
SUMAtSUMAtSUMAt
.2.1.2
.2.1.2
SUM(X) SUMXCOUNT(W) COUNTW
SUM(fee) SUMXCOUNT(*) COUNTW
parse
retrieve rules
substitute
insert columns
sub
stitu
te
SUM(fee) SUMX
SUMA
COUNT(*) COUNTW
COUNTAAVERAGE(fee) AVGA
Name Mapping:
InstantiationIncremental Aggregation
16
Roadmap
• System overview
• Query examples
• Incremental Aggregation
• Incremental sharing• Evaluation
17
Incremental Multiple Query Optimization (Incremental Sharing)
• Index existing query plan information R.
• Given a new query Q, identify the sharable computations from R.
• Select the optimal sharing path.
• Expand R to compute Q.
Incremental Sharing
18
Expanding Query Network
• Limited sharing on holistic aggregates
• Sharing on distributive/algebraic aggregates through vertical expansion
Incremental Sharing
19
BID Rest
ID
COUNT(*)
AS COUNTA
SUM(fee)
AS SUMA
AVERAGE(fee)
AS AVGA
AH
1: Further Aggregate:COUNTB=SUM(COUNTA)SUMB=SUM(SUMA)GROUP BY BID
2:
COUNTB
SUMBAVGB
BID COUNT(*)
AS COUNTB
SUM(fee)
AS SUMB
AVERAGE(fee)
AS AVGB
BH
1: Further AggregateCOUNTB=SUM(COUNTA)SUMB=SUM(SUMA)GROUP BY BID
A B
Vertical Expansion
BDig
BDig
BDig
BDig
Incremental Sharing
Vertical Expansion
20
BID
Rest ID
COUNT(*)
AS COUNTA
SUM(fee)
AS SUMA
…
AN
A B
BID
Rest ID
…
AH
BID
COUNT(*)
AS COUNTB
SUM(fee)
AS SUMB
AVERAGE(fee)
AS AVGB
BH
2: Merge Groupst2.COUNTA = t1.COUNTA + t2.COUNTAt2.SUMA = t1.SUMA + t2.SUMA
1: Further AggregateCOUNTB=SUM(COUNTA)SUMB=SUM(SUMA)GROUP BY BID
Vertical Expansion
3: Compute Algebraic Aggregate
COUNTB
SUMBAVGB
BDig
BID
COUNT(*)
AS COUNTB
SUM(fee)
AS SUMB
AVERAGE(fee)
AS AVGB
BDig BN
4: Drop Duplicates
5: Insert New Results
BDig
BDig
BDig
BDig
BDig
21
Vertical Expansion Complexity
• TVcurr = O(|AN|2 + |BH|)
• TVhash = O(|AN| + |BH|)
• TVprefetch = O(|AN|)
Incremental Sharing
22
Original DirectParent NodeName GroupID
Original ExprCanonical ColumnName NodeName
Original GroupExprCanonical GroupExprID
GroupExprID GroupID
GroupTopology
GroupExprSet
GroupExprIndex
GroupColumns
Incremental Sharing
SystemCatalog
23
Select Optimal Sharing Path
• Select least-size node for sharing
Incremental Sharing
24
Rerouting
S B
S A
A
B
S B
A
S B
B
Animation Evolution
Incremental Sharing
25
Roadmap
• System overview
• Query examples
• Incremental Aggregation
• Incremental sharing
• Evaluation
26
Evaluation
• Databases: – Synthesized FedWire money transfers– Anonymized Medical patient admission records
• Queries:– Seed queries– Generate sharable queries from seeds– A wild range of queries (aggregates in this paper)
• Simulation:– Historical data (300000 on Fed, and 600000 on Med)– Chunks of new data (4000 per chunk)
Evaluation
27
Incremental Aggregation
Fed
(350 queries)
Med
(450 queries)
Incremental Aggregation
662 316
Non Incremental Aggregation
6236 938
Total execution time in seconds
Evaluation
28
Number of FED queries
Exe
cutio
n T
ime
(s)
0
200
400
600
800
1000
1200
1400
1600
0 50 100 150 200 250 300 350
SIA NS-IA
(a) FedEvaluation
29
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250 300 350 400 450
SIA NS-IA
Number of MED queries
Exe
cutio
n T
ime
(s)
(a) MedEvaluation
30
Conclusion
• Multiple aggregates over streams• Solutions:
– Incremental aggregation– Incremental MQO (incremental sharing)– Built atop DBMSs for direct practical utility
• Big performance improvement• Future work:
– A broad range of queries– Built atop DSMSs.
31
Acknowledgement
• Work with Professor Jaime Carbonell.
• Part of ARGUS by CMU and Dynamix.
• Team: Phil Hayes, Santosh Ananthraman, Bob Frederking, Eugene Fink, Dwight Dietrich, Ganesh Mani, Johny Mathew.
• Thanks to Professor Chris Olston for helpful discussion.
32
0.01
0.1
1
10
100
1 3 10 33 100 333 1000 3333 10000 30000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Incremental Size: |SN|
NonVE ITTVE ITT
Non-VE IBTVE IBT
IBT
: Inc
rem
enta
l-B
atch
Exe
cutio
n T
ime
(s)
ITT
: Ave
rag
e In
divi
dual
-Tup
le E
xecu
tion
Tim
e (s
)
FED Query Pair 1
(a) Pair 1Evaluation