1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier,...
-
Upload
cecily-lane -
Category
Documents
-
view
224 -
download
0
description
Transcript of 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier,...
1
Semantics and Semantics and Evaluation Techniques Evaluation Techniques for Window Aggregates for Window Aggregates
in Data Streamsin Data StreamsJin Li, David Maier, Kristin Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Tufte, Vassilis Papadimos,
Peter TuckerPeter Tucker
This work was supported by NSF grant IIS 0086002
2
MotivationMotivationTrafficTraffic: <sid, speed, ts>: <sid, speed, ts>
t4 ( s5, 47, 01:10:10)t5 ( s6, 48, 01:10:30)t6 ( s6, 46, 01:11:02)
( sid, max)( s5, 47 )( s6, 48 )
( sid, speed, ts )
Q1:“For every minute, find the max speed of the past 5 minutes for each sensor.” t6 (s6, 46, 01:11:02)
t4 (s5, 47, 01:10:10)t5 (s6, 48, 01:10:40)t4 (s5, 47, 01:10:10)
t1 (s5, 40, 01:06:30)t2 (s6, 42, 01:07:45)t3 (s5, 45, 01:08:15)
(sid, speed, ts (hh:mm:ss))
windows:01:06:00 – 01:11:0001:07:00 – 01:12:0001:08:00 – 01:13:00
windo
w
window:01:06:00 – 01:11:0001:07:00 – 01:12:0001:08:00 – 01:13:00
windo
w
windows:01:06:xx – 01:11:0001:07:00 – 01:12:0001:08:00 – 01:13:00
Traffic sensor
Traffic sensor
Traffic sensor
3
LimitationsLimitations Window semantics definition and implementationWindow semantics definition and implementation
Assumptions on data arrival orderAssumptions on data arrival order Data arrival affects query answer and result productionData arrival affects query answer and result production
Query evaluation performanceQuery evaluation performance Space: Internal buffer space to hold a window Space: Internal buffer space to hold a window Time: Tuple access – each tuple is accessed multiple timesTime: Tuple access – each tuple is accessed multiple times Latency: Window aggregate computation is tied with Latency: Window aggregate computation is tied with
window completionwindow completion
4
OutlineOutline WID overviewWID overview Window semantics definition and its Window semantics definition and its
implementation in WIDimplementation in WID DisorderDisorder Sharing panes – an optimization technique Sharing panes – an optimization technique
using sub-windows (panes)using sub-windows (panes) ConclusionConclusion
5
WID OverviewWID OverviewQ1:Q1: SELECT sid, max(speed)SELECT sid, max(speed)FROM Traffic FROM Traffic [RANGE 5 minutes[RANGE 5 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP-BY sidGROUP-BY sid
t4 ( s5, 47, 01:10:10)t5 ( s6, 48, 01:10:30)p1 ( s6, *, 01:11:00)
t4 ( s5, 47, 01:10:10, 70-74 )t5 ( s6, 48, 01:10:30, 70-74 )p1 ( s6, *, *, 70 )t6 ( s6, 46, 01:11:02, 71-75 )
(sid, window-id, max)( s6, 70, 48 )
A punctuation is a message embedded in the data to indicate the end of a sub-stream
tag window-id
70 s5 max: 47… … … … … …74 s5 max: 4770 s6 max: 48
… … … … … …74 s6 max: 48
71 s6 max: 48… … … … … …74 s6 max: 4875 s6 max: 46
wid sid max… … … … … …
(sid, speed, ts )
(sid, speed, ts, window-id)
t6 ( s6, 46, 01:11:02)t1 ( s5, 40, 01:06:30)t2 ( s6, 42, 01:07:45)t3 ( s5, 45, 01:08:15)
t1 ( s5, 40, 01:06:30, 70-74 )t2 ( s6, 42, 01:07:45, 70-74 )t3 ( s5, 45, 01:08:15, 70-74 )
70 s5 max: 40… … … … … …74 s5 max: 4070 s6 max: 42
… … … … … …74 s6 max: 42
70 s5 max: 45… … … … … …74 s5 max: 4570 s6 max: 48
… … … … … …74 s6 max: 48
6
Window Semantics Framework T: T: the set of all tuples in the input streamthe set of all tuples in the input stream S: S: a window specification a window specification W: W: a set of window-idsa set of window-ids
windowswindows: (: (TT, , SS) ) WW Defines the set of window ids to be usedDefines the set of window ids to be used
extentextent:: ((T, S, wT, S, w)) U U T, T, wherewhere w w W W Specifies which tuples belong to a given windowSpecifies which tuples belong to a given window
widswids:: ((T, S, tT, S, t)) V V W, W, wherewhere t t T T Determines the set of window-ids to which a tuple belongsDetermines the set of window-ids to which a tuple belongs Is the dual of Is the dual of extentextent
7
Defining Window SemanticsDefining Window Semantics- sliding window- sliding window
T: the set of all tuples in the input streamS: a window specification W: a set of window-ids
windows (T, S [RANGE, SLIDE, WATTR])
= {0, 1, 2, …}
extent (w, T, S[RANGE, SLIDE, WATTR]) = { t T | ((w+1) * SLIDE)-RANGE ) ≤ t.WATTR < (w+1) * SLIDE }
wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W | t.WATTR / SLIDE – 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 }
Q1:Q1:SELECT sid, max(speed)SELECT sid, max(speed)FROM Traffic FROM Traffic [RANGE 5 minutes[RANGE 5 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP-BY sidGROUP-BY sid
windows (T, S [5, 1, ts])
= {0, 1, 2, …}
extent (w, T, S[5, 1, ts]) = { t T | ((w+1) * 1) − 5 ) ≤ t.ts < (w+1) * 1 }
wids (t, T, S [5, 1, ts]) = {w W | t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 }For t4 (s5, 47, 01:10:10), wids (t4, T, S [5, 1, ts]) = {w W | t4.ts / 1 – 1 < w ≤ (t4.ts + 5) / 1) – 1 } = {w W | 69.17 < w ≤ 74.17 } = {w W | 70 ≤ w ≤ 74} where t4.ts is 01:10:10 ≈ 70.17 minute
8
Window Semantics Window Semantics Implementation in WID – Implementation in WID – sliding windowsliding window
(sid, speed, ts )( s5, 40, 01:06:30 ) t1
(sid, speed, ts, window-id)( s5, 40, 00:06:30, 70-74 ) t1
streamscan
bucket RANGE 5 minutes SLIDE 1 minute WATTR ts
max(group on window-id,
sid)
(sid, window-id, max)( s5, 70, 40 )
SELECT sid, max(speed)SELECT sid, max(speed)FROM Traffic FROM Traffic
[RANGE 5 minutes[RANGE 5 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP BY sidGROUP BY sid
( s5, *, *, 70 ) p1
(sid, speed, ts )( s5, *, 01:11:00 ) p1
1. Bucket implements wids function;
2. Bucket for sliding windows is stateless
9
Defining Window SemanticsDefining Window Semantics- partitioned window- partitioned window
Q2: SELECT sid, max(speed ) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid]
windows (T, S [RANGE, SLIDE, row-num, PATTR]) = {(i, p) | i {0, 1, 2, …}, p T.PATTR}
extent ((i, p), T, S[RANGE, SLIDE, row-num, PATTR]) = { t T | t.PATTR = p, ((i+1) * SLIDE)-RANGE ) ≤ rank(t.row-num, PATTR, T) < (i+1) * SLIDE }
T: the set of all tuples in the input streamS: a window specification W: a set of window-ids
10
Defining Window SemanticsDefining Window Semantics- partitioned window - partitioned window
(cont.)(cont.)wids (t, T, S[RANGE, row-num, PATTR]) = {(i, p)W | t.PATTR = p, r / SLIDE – 1 i (r + RANGE) / SLIDE –1} where r = rank (t, row-num, PATTR, T)
Q2: SELECT sid, max(speed ) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid]
T: the set of all tuples in the input streamS: a window specification W: a set of window-ids
11
Window Semantics Window Semantics Implementation in WID – Implementation in WID – partitioned windowpartitioned window
(sid, speed, row-num )( s5, 47, 507 ) t1
(sid, window-id, speed, row-num)( s5, 3-12, 47, 507 ) t1
streamscan
bucket RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid
Max (speed)(group on window-id,
sid)
(sid, window-id, max)( s5, 3, 47 )
( s5, 3, *, * ) p1
1. Bucket generates punctuations;
2. Bucket for partitioned windows maintains states (count for each partition)
SELECT sid, max(speed)FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid]
12
WID AdvantagesWID Advantages Window semantics definitionWindow semantics definition
Separated from physical implementation and data arrival Separated from physical implementation and data arrival orderorder
Flexible – covers varieties of windows, e.g., sliding, Flexible – covers varieties of windows, e.g., sliding, tumbling, landmark, time-based, tuple-based; allow user-tumbling, landmark, time-based, tuple-based; allow user-specified windowing attributespecified windowing attribute
Implementation of query evaluationImplementation of query evaluation Window semantics localized in BucketWindow semantics localized in Bucket Insensitive to data arrival order Insensitive to data arrival order Punctuation can guarantee progress Punctuation can guarantee progress
Gaps in tuple arrival need not affect result productionGaps in tuple arrival need not affect result production Performance gains in space, execution time and Performance gains in space, execution time and
latencylatency
13
WID vs. Buffering – WID vs. Buffering – execution time comparison execution time comparison (overview)(overview)
14
WID vs. Buffering – WID vs. Buffering – execution time comparison execution time comparison (zoom-in)(zoom-in)
15
OutlineOutline WID overviewWID overview Window semantics definition and its Window semantics definition and its
implementation in WIDimplementation in WID DisorderDisorder Sharing panes – an optimization technique Sharing panes – an optimization technique
using sub-windowsusing sub-windows ConclusionConclusion
16
Sources of DisorderSources of Disorder Sources of disorderSources of disorder
Merging different data sourcesMerging different data sources Various network transmission delayVarious network transmission delay Data prioritizationData prioritization Query processing algorithms, e.g., shared window Query processing algorithms, e.g., shared window
joins [Hammad, et al.]joins [Hammad, et al.] Multiple possible windowing attributes, e.g., two Multiple possible windowing attributes, e.g., two
timestampstimestamps
17
Handling DisorderHandling Disorder Generally dealt with by bufferingGenerally dealt with by buffering
Slack – BSort in AuroraSlack – BSort in Aurora Output buffering in a shared-window joinOutput buffering in a shared-window join
Punctuation + Window-idPunctuation + Window-id HeartbeatHeartbeat
18
Disorder Handling - WIDDisorder Handling - WIDQ1:Q1: SELECT sid, max(speed)SELECT sid, max(speed)FROM Traffic FROM Traffic [RANGE 5 minutes[RANGE 5 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP-BY sidGROUP-BY sid
p1 ( s6, *, 01:11:00)t7 ( s5, 52, 01:10:15)
p1 ( s6, *, *, 70 )t6 ( s6, 46, 01:11:02, 71-75)t7 ( s5, 52, 01:10:15, 70-74)
(sid, window-id, max)( s6, 70, 48 )
bucket
70 s5 max: 47… … … … … …74 s5 max: 4770 s6 max: 48
… … … … … …74 s6 max: 48
71 s6 max: 48… … … … … …74 s6 max: 4875 s6 max: 46
wid sid max… … … … … …
(sid, speed, ts )
(sid, speed, ts, window-id)
t3 ( s6, 46, 01:11:02)
70 s5 max: 52… … … … … …74 s5 max: 5270 s6 max: 48
… … … … … …74 s6 max: 48
19
OutlineOutline WID overviewWID overview Window semantics definition and its Window semantics definition and its
implementation in WIDimplementation in WID DisorderDisorder Sharing panes – an optimization technique Sharing panes – an optimization technique
using sub-windowsusing sub-windows ConclusionConclusion
20
Sharing PanesSharing PanesW
indo
ws
Panes
……
P1 P5 P6 P7 P8P2 P3 P4
W3
W1
W2
W4
W5
Q3Q3::SELECT sid, count(*)SELECT sid, count(*)FROM Traffic FROM Traffic [RANGE 4 minutes[RANGE 4 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP BY sidGROUP BY sid
21
Pane ImplementationPane Implementation
(sid, speed, ts )( s5, 47, 01:10:10 ) t1( s5, *, 01:11:00 ) p1( s6, 48, 01:10:30 ) t2
(sid, speed, ts, pane-id )( s5, 47, 01:10:10, 70-70 ) t1( s5, *, *, 70 ) p1( s6, 48, 01:10:30, 70-70 ) t2
streamscan
count (*)(group on pane-id,
sid)
bucket B1 as pane-idRANGE 1 minSLIDE 1 minWATTR ts
bucket B2 as window-idRANGE 4SLIDE 1
WATTR pane-id
sum(*)(group on window-id,
sid)
(sid, ts, pane-id, count)( s5, 01:10:10, 70, 8 ) m0
(sid, ts, pane-id, count, window-id)( s5, 01:10:10, 70, 8, 70-74 ) m0
SELECT sid, count(*)FROM Traffic [RANGE 4
minutes SLIDE 1 minute WATTR ts]GROUP BY sid
22
When are panes better than When are panes better than windows?windows?
SELECT sid, max(*)FROM Traffic [RANGE X rows SLIDE Y rows WATTR row-num]
GROUP BY sid
1.1. Panes are better when cost ratio is less than 1Panes are better when cost ratio is less than 12.2. The number of tuples per pane affects The number of tuples per pane affects
whether using panes is betterwhether using panes is better
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Cos
t Rat
io (P
ane
/ Win
dow
)
1 3 5 10 20 30 40 60 80
Tuples per Pane
25102050100
23
Conclusion and Future Conclusion and Future WorkWork
ConclusionConclusion A framework for defining window semanticsA framework for defining window semantics A one pass, non-buffering, disorder-tolerant A one pass, non-buffering, disorder-tolerant
query evaluation techniquequery evaluation technique Initial investigation on disorderInitial investigation on disorder Sharing panesSharing panes
Future workFuture work Disorder-tolerant window joinDisorder-tolerant window join Sharing panes among multiple aggregate Sharing panes among multiple aggregate
queriesqueries
24
Related WorkRelated Work STREAM@Stanford STREAM@Stanford
Heartbeat, Sub-aggregation Heartbeat, Sub-aggregation TelegraphCQ@Berkeley TelegraphCQ@Berkeley
Sliding window aggregates Sliding window aggregates Aurora&Borealis@Brown&MIT&Brandeis Aurora&Borealis@Brown&MIT&Brandeis
Slack Slack Gigascope@AT&TGigascope@AT&T
Ordering Update Token, Sub-aggregation Ordering Update Token, Sub-aggregation