1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier,...

24
1 Semantics and Semantics and Evaluation Techniques Evaluation Techniques for Window Aggregates for Window Aggregates in Data Streams in Data Streams Jin Li, David Maier, Jin Li, David Maier, Kristin Tufte, Vassilis Kristin Tufte, Vassilis Papadimos, Peter Tucker Papadimos, Peter Tucker This work was supported by NSF grant IIS 0086002

description

3 Limitations Window semantics definition and implementation Window semantics definition and implementation Assumptions on data arrival order Assumptions on data arrival order Data arrival affects query answer and result production Data arrival affects query answer and result production Query evaluation performance Query evaluation performance Space: Internal buffer space to hold a window Space: Internal buffer space to hold a window Time: Tuple access – each tuple is accessed multiple times Time: Tuple access – each tuple is accessed multiple times Latency: Window aggregate computation is tied with window completion Latency: Window aggregate computation is tied with window completion

Transcript of 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier,...

Page 1: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

1

Semantics and Semantics and Evaluation Techniques Evaluation Techniques for Window Aggregates for Window Aggregates

in Data Streamsin Data StreamsJin Li, David Maier, Kristin Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Tufte, Vassilis Papadimos,

Peter TuckerPeter Tucker

This work was supported by NSF grant IIS 0086002

Page 2: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

2

MotivationMotivationTrafficTraffic: <sid, speed, ts>: <sid, speed, ts>

t4 ( s5, 47, 01:10:10)t5 ( s6, 48, 01:10:30)t6 ( s6, 46, 01:11:02)

( sid, max)( s5, 47 )( s6, 48 )

( sid, speed, ts )

Q1:“For every minute, find the max speed of the past 5 minutes for each sensor.” t6 (s6, 46, 01:11:02)

t4 (s5, 47, 01:10:10)t5 (s6, 48, 01:10:40)t4 (s5, 47, 01:10:10)

t1 (s5, 40, 01:06:30)t2 (s6, 42, 01:07:45)t3 (s5, 45, 01:08:15)

(sid, speed, ts (hh:mm:ss))

windows:01:06:00 – 01:11:0001:07:00 – 01:12:0001:08:00 – 01:13:00

windo

w

window:01:06:00 – 01:11:0001:07:00 – 01:12:0001:08:00 – 01:13:00

windo

w

windows:01:06:xx – 01:11:0001:07:00 – 01:12:0001:08:00 – 01:13:00

Traffic sensor

Traffic sensor

Traffic sensor

Page 3: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

3

LimitationsLimitations Window semantics definition and implementationWindow semantics definition and implementation

Assumptions on data arrival orderAssumptions on data arrival order Data arrival affects query answer and result productionData arrival affects query answer and result production

Query evaluation performanceQuery evaluation performance Space: Internal buffer space to hold a window Space: Internal buffer space to hold a window Time: Tuple access – each tuple is accessed multiple timesTime: Tuple access – each tuple is accessed multiple times Latency: Window aggregate computation is tied with Latency: Window aggregate computation is tied with

window completionwindow completion

Page 4: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

4

OutlineOutline WID overviewWID overview Window semantics definition and its Window semantics definition and its

implementation in WIDimplementation in WID DisorderDisorder Sharing panes – an optimization technique Sharing panes – an optimization technique

using sub-windows (panes)using sub-windows (panes) ConclusionConclusion

Page 5: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

5

WID OverviewWID OverviewQ1:Q1: SELECT sid, max(speed)SELECT sid, max(speed)FROM Traffic FROM Traffic [RANGE 5 minutes[RANGE 5 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP-BY sidGROUP-BY sid

t4 ( s5, 47, 01:10:10)t5 ( s6, 48, 01:10:30)p1 ( s6, *, 01:11:00)

t4 ( s5, 47, 01:10:10, 70-74 )t5 ( s6, 48, 01:10:30, 70-74 )p1 ( s6, *, *, 70 )t6 ( s6, 46, 01:11:02, 71-75 )

(sid, window-id, max)( s6, 70, 48 )

A punctuation is a message embedded in the data to indicate the end of a sub-stream

tag window-id

70 s5 max: 47… … … … … …74 s5 max: 4770 s6 max: 48

… … … … … …74 s6 max: 48

71 s6 max: 48… … … … … …74 s6 max: 4875 s6 max: 46

wid sid max… … … … … …

(sid, speed, ts )

(sid, speed, ts, window-id)

t6 ( s6, 46, 01:11:02)t1 ( s5, 40, 01:06:30)t2 ( s6, 42, 01:07:45)t3 ( s5, 45, 01:08:15)

t1 ( s5, 40, 01:06:30, 70-74 )t2 ( s6, 42, 01:07:45, 70-74 )t3 ( s5, 45, 01:08:15, 70-74 )

70 s5 max: 40… … … … … …74 s5 max: 4070 s6 max: 42

… … … … … …74 s6 max: 42

70 s5 max: 45… … … … … …74 s5 max: 4570 s6 max: 48

… … … … … …74 s6 max: 48

Page 6: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

6

Window Semantics Framework T: T: the set of all tuples in the input streamthe set of all tuples in the input stream S: S: a window specification a window specification W: W: a set of window-idsa set of window-ids

windowswindows: (: (TT, , SS) ) WW Defines the set of window ids to be usedDefines the set of window ids to be used

extentextent:: ((T, S, wT, S, w)) U U T, T, wherewhere w w W W Specifies which tuples belong to a given windowSpecifies which tuples belong to a given window

widswids:: ((T, S, tT, S, t)) V V W, W, wherewhere t t T T Determines the set of window-ids to which a tuple belongsDetermines the set of window-ids to which a tuple belongs Is the dual of Is the dual of extentextent

Page 7: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

7

Defining Window SemanticsDefining Window Semantics- sliding window- sliding window

T: the set of all tuples in the input streamS: a window specification W: a set of window-ids

windows (T, S [RANGE, SLIDE, WATTR])

= {0, 1, 2, …}

extent (w, T, S[RANGE, SLIDE, WATTR]) = { t T | ((w+1) * SLIDE)-RANGE ) ≤ t.WATTR < (w+1) * SLIDE }

wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W | t.WATTR / SLIDE – 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 }

Q1:Q1:SELECT sid, max(speed)SELECT sid, max(speed)FROM Traffic FROM Traffic [RANGE 5 minutes[RANGE 5 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP-BY sidGROUP-BY sid

windows (T, S [5, 1, ts])

= {0, 1, 2, …}

extent (w, T, S[5, 1, ts]) = { t T | ((w+1) * 1) − 5 ) ≤ t.ts < (w+1) * 1 }

wids (t, T, S [5, 1, ts]) = {w W | t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 }For t4 (s5, 47, 01:10:10), wids (t4, T, S [5, 1, ts]) = {w W | t4.ts / 1 – 1 < w ≤ (t4.ts + 5) / 1) – 1 } = {w W | 69.17 < w ≤ 74.17 } = {w W | 70 ≤ w ≤ 74} where t4.ts is 01:10:10 ≈ 70.17 minute

Page 8: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

8

Window Semantics Window Semantics Implementation in WID – Implementation in WID – sliding windowsliding window

(sid, speed, ts )( s5, 40, 01:06:30 ) t1

(sid, speed, ts, window-id)( s5, 40, 00:06:30, 70-74 ) t1

streamscan

bucket RANGE 5 minutes SLIDE 1 minute WATTR ts

max(group on window-id,

sid)

(sid, window-id, max)( s5, 70, 40 )

SELECT sid, max(speed)SELECT sid, max(speed)FROM Traffic FROM Traffic

[RANGE 5 minutes[RANGE 5 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP BY sidGROUP BY sid

( s5, *, *, 70 ) p1

(sid, speed, ts )( s5, *, 01:11:00 ) p1

1. Bucket implements wids function;

2. Bucket for sliding windows is stateless

Page 9: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

9

Defining Window SemanticsDefining Window Semantics- partitioned window- partitioned window

Q2: SELECT sid, max(speed ) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid]

windows (T, S [RANGE, SLIDE, row-num, PATTR]) = {(i, p) | i {0, 1, 2, …}, p T.PATTR}

extent ((i, p), T, S[RANGE, SLIDE, row-num, PATTR]) = { t T | t.PATTR = p, ((i+1) * SLIDE)-RANGE ) ≤ rank(t.row-num, PATTR, T) < (i+1) * SLIDE }

T: the set of all tuples in the input streamS: a window specification W: a set of window-ids

Page 10: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

10

Defining Window SemanticsDefining Window Semantics- partitioned window - partitioned window

(cont.)(cont.)wids (t, T, S[RANGE, row-num, PATTR]) = {(i, p)W | t.PATTR = p, r / SLIDE – 1 i (r + RANGE) / SLIDE –1} where r = rank (t, row-num, PATTR, T)

Q2: SELECT sid, max(speed ) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid]

T: the set of all tuples in the input streamS: a window specification W: a set of window-ids

Page 11: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

11

Window Semantics Window Semantics Implementation in WID – Implementation in WID – partitioned windowpartitioned window

(sid, speed, row-num )( s5, 47, 507 ) t1

(sid, window-id, speed, row-num)( s5, 3-12, 47, 507 ) t1

streamscan

bucket RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid

Max (speed)(group on window-id,

sid)

(sid, window-id, max)( s5, 3, 47 )

( s5, 3, *, * ) p1

1. Bucket generates punctuations;

2. Bucket for partitioned windows maintains states (count for each partition)

SELECT sid, max(speed)FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid]

Page 12: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

12

WID AdvantagesWID Advantages Window semantics definitionWindow semantics definition

Separated from physical implementation and data arrival Separated from physical implementation and data arrival orderorder

Flexible – covers varieties of windows, e.g., sliding, Flexible – covers varieties of windows, e.g., sliding, tumbling, landmark, time-based, tuple-based; allow user-tumbling, landmark, time-based, tuple-based; allow user-specified windowing attributespecified windowing attribute

Implementation of query evaluationImplementation of query evaluation Window semantics localized in BucketWindow semantics localized in Bucket Insensitive to data arrival order Insensitive to data arrival order Punctuation can guarantee progress Punctuation can guarantee progress

Gaps in tuple arrival need not affect result productionGaps in tuple arrival need not affect result production Performance gains in space, execution time and Performance gains in space, execution time and

latencylatency

Page 13: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

13

WID vs. Buffering – WID vs. Buffering – execution time comparison execution time comparison (overview)(overview)

Page 14: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

14

WID vs. Buffering – WID vs. Buffering – execution time comparison execution time comparison (zoom-in)(zoom-in)

Page 15: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

15

OutlineOutline WID overviewWID overview Window semantics definition and its Window semantics definition and its

implementation in WIDimplementation in WID DisorderDisorder Sharing panes – an optimization technique Sharing panes – an optimization technique

using sub-windowsusing sub-windows ConclusionConclusion

Page 16: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

16

Sources of DisorderSources of Disorder Sources of disorderSources of disorder

Merging different data sourcesMerging different data sources Various network transmission delayVarious network transmission delay Data prioritizationData prioritization Query processing algorithms, e.g., shared window Query processing algorithms, e.g., shared window

joins [Hammad, et al.]joins [Hammad, et al.] Multiple possible windowing attributes, e.g., two Multiple possible windowing attributes, e.g., two

timestampstimestamps

Page 17: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

17

Handling DisorderHandling Disorder Generally dealt with by bufferingGenerally dealt with by buffering

Slack – BSort in AuroraSlack – BSort in Aurora Output buffering in a shared-window joinOutput buffering in a shared-window join

Punctuation + Window-idPunctuation + Window-id HeartbeatHeartbeat

Page 18: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

18

Disorder Handling - WIDDisorder Handling - WIDQ1:Q1: SELECT sid, max(speed)SELECT sid, max(speed)FROM Traffic FROM Traffic [RANGE 5 minutes[RANGE 5 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP-BY sidGROUP-BY sid

p1 ( s6, *, 01:11:00)t7 ( s5, 52, 01:10:15)

p1 ( s6, *, *, 70 )t6 ( s6, 46, 01:11:02, 71-75)t7 ( s5, 52, 01:10:15, 70-74)

(sid, window-id, max)( s6, 70, 48 )

bucket

70 s5 max: 47… … … … … …74 s5 max: 4770 s6 max: 48

… … … … … …74 s6 max: 48

71 s6 max: 48… … … … … …74 s6 max: 4875 s6 max: 46

wid sid max… … … … … …

(sid, speed, ts )

(sid, speed, ts, window-id)

t3 ( s6, 46, 01:11:02)

70 s5 max: 52… … … … … …74 s5 max: 5270 s6 max: 48

… … … … … …74 s6 max: 48

Page 19: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

19

OutlineOutline WID overviewWID overview Window semantics definition and its Window semantics definition and its

implementation in WIDimplementation in WID DisorderDisorder Sharing panes – an optimization technique Sharing panes – an optimization technique

using sub-windowsusing sub-windows ConclusionConclusion

Page 20: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

20

Sharing PanesSharing PanesW

indo

ws

Panes

……

P1 P5 P6 P7 P8P2 P3 P4

W3

W1

W2

W4

W5

Q3Q3::SELECT sid, count(*)SELECT sid, count(*)FROM Traffic FROM Traffic [RANGE 4 minutes[RANGE 4 minutes SLIDE 1 minuteSLIDE 1 minute WATTR ts]WATTR ts]GROUP BY sidGROUP BY sid

Page 21: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

21

Pane ImplementationPane Implementation

(sid, speed, ts )( s5, 47, 01:10:10 ) t1( s5, *, 01:11:00 ) p1( s6, 48, 01:10:30 ) t2

(sid, speed, ts, pane-id )( s5, 47, 01:10:10, 70-70 ) t1( s5, *, *, 70 ) p1( s6, 48, 01:10:30, 70-70 ) t2

streamscan

count (*)(group on pane-id,

sid)

bucket B1 as pane-idRANGE 1 minSLIDE 1 minWATTR ts

bucket B2 as window-idRANGE 4SLIDE 1

WATTR pane-id

sum(*)(group on window-id,

sid)

(sid, ts, pane-id, count)( s5, 01:10:10, 70, 8 ) m0

(sid, ts, pane-id, count, window-id)( s5, 01:10:10, 70, 8, 70-74 ) m0

SELECT sid, count(*)FROM Traffic [RANGE 4

minutes SLIDE 1 minute WATTR ts]GROUP BY sid

Page 22: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

22

When are panes better than When are panes better than windows?windows?

SELECT sid, max(*)FROM Traffic [RANGE X rows SLIDE Y rows WATTR row-num]

GROUP BY sid

1.1. Panes are better when cost ratio is less than 1Panes are better when cost ratio is less than 12.2. The number of tuples per pane affects The number of tuples per pane affects

whether using panes is betterwhether using panes is better

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Cos

t Rat

io (P

ane

/ Win

dow

)

1 3 5 10 20 30 40 60 80

Tuples per Pane

25102050100

Page 23: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

23

Conclusion and Future Conclusion and Future WorkWork

ConclusionConclusion A framework for defining window semanticsA framework for defining window semantics A one pass, non-buffering, disorder-tolerant A one pass, non-buffering, disorder-tolerant

query evaluation techniquequery evaluation technique Initial investigation on disorderInitial investigation on disorder Sharing panesSharing panes

Future workFuture work Disorder-tolerant window joinDisorder-tolerant window join Sharing panes among multiple aggregate Sharing panes among multiple aggregate

queriesqueries

Page 24: 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

24

Related WorkRelated Work STREAM@Stanford STREAM@Stanford

Heartbeat, Sub-aggregation Heartbeat, Sub-aggregation TelegraphCQ@Berkeley TelegraphCQ@Berkeley

Sliding window aggregates Sliding window aggregates Aurora&Borealis@Brown&MIT&Brandeis Aurora&Borealis@Brown&MIT&Brandeis

Slack Slack Gigascope@AT&TGigascope@AT&T

Ordering Update Token, Sub-aggregation Ordering Update Token, Sub-aggregation