Aggregate Sharing for User-Define Data Stream Windows

Post on 16-Apr-2017

274 views 2 download

Transcript of Aggregate Sharing for User-Define Data Stream Windows

@CIKM16

Cutty: Aggregate Sharing for User-Defined Windows

Paris Carbone <parisc@kth.se, senorcarbone@apache.org> Jonas Traub <jonas.traub@tu-berlin.de>Asterios Katsifodimos <asterios.katsifodimos@tu-berlin.de>Seif Haridi <haridi@kth.se>Volker Markl <volker.markl@tu-berlin.de>

1

Presentation : Paris Carbone PhD Candidate @ KTH Sweden Committer @ Apache Flink

@CIKM16 4 Reasons

Not to check your email during this talk

1. Windows are the backbone of data stream analysis.

2. We generalise the concept of data stream windows.

3. Our technique makes aggregations on general stream windows more efficient than ever.

4. We can multiplex and share aggregations of diverse types of sliding windows that run simultaneously.

2

@CIKM16

Outline

• Partial Sliding Window Aggregation

• Fundamental Limitations of Existing Approaches

• Introducing User-DefinedWindows(UDWs)

•Multi-Query Aggregation of UDWs with Cutty

• Performance Comparison

3

@CIKM16

4

Window Aggregation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 …

1 2 3 4 5

3 4 5 6 7

5 6 7 8 9

StreamDiscretization

fd

@CIKM16

Window Aggregation

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 …

1 2 3 4 5

3 4 5 6 7

5 6 7 8 9

A1

A2

A3

fa

@CIKM16

lift

record —> (val,count) combine

(val1+val2,count1+count2)

lower

(val,count) —> val/count

Partial Aggregation

6

1 2 3 4 5

1. lift

3. lower

A1

M (2,1)(1,1) 2. combine M

M

M

M

M

3

record typepartial typeaggr type

?

Example - AVG(3,2)

(1,1)

(3,1)

(6,3) (4,1)

(10,4) (5,1)

(15,5)

@CIKM16

Partial Aggregation

7

•#Invocations <—> Computational Complexity

•Commutativity & Associativity are typically assumed.

2. combine M

@CIKM16 Redundancy in Sliding Window Aggregation

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 …

1 2 3 4 5

3 4 5 6 7

5 6 7 8 9

… …

overlapping means redundant combine calls

we need to optimise…

@CIKM16

tumbling

single-typeperiodic Punctuation

SnapshotFCF/CF

Lower-Bound

Session

multi-type

ADWIN

Delta-based

FCA

slicing

Optimise…which windows?

pre-computenon-overlapping partials

Periodic

@CIKM16

Slicing

10

1 2 3 4 5 6 7 8 9 10 11

12

13 14 15 16 17 18 19

Example - Count Window range: 10, slide:3

1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19

If a sliding window can be defined in terms of a fixed range and slide the system can pre-aggregate consecutive, non-overlapping slices.

Panes1gcd(range,slide)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Pairs2p2:range%slidep1:slide-p2

12

periodic windows

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Cutty(preview)1. No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams

SIGMOD 2005 2. On-the-Fly Sharing for Streamed Aggregation - SIGMOD 2006

@CIKM16

Slicing - Observations

• Computational Complexity —> upd: O(1) ,merge: O(#partials) • Space Complexity (#stored sliced partials):

• Similar space requirements when range is a multiple of slice • Pairs has been extended for multi-query aggregation sharing

11

d range

gcd(range, slide)e

d2⇥ range

slidee

Memory (#partials)

Panes

Pairs

—> 10 partials —> 7 partials

…from previous example

@CIKM16

tumbling

single-typeperiodic Punctuation

SnapshotFCF/CF

Lower-Bound

Session

multi-type

ADWIN

Delta-based

FCA

slicing

Optimise…which windows?

pre-computenon-overlapping partials pre-compute overlapping partials for

arbitrary aggregation lookups

eager pre-aggregation

Non-Periodic

@CIKM16

Eager Pre-Aggregation

13

When windowing cannot be expressed simply by a range and slide : eagerly pre-compute partial aggregates and update a binary tree, bottom-up.

1 2 3 4 5 6 7 8

3 7 11 15

10 26

36

9 10

19

30

21

arbitrary window lookups logn}

}n pre-computed partials

n leaves ~ records}}2n

@CIKM16 Eager Pre-Aggregation

Observations

• Implementations: FlatFAT1, B-Int2

• High Space Complexity (#raw records…twice) • Most pre-aggregates are never used • Update+Aggregation complexity :

• Generic and Suitable for Ad-Hoc Queries • Potential for Multi-Query Window Pre-Aggregation

14

log(leaves)

1.General incremental sliding-window aggregation - VLDB 15 2.Resource sharing in continuous sliding-window aggregates - VLDB 04

@CIKM16

tumbling

single-typeperiodic Punctuation

SnapshotFCF/CF

Lower-Bound

Session

multi-type

ADWIN

Delta-based

FCA

efficientslicing

generic, high-costpre-aggregation

Non-Periodic

Periodic

@CIKM16

tumbling

single-typeperiodic Punctuation

SnapshotFCF/CF

Lower-Bound

Session

multi-type

ADWIN

Delta-based

FCA

efficientslicing

generic, high-costpre-aggregation

Non-DeterministicDeterministic

@CIKM16

Deterministic Windows: Intuition

17

Slices

Higherorder

partials

price[in USD]

time[in min.]

0

0

5 10 15 20 25 31 35

10

WindowWindow BeginThresholdPre-Aggregate

@CIKM16

Deterministic Windows: Intuition

18

Slices

Higherorder

partials

price[in USD]

time[in min.]

0

0

5 10 15 20 25 31 35

10

WindowWindow BeginThresholdPre-Aggregate

only need to determine when new windows start

@CIKM16

Deterministic Windows: Intuition

19

Slices

Higherorder

partials

price[in USD]

time[in min.]

0

0

5 10 15 20 25 31 35

10

WindowWindow BeginThresholdPre-Aggregate

only need to determine when new windows start

@CIKM16

User-Defined Windows

Deterministic: Expressed as a UDF that assigns each record to number of new or complete windows.

20

Trivial templating of existing window types

Non-Deterministic: Expressed as a UDF that assigns a record to complete windows and a reference to their beginning.

@CIKM16

Cutty Concept

21

Slices

Higherorder

partials

price[in USD]

time[in min.]

0

0

5 10 15 20 25 31 35

10

WindowWindow BeginThresholdPre-Aggregate

1

Slicing

Eager Pre-Aggregation

@CIKM16

Cutty Overview

Exploits Deterministic Windows for the most efficient yet aggregation slicing.

Utilises eager pre-aggregation at a low memory cost over optimally sliced partials.

Supports both single and multi-query multiplexed execution out-of-the-box for efficient operator sharing.

Non-Deterministic Windows can still utilise eager pre-aggregation.

22

@CIKM16

Cutty Architecture

23

@CIKM16

Cutty - Demo

24

1 2 3 4 5 6 7 8 9 10

-Active Partial

-

-Stored Partials -

- - - -

Records

Windows

@CIKM16

Cutty - Demo

25

1 2 3 4 5 6 7 8 9 10

1Active Partial

-

-Stored Partials -

- - - -

Records

Windows

@CIKM16

Cutty - Demo

26

1 2 3 4 5 6 7 8 9 10

Active Partial

-

-Stored Partials -

- - - -

Records

Windows

3

@CIKM16

Cutty - Demo

27

1 2 3 4 5 6 7 8 9 10

3Active Partial

3

3Stored Partials -

3 - - -

Records

Windows

@CIKM16

Cutty - Demo

28

1 2 3 4 5 6 7 8 9 10

Active Partial

3

3Stored Partials -

3 - - -

Records

Windows

3

@CIKM16

Cutty - Demo

29

1 2 3 4 5 6 7 8 9 10

Active Partial

3

3Stored Partials -

3 - - -

Records

15

Windows

12

@CIKM16

Cutty - Demo

30

1 2 3 4 5 6 7 8 9 10

Active Partial

15

15Stored Partials -

3 12 - -

Records

15

Windows21

6

@CIKM16

Cutty - Demo

31

1 2 3 4 5 6 7 8 9 10

Active Partial

15

15Stored Partials -

3 12 - -

Records

15

Windows21

13

@CIKM16

Cutty - Demo

32

1 2 3 4 5 6 7 8 9 10

Active Partial

15

28Stored Partials 13

3 12 13 -

Records

15

Windows21

33

8

@CIKM16

Implementation

33

• Apache Flink

•UDW API (Contributed to Apache Flink - 0.9)

•Shared Aggregation Operator (experimental)

•Optimiser collocates parallel windows in operators

• Aggregate Store

•Adaptation of FlatFAT1

•Circular Resizable Buffer Strategies

•Non-Eager Strategy Supported for Experiments

1.General incremental sliding-window aggregation - VLDB 15

@CIKM16

Performance AnalysisPeriodic Window Aggregation (DEBS12 dataset)

34

20 40 60 80 100

Number of Queries

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Num

ber

ofPa

rtia

ls

⇥105

CuttyPairs/Pairs+

COUNT-RANGES COUNT-SLIDES0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Num

ber

ofR

ecor

ds

20 40 60 80 100

Number of Queries

0k500k

1000k1500k2000k2500k3000k3500k4000k4500k

Thro

ughp

ut(r

ecor

ds/s

ec)

CuttyPairs+RA

1 10 20 30 40 50 60 70 80 90 100

Number of Queries

104

105

106

107

108

109

1010

1011

Tota

lRed

uce

Cal

ls

Cutty (eager)Pairs+Cutty (lazy)

PairsRANaive

@CIKM16

Performance AnalysisSession Window Aggregation (DEBS12 dataset)

35

SESSION LENGTHS0

5000

10000

15000

20000

25000

30000

35000

Num

ber

ofR

ecor

ds

1 10 20 30 40 50 60 70 80 90 100

Number of Queries

103

104

105

106

107

108

109

Tota

lRed

uce

Cal

ls

Cutty (UPD)Cutty (MERGE)

RA (UPD)RA (MERGE)

1 10 20 30 40 50 60 70 80 90 100

Number of Queries

100

101

102

103

104

105

106

Max

Allo

cati

on(#

part

ials

)

@CIKM16

No limits in multiplexing

36

distance [in km]

time

[in min.]

0

0

6 12 18 24

5

10

15

20

Slice Window

Window Begin

Record

1

@CIKM16

Summary

• UDWs extend the potential of pre-aggregation in window classes beyond fixed periodic windows.

• Cutty takes slicing a step further in terms of computational efficiency which combines seamlessly with eager aggregation.

• First work that addresses multi-query aggregation across diverse window types.

37

@CIKM16

Thank you!

38

@SenorCarbone

https://flink.apache.org/https://github.com/apache/flink