QCon London 2018 Foundations of streaming SQL · Reconciling streams & tables w/ the Beam Model How...

Post on 15-Aug-2020

2 views 0 download

Transcript of QCon London 2018 Foundations of streaming SQL · Reconciling streams & tables w/ the Beam Model How...

1

Foundations of streaming SQLor: how I learned to love stream & table theory

Slides: https://s.apache.org/streaming-sql-qcon-londonTyler AkidauApache Beam PMCSoftware Engineer at Google@takidau

Covering ideas from across the Apache Beam, Apache Calcite, Apache Kafka, and Apache Flink communities, with thoughts and contributions from Julian Hyde, Fabian Hueske, Shaoxuan Wang, Kenn Knowles, Ben Chambers, Reuven Lax, Mingmin Xu, James Xu, Martin Kleppmann, Jay Kreps and many more, not to mention that whole database community thing...

QCon London 2018

2

Table of Contents

01

02

Stream & Table TheoryA BasicsB The Beam Model

Streaming SQLA Time-varying relationsB SQL language extensions

Chapter 7

Chapter 9

3

01 Stream & Table TheoryTFW you realize everything you do was invented by the database community decades ago...

A BasicsB The Beam Model

4

Stream & table basics

https://www.confluent.io/blog/making-sense-of-stream-processing/ https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/

5

Special theory of stream & table relativity

streams → tables:

tables → streams:

The aggregation of a stream of updates over time yields a table.

The observation of changes to a table over time yields a stream.

6

Non-relativistic stream & table definitions

Tables are data at rest.

Streams are data in motion.

7

01 Stream & Table TheoryTFW you realize everything you do was invented by the database community decades ago...

A BasicsB The Beam Model

8

The Beam Model

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

9

Reconciling streams & tables w/ the Beam Model

● How does batch processing fit into all of this?

● What is the relationship of streams to bounded and unbounded datasets?

● How do the four what, where, when, how questions map onto a streams/tables world?

10

ReduceMap

MapReduce

input

output

11

MapReduce

input

output

MapRead

Map

MapWrite

ReduceRead

Reduce

ReduceWrite

12

MapReduce

MapRead

Map

MapWrite

ReduceRead

Reduce

ReduceWrite

?

?

?

?

?

? ?

13

MapReduce

MapRead

Map

MapWrite

ReduceRead

Reduce

ReduceWrite

?

?

?

?

?

table

table

14

Map phase

MapRead

Map

MapWrite

table

?

?

?

15

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>);

16

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>);

17

Map phase

MapRead

Map

MapWrite

table

stream

?

?

18

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>);

19

Map phase

MapRead

Map

MapWrite

table

stream

stream

?

20

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>);

void reduce(K2 key, Iterable<V2> value, Emit<V3>);

21

Map phase

MapRead

Map

MapWrite

table

stream

stream

table

22

MapReduce

MapRead

Map

MapWrite

ReduceRead

Reduce

ReduceWrite

table

stream

stream

table

?

?

table

23

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>);

void reduce(K2 key, Iterable<V2> value, Emit<V3>);

24

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>);

void reduce(K2 key, Iterable<V2> value, Emit<V3>);

25

table

MapReduce

MapRead

Map

MapWrite

ReduceRead

Reduce

ReduceWrite

table

stream

stream

table

stream

stream

26

Reconciling streams & tables w/ the Beam Model

● How does batch processing fit into all of this?

● What is the relationship of streams to bounded and unbounded datasets?

● How do the four what, where, when, how questions map onto a streams/tables world?

1. Tables are read into streams.

2. Streams are processed into new streams until a grouping operation is hit.

3. Grouping turns the stream into a table.

4. Repeat steps 1-3 until you run out of operations.

27

Reconciling streams & tables w/ the Beam Model

● How does batch processing fit into all of this?

● What is the relationship of streams to bounded and unbounded datasets?

● How do the four what, where, when, how questions map onto a streams/tables world?

Streams are the in-motion form of data

both bounded and unbounded.

28

Reconciling streams & tables w/ the Beam Model

● How does batch processing fit into all of this?

● What is the relationship of streams to bounded and unbounded datasets?

● How do the four what, where, when, how questions map onto a streams/tables world?

29

The Beam Model

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

30

The Beam Model

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

31

Example data: individual user scores

32

What is calculated?

PCollection<KV<Team, Score>> input = IO.read(...) .apply(ParDo.of(new ParseFn()); .apply(Sum.integersPerKey());

What is calculated?

34

The Beam Model

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

35

Windowing divides data into event-time-based finite chunks.

Often required when doing aggregations over unbounded data.

Fixed Sliding1 2 3

54

Sessions

2

431

Key 2

Key 1

Key 3

Time

1 2 3 4

Where in event time?

36

Where in event time?

PCollection<KV<User, Score>> input = IO.read(...) .apply(ParDo.of(new ParseFn()); .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey());

Where in event time?

38

The Beam Model

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

39

• Triggers control when results are emitted.

• Triggers are often relative to the watermark.

Proc

essi

ng T

ime

Event Time

~Watermark

Ideal

Skew

When in processing time?

40

When in processing time?

PCollection<KV<User, Score>> input = IO.read(...) .apply(ParDo.of(new ParseFn()); .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark()) .apply(Sum.integersPerKey());

When in processing time?

42

The Beam Model: asking the right questions

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

43

How do refinements relate?

PCollection<KV<User, Score>> input = IO.read(...) .apply(ParDo.of(new ParseFn()); .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark().withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey());

How do refinements relate?

45

What/Where/When/How Summary

3. Streaming 4. Streaming + Late Data Handling

1. Classic Batch 2. Windowed Batch

46

Reconciling streams & tables w/ the Beam Model

● How does batch processing fit into all of this?

● What is the relationship of streams to bounded and unbounded datasets?

● How do the four what, where, when, how questions map onto a streams/tables world?

47

General theory of stream & table relativityPipelines : tables + streams + operationsTables : data at restStreams : data in motionOperations : (stream | table) → (stream | table) transformations

● stream → stream: Non-grouping (element-wise) operations Leaves stream data in motion, yielding another stream.

● stream → table: Grouping operations Brings stream data to rest, yielding a table. Windowing adds the dimension of time to grouping.

● table → stream: Ungrouping (triggering) operations Puts table data into motion, yielding a stream. Accumulation dictates the nature of the stream (deltas, values, retractions).

● table → table: (none) Impossible to go from rest and back to rest without being put into motion.

48

02 Streaming SQLContorting relational algebra for fun and profit

A Time-varying relationsB SQL language extensions

49

Relational algebra

User Score Time

Julie 7 12:01

Frank 3 12:03

Julie 1 12:03

Julie 4 12:07

Score Time

7 12:01

3 12:03

1 12:03

4 12:07

πScore,Time(UserScores)πUserScoresπ SELECT Score, Time FROM UserScores;-----------------| Score | Time |-----------------| 7 | 12:01 || 3 | 12:03 || 1 | 12:03 || 4 | 12:07 |-----------------

Relational algebra SQLRelation

50

Relations evolve over time

12:07> SELECT * FROMUserScores;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 || Frank | 3 | 12:03 || Julie | 1 | 12:03 || Julie | 4 | 12:07 |-------------------------

12:03> SELECT * FROMUserScores;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 || Frank | 3 | 12:03 || Julie | 1 | 12:03 |-------------------------

12:00> SELECT * FROMUserScores;-------------------------| Name | Score | Time |--------------------------------------------------

12:01> SELECT * FROMUserScores;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 |-------------------------

51

Classic SQL vs Streaming SQL

Classic SQL classic relations single point in time:: ::

Streaming SQL time-varying relations every point in time:: ::

52

Classic SQL vs Streaming SQL

Classic SQL classic relations single point in time:: ::

Streaming SQL time-varying relations every point in time:: ::

53

Classic relations

12:07> SELECT * FROM UserScores;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | || | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:07> SELECT * FROMUserScores;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 || Frank | 3 | 12:03 || Julie | 1 | 12:03 || Julie | 4 | 12:07 |-------------------------

12:03> SELECT * FROMUserScores;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 || Frank | 3 | 12:03 || Julie | 1 | 12:03 |-------------------------

12:00> SELECT * FROMUserScores;-------------------------| Name | Score | Time |--------------------------------------------------

12:01> SELECT * FROMUserScores;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 |-------------------------

Time-varying relation

54

Closure property of relational algebra remains intact with time-varying relations.

55

Time-varying relations: variations12:07> SELECT * FROM UserScores;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | || | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

56

Time-varying relations: filtering12:07> SELECT * FROM UserScores;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | || | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:07> SELECT * FROM UserScores WHERE Name = “Julie”;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | || | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | || | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

57

Time-varying relations: grouping12:07> SELECT * FROM UserScores;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | || | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

58

How does this relate to streams & tables?

59

Time-varying relations: tables

12:07> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 12 | 12:07 || Frank | 3 | 12:03 |-------------------------

12:03> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 8 | 12:03 || Frank | 3 | 12:03 |-------------------------

12:01> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 |-------------------------

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:00> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |--------------------------------------------------

60

Time-varying relations: tables

12:01> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 |-------------------------

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:07> SELECT TABLE Name, SUM(Score), MAX(Time) AS OF SYSTEM TIME ‘12:01’ FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 |-------------------------

61

Time-varying relations: tables12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-------------------------| Name | Score | Time |-------------------------...

12:00

12:00> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |--------------------------------------------------

62

Time-varying relations: streams12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 |...

12:01

12:00> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |--------------------------------------------------

63

Time-varying relations: streams12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 |...

12:01

12:01> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 |-------------------------

64

Time-varying relations: streams12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 || Frank | 3 | 12:03 || Julie | 8 | 12:03 |...

12:03

12:01> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 |-------------------------

65

Time-varying relations: streams12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 || Frank | 3 | 12:03 || Julie | 8 | 12:03 |...

12:03

12:03> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 8 | 12:03 || Frank | 3 | 12:03 |-------------------------

66

Time-varying relations: streams12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 || Frank | 3 | 12:03 || Julie | 8 | 12:03 || Julie | 12 | 12:07 |...

12:07

12:03> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 8 | 12:03 || Frank | 3 | 12:03 |-------------------------

67

Time-varying relations: streams12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-----------------------------------------------------------------------------------------------------------------| [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | || ------------------------- | ------------------------- | ------------------------- | ------------------------- || | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | || | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | || ------------------------- | ------------------------- | ------------------------- | ------------------------- |-----------------------------------------------------------------------------------------------------------------

12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 7 | 12:01 || Frank | 3 | 12:03 || Julie | 8 | 12:03 || Julie | 12 | 12:07 |...

12:07

12:07> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;-------------------------| Name | Score | Time |-------------------------| Julie | 12 | 12:07 || Frank | 3 | 12:03 |-------------------------

68

How does this relate to streams & tables?

capture a point-in-time snapshot of a time-varying relation.

capture the evolution of a time-varying relation over time.

Tables

Streams

69

02 Streaming SQLContorting relational algebra for fun and profit

A Time-varying relationsB SQL language extensions

70

When do you need SQL extensions for streaming?

As a table: As a stream:

SQL extensions rarely needed. SQL extensions sometimes needed.

How is output consumed?

good defaults = often not needed

71

When do you need SQL extensions for streaming?*

Explicit table / stream selection● SELECT TABLE * from X;● SELECT STREAM * from X;

Timestamps and windowing● Event-time columns● Windowing. E.g.,

SELECT * FROM X GROUP BY SESSION(<COLUMN> INTERVAL '5' MINUTE);

○ Grouping by timestamp○ Complex multi-row transactions

inexpressible in declarative SQL (e.g., session windows)

Sane default table / stream selection● If all inputs are tables, output is a table● If any inputs are streams, output is a stream

Simple triggers● Implicitly defined by characteristics of the sink● Optionally be configured outside of query.● Per-query, e.g.: SELECT * from X EMIT <WHEN>; ● Focused set of use cases:

○ Repeated updates... EMIT AFTER <TIMEDELTA>

○ Completeness... EMIT WHEN WATERMARK PAST <COLUMN>

○ Repeated updates + completeness(e.g., early/on-time/late pattern)... EMIT AFTER <TIMEDELTA> AND WHEN WATERMARK PAST <COLUMN>

* Most of these extensions are theoretical at this point; very few have concrete implementations.

72

Summary

streams ⇄ tables

streams & tables : Beam Model

time-varying relations

SQL language extensions

73

Thank you!

In early release nowstreamingsystems.net

These slides: http://s.apache.org/streaming-sql-big-data-spain

Streaming SQL spec (WIP: Apex, Beam, Calcite, Flink): http://s.apache.org/streaming-sql-specStreaming in Calcite (Julian Hyde): https://calcite.apache.org/docs/stream.htmlStreams, joins & temporal tables (Julian Hyde): http://s.apache.org/streams-joins-and-temporal-tables

Streaming 101: http://oreilly.com/ideas/the-world-beyond-batch-streaming-101Streaming 102: http://oreilly.com/ideas/the-world-beyond-batch-streaming-102

Apache Beam: http://beam.apache.orgApache Calcite: http://calcite.apache.orgApache Flink: http://flink.apache.org

Twitter: @takidau