Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes...

34
1 Timo Walther Apache Flink PMC @twalthr Flink Forward @ San Francisco - April 11th, 2017 Table & SQL API unified APIs for batch and stream processing

Transcript of Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes...

Page 1: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

1

Timo WaltherApache Flink PMC

@twalthr

Flink Forward @ San Francisco - April 11th, 2017

Table & SQL APIunified APIs for batch and stream processing

Page 2: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Motivation

2

Page 3: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

DataStream API is great…

3

▪ Very expressive stream processing • Transform data, update state, define windows, aggregate, etc.

▪ Highly customizable windowing logic • Assigners, Triggers, Evictors, Lateness

▪ Asynchronous I/O • Improve communication to external systems

▪ Low-level Operations • ProcessFunction gives access to timestamps and timers

Page 4: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

… but it is not for Everyone!

4

▪ Writing DataStream programs is not always easy • Stream processing technology spreads rapidly • New streaming concepts (time, state, windows, ...)

▪ Requires knowledge & skill • Continous applications have special requirements • Programming experience (Java / Scala)

▪ Users want to focus on their business logic

Page 5: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Why not a Relational API?

5

▪ Relational API is declarative • User says what is needed, system decides how to compute it

▪ Queries can be effectively optimized • Less black-boxes, well-researched field

▪ Queries are efficiently executed • Let Flink handle state, time, and common mistakes

▪ ”Everybody” knows and uses SQL!

Page 6: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Goals

▪ Easy, declarative, and concise relational API

▪ Tool for a wide range of use cases

▪ Relational API as a unifying layer • Queries on batch tables terminate and produce a finite result • Queries on streaming tables run continuously and produce

result stream

▪ Same syntax & semantics for both queries

6

Page 7: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Table API & SQL

7

Page 8: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Table API & SQL

▪ Flink features two relational APIs • Table API: LINQ-style API for Java & Scala (since Flink 0.9.0) • SQL: Standard SQL (since Flink 1.1.0)

8

DataSet API DataStream API

Table API

SQL

Flink Dataflow Runtime

Page 9: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Table API & SQL Example

9

val tEnv = TableEnvironment.getTableEnvironment(env) // configure your data source val customerSource = CsvTableSource.builder() .path("/path/to/customer_data.csv") .field("name", Types.STRING).field("prefs", Types.STRING) .build() // register as a table tEnv.registerTableSource(”cust", customerSource) // define your table program val table = tEnv.scan("cust").select('name.lowerCase(), myParser('prefs)) val table = tEnv.sql("SELECT LOWER(name), myParser(prefs) FROM cust") // convert val ds: DataStream[Customer] = table.toDataStream[Customer]

Page 10: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Windowing in Table API

10

val sensorData: DataStream[(String, Long, Double)] = ??? // convert DataStream into Table val sensorTable: Table = sensorData .toTable(tableEnv, 'location, 'rowtime, 'tempF) // define query on Table val avgTempCTable: Table = sensorTable .window(Tumble over 1.day on 'rowtime as 'w) .groupBy('location, ’w) .select('w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%")

Page 11: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Windowing in SQL

11

val sensorData: DataStream[(String, Long, Double)] = ??? // register DataStream tableEnv.registerDataStream( "sensorData", sensorData, 'location, 'rowtime, 'tempF)

// query registered Table val avgTempCTable: Table = tableEnv.sql(""" SELECT TUMBLE_START(TUMBLE(time, INTERVAL '1' DAY) AS day, location, AVG((tempF - 32) * 0.556) AS avgTempC FROM sensorData WHERE location LIKE 'room%’ GROUP BY location, TUMBLE(time, INTERVAL '1' DAY) """)

Page 12: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Architecture

2 APIs [SQL, Table API] *

2 backends [DataStream, DataSet] =

4 different translation paths?

12

Page 13: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Architecture

13

DataSet Rules

DataSet PlanDataSet DataStreamDataStream Plan

DataStream Rules

Calcite Catalog

Calcite Logical Plan

Calcite Optimizer

CalciteParser & Validator

Table API SQL API

Dat

aSet

Tabl

e So

urce

s

Dat

aStre

am

Table API Validator

Page 14: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Architecture

14

DataSet Rules

DataSet PlanDataSet DataStreamDataStream Plan

DataStream Rules

Calcite Catalog

Calcite Logical Plan

Calcite Optimizer

CalciteParser & Validator

Table API SQL API

Dat

aSet

Tabl

e So

urce

s

Dat

aStre

am

Table API Validator

Page 15: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Architecture

15

DataSet Rules

DataSet PlanDataSet DataStreamDataStream Plan

DataStream Rules

Calcite Catalog

Calcite Logical Plan

Calcite Optimizer

CalciteParser & Validator

Table API SQL API

Dat

aSet

Tabl

e So

urce

s

Dat

aStre

am

Table API Validator

Page 16: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Architecture

16

DataSet Rules

DataSet PlanDataSet DataStreamDataStream Plan

DataStream Rules

Calcite Catalog

Calcite Logical Plan

Calcite Optimizer

CalciteParser & Validator

Table API SQL API

Dat

aSet

Tabl

e So

urce

s

Dat

aStre

am

Table API Validator

Page 17: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Translation to Logical Plan

17

sensorTable

.window(Tumble over 1.day on 'rowtime as 'w) .groupBy('location, ’w)

.select( 'w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%")

Catalog Node

Window Aggregate

Project

Filter

Logical Table Scan

Logical Window Aggregate

Logical Project

Logical Filter

Table Nodes Calcite Logical Plan

Table API Validation

Tran

slatio

n

Page 18: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Translation to DataStream Plan

18

Logical Table Scan

Logical Window Aggregate

Logical Project

Logical Filter

Calcite Logical Plan

Logical Table Scan

Logical Window Aggregate

Logical Calc

Optimized Plan

DataStream Scan

DataStream Calc

DataStream Aggregate

DataStream Plan

Opt

imize

Tran

sfor

m

Page 19: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Translation to Flink Program

19

DataStream Scan

DataStream Calc

DataStream Aggregate

DataStream Plan

(Forwarding)

FlatMap Function

Aggregate & WindowFunction

DataStream Program

Translate & Code-generate

Page 20: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Current State (in master)

▪ Batch support • Selection, Projection, Sort, Inner & Outer Joins, Set operations • Group-Windows for Slide, Tumble, Session

▪ Streaming support • Selection, Projection, Union • Group-Windows for Slide, Tumble, Session • Different SQL OVER-Windows (RANGE/ROWS)

▪ UDFs, UDTFs, custom rules20

Page 21: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Use Cases for Streaming SQL

▪ Continuous ETL & Data Import

▪ Live Dashboards & Reports

21

Page 22: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Outlook: Dynamic Tables

22

Page 23: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Dynamic Tables Model

▪ Dynamic tables change over time

▪ Dynamic tables are treated like static batch tables • Dynamic tables are queried with standard SQL / Table API • Every query returns another Dynamic Table

▪ “Stream / Table Duality” • Stream ←→ Dynamic Table

conversions without information loss

23

Page 24: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Stream to Dynamic Table

▪ Append Mode:

▪ Update Mode:

24

Page 25: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Querying Dynamic Tables

▪ Dynamic tables change over time • A[t]: Table A at specific point in time t

▪ Dynamic tables are queried with relational semantics • Result of a query changes as input table changes • q(A[t]): Evaluate query q on table A at time t

▪ Query result is continuously updated as t progresses • Similar to maintaining a materialized view • t is current event time

25

Page 26: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Querying a Dynamic Table

26

Page 27: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Querying a Dynamic Table

27

Page 28: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Querying a Dynamic Table

▪ Can we run any query on Dynamic Tables? No!

▪ State may not grow infinitely as more data arrives • Set clean-up timeout or key constraints.

▪ Input may only trigger partial re-computation

▪ Queries with possibly unbounded state or computation are rejected

28

Page 29: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Dynamic Table to Stream

▪ Convert Dynamic Table modifications into stream messages

▪ Similar to database logging techniques • Undo: previous value of a modified element • Redo: new value of a modified element • Undo+Redo: old and the new value of a changed element

▪ For Dynamic Tables: Redo or Undo+Redo

29

Page 30: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Dynamic Table to Stream

▪ Undo+Redo Stream (because A is in Append Mode):

30

Page 31: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Dynamic Table to Stream

▪ Redo Stream (because A is in Update Mode):

31

Page 32: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Result computation & refinement

32

First result (end – x)

Last result(end + x)

State is purged.

Late updates(on new data)

Update rate(every x)

Complete result

(end + x)

Complete result can be computed(end)

Page 33: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

Contributions welcome!

▪ Huge interest and many contributors • Adding more window operators • Introducing dynamic tables

▪ And there is a lot more to do • New operators and features for streaming and batch • Performance improvements • Tooling and integration

▪ Try it out, give feedback, and start contributing!33

Page 34: Table & SQL API · Table & SQL API unified APIs for batch and stream processing ... Table Nodes Calcite Logical Plan Table API Validation tion. Translation to DataStream Plan 18 Logical

34

Thank you!@twalthr @ApacheFlink @dataArtisans