Flink 0.10 - Upcoming Features

26
Upcoming Features: Apache Flink™ 0.10 Aljoscha Krettek [email protected]

Transcript of Flink 0.10 - Upcoming Features

Upcoming Features: Apache Flink™ 0.10

Aljoscha [email protected]

What to Expect

High-Availability of Master Node (JobManager)

Live Monitoring Event-time, watermarks and

windowing improvements Demo: Fault Tolerance

2

These are only the highlights, more stuff is being worked on!

High Availability

3

Status Quo

4

JobManager

TasManager

PANIC!

With High Availability

5

JobManager

TaskManager

Stand-byJobManager

Apache Zookeeper™

KEEP GOING

Some Details

Flink uses ZooKeeper™ for two things:• Leader selection (in case of multiple

JobManagers)• Reliable Storage of Dataflow graph and

checkpoint metadata (more on that later)

6

Live Monitoring

7

Live Monitoring

Before:• Accumulators only available after Job

finishes

Now:• Accumulators updated while Job is

running• System accumulators (number of

bytes/records processed…)

8

9

Timestamps, Watermarks and the Rest™

10

Why all the Fuss?

11

WindowOperator112131143

Payload: 0x45FD

Timestamp: 13

Window Window

Flow of Data

Elements do not arrive ordered by Timestamp.

? ?

Processing Time Windows

12

WindowOperator112131143

Payload: 0x45FD

Timestamp: 13

1143

Window

11213

Window

Flow of Data

Elements do not arrive ordered by Timestamp.

Event Time Windows

13

WindowOperator112131143

Payload: 0x45FD

Timestamp: 13

Flow of Data

Elements do not arrive ordered by Timestamp.

111314

Window

312

Window

Problem: How do you know when to process

windows?

Watermarks to the Rescue

14

Source 11213163115571420

4

This is a Watermark

815

Some Details

Window Operator waits for watermarks

Upon Watermark Arrival we can process elements with timestamps lower than the watermark

Operators forward watermarks once they know they cannot emit elements with lower timestamp

15

Fault Tolerance

16

Streaming Fault Tolerance

Ensure that operators see all events• “At least once”• Solved by replaying a stream from a

checkpoint, e.g., from a past Kafka offset

Ensure that operators do not perform duplicate updates to their state• “Exactly once”• Several solutions

17

Exactly-Once Approaches

Discretized streams (Spark Streaming)• Treat streaming as a series of small atomic

computations• “Fast track” to fault tolerance, but restricts

computational and programming model (e.g., cannot mutate state across “mini-batches”, window functions correlated with mini-batch size)

MillWheel (Google Cloud Dataflow)• State update and derived events committed as atomic

transaction to a high-throughput transactional store• Requires a very high-throughput transactional store

Chandy-Lamport distributed snapshots (Flink)18

19

20

21

22

Best of all Worlds for Streaming

Low latency• Thanks to pipelined engine

Exactly-once guarantees• Variation of Chandy-Lamport

High throughput• Controllable checkpointing overhead

Separates app logic from recovery• Checkpointing interval is just a config parameter

23

Demo time

24

25

flink-forward.org

I Flink, do you?

26

If you find this exciting,

get involved and start a discussion on Flink‘s mailing list,

or stay tuned by

subscribing to [email protected],following flink.apache.org/blog, and

@ApacheFlink on Twitter