Introduction to Apache Flink™: How Stream Processing is Shaping ...
Transcript of Introduction to Apache Flink™: How Stream Processing is Shaping ...
Introduction to Apache Flink™:How Stream Processing is Shaping
the Data Engineering Space
Tzu-Li (Gordon) Tai
@tzulitai
● 戴資力(Gordon)● Apache Flink Committer● Co-organizer of Apache Flink Taiwan User Group● Software Engineer @ VMFive● Java, Scala● Enjoy developing distributed systems
Who am I?
Stream processing is enabling the obvious: continuous processing on data that is continuously produced
2
Streaming is the next programming paradigm for data applications, and you need to start thinking in terms
of streams
3
01 The Traditional Batch Wayt
...
HDFSFile
MapReduce /Spark / Flink
Jobs
● Continouslyingesting data
● Periodicbatch files
● Periodicbatch jobs
4
01 The Traditional Batch Wayt
...
cross boundary
intermediateresults
● Jobs often has “dangling” results near batch boundaries
● Need to save them, and input into the next batch job
5
02 Key Observations for Batch
6
● Way too many moving parts
● Implicit treatment of time (the batch boundaries)
● Treating continuous state as discrete
● Troublesome to get accurate, correct results
03 The “Ideal” Streaming Wayt
...
Streaming processor that handles …
(1) continuous state(2) out-of-order events
scalably, robustly, and efficiently
7
04 Apache Flink
Apache Flinkan open-source platform for distributed stream and batch data processing
● Apache Top-Level Project since Jan. 2015
● Streaming Dataflow Engine at its core○ Low latency○ High Throughput○ Stateful○ Accurate○ Distributed
8
04 Apache Flink
Apache Flinkan open-source platform for distributed stream and batch data processing
● ~260 contributors, ~25 Committers / PMC
● Used adoption:○ Alibaba - realtime search optimization○ Uber - ride request fulfillment marketplace○ Netflix - Stream Processing as a Service (SPaaS)○ Kings Gaming - realtime data science dashboard○ ...
9
05 Scala Collection-like APIcase class Word (word: String, count: Int)
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap(_.split(“ ”)).map(word => Word(word,1)).groupBy(“word”).sum(“count”).print()
val lines: DataStream[String] = env.addSource(new KafkaSource(...))
lines.flatMap(_.split(“ ”)).map(word => Word(word,1)).keyBy(“word”).timeWindow(Time.seconds(5)).sum(“count”).print()
DataSet API
DataStream API
11
05 Scala Collection-like API
.filter(...).flatmap(...).map(...).groupBy(...).reduce(...)
● Becoming the de facto standard for new generation API to express data pipelines
● Apache Spark, Apache Flink, Apache Beam ...
12
06 What does Flink’s Engine do?
YourCode
process records one-at-a-time
...
● Computation on a never-ending stream of data records
13
06 What does Flink’s Engine do?
YourCode...
● System distributes the computation across the cluster
YourCode...
YourCode...
14
07 Streaming Dataflow Runtime
JobManager
TaskManager
TaskManager
TaskManager
TaskManager
TaskManager
TaskManager
ExecutionGraph
(parallel) TaskManager
TaskManager
TaskManager
Application Code
(DataSet /DataStream)
Optimizer / Graph
Generator
JobGraph(logical)
Client
concurrently executed
distributed queues as push-based data shipping channels
15
07 Streaming Dataflow Runtime
1. Record “A” enters Task 1, and is processed
2. The record is serialized into an output buffer at Task 1
3. The buffer is shipped to Task 2’s input buffer
Observation: Buffers need to be available throughout the process (think blocking queues used between threads)
● A slightly closer look into the transmission of data ...
Taken from anoutput buffer pool
Taken from aninput buffer pool
16
07 Streaming Dataflow Runtime● Natural, built-in backpressure
● Receiving data at a higher rate than a system can process during a temporary load spike
○ ex. GC talls at processing tasks○ ex. data source natural load spike
Normal stable case:
Temporary load spike:
Ideal backpressure handling:
17
● Due to one-at-a-time processing, Flink has very powerful built-in windowing (certainly among the best in the current streaming framework solutions)
○ Time-driven: Tumbling window, Sliding window○ Data-driven: Count window, Session window
07 Flexible Windows
18
08 What does Flink’s Engine do?
YourCode
...
State
● Computation and state, ex.:○ counters○ in-progress
windows○ state machines○ trained ML
models
● Results depend onhistory of stream
● A stateful streamprocessor gives tools to manage state
22
09 What does Flink’s Engine do?
YourCode
...
State
● Processingdepends on timestamps of whenevents were generated
● Core mechanics is called watermarks: basically a way to measure and advance clock time, instead of relying on machine time
t1t2 t3t4 t1 - t2 t3 - t4
23
09 Why Wall Time is Incorrect
● Think Twitter hash-tag count every 5 minutes
○ We would want the result to reflect the number of Twitter tweets actually tweeted in a 5 minute window
○ Not the number of tweet events the stream processor receives within 5 minutes
25
09 Why Wall Time is Incorrect
● Think replaying a Kafka topic on a windowed streaming application …
○ If you’re replaying a queue, windows are definitely wrong if using a wall clock
26
10 Flink’s Streaming Fault Tolerance
YourCode
YourCode
YourCode
State
State
State
...
...
...
...
...
...
...
...
...
YourCode
YourCode
YourCode
State
State
State
● Any operator in a Flink streaming topology can be stateful● How to ensure that the states are correct upon failure?
27
10 Flink’s Streaming Fault Tolerance
● First, a recap of some guarantee concepts:
○ At-least-once: records may be processed more than once.Think counting: may over count, resulting in wrong state
○ Exactly-once “state”: records appear to be processed only once, with respect to the state.Think counting: even on failure, each record is counted exactly once
○ End-to-end exactly-once: records appear to be processed only once, even to external systemsThink counting: for results stored externally, even after failure, the results remain correct
28
11 Flink’s Streaming Fault Tolerance
YourCode
YourCode
YourCode
State
State
State
...
...
...
...
...
...
...
...
...
YourCode
YourCode
YourCode
State
State
State
● Flink checkpoints: a combined snapshot of all operator state, with the corresponding position in the source
● Based on Chandly-Lamport Algorithm: does not haltany computation while taking consistent snapshots
29
12 Flink’s Savepoints
● Flink checkpoints: consistent snapshots of the whole topology state that the system periodically takes
● Flink savepoints: manually triggered checkpoints that can be persisted, and used to initialize state for a new streaming job
tt1 t2 t3
savepointstate at t1
savepointstate at t2
savepointstate at t3
30
13 So, back to this ...
t
...
Streaming processor that handles …
(1) continuous state(2) out-of-order events
scalably, robustly, and efficiently
32
14 What Flink provides, in a nutshell example
● Batch is inherently unsuitable for the nature of continuously generated data
● State is corrupt at boundaries
35
14 What Flink provides, in a nutshell example
● Flink’s Stateful streaming naturally treats state continuously as it processes your continuous data, and continuously generates results
36
14 What Flink provides, in a nutshell example
● On reprocessing: initial state for the job reflects all previous history data in the stream
37
14 What Flink provides, in a nutshell example
● On reprocessing: event-time processing guarantees correct results, even when fast-forwarding to the head of stream
event-time processing
38
15 Final Takeaways
● Stateful Streaming correctly embraces the nature of continuously generated data, and is the new programming paradigm for their applications.
● Streaming isn’t only about real-time. Realtime is only a natural advantage of streaming.
39
15 Final Takeaways● The choice is all about your data, and your code.
● Think:○ Is your data unbounded, or bounded?
■ Unbounded: click streams, page visits, impressions …■ Bounded: (???)
● Think:○ Does your code change faster than your data?
■ Data exploration, data mining, feature engineering …■ In this case, it doesn’t really matter whether you use batch or streaming
○ Or does your data change faster than your code?■ Production ETL pipelines, warehousing, serving, etc.■ For accuracy and robustness, definitely think and design in terms of
streaming
40
15 Final Takeaways
● Upcoming features in Flink:
○ Dynamic scaling, with stateful streaming○ Queryable state○ Incremental state checkpointing○ Even more savepoint functionality
41
15 Final Takeaways
● How Flink’s technology covers the application space:
Application
Realtime applications
Continuous applications
Analytics on historical data
Request/Response Apps
Technology
Low-latency stateful streaming
High-latency stateful streaming
Batch as special case of streaming
Queryable state
42