Flink meetup
-
Upload
christos-hadjinikolis -
Category
Engineering
-
view
71 -
download
0
Transcript of Flink meetup
Get your hands on implementing a Flink app: A tutorial
Christos Hadjinikolis & Satyasheel | DataReply.uk
C. Hadjinikolis & Satyasheel | DataReply 2
Tutorial Overview:
What is Apache Flink? Why Flink? Processing both bounded and un-bounded data! Anatomy of a Flink App Windowing in Flink Event time & Process time in Flink
2/22/17
C. Hadjinikolis & Satyasheel | DataReply 3
What is Apache Flink?
“A distributed data processing platform…”
2/22/17
42/22/17C. Hadjinikolis & Satyasheel | DataReply
Flink is a distributed stream- & batch- data processing platform Stream processing
…the real-time processing of data continuously, concurrently, and in a record-by-record fashion, where data is not static.
Batch processing…the execution of a series of programs each on a set or "batch" of static inputs, rather than a single input (which would instead be a custom job).
52/22/17C. Hadjinikolis & Satyasheel | DataReply
…distributed processing dataset types
UnboundedInfinite datasets that are appended to continuously: End users interacting with mobile or web applications Physical sensors providing measurements Financial markets Machine log data Surveillance camera frames
62/22/17C. Hadjinikolis & Satyasheel | DataReply
…distributed processing dataset types
BoundedFinite, unchanging datasets:
Pictures Documents Database tables
C. Hadjinikolis & Satyasheel | DataReply 7
Why Flink?
“The world is turning more and more towards stream processing…”
2/22/17
82/22/17C. Hadjinikolis & Satyasheel | DataReply
Opt for Flink because it:
Provides results that are accurate Is stateful and fault-tolerant and can
seamlessly recover from failures Performs at large scale
92/22/17C. Hadjinikolis & Satyasheel | DataReply
…exactly-once semantics
Statefull … apps can maintain summaries of processed data.
Checkpointing… a mechanism that ensures that in the event of failure no duplicate re-computation of an event will take place.
102/22/17C. Hadjinikolis & Satyasheel | DataReply
…event time semantics
…event-time-based windowingEvent time makes it easy to compute accurate results over streams where events arrive out of order and where events may arrive delayed.
112/22/17C. Hadjinikolis & Satyasheel | DataReply
… flexible windowingWindows can be customized with flexible triggering conditions to support sophisticated streaming patterns based on:
Time; Count, and; Sessions.
122/22/17C. Hadjinikolis & Satyasheel | DataReply
… lightweight fault tolerance
Recovers from failures with zero data loss while the tradeoff between reliability and latency is negligible.
132/22/17C. Hadjinikolis & Satyasheel | DataReply
… lightweight fault tolerance
Savepoints Provide a state versioning mechanism. Applications can update and reprocess
historic data with no lost state.
142/22/17C. Hadjinikolis & Satyasheel | DataReply
… Scalable
Designed to run on large scale clusters
with many thousands on nodes.
152/22/17C. Hadjinikolis & Satyasheel | DataReply
So, in summary…Flink is an open-source stream processing framework, which: Eliminates the “performance vs. reliability”
problem and; Performs consistently in both categories.
C. Hadjinikolis & Satyasheel | DataReply 16
Processing both bounded & un-bouded data!
“Unbounding the boundaries…”
2/22/17
172/22/17C. Hadjinikolis & Satyasheel | DataReply
…the streaming model & bounded datasets DataStream API un-bounded
data DataSet API bounded data
A bounded dataset is handled inside of Flink as a “finite stream”, with only a few minor differences in how Flink manages un-bounded datasets.
C. Hadjinikolis & Satyasheel | DataReply 18
Anatomy of a Flink App
“Let’s get this started…”
2/22/17
192/22/17C. Hadjinikolis & Satyasheel | DataReply
…Flink programs transform collections of dataEach program consists of the same basic parts: Obtain an execution environment, Load/create the initial data, Specify transformations on this data, Specify where to put the results of your
computations Trigger the program execution
202/22/17C. Hadjinikolis & Satyasheel | DataReply
Create execution environment
Load streaming data
Trigger transformations
Specify dumping location
Execute
212/22/17C. Hadjinikolis & Satyasheel | DataReply
…Lazy evaluation
When the program’s main method is executed: Each operation is created and added to
the program’s plan. execution is explicitly triggered by
an execute() call.This helps with constructing an optimised data-flow as a holistically planned unit.
C. Hadjinikolis & Satyasheel | DataReply 22
Lets take 15 mins…
2/22/17
C. Hadjinikolis & Satyasheel | DataReply 23
Windowing in Flink
“…a simple word count app.”
2/22/17
242/22/17C. Hadjinikolis & Satyasheel | DataReply
…so what is a window? A window is a way to get a {snapshot} of the streaming data. A {snapshot} can be based on time or other variables. One can define the window based on no of records or other
stream specific variables.
252/22/17C. Hadjinikolis & Satyasheel | DataReply
…enough with theory! Give us some code!
A streaming word count example with no windowing
262/22/17C. Hadjinikolis & Satyasheel | DataReply
…updating states
Flink automatically updates its states
without the user explicitly doing so. To better appreciate this, it is worth
contrasting Flink with Spark. Spark relies on micro-batches:
This means one has to define the batch size either in terms of time or size
Flink, does not require defining a batch size. It can process each and every new event
individually (it is true stream processing!)
vs
C. Hadjinikolis & Satyasheel | DataReply 27
Lets see an example
…
2/22/17
C. Hadjinikolis & Satyasheel | DataReply 28
Windowing in Flink
“Don't waste a minute not being happy. If one window closes, run to the next window - or break down a door. …”
2/22/17
292/22/17C. Hadjinikolis & Satyasheel | DataReply
…so why use windowing at all?
Aggregation on DataStream is different from
aggregation Dataset. One cannot count all records on infinite stream. DataStream aggregation makes sense on window
stream.
302/22/17C. Hadjinikolis & Satyasheel | DataReply
…what types of windowing can you use? Tumbling Windows :
Aligned, fixed length, non-overlapping window. Sliding Windows :
Aligned, fixed length, overlapping window. Session Windows :
Non aligned, variable length window. Count Windows :
Fixed number of records/events, non-overlapping window.
312/22/17C. Hadjinikolis & Satyasheel | DataReply
…anatomy of the window API
3 window functions:
Window Assigner: Responsible for assigning given element to window. Depending upon the definition of window, one element can belong to one or more
windows at a time. Trigger:
Defines the condition for triggering window evaluation. This function controls when a given window created by window assigner is
evaluated. Evictor:
An optional function which defines the preprocessing before firing window operations.
322/22/17C. Hadjinikolis & Satyasheel | DataReply
…understanding count window
Window Assigner (for count-based window user-
defined) No start or end to the window, therefore the window is non-time
based. For these windows we use the GlobalWindows window assigner. For a given key, all key-values are filled into the same window.
keyValue.window(GlobalWindows.create())
The window API allows us to add the window assigner to the window. Every window assigner has a default trigger.
for global windows that trigger is NeverTrigger which never triggers.
so, this window assigner has to be used with a custom trigger.
332/22/17C. Hadjinikolis & Satyasheel | DataReply
…understanding count window
Count trigger
Once we have the window assigner, we have to define when the window needs to be trigger-ed, for example:
trigger(CountTrigger.of(2)) This results in the window being evaluated every two records.
Evictor In addition to these, an evictor can be used for further preprocessing tasks
before firing a window operation, e.g. to remove the every 3rd element of all window.
Some default evictors: CountEvictor , DeltaEvictor , TimeEvictor
C. Hadjinikolis & Satyasheel | DataReply 34
The anatomy of a window API
…
2/22/17
C. Hadjinikolis & Satyasheel | DataReply 35
Tumbling Windows
…
2/22/17
C. Hadjinikolis & Satyasheel | DataReply 36
Sliding Windows …
2/22/17
C. Hadjinikolis & Satyasheel | DataReply 37
Lets take 15 mins…
2/22/17
C. Hadjinikolis & Satyasheel | DataReply 38
Timing in Flink
“The two most powerful warriors are patience and time.
2/22/17
392/22/17C. Hadjinikolis & Satyasheel | DataReply
…the time concept in streaming A streaming application is an always running application. ..we need to take snapshots of the stream at various points. ..these points can be defined using a time component. ..we can group, correlate, different events happening in the
stream. Some of the constructs like window, heavily use the time
component. Most of the streaming frameworks support a single meaning of
time, which is mostly tied to the processing time.
402/22/17C. Hadjinikolis & Satyasheel | DataReply
…time in Flink When we say, last “t” seconds, what do we mean exactly?
Well in Flink it’s one of three things: Processing Time“…the records arrived in last "t" seconds for the processing.” Event Time“… all the records generated in those last "t" seconds at the source.” Ingestion Time
The time when events ingested into the system. This time is in between of the event time and processing time.
412/22/17C. Hadjinikolis & Satyasheel | DataReply
…time in Flink
C. Hadjinikolis & Satyasheel | DataReply 42
Time in Flink…
2/22/17
C. Hadjinikolis & Satyasheel | DataReply 432/22/17
Thanks for your attention!