Practical Big Data Processing An Overview of...

17
1 © DIMA 2017 Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de | bbdc.berlin | [email protected] With slides from Volker Markl and data artisans

Transcript of Practical Big Data Processing An Overview of...

Page 1: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

1 © DIMA 2017© 2013 Berlin Big Data Center • All Rights Reserved1 © DIMA 2017

Practical Big Data ProcessingAn Overview of Apache Flink

Tilmann RablBerlin Big Data Center

www.dima.tu-berlin.de | bbdc.berlin | [email protected]

With slides from Volker Markl and data artisans

Page 2: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

2 © DIMA 20172

2 © DIMA 2017

What is Apache Flink?

Apache Flink is an open source platform for scalable batch and stream data processing.

http://flink.apache.org

• The core of Flink is a distributed streaming dataflow engine.

• Executing dataflows in parallel on clusters

• Providing a reliable foundation for various workloads

• DataSet and DataStream programming abstractions are the foundation for user programs and higher layers

Page 3: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

3 © DIMA 20173 © 2013 Berlin Big Data Center • All Rights Reserved

3 © DIMA 2017

What can I do with it?

A big data processing system that can natively support all these workloads.

Flink

Stream processing

Batchprocessing

Machine Learning at scale

Graph Analysis

Page 4: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

4 © DIMA 20174 © 2013 Berlin Big Data Center • All Rights Reserved

4 © DIMA 2017

Flink in the Analytics Ecosystem

4

MapReduce

Hive

Flink

Spark Storm

Yarn Mesos

HDFS

Mahout

Cascading

Tez

Pig

Data processing engines

App and resource management

Applications &Languages

Storage, streams KafkaHBase

Crunch

Giraph

Page 5: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

5 © DIMA 20175

5 © DIMA 2017

Rich set of operators

5

Map, Reduce, Join, CoGroup, Union, Iterate, Delta Iterate, Filter, FlatMap, GroupReduce, Project, Aggregate, Distinct, Vertex-Update, Accumulators, …

All you will ever need for any big data processing.

Page 6: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

6 © DIMA 20176

6 © DIMA 2017

Sneak peak: Two of Flink’s APIs

6

case class Word (word: String, frequency: Int)

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ").map(word => Word(word,1))}

.keyBy("word")

.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))

.sum("frequency”)

.print()

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment()val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ").map(word => Word(word,1))}

.groupBy("word").sum("frequency")

.print()

DataSet API (batch):

DataStream API (streaming):

Page 7: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

7 © DIMA 20177

7 © DIMA 2017

Flink Execution Model• Flink program = DAG* of operators and intermediate streams• Operator = computation + state• Intermediate streams = logical stream of records

Page 8: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

8 © DIMA 20178

8 © DIMA 2017

Architecture• Hybrid MapReduce and MPP database runtime

• Pipelined/Streaming engine– Complete DAG deployed

Worker 1

Worker 3 Worker 4

Worker 2

Job Manager

Page 9: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

9 © DIMA 20179 © 2013 Berlin Big Data Center • All Rights Reserved

9 © DIMA 2017

Built-in vs. driver-based looping

Step Step Step Step Step

Client

Step Step Step Step Step

Client

map

join

red.

join

Loop outside the system, in driver program

Iterative program looks like many independent jobs

Dataflows with feedback edges

System is iteration-aware, can optimize the job

Flink

Page 10: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

10 © DIMA 201710

10 © DIMA 2017

Managed Memory• Language APIs automatically converts objects to tuples

– Tuples mapped to pages/buffers of bytes– Operators can work on pages/buffers

• Full control over memory, out-of-core enabled• Operators (e.g., Hybrid Hash Join) address individual fields (not deserialize object): robust

4b 8b 1b 4b Variable Length

Tuple3<Integer, Double, Person>

F0: int

F1: double

Person

Id: int

Name: String

public class Person {int id; String name;}

Pojo (header)

IntSerialized

DoubleSerialized

IntSerialized

StringSerialized

Memory runs out

Page 11: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

11 © DIMA 201711

11 © DIMA 2017

Mini-Batching vs Native Streaming

Streamdiscretizer

Job Job Job Jobwhile (true) {// get next few records// issue batch computation

}

while (true) {// process next record

}

Long-standing operators

Discretized Streams (D-Streams)

Native streaming

Page 12: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

12 © DIMA 201712

12 © DIMA 2017

Problems of Mini-Batch• Latency

– Each mini-batch schedules a new job, loads user libraries, establishes DB connections, etc

• Programming model– Does not separate business logic from recovery – changing the mini-batch size changes

query results

• Power– Keeping and updating state across mini-batches only possible by immutable

computations

12

Page 13: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

13 © DIMA 201713

13 © DIMA 2017

Flink’s Windowing• Windows can be any combination of (multiple) triggers & evictions

– Arbitrary tumbling, sliding, session, etc. windows can be constructed.

• Common triggers/evictions part of the API– Time (processing vs. event time), Count

• Even more flexibility: define your own UDF trigger/eviction

• Examples:dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)));dataStream.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5)));

Page 14: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

14 © DIMA 201714 © 2013 Berlin Big Data Center • All Rights Reserved

14 © DIMA 2017

Yahoo! Benchmark ResultsPerformed by Yahoo! Engineering, Dec 16, 2015

[..]Storm 0.10.0, 0.11.0-SNAPSHOT and Flink 0.10.1 show sub- second latencies at relatively high

throughputs[..]. Spark streaming 1.5.1 supports high throughputs, but at a relatively higher latency.

Flink achieves highest throughput with competitive low latency!

Source: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Page 15: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

15 © DIMA 201715 © 2013 Berlin Big Data Center • All Rights Reserved

15 © DIMA 2017

Our benchmarksStreaming

Windowed Aggregations / Joins

Iterative Algorithms

Batch

Flink consistently outperforms other streaming engines in throughput and latency

Page 16: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

16 © DIMA 201716

16 © DIMA 2017

Flink in theBBDC Stack

See details in the demos!

Page 17: Practical Big Data Processing An Overview of …big-data-berlin.dima.tu-berlin.de/fileadmin/news/...• Language APIs automatically converts objects to tuples – Tuples mapped to

17 © DIMA 2017© 2013 Berlin Big Data Center • All Rights Reserved17 © DIMA 2017

Thank You

Contact:Tilmann [email protected]