Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Data Analysis with Apache Flink (Hadoop Summit, 2015)
-
Upload
aljoscha-krettek -
Category
Data & Analytics
-
view
1.584 -
download
0
Transcript of Data Analysis with Apache Flink (Hadoop Summit, 2015)
![Page 1: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/1.jpg)
Aljoscha Krettek / Till RohrmannFlink committers
Co-founders @ data [email protected] / [email protected]
Data Analysis With Apache Flink
![Page 2: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/2.jpg)
What is Apache Flink?
2
Functional API
Relational API Graph API Machine
Learning …
Iterative Dataflow Engine
![Page 3: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/3.jpg)
Apache Flink Stack
3
Pyth
on
Gelly
Tabl
e
Flin
k M
L
SAM
OA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream BuilderHado
op
M/R
Distributed Runtime
Local Remote Yarn Tez EmbeddedDa
taflo
w
Data
flow
*current Flink master + few PRs Ta
ble
![Page 4: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/4.jpg)
Example Use Case: Log Analysis
4
![Page 5: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/5.jpg)
5
What Seems to be the Problem? Collect clicks from
a webserver log Find interesting
URLs Combine with user
data
Web server log
userdata base
InterestingUser Data
Extract Clicks
Combine
Massage
![Page 6: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/6.jpg)
The Execution Environment Entry point for all Flink programs Creates DataSets from data sources
6
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
![Page 7: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/7.jpg)
Getting at Those Clicks
7
DataSet<String> log = env.readTextFile("hdfs:///log");
DataSet<Tuple2<String, Integer>> clicks = log.flatMap(
(String line, Collector<Tuple2<String, Integer>> out) -> String[] parts = in.split("*magic regex*"); if (isClick(parts)) { out.collect(new Tuple2<>(parts[1],Integer.parseInt(parts[2]))); } })
post /foo/bar… 313get /data/pic.jpg 128post /bar/baz… 128post /hello/there… 42
![Page 8: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/8.jpg)
The Table Environment Environment for dealing with Tables Converts between DataSet and Table
8
TableEnvironment tableEnv = new TableEnvironment();
![Page 9: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/9.jpg)
Counting those Clicks
9
Table clicksTable = tableEnv.toTable(clicks, "url, userId"); Table urlClickCounts = clicksTable .groupBy("url, userId") .select("url, userId, url.count as count");
![Page 10: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/10.jpg)
Getting the User Information
10
Table userInfo = tableEnv.toTable(…, "name, id, …");
Table resultTable = urlClickCounts.join(userInfo) .where("userId = id && count > 10") .select("url, count, name, …");
![Page 11: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/11.jpg)
The Final Step
11
class Result { public String url; public int count; public String name; …}
DataSet<Result> set = tableEnv.toSet(resultTable, Result.class);
DataSet<Result> result = set.groupBy("url").reduceGroup(new ComplexOperation());
result.writeAsText("hdfs:///result");env.execute();
![Page 12: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/12.jpg)
12
API in a Nutshell Element-wise
• map, flatMap, filter Group-wise
• groupBy, reduce, reduceGroup, combineGroup, mapPartition, aggregate, distinct
Binary• join, coGroup, union, cross
Iterations• iterate, iterateDelta
Physical re-organization• rebalance, partitionByHash, sortPartition
Streaming• window, windowMap, coMap, ...
![Page 13: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/13.jpg)
13
What happens under the hood?
![Page 14: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/14.jpg)
From Program to Dataflow
14
Flink Program
Dataflow Plan
Optimized Plan
![Page 15: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/15.jpg)
Distributed Execution
15
Orchestration
Recovery
Master
Memory Managemen
t
Serialization
Worker
StreamingNetwork
![Page 16: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/16.jpg)
Advanced Analysis:Website Recommendation
16
![Page 17: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/17.jpg)
17
Going Further Log analysis result:
Which user visited how often which web site
Which other websites might they like?
Recommendation by collaborative filtering
![Page 18: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/18.jpg)
Collaborative Filtering Recommend items based on users
with similar preferences Latent factor models capture
underlying characteristics of items and preferences of user
Predicted preference:18
![Page 19: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/19.jpg)
Matrix Factorization
19
![Page 20: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/20.jpg)
Alternating least squares Iterative approximation
1. Fix X and optimize Y2. Fix Y and optimize X
Communication and computation intensive
20
R=YX x
R=YX x
![Page 21: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/21.jpg)
Matrix Factorization Pipeline
21
val featureExtractor = HashingFT()val factorizer = ALS()
val pipeline = featureExtractor.chain(factorizer)
val clickstreamDS = env.readCsvFile[(String, String, Int)](clickStreamData)
val parameters = ParameterMap() .add(HashingFT.NumFeatures, 1000000) .add(ALS.Iterations, 10) .add(ALS.NumFactors, 50) .add(ALS.Lambda, 1.5)
val factorization = pipeline.fit(clickstreamDS, parameters)
Clickstream Data
Hashing Feature
ExtractorALS Matrix
factorization
![Page 22: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/22.jpg)
Does it Scale?
22
• 40 node GCE cluster, highmem-8• 10 ALS iteration with 50 latent factors• Based on Spark MLlib’s implementation
Scale of Netflix or Spotify
![Page 23: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/23.jpg)
What Else Can You Do? Classification using SVMs• Conversion goal prediction
Clustering• Visitor segmentation
Multiple linear regression• Visitor prediction
23
![Page 24: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/24.jpg)
Closing
24
![Page 25: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/25.jpg)
25
What Have You Seen? Flink is a general-purpose analytics system Highly expressive Table API Advanced analysis with Flink’s machine
learning library Jobs are executed on powerful distributed
dataflow engine
![Page 26: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/26.jpg)
26
Flink Roadmap for 2015 Additions to Machine Learning library Streaming Machine Learning Support for interactive programs Optimization for Table API queries SQL on top of Table API
![Page 27: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/27.jpg)
27
![Page 28: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/28.jpg)
flink.apache.org@ApacheFlink
![Page 29: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/29.jpg)
Backup Slides
29
![Page 30: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/30.jpg)
30
WordCount in DataSet APIcase class Word (word: String, frequency: Int)
val env = ExecutionEnvironment.getExecutionEnvironment()
val lines = env.readTextFile(...)
lines .flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency”) .print()
env.execute()
Java and Scala APIs offer the same functionality.
![Page 31: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/31.jpg)
Log Analysis Code
31
ExecutionEnvironment env = TableEnvironment tableEnv = new TableEnvironment();TableEnvironment tableEnv = new TableEnvironment();
DataSet<String> log = env.readTextFile("hdfs:///log");
DataSet<Tuple2<String, Integer>> clicks = log.flatMap( new FlatMapFunction<String, Tuple2<String, Integer>>() {
public void flatMap(String in, Collector<Tuple2<>> out) { String[] parts = in.split("*magic regex*"); if (parts[0].equals("click")) { out.collect(new Tuple2<>(parts[1], Integer.parseInt(parts[4]))); } } });
Table clicksTable = tableEnv.toTable(clicks, "url, userId"); Table urlClickCounts = clicksTable .groupBy("url, userId") .select("url, userId, url.count as count");
Table userInfo = tableEnv.toTable(…, "name, id, …");
Table resultTable = urlClickCounts.join(userInfo) .where("userId = id && count > 10") .select("url, count, name, …");
DataSet<Result> result = tableEnv.toSet(resultTable, Result.class);
result.writeAsText("hdfs:///result");
env.execute();
![Page 32: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/32.jpg)
Log Analysis Dataflow Graph
32
Log
Map
AggUsers
Join
Result
Group
Log
Map
AggUsers
Join
combine
partition
sort
merge
sort
Result
Group
partition
sort
![Page 33: Data Analysis with Apache Flink (Hadoop Summit, 2015)](https://reader033.fdocuments.in/reader033/viewer/2022052606/58f9a9a9760da3da068b70f8/html5/thumbnails/33.jpg)
Pipelined Execution
33
Only 1 Stage(depending on join strategy)
Data transfer in-memory and disk if
needed
Note: Intermediate DataSets are not necessarily “created”!