Milan – July 13 2016
Data Intensive Applications with Apache Flink
Simone RobuttiMachine Learning Engineer at Radicalbit
@SimoneRobutti
Agenda1. Brief Introduction to Apache Flink
○ Why
○ What
○ How
2. Machine Learning on Flink
○ Present landscape
○ Future of the Ecosystem
3. Closing notes on Radicalbit (shameless plug ahead)
100% Buzzword-free guaranteed
Big Data
Machine Intelligence
Web-scale400x
It’s like the human brain
Exactly-once
Exactly-once
Why Flink (and not Spark/Storm/Samza...)
Because it’s
production-ready
streaming-firstlow-latency
fault-toleranthigh-throughput
processing engine
Flink: what is it?
From Flink’s Documentation
Connectors and integrations
Flink’s Runtime
From Flink’s Documentation
Flink’s DataFlow
From Flink’s Documentation
Written by the user through DataSet/DataStream API
Compiled and optimized in the client
Flink’s DataFlow
From Flink’s Documentation
The compiled job is translated to distributed tasks by
the master and executed by workers
Machine Learning on Flink
Ready and awesome for parallel ML
Work in progress for distributed ML
ML on Flink
Flink for Model Evaluation Pipelines
Source
Data Preparation
Evaluation Sink
Source
Postprocess
-ing
Composable, modular Flink Operator
Evaluation with Flink-JPMML
Source Operator
Flink - JPMML
Operator
Sink Operator
Source Operator
model.pmml
Small library that implements basic model eval.
operations on top of JPMML (Gitlab)
Data Preparation
“I have seen people insisting on using Hadoop for
datasets that could easily fit on a flash drive and could
easily be processed on a laptop.”
- Yann LeCun
-
ML on Flink
FlinkML
What: Out-of-the-box workhorse algorithms (ALS,
SVM, LinReg, LogReg …)
Status: early phase, slow development
FlinkML
Pro: available out of the box, written with Flink API
Cons: reinvents the wheel, only a few algorithms,
no model persistence
Samsara
What: Linear algebra framework
Status: mature
Samsara
Pro: generic algorithms with platform-specific
bindings, skilled community
Cons: covers only a few use cases
SAMOA
What: Online learning algorithm framework (VHT,
AMR, …)
Status: early phase, complicated relationship with
the industry
SAMOA
Pro: many powerful generic online learning
algorithms, backed by academics (MOA, Weka)
Cons: not production ready, academic focus
ML on Flink: the future of the ecosystem
Apache Beam
Programming model for data processing pipelines
● Streaming first, batch as a bounded stream
● Layered API: What, Where, When, How
● Platform agnostic: same program, different
runners
Apache Beam - Runners
● Flink
● Spark (Partial)
● Google Cloud Dataflow
● Plain Java
● Gearpump (WIP)
● Apex (WIP)
BeamML: a runner-agnostic ML library
FlinkML Roadmap
● More algorithms!
● Evaluation framework
● Persistence/export
● Online Learning Framework
Proteus
Online Learning Platform - based on Flink
Source: Proteus’ website
The role of Radicalbit
Contributions
● Cassandra Connector
● Scala API extensions
● FlinkML (Linear Algebra Framework, MinHash)
● Akka Connector
Our vision
Flink can become the ideal choice to build real-time decision-heavy applications with high data-throughput
To achieve this:
● Ambitious applications (aim for real-time services)
● Reliable distributed online learning (Proteus?)
● A Pipelining Framework (experiment fast, increase testability and
modularity)
Q&A
THANKS!Simone Robutti
Mail: [email protected] Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit
Top Related