Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed...

22
Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame Christopher Nguyen, PhD—CEO & Co-Founder, Arimo Rohit Rai—CEO, Tuplejump Bringing BigApps to Flink @arimoinc @pentagoniac http//ddf.io

Transcript of Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed...

Page 1: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

Collaborative Predictive Intelligence

via DDF-on-Flink using Distributed DataFrame

Christopher Nguyen, PhD—CEO & Co-Founder, Arimo

Rohit Rai—CEO, Tuplejump

Bringing BigApps to Flink

@arimoinc@pentagoniachttp//ddf.io

Page 3: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

What Are Adatao Big Apps?

§Predictive: Predictive Analytics for Business Users

§Collaborative: Real-time Collaboration with Data Scientists

Page 4: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Demo

Page 5: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

The EXPLOSION

of Data & Compute engines

The CIO Challenge

ScalaClient

Scala

JavaClient

Java

PyClientPyth

on

RClient

R

Ignite

HDFS

S3

Redshift

BigQ

Cassandra

RDBMS

Spark

Flink

Presto

Ignite

HDFS

S3

RedshiftBigQ

Cassandra

RDBMS

Spark

Flink

PrestoIgnite

HDFS

S3

Redshift

BigQ

Cassandra

RDBMS

Spark

FlinkPresto

ScalaClient

Scala

PyClient

PythonJavaC

lient

Java

RClient

R

FlinkFlin

k

Ignite

HDFSRDBMS

Redshift

Cassandra HDFS RDBMSHDFS

Flink

Page 6: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Scala Java Python R

DDF

Spark Flink

DDF

Ignite

DDF

Data in Memory

Presto

DDF

Data at Rest

HDFS

DDF

DWs DBs

Enterprise Data Bus

DDF

S3

DDF

Redshift

DDF

BigQ

DDF

Cassandra

DDF

RDBMS

The Solution: DDF Data Integration

Page 7: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Benefits of DDF Data Integration

§ FOR DATA ENGINEERS

§ Unified API across data sources and engines

§ HDFS, S3, Cassandra, Redshift, BigQuery, RDBMS, Salesforce, Spark, Flink, Ignite …

§ FOR DATA SCIENTISTS

§ Uniform high-level DataFrame abstractions: ETL, ML, Streaming

Page 8: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Custom Apps

Adatao AppBuilder

Adatao PredictiveEngine

Arimo Predictive Intelligence Platform

Big Compute

Big Data

Big Apps

Distributed DataFrame (DDF)Open

Sourced

Data ScientistBusiness User Data Engineer

Page 9: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Why Flink?

§ Emerging engine with unique strengths (e.g., streaming)

§Driven by Customer & Partner conversations

Page 10: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Demo

Page 11: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Java Python R

DDF DDF DDF

Spark Flink RedshiftSpark APIs

RDD DataFrame DStream

Flink APIs DataSet Table

DataStream …

ETL Interfaces

ML Interfaces

Streaming Interfaces

Unified DDF APIs

DDF: “Under the Hood”

Page 12: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

DDF API in a Nutshell

// To start working with an engine

DDFManager manager = DDFManager.get(“flink”); // or “spark”

// Then, data can be loaded into a DDF as follows:DDF table = manager.sql2ddf("select * from airline");

// ETL, transformtable = table.transform("dist= round(distance/2, 2)”);

// Run Machine learning using MLlib, then run predictionKMeansModel kmeansModel = (KMeansModel) ddf.ML.train("kmeans", 5, 5).getRawModel();Int prediction = ddf.ML.applyModel(kmeansModel, false, true);

Page 13: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Demo

Page 14: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ It was easy for us to implement DDF on Flink

§ Flink API close to functional collection API

Page 15: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ With DDF, it’s easy to port applications on DDF from one engine to another

Page 16: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ There’s now an opportunity to use Flink for interactive applications

§ Backtracking scheduler, session management, better graph analysis

Page 17: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ Null/missing value handling in Flink

§ Null value support needed in RowSerializer

Page 18: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ Map vs MapPartitions vs Accumulators

§ Map for aggregations can cause a lot of object creation overhead

§ Accumulators may fail for huge datasets

Page 19: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Lessons Learned

§ Use caution when doing array copy overs in Table API

Page 20: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

DDF: Where is it heading?

§ More Engines: DBs & DWs: BigQuery, Cassandra, Teradata, Presto, Ignite

§ Enterprise Databus to seamlessly move data across sources

§ Richer APIs

Page 21: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

@arimoinc@pentagoniachttp//ddf.io

Get Started with DDF§ Increase your productivity & build engine-agnostics Apps

• Build your analytics apps on existing modules

• Flink, Spark, JDBC

§ Expand possibilities. Contribute to DDF

• Enrich existing plugins: Data APIs, ML APIs...

• Add new DDF plugins:

• BigQuery, Cassandra

•Marketo

• Ignite, Presto

§ Spread the word!

www.ddf.io/gettingstarted

Page 22: Bringing BigApps to Flink | Collaborative Predictive Intelligence via DDF-on-Flink using Distributed DataFrame

Collaborative Predictive Intelligence

via DDF-on-Flink using Distributed DataFrame

Christopher Nguyen, PhD—CEO & Co-Founder, Arimo

Rohit Rai—CEO, Tuplejump

Bringing BigApps to Flink

@arimoinc@pentagoniachttp//ddf.io