Danny Bickson - Python based predictive analytics with GraphLab Create

Post on 16-Apr-2017

483 views 2 download

Transcript of Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential1

GraphLab Create TrainingUvA School of Business

Danny Bickson, Co-Founder and VP EMEAbickson@dato.com

Dato Confidential2

Dato: We Intelligent Applications

Dato Confidential3

Businessmust be

intelligent

Machine learning applications

• Recommenders • Fraud detection• Ad targeting• Financial models• Personalized

medicine • Churn prediction• Smart UX

(video & text)• Personal assistants• IoT• Socials networks• Log analysisLast decade:

Data managementNow:

Intelligent apps

?Last 5 years:

Traditional analytics

Dato Confidential4

Example Intelligent Applications- images- text- graphs- tabular data

Dato Confidential

Creating a model

exploration

data

modelingpipeline

Dato Confidential

Creating a model pipelineIngest Transfor

mModel Deplo

yUnstructured Data

Dato Confidential

The Dato Machine Learning PlatformDato Predictive Services

Predictive Engine

REST ClientModel Mgmt

Machine Learning Toolkits

Canvas Free for academic usage

SDK SGraphSFrame

Engine – sframe gihub

GraphLab Create

Dato Confidential

GraphLab Create Benefits

Dato Confidential9

Why use GraphLab Create?• Efficient storage

GraphLab Sframe compressed column store:• x20 smaller than pandas• x2 smaller than Gzip

Size on disk (the lower the better!)

Dato Confidential10

No need for huge RAM!

Effective Delay vs RAM

x2x5

Data size limited by disk size

My data is larger than my machine RAM

Dato Confidential11

Comparison to sklearn

Try it here: http://blog.dato.com/how-fast-are-out-of-core-algorithms

Dato Confidential12

Summary of differences vs. sklearn• Better multicore support

• Out of core implementation (working from disk)

• Automatic feature expansion

• Automatic parameter selection

• Support for model serving

• Additional algorithms

Dato Confidential

Some of our Customers

13

Dato Confidential14

Dato on Coursera

40,000 students in 4 monthshttps://www.coursera.org/learn/ml-foundations

Specialization content:

● Machine Leraning Foundations

● Regression● Classification● Clustering &

Retrieval● Recommendatio

n Systems & Dimentionality Reduction

● Capstone: An Intelligent Application with Deep Learning

Dato Confidential15

Remco Frijling

Dato Confidential16

Dato Confidential17

Create an intelligent world!

Data Engineering

Sophisticated ML Deployment

• Fast & scalable• Rich data types• Built for ML

• App-oriented ML• Scalable ML• Extensibility

• Batch & always-on• RESTful interface• Elastic & robust

bickson@dato.com

Dato Confidential18

Appendix: Performance

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Performance HighlightsDato’s Platform outperforms other frameworks on most tasks: Data munging, machine learning essentials, & graph analytics tasks.

● Data Munging - SFrame, the columnar and out-of-core abstraction enables tabular queries on a single node that are faster or comparable to queries on 5-node clusters for systems like Spark & Redshift.

● Machine Learning - Unparalleled speed & accuracy for tasks including classification, recommendation, and deep learning on images compared to systems like MLLib, H2O, and scikit-learn.

● Graph Analysis - Orders of magnitude faster than comparable frameworks like GraphX & Giraph for common graph analytics tasks. Tasks complete in reasonable times (mins) even on the world’s largest publicly available webgraph. The only other known system to complete these tasks is one that runs on non-commodity, custom hardware.

Dato Confidential20

0 2 4 6 8 10 120.60%

0.65%

0.70%

0.75%

0.80%

0.85%

Hours

Test

Erro

r

Digit recognition benchmark

4 min on 4 GPUs

Machine Learning – Deep Learning

10 machines/80 cores

Dato Confidential

Graph Analytics - 1

21

GraphLab Create

GraphX

Giraph

Spark

0 750 1500 2250

70 sec

251 sec

200 sec

2,128 sec

Connected components in Twitter graph

Source(s): Gonzalez et. al. (OSDI 2014)Twitter: 41 million Nodes, 1.4 billion Edges

SGraph

16 machines

1 machine

Dato Confidential22

Pagerank on Common Crawl Graph3.5 billion Nodes and 128 billion Edges

1 machine 16 machines0

2

4

6

8

10

Min

utes

per

iter

atio

n

256 CPUs16 CPUs16 machines 300 machines

Dato Confidential23

Criteo Terabyte Click Prediction4.4 Billion

Rows13 Features

½ TB of data

0 4 8 12 160

500

1000

1500

2000

2500

3000

3500

4000

#Machines

Runt

ime

Linear

Speedup 225s

3630s

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Machine Learning – Logistic Reg. Accuracy

Dataset Source(s): LIBLinear binary classification datsets.

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Data Munging

SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

5 Nodes

1 Node

Source(s): https://amplab.cs.berkeley.edu/benchmark/, Armbrust et. al. (SIGMOD 2015)

Dataset: Extracted from 775M visits to 90M documents in the Common Crawl corpus

Dato Confidential26

Appendix: Pricing & Deployment Scenarios

Dato Confidential27

• Subscription license which includes support and and upgrades

• Licensed by user for Create & by machine for production use

• Training & technical services also available

• Discounts available for 10 or more users

Dato Confidential28

Deployment Scenarios“Getting Started”

“Real-time Predictions”

“Scaling Up”

GraphLab CreateDato Predictive ServicesDato Distributed

KeyGraphLab Create – installed on each team member machine• Working with data, training new models, doing ad-hoc

analysis

GraphLab Create • Installed on central team server• Trains production models periodically (ex. nightly)• Generates predictions and records to data store

GraphLab Create – installed on each team member machine• Installed on team member laptops• Working with data, ad-hoc analysis, training new models• Deploy new models to Predictive Services deployment

GraphLab Create – installed on central team server• Trains production models periodically (ex. nightly)• Deploys models to Dato Predictive Services Dato Predictive Services – installed on central team cluster• Hosting & Serving deployed models• REST API for application integrationGraphLab Create – installed on each team member machine• Working with data, training new models, doing ad-hoc

analysis• Deploys models to Predictive Services• Submits jobs to Distributed Dato Distributed – installed on central team cluster• Train models in parallel on larger dataset periodically (ex.

nightly)• Deploys newly trained models to Dato Predictive Services Dato Predictive Services – installed on central team cluster• Hosting deployed models• REST API for applicationintegration