Danny Bickson - Python based predictive analytics with GraphLab Create
-
Upload
pydata -
Category
Data & Analytics
-
view
483 -
download
2
Transcript of Danny Bickson - Python based predictive analytics with GraphLab Create
Dato Confidential1
GraphLab Create TrainingUvA School of Business
Danny Bickson, Co-Founder and VP [email protected]
Dato Confidential2
Dato: We Intelligent Applications
Dato Confidential3
Businessmust be
intelligent
Machine learning applications
• Recommenders • Fraud detection• Ad targeting• Financial models• Personalized
medicine • Churn prediction• Smart UX
(video & text)• Personal assistants• IoT• Socials networks• Log analysisLast decade:
Data managementNow:
Intelligent apps
?Last 5 years:
Traditional analytics
Dato Confidential4
Example Intelligent Applications- images- text- graphs- tabular data
Dato Confidential
Creating a model
exploration
data
modelingpipeline
Dato Confidential
Creating a model pipelineIngest Transfor
mModel Deplo
yUnstructured Data
Dato Confidential
The Dato Machine Learning PlatformDato Predictive Services
Predictive Engine
REST ClientModel Mgmt
Machine Learning Toolkits
Canvas Free for academic usage
SDK SGraphSFrame
Engine – sframe gihub
GraphLab Create
Dato Confidential
GraphLab Create Benefits
Dato Confidential9
Why use GraphLab Create?• Efficient storage
GraphLab Sframe compressed column store:• x20 smaller than pandas• x2 smaller than Gzip
Size on disk (the lower the better!)
Dato Confidential10
No need for huge RAM!
Effective Delay vs RAM
x2x5
Data size limited by disk size
My data is larger than my machine RAM
Dato Confidential11
Comparison to sklearn
Try it here: http://blog.dato.com/how-fast-are-out-of-core-algorithms
Dato Confidential12
Summary of differences vs. sklearn• Better multicore support
• Out of core implementation (working from disk)
• Automatic feature expansion
• Automatic parameter selection
• Support for model serving
• Additional algorithms
Dato Confidential
Some of our Customers
13
Dato Confidential14
Dato on Coursera
40,000 students in 4 monthshttps://www.coursera.org/learn/ml-foundations
Specialization content:
● Machine Leraning Foundations
● Regression● Classification● Clustering &
Retrieval● Recommendatio
n Systems & Dimentionality Reduction
● Capstone: An Intelligent Application with Deep Learning
Dato Confidential15
Remco Frijling
Dato Confidential16
Dato Confidential17
Create an intelligent world!
Data Engineering
Sophisticated ML Deployment
• Fast & scalable• Rich data types• Built for ML
• App-oriented ML• Scalable ML• Extensibility
• Batch & always-on• RESTful interface• Elastic & robust
Dato Confidential18
Appendix: Performance
Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.
Performance HighlightsDato’s Platform outperforms other frameworks on most tasks: Data munging, machine learning essentials, & graph analytics tasks.
● Data Munging - SFrame, the columnar and out-of-core abstraction enables tabular queries on a single node that are faster or comparable to queries on 5-node clusters for systems like Spark & Redshift.
● Machine Learning - Unparalleled speed & accuracy for tasks including classification, recommendation, and deep learning on images compared to systems like MLLib, H2O, and scikit-learn.
● Graph Analysis - Orders of magnitude faster than comparable frameworks like GraphX & Giraph for common graph analytics tasks. Tasks complete in reasonable times (mins) even on the world’s largest publicly available webgraph. The only other known system to complete these tasks is one that runs on non-commodity, custom hardware.
Dato Confidential20
0 2 4 6 8 10 120.60%
0.65%
0.70%
0.75%
0.80%
0.85%
Hours
Test
Erro
r
Digit recognition benchmark
4 min on 4 GPUs
Machine Learning – Deep Learning
10 machines/80 cores
Dato Confidential
Graph Analytics - 1
21
GraphLab Create
GraphX
Giraph
Spark
0 750 1500 2250
70 sec
251 sec
200 sec
2,128 sec
Connected components in Twitter graph
Source(s): Gonzalez et. al. (OSDI 2014)Twitter: 41 million Nodes, 1.4 billion Edges
SGraph
16 machines
1 machine
Dato Confidential22
Pagerank on Common Crawl Graph3.5 billion Nodes and 128 billion Edges
1 machine 16 machines0
2
4
6
8
10
Min
utes
per
iter
atio
n
256 CPUs16 CPUs16 machines 300 machines
Dato Confidential23
Criteo Terabyte Click Prediction4.4 Billion
Rows13 Features
½ TB of data
0 4 8 12 160
500
1000
1500
2000
2500
3000
3500
4000
#Machines
Runt
ime
Linear
Speedup 225s
3630s
Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.
Machine Learning – Logistic Reg. Accuracy
Dataset Source(s): LIBLinear binary classification datsets.
Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.
Data Munging
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
5 Nodes
1 Node
Source(s): https://amplab.cs.berkeley.edu/benchmark/, Armbrust et. al. (SIGMOD 2015)
Dataset: Extracted from 775M visits to 90M documents in the Common Crawl corpus
Dato Confidential26
Appendix: Pricing & Deployment Scenarios
Dato Confidential27
• Subscription license which includes support and and upgrades
• Licensed by user for Create & by machine for production use
• Training & technical services also available
• Discounts available for 10 or more users
Dato Confidential28
Deployment Scenarios“Getting Started”
“Real-time Predictions”
“Scaling Up”
GraphLab CreateDato Predictive ServicesDato Distributed
KeyGraphLab Create – installed on each team member machine• Working with data, training new models, doing ad-hoc
analysis
GraphLab Create • Installed on central team server• Trains production models periodically (ex. nightly)• Generates predictions and records to data store
GraphLab Create – installed on each team member machine• Installed on team member laptops• Working with data, ad-hoc analysis, training new models• Deploy new models to Predictive Services deployment
GraphLab Create – installed on central team server• Trains production models periodically (ex. nightly)• Deploys models to Dato Predictive Services Dato Predictive Services – installed on central team cluster• Hosting & Serving deployed models• REST API for application integrationGraphLab Create – installed on each team member machine• Working with data, training new models, doing ad-hoc
analysis• Deploys models to Predictive Services• Submits jobs to Distributed Dato Distributed – installed on central team cluster• Train models in parallel on larger dataset periodically (ex.
nightly)• Deploys newly trained models to Dato Predictive Services Dato Predictive Services – installed on central team cluster• Hosting deployed models• REST API for applicationintegration