Scaling Tensorflow to 100s of GPUs with Spark and Hops Hadoop
Global AI Conference, Santa Clara, January 18th 2018
Hops
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ RISE SICS
CEO @ Logical Clocks AB
AI Hierarchy of Needs
2
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
AI Hierarchy of Needs
3[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
Analytics
Prediction
AI Hierarchy of Needs
4
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
Hops
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
More Data means Better Predictions
Prediction
Performance
Traditional AI
Deep Neural Nets
Amount Labelled DataHand-crafted
can outperform
1980s1990s2000s 2010s 2020s?
What about More Compute?
“Methods that scale with computation are the future of AI”*
- Rich Sutton (A Founding Father of Reinforcement Learning)
2018-01-19 6/46
* https://www.youtube.com/watch?v=EeMCEQa85tw
More Compute should mean Faster Training
Training
Performance
Single-Host
Distributed
Available Compute
20152016 2017 2018?
Reduce DNN Training Time from 2 weeks to 1 hour
2018-01-19 8/46
In 2017, Facebook reduced training time on ImageNetfor a CNN from 2 weeks to 1 hour by scaling out to 256 GPUs using
Ring-AllReduce on Caffe2.
https://arxiv.org/abs/1706.02677
DNN Training Time and Researcher Productivity
9
•Distributed Deep Learning
- Interactive analysis!
- Instant gratification!
•Single Host Deep Learning
• Suffer from Google-Envy
“My Model’s Training.”
Training
Distributed Training: Theory and Practice
Image from @hardmaru on Twitter.
10
Distributed Algorithms are not all Created Equal
Training
Performance
Parameter Servers
AllReduce
Available Compute
Ring-AllReduce vs Parameter Server(s)
2018-01-19 13/46
GPU 0
GPU 1
GPU 2
GPU 3
send
send
send
send
recv
recv
recv
recv GPU 1 GPU 2 GPU 3 GPU 4
Param Server(s)
Network Bandwidth is the Bottleneck for Distributed Training
AllReduce outperforms Parameter Servers
2018-01-19 14/46
*https://github.com/uber/horovod
16 servers with 4 P100 GPUs (64 GPUs) each connected by ROCE-capable 25 Gbit/s network
(synthetic data). Speed below is images processed per second.*
For Bigger Models, Parameter Servers don’t scale
Multiple GPUs on a Single Server
2018-01-19 15/46
NVLink vs PCI-E Single Root Complex
2018-01-19 16/46On Single-Host (dist. Training), the Bus can be the Bottleneck[Images from: https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/ ]
NVLink – 80 GB/s PCI-E – 16 GB/s
Scale: Remove Bus and Net B/W Bottlenecks
2018-01-19 17/46Only one slow worker or bus or n/w link is needed to bottleneck DNN training time.
Ring-AllReduce
The Cloud is full of Bottlenecks….
Training
Performance
Public Cloud (10 GbE)
Infiniband On-Premise
Available Compute
Deep Learning Hierarchy of Scale
2018-01-19 19/46
DDL
AllReduce
on GPU Servers
DDL with GPU Servers
and Parameter Servers
Parallel Experiments on GPU Servers
Single GPU
Many GPUs on a Single GPU Server
Days/Hours
Days
Weeks
Minutes
Training Time for ImageNet
Hours
Lots of good GPUs > A few great GPUs
Hops
100 x Nvidia 1080Ti (DeepLearning11)
8 x Nvidia P/V100 (DGX-1)
VS
Both top (100 GPUs) and bottom (8 GPUs) cost the same: $150K (2017).
Consumer GPU Server $15K (10 x 1080Ti)
2018-01-19 https://www.oreilly.com/ideas/distributed-tensorflow 21/46
Cluster of Commodity GPU Servers
#EUai8
22
InfiniBand
Max 1-2 GPU Servers per Rack (2-4 KW per server)
#EUai8
TensorFlow Spark Platforms
•TensorFlow-on-Spark
•Deep Learning Pipelines
•Horovod
•Hops
23
Hops – Running Parallel Experiments
def model_fn(learning_rate, dropout):
import tensorflow as tf
from hops import tensorboard, hdfs, devices
…..
from hops import tflauncher
args_dict = {'learning_rate': [0.001], 'dropout': [0.5]}
tflauncher.launch(spark, model_fn, args_dict)
24
Launch TF jobs as Mappers in Spark
“Pure” TensorFlow code
in the Executor
Hops – Parallel Experiments
25#EUai8
def model_fn(learning_rate, dropout):
…..
from hops import tflauncher
args_dict = {'learning_rate': [0.001, 0.005, 0.01], 'dropout': [0.5, 0.6]}
tflauncher.launch(spark, model_fn, args_dict)
Launches 6 Executors with with a different Hyperparameter
combination. Each Executor can have 1-N GPUs.
Hops AllReduce/Horovod/TensorFlow
27#EUai8
import horovod.tensorflow as hvd
def conv_model(feature, target, mode)
…..
def main(_):
hvd.init()
opt = hvd.DistributedOptimizer(opt)
if hvd.local_rank()==0:
hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]
…..
else:
hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]
…..
from hops import allreduce
allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')
“Pure” TensorFlow code
TensorFlow and Hops Hadoop
2018-01-19 28/46
Don’t do this: Different Clusters for Big Data and ML
29
Hops: Single ML and Big Data Cluster
30/70
IT
DataLake
GPUs Compute
Kafka
Data EngineeringData Science
Project1 ProjectN
Elasticsearch
HopsFS: Next Generation HDFS*
16xThroughput
FasterBigger
*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
37xNumber of files
Scale Challenge Winner (2017)
31
Size Matters: Improving the Performance of Small Files in HDFS. Salman Niazi, Seif Haridi, Jim Dowling. Poster, EuroSys 2017.`
HopsFS now stores Small Files in the DB
GPUs supported as a Resource in Hops 2.8.2*
33Hops is the only Hadoop distribution to support GPUs-as-a-Resource.
*Robin Andersson, GPU Integration for Deep Learning on YARN, MSc Thesis, 2017
GPU Resource Requests in Hops
34
HopsYARN
4 GPUs on any host
10 GPUs on 1 host
100 GPUs on 10 hosts with ‘Infiniband’
20 GPUs on 2 hosts with ‘Infiniband_P100’
HopsFS
Mix of commodity GPUs and more
powerful GPUs good for (1) parallel
experiments and (2) distributed training
Hopsworks Data Platform
35
Develop Train Test Serve
MySQL Cluster
Hive
InfluxDB
ElasticSearch
KafkaProjects,Datasets,Users
HopsFS / YARN
Spark, Flink, Tensorflow
Jupyter, Zeppelin
Jobs, Kibana, Grafana
RESTAPI
Hopsworks
Python is a First-Class Citizen in Hopsworks
www.hops.io 36
Custom Python Environments with Conda
Python libraries are usable by Spark/Tensorflow
37
What is Hopsworks used for?
2018-01-19 38/46
HopsFS
YARN
Public Cloud or On-Premise
Parquet
ETL Workloads
39
Hive
Hopsworks
Jobs trigger
HopsFS
YARN
Public Cloud or On-Premise
Parquet
Business Intelligence Workloads
40
Hive
Jupyter/Zeppelin
or Jobs
Kibana
reports
Zeppelin
HopsFS
YARN
Grafana/
InfluxDB
Elastic/
Kibana
Public Cloud or On-Premise
Parquet
Data Src
Batch Analytics
Kafka
…...MySQL
Streaming Analytics in Hopsworks
41
Hive
HopsFS
YARN
FeatureStoreTensorflow
Serving
Public Cloud or On-Premise
Tensorboard
TensorFlow in Hopsworks
42
Experiments
Kafka
Hive
One Click Deployment of TensorFlow Models
Hops API
•Java/Scala library
- Secure Streaming Analytics with Kafka/Spark/Flink
• SSL/TLS certs, Avro Schema, Endpoints for Kafka/Hopsworks/etc
•Python Library
- Managing tensorboard, Load/save models in HopsFS
- Distributed Tensorflow in Python
- Parameter sweeps for parallel experiments
TensorFlow-as-a-Service in RISE SICS ICE
• Hops
• Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service
www.hops.site
• RISE SICS ICE
• 250 kW Datacenter, ~400 servers
• Research and test environment
https://www.sics.se/projects/sics-ice-data-center-in-lulea 45
Summary
•Distribution can make Deep Learning practitioners more productive.
https://www.oreilly.com/ideas/distributed-tensorflow
•Hopsworks is a new Data Platform built on HopsFS with first-class support for Python and Deep Learning / ML
- Tensorflow / Spark
The Team
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka, Tobias Johansson , Roberto Bampi.
www.hops.io@hopshadoop
Thank You.
Follow us: @hopshadoop
Star us: http://github.com/hopshadoop/hopsworks
Join us: http://www.hops.io
Thank You.
Follow us: @hopshadoop
Star us: http://github.com/hopshadoop/hopsworks
Join us: http://www.hops.io
Hops
Top Related