DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...•...
Transcript of DLoBD:&An&Emerging&Paradigm&of&Deep&Learning& …...•...
DLoBD: An Emerging Paradigm of Deep Learning over Big Data Stacks on RDMA-‐enabled Clusters
Talk at OFA Workshop 2018
by
Xiaoyi Lu The Ohio State University
E-‐mail: [email protected]‐state.edu h=p://www.cse.ohio-‐state.edu/~luxi
OFA Workshop 2018 2 Network Based CompuNng Laboratory
• Deep Learning is a sub-‐set of Machine Learning – But, it is perhaps the most radical and revoluJonary
subset
• Deep Learning is going through a resurgence – Model: Excellent accuracy for deep/convoluJonal
neural networks
– Data: Public availability of versaJle datasets like MNIST, CIFAR, and ImageNet
– Capability: Unprecedented compuJng and communicaJon capabiliJes: MulJ-‐/Many-‐Core, GPGPUs, Xeon Phi, InfiniBand, RoCE, etc.
• Big Data has become one of the most important elements in business analyJcs – Increasing demand for geWng Big Value out of Big
Data to drive the revenue conJnuously growing
Why Deep Learning is so hot?
Courtesy: hQp://www.zdnet.com/arNcle/caffe2-‐deep-‐learning-‐wide-‐ambiNons-‐flexibility-‐scalability-‐and-‐advocacy/
MNIST handwriQen digits Deep Neural Network
OFA Workshop 2018 3 Network Based CompuNng Laboratory
ApplicaNon Example of DL: Flickr’s Magic View Photo Filtering
Courtesy: hQps://thenextweb.com/opinion/2015/05/22/flickrs-‐new-‐magic-‐view-‐photo-‐filtering-‐feature-‐works-‐so-‐well-‐it-‐convinced-‐me-‐to-‐ditch-‐iphoto/#.tnw_RaZEaD6g
• Image recogniJon to divide pictures into surprisingly accurate categories • Magic of AI/DL: Generate accurate tags for billions of pictures
OFA Workshop 2018 4 Network Based CompuNng Laboratory
(1) Prepare Datasets @Scale
(2) Deep Learning @Scale
(3) Non-‐deep learning
analyJcs @Scale (4) Apply ML model @Scale
• Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms
• More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark
• Benefits of the DLoBD approach
– Easily build a powerful data analyJcs pipeline • E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, h=p://bit.ly/1KIDfof
– Be=er data locality
– Efficient resource sharing and cost effecNve
Deep Learning over Big Data (DLoBD)
OFA Workshop 2018 5 Network Based CompuNng Laboratory
• CaffeOnSpark
• SparkNet
• TensorFlowOnSpark
• TensorFrame
• DeepLearning4J
• BigDL
• mmlspark – CNTKOnSpark
• Many others…
Examples of DLoBD Stacks
OFA Workshop 2018 6 Network Based CompuNng Laboratory
• Layers of DLoBD Stacks – Deep learning applicaJon layer
– Deep learning library layer
– Big data analyJcs framework layer
– Resource scheduler layer
– Distributed file system layer
– Hardware resource layer
• How much performance benefit we can achieve for end deep learning applicaJons?
Overview of DLoBD Stacks
Sub-‐opNmal
Performance
?
OFA Workshop 2018 7 Network Based CompuNng Laboratory
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning (Caffe, TensorFlow,
BigDL, etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!!!
OFA Workshop 2018 8 Network Based CompuNng Laboratory
• BLAS Libraries – the heart of math operaJons – Atlas/OpenBLAS
– NVIDIA cuBlas
– Intel Math Kernel Library (MKL)
• DNN Libraries – the heart of ConvoluJons! – NVIDIA cuDNN (already reached its 7th
iteraJon – cudnn-‐v7)
– Intel MKL-‐DNN (MKL 2017) – recent but a very promising development
• CommunicaJon Libraries – the heart of model parameter updaJng – RDMA
– GPUDirect RDMA
Highly-‐OpNmized Underlying Libraries with HPC Technologies
OFA Workshop 2018 9 Network Based CompuNng Laboratory
Outline
• AcceleraJng Big Data Stacks • Benchmarking and Characterizing DLoBD Stacks
– CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL
• AcceleraJng DLoBD Stacks – BigDL on RDMA-‐Spark
– TensorFlow
OFA Workshop 2018 10 Network Based CompuNng Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-‐Hadoop-‐2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distribuJons
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-‐Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-‐Hadoop)
• OSU HiBD-‐Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-‐benchmarks
• hQp://hibd.cse.ohio-‐state.edu
• Users Base: 280 organizaJons from 34 countries
• More than 25,750 downloads from the project site
The High-‐Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE Also run on Ethernet
Available for x86 and OpenPOWER
Upcoming Release will have support
For Singularity and Docker
OFA Workshop 2018 11 Network Based CompuNng Laboratory
HiBD Release Timeline and Downloads
0
5000
10000
15000
20000
25000
30000
Num
ber o
f Dow
nloa
ds
Timeline
RDMA-‐Ha
doop
2.x 0.9.1
RDMA-‐Ha
doop
2.x 0.9.6
RDMA-‐Mem
cached
0.9.4
RDMA-‐Ha
doop
2.x 0.9.7
RDMA-‐Spark 0.9.1
RDMA-‐HB
ase 0.9.1
RDMA-‐Ha
doop
2.x 1.0.0
RDMA-‐Spark 0.9.4
RDMA-‐Ha
doop
2.x 1.3.0
RDMA-‐Mem
cached
0.9.6 & OHB
-‐0.9.3
RDMA-‐Spark 0.9.5
RDMA-‐Ha
doop
1.x 0.9.0
RDMA-‐Ha
doop
1.x 0.9.8
RDMA-‐Mem
cached
0.9.1 & OHB
-‐0.7.1
RDMA-‐Ha
doop
1.x 0.9.9
OFA Workshop 2018 12 Network Based CompuNng Laboratory
Version Number 0.9.1
Example: RDMA-‐based Apache Spark
Default Design Choices: 1. Hash-‐based Shuffle 2. NIO-‐based data transfer
1.2.0
Default Design Choices: 1. Sort-‐based Shuffle 2. NIO-‐based data
transfer
Default Design Choices: 1. Sort-‐based Shuffle 2. NeQy-‐based data
transfer
1.3.1
Default Design Choices: 1. Sort-‐based Shuffle 2. NeQy-‐based data transfer
New choice: Tungsten-‐Sort!
1.4.0~2.0.0+
RDMA-‐based Apache Spark (HotI’14)
RDMA-‐based Apache Spark (BigData’16)
1.6.0~2.2.0+
New Workloads -‐-‐ DLoBD: CaffeOnSpark, TensorFlowOnSpark, BigDL, etc.
RDMA-‐based Apache Spark (HotI’17)
OFA Workshop 2018 13 Network Based CompuNng Laboratory
• Design Features – RDMA based shuffle plugin – SEDA-‐based architecture – Dynamic connecJon
management and sharing
– Non-‐blocking data transfer – Off-‐JVM-‐heap buffer
management – InfiniBand/RoCE support
Design Overview of Spark with RDMA
• Enables high performance RDMA communicaJon, while supporJng tradiJonal socket interface
• JNI Layer bridges Scala based Spark with communicaJon library wri=en in naJve code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, AcceleraNng Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014
X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-‐Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.
Spark Core
RDMA Capable Networks (IB, iWARP, RoCE ..)
Apache Spark Benchmarks/ApplicaNons/Libraries/Frameworks
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java NaNve Interface (JNI)
NaNve RDMA-‐based Comm. Engine
Shuffle Manager (Sort, Hash, Tungsten-‐Sort)
Block Transfer Service (NeQy, NIO, RDMA-‐Plugin) NeQy Server
NIO Server
RDMA Server
NeQy Client
NIO Client
RDMA Client
OFA Workshop 2018 14 Network Based CompuNng Laboratory
• High-‐Performance Design of Spark over RDMA-‐enabled Interconnects – High performance RDMA-‐enhanced design with naJve InfiniBand and RoCE support at the verbs-‐level for Spark
– RDMA-‐based data shuffle and SEDA-‐based shuffle architecture
– Non-‐blocking and chunk-‐based data transfer
– Off-‐JVM-‐heap buffer management
– Support for OpenPOWER
– Easily configurable for different protocols (naJve InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.5
– Based on Apache Spark 2.1.0
– Tested with • Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various mulJ-‐core plasorms (x86, POWER)
• RAM disks, SSDs, and HDD
– hQp://hibd.cse.ohio-‐state.edu
RDMA for Apache Spark DistribuNon
OFA Workshop 2018 15 Network Based CompuNng Laboratory
• RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and available on SDSC Comet. – Examples for various modes of usage are available in:
• RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP
• RDMA for Apache Spark: /share/apps/examples/SPARK/
– Please email [email protected] (reference Comet as the machine, and SDSC as the site) if you have any further quesJons about usage and configuraJon.
• RDMA for Apache Hadoop is also available on Chameleon Cloud as an appliance – h=ps://www.chameleoncloud.org/appliances/17/
HiBD Packages on SDSC Comet and Chameleon Cloud
M. TaNneni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016
OFA Workshop 2018 16 Network Based CompuNng Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA-‐based design for Spark 1.5.1
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total Jme reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total Jme reduced by 43% over IPoIB (56Gbps)
Performance EvaluaNon on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0 50
100 150 200 250 300 350 400 450
Huge BigData GiganJc
Time (sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData GiganJc
Time (sec)
Data Size (GB)
IPoIB
RDMA
43% 37%
OFA Workshop 2018 17 Network Based CompuNng Laboratory
Outline
• AcceleraJng Big Data Stacks • Benchmarking and Characterizing DLoBD Stacks
– CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL
• AcceleraJng DLoBD Stacks – BigDL on RDMA-‐Spark
– TensorFlow
OFA Workshop 2018 18 Network Based CompuNng Laboratory
• Choose proper DL workloads, models and datasets – Varied sizes to cover big and small models.
Small and large data sets
– Cover different kinds of combinaJons
• Choose representaJve DLoBD stacks – CaffeOnSpark, TensorFlowOnSpark, and BigDL
– Running over Spark, Yarn, HDFS
Benchmarking and CharacterizaNon Methodology
• Define characterizaJon dimensions – Processor Type
– Parameter updaJng approach (i.e., communicaJon)
– Network Protocol (IPoIB, RDMA)
• Generate evaluaJon reports – Performance (End-‐to-‐end training Jme; Jme to a certain
accuracy; epoch execuJon Jme)
– Accuracy, Scalability, Resource UJlizaJon
– Breakdown
OFA Workshop 2018 19 Network Based CompuNng Laboratory
• Spark Driver: Job Launching and Job Control
• Spark Executor: For data feeding and task control
• Model Synchronizer: Communicates across nodes with RDMA / TCP, and output model file on HDFS
• Scalable and CommunicaJon intensive – Server-‐to-‐server direct communicaJon
(Ethernet or InfiniBand) achieves faster learning and eliminates scalability bo=leneck
– Out-‐of-‐band communicaJon
Overview of RepresentaNve DLoBD Stacks -‐ CaffeOnSpark
OFA Workshop 2018 20 Network Based CompuNng Laboratory
• Spark Executors acJng as containers used to run TensorFlow code
• Two different modes to ingesJng data – Read data directly from HDFS using
built-‐in TensorFlow modules – Feeding data from Spark RDDs to Spark
executors (TensorFlow core)
• Scalable and CommunicaJon intensive – Parameter Server-‐based approach
– Embedded inside one Spark executor and talk to other workers over gRPC or gPRC with RDMA
– Out-‐of-‐band communicaJon
Overview of RepresentaNve DLoBD Stacks -‐ TensorFlowOnSpark
OFA Workshop 2018 21 Network Based CompuNng Laboratory
• Microsoy CogniJve Toolkit (CNTK) and OpenCV into Spark Machine Learning pipelines without data transfer overhead
• Feeding data for CNTK Core (e.g. images or texts) can be directly read from HDFS by Spark Executors
• Scalable and CommunicaJon intensive – Embedded inside one Spark executor and
talk to other workers over MPI (RDMA, TCP)
– Out-‐of-‐band communicaJon
Overview of RepresentaNve DLoBD Stacks – CNTKOnSpark/MMLSpark
OFA Workshop 2018 22 Network Based CompuNng Laboratory
• Users can write deep learning applicaJons as Spark programs
• Users can load pre-‐trained Caffe or Torch models into Spark programs using BigDL
• Feed data to BigDL core by Spark Executor which can directly load data from HDFS
• High performance – Support Intel MKL – Support both Xeon and Xeon Phi (e.g., KNL)
• Scalable and CommunicaJon intensive – Spark block manager as parameter server – Organically designed and integrated with Spark architecture
– In-‐band CommunicaJon
• RDMA communicaJon can be achieved through our RDMA-‐Spark package!
Overview of RepresentaNve DLoBD Stacks -‐ BigDL
OFA Workshop 2018 23 Network Based CompuNng Laboratory
MNIST CIFAR-‐10 ImageNet
Category Digit ClassificaJon Object ClassificaJon Object ClassificaJon
ResoluJon 28 × 28 B&W 32 × 32 Color 256 × 256 Color
Classes 10 10 1000
Training Images 60 K 50 K 1.2 M
Tesing Images 10 K 10 K 100 K
Selected Various Datasets and Models
Model Layers (Conv. / Full-‐connected) Dataset Framework
LeNet 2 / 2 MNIST CaffeOnSpark, TensorFlowOnSpark
SoyMax Regression NA / NA MNIST TensorFlowOnSpark
CIFAR-‐10 Quick 3 / 1 CIFAR-‐10 CaffeOnSpark, TensorFlowOnSpark, MMLSpark
VGG-‐16 13 / 3 CIFAR-‐10 BigDL
AlexNet 5 / 3 ImageNet CaffeOnSpark
GoogLeNet 22 / 0 ImageNet CaffeOnSpark
Resnet-‐50 53/1 SyntheJc TensorFlow
OFA Workshop 2018 24 Network Based CompuNng Laboratory
Performance CharacterizaNon for CPU-‐/GPU-‐based Deep Learning with CaffeOnSpark
0
1000
2000
3000
4000
5000
6000
1 2 4 8 16
End-
to-E
nd T
ime
(sec
s)
Number of Nodes (one device/node)
CIFAR-‐10 Quick CPU+OpenBlas
CPU+MKL
GPU
GPU+cuDNN
0
200
400
600
800
1000
1200
1 2 4 8 16
End-
to-E
nd T
ime
(sec
s)
Number of Nodes (one device/node)
LeNet on MNIST CPU+OpenBlas
CPU+MKL
GPU
GPU+cuDNN
63%
91.26%
98.08%
78.3%
41.5% 15.9%
81.31%
88.91%
96.78%
80.08%
63.92%
• DL workloads can benefit from the high performance of the DLoBD stacks.
• Network will become a bo=leneck at some point if the sub-‐opJmal IPoIB network protocol is used.
• GPU/GPU+cuDNN can get the best performance. GPU + cuDNN is degraded at a large scale (e.g., 16 nodes).
• For some models, soluJons with CPU + MKL may outperform GPU-‐based soluJons.
OFA Workshop 2018 25 Network Based CompuNng Laboratory
Performance CharacterizaNon for IPoIB and RDMA with CaffeOnSpark and TensorFlowOnSpark (IB EDR)
• CaffeOnSpark benefits from the high performance of RDMA compared to IPoIB once communicaJon overhead becomes significant.
• Our experiments show that the default RDMA design in TensorFlowOnSpark is not fully opJmized yet. For MNIST tests, RDMA is not showing obvious benefits.
0 50
100 150 200 250 300 350 400 450
IPoIB RDMA IPoIB RDMA IPoIB RDMA IPoIB RDMA
GPU GPU-cuDNN GPU GPU-cuDNN
CIFAR-10 Quick LeNet on MNIST
End-
to-E
nd T
ime
(sec
s)
CaffeOnSpark
2 4 8 16
0 50
100 150 200 250 300 350 400 450
IPoIB RDMA IPoIB RDMA
CIFAR10 (2 GPUs/node) MNIST (1 GPU/node)
End-
to-E
nd T
ime
(sec
s)
TensorFlowOnSpark 2 GPUs 4 GPUs 8 GPUs
14.2% 13.3%
51.2% 45.6%
33.01%
4.9%
OFA Workshop 2018 26 Network Based CompuNng Laboratory
Performance CharacterizaNon with MMLSpark
0
1000
2000
3000
4000
5000
1 2 4 8 16
End-‐to-‐End
Tim
e (secs)
Number of Nodes (one device/node)
CPU+OpenBlas
CPU+MKL
GPU+cuDNN
0
40
80
120
160
IPoIB RDMA
End-‐to-‐End
Tim
e (secs)
CIFAR-‐10
2 4 8 16
• The soluJon of GPU + cuDNN performs best, up to 55x faster than CPU + OpenBLAS, and up to 15x than CPU + MKL.
• OpenMPI-‐based communicaJon over IPoIB and RDMA; Similar performance; The latency and bandwidth of IPoIB in this cluster are sufficient for small models.
• Could not find other benchmarks with bigger models for MMLSpark
OFA Workshop 2018 27 Network Based CompuNng Laboratory
CharacterizaNon on Performance and Accuracy
10
20
30
40
50
60
70
80 28
6
1001
5297
1293
9
1867
3
2297
4
2727
3
3013
8
3730
1
4303
0
5032
6
Acc
urac
y (%
)
Time (secs)
IPoIB RDMA
10
30
50
70
90
160
511
863
1214
15
65
1917
22
68
2619
29
71
3322
39
08
4259
46
10
4962
Acc
urac
y (%
)
Time (secs)
IPoIB RDMA
AlexNet on ImageNet with CaffeOnSpark GoogleNet on ImageNet with CaffeOnSpark
• Performance EvaluaJon of CaffeOnSpark (training Jme to achieve a 70% accuracy) – RDMA reduces the overall Jme cost by 22% in training AlexNet on ImageNet – RDMA reduces the overall Jme cost by 15% in training GoogleNet on ImageNet
• Performance EvaluaJon of BigDL (training Jme to achieve a 70% accuracy) – RDMA reduces the overall Jme cost by 48% in training VGG on CIFAR-‐10
0
20
40
60
80
100
165
474
711
961
1280
15
98
1920
20
17
2132
22
61
2908
36
45
4337
Acc
urac
y (%
)
Time (secs)
IPoIB RDMA
VGG on CIFAR-‐10 with BigDL
70% Accuracy Reduce by 48%
Reduce by 22%
70% Accuracy Reduce by 15% 70% Accuracy
OFA Workshop 2018 28 Network Based CompuNng Laboratory
Memory and Network UNlizaNon of CaffeOnSpark
• CIFAR-‐10 Quick Model and CIFAR-‐10 Dataset • GPU-‐based soluJons use less memory than CPU-‐based ones as they mostly use GPU memory. • CPU + MKL soluJon uses host memory more efficiently and has be=er performance than CPU +
OpenBLAS. • RDMA uJlizes the network resources more efficiently than the IPoIB in CaffeOnSpark. • CaffeOnSpark sJll does not fully uJlize the high throughput characterisJc of RDMA and memory
resource.
0
10
20
30
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57
Hos
t Mem
ory
Usa
ge
(GB
)
Time (min)
CPU+OpenBLAS CPU+MKL GPU GPU+cuDNN
0
10
20
0 1 2 3 4 5 6 7
Net
wor
k Th
roug
hput
(M
B/s
)
Time (min)
IPoIB RDMA
OFA Workshop 2018 29 Network Based CompuNng Laboratory
• SoyMax Regression model, over MNIST dataset
• Up to 15.5% Jme in Apache Hadoop YARN scheduler layer
• Up to 18.1% execuJon Jme in Spark job execuJon layer
• Data size is small, so we do not count the Jme spent on accessing HDFS layer.
• Need more effort to reduce the overhead across different layers of DLoBD stacks
• Maybe amorJzed in long-‐running deep learning jobs
Performance Overhead across Layers in DLoBD Stacks
0
10
20
30
40
50
60
70
IPoIB RDMA IPoIB RDMA
TensorFlowOnSpark Native TensorFlow
Tim
e (s
econ
d)
OtherSpark-OverheadYARN-OverheadActual Training
OFA Workshop 2018 30 Network Based CompuNng Laboratory
• RDMA can benefit DL workloads
– Up to 2.7x speedup with RDMA compared to the IPoIB scheme for deep learning workloads.
– RDMA can scale be=er and uJlize resources more efficiently than IPoIB over InfiniBand clusters
• GPU-‐based DL designs can outperform CPU-‐based designs, but not always
– LeNet on MNIST, CPU + MKL achieved be=er performance than GPU and GPU + cuDNN on 8/16 nodes
• Large rooms for further improvement in DLoBD stacks!!!
• We need more benchmarks, public datasets, and analysis tools!!!
Insights and Guidance
X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-‐capable Networks, HotI 2017.
X. Lu, H. Shi, R. Biswas, M. H. Javed, and D. K. Panda, DLoBD: A Comprehensive Study on the Emerging Paradigm of Deep Learning over Big Data Stacks, (Under Review).
OFA Workshop 2018 31 Network Based CompuNng Laboratory
Outline
• AcceleraJng Big Data Stacks • Benchmarking and Characterizing DLoBD Stacks
– CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL
• AcceleraJng DLoBD Stacks – BigDL on RDMA-‐Spark
– TensorFlow
OFA Workshop 2018 32 Network Based CompuNng Laboratory
Epoch-‐Level EvaluaNon with BigDL on SDSC Comet
0
10
20
30
40
50
60
10 510
1010 1510 2010 2510 3010 3510 4010 4510
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Acc
urac
y (%
)
Acc
umul
ativ
e Ep
ochs
Tim
e (s
ecs)
Epoch Number
IPoIB-Time
RDMA-Time
IPoIB-Accuracy
RDMA-Accuracy
VGG on CIFAR-‐10 with BigDL
• Epoch-‐level evaluaJon of training VGG model using BigDL on default Spark with
IPoIB and our RDMA-‐based Spark.
• RDMA version takes constantly less Jme than the IPoIB version to finish every epoch. – RDMA finishes epoch 18 in 2.6x Jme faster than IPoIB
2.6X
OFA Workshop 2018 33 Network Based CompuNng Laboratory
Scalability EvaluaNon with BigDL on SDSC Comet
• Using BigDL with IPoIB & RDMA Spark
• For VGG model trained with BigDL, RDMA-‐based Spark scales be=er than default IPoIB Spark
• For 384 CPU cores, 18 epochs and same batch size, RDMA takes about 870 seconds while IPoIB takes 2,372 seconds
• A speedup of 2.7x using RDMA for the epoch-‐level training Jme
10 510
1010 1510 2010 2510 3010 3510 4010 4510 5010
96 192 384
Acc
umul
ativ
e Ep
ochs
Tim
e (s
ecs)
Number of CPU Cores
IPoIB RDMA
OFA Workshop 2018 34 Network Based CompuNng Laboratory
Outline
• AcceleraJng Big Data Stacks • Benchmarking and Characterizing DLoBD Stacks
– CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL
• AcceleraJng DLoBD Stacks – BigDL on RDMA-‐Spark
– TensorFlow
OFA Workshop 2018 35 Network Based CompuNng Laboratory
Overview of gRPC with TensorFlow
Worker services communicate among each other using gRPC, or gRPC+X!
Client Master
Worker /job:PS/task:0
Worker /job:Worker/task:0
CPU GPU
gRPC server/ client
CPU GPU
gRPC server/ client
OFA Workshop 2018 36 Network Based CompuNng Laboratory
Performance Benefits for RDMA-‐gRPC with Micro-‐Benchmark
RDMA-‐gRPC RPC Latency
• gRPC-‐RDMA Latency on SDSC-‐Comet-‐FDR – Up to 2.7x performance speedup over IPoIB for Latency for small messages – Up to 2.8x performance speedup over IPoIB for Latency for medium messages – Up to 2.5x performance speedup over IPoIB for Latency for large messages
0
15
30
45
60
75
90
2 8 32 128 512 2K 8K
Latency (us)
payload (Bytes)
Default gRPC
OSU RDMA gRPC
0
200
400
600
800
1000
16K 32K 64K 128K 256K 512K Latency (us) Payload (Bytes)
Default gRPC
OSU RDMA gRPC
100
3800
7500
11200
14900
18600
1M 2M 4M 8M
Latency (us)
Payload (Bytes)
Default gRPC
OSU RDMA gRPC
R. Biswas, X. Lu, and D. K. Panda, AcceleraNng gRPC and TensorFlow with RDMA for High-‐Performance Deep Learning over InfiniBand, Under Review.
OFA Workshop 2018 37 Network Based CompuNng Laboratory
0
20
40
60
80
100
120
140
160
16 32 64
Images / Second
Batch Size
gRPPC Verbs MPI gRPC+RDMA
35%
Performance Benefit for TensorFlow -‐ Resnet50
0
50
100
150
200
250
300
16 32 64
Images / Second
Batch Size
gRPPC Verbs MPI gRPC+RDMA
41%
• TensorFlow Resnet50 performance evaluaJon on an IB EDR cluster – Up to 35% performance speedup over IPoIB for 4 nodes. – Up to 41% performance speedup over IPoIB for 8 nodes.
4 Nodes 8 Nodes
OFA Workshop 2018 38 Network Based CompuNng Laboratory
0 10 20 30 40 50 60 70 80 90
16 32 64
Images / Second
Batch Size
gRPPC Verbs MPI gRPC+RDMA
Performance Benefit for TensorFlow -‐ IncepNon3
0 20 40 60 80 100 120 140 160 180 200
16 32 64
Images / Second
Batch Size
gRPPC Verbs MPI gRPC+RDMA
TensorFlow IncepJon3 performance evaluaJon on an IB EDR cluster • Up to 27% performance speedup over IPoIB for 4 nodes • Up to 36% performance speedup over IPoIB for 8 nodes.
4 Nodes 8 Nodes
27%
36%
OFA Workshop 2018 39 Network Based CompuNng Laboratory
• Discussed challenges in benchmarking, characterizing, and acceleraJng Deep Learning over Big Data (DLoBD) stacks
• RDMA can benefit DL workloads as showed by our RDMA-‐Spark, AR-‐gRPC, and other RDMA designs
• Many other open issues need to be solved
• Will enable Big Data and Deep Learning community to take advantage of modern HPC technologies to carry out their analyJcs in a fast and scalable manner
Concluding Remarks
OFA Workshop 2018 40 Network Based CompuNng Laboratory
The 4th InternaNonal Workshop on High-‐Performance Big Data CompuNng (HPBDC)
HPBDC 2018 will be held with IEEE InternaNonal Parallel and Distributed Processing Symposium (IPDPS 2018), Vancouver, BriNsh Columbia CANADA, May, 2018
Workshop Date: May 21st, 2018
Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-‐Performance Big Data Programming Environment
Six Regular Research Papers and Two Short Research Papers
Panel Topic: Which Framework is the Best for High-‐Performance Deep Learning: Big Data Framework or HPC Framework? hQp://web.cse.ohio-‐state.edu/~luxi/hpbdc2018
HPBDC 2017 was held in conjuncNon with IPDPS’17
hQp://web.cse.ohio-‐state.edu/~luxi/hpbdc2017
HPBDC 2016 was held in conjuncNon with IPDPS’16
hQp://web.cse.ohio-‐state.edu/~luxi/hpbdc2016
OFA Workshop 2018 41 Network Based CompuNng Laboratory
Two More PresentaNons
• Wednesday (04/11/18) at 11:30 am
Building Efficient Clouds for HPC, Big Data, and Neuroscience ApplicaNons over SR-‐IOV-‐enabled InfiniBand Clusters
• Thursday (04/12/18) at 04:00 pm
High-‐Performance Big Data AnalyNcs with RDMA over NVM and NVMe-‐SSD
OFA Workshop 2018 42 Network Based CompuNng Laboratory
[email protected]‐state.edu
hQp://www.cse.ohio-‐state.edu/~luxi
Thank You!
Network-‐Based CompuJng Laboratory h=p://nowlab.cse.ohio-‐state.edu/
The High-‐Performance Big Data Project h=p://hibd.cse.ohio-‐state.edu/