Post on 16-Apr-2017
InfernoScalable Deep Learning on Spark
Matthias Langerm.langer@latrobe.edu.au
Dr. Zhen Hez.he@latrobe.edu.au
Prof. Wenny Rahayuw.rahayu@latrobe.edu.au
Department of Computer Science &Computer Engineering
Topics• Deep Learning – Introduction
• Spark & Deep Learning
• Our solution:La Trobe University’s Deep Learning System
• Conclusion, Timeline, Q&A
Deep LearningIntroduction
Source: CerCo (Brain and Cognition Research Centre), Toulouse
Object/Action Recognition• Automatic Captioning
• Navigating Artificial Agents
• Deep Learning performs
100% better than the best non-deep learning algorithms in many Computer Vision tasks!
Source: Research @ Facebook (left), google.com/selfdrivingcar (right)
Voice Recognition
• Deep Learning performs 30% better than the best non-deep learning algorithms!
Natural Language Processing• Translation
• Thought Vector Q&A
• …
• Deep Learning tends to perform “better” than traditional machine learning algorithms!
Source: Google Inc. / Google Translate
Source: GoogleBrain; Google, Inc.
Spark & DLHow they could be an ideal tandem, but there are challenges…
Why do you want to use a cluster to train Deep Neural Networks?
This model took about 22 days to train.
I trained 50x from scratch until I found hyper-parameters that work well.
Deep Learning is SLOW
• Highly scalable
• No relevant hardware limits
• Extensible
Worker 1
Worker 2Worker 3
Master
Two approaches to speed up DLScaling Up Scaling Out
• Superior scaling until fundamental limits of the hardware are reached Max. the number of PCIe lanes Max. read speed of HDD Costs scale up non-linear
(DGX-1 = $129,000)Source: https://developer.nvidia.com/devbox
You already have all your valuable data in Spark/Hadoop
DL (often) requires a lot of data to train
Need a lot of memory
Pre-processing has massive of I/O requirements(disk & network)
More reasons why you would want to useHadoop/Spark for DL?
&
How could you implementDL on Spark?
Worker 1 Worker 2 Worker 3
𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+…
Master
𝑏2𝑥2+𝑏3𝑥3+…
= mini-batch of data
Draw mini-batch
Map:Compute updated model ineach worker
Reduce:Assemble into “better” modelvia Master node
Broadcast “better” modeland repeat
Spark RDD
𝑏2𝑥2+𝑏3𝑥3+…
Compute5%
Commu-nication
95%
Problem 1:Big Parameters = High shuffle cost!
Worker 1 Worker 2 Worker 3
𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+…
Master
𝑏2𝑥2+𝑏3𝑥3+…
Reduce models(at best 5 s over 1 GbE)
Broadcast combined model(at best 5 s over 1 GbE)
500 MB 500 MB 500 MB
500 MB
Compute updated models(typically 50 – 500 ms)
Problem 2:Node communication is synchronous
Worker 1 Worker 2 Worker 3
𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+…
Master
𝑏2𝑥2+𝑏3𝑥3+…
Bottleneck!
Blaze
La Trobe University DL-SystemCluster Single Machine
BlazeScala based standalone deep learning system
CUBlaze CUBlazeGPU acceleration for Blaze
Inferno
InfernoCoordinates distributed
computation of Blaze models in synchronous
Spark environment
A (probably biased) comparisonInferno SparkNet (Caffe) CaffeOnSpark deeplearning4j H2O
ConvNets, AutoEncoders, etc. planned
Communication protocol during training Spark MR Spark MR MPI/RDMA Spark MR among
others Grpc/MPI/RDMA
Build Complex models (e.g. ResNet) some
Dynamic branching support(Path altering / dropping)
Pluggable preprocessing Pipeline partial
Pluggable update policies forhyper parametersPluggable & visualizableonline cross validationEntire execution path determined in single runtime environment
Model description language JVM Code Config File Config File JVM Code multiple
GPU acceleration
Blaze
CUBlaze
Inferno
BlazeHigh-Performance Deep Learning Engine
Module Library• Standard Modules
Add-Bias (C/U/S/B), Immediate-Filter (C/U/S/B) Convolution, Convolution-Decoder, Linear, Linear-
Decoder, Locally-Connected, Locally-Connected-Decoder
L2-Pooling, Max-Pooling, Mean-Pooling, Batch-Normalization , Dropout, LCN, LRN, Normalization
(C/U/S/B), Reshape, Weight-Decay (L1/L2)
• NonlinearitiesAbs, Add-Noise, ELU, Exp, Hard-Tanh, LeakyReLU, Ln, Pow, PReLU, ReLU, ReQU, (Log-)Sigmoid, SmoothAbs, (Log-)Softmax, SoftPlus, Sq, Sqrt, SReLU, Tanh
• OptimizersAdaDelta, AdaGrad, Adam, ConjugateGradientDescent, Rprop, RMSProp,SGD (traditional, local learning rates, momentum)
• Constraints (can inject everywhere!)BCE, ClassLL, ClassNLL, KLDivergence, MSE
• ContainersSequence, Auto-Encoder, Branch (Parallel)
• BranchingAlternate-Path, Drop-Path, Random-Path
• Tensor Tables OperationsSelect, Concatenate (C/U/S/B), Merge (add/mean/lerp)
• Visualization & BenchmarkingBenchmark-Wrapper, Visualize-HistogramVisualize-MeanAndStdDev (C/U/S/B)
C/U/S/B = These operations can be applied either on [C]hannel, [U]nit, [S]ample or [B]atch level.
Performance – AlexNet OWT
All benchmarks done using NVIDIA TitanX GPUs on comparable setups; Source: https://github.com/soumith/convnet-benchmarks
Torch(CuDNN)
TensorFlow CUBlaze(1 GB WS
Limit)
Torch(fbfft)
cudaconvnet2 Caffe(native)
Torch-7(native)
27 2637 31
42
121132
53 55 5672
135
203 210
forward (ms) backward (ms)
Performance – VGG A
Torch(CuDNN)
TensorFlow CUBlaze(1 GB WS
Limit)
Torch(fbfft)
cudaconvnet2 Caffe(native)
Torch-7(native)
162 158 167
355408
323 350331382 378
737
821745 755
forward (ms) backward (ms)
All benchmarks done using NVIDIA TitanX GPUs on comparable setups; Source: https://github.com/soumith/convnet-benchmarks
Cached Sample
…Cached Sample
Cached Sample
How Blaze works (example)
Prefetcher
Model(fprop only)
Augmenter
Weights(fixed)
Sample Merger
Data Source(HDD, SparkRDD, HDFS)
Optimizer
Model
Weights(tunable)
Hyper Param.
Hyper Param.
Objectives
Hyper Param.
ScopeDelimiter
Terminal,File,
Showoff,etc.
Easy Setup: Model
• Blaze automatically infers most layer parameters based on the actual input
• Usually no need to specify input and output dimensions or whether to use CPU or GPU
val noClasses = 100
// Kernelsval kernelConv1 = Kernel2D(dims = (11, 11), stride = (4, 4), padding = (2, 2))val kernelConv2 = Kernel2D.centered((3, 3))val kernelPool = Kernel2D((3, 3), (2, 2))
// Layersval bias = AddBiasBuilder()val relu = ReLUBuilder()val lrn = LateralResponseNormalizationBuilder(n = 5, k = 2, alpha = 1e-4f, beta = 0.75f)val pool = MaxPoolingBuilder(kernelPool)
// Lego!val mb = SequenceBuilder(
ConvolutionFilterBuilder(kernelConv1, 48), bias, relu, pool, lrn,ConvolutionFilterBuilder(kernelConv2, 192), bias, relu,
ConvolutionFilterBuilder(kernelConv2, 128), bias, relu, pool, ReshapeBuilder.collapseDimensions(), LinearBuilder(noClasses), bias, SoftmaxBuilder(), ClassLLConstraintBuilder())
Easy Setup: CPU and GPU• Blaze maintains a variant table for each module type.
• When you “build” an instance of a module, all variants are scored and the “best” variant for the current situation is selected automatically. You can configure what “best” means.
// Input dataval data = Array[Batch](...)
// Inspect batchesval hints = BuildHints.derive(data)
// Build compatible modelval m = mb.build(hints)
19:25:20 INFO Scoring ConvolutionFilter[Kernel2[(3, 3), (1, 1)] x 2, 0/1 = filter]:19:25:20 DEBUG 0000800a => CUDA_CUDNN, preferred, input type matches19:25:20 DEBUG 0000400a => JVM_BLAS_IMPLICITMM, preferred19:25:20 DEBUG 00000004 => JVM_BLAS_MM19:25:20 DEBUG 0000000a => JVM_BREEZE_MM, preferred19:25:20 DEBUG 00000002 => JVM_BREEZE_SPARSEMM19:25:20 INFO CUDA_CUDNN selected!
Working with large models!
val mb = SequenceBuilder(...)val hints = ...val g = mb.toGraph(hints)SvgRenderer.render(g)
Visualizingpre-processingpipelines
val apb = AsynchronousPrefetcherBuilder(...)val g = apb.toGraph()SvgRenderer.render(g)
Easy Setup: Optimizerval ob = MomentumBuilder()
// Configure Hyper-Parametersob.learningRate = DiscreteStepsBuilder(
0 -> 1e-2f,40000 -> 1e-3f,80000 -> 1e-4f
)
// Setup Objectivesob.objectives += IterationCountLimitBuilder(1000) += CrossValidationBuilder(dataSource, ... preprocessing pipeline ...) += PrintStatusBuilder() >> FileSinkBuilder(HadoopFileHandle.userHome ++ "results/optimization.log") += objectives.Presets.visualizePerformance() >> ShowoffSinkBuilder("Cross Validation Performance")
// Add more advanced stuff like Regularizers...
// Go!val o = ob.build(m, dataSource)o.run()
Other Features• Tensor Memory Management
Automatically monitors the dependencies between all tensors Reallocates space occupied by unneeded tensors on the fly Will automatically toggle “inPlace” processing when it is safe
• Intermediate results are stored separate from model Forward passes yield backpropagation contexts that can be consumed or discarded
at any time. Very interesting property for:
Live Query/Training Fancy Optimizers Hyper Parameter Search
Saves up to
40% GPU memory
during training!
Blaze
CUBlaze
Inferno
InfernoTraining Deep Learning Models fasterwith Apache Spark
Starting an Inferno cluster
SparkConf
ClusterCoordinator
Worker 1
Worker 2Worker 3
Master
Cluster
FileRDD
Spark BinaryRDD Inferno FileRDD
689 s
6 s
9999 s
35 s
Loading meta-data of HDFS filesClaim
Assess
TailorSpark
Context
SampleDataRDD
Load hdfs://…
Create Samples
Load Plugins(e.g.
CUBlaze)
run()build()
cache()
cache()
cache()
Distributed Optimizer
Blaze Model
Blaze Optimizer
Pre-processing Pipeline
InfernoOptimizer
SampleDataRDD
ClusterCoordinator
Weights Hyper Param.
Objectives
Hyper Param.
ScopeDelimiter
Hyper Param.
Objectives
ScopeDelimiter
Worker 1
Worker 2Worker 3
Master
Cluster
Applied with cluster wide focus.
Applied independently in each worker.
57 minutes2 hours, 42 minutes
PerformanceResNet 34 on ImageNet
Blaze2 x 8 core Xeon CPU + 1 x NVIDIA TitanX
Inferno (over 1 GbE)8 x 8 core Xeon CPU + 4 x NIVIDA TitanX
Reached 20% Top1 accuracy 2.84 times faster!
PerformancePreAct ResNet 152 on ImageNet
0 h 5 h 10 h 15 h 20 h 25 h 30 h 35 h 40 h 45 h 50 h0%
10%
20%
30%
40%
50%
60%
70%
80%
1x TitanX - Top 1 Accuracy1x TitanX - Top 5 AccuracyInferno Cluster (5x TitanX, 1 GbE) - Top 1 Ac-curacyInferno Cluster (5x TitanX, 1 GbE) - Top 5 Ac-curacy
Reached 30% Top1 accuracy 4.81 times faster using 5 GPUs!** vs.
Conclusion• Blaze & CUBlaze
Fast Huge extensible module library Easy to use
• Inferno Allows you to accelerate Blaze DL tasks on Spark Uses Spark MR methods for all data transmissions:
Can run rather nicely along with other Spark jobs. Can be used without high-speed / low latency equipment
(usually required to make RDMA solutions perform well) Plain old (and even slow) Ethernet is enough!
* Note that using “Showoff” to visualize progress may open separate HTTP connections to the Showoff-Server.
Where can I get it?• Blaze & CUBlaze & Example Code
Stable, we train models using it for months already. A snapshot of the current stable release is available at:https://github.com/bashimao/ltudl (Apache License 2.0)
• ShowoffMulti-purpose live visualization system developed by Aiden Nibali (La Trobe University):https://github.com/anibali/showoff
• Inferno I am writing a paper about Inferno’s optimization system right now. Once it has been accepted for publication, we will release the full source code on GitHub. The best way to prepare for Inferno, is to download Blaze now and to get familiar with it.
Questions?Matthias Langer, PhD cand.m.langer@latrobe.edu.au
Supervisors:Dr. Zhen Hez.he@latrobe.edu.au
Prof. Wenny Rahayuw.rahayu@latrobe.edu.au
Deep Learning & Spark @ LaTrobeStudents• Master of Data Science degree
http://tinyurl.com/hf4wmn2 Advanced data science lab established in 2016 with newest
hardware. CSE5BDC
Big Data Management on the Cloud (I tutor this!) CSE5DEV
Data Exploration and Visualization(~50% lectures on deep learning)
CSE5WDCWeb Development on the Cloud
• Research GPU research cluster capable of running distributed deep learning
tasks. In-house development of a distributed deep learning system. Dedicated research group working with various Deep Learning
systems. CSE4DLJ
Weekly Deep Learning Journal Club
Industry• If you have a data analytics problem:
… we have a dedicated deep learning research team!
… and probably also a deep learning solution for it!
• Spark & Deep Learning workshops for Torch available on demand.
• Past & current machine learning research collaborations Alfred Hospital ZenDesk AIS (Australian Institute for Sports)
• Contact: z.he@latobe.edu.au