Inferno Scalable Deep Learning on Spark

Post on 16-Apr-2017

731 views 1 download

Transcript of Inferno Scalable Deep Learning on Spark

InfernoScalable Deep Learning on Spark

Matthias Langerm.langer@latrobe.edu.au

Dr. Zhen Hez.he@latrobe.edu.au

Prof. Wenny Rahayuw.rahayu@latrobe.edu.au

Department of Computer Science &Computer Engineering

Topics• Deep Learning – Introduction

• Spark & Deep Learning

• Our solution:La Trobe University’s Deep Learning System

• Conclusion, Timeline, Q&A

Deep LearningIntroduction

Source: CerCo (Brain and Cognition Research Centre), Toulouse

Object/Action Recognition• Automatic Captioning

• Navigating Artificial Agents

• Deep Learning performs

100% better than the best non-deep learning algorithms in many Computer Vision tasks!

Source: Research @ Facebook (left), google.com/selfdrivingcar (right)

Voice Recognition

• Deep Learning performs 30% better than the best non-deep learning algorithms!

Natural Language Processing• Translation

• Thought Vector Q&A

• …

• Deep Learning tends to perform “better” than traditional machine learning algorithms!

Source: Google Inc. / Google Translate

Source: GoogleBrain; Google, Inc.

Spark & DLHow they could be an ideal tandem, but there are challenges…

Why do you want to use a cluster to train Deep Neural Networks?

This model took about 22 days to train.

I trained 50x from scratch until I found hyper-parameters that work well.

Deep Learning is SLOW

• Highly scalable

• No relevant hardware limits

• Extensible

Worker 1

Worker 2Worker 3

Master

Two approaches to speed up DLScaling Up Scaling Out

• Superior scaling until fundamental limits of the hardware are reached Max. the number of PCIe lanes Max. read speed of HDD Costs scale up non-linear

(DGX-1 = $129,000)Source: https://developer.nvidia.com/devbox

You already have all your valuable data in Spark/Hadoop

DL (often) requires a lot of data to train

Need a lot of memory

Pre-processing has massive of I/O requirements(disk & network)

More reasons why you would want to useHadoop/Spark for DL?

&

How could you implementDL on Spark?

Worker 1 Worker 2 Worker 3

𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+…

Master

𝑏2𝑥2+𝑏3𝑥3+…

= mini-batch of data

Draw mini-batch

Map:Compute updated model ineach worker

Reduce:Assemble into “better” modelvia Master node

Broadcast “better” modeland repeat

Spark RDD

𝑏2𝑥2+𝑏3𝑥3+…

Compute5%

Commu-nication

95%

Problem 1:Big Parameters = High shuffle cost!

Worker 1 Worker 2 Worker 3

𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+…

Master

𝑏2𝑥2+𝑏3𝑥3+…

Reduce models(at best 5 s over 1 GbE)

Broadcast combined model(at best 5 s over 1 GbE)

500 MB 500 MB 500 MB

500 MB

Compute updated models(typically 50 – 500 ms)

Problem 2:Node communication is synchronous

Worker 1 Worker 2 Worker 3

𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+… 𝑏2𝑥2+𝑏3𝑥3+…

Master

𝑏2𝑥2+𝑏3𝑥3+…

Bottleneck!

Blaze

La Trobe University DL-SystemCluster Single Machine

BlazeScala based standalone deep learning system

CUBlaze CUBlazeGPU acceleration for Blaze

Inferno

InfernoCoordinates distributed

computation of Blaze models in synchronous

Spark environment

A (probably biased) comparisonInferno SparkNet (Caffe) CaffeOnSpark deeplearning4j H2O

ConvNets, AutoEncoders, etc. planned

Communication protocol during training Spark MR Spark MR MPI/RDMA Spark MR among

others Grpc/MPI/RDMA

Build Complex models (e.g. ResNet) some

Dynamic branching support(Path altering / dropping)

Pluggable preprocessing Pipeline partial

Pluggable update policies forhyper parametersPluggable & visualizableonline cross validationEntire execution path determined in single runtime environment

Model description language JVM Code Config File Config File JVM Code multiple

GPU acceleration

Blaze

CUBlaze

Inferno

BlazeHigh-Performance Deep Learning Engine

Module Library• Standard Modules

Add-Bias (C/U/S/B), Immediate-Filter (C/U/S/B) Convolution, Convolution-Decoder, Linear, Linear-

Decoder, Locally-Connected, Locally-Connected-Decoder

L2-Pooling, Max-Pooling, Mean-Pooling, Batch-Normalization , Dropout, LCN, LRN, Normalization

(C/U/S/B), Reshape, Weight-Decay (L1/L2)

• NonlinearitiesAbs, Add-Noise, ELU, Exp, Hard-Tanh, LeakyReLU, Ln, Pow, PReLU, ReLU, ReQU, (Log-)Sigmoid, SmoothAbs, (Log-)Softmax, SoftPlus, Sq, Sqrt, SReLU, Tanh

• OptimizersAdaDelta, AdaGrad, Adam, ConjugateGradientDescent, Rprop, RMSProp,SGD (traditional, local learning rates, momentum)

• Constraints (can inject everywhere!)BCE, ClassLL, ClassNLL, KLDivergence, MSE

• ContainersSequence, Auto-Encoder, Branch (Parallel)

• BranchingAlternate-Path, Drop-Path, Random-Path

• Tensor Tables OperationsSelect, Concatenate (C/U/S/B), Merge (add/mean/lerp)

• Visualization & BenchmarkingBenchmark-Wrapper, Visualize-HistogramVisualize-MeanAndStdDev (C/U/S/B)

C/U/S/B = These operations can be applied either on [C]hannel, [U]nit, [S]ample or [B]atch level.

Performance – AlexNet OWT

All benchmarks done using NVIDIA TitanX GPUs on comparable setups; Source: https://github.com/soumith/convnet-benchmarks

Torch(CuDNN)

TensorFlow CUBlaze(1 GB WS

Limit)

Torch(fbfft)

cudaconvnet2 Caffe(native)

Torch-7(native)

27 2637 31

42

121132

53 55 5672

135

203 210

forward (ms) backward (ms)

Performance – VGG A

Torch(CuDNN)

TensorFlow CUBlaze(1 GB WS

Limit)

Torch(fbfft)

cudaconvnet2 Caffe(native)

Torch-7(native)

162 158 167

355408

323 350331382 378

737

821745 755

forward (ms) backward (ms)

All benchmarks done using NVIDIA TitanX GPUs on comparable setups; Source: https://github.com/soumith/convnet-benchmarks

Cached Sample

…Cached Sample

Cached Sample

How Blaze works (example)

Prefetcher

Model(fprop only)

Augmenter

Weights(fixed)

Sample Merger

Data Source(HDD, SparkRDD, HDFS)

Optimizer

Model

Weights(tunable)

Hyper Param.

Hyper Param.

Objectives

Hyper Param.

ScopeDelimiter

Terminal,File,

Showoff,etc.

Easy Setup: Model

• Blaze automatically infers most layer parameters based on the actual input

• Usually no need to specify input and output dimensions or whether to use CPU or GPU

val noClasses = 100

// Kernelsval kernelConv1 = Kernel2D(dims = (11, 11), stride = (4, 4), padding = (2, 2))val kernelConv2 = Kernel2D.centered((3, 3))val kernelPool = Kernel2D((3, 3), (2, 2))

// Layersval bias = AddBiasBuilder()val relu = ReLUBuilder()val lrn = LateralResponseNormalizationBuilder(n = 5, k = 2, alpha = 1e-4f, beta = 0.75f)val pool = MaxPoolingBuilder(kernelPool)

// Lego!val mb = SequenceBuilder(

ConvolutionFilterBuilder(kernelConv1, 48), bias, relu, pool, lrn,ConvolutionFilterBuilder(kernelConv2, 192), bias, relu,

ConvolutionFilterBuilder(kernelConv2, 128), bias, relu, pool, ReshapeBuilder.collapseDimensions(), LinearBuilder(noClasses), bias, SoftmaxBuilder(), ClassLLConstraintBuilder())

Easy Setup: CPU and GPU• Blaze maintains a variant table for each module type.

• When you “build” an instance of a module, all variants are scored and the “best” variant for the current situation is selected automatically. You can configure what “best” means.

// Input dataval data = Array[Batch](...)

// Inspect batchesval hints = BuildHints.derive(data)

// Build compatible modelval m = mb.build(hints)

19:25:20 INFO Scoring ConvolutionFilter[Kernel2[(3, 3), (1, 1)] x 2, 0/1 = filter]:19:25:20 DEBUG 0000800a => CUDA_CUDNN, preferred, input type matches19:25:20 DEBUG 0000400a => JVM_BLAS_IMPLICITMM, preferred19:25:20 DEBUG 00000004 => JVM_BLAS_MM19:25:20 DEBUG 0000000a => JVM_BREEZE_MM, preferred19:25:20 DEBUG 00000002 => JVM_BREEZE_SPARSEMM19:25:20 INFO CUDA_CUDNN selected!

Working with large models!

val mb = SequenceBuilder(...)val hints = ...val g = mb.toGraph(hints)SvgRenderer.render(g)

Visualizingpre-processingpipelines

val apb = AsynchronousPrefetcherBuilder(...)val g = apb.toGraph()SvgRenderer.render(g)

Easy Setup: Optimizerval ob = MomentumBuilder()

// Configure Hyper-Parametersob.learningRate = DiscreteStepsBuilder(

0 -> 1e-2f,40000 -> 1e-3f,80000 -> 1e-4f

)

// Setup Objectivesob.objectives += IterationCountLimitBuilder(1000) += CrossValidationBuilder(dataSource, ... preprocessing pipeline ...) += PrintStatusBuilder() >> FileSinkBuilder(HadoopFileHandle.userHome ++ "results/optimization.log") += objectives.Presets.visualizePerformance() >> ShowoffSinkBuilder("Cross Validation Performance")

// Add more advanced stuff like Regularizers...

// Go!val o = ob.build(m, dataSource)o.run()

Other Features• Tensor Memory Management

Automatically monitors the dependencies between all tensors Reallocates space occupied by unneeded tensors on the fly Will automatically toggle “inPlace” processing when it is safe

• Intermediate results are stored separate from model Forward passes yield backpropagation contexts that can be consumed or discarded

at any time. Very interesting property for:

Live Query/Training Fancy Optimizers Hyper Parameter Search

Saves up to

40% GPU memory

during training!

Blaze

CUBlaze

Inferno

InfernoTraining Deep Learning Models fasterwith Apache Spark

Starting an Inferno cluster

SparkConf

ClusterCoordinator

Worker 1

Worker 2Worker 3

Master

Cluster

FileRDD

Spark BinaryRDD Inferno FileRDD

689 s

6 s

9999 s

35 s

Loading meta-data of HDFS filesClaim

Assess

TailorSpark

Context

SampleDataRDD

Load hdfs://…

Create Samples

Load Plugins(e.g.

CUBlaze)

run()build()

cache()

cache()

cache()

Distributed Optimizer

Blaze Model

Blaze Optimizer

Pre-processing Pipeline

InfernoOptimizer

SampleDataRDD

ClusterCoordinator

Weights Hyper Param.

Objectives

Hyper Param.

ScopeDelimiter

Hyper Param.

Objectives

ScopeDelimiter

Worker 1

Worker 2Worker 3

Master

Cluster

Applied with cluster wide focus.

Applied independently in each worker.

57 minutes2 hours, 42 minutes

PerformanceResNet 34 on ImageNet

Blaze2 x 8 core Xeon CPU + 1 x NVIDIA TitanX

Inferno (over 1 GbE)8 x 8 core Xeon CPU + 4 x NIVIDA TitanX

Reached 20% Top1 accuracy 2.84 times faster!

PerformancePreAct ResNet 152 on ImageNet

0 h 5 h 10 h 15 h 20 h 25 h 30 h 35 h 40 h 45 h 50 h0%

10%

20%

30%

40%

50%

60%

70%

80%

1x TitanX - Top 1 Accuracy1x TitanX - Top 5 AccuracyInferno Cluster (5x TitanX, 1 GbE) - Top 1 Ac-curacyInferno Cluster (5x TitanX, 1 GbE) - Top 5 Ac-curacy

Reached 30% Top1 accuracy 4.81 times faster using 5 GPUs!** vs.

Conclusion• Blaze & CUBlaze

Fast Huge extensible module library Easy to use

• Inferno Allows you to accelerate Blaze DL tasks on Spark Uses Spark MR methods for all data transmissions:

Can run rather nicely along with other Spark jobs. Can be used without high-speed / low latency equipment

(usually required to make RDMA solutions perform well) Plain old (and even slow) Ethernet is enough!

* Note that using “Showoff” to visualize progress may open separate HTTP connections to the Showoff-Server.

Where can I get it?• Blaze & CUBlaze & Example Code

Stable, we train models using it for months already. A snapshot of the current stable release is available at:https://github.com/bashimao/ltudl (Apache License 2.0)

• ShowoffMulti-purpose live visualization system developed by Aiden Nibali (La Trobe University):https://github.com/anibali/showoff

• Inferno I am writing a paper about Inferno’s optimization system right now. Once it has been accepted for publication, we will release the full source code on GitHub. The best way to prepare for Inferno, is to download Blaze now and to get familiar with it.

Questions?Matthias Langer, PhD cand.m.langer@latrobe.edu.au

Supervisors:Dr. Zhen Hez.he@latrobe.edu.au

Prof. Wenny Rahayuw.rahayu@latrobe.edu.au

Deep Learning & Spark @ LaTrobeStudents• Master of Data Science degree

http://tinyurl.com/hf4wmn2 Advanced data science lab established in 2016 with newest

hardware. CSE5BDC

Big Data Management on the Cloud (I tutor this!) CSE5DEV

Data Exploration and Visualization(~50% lectures on deep learning)

CSE5WDCWeb Development on the Cloud

• Research GPU research cluster capable of running distributed deep learning

tasks. In-house development of a distributed deep learning system. Dedicated research group working with various Deep Learning

systems. CSE4DLJ

Weekly Deep Learning Journal Club

Industry• If you have a data analytics problem:

… we have a dedicated deep learning research team!

… and probably also a deep learning solution for it!

• Spark & Deep Learning workshops for Torch available on demand.

• Past & current machine learning research collaborations Alfred Hospital ZenDesk AIS (Australian Institute for Sports)

• Contact: z.he@latobe.edu.au