Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep...

42
Large-Scale Deep Learning Tal Ben-Nun Advanced Seminar in Deep Learning, October 2015 The Hebrew University of Jerusalem

Transcript of Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep...

Page 1: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Large-Scale Deep Learning

Tal Ben-Nun

Advanced Seminar in Deep Learning, October 2015

The Hebrew University of Jerusalem

Page 2: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Outline

• Motivation

• Training Parallelism

• Distributed Deep Learning

• Specialized Hardware

• Further Optimizations

Page 3: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Why Scale Up?

• Enormous amounts of data • ImageNet (1k): 180 GB • ImageNet (22k): A few TB • Industry: Much larger

• Large neural network architectures • Around 20 layers deep today, ~100M-2B parameters

• Faster prototyping • GoogLeNet GPU throughput: 66-452 images per second • Epoch time: 10s of hours to days (and weeks)

Page 4: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Outline

• Motivation

• Training Parallelism

• Distributed Deep Learning

• Specialized Hardware

• Further Optimizations

Page 5: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Backpropagation Algorithm

• Update rule: 𝑊𝑡 = 𝑔(𝑊𝑡−1, 𝛻𝐿𝑡 𝑊𝑡−1 , [ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 … ])

• 𝛻𝐿𝑡 𝑊 is an average direction of the gradient over a mini-batch of size 𝑏:

• 𝛻𝐿𝑡 𝑊 =1

𝑏 𝛻ℓ 𝑊; (𝑥𝑖 , 𝑦𝑖)𝑏𝑖=1

• A CNN is a Directed Acyclic Graph (DAG)

• At each layer in backpropagation, derivatives are estimated w.r.t.: • Layer parameters (if necessary) • Data (chain rule)

• Additional memory storage required for training

Softmax

Concatenation

Fully Connected

Convolution Pooling

Convolution Convolution

Convolution

Input

Page 6: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

• ~6.8 million parameters

• 22 layers deep

GoogLeNet [Szegedy et al., 2014]

Szegedy et al. "Going deeper with convolutions." (2014).

Page 7: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Multi-Convolution Images Pooling Fully

Connected

Loss

Simple CNN Architecture

Page 8: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Proc. 3

Multi-Convolution Images Pooling

Proc. 1

Proc. 2

Fully

Connected

Loss

Data Parallelism

Page 9: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Data Parallelism

Good for forward pass (independent)

Backpropagation requires all-to-all communication only when accumulating results

× Requires allocation of all parameters on each processor

Page 10: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Multi-Convolution Images Pooling

3

1

Fully

Connected

3

1

Loss

Model Parallelism

Proc. 1

Proc. 2 Proc. 3

Page 11: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Model Parallelism

Parameters can be divided across processors

× Mini-batch has to be copied to all processors

× Backpropagation requires all-to-all communication every layer

Page 12: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Hybrid Data/Model Parallelism

• Conjecture[Krizhevsky, 2014]: Most of the computations are performed in the convolutional portion, most of the parameters are stored in the fully connected portion

• Proposed Solution: Use data parallelism on convolutional portion and model parallelism on the FC portion

Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." (2014).

Page 13: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Proc. 3

Multi-Convolution Images Pooling

3

1

Proc. 1

Proc. 2

Fully

Connected

3

1

Loss

Hybrid Data/Model Parallelism [Krizhevsky, 2014]

Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." (2014).

Page 14: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Hybrid Data/Model Parallelism Results

• AlexNet, ILSVRC 2012:

Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." (2014).

Page 15: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Outline

• Motivation

• Training Parallelism

• Distributed Deep Learning

• Specialized Hardware

• Further Optimizations

Page 16: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Distributed Deep Learning

• Runs on a computer cluster

• Each node runs partially autonomously

• Inter-node communication from time to time

• Best result is gathered from the nodes

• Training data can be split to per-node “shards”

Node 1

Node 3

Node 2

Node N

Page 17: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Distributed Deep Learning – Opportunities

• Increased memory: • More data

• More parameters

• Fault tolerance • Protection against node crashes

• Improved stochasticity

Node 1

Node 3

Node 2

Node N

Page 18: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Distributed Deep Learning – Determining Factors

• Computational independence

• Communication efficiency

• Network congestion

• Load balancing

• Points of failure

Node 1

Node 3

Node 2

Node N

Page 19: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Distributed Deep Learning – DistBelief

• Distributed learning infrastructure used at Google [Dean et al., 2012] • Handles communications automatically • Users define computation and messages sent during fprop/bprop

• Each model replica has the same parameters, but optimizes different data

• Each replica is divided among several machines

• Two distributed optimization schemes for training: • Online – Downpour SGD • Batch – Sandblaster LBFGS

• Uses a centralized parameter server (several machines, sharded) • Handles slow and faulty replicas

Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.

Page 20: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Asynchronous SGD – HOGWILD! • To achieve coherency in distributed SGD, nodes must synchronize

w.r.t. parameters:

• HOGWILD! [Niu et al., 2011] is a method for asynchronous SGD:

Niu, Feng, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." NIPS 2011.

Page 21: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Asynchronous SGD – HOGWILD! • HOGWILD!:

• Proven to converge in sparse problems • Provides near-linear scaling • Assumes shared-memory architecture (e.g., multicore CPUs)

• Formulates ML problems as hypergraphs 𝐺 = (𝑉, 𝐸) where: • 𝑤∗ = argmin𝑤 𝑓 𝑤 = argmin𝑤 𝑓𝑒(𝑤𝑒)𝑒∈𝐸

• Each hyperedge 𝑒 ∈ E represents subsets of 𝑛

• 𝑤𝑒 is reduced to coordinates in 𝑒

Image: Wikipedia

Niu, Feng, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." NIPS 2011.

Page 22: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Asynchronous SGD – HOGWILD!

• Example: Sparse SVM • 𝐸 represents training set, denoting participating parameters as 𝑢 ∈ 𝑒

argmin𝑤 max

𝑖

0,1 − 𝑦𝑖𝑤𝑇𝑥𝑖 + 𝜆||𝑤||

2

is formulated as:

argmin𝑤 max 0,1 − 𝑦𝑒𝑤𝑇𝑥𝑒 + 𝜆

𝑤𝑢2

𝑑𝑢𝑢∈𝑒𝑒∈𝐸

• 𝑑𝑢 is the number of training examples non-zero in component 𝑢

Niu, Feng, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." NIPS 2011.

Page 23: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Asynchronous SGD – HOGWILD!

• Assuming 𝑓 is 𝐿-Lipschitz, strongly convex with constant 𝑐, and its derivatives are bounded by M

• Assuming constant learning rate 𝛾 and 𝛾𝑐 < 1

Niu, Feng, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." NIPS 2011.

Page 24: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Distributed Deep Learning – Downpour SGD

• Relaxation of HOGWILD! for distributed systems

• Algorithm: • Divide training data into subsets and run a replica on each subset

• Every 𝑛𝑓𝑒𝑡𝑐ℎ iterations, fetch up-to-date parameters from server

• Every 𝑛𝑝𝑢𝑠ℎ iterations, push local gradients to server

• Note that parameter shards may be “out-of-sync”

Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.

Page 26: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

BFGS Nonlinear Optimization Algorithm

• Second order method (𝐵𝑘 estimates Hessian)

• Uses line search to determine learning rate per iteration

• Limited-Memory BFGS (LBFGS) uses 𝑚 last updates of 𝑥 instead of 𝐵𝑘

𝐵0 = 𝐼

for k=0 to K:

Solve 𝐵𝑘𝑝𝑘 = −𝛻𝑓 𝑥𝑘 to obtain direction 𝑝𝑘

Find 𝛼𝑘 = argmin𝛼 𝑓(𝑥𝑘 + 𝛼𝑝𝑘)

Set 𝑥𝑘+1 = 𝑥𝑘 + 𝛼𝑘𝑝𝑘

Compute 𝐵𝑘+1−1 using two rank-1 updates on 𝐵𝑘

−1

Page 27: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Distributed Deep Learning – Sandblaster LBFGS

• Coordinator process issues commands (dot product, scaling, multiplication, etc.) to slave nodes, each processing a different parameter shard

• Communication is sparser • Most of the information is stored locally • Coordinator messages are small • Slaves fetch parameters at the beginning of each batch,

send gradients once in a while for fault tolerance

• Employs computation replication and load balancing • Nodes that finish their job get more jobs • If one node is slow, additional nodes get the same job

Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.

Page 32: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Distributed Deep Learning – Drawbacks

• Centralized • Single point of failure: Parameter Server(s) / Coordinator

• Network congestion can be a bottleneck

• Very relaxed theory for Downpour SGD

Page 33: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Outline

• Motivation

• Training Parallelism

• Distributed Deep Learning

• Specialized Hardware

• Further Optimizations

Page 34: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Specialized Hardware

• GPU • Thousands of cores, massively parallel (5-10 TFLOP/s per card)

• Multi-GPU nodes further increase training performance (using data/model parallelism)

• Drawback: Hard to program efficiently

• FPGA • Specialized for certain operations (e.g. convolutions)

• Drawbacks: Network-specific, harder to program

• Convolutional Processing Units?

Page 35: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Specialized Hardware

Page 36: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Distributed Deep Learning with GPUs

• Distributed GPU-based system[Coates et al., 2013] was shown to run DistBelief-scale problems (1000 machines) with 3 multi-GPU nodes

• 3 tiers of concurrency: GPU, model parallelism, nodes

Coates et al. "Deep learning with COTS HPC systems." ICML 2013.

Time for a single mini-batch gradient update of a sparse autoencoder

Page 37: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Outline

• Motivation

• Training Parallelism

• Distributed Deep Learning

• Specialized Hardware

• Further Optimizations

Page 38: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Further Optimizations

• Convolution operator replacement • As Toeplitz matrix multiplication (im2col)

• As an operation in frequency domain

• Lowering floating point precision

• Specific compilers/assemblers

Page 39: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Convolution as FFT

• In convolution, each pixel requires an NxN surrounding region

• Convolution can be computed using FFT [Vasilache et al., 2015]:

• The larger the convolution kernel, the better the performance:

3x3 kernel 7x7 kernel 13x13 kernel 5x5 kernel

Output size

Pro

ble

m s

ize

Vasilache et al. "Fast Convolutional Nets With fbfft: A GPU Performance Evaluation", ICLR 2015

Output size

Pro

ble

m s

ize

Output size

Pro

ble

m s

ize

Output size

Pro

ble

m s

ize

Page 40: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Sacrificing Accuracy for Performance

• Half-precision (16-bit floating point) [Gupta et al., 2015] • Memory is stored in 16-bit format

• Computations are performed in 32-bits

• Uses Stochastic Rounding:

Goal: Preserve

Gupta et al. "Deep Learning with Limited Numerical Precision." NIPS 2015.

Page 41: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Sacrificing Accuracy for Performance

• Results on MNIST with LeNet:

WL=Word Length (bits) FL=Fractional Length (bits)

Gupta et al. "Deep Learning with Limited Numerical Precision." NIPS 2015.

Page 42: Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep Learning Tal Ben-Nun ... •Algorithm: •Divide training data into subsets and run a

Benchmark Insights

According to Convnet Benchmarks (as of October 2015):

• GPU + 16-bit floating point + custom assembler is the fastest

• GPU + 32-bit floating point + FFT is a close runner-up

• FFT surpasses convolution, unless kernel size is small (OxfordNet)

• Performance is close to peak hardware FLOPs (~4 out of 4.7 TFLOPs)

• Naïve versions can be >16 times slower