Notes from 2016 bay area deep learning school

65
Summary of Bay Area Deep Learning School Niketan Pansare

Transcript of Notes from 2016 bay area deep learning school

Page 1: Notes from 2016 bay area deep learning school

Summary of Bay Area Deep Learning School

Niketan Pansare

Page 2: Notes from 2016 bay area deep learning school

• Summary• Why Deep Learning is gaining popularity ?• Introduction to Deep Learning• Case-study of the state-of-the-art networks• How to train them• Tricks of the trade

• Overview of existing deep learning stack

Agenda

Page 3: Notes from 2016 bay area deep learning school

Summary• 1300 applicants for 500 spots (industry + academia)• Videos are online:• Day 1: https://www.youtube.com/watch?v=eyovmAtoUx0• Day 2: https://www.youtube.com/watch?v=9dXiAecyJrY

• Mostly high-quality talks from different areas• Computer Vision (Karpathy – OpenAI), Speech (Coates - Baidu), NLP (Socher – Salesforce, Quoc Le -

Google), Unsupervised Learning (Salakhutdinov - CMU), Reinforcement Learning (Schulman - OpenAI)• Tools (TensorFlow/Theano/Torch)• Overview/Vision talks (Ng, Bengio and Larochelle)

• Networking:• Keras contributor (working in startup) – CNTK integration, potential for SystemML integration• TensorFlow users in Google

• Discussion on “dynamic operator placement” described in the whitepaper

Page 4: Notes from 2016 bay area deep learning school

Why Deep Learning is gaining popularity ?

Page 5: Notes from 2016 bay area deep learning school

• Efficacy of larger networks

Why Deep Learning is gaining popularity ?

Reference: Andrew Ng (Spark summit 2016).

Page 6: Notes from 2016 bay area deep learning school

• Efficacy of larger networks

Why Deep Learning is gaining popularity ?

Reference: Andrew Ng (Spark summit 2016).

Train large network on large amount of data

Relative ordering not defined for small data

Page 7: Notes from 2016 bay area deep learning school

• Efficacy of larger networks• Large amount of data

Why Deep Learning is gaining popularity ?

Caltech101 dataset (by FeiFei Li)

Google Street View House Numbers (SVHN) Dataset

CIFAR-10 dataset

Flickr 30K Images

Page 8: Notes from 2016 bay area deep learning school

• Efficacy of larger networks• Large amount of data• Compute power necessary to train larger networks

Why Deep Learning is gaining popularity ?

VGG: ~2-3 weeks training with 4 GPUsResNet 101: 2-3 weeks with 4 GPUs

Rocket Fuel*

Page 9: Notes from 2016 bay area deep learning school

• Efficacy of larger networks• Large amount of data• Compute power necessary to train larger networks• Techniques/Algorithms/Networks to deal with training issues• Non-linearities, Batch normalization, Dropout, Ensembles• Will discuss these in detail later

Why Deep Learning is gaining popularity ?

Page 10: Notes from 2016 bay area deep learning school

• Efficacy of larger networks• Large amount of data• Compute power necessary to train larger networks• Techniques/Algorithms/Networks to deal with training issues• Success stories in vision, speech and text

Why Deep Learning is gaining popularity ?

Page 11: Notes from 2016 bay area deep learning school

• Efficacy of larger networks• Large amount of data• Compute power necessary to train larger networks• Techniques/Algorithms/Networks to deal with training issues• Success stories in vision, speech and text• No feature engineering

Why Deep Learning is gaining popularity ?

Page 12: Notes from 2016 bay area deep learning school

• Efficacy of larger networks• Large amount of data• Compute power necessary to train larger networks• Techniques/Algorithms/Networks to deal with training issues• Success stories in vision, speech and text• No feature engineering• Transfer Learning + Open-source (network, learned weights, dataset as well as codebase)

• https://github.com/BVLC/caffe/wiki/Model-Zoo• https://github.com/KaimingHe/deep-residual-networks• https://github.com/facebook/fb.resnet.torch• https://github.com/baidu-research/warp-ctc• https://github.com/NervanaSystems/ModelZoo

Why Deep Learning is gaining popularity ?

Page 13: Notes from 2016 bay area deep learning school

• Efficacy of larger networks• Large amount of data• Compute power necessary to train larger networks• Techniques/Algorithms/Networks to deal with training issues• Success stories in vision, speech and text• No feature engineering• Transfer Learning + Open-source (network, learned weights, dataset as well as codebase)• Tooling support for rapid iterations/experimentation• Auto-differentiation, general purpose optimizer (SGD variants)• Layered architecture• Tensorboard

Why Deep Learning is gaining popularity ?

Page 14: Notes from 2016 bay area deep learning school

• Efficacy of larger networks• Large amount of data• Compute power necessary to train larger networks• Techniques/Algorithms/Networks to deal with training issues• Success stories in vision, speech and text• No feature engineering• Transfer Learning + Open-source (network, learned weights, dataset as well as codebase)• Tooling support for rapid iterations/experimentation• Auto-differentiation, general purpose optimizer (SGD variants)• Layered architecture• Tensorboard

Why Deep Learning is gaining popularity ?

Will skip RNN, LSTM, CTC, Parameter

server, Unsupervised and Reinforcement

Deep Learning

Page 15: Notes from 2016 bay area deep learning school

• DL for Speech (covers CTC + Speech pipeline):• https://youtu.be/9dXiAecyJrY?t=3h49m40s• https://github.com/baidu-research/ba-dls-deepspeech

• DL for NLP (covers word embeddings, RNN, LSTM, seq2seq)• https://youtu.be/eyovmAtoUx0?t=3h51m45s (Richard Socher)• https://youtu.be/9dXiAecyJrY?t=7h4m12s (Quoc Le)

• Deep Unsupervised Learning (covers RBM, Autoencoders, …):• https://youtu.be/eyovmAtoUx0?t=7h7m54s

• Deep Reinforcement Learning (covers Q-learning, policy gradients):• https://youtu.be/9dXiAecyJrY?t=7m43s

• Tutorial (TensorFlow, Torch, Theano)• https://github.com/wolffg/tf-tutorial/• https://github.com/alexbw/bayarea-dl-summerschool• https://github.com/lamblin/bayareadlschool

Not covered in this talk

Page 16: Notes from 2016 bay area deep learning school

Introduction to Deep Learning

Page 17: Notes from 2016 bay area deep learning school

Different abstractions for Deep Learning

Deep Learning pipeline Deep Learning taskEg: CNN + classifier => Image captioning, Localization, …

Deep Neural NetworkEg: CNN, AlexNet,GoogLeNet, …

LayerEg: Convolution, Pooling, …

Page 18: Notes from 2016 bay area deep learning school

Common layers• Fully connected layer

Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/

Page 19: Notes from 2016 bay area deep learning school

Common layers• Fully connected layer• Convolution layer• Less number of parameters as

compared to FC• Useful to capture local

features (spatially)• Output #channels = #filters

Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/

Page 20: Notes from 2016 bay area deep learning school

Common layers• Fully connected layer• Convolution layer• Pooling layer• Useful to tolerate feature

deformation such as local shifts• Output #channels = Input

#channels

Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/

Page 21: Notes from 2016 bay area deep learning school

Common layers• Fully connected layer• Convolution layer• Pooling layer• Activations• Sigmoid• Tanh• ReLU

Reference: Introduction to Feedforward Neural Networks - Larochelle. https://dl.dropboxusercontent.com/u/19557502/hugo_dlss.pdfhttp://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Page 22: Notes from 2016 bay area deep learning school

• Squashes the neuron’s pre-activations between [0, 1]• Historically popular• Disadvantages:• Tends to vanish the gradient as activation increase (i.e. saturated neurons)• Sigmoid outputs are not zero-centered• exp() is a bit compute expensive

Sigmoid

Reference: Introduction to Feedforward Neural Networks - Larochelle. http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Page 23: Notes from 2016 bay area deep learning school

• Squashes the neuron’s pre-activations between [-1, 1]• Advantage: • Zero-centered

• Disadvantages:• Tends to vanish the gradient as activation increase• exp() is compute expensive

Tanh

Reference: Introduction to Feedforward Neural Networks - Larochelle. http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Page 24: Notes from 2016 bay area deep learning school

• Bounded below by 0 (always non-negative)• Advantages: • Does not saturate (in +region)• Very computationally efficient• Converges much faster than sigmoid/tanh in practice (e.g. 6x)

• Disadvantages:• Tends to blowup the activations

• Alternatives:• Leaky ReLU: max(0.001*a, a)• Parameteric ReLU: max(alpha*a, a)• Exponential ReLU: a if a>0; else alpha*(exp(a)-1)

ReLU (Rectified Linear Units)

Reference: Introduction to Feedforward Neural Networks - Larochelle. http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

max(0, a)

Page 25: Notes from 2016 bay area deep learning school

• According to Hinton, why did deep learning not catch on earlier ?• Our labeled datasets were thousands of times too small.• Our computers were millions of times too slow.• We initialized the weights in a stupid way.• We used the wrong type of non-linearity (i.e. sigmoid/tanh).

• Which non-linearity to use => ReLU according to• LeCun: http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf

• Hinton: http://www.cs.toronto.edu/~fritz/absps/reluICML.pdf

• Bengio: https://www.utc.fr/~bordesan/dokuwiki/_media/en/glorot10nipsworkshop.pdf

• If not satisfied with ReLU, • Double-check the learning rates• Then, try out Leaky ReLU / ELU• Then, try out tanh but don’t expect much• Don’t use sigmoid

Reference: Introduction to Feedforward Neural Networks - Larochelle. http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Page 26: Notes from 2016 bay area deep learning school

Common layers• Fully connected layer• Convolution layer• Pooling layer• Activations• SoftMax

• Strictly positive• Sums to 1• Used for multi-class classification• Other losses: Hinge, Euclidean, Sigmoid cross-entropy, …

Reference: Introduction to Feedforward Neural Networks - Larochelle. https://dl.dropboxusercontent.com/u/19557502/hugo_dlss.pdf

Page 27: Notes from 2016 bay area deep learning school

Common layers• Fully connected layer• Convolution layer• Pooling layer• Activations• SoftMax• Dropout

• Idea: «cripple» neural network by removing hidden units stochastically• Use random mask: Could use a different dropout probability, but 0.5 usually works well• Beats regular backpropagation on many datasets, but is slower (~2x)• Helps to prevent overfitting

Page 28: Notes from 2016 bay area deep learning school

Common layers• Normalization layers• Batch Normalization (BN)

• Network converge faster if inputs are whitened, i.e. linearly transformed to have zero mean and unit variance, and decorrelated

• Ioffe and Szegedy, 2014 suggested to also use normalization at the level of hidden level• BN: normalizing each layer, for each mini-batch => addresses “internal covariate shift”• Greatly accelerate training + Less sensitive to initialization + Improve regularization

Reference: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift

Two popular approaches:- Subtract the mean image (e.g. AlexNet)- Subtract per-channel mean (e.g. VGGNet)

Page 29: Notes from 2016 bay area deep learning school

Common layers• Normalization layers• Batch Normalization (BN)

• Network converge faster if inputs are whitened, i.e. linearly transformed to have zero mean and unit variance, and decorrelated

• Ioffe and Szegedy, 2014 suggested to also use normalization at the level of hidden level• BN: normalizing each layer, for each mini-batch => addresses “internal covariate shift”• Greatly accelerate training + Less sensitive to initialization + Improve regularization

Reference: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift

Page 30: Notes from 2016 bay area deep learning school

Common layers• Normalization layers• Batch Normalization (BN)

• BN: normalizing each layer, for each mini-batch• Greatly accelerate training + Less sensitive to initialization + Improve regularization

Reference: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift

Trained with initial learning rate 0.0015

Same as Inception with BN before each nonlinearity

Initial learning rate increased by 5x (0.0075) and 30x (0.045)

Same as N-x5, but with Sigmoid instead of ReLU

Page 31: Notes from 2016 bay area deep learning school

Common layers• Normalization layers• Batch Normalization (BN)• Local Response Normalization (LRN)

• Used in AlexNet paper with k=2, alpha=10e-4, beta=0.75, n=5• Not common anymore

channelNumber of channels

Page 32: Notes from 2016 bay area deep learning school

Different abstractions for Deep Learning

Deep Learning pipeline Deep Learning taskEg: CNN + classifier => Image captioning, Localization, …

Deep Neural NetworkEg: CNN, AlexNet,GoogLeNet, …

LayerEg: Convolution, Pooling, …

Page 33: Notes from 2016 bay area deep learning school

Convolutional Neural networks

Page 34: Notes from 2016 bay area deep learning school

Convolutional Neural networks

LeNet for OCR (90s)AlexNet

Compared to LeCun 1998, AlexNet used:•More data: 10^6 vs. 10^3•GPU (~20x speedup) => Almost 1B FLOPs for single image•Deeper: More layers (8 weight layers)•Fancy regularization (dropout 0.5)•Fancy non-linearity (first use of ReLU according to Karpathy)•Accuracy on ImageNet (ILSVRC 2012 winner): 16.4%•Using ensembles (7 CNN), accuracy 15.4%

Page 35: Notes from 2016 bay area deep learning school

Convolutional Neural networks

ZFNet [Zeiler and Fergus, 2013]•It was an improvement on AlexNet by tweaking the architecture hyperparameters, • In particular by expanding the size of the middle convolutional layers• CONV 3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512

• And making the stride and filter size on the first layer smaller.• CONV 1: change from (11x11 stride 4) to (7x7 stride 2)

•Accuracy on ImageNet (ILSVRC 2013 winner): 16.4% -> 14.8%

Reference: http://cs231n.github.io/convolutional-networks/

Page 36: Notes from 2016 bay area deep learning school

Convolutional Neural networks

• Homogenous architecture• All convolution layers use small 3x3 filters

(compared to AlexNet that uses 11x11, 5x5 and 3x3 filters) with stride 1 (compared to AlexNet that uses 4 and 1 strides)

• Depth of network critical component (19 layers)• Other details:

• 5 maxpool layers (x2 reduction)• No normalization• 3 FC layers (instead of 2) => Most number of

parameters (102760448, 16777216, 409600)• ImageNet top 5 error (ILSVRC 2014 runner-up):

• 14.8% -> 7.3% (top 5 error)

Reference: https://arxiv.org/pdf/1509.07627.pdf, https://arxiv.org/pdf/1409.1556v6.pdf, https://www.youtube.com/watch?v=j1jIoHN3m0s

64 128 256 512 512

Number of filters• Why 3x3 layers ?

• Stacked convolution layers have large receptive field • two 3x3 => 5x5 receptive field• three 3x3 layers => 7x7 receptive field

• More non-linearity• Less parameters to learn

Page 37: Notes from 2016 bay area deep learning school

New Lego brick or mini-network (Inception module)

For Inception v4, see https://arxiv.org/abs/1602.07261

Page 38: Notes from 2016 bay area deep learning school

Convolutional Neural networks GoogLeNet [Szegedy et al., 2014]

- 9 inception modules

- ILSVRC 2014 winner (6.7% top 5 error )

- Only 5 million params!(Uses Avg pooling instead of FC layers)

Page 39: Notes from 2016 bay area deep learning school

Convolutional Neural networks

GoogLeNet VGG_model_A AlexNet

updateOutput 130.76 162.74 27.65

updateGradInput 197.86 167.05 24.32

accGradParameters 142.15 199.49 28.99

Forward 130.76 162.74 27.65

Backward 340.01 366.54 53.31

TOTAL 470.77 529.29 80.96

Speed with Torch7 (using GeForce GTX TITAN X and CuDNN) … all time in milliseconds

Compared to AlexNet, GoogLeNet has- 12x less params- 2x more compute- 6.67% (vs. 16.4%)

Compared to VGGNet, GoogLeNet has- 36x less params- 22 layers (vs. 19)- 6.67% (vs. 7.3%)

Reference: https://arxiv.org/pdf/1512.00567.pdf, https://github.com/soumith/convnet-benchmarks/blob/master/torch7/imagenet_winners/output.log

Page 40: Notes from 2016 bay area deep learning school

Analysis of errors on GoogLeNet vs human on ImageNet dataset• Types of error that both GoogLeNet human are susceptible to:

• Multiple objects (24% of GoogLeNet errors and 16% of human errors)• Incorrect annotations

• Types of error that GoogLeNet is more susceptible to than human:• Object small or thin (21% of GoogLeNet errors)• Image filters, eg: distort contrast/color distribution (13% of GoogLeNet errors and only 1

human error)• Abstract representations, eg: shadow on the ground, of a child on a swing (6% GoogleNet

errors)• Types of error that human is more susceptible to than GoogLeNet:

• Fine-grained recognition, eg: species of dogs (7% of GoogLeNet errors and 37% of human errors)

• Insufficient training data

Reference: http://arxiv.org/abs/1409.0575

Page 41: Notes from 2016 bay area deep learning school

Convolutional Neural networks

Page 42: Notes from 2016 bay area deep learning school

New Lego brick (Residual block)

Reference: http://torch.ch/blog/2016/02/04/resnets.html

Shortcut to address underfitting due to vanishing gradients- Occurs even with batch normalization

Page 43: Notes from 2016 bay area deep learning school

Convolutional Neural networks• ResNet Architecture

• VGG style design => just deep• All 3x3 convolution• #Filter x2

• Other remarks:• no max pooling (almost)• no FC• no dropout

• See https://github.com/facebook/fb.resnet.torch

Reference: http://image-net.org/challenges/talks/ilsvrc2015_deep_residual_learning_kaiminghe.pdf

Page 44: Notes from 2016 bay area deep learning school

Different abstractions for Deep Learning

Deep Learning pipeline Deep Learning taskEg: CNN + classifier => Image captioning, Localization, …

Deep Neural NetworkEg: CNN, AlexNet,GoogLeNet, …

LayerEg: Convolution, Pooling, …

Page 45: Notes from 2016 bay area deep learning school

Addressing other tasks …

Reference: https://docs.google.com/presentation/d/1Q1CmVVnjVJM_9CDk3B8Y6MWCavZOtiKmOLQ0XB7s9Vg/edit#slide=id.g17e6880c10_0_926

SKIP THIS !!

Page 46: Notes from 2016 bay area deep learning school

Addressing other tasks … SKIP THIS !!

Page 47: Notes from 2016 bay area deep learning school

How to train a Deep Neural Network ?

Page 48: Notes from 2016 bay area deep learning school

Training a Deep Neural Network

Page 49: Notes from 2016 bay area deep learning school

Training a Deep Neural Network“Forward propagation”

Compute a function via composition of linear transformations followed by element-wise non-linearities

“Backward propagation”Propagates errors backwards and update weights according

to how much they contributed to the output

Reference: “You Should Be Using Automatic Differentiation” by Ryan Adams (Twitter)

Special case of “automatic differentiation” discussed in next slides

Page 50: Notes from 2016 bay area deep learning school

Training a Deep Neural Network

Training features:

Training label:

Goal: learn the weightsDefine a loss function:

For numerical stability and mathematical simplicity, we use negative log-likelihood (often referred to as cross-entropy):

Page 51: Notes from 2016 bay area deep learning school

• Using the loss function: , we learn weights by

Training a Deep Neural Network

• Learning is cast as optimization• Popular algorithm: Stochastic Gradient Descent• Needs to compute the gradients:

• And initialization of weights (covered later):

Page 52: Notes from 2016 bay area deep learning school

• Evaluate derivative of f(x) = sin(x – 3/x) at x = 0.01• Symbolic differentiation• Symbolically differentiate the function as an expression, and evaluate it at the

required point• Low speed + difficult to convert DNN into expressions• Symbolically, f’(x) = cos(x – 3/x)(1+ 3/x2) … at x=0.01 => -962.8192798

• Numerical differentiation• Use finite differences:• Generally bad numerical stability

Methods for differentiating functions

Reference: http://homes.cs.washington.edu/~naveenks/files/2009_Cranfield_PPT.pdf

• Automatic/Algorithmic Differentiation (AD)• Mechanically calculates derivatives as functions expressed as computer

programs, at machine precision, and with complexity guarantees - Barak Pearlmutter

• Reverse-mode automatic differentiation used in practice

Page 53: Notes from 2016 bay area deep learning school

Examples of AD in practice

https://github.com/HIPS/autograd

For Python and NumPy:

See http://www.autodiff.org/ for more details

For Torch (developed by Twitter cortex):

https://github.com/twitter/torch-autograd/

Page 54: Notes from 2016 bay area deep learning school

• Convert the algorithm into sequence of assignment of basic operations:

Reverse-mode AD (how it works)

https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/

parents of

• Apply chain rule:

• Differentiate each basic operation f in the reverse order:

Page 55: Notes from 2016 bay area deep learning school

Reverse-mode AD (how it works – NN)

From Neural Network with Torch - Alex Wiltschko

Page 56: Notes from 2016 bay area deep learning school

Reverse-mode AD (how it works – NN)

From Neural Network with Torch - Alex Wiltschko

Page 57: Notes from 2016 bay area deep learning school

• Normalize your data• Mini-batch instead of SGD (leverage matrix-matrix operations)• Use momentum• Use adaptive learning rates:• Adagrad: learning rates are scaled by the square root of the cumulative sum of

squared gradients

• RMSProp: instead of cumulative sum, use exponential moving average

• Adam: essentially combines RMSProp with momentum

• Debug your gradient using finite difference method

Tricks of the Trade

Page 58: Notes from 2016 bay area deep learning school

• Use momentum• Use adaptive learning rates:• Adagrad: learning rates are scaled by the square root of the cumulative sum

of squared gradients• RMSProp: instead of cumulative sum, use exponential moving average• Adam: essentially combines RMSProp with momentum

Tricks of the Trade

Page 59: Notes from 2016 bay area deep learning school

• Initialization matters• Assume 10-layer FC network with tanh non-linearity

Tricks of the Trade

- Initialize with zero mean & 0.01 std dev- Does not work for deep networks

Layer Number

Layer mean Layer std dev

- Initialize with zero mean & unit std dev- Almost all neurons completely saturated, either -1and 1. Gradients will be all zero.

Layer Number

Layer mean Layer std dev

Page 60: Notes from 2016 bay area deep learning school

• Initialization matters• Assume 10-layer FC network with tanh non-linearity

Tricks of the Trade

Xavier initialization [Glorot et al., 2010]:

Layer Number

Layer mean Layer std dev

- Use zero-mean and 1/fan_in variance - Works well for tanh

- But not for ReLU

He al proposed replacing by

Note: additional /2

Page 61: Notes from 2016 bay area deep learning school

• Initialization matters• Assume 10-layer FC network with tanh non-linearity• Batch normalization reduces the strong dependence on initialization

Tricks of the Trade

Page 62: Notes from 2016 bay area deep learning school

Overview of existing deep learning stack

Page 63: Notes from 2016 bay area deep learning school

Existing Deep Learning Stack

Caffee, Theano , Torch7, TensorFlow, DeepLearning4J, SystemML*

cuDNN Aparapi (converts bytecode to OpenCL)~CPU’s BLAS/LAPACK: cuBLAS, MAGMA,CULA, cuSPARSE, cuSOLVER, cuRAND, etc

CUDA (preferred if Nvidia GPUs) OpenCL (portable)

Framework:

Library with commonly used building blocks:

Driver/Toolkit:

HardwareMulticore, Task parallelism, Minimize latency (eg: Unsafe/DirectBuf/GC pauses/NIO)

Data parallelism (single task), Cost of moving data from CPU to GPU (Kernel fusion ?), Maximize throughput.

Rule of Thumb: Always use libraries !!Caffe (GPU) 11x but Caffe(cuDNN) 14x on AlexNet training (5 convolution + 3 connected layers)

*Conditions apply: Unified memory model since CUDA 6

Page 64: Notes from 2016 bay area deep learning school

Comparison of existing frameworkCore Lang

Bindings CPU Single GPU

MultiGPU

Distributed Comments

Caffe C++ Python, MatLab

Yes Yes Yes See com.yahoo.ml.CaffeOnSpark

Mostly for image classification, Models/Layers expressed in proto format

Theano / PyLearn2

Python Yes Yes In Progress

No Transparent use of GPU, Auto-diff, General purpose, Computation as DAG.

Torch7 Lua Yes Yes Yes See Twitter’s torch-distlearn

CTC impl on Torch7 of Baidu’s Deep Speech opensourced. Very efficient.

TensorFlow C++ Python Yes Yes Upto 4 GPUs

Not open-sourced

Slower than Theano/Torch, TensorBoard useful, Computation as DAG

DL4J Java Yes Yes Most likely Yes Supports GPUs via CUDA, Support for Hadoop/Spark

SystemML Java Python, Scala

Yes In Progress

Not yet Yes

Minerva/CXXNet (Smola)

C++ Python Yes Yes Yes Yes https://github.com/dmlc. Minerva ~ Theano and CXXNet ~ Caffe

Page 65: Notes from 2016 bay area deep learning school

Thank You !!