INTRODUCTION TO DEEP
LEARNING
Dmytro Mishkin
Czech Technical University in Prague
Clear Research Corporation
MY BACKGROUND
CTO of Clear Research. Using deep learning
at work since 2014.
PhD student of Czech Technical university in
Prague. Beat Deep Learning approaches at
VPRiCE Challenge 2015 with classical
methods
Now fully work in DL, recent paper “All you
need is a good init” added to Stanford CS231n
course.
Kaggler. 9th out of 1049 teams at National
Data Science Bowl
AGENDA
Why deep learning (DL)? Some applications
and motivations
What is the core idea behind DL?
Basics of convolutional networks (CNN)
Practical recommendation for CNN-based image
classification. State-of-art approaches
Deep Learning libraries overview
How to apply CNNs to different tasks
EC2 hands-on experience on Cats-vs-Dogs
competition. Homework
XKCD. NOT TRUE ANYMORE
DEEP LEARNING APPLICATIONS
Alpha Go :)
Image recognition
Speech Recognition. Cortana, Siri
Translation
Anomaly detection
Fraud detection
Video recognition
Robotics
Recommendation systems
DNA, biology, and more..
ALPHAGO
Mastering the game of Go with deep neural networks and tree search Silver et.al 2016
IMAGE CLASSIFICATION
Select all dogs. Our assignment…almost :)
State-of-art since 2012. Krizhevsky et.al 2012Superhuman level an ImageNet classification since 2015.
He et.al 2015, Szegedy et.al 2015
OBJECT DETECTION
SPEECH RECOGNITION
Cortana
Siri
OK, Google
Figure from Huang et.al. 2015.
ANOMALY DETECTION
VIDEO CAPTIONING
Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Venugopalan et.al. 2015
TEXT TRANSLATION
From [Bahadanau et al., 2015] slides at ICLR 2015.
DEEP LEARNING FRAMEWORKS FOR
REGULATORY GENOMICS AND EPIGENOMICS
https://www.youtube.com/watch?v=2vpKB3j-OY0
ROBOTICS: NAVIGATION
https://www.youtube.com/watch?v=umRdt3zGgpU
FRAUD DETECTION
As simple classificationhttp://www.slideshare.net/0xdata/
paypal-fraud-detection-with-deep-learning-in-h2o-presentationh2oworld2014
AGENDA
Why deep learning (DL)? Some applications and
motivations
What is the core idea behind DL?
Basics of convolutional networks (CNN)
Practical recommendation for CNN-based image
classification. State-of-art approaches
Deep Learning libraries overview
How to apply CNNs to different tasks
EC2 hands-on experience on Cats-vs-Dogs
competition. Homework
DL IS NOT THE BEST CHOICE WHEN
You have little number of heterogenous of
(enumeration) features.
E.g. almost all kaggle competitions:
Given browser, session id, gender, determine if
customer wants revenge :)
Given some anonymized features, predict stock paper
price
Given gender, profession, age, etc. predict insurance
risk
Как нафармить рейтинг на Хабре (sorry :)
WHAT IS COMMON IN DL-FRIENDLY
TASKS?
Extremely hard to explicit write algorithms
Even if features are obvious – how to extract
them?
Lots of structured homogenous data (image,
speech , text).
You can and have to transform input. Could you
transform browser version?
DEEP LEARNING IS HIERARCHICAL
REPRESENTATION LEARNING
Quoc.V.Le et.al.,2011. Building high-level features using large scale unsupervised learning
DEPTH IS ESSENTIAL IN DEEP LEARNING
DEPTH IS ESSENTIAL IN DEEP LEARNING
DEPTH IS ESSENTIAL IN DEEP LEARNING
DEPTH IS ESSENTIAL IN DEEP LEARNING
AGENDA
Why deep learning (DL)? Some applications and
motivations
What is the core idea behind DL?
Basics of convolutional networks (CNN)
Practical recommendation for CNN-based image
classification. State-of-art approaches
Deep Learning libraries overview
How to apply CNNs to different tasks
EC2 hands-on experience on Cats-vs-Dogs
competition. Homework
CONVOLUTIONS? WHY NOT JUST MLP?
http://cs231n.github.io/convolutional-networks/
NN
CNN
WHAT IS CONVOLUTION
https://developer.apple.com/library/ios/documentation/Performance/Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html
Classical NN
for image
is convolution
with image
size kernel
CONVOLUTIONS? WHY NOT JUST MLP?
Let`s look on filters:
Local
The most values are mean
(non-informative)
Wasted computation
and memory!!!
Also lots of parameters
and low data -> overfitting
http://www.cs.toronto.edu/~ranzato/research/projects.html
CONVOLUTIONS? WHY NOT JUST MLP?
Krizhevsky et al. 2012. conv1 filters
11x11x3. Much less wasted space!
DO WE NEED MATH?
If yes, go to the whiteboard
POOLING
http://cs231n.github.io/convolutional-networks/
MAX POOLING
http://cs231n.github.io/convolutional-networks/
TYPICAL CNN STRUCTURE (LENET-5)
http://eblearn.sourceforge.net/lib/exe/lenet5.png
• (Conv-ReLU-Pool)xN Softmax. Simple• (Conv-ReLU)xN-Pool- (Conv-Relu)x2N-Pool….Softmax. Popular.
• Some Inception arch. Have fun :)
NON-LINEARITIES
NON-LINEARITIES BENCHMARK
https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Activations.md
NON-LINEARITIES
AGENDA
Why deep learning (DL)? Some applications and
motivations
What is the core idea behind DL?
Basics of convolutional networks (CNN)
Practical recommendation for CNN-based
image classification. State-of-art
approaches
Deep Learning libraries overview
How to apply CNNs to different tasks
EC2 hands-on experience on Cats-vs-Dogs
competition. Homework
IMAGE PREPROCESSING
Subtract mean pixel (training set), divide by std.
RGB is the best colorspace for CNN
Do nothing more…
…unless you have specific dataset.
Subtract local mean pixelB.Graham, 2015
Kaggle Diabetic Retinopathy Competition report
TRAINING. SOLVERS AND REGULARIZATION
Use SGD with momentum.
Try learning rates 0.01, 0.005, 0.001
Momentums: 0, 0.5, 0.9, 0.95
Try L2 weight decay 0.0005, 0.0001. Prevents
from overconfidence
Fancy solvers (ADAM, RMSProp, AdaDelta)
sometimes work better, sometimes not.
ARCHITECTURE
Use as small filters as possible
3x3 + ReLU + 3x3 + ReLU > 5x5 + ReLU.
Exception: 1st layer. Too computationally
ineffective to use 3x3 there.
Convolutional Neural Networks at Constrained Time Cost. He and Sun, 2015
WEIGHTS INITIALIZATION
Preserve var=1
of all layers
output.
How?
There are lots of
papers with
variants
Mishkin and Matas. All you need is a good init. ICLR, 2016
WEIGHTS INITIALIZATION
Gaussian noise with some coefficient:
Xavier:
He (0.5 * Xavier for ReLU)
Orthonormal (Saxe et.al. 2013)
Data-dependent: LSUV
Mishkin and Matas. All you need is a good init. ICLR, 2016
BATCH NORMALIZATION
Ioffe et.al 2015
BATCH NORMALIZATION
DROPOUT
DROPOUT
Play with rates. 0.5 is rarely optimal choice (but
often good)
DROPOUT
Dropout_rate * width = constant – doesn`t work!
DATA AUGMENTATION
Common (helps 99% cases):
Random crop: e.g., 227x227 from 256x256 px
(AlexNet)
Horizontal mirror
Dataset dependent:
Random rotation
Affine transform
Random scale
Color augmentation
Noise input
Thin plate deformation
Unleash your imagination
PADDING. VALID AND SAME CONVOLUTION
http://www.johnloomis.org/ece563/notes/filter/conv/convolution.html
Same = padding with zerosby ½ kernel size.
The most common choice
PADDING
Padding:
Preserving spatial size, not “washing out”
information
Dropout-like augmentation by zeros
Caffenet128
with conv padding: 47% top-1 acc
w/o conv padding: 41% top-1 acc.
It is huge difference
RESUME FROM CS231N
AGENDA
Why deep learning (DL)? Some applications and
motivations
What is the core idea behind DL?
Basics of convolutional networks (CNN)
Practical recommendation for CNN-based image
classification. State-of-art approaches
Deep Learning libraries overview. Why
caffe.
How to apply CNNs to different tasks
EC2 hands-on experience on Cats-vs-Dogs
competition. Homework
DEEP LEARNING TOOLBOXES
Caffe
Torch
Theano
TensorFlow
MXNet
…
Nervana
DeepLearning4j
ConvnetJS
CNTK
Veles
H20...sorry, guys :(
MAIN DEEP LEARNING TOOLBOXES
SPEED BENCHMARK ALEXNET
Library Class Time (ms) forward (ms) backward (ms)
CuDNN[R4]-fp16
(Torch)cudnn.SpatialConvolution 71 25 46
Nervana-neon-fp16 ConvLayer 78 25 52
CuDNN[R4]-fp32
(Torch)cudnn.SpatialConvolution 81 27 53
Nervana-neon-fp32 ConvLayer 87 28 58
fbfft (Torch) fbnn.SpatialConvolution 104 31 72
TensorFlow conv2d 151 34 117
Chainer Convolution2D 177 40 136
cudaconvnet2* ConvLayer 177 42 135
CuDNN[R2] * cudnn.SpatialConvolution 231 70 161
Caffe (native) ConvolutionLayer 324 121 203
Torch-7 (native) SpatialConvolutionMM 342 132 210
CL-nn (Torch) SpatialConvolutionMM 963 388 574
Caffe-CLGreenTea ConvolutionLayer 1442 210 1232
https://github.com/soumith/convnet-benchmarks
SPEEDBENCHMARK. GOOGLENET
Library Class Time (ms)forward
(ms)
backward
(ms)
Nervana-
neon-fp16ConvLayer 230 72 157
Nervana-
neon-fp32ConvLayer 270 84 186
CuDNN[R4]-
fp16 (Torch)
cudnn.Spatial
Convolution462 112 349
CuDNN[R4]-
fp32 (Torch)
cudnn.Spatial
Convolution470 130 340
ChainerConvolution2
D687 189 497
TensorFlow conv2d 905 187 718
CaffeConvolutionL
ayer1935 786 1148
CL-nn (Torch)SpatialConvol
utionMM7016 3027 3988
Caffe-
CLGreenTea
ConvolutionL
ayer9462 746 8716
https://github.com/soumith/convnet-benchmarks
CAFFE
CAFFE
AGENDA
Why deep learning (DL)? Some applications and
motivations
What is the core idea behind DL?
Basics of convolutional networks (CNN)
Practical recommendation for CNN-based image
classification. State-of-art approaches
Deep Learning libraries overview
How to apply CNNs to various tasks
EC2 hands-on experience on Cats-vs-Dogs
competition. Homework.
HOW TO DO – LET`S GO TO WHITEBOARD
Image retrieval Babenko et. al (2014)
Person identification Chopra et. al 2006
Ranking Wang et.al 2014
Playing games. Atari (2013) Go (2016)
Text generation https://github.com/karpathy/char-rnn
Image generation Radford et.al 2016
Action recognition Simonyan et.al 2014
Anomaly detection https://www.youtube.com/watch?v=ds73ULGjnpc&feature=youtu.be
Translation Cho et al 2014
Fraud detection at PayPal http://university.h2o.ai/cds-lp/cds02.html
AGENDA
Why deep learning (DL)? Some applications and
motivations
What is the core idea behind DL?
Basics of convolutional networks (CNN)
Practical recommendation for CNN-based image
classification. State-of-art approaches
Deep Learning libraries overview
How to apply CNNs to different tasks
EC2 hands-on experience on Cats-vs-Dogs
competition. Homework.
IMAGE RETRIEVAL
Figure from Babenko et.al.2014
1. Pass image through ImageNet-pretrained CNN.2. Use some layer activations as description
3. L2-normalize (must!)
4. Put in some fast NN search like kd-tree.
EMBEDDINGS WITH SIAMESE NETWORKS
1. Put 2 images through copies of the same networks2. L2 difference < 1 if same person, >1 if different
https://www.cs.nyu.edu/~sumit/research/research.html
WHAT ABOUT 3 COPIES? TRIPLETS
http://arxiv.org/abs/1412.6622
1. Put 2 images through copies of the same networks
2. D(x, x+) < D (x,x-)
Drawback: 1) slow training :(
2) Have to select hard triplets. Random
ones easily satisfy equation above.
GENERATING IMAGES WITH GANS
http://soumith.ch/eyescream/
http://arxiv.org/abs/1511.06434
• Generator tries to generate image undistinguishable from natural.• Discriminatior tries to distinguish.
• Both learn simultaneously
AUTO-ENCODERS
http://deeplearning4j.org/deepautoencoder.html
DE-NOISING AUTO-ENCODERS
Clean Input Corrupted input (what net sees) Reconstructed
If compare input and reconstruction, we can detect anomalies
http://www.cs.toronto.edu/~ranzato/research/projects.html
QUESTIONS?
Top Related