Valentin Leveau and Titouan Lorieul Cocomeet 15/12/17 Deep Learning.pdf · Valentin Leveau -...
Transcript of Valentin Leveau and Titouan Lorieul Cocomeet 15/12/17 Deep Learning.pdf · Valentin Leveau -...
“Demystifying” Deep LearningValentin Leveau and Titouan Lorieul
Cocomeet 15/12/17
Summary1. Introduction2. From shallow neural networks to (very) deep ones3. Why deep CNN get popularized so late?4. Theoretical focus: learning as an optimization problem5. Practical limitations of neural networks
We now are good at mimicking some part of intelligence : Learning
(Machine) Learning = Learning from examples to do a given task and
generalize to new examples.
Predict some variables given others.
Capture statistical relationships / structure
between observed variables.
Introduction: DL vs ML
AIMachine Learning
Deep Learning
Machine Learning : Supervised Learning It is all about learning a prediction function parametrized by some
parameters
A simple function could be a linear model:
Goal is to learn the parameters so that we predict good values of yi given xi
An image can be modeled by its pixel representation:
Very very high dimensional feature space
Not relevant features
Classification in high dimensional spaces
See is why a machine sees !!!
Find Non Linear Invariant in the data (S. Mallat)
Find a mapping such that the problem becomes linearly separable
We want to produce an abstraction of the image containing highly
discriminative features→ Representation Learning
Find Non Linear Invariant in the data (S. Mallat)
Produce handcrafted intermediate representation of images Image abstraction that contain relevant information
Problem: Decide manually which kind of features are good for the different tasks
Solution: Let the system learn the change of variable
→ Toward Deep Representation Learning
Mid-LevelRepresentation
(BoW, Fisher Vectors,...)
Low-LevelRepresentation(SIFT, GIST,...)
Classification
Label
Query Image Intermediate Representation pipeline
Several strategies to go non linear
11/09/2016Valentin Leveau - Kernelizing spatially consistent visual matches for FGC - 9
• Learning several levels of abstraction of the input signal (compositionality).• Progressive Embedding from input pixels to the label space.• Jointly learning all the trainable modules.
Low-level feature
extractionClassification
Image
Intermediate Representation pipeline Local
pooling
(sum, max, avg, etc.)
High-level feature
extraction
Local pooling
(sum, max, avg, etc.)
Deep (Convolutional) Neural Network
From Shallow Neural Networks to (very) Deep Ones
Perceptron : Simple Elementary Neural Unit
Multi Layer Perceptron 2 layers architecture :
Layer 1: several units in parallel + non-linearities
Layer 2 : final linear classifier unit
W1 W2 W3 W4 W5
input data Deep non linear embedding Classification layer
Multi Layer Perceptron• The nonlinearities are crucial !!!
• Linear combination of linear combinations = Linear combination ”depth is useless)
• Nonlinearities allow to bend the space to get samples linearly separable ”click the curved space)
Optimization strategy : Supervised Learning Define an objective function to minimize with respect to the parameters
Example: least square minimization over training samples
Optimization strategy : Gradient Descent (GD) Update parameters of an objective function in
the direction of the gradient
Several GD strategies: a) All the training samples : Batch GD (exact gradient of J) b) Only one training sample : Stochastic GD c) A subsample of the training set: Mini-batch GD (reasonable approximation of a)
E
t
Gradient descent through deep feedforward models Gradient can be hard to compute A lot of parameters with interdependencies from layer to layer Solution: Deep model is a composition of function
Backpropagation of the gradient
We move co linearly in the direction of input data
Convolutional Neural Network
W1 W2 W3 W4 W5
Such deep non linear mapping still suffer from lack of efficiency to learn invariance We still try to map pixel-wise vectorial representation to the target space. We need stronger assumptions about the model to learn useful invariants for vision.
Solution: Replace linear applications by convolution kernels with learnable params !
A convolutionReplace linear applications by a convolution
Applying a convolution is equivalent to sharing local receptive fields parameters !→ Huge numbers of matrix multiplications → But the number of free parameters is low !
Translation Invariance with Conv Layers
image patch at position (x,y)
Weight sharing:
Activation at position ”x,y) for the k-th filter
Pooling layers Progressively reduce spatial resolution (e.g. max pooling):
Backprop gradient only spatial locations that chosen as max
Why ConvNet got famous so late ?
• Not many theoretical justifications ”compared to CV + Kernel Methods)
• Too much parameters for little datasets and too slow algorithms and hardwares
• High convergence problems ”gradient vanishing, bad normalizations, …)
• No good ways to initialize deep models
→ Motivations in Unsupervised Learning ”2000-2011)
Learn a model to: Project data into a non linear intermediate embedding ”Coding)
Reconstruct its own input from the codes ”Decoding)
PCA’s eigen directions span the same space than linear autoencoder’s
Coding function:
Decoding function:
Reconstruction objective:
Unsupervised Learning example: Autoencoder
min ||X - X_hat||²
h
X X_hat
Unsupervised Learning example: Autoencoder It is more efficient in practice to learn the layers separately.
→ Actually with ReLU, that’s okay.
Stacked autoencoders: Train the first autoencoder layer to reconstruct the input
Use the intermediate representation as input to the next layer and re-apply the process
Fine-tune the model to jointly learn the layers
Why ConvNet got famous so late ?• Large scale labelled dataset : ImageNet ”15 M)
• Hardware acceleration : GPU
• The ReLU nonlinearity : • The killer detail : as good (if not better) as unsupervised pretraining !
What Changed ?
What Changed ?• Batch Normalization ”x14 faster !!!)• Dropout
Evolution of the architectures • LeNet1-5
[Y. LeCun et al. 1989]
• AlexNet[A. Krizhevsky et al. 2012]
Evolution of the architectures • VGG Net
[K. Simonyan et al. 2014]
Evolution of the architectures • GoogLeNet
[C. Szegedy et al. 2015]
Evolution of the architectures • Inception modules
Evolution of the architectures • ResNet ”152 layers, even 1,000 ...)
[K. He et al. 2015]
Evolution of the architectures • Densenet:
[Liuzhuang et al. 2016]
Setting hyper-parameters of DL is a Nightmare !
• A lot more of good practices to make it work:
• Data-augmentation ”random flip, crop, rotation, etc …)
• Momentum and adaptive learning rate methods
• Learning rate tuning and policy
• Learn an ensemble of models (and aggregate somehow their predictions)
Setting hyper-parameters of DL is a Nightmare !
• Learning rate decay
Transfer learning (fine-tuning)Problem: CNNs require huge training data to learn the millions of parameters Solution: Learn domain specific features by transfer learning
1. Train CNN on a generalist image dataset with millions of images
Transfer learning (fine-tuning)Problem: CNNs require huge training data to learn the millions of parameters Solution: Learn domain specific features by transfer learning
1. Train CNN on a generalist image dataset with millions of images2. Keep the weights of the lowest layers but remove/reset the top layers
Transfer learning (fine-tuning)Problem: CNNs require huge training data to learn the millions of parameters Solution: Learn domain specific features by transfer learning
1. Train CNN on a generalist image dataset with millions of images2. Keep the weights of the lowest layers but remove/reset the top layers3. Feed forward and back-propagate new domain specific images (with usually
a different number of classes C)
New output layer
Layer 7
The power of transfer learningTransfer learning usually works for any domain
Table 1 - accuracy measured on several fine-grained image classification datasets
Even very specific ones:
Rice seeds varieties recognition 100 classes, 1 500 texture images
GoogLeNet trained from scratch 58.1%
GoogLeNet pre-trained on ImageNet 70.4%
Herbaria species recognition 255 classes, 11K herbaria sheets
Trademark Logos Car models Paris Buildings Aircraft models Bird species Flower species
GoogLeNet trained from scratch 67.7% 59.3% 55.3% 72.7% 24.4% 59.5%
GoogLeNet pre-trained on ImageNet 87.5% 79.9% 71.3% 88.1% 72.4% 89.5%
GoogLeNet trained from scratch 8.8%
GoogLeNet pre-trained on ImageNet 52.4%
Frameworks
From academia…
...to industry
etc.
ToolkitHardware
GPUs…
Mobile…
FPGA… Project Brainwave (Microsoft)
And others… TPUs (Google), etc.
Applications
Plant species recognition: Pl@ntNet
Localization / Segmentation (Deep Mask)
Style Transfer to real and sketch images
Image Generation with GAN[S. Chopra et al. 2015]
Generative Adversarial Network (GAN)[I. GoodFellow et al. 2014]
Logic with Deep Learning
NLP with Memory Networks
NLP with Memory Networks
Image Captioning via attention based models
“Reverse Image Captioning” :
Ressources Cours de Yann LeCun au Collège de France
Intervention de Stéphane Mallat au Collège de France
The Deep Learning Book
Pattern Recognition and Machine Learning
Questions
Theoretical Focus:Learning as an Optimization Problem
Statistical Learning Theory
Data generating process: supervised learning
X × Y probability space Input space X , e.g. Rd, images, ... Output space Y, e.g. R, 0, 1, 0, 1, ...,K − 1, ...
Data drawn from p(x, y) is fixed but unknown probability models uncertainty p(x, y) = p(y|x)p(x)
We have a sample set Tn = (xi, yi), i = 1, 2, ..., n
Objective
Find the function f that best predict y given x.
y = f(x)
Objective
Find the function f that best predict y given x.
y = f(x)
Formalism
Reduce search to a parametrized function class f(.; θ) e.g. linear regression f(x;w, b) = wTx+ b, Θ = R
d × R
Loss function: l(y, y) = l(y, f(x; θ)) e.g. squared loss l(y, y) = (y − y)2
Risk function: J∗(θ) = Ex,y[l(y, f(x; θ))]
Optimization problem 1
minθ∈Θ
J∗(θ)
Statistical Learning Theory
Optimization problem 1
minθ∈Θ
Ex,y[l(y, f(x; θ))]
p(x, y) is unknown, we cannot compute the expectation.
Use empirical risk instead J(θ) = 1
n
∑i l(yi, f(xi; θ))
Optimization problem 2
minθ∈Θ
J(θ)
Optimization algorithms
Usually, we have convex objective functions and Θ ⊂ Rp. We
know how to deal with these optimization problems:
if derivable, first-order gradient approaches: GradientDescent, Stochastic Gradient Descent, ...
if two time derivable, second-order approaches:Newton-Raphson, Quasi-Newton, ...
...
Optimization algorithms
Usually, we have convex objective functions and Θ ⊂ Rp. We
know how to deal with these optimization problems:
if derivable, first-order gradient approaches: GradientDescent, Stochastic Gradient Descent, ...
if two time derivable, second-order approaches:Newton-Raphson, Quasi-Newton, ...
...
However, when measured on a test set, the error is high. Thefitted model does not generalize on new input...
Why?
Underfitting VS Overfitting
Regularization term
Optimization problem 3
minθ∈Θ
J(θ) + λΩ(θ)
where λ is the regularization parameter (hyperparameter)
Examples:
L1 regularization: Ω(θ) = ‖θ‖1 =∑
k |θk|
L2 regularization: Ω(θ) = 1
2‖θ‖2
2= 1
2
∑k θ
2
k
Backed by theory: VC theory, PAC learning
ǫ2 ≤log |H|+ log 1
δ
2n
What about Neural Networks?
Optimization problem
definitely not convex
needs a lot of engineering to make it converge tosomething: good initialization, ReLU activations, lots oftricks to avoid gradient from exploding, etc.
...but somehow Stochastic Gradient Descent works well
Generalization
no regularization term (but other approaches)
huge capacity (universal approximation theorem): classictheory do not provide a good bound on generalization...
In the end, we do not even optimize what we would like(surrogate loss), we just monitor the validation error in order toknow when we stop...
What about Neural Networks?
Open question: Why does it work?
Optimization
empirically: a lot of minima reached by SGD are equivalentin term of generalization error1
Generalization
new theories rising?2
1Choromanska et al. 2015; Goodfellow and Vinyals 2014.2Shwartz-Ziv and Tishby 2017.
Practical Limitations of Neural Networks
Some current limitations
Interpretability
building predictive models can be useful to performanalysis on a set of variables (e.g. economics, ecology, ...)
... given we can understand what the model has learned
but neural networks are black boxes
Can be fooled
no measure of prediction uncertainty usually outputs only some sort of scores that can not be
interpreted as uncertainty estimates Bayesian Neural Networks (BNN)3
easy to find adversarial examples need for formal verification
3Kendall and Gal 2017.
Some failures...
Because neural networks have allowed us to make considerableprogress in a lot of different tasks, we might considerer themsolved... but there are not.
Some failures...
Some failures...
Some failures...
Adversarial examples4
4Goodfellow, Shlens, et al. 2014; Kurakin et al. 2016.
Verification
For safety-critical systems, we need formal guarantees on thebehavior of the algorithms, e.g.:
check properties of DNN implementation of next-generationairborne collision avoidance system for unmanned aircraft5
verify that there is no adversarial example around a testpoint6
5Katz et al. 2017.6Huang et al. 2017.
Conclusion
Deep Learning has been very successful and has a very largearray of applications.
Proved that the current statistical learning theories areinsufficient.
We need to rethink of generalization in high dimension.
We need to find a new learning theory.
Lack of theoretical understanding limits their usage as blackbox modules in software because they do not provide strongcontracts (hence need for formal verification).
There is a current on-going debate in the community (e.g. seeAli Rahimi’s NIPS 2017 Test-of-Time award talk).
Questions?