Dnn Tutorial
-
Upload
pavitrakumar123 -
Category
Documents
-
view
21 -
download
0
description
Transcript of Dnn Tutorial
Deep Learning: Past, Present and Future (?)
Kyunghyun Cho
Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,
Facult des arts et des sciences,Université de Montréal
Machine Learning
Learning Inference
?
1. Let the modelM learn the data D2. Let the modelM infer unknown quantities
Machine Learning & Perception
Learning Inference
?
Perception is the organization, identification,and interpretation of sensory information inorder to represent and understand theenvironment.
–Wikipedia
→
(Farabet et al., 2013)
Machine Learning & Perception: Examples
Learning Inference
?
Data Sensory Information QueryLabeled Images An image Is a cat in the image?Transcribed Speech A speech segment What is this person saying?Paraphrases A pair of sentences Is this sentence a paraphrase?Movie Ratings Ratings of Y and by X Will a user X like a movie Y ?Parallel Corpora A Finnish sentence What is “moi” in English?
A human possesses the best machinery for perception, called a brain.
But, how does our brain do it?
Deep Learning: Motivated from Human Learning
(Van Essen&Gallant, 1994)
Learn massive data simple functions Multi-layered
(Krizhevsky et al., 2012)
Boltzmann machines? I remember working on them in 80s and 90s..
– Anonymous Interviewer, 2011paraphrased
Deep Learning: History
1958 Rosenblatt proposed perceptrons1980 Neocognitron (Fukushima, 1980)
1982 Hopfield network, SOM (Kohonen, 1982), Neural PCA (Oja, 1982)
1985 Boltzmann machines (Ackley et al., 1985)
1986 Multilayer perceptrons and backpropagation (Rumelhart et al., 1986)
1988 RBF networks (Broomhead&Lowe, 1988)
1989 Autoencoders (Baldi&Hornik, 1989), Convolutional network (LeCun, 1989)
1992 Sigmoid belief network (Neal, 1992)
1993 Sparse coding (Field, 1993)
Why all this fuss about deep learning now?
ImageNet: ILSVRC 2012 – Classification Task
Top Rankers1. SuperVision (0.153): Deep Conv. Neural Network (Krizhevsky et al.)
2. ISI (0.262): Features + FV + Linear classifier (Gunji et al.)
3. OXFORD_VGG (0.270): Features + FV + SVM (Simonyan et al.)
4. XRCE/INRIA (0.271): SIFT + FV + PQ + SVM (Perronin et al.)
5. University of Amsterdam (0.300): Color desc. + SVM (van de Sande et al.)
(Krizhevsky et al., 2012)
ImageNet: ILSVRC 2013 – Classification TaskTop Rankers1. Clarifi (0.117): Deep Convolutional Neural Networks (Zeiler)
2. NUS: Deep Convolutional Neural Networks3. ZF: Deep Convolutional Neural Networks4. Andrew Howard: Deep Convolutional Neural Networks5. OverFeat: Deep Convolutional Neural Networks6. UvA-Euvision: Deep Convolutional Neural Networks7. Adobe: Deep Convolutional Neural Networks8. VGG: Deep Convolutional Neural Networks9. CognitiveVision: Deep Convolutional Neural Networks
10. decaf: Deep Convolutional Neural Networks11. IBM Multimedia Team: Deep Convolutional Neural Networks12. Deep Punx (0.209): Deep Convolutional Neural Networks13. MIL (0.244): Local image descriptors + FV + linear classifier (Hidaka et
al.)14. Minerva-MSRA: Deep Convolutional Neural Networks15. Orange: Deep Convolutional Neural Networks16. BUPT-Orange: Deep Convolutional Neural Networks17. Trimps-Soushen1: Deep Convolutional Neural Networks18. QuantumLeap: 15 features + RVM (Shu&Shu)
(Sermanet et al., 2013)
You’re already using deep learning!
How do you tell deep learning from not-so-deep machine learning?
Not-So-Deep Machine Learning
1. Feature engineering ← not learned!2. Learning3. Inference
features
xx
x
1
2
N
...
...
......
(1) (2)
(3)
Separation between domain knowledge and general machine learning
Unsupervised Learning of Representation
1a. Feature engineering1b. Feature/Representation learning2. Learning3. Inference
features
xx
x
1
2
N
...
...
...
...
(1a)
(1b)
(2)
(3)
Deep Learning: toward the Ultimate Machine Learning?
1. Jointly learn everything2. Inference
(1)
(2)
The data decides –Yoshua Bengio
Why now? Why not 20 years ago?
What has happened in last 20+ years?
I We have connected the dots, e.g.,I PCA ⇔ Neural PCA ⇔ Probabilistic PCA ⇔ AutoencoderI Autoencoder ⇔ Belief network ⇔ Restricted Boltzmann machine
I We understand learning betterI Model structure matters a lotI Learning is but is not optimizationI No need to be scared of non-convex optimization
I We understand how learning and inference interactI Exponential growth of the amount of data and computational power
And Beyond. . .
(Goodfellow, 2013)
Today’s Tutorial: Introduction to Deep Learning
1. Deep Learning: Past, Present and Future2. Supervised Neural Networks
I Multilayer Perceptron and LearningI RegularizationI Practical RecipeI Task-specific Neural Network
3. Unsupervised Neural NetworksI Unsupervised LearningI Generative Modeling with Neural Networks
I Density/Distribution LearningI Learning to Infer
I Semi-Supervised Learning and Pretraining
4. Advanced TopicsI Beyond Computer VisionI Advanced Topics
Supervised Neural Networks
Kyunghyun Cho
Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,
Facult des arts et des sciences,Université de Montréal
Warning!!!
The next 9–10 slides may be extremely boring.
Supervised Learning: Rough Picture
Data:
D = (x1, y1), (x2, y2) . . . , (xN , yN)
Assumption:
yn = f ∗(xn)
Find a function f , using D, that emulates f ∗ as well as possible onsamples potentially not included in D.
Supervised Learning: Probabilistic Picture
Underlying distributions:
X and Y | X
Data:
D = (x , y) | x ∼ X & y ∼ Y | X = x
Find a distribution p(y | x), using D, that emulates Y | X as well aspossible on new samples from p(x) potentially not included in D.
Supervised Learning: Evaluation and Generalization
Evaluation:
Ex∼p(x) [‖f (x)− f ∗(x)‖p] ≈∑
x∈Dtest
[‖f (x)− f ∗(x)‖p]
or
Ex∼p(x) [KL(p(y | x)‖p∗(y | x))] ≈∑
x∈Dtest
KL(p‖p∗),
where D 6= Dtest
Why do we test the found solution f ∗ or p∗ on Dtest, not on D?
Linear/Ridge Regression
Underlying assumption:
y = W ∗x + b∗ + ε,
where ε is a white Gaussian noise.
Data:
D = (x1, y1), (x2, y2) . . . , (xN , yN) ,
where yn = W ∗xn + b∗ + ε.
Learning:
W , b = argminW ,b
1N
N∑n=1
‖Wxn + b − yn‖22 + λ(‖W ‖2F + ‖b‖22
)
Multilayer Perceptron: (Binary) Classification
Underlying assumption:
y ∼ B (p = fθ∗(x)) ,
where fθ∗ is a nonlinear function parameterized with θ∗ and B(p) is aBernoulli distribution of mean p.
Data:
D = (x1, y1), (x2, y2) . . . , (xN , yN) ,
where yn ∼ B (p = fθ∗(xn)).
Learning:
θ = argminθ
1N
N∑n=1
yn log fθ(xn) + (1− yn) log (1− fθ(xn)) + λΩ(θ,D)
Learning as an Optimization
Ultimately, learning is (mostly)
θ = argminθ
1N
N∑n=1
c ((xn, yn) | θ) + λΩ (θ,D) ,
where c ((x , y) | θ) is a per-sample cost function.
Gradient Descent
Gradient-descent Algorithm:
θt = θt−1 − η∇L(θt−1)
where, in our case,
L(θ) =1N
N∑n=1
l ((xn, yn) | θ) .
Let us assume that Ω (θ,D) = 0.
Stochastic Gradient Descent
Often, it is too costly to compute C (θ) due to a large training set.
Stochastic gradient descent algorithm:
θt = θt−1 − ηt∇l((x ′, y ′) | θt−1) ,
where (x ′, y ′) is a randomly chosen sample from D, and
∞∑t=1
ηt →∞ and∞∑t=1
(ηt)2 <∞.
Let us assume that Ω (θ,D) = 0.
Any question so far?
Almost there. . .
How do we compute the gradient efficiently for deep neural networks?
Backpropagation Algorithm – (1) Forward Pass
x1
x2
h1
h2
f L
Forward Computation:
L(f (h1(x1, x2,θh1), h2(x1, x2,θh2),θf ), y)
Multilayer Perceptron with a single hidden layer:
L(x, y ,θ) =12(y −U>φ
(W>x
))2
Backpropagation Algorithm – (2) Chain Rule
x1
x2
h1
h2
f L∂L∂f
∂h∂x1
2
∂h∂x
2
1 ∂f∂h2
∂f∂h1
Chain rule of derivatives:
∂L∂x1
=∂L∂f
∂f∂x1
=∂L∂f
(∂f∂h1
∂h1
∂x1+
∂f∂h2
∂h2
∂x1
)
Backpropagation Algorithm – (3) Shared Derivatives
x1
x2
h1
h2
f L
∂f∂h1
∂f∂h2
∂L∂f
Local derivatives are shared:
∂L∂x1
=∂L∂f
(∂f∂h1
∂h1
∂x1+
∂f∂h2
∂h2
∂x1
)∂L∂x2
=∂L∂f
(∂f∂h1
∂h1
∂x2+
∂f∂h2
∂h2
∂x2
)
Backpropagation Algorithm – (4) Local Computation
∂L∂h
∂h∂a1
∂h∂a2
∂h∂aq
∂L∂a1
∂L∂a2
∂L∂aq
Each node computesI Forward: h(a1, a2, . . . , aq)
I Backward: ∂h∂a1
, ∂h∂a2
, . . . , ∂h∂aq
Backpropagation Algorithm – Requirements
∂L∂h
∂h∂a1
∂h∂a2
∂h∂aq
∂L∂a1
∂L∂a2
∂L∂aq
I Each node computes adifferentiable function1
I Directed Acyclic Graph2
1Well. . . ?2Well. . . ?
Backpropagation Algorithm – Automatic Differentiation
x1
x2
h1
h2
f L
∂f∂h1
∂f∂h2
∂L∂f
I Generalized approach to computing partial derivativesI As long as your neural network fits the requirements, you do not
need to derive the derivatives yourself!I Theano, Torch, . . .
Any question on backpropagation and automatic differentiation?
Regularization – (1) Maximum a Posteriori
Probabilistic Perspective: Find a modelM by. . .I Maximum Likelihood (ML): argmaxθ p(D | M)
I Maximum a Posteriori (MAP): argmaxθ p(M | D)
What is the probability of θ given a current data D?
p(M | D) =p(D | M)p(M)∑M p(D,M)
∝ p(D | M)p(M)
P(M): What do we think a good model should be?
Regularization – (2) Weight Decay
P(M): What do we think a good model should be?
Weight-Decay Regularization
I Prior Distribution: θj ∼ N(0, (Mλ)−1
)Maximum a Posteriori Estimation
θ = argmaxθ
N∑n=1
p(yn | xn,θ) + λ
M∑j=1
θ2j .
Regularization – (3) Smoothness and Noise Injection
Prior on a Model: SmoothnessI f (x) ≈ f (x + ε): the model should be insensitive to small
change/noise ⇐⇒ Minimize∑N
n=1
∣∣∣∂f (xn)∂x
∣∣∣2Regularizing
∑Nn=1
∣∣∣∂f (xn)∂x
∣∣∣2 is equivalent to adding random Gaussiannoise in the input (Bishop, 1995)
argminθ
N∑n=1
‖fθ(xn)− yn‖2 + λ
∣∣∣∣∂f (xn)
∂x
∣∣∣∣2
≈∑ε
p(ε)
(argmin
θ
N∑n=1
‖fθ(xn+ε)− yn‖2)
Regularization – (4a) Ensemble Learning and DropoutWisdom of the Crowd: Train M classifiers and let them vote
f (x) =1M
M∑m=1
fMm (x)
(Ciresan et al., 2012)
Regularization – (4b) Ensemble Learning and Dropout
Dropout: Train one, but exponentially many classifiers (Hinton et al., 2012)
L(θ) =1N
N∑n=1
logEm [p(yn,m | xn,θ)] ≥ 1N
N∑n=1
Em [log p(yn,m | xn,θ)]
x1
x2
h1
hH
fh2
h3...
Each update samples one network out of exponentially many classifiers.
θt = θt−1 − ηt∇l((x ′, y ′),m | θt−1) ,
where mi,l ∼ B (0.5).
Regularization – (4b) Ensemble Learning and Dropout
Dropout: When testing, halve the activations
L(θ) ≥ 1N
N∑n=1
Em [log p(yn,m | xn,θ)] ≈ 1N
N∑n=1
p(yn,m =
12| xn,θ
)
x1
x2
h1
hH
fh2
h3
...
12
12
12
12
Do you see why I spent so much time on regularization?
Common Recipe for Deep Neural Networks
1. Use a piecewise linear hidden unitI Rectifier: h(x) = max 0, x (Glorot&Bengio, 2011)I Maxout: h(x1, . . . , xp) = max x1, . . . , xp (Goodfellow et al., 2013)
2. Preprocess data and choose features carefullyI Images: Whitening? Local contrast normalization? Raw? SIFT?
HoG?I Speech: Raw? Spectrum?I Text: Characters? Words? Tree?I General: z-Normalization?
3. Use Dropout and other regularization methods4. Unsupervised Pretraining (Hinton&Salakhutdinov, 2006)
I Few labeled samples, but a lot of unlabeled samples
5. Carefully search for hyperparametersI Random search, Bayesian optimization
(Bergstra&Bengio, 2013;Bergstra et al., 2011;Snoek et al.,2012)
6. Often, deeper the better7. Build an ensemble of neural networks
But, nobody seems to use a vanilla multilayer perceptron, right?
How to Encode Prior/Domain Knowledge?
Data Preprocessing and Feature Extraction:I Object recognition from images
I Lighting condition shouldn’t matter → Contrast normalizationI Gesture recognition from skeleton
I Relative positions of joints are important → Relative coordinatesystem w.r.t. the body center
I Language ProcessingI Word counts are important → Bag-of-Words representation
Model Architecture Design
Convolutional Neural Networks – (1)
Suitable for Images, Videos and Speech
Prior/Domain KnowledgeI Translation invarianceI Rotation invariance: Images and VideosI Temporal invariance: Videos and SpeechI Frequency invariance: Speech
Convolutional Neural Networks – (2) Convolution andPooling
ConvolutionI Global
PoolingI Local
max max max
Convolutional Neural Networks – (3)Convolutional Neural Network
(TeraDeep, 2013)
Convolutional Layer1. Contrast Normalization2. Convolution3. Pooling4. Nonlinearity
Convolutional Neural Networks – (3)Deep Convolutional Neural Network
(Krizhevsky et al., 2012)
Recursive Neural Networks – (1)
Suitable for Text and Variable-Length Sequences
Prior/Domain KnowledgeI Compositionality ≈ Tree-based Grammar (?)I Location invarianceI Variable Length
Recursive Neural Networks – (2)
Compositional Structure
A small crowd quietly enters the historic church
Small (local) pieces are glued together to form a global structure
Recursive Neural Networks – (3)
Finding a good, compact representation of variable-length sequence
(Socher et al., 2011)
What other architectures can you think of?
Further Topics
I Is learning solved for supervised neural networks?I Recurrent neural networks: cope with variable-length inputs/outputsI Beyond sigmoid and rectifier functions
Unsupervised Neural Networks
Kyunghyun Cho
Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,
Facult des arts et des sciences,Université de Montréal
Warning!!!
The first half of this session can be boring.
Unsupervised Learning
No more label!
D = x1, x2, . . . , xN
What can we do?
(Exploratory) Data Analysis
The most important step in machine learning
−200 −150 −100 −50 0 50 100 150 200 250 300
−150
−100
−50
0
50
100
1500
50
100
150
zx
y
Human Gestures Visualization by DNN
(Cho&Chen, 2014)
Feature Extraction
With domain knowledge → Engineered FeaturesWithout domain knowledge → Learned Features
x f?
Generative Model: Probabilistic Picture
Underlying distribution:
X ∼ PD
Data:
D = x | x ∼ PD
Find a distribution p(x), using D, that emulates PD as well as possibleon new samples from p(x) potentially not included in D.
What should we do with data D = xnNn=1
Example target tasksI Classification p(xclass | xins)I Missing value reconstruction p(xm | xo)
I Denoising p(x | x)I Structured Output Prediction p(xout | xin)
I Outlier Detection p(x)? > τ
Ultimately, it comes down to learning a distribution of x.
Density/Distribution Estimation – (1)
Latent Variable Models pθ(x)
θ∗ = argmaxθ
1N
∑n=1
log∑
h
p(xn | h)p(h)
1. Define a parametric form of joint distribution pθ(x , h)2. Derive a learning rule for θ
Density/Distribution Estimation –(2) Restricted Boltzmann Machines
x1 x2 xp
h1 h2 hq
1. Joint distribution
pθ(x , h) =1
Z(θ)exp
0@ pXi=1
qXj=1
xihjwij
1A2. Marginal distribution
Xh
pθ(x , h) =1
Z(θ)
qYj=1
1 + exp
pXi=1
wi,jxi
!!
3. Learning rule: Maximum Likelihood
L(θ) =Ex∼Pd
24logqY
j=1
1 + exp
pXi=1
wi,jxi
!!− log Z(θ)
35∇wij =〈xihj 〉d − 〈xihj 〉m
(Smolensky, 1986)
Density/Distribution Estimation – (3) Belief Networks
...
...
...
...
1. Joint distribution
pθ(x , h) = p(x | h1)p(h1 | h2) · · · p(h)
2. Marginal distributionXh
pθ(x , h) = ?
3. Learning rule: Maximum Likelihood
L(θ) =Ex∼Pd
"Xh
pθ(x , h)
#
(Neal, 1996)
Density/Distribution Estimation – (4) NADE
1. Joint distribution
pθ(x) = pθ(x1)pθ(x2 | x1) · pθ(xd | x1, . . . , xd−1)
2. Marginal distribution: no latent variable h
3. Learning rule:3. Maximum Likelihood (fixed order), Order-agnostic (all orders)
(Larochelle&Murray, 2011)
Density/Distribution Estimation – (4) Issues
Intractability! Intractability! Intractability!
(General) Boltzmann MachinesI Normalization Constant Z(θ)
I Marginal ProbabilityP
h p(x, h)
I Posterior Probability p(h | x)
I Conditional Probabilityp(xmis | xobs)
Restricted Boltzmann MachinesI Normalization Constant Z(θ)
I Marginal ProbabilityP
h p(x, h)
I Posterior Probability p(h | x)
I Conditional Probabilityp(xmis | xobs)
Belief NetworksI Normalization Constant Z(θ)
I Marginal ProbabilityP
h p(x, h)
I Posterior Probability p(h | x)
I Conditional Probabilityp(xmis | xobs)
NADEI Normalization Constant Z(θ)
I Marginal ProbabilityP
h p(x, h)
I Posterior Probability p(h | x)
I Conditional Probabilityp(xmis | xobs)
I Somewhat unsatisfactoryperformance
Do we want to learn the distribution?
Generative Model – (1) Learn to Infer
Example target tasksI Classification p(xclass | xins)I Missing value reconstruction p(xm | xo)
I Denoising p(x | x)I Structured Output Prediction p(xout | xin)
I Outlier Detection p(x)? > τ
All we want is to infer the conditional distribution of unknown variables
(Goodfellow et al., 2013; Brakel et al., 2013; Stoyanov et al., 2011, Raiko et al., 2014)
Generative Model – (2) Learn to Infer
Approximate Inference in a Graphical Model
p(xmis | xobs) ≈ Q(xmis | xobs)
Methods:I Loopy belief propagationI Variational inference/message-passing
At the end of the day. . .
Qk(xmis | xobs) = f(Qk−1(xmis | xobs)
)until convergence
Generative Model – (3) Learn to Infer
x<0> x<1> x<2>
h<1> h<2>
x<k>
h<k>
...
...x
h
Approximate Inference in Restricted Boltzmann Machine
p(xmis | xobs) ≈ Q(xmis | xobs)
Mean-field Fixed-point Iteration
µkx = σ
(Wσ
(W>µk−1
x + c)
+ b)
At the end of the day, a multilayer perceptron with k − 1 hidden layers.→ Use backpropagation and stochastic gradient descent!
Generative Model – (4) Learn to Infer – NADE-k
x<0> x<1> x<2>
h<1> h<2>
x<k>
h<k>
...
...x
h
→ v<0> v<1>
h<1> h<1>[1] [2]
UW V
h<2> h<2>[1] [2]
UW V
v<2>
Further Generalization with Deep Neural Networks
p(xmis | xobs) ≈ Q(xmis | xobs) = fθ(xobs)
Interpret the model as a mixture of NADE’s with different orders ofvariables
I Exact computation of p(x) possibleI Fast inference p(xmis | xobs)I Flexible
(Raiko et al., 2014)
Generative Model – (5) Learn to Infer
Lesson:
- Do not maximize log-likelihood, but minimize the actual cost!
⇐⇒
- Don’t do what a model tells you to do, but do what you aim to do.
But, popular science journalists don’t care about generative models..
Manifold Learning – Semi-Supervised Learning (1)
???
I Which class does the dot belong to, red or blue?
Manifold Learning – Semi-Supervised Learning (2)
???
I Now, which class does the dot belong to, red or blue?I The black dots are unlabeled
Manifold Learning – New Representation (1)
Representation φ on the data manifold?
Hidden space
Data space
κ(x)
1. φ should reflect changes alongthe manifold
φ(xi ) 6= φ(xj), for all xi , xj ∈ D
2. φ should not reflect anychange orthogonal to themanifold
φ(xi + ε) = φ(xi )
Manifold Learning – New Representation (2)Denoising Autoencoder
(Vincent et al., 2011)
Representation that capture manifold1. φ(xi ) 6= φ(xj), for all xi , xj ∈ D2. φ(xi + ε) = φ(xi )
Denoising autoencoder achieves it by
minθ,θ′‖x− gθ′ (fθ (x + ε))‖2
Hidden space
Data space
κ(x)
Semi-Supervised Learning in Action (1)Layer-wise Pretraining
x
h1 h2
h3 y
Semi-Supervised Learning in Action (2)Layer-wise Pretraining
xx
h[1] h[1] h[1]
h[2] h[2]
y
Pretraining (1st layer)
Pretraining (2nd layer)
(Hinton&Salakhutdinov, 2006; Bengio et al., 2007; Ranzato et al., 2007)
Manifold Embedding and Visualization – (1)
I Manifold Embedding: M⊂ Rd → Rq, q dI If q = 2 or 3, data visualization
Data
z1z2
z2
z1
(Oja, 1991; Kramer, 1991; Hinton & Salakhutdinov, 2006)
Manifold Embedding and Visualization – (2)
Handwritten Digits [0, 1]196 → R2
0123456789
Pose Frame Data R30 → R2
rotateArmsLBackrotateArmsRBackrotateArmsBBack
What other applications can you think of?
Advanced Topics
Kyunghyun Cho
Laboratoire d’Informatique des Systèmes Adaptatifs,Département d’informatique et de recherche opérationnelle,
Facult des arts et des sciences,Université de Montréal
Is deep learning all about computer vision and speech recognition?
Deep Reinforcement Learning
a
h1
h2
h3
s
Q LearningI Q(s, a): state-action functionI Action at time t
= argmaxa∈[1,j] Q(s, a)
I Update Q on-the-fly
Deep Q LearningI Model Q with a deep neural networkI Predict Q(aj , ·) for all j at onceI State s: visual perception
not internal states!
(Mnih et al., 2013)
Natural Language Processing
In neuropsychology, linguistics and the philosophy of language, a naturallanguage or ordinary language is any language which arises,unpremeditated, in the brains of human beings.
–Wikipedia
Natural Language Processing
To machine learning researchers:
Natural Language is a huge set of variable-length sequences ofhigh-dimensional vectors.
Natural Language Processing – (1)How should we represent a linguistic symbol?
Say, we have four symbols (words):
[EU], [3], [France], [three]
Most naïve, uninformative coding:
[EU] = [1, 0, 0, 0]
[3] = [0, 1, 0, 0]
[France] = [0, 0, 1, 0]
[three] = [0, 0, 0, 1]
Not satisfying..
Natural Language Processing – (2)How should we represent a linguistic symbol?
Say, we have four symbols (words):
[EU], [3], [France], [three]
Is there a representation that preserves the similarities of meanings ofsymbols?
D([EU] , [France]) < D([EU] , [3] ,
D([3] , [three]) < D([France] , [three] ,
D([3] , [three]) < ε
...
Natural Language Processing – (3)Continuous-Space Representation
Sample sentences:1. There are three teams left for the qualification.2. 3 teams have passed the first round.
Task: Predict a following word given a current work [three]
Naïve approach: build a table (so called n-gram)I (three, teams), (3, teams)I The table can grow arbitrarily.
Machine learning: compress the table into a continuous functionI Map three and 3 to nearby points x in a continuous spaceI From x , map to [teams].
Natural Language Processing – (4)Continuous-Space Representation
(Cho et al., 2014)
Natural Language Processing – (5)Beyond Word Representation
(Cho et al., 2014)
NN: I am very powerful and can model anything as long as I’m fedenough computational resource.
SVM: But, you have to optimize a high-dimensional, non-convexfunction which has many, many local minima!
NN: Really?
Advanced Optimization – (1) Statistical Physics says. . .
Not really
Advanced Optimization – (2)Local Minima? Saddle Points?
1.5 1.0 0.5 0.0 0.5 1.0 1.54
3
2
1
0
1
2
3
4
X
1.00.5
0.00.5
1.0
Y
1.00.5
0.00.5
1.0
Z
2.52.01.51.00.5
0.00.51.0
X
1.5 1.0 0.5 0.0 0.5 1.0 1.5
Y
1.51.0
0.50.0
0.51.0
1.5
Z
3
2
1
0
1
2
X
1.0 0.5 0.0 0.5 1.0
Y
1.0
0.5
0.0
0.5
1.0
Z
0.50.0
0.5
1.0
(Dauphin et al., 2014; Pascanu et al., 2014)
Advanced Optimization – (3)Beyond the 2nd-order Method
(Quasi-)Newton Method
θ ← θ − H−1∇L(θ)
How well does the quadratic approximation hold when training neuralnetworks?
Saddle-Free Newton Method (very new!!) (Dauphin et al., 2014)
θ ← θ − |H|−1∇L(θ),
where |H| is constructed by
|H| = U |Σ|V
when H = UΣV .
Lastly but not at all least,is there any theoretical ground for using deep neural networks?
Theoretical Analysis –Deep Rectifier Networks Fold the Space
1. Fold along the 2. Fold along thehorizontal axisvertical axis
3.
(a)
S1S2S3
S4
S ′4 S ′1
S ′1S ′1
S ′1 S ′4
S ′4S ′4
S ′2
S ′2S ′2
S ′2 S ′3 S ′3
S ′3 S ′3
S ′1S ′4
S ′2S ′3
Input Space
First Layer Space
Second LayerSpace
(b) (c)
(Montufar et al., 2014; Pascanu et al., 2014)
Is it the beginning of deep learning or the end of deep learning?