TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 ·...
Transcript of TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 ·...
![Page 1: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/1.jpg)
The Origin of Deep Learning
Lili MouJan, 2015
![Page 2: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/2.jpg)
Acknowledgment
Most of the materials come from G. E. Hinton’s online course.
![Page 3: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/3.jpg)
OutlineIntroduction
Preliminary
Boltzmann Machines and RBMs
Deep Belief NetsThe Wake-Sleep AlgorithmBNs and RBMs
Examples
![Page 4: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/4.jpg)
Introduction
Lili Mou | SEKE Team 4/50
![Page 5: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/5.jpg)
A Little History about Neural Networks
• Perceptions
• Multi-layer neural networksModel capacityInefficient representationsThe curse of dimensionality
• Early attempts for deep neural networksOptimization and generalization
• One of the few successful deep architectures: Convolutional Neural networks
• Successful pretraining methodsDeep belief netsAutoencoders
Lili Mou | SEKE Team 5/50
![Page 6: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/6.jpg)
Stacked Restricted Boltzmann Machines and Deep Belief Net
Stacked Boltzmann Machines and Deep belief nets aresuccessful pretraining architectures for deep neural networks.
Learning a deep neural network has two stages:
1. Pretrain the model unsupervisedly
2. Initialize the weights in feed-forward neural networks as have been pretrained
Lili Mou | SEKE Team 6/50
![Page 7: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/7.jpg)
Preliminary
Lili Mou | SEKE Team 7/50
![Page 8: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/8.jpg)
Bayesian Networks
A bayesian network is
• A directed acyclic graph G (DAG), whose nodes correspond to the randomvariables X1, X2, · · · , Xn.
• For each node Xi, we have a conditional probabilistic distribution (CPD)P(Xi|ParG(Xi))
The joint distribution is
P(X) =n∏i=1
P(Xi|ParG(Xi))
Lili Mou | SEKE Team 8/50
![Page 9: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/9.jpg)
Dependencies and Conditional Dependencies
Is X independent of Y given evidence Z?• X → Y
• X ← Y
• X →W → Y
• X ←W ← Y
• X ←W → Y
• X →W ← Y
Definition (Active trail) X1 −X2 −Xn is active give Z iff• For any v-structure Xi−1 → Xi ← Xi+1, we have Xi or one of its
descendents ∈ Z• No other Xi is in Z
Definition (d-separation): X and Y are d-separated given evidence Z if thereis no active trail from X to Y given Z.Theorem d-separation ⇒ (conditional) independency
Lili Mou | SEKE Team 9/50
![Page 10: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/10.jpg)
Reasoning Patterns
3 typical reasoning patterns:
1. Causal Reasoning (Prediction)Predict “downstream” effects of various factors.
2. Evidential Reasoning (Explanation)Reason from effects to causes
3. Intercausal Reasoning (Explaining away)More tricky. The intuition is:Both flu and cancer cause fevers. When we have a fever, if we know thatwe have a flu, the probability of a cancer declines.
Lili Mou | SEKE Team 10/50
![Page 11: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/11.jpg)
Markov Networks
Factors: Φ = {φ1(D1), · · · , φk(Dk)}φi is a factor over scope Di ⊆ {X1 · · ·Xn}
Unnormalized measure: P̃Φ(X1, · · · , Xn) =k∏i=1
φi(Di)
Partition function: ZΦ =∑
X1,··· ,Xn
P̃Φ(X1, · · · , Xn)
Joint probabilistic distribution: PΦ(X1, · · · , Xn) = 1ZP̃Φ(X1, · · · , Xn)
Lili Mou | SEKE Team 11/50
![Page 12: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/12.jpg)
Conditional Random Fields
Factors: Φ = {φ1(X,Y ), · · · , φk(X,Y )}
Unnormalized measure: P̃Φ(Y |X) =k∏i=1
φi(Di)
Partition function: ZΦ(X) =∑Y
P̃Φ(Y |X)
Conditional probabilistic distribution: PΦ(Y |X) = 1Zφ(X) P̃Φ(Y |X)
Lili Mou | SEKE Team 12/50
![Page 13: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/13.jpg)
Inference for Probabilistic Graphical Models
• Exact inference, e.g., clique tree calibration• Sampling methods, e.g., Gibbs sampling, Metropolis Hastings algorithms• Variational inference
Gibbs samplingPosterior distributions P(Xi|X−i) known, give an unbiased sample of P(X)Wait until mixed {
Loop over i {xi ∼ P(Xi|X−i)
}}In probabilistic graphical models, P(Xi|X−i) is easy to compute, which is onlyrelated to local factors.
Lili Mou | SEKE Team 13/50
![Page 14: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/14.jpg)
Parameter Learning for Bayesian Networks
• Supervised learning, all variables are known in the training set– Maximum likelihood estimation– Max a posteriori
• Unsupervised/Semisupervised learning, some variables are unknown forat least some training samples
– Expectation maximumLoop {1. Estimate the probability of the unknown variables2. Estimate the parameters of the Bayesian network (MLE/MAP)
}
N.B.: Inference is required for un-/semi-supervised learning
Lili Mou | SEKE Team 14/50
![Page 15: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/15.jpg)
Parameter Learning for MRF/CRF
Rewrite the joint probability as
P(X) = 1Z
exp{∑
i
θifi(X)}
MLE estimation with gradient based methods
∂ logP∂θi
= Edata[fi]− Emodel[fi]
N.B.: Inference is required during each iteration of learning
Lili Mou | SEKE Team 15/50
![Page 16: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/16.jpg)
Boltzmann Machines and RBMs
Lili Mou | SEKE Team 16/50
![Page 17: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/17.jpg)
Boltzmann Machines
We have visible variables v and hidden variables h. Both v and h are binaryDefine energy function
E(v,h) = −∑i
bivi −∑j
bkhk −∑i<j
wi,jvivj −∑i,k
wikvihk −∑k<l
wklhkhl
Define probabilityP(v,h) = 1
Zexp {−E(v,h)}
whereZ =
∑v,h
P(v,h)
Lili Mou | SEKE Team 17/50
![Page 18: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/18.jpg)
Boltzmann Machines as Markov Networks
Two types of factors
1. Biases.xi φ(xi)0 11 log bi
2. Pairwise factors.
xi xj φ(xi, yj)0 0 10 1 11 0 11 1 logwi,j
Lili Mou | SEKE Team 18/50
![Page 19: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/19.jpg)
Maximum Likelihood Estimation
∇θi(− logL(Θ)) = Edata[fi]− Emodel[fi]
Lili Mou | SEKE Team 19/50
![Page 20: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/20.jpg)
Training BMs
• Brute-force (intractable)
• MCMCpi ≡ P(si|s−i) = σ(−
∑j
sjwji − bi)
• Warm startingKeep the h(i) as “particles” for each training sample x(i)
Run the Markov chain only a few times after each weight update
• Mean-field approximation
pi = σ(−∑j
pjwji − bi)
Lili Mou | SEKE Team 20/50
![Page 21: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/21.jpg)
Restricted Boltzmann Machines (RBM)
To achieve better training, we add constraints to our model1. Only one hidden layer and one visible layer2. No hidden connections3. No visible connections (which also marginalized out in BMs)
Our model becomes a complete bipartite graphNice properties:
• hi ⊥ hj |v
• vi ⊥ vj |h
The learning will be much faster
• Edata[vihj ] can be computed within exactly one matrix multiplication
• Emodel[vihj ] is also easier to compute because the Gibbs sampling becomesh|v ⇒ v|h⇒ h|v · · ·
Lili Mou | SEKE Team 21/50
![Page 22: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/22.jpg)
Contrastive Divergence Learning Algorithm
• MCMC:∆wij = α(< vihj >
0 − < vihj >∞)
• CK-k∆wij = α(< vihj >
0 − < vihj >k)
The crazy idea: Do things wrongly and hope it works.
Lili Mou | SEKE Team 22/50
![Page 23: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/23.jpg)
Why does it work at all? When does it fail?
Why does it work at all?• Start from the data, Markov chain wanders away from the current data
towards things that it likes more.• Perhaps, we only need one or a few steps to know the direction to the
stationary distribution.
When does it fail?• When the current low energy space is far away from any data points.
A good compromise• CD-1 ⇒ CD-3 ⇒ CD-k
Lili Mou | SEKE Team 23/50
![Page 24: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/24.jpg)
Exploring “Deep” Structures
• Deep Boltzmann machines (DBM): Boltz-mann machines with multiple hidden layers
• Learning: Following the learning rules ofBMs, sample hidden units once a half ofthe net
What if we train layer by layer greedily?
Lili Mou | SEKE Team 24/50
![Page 25: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/25.jpg)
Deep Belief Nets
Lili Mou | SEKE Team 25/50
![Page 26: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/26.jpg)
Belief Nets
Consider a generative process that generates our data x in a causal fashion
The conditional probabilistic distribution is in the form
pi ≡ P(si = 1) = σ(−bi −∑j
sjwji) = 11 + exp
{−bi −
∑j sjwji
}where wji and bi are the weights to be learned.
Lili Mou | SEKE Team 26/50
![Page 27: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/27.jpg)
Learning DBNs
The learning process would be easy if we couldget unbiased samples si
∆wji = αsj(si − pi)However, it is intractable to get unbiased sam-ples for a deep, densely-connected Bayesian net-work
• Posterior not factorial even with only onehidden layer (the phenomenon of “explain-ing away”)
• Posterior depends on the prior as well aslikelihood
⇒ All the weights interact.• Even worse, to compute the prior of one
layer, we need to marginalize out all thehidden layers above.
Lili Mou | SEKE Team 27/50
![Page 28: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/28.jpg)
Some Algorithms for Learning DBNs
• Using Markov chain Monto Carlo methods (Neal, 1992)
• Variational methods to get approximate samples from the posterior (1990’s)
The crazy idea: Do things wrongly and hope it works!
Lili Mou | SEKE Team 28/50
![Page 29: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/29.jpg)
The Wake-Sleep Algorithm
Lili Mou | SEKE Team 29/50
![Page 30: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/30.jpg)
The Wake-Sleep Algorithm (Hinton, 1995)
Basic idea: ignore “explaining away”
• Wake phase: Use recognition weightsR = P(Y |X) to perform a bottom-uppass, and learn the generative weights
• Sleep phase: Use generative weightsW = P(X|Y ) to generate samplesfrom the model, and learn the recog-nition weights
N.B.: The recognition weights are not partof the generative modelRecall: Edata and Edata
Lili Mou | SEKE Team 30/50
![Page 31: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/31.jpg)
![Page 32: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/32.jpg)
BNs and RBMs
Lili Mou | SEKE Team 32/50
![Page 33: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/33.jpg)
A BIG SURPRISE
RBM
||
BN with infinite layers with tied weights
Lili Mou | SEKE Team 33/50
![Page 34: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/34.jpg)
The Intuition
Lili Mou | SEKE Team 34/50
![Page 35: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/35.jpg)
Complementarity Prior
Factorial posterior, i.e., P(x|y) =∏i P(xi|y) and P(y|x) =
∏i P(yi|x)
⇔ Independent set {yi ⊥ yk|x, xi ⊥ xj |y}
⇔ (Assume positivity)
P(x,y) = 1Z
exp
∑i,j
ψi,j(xi, yj) +∑i
γi(xi) +∑j
αj(yj)
Can be extended to infinite layers within a-few-line derivations
Lili Mou | SEKE Team 35/50
![Page 36: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/36.jpg)
CD Learning Revisit
• CD-k: Ignore the “explaining away” phenomenon by ignoring the smallderivatives contributed by the tied weights in higher layers
• When weights are small, Markov chain will be mixed soon.
• When weights get larger, run more iterations of CD.
• Good news:For multi-layer feature learning (in DBNs), it is just OK to use CD-1.
(why?)
Lili Mou | SEKE Team 36/50
![Page 37: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/37.jpg)
Stacked RBM
What if we train stacked RBMs layer by layer greedily?
Lili Mou | SEKE Team 37/50
![Page 38: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/38.jpg)
![Page 39: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/39.jpg)
![Page 40: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/40.jpg)
Stacked RBM and BNs
• Prior does not cancel exactly the “explaining away” effect
• Do things wrongly anyway
• It can be proved that, a variational bound of the likelihood will improvewhenever we add a new hidden layer.
Lili Mou | SEKE Team 40/50
![Page 41: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/41.jpg)
Deep Belief Nets
Deep Belief Nets, a hybrid model of directed and undirected graph
Learning, given data D• Learn stacked RBMs layer by layer greedily
with CD-1• Fine-tuning by the wake and sleep algorithm
(why?)Reconstructing data
• Start from random configurations of the toptwo layers
• Run the Markov chain for a long time• Generate lower layers (including the data) in
a causal fashion⇒ Theoretically incorrect, but works well
Lili Mou | SEKE Team 41/50
![Page 42: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/42.jpg)
Examples
Lili Mou | SEKE Team 42/50
![Page 43: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/43.jpg)
RBM Modeling MNIST Data
Modeling ’2’
Modeling all 10 digits
. . .
Lili Mou | SEKE Team 43/50
![Page 44: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/44.jpg)
RBM for Collaborative Filtering
• RBM works about as good as matrix factorization, but gives different errors
• Combining RBM and MF is a big win ($1,000,000)
Lili Mou | SEKE Team 44/50
![Page 45: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/45.jpg)
DBM Modeling MNIST Data
Lili Mou | SEKE Team 45/50
![Page 46: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/46.jpg)
DBN Classifying MNIST Digits
• PretrainingTraining stacked RBMs layer by layer greedilyFine-tuning with wake and sleep algorithm
• Supervised learningRegard the DBN as a feed-forward neural networkFine-tuning the connection weights
Lili Mou | SEKE Team 46/50
![Page 47: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/47.jpg)
Results
Lili Mou | SEKE Team 47/50
![Page 48: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/48.jpg)
Even More Results
Lili Mou | SEKE Team 48/50
![Page 49: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/49.jpg)
What happens when fine-tuning?
Lili Mou | SEKE Team 49/50
![Page 50: TheOriginofDeepLearning - PKUsei.pku.edu.cn/~moull12/resource/DBN.pdf · 2015-04-04 · StackedRestrictedBoltzmannMachinesandDeepBeliefNet StackedBoltzmannMachinesandDeepbeliefnetsare](https://reader034.fdocuments.in/reader034/viewer/2022042306/5ed28bd7af2f306b9a0141eb/html5/thumbnails/50.jpg)
Thanks for listening!