Yeast Bread. Brioche Made in muffin pan or fluted pan Made in muffin pan or fluted pan.
Erte Pan Wireless Eng. Group Advisor: Dr. Han Department of Electrical and Computer Engineering...
-
Upload
camden-larmer -
Category
Documents
-
view
216 -
download
0
Transcript of Erte Pan Wireless Eng. Group Advisor: Dr. Han Department of Electrical and Computer Engineering...
Erte PanWireless Eng. Group
Advisor: Dr. Han
Department of Electrical and Computer EngineeringUniversity of Houston, Houston, TX.
Erte PanWireless Eng. Group
Advisor: Dr. Han
Department of Electrical and Computer EngineeringUniversity of Houston, Houston, TX.
Deep Belief Nets and Restricted Boltzmann Machines
Deep Belief Nets and Restricted Boltzmann Machines
Graphical ModelGraphical Model
hidden
i
j
visible
hidden
i
j
visible
Undirected graphical model:
links have no directional significance
inference(infer the states of unobserved variables) is easy
learning(adjust weights between variables to make the network more likely to generate the observed data) and generating processes are tricky
Directed graphical model:
links have a particular directionality indicated by arrows
inference is difficult
learning and generating processes are simple
Generative model: graphical model captures the causal process by which the observed data was generated, so it is also called generative model.
Boltzmann Machine ModelBoltzmann Machine Model
Boltzmann Machine:
one input layer and one hidden layer
typically binary states for every unit
stochastic (vs. deterministic)
recurrent (vs. feed-forward)
generative model (vs. discriminative): estimate the distribution of observations(say p(image)), while traditional discriminative networks only estimate the labels(say p(label|image))
defined Energy of the network and Probability of a unit’s state(scalar T is referred to as the “temperature”):
Boltzmann Machine
)/)((
1
1)1(
1,
)(
m
iijij
T
Ej Tsws
e
sPj
ijiji
jii
i swssasE ,)(
Restricted Boltzmann Machine ModelRestricted Boltzmann Machine Model
Restricted Boltzmann Machine
Restricted Boltzmann Machine:
a bipartite graph: no intralayer connections, feed-forward
RBM does not have T factor, the rest are the same as BM
one important feature of RBM is that the visible units and hidden units are conditionally independent, which will lead to a beautiful result later on:
n
jj
m
ii
vhPP
hvPP
1
1
)|()v|h(
)|(h)|(v
Stochastic SearchStochastic Search
Why BM?
Different optimization criteria in traditional networks and RBM for optimization purpose:
Traditional: Error criterion. BP method strictly goes along the gradient descent direction. Any direction that enlarge error is NOT acceptable. Easy to get stuck in local minima.
BM: associate the network with “Energy”. Simulated Annealing enables the energy to grow under certain probability.
Simulated AnnealingSimulated Annealing
Simulated Annealing for BM:
1. Create initial solution S (global states of the network)
Initialize temperature T>>1 2. Repeat until T =T-lower-bound
Repeat until thermal equilibrium is reached at the current T• Generate a random transition from S
to S’• Let E = E(S’) E(S)• if E < 0 then S = S’ • else if exp[E/T] > rand(0,1) then S = S’
Reduce temperature T according to the cooling schedule
3. Return S
1. Create initial solution S (global states of the network)
Initialize temperature T>>1 2. Repeat until T =T-lower-bound
Repeat until thermal equilibrium is reached at the current T• Generate a random transition from S
to S’• Let E = E(S’) E(S)• if E < 0 then S = S’ • else if exp[E/T] > rand(0,1) then S = S’
Reduce temperature T according to the cooling schedule
3. Return S
This term allows “thermal disturbance” which facilitate finding global minimum
Restricted Boltzmann MachineRestricted Boltzmann Machine
Two characters to define a Restricted Boltzmann Machine:
states of all the units: obtained through probability distribution.
weights of the network: obtained through training(Contrastive Divergence).
As mentioned before, the objective of RBM is to estimate the distribution of input data. And this goal is fully determined by the weights, given the input.
Energy defined for the RBM:
Restricted Boltzmann Machine
Distribution of visible layer of the RBM(Boltzmann Distribution):
Z is the partition function defined as the sum of over all possible configurations of {v,h}
Probability that unit i is on(binary state 1): σ(.) is the logistic/sigmoid function
ijii j
jjj
jii
i vwhhbvahvE ,),(
h
hvEeZ
vP ),(1)(
),( hvEe
)()|1(1
,
m
iijijj vwbvhP
Restricted Boltzmann MachineRestricted Boltzmann Machine
the <.>0 denotes an average w.r.t. the data distribution
the gradient is then computed as:
the <.>∞ denotes an average w.r.t. the model distribution
Given i.i.d. samples , The objective is to maximize the average log-likelihood:
Training for RBM: Maximum Likelihood learning
the probability over a vector x with parameter W(weights) is:
);(
)(
1);( WxEe
WZWxp
x
WxEeWZ );()(
Nn 1n}{x
)(log);();(log;log1
01
0WZWxEWxpW)p(x
N)L(W;
N
nn
N
n nxxN
xp10 )(
1)(
W
E(x;W)
W
E(x;W)
W
)L(W;
0
Restricted Boltzmann MachineRestricted Boltzmann Machine
Then the update of weights, W, can be computed as:
the <.>0 term can be computed using the input samples
the <.>∞ term can be solved by MCMC but very slow and suffering from large variance of estimated gradient
Solution: Contrastive Divergence
maximizing the log probability of the data is the same as minimizing the KL divergence, (
define CD to be: where n is the small number that we run the MC
use CDn multiplied by learning rate as the update of the weights
Note: this update direction is NOT the gradient of ANY function, yet it is successful in application…
)(
);()()1(
WW
WLWW
x Wxp
xpxpppKL
);(
)(log)()||( 0
00
)||()||( 0n ppKLppKLCD n
Restricted Boltzmann MachineRestricted Boltzmann Machine
Summarized algorithm for training RBM:
1) take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.
2) compute the expectation of vh and call this the positive gradient. (clamped phase, or positive phase)
3) From h , sample a reconstruction v’ of the visible units, then resample the hidden activations h’ from this.
4) Compute the expectation of v’h’ and call this the negative gradient.(free phase, or negative phase)
5) Let the weight update Wij to be the positive gradient minus the negative gradient, times some learning rate.
In RBM, the previous equations then become(for calculating a particular weight between two units):
these two equations are obtained by substituting the energy function into the learning rule.
modeljidatajiij
hvhvw
vp
)(log
)( modeljidatajiij hvhvw
General Deep Belief NetsGeneral Deep Belief Nets
Problem with DBNs:
Since DBNs are directed graph model, given input data, the posterior of hidden units is intractable due to the “explaining away” effect.
Solution: Complementary Priors to ensure the posterior of hidden units are under the independent constraints.
truck hits house earthquake
house jumps
20 20
-20
-10 -10
General Deep Belief Nets
Explaining Away Effect
p(1,1)=.0001p(1,0)=.4999p(0,1)=.4999p(0,0)=.0001
posterior
Explaining Away EffectExplaining Away Effect
Brief summary for explaining away effect:
Given the observations, the posterior of associated hidden variables are actually NOT independent(the probability that one hidden variable is on or off influences the states of others), even though the hidden variables are assumed to be independent in their priors.
The reason is that we have non-independence in the likelihood term:
Posterior(non-indep) = prior(indep.) * likelihood (non-indep.)
Eliminate Explaining Away by Complementary Priors
Add extra hidden layers to create CP that has opposite correlations with the likelihood term, so (when likelihood is multiplied by the prior), posterior will become factorial
W
v1
h1
v0
h0
v2
h2
TW
TW
TW
W
W
etc.
+
+
+
+
Complementary PriorsComplementary Priors
Definition of Complementary Priors:
Consider observations x and hidden variables y, for a given likelihood function P(x|y), the priors of y, P(y) is called the complementary priors of P(x|y), provided that P(x,y)=P(x|y) P(y) leads to the posteriors P(y|x) that exactly factorises.
Infinite directed model with tied weights & Complementary Priors & Gibbs sampling:
Recall that the RBMs have the property
The definition of energy function of RBM makes it proper model that has two sets of conditional independencies(complementary priors for both v and h)
Since we need to estimate the distribution of data, P(v), we can perform Gibbs sampling alternatively from P(v,h) for infinite times. This procedure is analogous to unroll the single RBM into infinite directed stacks of RBMs with tied weights(due to “complementary priors”) where each RBM takes input from the hidden layer of the lower level RBM.
n
jj
m
ii
vhPP
hvPP
1
1
)|()v|h(
)|(h)|(v
DBNs based on RBMsDBNs based on RBMs
DBNs based on stacks of RBMs:
The top two hidden layers form an undirected associative memory(regarded as a shorthand for infinite stacks) and the remained hidden layers form a directed acyclic graph.
h2
data
h1
h3
2W
3W
1W
RBM
RBM
RBM The red arrows are NOT part of the generative model. They are just for inference purpose
Training Deep Belief NetsTraining Deep Belief Nets
Previous discussion gives an intuition of training stacks of RBMs one layer at a time.
This greedy learning algorithm is proved to be efficient in the sense of expected variance by Hinton.
First, learn all the weights tied.
Learn as a single RBM
Training Deep Belief NetsTraining Deep Belief Nets
Then freeze bottom layer and relearn all the other layers.
Then freeze bottom two layers and relearn all the other layers.
Learn as a single RBM
Learn as a single RBM
Fine-tuning Deep Belief NetsFine-tuning Deep Belief Nets
Each time we learn a new layer, the inference at the lower layers will become incorrect, but the variational bound on the log probability of the data improves, proved by Hinton.
Since the inference at lower layers becomes incorrect, Hinton uses a fine-tuning procedure to adjust the weights, called wake-sleep algorithm.
Wake-sleep algorithm:
wake phase: do a down-top pass, sample h using the recognition weight based on input v for each RBM, and then adjust the generative weight by the RBM learning rule.
sleep phase: do a top-down pass, start by a random state of h at the top layer and generate v. Then the recognition weights are modified.
h2
data
h1
h3
2W
3W
1W
RBM
RBM
RBM
TW
Deep Belief NetsDeep Belief Nets
Analogs for wake-sleep algorithm:
wake phase: if the reality is different with the imagination, then modify the generative weights to make what is imagined as close as the reality.
sleep phase: if the illusions produced by the concepts learned during wake phase are different with the concepts, then modify the recognition weight to make the illusions as close as the concepts.
Questions on DBNs:
training vector vs. training set(Patch Training)
How to perform unsupervised classification?
Performances from DBNsPerformances from DBNs
A: 2-D coded representation of hand-written database MNIST by PCA
B: 2-D coded representation of MNIST by DBNs
Results produced by Hinton etc.
Performances from DBNsPerformances from DBNs
A
B
A: 2-D coded representation of documents retrieval data by LSA
B: 2-D coded representation of the same data by DBNs
Results produced by Hinton etc.
Convolutional DBNsConvolutional DBNs
Limitations of DBNs:
unable to process high dimensional data(DBNs transform 2D images into vectors and then input them into the networks, thus certain spatial information is lost)
even if using vectors as the input instead, DBNs are unable to be scaled up properly for real image sizes. They are only suitable for small images
directly extending the DBNs to fit the high dimensional data suffers from inefficient computation(millions of weights to estimate) Advantages of CDBNs:
feature detectors are shared through all locations in an image, therefore they form the convolution kernels and reduce computation
max-pooling: shrink the representation to be translation-invariant and reduce computation
Architecture of CDBNsArchitecture of CDBNs
H v
H W
N
ji
N
jiij
kij
K
kk
K
k
N
ji
N
srsjri
krs
kij
vchb
vWhhvE
1, 1,1
1 1, 1,1,1),(
),(1),( hvEe
ZhvP
Energy term and Probability are defined similarly to RBM:
All units are 2D binary images, within one unit of detection layer, the weights/convolutional kernels are shared, leading to the convolution operation :
H vN
ji
N
jiij
kij
K
kk
kK
k
k vchbvWhhvE1, 1,11
)~(),(
CDBNsCDBNs
Bji
kij
N
ji
N
jiij
kij
K
kkij
k
ij
K
k
kij
kh
vchbvWhhvEH v
),(
1, 1,11
,,1tosubject
)~(),(
ijk
kkij vWbhI )
~()(
Bji
kji
k
Bji
kji
kijk
ij
hIvpP
hI
hIvhP
)','( ''
)','( ''
))(exp(1
1)|0(
))(exp(1
))(exp()|1(
Training of CDBNs is done by optimizing the networks’ energy via sparsity regularization(imposed by max-pooling):
This yields a similar updating strategy for the weights and biases as the Contrastive Divergence.
The sparse constraints also give rise to a simple inference of the network:
where
Performance of CDBNsPerformance of CDBNs
Results produced by Andrew Y. Ng etc.
Hierarchical representations of Caltech-101 object classification database by CDBNs. Top: first layer CDBN output. Bottom: second layer CDBN output.
ReferencesReferences
Review:
Learning deep architectures for AI, Y. Bengio 2009
Foundations:
A fast learning algorithm for deep belief nets, Hinton 2006
Reducing the dimensionality of data with neural networks, Hinton 2006
A practical guide to training restricted Boltzmann machines, Hinton 2010
On contrastive divergence learning, Hinton, 2005
On the convergence property of contrastive divergence, Tieleman, 2010
Training products of experts by minimizing contrastive divergence, Hinton, 2002
Learning multiple layers of representation, Hinton, 2007
Applications:
Sparse deep belief net model for visual area V2, H. Lee 2008
Convolutional deep belief network for scalable unsupervised learning of hierarchical representations, H. Lee 2009
Unsupervised learning of invariant feature hierarchies with applications to object recognition, Y. LeCun 2007