Deep Belief Networks Learning Advanced...
Transcript of Deep Belief Networks Learning Advanced...
Advanced Machine Learning / Deep Belief NetworksDaniel Ulbricht
Agenda
● Short History in Machine Learning● What's Deep Learning?● Derive Learning in General● Energy Based Models / Restricted
Boltzmann Machines● Bring things together
● You will Implement a core algorithm to apply learning
History: 1th wave
"The Perceptron"Frank Rosenblatt explained the Perceptron in the year 1958
● Simple problems like XOR could be solved
History: 2nd wave
"Backpropagation"
Developed by Paul Werbos in 1974● Complex non-linear problems could be
solved
History: 3rd wave
"Deep Belief Networks"
The "magic" behind will be explained in this talk
Developed mainly 2006 by Geoff Hinton
History: Deep Learning
● An automatic way to learn representations (descriptors) from given data
● Attempt to learn multiple levels of representations on increasing complexity
Lent from Andrew Ng
History: Deep Learning
● BackPropagation is already an attempt to perform Deep Learning
● But there are some problems○ Gradient is progressively getting diluted○ Initialization of weights○ How to label all the given data
Decreasing update strength
Machine Learning in General
Input
Output
Goal:Find weights W to maximize the probability of a certain outputs given some input vectors
Maximize:
Weights
Machine Learning in General
Learning can be performed using:● Gradient Ascent on: log P● Gradient Descent on: - log P
From the optimization theory we know many downhill optimization algorithms● (Stochastic) Gradient Descent● Conjugant Gradient● Dog leg
Max the Log likelihood of System
Log likelihood of Data
Average log likelihood per pattern log likelihood of Normalization Term
Rules for Gradient Computation
Sum Rule:
Product Rule:
1:
2:
Gradient of first part
Gradient of first part:
Sum over posterior ;-)
Rule 1
Sum Rule
Reorder
Product Rule
Rule 2
Gradient of second part
Gradient of second part:
Sum over joint
Full Gradient
Full Gradient:
Two averages around the same term
Therefore we can write:
Hebbian /Positive phase
Anti - Hebbian /Negative phase
Gradient in Sigmoid Belief Nets
Apply this knowledge on normal Sigmoid Nets● Used in backpropagation
Joint is automatically normalized:
This leads the second gradient term to be zero:
Full gradient:
The well known: Delta Rule
Energy Based Models (EBM)
Energy Based Probabilistic Models define a probability distribution as follows:
High Probability -> Low EnergyLow Probability -> High Energy-> Minimize the energy
Partition function (Normalization term)
Energy Based Models with Hidden Units
In reality with can't observe the full state of our data and/or we are not aware of indirect influences -> Therefore we add Hidden Units to to increase the expression power of the model.
Restricted Boltzmann Machine
Fancy name for a simple bidirectional graph:● No connection inside the same layer● No loops● Energy function is used to perform
transitions
Alternate Gibbs Sampling
Computing the average of the posterior and the joint is very expensive
To overcome this Gibbs Sampling is used
Gibbs Sampling inside Energy Based Models leads to the simple sigmoid function
Poof is easy to do but would take too longUse the normal Gibbs algorithm and put in the Energy term for the distribution. You can find it on my webpage
Alternate Gibbs Sampling
Alternate Gibbs Sampling:
● Sample Up (visible to hidden)● Sample Down (Hidden to visible)● continue...
Alternate Gibbs Sampling
● Running it infinite iterations would give exact gradient (~ Monte Carlo Markov Chain)
● Surprisingly even a single iteration works very well in practice○ Geoff Hinton tried this in 2006 and recognized that
the system converges well even with a single iteration
○ Called Contrastive Divergence
Alternate Gibbs Sampling
Alternate Gibbs Sampling:
● Start with training vector● Sample Up (visible to hidden)● Sample Down (Hidden to visible)● Sample Up
Hebbian Anti- Hebbian
Bring things together
For simplification we use from now on binary input and output units● Terms get much easier to compute● Its also the common way using in practical
applications
Bring things together
● Hebbian Part (Up Step):Sum over all visible units multiplied with the according hidden unit weight
Sigmoid function - We can use it due to the fact that gibbs sampling inside an EBM is sigmoid
Make the output Stochastic.Simplifies the next steps
bias for hidden unit
Bring things together
● Hebbian Part (Down Step):
Same as Up Step only using a different bias
bias for visible unit
Bring things together
● Anti - Hebbian Part (Up Step):
Instead of using the reconstructed output now the probability is used.
Bring things together
● Full Gradient:
Don't forget the bias:
Average over posterior Average over joint
So Far We Have
● The knowledge to train a Restricted Boltzmann Machine
● No need for labels -> Our labels are the equilibrium level of the Energy Function
Open Question:● How to perform Deep Learning without
the factorial behaviour
Stacking RBM's
To perform Deep Learning we stack multiple RBM's but learn them layer per layer
Input
Stacking RBM's
To perform Deep Learning we stack multiple RBM's but learn them layer per layer
Input
Hidden
W1
Stacking RBM's
To perform Deep Learning we stack multiple RBM's but learn them layer per layer
Input
Hidden
W1 <- Fixed (We don't update anymore)
Hidden
W2
Now we have
● A network which learns○ Without labels
■ Labels are the equilibrium level of energy term○ Every layer learns a significant amount
■ Due to be independent from every other layer
Get hands on:
Download the example matlab/octave files from my homepage
You will recognize that calling runRBM● will do nothing so far● It misses the implementation of
"Contrastive Divergence"● Try to implement the "Contrastive
Divergence"○ Solution can be found also on my homepage
Thank you for Listening