Hopfield networks andBoltzmann machines
Geoffrey Hinton et al.Presented by Tambet Matiisen
18.11.2014
Hopfield network
Binary unitsSymmetrical connections
http://www.nnwj.de/hopfield-net.html
Energy function
• The global energy:
• The energy gap:
• Update rule:
ji
ijjiii
i wssbsE
j
ijjiiii wsbsEsEE )1()0(
otherwise,0
0if,1
i
jijjii
s
wsbs
http://en.wikipedia.org/wiki/Hopfield_network
- E = goodness = 4
Example
3 2 3 3
-1
-4
-1
1
1
0
0
?
0
- E = goodness = 3
? ?1
Deeper energy minimum
3 2 3 3
-1
-4
1-1
1
1
0
0
- E = goodness = 5
Is updating of Hopfield networkdeterministic or non-deterministic?
A. DeterministicB. Non-deterministic
Determinist
ic
Non-determinist
ic
0%0%
How to update?
• Nodes must be updated sequentially, usuallyin randomized order.
• With parallel updating energy could go up.
• If updates occur in parallel but with randomtiming, the oscillations are usually destroyed.
-100
+5 +500
Content-addressable memory
• Using energy minima to represent memoriesgives a content-addressable memory.– An item can be accessed by just knowing part of
its content.– Can fill out missing or corrupted pieces of
information.– It is robust against hardware damage.
Classical conditioning
http://changecom.wordpress.com/2013/01/03/classical-conditioning/
Storing memories
• Energy landscape is determined by weights!
• If we use activities -1 and 1:
• If we use states 0 and 1:
∆wij = sis j
)()(42
1
2
1 jiij ssw
1 thenif
1 thenif
ijji
ijji
wss
wss
Demo
• http://www.tarkvaralabor.ee/doodler/• (choose Algorithm: Hopfield and Initialize)
How many weights the example had?
A. 100B. 1000C. 10000
1001000
10000
0% 0%0%
Storage capacity
• The capacity of a totally connected net with Nunits is only about 0.15 * N memories.– With N bits per one memory this is only
0.15 * N * N bits.
• The net has N2 weights and biases.• After storing M memories, each connection
weight has an integer value in the range [–M, M].• So the number of bits required to store the
weights and biases is: N 2 log(2M +1)
How many bits are needed torepresent weights in the example?
A. 1500B. 50 000C. 320 000
1500
50 000
320 000
0% 0%0%
Spurious minima
• Each time we memorize a configuration, wehope to create a new energy minimum.
• But what if two minima merge to create aminimum at an intermediate location?
Reverse learning
• Let the net settle from a random initial stateand then do unlearning.
• This will get rid of deep, spurious minima andincrease memory capacity.
Increasing memory capacity
• Instead of trying to store vectors in one shot,cycle through the training set many times.
• Use the perceptron convergence procedure totrain each unit to have the correct state giventhe states of all the other units in that vector.
ijijji wxfx jiiij xxxw )(
Hopfield nets with hidden units• Instead of using the net
to store memories, use itto constructinterpretations of sensoryinput.– The input is represented by
the visible units.– The interpretation is
represented by the statesof the hidden units.
– The badness of theinterpretation isrepresented by the energy.
visible units
hidden units
3D edges from 2D images
You can only see one ofthese 3-D edges at atime because theyocclude one another.
2-D lines
3-D lines
picture
Noisy networks
• Hopfield net tries reduce the energy at each step.– This makes it impossible to escape from local minima.
• We can use random noise to escape from poorminima.– Start with a lot of noise so its easy to cross energy
barriers.– Slowly reduce the noise so that the system ends up in
a deep minimum. This is “simulated annealing”.
A B C
Temperature
AB
1.0)(
2.0)(
BAp
BAp
AB
000001.0)(
001.0)(
BAp
BAp
High temperaturetransitionprobabilities
Low temperaturetransitionprobabilities
Stochastic binary units
• Replace the binary threshold units by binary stochasticunits that make biased random decisions.
• The “temperature” controls the amount of noise.• Raising the noise level is equivalent to decreasing all
the energy gaps between configurations.
p(si=1) = 1
1+ e−∆Ei T temperature
Why we need stochastic binary units?
A. Because we cannotget rid of inherentnoise.
B. Because it helps toescape localminima.
C. Because we wantsystem to producerandomized results.
Because
we ca
nnot get r
i...
Because
it helps t
o escap...
Because
we w
ant syst
em ..
0% 0%0%
Thermal equilibrium
• Thermal equilibrium is a difficult concept!– Reaching thermal equilibrium does not mean that
the system has settled down into the lowestenergy configuration.
– The thing that settles down is the probabilitydistribution over configurations.
– This settles to the stationary distribution.– Any given system keeps changing its configuration,
but the fraction of systems in each configurationdoes not change.
Modeling binary data
• Given a training set of binary vectors, fit amodel that will assign a probability to everypossible binary vector.
• Model can be used for generating data withthe same distribution as original data.
• If particular model (distribution) produced theobserved data:
jj
iii modeldatap
modelpmodeldatapdatamodelp
)|(
)()|()|(
Boltzmann machine
• ...is defined in terms of the energies of jointconfigurations of the visible and hidden units.
• Probability of joint configuration:
• The probability of finding the network in thatjoint configuration after we have updated allof the stochastic binary units many times.
p(v,h)∝e−E(v,h)
Energy of a joint configuration
−E(v,h) = vibii∈vis∑ + hkbk
k∈hid∑ + viv jwij
i< j∑ + vihkwik
i,k∑ + hkhlwkl
k<l∑
bias ofunit k
weight betweenvisible unit i andhidden unit k
Energy with configuration von the visible units and hon the hidden units
binary stateof unit i in v
indexes every non-identical pair of iand j once
From energies to probabilities• The probability of a joint
configuration over both visibleand hidden units depends onthe energy of that jointconfiguration compared withthe energy of all other jointconfigurations.
• The probability of aconfiguration of the visibleunits is the sum of theprobabilities of all the jointconfigurations that contain it.
p(v,h) = e−E (v,h)
e−E(u,g)
u,g∑
partitionfunction
p(v) =e−E(v,h)
h∑e−E(u,g)
u,g∑
Example
h1 h2
+2 +1
v1 v2
An example of how weights
define a distribution
1 1 1 1 2 7.39 .1861 1 1 0 2 7.39 .1861 1 0 1 1 2.72 .0691 1 0 0 0 1 .0251 0 1 1 1 2.72 .0691 0 1 0 2 7.39 .1861 0 0 1 0 1 .0251 0 0 0 0 1 .0250 1 1 1 0 1 .0250 1 1 0 0 1 .0250 1 0 1 1 2.72 .0690 1 0 0 0 1 .0250 0 1 1 -1 0.37 .0090 0 1 0 0 1 .0250 0 0 1 0 1 .0250 0 0 0 0 1 .025
39.70
v h −E e−E p(v, h ) p(v)
0.466
0.305
0.144
0.084
-1
Getting a sample from model• We cannot compute the normalizing term (the
partition function) because it has exponentially manyterms.
• So we use Markov Chain Monte Carlo to get samplesfrom the model starting from a random globalconfiguration:– Keep picking units at random and allowing them to
stochastically update their states based on their energygaps.
– Run the Markov chain until it reaches its stationarydistribution
• The probability of a global configuration is then relatedto its energy by the Boltzmann distribution.
Getting a sample from the posteriordistribution for a given data vector
• The number of possible hidden configurations isexponential so we need MCMC to sample from theposterior.– It is just the same as getting a sample from the model,
except that we keep the visible units clamped to the givendata vector.
– Only the hidden units are allowed to change states• Samples from the posterior are required for learning
the weights. Each hidden configuration is an“explanation” of an observed visible configuration.Better explanations have lower energy.
What does Boltzmann machine reallydo?
A. Models probabilitydistribution of inputdata.
B. Generates samplesfrom modeleddistribution.
C. Learns probabilitydistribution of inputdata from samples.
D. All of above.Models p
robability dist
ri...
Generates s
amples from...
Learns p
robability dist
rib...
All of a
bove.
None of above
.
20% 20% 20%20%20%
Top Related