Download - Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Hopfield networks andBoltzmann machines

Geoffrey Hinton et al.Presented by Tambet Matiisen

18.11.2014

Hopfield network

Binary unitsSymmetrical connections

http://www.nnwj.de/hopfield-net.html

Energy function

• The global energy:

• The energy gap:

• Update rule:

ji

ijjiii

i wssbsE

j

ijjiiii wsbsEsEE )1()0(

otherwise,0

0if,1

i

jijjii

s

wsbs

http://en.wikipedia.org/wiki/Hopfield_network

- E = goodness = 4

Example

3 2 3 3

-1

-4

-1

1

1

0

0

?

0

- E = goodness = 3

? ?1

Deeper energy minimum

3 2 3 3

-1

-4

1-1

1

1

0

0

- E = goodness = 5

Is updating of Hopfield networkdeterministic or non-deterministic?

A. DeterministicB. Non-deterministic

Determinist

ic

Non-determinist

ic

0%0%

How to update?

• Nodes must be updated sequentially, usuallyin randomized order.

• With parallel updating energy could go up.

• If updates occur in parallel but with randomtiming, the oscillations are usually destroyed.

-100

+5 +500

Content-addressable memory

• Using energy minima to represent memoriesgives a content-addressable memory.– An item can be accessed by just knowing part of

its content.– Can fill out missing or corrupted pieces of

information.– It is robust against hardware damage.

Classical conditioning

http://changecom.wordpress.com/2013/01/03/classical-conditioning/

Storing memories

• Energy landscape is determined by weights!

• If we use activities -1 and 1:

• If we use states 0 and 1:

∆wij = sis j

)()(42

1

2

1 jiij ssw

1 thenif

1 thenif

ijji

ijji

wss

wss

Demo

• http://www.tarkvaralabor.ee/doodler/• (choose Algorithm: Hopfield and Initialize)

How many weights the example had?

A. 100B. 1000C. 10000

1001000

10000

0% 0%0%

Storage capacity

• The capacity of a totally connected net with Nunits is only about 0.15 * N memories.– With N bits per one memory this is only

0.15 * N * N bits.

• The net has N2 weights and biases.• After storing M memories, each connection

weight has an integer value in the range [–M, M].• So the number of bits required to store the

weights and biases is: N 2 log(2M +1)

How many bits are needed torepresent weights in the example?

A. 1500B. 50 000C. 320 000

1500

50 000

320 000

0% 0%0%

Spurious minima

• Each time we memorize a configuration, wehope to create a new energy minimum.

• But what if two minima merge to create aminimum at an intermediate location?

Reverse learning

• Let the net settle from a random initial stateand then do unlearning.

• This will get rid of deep, spurious minima andincrease memory capacity.

Increasing memory capacity

• Instead of trying to store vectors in one shot,cycle through the training set many times.

• Use the perceptron convergence procedure totrain each unit to have the correct state giventhe states of all the other units in that vector.

ijijji wxfx jiiij xxxw )(

Hopfield nets with hidden units• Instead of using the net

to store memories, use itto constructinterpretations of sensoryinput.– The input is represented by

the visible units.– The interpretation is

represented by the statesof the hidden units.

– The badness of theinterpretation isrepresented by the energy.

visible units

hidden units

3D edges from 2D images

You can only see one ofthese 3-D edges at atime because theyocclude one another.

2-D lines

3-D lines

picture

Noisy networks

• Hopfield net tries reduce the energy at each step.– This makes it impossible to escape from local minima.

• We can use random noise to escape from poorminima.– Start with a lot of noise so its easy to cross energy

barriers.– Slowly reduce the noise so that the system ends up in

a deep minimum. This is “simulated annealing”.

A B C

Temperature

AB

1.0)(

2.0)(

BAp

BAp

AB

000001.0)(

001.0)(

BAp

BAp

High temperaturetransitionprobabilities

Low temperaturetransitionprobabilities

Stochastic binary units

• Replace the binary threshold units by binary stochasticunits that make biased random decisions.

• The “temperature” controls the amount of noise.• Raising the noise level is equivalent to decreasing all

the energy gaps between configurations.

p(si=1) = 1

1+ e−∆Ei T temperature

Why we need stochastic binary units?

A. Because we cannotget rid of inherentnoise.

B. Because it helps toescape localminima.

C. Because we wantsystem to producerandomized results.

Because

we ca

nnot get r

i...

Because

it helps t

o escap...

Because

we w

ant syst

em ..

0% 0%0%

Thermal equilibrium

• Thermal equilibrium is a difficult concept!– Reaching thermal equilibrium does not mean that

the system has settled down into the lowestenergy configuration.

– The thing that settles down is the probabilitydistribution over configurations.

– This settles to the stationary distribution.– Any given system keeps changing its configuration,

but the fraction of systems in each configurationdoes not change.

Modeling binary data

• Given a training set of binary vectors, fit amodel that will assign a probability to everypossible binary vector.

• Model can be used for generating data withthe same distribution as original data.

• If particular model (distribution) produced theobserved data:

jj

iii modeldatap

modelpmodeldatapdatamodelp

)|(

)()|()|(

Boltzmann machine

• ...is defined in terms of the energies of jointconfigurations of the visible and hidden units.

• Probability of joint configuration:

• The probability of finding the network in thatjoint configuration after we have updated allof the stochastic binary units many times.

p(v,h)∝e−E(v,h)

Energy of a joint configuration

−E(v,h) = vibii∈vis∑ + hkbk

k∈hid∑ + viv jwij

i< j∑ + vihkwik

i,k∑ + hkhlwkl

k<l∑

bias ofunit k

weight betweenvisible unit i andhidden unit k

Energy with configuration von the visible units and hon the hidden units

binary stateof unit i in v

indexes every non-identical pair of iand j once

From energies to probabilities• The probability of a joint

configuration over both visibleand hidden units depends onthe energy of that jointconfiguration compared withthe energy of all other jointconfigurations.

• The probability of aconfiguration of the visibleunits is the sum of theprobabilities of all the jointconfigurations that contain it.

p(v,h) = e−E (v,h)

e−E(u,g)

u,g∑

partitionfunction

p(v) =e−E(v,h)

h∑e−E(u,g)

u,g∑

Example

h1 h2

+2 +1

v1 v2

An example of how weights

define a distribution

1 1 1 1 2 7.39 .1861 1 1 0 2 7.39 .1861 1 0 1 1 2.72 .0691 1 0 0 0 1 .0251 0 1 1 1 2.72 .0691 0 1 0 2 7.39 .1861 0 0 1 0 1 .0251 0 0 0 0 1 .0250 1 1 1 0 1 .0250 1 1 0 0 1 .0250 1 0 1 1 2.72 .0690 1 0 0 0 1 .0250 0 1 1 -1 0.37 .0090 0 1 0 0 1 .0250 0 0 1 0 1 .0250 0 0 0 0 1 .025

39.70

v h −E e−E p(v, h ) p(v)

0.466

0.305

0.144

0.084

-1

Getting a sample from model• We cannot compute the normalizing term (the

partition function) because it has exponentially manyterms.

• So we use Markov Chain Monte Carlo to get samplesfrom the model starting from a random globalconfiguration:– Keep picking units at random and allowing them to

stochastically update their states based on their energygaps.

– Run the Markov chain until it reaches its stationarydistribution

• The probability of a global configuration is then relatedto its energy by the Boltzmann distribution.

Getting a sample from the posteriordistribution for a given data vector

• The number of possible hidden configurations isexponential so we need MCMC to sample from theposterior.– It is just the same as getting a sample from the model,

except that we keep the visible units clamped to the givendata vector.

– Only the hidden units are allowed to change states• Samples from the posterior are required for learning

the weights. Each hidden configuration is an“explanation” of an observed visible configuration.Better explanations have lower energy.

What does Boltzmann machine reallydo?

A. Models probabilitydistribution of inputdata.

B. Generates samplesfrom modeleddistribution.

C. Learns probabilitydistribution of inputdata from samples.

D. All of above.Models p

robability dist

ri...

Generates s

amples from...

Learns p

robability dist

rib...

All of a

bove.

None of above

.

20% 20% 20%20%20%