Download - Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Associative Memories (II)Stochastic Networks and Boltzmann Machines

Davide Bacciu

Dipartimento di InformaticaUniversità di [email protected]

Applied Brain Science - Computational Neuroscience (CNS)


Conclusions

IntroductionStochastic Neurons

Deterministic Networks

Consider the Hopfield network model introduced so farNeuron output is a deterministic function of its inputsRecurrent network of all visible unitsLearns to encode a set of fixed training patternsV = [v1 . . . vP]

What models do we obtain if we relax the deterministicinput-output mapping?


Conclusions


Stochastic Networks

A network of units whose activation is determined by astochastic function

The state of a unit at a given timestep is sampled from agiven probability distributionThe network learns a probability distribution P(V) from thetraining patterns

Network includes bothvisible v and hidden h unitsNetwork activity is asample from posteriorprobability given inputs(visible data)


Conclusions


Neural Sampling Hypothesis

The activity of each neuron can be sampled from the networkdistribution

No distinction between input and outputNatural way to deal with incomplete stimuliStochastic nature of activation is coherent with neuralresponse variabilitySpontaneous neural activity can be explained in terms ofprior/marginal distributions


Conclusions


Probability and Statistics Refresher

On the blackboard


Conclusions


Stochastic Binary Neurons

Spiking point neuron with binary output sjTypically discrete time model with time into small ∆tintervalsAt each time interval(t + 1 ≡ t + ∆t), the neuron can emit aspike with probability p(t)

j

s(t)j =

{1, with probability p(t)

j

0, with probability 1− p(t)j

The key is in the definition of the spiking probability (need to bea function of local potential)

p(t)j ≈ σ(x (t)

j )


Conclusions


General Sigmoidal Stochastic Binary Network

Network of N neurons with binary activation sj

Weight matrix M = [Mi j]i,j∈{1,...,N}Bias vector b = [bj ]j∈{1,...,N}

Local neuron potential xj defined as usual

x (t+1)j =

N∑i=1

Mijs(t)i + bj

A chosen neuron fires with spiking probability

p(t+1)j = P(s(t+1)

j = 1|st ) = σ(x (t+1)j ) =

1

1 + e−x (t+1)j

Formulation highlights Markovian dynamics


Conclusions


Neurobiological Foundations

Variability in transmitter release in synaptic vescicles ≈Gaussian Distribution (Katz 1954)Assuming independent distributions and large number ofsynaptic connections

⇒ Central limit theorem⇒ Local (membrane) potential ≈ Gaussian

DistributionConditional probability of neuron spiking is a GaussianCDF ≈ scaled sigmoid


Conclusions


Network Dynamics (I)

How does the network state (activation of all neurons) evolve intime?

Assume neurons to be updated in parallel every ∆t (Paralleldynamics)

P(s(t+1)|s(t)) =N∏

j=1

P(s(t+1)j |st ) = T (s(t+1)|s(t))

Yielding a Markov process for state update

P(s(t+1) = s′) =∑

s

T (s′|s)P(s(t) = s)


Conclusions


Network Dynamics (II)

Parallel dynamics assumes a synchronization clock exist(biologically non plausible)Alternatively, one neuron at random can be chosen forupdate at each step (Glauber Dynamics)No fixed-point guarantees for s but it has a stationarydistribution for the network at equilibrium state when itsconnectivity is symmetric

Given Fj as state flip operator for j-th neuron s(t+1) = Fjs(t)

T (s(t+1)|s(t)) =1N

P(s(t+1)j |st )

While if s(t+1) = s(t)

T (s(t+1)|s(t)) = 1− 1N

∑j

P(s(t+1)j |st )


Conclusions

Boltzmann MachineRestricted Boltzmann Machine

The Boltzmann-Gibbs Distribution

Symmetric connectivity enforces detailed balance condition

P(s)T (s′|s) = P(s′)T (s|s′)

Ensures reversible transitions guaranteeing existence ofequilibrium (Boltzmann-Gibbs) distribution

P∞(s) =e−E(s)

Z

whereE(s) is the energy function

Z =∑

s

e−E(s) is the partition function


Conclusions


Boltzmann Machines

A stochastic recurrent networkwhere binary unit states are alsorandom variables

visible v ∈ {0,1}latent h ∈ {0,1}s = [vh]

Boltzmann-Gibbs distribution having linear energy function

E(s) = −12

∑ij

Mijsisj −∑

j

bjsj = −12

sT Ms− bT s

with symmetric and no self-recurrent connectivity


Conclusions


Learning

Ackley, Hinton and Sejnowski (1985)Boltzmann machines can be trained so that the equilibriumdistribution tends towards any arbitrary distribution acrossbinary vectors given samples from that distribution

A couple of simplifications to start withBias b absorbed into weight matrix MConsider only visible units s = v

Use probabilistic learning techniques to fit the parameters, i.e.maximizing the log-likelihood

L(M) =1L

L∑l=1

log P(vl |M)

given the P visible training patterns vl


Conclusions


Gradient Approach

First, the gradient for a single pattern

∂P(v|M)

∂Mij= −〈vivj〉+ vivj

with free expectations 〈vivj〉 =∑

v

P(v)vivj

Then, the log-likelihood gradient

∂L∂Mij

= −〈vivj〉+ 〈vivj〉c

with clamped expectations 〈vivj〉c =1L

p∑l=1

v li v

lj


Conclusions


Something we have already seen..

It is Hebbian learning again!

〈vivj〉c︸︷︷︸wake

−〈vivj〉︸︷︷︸dream

wake part is the usual Hebb rule applied to the empiricaldistribution of data that the machine sees coming in fromthe outside worlddream part is an anti-hebbian term concerning correlationbetween units when generated by the internal dynamics ofthe machine

Can only capture quadratic correlation!


Conclusions


Learning with Hidden Units

To efficiently capture higher-order correlations we need tointroduce hidden units hAgain log-likelihood gradient ascent (s = [vh])

∂P(v|M)

∂Mij=

∑h

sisjP(h|v)−∑

s

sisjP(s)

= 〈sisj〉c − 〈sisj〉

Expectations generally become intractable due to thepartition function Z (exponential complexity)


Conclusions


Boltzmann Mean-field Approximation

Approximate the expectations using a simpler distributionthan P(s)

Q(s) =N∏

i=1

msii (1−mi)

si

New parameter mi describing the probability of unit i beingon, that are chosen to minimize

m∗ = arg minm

KL[Q(s)||P(s)]

The solution is obtained by the N dimensional system

mi = σ(∑

j

Mijmj)

The mi ’s allow to approximate the intractable expectations

〈sisj〉 ≈ mimj


Conclusions


Restricted Boltzmann Machines (RBM)

A special Boltzmann machineBipartite graphConnections only betweenhidden and visible units

Energy function, highlighting bipartition in hidden (h) andvisible (v) units

E(v,h) = −vT Mh− bT v− cT h

Learning (and inference) becomes tractable due to graphbipartition which factorizes distribution


Conclusions


The RBM Catch

Hidden units are conditionally independent given visible units,and viceversa

P(hj |v) = σ(∑

i

Mijvi + cj)

P(vi |h) = σ(∑

j

Mijhj + bi)

They can be updated in batch!


Conclusions


Training Restricted Boltzmann Machines

Again by likelihood maximization, yields

∂L∂Mij

= 〈vihj〉c︸︷︷︸data

−〈vihj〉︸︷︷︸model

A Gibbs sampling approach

WakeClamp data on vSample vihj for all pairs ofconnected unitsRepeat for all elements ofdataset

DreamDon’t clamp unitsLet network reachequilibriumSample vihj for all pairs ofconnected unitsRepeat many times to geta good estimate


Conclusions


Gibbs-Sampling RBM

∂L∂Mij

= 〈vihj〉c︸︷︷︸data

−〈vihj〉︸︷︷︸model

It is difficult to obtain an unbiased sample of the second term


Conclusions


Gibbs-Sampling RBMPlugging-in Data

1 Start with a training vector on the visible units2 Alternate between updating all the hidden units in parallel

and updating all the visible units in parallel (iterate)

∂L∂Mij

= 〈vihj〉0︸︷︷︸data

−〈vihj〉∞︸︷︷︸model


Conclusions


Contrastive-Divergence Learning

Gibbs sampling can be painfully slow to converge

1 Clamp a training vector vl

on visible units2 Update all hidden units in

parallel3 Update the all visible units

in parallel to get areconstruction

4 Update the hidden unitsagain

〈vihj〉0︸︷︷︸data

− 〈vihj〉1︸︷︷︸reconstruction


Conclusions


What does Contrastive Divergence Learn?

A very crude approximation of the gradient of thelog-likelihood

It does not even follow the gradient closelyMore closely approximating the gradient of a objectivefunction called the Contrastive Divergence

It ignores one tricky term in this objective function so it isnot even following that gradient

Sutskever and Tieleman (2010) have shown that it is notfollowing the gradient of any function


Conclusions


So Why Using it?

Because He says so!

It works well enough in many significant applications


Conclusions

ApplicationsSummary

Character Recognition

Learning good features for reconstructing images of number 2handwriting

Slide credit goes to G. Hinton


Conclusions

ApplicationsSummary

Weight Learning


Conclusions

ApplicationsSummary

Final Weights


Conclusions

ApplicationsSummary

Digit Reconstruction


Conclusions

ApplicationsSummary

Digit Reconstruction (II)

What would happen if we supply the RBM with a test digit that itis not a 2?

It will try anyway to see a 2 in whatever we supply!


Conclusions

ApplicationsSummary

One Last Final Reason for Introducing RBM

Deep Belief Network

The fundamental buildingblock for one of the mostpopular deep learningarchitectures

A network of stackedRBM trained layer-wiseby ContrastiveDivergence plus asupervised read-out layer


Conclusions

ApplicationsSummary

Take Home Messages

Stochastic networks as a paradigm that can explain neuralvariability and spontaneous activity in terms of distributionsBoltzmann Machines

Hidden neurons required to explain high-order correlationsTraining is a mix of Hebbian and anti-HebbianMultifaceted nature (recurrent network, undirectedgraphical model and energy-based network)

Restricted Boltzmann MachinesTractable model thanks to bipartite connectivityTrained by a very short Gibbs sampling (contrastivedivergence)Can be very powerful if stacked (deep learning)


Conclusions

ApplicationsSummary

Next Lecture

Part I - Lecture (1h)Incremental learning in competitive networksStability-plasticity dilemmaAdaptive Resonance Theory (ART)

Part II - Lab (2h)Restricted Boltzmann MachinesI suggest you have a good look at reference [1] on thecourse wiki (pages 3− 6)