annealing (2)

8/18/2019 annealing (2)

1/32

2806 Neural Computation

Stochastic MachinesLecture 10

2005 Ari Visa

8/18/2019 annealing (2)

2/32

Agenda

Some historical notes

Some theory

Metropolis Algorithm Simulated Annealing

i!!s Sampling

"olt#mann Machine

Conclusions

8/18/2019 annealing (2)

3/32

Some $istorical Notes

Statistical mechanics encompasses the %ormal study

o% macroscopic e&uili!rium prorerties o% large

systems o% elements'

(he )canonical distri!ution) *+ i!!s distri!ution +

"olt#mann distri!ution, *-illard i!!s. 1/02,

(he use o% statistical mechanics as !asis %or the study

o% neural netors *Cragg and (emperley. 1/5,*Coan. 1/68,

8/18/2019 annealing (2)

4/32

Some $istorical Notes

(he idea o% introducing temperature and simulatedannealing into com!inatorial optimi#ation

pro!lems *3irpatric. elatt. Vacchi. 1/84,'

(he "olt#mann machine is among the %irstmultilayer learning machine inspired !y statisticalmechanics *$inton Senosi. 1/84.1/86,.*Acley et al 1/85,'

8/18/2019 annealing (2)

5/32

Some (heory

Consider a physical system ith many degrees o% %reedom. that can reside in amyone o% a large num!er o% possi!le states'

Let pi denote the pro!a!ility o% occurrence o% state i. p i 70 %or all i and Σi pi +1'

Let i denote the energy o% the system hen it is in state i'

-hen the system is in thermal e&uili!rium ith its surrounding en9ironment. state

i occurs ith a pro!a!ility: pi + 1;< e=p*> i ; "(, here ( is the a!solutetemperature *3, and "is "olt#mann?s constant *+ canonial or i!!sdistri!ution,'

(he normali#ing &uantity < is called the sum o9er states or the partitioning%unction < + Σi e=p*> i ; "(,'

Note to points:

1, States o% lo energy ha9e a higher pro!a!ility o% occurrence than states o% highenergy'

2, As the temperature ( reduced. the pro!a!ility is concentrated on a smallersu!set o% lo>energy states'

8/18/2019 annealing (2)

6/32

Some (heory

(he $elmholt# %ree energy o% a physical system @ isde%ined in terms o% the partition %unction < : @ + >( log <

(he a9erage energy: B + Σi pii B + >@ + >(Σi pilog pi *+ entropy $, B + >@ + >($ $ D $? 7 0 The principle of minimal free energy: the principle o% minimal %ree energy o% a stochastic system

ith respect to 9aria!les o% the system is achie9ed atthermal e&uili!rium. at hich point the system is go9erned

!y the i!!s distri!ution'

8/18/2019 annealing (2)

7/32

Some (heory

Consider a system hose e9olution is descri!ed !y a

stochastic process EFn. n+1.2.'''G. consisting o% a

%amily o% random 9aria!les'

(he 9alue =n assumed !y the random 9aria!le Fn at

discrete time n is called the state o% the system'

(he space o% all possi!le 9alues that the random

9aria!les can assume is called the state space o%the system'

8/18/2019 annealing (2)

8/32

Some (heory

H% the structure o% the stochastic process EFn.n+1.2.'''G. is such that the conditional pro!a!ilitydistri!ution o% X n+1 depends only on the value of X n

and is independent o% all pre9ious 9alues. is a Markov chain' Markov property I*FnD1+ =nJ Fn+ =n .'''.F1+ =1G A se&uence o% random 9aria!les F1.F2 .'''.Fn. FnD1

%orms a Maro9 chain i% the pro!a!ility that thesystem is in state =nD1 at time nD1 dependse=clusi9ely on the pro!a!ility that the system is instate =n at time n'

8/18/2019 annealing (2)

9/32

Some (heory

Hn a maro9 chain. the transition %rom one state to another is pro!a!ilistic : pi + I*FnD1+ JFn + i, denote the transition pro!a!ility%rom state i at time n to state at time nD1 *p i 70 %or all *i., and Σ pi + 1%or all i,'

Maro9 chain is homogeneous in time i% the transition pro!a!ilities are

%i=ed and not change ith time' Let pi*m, denote the m-step transition probability %rom state i to state :

pi*m, + I*FnDm+ = JFn + =i, m+1.2.''' -e may 9ie as the sum o9er all intermediate states through hich the

system passes in its transition %rom state i to state ' p i*m, + Σ pi *m, p m+1.2. ''' ith pi *1, + pi '

pi*mDn, + Σ pi *m, p *n, m.n +1.2. ''' *Chapman>3olmogoro9 identity,' -hen a state o% the chain can only reoccur at time inter9als that are

multiples o% d *d is the largest such num!er,. e say the state has period d'

8/18/2019 annealing (2)

10/32

Some (heory

(he state i is said to !e a recurrent state i% the maro9 chain returns tostate i ith pro!a!ility 1 % i + I*e9er returning to state i, + 1'

H% the pro!a!ility % i is less than 1. the state i is said to !e a transient state'

(he state o% a Maro9 chain is said to !e accessible %rom state i i%there is a %inite se&uence o% transitions %rom i to ith positi9e pro!a!ility'

H% the states i and are accessi!le to each other. the states i and o% theMaro9 chain are said to communicate ith each other'

H% to states o% a Maro9 chain communicate ith each other. they

!elong to the same class' H% all the states consist o% a single class. Maro9 chain is said to !eindecomposible or irreducible'

8/18/2019 annealing (2)

11/32

Some (heory

(he mean recurrence time o% state i is de%ined as thee=pectations o% (i*, o9er the returns ' (i*, denotesthe time that elapses !eteen the (k-1)th and k threturns to state i '

(he steady-state probability o% state i. denoted !y πi .is e&ual to the reciprocal o% the mean recurrence timeπi + 1 ;K(i*,

H% K(i*, . that is πi B0. the state i is asid to !e a

positive recurrent state' H% K(i*, +. that is πi +0. the state i is asid to !e a

null recurrent state'

8/18/2019 annealing (2)

12/32

Some (heory

Ergodicity + e may su!stitute time a9erages %or ensem!le a9erages' Hn the conte=t o% Maro9 chain + the long>term proportion o% time

spent !y the chain in state i corresponds to the steady>state pro!a!ilityπi '

(he proportion o% time spent in state i a%ter returns 9i*,. 9

i*, + ;

Σ l+1(i*l,' Consider an ergodic Maro9 chain characteri#ed !y a stochastic matri=

P' Let the ro 9ector *n>1, denote the state distri!ution 9ector o% thechain at time n>1 the th element o% *n>1, is the pro!a!ility that thechain is in state = at time n>1'

*n,

+*n>1,

P

(he state distri!ution 9ector o% Maro9 chain at time n is the producto% the initial state distri!ution 9ector *0, and the nth poer o% thestochastic matri= P' *n, + *0, Pn.

8/18/2019 annealing (2)

13/32

Some (heory

(he ergodicity

theorem: 11'2>11'2O

8/18/2019 annealing (2)

14/32

Some (heory

(he principle of detailedbalance states that atthermal e&uili!rium.

the rate o% occurrenceo% any transitione&uals thecorresponding rate o%

occurrence o% thein9erse transition'

πi pi6 + π 6 p 6i

8/18/2019 annealing (2)

15/32

Metropolis Algorithm

Metropolis algorithm *Metropolis et al 1/54, is a modi%ied MonteCarlo method %or stochastic simulation o% a collection o% atoms ine&uili!rium at a gi9en temperature'

(he random 9aria!le Fn representing an ar!itary Maro9 chain is instate =i at time n' -e randomly generate a ne state = representing a

reali#ation o% another random 9aria!le Pn' (he generation o% this nestatesatis%ies the symmetry condition:I*Pn+ = JFn + =i, + I*Pn+ =i JFn + = ,

Let + > i denote the energy di%%erence resulting %rom thetransition o% the system %rom state Fn+ =i to state Pn + = '

1, 0: -e %ind that π ; πi + e=p*> ;(, B 1→ π p i + πi τ p i 1, B0: -e %ind that π ; πi + e=p*> ;(, 1→ π p i + πi τ i (he a prior transition pro!a!ilities τi are in %act the pro!a!ilistic

model o% the random step in the Metropolis algorithm'

8/18/2019 annealing (2)

16/32

Simulated Annealing

1, A schedule that determines the rate at hich the temperature isloered'

2, An algorithm that iterati9ely %inds the e&uili!rium distri!ution ateach ne temperature in the schedule !y using the %inal state o% thesystem at the pre9ious temperature as the starting point %or the ne

temperature *3irpatric et al' 1/84,'

(he Metropolis algorithm is the !asis %or the simulated annealing process' (he temperature ( plays the role o% a control parameter' thesimulated annealing process ill con9erge to a con%iguration o%minimal energy pro9ided that the temperature is decreased no %aster

than logarithmically Q too slo to !e o% practical use → finite-timeapproimation *no longer guaranteed to %ind a glo!al minimum ith

pro!a!ility one,

8/18/2019 annealing (2)

17/32

Simulated Annealing

(o implement a %inite>time appro=imation o% the

simulated annealing algorithm. e must speci%y a

set o% parameters go9erning the con9ergence o%

the algorithm' these parameters are com!ined in aso>called annealing schedule or cooking schedule'

(he annealing schedule speci%ies a %inite se&uence

o% 9alues o% the temperature and and a %inite

num!er o% transitions attempted at each 9alue o%

the temperature'

8/18/2019 annealing (2)

18/32

8/18/2019 annealing (2)

19/32

i!!s Sampling

i!!s sampler generate a Maro9 chain ith the i!!sdistri!ution as e&uili!rium distri!ution'

(he transition pro!a!ilities associated ith i!!s samplerare nonstationary'

1, ach component o% the random 9ector X is 9isited in thenatural order. ith the result that a total o% 3 ne 9ariatesare generated on each iteration'

2, (he ne 9alue o% component F>1 is used immediately

hen a ne 9alue o% F is dran %or +2.4.'''.3' (he i!!s sampler is an iterati9e adapti9e scheme'

8/18/2019 annealing (2)

20/32

i!!s Sampler

*11'45. 11'46. 11'4O,

8/18/2019 annealing (2)

21/32

"olt#mann Machine

(he primary goal o% "olt#mannlearning is to produce a neuralnetor that correctly modelsinput patterns according to a"olt#mann distri!ution'

(he "olt#mann machineconsists o% stochastic neurons'A stochastic neuron resides inone o% to possi!le states *T 1,in a pro!a!ilistic manner'

(he use o% symmetric synaptic

connections !eteen neurons' (he stochastic neurons

partition into to %unctionalgroups: 9isi!le and hidden'

8/18/2019 annealing (2)

22/32

"olt#mann Machine

Uuring the training phase o% the netor . the9isi!le neurons are all clamped onto speci%ic statesdetermined !y the en9ironment'

(he hidden neurons alays operate %reely they areused to e=plain underlying constraints contained inthe en9ironmental input 9ectors'

(his is accomplished !y capturing higher>orderstatistical correlations in the clamping 9ectors'

(he netor can per%orm pattern completition pro9ided that it has learned the training distri!ution properly'

8/18/2019 annealing (2)

23/32

"olt#mann Machine

Let x denote the state o% the"olt#mann machine. ith itscomponent =i denoting the state o%neuron i' (he state x represents areali#ation o% the random 9ector X'(he synaptic connection %rom

neuron i to neuron is denoted !y 6i. ith 6i + i6 %or all *i., and ii + 0 %or all i'

*x, + > W iW 6 6i =i= 6 . iX I*X +x, + 1;< e=p*>*x,;(, + ϕ*=;( Wi 6i =i, here ϕ*', is a

sigmoid %unction o% its arguments' i!!s sampling and simulated

annealing are used'

8/18/2019 annealing (2)

24/32

"olt#mann Machine

(he goal o% "olt#mann learning is to ma=imi#e the lielihood or log>lielihood %unction inaccordance ith the ma=imum>lielihood principle'

&ositve phase' Hn this phase the netor operates in its clamped condition' 'egative phase% Hn this second phase. the netor is alloed to run %reely. and there%ore

ith no en9ironmental input' (he log>lielihood %unction L*w, + logY =S ∈(I*XS+ xS,

L*w, + WxS ∈( *log Wxβ e=p*>*x,;(, > log Wx e=p*>*x,;(, ,

Ui%%erentiating L*w, ith respect to 6i and introducing ρD 6i and ρ> 6i ' 6i + ε∂L*, ; ∂ 6i +η *ρD 6i > ρ> 6i , here η is a learning>rate parameter η + ε;(' @rom a learning point o% 9ie. the to terms that constitute the "olt#mann learning rule

ha9e opposite meaning: ρD 6i corresponding to the clamped condition o% the netor is a

$e!!ian learning rule ρ>

6i corresponding to the %ree>running condition o% the netor isunlearning *%orgetting, term' -e ha9e also a primiti9e %orm o% an attention mechanism' (he to phase approach and .speci%ically. the negati9e phase means also increased

computational time and sensiti9ity to statistical errors'

8/18/2019 annealing (2)

25/32

Sigmoid "elie% Netors

igmoid belief netorks or logistic belief nets *Neal 1//2,ere de9eloped to %ind a stochastic machine that ouldshare ith the "olt#mann machine the capacity to learnar!itarily pro!a!ility distri!utions o9er !inary 9ectors. !ut

ould not need the negati9e phase o% the "olt#mannmachine learning procedure' (his o!ecti9e as achie9ed !y replacing the symmetric

connections o% the "olt#mann machine ith directedconnections that form an acyclic graph'

A sigmoid !elie% netor consists o% a multilayerarchitecture ith !inary stochastic neurons' (he acyclicnature o% the machine maes it easy to per%orm pro!a!ilistic calculations'

8/18/2019 annealing (2)

26/32


Let the 9ector X. consisting o% to>9alued random 9aria!lesF1.F2.'''.F N. de%ine a sigmoid

!elie% netor composed o% Nstochastic neurons'

(he parents o% element F in X aredenoted !y pa*F , ⊆ EF1.F2.'''.F >1G

pa*F , is the smallest su!set o% random9ector X %or hich e ha9e I*F += JF1 + =1.'''.F >1+ = >1 , + I*F + = J

pa* F , + ϕ*= ;( Wi i =i,

Note that 1' i +0 %or all Fi not !elonging to pa*Fi, and

2' i +0 %or all i 7 '

8/18/2019 annealing (2)

27/32


Learning:

Ht is assumed that each sample is to>9alued. representing certain attri!utes' Zepetition o%training e=amples is permitted. in proportion to ho commonly a particularcom!ination o% attri!utes is non to occur'

1' Some si#e %or a state 9ector. x. is decided %or the netor'

2' A su!set o% the state 9ector. say xS. is selected to represent the attri!utes in the

training cases that is xS represent the state 9ector o% the 9isi!le neurons'4' (he remaining part o% the state 9ector x. denoted !y xβ.de%ines the state 9ector o% the

hidden neurons'

Ui%%erent arrangements o% 9isi!le and hidden neurons may result in di%%erent con%iguration[ (he log>lielihood %unction L*w, + Y =S ∈( log I*XS+ xS, Ui%%erentiating L*w, ith respect to 6i 6i + ε∂L*, ; ∂ 6i +η ρ 6i here η is a learning>rate parameter η + ε;( and ρ 6i is Wxβ Wx I*X+ x JXS+ xS,

ϕ*>= 6;( WiA6 6i =i,= 6=i hich is an a9erage correlation !eteen the states o% neurons iand . eighted !y the %actorϕ*>= 6;( WiA6 6i =i, '

8/18/2019 annealing (2)

28/32


ta!le 11'2

8/18/2019 annealing (2)

29/32

$elmholt# Machine

(he $elmholt# machine *Uayan et al' 1//5.$inton et al' 1//5, uses to entirely di%%erent setso% synaptic connections'

(he %orard connections constitute tehrecognition model' (he purpose o% this model isto in%er a pro!a!ility distri!ution o9er the

underlying causes o% the input 9ector' (he !acard connections constitute thegenerati9e model' (he purpose o% this secondmodel is to reconstruct an appro=imation to theoriginal input 9ector %rom theunderlyingrepresentations captured !y the hiddenlayers o% the netor. there!y ena!ling it tooperate in a self-supervised manner '

"oth the recognition and generati9e modelsoperate in a strictly %eed%orard %ashion. ith no%eed!ac they interact ith each other only 9iathe learning procedure'

8/18/2019 annealing (2)

30/32

8/18/2019 annealing (2)

31/32

Mean>@ield (heory

(he use o% mean>%ield theory asthe mathematical !asis %orderi9ing deterministic

appro=imation to the stochasticmachines to speed up learning'

1' Correlations are replaced !ytheir mean>%ield apro=imations'

2' An intracta!le model isreplaced !y a tracta!le model9ia a 9ariational principle'

8/18/2019 annealing (2)

32/32

Summary

Some ideas rooted in statistical mechanics ha9e represented'

(he "olt#mann machine uses hidden and 9isi!le neurons that are in the%orm o% stochastic. !inary>state units' Ht e=ploits the properties o% thei!!s distri!ution. there!y o%%ering some appealing %eatures:

(hrough training. the pro!a!ility distri!ution e=hi!ited !y the neurons ismatched to that o% the en9ironment'

(he netor o%%ers a generali#ed approach that is applica!le to the !asicissues o% search. representation. and learning'

(he netor is &uaranteed to %ind the glo!al minimum o% the energysur%ace ith respect to the states. pro9ided that the annealing schedulein the learning process is per%ormed sloly enough'

annealing (2)

Documents

Transcript of annealing (2)