annealing (2)

download annealing (2)

of 32

Transcript of annealing (2)

  • 8/18/2019 annealing (2)

    1/32

    2806 Neural Computation

    Stochastic MachinesLecture 10

    2005 Ari Visa

  • 8/18/2019 annealing (2)

    2/32

    Agenda

    Some historical notes

    Some theory

    Metropolis Algorithm Simulated Annealing

    i!!s Sampling

    "olt#mann Machine

    Conclusions

  • 8/18/2019 annealing (2)

    3/32

    Some $istorical Notes

    Statistical mechanics encompasses the %ormal study

    o% macroscopic e&uili!rium prorerties o% large

    systems o% elements'

    (he )canonical distri!ution) *+ i!!s distri!ution +

    "olt#mann distri!ution, *-illard i!!s. 1/02,

    (he use o% statistical mechanics as !asis %or the study

    o% neural netors *Cragg and (emperley. 1/5,*Coan. 1/68,

  • 8/18/2019 annealing (2)

    4/32

    Some $istorical Notes

    (he idea o% introducing temperature and simulatedannealing into com!inatorial optimi#ation

     pro!lems *3irpatric. elatt. Vacchi. 1/84,'

    (he "olt#mann machine is among the %irstmultilayer learning machine inspired !y statisticalmechanics *$inton Senosi. 1/84.1/86,.*Acley et al 1/85,'

  • 8/18/2019 annealing (2)

    5/32

    Some (heory

    Consider a physical system ith many degrees o% %reedom. that can reside in amyone o% a large num!er o% possi!le states'

    Let pi denote the pro!a!ility o% occurrence o% state i. p i 70 %or all i and Σi pi +1'

    Let i denote the energy o% the system hen it is in state i'

    -hen the system is in thermal e&uili!rium ith its surrounding en9ironment. state

    i occurs ith a pro!a!ility: pi + 1;< e=p*> i ;  "(, here ( is the a!solutetemperature *3, and  "is "olt#mann?s constant *+ canonial or i!!sdistri!ution,'

    (he normali#ing &uantity < is called the sum o9er states or the partitioning%unction < + Σi e=p*> i ;  "(,'

     Note to points:

    1, States o% lo energy ha9e a higher pro!a!ility o% occurrence than states o% highenergy'

    2, As the temperature ( reduced. the pro!a!ility is concentrated on a smallersu!set o% lo>energy states'

  • 8/18/2019 annealing (2)

    6/32

    Some (heory

    (he $elmholt# %ree energy o% a physical system @ isde%ined in terms o% the partition %unction < : @ + >( log <

    (he a9erage energy: B + Σi pii B + >@ + >(Σi pilog pi *+ entropy $, B + >@ + >($ $ D $? 7 0 The principle of minimal free energy: the principle o% minimal %ree energy o% a stochastic system

    ith respect to 9aria!les o% the system is achie9ed atthermal e&uili!rium. at hich point the system is go9erned

     !y the i!!s distri!ution'

  • 8/18/2019 annealing (2)

    7/32

    Some (heory

    Consider a system hose e9olution is descri!ed !y a

    stochastic process EFn. n+1.2.'''G. consisting o% a

    %amily o% random 9aria!les'

    (he 9alue =n assumed !y the random 9aria!le Fn at

    discrete time n is called the state o% the system'

    (he space o% all possi!le 9alues that the random

    9aria!les can assume is called the state space o%the system'

  • 8/18/2019 annealing (2)

    8/32

    Some (heory

    H% the structure o% the stochastic process EFn.n+1.2.'''G. is such that the conditional pro!a!ilitydistri!ution o% X n+1 depends only on the value of X n 

    and is independent o% all pre9ious 9alues. is a Markov chain'  Markov property I*FnD1+ =nJ Fn+ =n .'''.F1+ =1G A se&uence o% random 9aria!les F1.F2 .'''.Fn. FnD1 

    %orms a Maro9 chain i% the pro!a!ility that thesystem is in state =nD1 at time nD1 dependse=clusi9ely on the pro!a!ility that the system is instate =n at time n'

  • 8/18/2019 annealing (2)

    9/32

    Some (heory

    Hn a maro9 chain. the transition %rom one state to another is pro!a!ilistic : pi + I*FnD1+ JFn + i, denote the transition pro!a!ility%rom state i at time n to state at time nD1 *p i 70 %or all *i., and Σ  pi + 1%or all i,'

    Maro9 chain is homogeneous in time i% the transition pro!a!ilities are

    %i=ed and not change ith time' Let pi*m, denote the m-step transition probability %rom state i to state :

     pi*m, + I*FnDm+ =  JFn + =i, m+1.2.''' -e may 9ie as the sum o9er all intermediate states through hich the

    system passes in its transition %rom state i to state ' p i*m, + Σ  pi *m, p m+1.2. ''' ith pi *1, + pi  '

     pi*mDn, + Σ  pi *m, p *n, m.n +1.2. ''' *Chapman>3olmogoro9 identity,' -hen a state o% the chain can only reoccur at time inter9als that are

    multiples o% d *d is the largest such num!er,. e say the state has  period  d'

  • 8/18/2019 annealing (2)

    10/32

    Some (heory

    (he state i is said to !e a recurrent  state i% the maro9 chain returns tostate i ith pro!a!ility 1 % i + I*e9er returning to state i, + 1'

    H% the pro!a!ility % i is less than 1. the state i is said to !e a transient  state'

    (he state o% a Maro9 chain is said to !e accessible %rom state i i%there is a %inite se&uence o% transitions %rom i to ith positi9e pro!a!ility'

    H% the states i and are accessi!le to each other. the states i and o% theMaro9 chain are said to communicate ith each other'

    H% to states o% a Maro9 chain communicate ith each other. they

     !elong to the same class' H% all the states consist o% a single class. Maro9 chain is said to !eindecomposible or irreducible'

  • 8/18/2019 annealing (2)

    11/32

    Some (heory

    (he mean recurrence time o% state i is de%ined as thee=pectations o% (i*, o9er the returns ' (i*, denotesthe time that elapses !eteen the (k-1)th and k threturns to state i '

    (he steady-state probability o% state i. denoted !y πi .is e&ual to the reciprocal o% the mean recurrence timeπi + 1 ;K(i*,

    H% K(i*, . that is πi B0. the state i is asid to !e a

     positive recurrent  state' H% K(i*, +. that is πi +0. the state i is asid to !e a

    null recurrent  state'

  • 8/18/2019 annealing (2)

    12/32

    Some (heory

     Ergodicity + e may su!stitute time a9erages %or ensem!le a9erages' Hn the conte=t o% Maro9 chain + the long>term proportion o% time

    spent !y the chain in state i corresponds to the steady>state pro!a!ilityπi '

    (he proportion o% time spent in state i a%ter returns 9i*,. 9

    i*, + ;

    Σ l+1(i*l,' Consider an ergodic Maro9 chain characteri#ed !y a stochastic matri=

    P' Let the ro 9ector *n>1, denote the state distri!ution 9ector o% thechain at time n>1 the th element o% *n>1, is the pro!a!ility that thechain is in state =  at time n>1'

     

    *n,

     +*n>1,

     P

    (he state distri!ution 9ector o% Maro9 chain at time n is the producto% the initial state distri!ution 9ector *0, and the nth poer o% thestochastic matri= P' *n, + *0, Pn.

  • 8/18/2019 annealing (2)

    13/32

    Some (heory

    (he ergodicity

    theorem: 11'2>11'2O

  • 8/18/2019 annealing (2)

    14/32

    Some (heory

    (he principle of detailedbalance states that atthermal e&uili!rium.

    the rate o% occurrenceo% any transitione&uals thecorresponding rate o%

    occurrence o% thein9erse transition'

    πi pi6 + π 6 p 6i 

  • 8/18/2019 annealing (2)

    15/32

    Metropolis Algorithm

    Metropolis algorithm *Metropolis et al 1/54, is a modi%ied MonteCarlo method %or stochastic simulation o% a collection o% atoms ine&uili!rium at a gi9en temperature'

    (he random 9aria!le Fn representing an ar!itary Maro9 chain is instate =i at time n' -e randomly generate a ne state =  representing a

    reali#ation o% another random 9aria!le Pn' (he generation o% this nestatesatis%ies the symmetry condition:I*Pn+ =  JFn + =i, + I*Pn+ =i JFn + = ,

    Let +   > i denote the energy di%%erence resulting %rom thetransition o% the system %rom state Fn+ =i to state Pn + = ' 

    1, 0: -e %ind that π  ; πi + e=p*> ;(, B 1→  π  p i + πi τ p i  1, B0: -e %ind that π  ; πi + e=p*> ;(, 1→  π  p i + πi τ i  (he a prior transition pro!a!ilities τi are in %act the pro!a!ilistic

    model o% the random step in the Metropolis algorithm'

  • 8/18/2019 annealing (2)

    16/32

    Simulated Annealing

    1, A schedule that determines the rate at hich the temperature isloered'

    2, An algorithm that iterati9ely %inds the e&uili!rium distri!ution ateach ne temperature in the schedule !y using the %inal state o% thesystem at the pre9ious temperature as the starting point %or the ne

    temperature *3irpatric et al' 1/84,'

    (he Metropolis algorithm is the !asis %or the simulated annealing process' (he temperature ( plays the role o% a control parameter' thesimulated annealing process ill con9erge to a con%iguration o%minimal energy pro9ided that the temperature is decreased no %aster

    than logarithmically Q too slo to !e o% practical use →  finite-timeapproimation *no longer guaranteed to %ind a glo!al minimum ith

     pro!a!ility one,

  • 8/18/2019 annealing (2)

    17/32

    Simulated Annealing

    (o implement a %inite>time appro=imation o% the

    simulated annealing algorithm. e must speci%y a

    set o% parameters go9erning the con9ergence o%

    the algorithm' these parameters are com!ined in aso>called annealing schedule or cooking schedule'

    (he annealing schedule speci%ies a %inite se&uence

    o% 9alues o% the temperature and and a %inite

    num!er o% transitions attempted at each 9alue o%

    the temperature'

  • 8/18/2019 annealing (2)

    18/32

  • 8/18/2019 annealing (2)

    19/32

    i!!s Sampling

    i!!s sampler generate a Maro9 chain ith the i!!sdistri!ution as e&uili!rium distri!ution'

    (he transition pro!a!ilities associated ith i!!s samplerare nonstationary'

    1, ach component o% the random 9ector X is 9isited in thenatural order. ith the result that a total o% 3 ne 9ariatesare generated on each iteration'

    2, (he ne 9alue o% component F>1 is used immediately

    hen a ne 9alue o% F  is dran %or +2.4.'''.3' (he i!!s sampler is an iterati9e adapti9e scheme'

  • 8/18/2019 annealing (2)

    20/32

    i!!s Sampler 

    *11'45. 11'46. 11'4O,

  • 8/18/2019 annealing (2)

    21/32

    "olt#mann Machine

    (he primary goal o% "olt#mannlearning is to produce a neuralnetor that correctly modelsinput patterns according to a"olt#mann distri!ution'

    (he "olt#mann machineconsists o% stochastic neurons'A stochastic neuron resides inone o% to possi!le states *T 1,in a pro!a!ilistic manner'

    (he use o% symmetric synaptic

    connections !eteen neurons' (he stochastic neurons

     partition into to %unctionalgroups: 9isi!le and hidden'

  • 8/18/2019 annealing (2)

    22/32

    "olt#mann Machine

    Uuring the training phase o% the netor . the9isi!le neurons are all clamped onto speci%ic statesdetermined !y the en9ironment'

    (he hidden neurons alays operate %reely they areused to e=plain underlying constraints contained inthe en9ironmental input 9ectors'

    (his is accomplished !y capturing higher>orderstatistical correlations in the clamping 9ectors'

    (he netor can per%orm pattern completition pro9ided that it has learned the training distri!ution properly'

  • 8/18/2019 annealing (2)

    23/32

    "olt#mann Machine

    Let x denote the state o% the"olt#mann machine. ith itscomponent =i denoting the state o%neuron i' (he state x represents areali#ation o% the random 9ector X'(he synaptic connection %rom

    neuron i to neuron is denoted !y 6i. ith  6i + i6 %or all *i., and ii + 0 %or all i'

    *x, + > W iW 6 6i =i= 6 . iX I*X +x, + 1;< e=p*>*x,;(,  + ϕ*=;( Wi 6i =i, here ϕ*', is a

    sigmoid %unction o% its arguments' i!!s sampling and simulated

    annealing are used'

  • 8/18/2019 annealing (2)

    24/32

    "olt#mann Machine

    (he goal o% "olt#mann learning is to ma=imi#e the lielihood or log>lielihood %unction inaccordance ith the ma=imum>lielihood principle'

     &ositve phase' Hn this phase the netor operates in its clamped condition'  'egative phase% Hn this second phase. the netor is alloed to run %reely. and there%ore

    ith no en9ironmental input' (he log>lielihood %unction L*w, + logY =S ∈(I*XS+ xS,

    L*w, + WxS ∈( *log Wxβ e=p*>*x,;(, > log Wx e=p*>*x,;(, ,

    Ui%%erentiating L*w, ith respect to  6i and introducing ρD 6i and ρ> 6i '  6i + ε∂L*, ; ∂  6i  +η *ρD 6i > ρ> 6i , here η is a learning>rate parameter η + ε;(' @rom a learning point o% 9ie. the to terms that constitute the "olt#mann learning rule

    ha9e opposite meaning: ρD 6i corresponding to the clamped condition o% the netor is a

    $e!!ian learning rule ρ>

     6i corresponding to the %ree>running condition o% the netor isunlearning *%orgetting, term' -e ha9e also a primiti9e %orm o% an attention mechanism' (he to phase approach and .speci%ically. the negati9e phase means also increased

    computational time and sensiti9ity to statistical errors'

  • 8/18/2019 annealing (2)

    25/32

    Sigmoid "elie% Netors

    igmoid belief netorks or logistic belief nets *Neal 1//2,ere de9eloped to %ind a stochastic machine that ouldshare ith the "olt#mann machine the capacity to learnar!itarily pro!a!ility distri!utions o9er !inary 9ectors. !ut

    ould not need the negati9e phase o% the "olt#mannmachine learning procedure' (his o!ecti9e as achie9ed !y replacing the symmetric

    connections o% the "olt#mann machine ith directedconnections that form an acyclic graph'

    A sigmoid !elie% netor consists o% a multilayerarchitecture ith !inary stochastic neurons' (he acyclicnature o% the machine maes it easy to per%orm pro!a!ilistic calculations'

  • 8/18/2019 annealing (2)

    26/32

    Sigmoid "elie% Netors

    Let the 9ector X. consisting o% to>9alued random 9aria!lesF1.F2.'''.F N. de%ine a sigmoid

     !elie% netor composed o% Nstochastic neurons'

    (he parents o% element F  in X aredenoted !y pa*F , ⊆ EF1.F2.'''.F >1G

     pa*F , is the smallest su!set o% random9ector X %or hich e ha9e I*F +=  JF1 + =1.'''.F >1+ = >1 , + I*F + =  J

     pa* F , + ϕ*= ;( Wi i =i,

     Note that 1'  i +0 %or all Fi not !elonging to pa*Fi, and

     2'  i +0 %or all i 7 '

  • 8/18/2019 annealing (2)

    27/32

    Sigmoid "elie% Netors

    Learning:

    Ht is assumed that each sample is to>9alued. representing certain attri!utes' Zepetition o%training e=amples is permitted. in proportion to ho commonly a particularcom!ination o% attri!utes is non to occur'

    1' Some si#e %or a state 9ector. x. is decided %or the netor'

    2' A su!set o% the state 9ector. say xS. is selected to represent the attri!utes in the

    training cases that is xS  represent the state 9ector o% the 9isi!le neurons'4' (he remaining part o% the state 9ector x. denoted !y xβ.de%ines the state 9ector o% the

    hidden neurons'

    Ui%%erent arrangements o% 9isi!le and hidden neurons may result in di%%erent con%iguration[ (he log>lielihood %unction L*w, + Y =S ∈( log I*XS+ xS, Ui%%erentiating L*w, ith respect to  6i  6i + ε∂L*, ; ∂  6i  +η ρ 6i here η is a learning>rate parameter η + ε;( and ρ 6i  is Wxβ  Wx  I*X+ x JXS+ xS,

    ϕ*>= 6;( WiA6 6i =i,= 6=i hich is an a9erage correlation !eteen the states o% neurons iand . eighted !y the %actorϕ*>= 6;( WiA6 6i =i, '

  • 8/18/2019 annealing (2)

    28/32

    Sigmoid "elie% Netors

    ta!le 11'2

  • 8/18/2019 annealing (2)

    29/32

    $elmholt# Machine

    (he $elmholt# machine *Uayan et al' 1//5.$inton et al' 1//5, uses to entirely di%%erent setso% synaptic connections'

    (he %orard connections constitute tehrecognition model' (he purpose o% this model isto in%er a pro!a!ility distri!ution o9er the

    underlying causes o% the input 9ector' (he !acard connections constitute thegenerati9e model' (he purpose o% this secondmodel is to reconstruct an appro=imation to theoriginal input 9ector %rom theunderlyingrepresentations captured !y the hiddenlayers o% the netor. there!y ena!ling it tooperate in a self-supervised manner '

    "oth the recognition and generati9e modelsoperate in a strictly %eed%orard %ashion. ith no%eed!ac they interact ith each other only 9iathe learning procedure'

  • 8/18/2019 annealing (2)

    30/32

  • 8/18/2019 annealing (2)

    31/32

    Mean>@ield (heory

    (he use o% mean>%ield theory asthe mathematical !asis %orderi9ing deterministic

    appro=imation to the stochasticmachines to speed up learning'

    1' Correlations are replaced !ytheir mean>%ield apro=imations'

    2' An intracta!le model isreplaced !y a tracta!le model9ia a 9ariational principle'

  • 8/18/2019 annealing (2)

    32/32

    Summary

    Some ideas rooted in statistical mechanics ha9e represented'

    (he "olt#mann machine uses hidden and 9isi!le neurons that are in the%orm o% stochastic. !inary>state units' Ht e=ploits the properties o% thei!!s distri!ution. there!y o%%ering some appealing %eatures:

    (hrough training. the pro!a!ility distri!ution e=hi!ited !y the neurons ismatched to that o% the en9ironment'

    (he netor o%%ers a generali#ed approach that is applica!le to the !asicissues o% search. representation. and learning'

    (he netor is &uaranteed to %ind the glo!al minimum o% the energysur%ace ith respect to the states. pro9ided that the annealing schedulein the learning process is per%ormed sloly enough'