CHAPTER 8 A NNEALING- T YPE A LGORITHMS Organization of chapter in ISSO –Introduction to simulated...

15
CHAPTER 8 CHAPTER 8 A A NNEALING- NNEALING- T T YPE YPE A A LGORITHMS LGORITHMS •Organization of chapter in ISSO –Introduction to simulated annealing –Simulated annealing algorithm •Basic algorithm with noise-free loss measurements •With noisy loss measurements –Numerical examples •Traveling salesperson problem •Continuous problems with single and multiple minima –Annealing algorithms based on stochastic approximation with injected “noise” Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

Transcript of CHAPTER 8 A NNEALING- T YPE A LGORITHMS Organization of chapter in ISSO –Introduction to simulated...

CHAPTER 8CHAPTER 8

AANNEALING-NNEALING-TTYPE YPE AALGORITHMSLGORITHMS

•Organization of chapter in ISSO–Introduction to simulated annealing

–Simulated annealing algorithm•Basic algorithm with noise-free loss measurements•With noisy loss measurements

–Numerical examples•Traveling salesperson problem•Continuous problems with single and multiple minima

–Annealing algorithms based on stochastic approximation with injected “noise”

–Convergence theory

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

8-2

Background on Simulated Annealing• Continues in spirit of Chaps. 2, 6, and 7 in working with only

loss measurements (no direct gradients)

• Simulated annealing (SAN) based on analogies to cooling (annealing) of physical substances– Optimal analogous to minimum energy state

• Primarily designed to be global optimization method

• Based on probabilistic criterion for accepting increased loss value during search process– Metropolis criterion– Allows for temporary increase in loss as means of reaching

global minimum

• Some convergence theory possible (e.g., Hajek, 1988, for discrete [see p. 213 of ISSO]; Sect. 8.6 in ISSO for continuous )

8-3

Metropolis CriterionMetropolis Criterion• In iterative process, suppose have current value curr and

candidate new value new. Should we accept new if new is worse than curr (i.e., has higher loss value)?

• Metropolis criterion (from famous 1953 paper of Metropolis et al.)

gives probability of accepting new value (cb is constant and T is

“temperature”; set cb = 1 without loss of generality)

• Repeated application of Metropolis criterion (iteration to iteration) provides for convergence of SAN to global minimum

– Markov chain theory applies for discrete ; stochastic approximation for continuous

new currexp( ) ( )

b

L L

c T

8-4

SAN Algorithm with Noise-Free Loss Measurements

Step 0 (initialization)Step 0 (initialization) Set initial temperature T and current parameter curr;

determine L(curr).

Step 1 (candidate value)Step 1 (candidate value) Randomly determine new value new and

determine L(new).

Step 2 (compare Step 2 (compare LL values) values) If L(new) < L(curr), accept new . Alternatively, if

L(new) L(curr), accept new with probability given by Metropolis criterion

(implemented via Monte Carlo sampling scheme); otherwise keep curr .

Step 3 (iterate at fixed temperature) Step 3 (iterate at fixed temperature) Repeat steps 1 and 2 until T is changed.

Step 4 (decrease temperature)Step 4 (decrease temperature) Lower T according to the annealing schedule and return to Step 1. Continue till effective convergence.

8-5

SAN Algorithm with Noisy Loss Measurements

• As with random search (Chap. 2 of ISSO), standard SAN not designed for noisy measurements y = L +

• However, SAN sometimes used with noisy measurements• Standard approach is to form average of loss

measurements at each in search process• Alternative is to use threshold idea of Sect. 2.3 of ISSO

– Only accept new value if noisy loss value is sufficiently bigger or smaller than current noisy loss

• Can use one-sided Chebyshev inequality to characterize likelihood of error at each iteration under general noise distribution

• Very limited convergence theory for SAN with noisy measurements

8-6

Traveling Salesperson Problem (TSP)

• TSP is famous discrete optimization problem• Many successful uses of SAN with TSP• Basic problem is to find best way for salesperson to hit

every city in territory once and only once– Setting arises in many problems of optimization on networks

(communications, transportation, etc.)

• If tour involves n cities, there are (n–1) ! /2 possible solutions– Extremely rapid growth in solution space as n increases– Problem is “NP hard”

• Perturbations in SAN steps based on three operations on network: inversioninversion, translationtranslation, and switchingswitching– Depicted below

TSP: Standard Search Operations Applied to 8-City Tour

Inversion reverses order 2-3-4-5; translation removes section 2-3-4-5 and places it between 6-7; switching interchanges order of 2 and 5. 8-7

8-8

TSP (cont’d)TSP (cont’d)Solution to Trivial 4-City Problem where Solution to Trivial 4-City Problem where

Cost/Link = Distance (Related to Exercise 8.5 in Cost/Link = Distance (Related to Exercise 8.5 in ISSOISSO))

8-9

Some Numerical Results for SAN

• Section 8.3 of ISSO reports on three examples for SAN– Small-scale TSP– Problem with no local minima other than global minimum– Problem with multiple local minima

• All examples based on stepwise temperature decay in basic SAN steps above and noise-free loss measurements

• All SAN runs require algorithm tuning to pick:– Initial T– Number of iterations at fixed T– Choice of 0 < < 1, representing amount of reduction in

each temperature decay

– Method for generating candidate new

• Brief descriptions follow on slides below….

8-10

Small-Scale TSP (Example 8.1 in (Example 8.1 in ISSOISSO))• 10-city tour (very small by industrial standards)

– Know by enumeration that minimum cost of tour = 440• Randomly chose inversion, translation, or switching at each

iteration– Tuning required to choose “good” probabilities of selecting

these operators• 8 of 10 SAN runs find minimum cost tour

– Sample mean cost of initial tour is 700; sample mean of final tour is 444

• Essential to success is adequate use of inversion operator; 0 of 10 SAN runs find optimal tour if probability of inversion is 0.50

• SAN successfully used in much largermuch larger TSPs – E.g., seminal 1983 (!) Kirkpatrick et al. paper in Science

considers TSP with 400 cities

8-11

Comparison of SAN and Two Random Comparison of SAN and Two Random Search Algorithms Search Algorithms (Example 8.2 in (Example 8.2 in ISSOISSO))

• Considered very simple p = 2 “quartic loss” seen earlier:

• Function has single global minimum; no local minima

• Table below gives sample mean terminal loss value, where initial loss = 4.00; L() = 0

• SAN performs well, but random search even better in this problem

No. of meas.

Random search B

Random search C

SAN Tinit = 0.01

SAN Tinit = 0.10

100 0.00053 0.328 1.86 0.091

10,000 2.7 10 6 2.510 7 0.00038 0.0024

4 2 21 1 1 2 2 1 2( ) ( [ , ] )TL t t t t t t t

8-12

Evaluation of SAN in Problem with Multiple Evaluation of SAN in Problem with Multiple Local MinimaLocal Minima (Example 8.3 in (Example 8.3 in ISSOISSO))

• Many numerical studies in literature showing favorable results for SAN

• Loss function in study of Brooks and Morgan (1995):

with = [t1,t2]T and = [1,1]2

• Function has many local minima with a unique global minimum

• Study compares quasi-Newton method and SAN – “Apples vs. oranges” (gradient-based vs. non-gradient-

based)

• 20% of quasi-Newton runs and 100% of SAN runs ended near (random initial conditions)

2 21 2 1 2( ) 2 0.3cos(3 ) 0.4cos(4 )L t t t t

8-13

Global Optimization via Annealing of Global Optimization via Annealing of Stochastic ApproximationStochastic Approximation

• SAN not only way annealing used for global optimization

• With appropriate annealing, stochastic approximation (SA) can be used in global optimization

• Standard approach is to inject Gaussian noise to r.h.s. of SA recursion:

where Gk is direct gradient measurement (Chap. 5) or

gradient approximation (FDSA or SPSA), bk 0 (the

“annealing”), and wk N(0,Ipp)

• Injected noise wk generated by Monte Carlo

• Eqn. (*) has theoretical basis for formal convergence (Sect. 8.4 of ISSO)

k k k k k k ka b 1ˆ ˆ ˆ( ) (*)G w

8-14

Global Optimization via Annealing of Global Optimization via Annealing of Stochastic Approximation (cont’d)Stochastic Approximation (cont’d)

• Careful selection of ak and bk required to achieve global

convergence

• Stochastic rate of convergence is slow:

when ak= a/(k+1), < 1

when ak= a/(k+1)

• Above slow rates are price to be paid for global convergence

• SPSA without injected randomness (i.e., bk = 0) is global

optimizer under certain conditions

– Much faster convergence rate (0< 2/3)

k O kˆ ˆ 1 log

k O kˆ ˆ 1 loglog

O k / 21

8-15

Ratio of Asymptotic Estimation Errors Ratio of Asymptotic Estimation Errors with and without Injected Randomness with and without Injected Randomness

((bbkk > 0 and > 0 and bbkk = 0, resp.) = 0, resp.)

103 104 105 106 Iterations, k

10

20

30

40

50

60

70

kk

kk

b

b

k

k1/ 30

0

ˆ ˆ 1 loglogconstant

ˆ ˆ 1