Optimization Class Project. MIPTOptimization Class Project. MIPT Introduction The most well-known...

1
Benchmarking of quasi-Newton methods Igor Sokolov Optimization Class Project. MIPT Introduction The most well-known minimization technique for unconstrained problems is New- tons Method. In each iteration, the step update is x k +1 = x k - ( 2 f k ) f k . wever, the inverse of the Hessian has to be calculated in every iteration so it takes O ( n 3 ) . Moreover, in some applications, the second derivatives may be unavailable. One fix to the problem is to use a finite difference approximation to the Hessian. We consider solving the nonlinear unconstrained minimization problem min f (x),x R Let’s consider the following quadratic model of the objective function m k (p)= f k + f T k p + 1 2 B k p, where B k = B T k ,B k 0 is an n × n The minimizer p k of this convex quadratic model p k = -B -1 k f k is used as the search direction, and the new iterate is x k +1 = xk + αp k , let s k = αp k The general structure of quasi-Newton method can be summarized as follows Given x 0 , B 0 (or H 0 ), k 0; For k =0, 1, 2,... Evaluate gradient g k . Calculate s k by line search or trust region methods. x k +1 x k + s k y k g k +1 - g k Update B k +1 or H k +1 according to the quasi-Newton formulas. End(for) Basic requirement in each iteration, i.e., B k s k = y k (or H k y k = s k ) Quasi-Newton Formulas for Optimization BFGS min ||H - H k ||, s.t H = H T , Hy k = s k H k +1 =(I - ρs k y T k )H k (I - ρy k s T k )+ ρs k s T k where ρ = 1 y T k s k B k +1 = B k - B k s k s T k B k s T k B k s k + y k y T k y T k s k DFP min ||B - B k ||, s.t B = B T , Bs k = y k B k +1 =(I - γy k s T k )H k (I - γs k y T k )+ γy k y T k where γ = 1 y T k s k H k +1 = H k - H k y k y T k H k y T k H k y k + s k s T k y T k s k PSB min ||B - B k ||, s.t (B - B k )=(B - B k ) T , Bs k = y k B k +1 = B k - (y k -B k s k )s T k +s k (y k -B k s k ) T s T k s k + + s k (y k -B k s k )s k s T k (s T k s k ) 2 H k +1 = H k - (s k -H k y k )y T k +y k (s k -H k y k ) T y T k y k + + s k (s k -H k y k )y k y T k (y T k y k ) 2 SR1 B k +1 = B k + σνν T , s.t B k +1 s k = y k B k +1 = B k + (y k -B k s k )(y k -B k s k ) T (y k -B k s k ) T s k , H k +1 = H k + (s k -H k y k )(s k -H k y k ) T (s k -H k y k ) T y k Method Advantages Disadvantages BFGS H 0 0 hence if H 0 0 self correcting property if Wolfe chosen superlinear convergence y T k s k 0 formula produce bad results sensitive to round-off error sensitive to inaccurate line search can get stuck on a saddle point for the nonlinear problem DFP can be highly inefficient at cor- recting large eigenvalues of ma- trices sensitive to round-off error sensitive to inaccurate line search can get stuck on a saddle point for nonlinear problem PSB superlinear convergence SR1 garantees to be B k +1 0 even if s k y k > 0 doesn’t satisfied very good approximations to the Hessian matrices, often better than BFGS Line Search vs. Trust Region Line search (strong Wolfe conditions) f (x k + α k p k ) f (x k )+ c 1 α k T k p k |f (x k + α k p k ) T p k |≤ c 2 |∇f T k p k | Trust region Both direction and step size find from solving min pR n m k (p)= f k + f T k p + 1 2 B k p s.t ||p|| ≤ Δ k Numerical Experiment All quasi-newton methods (BFGS,DFP, PSB, SR1) with two strategies (line search, trust region) were implemented in Python (overall 8 algorithms) 165 various N -d(N 2) strong benchmark problems For each algorithm, all problems were launched from the random point of domain 100 times and results were averaged Examples of benchmark problems Results Proportion of problems where have been success- fully found global minimizer in more than half of all launches in more than 0.99 launches square under plot of performance profile function Strategy BFGS DFP PSB SR1 Total LS 0.21 0.12 2.58 0.16 0.08 2.54 0.09 0.05 2.66 0.09 0.05 2.97 0.07 0.04 TR 0.18 0.12 2.93 0.16 0.1 2.91 0.19 0.14 2.93 0.11 0.06 2.83 0.1 0.05 Performance evaluation: n s - number of solvers, n p - number of problems, t s,p - time, r s,p = t s,p min{t s,p :sS } - performance profile function ρ s (τ )= 1 n p size{p :1 p n p , log(r s,p τ )} - defines the probability for solver s that the performance ratio r s,p is within a factor τ of the best possible ratio Conclusions and Further Work Trust region based strategy for all methods(except BFGS) successfully solved more problems than line search. It’s square metric also showed better results Though SR1 solved the least amount of problems it founded solution faster for line search Further, one may add to comparison L-BFGS-B Source code of this work is available here [3] Acknowledgements The author of the project expresses gratitude to Daniil Merkulov for his lectures of optimization and his support in this work. [1] Ding, Yang, Enkeleida Lushi, and Qingguo Li. ”Investigation of quasi-Newton methods for unconstrained optimization.” Simon Fraser University, Canada (2004). [2] Wright, Stephen, and Jorge Nocedal. Numerical optimization. Springer Science 35.67-68 (1999) [3] https://github.com/IgorSokoloff/benchmarking_of_quasi_ newton_methods

Transcript of Optimization Class Project. MIPTOptimization Class Project. MIPT Introduction The most well-known...

Page 1: Optimization Class Project. MIPTOptimization Class Project. MIPT Introduction The most well-known minimization technique for unconstrained problems is New-tons Method. In each iteration,

Benchmarking of quasi-Newton methodsIgor SokolovOptimization Class Project. MIPT

IntroductionThe most well-known minimization technique for unconstrained problems is New-tons Method. In each iteration, the step update is xk+1 = xk −

(∇2fk

)∇fk.

wever, the inverse of the Hessian has to be calculated in every iteration so it takesO(n3)

. Moreover, in some applications, the second derivatives may be unavailable.One fix to the problem is to use a finite difference approximation to the Hessian.

We consider solving the nonlinear unconstrained minimization problem

min f (x), x ∈ R

Let’s consider the following quadratic model of the objective function

mk(p) = fk +∇fTk p +12Bkp, where Bk = BTk , Bk � 0 is an n× n

The minimizer pk of this convex quadratic model pk = −B−1k ∇fk is used as thesearch direction, and the new iterate is

xk+1 = xk + αpk, let sk = αpkThe general structure of quasi-Newton method can be summarized as follows

• Given x0, B0(or H0), k → 0;

• For k = 0, 1, 2, . . .

Evaluate gradient gk.

Calculate sk by line search or trust region methods.

xk+1← xk + skyk ← gk+1 − gkUpdate Bk+1 or Hk+1 according to the quasi-Newton formulas.

End(for)

Basic requirement in each iteration, i.e., Bksk = yk (or Hkyk = sk)

Quasi-Newton Formulas for Optimization

BFGS

min ||H −Hk||,s.t H = HT , Hyk = sk

Hk+1 = (I − ρskyTk )Hk(I − ρyksTk ) + ρsks

Tk

where ρ = 1yTk sk

Bk+1 = Bk −Bksks

TkBk

sTkBksk+yky

Tk

yTk skDFP

min ||B −Bk||,s.t B = BT , Bsk = yk

Bk+1 = (I − γyksTk )Hk(I − γskyTk ) + γyky

Tk

where γ = 1yTk sk

Hk+1 = Hk −Hkyky

TkHk

yTkHkyk+sks

Tk

yTk skPSB

min ||B −Bk||,s.t (B −Bk) = (B −Bk)T ,Bsk = yk

Bk+1 = Bk −(yk−Bksk)s

Tk+sk(yk−Bksk)

T

sTk sk+

+sk(yk−Bksk)sks

Tk

(sTk sk)2

Hk+1 = Hk −(sk−Hkyk)y

Tk+yk(sk−Hkyk)

T

yTk yk+

+sk(sk−Hkyk)yky

Tk

(yTk yk)2

SR1

Bk+1 = Bk + σννT ,s.t Bk+1sk = yk

Bk+1 = Bk +(yk−Bksk)(yk−Bksk)

T

(yk−Bksk)Tsk,

Hk+1 = Hk +(sk−Hkyk)(sk−Hkyk)

T

(sk−Hkyk)Tyk

Method Advantages DisadvantagesBFGS

• H0 � 0 hence if H0 � 0

• self correcting property if Wolfechosen

• superlinear convergence

• yTk sk ≈ 0 formula produce badresults

• sensitive to round-off error

• sensitive to inaccurate line search

• can get stuck on a saddle pointfor the nonlinear problem

DFP• can be highly inefficient at cor-

recting large eigenvalues of ma-trices

• sensitive to round-off error

• sensitive to inaccurate linesearch

• can get stuck on a saddle pointfor nonlinear problem

PSB• superlinear convergence

SR1• garantees to be Bk+1 � 0 even

if skyk > 0 doesn’t satisfied

• very good approximations to theHessian matrices, often betterthan BFGS

Line Search vs. Trust Region

• Line search (strong Wolfe conditions)

f (xk + αkpk) ≤ f (xk) + c1αk∇Tk pk|f (xk + αkpk)

Tpk| ≤ c2|∇fTk pk|• Trust region

Both direction and step size find from solving

minp∈Rnmk(p) = fk +∇fTk p +12Bkp s.t ||p|| ≤ ∆k

Numerical Experiment

• All quasi-newton methods (BFGS,DFP, PSB, SR1) with two strategies (linesearch, trust region) were implemented in Python (overall 8 algorithms)

• 165 various N -d(N ≥ 2) strong benchmark problems

• For each algorithm, all problems were launched from the random point of domain100 times and results were averaged

Examples of benchmark problems

x1

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

x2

10.07.5

5.02.5

0.02.55.07.510.0

f(x1,

x 2)

0.00.20.40.60.81.0

NewFunction02 Test Function

x1

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

x2

10.07.5

5.02.5

0.02.55.07.510.0

f(x1,

x 2)

2.001.751.501.251.000.750.500.25

CrossInTray Test Function

Results

0.0 0.2 0.4 0.6 0.8 1.0Proportion of successfully solved problems

0

20

40

60

80

100

120

Num

ber o

f pro

blem

s

Distribution of successfully solved problems (line search)BFGS_LS(suc)DFP_LS(suc)PSB_LS(suc)SR1_LS(suc)

0.0 0.2 0.4 0.6 0.8 1.0Proportion of successfully solved problems

0

20

40

60

80

100

120

Num

ber o

f pro

blem

s

Distribution of successfully solved problems (trust region)BFGS_TR(suc)DFP_TR(suc)PSB_TR(suc)SR1_TR(suc)

Proportion of problems where have been success-fully found global minimizer in more than half of alllaunches

in more than0.99 launches

square under plot ofperformance profilefunction

Strategy BFGS DFP PSB SR1 TotalLS 0.21 0.12 2.58 0.16 0.08 2.54 0.09 0.05 2.66 0.09 0.05 2.97 0.07 0.04TR 0.18 0.12 2.93 0.16 0.1 2.91 0.19 0.14 2.93 0.11 0.06 2.83 0.1 0.05

Performance evaluation: ns - number of solvers, np - number of problems, ts,p -

time, rs,p =ts,p

min{ts,p:s∈S} - performance profile function

ρs(τ ) = 1npsize{p : 1 ≤ p ≤ np, log(rs,p ≤ τ )} - defines the probability for

solver s that the performance ratio rs,p is within a factor τ of the best possibleratio

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

P(sp

<)

Performanse profile, line search, CPU time plotBFGS_LSDFP_LSPSB_LSSR1_LS

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

P(sp

<)

Performanse profile, trust region, CPU time plotBFGS_TRDFP_TRPSB_TRSR1_TR

Conclusions and Further Work

• Trust region based strategy for all methods(except BFGS) successfully solvedmore problems than line search. It’s square metric also showed better results

• Though SR1 solved the least amount of problems it founded solution faster forline search

• Further, one may add to comparison L-BFGS-B

Source code of this work is available here [3]

AcknowledgementsThe author of the project expresses gratitude to Daniil Merkulov for his lectures ofoptimization and his support in this work.

[1] Ding, Yang, Enkeleida Lushi, and Qingguo Li. ”Investigation of quasi-Newtonmethods for unconstrained optimization.” Simon Fraser University, Canada(2004).

[2] Wright, Stephen, and Jorge Nocedal. Numerical optimization. Springer Science35.67-68 (1999)

[3] https://github.com/IgorSokoloff/benchmarking_of_quasi_

newton_methods