Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf ·...

Optimization 2:

Numerical Optimization

CS5240 Theoretical Foundations in Multimedia

Leow Wee Kheng

Department of Computer ScienceSchool of Computing

National University of Singapore

Leow Wee Kheng (NUS) Optimization 2 1 / 39

Last Time...

Last Time...

Newton’s method minimizes f(x) by iteratively updating:

xk+1 = xk −H−1∇f(x). (1)

Different ways of estimating H−1 give different methods.

Gradient descent updates variables by

xk+1 = xk − α∇f(xk). (2)

Gradient descent can be slow.To increase convergence rate, find good step length α using line search.


Line Search

Line Search

Gradient descent minimizes f(x) by updating xk along −∇f(xk):

xk+1 = xk − α∇f(xk). (3)

In general, other descent directions pk are possible if ∇f(X)⊤pk < 0.Then, update xk according to

xk+1 = xk + αpk. (4)

Ideally, want to find α > 0 that minimizes f(xk + αpk). Too expensive.

In practice, find α > 0 that reduces f(x) sufficiently (not too small).The simple condition f(xk + αpk) < f(xk) may not be sufficient.


Line Search

A popular sufficient line search condition is Armijo condition:

f(xk + αpk) ≤ f(xk) + c1α∇f(xk)⊤pk, (5)

where c1 > 0, usually quite small, say, c1 = 10−4.

Armijo condition may accepts very small α.So, add another condition to ensure sufficient reduction of f(x):

∇f(xk + αkpk)⊤pk ≥ c2∇f(x)

⊤pk, (6)

with 0 < c1 < c2 ≤ 1, typically c2 = 0.9.

These two conditions together are called Wolfe conditions.

Known result:Gradient descent will find local minimum if α satisfies Wolfe conditions.


Line Search

Another way is to bound α from below, giving Goldstein conditions:

f(xk)+(1−c)α∇f(xk)⊤pk ≤ f(xk+αpk) ≤ f(xk)+c α∇f(xk)

⊤pk. (7)

Another simple way is to start with large α and reduce it iteratively:

Backtracking Line Search

1 Choose large α > 0, 0 < ρ < 1.2 repeat

3 α← ρα.4 until Armijo condition is met.

In this case, Armijo condition alone is enough.


Gauss-Newton Method

Gauss-Newton Method

Carl Friedrich Gauss Sir Isaac Newton


Gauss-Newton Method

Gauss-Newton method estimates Hessian H by 1st-order derivatives.

Express E(a) as the sum of square of ri(a),

E(a) =n∑

i=1

ri(a)2, (8)

where ri is the error or residual of each data point xi:

ri(a) = vi − f(xi;a), (9)

andr = [r1 · · · rn]

⊤. (10)

Then,

∂E

∂aj= 2

n∑

i=1

ri∂ri

∂aj, (11)

∂2E

∂aj∂al= 2

n∑

i=1

[∂ri

∂aj

∂ri

∂al+ ri

∂2ri

∂aj∂al

]

≈ 2

n∑

i=1

∂ri

∂aj

∂ri

∂al. (12)


Gauss-Newton Method

The derivative∂ri

∂ajis the (i, j)-th element of the Jacobian Jr of r.

Jr =

∂r1

∂a1· · ·

∂r1

∂am...

. . ....

∂rn

∂a1· · ·

∂rn

∂am

. (13)

The gradient and Hessian are related to the Jacobian by

∇E = 2J⊤r r. (14)

H ≈ 2J⊤r Jr, (15)

So, the update equation becomes

ak+1 = ak − (J⊤r Jr)

−1J⊤r r. (16)


Gauss-Newton Method

We can also use the Jacobian Jf of f(xi):

Jf =

∂f(x1)

∂a1· · ·

∂f(x1)

∂am...

. . ....

∂f(xn)

∂a1· · ·

∂f(xn)

∂am

. (17)

Then, Jf = −Jr, and the update equation becomes

ak+1 = ak + (J⊤f Jf )

−1J⊤f r. (18)


Gauss-Newton Method

Compare to linear fitting:

linear fitting a = (D⊤D)−1D⊤v

Gauss-Newton ∆a = (J⊤f Jf )

−1J⊤f r

(J⊤f Jf )

−1J⊤f is analogous to pseudo-inverse.

Compare to gradient descent:

gradient descent ∆a = −α∇E

Gauss-Newton ∆a = −12(J

⊤r Jr)

−1∇E

Gauss-Newton updates a along negative gradient at variable rate.

Gauss-Newton method may converge slowly or diverge if

◮ initial guess a(0) is far from minimum, or

◮ matrix J⊤J is ill-conditioned.


Gauss-Newton Method Condition Number

Condition Number

The condition number of (real) matrix D is given by the ratio of itssingular values:

κ(D) =σmax(D)

σmin(D). (19)

If D is normal, i.e., D⊤D = DD⊤, then its condition number is givenby the ratio of its eigenvalues:

κ(D) =λmax(D)

λmin(D). (20)

Large condition number means the elements of D are less balanced.


Gauss-Newton Method Condition Number

For the linear equationDa = v,

the condition number indicates change of solution a vs. change of v.

◮ Small condition number:◮ Small change in v gives small change in a: stable.◮ Small error in v gives small error in a.◮ D is well-conditioned.

◮ Large condition number:◮ Small change in v gives large change in a: unstable.◮ Small error in v gives large error in a.◮ D is ill-conditioned.


Gauss-Newton Method Practical Issues

Practical Issues

The update equation, with J = Jf ,

ak+1 = ak + (J⊤J)−1J⊤r. (21)

requires computation of matrix inverse, which can be expensive.

Can use singular value decomposition (SVD) to help.

SVD of J isJ = UΣV⊤. (22)

Substituting it into the update equation yields (Homework)

ak+1 = ak +VΣ−1U⊤r. (23)


Gauss-Newton Method Practical Issues

Convergence criteria: same as gradient descent

◮

|Ek+1 − Ek|

Ek

< ǫ, e.g., ǫ = 0.0001.

◮

|aj,k+1 − aj,k|

|aj,k|< ǫ, for each component j.

If divergence occurs, can introduce step length α:

ak+1 = ak + α (J⊤J)−1J⊤r, (24)

where 0 < α < 1.

Is there an alternative way to control convergence?Yes: Levenberg-Marquardt Method.


Levenberg-Marquardt Method


Gauss-Newton method updates a according to

(J⊤J)∆a = J⊤r. (25)

Levenberg modifies the update equation to

(J⊤J+ λI)∆a = J⊤r. (26)

◮ λ is adjusted at each iteration.

◮ If reduction of error E is large, can use small λ:more similar to Gauss-Newton.

◮ If reduction of error E is small, can use large λ:more similar to gradient descent.



If λ is very large, (J⊤J+ λI) ≈ λI, i.e., just doing gradient descent.

Marquardt scales each component of gradient:

(J⊤J+ λ diag(J⊤J))∆a = J⊤r, (27)

where diag(J⊤J) is a diagonal matrix of the diagonal elements of J⊤J.

The (i, i)-th component of diag(J⊤J) is

n∑

l=1

(∂f(xl)

∂ai

)2

. (28)

◮ If (i, i)-th component of gradient is small, ai is updated more:faster convergence along i-th direction.

◮ Then, λ doesn’t have to be too large.


Quasi-Newton

Quasi-Newton

Quasi-Newton also minimizes f(x) by iteratively updating x:

xk+1 = xk + αpk (29)

where the descent direction pk is

pk = −H−1k ∇f(xk). (30)

There are many methods of estimating H−1.

The most effective is the method by Broyden, Fletcher, Goldfarb andShanno (BFGS).


Quasi-Newton

BFGS Method

Let ∆xk = xk+1 − xk and yk = ∇f(xk+1)−∇f(xk).

Then, estimate Hk or H−1k by

Hk+1 = Hk −Hk∆xk∆x⊤

k Hk

∆x⊤k Hk∆xk

+yky

⊤k

y⊤k ∆xk

, (31)

H−1k+1 =

(

I−∆xky

⊤k

y⊤k ∆xk

)

H−1k

(

I−yk∆x⊤

k

y⊤k ∆xk

)

+∆xk∆x⊤

k

y⊤k ∆xk

. (32)

(For derivation, please refer to [1].)


Quasi-Newton

BFGS Algorithm

1 Set α and initialize x0 and H−10 .

2 for k = 0, 1, . . . until convergence do

3 Compute search direction pk = −H−1k ∇f(xk).

4 Perform line search to find α that satisfies Wolfe conditions.

5 Compute ∆xk = αpk, and set xk+1 = xk +∆xk.

6 Compute yk = ∇f(xk+1)−∇f(xk).

7 Compute H−1k+1 by Eq. 32.

◮ No standard way to initialize H. Can try H = I.


Approximation of Gradient


All previous methods require the gradient ∇f (or ∇E).In many complex problems, it is very difficult to derive ∇f .

Several methods to overcome this problem:

◮ Use symbolic differentiation to derive math expression of ∇f .Provided in Matlab, Maple, etc.

◮ Approximate gradient numerically.

◮ Optimize without gradient.



Forward Difference

From Taylor series expansion,

f(x+∆x) = f(x) +∇f(x)⊤∆x+ · · · (33)

Denote the j-th component of ∆x as ǫej , where ej is the unit vector ofthe j-th dimension. Then, the j-th component of ∇f(x) is

∂f(x)

∂xj≈

f(x+ ǫej)− f(x)

ǫ. (34)

Note that x+ ǫej updates only the j-th component:

x+ ǫej =[ x1 · · · xj−1 xj xj+1 · · · xm ]⊤

+

[ 0 · · · 0 ǫ 0 · · · 0 ]⊤(35)

Then,

∇f(x) ≈

[∂f(x)

∂x1· · ·

∂f(x)

∂xm

]⊤

. (36)



Central Difference

Forward difference method has forward bias: x+ ǫej .Error in forward bias can accumulate quickly.

To be less biased, use central difference.

f(x+ ǫej) ≈ f(x) + ǫ∂f(x)

∂xj

f(x− ǫej) ≈ f(x)− ǫ∂f(x)

∂xj.

So,∂f(x)

∂xj≈

f(x+ ǫej)− f(x− ǫej)

2ǫ. (37)


Optimization without Gradient


A simple idea is to descend along each coordinate direction iteratively.Let e1, . . . , em denote the unit vectors along coordinates 1, . . . ,m.

Coordinate Descent

1 Choose initial x0.2 repeat

3 for each j = 1, . . . ,m do

4 Find xk+1 along ej that minimizes f(xk).

5 until convergence.



Alternative for Step 4:Find xk+1 along ej using line search to reduce f(xk) sufficiently.

◮ Strength: Simplicity.

◮ Weakness: Can be slow.Can iterate around the minimum, never approaching it.



Powell’s Method

Overcomes the problem of coordinate descent.

Basic Idea:Choose directions that do not interfere with or undo previous effort.

For details, refer to [2].


Alternating Optimization


Let S ⊆ Rm denote the set of x that satisfy the constraints.

Then, constrained optimization problem can be written as:

minx∈S

f(x). (38)

Alternating optimization (AO) is similar to coordinate descentexcept that the variables xj , j = 1, . . . ,m, are split into groups.

Partition x ∈ S into p non-overlapping parts yl such that

x = (y1, . . . ,yl, . . . ,yp). (39)

Example: x = (x1, x2, x4︸︷︷︸

y1

, x3, x7︸︷︷︸y2

, x5, x6, x8, x9︸︷︷︸

y3

).

Denote Sl ⊂ S as the set that contains yl.




1 Choose initial x(0) = (y(0)1 , . . . ,y

(0)p ).

2 for k = 0, 1, 2, . . ., until convergence do

3 for l = 1, . . . , p do

4 Find y(k+1)l such that

y(k+1)l = arg min

yl∈Sl

f(y(k+1)1 , . . . ,y

(k+1)l−1 , yl, y

(k)l+1, . . . ,y

(k)p ).

That is, find y(k+1)l while fixing other variables.

(Alternative) Find y(k+1)l that decreases f(x) sufficiently:

f(y(k+1)1 , . . . ,y

(k+1)l−1 , y

(k)l , y

(k)l+1, . . . ,y

(k)p )−

f(y(k+1)1 , . . . ,y

(k+1)l−1 , y

(k+1)l , y

(k)l+1, . . . ,y

(k)p ) > c > 0.



Example:

initially (y(0)1 ,y

(0)2 ,y

(0)3 , . . . ,y

(0)l , . . . ,y

(0)p )

↓

k = 0, l = 1 (y(1)1 ,y

(0)2 ,y

(0)3 , . . . ,y

(0)l , . . . ,y

(0)p )

↓

k = 0, l = 2 (y(1)1 ,y

(1)2 ,y

(0)3 , . . . ,y

(0)l , . . . ,y

(0)p )

↓

k = 0, l = p (y(1)1 ,y

(1)2 ,y

(1)3 , . . . ,y

(1)l , . . . ,y

(1)p )

↓

k = 1, l = 1 (y(2)1 ,y

(1)2 ,y

(1)3 , . . . ,y

(1)l , . . . ,y

(1)p )

...

Obviously, we want to partition the variables so that

◮ Each sub-problem of finding yl is relatively easy to solve.

◮ Solving yl does not undo previous partial solutions.



Example: k-means clustering

k-means clustering tries to group data points x into k clusters.

c1

c2

c3

Let cj denote the prototype of cluster Cj .Then, k-means clustering finds Cj and cj , j = 1, . . . , k, that minimizewithin-cluster difference:

D =k∑

j=1

∑

x∈Cj

(x− cj)2. (40)



Suppose Cj are already known.Then, the cj that minimize D is given by

∂D

∂cj= 2

∑

x∈Cj

(x− cj) = 0. (41)

That is, cj is the centroid of Cj :

cj =1

|Cj |

∑

x∈Cj

x. (42)

How to find Cj?Use alternating optimization.



k-means clustering

1 Initialize cj , j = 1, . . . , k.2 repeat

3 Assign each x to its nearest cluster Cj (fix cj , update Cj):

‖x− cj‖ ≤ ‖x− cl‖, for l 6= j.

4 Update each cluster centroid cj (fix Cj , update cj):

cj =1

|Cj |

∑

x∈Cj

x.

5 until convergence.

◮ Step 2 is (locally) optimal.

◮ Step 1 does not undo Step 2. Why?

◮ So, k-means clustering can converge to a local minimum.


Convergence Rates

Convergence Rates

Suppose the sequence {xk} converges to x∗.

Then, the sequence converges to x∗ linearly if ∃µ ∈ (0, 1) such that

limk→∞

|xk+1 − x∗|

|xk − x∗|= µ. (43)

◮ The sequence converges sublinearly if µ = 1.

◮ The sequence converges superlinearly if µ = 0.

If the sequence converges sublinearly and

limk→∞

|xk+2 − xk+1|

|xk+1 − xk|= 1, (44)

then, the sequence converges logarithmically.


Convergence Rates

We can distinguish between different rates of superlinear convergence.

The sequence converges with order of q, for q > 1, if ∃µ > 0 such that

limk→∞

|xk+1 − x∗|

|xk − x∗|q= µ > 0. (45)

◮ If q = 1, the sequence converges q-linearly.

◮ If q = 2, the sequence converges quadratically or q-quadratically.

◮ If q = 3, the sequence converges cubically or q-cubically.

◮ etc.


Convergence Rates

Known Convergence Rates

◮ Gradient descent converges linearly, often slowly linearly.

◮ Gauss-Newton converges quadratically, if it converges.

◮ Levenberg-Marquardt converges quadratically.

◮ Quasi-Newton converges superlinearly.

◮ Alternating optimization converges q-linearly if f is convex arounda local minimizer x∗.


Summary

Summary

Method For Use

Gradient descent optimization ∇f

Gauss-Newton nonlinear least-square ∇E, estimate of H

Levenberg-Marquardt nonlinear least-square ∇E, estimate of H

quasi-Newton optimization ∇f , estimate of H

Newton optimization ∇f , H

coordinate descent optimization f

alternating optimization optimization f


Summary

SciPy supports many optimization methods including

◮ linear least-square

◮ Levenberg-Marquardt method

◮ quasi-Newton BFGS method

◮ line search that satisfies Wolfe conditions

◮ etc.

For more details and other methods, refer to [1].


Probing Questions

Probing Questions

◮ Can you apply an algorithm specialized for nonlinear least square,e.g., Gauss-Newton, to generic optimization?

◮ Gauss-Newton may converge slowly or diverge if the matrix J⊤J isill-conditioned. How to make the matrix well-conditioned?

◮ How can your alternating optimization program detectinterference of previous update by current update?

◮ If your alternating optimization program detects frequentinterference between updates, what can you do about it?

◮ So far, we have look at minimizing a scalar function f(x). How tominimize a vector function f(x) or multiple scalar functions fl(x)simultaneously?

◮ How to ensure that your optimization program will alwaysconverge regardless of the test condition?


Homework

Homework

1. Describe the essence of Gauss-Newton, Levenberg-Marquardt,quasi-Newton, coordinate descent and alternating optimizationeach in one sentence.

2. With SVD of J given by J = UΣV⊤, show that

(J⊤J)−1J⊤ = VΣ−1U⊤.

Hint: For matrices A and B, (AB)−1 = B−1A−1.

3. Q3(c) of AY2015/16 Final Evaluation.

4. Q5 of AY2015/16 Final Evaluation.


References

References

1. J. Nocedal and S. J. Wright, Numerical Optimization, Springer-Verlag,1999.

2. W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery,Numerical Recipes: The Art of Scientific Computing, 3rd Ed.,Cambridge University Press, 2007.www.nr.com

3. J. C. Bezdek and R. J. Hathaway, Some notes on alternatingoptimization, in Proc. Int. Conf. on Fuzzy Systems, 2002.


www.nr.com

Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf ·...

Documents

Transcript of Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf ·...