Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf ·...

39
Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (NUS) Optimization 2 1 / 39

Transcript of Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf ·...

Page 1: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Optimization 2:

Numerical Optimization

CS5240 Theoretical Foundations in Multimedia

Leow Wee Kheng

Department of Computer ScienceSchool of Computing

National University of Singapore

Leow Wee Kheng (NUS) Optimization 2 1 / 39

Page 2: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Last Time...

Last Time...

Newton’s method minimizes f(x) by iteratively updating:

xk+1 = xk −H−1∇f(x). (1)

Different ways of estimating H−1 give different methods.

Gradient descent updates variables by

xk+1 = xk − α∇f(xk). (2)

Gradient descent can be slow.To increase convergence rate, find good step length α using line search.

Leow Wee Kheng (NUS) Optimization 2 2 / 39

Page 3: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Line Search

Line Search

Gradient descent minimizes f(x) by updating xk along −∇f(xk):

xk+1 = xk − α∇f(xk). (3)

In general, other descent directions pk are possible if ∇f(X)⊤pk < 0.Then, update xk according to

xk+1 = xk + αpk. (4)

Ideally, want to find α > 0 that minimizes f(xk + αpk). Too expensive.

In practice, find α > 0 that reduces f(x) sufficiently (not too small).The simple condition f(xk + αpk) < f(xk) may not be sufficient.

Leow Wee Kheng (NUS) Optimization 2 3 / 39

Page 4: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Line Search

A popular sufficient line search condition is Armijo condition:

f(xk + αpk) ≤ f(xk) + c1α∇f(xk)⊤pk, (5)

where c1 > 0, usually quite small, say, c1 = 10−4.

Armijo condition may accepts very small α.So, add another condition to ensure sufficient reduction of f(x):

∇f(xk + αkpk)⊤pk ≥ c2∇f(x)

⊤pk, (6)

with 0 < c1 < c2 ≤ 1, typically c2 = 0.9.

These two conditions together are called Wolfe conditions.

Known result:Gradient descent will find local minimum if α satisfies Wolfe conditions.

Leow Wee Kheng (NUS) Optimization 2 4 / 39

Page 5: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Line Search

Another way is to bound α from below, giving Goldstein conditions:

f(xk)+(1−c)α∇f(xk)⊤pk ≤ f(xk+αpk) ≤ f(xk)+c α∇f(xk)

⊤pk. (7)

Another simple way is to start with large α and reduce it iteratively:

Backtracking Line Search

1 Choose large α > 0, 0 < ρ < 1.2 repeat

3 α← ρα.4 until Armijo condition is met.

In this case, Armijo condition alone is enough.

Leow Wee Kheng (NUS) Optimization 2 5 / 39

Page 6: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Gauss-Newton Method

Gauss-Newton Method

Carl Friedrich Gauss Sir Isaac Newton

Leow Wee Kheng (NUS) Optimization 2 6 / 39

Page 7: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Gauss-Newton Method

Gauss-Newton method estimates Hessian H by 1st-order derivatives.

Express E(a) as the sum of square of ri(a),

E(a) =n∑

i=1

ri(a)2, (8)

where ri is the error or residual of each data point xi:

ri(a) = vi − f(xi;a), (9)

andr = [r1 · · · rn]

⊤. (10)

Then,

∂E

∂aj= 2

n∑

i=1

ri∂ri

∂aj, (11)

∂2E

∂aj∂al= 2

n∑

i=1

[∂ri

∂aj

∂ri

∂al+ ri

∂2ri

∂aj∂al

]

≈ 2

n∑

i=1

∂ri

∂aj

∂ri

∂al. (12)

Leow Wee Kheng (NUS) Optimization 2 7 / 39

Page 8: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Gauss-Newton Method

The derivative∂ri

∂ajis the (i, j)-th element of the Jacobian Jr of r.

Jr =

∂r1

∂a1· · ·

∂r1

∂am...

. . ....

∂rn

∂a1· · ·

∂rn

∂am

. (13)

The gradient and Hessian are related to the Jacobian by

∇E = 2J⊤r r. (14)

H ≈ 2J⊤r Jr, (15)

So, the update equation becomes

ak+1 = ak − (J⊤r Jr)

−1J⊤r r. (16)

Leow Wee Kheng (NUS) Optimization 2 8 / 39

Page 9: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Gauss-Newton Method

We can also use the Jacobian Jf of f(xi):

Jf =

∂f(x1)

∂a1· · ·

∂f(x1)

∂am...

. . ....

∂f(xn)

∂a1· · ·

∂f(xn)

∂am

. (17)

Then, Jf = −Jr, and the update equation becomes

ak+1 = ak + (J⊤f Jf )

−1J⊤f r. (18)

Leow Wee Kheng (NUS) Optimization 2 9 / 39

Page 10: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Gauss-Newton Method

Compare to linear fitting:

linear fitting a = (D⊤D)−1D⊤v

Gauss-Newton ∆a = (J⊤f Jf )

−1J⊤f r

(J⊤f Jf )

−1J⊤f is analogous to pseudo-inverse.

Compare to gradient descent:

gradient descent ∆a = −α∇E

Gauss-Newton ∆a = −12(J

⊤r Jr)

−1∇E

Gauss-Newton updates a along negative gradient at variable rate.

Gauss-Newton method may converge slowly or diverge if

◮ initial guess a(0) is far from minimum, or

◮ matrix J⊤J is ill-conditioned.

Leow Wee Kheng (NUS) Optimization 2 10 / 39

Page 11: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Gauss-Newton Method Condition Number

Condition Number

The condition number of (real) matrix D is given by the ratio of itssingular values:

κ(D) =σmax(D)

σmin(D). (19)

If D is normal, i.e., D⊤D = DD⊤, then its condition number is givenby the ratio of its eigenvalues:

κ(D) =λmax(D)

λmin(D). (20)

Large condition number means the elements of D are less balanced.

Leow Wee Kheng (NUS) Optimization 2 11 / 39

Page 12: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Gauss-Newton Method Condition Number

For the linear equationDa = v,

the condition number indicates change of solution a vs. change of v.

◮ Small condition number:◮ Small change in v gives small change in a: stable.◮ Small error in v gives small error in a.◮ D is well-conditioned.

◮ Large condition number:◮ Small change in v gives large change in a: unstable.◮ Small error in v gives large error in a.◮ D is ill-conditioned.

Leow Wee Kheng (NUS) Optimization 2 12 / 39

Page 13: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Gauss-Newton Method Practical Issues

Practical Issues

The update equation, with J = Jf ,

ak+1 = ak + (J⊤J)−1J⊤r. (21)

requires computation of matrix inverse, which can be expensive.

Can use singular value decomposition (SVD) to help.

SVD of J isJ = UΣV⊤. (22)

Substituting it into the update equation yields (Homework)

ak+1 = ak +VΣ−1U⊤r. (23)

Leow Wee Kheng (NUS) Optimization 2 13 / 39

Page 14: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Gauss-Newton Method Practical Issues

Convergence criteria: same as gradient descent

|Ek+1 − Ek|

Ek

< ǫ, e.g., ǫ = 0.0001.

|aj,k+1 − aj,k|

|aj,k|< ǫ, for each component j.

If divergence occurs, can introduce step length α:

ak+1 = ak + α (J⊤J)−1J⊤r, (24)

where 0 < α < 1.

Is there an alternative way to control convergence?Yes: Levenberg-Marquardt Method.

Leow Wee Kheng (NUS) Optimization 2 14 / 39

Page 15: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Levenberg-Marquardt Method

Levenberg-Marquardt Method

Gauss-Newton method updates a according to

(J⊤J)∆a = J⊤r. (25)

Levenberg modifies the update equation to

(J⊤J+ λI)∆a = J⊤r. (26)

◮ λ is adjusted at each iteration.

◮ If reduction of error E is large, can use small λ:more similar to Gauss-Newton.

◮ If reduction of error E is small, can use large λ:more similar to gradient descent.

Leow Wee Kheng (NUS) Optimization 2 15 / 39

Page 16: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Levenberg-Marquardt Method

If λ is very large, (J⊤J+ λI) ≈ λI, i.e., just doing gradient descent.

Marquardt scales each component of gradient:

(J⊤J+ λ diag(J⊤J))∆a = J⊤r, (27)

where diag(J⊤J) is a diagonal matrix of the diagonal elements of J⊤J.

The (i, i)-th component of diag(J⊤J) is

n∑

l=1

(∂f(xl)

∂ai

)2

. (28)

◮ If (i, i)-th component of gradient is small, ai is updated more:faster convergence along i-th direction.

◮ Then, λ doesn’t have to be too large.

Leow Wee Kheng (NUS) Optimization 2 16 / 39

Page 17: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Quasi-Newton

Quasi-Newton

Quasi-Newton also minimizes f(x) by iteratively updating x:

xk+1 = xk + αpk (29)

where the descent direction pk is

pk = −H−1k ∇f(xk). (30)

There are many methods of estimating H−1.

The most effective is the method by Broyden, Fletcher, Goldfarb andShanno (BFGS).

Leow Wee Kheng (NUS) Optimization 2 17 / 39

Page 18: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Quasi-Newton

BFGS Method

Let ∆xk = xk+1 − xk and yk = ∇f(xk+1)−∇f(xk).

Then, estimate Hk or H−1k by

Hk+1 = Hk −Hk∆xk∆x⊤

k Hk

∆x⊤k Hk∆xk

+yky

⊤k

y⊤k ∆xk

, (31)

H−1k+1 =

(

I−∆xky

⊤k

y⊤k ∆xk

)

H−1k

(

I−yk∆x⊤

k

y⊤k ∆xk

)

+∆xk∆x⊤

k

y⊤k ∆xk

. (32)

(For derivation, please refer to [1].)

Leow Wee Kheng (NUS) Optimization 2 18 / 39

Page 19: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Quasi-Newton

BFGS Algorithm

1 Set α and initialize x0 and H−10 .

2 for k = 0, 1, . . . until convergence do

3 Compute search direction pk = −H−1k ∇f(xk).

4 Perform line search to find α that satisfies Wolfe conditions.

5 Compute ∆xk = αpk, and set xk+1 = xk +∆xk.

6 Compute yk = ∇f(xk+1)−∇f(xk).

7 Compute H−1k+1 by Eq. 32.

◮ No standard way to initialize H. Can try H = I.

Leow Wee Kheng (NUS) Optimization 2 19 / 39

Page 20: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Approximation of Gradient

Approximation of Gradient

All previous methods require the gradient ∇f (or ∇E).In many complex problems, it is very difficult to derive ∇f .

Several methods to overcome this problem:

◮ Use symbolic differentiation to derive math expression of ∇f .Provided in Matlab, Maple, etc.

◮ Approximate gradient numerically.

◮ Optimize without gradient.

Leow Wee Kheng (NUS) Optimization 2 20 / 39

Page 21: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Approximation of Gradient

Forward Difference

From Taylor series expansion,

f(x+∆x) = f(x) +∇f(x)⊤∆x+ · · · (33)

Denote the j-th component of ∆x as ǫej , where ej is the unit vector ofthe j-th dimension. Then, the j-th component of ∇f(x) is

∂f(x)

∂xj≈

f(x+ ǫej)− f(x)

ǫ. (34)

Note that x+ ǫej updates only the j-th component:

x+ ǫej =[ x1 · · · xj−1 xj xj+1 · · · xm ]⊤

+

[ 0 · · · 0 ǫ 0 · · · 0 ]⊤(35)

Then,

∇f(x) ≈

[∂f(x)

∂x1· · ·

∂f(x)

∂xm

]⊤

. (36)

Leow Wee Kheng (NUS) Optimization 2 21 / 39

Page 22: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Approximation of Gradient

Central Difference

Forward difference method has forward bias: x+ ǫej .Error in forward bias can accumulate quickly.

To be less biased, use central difference.

f(x+ ǫej) ≈ f(x) + ǫ∂f(x)

∂xj

f(x− ǫej) ≈ f(x)− ǫ∂f(x)

∂xj.

So,∂f(x)

∂xj≈

f(x+ ǫej)− f(x− ǫej)

2ǫ. (37)

Leow Wee Kheng (NUS) Optimization 2 22 / 39

Page 23: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Optimization without Gradient

Optimization without Gradient

A simple idea is to descend along each coordinate direction iteratively.Let e1, . . . , em denote the unit vectors along coordinates 1, . . . ,m.

Coordinate Descent

1 Choose initial x0.2 repeat

3 for each j = 1, . . . ,m do

4 Find xk+1 along ej that minimizes f(xk).

5 until convergence.

Leow Wee Kheng (NUS) Optimization 2 23 / 39

Page 24: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Optimization without Gradient

Alternative for Step 4:Find xk+1 along ej using line search to reduce f(xk) sufficiently.

◮ Strength: Simplicity.

◮ Weakness: Can be slow.Can iterate around the minimum, never approaching it.

Leow Wee Kheng (NUS) Optimization 2 24 / 39

Page 25: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Optimization without Gradient

Powell’s Method

Overcomes the problem of coordinate descent.

Basic Idea:Choose directions that do not interfere with or undo previous effort.

For details, refer to [2].

Leow Wee Kheng (NUS) Optimization 2 25 / 39

Page 26: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Alternating Optimization

Alternating Optimization

Let S ⊆ Rm denote the set of x that satisfy the constraints.

Then, constrained optimization problem can be written as:

minx∈S

f(x). (38)

Alternating optimization (AO) is similar to coordinate descentexcept that the variables xj , j = 1, . . . ,m, are split into groups.

Partition x ∈ S into p non-overlapping parts yl such that

x = (y1, . . . ,yl, . . . ,yp). (39)

Example: x = (x1, x2, x4︸ ︷︷ ︸

y1

, x3, x7︸ ︷︷ ︸y2

, x5, x6, x8, x9︸ ︷︷ ︸

y3

).

Denote Sl ⊂ S as the set that contains yl.

Leow Wee Kheng (NUS) Optimization 2 26 / 39

Page 27: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Alternating Optimization

Alternating Optimization

1 Choose initial x(0) = (y(0)1 , . . . ,y

(0)p ).

2 for k = 0, 1, 2, . . ., until convergence do

3 for l = 1, . . . , p do

4 Find y(k+1)l such that

y(k+1)l = arg min

yl∈Sl

f(y(k+1)1 , . . . ,y

(k+1)l−1 , yl, y

(k)l+1, . . . ,y

(k)p ).

That is, find y(k+1)l while fixing other variables.

(Alternative) Find y(k+1)l that decreases f(x) sufficiently:

f(y(k+1)1 , . . . ,y

(k+1)l−1 , y

(k)l , y

(k)l+1, . . . ,y

(k)p )−

f(y(k+1)1 , . . . ,y

(k+1)l−1 , y

(k+1)l , y

(k)l+1, . . . ,y

(k)p ) > c > 0.

Leow Wee Kheng (NUS) Optimization 2 27 / 39

Page 28: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Alternating Optimization

Example:

initially (y(0)1 ,y

(0)2 ,y

(0)3 , . . . ,y

(0)l , . . . ,y

(0)p )

k = 0, l = 1 (y(1)1 ,y

(0)2 ,y

(0)3 , . . . ,y

(0)l , . . . ,y

(0)p )

k = 0, l = 2 (y(1)1 ,y

(1)2 ,y

(0)3 , . . . ,y

(0)l , . . . ,y

(0)p )

k = 0, l = p (y(1)1 ,y

(1)2 ,y

(1)3 , . . . ,y

(1)l , . . . ,y

(1)p )

k = 1, l = 1 (y(2)1 ,y

(1)2 ,y

(1)3 , . . . ,y

(1)l , . . . ,y

(1)p )

...

Obviously, we want to partition the variables so that

◮ Each sub-problem of finding yl is relatively easy to solve.

◮ Solving yl does not undo previous partial solutions.

Leow Wee Kheng (NUS) Optimization 2 28 / 39

Page 29: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Alternating Optimization

Example: k-means clustering

k-means clustering tries to group data points x into k clusters.

c1

c2

c3

Let cj denote the prototype of cluster Cj .Then, k-means clustering finds Cj and cj , j = 1, . . . , k, that minimizewithin-cluster difference:

D =k∑

j=1

x∈Cj

(x− cj)2. (40)

Leow Wee Kheng (NUS) Optimization 2 29 / 39

Page 30: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Alternating Optimization

Suppose Cj are already known.Then, the cj that minimize D is given by

∂D

∂cj= 2

x∈Cj

(x− cj) = 0. (41)

That is, cj is the centroid of Cj :

cj =1

|Cj |

x∈Cj

x. (42)

How to find Cj?Use alternating optimization.

Leow Wee Kheng (NUS) Optimization 2 30 / 39

Page 31: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Alternating Optimization

k-means clustering

1 Initialize cj , j = 1, . . . , k.2 repeat

3 Assign each x to its nearest cluster Cj (fix cj , update Cj):

‖x− cj‖ ≤ ‖x− cl‖, for l 6= j.

4 Update each cluster centroid cj (fix Cj , update cj):

cj =1

|Cj |

x∈Cj

x.

5 until convergence.

◮ Step 2 is (locally) optimal.

◮ Step 1 does not undo Step 2. Why?

◮ So, k-means clustering can converge to a local minimum.

Leow Wee Kheng (NUS) Optimization 2 31 / 39

Page 32: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Convergence Rates

Convergence Rates

Suppose the sequence {xk} converges to x∗.

Then, the sequence converges to x∗ linearly if ∃µ ∈ (0, 1) such that

limk→∞

|xk+1 − x∗|

|xk − x∗|= µ. (43)

◮ The sequence converges sublinearly if µ = 1.

◮ The sequence converges superlinearly if µ = 0.

If the sequence converges sublinearly and

limk→∞

|xk+2 − xk+1|

|xk+1 − xk|= 1, (44)

then, the sequence converges logarithmically.

Leow Wee Kheng (NUS) Optimization 2 32 / 39

Page 33: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Convergence Rates

We can distinguish between different rates of superlinear convergence.

The sequence converges with order of q, for q > 1, if ∃µ > 0 such that

limk→∞

|xk+1 − x∗|

|xk − x∗|q= µ > 0. (45)

◮ If q = 1, the sequence converges q-linearly.

◮ If q = 2, the sequence converges quadratically or q-quadratically.

◮ If q = 3, the sequence converges cubically or q-cubically.

◮ etc.

Leow Wee Kheng (NUS) Optimization 2 33 / 39

Page 34: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Convergence Rates

Known Convergence Rates

◮ Gradient descent converges linearly, often slowly linearly.

◮ Gauss-Newton converges quadratically, if it converges.

◮ Levenberg-Marquardt converges quadratically.

◮ Quasi-Newton converges superlinearly.

◮ Alternating optimization converges q-linearly if f is convex arounda local minimizer x∗.

Leow Wee Kheng (NUS) Optimization 2 34 / 39

Page 35: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Summary

Summary

Method For Use

Gradient descent optimization ∇f

Gauss-Newton nonlinear least-square ∇E, estimate of H

Levenberg-Marquardt nonlinear least-square ∇E, estimate of H

quasi-Newton optimization ∇f , estimate of H

Newton optimization ∇f , H

coordinate descent optimization f

alternating optimization optimization f

Leow Wee Kheng (NUS) Optimization 2 35 / 39

Page 36: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Summary

SciPy supports many optimization methods including

◮ linear least-square

◮ Levenberg-Marquardt method

◮ quasi-Newton BFGS method

◮ line search that satisfies Wolfe conditions

◮ etc.

For more details and other methods, refer to [1].

Leow Wee Kheng (NUS) Optimization 2 36 / 39

Page 37: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Probing Questions

Probing Questions

◮ Can you apply an algorithm specialized for nonlinear least square,e.g., Gauss-Newton, to generic optimization?

◮ Gauss-Newton may converge slowly or diverge if the matrix J⊤J isill-conditioned. How to make the matrix well-conditioned?

◮ How can your alternating optimization program detectinterference of previous update by current update?

◮ If your alternating optimization program detects frequentinterference between updates, what can you do about it?

◮ So far, we have look at minimizing a scalar function f(x). How tominimize a vector function f(x) or multiple scalar functions fl(x)simultaneously?

◮ How to ensure that your optimization program will alwaysconverge regardless of the test condition?

Leow Wee Kheng (NUS) Optimization 2 37 / 39

Page 38: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

Homework

Homework

1. Describe the essence of Gauss-Newton, Levenberg-Marquardt,quasi-Newton, coordinate descent and alternating optimizationeach in one sentence.

2. With SVD of J given by J = UΣV⊤, show that

(J⊤J)−1J⊤ = VΣ−1U⊤.

Hint: For matrices A and B, (AB)−1 = B−1A−1.

3. Q3(c) of AY2015/16 Final Evaluation.

4. Q5 of AY2015/16 Final Evaluation.

Leow Wee Kheng (NUS) Optimization 2 38 / 39

Page 39: Optimization 2: Numerical Optimization - NUS Computingcs5240/lecture/optimization-2.pdf · Optimization 2: Numerical Optimization CS5240 Theoretical Foundations in Multimedia LeowWeeKheng

References

References

1. J. Nocedal and S. J. Wright, Numerical Optimization, Springer-Verlag,1999.

2. W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery,Numerical Recipes: The Art of Scientific Computing, 3rd Ed.,Cambridge University Press, 2007.www.nr.com

3. J. C. Bezdek and R. J. Hathaway, Some notes on alternatingoptimization, in Proc. Int. Conf. on Fuzzy Systems, 2002.

Leow Wee Kheng (NUS) Optimization 2 39 / 39