Lecture 5: Gradient methods -...

Lecture 5: Gradient methods

(Chapter 8 from the textbook)Xiaoqun Zhang

Shanghai Jiao Tong University

Last updated: October 23, 2018

1/23

Gradient methods

Outline

Gradient methods

Steepest descent method

Convergence analysis

2/23

Gradient methods

Gradient methods

simple and intuitive

every iteration is inexpensive

does not require second derivatives

extensions: nonsmooth optimization, combining with duality splitting, coordinate descent, alternating direction,stochastic, online learning

suitable for large-scale problem, parallelization

3/23

Gradient methods

Geometric interpretation of gradients

∇f (x0), if it is not zero, is orthogonal to the tangent of the level set curves of f passing x0, point outward.

4/23

Gradient methods

Descent direction

Descent direction at x: 〈∇f (x), d〉 < 0.

f (x + αd) = f (x) + α〈∇f (x), d〉+ o(α) < f (x)

−∇f (x) is max-rate descending direction of f at x and ‖∇f (x)‖ is the rate.

proof: pick any direction d with ‖d‖ = 1. The rate of change at x is

〈∇f (x), d〉 ≤ ‖∇f (x)‖‖d‖ = ‖∇f (x)‖

If we set d = ∇f (x)‖∇f (x)‖ , then

〈∇f (x), d〉 = ‖∇f (x)‖

5/23

Gradient methods

Classical gradient method

Consider f : Rn → R differentiablemin f (x)

Choose x(0) and repeatx(k) = x(k−1) − αk∇f (x(k−1)), k = 1, 2, · · ·

Small stepsizes: iterations are more likely converge, but it might requires more iteration and thus evaluations of∇f .Large step sizes: better use of ∇f (x) and it may reduce the total number of iterations, but it can causeoversmoothing and zig-zags which lead to divergent iterations.Choice of stepsizes: fixed: αk constant; 1D line search (backtracking, exact, Barzilai-Borwein withnonmonotone line search).

6/23

Gradient methods

Stopping criteria

Gradient condition: ‖∇f (x(k+1))‖ < ε

|f (x(k+1))− f (x(k))| < ε or |f (x(k+1))−f (x(k))|max{1,f (x(k))} < ε

‖x(k+1) − x(k)‖ < ε or ‖x(k+1)−x(k)‖max{1,‖(x(k))‖} < ε

7/23


Outline

Gradient methods



8/23


The method of steepest descent

(Also known as gradient descent with exact line search)

Stepsize αk is determined byαk = arg min

αk≥0f (x(k) − α∇f (x(k)))

For quadratic program, αk has closed form.For problem with inexpensive evaluation values but expensive gradient evaluations.

9/23


Orthogonality

PropsitionIf {x(k)}∞k=0 is a steepest decent sequence for a given function f : Rn → R, then for each k the vector x(k+1) − x(k)

is orthogonal to the vector x(k+2) − x(k+1).Proof:

x(k+1) − x(k) = −αk∇f (x(k)), x(k+2) − x(k+1) = −αk+1∇f (x(k+1)).

We want to show 〈∇f (x(k)),∇f (x(k+1))〉 = 0. Consider αk > 0, thus αk is aninterior point for

αk = arg min f (x(k) − α∇f (x(k))) .= φ(α)

By FONC, φ′(αk) = 0 ⇒ 〈∇f (x(k) − αk∇f (x(k))),∇f (x(k))〉 = 0 Thus〈∇f (x(k+1)),∇f (x(k))〉 = 0.

10/23


Function decreasing

PropsitionIf {x(k)}∞k=0 is the steepest descent sequence for f : Rn → R and if ∇f (x(k)) 6= 0, then f (x(k+1)) < f (x(k)).

Proof: As αk is the minimizer of φk(α) = f (x(k) − α∇f (x(k))) over all α ≥ 0, Then we have

φk(αk) ≤ φk(α)

for all α ≥ 0. By the chain rule that φ′k(0) = −‖∇f (x(k))‖2 < 0, then there is an α̃ > 0 such that φk(0) > φk(α) forall α ∈ (0, α̃]. Hence

f (x(k+1)) = φk(αk) ≤ φ(α) < φk(0) = f (x(k)).

11/23


Examplef (x1, x2, x3) = (x1 − 4)4 + (x2 − 3)2 + 4(x3 + 5)4

The initial point is x(0) = [4, 2,−1]T .Step 1: compute ∇f (x(0)) = [0,−2, 1024]T and obtain α0 by secant method α0 = 3.967× 10−3 andx(1) = [4.000, 2.008.− 5.062]T

Step 2: compute ∇f (x(1)) = [0.000,−1.984,−0.003875]T and obtain α1 by secant method α0 = 0.5000 andx(2) = [4.000, 3.000.− 5.060]T .Step 3: compute ∇f (x(2)) = [0.000, 0.000,−0.003525]T and obtain α2 by secant method α2 = 16.29 andx(2) = [4.000, 3.000.− 5.002]T , which is close to the minimizer [4, 3,−5]T .

12/23


Steepest descent for quadratic programming

Consider f (x) = 12 xT Qx − bT x where Q is symmetric positive definite (s.p.d.). (If Q is not symmetric, consider

A = 1/2(Q + QT ))By FONC, we have

∇f (x) = 0 = Qx − b = 0→ x = Q−1b

and ∇2f (x) = Q is s.p.d, thus x∗ = Q−1b is the unique global minimizer by SOSC.Steepest descent iteration: start from any x(0), set

x(k+1) = x(k) − αkg(k)

where g(k) = ∇f (x(k)) = Qx(k) − b. Assume g(k) 6= 0, αk = arg minα≥0 f (x(k) − αg(k)). By FONC, we have〈∇f (x(k+1)), g(k)〉 = 0⇒ αk = g(k)T g(k)

g(k)T Qg(k) .

x(k+1) = x(k) − g(k)T g(k)

g(k)T Qg(k)g(k), where g(k) = Qx(k) − b

Computation cost: two matrix-vector multiplication per iterations.

13/23


Example

f (x1, x2) = x21 + x2

2

It is easy to see that (0, 0)T is the solution. Starting fromany initial point x(0), one can find x(1) is the solution, i.e.global solution in one iteration.

f (x) = min(15x2

1 + x22 )

14/23


Outline

Gradient methods



15/23


Convergence for quadratic programming

min 12xT Qx − bT x

where Q is symmetric positive definite.

TheoremIn the steepest descent algorithm, we have x(k) → x∗ for any x(0).

TheoremFor the fixed-step-size gradient algorithm, x(k) → x∗ for any x(0) if and only if

0 < α <2

λmax(Q) .

16/23


Convergence rate

DefinitionGiven a sequence x(k) that converges to x∗. We say the convergence order is p, p ∈ R if

0 < lim ‖x(k+1) − x∗‖‖x(k) − x∗‖p <∞

If p = 1 and lim ‖x(k+1)−x∗‖‖x(k)−x∗‖ = 1, sublinear;

If p = 1 and lim ‖x(k+1)−x∗‖‖x(k)−x∗‖ = γ < 1,linear;

If p = 1 and lim ‖x(k+1)−x∗‖‖x(k)−x∗‖ = 0, superlinear;

If p = 2, quadratic.

17/23


Convergence order

Recall that a = O(h) if there exists a constant c such that |a| ≤ c|h| for sufficiently small h.

TheoremLet x(k) be a sequence converges to x∗. If

‖x(k+1) − x∗‖ = O(‖x(k) − x∗‖p),

then the order of convergence (if it exists) is at least p.

18/23


Examples

Let x(k) = 1k and thus x(k) → 0. And

|x(k+1)||x(k)|p

= kp

k + 1if p < 1, it converges to 0; p > 1, it grows to ∞; p = 1, it converges to 1, thus sublinear convergence.Let x(k) = γk , where 0 < γ < 1, thus x(k) → 0. And

|x(k+1)||x(k)|p

= γk(1−p)γ

If p < 1, it goes to 0; p > 1, it grows to ∞; If p = 1, it converges to γ, thus linear convergence.Let x(k) = γqk

, where q > 1 and 0 < γ < 1, and thus x(k) → 0. And

|x(k+1)||x(k)|p

= γqk+1

(γqk )p= γ(q−p)qk

If p < q, it converges to 0, whereas if p > q, it grows to ∞. If p = q, the sequence converges to 1. Hence theorder of convergence is q.

19/23


Steepest descent for quadratic programming

Define e(k) = x(k) − x∗; g(k) = Qx(k) − b = Qe(k).Good cases:

Qe(k) = λe(k) ⇒ e(k+1) = e(k) − ‖g(k)‖2

λ‖g(k)‖2 λe(k) = 0, converge in one more iteration.Q has only one distinct eigenvalue (the level sets of Q are circles).

General cases: define ‖e‖Q =√

eT Qe and κ : λmax(Q)λmin(Q) , then

‖e(k)‖Q ≤ (κ− 1κ+ 1)k‖e(0)‖Q

20/23


Gradient descent with fixed stepsize for general f

Consider general differentiable f :x(k+1) = x(k) − αg(k)

where g(k) = ∇f (x(k)). Assume x∗ exists

‖x(k+1) − x∗‖2 = ‖x(k) − x∗ − αg(k)‖2

= ‖x(k) − x∗‖2 − 2α〈g(k), x(k) − x∗〉+ α2‖g(k)‖2

If α2‖g(k)‖2 ≤ 2α〈g(k), x(k) − x∗〉, then

‖x(k+1) − x∗‖2 ≤ ‖x(k) − x∗‖2

As g∗ = ∇f (x∗) = 0, the condition is equivalent to

α

2 ‖g(k) − g∗‖2 ≤ 〈g(k) − g∗, x(k) − x∗〉

21/23


Special case with convex and Lipschitz differentiable f

DefinitionA function f is L-lipschitz differentiable, L ≥ 0 if f ∈ C 1, and

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖, ∀x, y ∈ Rn

Theorem (Baillon-Haddad Theorem)

If f ∈ C 1 is a convex function, then it is L− Lipschitz differentiable if and only if

‖∇f (x)−∇f (y)‖2 ≤ L〈∇f (x)−∇f (y), x − y〉, ∀x, y ∈ Rn

(such f is called L-cocoercive)

22/23


Let f ∈ C 1 is a convex function and L− Lipschitz differentiable. If 0 < α < 2L , then

α

2 ‖g(k) − g∗‖2 ≤ 〈g(k) − g∗, x(k) − x∗〉

and thus ‖x(k+1) − x∗‖ ≤ ‖x(k) − x∗‖ for k = 0, 1 · · · .. The iteration stays bounded.Let f ∈ C 1 is a convex function and L− Lipschitz differentiable. If 0 < α < 2

L , then

both f (x(k)) and ‖∇f (x(k))‖ are monotonically decreasing.f (x(k))− f (x∗) = O( 1

k )‖∇f (x(k))‖ = o( 1

k )

23/23

Lecture 5: Gradient methods -...

Documents

Transcript of Lecture 5: Gradient methods -...