Lecture 5: Gradient methods -...
Transcript of Lecture 5: Gradient methods -...
Lecture 5: Gradient methods
(Chapter 8 from the textbook)Xiaoqun Zhang
Shanghai Jiao Tong University
Last updated: October 23, 2018
1/23
Gradient methods
Outline
Gradient methods
Steepest descent method
Convergence analysis
2/23
Gradient methods
Gradient methods
simple and intuitive
every iteration is inexpensive
does not require second derivatives
extensions: nonsmooth optimization, combining with duality splitting, coordinate descent, alternating direction,stochastic, online learning
suitable for large-scale problem, parallelization
3/23
Gradient methods
Geometric interpretation of gradients
∇f (x0), if it is not zero, is orthogonal to the tangent of the level set curves of f passing x0, point outward.
4/23
Gradient methods
Descent direction
Descent direction at x: 〈∇f (x), d〉 < 0.
f (x + αd) = f (x) + α〈∇f (x), d〉+ o(α) < f (x)
−∇f (x) is max-rate descending direction of f at x and ‖∇f (x)‖ is the rate.
proof: pick any direction d with ‖d‖ = 1. The rate of change at x is
〈∇f (x), d〉 ≤ ‖∇f (x)‖‖d‖ = ‖∇f (x)‖
If we set d = ∇f (x)‖∇f (x)‖ , then
〈∇f (x), d〉 = ‖∇f (x)‖
5/23
Gradient methods
Classical gradient method
Consider f : Rn → R differentiablemin f (x)
Choose x(0) and repeatx(k) = x(k−1) − αk∇f (x(k−1)), k = 1, 2, · · ·
Small stepsizes: iterations are more likely converge, but it might requires more iteration and thus evaluations of∇f .Large step sizes: better use of ∇f (x) and it may reduce the total number of iterations, but it can causeoversmoothing and zig-zags which lead to divergent iterations.Choice of stepsizes: fixed: αk constant; 1D line search (backtracking, exact, Barzilai-Borwein withnonmonotone line search).
6/23
Gradient methods
Stopping criteria
Gradient condition: ‖∇f (x(k+1))‖ < ε
|f (x(k+1))− f (x(k))| < ε or |f (x(k+1))−f (x(k))|max{1,f (x(k))} < ε
‖x(k+1) − x(k)‖ < ε or ‖x(k+1)−x(k)‖max{1,‖(x(k))‖} < ε
7/23
Steepest descent method
Outline
Gradient methods
Steepest descent method
Convergence analysis
8/23
Steepest descent method
The method of steepest descent
(Also known as gradient descent with exact line search)
Stepsize αk is determined byαk = arg min
αk≥0f (x(k) − α∇f (x(k)))
For quadratic program, αk has closed form.For problem with inexpensive evaluation values but expensive gradient evaluations.
9/23
Steepest descent method
Orthogonality
PropsitionIf {x(k)}∞k=0 is a steepest decent sequence for a given function f : Rn → R, then for each k the vector x(k+1) − x(k)
is orthogonal to the vector x(k+2) − x(k+1).Proof:
x(k+1) − x(k) = −αk∇f (x(k)), x(k+2) − x(k+1) = −αk+1∇f (x(k+1)).
We want to show 〈∇f (x(k)),∇f (x(k+1))〉 = 0. Consider αk > 0, thus αk is aninterior point for
αk = arg min f (x(k) − α∇f (x(k))) .= φ(α)
By FONC, φ′(αk) = 0 ⇒ 〈∇f (x(k) − αk∇f (x(k))),∇f (x(k))〉 = 0 Thus〈∇f (x(k+1)),∇f (x(k))〉 = 0.
10/23
Steepest descent method
Function decreasing
PropsitionIf {x(k)}∞k=0 is the steepest descent sequence for f : Rn → R and if ∇f (x(k)) 6= 0, then f (x(k+1)) < f (x(k)).
Proof: As αk is the minimizer of φk(α) = f (x(k) − α∇f (x(k))) over all α ≥ 0, Then we have
φk(αk) ≤ φk(α)
for all α ≥ 0. By the chain rule that φ′k(0) = −‖∇f (x(k))‖2 < 0, then there is an α̃ > 0 such that φk(0) > φk(α) forall α ∈ (0, α̃]. Hence
f (x(k+1)) = φk(αk) ≤ φ(α) < φk(0) = f (x(k)).
11/23
Steepest descent method
Examplef (x1, x2, x3) = (x1 − 4)4 + (x2 − 3)2 + 4(x3 + 5)4
The initial point is x(0) = [4, 2,−1]T .Step 1: compute ∇f (x(0)) = [0,−2, 1024]T and obtain α0 by secant method α0 = 3.967× 10−3 andx(1) = [4.000, 2.008.− 5.062]T
Step 2: compute ∇f (x(1)) = [0.000,−1.984,−0.003875]T and obtain α1 by secant method α0 = 0.5000 andx(2) = [4.000, 3.000.− 5.060]T .Step 3: compute ∇f (x(2)) = [0.000, 0.000,−0.003525]T and obtain α2 by secant method α2 = 16.29 andx(2) = [4.000, 3.000.− 5.002]T , which is close to the minimizer [4, 3,−5]T .
12/23
Steepest descent method
Steepest descent for quadratic programming
Consider f (x) = 12 xT Qx − bT x where Q is symmetric positive definite (s.p.d.). (If Q is not symmetric, consider
A = 1/2(Q + QT ))By FONC, we have
∇f (x) = 0 = Qx − b = 0→ x = Q−1b
and ∇2f (x) = Q is s.p.d, thus x∗ = Q−1b is the unique global minimizer by SOSC.Steepest descent iteration: start from any x(0), set
x(k+1) = x(k) − αkg(k)
where g(k) = ∇f (x(k)) = Qx(k) − b. Assume g(k) 6= 0, αk = arg minα≥0 f (x(k) − αg(k)). By FONC, we have〈∇f (x(k+1)), g(k)〉 = 0⇒ αk = g(k)T g(k)
g(k)T Qg(k) .
x(k+1) = x(k) − g(k)T g(k)
g(k)T Qg(k)g(k), where g(k) = Qx(k) − b
Computation cost: two matrix-vector multiplication per iterations.
13/23
Steepest descent method
Example
f (x1, x2) = x21 + x2
2
It is easy to see that (0, 0)T is the solution. Starting fromany initial point x(0), one can find x(1) is the solution, i.e.global solution in one iteration.
f (x) = min(15x2
1 + x22 )
14/23
Convergence analysis
Outline
Gradient methods
Steepest descent method
Convergence analysis
15/23
Convergence analysis
Convergence for quadratic programming
min 12xT Qx − bT x
where Q is symmetric positive definite.
TheoremIn the steepest descent algorithm, we have x(k) → x∗ for any x(0).
TheoremFor the fixed-step-size gradient algorithm, x(k) → x∗ for any x(0) if and only if
0 < α <2
λmax(Q) .
16/23
Convergence analysis
Convergence rate
DefinitionGiven a sequence x(k) that converges to x∗. We say the convergence order is p, p ∈ R if
0 < lim ‖x(k+1) − x∗‖‖x(k) − x∗‖p <∞
If p = 1 and lim ‖x(k+1)−x∗‖‖x(k)−x∗‖ = 1, sublinear;
If p = 1 and lim ‖x(k+1)−x∗‖‖x(k)−x∗‖ = γ < 1,linear;
If p = 1 and lim ‖x(k+1)−x∗‖‖x(k)−x∗‖ = 0, superlinear;
If p = 2, quadratic.
17/23
Convergence analysis
Convergence order
Recall that a = O(h) if there exists a constant c such that |a| ≤ c|h| for sufficiently small h.
TheoremLet x(k) be a sequence converges to x∗. If
‖x(k+1) − x∗‖ = O(‖x(k) − x∗‖p),
then the order of convergence (if it exists) is at least p.
18/23
Convergence analysis
Examples
Let x(k) = 1k and thus x(k) → 0. And
|x(k+1)||x(k)|p
= kp
k + 1if p < 1, it converges to 0; p > 1, it grows to ∞; p = 1, it converges to 1, thus sublinear convergence.Let x(k) = γk , where 0 < γ < 1, thus x(k) → 0. And
|x(k+1)||x(k)|p
= γk(1−p)γ
If p < 1, it goes to 0; p > 1, it grows to ∞; If p = 1, it converges to γ, thus linear convergence.Let x(k) = γqk
, where q > 1 and 0 < γ < 1, and thus x(k) → 0. And
|x(k+1)||x(k)|p
= γqk+1
(γqk )p= γ(q−p)qk
If p < q, it converges to 0, whereas if p > q, it grows to ∞. If p = q, the sequence converges to 1. Hence theorder of convergence is q.
19/23
Convergence analysis
Steepest descent for quadratic programming
Define e(k) = x(k) − x∗; g(k) = Qx(k) − b = Qe(k).Good cases:
Qe(k) = λe(k) ⇒ e(k+1) = e(k) − ‖g(k)‖2
λ‖g(k)‖2 λe(k) = 0, converge in one more iteration.Q has only one distinct eigenvalue (the level sets of Q are circles).
General cases: define ‖e‖Q =√
eT Qe and κ : λmax(Q)λmin(Q) , then
‖e(k)‖Q ≤ (κ− 1κ+ 1)k‖e(0)‖Q
20/23
Convergence analysis
Gradient descent with fixed stepsize for general f
Consider general differentiable f :x(k+1) = x(k) − αg(k)
where g(k) = ∇f (x(k)). Assume x∗ exists
‖x(k+1) − x∗‖2 = ‖x(k) − x∗ − αg(k)‖2
= ‖x(k) − x∗‖2 − 2α〈g(k), x(k) − x∗〉+ α2‖g(k)‖2
If α2‖g(k)‖2 ≤ 2α〈g(k), x(k) − x∗〉, then
‖x(k+1) − x∗‖2 ≤ ‖x(k) − x∗‖2
As g∗ = ∇f (x∗) = 0, the condition is equivalent to
α
2 ‖g(k) − g∗‖2 ≤ 〈g(k) − g∗, x(k) − x∗〉
21/23
Convergence analysis
Special case with convex and Lipschitz differentiable f
DefinitionA function f is L-lipschitz differentiable, L ≥ 0 if f ∈ C 1, and
‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖, ∀x, y ∈ Rn
Theorem (Baillon-Haddad Theorem)
If f ∈ C 1 is a convex function, then it is L− Lipschitz differentiable if and only if
‖∇f (x)−∇f (y)‖2 ≤ L〈∇f (x)−∇f (y), x − y〉, ∀x, y ∈ Rn
(such f is called L-cocoercive)
22/23
Convergence analysis
Let f ∈ C 1 is a convex function and L− Lipschitz differentiable. If 0 < α < 2L , then
α
2 ‖g(k) − g∗‖2 ≤ 〈g(k) − g∗, x(k) − x∗〉
and thus ‖x(k+1) − x∗‖ ≤ ‖x(k) − x∗‖ for k = 0, 1 · · · .. The iteration stays bounded.Let f ∈ C 1 is a convex function and L− Lipschitz differentiable. If 0 < α < 2
L , then
both f (x(k)) and ‖∇f (x(k))‖ are monotonically decreasing.f (x(k))− f (x∗) = O( 1
k )‖∇f (x(k))‖ = o( 1
k )
23/23