Introductory Course on Non-smooth Optimisation
Transcript of Introductory Course on Non-smooth Optimisation
Introductory Course on Non-smooth Optimisation
Lecture 09 - Non-convex optimisation
Jingwei Liang
Department of Applied Mathematics and Theoretical Physics
Table of contents
1 Examples
2 Non-convex optimisation
3 Convex relaxation
4 Łojasiewicz inequality
5 Kurdyka-Łojasiewicz inequality
Compressed sensing
Forward observationb = Ax,
x ∈ Rn is sparse.
A : Rn → Rm withm << n.
Compressed sensingminx∈Rn
||x||0 s.t. Ax = b.
NB: NP-hard problem.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Image processing
Two-phase segmentation Given an image I, which consists of foreground and background, segmentthe foreground. Ideally,
I = fC + bΩ\C.
Mumford–Shah model
E(u, C) =
∫Ω
(u− I)2dx + λ
(∫Ω\C||∇u||2dx + α|C|
),
where |C| = peri(C).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Principal component pursuit
Forward mixture modelw = x + y + ε,
where x ∈ Rm×n is κ-sparse, y ∈ Rm×n is σ-low-rank and ε is noise.
Non-convex PCPmin
x,y∈Rm×n
12 ||x + y − w||2
s.t. ||x||0 ≤ κ and rank(y) ≤ σ.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Neural networks
Each layer of NNs is convex
Linear operation, e.g. convolution.
Non-linear activation function, e.g. rectifier maxx,0.
The composition of convex functions is not necessarily convex...
Neural networks are universal function approximators.
Hence need to approximate non-convex functions.
Cannot approximate non-convex functions with convex functions.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Examples
2 Non-convex optimisation
3 Convex relaxation
4 Łojasiewicz inequality
5 Kurdyka-Łojasiewicz inequality
Non-convex optimisation
Non-convex problemAny problem that is not convex/concave is non-convex...
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Challenges
Potentially many local minima.
Saddle points.
Very flat regions.
Widely varying curvature.
NP-hard.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Examples
2 Non-convex optimisation
3 Convex relaxation
4 Łojasiewicz inequality
5 Kurdyka-Łojasiewicz inequality
Convex relaxation
Non-convex optimisation problem
minx
E(x).
Convex optimisation problem
minx
F(x).
What ifArgmin(F) ⊆ Argmin(E),
Subtle and case-dependent.
Somehow, finding F is almost equivalent to solving E.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convex relaxation
Loose relaxation Ideal relaxation
In practice, it is easier to obtain
Argmin(E) ⊆ Argmin(F).
Loose relaxation will work if two global minima are close enough.
Ideal relaxation will fail if Argmin(F) is too large.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convolution
For certain problems, non-convexity can be treated as noise...
Original function Convolution
Symmetric boundary condition for the convolution.
Almost convex problem after convolution.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Examples
2 Non-convex optimisation
3 Convex relaxation
4 Łojasiewicz inequality
5 Kurdyka-Łojasiewicz inequality
Smooth problem
Let F ∈ C1L.
Gradient descentxk+1 = xk − γ∇F(xk).
Descent propertyF(xk)− F(xk+1) ≥ γ(1− γL
2 )||∇F(xk)||2.
Let γ ∈]0, 2/L[,
γ(1− γL2 )∑k
i=0 ||∇F(xi)||2 ≤ F(x0)− F(xk+1) ≤ F(x0)− F(x?).
F(x?) > −∞, rhs is a positive constant.
for lhs, let k→ +∞,lim
k→+∞||∇F(xk)||2 = 0.
NB: for smooth case, a critical point is guarantee. For non-smooth problem...
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Semi-algebraic sets and functions
Semi-algebraic setA semi-algebraic subset of Rn is a finite union of sets of the form
x ∈ Rn : fi(x) = 0, gj(x) ≤ 0, i ∈ I, j ∈ J
where I, J are finite and fi, gj : Rn → R are real polynomial functions.
Stability under finite∩,∪ and complementation.
Semi-algebraic setA function or a mapping is semi-algebraic if its graph is a semi-algebraic set.
Same definition for real-extended function or multivalued mappings.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Properties
Tarski-SeidenbergThe image of a semi-algebraic set by a linear projection is semi-algebraic.
The closure of a semi-algebraic set A is semi-algebraic.
Example
The graph of the derivative of a semi-algebraic function is semi-algebraic.
Let A be a semi-algebraic subset of Rn and f : Rn → Rp semi-algebraic. Then f(A) issemi-algebraic.
g(x) = maxF(x, y) : y ∈ S is semi-algebraic if F and S are semi-algebraic.
Other examplesminx
12 ||Ax− b||2 + µ||x||p : p is rational,
minX
12 ||AX − B||2 + µrank(X).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Subdifferential
Convex subdifferential R ∈ Γ0(Rn)
∂R(x) =g : R(x′) ≥ R(x) + 〈g, x′ − x〉, ∀x′ ∈ Rn.
Fréchet subdifferential
Given x ∈ dom(R), the Fréchet subdifferential ∂R(x) of R at x is the set of vectors v such that
lim infx′→x, x′ 6=x
1||x− x′||
(R(x′)− R(x)− 〈v, x′ − x〉
)≥ 0.
If x /∈ dom(R), then ∂R(x) = ∅.
Limiting subdifferentialThe limiting-subdifferential (or simply subdifferential) of R at x, written as ∂R(x), reads
∂R(x)def= v ∈ Rn : ∃xk → x, R(xk)→ R(x), vk ∈ ∂R(xk)→ v.
∂R is convex and ∂R is closed.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Critical points
Minimal norm subgradient
||∂R(x)||− = min||v|| : v ∈ ∂R(x).
Critical points
Fermat’s rule: if x is a minimiser of R, then 0 ∈ ∂R(x).
Conversely when 0 ∈ ∂R(x), the point x is called a critical point.
When R is convex, any minimiser is a global minimiser.
When R is non-convex
– Local minima.
– Local maxima.
– Saddle point.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Sharpness
SharpnessFunction R : Rn → R ∪ +∞ is called sharp on the slice
[a < R < b]def=x ∈ Rn : a < f(x) < b
.
If there exists α > 0 such that
||∂R(x)||− ≥ α, ∀x ∈ [a < R < b].
Norms, e.g. R(x) = ||x||.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Łojasiewicz inequality
Łojasiewicz inequalityLet R : Rn → R∪ +∞ be proper lower semi-continuous, and moreover continuous along itsdomain. Then R is said to have Łojasiewicz property if: for any critical point x, there exist C, ε > 0and θ ∈ [0, 1[ such that
|R(x)− R(x)|θ ≤ C||v||, ∀x ∈ Bx(ε), v ∈ ∂R(x).
By convention, let 00 = 0.
PropertySuppose that R has Łojasiewicz property.
If S is a connected subset of the set of critical points of R, that is 0 ∈ ∂R(x) for all x ∈ S, thenR is constant on S.
If in addition S is a compact set, then there exist C, ε > 0 and θ ∈ [0, 1[ such that
∀x ∈ Rn, dist(x, S) ≤ ε, ∀v ∈ ∂R(x) : |R(x)− R(x)|θ ≤ C||v||.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Non-convex PPA
Proximal point algorithmLet R : Rn → R∪ +∞ be proper and lower semi-continuous. From arbitrary x0 ∈ Rn,
xk+1 ∈ argminx γR(x) + 12 ||x− xk||2.
Assumption
R is proper, that isinfx∈Rn
R(x) > −∞.
This impliesargminx γR(x) + 1
2 ||x− xk||2
is non-empty and compact.
The restriction of R to its domain is a continuous function.
R has the Łojasiewicz property.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Property
PropertyLet xkk∈N be the sequence generated by non-convex PPA and ω(xk) the set of its limitingpoints. Then
Sequence R(xk)k∈N is decreasing.∑k ||xk − xk+1||2 < +∞.
If R satisfies assumption 2, then ω(xk) ⊂ crit(R).
If moreover, xkk∈N is bounded
ω(xk) is a non-empty compact set, and
dist(xk, ω(xk)
)→ 0.
If R satisfies assumption 2, then R is finite and constant on ω(xk).
NB: Boundedness can be guaranteed if R is coercive.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence
Convergence of PPASuppose the sequence xkk∈N generated by non-convex PPA is bounded, then∑
k||xk − xk+1|| < +∞,
and the whole sequence converges to some critical point x ∈ crit(R).
From definition of xk+1: R(xk+1) + 12γ ||xk − xk+1||2 ≤ R(xk) .
Consider g(s) = s1−θ, s > 0: ∇g(s) = (1− θ)s−θ
g(R(xk)
)− g(R(xk+1)
)≥ (1− θ)
(R(xk+1)
)−θ(R(xk)− R(xk+1))
≥ (1− θ)(R(xk)
)−θ 12γ ||xk − xk+1||2.
WLOG, assume R(x) = 0 for x ∈ ω(xk). Let vk ∈ ∂R(xk), then for all k large enough
0 < R(xk)θ ≤ C||vk|| = Cγ||xk − xk−1||.
There existsM > 0||xk − xk+1||2||xk − xk−1||
≤ M(R(xk)1−θ − R(xk+1)
1−θ).Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence
Convergence of PPASuppose the sequence xkk∈N generated by non-convex PPA is bounded, then∑
k||xk − xk+1|| < +∞,
and the whole sequence converges to some critical point x ∈ crit(R).
Take r ∈]0, 1[, if ||xk − xk+1|| ≥ r||xk − xk−1||, then
||xk − xk+1|| ≤ Mr(R(xk)1−θ − R(xk+1)
1−θ).For all k large enough
||xk − xk+1|| ≤ r||xk − xk−1||+ Mr(R(xk)1−θ − R(xk+1)
1−θ).There exists some K > 0, such that for k ≥ K∑k
i=K||xi − xi+1|| ≤ r1− r ||xK − xK−1||+ M
r(1− r)(R(xK)1−θ − R(xK+1)
1−θ).R(x) is bounded from below. Take k→ +∞...
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Rate of convergence
Convergence rateSuppose the convergence of the non-convex PPA is true. Denote θ the Łojasiewicz exponent ofx∞. The following statements hold
If θ = 0, then xkk∈N converges in finite number of steps.
If θ ∈]0, 1/2], then there exists η ∈]0, 1[ such that
||xk − x∞|| = O(ηk).
If θ ∈]1/2, 1[, then||xk − x∞|| = O(k−
1−θ2θ−1 ).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Outline
1 Examples
2 Non-convex optimisation
3 Convex relaxation
4 Łojasiewicz inequality
5 Kurdyka-Łojasiewicz inequality
Kurdyka-Łojasiewicz inequality (KL)
Let R : Rn → R∪ +∞ be proper l.s.c. For a, b such that−∞ < a < b < +∞,
[a < R < b]def= x ∈ Rn : a < R(x) < b.
Kurdyka-Łojasiewicz inequalityR is said to have the KL property at x ∈ dom(R) if there exists η ∈]0,+∞], a neighbourhood Uof x and a continuous concave function ϕ : [0, η[→ R+ such that
ϕ(0) = 0.
ϕ is C1 on ]0, η[.
for all s ∈]0, η[, ϕ′(s) > 0.
for all x ∈ U ∩ [R(x) < R < R(x) + η], the KL inequality holds
ϕ′(R(x)− R(x)
)dist(0, ∂R(x)
)≥ 1.
Proper l.s.c. functions are KL at non-critical points.
Proper l.s.c. functions which satisfy KL at each point of dom(∂R) are called KL functions.
Typical KL functions are the class of semi-algebraic functions.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Kurdyka-Łojasiewicz functions
||∇F(x)|| ≥ 0 ||∇(ϕ F)(x)|| ≥ 1
When R(x) = 0, then the condition becomes
||∂(ϕ F)(x)||− ≥ 1.
ϕ is called a desingularising function for R, i.e. sharp up to reparameterization via ϕ.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Abstract descent methods
Let Φ be proper and lower semi-continuous. Suppose a sequence xkk∈N is generated such thatthe following conditions are satisfied.
ConditionsLet c, d > 0 be some constants
A.1 Sufficient decrease conditions For each k ∈ N,
Φ(xk+1) + c||xk+1 − xk||2 ≤ Φ(xk).
A.2 Relative error condition For each k ∈ N, there exists gk+1 ∈ ∂Φ(xk+1) such that
||gk+1|| ≤ d||xk+1 − xk||.
A.3 Continuity condition There exists a subsequence xkjj∈N and x such that
xkj → x, Φ(xkj)→ Φ(x).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence
ConvergenceLet Φ : Rn → R∪ +∞ be proper and l.s.c. and KL at some x ∈ Rn. Let U, η and ϕ be in the KLproperty. Let δ, ρ > 0 be such that Bx(δ) ⊂ U with ρ ∈]0, δ[. Consider a sequence xkk∈Nwhich satisfies (A.1)-(A.2). Suppose moreover
Φ(x) < Φ(x0) < Φ(x) + η,
||x0 − x||+ 2√
Φ(x0)−Φ(x)c + d
cϕ(Φ(x0)− Φ(x)
)< ρ,
and∀k ∈ N, xk ∈ Bx(ρ)⇒ xk+1 ∈ Bx(δ) with Φ(xk+1) ≥ Φ(x).
Then the sequence xkk∈N satisfies
∀k ∈ N, xk ∈ Bx(δ),∑k ||xk − xk+1|| < +∞,
Φ(xk)→ Φ(x).
and converges to a point x? ∈ Bx(δ) such that Φ(x?) ≤ Φ(x).
If moreover, (A.3) is true, then x? is a critical point and Φ(x?) = Φ(x).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence
Condition (A.1) implies that Φ(xk)k∈N is non-increasing, and for all k ∈ N
||xk+1 − xk|| ≤√
Φ(xk)−Φ(xk+1)c .
Condition (A.2) and KL inequality
ϕ′(Φ(xk)− Φ(x)
)≥ 1||gk||≥ 1
d||xk − xk−1||.
Since ϕ is concave,
ϕ(Φ(xk)− Φ(x)
)− ϕ
(Φ(xk)− Φ(x)
)≥ ϕ′
(Φ(xk)− Φ(x)
)(Φ(xk)− Φ(xk+1)
)≥ ϕ′
(Φ(xk)− Φ(x)
)c||xk − xk+1||2.
Combining the above two yields||xk − xk+1||2||xk − xk−1||
≤ dc(ϕ(Φ(xk)− Φ(x)
)− ϕ
(Φ(xk)− Φ(x)
)).
Apply the inequality 2√xy ≤ x + y,
2||xk − xk+1|| ≤ ||xk − xk−1||+ dc(ϕ(Φ(xk)− Φ(x))− ϕ(Φ(xk+1)− Φ(x))
).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Convergence
Continue with (A.1),||x1 − x0|| ≤
√Φ(x0)−Φ(x1)
c ≤√
Φ(x0)−Φ(x)c .
Then||x1 − x|| ≤ ||x1 − x0||+ ||x0 − x|| ≤ ||x1 − x0||+
√Φ(x0)−Φ(x)
c ≤ ρ.
By induction, we can show that for all k ∈ N
xk ∈ Bx(ρ) and∑ki=1||xi+1 − xi||+ ||xk+1 − xk|| ≤ ||x1 − x0||+ d
c(ϕ(Φ(x1)− Φ(x))− ϕ(Φ(xk+1)− Φ(x))
).
The above directly implies∑k||xk − xk+1|| ≤ ||x1 − x0||+ d
cϕ(Φ(x1)− Φ(x)) < +∞.
Hence, there exists x? ∈ ω(xk)
xk → x?, gk → 0, Φ(xk)→ v ≥ Φ(x).
KL inequalityϕ′(v − Φ(x)
)||gk|| ≥ 1
indicates v = Φ(x). Lower semi-continuous yields Φ(x?) ≤ Φ(x).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Forward–Backward splitting
Consider minimisingminx∈Rn
Φ(x)
def= R(x) + F(x)
,
R : Rn → R∪ +∞ is proper l.s.c. and bounded from below.
F : Rn → R is finite-valued, differentiable and∇F is L-Lipschitz.
Forward–Backward splittingLet γ ∈]0, 1/L[:
xk+1 ∈ proxγR(xk − γ∇F(xk)
).
Sufficient decreaseΦ(xk+1) + 1− γL
2γ ||xk − xk+1||2 ≤ Φ(xj).
Relative error gk+1def= 1
γ(xk − xk+1)−∇F(xk) +∇F(xk+1) ∈ ∂Φ(xk+1)
||gk+1|| ≤(1γ
+ L)||xk − xk+1||.
Continuity sequence xkk∈N is bounded.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
A coupled minimisation problem
A coupled problemConsider minimising
minx∈Rn,y∈Rm
E(x, y)
def= R(x) + F(x, y) + J(y)
,
R : Rn → R∪ +∞, J : Rm → R∪ +∞ are proper l.s.c. and bounded from below.
F : Rn × Rm → R is finite-valued, differentiable and∇F is L-Lipschitz.
Subdifferential
∂E(x, y) =∂R(x) +∇xF(x, y)
×∂J(x) +∇yF(x, y)
= ∂xE(x, y)× ∂yE(x, y).
Separate Lipschitz continuity for F: ∇xF is Lx-Lip. and∇yF is Ly-Lip.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Proximal alternating minimisation (PAM)
PAM is an alternating minimisation algorithm.
PAMLet γx, γy ∈]0, 1/L[:
xk+1 ∈ argminx∈Rn E(x, yk) + 12γx||x− xk||2,
yk+1 ∈ argminy∈Rm E(xk+1, y) + 12γy||y − yk||2.
PAM is an instance of PPA.
Convergence, let Φ(x, y) = E(x, y).
No closed form solution,
xk+1 ∈ argminx∈Rn E(x, yk) + 12γx||x− xk||2
= argminx∈Rn R(x) + F(x, yk) + 12γx||x− xk||2.
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Proximal alternating linearised minimisation (PALM)
PALM is linearised PAM.
F(x, yk) ≤ F(xk, yk) + 〈∇xF(xk, yk), x− xk〉+ 12γx||x− xk||2.
PALMLet γx, γy ∈]0, 1/L[:
xk+1 ∈ proxγxR(xk − γx∇xF(xk, yk)
),
yk+1 ∈ proxγy J(yk − γy∇yF(xk+1, yk)
).
PAM is an instance of Forward–Backward.
Convergence, let Φ(x, y) = E(x, y).
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Remarks
Converges to global minimiser if starts close enough.
Inertial acceleration can be applied to all of them.
Step-size v.s. inertial parameter.
Step-size and critical points.
Stochastic optimisation methods can escape saddle-point or find global minimiser...
Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019
Reference
H. Attouch, and J. Bolte. “On the convergence of the proximal algorithm for nonsmoothfunctions involving analytic features”. Mathematical Programming 116.1-2 (2009): 5-16.
H. Attouch, J. Bolte, and B. Svaiter. “Convergence of descent methods for semi-algebraic andtame problems: proximal algorithms, forward–backward splitting, and regularizedGauss–Seidel methods”. Mathematical Programming 137.1-2 (2013): 91-129.
H. Attouch, et al. “Proximal alternating minimization and projection methods for nonconvexproblems: An approach based on the Kurdyka-Łojasiewicz inequality”. Mathematics ofOperations Research 35.2 (2010): 438-457.
J. Bolte, S. Sabach, and M. Teboulle. “Proximal alternating linearized minimization ornonconvex and nonsmooth problems”. Mathematical Programming 146.1-2 (2014): 459-494.
J. Liang, J. Fadili, and G. Peyré. “A multi-step inertial forward-backward splitting method fornon-convex optimization”. Advances in Neural Information Processing Systems. 2016.