Introductory Course on Non-smooth Optimisation

Introductory Course on Non-smooth Optimisation

Lecture 09 - Non-convex optimisation

Jingwei Liang

Department of Applied Mathematics and Theoretical Physics

Table of contents

1 Examples

2 Non-convex optimisation

3 Convex relaxation

4 Łojasiewicz inequality

5 Kurdyka-Łojasiewicz inequality

Compressed sensing

Forward observationb = Ax,

x ∈ Rn is sparse.

A : Rn → Rm withm << n.

Compressed sensingminx∈Rn

||x||0 s.t. Ax = b.

NB: NP-hard problem.

Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

Image processing

Two-phase segmentation Given an image I, which consists of foreground and background, segmentthe foreground. Ideally,

I = fC + bΩ\C.

Mumford–Shah model

E(u, C) =

∫Ω

(u− I)2dx + λ

(∫Ω\C||∇u||2dx + α|C|

),

where |C| = peri(C).


Principal component pursuit

Forward mixture modelw = x + y + ε,

where x ∈ Rm×n is κ-sparse, y ∈ Rm×n is σ-low-rank and ε is noise.

Non-convex PCPmin

x,y∈Rm×n

12 ||x + y − w||2

s.t. ||x||0 ≤ κ and rank(y) ≤ σ.


Neural networks

Each layer of NNs is convex

Linear operation, e.g. convolution.

Non-linear activation function, e.g. rectifier maxx,0.

The composition of convex functions is not necessarily convex...

Neural networks are universal function approximators.

Hence need to approximate non-convex functions.

Cannot approximate non-convex functions with convex functions.


Outline

1 Examples


3 Convex relaxation



Non-convex optimisation

Non-convex problemAny problem that is not convex/concave is non-convex...


Challenges

Potentially many local minima.

Saddle points.

Very flat regions.

Widely varying curvature.

NP-hard.


Outline

1 Examples


3 Convex relaxation



Convex relaxation

Non-convex optimisation problem

minx

E(x).

Convex optimisation problem

minx

F(x).

What ifArgmin(F) ⊆ Argmin(E),

Subtle and case-dependent.

Somehow, finding F is almost equivalent to solving E.


Convex relaxation

Loose relaxation Ideal relaxation

In practice, it is easier to obtain

Argmin(E) ⊆ Argmin(F).

Loose relaxation will work if two global minima are close enough.

Ideal relaxation will fail if Argmin(F) is too large.


Convolution

For certain problems, non-convexity can be treated as noise...

Original function Convolution

Symmetric boundary condition for the convolution.

Almost convex problem after convolution.


Outline

1 Examples


3 Convex relaxation



Smooth problem

Let F ∈ C1L.

Gradient descentxk+1 = xk − γ∇F(xk).

Descent propertyF(xk)− F(xk+1) ≥ γ(1− γL

2 )||∇F(xk)||2.

Let γ ∈]0, 2/L[,

γ(1− γL2 )∑k

i=0 ||∇F(xi)||2 ≤ F(x0)− F(xk+1) ≤ F(x0)− F(x?).

F(x?) > −∞, rhs is a positive constant.

for lhs, let k→ +∞,lim

k→+∞||∇F(xk)||2 = 0.

NB: for smooth case, a critical point is guarantee. For non-smooth problem...


Semi-algebraic sets and functions

Semi-algebraic setA semi-algebraic subset of Rn is a finite union of sets of the form

x ∈ Rn : fi(x) = 0, gj(x) ≤ 0, i ∈ I, j ∈ J

where I, J are finite and fi, gj : Rn → R are real polynomial functions.

Stability under finite∩,∪ and complementation.

Semi-algebraic setA function or a mapping is semi-algebraic if its graph is a semi-algebraic set.

Same definition for real-extended function or multivalued mappings.


Properties

Tarski-SeidenbergThe image of a semi-algebraic set by a linear projection is semi-algebraic.

The closure of a semi-algebraic set A is semi-algebraic.

Example

The graph of the derivative of a semi-algebraic function is semi-algebraic.

Let A be a semi-algebraic subset of Rn and f : Rn → Rp semi-algebraic. Then f(A) issemi-algebraic.

g(x) = maxF(x, y) : y ∈ S is semi-algebraic if F and S are semi-algebraic.

Other examplesminx

12 ||Ax− b||2 + µ||x||p : p is rational,

minX

12 ||AX − B||2 + µrank(X).


Subdifferential

Convex subdifferential R ∈ Γ0(Rn)

∂R(x) =g : R(x′) ≥ R(x) + 〈g, x′ − x〉, ∀x′ ∈ Rn.

Fréchet subdifferential

Given x ∈ dom(R), the Fréchet subdifferential ∂R(x) of R at x is the set of vectors v such that

lim infx′→x, x′ 6=x

1||x− x′||

(R(x′)− R(x)− 〈v, x′ − x〉

)≥ 0.

If x /∈ dom(R), then ∂R(x) = ∅.

Limiting subdifferentialThe limiting-subdifferential (or simply subdifferential) of R at x, written as ∂R(x), reads

∂R(x)def= v ∈ Rn : ∃xk → x, R(xk)→ R(x), vk ∈ ∂R(xk)→ v.

∂R is convex and ∂R is closed.


Critical points

Minimal norm subgradient

||∂R(x)||− = min||v|| : v ∈ ∂R(x).

Critical points

Fermat’s rule: if x is a minimiser of R, then 0 ∈ ∂R(x).

Conversely when 0 ∈ ∂R(x), the point x is called a critical point.

When R is convex, any minimiser is a global minimiser.

When R is non-convex

– Local minima.

– Local maxima.

– Saddle point.


Sharpness

SharpnessFunction R : Rn → R ∪ +∞ is called sharp on the slice

[a < R < b]def=x ∈ Rn : a < f(x) < b

.

If there exists α > 0 such that

||∂R(x)||− ≥ α, ∀x ∈ [a < R < b].

Norms, e.g. R(x) = ||x||.


Łojasiewicz inequality

Łojasiewicz inequalityLet R : Rn → R∪ +∞ be proper lower semi-continuous, and moreover continuous along itsdomain. Then R is said to have Łojasiewicz property if: for any critical point x, there exist C, ε > 0and θ ∈ [0, 1[ such that

|R(x)− R(x)|θ ≤ C||v||, ∀x ∈ Bx(ε), v ∈ ∂R(x).

By convention, let 00 = 0.

PropertySuppose that R has Łojasiewicz property.

If S is a connected subset of the set of critical points of R, that is 0 ∈ ∂R(x) for all x ∈ S, thenR is constant on S.

If in addition S is a compact set, then there exist C, ε > 0 and θ ∈ [0, 1[ such that

∀x ∈ Rn, dist(x, S) ≤ ε, ∀v ∈ ∂R(x) : |R(x)− R(x)|θ ≤ C||v||.


Non-convex PPA

Proximal point algorithmLet R : Rn → R∪ +∞ be proper and lower semi-continuous. From arbitrary x0 ∈ Rn,

xk+1 ∈ argminx γR(x) + 12 ||x− xk||2.

Assumption

R is proper, that isinfx∈Rn

R(x) > −∞.

This impliesargminx γR(x) + 1

2 ||x− xk||2

is non-empty and compact.

The restriction of R to its domain is a continuous function.

R has the Łojasiewicz property.


Property

PropertyLet xkk∈N be the sequence generated by non-convex PPA and ω(xk) the set of its limitingpoints. Then

Sequence R(xk)k∈N is decreasing.∑k ||xk − xk+1||2 < +∞.

If R satisfies assumption 2, then ω(xk) ⊂ crit(R).

If moreover, xkk∈N is bounded

ω(xk) is a non-empty compact set, and

dist(xk, ω(xk)

)→ 0.

If R satisfies assumption 2, then R is finite and constant on ω(xk).

NB: Boundedness can be guaranteed if R is coercive.


Convergence

Convergence of PPASuppose the sequence xkk∈N generated by non-convex PPA is bounded, then∑

k||xk − xk+1|| < +∞,

and the whole sequence converges to some critical point x ∈ crit(R).

From definition of xk+1: R(xk+1) + 12γ ||xk − xk+1||2 ≤ R(xk) .

Consider g(s) = s1−θ, s > 0: ∇g(s) = (1− θ)s−θ

g(R(xk)

)− g(R(xk+1)

)≥ (1− θ)

(R(xk+1)

)−θ(R(xk)− R(xk+1))

≥ (1− θ)(R(xk)

)−θ 12γ ||xk − xk+1||2.

WLOG, assume R(x) = 0 for x ∈ ω(xk). Let vk ∈ ∂R(xk), then for all k large enough

0 < R(xk)θ ≤ C||vk|| = Cγ||xk − xk−1||.

There existsM > 0||xk − xk+1||2||xk − xk−1||

≤ M(R(xk)1−θ − R(xk+1)

1−θ).Jingwei Liang, DAMTP Introduction to Non-smooth Optimisation March 13, 2019

Convergence

Convergence of PPASuppose the sequence xkk∈N generated by non-convex PPA is bounded, then∑

k||xk − xk+1|| < +∞,

and the whole sequence converges to some critical point x ∈ crit(R).

Take r ∈]0, 1[, if ||xk − xk+1|| ≥ r||xk − xk−1||, then

||xk − xk+1|| ≤ Mr(R(xk)1−θ − R(xk+1)

1−θ).For all k large enough

||xk − xk+1|| ≤ r||xk − xk−1||+ Mr(R(xk)1−θ − R(xk+1)

1−θ).There exists some K > 0, such that for k ≥ K∑k

i=K||xi − xi+1|| ≤ r1− r ||xK − xK−1||+ M

r(1− r)(R(xK)1−θ − R(xK+1)

1−θ).R(x) is bounded from below. Take k→ +∞...


Rate of convergence

Convergence rateSuppose the convergence of the non-convex PPA is true. Denote θ the Łojasiewicz exponent ofx∞. The following statements hold

If θ = 0, then xkk∈N converges in finite number of steps.

If θ ∈]0, 1/2], then there exists η ∈]0, 1[ such that

||xk − x∞|| = O(ηk).

If θ ∈]1/2, 1[, then||xk − x∞|| = O(k−

1−θ2θ−1 ).


Outline

1 Examples


3 Convex relaxation



Kurdyka-Łojasiewicz inequality (KL)

Let R : Rn → R∪ +∞ be proper l.s.c. For a, b such that−∞ < a < b < +∞,

[a < R < b]def= x ∈ Rn : a < R(x) < b.

Kurdyka-Łojasiewicz inequalityR is said to have the KL property at x ∈ dom(R) if there exists η ∈]0,+∞], a neighbourhood Uof x and a continuous concave function ϕ : [0, η[→ R+ such that

ϕ(0) = 0.

ϕ is C1 on ]0, η[.

for all s ∈]0, η[, ϕ′(s) > 0.

for all x ∈ U ∩ [R(x) < R < R(x) + η], the KL inequality holds

ϕ′(R(x)− R(x)

)dist(0, ∂R(x)

)≥ 1.

Proper l.s.c. functions are KL at non-critical points.

Proper l.s.c. functions which satisfy KL at each point of dom(∂R) are called KL functions.

Typical KL functions are the class of semi-algebraic functions.


Kurdyka-Łojasiewicz functions

||∇F(x)|| ≥ 0 ||∇(ϕ F)(x)|| ≥ 1

When R(x) = 0, then the condition becomes

||∂(ϕ F)(x)||− ≥ 1.

ϕ is called a desingularising function for R, i.e. sharp up to reparameterization via ϕ.


Abstract descent methods

Let Φ be proper and lower semi-continuous. Suppose a sequence xkk∈N is generated such thatthe following conditions are satisfied.

ConditionsLet c, d > 0 be some constants

A.1 Sufficient decrease conditions For each k ∈ N,

Φ(xk+1) + c||xk+1 − xk||2 ≤ Φ(xk).

A.2 Relative error condition For each k ∈ N, there exists gk+1 ∈ ∂Φ(xk+1) such that

||gk+1|| ≤ d||xk+1 − xk||.

A.3 Continuity condition There exists a subsequence xkjj∈N and x such that

xkj → x, Φ(xkj)→ Φ(x).


Convergence

ConvergenceLet Φ : Rn → R∪ +∞ be proper and l.s.c. and KL at some x ∈ Rn. Let U, η and ϕ be in the KLproperty. Let δ, ρ > 0 be such that Bx(δ) ⊂ U with ρ ∈]0, δ[. Consider a sequence xkk∈Nwhich satisfies (A.1)-(A.2). Suppose moreover

Φ(x) < Φ(x0) < Φ(x) + η,

||x0 − x||+ 2√

Φ(x0)−Φ(x)c + d

cϕ(Φ(x0)− Φ(x)

)< ρ,

and∀k ∈ N, xk ∈ Bx(ρ)⇒ xk+1 ∈ Bx(δ) with Φ(xk+1) ≥ Φ(x).

Then the sequence xkk∈N satisfies

∀k ∈ N, xk ∈ Bx(δ),∑k ||xk − xk+1|| < +∞,

Φ(xk)→ Φ(x).

and converges to a point x? ∈ Bx(δ) such that Φ(x?) ≤ Φ(x).

If moreover, (A.3) is true, then x? is a critical point and Φ(x?) = Φ(x).


Convergence

Condition (A.1) implies that Φ(xk)k∈N is non-increasing, and for all k ∈ N

||xk+1 − xk|| ≤√

Φ(xk)−Φ(xk+1)c .

Condition (A.2) and KL inequality

ϕ′(Φ(xk)− Φ(x)

)≥ 1||gk||≥ 1

d||xk − xk−1||.

Since ϕ is concave,

ϕ(Φ(xk)− Φ(x)

)− ϕ

(Φ(xk)− Φ(x)

)≥ ϕ′

(Φ(xk)− Φ(x)

)(Φ(xk)− Φ(xk+1)

)≥ ϕ′

(Φ(xk)− Φ(x)

)c||xk − xk+1||2.

Combining the above two yields||xk − xk+1||2||xk − xk−1||

≤ dc(ϕ(Φ(xk)− Φ(x)

)− ϕ

(Φ(xk)− Φ(x)

)).

Apply the inequality 2√xy ≤ x + y,

2||xk − xk+1|| ≤ ||xk − xk−1||+ dc(ϕ(Φ(xk)− Φ(x))− ϕ(Φ(xk+1)− Φ(x))

).


Convergence

Continue with (A.1),||x1 − x0|| ≤

√Φ(x0)−Φ(x1)

c ≤√

Φ(x0)−Φ(x)c .

Then||x1 − x|| ≤ ||x1 − x0||+ ||x0 − x|| ≤ ||x1 − x0||+

√Φ(x0)−Φ(x)

c ≤ ρ.

By induction, we can show that for all k ∈ N

xk ∈ Bx(ρ) and∑ki=1||xi+1 − xi||+ ||xk+1 − xk|| ≤ ||x1 − x0||+ d

c(ϕ(Φ(x1)− Φ(x))− ϕ(Φ(xk+1)− Φ(x))

).

The above directly implies∑k||xk − xk+1|| ≤ ||x1 − x0||+ d

cϕ(Φ(x1)− Φ(x)) < +∞.

Hence, there exists x? ∈ ω(xk)

xk → x?, gk → 0, Φ(xk)→ v ≥ Φ(x).

KL inequalityϕ′(v − Φ(x)

)||gk|| ≥ 1

indicates v = Φ(x). Lower semi-continuous yields Φ(x?) ≤ Φ(x).


Forward–Backward splitting

Consider minimisingminx∈Rn

Φ(x)

def= R(x) + F(x)

,

R : Rn → R∪ +∞ is proper l.s.c. and bounded from below.

F : Rn → R is finite-valued, differentiable and∇F is L-Lipschitz.

Forward–Backward splittingLet γ ∈]0, 1/L[:

xk+1 ∈ proxγR(xk − γ∇F(xk)

).

Sufficient decreaseΦ(xk+1) + 1− γL

2γ ||xk − xk+1||2 ≤ Φ(xj).

Relative error gk+1def= 1

γ(xk − xk+1)−∇F(xk) +∇F(xk+1) ∈ ∂Φ(xk+1)

||gk+1|| ≤(1γ

+ L)||xk − xk+1||.

Continuity sequence xkk∈N is bounded.


A coupled minimisation problem

A coupled problemConsider minimising

minx∈Rn,y∈Rm

E(x, y)

def= R(x) + F(x, y) + J(y)

,

R : Rn → R∪ +∞, J : Rm → R∪ +∞ are proper l.s.c. and bounded from below.

F : Rn × Rm → R is finite-valued, differentiable and∇F is L-Lipschitz.

Subdifferential

∂E(x, y) =∂R(x) +∇xF(x, y)

×∂J(x) +∇yF(x, y)

= ∂xE(x, y)× ∂yE(x, y).

Separate Lipschitz continuity for F: ∇xF is Lx-Lip. and∇yF is Ly-Lip.


Proximal alternating minimisation (PAM)

PAM is an alternating minimisation algorithm.

PAMLet γx, γy ∈]0, 1/L[:

xk+1 ∈ argminx∈Rn E(x, yk) + 12γx||x− xk||2,

yk+1 ∈ argminy∈Rm E(xk+1, y) + 12γy||y − yk||2.

PAM is an instance of PPA.

Convergence, let Φ(x, y) = E(x, y).

No closed form solution,

xk+1 ∈ argminx∈Rn E(x, yk) + 12γx||x− xk||2

= argminx∈Rn R(x) + F(x, yk) + 12γx||x− xk||2.


Proximal alternating linearised minimisation (PALM)

PALM is linearised PAM.

F(x, yk) ≤ F(xk, yk) + 〈∇xF(xk, yk), x− xk〉+ 12γx||x− xk||2.

PALMLet γx, γy ∈]0, 1/L[:

xk+1 ∈ proxγxR(xk − γx∇xF(xk, yk)

),

yk+1 ∈ proxγy J(yk − γy∇yF(xk+1, yk)

).

PAM is an instance of Forward–Backward.

Convergence, let Φ(x, y) = E(x, y).


Remarks

Converges to global minimiser if starts close enough.

Inertial acceleration can be applied to all of them.

Step-size v.s. inertial parameter.

Step-size and critical points.

Stochastic optimisation methods can escape saddle-point or find global minimiser...


Reference

H. Attouch, and J. Bolte. “On the convergence of the proximal algorithm for nonsmoothfunctions involving analytic features”. Mathematical Programming 116.1-2 (2009): 5-16.

H. Attouch, J. Bolte, and B. Svaiter. “Convergence of descent methods for semi-algebraic andtame problems: proximal algorithms, forward–backward splitting, and regularizedGauss–Seidel methods”. Mathematical Programming 137.1-2 (2013): 91-129.

H. Attouch, et al. “Proximal alternating minimization and projection methods for nonconvexproblems: An approach based on the Kurdyka-Łojasiewicz inequality”. Mathematics ofOperations Research 35.2 (2010): 438-457.

J. Bolte, S. Sabach, and M. Teboulle. “Proximal alternating linearized minimization ornonconvex and nonsmooth problems”. Mathematical Programming 146.1-2 (2014): 459-494.

J. Liang, J. Fadili, and G. Peyré. “A multi-step inertial forward-backward splitting method fornon-convex optimization”. Advances in Neural Information Processing Systems. 2016.

Introductory Course on Non-smooth Optimisation

Documents

Transcript of Introductory Course on Non-smooth Optimisation