DmitriyDrusvyatskiy SchoolofORIE,CornellUniversity...

Slope and geometry in variational mathematics

Dmitriy DrusvyatskiySchool of ORIE, Cornell University

Joint work withAris Daniilidis (Barcelona), Alex D. Ioffe (Technion), Martin Larsson (Lausanne),

and Adrian S. Lewis (Cornell)

January 29, 2013

Fix a metric space (X , d) and a function f : X → R.

Slope: “Fastest instantaneous rate of decrease”

|∇f |(x) := limsupx→x

f (x)− f (x)d(x, x) .

• If f : Rn → R is smooth, then |∇f |(x) = |∇f (x)|.

Critical points:

x is critical for f ⇐⇒ |∇f |(x) = 0.

Origin is critical

f (x)− f (x)d(x, x) .

Critical points:

Origin is critical

f (x)− f (x)d(x, x) .

Critical points:

Origin is critical

f (x)− f (x)d(x, x) .

Critical points:

Origin is critical

f (x)− f (x)d(x, x) .

Critical points:

Origin is critical

f (x)− f (x)d(x, x) .

Critical points:

x is critical for f ⇐⇒ |∇f |(x) = 0.Eg:

Origin is critical 2/35

Method of alternating projectionsCommon problem:

Given sets A,B ⊂ Rn , find some point x ∈ A ∩ B.

Distance and projection:

dB(x) = miny∈B|x − y| and PB(x) = {nearest points of B to x}.

Finding points in PA and PB is often easy!

Eg 1 (simple example): Linear programming:

{x : x ≥ 0}⋂{x : Ax = b}.

Eg 2 (more interesting): Low-order control:

{X � 0 : rankX ≤ r}⋂{X : A(X) = b}.

{x : x ≥ 0}⋂{x : Ax = b}.

{X � 0 : rankX ≤ r}⋂{X : A(X) = b}.

{x : x ≥ 0}⋂{x : Ax = b}.

{X � 0 : rankX ≤ r}⋂{X : A(X) = b}.

{x : x ≥ 0}⋂{x : Ax = b}.

{X � 0 : rankX ≤ r}⋂{X : A(X) = b}.

{x : x ≥ 0}⋂{x : Ax = b}.

{X � 0 : rankX ≤ r}⋂{X : A(X) = b}.

Method of alternating projections

Method of alternating projections (von Neumann ’33):

xk+1 ∈ PB(xk)xk+2 ∈ PA(xk+1)

The “angle” between A and B drives the convergence!

Quantifying the angle:

ψ(x, y) :={|x − y| if x ∈ A, y ∈ B+∞ otherwise

Comparison of |∇ψy|(x) and |∇ψx |(y) quantifies the angle!

Outline

• Variational geometry:— Subdifferentials— The semi-algebraic case— Slope and error bounds

• Applications:— Alternating projections and transversality— Steepest descent trajectories

• To smooth or not to smooth?

• Active sets in optimization

SubdifferentialsFréchet subdifferential:

v ∈ ∂f (x) ⇐⇒ x is critical for f − 〈v, ·〉,

or equivalentlyv ∈ ∂f (x) ⇐⇒ f (x) ≥ f (x) + 〈v, x − x〉+ o(|x − x|).

• Eg: If f is smooth ∂f = {∇f }.Example:

∂(−|·|)(x) =

1 if x < 0∅ if x = 0−1 if x > 0

y = −|x|

Subdifferential graph:

gph ∂f := {(x, v) ∈ Rn ×Rn : v ∈ ∂f (x)}.

v ∈ ∂f (x) ⇐⇒ x is critical for f − 〈v, ·〉,or equivalently

v ∈ ∂f (x) ⇐⇒ f (x) ≥ f (x) + 〈v, x − x〉+ o(|x − x|).

• Eg: If f is smooth ∂f = {∇f }.Example:

∂(−|·|)(x) =

1 if x < 0∅ if x = 0−1 if x > 0

y = −|x|

gph ∂f := {(x, v) ∈ Rn ×Rn : v ∈ ∂f (x)}.

v ∈ ∂f (x) ⇐⇒ f (x) ≥ f (x) + 〈v, x − x〉+ o(|x − x|).• Eg: If f is smooth ∂f = {∇f }.

Example:

∂(−|·|)(x) =

1 if x < 0∅ if x = 0−1 if x > 0

y = −|x|

gph ∂f := {(x, v) ∈ Rn ×Rn : v ∈ ∂f (x)}.

Example:

∂(−|·|)(x) =

1 if x < 0∅ if x = 0−1 if x > 0

y = −|x|

gph ∂f := {(x, v) ∈ Rn ×Rn : v ∈ ∂f (x)}.

Example:

∂(−|·|)(x) =

1 if x < 0∅ if x = 0−1 if x > 0

y = −|x|

gph ∂f := {(x, v) ∈ Rn ×Rn : v ∈ ∂f (x)}.6/35

Subdifferentials

Fundamental theoretical and computational hurdle:1. |∇f | is not lsc,2. gph ∂f is not closed.

Limiting slope:

|∇f |(x) := liminfx→x

|∇f |(x).

Limiting subdifferential:

gph ∂f := cl (gph ∂f ).

Relationship (Ioffe ’00):

|∇f |(x) = dist (0, ∂f (x)).

Subdifferentials

Limiting slope:

|∇f |(x).

|∇f |(x) = dist (0, ∂f (x)).

Subdifferentials

Limiting slope:

|∇f |(x).

|∇f |(x) = dist (0, ∂f (x)).

Subdifferentials

Are these notions adequate?

Pathology: gph ∂f can be very large!

What do we expect the size of gph ∂f to be?

• If f : Rn → R is smooth, then gph ∂f is n-dimensionalsmooth manifold.

• If f : Rn → R is convex, then gph ∂f is n-dimensionalLipschitz manifold (Minty ’62).

Multiple authors (Rockafellar, Borwein, Wang,. . .):There are functions f : Rn → R with “2n-dimensional” gph ∂f .

Subdifferentials

Semi-algebraic geometry

Q ⊂ Rn is semi-algebraic if it is a finite union of

solution sets to finitely many polynomial inequalities.

Semi-algebraicity is robust (Tarski-Seidenberg theorem).Eg:f semi-algebraic =⇒ gph ∂f and |∇f | are semi-algebraic.

Semi-algebraic Q “stratify” into finitely many manifolds {Mi}.Dimension:

dimQ := maxi=1,...,k

{dimMi}.

Theorem (D-Ioffe-Lewis)For semi-algebraic f : Rn → R, we have

dim gph ∂f = n,

even locally around any point in gph ∂f .

Conclusion: Criticality is meaningful for concrete variationalproblems!

Semi-algebraic f have only finitely many critical values(cf. Sard’s Theorem) ⇒ intervals (a, b) of non-critical values.

What can we learn from non-criticality?

dim gph ∂f = n,

Semi-algebraic f have only finitely many critical values(cf. Sard’s Theorem)

⇒ intervals (a, b) of non-critical values.

dim gph ∂f = n,

Slope and error boundsCommon problem: Estimate

dist (x, [f ≤ r ]) (difficult).

“The residual”:f (x)− r (easy).

Desirable quality: Exists κ with

dist (x, [f ≤ r ]) ≤ κ(f (x)− r).

Restrict f : Rn → R to a “slice” f−1(a, b).

Lemma (Error bound)The following are equivalent.Non-criticality:

|∇f | ≥ 1κ.

Error-bound:

dist (x, [f ≤ r ]) ≤ κ(f (x)− r), when r ∈ (a, f (x)).

dist (x, [f ≤ r ]) ≤ κ(f (x)− r).

|∇f | ≥ 1κ.

Error-bound:

dist (x, [f ≤ r ]) ≤ κ(f (x)− r).

|∇f | ≥ 1κ.

Error-bound:

dist (x, [f ≤ r ]) ≤ κ(f (x)− r).

|∇f | ≥ 1κ.

Error-bound:

dist (x, [f ≤ r ]) ≤ κ(f (x)− r).

|∇f | ≥ 1κ.

Error-bound:

Alternating projections & transversality

Convergence of alternating projections

Indicator function:

δA(x) ={

0 if x ∈ A∞ if x /∈ A

Coupling function:

ψ(x, y) = δA(x) + |x − y|+ δB(y).

Sufficient condition: Exists κ > 0 withmax {|∇ψx |(y), |∇ψy|(x)} ≥ κ

for all x ∈ A and y ∈ B, not in A ∩ B. (Proof: Error bound)

Equivalently: For any v = x−y|x−y| , either

dist(v, ∂δB(y)

)≥ κ or dist

(− v, ∂δA(x)

)≥ κ.

Normal cone: NB(y) := ∂δB(y).

Convergence of alternating projectionsIndicator function:

δA(x) ={

0 if x ∈ A∞ if x /∈ A

Coupling function:

ψ(x, y) = δA(x) + |x − y|+ δB(y).

dist(v, ∂δB(y)

)≥ κ or dist

(− v, ∂δA(x)

)≥ κ.

δA(x) ={

0 if x ∈ A∞ if x /∈ A

Coupling function:

ψ(x, y) = δA(x) + |x − y|+ δB(y).

dist(v, ∂δB(y)

)≥ κ or dist

(− v, ∂δA(x)

)≥ κ.

δA(x) ={

0 if x ∈ A∞ if x /∈ A

Coupling function:

ψ(x, y) = δA(x) + |x − y|+ δB(y).

for all x ∈ A and y ∈ B, not in A ∩ B.

(Proof: Error bound)

dist(v, ∂δB(y)

)≥ κ or dist

(− v, ∂δA(x)

)≥ κ.

δA(x) ={

0 if x ∈ A∞ if x /∈ A

Coupling function:

ψ(x, y) = δA(x) + |x − y|+ δB(y).

dist(v, ∂δB(y)

)≥ κ or dist

(− v, ∂δA(x)

)≥ κ.

δA(x) ={

0 if x ∈ A∞ if x /∈ A

Coupling function:

ψ(x, y) = δA(x) + |x − y|+ δB(y).

dist(v, ∂δB(y)

)≥ κ or dist

(− v, ∂δA(x)

)≥ κ.

δA(x) ={

0 if x ∈ A∞ if x /∈ A

Coupling function:

ψ(x, y) = δA(x) + |x − y|+ δB(y).

dist(v, ∂δB(y)

)≥ κ or dist

(− v, ∂δA(x)

)≥ κ.

Normal cone: NB(y) := ∂δB(y).13/35

dist(v,NB(y)

)≥ κ or dist

(v,−NA(x)

)≥ κ

Figure: Normal cones

Transversality: NA(x) ∩ −NB(x) = {0}.

Local convergence (D-Ioffe-Lewis ’13):

A and B transverse at x =⇒ local R-linear convergence.

dist(v,NB(y)

)≥ κ or dist

(v,−NA(x)

)≥ κ

dist(v,NB(y)

)≥ κ or dist

(v,−NA(x)

)≥ κ

dist(v,NB(y)

)≥ κ or dist

(v,−NA(x)

)≥ κ

Steepest descent trajectories

Steepest descent trajectoriesNotation: X is a complete metric space.

Bounded speed: A curve x : [0,T ]→ X is 1-Lipschitz if

dist (x(t), x(s)) ≤ |t − s|.

What are trajectories of steepest descent?

Motivation: 1-Lipschitz curves x : [0,T ]→ X satisfy

|∇(f ◦ x)| ≤ |∇f |(x).

Definition (Trajectories of near-steepest descent)A curve x : [0,T ]→ X is a trajectory of near-steepest descent if• x is 1-Lipschitz,• f ◦ x is decreasing,• |(f ◦ x)′| ≥ |∇f |(x), a.e. on [0,T ].

Steepest descent trajectoriesNotation: X is a complete metric space.Bounded speed: A curve x : [0,T ]→ X is 1-Lipschitz if

dist (x(t), x(s)) ≤ |t − s|.

|∇(f ◦ x)| ≤ |∇f |(x).

dist (x(t), x(s)) ≤ |t − s|.

|∇(f ◦ x)| ≤ |∇f |(x).

dist (x(t), x(s)) ≤ |t − s|.

|∇(f ◦ x)| ≤ |∇f |(x).

dist (x(t), x(s)) ≤ |t − s|.

|∇(f ◦ x)| ≤ |∇f |(x).

Example

Figure: f (x, y) = max{x + y, |x − y|}+ x(x + 1) + y(y + 1) + 10017/35

Subgradient dynamical systems

Theorem (D-Ioffe-Lewis)For reasonable f : Rn → R and a curve x : [0,T ]→ Rn the followingare equivalent.1. x is a curve of near-steepest descent,2. f ◦ x is decreasing and (after reparametrizing)

x ∈ −∂f (x), a.e. on [0,T ].

Remark: When 2. holds,

x is the shortest element of −∂f (x), a.e. on [0,T ].

Reasonable conditions: f is smooth, convex, or semi-algebraic.

x ∈ −∂f (x), a.e. on [0,T ].

Existence

Theorem (Ambrosio et al. ’05, De Giorgi ’93)For reasonable f : X → R, there exist curves of near-steepestdescent starting from any point.

Proof ingredients:• Moreau-Yosida approximation:

xk+1 = argminx∈X

{f (x) + 12τ d

2(x, xk)}.

• Extraneous topologies ⇒ existence of minimizers andconvergence.

Proof is opaque and uses heavy machinery!

Existence

xk+1 = argminx∈X

{f (x) + 12τ d

2(x, xk)}.

Existence

xk+1 = argminx∈X

{f (x) + 12τ d

2(x, xk)}.

New proof idea: Error bound lemma

Equipartition: For η > 0,

f (x0)− η = τ0 < . . . < τk = f (x0).

Initialize: j = 0;while i ≤ k do

xj+1 ← P[f≤τj+1](xj);end

Consider resulting trajectories as k →∞.

f (x0)− η = τ0 < . . . < τk = f (x0).

Lengths of steepest descent curves

One motivation: Algorithm complexity (Eg: Attouch et al. ’11).

Theorem (D-Ioffe-Lewis)If f is semi-algebraic, then any bounded trajectory ofnear-steepest descent (with maximal domain) has bounded lengthand converges to a critical point.

Historic remarks:• Lojasiewicz famously proved it for real analytic functions.• True for convex f (Daniilidis et al. ’12).• Not true for C∞ functions (Palis, de Melo).

To smooth or not to smooth?

Approximation of functionsMollification:

fε(x) := (f ∗ φε)(x) =∫

Rnφε(x − y)f (y) dy

where φε is a scaled standard bump function φ

• Downside: Not much control on derivatives.

Approximation of functionsMollification:

fε(x) := (f ∗ φε)(x) =∫

Rnφε(x − y)f (y) dy

where φε is a scaled standard bump function φ

• Downside: Not much control on derivatives.

Approximation of functionsSet-up: Q ↪−→ Rn f−−→ R.

Assume Q is a disjoint union of C∞ manifolds

Q = M1 ∪M2 ∪ . . . ∪Mk−1 ∪Mk

Approximation of functionsMotivation: Integration by parts∫

Qg∆f dx

=∫∂Q

g〈∇f , n〉dS −∫

Q〈∇g,∇f 〉 dx.

Goal: approximate f by a C2−smooth f so that the ∇f ⊥ n.

Theorem (D-Larsson)Given a continuous f : Rn → R and any ε > 0, there exists aC1-smooth f satisfying1. Closeness: |f (x)− f (x)| < ε for all x ∈ Rn,2. Neumann Boundary condition:

x ∈ Mi =⇒ ∇f (x) ∈ TxMi .

provided {Mi} is a Whitney stratification of Q.

• For semi-algebraic Q, Whitney stratifications always exist!

Qg∆f dx =

∫∂Q

g〈∇f , n〉dS −∫

Q〈∇g,∇f 〉 dx.

x ∈ Mi =⇒ ∇f (x) ∈ TxMi .

Qg∆f dx =

∫∂Q

g〈∇f , n〉dS −∫

Q〈∇g,∇f 〉 dx.

x ∈ Mi =⇒ ∇f (x) ∈ TxMi .

Qg∆f dx =

∫∂Q

g〈∇f , n〉dS −∫

Q〈∇g,∇f 〉 dx.

x ∈ Mi =⇒ ∇f (x) ∈ TxMi .

Qg∆f dx =

∫∂Q

g〈∇f , n〉dS −∫

Q〈∇g,∇f 〉 dx.

x ∈ Mi =⇒ ∇f (x) ∈ TxMi .

Qg∆f dx =

∫∂Q

g〈∇f , n〉dS −∫

Q〈∇g,∇f 〉 dx.

x ∈ Mi =⇒ ∇f (x) ∈ TxMi .

• For semi-algebraic Q, Whitney stratifications always exist!25/35

Approximation of functionsWhat about higher order of smoothness?

Need a more stringentstratification.

Definition (Normally flat stratification)The stratification {Mi} is normally flat if Mi ⊂ clMj implies

PMi (x) = PMi ◦ PMj (x) for all x near Mi and Mj .

PMj(x)

PMi(x) = PMi

◦ PMj(x)

• If {Mi} is normally flat, can guarantee f is C∞−smooth!

Example: partition of polyhedra into open faces is normally flat.

Approximation of functionsWhat about higher order of smoothness? Need a more stringentstratification.

PMj(x)

PMi(x) = PMi

◦ PMj(x)

PMj(x)

PMi(x) = PMi

◦ PMj(x)

PMj(x)

PMi(x) = PMi

◦ PMj(x)

PMj(x)

PMi(x) = PMi

◦ PMj(x)

Example: partition of polyhedra into open faces is normally flat.26/35

Approximation of functions

Any other examples?

Notation:• Sn are n × n symmetric matrices.• λ : Sn → Rn is the eigenvalue map

λ(A) = (λ1(A), . . . , λn(A)).

Normally flat stratifications “lift” (D-Larsson ’12):

A symmetric stratification {Mi} of Q ⊂ Rn is normally flat⇐⇒ stratification {λ−1(Mi)} of λ−1(Q) ⊂ Sn is normally flat.

Any other examples?

λ(A) = (λ1(A), . . . , λn(A)).

Any other examples?

λ(A) = (λ1(A), . . . , λn(A)).

Examples: Sn+ = {X ∈ Sn : X � 0},

Rn×n+ = {X ∈ Rn×n : det(X) ≥ 0},Bn

2 = {X ∈ Rn×n : σmax ≤ 1}.

=⇒ Strongest approximation results become available!

Applications:

1. Probability: properties of the matrix-valued Bessel Process(Larsson ’12)

2. PDEs: Geometry of Sobolev spaces.3. More forthcoming...

Rn×n+ = {X ∈ Rn×n : det(X) ≥ 0},Bn

2 = {X ∈ Rn×n : σmax ≤ 1}.

Applications:

1. Probability: properties of the matrix-valued Bessel Process(Larsson ’12)

2. PDEs: Geometry of Sobolev spaces.3. More forthcoming...

Rn×n+ = {X ∈ Rn×n : det(X) ≥ 0},Bn

2 = {X ∈ Rn×n : σmax ≤ 1}.

Applications:1. Probability: properties of the matrix-valued Bessel Process

(Larsson ’12)2. PDEs: Geometry of Sobolev spaces.3. More forthcoming...

Active sets in optimization.

Active sets in optimization

Figure: Q is 4× 4 Toeplitz spectrahedron

Definition (Partial Smoothness)A convex set Q is partly smooth relative toM⊂ Q if1. (Smoothness)M is a smooth manifold,2. (Sharpness) NM = spanNQ onM,3. (Continuity) NQ varies continuously onM.

Figure: Q is 4× 4 Toeplitz spectrahedron

Definition (Partial Smoothness)A convex set Q is partly smooth relative toM⊂ Q if1. (Smoothness)M is a smooth manifold,2. (Sharpness) NM = spanNQ onM,3. (Continuity) NQ varies continuously onM.

Why do optimizers care?

• Many optimization algorithms identifyM in finite time!

Eg: Gradient projection, Newton-like, proximal point.

=⇒ Acceleration strategies!

How to see this structure in eigenvalue optimization?

Partly smooth manifolds “lift” (Daniilidis-D-Lewis):

Q partly smooth relative toMm

λ−1(Q) partly smooth relative to λ−1(M)

provided Q is symmetric.

Eg: Gradient projection, Newton-like, proximal point.

=⇒ Acceleration strategies!

Eg: Gradient projection, Newton-like, proximal point.=⇒ Acceleration strategies!

Nonsmoothness under symmetry

Actions:permutation group Σ acts on Rn (permutation)

orthogonal group On acts on Sn (conjugation)

Correspondence:On-invariant (spectral)⇐⇒ Σ-invariant (symmetric).

Q is spectral ⇐⇒ Q = λ−1(M) withM symmetric.

Eg: {X : X � 0} = λ−1({x : x ≥ 0}

THE TRANSFER PRINCIPLE:Variational properties of Q andM are in 1-to-1 correspondence!

Eg: Convexity, stratifiability, Whitney conditions, normalflatness, partial smoothness, . . .

Actions:permutation group Σ acts on Rn (permutation)orthogonal group On acts on Sn (conjugation)

Eg: {X : X � 0} = λ−1({x : x ≥ 0}

Summary

Geometry

Analysis Algorithms

Thank you.

References• Optimality, identifiability, and sensitivity, D-Lewis,submitted to Math. Programming Ser. A.

• The dimension of semi-algebraic subdifferentialgraphs, D-Ioffe-Lewis. Nonlinear Analysis, 75(3),1231-1245, 2012.

• Semi-algebraic functions have small subdifferentials,D-Lewis, to appear in Math. Prog. Ser. B.

• Tilt stability, uniform quadratic growth, and strongmetric regularity of the subdifferential, D-Lewis, toappear in SIAM J. on Opt.

• Approximating functions on stratifiable domains,D-Larsson, submitted to the Trans. of the AMS.

• Generic nondegeneracy in convex optimization,D-Lewis, Proc. Amer. Soc. 139 (2011), 2519-2527.

• Trajectories of subgradient dynamical systems,D-Ioffe-Lewis, submitted to SIAM J. on Control and Opt.

• Spectral lifts of identifiable sets and partly smoothmanifolds, Daniilidis-D-Lewis, preprint.

Available at http://people.orie.cornell.edu/dd379/35/35

DmitriyDrusvyatskiy SchoolofORIE,CornellUniversity...

Documents

Transcript of DmitriyDrusvyatskiy SchoolofORIE,CornellUniversity...