Faculty - The Pervasiveness of Proximal Point Iterations With a...

p1 p2 p3 n1 n2

The Pervasiveness of Proximal Point

Iterations – With a Proximal Analysis

of Neural Networks

Patrick L. Combettes

Department of MathematicsNorth Carolina State University

Raleigh, NC 27695, USA

BayOpt Meeting, Santa Cruz, May 17, 2019

Patrick L. Combettes Proximal Point Iterations 1/32

p1 p2 p3 n1 n2

Part 1

The proximal point algorithm


p1 p2 p3 n1 n2

Nonexpansive operators (Browder, Minty)

H is a real Hilbert space

T : H → H is nonexpansive if

(∀x ∈ H)(∀y ∈ H) ‖Tx − Ty‖ 6 ‖x − y‖,

firmly nonexpansive if 2T − Id is nonexpansive, i.e.,

(∀x ∈ H)(∀y ∈ H) ‖Tx−Ty‖2+‖(Id−T )x−(Id−T )y‖2 6 ‖x−y‖2,

and α-averaged (α ∈ ]0, 1]), if

(∀x ∈ H)(∀y ∈ H) ‖Tx−Ty‖2+1 − α

α‖(Id−T )x−(Id−T )y‖2 6 ‖x−y‖2

Convex combinations and compositions of averaged oper-ators are averaged

This fact reduces the analysis of most prominent algorithmsin optimization to averaged operator iterations


p1 p2 p3 n1 n2

Monotone operators

Single-valued monotone operators were introduced inde-pendently in 1960 by Kačurovskĭı, Minty, and Zarantonello

A set-valued operator A : H → 2H with graph gra A ={(x , x∗) ∈ H×H | x∗ ∈ Ax

}is monotone if

(∀(x , x∗) ∈ gra A)(∀(y, y∗) ∈ gra A) 〈x − y | x∗ − y∗〉 > 0,

and maximally monotone if there is no monotone operatorB : H → 2H such that gra A ⊂ gra B 6= gra A

Theorem (Minty, 1962)

T : H → H is firmly nonexpansive ⇔ T = JA = (Id + A)−1 (re-

solvent) for some maximally monotone A : H → 2H; in thiscase Fix T = zer A and the reflected resolvent RA = 2JA − Idis nonexpansive


p1 p2 p3 n1 n2

Convex analysis (Moreau, Rockafellar, 1962+)

Γ0(H): lower semicontinuous convex functions f : H →]−∞,+∞] such that dom f =

{x ∈ H | f (x) < +∞

}6= Ø

f ∗ : x∗ 7→ supx∈H 〈x | x∗〉 − f (x) is the conjugate of f ; if f ∈

Γ0(H), then f∗ ∈ Γ0(H) and f

∗∗ = f

The subdifferential of f at x ∈ H is

∂f (x) ={

x∗ ∈ H | (∀y ∈ H) 〈y − x | x∗〉+ f (x)︸︷︷︸

fx,x∗ (y)

6 f (y)}.

gra f

epi f

gra fx,x∗gra fx,x∗

gra 〈· | x∗〉

x

f (x)

R

H

f ∗(x∗)

Fermat’s rule:x minimizes f ⇔ 0 ∈ ∂f (x)

∂f is maximally monotone

Infimal convolution:(f �g) : x 7→ infy∈H f (y)+g(x − y)


p1 p2 p3 n1 n2

Moreau’s proximity operator

In 1962, motivated by nonsmooth mechanics, J. J. Moreau(1923–2014) introduced the proximity operator of f ∈ Γ0(H)

proxf : x 7→ argminy∈H

f (y) +1

2‖x − y‖2

and derived its main properties

Set q = ‖ · ‖2/2. Then f �q + f ∗�q = q and

proxf = ∇(f + q)∗ = ∇(f ∗�q) = Id − proxf∗ = (Id + ∂f )

−1

proxf = J∂f , hence

Fix proxf = zer ∂f = Argmin f(proxf x , x − proxf x) ∈ gra ∂fFirm nonexpansiveness:‖proxf x − proxf y‖

2 + ‖proxf∗x − proxf∗y‖2 6 ‖x − y‖2

This suggests that xn+1 = proxf xn ⇀ x ∈ Argmin f


p1 p2 p3 n1 n2

The proximal point algorithm for minimization

−−4

−4

−2

−−2

|

2

|4

|−6

|6

|−2

|−4

ξ1

ξ2

d2

•x1

• x2

•x3

•x4

•x5

• x6

• x1

d1

•x2• x3

•x4

•x5

•x6

•x0

• x1

d1

•x2

Steepest descent method in green, its inertial version in blue,and the proximal point algorithm in red. At iteration n, dn =∇ϕ(xn)/‖∇ϕ(xn)‖ is the normalized gradient at xn.


p1 p2 p3 n1 n2

The proximal point algorithm for minimization

First derived by Martinet (1970/72) with constant parame-ters, and then by Brézis/P.-L. Lions (1978)

xn+1 = proxγnf xn ⇀ x ∈ Argmin f if∑

n∈N

γn = +∞

Proximity-preserving transformations (PLC, 2018):

Set A� B = (A−1 + B−1)−1 and L ⊲ A =(L ◦ A−1 ◦ L∗)−1

Define (for (ωi)16i6m in the simplex)

T =

m∑

i=1

ωiL∗i ◦(

proxfi �(∂gi � (Mi ⊲ ∂hi)

))

◦ Li

Then T ∈ P(H). More specifically, T = proxf , where

f =

(m∑

i=1

ωi

((fi + g

∗i + h

∗i ◦ M

∗i

)∗�qi

)

◦ Li

)∗

− q

Algorithms iterating T are thus proximal point algorithms


p1 p2 p3 n1 n2

Proximity-preserving transformations

(Ti)i∈I be a finite family in P(H), (ωi)i∈I convex weights. Then∑i∈I ωiTi ∈ P(H) (Moreau, 1963)

Let T1 and T2 be in P(H). Then T1� T2 ∈ P(H).

The barycentric projection method (Auslender, 1969)

xn+1 =∑

i∈I

ωiprojCi xn

is a proximal algorithm

Let T1 and T2 be in P(H). Then (T1 − T2 + Id)/2 ∈ P(H)

Let T ∈ P(H) and let V be a closed vector subspace of H.Then projV ◦ T ◦ projV ∈ P(H)


p1 p2 p3 n1 n2

Proximity-preserving transformations

K a closed convex cone in H with polar cone K⊖, V aclosed vector subspace of H

Set

f =

(1

2d2K⊖ ◦ projV

)∗

−‖ · ‖2

2and T = projV ◦ projK ◦ projV

Then T = proxf

Let x0 ∈ V and (∀n ∈ N) xn+1 = proxf xn

(xn)n∈N is identical to the alternating projection sequencexn+1 = (projV ◦ projK )xn

Hundal (2004) constructed a special V and K so that con-vergence of alternating projections is only weak and notstrong. We thus obtain a new instance of the weak but notstrong convergence of the proximal point algorithm.


p1 p2 p3 n1 n2

The proximal point algorithm for inclusions

Extension to a maximally monotone operator A by Rockafel-lar (1976), Brézis/P.-L. Lions (1978), etc.

xn+1 = xn + λn(JγnAxn − xn

), 0 < λn < 2

This provides a much more powerful framework:

Applied to saddle operators it covers various algo-rithms, e.g., the proximal method of multipliers in thecase of the ordinary Lagrangian (Rockafellar, 1976)It covers the Douglas-Rachford splitting algorithm (Eck-stein/Bersekas, 1992)It covers the Forward-Backward splitting algorithm andmore generally any averaged operator scheme (PLC,2018); in particular it covers the Chambolle-Pock algo-rithm, dual ascent methods, etc.Applied to the partial inverse of a monotone operatorit yields the method of partial inverses (Spingarn, 1983)


p1 p2 p3 n1 n2

Example: structured convex minimization

Solve the primal problem

minimizex∈H

f (x) +

m∑

i=1

gi(Lix − oi)− 〈x | z〉

...together with the dual problem

minimizev1∈G1,..., vm∈Gm

f ∗(

z −

m∑

i=1

L∗i vi

)

+

m∑

i=1

(g∗i (vi) + 〈vi | oi〉

).


p1 p2 p3 n1 n2

Example: structured convex minimization

Algorithm (PLC et al, 2014):

pn = proxf (xn + un + z)rn = xn + un − pnFor i = 1, . . . ,m⌊

qi,n = oi + proxgi (yi,n + vi,n − oi)si,n = yi,n + vi,n − qi,n

tn = Q(rn +∑m

i=1 L∗i si,n)

wn = Q(pn +∑m

i=1 L∗i qi,n)

xn+1 = xn − λntnun+1 = un + λn(wn − pn)For i = 1, . . . ,m⌊

yi,n+1 = yi,n − λnLi tnvi,n+1 = vi,n + λn(Liwn − qi,n)

This is the method of partial inverses in the primal-dual prod-uct space with respect to V = gra L, where L : x 7→ (Lix)16i6m,hence an instance of the proximal point algorithm (hereQ = (Id + L∗L)−1)


p1 p2 p3 n1 n2

Part 2

Proximal analysis of neural networks

Joint work with J.-C. Pesquet (2018, 2019)


p1 p2 p3 n1 n2

Feed-forward neural networks structures

x W1 +

b1

R1 · · · Wm +

bm

Rm Tx

Fig. 1: m-layer network: Wi is a (linear) weight operator, bi is abias vector, Ri is a (nonlinear) activation operator.

✓ Generic methods for nonlinear approximation[Cybenko, 1989; Funahashi, 1989]

✓ Efficient for incorporating prior knowledge from big databases

✗ Black-box, empirical approaches


p1 p2 p3 n1 n2

Feed-forward neural networks structures

x W1 +

b1

R1 · · · Wm +

bm

Rm Tx

Objective: Use tools from nonlinear analysis to investigate theproperties and the asymptotic behavior of feed-forward neuralnetwork structures, in particular:

What is the robustness of the network to perturbations of theinput?

As the number m of layers increases, does Tx converge tosomething and, if so, to what?


p1 p2 p3 n1 n2

Feed-forward neural networks

x W1 +

b1

R1 · · · Wm +

bm

Rm Tx

(Hi)06i6m are real Hilbert spaces

For each i ∈ {1, . . . ,m}, Ti : Hi−1 → Hi : x 7→ Ri(Wix + bi), whereWi : Hi−1 → Hi is bounded and linear, bi ∈ Hi , and Ri : Hi → Hi isαi -averaged for some αi ∈ ]0, 1]

T = Tm ◦ · · · ◦ T1

NEURAL NETWORK MODEL


p1 p2 p3 n1 n2

Most activation operators are proximity operators

• Rectified linear unit (ReLU)

̺ : R → R : ξ 7→

{

ξ, if ξ > 0;

0, if ξ 6 0.

Then ̺ = proxι[0,+∞[ .

• Parametric ReLU (α ∈ ]0, 1])

̺ : R → R : ξ 7→

{

ξ, if ξ > 0;

αξ, if ξ 6 0,

Then ̺ = proxφ, where

φ : R → R : ξ 7→

{

0, if ξ > 0;

(1/α− 1)ξ2/2, if ξ 6 0.


p1 p2 p3 n1 n2


• Unimodal sigmoid

̺ : R → R : ξ 7→1

1 + e−ξ−

1

2

Then ̺ = proxφ where

φ : ξ 7→

(ξ + 1/2) ln(ξ + 1/2) + (1/2 − ξ) ln(1/2 − ξ)−1

2(ξ2 + 1/4) if |ξ| < 1/2;

−1/4, if |ξ| = 1/2;

+∞, if |ξ| > 1/2.

• Elliot function

̺ : R → R : ξ 7→ξ

1 + |ξ|.


φ : R →]−∞,+∞] : ξ 7→

−|ξ| − ln(1 − |ξ|)−ξ2

2, if |ξ| < 1;

+∞, if |ξ| > 1.


p1 p2 p3 n1 n2


• logarithmic activation

̺ : R → R : ξ 7→ sign(ξ) ln(1 + |ξ|

)


φ : R → ]−∞,+∞] : ξ 7→ e|ξ| − |ξ| − 1 −ξ2

2.

• arctangent

̺ =2

πarctan


φ : R → ]−∞,+∞] : ξ 7→

−2

πln(

cos(πξ

2

))

−1

2ξ2, if |ξ| < 1;

+∞, if |ξ| > 1.


p1 p2 p3 n1 n2


• inverse square root unit activation

̺ : R → R : ξ 7→ ξ/√

1 + ξ2.


φ : R → ]−∞,+∞] : ξ 7→

{

−ξ2/2 −√

1 − ξ2, if |ξ| 6 1;

+∞, if |ξ| > 1.

• inverse square root linear unit activation

̺ =

ξ, if ξ > 0;ξ

√

1 + ξ2, if ξ < 0.


φ : R → ]−∞,+∞] : ξ 7→

0, if ξ > 0;

1 − ξ2/2 −√

1 − ξ2, if − 1 6 ξ < 0;

+∞, if ξ < −1.


p1 p2 p3 n1 n2

Most activation operators are proximity operators••

•

|

−1

|

1

1−

φ(x)

x

+∞+∞

||−2−4

| |

2

3 −

−2−

−1−

1 −

4

̺(x)

x

Figure: The function φ (top) and the corresponding proximalactivation function (bottom) ̺. Inverse square root linear unit isin red, arctangent activation function is in blue, logarithmicactivation function is in green.


p1 p2 p3 n1 n2


• Softmax

R : RN → RN : (ξk )16k6N 7→

exp(ξk )

/N∑

j=1

exp(ξj)

16k6N

− u,

where u = (1, . . . , 1)/N ∈ RN . Then R = proxϕ whereϕ = ψ(·+ u) + 〈· | u〉 and

ψ : RN → ]−∞,+∞]

(ξk )16k6N 7→

N∑

k=1

(

ξk ln ξk −ξ2k2

)

, if (ξk )16i6N ∈ [0, 1]N

and

N∑

k=1

ξk = 1;

+∞, otherwise.


p1 p2 p3 n1 n2


• Squashing function used in capsnets

(∀x ∈ RN) Rx = µ‖x‖1 + ‖x‖2 x = proxφ◦‖·‖x , µ =

8

3√

3,

where

φ : ξ 7→

µ arctan

√

|ξ|µ− |ξ| −

√

|ξ|(µ− |ξ|)− ξ2

2, if |ξ| < µ;

µ(π − µ)2

, if |ξ| = µ;+∞, otherwise.

|

−µ

|

µ

µ

2(π − µ) − ••

φ(ξ)

ξ

+∞+∞


p1 p2 p3 n1 n2

Averagedness result

Goal: Derive properties of compositions of linear operatorsand firmly nonexpansive mappings

Difficulty: The operators are defined in different spaces


p1 p2 p3 n1 n2

Averagedness result

Proposition

Let α ∈ [1/2,1]. Set W = Wm ◦ · · · ◦ W1, µ = inf‖x‖H0=1〈Wx | x〉, and

θm = ‖W‖+m−1∑

ℓ=1

∑

06j1

p1 p2 p3 n1 n2

Averagedness result

Example

Consider the Proposition(i) with m = 2. Then P2 ◦ W2 ◦ P1 ◦ W1 isα-averaged, hence nonexpansive, if

‖W2 ◦ W1 − 4(1 − α)Id‖+ ‖W2 ◦ W1‖+ 2‖W2‖ ‖W1‖ 6 4α.

In particular, if α = 1, this condition is clearly less restrictive thanrequiring that W1 and W2 be nonexpansive.


p1 p2 p3 n1 n2

Asymptotic behavior

Let x0 ∈ H and let {λn}n∈N ⊂ ]0,+∞[. Iterate

for n = 0, 1, . . .

x1,n = R1,n(W1,nxn + b1,n)x2,n = R2,n(W2,nx1,n + b2,n)

...xm,n = Rm,n(Wm,nxm−1,n + bm,n)xn+1 = xn + λn(xm,n − xn).

MODEL

• Wi,n : Hi−1 → Hi is a bounded linear operator, bi,n ∈ Hi , andRi,n : Hi → Hi

• (Hi)06i6m real Hilbert spaces such that Hm = H0 = H


p1 p2 p3 n1 n2

Asymptotic behavior

Let x0 ∈ H and let {λn}n∈N ⊂ ]0,+∞[. Iterate

for n = 0, 1, . . .

x1,n = R1,n(W1,nxn + b1,n)x2,n = R2,n(W2,nx1,n + b2,n)

...xm,n = Rm,n(Wm,nxm−1,n + bm,n)xn+1 = xn + λn(xm,n − xn).

MODEL

Remark

λn models a skip connection


p1 p2 p3 n1 n2

Periodic networks

ASSUMPTIONS

• Periodicity: Ri,n ≡ Ri , Wi,n ≡ Wi , bi,n ≡ bi

• Proximal activation:

Ri = proxϕi for some ϕi ∈ Γ0(Hi) such that ϕi(0) = inf ϕi(Hi).

NOTATION

• For every i ∈ {1, . . . ,m}, Ti : Hi−1 → Hi : x 7→ Ri(Wix + bi)

• F = Fix T with T = Tm ◦ · · · ◦ T1.


p1 p2 p3 n1 n2

Associated variational inequality

Find x1 ∈ H1, . . . , xm ∈ Hm such that

b1 ∈ x1 − W1xm + ∂ϕ1(x1)

b2 ∈ x2 − W2x1 + ∂ϕ2(x2)...

bm ∈ xm − Wmxm−1 + ∂ϕm(xm)

(1)

More compactly, find x ∈ H = H1 ⊕ · · · ⊕ Hm such that

b ∈ ∂ψ(x) + Bx, B = Id − W ◦ S,

where

→

H = Hm ⊕H1 ⊕ · · · ⊕ Hm−1

S : H →→

H : (x1, . . . , xm−1, xm) 7→ (xm, x1, . . . , xm−1)

W :→

H → H : (xm, x1, . . . , xm−1) 7→ (W1xm,W2x1, . . . ,Wmxm−1)

ψ : H → ]−∞,+∞] : x 7→∑m

i=1

(ϕi(xi)− 〈xi | bi〉

)


p1 p2 p3 n1 n2

Associated variational inequality

Proposition

Set F = Fix (Tm ◦ · · · ◦ T1) andF =

{x ∈ H | x1 = T1xm, x2 = T2x1, . . . , xm = Tmxm−1

}

The set of solutions to (1) isF =

{(T1xm, (T2 ◦ T1)xm, . . . , (Tm−1 ◦ · · · ◦ T1)xm, xm) | xm ∈ F

}.

Suppose that (Wi)16i6m satisfies averagedness conditions forsome α ∈ [1/2, 1]. Then F is closed and convex.

Suppose that one of the following holds:

ran(Tm ◦ · · · ◦ T1) is bounded.

Then there exists j ∈ {1, . . . ,m} such that domϕj isbounded.

Then F and F are nonempty.

Remark

A solution to (1) does not solve a minimization problem.


p1 p2 p3 n1 n2

Periodic networks

Theorem

Suppose that the following hold:

F 6= Ø.

(Wi)16i6m satisfies averageness conditions with parameter α.

One of the following is satisfied:

λn ≡ 1/α = 1 and Txn − xn → 0.

(λn)n∈N lies in ]0, 1/α[ and∑

n∈N λn(1 − αλn) = +∞.

Then (xn)n∈N converges weakly to a point xm ∈ F and(T1xm, (T2 ◦ T1)xm, . . . , (Tm−1 ◦ · · · ◦ T1)xm, xm) solves (1).


p1 p2 p3 n1 n2

Periodic networks

Theorem

Suppose that the following hold:

F 6= Ø.

(Wi)16i6m satisfies averageness conditions with parameter α.

One of the following is satisfied:

λn ≡ 1/α = 1 and Txn − xn → 0.

(λn)n∈N lies in ]0, 1/α[ and∑

n∈N λn(1 − αλn) = +∞.

Then (xn)n∈N converges weakly to a point xm ∈ F and(T1xm, (T2 ◦ T1)xm, . . . , (Tm−1 ◦ · · · ◦ T1)xm, xm) solves (1).

Remark

Suppose that (Wi)16i6m satisfies averageness conditions withα ∈ [1/2, 1], (λn)n∈N lies in [ε, (1/α)− ε], for some ε ∈ ]0, 1/2[, andF = Ø. Then ‖xn‖ → +∞.


p1 p2 p3 n1 n2

(Mildly) aperiodic networks

ASSUMPTIONSThere exist (ωn)n∈N ∈ ℓ

1+, (ρn)n∈N ∈ ℓ

1+, (ηn)n∈N ∈ ℓ

1+, and

(νn)n∈N ∈ ℓ1+ for which the following hold, for every i ∈ {1, . . . ,m}:

• There exists a bounded linear operator Wi : Hi−1 → Hi suchthat (∀n ∈ N) ‖Wi,n − Wi‖ 6 ωn.

• There exists a proximal activation operator Ri : Hi → Hi suchthat (∀n ∈ N)(∀x ∈ Hi) ‖Ri,nx − Rix‖ 6 ρn‖x‖+ ηn.

• There exists bi ∈ Hi such that (∀n ∈ N) ‖bi,n − bi‖ 6 νn.

Same asymptotic results

What about more general networks ?


p1 p2 p3 n1 n2

Averaged activation operators

There exist a maximally monotone operator A : H → 2H anda constant λ ∈ [0, 2] such that R = Id + λ(JA − Id).

On the real line, an averaged activator is of the form R =Id + λ(proxφ − Id), where φ ∈ Γ0(R) and λ ∈ [0, 2].

This includes almost all recent proposals for activation in neu-ral networks, e.g.,

sine activation function R = sinpiecewise Mexican-hat activation function,absolute value function R = | · | (scattering networks),swish activation function

(∀x ∈ R) R(x) =5x

6(1 + exp(−x)),

etc.

This formalism leads to tight Lipschitz constant derivations forthe network → stability garantees.


p1 p2 p3 n1 n2

References

PLC, Monotone operator theory in convex optimization, Math. Pro-gram., vol. B170, 2018.

PLC/Pesquet, Deep neural network structures solving variationalinequalities, arxiv, 2018.

PLC/Pesquet, Lipschitz certificates for neural network structuresdriven by averaged activation operators, arxiv, 2019.

Bauschke/PLC, Convex Analysis and Monotone Operator Theoryin Hilbert Spaces, 2nd ed. Springer, New York, 2017.

Chierchia/Chouzenoux/PLC/Pesquet, Proximity Operator Reposi-tory, http://proximity-operator.net/


http://proximity-operator.net/

p1p2p3n1n2

Faculty - The Pervasiveness of Proximal Point Iterations With a...

Documents

Transcript of Faculty - The Pervasiveness of Proximal Point Iterations With a...