Faculty - The Pervasiveness of Proximal Point Iterations With a...

37
p1 p2 p3 n1 n2 The Pervasiveness of Proximal Point Iterations – With a Proximal Analysis of Neural Networks Patrick L. Combettes Department of Mathematics North Carolina State University Raleigh, NC 27695, USA BayOpt Meeting, Santa Cruz, May 17, 2019 Patrick L. Combettes Proximal Point Iterations 1/32

Transcript of Faculty - The Pervasiveness of Proximal Point Iterations With a...

  • p1 p2 p3 n1 n2

    The Pervasiveness of Proximal Point

    Iterations – With a Proximal Analysis

    of Neural Networks

    Patrick L. Combettes

    Department of MathematicsNorth Carolina State University

    Raleigh, NC 27695, USA

    BayOpt Meeting, Santa Cruz, May 17, 2019

    Patrick L. Combettes Proximal Point Iterations 1/32

  • p1 p2 p3 n1 n2

    Part 1

    The proximal point algorithm

    Patrick L. Combettes Proximal Point Iterations 2/32

  • p1 p2 p3 n1 n2

    Nonexpansive operators (Browder, Minty)

    H is a real Hilbert space

    T : H → H is nonexpansive if

    (∀x ∈ H)(∀y ∈ H) ‖Tx − Ty‖ 6 ‖x − y‖,

    firmly nonexpansive if 2T − Id is nonexpansive, i.e.,

    (∀x ∈ H)(∀y ∈ H) ‖Tx−Ty‖2+‖(Id−T )x−(Id−T )y‖2 6 ‖x−y‖2,

    and α-averaged (α ∈ ]0, 1]), if

    (∀x ∈ H)(∀y ∈ H) ‖Tx−Ty‖2+1 − α

    α‖(Id−T )x−(Id−T )y‖2 6 ‖x−y‖2

    Convex combinations and compositions of averaged oper-ators are averaged

    This fact reduces the analysis of most prominent algorithmsin optimization to averaged operator iterations

    Patrick L. Combettes Proximal Point Iterations 3/32

  • p1 p2 p3 n1 n2

    Monotone operators

    Single-valued monotone operators were introduced inde-pendently in 1960 by Kačurovskĭı, Minty, and Zarantonello

    A set-valued operator A : H → 2H with graph gra A ={(x , x∗) ∈ H×H | x∗ ∈ Ax

    }is monotone if

    (∀(x , x∗) ∈ gra A)(∀(y, y∗) ∈ gra A) 〈x − y | x∗ − y∗〉 > 0,

    and maximally monotone if there is no monotone operatorB : H → 2H such that gra A ⊂ gra B 6= gra A

    Theorem (Minty, 1962)

    T : H → H is firmly nonexpansive ⇔ T = JA = (Id + A)−1 (re-

    solvent) for some maximally monotone A : H → 2H; in thiscase Fix T = zer A and the reflected resolvent RA = 2JA − Idis nonexpansive

    Patrick L. Combettes Proximal Point Iterations 4/32

  • p1 p2 p3 n1 n2

    Convex analysis (Moreau, Rockafellar, 1962+)

    Γ0(H): lower semicontinuous convex functions f : H →]−∞,+∞] such that dom f =

    {x ∈ H | f (x) < +∞

    }6= Ø

    f ∗ : x∗ 7→ supx∈H 〈x | x∗〉 − f (x) is the conjugate of f ; if f ∈

    Γ0(H), then f∗ ∈ Γ0(H) and f

    ∗∗ = f

    The subdifferential of f at x ∈ H is

    ∂f (x) ={

    x∗ ∈ H | (∀y ∈ H) 〈y − x | x∗〉+ f (x)︸ ︷︷ ︸

    fx,x∗ (y)

    6 f (y)}.

    gra f

    epi f

    gra fx,x∗gra fx,x∗

    gra 〈· | x∗〉

    x

    f (x)

    R

    H

    f ∗(x∗)

    Fermat’s rule:x minimizes f ⇔ 0 ∈ ∂f (x)

    ∂f is maximally monotone

    Infimal convolution:(f �g) : x 7→ infy∈H f (y)+g(x − y)

    Patrick L. Combettes Proximal Point Iterations 5/32

  • p1 p2 p3 n1 n2

    Moreau’s proximity operator

    In 1962, motivated by nonsmooth mechanics, J. J. Moreau(1923–2014) introduced the proximity operator of f ∈ Γ0(H)

    proxf : x 7→ argminy∈H

    f (y) +1

    2‖x − y‖2

    and derived its main properties

    Set q = ‖ · ‖2/2. Then f �q + f ∗�q = q and

    proxf = ∇(f + q)∗ = ∇(f ∗�q) = Id − proxf∗ = (Id + ∂f )

    −1

    proxf = J∂f , hence

    Fix proxf = zer ∂f = Argmin f(proxf x , x − proxf x) ∈ gra ∂fFirm nonexpansiveness:‖proxf x − proxf y‖

    2 + ‖proxf∗x − proxf∗y‖2 6 ‖x − y‖2

    This suggests that xn+1 = proxf xn ⇀ x ∈ Argmin f

    Patrick L. Combettes Proximal Point Iterations 6/32

  • p1 p2 p3 n1 n2

    The proximal point algorithm for minimization

    −−4

    −4

    −2

    −−2

    |

    2

    |4

    |−6

    |6

    |−2

    |−4

    ξ1

    ξ2

    d2

    •x1

    • x2

    •x3

    •x4

    •x5

    • x6

    • x1

    d1

    •x2• x3

    •x4

    •x5

    •x6

    •x0

    • x1

    d1

    •x2

    Steepest descent method in green, its inertial version in blue,and the proximal point algorithm in red. At iteration n, dn =∇ϕ(xn)/‖∇ϕ(xn)‖ is the normalized gradient at xn.

    Patrick L. Combettes Proximal Point Iterations 7/32

  • p1 p2 p3 n1 n2

    The proximal point algorithm for minimization

    First derived by Martinet (1970/72) with constant parame-ters, and then by Brézis/P.-L. Lions (1978)

    xn+1 = proxγnf xn ⇀ x ∈ Argmin f if∑

    n∈N

    γn = +∞

    Proximity-preserving transformations (PLC, 2018):

    Set A� B = (A−1 + B−1)−1 and L ⊲ A =(L ◦ A−1 ◦ L∗)−1

    Define (for (ωi)16i6m in the simplex)

    T =

    m∑

    i=1

    ωiL∗i ◦(

    proxfi �(∂gi � (Mi ⊲ ∂hi)

    ))

    ◦ Li

    Then T ∈ P(H). More specifically, T = proxf , where

    f =

    (m∑

    i=1

    ωi

    ((fi + g

    ∗i + h

    ∗i ◦ M

    ∗i

    )∗�qi

    )

    ◦ Li

    )∗

    − q

    Algorithms iterating T are thus proximal point algorithms

    Patrick L. Combettes Proximal Point Iterations 8/32

  • p1 p2 p3 n1 n2

    Proximity-preserving transformations

    (Ti)i∈I be a finite family in P(H), (ωi)i∈I convex weights. Then∑i∈I ωiTi ∈ P(H) (Moreau, 1963)

    Let T1 and T2 be in P(H). Then T1� T2 ∈ P(H).

    The barycentric projection method (Auslender, 1969)

    xn+1 =∑

    i∈I

    ωiprojCi xn

    is a proximal algorithm

    Let T1 and T2 be in P(H). Then (T1 − T2 + Id)/2 ∈ P(H)

    Let T ∈ P(H) and let V be a closed vector subspace of H.Then projV ◦ T ◦ projV ∈ P(H)

    Patrick L. Combettes Proximal Point Iterations 9/32

  • p1 p2 p3 n1 n2

    Proximity-preserving transformations

    K a closed convex cone in H with polar cone K⊖, V aclosed vector subspace of H

    Set

    f =

    (1

    2d2K⊖ ◦ projV

    )∗

    −‖ · ‖2

    2and T = projV ◦ projK ◦ projV

    Then T = proxf

    Let x0 ∈ V and (∀n ∈ N) xn+1 = proxf xn

    (xn)n∈N is identical to the alternating projection sequencexn+1 = (projV ◦ projK )xn

    Hundal (2004) constructed a special V and K so that con-vergence of alternating projections is only weak and notstrong. We thus obtain a new instance of the weak but notstrong convergence of the proximal point algorithm.

    Patrick L. Combettes Proximal Point Iterations 10/32

  • p1 p2 p3 n1 n2

    The proximal point algorithm for inclusions

    Extension to a maximally monotone operator A by Rockafel-lar (1976), Brézis/P.-L. Lions (1978), etc.

    xn+1 = xn + λn(JγnAxn − xn

    ), 0 < λn < 2

    This provides a much more powerful framework:

    Applied to saddle operators it covers various algo-rithms, e.g., the proximal method of multipliers in thecase of the ordinary Lagrangian (Rockafellar, 1976)It covers the Douglas-Rachford splitting algorithm (Eck-stein/Bersekas, 1992)It covers the Forward-Backward splitting algorithm andmore generally any averaged operator scheme (PLC,2018); in particular it covers the Chambolle-Pock algo-rithm, dual ascent methods, etc.Applied to the partial inverse of a monotone operatorit yields the method of partial inverses (Spingarn, 1983)

    Patrick L. Combettes Proximal Point Iterations 11/32

  • p1 p2 p3 n1 n2

    Example: structured convex minimization

    Solve the primal problem

    minimizex∈H

    f (x) +

    m∑

    i=1

    gi(Lix − oi)− 〈x | z〉

    ...together with the dual problem

    minimizev1∈G1,..., vm∈Gm

    f ∗(

    z −

    m∑

    i=1

    L∗i vi

    )

    +

    m∑

    i=1

    (g∗i (vi) + 〈vi | oi〉

    ).

    Patrick L. Combettes Proximal Point Iterations 12/32

  • p1 p2 p3 n1 n2

    Example: structured convex minimization

    Algorithm (PLC et al, 2014):

    pn = proxf (xn + un + z)rn = xn + un − pnFor i = 1, . . . ,m⌊

    qi,n = oi + proxgi (yi,n + vi,n − oi)si,n = yi,n + vi,n − qi,n

    tn = Q(rn +∑m

    i=1 L∗i si,n)

    wn = Q(pn +∑m

    i=1 L∗i qi,n)

    xn+1 = xn − λntnun+1 = un + λn(wn − pn)For i = 1, . . . ,m⌊

    yi,n+1 = yi,n − λnLi tnvi,n+1 = vi,n + λn(Liwn − qi,n)

    This is the method of partial inverses in the primal-dual prod-uct space with respect to V = gra L, where L : x 7→ (Lix)16i6m,hence an instance of the proximal point algorithm (hereQ = (Id + L∗L)−1)

    Patrick L. Combettes Proximal Point Iterations 13/32

  • p1 p2 p3 n1 n2

    Part 2

    Proximal analysis of neural networks

    Joint work with J.-C. Pesquet (2018, 2019)

    Patrick L. Combettes Proximal Point Iterations 14/32

  • p1 p2 p3 n1 n2

    Feed-forward neural networks structures

    x W1 +

    b1

    R1 · · · Wm +

    bm

    Rm Tx

    Fig. 1: m-layer network: Wi is a (linear) weight operator, bi is abias vector, Ri is a (nonlinear) activation operator.

    ✓ Generic methods for nonlinear approximation[Cybenko, 1989; Funahashi, 1989]

    ✓ Efficient for incorporating prior knowledge from big databases

    ✗ Black-box, empirical approaches

    Patrick L. Combettes Proximal Point Iterations 15/32

  • p1 p2 p3 n1 n2

    Feed-forward neural networks structures

    x W1 +

    b1

    R1 · · · Wm +

    bm

    Rm Tx

    Objective: Use tools from nonlinear analysis to investigate theproperties and the asymptotic behavior of feed-forward neuralnetwork structures, in particular:

    What is the robustness of the network to perturbations of theinput?

    As the number m of layers increases, does Tx converge tosomething and, if so, to what?

    Patrick L. Combettes Proximal Point Iterations 16/32

  • p1 p2 p3 n1 n2

    Feed-forward neural networks

    x W1 +

    b1

    R1 · · · Wm +

    bm

    Rm Tx

    (Hi)06i6m are real Hilbert spaces

    For each i ∈ {1, . . . ,m}, Ti : Hi−1 → Hi : x 7→ Ri(Wix + bi), whereWi : Hi−1 → Hi is bounded and linear, bi ∈ Hi , and Ri : Hi → Hi isαi -averaged for some αi ∈ ]0, 1]

    T = Tm ◦ · · · ◦ T1

    NEURAL NETWORK MODEL

    Patrick L. Combettes Proximal Point Iterations 17/32

  • p1 p2 p3 n1 n2

    Most activation operators are proximity operators

    • Rectified linear unit (ReLU)

    ̺ : R → R : ξ 7→

    {

    ξ, if ξ > 0;

    0, if ξ 6 0.

    Then ̺ = proxι[0,+∞[ .

    • Parametric ReLU (α ∈ ]0, 1])

    ̺ : R → R : ξ 7→

    {

    ξ, if ξ > 0;

    αξ, if ξ 6 0,

    Then ̺ = proxφ, where

    φ : R → R : ξ 7→

    {

    0, if ξ > 0;

    (1/α− 1)ξ2/2, if ξ 6 0.

    Patrick L. Combettes Proximal Point Iterations 18/32

  • p1 p2 p3 n1 n2

    Most activation operators are proximity operators

    • Unimodal sigmoid

    ̺ : R → R : ξ 7→1

    1 + e−ξ−

    1

    2

    Then ̺ = proxφ where

    φ : ξ 7→

    (ξ + 1/2) ln(ξ + 1/2) + (1/2 − ξ) ln(1/2 − ξ)−1

    2(ξ2 + 1/4) if |ξ| < 1/2;

    −1/4, if |ξ| = 1/2;

    +∞, if |ξ| > 1/2.

    • Elliot function

    ̺ : R → R : ξ 7→ξ

    1 + |ξ|.

    Then ̺ = proxφ, where

    φ : R →]−∞,+∞] : ξ 7→

    −|ξ| − ln(1 − |ξ|)−ξ2

    2, if |ξ| < 1;

    +∞, if |ξ| > 1.

    Patrick L. Combettes Proximal Point Iterations 18/32

  • p1 p2 p3 n1 n2

    Most activation operators are proximity operators

    • logarithmic activation

    ̺ : R → R : ξ 7→ sign(ξ) ln(1 + |ξ|

    )

    Then ̺ = proxφ, where

    φ : R → ]−∞,+∞] : ξ 7→ e|ξ| − |ξ| − 1 −ξ2

    2.

    • arctangent

    ̺ =2

    πarctan

    Then ̺ = proxφ, where

    φ : R → ]−∞,+∞] : ξ 7→

    −2

    πln(

    cos(πξ

    2

    ))

    −1

    2ξ2, if |ξ| < 1;

    +∞, if |ξ| > 1.

    Patrick L. Combettes Proximal Point Iterations 18/32

  • p1 p2 p3 n1 n2

    Most activation operators are proximity operators

    • inverse square root unit activation

    ̺ : R → R : ξ 7→ ξ/√

    1 + ξ2.

    Then ̺ = proxφ, where

    φ : R → ]−∞,+∞] : ξ 7→

    {

    −ξ2/2 −√

    1 − ξ2, if |ξ| 6 1;

    +∞, if |ξ| > 1.

    • inverse square root linear unit activation

    ̺ =

    ξ, if ξ > 0;ξ

    1 + ξ2, if ξ < 0.

    Then ̺ = proxφ, where

    φ : R → ]−∞,+∞] : ξ 7→

    0, if ξ > 0;

    1 − ξ2/2 −√

    1 − ξ2, if − 1 6 ξ < 0;

    +∞, if ξ < −1.

    Patrick L. Combettes Proximal Point Iterations 18/32

  • p1 p2 p3 n1 n2

    Most activation operators are proximity operators••

    |

    −1

    |

    1

    1−

    φ(x)

    x

    +∞+∞

    ||−2−4

    | |

    2

    3 −

    −2−

    −1−

    1 −

    4

    ̺(x)

    x

    Figure: The function φ (top) and the corresponding proximalactivation function (bottom) ̺. Inverse square root linear unit isin red, arctangent activation function is in blue, logarithmicactivation function is in green.

    Patrick L. Combettes Proximal Point Iterations 19/32

  • p1 p2 p3 n1 n2

    Most activation operators are proximity operators

    • Softmax

    R : RN → RN : (ξk )16k6N 7→

    exp(ξk )

    /N∑

    j=1

    exp(ξj)

    16k6N

    − u,

    where u = (1, . . . , 1)/N ∈ RN . Then R = proxϕ whereϕ = ψ(·+ u) + 〈· | u〉 and

    ψ : RN → ]−∞,+∞]

    (ξk )16k6N 7→

    N∑

    k=1

    (

    ξk ln ξk −ξ2k2

    )

    , if (ξk )16i6N ∈ [0, 1]N

    and

    N∑

    k=1

    ξk = 1;

    +∞, otherwise.

    Patrick L. Combettes Proximal Point Iterations 20/32

  • p1 p2 p3 n1 n2

    Most activation operators are proximity operators

    • Squashing function used in capsnets

    (∀x ∈ RN) Rx = µ‖x‖1 + ‖x‖2 x = proxφ◦‖·‖x , µ =

    8

    3√

    3,

    where

    φ : ξ 7→

    µ arctan

    |ξ|µ− |ξ| −

    |ξ|(µ− |ξ|)− ξ2

    2, if |ξ| < µ;

    µ(π − µ)2

    , if |ξ| = µ;+∞, otherwise.

    |

    −µ

    |

    µ

    µ

    2(π − µ) − ••

    φ(ξ)

    ξ

    +∞+∞

    Patrick L. Combettes Proximal Point Iterations 21/32

  • p1 p2 p3 n1 n2

    Averagedness result

    Goal: Derive properties of compositions of linear operatorsand firmly nonexpansive mappings

    Difficulty: The operators are defined in different spaces

    Patrick L. Combettes Proximal Point Iterations 22/32

  • p1 p2 p3 n1 n2

    Averagedness result

    Proposition

    Let α ∈ [1/2,1]. Set W = Wm ◦ · · · ◦ W1, µ = inf‖x‖H0=1〈Wx | x〉, and

    θm = ‖W‖+m−1∑

    ℓ=1

    06j1

  • p1 p2 p3 n1 n2

    Averagedness result

    Example

    Consider the Proposition(i) with m = 2. Then P2 ◦ W2 ◦ P1 ◦ W1 isα-averaged, hence nonexpansive, if

    ‖W2 ◦ W1 − 4(1 − α)Id‖+ ‖W2 ◦ W1‖+ 2‖W2‖ ‖W1‖ 6 4α.

    In particular, if α = 1, this condition is clearly less restrictive thanrequiring that W1 and W2 be nonexpansive.

    Patrick L. Combettes Proximal Point Iterations 24/32

  • p1 p2 p3 n1 n2

    Asymptotic behavior

    Let x0 ∈ H and let {λn}n∈N ⊂ ]0,+∞[. Iterate

    for n = 0, 1, . . .

    x1,n = R1,n(W1,nxn + b1,n)x2,n = R2,n(W2,nx1,n + b2,n)

    ...xm,n = Rm,n(Wm,nxm−1,n + bm,n)xn+1 = xn + λn(xm,n − xn).

    MODEL

    • Wi,n : Hi−1 → Hi is a bounded linear operator, bi,n ∈ Hi , andRi,n : Hi → Hi

    • (Hi)06i6m real Hilbert spaces such that Hm = H0 = H

    Patrick L. Combettes Proximal Point Iterations 25/32

  • p1 p2 p3 n1 n2

    Asymptotic behavior

    Let x0 ∈ H and let {λn}n∈N ⊂ ]0,+∞[. Iterate

    for n = 0, 1, . . .

    x1,n = R1,n(W1,nxn + b1,n)x2,n = R2,n(W2,nx1,n + b2,n)

    ...xm,n = Rm,n(Wm,nxm−1,n + bm,n)xn+1 = xn + λn(xm,n − xn).

    MODEL

    Remark

    λn models a skip connection

    Patrick L. Combettes Proximal Point Iterations 25/32

  • p1 p2 p3 n1 n2

    Periodic networks

    ASSUMPTIONS

    • Periodicity: Ri,n ≡ Ri , Wi,n ≡ Wi , bi,n ≡ bi

    • Proximal activation:

    Ri = proxϕi for some ϕi ∈ Γ0(Hi) such that ϕi(0) = inf ϕi(Hi).

    NOTATION

    • For every i ∈ {1, . . . ,m}, Ti : Hi−1 → Hi : x 7→ Ri(Wix + bi)

    • F = Fix T with T = Tm ◦ · · · ◦ T1.

    Patrick L. Combettes Proximal Point Iterations 26/32

  • p1 p2 p3 n1 n2

    Associated variational inequality

    Find x1 ∈ H1, . . . , xm ∈ Hm such that

    b1 ∈ x1 − W1xm + ∂ϕ1(x1)

    b2 ∈ x2 − W2x1 + ∂ϕ2(x2)...

    bm ∈ xm − Wmxm−1 + ∂ϕm(xm)

    (1)

    More compactly, find x ∈ H = H1 ⊕ · · · ⊕ Hm such that

    b ∈ ∂ψ(x) + Bx, B = Id − W ◦ S,

    where

    H = Hm ⊕H1 ⊕ · · · ⊕ Hm−1

    S : H →→

    H : (x1, . . . , xm−1, xm) 7→ (xm, x1, . . . , xm−1)

    W :→

    H → H : (xm, x1, . . . , xm−1) 7→ (W1xm,W2x1, . . . ,Wmxm−1)

    ψ : H → ]−∞,+∞] : x 7→∑m

    i=1

    (ϕi(xi)− 〈xi | bi〉

    )

    Patrick L. Combettes Proximal Point Iterations 27/32

  • p1 p2 p3 n1 n2

    Associated variational inequality

    Proposition

    Set F = Fix (Tm ◦ · · · ◦ T1) andF =

    {x ∈ H | x1 = T1xm, x2 = T2x1, . . . , xm = Tmxm−1

    }

    The set of solutions to (1) isF =

    {(T1xm, (T2 ◦ T1)xm, . . . , (Tm−1 ◦ · · · ◦ T1)xm, xm) | xm ∈ F

    }.

    Suppose that (Wi)16i6m satisfies averagedness conditions forsome α ∈ [1/2, 1]. Then F is closed and convex.

    Suppose that one of the following holds:

    ran(Tm ◦ · · · ◦ T1) is bounded.

    Then there exists j ∈ {1, . . . ,m} such that domϕj isbounded.

    Then F and F are nonempty.

    Remark

    A solution to (1) does not solve a minimization problem.

    Patrick L. Combettes Proximal Point Iterations 28/32

  • p1 p2 p3 n1 n2

    Periodic networks

    Theorem

    Suppose that the following hold:

    F 6= Ø.

    (Wi)16i6m satisfies averageness conditions with parameter α.

    One of the following is satisfied:

    λn ≡ 1/α = 1 and Txn − xn → 0.

    (λn)n∈N lies in ]0, 1/α[ and∑

    n∈N λn(1 − αλn) = +∞.

    Then (xn)n∈N converges weakly to a point xm ∈ F and(T1xm, (T2 ◦ T1)xm, . . . , (Tm−1 ◦ · · · ◦ T1)xm, xm) solves (1).

    Patrick L. Combettes Proximal Point Iterations 29/32

  • p1 p2 p3 n1 n2

    Periodic networks

    Theorem

    Suppose that the following hold:

    F 6= Ø.

    (Wi)16i6m satisfies averageness conditions with parameter α.

    One of the following is satisfied:

    λn ≡ 1/α = 1 and Txn − xn → 0.

    (λn)n∈N lies in ]0, 1/α[ and∑

    n∈N λn(1 − αλn) = +∞.

    Then (xn)n∈N converges weakly to a point xm ∈ F and(T1xm, (T2 ◦ T1)xm, . . . , (Tm−1 ◦ · · · ◦ T1)xm, xm) solves (1).

    Remark

    Suppose that (Wi)16i6m satisfies averageness conditions withα ∈ [1/2, 1], (λn)n∈N lies in [ε, (1/α)− ε], for some ε ∈ ]0, 1/2[, andF = Ø. Then ‖xn‖ → +∞.

    Patrick L. Combettes Proximal Point Iterations 29/32

  • p1 p2 p3 n1 n2

    (Mildly) aperiodic networks

    ASSUMPTIONSThere exist (ωn)n∈N ∈ ℓ

    1+, (ρn)n∈N ∈ ℓ

    1+, (ηn)n∈N ∈ ℓ

    1+, and

    (νn)n∈N ∈ ℓ1+ for which the following hold, for every i ∈ {1, . . . ,m}:

    • There exists a bounded linear operator Wi : Hi−1 → Hi suchthat (∀n ∈ N) ‖Wi,n − Wi‖ 6 ωn.

    • There exists a proximal activation operator Ri : Hi → Hi suchthat (∀n ∈ N)(∀x ∈ Hi) ‖Ri,nx − Rix‖ 6 ρn‖x‖+ ηn.

    • There exists bi ∈ Hi such that (∀n ∈ N) ‖bi,n − bi‖ 6 νn.

    Same asymptotic results

    What about more general networks ?

    Patrick L. Combettes Proximal Point Iterations 30/32

  • p1 p2 p3 n1 n2

    Averaged activation operators

    There exist a maximally monotone operator A : H → 2H anda constant λ ∈ [0, 2] such that R = Id + λ(JA − Id).

    On the real line, an averaged activator is of the form R =Id + λ(proxφ − Id), where φ ∈ Γ0(R) and λ ∈ [0, 2].

    This includes almost all recent proposals for activation in neu-ral networks, e.g.,

    sine activation function R = sinpiecewise Mexican-hat activation function,absolute value function R = | · | (scattering networks),swish activation function

    (∀x ∈ R) R(x) =5x

    6(1 + exp(−x)),

    etc.

    This formalism leads to tight Lipschitz constant derivations forthe network → stability garantees.

    Patrick L. Combettes Proximal Point Iterations 31/32

  • p1 p2 p3 n1 n2

    References

    PLC, Monotone operator theory in convex optimization, Math. Pro-gram., vol. B170, 2018.

    PLC/Pesquet, Deep neural network structures solving variationalinequalities, arxiv, 2018.

    PLC/Pesquet, Lipschitz certificates for neural network structuresdriven by averaged activation operators, arxiv, 2019.

    Bauschke/PLC, Convex Analysis and Monotone Operator Theoryin Hilbert Spaces, 2nd ed. Springer, New York, 2017.

    Chierchia/Chouzenoux/PLC/Pesquet, Proximity Operator Reposi-tory, http://proximity-operator.net/

    Patrick L. Combettes Proximal Point Iterations 32/32

    http://proximity-operator.net/

    p1p2p3n1n2