L.Vandenberghe ECE236C(Spring2020) 10 ...

28
L. Vandenberghe ECE236C (Spring 2020) 10. Dual proximal gradient method proximal gradient method applied to the dual examples alternating minimization method 10.1

Transcript of L.Vandenberghe ECE236C(Spring2020) 10 ...

Page 1: L.Vandenberghe ECE236C(Spring2020) 10 ...

L. Vandenberghe ECE236C (Spring 2020)

10. Dual proximal gradient method

• proximal gradient method applied to the dual

• examples

• alternating minimization method

10.1

Page 2: L.Vandenberghe ECE236C(Spring2020) 10 ...

Dual methods

Subgradient method: converges slowly, step size selection is difficult

Gradient method: requires differentiable dual cost function

• often the dual cost function is not differentiable, or has a nontrivial domain

• dual function can be smoothed by adding small strongly convex term to primal

Augmented Lagrangian method

• equivalent to gradient ascent on a smoothed dual problem

• quadratic penalty in augmented Lagrangian destroys separable primal structure

Proximal gradient method (this lecture): dual cost split in two terms

• one term is differentiable with Lipschitz continuous gradient

• other term has an inexpensive prox operator

Dual proximal gradient method 10.2

Page 3: L.Vandenberghe ECE236C(Spring2020) 10 ...

Composite primal and dual problem

primal: minimize f (x) + g(Ax)dual: maximize −g∗(z) − f ∗(−AT z)

the dual problem has the right structure for the proximal gradient method if

• f is strongly convex: this implies f ∗(−AT z) has a Lipschitz continuous gradient

A∇ f ∗(−ATu) − A∇ f ∗(−ATv)

2≤‖A‖22µ‖u − v‖2

µ is the strong convexity constant of f (see page 5.19)

• prox operator of g (or g∗) is inexpensive (closed form or simple algorithm)

Dual proximal gradient method 10.3

Page 4: L.Vandenberghe ECE236C(Spring2020) 10 ...

Dual proximal gradient update

minimize g∗(z) + f ∗(−AT z)

• proximal gradient update:

z+ = proxtg∗(z + t A∇ f ∗(−AT z))

• ∇ f ∗ can be computed by minimizing partial Lagrangian (from p. 5.15, p. 5.19):

x = argminx( f (x) + zT Ax)

z+ = proxtg∗(z + t Ax)

• partial Lagrangian is a separable function of x if f is separable

• step size t is constant (t ≤ µ/‖A‖22) or adjusted by backtracking

• faster variant uses accelerated proximal gradient method of lecture 7

Dual proximal gradient method 10.4

Page 5: L.Vandenberghe ECE236C(Spring2020) 10 ...

Dual proximal gradient update

x = argminx( f (x) + zT Ax)

z+ = proxtg∗(z + t Ax)

• Moreau decomposition gives alternate expression for z-update:

z+ = z + t Ax − tproxt−1g(t−1z + Ax)

• right-hand side can be written as z + t(Ax − y) where

y = proxt−1g(t−1z + Ax)

= argminy(g(y) + t

2‖Ax − t−1z − y‖22)

= argminy(g(y) + zT(Ax − y) + t

2‖Ax − y‖22)

Dual proximal gradient method 10.5

Page 6: L.Vandenberghe ECE236C(Spring2020) 10 ...

Alternating minimization interpretation

x = argminx( f (x) + zT Ax)

y = argminy(g(y) − zT y +

t2‖Ax − y‖22)

z+ = z + t(Ax − y)

• first minimize Lagrangian over x, then augmented Lagrangian over y

• compare with augmented Lagrangian method:

(x, y) = argminx,y( f (x) + g(y) + zT(Ax − y) + t

2‖Ax − y‖22)

• requires strongly convex f (in contrast to augmented Lagrangian method)

Dual proximal gradient method 10.6

Page 7: L.Vandenberghe ECE236C(Spring2020) 10 ...

Outline

• proximal gradient method applied to the dual

• examples

• alternating minimization method

Page 8: L.Vandenberghe ECE236C(Spring2020) 10 ...

Regularized norm approximation

primal: minimize f (x) + ‖Ax − b‖dual: maximize −bT z − f ∗(−AT z)

subject to ‖z‖∗ ≤ 1

(see page 5.23)

• we assume f is strongly convex with constant µ, not necessarily differentiable

• we assume projections on unit ‖ · ‖∗-ball are simple

• this is a special case of the problem on page 10.3 with g(y) = ‖y − b‖:

g∗(z) ={

bT z ‖z‖∗ ≤ 1+∞ otherwise, proxtg∗(z) = PC(z − tb)

Dual proximal gradient method 10.7

Page 9: L.Vandenberghe ECE236C(Spring2020) 10 ...

Dual gradient projection

primal: minimize f (x) + ‖Ax − b‖dual: maximize −bT z − f ∗(−AT z)

subject to ‖z‖∗ ≤ 1

• dual gradient projection update (C = {z | ‖z‖∗ ≤ 1}):

z+ = PC

(z + t(A∇ f ∗(−AT z) − b)

)• gradient of f ∗ can be computed by minimizing the partial Lagrangian:

x = argminx( f (x) + zT Ax)

z+ = PC(z + t(Ax − b))

Dual proximal gradient method 10.8

Page 10: L.Vandenberghe ECE236C(Spring2020) 10 ...

Example

primal: minimize f (x) +p∑

i=1‖Bix‖2

dual: maximize − f ∗(−BT1 z1 − · · · − BT

p zp)subject to ‖zi‖2 ≤ 1, i = 1, . . . , p

Dual gradient projection update (for strongly convex f ):

x = argminx( f (x) + (

p∑i=1

BTi zi)T x)

z+i = PCi (zi + tBi x) , i = 1, . . . , p

• Ci is unit Euclidean norm ball in Rmi, if Bi ∈ Rmi×n

• x-calculation decomposes if f is separable

Dual proximal gradient method 10.9

Page 11: L.Vandenberghe ECE236C(Spring2020) 10 ...

Example

• we take f (x) = (1/2)‖Cx − d‖22• each iteration requires solution of linear equation with coefficient CTC

• randomly generated C ∈ R2000×1000, Bi ∈ R10×1000, p = 500

0 100 200 300 400 50010−6

10−5

10−4

10−3

10−2

10−1

100

iteration

rela

tive

dual

subo

ptim

ality

projected gradientFISTA

Dual proximal gradient method 10.10

Page 12: L.Vandenberghe ECE236C(Spring2020) 10 ...

Minimization over intersection of convex sets

minimize f (x)subject to x ∈ C1 ∩ · · · ∩ Cp

• f is strongly convex with constant µ

• we assume each set Ci is closed, convex, and easy to project onto

• this is a special case of the problem on page 10.3 with

g(y1, . . . , yp) = δC1(y1) + · · · + δCp(yp)A =

[I I · · · I

]T

with this choice of g and A,

f (x) + g(Ax) = f (x) + δC1(x) + · · · + δCp(x)

Dual proximal gradient method 10.11

Page 13: L.Vandenberghe ECE236C(Spring2020) 10 ...

Dual problem

primal: minimize f (x) + δC1(x) + · · · + δCp(x)dual: maximize −δ∗C1

(z1) − · · · − δ∗Cp(zp) − f ∗(−z1 − · · · − zp)

• proximal mapping of δ∗Ci: from Moreau decomposition (page 6.18),

proxtδ∗Ci(u) = u − tPCi(u/t)

• gradient of h(z1, . . . , zp) = f ∗(−z1 − · · · − zp):

∇h(z) = −A∇ f (−AT z) = −

I...I

∇ f ∗(−z1 − · · · − zp)

• ∇h(z) is Lipschitz continuous with constant ‖A‖22/µ = p/µ

Dual proximal gradient method 10.12

Page 14: L.Vandenberghe ECE236C(Spring2020) 10 ...

Dual proximal gradient method

primal: minimize f (x) + δC1(x) + · · · + δCp(x)dual: maximize −δ∗C1

(z1) − · · · − δ∗Cp(zp) − f ∗(−z1 − · · · − zp)

• dual proximal gradient update

s = −z1 − · · · − zp

z+i = zi + t∇ f ∗(s) − tPCi(t−1zi + ∇ f ∗(s)), i = 1, . . . , p

• gradient of f ∗ can be computed by minimizing the partial Lagrangian

x = argminx( f (x) + (z1 + · · · + zp)T x)

z+i = zi + t x − tPCi (zi/t + x) , i = 1, . . . , p

• stepsize is fixed (t ≤ µ/p) or adjusted by backtracking

Dual proximal gradient method 10.13

Page 15: L.Vandenberghe ECE236C(Spring2020) 10 ...

Euclidean projection on intersection of convex sets

minimize 12‖x − a‖22

subject to x ∈ C1 ∩ · · · ∩ Cp

• special case of previous problem with

f (x) = 12‖x − a‖22, f ∗(u) = 1

2‖u‖22 + aTu

• strong convexity constant µ = 1; hence stepsize t = 1/p works

• dual proximal gradient update (with change of variable wi = pzi):

x = a − 1p(w1 + · · · + wp)

w+i = wi + x − PCi(wi + x), i = 1, . . . , p

• the p projections in the second step can be computed in parallel

Dual proximal gradient method 10.14

Page 16: L.Vandenberghe ECE236C(Spring2020) 10 ...

Nearest positive semidefinite unit-diagonal Z-matrix

projection in Frobenius norm of A ∈ S100 on the intersection of two sets:

C1 = S100+ , C2 = {X ∈ S100 | diag(X) = 1, Xi j ≤ 0 for i , j}

0 50 100 15010−6

10−5

10−4

10−3

10−2

10−1

100

iteration

rela

tive

dual

subo

ptim

ality

proximal gradientFISTA

Dual proximal gradient method 10.15

Page 17: L.Vandenberghe ECE236C(Spring2020) 10 ...

Euclidean projection on polyhedron

• intersection of p halfspaces Ci = {x | aTi x ≤ bi}

PCi(x) = x − max{aTi x − bi,0}‖ai‖22

ai

• example with p = 2000 inequalities and n = 1000 variables

0 1000 2000 3000 400010−4

10−3

10−2

10−1

100

iteration

rela

tive

dual

subo

ptim

ality

proximal gradientFISTA

Dual proximal gradient method 10.16

Page 18: L.Vandenberghe ECE236C(Spring2020) 10 ...

Decomposition of primal-dual separable problems

minimizen∑

j=1f j(x j) +

m∑i=1

gi(Ai1x1 + · · · + Ainxn)

• special case of f (x) + g(Ax) with (block-)separable f and g

• for example,minimize

n∑j=1

f j(x j)

subject ton∑

j=1A1 j x j ∈ C1

· · ·n∑

j=1Amj x j ∈ Cm

• we assume each fi is strongly convex; each gi has inexpensive prox operator

Dual proximal gradient method 10.17

Page 19: L.Vandenberghe ECE236C(Spring2020) 10 ...

Decomposition of primal-dual separable problems

primal: minimizen∑

j=1f j(x j) +

m∑i=1

gi(Ai1x1 + · · · + Ainxn)

dual: maximize −m∑

i=1g∗i (zi) −

n∑j=1

f ∗j (−AT1 j z1 − · · · − AT

mj z j)

Dual proximal gradient update

x j = argminx j

( f j(x j) +m∑

i=1zTi Ai j x j), j = 1, . . . ,n

z+i = proxtg∗i(zi + t

n∑j=1

Ai j x j), i = 1, . . . ,m

Dual proximal gradient method 10.18

Page 20: L.Vandenberghe ECE236C(Spring2020) 10 ...

Outline

• proximal gradient method applied to the dual

• examples

• alternating minimization method

Page 21: L.Vandenberghe ECE236C(Spring2020) 10 ...

Separable structure with one strongly convex term

minimize f1(x1) + f2(x2) + g(A1x1 + A2x2)

• composite problem with separable f (two terms, for simplicity)

• if f1 and f2 are strongly convex, dual method of page 10.4 applies

x1 = argminx1

( f1(x1) + zT A1x1)

x2 = argminx2

( f2(x2) + zT A2x2)

z+ = proxtg∗(z + t(A1 x1 + A2 x2))

• we now assume that one function ( f2) is not strongly convex

Dual proximal gradient method 10.19

Page 22: L.Vandenberghe ECE236C(Spring2020) 10 ...

Separable structure with one strongly convex term

primal: minimize f1(x1) + f2(x2) + g(A1x1 + A2x2)

dual: maximize −g∗(z) − f ∗1 (−AT1 z) − f ∗2 (−AT

2 z)

• we split dual objective in components − f ∗1 (−AT1 z) and −g∗(z) − f ∗2 (−AT

2 z)

• component f ∗1 (−AT1 z) is differentiable with Lipschitz continuous gradient

• proximal mapping of h(z) = g∗(z) + f ∗2 (−AT2 z) was discussed on page 8.7:

proxth(w) = w + t(A2 x2 − y)

where x2, y minimize a partial augmented Lagrangian

(x2, y) = argminx2,y

( f2(x2) + g(y) +t2‖A2x2 − y + w/t‖22)

Dual proximal gradient method 10.20

Page 23: L.Vandenberghe ECE236C(Spring2020) 10 ...

Dual proximal gradient method

z+ = proxth(z + t A1∇ f ∗1 (−AT1 z))

• evaluate ∇ f ∗1 by minimizing partial Lagrangian:

x1 = argminx1

( f1(x1) + zT A1x1)

z+ = proxth(z + t A1 x1)

• evaluate proxth(z + t A1 x1) by minimizing augmented Lagrangian:

(x2, y) = argminx2,y

( f2(x2) + g(y) +t2‖A2x2 − y + z/t + A1 x‖22)

z+ = z + t(A1 x1 + A2 x2 − y)

Dual proximal gradient method 10.21

Page 24: L.Vandenberghe ECE236C(Spring2020) 10 ...

Alternating minimization method

starting at some initial z, repeat the following iteration

1. minimize the Lagrangian over x1:

x1 = argminx1

( f1(x1) + zT A1x1)

2. minimize the augmented Lagrangian over x2, y:

(x2, y) = argminx2,y

(f2(x2) + g(y) +

t2‖A1 x1 + A2x2 − y + z/t‖22

)3. update dual variable:

z+ = z + t(A1 x1 + A2 x2 − y)

Dual proximal gradient method 10.22

Page 25: L.Vandenberghe ECE236C(Spring2020) 10 ...

Comparison with augmented Lagrangian method

Augmented Lagrangian method (for problem on page 10.19)

1. compute minimizer x1, x2, y of the augmented Lagrangian

f1(x1) + f2(x2) + g(y) +t2‖A1x1 + A2x2 − y + z/t‖22

2. update dual variable:

z+ = z + t(A1 x1 + A2 x2 − y)

Differences with alternating minimization (dual proximal gradient method)

• augmented Lagrangian method does not require strong convexity of f1

• there is no upper limit on the step size t in augmented Lagrangian method

• quadratic term in step 1 of AL method destroys separability of f1(x1) + f2(x2)

Dual proximal gradient method 10.23

Page 26: L.Vandenberghe ECE236C(Spring2020) 10 ...

Example

minimize 12xT

1 Px1 + qT1 x1 + qT

2 x2

subject to B1x1 � d1, B2x2 � d2A1x1 + A2x2 = b

• without equality constraint, problem would separate in independent QP and LP

• we assume P � 0

Formulation for dual decomposition

minimize f1(x1) + f2(x2)subject to A1x1 + A2x2 = b

• first function is strongly convex

f1(x) =12

xT1 Px1 + qT

1 x1, dom f1 = {x1 | B1x1 � d1}

• second function is not: f2(x) = qT2 x2 with domain {x2 | B2x2 � d2}

Dual proximal gradient method 10.24

Page 27: L.Vandenberghe ECE236C(Spring2020) 10 ...

Example

Alternating minimization algorithm

1. compute the solution x1 of the QP

minimize (1/2)xT1 P1x1 + (q1 + AT

1 z)T x1

subject to B1x1 � d1

2. compute the solution x2 of the QP

minimize (q2 + AT2 z)T x2 + (t/2)‖A1 x1 + A2x2 − b‖22

subject to B2x2 � d2

3. dual update:z+ = z + t(A1 x1 + A2 x2 − b)

Dual proximal gradient method 10.25

Page 28: L.Vandenberghe ECE236C(Spring2020) 10 ...

References

• P. Tseng, Applications of a splitting algorithm to decomposition in convex programming andvariational inequalities, SIAM J. Control and Optimization (1991).

• P. Tseng, Further applications of a splitting algorithm to decomposition in variational inequalitiesand convex programming, Mathematical Programming (1990).

Dual proximal gradient method 10.26