Recent Advances in Alternating Direction Methods: Practice...

Recent Advances in Alternating DirectionMethods: Practice and Theory

Yin Zhang

Department of Computational and Applied MathematicsRice University, Houston, Texas, U.S.A.

“Numerical Methods for Continuous Optimization”IPAM Workshop at UCLA

Los Angeles, October 11, 2010

Outline:

Alternating Direction Method (ADM)— algorithm, history, convergence

Some Recent Applications and Extensions— convex and non-convex models

Local Convergence and Rate— general equality constrained problem— general splitting and stationary iterations

Summary

The talk involves contributions from (in random order):

Junfeng Yang, Zaiwen Wen, Yilun Wang, Chengbo Li, Yuan Shen,Wotao Yin

Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 2/33

Introduction to ADMIdea:

“Divide and Conquer.” — Julius Caesar (100-44 BC)

“远交近攻”，“各个击破”. — Sun-Tzu (400 BC)


Classic Alternating Direction Method

Convex program with structure (2-block separability):

minx ,yf1(x) + f2(y) : Ax + By = b, x ∈ X , y ∈ Y .

Augmented Lagrangian (AL):

LA(x , y , λ) = f1(x)+f2(y)+β

2

(‖Ax + By − b − λ‖2 − ‖λ‖2

)(AL)ADM: for γ ∈ (0, 1.618)

xk+1 ← arg minx

LA(x , yk , λk ) : x ∈ X

yk+1 ← arg miny

LA(xk+1, y , λk ) : y ∈ Y

λk+1 ← λk − γ(Axk+1 + Byk+1 − b)

“An inexact version” of AL Method. Hestines-69, Powell-69

(AL Iterations = Bregman Iterations by Yin-Osher-Goldfarb-Darbon-08)


ADM overview

ADM as we know today

Glowinski-Marocco-75 andGabay-Mercier-76

Glowinski at el. 81-89, Gabay-83...

Connections before Aug. Lagrangian

Douglas, Peaceman, Rachford(middle 1950’s)

operator splittings for PDE(a.k.a. ADI methods)

Some subsequent studies

variational inequality, proximal point (Eckstein-Bertsekas-92)

inexact ADM (He-Liao-Han-Yang-02), ......


ADM Global Convergence

e.g., “Augmented Lagrangian methods ...” Fortin-Glowinski-83

Assumptions required by current theory:

convexity over the entire domain

separability for exactly 2 blocks, no more

exact or high-accuracy minimization for each block

Strength:

differentiability not required

side-constraints allowed: x ∈ X , y ∈ Y

But

some assumptions are restrictive

no rate of convergence result


Some Recent Applications

From PDE to:

Signal/Image Processing

Sparse Optimization


`1-minimization in Compressive Sensing

Sparse signal recovery model: A ∈ Rm×n (m < n)

min ‖x‖1 : Ax = b ⇔ maxb>y : A>y ∈ [−1, 1]n

Split A from the box:

maxb>y : A>y = z ∈ [−1, 1]n

ADM (1 of variants in Yang-Z-09):

y ← (AA>)−1(A(z − x) + b/β)

z ← P[−1,1]n (A>y + x)

x ← x − γ(z − A>y)

— Most efficient when AA> = I (common in CS).— Use a steepest descent step for y if AA> 6= I

(still good in CS because of well-conditioned A)


Numerical Comparison

ADM solver package YALL1: http://yall1.blogs.rice.edu/

Compared codes: SPGL1, NESTA, SpaRSA, FPC, FISTA, CGD

100

101

102

10−1

100

Iteration

Rel

ativ

e E

rror

m/n: 0.20, p/m: 0.20,

PADMDADMSPGL1NESTA

100

101

102

103

10−2

10−1

100

Iteration

Rel

ativ

e E

rror

m/n: 0.30, p/m: 0.20, μ: 1.0e−004

PADMDADMSpaRSAFPC_BBFISTACGD

(noisy measurements, average of 50 runs)

Nonasymptotically, ADMs showed the fastest speedof convergence in reducing error ‖xk − x∗‖.


http://yall1.blogs.rice.edu/

TV-minimization in Image Processing

TV/L2 deconvolution model (Rudin-Osher-Fatemi-92):

minu

∑i‖Diu‖+

µ

2‖Ku − f ‖2,

which is equivalent to:

minu,w

∑i‖wi‖+

µ

2‖Ku − f ‖2 : wi = Diu, ∀i

.

Augmented Lagrangian function LA(w, u, λ):∑i

(‖wi‖ − λ>i (wi − Diu) +

β

2‖wi − Diu‖2

)+µ

2‖Ku − f ‖2.

Exact minimization for w (shrinkage) and u (FFT)(e.g., Yang-Yin-Z-Wang-08 (λi = 0), Goldstein-Osher-08, Esser-09).


Multiplier helps: Penalty vs. ADM

100

101

102

−2

0

2

4

6

8

10

12

Iteration Number

SNR history (dB)

FTVdADM

100

101

102

105.2

105.3

105.4

Iteration Number

Function values

FTVdADM

Matlab package FTVd (Wang-Yang-Yin-Z-07∼09):http://www.caam.rice.edu/~optimization/L1/ftvd/

(v1-3 use Quadratic penalty, v4 applies ADM)


http://www.caam.rice.edu/~optimization/L1/ftvd/

Multi-Signal Reconstruction with Joint Sparsity

Recover a set of jointly sparse signals X = [x1 · · · xp] ∈ Rn×p

minX

n∑i=1

‖eTi X‖2 s.t. Ajxj = bj , ∀j .

Assume Aj = A for simplicity. Introduce splitting Z ∈ Rp×n,

minX

n∑i=1

‖Zei‖2 s.t. Z = XT , AX = B.

ADM;

Z ← shrink(X> + Λ1, 1/β) (column-wise)

X ← (I + ATA)−1((Z − Λ1)> + A>(B + Λ2))

(Λ1,Λ2) ← (Λ1,Λ2)− γ(Z − XT ,AX − B)

Easy if AAT = I ; else take a steepest descent step in X .


A Sample of Recent Works on Convex Programs

D. Goldfarb, S. Ma, and K. Scheinberg, “Fast AlternatingLinearization Methods for Minimizing the Sum of Two ConvexFunctions”, TR Columbia, 2010.

M. V. Afonso, J. Bioucas-Dias, and M. Figueiredo, “A fastalgorithm for the constrained formulation of compressiveimage reconstruction and other linear inverse problems”,Arxiv, 2009.

S. Setzer, “Split Bregman Algorithm, Douglas-RachfordSplitting and Frame Shrinkage”, Proc. 2nd InternationalConference on Scale Lecture Notes in Computer Science,2009.


Extensions to Non-convex Territories

Low-Rank/Sparse Matrix Models

Non-convex, Non-separable functions

More than 2 blocks


Matrix Fitting Models (I): Completion

Find low-rank Z to fit data Mij : (i , j) ∈ Ω

Nuclear-norm minimization is good, but SVDs are expensive.

Non-convex model (Wen-Yin-Z-09): find X ∈ Rm×k ,Y ∈ Rk×n

minX ,Y ,Z

‖XY − Z‖2F s.t. PΩ(Z −M) = 0

An SOR scheme:

Z ← ωZ + (1− ω)XY

X ← qr(ZY>)

Y ← X>Z

Z ← XY + PΩ(M − XY )

1 small QR (m × k). No SVD.Much faster than SVD-based codes (FPCA, APGL, ....)


Nonlinear GS vs SOR

0 50 100 150 200 250 300 350 40010

−5

10−4

10−3

10−2

10−1

100

iteration

no

rm

alize

d r

esid

ua

l

SOR: k=12

SOR: k=20

GS: k=12

GS: k=20

(a) n=1000, r=10, SR = 0.08

0 20 40 60 80 100 12010

−5

10−4

10−3

10−2

10−1

100

iteration

no

rm

alize

d r

esid

ua

l

SOR: k=12

SOR: k=20

GS: k=12

GS: k=20

(b) n=1000, r=10, SR=0.15

Alternating minimization, but no multiplier for storage reason

Is non-convexity a problem for global optimization of this problem?— “Yes” in theory— “Not really” in practice


Matrix Fitting Models (II): SeparationGiven data Dij : (i , j) ∈ Ω,

Find low-rank Z so that difference PΩ(Z−D) is sparse

Non-convex Model (Shen-Wen-Z-10): U ∈ Rm×k ,V ∈ Rk×n

minU,V ,Z

‖PΩ(Z − D)‖1 s.t. Z − UV = 0

ADM scheme:

U ← qr((Z − Λ/β)V>)

V ← U>(Z − Λ/β)

PΩc (Z ) ← PΩc (UV + Λ/β)

PΩ(Z ) ← PΩ(shrink(· · · ) + D)

Λ ← Λ− γβ(Z − UV )

— 1 small QR. No SVD. Faster.— non-convex, 3 blocks. nonlinear constraint. convergence?


Matrix Separation: recoverability can be better

Sparsity %

Ra

nk

Digits of accuracy

10 20 30 40 50 60

20

40

60

80

100

120

2

3

4

5

6

7

(a) LMaFit (our solver)

Sparsity %

Ra

nk

Digits of accuracy

10 20 30 40 50 60

20

40

60

80

100

120

2

3

4

5

6

7

(b) IALM (nuclear-norm)

In S + Z = D, when S dominates Z in magnitude, our model hasworse recoverability. So the 2 models complement each other.


Nonnegative Matrix Factorization (Z-09)

Given A ∈ Rn×n, find X ,Y ∈ Rn×k (k n),

min ‖XY> − A‖2F s.t. X ,Y ≥ 0

Splitting:

min ‖XY> − A‖2F s.t. X = U1,Y = U2,U1,U2 ≥ 0

ADM scheme:

X ← (AY + β(U1 − Λ1))(Y>Y + βI )−1

Y> ← (X>X + βI )−1(X>A + β(U2 − Λ2))

(U1,U2) ← P+(X + Λ1,Y + Λ2)

(Λ1,Λ2) ← (Λ1,Λ2)− γ(X − U1,Y − U2)

— cost/iter: 2 (k × k) linear systems plus matrix arithmetics— better performance than Matlab built-in function “nnmf”— non-convex, non-separable, 3 blocks: convergence?


Minimization on Stiefel Manifold

minX∈X

f (X ) s.t. X>X = I

Splitting:

minX∈X

f (X ) s.t. X − Z = 0, ZTZ = I

ADM framework:

X ← argmin f (X ) +β

2‖X − (Z + Λ)‖2

F : X ∈ X

Z ← argmin ‖Z − (X − Λ)‖2F : ZTZ = I

Λ ← Λ− γ(X − Z )

— X -subproblem: easier in many cases— Z -subproblem: close-form solution with an SVD— a more complex splitting necessary to avoid SVD


Nonlinear Data-Fitting with Regularization

minv

R(Dv) +µ

2‖K (v)− d‖2.

Introduce splitting u,

minu,v

R(u) +µ

2‖K (v)− d‖2 s.t. u = Dv .

A general ADM scheme:

u ← argminu

R(u) +β

2‖u − (Dv + λ)‖2,

v ← argminv

µ‖K (v)− d‖2 + β‖Dv − (u − λ)‖2,

λ ← λ− γ(u − Dv),

— u-subproblem: easy for many R(·) (TV, `1, ...)— v -subproblem: regularized by a Tikhonov term— v -subproblem: Gauss-Newton step could suffice


Some Theoretical Results

A general setting

Local R-linear convergence

(Joint work in progress with Junfeng Yang)

General Setting: Problem

Consider

minx

f (x) s.t. c(x) = 0

where f : Rn → R and c : Rn → Rm(m < n) are C2-mappings.Augmented Lagrangian:

Lα(x , y) , αf (x)− yT c(x) +1

2‖c(x)‖2

Augmented saddle point system:

∇xLα(x , y) = 0,

c(x) = 0.


Splitting and Iteration Scheme

G : Rn × Rn → Rn is a splitting of F : Rn → Rn if

G (x , x) ≡ F (x), ∀x ∈ Rn.

e.g., if A = L− R, G (x , x) , Lx − Rx ≡ Ax , F (x).

Let G (x , x , y) be a splitting of ∇xLα(x , y) on x

Augmented saddle point system becomes

G (x , x , y) = 0

c(x) = 0

A generalized ADM (gADM) scheme:

xk+1 ← G (x , xk , yk ) = 0

yk+1 ← yk − τc(xk+1)


Block Gauss-Seidel for Square System F (x) = 0

Partition the system and variable into s ≤ n consistent blocks:

F = (F1,F2, · · · ,Fs), x = (x1, x2, · · · , xs)

Block GS iteration: given xk , for i = 1, 2, . . . , s

xik+1 ← Fi (x

k+11 , . . . , xk+1

i−1 , xi , xki+1, . . . , x

ks ) = 0

or xk+1 ← G (x , xk ) = 0

where

G (x , z) =

F1(x1, z2, . . . , zs)

...Fi (x1, . . . , xi , zi+1, . . . , zs)

...Fs(x1, . . . , xs)

(SOR can be similarly defined.)


Splitting for Gradient Descent: F (x) = ∇f (x)

Gradient descent method (with a constant step size):

xk+1 = xk − αF (xk ),

or xk+1 ← G (x , xk ) = 0

where

G (x , z) =1

αx −

(1

αz − F (z)

).

— gradient descent iterations can be done block-wise— block GS, SOR and gradient descent can be mixed

(e.g., 1st block: GS; 2nd block: gradient descent)


Assumptions

Let ∂iG (x , x , y) be the partial Jacobian of the splitting Gw.r.t. the i-th argument, and ∂iG

∗ , ∂iG (x∗, x∗, y∗) wherex∗ is a minimizer and y∗ the associated multiplier.

Assumption 1. (2nd-order Sufficiency)f , c ∈ C2, and α > 0 is chosen so that

∇2xLα(x∗, y∗) 0

Assumption 2. (Requirements on splitting)∂1G is nonsingular in a neighborhood of (x∗, x∗, y∗), and

ρ([∂1G

∗]−1∂2G∗) ≤ 1

but |λ| = 1 is allowed only for λ = 1.(e.g., for gradient descent: [∂1G

∗]−1∂2G∗ = I − α∇2f (x∗))


Assumptions are Reasonable

A1. 2nd-order sufficiency guarantees that α > 0 exists so that

α[∇2f (x∗)−

∑i y∗i ∇2ci (x

∗)]

+ A(x∗)>A(x∗) 0

where A(x) = ∂c(x). Note

∇xLα(x , y) = G (x , x , y) =⇒ ∇2xL∗α = ∂1G

∗ + ∂2G∗ 0

A2. Any convergent linear splitting for matrices 0 leads to acorresponding nonlinear splitting G satisfying

ρ([∂1G

∗]−1∂2G∗) < 1

Hence, A2 is satisfied by block GS (i.e., ADM), SOR, gradientdescent (with appropriate α) and their mixtures.


The Error System

Recall gADM:

xk+1 ← G (x , xk , yk ) = 0

yk+1 ← yk − τc(xk+1)

Using Implicit Function Theorem, we derive an error system

ek+1 = M∗(τ)ek + o(‖ek‖)

where ek , (xk , yk )− (x∗, y∗),

M∗(τ) =

[−[∂1G

∗]−1∂2G∗ [∂1G

∗]−1A∗>

τA∗[∂1G∗]−1∂2G

∗ I − τA∗[∂1G∗]−1A∗>

]

Key Lemma. (Z-2010) Under Assumptions 1-2, there exists η > 0such that ρ(M∗(τ)) < 1 for all τ ∈ (0, 2η).


Convergence: τ ∈ (0, 2η)

Theorem [Local convergence].There exists an open neighborhood U of (x∗, y∗) such that for any(x0, y0) ∈ U, the sequence (xk , yk ) generated by gADM stays inU and converges to (x∗, y∗).

Theorem [R-linear rate].The asymptotic convergence rate of gADM is R-linear withR-factor ρ(M∗(τ)), i.e.,

lim supk→∞

‖(xk , yk )− (x∗, y∗)‖1/k = ρ(M∗(τ)).

— These follow from the Key Lemma and Ortega-Rockoff-70.

Corollary [Linear case].If f is quadratic and c is affine, then U = Rn × Rm and theconvergence is globally Q-linear with Q-factor ρ(M∗(τ)).


Summary: ADM ' Splitting + Alternating

powerful tool for constructing inexpensive iterations

great versatility, computationally proven effectiveness

convergence theory extended, many issues remain

Table: assessment of the obtained results

strength weaknessnon-convex functions local convergence only

non-separable functions smooth functions only

any number of blocks no side-constraints

mixed strategies covered —

rate of convergence —

Our results extend the classic ADM in several aspects, but do notrecover the existing results except for quadratic programs.


Open Question:

Why or when is non-convexity NOT a problem?

Thanks!

Acknowledgments: NSF, ONR, IPAM

References

(FISTA) A. Beck, and M. Teboulle, “A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems”, SIAM J. Imag. Sci., 2:183-202, 2009.

(NESTA) J. Bobin, S. Becker, and E. Candes, “NESTA: A Fast and AccurateFirst-order Method for Sparse Recovery”, TR, CalTech, April 2009.

(SPGL1) M. P. Friedlander and E. van den Berg, “Probing the Pareto frontierfor basis pursuit solutions”, SIAM J. Sci. Comput., 31(2):890–912, 2008.

(FPC) E. T. Hale, W. Yin, and Y. Zhang, “Fixed-point continuation forl1-minimization: Methodology and convergence”, SIAM J. Optim,19(3):1107–1130, 2008.

(IALM) Z. Lin, M. Chen, L. Wu, and Y. Ma, “The Augmented LagrangeMultiplier Method for Exact Recovery of Corrupted Low-Rank Matrices”, TRUIUC, UILU-ENG-09-2215, Nov. 2009.

(FPCA) S. Ma, D. Goldfarb and L. Chen, “Fixed Point and Bregman IterativeMethods for Matrix Rank Minimization”, Math. Prog., to appear.

(APGL) K.-C. Toh, and S. W. Yun, “An accelerated proximal gradient algorithmfor nuclear norm regularized least squares problems”, Pacific J. Optimization.

(SpaRSA) S. Wright, R. Nowak, M. Figueiredo, “Sparse reconstruction byseparable approximation”, IEEE Trans Signal Process., 57(7):2479–2493, 2009.

(CGD) S. Yun, and K.-C. Toh, “A coordinate gradient descent method forL1-regularized convex minimization”, Computational Optimization andApplications, to appear.


Recent Advances in Alternating Direction Methods: Practice...

Documents

Transcript of Recent Advances in Alternating Direction Methods: Practice...