Recent Advances in Alternating Direction Methods: Practice...
Transcript of Recent Advances in Alternating Direction Methods: Practice...
Recent Advances in Alternating DirectionMethods: Practice and Theory
Yin Zhang
Department of Computational and Applied MathematicsRice University, Houston, Texas, U.S.A.
“Numerical Methods for Continuous Optimization”IPAM Workshop at UCLA
Los Angeles, October 11, 2010
Outline:
Alternating Direction Method (ADM)— algorithm, history, convergence
Some Recent Applications and Extensions— convex and non-convex models
Local Convergence and Rate— general equality constrained problem— general splitting and stationary iterations
Summary
The talk involves contributions from (in random order):
Junfeng Yang, Zaiwen Wen, Yilun Wang, Chengbo Li, Yuan Shen,Wotao Yin
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 2/33
Introduction to ADMIdea:
“Divide and Conquer.” — Julius Caesar (100-44 BC)
“远交近攻”,“各个击破”. — Sun-Tzu (400 BC)
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 3/33
Classic Alternating Direction Method
Convex program with structure (2-block separability):
minx ,yf1(x) + f2(y) : Ax + By = b, x ∈ X , y ∈ Y .
Augmented Lagrangian (AL):
LA(x , y , λ) = f1(x)+f2(y)+β
2
(‖Ax + By − b − λ‖2 − ‖λ‖2
)(AL)ADM: for γ ∈ (0, 1.618)
xk+1 ← arg minx
LA(x , yk , λk ) : x ∈ X
yk+1 ← arg miny
LA(xk+1, y , λk ) : y ∈ Y
λk+1 ← λk − γ(Axk+1 + Byk+1 − b)
“An inexact version” of AL Method. Hestines-69, Powell-69
(AL Iterations = Bregman Iterations by Yin-Osher-Goldfarb-Darbon-08)
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 4/33
ADM overview
ADM as we know today
Glowinski-Marocco-75 andGabay-Mercier-76
Glowinski at el. 81-89, Gabay-83...
Connections before Aug. Lagrangian
Douglas, Peaceman, Rachford(middle 1950’s)
operator splittings for PDE(a.k.a. ADI methods)
Some subsequent studies
variational inequality, proximal point (Eckstein-Bertsekas-92)
inexact ADM (He-Liao-Han-Yang-02), ......
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 5/33
ADM Global Convergence
e.g., “Augmented Lagrangian methods ...” Fortin-Glowinski-83
Assumptions required by current theory:
convexity over the entire domain
separability for exactly 2 blocks, no more
exact or high-accuracy minimization for each block
Strength:
differentiability not required
side-constraints allowed: x ∈ X , y ∈ Y
But
some assumptions are restrictive
no rate of convergence result
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 6/33
Some Recent Applications
From PDE to:
Signal/Image Processing
Sparse Optimization
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 7/33
`1-minimization in Compressive Sensing
Sparse signal recovery model: A ∈ Rm×n (m < n)
min ‖x‖1 : Ax = b ⇔ maxb>y : A>y ∈ [−1, 1]n
Split A from the box:
maxb>y : A>y = z ∈ [−1, 1]n
ADM (1 of variants in Yang-Z-09):
y ← (AA>)−1(A(z − x) + b/β)
z ← P[−1,1]n (A>y + x)
x ← x − γ(z − A>y)
— Most efficient when AA> = I (common in CS).— Use a steepest descent step for y if AA> 6= I
(still good in CS because of well-conditioned A)
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 8/33
Numerical Comparison
ADM solver package YALL1: http://yall1.blogs.rice.edu/
Compared codes: SPGL1, NESTA, SpaRSA, FPC, FISTA, CGD
100
101
102
10−1
100
Iteration
Rel
ativ
e E
rror
m/n: 0.20, p/m: 0.20,
PADMDADMSPGL1NESTA
100
101
102
103
10−2
10−1
100
Iteration
Rel
ativ
e E
rror
m/n: 0.30, p/m: 0.20, μ: 1.0e−004
PADMDADMSpaRSAFPC_BBFISTACGD
(noisy measurements, average of 50 runs)
Nonasymptotically, ADMs showed the fastest speedof convergence in reducing error ‖xk − x∗‖.
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 9/33
TV-minimization in Image Processing
TV/L2 deconvolution model (Rudin-Osher-Fatemi-92):
minu
∑i‖Diu‖+
µ
2‖Ku − f ‖2,
which is equivalent to:
minu,w
∑i‖wi‖+
µ
2‖Ku − f ‖2 : wi = Diu, ∀i
.
Augmented Lagrangian function LA(w, u, λ):∑i
(‖wi‖ − λ>i (wi − Diu) +
β
2‖wi − Diu‖2
)+µ
2‖Ku − f ‖2.
Exact minimization for w (shrinkage) and u (FFT)(e.g., Yang-Yin-Z-Wang-08 (λi = 0), Goldstein-Osher-08, Esser-09).
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 10/33
Multiplier helps: Penalty vs. ADM
100
101
102
−2
0
2
4
6
8
10
12
Iteration Number
SNR history (dB)
FTVdADM
100
101
102
105.2
105.3
105.4
Iteration Number
Function values
FTVdADM
Matlab package FTVd (Wang-Yang-Yin-Z-07∼09):http://www.caam.rice.edu/~optimization/L1/ftvd/
(v1-3 use Quadratic penalty, v4 applies ADM)
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 11/33
Multi-Signal Reconstruction with Joint Sparsity
Recover a set of jointly sparse signals X = [x1 · · · xp] ∈ Rn×p
minX
n∑i=1
‖eTi X‖2 s.t. Ajxj = bj , ∀j .
Assume Aj = A for simplicity. Introduce splitting Z ∈ Rp×n,
minX
n∑i=1
‖Zei‖2 s.t. Z = XT , AX = B.
ADM;
Z ← shrink(X> + Λ1, 1/β) (column-wise)
X ← (I + ATA)−1((Z − Λ1)> + A>(B + Λ2))
(Λ1,Λ2) ← (Λ1,Λ2)− γ(Z − XT ,AX − B)
Easy if AAT = I ; else take a steepest descent step in X .
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 12/33
A Sample of Recent Works on Convex Programs
D. Goldfarb, S. Ma, and K. Scheinberg, “Fast AlternatingLinearization Methods for Minimizing the Sum of Two ConvexFunctions”, TR Columbia, 2010.
M. V. Afonso, J. Bioucas-Dias, and M. Figueiredo, “A fastalgorithm for the constrained formulation of compressiveimage reconstruction and other linear inverse problems”,Arxiv, 2009.
S. Setzer, “Split Bregman Algorithm, Douglas-RachfordSplitting and Frame Shrinkage”, Proc. 2nd InternationalConference on Scale Lecture Notes in Computer Science,2009.
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 13/33
Extensions to Non-convex Territories
Low-Rank/Sparse Matrix Models
Non-convex, Non-separable functions
More than 2 blocks
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 14/33
Matrix Fitting Models (I): Completion
Find low-rank Z to fit data Mij : (i , j) ∈ Ω
Nuclear-norm minimization is good, but SVDs are expensive.
Non-convex model (Wen-Yin-Z-09): find X ∈ Rm×k ,Y ∈ Rk×n
minX ,Y ,Z
‖XY − Z‖2F s.t. PΩ(Z −M) = 0
An SOR scheme:
Z ← ωZ + (1− ω)XY
X ← qr(ZY>)
Y ← X>Z
Z ← XY + PΩ(M − XY )
1 small QR (m × k). No SVD.Much faster than SVD-based codes (FPCA, APGL, ....)
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 15/33
Nonlinear GS vs SOR
0 50 100 150 200 250 300 350 40010
−5
10−4
10−3
10−2
10−1
100
iteration
no
rm
alize
d r
esid
ua
l
SOR: k=12
SOR: k=20
GS: k=12
GS: k=20
(a) n=1000, r=10, SR = 0.08
0 20 40 60 80 100 12010
−5
10−4
10−3
10−2
10−1
100
iteration
no
rm
alize
d r
esid
ua
l
SOR: k=12
SOR: k=20
GS: k=12
GS: k=20
(b) n=1000, r=10, SR=0.15
Alternating minimization, but no multiplier for storage reason
Is non-convexity a problem for global optimization of this problem?— “Yes” in theory— “Not really” in practice
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 16/33
Matrix Fitting Models (II): SeparationGiven data Dij : (i , j) ∈ Ω,
Find low-rank Z so that difference PΩ(Z−D) is sparse
Non-convex Model (Shen-Wen-Z-10): U ∈ Rm×k ,V ∈ Rk×n
minU,V ,Z
‖PΩ(Z − D)‖1 s.t. Z − UV = 0
ADM scheme:
U ← qr((Z − Λ/β)V>)
V ← U>(Z − Λ/β)
PΩc (Z ) ← PΩc (UV + Λ/β)
PΩ(Z ) ← PΩ(shrink(· · · ) + D)
Λ ← Λ− γβ(Z − UV )
— 1 small QR. No SVD. Faster.— non-convex, 3 blocks. nonlinear constraint. convergence?
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 17/33
Matrix Separation: recoverability can be better
Sparsity %
Ra
nk
Digits of accuracy
10 20 30 40 50 60
20
40
60
80
100
120
2
3
4
5
6
7
(a) LMaFit (our solver)
Sparsity %
Ra
nk
Digits of accuracy
10 20 30 40 50 60
20
40
60
80
100
120
2
3
4
5
6
7
(b) IALM (nuclear-norm)
In S + Z = D, when S dominates Z in magnitude, our model hasworse recoverability. So the 2 models complement each other.
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 18/33
Nonnegative Matrix Factorization (Z-09)
Given A ∈ Rn×n, find X ,Y ∈ Rn×k (k n),
min ‖XY> − A‖2F s.t. X ,Y ≥ 0
Splitting:
min ‖XY> − A‖2F s.t. X = U1,Y = U2,U1,U2 ≥ 0
ADM scheme:
X ← (AY + β(U1 − Λ1))(Y>Y + βI )−1
Y> ← (X>X + βI )−1(X>A + β(U2 − Λ2))
(U1,U2) ← P+(X + Λ1,Y + Λ2)
(Λ1,Λ2) ← (Λ1,Λ2)− γ(X − U1,Y − U2)
— cost/iter: 2 (k × k) linear systems plus matrix arithmetics— better performance than Matlab built-in function “nnmf”— non-convex, non-separable, 3 blocks: convergence?
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 19/33
Minimization on Stiefel Manifold
minX∈X
f (X ) s.t. X>X = I
Splitting:
minX∈X
f (X ) s.t. X − Z = 0, ZTZ = I
ADM framework:
X ← argmin f (X ) +β
2‖X − (Z + Λ)‖2
F : X ∈ X
Z ← argmin ‖Z − (X − Λ)‖2F : ZTZ = I
Λ ← Λ− γ(X − Z )
— X -subproblem: easier in many cases— Z -subproblem: close-form solution with an SVD— a more complex splitting necessary to avoid SVD
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 20/33
Nonlinear Data-Fitting with Regularization
minv
R(Dv) +µ
2‖K (v)− d‖2.
Introduce splitting u,
minu,v
R(u) +µ
2‖K (v)− d‖2 s.t. u = Dv .
A general ADM scheme:
u ← argminu
R(u) +β
2‖u − (Dv + λ)‖2,
v ← argminv
µ‖K (v)− d‖2 + β‖Dv − (u − λ)‖2,
λ ← λ− γ(u − Dv),
— u-subproblem: easy for many R(·) (TV, `1, ...)— v -subproblem: regularized by a Tikhonov term— v -subproblem: Gauss-Newton step could suffice
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 21/33
Some Theoretical Results
A general setting
Local R-linear convergence
(Joint work in progress with Junfeng Yang)
General Setting: Problem
Consider
minx
f (x) s.t. c(x) = 0
where f : Rn → R and c : Rn → Rm(m < n) are C2-mappings.Augmented Lagrangian:
Lα(x , y) , αf (x)− yT c(x) +1
2‖c(x)‖2
Augmented saddle point system:
∇xLα(x , y) = 0,
c(x) = 0.
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 23/33
Splitting and Iteration Scheme
G : Rn × Rn → Rn is a splitting of F : Rn → Rn if
G (x , x) ≡ F (x), ∀x ∈ Rn.
e.g., if A = L− R, G (x , x) , Lx − Rx ≡ Ax , F (x).
Let G (x , x , y) be a splitting of ∇xLα(x , y) on x
Augmented saddle point system becomes
G (x , x , y) = 0
c(x) = 0
A generalized ADM (gADM) scheme:
xk+1 ← G (x , xk , yk ) = 0
yk+1 ← yk − τc(xk+1)
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 24/33
Block Gauss-Seidel for Square System F (x) = 0
Partition the system and variable into s ≤ n consistent blocks:
F = (F1,F2, · · · ,Fs), x = (x1, x2, · · · , xs)
Block GS iteration: given xk , for i = 1, 2, . . . , s
xik+1 ← Fi (x
k+11 , . . . , xk+1
i−1 , xi , xki+1, . . . , x
ks ) = 0
or xk+1 ← G (x , xk ) = 0
where
G (x , z) =
F1(x1, z2, . . . , zs)
...Fi (x1, . . . , xi , zi+1, . . . , zs)
...Fs(x1, . . . , xs)
(SOR can be similarly defined.)
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 25/33
Splitting for Gradient Descent: F (x) = ∇f (x)
Gradient descent method (with a constant step size):
xk+1 = xk − αF (xk ),
or xk+1 ← G (x , xk ) = 0
where
G (x , z) =1
αx −
(1
αz − F (z)
).
— gradient descent iterations can be done block-wise— block GS, SOR and gradient descent can be mixed
(e.g., 1st block: GS; 2nd block: gradient descent)
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 26/33
Assumptions
Let ∂iG (x , x , y) be the partial Jacobian of the splitting Gw.r.t. the i-th argument, and ∂iG
∗ , ∂iG (x∗, x∗, y∗) wherex∗ is a minimizer and y∗ the associated multiplier.
Assumption 1. (2nd-order Sufficiency)f , c ∈ C2, and α > 0 is chosen so that
∇2xLα(x∗, y∗) 0
Assumption 2. (Requirements on splitting)∂1G is nonsingular in a neighborhood of (x∗, x∗, y∗), and
ρ([∂1G
∗]−1∂2G∗) ≤ 1
but |λ| = 1 is allowed only for λ = 1.(e.g., for gradient descent: [∂1G
∗]−1∂2G∗ = I − α∇2f (x∗))
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 27/33
Assumptions are Reasonable
A1. 2nd-order sufficiency guarantees that α > 0 exists so that
α[∇2f (x∗)−
∑i y∗i ∇2ci (x
∗)]
+ A(x∗)>A(x∗) 0
where A(x) = ∂c(x). Note
∇xLα(x , y) = G (x , x , y) =⇒ ∇2xL∗α = ∂1G
∗ + ∂2G∗ 0
A2. Any convergent linear splitting for matrices 0 leads to acorresponding nonlinear splitting G satisfying
ρ([∂1G
∗]−1∂2G∗) < 1
Hence, A2 is satisfied by block GS (i.e., ADM), SOR, gradientdescent (with appropriate α) and their mixtures.
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 28/33
The Error System
Recall gADM:
xk+1 ← G (x , xk , yk ) = 0
yk+1 ← yk − τc(xk+1)
Using Implicit Function Theorem, we derive an error system
ek+1 = M∗(τ)ek + o(‖ek‖)
where ek , (xk , yk )− (x∗, y∗),
M∗(τ) =
[−[∂1G
∗]−1∂2G∗ [∂1G
∗]−1A∗>
τA∗[∂1G∗]−1∂2G
∗ I − τA∗[∂1G∗]−1A∗>
]
Key Lemma. (Z-2010) Under Assumptions 1-2, there exists η > 0such that ρ(M∗(τ)) < 1 for all τ ∈ (0, 2η).
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 29/33
Convergence: τ ∈ (0, 2η)
Theorem [Local convergence].There exists an open neighborhood U of (x∗, y∗) such that for any(x0, y0) ∈ U, the sequence (xk , yk ) generated by gADM stays inU and converges to (x∗, y∗).
Theorem [R-linear rate].The asymptotic convergence rate of gADM is R-linear withR-factor ρ(M∗(τ)), i.e.,
lim supk→∞
‖(xk , yk )− (x∗, y∗)‖1/k = ρ(M∗(τ)).
— These follow from the Key Lemma and Ortega-Rockoff-70.
Corollary [Linear case].If f is quadratic and c is affine, then U = Rn × Rm and theconvergence is globally Q-linear with Q-factor ρ(M∗(τ)).
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 30/33
Summary: ADM ' Splitting + Alternating
powerful tool for constructing inexpensive iterations
great versatility, computationally proven effectiveness
convergence theory extended, many issues remain
Table: assessment of the obtained results
strength weaknessnon-convex functions local convergence only
non-separable functions smooth functions only
any number of blocks no side-constraints
mixed strategies covered —
rate of convergence —
Our results extend the classic ADM in several aspects, but do notrecover the existing results except for quadratic programs.
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 31/33
References
(FISTA) A. Beck, and M. Teboulle, “A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems”, SIAM J. Imag. Sci., 2:183-202, 2009.
(NESTA) J. Bobin, S. Becker, and E. Candes, “NESTA: A Fast and AccurateFirst-order Method for Sparse Recovery”, TR, CalTech, April 2009.
(SPGL1) M. P. Friedlander and E. van den Berg, “Probing the Pareto frontierfor basis pursuit solutions”, SIAM J. Sci. Comput., 31(2):890–912, 2008.
(FPC) E. T. Hale, W. Yin, and Y. Zhang, “Fixed-point continuation forl1-minimization: Methodology and convergence”, SIAM J. Optim,19(3):1107–1130, 2008.
(IALM) Z. Lin, M. Chen, L. Wu, and Y. Ma, “The Augmented LagrangeMultiplier Method for Exact Recovery of Corrupted Low-Rank Matrices”, TRUIUC, UILU-ENG-09-2215, Nov. 2009.
(FPCA) S. Ma, D. Goldfarb and L. Chen, “Fixed Point and Bregman IterativeMethods for Matrix Rank Minimization”, Math. Prog., to appear.
(APGL) K.-C. Toh, and S. W. Yun, “An accelerated proximal gradient algorithmfor nuclear norm regularized least squares problems”, Pacific J. Optimization.
(SpaRSA) S. Wright, R. Nowak, M. Figueiredo, “Sparse reconstruction byseparable approximation”, IEEE Trans Signal Process., 57(7):2479–2493, 2009.
(CGD) S. Yun, and K.-C. Toh, “A coordinate gradient descent method forL1-regularized convex minimization”, Computational Optimization andApplications, to appear.
Yin Zhang, Rice University Recent Advances in Alternating Direction Methods: Practice and Theory 33/33