Post on 30-May-2018
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
1/33
Super-Linear Convergence of Dual AugmentedLagrangian Algorithm for Sparse Learning
Ryota Tomioka 1 , Taiji Suzuki 1 , and Masashi Sugiyama 2
1 University of Tokyo2 Tokyo Institute of Technology
2009-12-12 @ NIPS workshop OPT09
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 1 / 22
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
2/33
Introduction
Objective
Develop an optimization algorithm for the optimization problem:
minimizew R n
f (Aw )
loss+ (w )
regularizer.
For example, lasso:
minimizew R n
12
Aw y 2 + w 1 .
A
R m n
: design matrix ( m : #observations, n : #unknowns)f is convex and twice differentiable. (w ) is convex but possibly non-differentiable. = .We are interested in algorithms for general f and (LARS).
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 2 / 22
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
3/33
Introduction
Where does the difculty come from?
Conventional view : the non-differentiability of (w ).
Upper bound the regularizer from abovewith a differentiable function.
FOCUSS
(Rao & Kreutz-Delgado, 99)Majorization-Minimization (Figueiredo etal., 07)Iteratively reweighted least squares(IRLS).
Explicitly handle the non-differentiability.Sub-gradient L-BFGS (Andrew & Gao,07; Yu et al., 08)
Our view : the coupling between variables introduced by A.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 3 / 22
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
4/33
Introduction
Where does the difculty come from?
Our view : the coupling between variables introduced by A.
In fact, when A = I n
minw R n
12
y w 22 + w 1 =n
j = 1
minw j R
12
(y j w j )2 + |w j | .
w j = ST (y j )
=y j ( y j ),0 (
y j
),
y j + (y j ).min is obtained analytically!
We focus on
for which the above min can be obtained analytically
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 4 / 22
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
5/33
Introduction
Proximation wrt can be computed analytically
AssumptionProximation wrt (soft-thresholding):
ST (y ) = argminw R n
(w ) +12
y w 22can be computed analytically.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 5 / 22
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
6/33
Introduction
Outline
1 IntroductionSparse regularized learning.Why is it difcult? not the non-differentiability
2 Methods
Iterative shrinkage-thresholding (IST)Dual Augmented Lagrangian (porposed)3 Theoretical results: super-linear convergence
Exact inner minimizationApproximate inner minimization
4 Empirical resultsComparison against OWLQN, SpaRSA, and FISTA.
5 Summary
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 6 / 22
h d
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
7/33
Methods
Iterative Shrinkage/Thresholding (IST)Algorithm (Figueiredo&Nowak, 03; Daubechies et al., 04;...)
1 Choose an initial solution w 0 .2 Repeat until some stopping criterion is satised:
w t + 1 ST t
shrink
w t t Af (Aw t )
gradient step
.
Pro : easy to implement.Con : bad for poorly conditioned A.
Also known as:Forward-Backward Splitting[Combettes & Wajs, 05]Thresholded Landweber Iteration[Daubechies et al., 04]
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 7 / 22
M h d
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
8/33
Methods
Iterative Shrinkage/Thresholding (IST)Algorithm (Figueiredo&Nowak, 03; Daubechies et al., 04;...)
1 Choose an initial solution w 0 .2 Repeat until some stopping criterion is satised:
w t + 1 ST t
shrink
w t t Af (Aw t )
gradient step
.
Pro : easy to implement.Con : bad for poorly conditioned A.
Also known as:Forward-Backward Splitting[Combettes & Wajs, 05]Thresholded Landweber Iteration[Daubechies et al., 04]
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 7 / 22
Methods
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
9/33
Methods
Dual Augmented Lagrangian (DAL) method
Primal problem
minimizew
f (Aw ) + (w )
f (w )Proximal minimization:w t + 1 = argmin
w f (w ) +
12t
w w t 2
(0
1
)
Easy to analyze.f (w t + 1 ) + 12 t w
t + 1 w t 2 f (w t ) .
Not practical! (as difcult asthe original problem)
Dual problem
maximize ,v f ( ) (v )
s.t. v = A
Augmented Lagrangian(Tomioka & Sugiyama, 09) :
w t + 1 = ST t (w t + t A t )
t = argmin
t ( )
Minimization of t ( ) is easy(smooth).Step-size t is increased.See Rockafellar 76 for the equivalence.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 8 / 22
Methods
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
10/33
Methods
Dual Augmented Lagrangian (DAL) method
Primal problem
minimizew
f (Aw ) + (w )
f (w )Proximal minimization:w t + 1 = argmin
w f (w ) +
12t
w w t 2
(0
1
)
Easy to analyze.f (w t + 1 ) + 12 t w
t + 1 w t 2 f (w t ) .
Not practical! (as difcult asthe original problem)
Dual problem
maximize ,v f ( ) (v )
s.t. v = A
Augmented Lagrangian(Tomioka & Sugiyama, 09) :
w t + 1 = ST t (w t + t A t )
t = argmin
t ( )
Minimization of t ( ) is easy(smooth).Step-size t is increased.See Rockafellar 76 for the equivalence.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 8 / 22
Methods
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
11/33
Methods
Dual Augmented Lagrangian (DAL) method
Primal problem
minimizew
f (Aw ) + (w )
f (w )Proximal minimization:
w t + 1 = argminw
f (w ) +1
2t w w t 2
(0
1
)
Easy to analyze.f (w t + 1 ) + 12 t w
t + 1 w t 2 f (w t ) .
Not practical! (as difcult asthe original problem)
Dual problem
maximize ,v f ( ) (v )
s.t. v = A
Augmented Lagrangian(Tomioka & Sugiyama, 09) :
w t + 1 = ST t (w t + t A t )
t = argmin
t ( )
Minimization of t ( ) is easy(smooth).Step-size t is increased.See Rockafellar 76 for the equivalence.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 8 / 22
Methods
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
12/33
Methods
Difference: How do we get rid of the couplings?
Proximation wrt f is hard:
w t + 1 = argminw
f (w )
f (Aw ) variables are coupled+ (w ) +
12t w w t 2 .
IST: linearly approximates the loss term:
f (Aw )f (Aw t ) + (w w t )Af (Aw t )
tightest at the current point w t DAL (proposed) : linearly lower-bounds theloss term:
f (Aw ) = max R m f ( ) w A
tightest at the next point w t + 1Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 9 / 22
Methods
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
13/33
Difference: How do we get rid of the couplings?
Proximation wrt f is hard:
w t + 1 = argminw
f (w )
f (Aw ) variables are coupled+ (w ) +
12t w w t 2 .
IST: linearly approximates the loss term:
f (Aw )f (Aw t ) + (w w t )Af (Aw t )
tightest at the current point w t DAL (proposed) : linearly lower-bounds theloss term:
f (Aw ) = max R m f ( ) w A
tightest at the next point w t + 1
wt
wt+1
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 9 / 22
Methods
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
14/33
Difference: How do we get rid of the couplings?
Proximation wrt f is hard:
w t + 1 = argminw
f (w )
f (Aw ) variables are coupled+ (w ) +
12t w w t 2 .
IST: linearly approximates the loss term:
f (Aw )f (Aw t ) + (w w t )Af (Aw t )
tightest at the current point w t DAL (proposed) : linearly lower-bounds theloss term:
f (Aw ) = max R m f ( ) w A
tightest at the next point w t + 1
wt
wt+1
wt
wt+1
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 9 / 22
Methods
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
15/33
Numerical examples
DAL is better when A is poorly conditioned.
IST
DAL
2 1.5 1 0.5 0 0.5 1 1.5 22
1.5
1
0.5
0
0.5
1
1.5
2
2 1.5 1 0.5 0 0.5 1 1.5 22
1.5
1
0.5
0
0.5
1
1.5
2
2 1.5 1 0.5 0 0.5 1 1.5 22
1.5
1
0.5
0
0.5
1
1.5
2
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 10 / 22
Theoretical results
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
16/33
Theorem 1 (exact minimization)
Denition
w t : sequence generated by the DAL algorithm witht (
t )= 0 (exact minimization) .w : the unique minimizer of the objective f .
AssumptionThere is a constant such that
f (w t + 1) f (w ) w t + 1 w 2 (t = 0, 1 , 2, . . . ).
Theorem 1
w t + 1 w
11 + t
w t w .
I.e., w t converges super-linearly to w if t is increasing.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 11 / 22
Theoretical results
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
17/33
Theorem 2 (approximate minimization)
Denition
w t : sequence generated by the DAL algorithm with
t (t ) t w t + 1 w t 1/ : Lipschitz con-stant of f .
Theorem 2Under the same assumption as in Theorem 1,
w t + 1 w
11 + 2t w
t w .
I.e., w t converges super-linearly to w if t is increasing.
NoteConvergence is slower than the exact case (
t (t )= 0).
A faster rate can be obtained if we choose t (t )
w t + 1
w t
O(1/ t ).Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 12 / 22
Theoretical results
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
18/33
Theorem 2 (approximate minimization)
Denition
w t : sequence generated by the DAL algorithm with
t (t ) t w t + 1 w t 1/ : Lipschitz con-stant of f .
Theorem 2Under the same assumption as in Theorem 1,
w t + 1 w
11 + 2t w
t w .
I.e., w t converges super-linearly to w if t is increasing.
NoteConvergence is slower than the exact case (
t (t )= 0).
A faster rate can be obtained if we choose t (t )
w t + 1
w t
O(1/ t ).Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 12 / 22
Theoretical results
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
19/33
Proof (in essence) of Theorem 1
Since w t + 1 = argmin w f (w ) +1
2t w w t 2 ,
(w t
w t + 1
)/ t f (w t + 1
) (is a subgradient of f ). I.e.,
f (w ) f (w t + 1) (w t w t + 1)/ t , w w t + 1 .(inspired by Beck & Teboulle 09)
w
wt+1
f(w
)
f(wt+1 )
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 13 / 22
Theoretical results
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
20/33
Proof (in essence) of Theorem 2
f (w
) f (w t + 1
) (w t
w t + 1
)/ t ,w
w t + 1
1
2 t (
t )
2
cost of approximateminimization
.
1/ : Lipschitz constant of f .
w
wt+1
f(w
)
f(wt+1 )
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 14 / 22
Empirical results
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
21/33
Empirical results: 1-logistic regression
#samples=1 , 024, #unknowns=16 , 384.
FISTATwo-step IST (Beck& Teboulle 09)
OWLQN
Orthant-wiseL-BFGS (Andrew &Gao 07)
SpaRSAStep-size improvedIST (Wright et al. 09)
100
101
102
103
104
104
102
100
| | w
t
w * | |
DALDAL (thm2)DAL (thm1)FISTAOWLQNSpaRSA
100
101
102
104
102
100
100
101
102
103
104
108
106
104
10 2
100
102
#iteretions
f ( w
t )
f ( w
* )
100
101
102
108
106
104
10 2
100
102
CPU time (sec)
DALFISTAOWLQNSpaRSA
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 15 / 22
Summary
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
22/33
Summary
Why is sparse learning difcult to optimize? couplings Non-differentiability is not bad.Cost of inner minimization O (m 2 n + ) (n + : number of activevariables). Sparsity makes inner minimization efcient .
How do we get rid of the couplings?
Use linear parametric lower bound instead of linear approximation.Super-linear convergence for exact/approximate innerminimization.
Improved a classic result in optimization by specializing the settingto sparse learning ; i.e., proximation wrt can be performedanalytically.
Empirical results are promissing.Faster than OWLQN, SpaRSA, and FISTA with the potential to begeneralized further.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 16 / 22
Summary
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
23/33
Summary
Why is sparse learning difcult to optimize? couplings Non-differentiability is not bad.Cost of inner minimization O (m 2 n + ) (n + : number of activevariables). Sparsity makes inner minimization efcient .
How do we get rid of the couplings?
Use linear parametric lower bound instead of linear approximation.Super-linear convergence for exact/approximate innerminimization.
Improved a classic result in optimization by specializing the settingto sparse learning ; i.e., proximation wrt can be performedanalytically.
Empirical results are promissing.Faster than OWLQN, SpaRSA, and FISTA with the potential to begeneralized further.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 16 / 22
Summary
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
24/33
Summary
Why is sparse learning difcult to optimize? couplings
Non-differentiability is not bad.Cost of inner minimization O (m 2 n + ) (n + : number of activevariables). Sparsity makes inner minimization efcient .
How do we get rid of the couplings?
Use linear parametric lower bound instead of linear approximation.Super-linear convergence for exact/approximate innerminimization.
Improved a classic result in optimization by specializing the settingto sparse learning ; i.e., proximation wrt can be performedanalytically.
Empirical results are promissing.Faster than OWLQN, SpaRSA, and FISTA with the potential to begeneralized further.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 16 / 22
Summary
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
25/33
Summary
Why is sparse learning difcult to optimize? couplings
Non-differentiability is not bad.Cost of inner minimization O (m 2 n + ) (n + : number of activevariables). Sparsity makes inner minimization efcient .
How do we get rid of the couplings?
Use linear parametric lower bound instead of linear approximation.Super-linear convergence for exact/approximate innerminimization.
Improved a classic result in optimization by specializing the settingto sparse learning ; i.e., proximation wrt can be performedanalytically.
Empirical results are promissing.Faster than OWLQN, SpaRSA, and FISTA with the potential to begeneralized further.
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 16 / 22
Appendix
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
26/33
(1) Proximation wrt is analytic (though non-smooth):
w t + 1 = ST t w t + t A t
(2) Inner minimization is smooth:
t = argmin R m
f ( ) independent of A.
+1
2t ST t (w
t + t A )22
= ()(linear to the number of
active variables)
0
(w)
(w)
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 17 / 22
Appendix
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
27/33
Comparison to other algorithms
DAL (this talk)
w k w = O (exp (k ))SpaRSA (Step-size improved IST)Convergence shown but no rate given. (Wright et al. 09)
OWLQN (Orthant-wise L-BFGS)Convergence shown but no rate given. (Andrew & Gao 07)
IST (Iterative Soft-thresholding)
f (w k
) f (w
) = O (1/ k ) (Beck & Teboulle 09)FISTA (Two-step IST)f (w k ) f (w ) = O (1/ k 2) (Beck & Teboulle 09)
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 18 / 22
Appendix
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
28/33
Comparison to Rockafellar 76
Assumption
The multifunction
f is (locally) Lipschitz contin uou s at the origin:
f ( ) f (0)L ( )
Implies our assumption with = 12 min (1/ L, / w
0 w ).Convergence (exact minimization) comparable to Thm 1
w t + 1 w
1
1 + ( t / L)2w t w
Convergence (approximate minimization) much worse than Thm 2
w t + 1 w
t + t 1 t
w t w t = 11+( t / L)2
(assuming
t t / t w
t + 1
w t
)Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 19 / 22
Appendix
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
29/33
Outline of proof of Theorem 1
1 Since w t + 1 = argmin w f (w ) +1
2t w w t 2 ,
(w t w t + 1)/ t is a subgradient of f at w t + 1 . I.e.,f (w ) f (w t + 1 ) (w t w t + 1)/ t , w w t + 1 .
2 For any > 0,
w w t + 1w t w
2
w w t + 12 +1
2w t w 2 .
3 Combining 1 & 2 and using f (w t + 1
) f (w
) w t + 1
w
2,
12
w t w 2 ((1 + t ) 2
2)w
t + 1 w 2 .4 Maximize RHS wrt .Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 20 / 22
Appendix
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
30/33
Outline of proof of Theorem 1
1 Since w t + 1 = argmin w f (w ) +1
2t w w t 2 ,
(w t w t + 1)/ t is a subgradient of f at w t + 1 . I.e.,f (w ) f (w t + 1 ) (w t w t + 1)/ t , w w t + 1 .
2 For any > 0,
w w t + 1w t w
2
w w t + 12 +1
2w t w 2 .
3 Combining 1 & 2 and using f (w t + 1
) f (w
) w t + 1
w
2,
12
w t w 2 ((1 + t ) 2
2)w
t + 1 w 2 .4 Maximize RHS wrt .Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 20 / 22
Appendix
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
31/33
Outline of proof of Theorem 1
1 Since w t + 1 = argmin w f (w ) +1
2t w w t 2 ,
(w t w t + 1)/ t is a subgradient of f at w t + 1 . I.e.,f (w ) f (w t + 1 ) (w t w t + 1)/ t , w w t + 1 .
2 For any > 0,
w w t + 1w t w
2
w w t + 12 +1
2w t w 2 .
3 Combining 1 & 2 and using f (w t + 1
) f (w
) w t + 1
w
2,
12
w t w 2 ((1 + t ) 2
2)w
t + 1 w 2 .4 Maximize RHS wrt .Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 20 / 22
Appendix
l f f f h
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
32/33
Outline of proof of Theorem 2
1 Let t :=
t ( t ),
f (w ) f (w t + 1 ) (w t w t + 1)/ t , w w t + 1 1
2 t
2 .
2 By assumption
f (w t + 1) f (w ) w t + 1 w 2 ,
t
2 t
w t + 1 w t 2 .3 Combining 1 & 2,
12
w w t 2 (t +12
)w w t + 12 .
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 21 / 22
Appendix
EEG bl P300 i l ll d ( bj A)
8/9/2019 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning
33/33
EEG problem P300 visual speller dataset (subject A)
Number ofsamplesm = 2550.6 classclassication.w
R 37 64 .Trace-normregularization. 0 100 200 300
0
1000
2000
3000
4000
5000
time (s)
DAL
0 100 200
3200
3400
3600
3800
SubLBFGS
time (s)0 1000 2000 3000
3265
3270
3275
3280
3285
3290
ProjGrad
time (s)
Ryota Tomioka (Univ Tokyo) DAL 2009-12-12 22 / 22