Randomized Coordinate Descent Methods on Optimization ... › ~optimization › L1 ›...
Transcript of Randomized Coordinate Descent Methods on Optimization ... › ~optimization › L1 ›...
Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints
Lijun Xu Optimization Group Meeting
November 27, 2012
By I. Necoara, Y. Nesterov, and F. Glineur
Outline
Introduction Randomized Block (i,j) Coordinate Descent
Method RCD Method in Strongly Convex Case Random Pairs Sampling Extensions Numerical experiment
• Coordinate Descent Method consider Q: How to choose ? a) cyclic. (difficult to prove convergence) b) maximal descent. convergence rate is trivial (worse than simple
Gradient Method in general) c) random. (faster, simpler, robust, distributed
and parallel, etc. )
Introduction
Introduction
• Randomized(block) coordinate descent methods
a) The first analysis of this method, when applied to the problem of minimizing a smooth convex function, was performed by Nesterov (2010)[1].
b) The extension to composite functions was given by Richtárik and Takáč (2011)[2]
[1] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, Core Discussion Paper, 2010. [2] P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, submitted to Mathematical Programming, 2011.
Problem formulation
• Minimize a separable convex objective function
with linearly coupled constraints. • Extension to problems with non-separate objective
function and general linear constraints.
Motivation of Formulation
Applications in •Resource allocation in economic systems, •Distributed computer systems, •Traffic equilibrium problems, •Network flow, etc.
Dual problem corresponding to an optimization of a sum of convex functions.
Finding a point in the intersection of some convex sets.
Notations
• (2.1) becomes
* *
1* * *
1 1* *
[1 ]
: 0 0
( ) ( ( ) ,..., ( ) ) ( ,..., )
( ) ( )
N
ii
T T T T TN N
i i j j N
KKT Ux x
f x U f x f xf x f x i j
λ λ λ=
= ⇔ =
∇ = ⇔ ∇ ∇ =
⇔∇ =∇ ∀ ≠ ∈
∑
:
min ( ) s.t. 0Nnxf x Ux
∈=
Notations • Consider the subspace • It’s orthogonal complement
• Define extended norm induced by G: (for the gradients), Cauchy-Schwartz inequality:
Notations
• Partition of the identity matrix
0
0
n n
i n n
n n
U I
×
×
×
=
----i th entry 1
, ,
( ) ( ), .
Nn
i i ii
ni i
x U x x
f U fα α α=
= ∈
= ∈
∑
1
1
( )
( ) ( )
N
i i ii
N
i i ii
x x d U x d
f x f x d
+
=
+
=
= + = +
= +
∑
∑, n
i ix d ∈
Basic Assumption • All are convex , • are Lipschitz continuous(with Lipschitz
constants ), i.e.:
• Graph (V,E) is undirected and connected, with N notes V={1,…,N}.
use as chosen coordinates.
0iL >
if
if∇
( , )i j E∈
Randomized Block (i,j) Coordinate Descent Method
• Recall
• Choose randomly a pair with probability
• Define
1 1
1
min ( ) ( )+ + ( )
s.t. 0.Nn N Nx
N
f x f x f x
x x∈
=
+ + =
( , )i j E∈( ) 0ij jip p= >
• Consider feasibility of i.e. we require . • Minimize the right hand side adding feasibility
• Get the following decrease in f
Randomized Block (i,j) Coordinate Descent Method
0i jd d+ =
Randomized Block (i,j) Coordinate Descent Method
• Each iteration: compute only , full gradient methods: . • depends on random variable: • Define the expected value:
• Key Inequality :
• where
Randomized Block (i,j) Coordinate Descent Method
(0 ( )( ))T T Nn Nnij N i j i j nG e e e e I ×= + − − ⊗ ∈
• Introduce the distance
which measures the size of the level set of f given by .
• Convergence results:
Randomized Block (i,j) Coordinate Descent Method
0x
• Proof convexity : and key inequality: obtain take expectation in (denoting ),
Randomized Block (i,j) Coordinate Descent Method
Design of the probability • Uniform probabilities:
• Dependent on the Lipschitz constants:
• Design the probability since
Recall convergence rate: Idea: searching for to optimize . i.e. is assumed constant such that for .
Design of the probability
• Using the relaxation from semidefinite programming:
Design of the probability
where , and are multipliers in Lagrange Relaxation.
2 2 21( , , )T
NR R R=
• Note
• Convergence rate under designed probability
Design of the probability
Comparison with full gradient method
1 1 1
11
1
11 1
NL L L
NL L
LN
L L L N N
− − −
−−
−
−− −×
=
• Consider a particular case: a) a complete graph b) probability ,
• upper bound (BCD method)
• Full gradient method
similarly, (full)
(random)
Comparison with full gradient method
Strongly Convex Case • Strongly convex w.r.t with convexity
parameter
and key inequality:
minimizing over x
• Similarly, choose the optimal probability by solving the following SDP:
Strongly Convex Case
Rate of convergence in probability
• The proof use a similar reasoning as Theorem 1 in [14] and is
derived from Markov inequality. [14] P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, submitted to Mathematical Programming, 2011.
Rate of convergence in probability
Random pairs sampling • method needs to choose a pair of
coordinates at each iteration. • So we need a fast procedure to generate
random pairs. • Given probability distribution redefine into a indices vector such that:
then divide [0,1] into subintervals:
( , )( ) i jRCD
( , )i j
| |pn E=
pn
Remark :
• Clearly, the width of interval equals the probability ,
• Sampling Algorithm Description
Random pairs sampling
l
l li jp
Generalizations
Extension of to more than one pair. ( , )( ) i jRCD
The same rate of convergence will be obtained for as previous sections.
( )MRCD
Extension of to nonseparable objective functions with general equality constraints.
has component-wise Lipschitz continuous gradient:
Generalizations ( , )( ) i jRCD
f
• Assuming
Generalizations
arg mini i j jA s A s+
• Similar convergence rate:
• Similar choosing the probability:
Generalizations
Google Problem
Goal:
Thank you!