Randomized Coordinate Descent Methods on Optimization ... › ~optimization › L1 ›...

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Lijun Xu Optimization Group Meeting

November 27, 2012

By I. Necoara, Y. Nesterov, and F. Glineur

Outline

Introduction Randomized Block (i,j) Coordinate Descent

Method RCD Method in Strongly Convex Case Random Pairs Sampling Extensions Numerical experiment

• Coordinate Descent Method consider Q: How to choose ? a) cyclic. (difficult to prove convergence) b) maximal descent. convergence rate is trivial (worse than simple

Gradient Method in general) c) random. (faster, simpler, robust, distributed

and parallel, etc. )

Introduction

Introduction

• Randomized(block) coordinate descent methods

a) The first analysis of this method, when applied to the problem of minimizing a smooth convex function, was performed by Nesterov (2010)[1].

b) The extension to composite functions was given by Richtárik and Takáč (2011)[2]

[1] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, Core Discussion Paper, 2010. [2] P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, submitted to Mathematical Programming, 2011.

http://en.wikipedia.org/wiki/Random_coordinate_descent�

Problem formulation

• Minimize a separable convex objective function

with linearly coupled constraints. • Extension to problems with non-separate objective

function and general linear constraints.

Motivation of Formulation

Applications in •Resource allocation in economic systems, •Distributed computer systems, •Traffic equilibrium problems, •Network flow, etc.

Dual problem corresponding to an optimization of a sum of convex functions.

Finding a point in the intersection of some convex sets.

Notations

• (2.1) becomes

* *

1* * *

1 1* *

[1 ]

: 0 0

( ) ( ( ) ,..., ( ) ) ( ,..., )

( ) ( )

N

ii

T T T T TN N

i i j j N

KKT Ux x

f x U f x f xf x f x i j

λ λ λ=

= ⇔ =

∇ = ⇔ ∇ ∇ =

⇔∇ =∇ ∀ ≠ ∈

∑

:

min ( ) s.t. 0Nnxf x Ux

∈=

Notations • Consider the subspace • It’s orthogonal complement

• Define extended norm induced by G: (for the gradients), Cauchy-Schwartz inequality:

Notations

• Partition of the identity matrix

0

0

n n

i n n

n n

U I

×

×

×

=

----i th entry 1

, ,

( ) ( ), .

Nn

i i ii

ni i

x U x x

f U fα α α=

= ∈

= ∈

∑

1

1

( )

( ) ( )

N

i i ii

N

i i ii

x x d U x d

f x f x d

+

=

+

=

= + = +

= +

∑

∑, n

i ix d ∈

Basic Assumption • All are convex , • are Lipschitz continuous(with Lipschitz

constants ), i.e.:

• Graph (V,E) is undirected and connected, with N notes V={1,…,N}.

use as chosen coordinates.

0iL >

if

if∇

( , )i j E∈

Randomized Block (i,j) Coordinate Descent Method

• Recall

• Choose randomly a pair with probability

• Define

1 1

1

min ( ) ( )+ + ( )

s.t. 0.Nn N Nx

N

f x f x f x

x x∈

=

+ + =

( , )i j E∈( ) 0ij jip p= >

• Consider feasibility of i.e. we require . • Minimize the right hand side adding feasibility

• Get the following decrease in f


0i jd d+ =


• Each iteration: compute only , full gradient methods: . • depends on random variable: • Define the expected value:

• Key Inequality :

• where


(0 ( )( ))T T Nn Nnij N i j i j nG e e e e I ×= + − − ⊗ ∈

• Introduce the distance

which measures the size of the level set of f given by .

• Convergence results:


0x

• Proof convexity : and key inequality: obtain take expectation in (denoting ),


Design of the probability • Uniform probabilities:

• Dependent on the Lipschitz constants:

• Design the probability since

Recall convergence rate: Idea: searching for to optimize . i.e. is assumed constant such that for .

Design of the probability

• Using the relaxation from semidefinite programming:


where , and are multipliers in Lagrange Relaxation.

2 2 21( , , )T

NR R R=

• Note

• Convergence rate under designed probability


Comparison with full gradient method

1 1 1

11

1

11 1

NL L L

NL L

LN

L L L N N

− − −

−−

−

−− −×

=

• Consider a particular case: a) a complete graph b) probability ,

• upper bound (BCD method)

• Full gradient method

similarly, (full)

(random)

Comparison with full gradient method

Strongly Convex Case • Strongly convex w.r.t with convexity

parameter

and key inequality:

minimizing over x

• Similarly, choose the optimal probability by solving the following SDP:

Strongly Convex Case

Rate of convergence in probability

• The proof use a similar reasoning as Theorem 1 in [14] and is

derived from Markov inequality. [14] P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, submitted to Mathematical Programming, 2011.

Rate of convergence in probability

Random pairs sampling • method needs to choose a pair of

coordinates at each iteration. • So we need a fast procedure to generate

random pairs. • Given probability distribution redefine into a indices vector such that:

then divide [0,1] into subintervals:

( , )( ) i jRCD

( , )i j

| |pn E=

pn

Remark :

• Clearly, the width of interval equals the probability ,

• Sampling Algorithm Description

Random pairs sampling

l

l li jp

Generalizations

Extension of to more than one pair. ( , )( ) i jRCD

The same rate of convergence will be obtained for as previous sections.

( )MRCD

Extension of to nonseparable objective functions with general equality constraints.

has component-wise Lipschitz continuous gradient:

Generalizations ( , )( ) i jRCD

f

• Assuming

Generalizations

arg mini i j jA s A s+

• Similar convergence rate:

• Similar choosing the probability:

Generalizations

Google Problem

Goal:

Thank you!

Randomized Coordinate Descent Methods on Optimization ... › ~optimization › L1 ›...

Documents

Transcript of Randomized Coordinate Descent Methods on Optimization ... › ~optimization › L1 ›...