Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik...

76
Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 A P PRO X Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Transcript of Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik...

Page 1: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Accelerated, Parallel and PROXimal coordinate descent

Moscow February 2014

A P PROXPeter Richtárik

(Joint work with Olivier Fercoq - arXiv:1312.5799)

Page 2: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Optimization Problem

Page 3: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Problem

Convex (smooth or nonsmooth)

Convex (smooth or nonsmooth)- separable- allow

Loss Regularizer

Page 4: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Regularizer: examples

No regularizer Weighted L1 norm

Weighted L2 normBox constraints

e.g., SVM dual

e.g., LASSO

Page 5: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Loss: examples

Quadratic loss

L-infinity

L1 regression

Exponential loss

Logistic loss

Square hinge loss

BKBG’11RT’11bTBRS’13RT ’13a

FR’13

Page 6: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

RANDOMIZED COORDINATE DESCENT

IN 2D

Page 7: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Find the minimizer of

2D OptimizationContours of a function

Goal:

a2 =b2

Page 8: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Randomized Coordinate Descent in 2D

a2 =b2

N

S

EW

Page 9: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

Page 10: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

2

Page 11: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Randomized Coordinate Descent in 2D

a2 =b2

1

23 N

S

EW

Page 12: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

Page 13: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

5

Page 14: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

6

N

S

EW

Page 15: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

N

S

EW

67SOLVED!

Page 16: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

CONTRIBUTIONS

Page 17: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Variants of Randomized Coordinate Descent Methods

• Block– can operate on “blocks” of coordinates – as opposed to just on individual coordinates

• General – applies to “general” (=smooth convex) functions – as opposed to special ones such as quadratics

• Proximal– admits a “nonsmooth regularizer” that is kept intact in solving subproblems – regularizer not smoothed, nor approximated

• Parallel – operates on multiple blocks / coordinates in parallel– as opposed to just 1 block / coordinate at a time

• Accelerated– achieves O(1/k^2) convergence rate for convex functions– as opposed to O(1/k)

• Efficient– complexity of 1 iteration is O(1) per processor on sparse problems – as opposed to O(# coordinates) : avoids adding two full vectors

Page 18: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Brief History of Randomized Coordinate Descent Methods

+ new long stepsizes

Page 19: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

APPROX

Page 20: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

“PROXIMAL”“PAR

ALLE

L”

“ACCELERATED”

A P PROX

Page 21: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

PCDM (R. & Takáč, 2012) = APPROX if we force

Page 22: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

APPROX: Smooth Case

Want this to be as large as possible

Update for coordinate i

Partial derivative of f

Page 23: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

CONVERGENCE RATE

Page 24: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Convergence Rate

average # coordinates updated / iteration

# coordinates# iterations

implies

Theorem [FR’13b]

Key assumption

Page 25: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Special Case: Fully Parallel Variant

all coordinates are updated in each iteration

# normalized weights (summing to n)

# iterations

implies

Page 26: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Special Case: Effect of New Stepsizes

Average degree of separability

“Average” of the Lipschitz constants

With the new stepsizes (will mention later!), we have:

Page 27: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

“EFFICIENCY” OF

APPROX

Page 28: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Cost of 1 Iteration of APPROX

Assume N = n (all blocks are of size 1)and that

Sparse matrixThen the average cost of 1 iteration of APPROX is

Scalar function: derivative = O(1)

arithmetic ops

= average # nonzeros in a column of A

Page 29: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Bottleneck: Computation of Partial Derivatives

maintained

Page 30: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

PRELIMINARYEXPERIMENTS

Page 31: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

L1 Regularized L1 Regression

Dorothea dataset:

Gradient Method

Nesterov’s Accelerated Gradient Method

SPCDM

APPROX

Page 32: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

L1 Regularized L1 Regression

Page 33: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

L1 Regularized Least Squares (LASSO)

KDDB dataset:

PCDM

APPROX

Page 34: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Training Linear SVMs

Malicious URL dataset:

Page 35: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Choice of Stepsizes:

How (not) to ParallelizeCoordinate Descent

Page 36: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Convergence of Randomized Coordinate Descent

Strongly convex F(Simple Mehod)

Smooth or ‘simple’ nonsmooth F(Accelerated Method)

‘Difficult’ nonsmooth F(Accelerated Method)

or smooth F(Simple method)

‘Difficult’ nonsmooth F(Simple Method)

Focus on n

(big data = big n)

Page 37: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Parallelization Dream

Depends on to what extent we can add up individual updates, which depends on the properties of F and the

way coordinates are chosen at each iteration

Serial Parallel

What do we actually get?WANT

Page 38: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

“Naive” parallelization

Do the same thing as before, but

for MORE or ALL coordinates &

ADD UP the updates

Page 39: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Failure of naive parallelization

1a

1b

0

Page 40: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Failure of naive parallelization

1

1a

1b

0

Page 41: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Failure of naive parallelization

1

2a

2b

Page 42: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Failure of naive parallelization

1

2a

2b

2

Page 43: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Failure of naive parallelization

2

OOPS!

Page 44: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

1

1a

1b

0

Idea: averaging updates may help

SOLVED!

Page 45: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Averaging can be too conservative

1a

1b

0

12a

2b

2

and so on...

Page 46: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Averaging may be too conservative 2

WANT

BAD!!!But we wanted:

Page 47: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

What to do?

Averaging:

Summation:

Update to coordinate i

i-th unit coordinate vector

Figure out when one can safely use:

Page 48: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

ESO:Expected SeparableOverapproximation

Page 49: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

5 Models for f Admitting Small

1

2

3

Smooth partially separable f [RT’11b ]

Nonsmooth max-type f [FR’13]

f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

Page 50: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

5 Partially separable f with block smooth components [FR’13b]

5 Models for f Admitting Small

4 Partially separable f with smooth components [NC’13]

Page 51: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Randomized Parallel Coordinate Descent Method

Random set of coordinates (sampling)

Current iterate New iterate i-th unit coordinate vector

Update to i-th coordinate

Page 52: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

ESO: Expected Separable Overapproximation

Definition [RT’11b]

1. Separable in h2. Can minimize in parallel3. Can compute updates for only

Shorthand:

Minimize in h

Page 53: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

PART II.ADDITIONAL TOPICS

Page 54: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Partial Separability and

Doubly Uniform Samplings

Page 55: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Serial uniform samplingProbability law:

Page 56: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

-nice samplingProbability law:

Good for shared memory systems

Page 57: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Doubly uniform sampling

Probability law:

Can model unreliable processors / machines

Page 58: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

ESO for partially separable functionsand doubly uniform samplings

Theorem [RT’11b]

1 Smooth partially separable f [RT’11b ]

Page 59: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

PCDM: Theoretical Speedup

Much of Big Data is here!

degree of partial separability

# coordinates

# coordinate updates / iter

WEAK OR NO SPEEDUP: Non-separable (dense) problems

LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems

Page 60: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)
Page 61: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

n = 1000(# coordinates)

Theory

Page 62: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Practice

n = 1000(# coordinates)

Page 63: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

PCDM: Experiment with a 1 billion-by-2 billion

LASSO problem

Page 64: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Optimization with Big Data

* in a billion dimensional space on a foggy day

Extreme* Mountain Climbing=

Page 65: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Coordinate Updates

Page 66: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Iterations

Page 67: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Wall Time

Page 68: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Distributed-Memory Coordinate Descent

Page 69: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Distributed -nice sampling

Probability law:

Machine 2Machine 1 Machine 3

Good for a distributed version of coordinate descent

Page 70: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

ESO: Distributed setting

Theorem [RT’13b]

3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

spectral norm of the data

Page 71: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Bad partitioning at most doubles # of iterations

spectral norm of the partitioning

Theorem [RT’13b]

# nodes

# iterations = implies

# updates/node

Page 72: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)
Page 73: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

LASSO with a 3TB data matrix

128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM

= # coordinates

Page 74: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

• Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR 2011.

• Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012.

• [RT’11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., 2012.

• Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: 1304.5530, 2013.

• Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012.

• Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Technical report, Microsoft Research, 2013.

References: serial coordinate descent

Page 75: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

• [BKBG’11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011

• [RT’12] P.R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. arXiv:1212.0873, 2012

• Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013

• [FR’13a] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885, 2013

• [RT’13a] P.R. and Martin Takáč, Distributed coordinate descent method for big data learning. arXiv:1310.2059, 2013

• [RT’13b] P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438, 2013

References: parallel coordinate descent

Good entry point to the topic (4p paper)

Page 76: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

• P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012.

• Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013.

• Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10):1904-1916, 2012.

• [FR’13b] Olivier Fercoq and P.R., Accelerated, Parallel and Proximal coordinate descent. arXiv:1312.5799, 2013

References: parallel coordinate descent