Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All...

Post on 22-Dec-2015

215 views 0 download

Transcript of Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All...

Peter Richtárik (joint work with Martin Takáč)

Distributed Coordinate Descent Method

AmpLab All Hands Meeting - Berkeley - October 29, 2013

Randomized Coordinate Descent

in 2D

Find the minimizer

2D OptimizationContours of a function

Goal:

a2 =b2

Randomized Coordinate Descent in 2D

a2 =b2

N

S

EW

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

2

Randomized Coordinate Descent in 2D

a2 =b2

1

23 N

S

EW

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

5

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

6

N

S

EW

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

N

S

EW

67SOLVED!

Convergence of Randomized Coordinate Descent

Strongly convex f

Smooth or ‘simple’ nonsmooth f‘difficult’ nonsmooth f

Focus on d

(big data = big d)

Parallelization Dream

Serial Parallel

In reality we get something in between

How (not) to ParallelizeCoordinate Descent

“Naive” parallelization

Do the same thing as before, but with more or all coordinates

and add up the updates

Failure of naive parallelization

1a

1b

0

Failure of naive parallelization

1

1a

1b

0

Failure of naive parallelization

1

2a

2b

Failure of naive parallelization

1

2a

2b

2

Failure of naive parallelization

2

OOPS!

1

1a

1b

0

Idea: averaging updates may help

SOLVED!

Averaging can be too conservative

1a

1b

0

12a

2b

2

and so on...

Averaging may be too conservative

WANT

BAD!!!

Minimizing Regularized Loss

Minimizing Regularized Loss

Convex (smooth)

Convex (smooth or nonsmooth)- separable- allow

Loss Regularizer

Regularizer: examples

No regularizer Weighted L1 norm

Weighted L2 normBox constraints

e.g., SVM dual

e.g., LASSO

Structure of f

Considered in [BKBG, ICML 2011]

Loss: examples

Quadratic loss

L-infinity

L1 regression

Exponential loss

Logistic loss

Square hinge loss

BKBG’11RT’11bTBRS’13RT ’13a

FR’13

Distributed CoordinateDescent Method

I. Distribution of Datad = # features / variables / coordinates Data matrix

II. Choice of Coordinates

II. Choice of Coordinates

Random set of coordinates (‘sampling’)

III. Computing Updates to Selected Coordinates

Random set of coordinates (‘sampling’)

Current iterate New iterate

Update to i-th coordinate

All nodes need to be able to compute this (communication)

Iteration Complexity

implies

Strong convexity constant of the regularizer

Strong convexity constant of the loss f

Theorem [RT’13]# coordinates

# nodes # coordinates updated / node

Bad partitioning at most doubles # of iterations

spectral norm of the “partitioning”

Theorem [RT’13]

Experiment 1

1 node (c = 1)

LASSO problemn = 2 billions d = 1 billion

Coordinate Updates

Iterations

Wall Time

Experiment 2

128 nodes (c = 512, 4096 cores)

LASSO problemn = 1 billion d = 0.5 billion

data size = 3 TB

LASSO: 3TB data + 128 nodes