Peter Richtrik Parallel coordinate NIPS 2013, Lake Tahoe
descent methods
Slide 2
I.Introduction II.Regularized Convex Optimization III.Parallel
CD (PCDM) IV.Accelerated Parallel CD (APCDM) V.ESO? Good ESO?
VI.Experiments in 10^9 dim VII.Distributed CD (Hydra)
VIII.Nonuniform Parallel CD (NSync) IX.Mini-batch SDCA for SVMs
OUTLINE POSTER
Slide 3
Part I. Introduction
Slide 4
What is Randomized Coordinate Descent ?
Slide 5
Find the minimizer of 2D Optimization Contours of a function
Goal:
Slide 6
Randomized Coordinate Descent in 2D N S E W
Slide 7
1 N S E W
Slide 8
1 N S E W 2
Slide 9
1 2 3 N S E W
Slide 10
1 2 3 4 N S E W
Slide 11
1 2 3 4 N S E W 5
Slide 12
1 2 3 4 5 6 N S E W
Slide 13
1 2 3 4 5 N S E W 6 7 S O L V E D !
Slide 14
Convergence of Randomized Coordinate Descent Strongly convex F
Smooth or simple nonsmooth F difficult nonsmooth F Focus on n (big
data = big n)
Slide 15
Parallelization Dream Depends on to what extent we can add up
individual updates, which depends on the properties of F and the
way coordinates are chosen at each iteration SerialParallel What do
we actually get? WANT
Slide 16
How (not) to Parallelize Coordinate Descent
Slide 17
Naive parallelization Do the same thing as before, but for MORE
or ALL coordinates & ADD UP the updates
Slide 18
Failure of naive parallelization 1a 1b 0
Slide 19
Failure of naive parallelization 1 1a 1b 0
Slide 20
Failure of naive parallelization 1 2a 2b
Slide 21
Failure of naive parallelization 1 2a 2b 2
Slide 22
Failure of naive parallelization 2 O O P S !
Slide 23
1 1a 1b 0 Idea: averaging updates may help S O L V E D !
Slide 24
Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o
n...
Slide 25
Averaging may be too conservative 2 WANT BAD!!! But we
wanted:
Slide 26
What to do? Averaging: Summation: Update to coordinate i i-th
unit coordinate vector Figure out when one can safely use:
Slide 27
Part II. Regularized Convex Optimization
Slide 28
Problem Convex (smooth or nonsmooth) Convex (smooth or
nonsmooth) - separable - allow Loss Regularizer
Loss: examples Quadratic loss L-infinity L1 regression
Exponential loss Logistic loss Square hinge loss BKBG11 RT11b
TBRS13 RT 13a FR13a
Slide 31
3 models for f with small 1 2 3 Smooth partially separable f
[RT11b ] Nonsmooth max-type f [FR13] f with bounded Hessian
[BKBG11, RT13a ]
Slide 32
Part III. Parallel Coordinate Descent P.R. and Martin Tak
Parallel Coordinate Descent Methods for Big Data Optimization
arXiv:1212.0873, 2012 [IMA Leslie Fox Prize 2013] P.R. and Martin
Tak Parallel Coordinate Descent Methods for Big Data Optimization
arXiv:1212.0873, 2012 [IMA Leslie Fox Prize 2013]
Slide 33
Randomized Parallel Coordinate Descent Method Random set of
coordinates (sampling) Current iterateNew iteratei-th unit
coordinate vector Update to i-th coordinate
Slide 34
ESO: Expected Separable Overapproximation Definition [RT11b] 1.
Separable in h 2. Can minimize in parallel 3. Can compute updates
for only Shorthand: Minimize in h
Slide 35
Convergence rate: convex f average # updated coordinates per
iteration # coordinatesstepsize parameter error tolerance #
iterations implies Theorem [RT11b]
Slide 36
Convergence rate: strongly convex f implies Strong convexity
constant of the regularizer Strong convexity constant of the loss f
Theorem [RT11b]
Slide 37
Part IV. Accelerated Parallel CD Olivier Fercoq and P.R.
Accelerated, Parallel and Proximal Coordinate Descent Manuscript,
2013 Olivier Fercoq and P.R. Accelerated, Parallel and Proximal
Coordinate Descent Manuscript, 2013
Slide 38
3 x YES
Slide 39
The Algorithm Parallel CD step ESO parameters
Slide 40
Complexity average # updated coordinates per iteration #
coordinates error tolerance # iterations implies Theorem
[FR13b]
Slide 41
Part V. ESO? Good ESO? (e.g., partially separable f &
doubly uniform S)
Slide 42
Serial uniform sampling Probability law:
Slide 43
-nice sampling Probability law: Good for shared memory
systems
Slide 44
Doubly uniform sampling Probability law: Can model unreliable
processors / machines
Slide 45
ESO for partially separable functions and doubly uniform
samplings Theorem [RT11b] 1 Smooth partially separable f [RT11b
]
Slide 46
Theoretical speedup Much of Big Data is here! degree of partial
separability # coordinates # coordinate updates / iter WEAK OR NO
SPEEDUP: Non-separable (dense) problems LINEAR OR GOOD SPEEDUP:
Nearly separable (sparse) problems
Slide 47
Slide 48
n = 1000 (# coordinates) Theory
Slide 49
Practice n = 1000 (# coordinates)
Slide 50
Part VI. Experiment with a 1 billion-by-2 billion LASSO
problem
Slide 51
Optimization with Big Data * in a billion dimensional space on
a foggy day Extreme* Mountain Climbing =
Slide 52
Coordinate Updates
Slide 53
Iterations
Slide 54
Wall Time
Slide 55
L2-reg logistic loss on rcv1.binary
Slide 56
Part VII. Distributed CD P.R. and Martin Tak Distributed
Coordinate Descent Method for Learning with Big Data
arXiv:1310.2059, 2013 P.R. and Martin Tak Distributed Coordinate
Descent Method for Learning with Big Data arXiv:1310.2059,
2013
Slide 57
Distributed -nice sampling Probability law: Machine 2Machine
1Machine 3 Good for a distributed version of coordinate
descent
Slide 58
ESO: Distributed setting Theorem [RT13b] 3 f with bounded
Hessian [BKBG11, RT13a ] spectral norm of the data
Slide 59
Bad partitioning at most doubles # of iterations spectral norm
of the partitioning Theorem [RT13b] # nodes # iterations = implies
# updates/node
Slide 60
Slide 61
LASSO with a 3TB data matrix 128 Cray XE6 nodes with 4 MPI
processes (c = 512) Each node: 2 x 16-cores with 32GB RAM = #
coordinates
Slide 62
References
Slide 63
Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for
L1-regularized loss minimization. JMLR 2011. Yurii Nesterov,
Efficiency of coordinate descent methods on huge-scale optimization
problems. SIAM Journal on Optimization, 22(2):341-362, 2012.
[RT11b] P.R. and Martin Tak, Iteration complexity of randomized
block-coordinate descent methods for minimizing a composite
function. Mathematical Prog., 2012. Rachael Tappenden, P.R. and
Jacek Gondzio, Inexact coordinate descent: complexity and
preconditioning, arXiv: 1304.5530, 2013. Ion Necoara, Yurii
Nesterov, and Francois Glineur. Efficiency of randomized coordinate
descent methods on optimization problems with linearly coupled
constraints. Technical report, Politehnica University of Bucharest,
2012. Zhaosong Lu and Lin Xiao. On the complexity analysis of
randomized block- coordinate descent methods. Technical report,
Microsoft Research, 2013. References: serial coordinate
descent
Slide 64
[BKBG11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos
Guestrin, Parallel Coordinate Descent for L1-Regularized Loss
Minimization. ICML 2011 [RT12] P.R. and Martin Tak, Parallel
coordinate descen methods for big data optimization.
arXiv:1212.0873, 2012 Martin Tak, Avleen Bijral, P.R., and Nathan
Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013
[FR13a] Olivier Fercoq and P.R., Smooth minimization of nonsmooth
functions with parallel coordinate descent methods.
arXiv:1309.5885, 2013 [RT13a] P.R. and Martin Tak, Distributed
coordinate descent method for big data learning. arXiv:1310.2059,
2013 [RT13b] P.R. and Martin Tak, On optimal probabilities in
stochastic coordinate descent methods. arXiv:1310.3438, 2013
References: parallel coordinate descent Good entry point to the
topic (4p paper)
Slide 65
P.R. and Martin Tak, Efficient serial and parallel coordinate
descent methods for huge-scale truss topology design. Operations
Research Proceedings 2012. [FR13b] Olivier Fercoq and P.R.,
Accelerated, Parallel and Proximal Coordinate Descent. Manuscript,
2013. Rachael Tappenden, P.R. and Burak Buke, Separable
approximations and decomposition methods for the augmented
Lagrangian. arXiv:1308.6774, 2013. Indranil Palit and Chandan K.
Reddy. Scalable and parallel boosting with MapReduce. IEEE
Transactions on Knowledge and Data Engineering, 24(10):1904-1916,
2012. Shai Shalev-Shwartz and Tong Zhang, Accelerated mini-batch
stochastic dual coordinate ascent. NIPS 2013. References: parallel
coordinate descent