Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik...
-
Upload
holden-whitfield -
Category
Documents
-
view
213 -
download
0
Transcript of Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik...
Accelerated, Parallel and PROXimal coordinate descent
Moscow February 2014
A P PROXPeter Richtárik
(Joint work with Olivier Fercoq - arXiv:1312.5799)
Optimization Problem
Problem
Convex (smooth or nonsmooth)
Convex (smooth or nonsmooth)- separable- allow
Loss Regularizer
Regularizer: examples
No regularizer Weighted L1 norm
Weighted L2 normBox constraints
e.g., SVM dual
e.g., LASSO
Loss: examples
Quadratic loss
L-infinity
L1 regression
Exponential loss
Logistic loss
Square hinge loss
BKBG’11RT’11bTBRS’13RT ’13a
FR’13
RANDOMIZED COORDINATE DESCENT
IN 2D
Find the minimizer of
2D OptimizationContours of a function
Goal:
a2 =b2
Randomized Coordinate Descent in 2D
a2 =b2
N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
N
S
EW
2
Randomized Coordinate Descent in 2D
a2 =b2
1
23 N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
23
4N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
23
4N
S
EW
5
Randomized Coordinate Descent in 2D
a2 =b2
1
23
45
6
N
S
EW
Randomized Coordinate Descent in 2D
a2 =b2
1
23
45
N
S
EW
67SOLVED!
CONTRIBUTIONS
Variants of Randomized Coordinate Descent Methods
• Block– can operate on “blocks” of coordinates – as opposed to just on individual coordinates
• General – applies to “general” (=smooth convex) functions – as opposed to special ones such as quadratics
• Proximal– admits a “nonsmooth regularizer” that is kept intact in solving subproblems – regularizer not smoothed, nor approximated
• Parallel – operates on multiple blocks / coordinates in parallel– as opposed to just 1 block / coordinate at a time
• Accelerated– achieves O(1/k^2) convergence rate for convex functions– as opposed to O(1/k)
• Efficient– complexity of 1 iteration is O(1) per processor on sparse problems – as opposed to O(# coordinates) : avoids adding two full vectors
Brief History of Randomized Coordinate Descent Methods
+ new long stepsizes
APPROX
“PROXIMAL”“PAR
ALLE
L”
“ACCELERATED”
A P PROX
PCDM (R. & Takáč, 2012) = APPROX if we force
APPROX: Smooth Case
Want this to be as large as possible
Update for coordinate i
Partial derivative of f
CONVERGENCE RATE
Convergence Rate
average # coordinates updated / iteration
# coordinates# iterations
implies
Theorem [FR’13b]
Key assumption
Special Case: Fully Parallel Variant
all coordinates are updated in each iteration
# normalized weights (summing to n)
# iterations
implies
Special Case: Effect of New Stepsizes
Average degree of separability
“Average” of the Lipschitz constants
With the new stepsizes (will mention later!), we have:
“EFFICIENCY” OF
APPROX
Cost of 1 Iteration of APPROX
Assume N = n (all blocks are of size 1)and that
Sparse matrixThen the average cost of 1 iteration of APPROX is
Scalar function: derivative = O(1)
arithmetic ops
= average # nonzeros in a column of A
Bottleneck: Computation of Partial Derivatives
maintained
PRELIMINARYEXPERIMENTS
L1 Regularized L1 Regression
Dorothea dataset:
Gradient Method
Nesterov’s Accelerated Gradient Method
SPCDM
APPROX
L1 Regularized L1 Regression
L1 Regularized Least Squares (LASSO)
KDDB dataset:
PCDM
APPROX
Training Linear SVMs
Malicious URL dataset:
Choice of Stepsizes:
How (not) to ParallelizeCoordinate Descent
Convergence of Randomized Coordinate Descent
Strongly convex F(Simple Mehod)
Smooth or ‘simple’ nonsmooth F(Accelerated Method)
‘Difficult’ nonsmooth F(Accelerated Method)
or smooth F(Simple method)
‘Difficult’ nonsmooth F(Simple Method)
Focus on n
(big data = big n)
Parallelization Dream
Depends on to what extent we can add up individual updates, which depends on the properties of F and the
way coordinates are chosen at each iteration
Serial Parallel
What do we actually get?WANT
“Naive” parallelization
Do the same thing as before, but
for MORE or ALL coordinates &
ADD UP the updates
Failure of naive parallelization
1a
1b
0
Failure of naive parallelization
1
1a
1b
0
Failure of naive parallelization
1
2a
2b
Failure of naive parallelization
1
2a
2b
2
Failure of naive parallelization
2
OOPS!
1
1a
1b
0
Idea: averaging updates may help
SOLVED!
Averaging can be too conservative
1a
1b
0
12a
2b
2
and so on...
Averaging may be too conservative 2
WANT
BAD!!!But we wanted:
What to do?
Averaging:
Summation:
Update to coordinate i
i-th unit coordinate vector
Figure out when one can safely use:
ESO:Expected SeparableOverapproximation
5 Models for f Admitting Small
1
2
3
Smooth partially separable f [RT’11b ]
Nonsmooth max-type f [FR’13]
f with ‘bounded Hessian’ [BKBG’11, RT’13a ]
5 Partially separable f with block smooth components [FR’13b]
5 Models for f Admitting Small
4 Partially separable f with smooth components [NC’13]
Randomized Parallel Coordinate Descent Method
Random set of coordinates (sampling)
Current iterate New iterate i-th unit coordinate vector
Update to i-th coordinate
ESO: Expected Separable Overapproximation
Definition [RT’11b]
1. Separable in h2. Can minimize in parallel3. Can compute updates for only
Shorthand:
Minimize in h
PART II.ADDITIONAL TOPICS
Partial Separability and
Doubly Uniform Samplings
Serial uniform samplingProbability law:
-nice samplingProbability law:
Good for shared memory systems
Doubly uniform sampling
Probability law:
Can model unreliable processors / machines
ESO for partially separable functionsand doubly uniform samplings
Theorem [RT’11b]
1 Smooth partially separable f [RT’11b ]
PCDM: Theoretical Speedup
Much of Big Data is here!
degree of partial separability
# coordinates
# coordinate updates / iter
WEAK OR NO SPEEDUP: Non-separable (dense) problems
LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems
n = 1000(# coordinates)
Theory
Practice
n = 1000(# coordinates)
PCDM: Experiment with a 1 billion-by-2 billion
LASSO problem
Optimization with Big Data
* in a billion dimensional space on a foggy day
Extreme* Mountain Climbing=
Coordinate Updates
Iterations
Wall Time
Distributed-Memory Coordinate Descent
Distributed -nice sampling
Probability law:
Machine 2Machine 1 Machine 3
Good for a distributed version of coordinate descent
ESO: Distributed setting
Theorem [RT’13b]
3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]
spectral norm of the data
Bad partitioning at most doubles # of iterations
spectral norm of the partitioning
Theorem [RT’13b]
# nodes
# iterations = implies
# updates/node
LASSO with a 3TB data matrix
128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM
= # coordinates
• Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR 2011.
• Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012.
• [RT’11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., 2012.
• Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: 1304.5530, 2013.
• Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012.
• Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Technical report, Microsoft Research, 2013.
References: serial coordinate descent
• [BKBG’11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011
• [RT’12] P.R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. arXiv:1212.0873, 2012
• Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013
• [FR’13a] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885, 2013
• [RT’13a] P.R. and Martin Takáč, Distributed coordinate descent method for big data learning. arXiv:1310.2059, 2013
• [RT’13b] P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438, 2013
References: parallel coordinate descent
Good entry point to the topic (4p paper)
• P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012.
• Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013.
• Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10):1904-1916, 2012.
• [FR’13b] Olivier Fercoq and P.R., Accelerated, Parallel and Proximal coordinate descent. arXiv:1312.5799, 2013
References: parallel coordinate descent