Scaling Up LASSO Solvers - …courses.cs.washington.edu/courses/cse547/14wi/slides/...A tool for...

1

1

“Scalable” LASSO Solvers: Parallel SCD (Shotgun) Parallel SGD Averaging Solutions ADMM

Machine Learning for Big Data CSE547/STAT548, University of Washington

Emily Fox February 6th, 2014

©Emily Fox 2014

Case Study 3: fMRI Prediction

Scaling Up LASSO Solvers

n  A simple SCD for LASSO (Shooting) ¨  Your HW, a more efficient implementation! J ¨  Analysis of SCD

n  Parallel SCD (Shotgun) n  Other parallel learning approaches for linear models

¨  Parallel stochastic gradient descent (SGD) ¨  Parallel independent solutions then averaging

n  ADMM

©Emily Fox 2014 2

2

Stochastic Coordinate Descent for LASSO (aka Shooting Algorithm)

n  Repeat until convergence ¨ Pick a coordinate j at random

n  Set:

n  Where:

©Emily Fox 2014 3

�̂j =

8<

:

(cj + �)/aj cj < ��0 cj 2 [��,�]

(cj � �)/aj cj > �

cj = 2NX

i=1

x

ij(y

i � �

0�jx

i�j)aj = 2

NX

i=1

(xij)

2

Shotgun: Parallel SCD [Bradley et al ‘11]

Shotgun (Parallel SCD)

While not converged, " On each of P processors, " Choose random coordinate j, " Update βj (same as for Shooting)

minβF(β) F(β) =|| Xβ − y ||2

2 +λ || β ||1where Lasso:

©Emily Fox 2014 4

3

Is SCD inherently sequential?

Coordinate update:

β j ← β j +δβ j

(closed-form minimization)

Δβ =

"

#

$$$$

%

&

''''

Collective update: δβi

δβ j

00

0

minβF(β) F(β) =|| Xβ − y ||2


©Emily Fox 2014 5

Convergence Analysis

≤d 1

2 || β* ||22 +F(β (0) )( )TP

E F(β (T ) )!" #$−F(β*)

Theorem: Shotgun Convergence

Assume

€

P < d /ρ +1where

€

ρ = spectral radius of XTX

Nice case: Uncorrelated features

ρ = __⇒ Pmax = __

Bad case: Correlated features ρ = __⇒ Pmax = __(at worst)

minβF(β) F(β) =|| Xβ − y ||2


©Emily Fox 2014 6

p

p

4

Stepping Back… n  Stochastic coordinate ascent

¨  Optimization:

¨  Parallel SCD:

¨  Issue:

¨  Solution:

n  Natural counterpart: ¨  Optimization:

¨  Parallel

¨  Issue:

¨  Solution:

©Emily Fox 2014 7

Parallel SGD with No Locks [e.g., Hogwild!, Niu et al. ‘11] n  Each processor in parallel:

¨  Pick data point i at random ¨  For j = 1…p:

n  Assume atomicity of:

©Emily Fox 2014 8

5

Addressing Interference in Parallel SGD

n  Key issues: ¨  Old gradients

¨  Processors overwrite each other’s work

n  Nonetheless: ¨  Can achieve convergence and some parallel speedups ¨  Proof uses weak interactions, but through sparsity of data points

©Emily Fox 2014 9

Problem with Parallel SCD and SGD

n  Both Parallel SCD & SGD assume access to current estimate of weight vector

n  Works well on shared memory machines

n  Very difficult to implement efficiently in distributed memory

n  Open problem: Good parallel SGD and SCD for distributed setting… ¨  Let’s look at a trivial approach

©Emily Fox 2014 10

6

Simplest Distributed Optimization Algorithm Ever Made

n  Given N data points & P machines n  Stochastic optimization problem: n  Distribute data:

n  Solve problems independently

n  Merge solutions

n  Why should this work at all????

©Emily Fox 2014 11

For Convex Functions… n  Convexity:

n  Thus:

©Emily Fox 2014 12

7

Hopefully… n  Convexity only guarantees:

n  But, estimates from independent data!

©Emily Fox 2014 13

The average mixture algorithm

Two pictures:

!!1!!2

!!3!!4

Duchi (UC Berkeley) Communication E!cient Optimization CDC 2012 7 / 21

Figure from John Duchi

Analysis of Distribute-then-Average [Zhang et al. ‘12]

n  Under some conditions, including strong convexity, lots of smoothness, and more…

n  If all data were in one machine, converge at rate:

n  With P machines, converge at a rate:

©Emily Fox 2014 14

8

Tradeoffs, tradeoffs, tradeoffs,… n  Distribute-then-Average:

¨  “Minimum possible” communication ¨  Bias term can be a killer with finite data

n  Issue definitely observed in practice ¨  Significant issues for L1 problems:

n  Parallel SCD or SGD ¨  Can have much better convergence in practice for multicore setting ¨  Preserves sparsity (especially SCD) ¨  But, hard to implement in distributed setting

©Emily Fox 2014 15

Alternating Directions Method of Multipliers

n  A tool for solving convex problems with separable objectives:

n  LASSO example:

n  Know how to minimize f(β) or g(β) separately

©Emily Fox 2014 16

9

ADMM Insight

n  Try this instead:

n  Solve using method of multipliers n  Define the augmented Lagrangian:

¨  Issue: L2 penalty destroys separability of Lagrangian ¨  Solution: Replace minimization over (x, z) by alternating

minimization

©Emily Fox 2014 17

ADMM Algorithm

n  Augmented Lagrangian:

n  Alternate between:

1.  x

2.  z 3.  y

©Emily Fox 2014 18

L⇢(x, z, y) = f(x) + g(z) + y

T (x� z) +⇢

2||x� z||22

10

ADMM for LASSO

n  Objective:

n  Augmented Lagrangian:

n  Alternate between:

1.  β

2.  z 3.  a

©Emily Fox 2014 19

L⇢(x, z, y) = f(x) + g(z) + y

T (x� z) +⇢

2||x� z||22

L⇢(�, z, a) =

ADMM Wrap-Up

n  When does ADMM converge? ¨  Under very mild conditions ¨  Basically, f and g must be convex

n  ADMM is useful in cases where ¨  f(x) + g(x) is challenging to solve due to coupling ¨  We can minimize

n  f(x) + (x-a)2

n  g(x) + (x-a)2

n  Reference

¨  Boyd, Parikh, Chu, Peleato, Eckstein (2011) “Distributed optimization and statistical learning via the alternating direction method of multipliers.” Foundations and Trends in Machine Learning, 3(1):1-122.

©Emily Fox 2014 20

11

What you need to know

n  A simple SCD for LASSO (Shooting) ¨  Your HW, a more efficient implementation! J ¨  Analysis of SCD

n  Parallel SCD (Shotgun) n  Other parallel learning approaches for linear

models ¨  Parallel stochastic gradient descent (SGD) ¨  Parallel independent solutions then averaging

n  ADMM ¨  General idea ¨  Application to LASSO

©Emily Fox 2014 21

Scaling Up LASSO Solvers - …courses.cs.washington.edu/courses/cse547/14wi/slides/...A tool for...

Documents

Transcript of Scaling Up LASSO Solvers - …courses.cs.washington.edu/courses/cse547/14wi/slides/...A tool for...