Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. ·...

Distributed Sparse Optimization

Wotao Yin

Computational & Applied Mathematics (CAAM)

Rice University

STATOS 2013 – in memory of Alex Gershman

Zhimin Peng, Ming Yan, Wotao Yin. Parallel and Distributed Sparse Optimization, 2013

Outline

1. Background: Sparse optimization

2. Background: Parallel benefits, speedup, and overhead

3. Parallelize existing algorithms

4. Primal decomposition / dual decomposition

• Parallel dual gradient ascent (linearized Bregman)

• Parallel ADMM

5. Parallel greedy coordinate descent

6. Numerical results with big data

Coverage

This talk will not

� survey the existing sparse optimization models or algorithms

� cover decentralized algorithms due to time limit

This talk will

� illustrate how to apply parallel computing techniques on some large-scale

sparse optimization problems

� discuss parallel speedup and overhead along the way

Sparse optimization

A tool that processes “large” data sets for “structured” solutions

Examples of structures: sparse, joint/group sparse, sparse graph, low-rank,

smooth, low-dim manifold, their combinations, ...

Key points:

• finding sparsest solution is generally NP-hard1

• `1 minimization gives the sparsest solution, given sufficient measurements

and other conditions2

• `1 minimization is tractable

• has many interesting applications and generalization (requiring smart

designs of regularization and sensing matrix)

1Natarajan [1995]2Donoho and Elad [2003], Donoho [2006], Candes, Romberg, and Tao [2006]

Basic approach: `1 minimization

minimizex

‖ΨT x‖1, s.t. Ax = b, (1a)

minimizex

λ‖ΨT x‖1 +1

2‖Ax− b‖22, (1b)

minimizex

‖ΨT x‖1, s.t. ‖Ax− b‖2 ≤ δ. (1c)

The settings:

• Ψ is the sparsifying transform; ΨT x is sparse

• A is a linear operator / sensing operator / feature matrix

• when b or ΨT x has noise, (1b) or (1c) is used

Other regularization: joint/group variants, nuclear norm, ...

Standard form of regularization: minimize∑

i(atomi), good for parallelism

`1 optimization: single-threaded approaches

Off-the-shelf : subgradient descent, reduction to LP/SOCP/SDP

Smoothing: approximate `1 by `1+ε-norm,∑

i

√x2

i + ε, Huber-norm,

augmented `1

then, apply gradient-based and quasi-Newton methods.

Splitting: split smooth and nonsmooth terms into simple subproblems

• primal splitting: GPSR/FPC/SpaRSA/FISTA/SPGL1/...

• dual operator splitting: Bregman/ADMM/split Bregman/...

Greedy approaches: OMP and variants, greedy coordinate descent

Non-convex approaches

......

Parallel computing

• a problem is broken into concurrent parts, executed simultaneously,

coordinated by a controller

• uses multiple cores/CPUs/networked computers, or their combinations

• expect to solve a problem in less time, or a larger problem

CPU

CPU

CPU

t1t2tN ∙ ∙ ∙

Problem

Benefits of being parallel

• break single-CPU limits

• save time

• process big data

• access non-local resource

• handle concurrent data streams / enable collaboration

Parallel overhead

• computing overhead: start up time, synchronization wait time, data

communication time, termination (data collection time)

• I/O overhead: slow read and write of non-local data

• algorithm overhead: extra variables, data duplication

• coding overhead: language, library, operating system, debug, maintenance

Parallel speedup

• definition:

speedup =serial time

parallel time

time is in the wall-clock sense

• Amdahl’s Law: assume infinite processors and no overhead

ideal max speedup =1

1− P

where P = parallelized fraction of code

0 0.2 0.4 0.6 0.8 10

5

10

15

20

Parallel portion of code

Spee

dup

Parallel speedup

• assume N processors and no overhead

ideal speedup =1

(P/N) + (1− P )

• note: P may depend on data size, which depends on input size and N

100

102

104

106

108

0

5

10

15

20

Number of processors

Spee

dup

25%50%90%95%

Parallel speedup

• E := parallel overhead

• in the real world

actual speedup =1

(P/N) + (1− P ) + E

100

102

104

106

108

0

2

4

6

8

10


Spee

dup

25%50%90%95%

100

102

104

106

108

0

5

10

15

20


Spee

dup

25%50%90%95%

when E = N when E = log(N)

Types of communication

Broadcast Scatter

Gather Reduce

1 2 3 3

9

Allreduce

Allreduce Sum via butterfly communication

2 3 6 4 8 1 7 5

5 5 10 10 9 9 12 12

15 15 15 15 21 21 21 21

36 36 36 36 36 36 36 36

0 1 2 3 4 5 6 7

log(N) layers, N parallel communications per layer

Decomposable objective and constraints enables parallelism

• variables x = [x1,x2, . . . ,xN ]T

• embarrassingly parallel (not an interesting case): min∑

i fi(xi)

• consensus minimization, no constraints

min f(x) =N∑

i=1

fi(x) + r(x)

example: f(x) = λ2

∑i ‖A(i)x− bi‖22 + ‖x‖1

• coupling variable z

minx,z

f(x, z) =N∑

i=1

[fi(xi, z) + ri(xi)] + r(z)

• coupling constraints

min f(x) =N∑

i=1

fi(xi) + ri(xi)

subject to

• equality constraints:∑N

i=1 Aixi = b

• inequality constraints:∑N

i=1 Aixi ≤ b

• graph-coupling constraints Ax = b, where A is sparse

• combinations of independent variables, coupling variables and constraints

example:

minx,w,z

∑Ni=1 fi(xi, z)

s.t.∑

i Aixi ≤ b

Bixi ≤ w, ∀i

w ∈ Ω

• variable coupling ←→ constraint coupling:

minN∑

i=1

fi(xi, z) −→ minN∑

i=1

fi(xi, zi) s.t. zi = z ∀i

Approaches toward parallelism

� Keep an existing serial algorithm, parallelize its expensive part(s)

� Introduce a new parallel algorithm by

• formulating an equivalent model

• replacing the model by an approximate model that’s easier to parallelize

Outline






• Parallel ADMM



Example: parallel ISTA

minimizex

λ‖x‖1 + f(x)

� Prox-linear approximation:

arg minx

λ‖x‖1 + 〈∇f(x),x− xk〉+1

2δk‖x− xk‖22

� Iterative soft-threshold algorithm:

xk+1 = shrink(xk − δk∇f(x), λδk

)

• shrink(y, t) has a simple close form: max(abs(y)-t,0).*sign(y)

• the dominating computation is ∇f(x)

� examples: f(x) = 12‖Ax− b‖22, logistic loss of Ax− b, hinge loss ... where

A is often dense and can go very large-scale

Parallel gradient computing of f(x) = g(Ax + b): row partition

Let A be very big and dense, and g(y) =∑

gi(yi).

Then f(x) = g(Ax + b) =∑

gi(A(i)x + bi)

• let A(i) and bi be the ith row block of A and b, respectively

...

A(1)

A(2)

A(M)

• store A(i),bi on node i

• goal: to compute ∇f(x) =∑

AT(i)∇gi(A(i)x + bi), or similarly ∂f(x)

Parallel gradient computing of f(x) = g(Ax + b): row partition

Steps

1. compute A(i)x + bi ∀i in parallel;

2. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;

3. compute AT(i)gi ∀i in parallel;

4. allreduce∑M

i=1 AT(i)gi.

• step 1, 2 and 3 are local computation and communication-free;

• step 4 requires a collective communication.

Parallel gradient computing of f(x) = g(Ax + b): column partition

∙ ∙ ∙A1 A2 AN

column block partition

Ax =N∑

i=1

Aixi =⇒ f(x) = g(Ax + b) = g(∑

Aixi + b)

∇f(x) =

AT1∇g(Ax + b)

AT2∇g(Ax + b)

...

ATN∇g(Ax + b)

, similar for ∂f(x)

Parallel computing of ∇f(x) = g(Ax + b): column partition

Steps

1. compute Aixi + 1N

b ∀i in parallel

2. allreduce Ax + b =∑N

i=1 Aixi + 1N

b;

3. evaluate ∇g(Ax + b) ∀i in each process;

4. compute ATi ∇g(Ax + b) ∀i in parallel.

• each processor keeps Ai and xi, as well as a copy of b;

• step 2 requires a collective communication;

• step 3 involves duplicated computation in each process.

Two-way partition

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙∙∙

∙∙∙

∙∙∙

A1,1 A1,2 A1,N

A2,2A2,1

AM,1

A2,N

AM,NAM,2

∙∙∙

f(x) =∑

gi(A(i)x + bi) =∑

i

gi(∑

j

Ai,jxj + bi)

∇f(x) =∑

i

AT(i)∇gi(A(i)x + bi) =

∑

i

AT(i)∇gi(

∑

j

Ai,jxj + bi)

Two-way partition

Step

1. compute Ai,jxj + 1N

bi ∀i in parallel;

2. allreduce∑N

j=1 Ai,jxj + bi for every i;

3. compute gi = ∇gi(A(i)x + bi) ∀i in parallel;

4. compute ATj,igi ∀i in parallel;

5. allreduce∑N

i=1 ATj,igi for every j.

• processor (i, j) keeps Ai,j , xj and bi;

• step 2 and 5 require the collective communication;

Example: parallel ISTA for LASSO

LASSO

min f(x) = λ‖x‖1 +1

2‖Ax− b‖22

algorithm

xk+1 = shrink(xk − δkA

T Axk + δkAT b, λδk

)

Serial pseudo-code

1: initialize x = 0 and δ;

2: pre-compute AT b

3: while not converged do

4: g = AT Ax−AT b

5: x = shrink (x− δg, λδ);

6: end while


distribute blocks of rows to M nodes

...

A(1)

A(2)

A(M)


2: i = find my processor id

3: processor i loads A(i)

4: pre-compute δAT b


6: processor i computes ci = AT(i)A(i)x

7: y = allreduce(ci, SUM)

8: x = shrink(xk − δc + δAT b, λδ

);

9: end while

• assume A ∈ Rm×n

• one allreduce per iteration

• speedup

≈ 1P/M+(1−P )+O(log(M))

• P is close to 1

• requires synchronization


distribute blocks of columns to N nodes

∙ ∙ ∙A1 A2 AN


2: i = find my processor id

3: processor i loads Ai

4: pre-compute δATi b


6: processor i computes yi = Aixi

7: y = allreduce(yi, SUM)

8: processor i computes zi = ATi y

9: xk+1i = shrink

(xk

i − δzi + δATi b, λδ

);

10: end while

• one allreduce per iteration

• speedup

≈ 1P/N+(1−P )+O( m

nlog(N))

• P is close to 1

• requires synchronization

• if m� n, this approach is

faster

Outline




4. Primal decomposition / dual decomposition3


• Parallel ADMM



3Dantzig and Wolfe [1960], Spingarn [1985], Bertsekas and Tsitsiklis [1997]

Unconstrained minimization with coupling variables

x = [x1,x2, . . . ,xN ]T .

z is the coupling/bridging/complicating variable.

minx,z

f(x, z) =N∑

i=1

fi(xi, z)

equivalently to separable objective with coupling constraints

minx,z

N∑

i=1

fi(xi, zi) s.t. zi = z ∀i

Examples

• network utility maximization (NUM), extend to multiple layers4

• domain decomposition (z are overlapping boundary variables)

• system of equations: A is block-diagonal but with a few dense columns

4see tutorial: Palomar and Chiang [2006]

Primal decomposition (bi-level optimization)

Introduce easier subproblems

gi(z) = minxi

fi(xi, z), i = 1, . . . , N

then

minx,z

N∑

i=1

fi(xi, z) ⇐⇒ minz

∑gi(z)

the RHS is called the master problem

parallel computing: iterate

1. fix z, update xi and compute gi(z) in parallel

2. update z using a standard method (typically, subgradient descent)

• good for small-dimensional z

• fast if the master problem converges fast

Dual decomposition

Consider

minx,z

∑

i

fi(xi, zi)

s.t.∑

i

Aizi = b

Introduce

gi(z) = minxi

f(xi, z)

g∗i (y) = max

zyT z− gi(z) //convex conjugate

Dualization:

⇔ minx,z

maxy

∑

i

fi(xi, zi) + yT (b−∑

i

Aizi)

(generalized minimax thm.) ⇔ maxyi

minxi,zi

{∑

i

(fi(xi, zi)− yT Aizi

)+ yT b

}

⇔ maxyi

{

minzi

∑

i

(gi(zi)− yT Aizi

)+ yT b

}

⇔ miny

∑

i

g∗i (AT

i y)− yT b

where gi(y) = minxi f(xi,y), g∗i (z) = miny gi(y)− zT y is the conjugate

function of gi(y).

Example: consensus and exchange problems

Consensus problem

minz

∑

i

gi(z)

Reformulate as

minz

∑

i

gi(zi) s.t. zi = z ∀i

Its dual problem is the exchange problem

miny

∑

i

g∗i (yi) s.t.

∑

i

yi = 0

Primal-dual objective properties

Constrained separable problem

minz

∑

i

gi(z) s.t.∑

i

Aizi = b

Dual problem

miny

∑

i

g∗i (AT

i y)− bT y

� If gi is proper, closed, convex, g∗i (aT

i y) is sub-differentiable

� If gi is strictly convex, g∗i (aT

i y) is differentiable

� If gi is strongly convex, g∗i (aT

i y) is differentiable and has Lipschitz gradient

� Easily obtaining z∗ from y∗ generally requires strict convexity of gi or strict

complementary slackness

Example:

• gi(z) = |zi| and g∗i (aT

i y) = ι{−1 ≤ aTi y ≤ 1}

• gi(z) = 12|zi|2 and g∗

i (aTi y) = 1

2|aT

i y|2

• gi(z) = exp(zi) and g∗i (aT

i y) = (aTi y) ln(aT

i y)− aTi y + ι{aT

i y ≥ 0}

Outline





• Parallel dual gradient ascent (linearized Bregman5)

• Parallel ADMM



5Yin, Osher, Goldfarb, and Darbon [2008]

Example: augmented `1

Change |zi| into strongly convex gi(zi) = |zi|+ 12α|zi|2

Dual problem

miny

∑

i

g∗i (AT

i y)− bT y =∑

i

α

2| shrink(AT

i y)|2

︸︷︷︸hi(y)

−bT y

Dual gradient iteration (equivalent to linearized Bregman):

yk+1 = yk − tk

(∑

i

∇hi(yk)− b

)

, where ∇hi(y) = αAi shrink(ATi y).

� Dual is C1 and unconstrained. One can apply (accelerated) gradient

descent, line search, quasi-Newton method, etc.

� Global linear convergence.6

� Easy to parallelize. One collective communication per iteration.

� Recover x∗ = α shrink(ATi y∗).

6Lai and Yin [2011]

Augmented `1: exact regularization7

Theorem

There exists a finite α0 > 0 such that whenever α > α0, the solution to

(L1+LS) min ‖x‖1 +1

2α‖x‖22 s.t. Ax = b

is also a solution to

(L1) min ‖x‖1 s.t. Ax = b.

L1 L1+LS

7Friedlander and Tseng [2007], Yin [2010]

Augmented `1: CS recovery guarantees

• if min ‖x‖1 gives exact or stable recovery provided

#measurements ≥ C ∙ f(signal dim, signal sparsity).

• adding 12α‖x‖22 changes the condition to

#measurements ≥ (C + O(1

2α)) ∙ f(signal dim, signal sparsity).

• in theory and practice, α ≥ 10‖xo‖∞ suffices8

8Lai and Yin [2011]

Augmented `1: simple test

• Gaussian random A

• xo had 10% nonzero entries

• recover xo from under-sampled b = Axo

min{‖x‖1 : Ax = b} min{‖x‖22 : Ax = b} min{‖x‖1 + 1

25 ‖x‖22 : Ax = b}

exactly the same as `1 solution

Outline






• Parallel ADMM9



9Bertsekas and Tsitsiklis [1997]

Alternating direction method of multipliers (ADMM)

step 1: turn problem into the form of

minx,y

f(x) + g(y)

s.t. Ax + By = b.

f and g are convex, maybe nonsmooth, can incorporate constraints

Alternating direction method of multipliers (ADMM)

step 1: turn problem into the form of

minx,y

f(x) + g(y)

s.t. Ax + By = b.

f and g are convex, maybe nonsmooth, can incorporate constraints

step 2: iterate

1. xk+1 ← minx f(x) + g(yk) + β2‖Ax + Byk − b− zk‖22,

2. yk+1 ← miny f(xk+1) + g(y) + β2‖Axk+1 + By − b− zk‖22,

3. zk+1 ← zk − (Axk+1 + Byk+1 − b).

history: dates back to 60s, formalized 80s, parallel versions appeared late 80s,

revived very recently, new convergence results recently

ADMM and parallel/distributed computing

Two approaches:

• parallelize the serial algorithm(s) for ADMM subproblem(s),

• turn problem into a parallel-ready form10

minx,y

∑i fi(xi) + g(y)

s.t.

A1

. . .

AM

x1

...

xM

+

B1

...

BM

y =

b1

...

bM

consequently, computing xk+1 reduces to parallel subproblems

xk+1i ← min

xfi(xi) +

β

2‖Aixi + Biy

k − bi − zki ‖

22, i = 1, . . . , M

Issue: to have block diagonal A and simple B, additional variables are often

needed, thus increasing the problem size and parallel overhead.

10a recent survey Boyd, Parikh, Chu, Peleato, and Eckstein [2011]

Parallel Linearized Bregman vs ADMM on basis pursuit

Basis pursuit:

minx‖x‖1 s.t. Ax = b.

• example with dense A ∈ R1024×2048;

• distribute A to N computing nodes

A = [A1 A2 ∙ ∙ ∙ AN ];

• compare

1. parallel ADMM in the survey paper Boyd, Parikh, Chu, Peleato, and

Eckstein [2011];

2. parallel linearized Bregman (dual gradient descent), un-accelerated.

• tested N = 1, 2, 4, . . . , 32 computing nodes;

parallel ADMM v.s. parallel linearized Bregman

100

101

100

101

# of cores

time

ADMM

Parallel LBreg

parallel linearized Bregman scales much better

Outline






• Parallel ADMM



(Block)-coordinate descent/update

Definition: update one block of variables each time, keeping others fixed

Advantage: update is simple

Disadvantages:

• more iterations (with exceptions)

• may stuck at non-stationary points if problem is non-convex or

non-smooth (with exceptions)

Block selection: cycle (Gauss-Seidel), parallel (Jacobi), random, greedy

Specific for sparse optimization: greedy approach11 can be exceptionally

effective (because most time correct variables are selected to update; most zero

variables are never touched; “reducing” the problem to a much smaller one.)

11Li and Osher [2009]

GRock: greedy coordinate-block descent

Example:

min λ‖x‖1 + f(Ax− b)

� decompose Ax =∑

j Aixj ; block Aj and xj are kept on node j

Parallel GRock:

1. (parallel) compute a merit value for each coordinate i

di = arg mind

λ ∙ r(xi + d) + gid +1

2d2, where gi = aT

(i)∇f(Ax− b)

2. (parallel) compute a merit value for each block j

mj = max{|d| : d is an element of dj}

let sj be the index of the maximal coordinate within block j

3. (allreduce) P ← select P blocks with largest mj , 2 ≤ P ≤ N

4. (parallel) update xsj ← xsj + dsj for all j ∈ P

5. (allreduce) update Ax

How large P can be?

P depends on the block spectral radius

ρP = maxM∈M

ρ(M),

where M is the set of all P × P submatrices that we can obtain from AT A

corresponding to selecting exactly one column from each of the P blocks

Theorem

Assume each column of A has unit 2-norm. If ρP < 2, GRock with P parallel

updates per iteration gives

F(x + d)−F(x) ≤ρP − 2

2β‖d‖22

moreover,

F(xk)−F(x∗) ≤2C2

(

2L + β√

NP

)2

(2− ρP )β∙1

k.

Compare different block selection rules

0 50 10010

-20

10-10

100

1010

nomalized # coordinates calculated

f-fm

in

Cyclic CDGreedy CDMixed CDGRock

0 50 10010

-15

10-10

10-5

100

105

Normalized # coordinates calculated

Rel

ativ

eer

ror

Cyclic CDGreedy CDMixed CDGRock

• LASSO test with A ∈ R512×1024, N = 64 column blocks

• GRock uses P = 8 updates each iteration

• greedy CD12 uses P = 1

• mixed CD13 selects P = 8 random blocks and best coordinate in each

iteration

12Li and Osher [2009]13Scherrer, Tewari, Halappanavar, and Haglin [2012]

Outline






• Parallel ADMM



Test on Rice cluster STIC

specs:

• 170 Appro Greenblade E5530 nodes each with two quad-core 2.4GHz Xeon

(Nahalem) CPUs

• each node has 12GB of memory shared by all 8 cores

• # of processes used on each node = 8

test dataset:

A type A size λ sparsity of x∗

dataset I Gaussian 1024× 2048 0.1 100

dataset II Gaussian 2048× 4096 0.01 200

iterations vs cores

100

101

102

102

103

104

number of cores

num

ber

ofit

erat

ions

p D-ADMMp FISTAGRock

100

101

102

102

104

number of cores

num

ber

ofit

erat

ions


dataset I dataset II

Note: we set P = number of cores

time vs cores

100

101

102

10-1

100

101

number of cores

tim

e(s

)


P = 16

100

101

102

100

102

number of cores

tim

e(s

)




(% communication time) vs cores

100

101

102

0.1

0.2

0.3

0.4

0.5

0.6

0.7

number of cores

com

m.

tim

eper

cent

age


100

101

102

0.1

0.2

0.3

0.4

number of cores

com

m.

tim

eper

cent

age p D-ADMM

p FISTAGRock



big data LASSO on Amazon EC2

Amazon EC2 is an elastic, pay-as-you-use cluster;

advantage: no hardware investment, everyone can have an account

test dataset

• A dense matrix, 20 billion entries, and 170GB size

• x: 200K entries, 4K nonzeros, Gaussian values

requested system

• 20 “high-memory quadruple extra-large instances”

• each instance has 8 cores and 60GB memory

code

• written in C using GSL (for matrix-vector multiplication) and MPI

parallel Dual-ADMM vs FISTA14 vs GRock

p D-ADMM p FISTA GRock

estimate stepsize (min.) n/a 1.6 n/a

matrix factorization (min.) 51 n/a n/a

iteration time (min.) 105 40 1.7

number of iterations 2500 2500 104

communication time (min.) 30.7 9.5 0.5

stopping relative error 1E-1 1E-3 1E-5

total time (min.) 156 41.6 1.7

Amazon charge $85 $22.6 $0.93

• ADMM’s performance depends on penalty parameter β

we picked β as the best out of only a few trials (we cannot afford more)

• parallel Dual ADMM and FISTA were capped at 2500 iterations

• GRock used adaptive P and stopped at relative error 1E-5

14Beck and Teboulle [2009]

Software codes can be found in the author’s website.

Funding agencies: US NSF, DoD

Acknowledgements: Qing Ling (USTC) and Zaiwen Wen (SJTU)

References:

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on

Computing, 24:227–234, 1995.

D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal)

dictionaries vis `1 minimization. Proceedings of the National Academy of Sciences,

100:2197–2202, 2003.

D.L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):

1289–1306, 2006.

E.J. Candes, E. Romberg, and T. Tao. Robust uncertainty principles: Exact signal

reconstruction from highly incomplete frequency information. Information Theory,

IEEE Transactions on, 52(2):489–509, 2006.

G.B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations

Research, 8:101–111, 1960.

J.E. Spingarn. Applications of the method of partial inverses to convex programming:

decomposition. Mathematical Programming, 32(2):199–223, 1985.

D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical

Methods, Second Edition. Athena Scientific, 1997.

D.P. Palomar and M. Chiang. A tutorial on decomposition methods for network utility

maximization. IEEE Journal on Selected Areas in Communications, 24(8):

1439–1451, 2006.

W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for

l1-minimization with applications to compressed sensing. SIAM Journal on Imaging

Sciences, 1(1):143–168, 2008.

M.-J. Lai and W. Yin. Augmented l1 and nuclear-norm models with a globally linearly

convergent algorithm. SIAM Journal on Imaging Sciences, 6(2):1059–1091, 2011.

M.P. Friedlander and P. Tseng. Exact regularization of convex programs. SIAM

Journal on Optimization, 18(4):1326–1350, 2007.

W. Yin. Analysis and generalizations of the linearized Bregman method. SIAM

Journal on Imaging Sciences, 3(4):856–877, 2010.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and

statistical learning via the alternating direction method of multipliers. Foundations

and Trends in Machine Learning, 3(1):1–122, 2011.

Y. Li and S. Osher. Coordinate descent optimization for `1 minimization with

application to compressed sensing: a greedy algorithm. Inverse Problems and

Imaging, 3(3):487–503, 2009.

C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for

accelerating parallel coordinate descent. In NIPS, pages 28–36, 2012.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear

inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. ·...

Documents

Transcript of Distributed Sparse Optimizationoptimization/disparse/Yin_STATOS_2013.pdf · 2013. 6. 17. ·...