Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and...

47
SLIM University of British Columbia Consortium 2010 Tim Lin Sparse optimization and the -norm 1 Thursday, December 9, 2010

Transcript of Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and...

Page 1: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

SLIMUniversityofBritishColumbia

Consortium2010

TimLinSparse optimization and the -norm!1

Thursday, December 9, 2010

Page 2: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• O9enwri;enas:

• Whereareconvex(iftheyappear)

• AnyopDmallocalsoluDonisalsoaopDmalglobalsoluDon

Convex optimization problem

standard form convex optimization problem

minimize f0(x)subject to fi(x) ! 0, i = 1, . . . ,m

aTi x = bi, i = 1, . . . , p

• f0, f1, . . . , fm are convex; equality constraints are a!ne

• problem is quasiconvex if f0 is quasiconvex (and f1, . . . , fm convex)

often written as

minimize f0(x)subject to fi(x) ! 0, i = 1, . . . ,m

Ax = b

important property: feasible set of a convex optimization problem is convex

Convex optimization problems 4–6

fo, fi

Convex optimization

Thursday, December 9, 2010

Page 3: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• O9enwri;enas:

• Whereareconvex(iftheyappear)

• AnyopDmallocalsoluDonisalsoaopDmalglobalsoluDon

Convex optimization

fo, fi

Convex optimization problem

standard form convex optimization problem

minimize f0(x)subject to fi(x) ! 0, i = 1, . . . ,m

aTi x = bi, i = 1, . . . , p

• f0, f1, . . . , fm are convex; equality constraints are a!ne

• problem is quasiconvex if f0 is quasiconvex (and f1, . . . , fm convex)

often written as

minimize f0(x)subject to fi(x) ! 0, i = 1, . . . ,m

Ax = b

important property: feasible set of a convex optimization problem is convex

Convex optimization problems 4–6

Thursday, December 9, 2010

Page 4: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• O9enwri;enas:

• Whereareconvex(iftheyappear)

• Steepestdescent,Newton,Gauss‐Newton,etcallconvergesglobally,usuallyquickly

fo, fi

Convex optimization

Convex optimization problem

standard form convex optimization problem

minimize f0(x)subject to fi(x) ! 0, i = 1, . . . ,m

aTi x = bi, i = 1, . . . , p

• f0, f1, . . . , fm are convex; equality constraints are a!ne

• problem is quasiconvex if f0 is quasiconvex (and f1, . . . , fm convex)

often written as

minimize f0(x)subject to fi(x) ! 0, i = 1, . . . ,m

Ax = b

important property: feasible set of a convex optimization problem is convex

Convex optimization problems 4–6

Thursday, December 9, 2010

Page 5: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Definition

f : Rn ! R is convex if dom f is a convex set and

f(!x + (1 " !)y) # !f(x) + (1 " !)f(y)

for all x, y $ dom f , 0 # ! # 1

(x, f(x))

(y, f(y))

• f is concave if "f is convex

• f is strictly convex if dom f is convex and

f(!x + (1 " !)y) < !f(x) + (1 " !)f(y)

for x, y $ dom f , x %= y, 0 < ! < 1

Convex functions 3–2

Convex functions

Thursday, December 9, 2010

Page 6: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

convex not convex

Convex functions

Thursday, December 9, 2010

Page 7: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Convex functionsDefinition

f : Rn ! R is convex if dom f is a convex set and

f(!x + (1 " !)y) # !f(x) + (1 " !)f(y)

for all x, y $ dom f , 0 # ! # 1

(x, f(x))

(y, f(y))

• f is concave if "f is convex

• f is strictly convex if dom f is convex and

f(!x + (1 " !)y) < !f(x) + (1 " !)f(y)

for x, y $ dom f , x %= y, 0 < ! < 1

Convex functions 3–2

Thursday, December 9, 2010

Page 8: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Concave functions

Definition

f : Rn! R is convex if dom f is a convex set and

f(!x + (1 " !)y) # !f(x) + (1 " !)f(y)

for all x, y $ dom f , 0 # ! # 1

(x, f(x))

(y, f(y))

• f is concave if "f is convex

• f is strictly convex if dom f is convex and

f(!x + (1 " !)y) < !f(x) + (1 " !)f(y)

for x, y $ dom f , x %= y, 0 < ! < 1

Convex functions 3–2

Thursday, December 9, 2010

Page 9: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Examples

least-squaresminimize !Ax " b!2

2

• analytical solution x! = A†b (A† is pseudo-inverse)

• can add linear constraints, e.g., l # x # u

linear program with random cost

minimize c̄Tx + !xT!x = E cTx + ! var(cTx)subject to Gx # h, Ax = b

• c is random vector with mean c̄ and covariance !

• hence, cTx is random variable with mean c̄Tx and variance xT!x

• ! > 0 is risk aversion parameter; controls the trade-o! betweenexpected cost and variance (risk)

Convex optimization problems 4–23

Examples

least-squaresminimize !Ax " b!2

2

• analytical solution x! = A†b (A† is pseudo-inverse)

• can add linear constraints, e.g., l # x # u

linear program with random cost

minimize c̄Tx + !xT!x = E cTx + ! var(cTx)subject to Gx # h, Ax = b

• c is random vector with mean c̄ and covariance !

• hence, cTx is random variable with mean c̄Tx and variance xT!x

• ! > 0 is risk aversion parameter; controls the trade-o! betweenexpected cost and variance (risk)

Convex optimization problems 4–23

Aismxnmatrix:m=nm>nm<n

Example: least-squares

Thursday, December 9, 2010

Page 10: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Examples

least-squaresminimize !Ax " b!2

2

• analytical solution x! = A†b (A† is pseudo-inverse)

• can add linear constraints, e.g., l # x # u

linear program with random cost

minimize c̄Tx + !xT!x = E cTx + ! var(cTx)subject to Gx # h, Ax = b

• c is random vector with mean c̄ and covariance !

• hence, cTx is random variable with mean c̄Tx and variance xT!x

• ! > 0 is risk aversion parameter; controls the trade-o! betweenexpected cost and variance (risk)

Convex optimization problems 4–23

{{

{Linear/Affinetransformsareconvex

Normsareconvex

ComposiDonpreservesconvexity*

Example: least-squaresExamples

least-squaresminimize !Ax " b!2

2

• analytical solution x! = A†b (A† is pseudo-inverse)

• can add linear constraints, e.g., l # x # u

linear program with random cost

minimize c̄Tx + !xT!x = E cTx + ! var(cTx)subject to Gx # h, Ax = b

• c is random vector with mean c̄ and covariance !

• hence, cTx is random variable with mean c̄Tx and variance xT!x

• ! > 0 is risk aversion parameter; controls the trade-o! betweenexpected cost and variance (risk)

Convex optimization problems 4–23

Aismxnmatrix:m=nm>nm<n

Thursday, December 9, 2010

Page 11: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• SignalregularizaDon,etc

• makessurevaluedon’tspiketoohigh

Example: Reg. least-squares

minimize !Ax" b!22subject to !x!2 # !

Thursday, December 9, 2010

Page 12: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• SignalregularizaDon,etc

• IsitaconvexopDmizaDonproblem?

Sparse modeling / regressor selection

fit vector b ! Rm as a linear combination of k regressors (chosen from npossible regressors)

minimize "Ax # b"2

subject to card(x) $ k

• gives k-term model

• chooses subset of k regressors that (together) best fit or explain b

• can solve (in principle) by trying all

!

nk

"

choices

• variations:

– minimize card(x) subject to "Ax # b"2 $ !– minimize "Ax # b"2 + " card(x)

Prof. S. Boyd, EE364b, Stanford University 7

Sparse regularizationSparse modeling / regressor selection

fit vector b ! Rm as a linear combination of k regressors (chosen from npossible regressors)

minimize "Ax # b"2

subject to card(x) $ k

• gives k-term model

• chooses subset of k regressors that (together) best fit or explain b

• can solve (in principle) by trying all

!

nk

"

choices

• variations:

– minimize card(x) subject to "Ax # b"2 $ !– minimize "Ax # b"2 + " card(x)

Prof. S. Boyd, EE364b, Stanford University 7Thursday, December 9, 2010

Page 13: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Sparse signal reconstruction

• estimate signal x, given

– noisy measurement y = Ax+ v, v ! N (0,!2I) (A is known; v is not)– prior information card(x) " k

• maximum likelihood estimate x̂ml is solution of

minimize #Ax $ y#2

subject to card(x) " k

Prof. S. Boyd, EE364b, Stanford University 8

Sparse signal reconstruction

• estimate signal x, given

– noisy measurement y = Ax+ v, v ! N (0,!2I) (A is known; v is not)– prior information card(x) " k

• maximum likelihood estimate x̂ml is solution of

minimize #Ax $ y#2

subject to card(x) " k

Prof. S. Boyd, EE364b, Stanford University 8

Sparse signal reconstruction

Thursday, December 9, 2010

Page 14: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• Knowseismicsignalis“sparse”inFKdomain

• Replacewith

• haswell‐definedsoluDonviathresholding(ie,pickklargestcoefsandzerotherest)

A := F

card(x) ! k card(x) = k

minimize !Fx" b!2

subject to card(x) = k

Ex: Denoising

Thursday, December 9, 2010

Page 15: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• SomeDmes(mostoftheDme)kisdifficulttopredict

• Yourbestbestistosolve:

Don’t know k?

minimize card(x)subject to !Ax" b!2 # !

Thursday, December 9, 2010

Page 16: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• SomeDmes(mostoftheDme)kisdifficulttopredict

• Yourbestbestistosolve:

Not Convex

Don’t know k?

minimize card(x)subject to !Ax" b!2 # !

Thursday, December 9, 2010

Page 17: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

card(x) in 1D

Thursday, December 9, 2010

Page 18: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• SomeDmes(mostoftheDme)kisdifficulttopredict

• Yourbestbestistosolve:

Don’t know k?

minimize !x!1subject to !Ax" b!2 # !

Thursday, December 9, 2010

Page 19: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

F igu re 3: A t t he origin, t he canonical 0 sparsi ty count f 0 (t) is bet ter approxima ted by t helog-sum penal ty funct ion f l og , (t) t han by t he t radi t ional convex 1 relaxa t ion f 1 (t).

many possibili t ies of t his na t ure and t hey tend to work well (somet imes bet ter t han t he log-sumpenal ty). B ecause of space limi t a t ions, however, we will limi t ourselves to empirical st udies of t heperformance of t he log-sum penal ty, and leave t he choice of other penal t ies for fur t her research.

2.5 Historical progression

T he development of t he reweighted 1 algori t hm has an interest ing historical parallel wi t h t he useof I tera t ively R eweighted L east Squares (I R LS) for robust st a t ist ical est ima t ion [37–39]. Considera regression problem A x = b where t he observa t ion ma t rix A is overdetermined. I t was not iced t ha tst andard least squares regression, in which one minimizes !r!2 where r = A x " b is t he residualvector, lacked robust ness vis a vis ou t liers. To defend against t his, I R LS was proposed as ani tera t ive met hod to minimize instead t he ob ject ive

minx

!

i

( ri( x )),

where (·) is a penal ty funct ion such as t he 1 norm [37, 40]. T his minimiza t ion can be accomplishedby solving a sequence of weighted least-squares problems where t he weights {wi} depend on t heprevious residual wi = !( ri) / ri . For typical choices of t his dependence is in fact inverselypropor t ional—large residuals will be penalized less in t he subsequent i tera t ion and vice versa—as is t he case wi t h our reweighted 1 algori t hm. Interest ingly, just as I R LS involved i tera t ivelyreweight ing t he 2-norm in order to bet ter approxima te an 1-like cri terion, our algori t hm involvesi tera t ively reweight ing t he 1-norm in order to bet ter approxima te an 0-like cri terion.

3 Numerical experiments

We present a series of experiments demonst ra t ing t he benefi ts of reweight ing t he 1 penal ty. Wewill see t ha t t he requisi te number of measurements to recover or approxima te a signal is typi-cally reduced, in some cases by a subst ant ial amount . We also demonst ra te t ha t t he reweight ingapproach is robust and broadly applicable, providing examples of sparse and compressible signalrecovery, noise-aware recovery, model select ion, error correct ion, and 2-dimensional tot al-varia t ionminimiza t ion. M eanwhile, we address impor t ant issues such as how one can choose wisely andhow robust is t he algori t hm to t his choice, and how many reweight ing i tera t ions are needed forconvergence.

9

Argument1:convexenvelopeofcard(x)is!x!1

Approximating sparsity

Thursday, December 9, 2010

Page 20: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

example (m = 100, n = 30): histogram of residuals for penalties

!(u) = |u|, !(u) = u2, !(u) = max{0, |u|!a}, !(u) = ! log(1!u2)p

=1

p=

2D

eadz

one

Log

barr

ier

r!2

!2

!2

!2

!1

!1

!1

!1

0

0

0

0

1

1

1

1

2

2

2

20

40

0

10

0

20

0

10

shape of penalty function has large e!ect on distribution of residuals

Approximation and fitting 6–5

Argument2:minimizingtendstoproducemanyzeros!x!1

Approximating sparsity

Thursday, December 9, 2010

Page 21: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

L1 Ball L2 Ball

Argument3:geometric(in2D)

Approximating sparsity

Thursday, December 9, 2010

Page 22: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

L1SoluDon L2SoluDon

Approximating sparsity

Thursday, December 9, 2010

Page 23: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

L1SoluDon L2SoluDon

Approximating sparsity

Thursday, December 9, 2010

Page 24: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

L1SoluDon L2SoluDon

sparse solutions

Approximating sparsity

Thursday, December 9, 2010

Page 25: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

L1SoluDon L2SoluDon

dense solution

Approximating sparsity

Thursday, December 9, 2010

Page 26: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

minimize !x!1

subject to !Ax" b!2 < !

• Method1:SPG‐L1(projecDon)

• Method2:reweighDng

• Method3:ConDnuaDon/Hubernorm

Solving L1 minimization

Thursday, December 9, 2010

Page 27: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

minimize !x!1

subject to !Ax" b!2 < !

• Method1:SPG‐L1(projecDon)

• Method2:ReweighDng

• Method3:ConDnuaDon/Hubernorm

Whynosteepestdescent/Newton?

Non‐differenDable

Solving L1 minimization

Thursday, December 9, 2010

Page 28: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

894 EWOUT VAN DEN BERG AND MICHAEL P. FRIEDLANDER

Projected gradient. Our application of the SPG algorithm to solve (LS! ) followsBirgin, Mart́ınez, and Raydan [5] closely for the minimization of general nonlinearfunctions over arbitrary convex sets. The method they propose combines projected-gradient search directions with the spectral step length that was introduced by Barzilaiand Borwein [1]. A nonmonotone line search is used to accept or reject steps. Thekey ingredient of Birgin, Mart́ınez, and Raydan’s algorithm is the projection of thegradient direction onto a convex set, which in our case is defined by the constraintin (LS! ). In their recent report, Figueiredo, Nowak, and Wright [27] describe theremarkable e!ciency of an SPG method specialized to (QP"). Their approach buildson the earlier report by Dai and Fletcher [18] on the e!ciency of a specialized SPGmethod for general bound-constrained quadratic programs (QPs).

2. The Pareto curve. The function ! defined by (1.1) yields the optimal valueof the constrained problem (LS! ) for each value of the regularization parameter " .Its graph traces the optimal trade-o" between the one-norm of the solution x andthe two-norm of the residual r, which defines the Pareto curve. Figure 2.1 shows thegraph of ! for a typical problem.

The Newton-based root-finding procedure that we propose for locating specificpoints on the Pareto curve—e.g., finding roots of (1.2)—relies on several importantproperties of the function !. As we show in this section, ! is a convex and di"erentiablefunction of " . The di"erentiability of ! is perhaps unintuitive, given that the one-norm constraint in (LS! ) is not di"erentiable. To deal with the nonsmoothness ofthe one-norm constraint, we appeal to Lagrange duality theory. This approach yieldssignificant insight into the properties of the trade-o" curve. We discuss the mostimportant properties below.

2.1. The dual subproblem. The dual of the Lasso problem (LS! ) plays aprominent role in understanding the Pareto curve. In order to derive the dual of(LS! ), we first recast (LS! ) as the equivalent problem

(2.1) minimizer,x

!r!2 subject to Ax + r = b, !x!1 " ".

0 1 2 3 4 5 6 70

5

10

15

20

25

one!norm of solution

two!

norm

of r

esid

ual

Fig. 2.1. A typical Pareto curve (solid line) showing two iterations of Newton’s method. Thefirst iteration is available at no cost.

UseSPGl1(vandenBerg,Friedlander,2008)‐aprojectedgradientbasedmethod(seismicdata‐volumesarehuge)‐usesroot‐findingtofindthefinalone‐norm

Feasible

Pareto curve

Solving L1 minimization

Thursday, December 9, 2010

Page 29: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

894 EWOUT VAN DEN BERG AND MICHAEL P. FRIEDLANDER

Projected gradient. Our application of the SPG algorithm to solve (LS! ) followsBirgin, Mart́ınez, and Raydan [5] closely for the minimization of general nonlinearfunctions over arbitrary convex sets. The method they propose combines projected-gradient search directions with the spectral step length that was introduced by Barzilaiand Borwein [1]. A nonmonotone line search is used to accept or reject steps. Thekey ingredient of Birgin, Mart́ınez, and Raydan’s algorithm is the projection of thegradient direction onto a convex set, which in our case is defined by the constraintin (LS! ). In their recent report, Figueiredo, Nowak, and Wright [27] describe theremarkable e!ciency of an SPG method specialized to (QP"). Their approach buildson the earlier report by Dai and Fletcher [18] on the e!ciency of a specialized SPGmethod for general bound-constrained quadratic programs (QPs).

2. The Pareto curve. The function ! defined by (1.1) yields the optimal valueof the constrained problem (LS! ) for each value of the regularization parameter " .Its graph traces the optimal trade-o" between the one-norm of the solution x andthe two-norm of the residual r, which defines the Pareto curve. Figure 2.1 shows thegraph of ! for a typical problem.

The Newton-based root-finding procedure that we propose for locating specificpoints on the Pareto curve—e.g., finding roots of (1.2)—relies on several importantproperties of the function !. As we show in this section, ! is a convex and di"erentiablefunction of " . The di"erentiability of ! is perhaps unintuitive, given that the one-norm constraint in (LS! ) is not di"erentiable. To deal with the nonsmoothness ofthe one-norm constraint, we appeal to Lagrange duality theory. This approach yieldssignificant insight into the properties of the trade-o" curve. We discuss the mostimportant properties below.

2.1. The dual subproblem. The dual of the Lasso problem (LS! ) plays aprominent role in understanding the Pareto curve. In order to derive the dual of(LS! ), we first recast (LS! ) as the equivalent problem

(2.1) minimizer,x

!r!2 subject to Ax + r = b, !x!1 " ".

0 1 2 3 4 5 6 70

5

10

15

20

25

one!norm of solution

two!

norm

of r

esid

ual

Fig. 2.1. A typical Pareto curve (solid line) showing two iterations of Newton’s method. Thefirst iteration is available at no cost.

minimize !x!1

subject to !Ax" b!2 < !

!

!x!1feasiblesoluDonwithsmallest

Solving L1 minimization

Thursday, December 9, 2010

Page 30: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

894 EWOUT VAN DEN BERG AND MICHAEL P. FRIEDLANDER

Projected gradient. Our application of the SPG algorithm to solve (LS! ) followsBirgin, Mart́ınez, and Raydan [5] closely for the minimization of general nonlinearfunctions over arbitrary convex sets. The method they propose combines projected-gradient search directions with the spectral step length that was introduced by Barzilaiand Borwein [1]. A nonmonotone line search is used to accept or reject steps. Thekey ingredient of Birgin, Mart́ınez, and Raydan’s algorithm is the projection of thegradient direction onto a convex set, which in our case is defined by the constraintin (LS! ). In their recent report, Figueiredo, Nowak, and Wright [27] describe theremarkable e!ciency of an SPG method specialized to (QP"). Their approach buildson the earlier report by Dai and Fletcher [18] on the e!ciency of a specialized SPGmethod for general bound-constrained quadratic programs (QPs).

2. The Pareto curve. The function ! defined by (1.1) yields the optimal valueof the constrained problem (LS! ) for each value of the regularization parameter " .Its graph traces the optimal trade-o" between the one-norm of the solution x andthe two-norm of the residual r, which defines the Pareto curve. Figure 2.1 shows thegraph of ! for a typical problem.

The Newton-based root-finding procedure that we propose for locating specificpoints on the Pareto curve—e.g., finding roots of (1.2)—relies on several importantproperties of the function !. As we show in this section, ! is a convex and di"erentiablefunction of " . The di"erentiability of ! is perhaps unintuitive, given that the one-norm constraint in (LS! ) is not di"erentiable. To deal with the nonsmoothness ofthe one-norm constraint, we appeal to Lagrange duality theory. This approach yieldssignificant insight into the properties of the trade-o" curve. We discuss the mostimportant properties below.

2.1. The dual subproblem. The dual of the Lasso problem (LS! ) plays aprominent role in understanding the Pareto curve. In order to derive the dual of(LS! ), we first recast (LS! ) as the equivalent problem

(2.1) minimizer,x

!r!2 subject to Ax + r = b, !x!1 " ".

0 1 2 3 4 5 6 70

5

10

15

20

25

one!norm of solution

two!

norm

of r

esid

ual

Fig. 2.1. A typical Pareto curve (solid line) showing two iterations of Newton’s method. Thefirst iteration is available at no cost.

DerivaDvegivenby ||ATx||!

Solving L1 minimization

minimize !x!1subject to !Ax" b!2 # !

Thursday, December 9, 2010

Page 31: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Originalproblembreaksdownintoaseriesofnewproblems:

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

908 EWOUT VAN DEN BERG AND MICHAEL P. FRIEDLANDER

Trace #

Tim

e

50 100 150 200 250

(a) Image with missing traces

Trace #50 100 150 200 250

(b) Interpolated image

0 0.5 1 1.5 20

50

100

150

200

250

one!norm of solution (x104)

two

!no

rm o

f re

sidua

l

Pareto curveSolution path

(c) Pareto curve and solution path

Fig. 6.1. Corrupted and interpolated images for problem seismic. Graph (c) shows the Paretocurve and the solution path taken by SPGL1.

ever, as might be expected of an interior-point method based on a conjugate-gradientlinear solver, it can require many matrix-vector products.

It may be progressively more di!cult to solve (QP!) as ! ! 0 because the reg-ularizing e"ect from the one-norm term tends to become negligible, and there is lesscontrol over the norm of the solution. In contrast, the (LS" ) formulation is guaranteedto maintain a bounded solution norm for all values of " .

6.4. Sampling the Pareto curve. In situations where little is known aboutthe noise level #, it may be useful to visualize the Pareto curve in order to understandthe trade-o"s between the norms of the residual and the solution. In this sectionwe aim to obtain good approximations to the Pareto curve for cases in which it isprohibitively expensive to compute it in its entirety.

We test two approaches for interpolation through a small set of samples i =1, . . . , k. In the first, we generate a uniform distribution of parameters !i = (i/k)"ATb"!and solve the corresponding problems (QP!i). In the second, we generate a uni-form distribution of parameters #i = (i/k)"b"2 and solve the corresponding problems(BP#i). We leverage the convexity and di"erentiability of the Pareto curve to approx-imate it with piecewise cubic polynomials that match function and derivative valuesat each end. When a nonconvex fit is detected, we switch to a quadratic interpolation

(ProjectedGradient)

minimize !Ax" b!2subject to !x!1 # !

Solving L1 minimization

Thursday, December 9, 2010

Page 32: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

!1ball

(!x!1 = ")

PossiblesoluDonsofAx=b

minimize !Ax" b!2subject to !x!1 # !

SteepestdescentdirecDonA†(Ax! b)

Projected Gradient

Thursday, December 9, 2010

Page 33: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

!1ball

(!x!1 = ")

minimize !Ax" b!2subject to !x!1 # !

SubtractawayshortestL2distancetotheball

Projected Gradient

Thursday, December 9, 2010

Page 34: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

!1ball

(!x!1 = ")

minimize !Ax" b!2subject to !x!1 # !

“projectedgradient”

Projected Gradient

Thursday, December 9, 2010

Page 35: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

!1ball

(!x!1 = ")

minimize !Ax" b!2subject to !x!1 # !

stepsize:“Spectralgradientmethod”

!xT !x

!xT !g

Projected Gradient

Thursday, December 9, 2010

Page 36: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

“Spectral Projected-Gradient for L1 problems”

Get gradient (apply adjoint)

Scale by spectral step-length

Project onto current L1 ball

Converged? (check

improvement)

residual norm small enough?

Terminate

Increase allowed L1

norm

No

Yes

No

Yes

Initialize with zeroes

SPGL1

Thursday, December 9, 2010

Page 37: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

F igu re 3: A t t he origin, t he canonical 0 sparsi ty count f 0 (t) is bet ter approxima ted by t helog-sum penal ty funct ion f l og , (t) t han by t he t radi t ional convex 1 relaxa t ion f 1 (t).

many possibili t ies of t his na t ure and t hey tend to work well (somet imes bet ter t han t he log-sumpenal ty). B ecause of space limi t a t ions, however, we will limi t ourselves to empirical st udies of t heperformance of t he log-sum penal ty, and leave t he choice of other penal t ies for fur t her research.

2.5 Historical progression

T he development of t he reweighted 1 algori t hm has an interest ing historical parallel wi t h t he useof I tera t ively R eweighted L east Squares (I R LS) for robust st a t ist ical est ima t ion [37–39]. Considera regression problem A x = b where t he observa t ion ma t rix A is overdetermined. I t was not iced t ha tst andard least squares regression, in which one minimizes !r!2 where r = A x " b is t he residualvector, lacked robust ness vis a vis ou t liers. To defend against t his, I R LS was proposed as ani tera t ive met hod to minimize instead t he ob ject ive

minx

!

i

( ri( x )),

where (·) is a penal ty funct ion such as t he 1 norm [37, 40]. T his minimiza t ion can be accomplishedby solving a sequence of weighted least-squares problems where t he weights {wi} depend on t heprevious residual wi = !( ri) / ri . For typical choices of t his dependence is in fact inverselypropor t ional—large residuals will be penalized less in t he subsequent i tera t ion and vice versa—as is t he case wi t h our reweighted 1 algori t hm. Interest ingly, just as I R LS involved i tera t ivelyreweight ing t he 2-norm in order to bet ter approxima te an 1-like cri terion, our algori t hm involvesi tera t ively reweight ing t he 1-norm in order to bet ter approxima te an 0-like cri terion.

3 Numerical experiments

We present a series of experiments demonst ra t ing t he benefi ts of reweight ing t he 1 penal ty. Wewill see t ha t t he requisi te number of measurements to recover or approxima te a signal is typi-cally reduced, in some cases by a subst ant ial amount . We also demonst ra te t ha t t he reweight ingapproach is robust and broadly applicable, providing examples of sparse and compressible signalrecovery, noise-aware recovery, model select ion, error correct ion, and 2-dimensional tot al-varia t ionminimiza t ion. M eanwhile, we address impor t ant issues such as how one can choose wisely andhow robust is t he algori t hm to t his choice, and how many reweight ing i tera t ions are needed forconvergence.

9

1storderapprox.ofaLogfuncDon

log(x) ! log(xk) +1xk

(x" xk)

Method 2: Reweighting

Thursday, December 9, 2010

Page 38: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

IRLS:IteraDvelyre‐weightedleast‐squares

xk obtained from

Method 2: Reweighting

*minimize !!

1xk!1

"x!2

subject to !Ax" b!2 # !

Thursday, December 9, 2010

Page 39: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Method 3: Continuation

Thursday, December 9, 2010

Page 40: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

“Huber Norm”

Method 3: Continuation

Thursday, December 9, 2010

Page 41: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

“Huber Norm”

Zero Hessian Zero Hessian

Method 3: Continuation

Thursday, December 9, 2010

Page 42: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

“Huber Norm”

Zero Hessian Zero Hessian

Method 3: Continuation

Thursday, December 9, 2010

Page 43: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Gradualshrinkageofcurvature

Method 3: Continuation

Thursday, December 9, 2010

Page 44: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Gradualshrinkageofcurvature

Method 3: Continuation

Thursday, December 9, 2010

Page 45: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

GradualshrinkageofcurvatureRapidlyacceleratesconvergence

Method 3: Continuation

Thursday, December 9, 2010

Page 46: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

• Usesparsitytoexploitstructure&a‐prioriknowledgeaboutsoluDon

• Notconvex,ergo

• NotdifferenDable,ergotricks

• Threemainclassesofmethods:ProjecDon,Re‐weighDng,ConDnuaDon

!1

Summary

Thursday, December 9, 2010

Page 47: Consortium 2010 - University of British Columbia · Consortium 2010 Tim Lin Sparse optimization and the -norm! 1 Thursday, December 9, 2010 ... i x = b i,i=1p • f 0, f 1, f m affi

Acknowledgements

ThisworkwasinpartfinanciallysupportedbytheNaturalSciencesandEngineeringResearchCouncilofCanadaDiscoveryGrant(22R81254)andtheCollaborativeResearchandDevelopmentGrantDNOISEII(375142‐08).ThisresearchwascarriedoutaspartoftheSINBADIIprojectwithsupportfromthefollowingorganizations:BGGroup,BP,Chevron,ConocoPhillips,Petrobras,TotalSA,andWesternGeco.

SLIM

• BPEPTformotivation&feedback• MaterialbasedonBoyd&Vandenberghe

Thursday, December 9, 2010