Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf ·...

Sparse Representation and OptimizationMethods for L1-regularized Problems

Chih-Jen LinDepartment of Computer Science

National Taiwan University

Chih-Jen Lin (National Taiwan Univ.) 1 / 60

Outline

Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments


Sparse Representation

Outline





A mathematical way to model a signal, an image, ora document is

y =Xw

=w1

x11...xl1

+ · · ·+ wn

x1n...xln

A signal is a linear combination of others

X and y are given

We would like to find w with as few non-zeros aspossible (sparsity)



Example: Image Deblurring

Considery = Hz

z : original image, H : blur operation

y : observed image

Assumez = Dw

with known dictionary D

Try tominw‖y − HDw‖ and get w



Example: Image Deblurring (Cont’d)

We hope w has few nonzeros as each image isgenerated using only several columns of thedictionary

The restored image is Dw



Example: Face Recognition

Assume a face image is a combination of the sameperson’s other imagesx11...

xl1

: 1st image,

x12...xl2

: 2nd image, . . .

l : number of pixels in a face image

Given a face image y and collections of twopersons’ faces X1 and X2



Example: Face Recognition (Cont’d)

Ifminw‖y − X1w‖ < min

w‖y − X2w‖,

predict y as the first person

We hope w has few nonzeros as noisy imagesshouldn’t be used



Example: Feature Selection

Given

X =

x11 . . . x1n...

xl1 . . . xln

, [xi1 . . . xin

]: ith document

yi = +1 or − 1 (two classes)

We hope to find w such that

wTx i

{> 0 if yi = 1

< 0 if yi = −1



Example: Feature Selection (Cont’d)

Try to

minw

l∑i=1

e−yiwTx i

and hope that w is sparse

That is, we assume that each document isgenerated from important features

wi 6= 0⇒ important features



L1-norm Minimization I

Finding w with the smallest number of non-zeros isdifficult

‖w‖0 : number of nonzeros

Instead, L1-norm minimization

minw

C‖y − Xw‖2 + ‖w‖1

C : a parameter given by users

This is a regression problem



L1-norm Minimization II

1-norm versus 2-norm

‖w‖1 = |w1|+ · · ·+ |wn|‖w‖22 = w 2

1 + · · ·+ w 2n

Two figures

w

|w |

w

w 2



L1-norm Minimization III

If using 2-norm, all wi are non-zeros

Using 1-norm, many wi may be zeros

Smaller C , better sparsity



L1-regularized Classifier

Training data {yi , x i}, x i ∈ Rn, i = 1, . . . , l , yi = ±1l : # of data, n: # of features

minw

‖w‖1 + C∑l

i=1ξ(w ; x i , yi)

ξ(w ; x i , yi): loss functionLogistic loss:

log(1 + e−ywTx)

L1 and L2 losses:

max(1− ywTx , 0) and max(1− ywTx , 0)2

We do not consider kernelsChih-Jen Lin (National Taiwan Univ.) 14 / 60


L1-regularized Classifier (Cont’d)

‖w‖1 not differentiable ⇒ causes difficulties inoptimization

Loss functions: logistic loss twice differentiable, L2loss differentiable, and L1 loss not differentiable

We focus on logistic and L2 loss

Sometimes bias term is added

wTx ⇒ wTx + b



L1-regularized Classifier (Cont’d)

Many available methods; we review existingmethods and show details of some methods

Notation:

f (w) ≡ ‖w‖1 + C∑l

i=1ξ(w ; x i , yi)

is the function to be minimized, and

L(w) ≡ C∑l

i=1ξ(w ; x i , yi).

We do not discuss L1-regularized regression, whichis another hot topic recently


Existing Optimization Methods

Outline




Decomposition Methods

Working on some variables at a time

Cyclic coordinate descent methods

Working variables sequentially or randomly selected

One-variable case:

minz

f (w + ze j)− f (w)

e j : indicator vector for the the jth element

Examples: Goodman (2004); Genkin et al. (2007);Balakrishnan and Madigan (2005); Tseng and Yun(2009); Shalev-Shwartz and Tewari (2009); Duchiand Singer (2009); Wright (2012)



Decomposition Methods (Cont’d)

Gradient-based working set selection

Higher cost per iteration; larger working set

Examples: Shevade and Keerthi (2003); Tseng andYun (2009); Yun and Toh (2011)

Active set method

Working set the same as the set of non-zero welements

Examples: Perkins et al. (2003)



Constrained Optimization

Replace w with w+ −w−:

minw+,w−

∑n

j=1w+j +

∑n

j=1w−j + C

∑l

i=1ξ(w+−w−; x i , yi)

s. t. w+j ≥ 0, w−j ≥ 0, j = 1, . . . , n.

Any bound-constrained optimization methods canbe usedExamples: Schmidt et al. (2009) used Gafni andBertsekas (1984); Kazama and Tsujii (2003) usedBenson and More (2001); we have considered Linand More (1999); Koh et al. (2007): interior pointmethod



Constrained Optimization (Cont’d)

Equivalent problem with non-smooth constraints:

minw

∑l

i=1ξ(w ; x i , yi)

subject to ‖w‖1 ≤ K .

C replaced by a corresponding KGo back to LASSO (Tibshirani, 1996) if y ∈ R andleast-square lossExamples: Kivinen and Warmuth (1997); Lee et al.(2006); Donoho and Tsaig (2008); Duchi et al.(2008); Liu et al. (2009); Kim and Kim (2004); Kimet al. (2008); Roth (2004)



Other Methods

Expectation maximization: Figueiredo (2003);Krishnapuram et al. (2004, 2005).

Stochastic gradient descent: Langford et al. (2009);Shalev-Shwartz and Tewari (2009)

Modified quasi Newton: Andrew and Gao (2007);Yu et al. (2010)

Hybrid: easy method first and then interior-point forfaster local convergence (Shi et al., 2010)



Other Methods (Cont’d)

Quadratic approximation followed by coordinatedescent: Krishnapuram et al. (2005); Friedmanet al. (2010); Yuan et al. (2012); a kind of Newtonapproach

Cutting plane method: Teo et al. (2010)

Some methods find a solution path for different Cvalues; e.g., Rosset (2005), Zhao and Yu (2007),Park and Hastie (2007), and Keerthi and Shevade(2007).

Here we focus on a single C



Strengths and Weaknesses of ExistingMethods

Convergence speed: higher-order methods (quasiNewton or Newton) have fast local convergence,but fail to obtain a reasonable model quickly

Implementation efforts: higher-order methodsusually more complicated

Large data: if solving linear systems is needed, useiterative (e.g., CG) instead of direct methods

Feature correlation: methods working on somevariables at a time (e.g., decomposition methods)may be efficient if features are almost independent


Coordinate Descent Methods

Outline




Coordinate Descent Methods I

Minimizing the one-variable function

gj(z) ≡ |wj + z | − |wj |+ L(w + ze j)− L(w),

wheree j ≡ [0, . . . , 0︸︷︷︸

j−1

, 1, 0, . . . , 0]T .

No closed form solution

Genkin et al. (2007), Shalev-Shwartz and Tewari(2009), and Yuan et al. (2010)



Coordinate Descent Methods II

They differ in how to minimize this one-variableproblem

While gj(z) is not differentiable, we can have a formsimilar to Taylor expansion:

gj(z) = gj(0) + g ′j (0)z +1

2g ′′j (ηz)z2,



Coordinate Descent Methods III

Anoter representation (for our derivation)

minz

gj(z) = |wj + z | − |wj |+ Lj(z ; w)− Lj(0; w),

where

Lj(z ; w) ≡ L(w + ze j).

is a function of z



BBR (Genkin et al., 2007) I

They rewrite gj(z) as

gj(z) = gj(0) + g ′j (0)z +1

2g ′′j (ηz)z2,

where 0 < η < 1,

g ′j (0) ≡

{L′j(0) + 1 if wj > 0,

L′j(0)− 1 if wj < 0,(1)

andg ′′j (ηz) ≡ L′′j (ηz).



BBR (Genkin et al., 2007) II

gj(z) is not differentiable if wj = 0

BBR finds an upper bound Uj of g ′′j (z) in a trustregion

Uj ≥ g ′′j (z), ∀|z | ≤ ∆j .

Then gj(z) is an upper-bound function of gj(z):

gj(z) ≡ gj(0) + g ′j (0)z +1

2Ujz

2.

Any step z satisfying gj(z) < gj(0) leads to

gj(z)− gj(0) = gj(z)− gj(0) ≤ gj(z)− gj(0) < 0,



BBR (Genkin et al., 2007) III

Convergence not proved (no sufficient decreasecondition via line search)

Logistic loss

Uj ≡ Cl∑

i=1

x2ijF(yiwTx i ,∆j |xij |

),

where

F (r , δ) =

{0.25 if |r | ≤ δ,

12+e(|r |−δ)+e(δ−|r |) otherwise.



BBR (Genkin et al., 2007) IV

The sub-problem solved in practice:

minz

gj(z)

s. t. |z | ≤ ∆j and wj + z

{≥ 0 if wj > 0,

≤ 0 if wj < 0.

Update rule:

d = min

(max

(P(−

g ′j (0)

Uj,wj),−∆j

),∆j

),



BBR (Genkin et al., 2007) V

where

P(z ,w) ≡

{z if sgn(w + z) = sgn(w),

−w otherwise.



SCD (Shalev-Shwartz and Tewari, 2009) I

SCD: stochastic coordinate descent

w = w+ −w−

At each step, randomly select a variable from

{w+1 , . . . ,w

+n ,w

−1 , . . . ,w

−n }

One-variable sub-problem:

minz

gj(z) ≡ z+Lj(z ; w+−w−)−Lj(0; w+−w−),

subject to

w k,+j + z ≥ 0 or w k ,−

j + z ≥ 0,Chih-Jen Lin (National Taiwan Univ.) 34 / 60


SCD (Shalev-Shwartz and Tewari, 2009) II

Second-order approximation similar to BBR:

gj(z) = gj(0) + g ′j (0)z +1

2Ujz

2,

where

g ′j (0) =

{1 + L′j(0) for w+

j

1− L′j(0) for w−jand Uj ≥ g ′′j (z), ∀z .

BBR: Uj an upper bound of g ′′j (z) in the trust region

SCD: global upper bound



SCD (Shalev-Shwartz and Tewari, 2009)III

For logistic regression,

Uj = 0.25Cl∑

i=1

x2ij ≥ g ′′j (z), ∀z .

Shalev-Shwartz and Tewari (2009) assume−1 ≤ xij ≤ 1, ∀i , j , so a simple upper bound is

Uj = 0.25Cl .



CDN (Yuan et al., 2010) I

Newton step:

minz

g ′j (0)z +1

2g ′′j (0)z2.

That is,

minz

|wj + z | − |wj |+ L′j(0)z +1

2L′′j (0)z2.

Second-order term not replaced by an upper bound

Function value may not be decreasing

Tighter second-order term is useful (will see later)Chih-Jen Lin (National Taiwan Univ.) 37 / 60


CDN (Yuan et al., 2010) IIAssume z is the optimal solution; need line search

Following Tseng and Yun (2009)

gj(λz)− gj(0) ≤ σλ(L′j(0)z + |wj + z | − |wj |),

This is slightly different from the traditional form ofline search. Now

|wj + z | − |wj |

must be taken into consideration

Convergence can be proved



Calculating First and Second OrderInformation I

We have

L′j(0) =dL(w + ze j)

dz

∣∣∣∣z=0

= ∇jL(w)

L′′j (0) =d2L(w + ze j)

dzdz

∣∣∣∣z=0

= ∇2jjL(w)



Calculating First and Second OrderInformation II

For logistic loss:

L′j(0) = Cl∑

i=1

yixij(τ(yi(w)Tx i)− 1

),

L′′j (0) = Cl∑

i=1

x2ij(τ(yi(w)Tx i)

) (1− τ(yi(w)Tx i)

),

where

τ(s) ≡ 1

1 + e−s


Other Methods

Outline



Other Methods

GLMNET (Friedman et al., 2010) I

A quadratic approximation of L(w):

f (w + d)− f (w)

= (‖w + d‖1 + L(w + d))− (‖w‖1 + L(w))

≈∇L(w)Td +1

2dT∇2L(w)d + ‖w + d‖1 − ‖w‖1.

Thenw ← w + d

Line search is needed for convergence


Other Methods

GLMNET (Friedman et al., 2010) IIBut how to handle quadratic minimization withsome one-norm terms?

GLMNET uses coordinate descent

For logistic regression:

∇L(w) = Cl∑

i=1

(τ(yiwTx i)− 1

)yix i

∇2L(w) = CXTDX ,

where D ∈ R l×l is a diagonal matrix with

Dii = τ(yiwTx i)(1− τ(yiwTx i)

)Chih-Jen Lin (National Taiwan Univ.) 43 / 60

Other Methods

GLMNET (Friedman et al., 2010) III

See an improved version called newGLMNET atYuan et al. (2012)


Other Methods

Bundle Methods (Teo et al., 2010) I

Also called cutting plane method

L(w) a convex loss function

If w k the current solution,

L(w) ≥∇L(w k)T (w −w k) + L(w k)

=aTk w + bk , ∀w ,

where

ak ≡ ∇L(w k) and bk ≡ L(w k)− aTk w k .


Other Methods

Bundle Methods (Teo et al., 2010) II

Maintains all earlier cutting planes to form alower-bound function for L(w):

L(w) ≥ LCPk (w) ≡ max1≤t≤k

aTt w + bt , ∀w .

Obtaining w k+1 by solving

minw

‖w‖1 + LCPk (w).


Other Methods

Bundle Methods (Teo et al., 2010) III

This is a linear program using w = w+ −w−:

minw+,w−,ζ

n∑j=1

w+j +

n∑j=1

w−j + ζ

subject to aTt (w+ −w−) + bt ≤ ζ, t = 1, . . . , k ,

w+j ≥ 0, w−j ≥ 0, j = 1, . . . , n.


Experiments

Outline



Experiments

Data

Data set l n # of non-zerosreal-sim 72,309 20,958 3,709,083news20 19,996 1,355,191 9,097,916rcv1 677,399 47,236 49,556,258yahoo-korea 460,554 3,052,939 156,436,656

l : number of data, n: number of features

They are all document sets

4/5 for training and 1/5 for testing

Select best C by cross validation on training


Experiments

Compared Methods

Software using wTx without b

BBR (Genkin et al., 2007)

SCD (Shalev-Shwartz and Tewari, 2009)

CDN: our coordinate descent implementation

TRON: our Newton implementation forbound-constrained formulation

OWL-QN (Andrew and Gao, 2007)

BMRM (Teo et al., 2010)


Experiments

Compared Methods (Cont’d)

Software using wTx + b

CDN: our coordinate descent implementation

BBR (Genkin et al., 2007)

CGD-GS (Yun and Toh, 2011)

IPM (Koh et al., 2007)

GLMNET (Friedman et al., 2010)

Lassplore (Liu et al., 2009)


Experiments

Convergence of Objective Values (no b)

real-sim news20

rcv1 yahoo-koreaChih-Jen Lin (National Taiwan Univ.) 52 / 60

Experiments

Test Accuracy

real-sim news20


Experiments

Convergence of Objective Values (with b)

real-sim news20


Experiments

Observations and Conclusions

Decomposition methods better in the early stage

One-variable sub-problem in coordinate descent

Use tight approximation if possible

Newton (IPM, GLMNET) and quasi Newton(OWL-QN): fast local convergence in the end

We also checked gradient and sparsity

Complete results (of more data sets) and programsare in Yuan et al.; JMLR 2010 (11), 3183–3234


Experiments

References I

G. Andrew and J. Gao. Scalable training of L1-regularized log-linear models. In Proceedings ofthe Twenty Fourth International Conference on Machine Learning (ICML), 2007.

S. Balakrishnan and D. Madigan. Algorithms for sparse linear classifiers in the massive datasetting. 2005. URL http://www.stat.rutgers.edu/~madigan/PAPERS/sm.pdf.

S. Benson and J. J. More. A limited memory variable metric method for bound constrainedminimization. Preprint MCS-P909-0901, Mathematics and Computer Science Division,Argonne National Laboratory, Argonne, Illinois, 2001.

D. L. Donoho and Y. Tsaig. Fast solution of l1 minimization problems when the solution maybe sparse. IEEE Transactions on Information Theory, 54:4789–4812, 2008.

J. Duchi and Y. Singer. Boosting with structural sparsity. In Proceedings of the Twenty SixthInternational Conference on Machine Learning (ICML), 2009.

J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the L1-ballfor learning in high dimensions. In Proceedings of the Twenty Fifth InternationalConference on Machine Learning (ICML), 2008.

M. A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions onPattern Analysis and Machine Intelligence, 25:1150–1159, 2003.

J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linearmodels via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.


http://www.stat.rutgers.edu/~madigan/PAPERS/sm.pdf

Experiments

References II

E. M. Gafni and D. P. Bertsekas. Two-metric projection methods for constrained optimization.SIAM Journal on Control and Optimization, 22:936–964, 1984.

A. Genkin, D. D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for textcategorization. Technometrics, 49(3):291–304, 2007.

J. Goodman. Exponential priors for maximum entropy models. In Proceedings of the AnnualMeeting of the Association for Computational Linguistics, 2004.

J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models withinequality constraints. In Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing, pages 137–144, 2003.

S. S. Keerthi and S. Shevade. A fast tracking algorithm for generalized LARS/LASSO. IEEETransactions on Neural Networks, 18(6):1826–1830, 2007.

J. Kim, Y. Kim, and Y. Kim. A gradient-based optimization algorithm for LASSO. Journal ofComputational and Graphical Statistics, 17(4):994–1009, 2008.

Y. Kim and J. Kim. Gradient LASSO for feature selection. In Proceedings of the 21stInternational Conference on Machine Learning (ICML), 2004.

J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linearpredictors. Information and Computation, 132:1–63, 1997.


Experiments

References IIIK. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic

regression. Journal of Machine Learning Research, 8:1519–1555, 2007. URLhttp://www.stanford.edu/~boyd/l1_logistic_reg.html.

B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. Figueiredo. A Bayesian approachto joint feature selection and classifier design. IEEE Transactions on Pattern Analysis andMachine Intelligence, 26(6):1105–1111, 2004.

B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink. Sparse multinomiallogistic regression: fast algorithms and generalization bounds. IEEE Transactions onPattern Analysis and Machine Intelligence, 27(6):957–968, 2005.

J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal ofMachine Learning Research, 10:771–801, 2009.

S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng. Efficient l1 regularized logistic regression. InProceedings of the Twenty-first National Conference on Artificial Intelligence (AAAI-06),pages 1–9, Boston, MA, USA, July 2006.

C.-J. Lin and J. J. More. Newton’s method for large-scale bound constrained problems. SIAMJournal on Optimization, 9:1100–1127, 1999.

J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of The 15thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages547–556, 2009.


http://www.stanford.edu/~boyd/l1_logistic_reg.html

Experiments

References IVM. Y. Park and T. Hastie. L1 regularization path algorithm for generalized linear models.

Journal of the Royal Statistical Society Series B, 69:659–677, 2007.

S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradientdescent in function space. Journal of Machine Learning Research, 3:1333–1356, 2003.

S. Rosset. Following curved regularized optimization solution paths. In L. K. Saul, Y. Weiss,and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages1153–1160, Cambridge, MA, 2005. MIT Press.

V. Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):16–28, 2004.

M. Schmidt, G. Fung, and R. Rosales. Optimization methods for l1-regularization. TechnicalReport TR-2009-19, University of British Columbia, 2009.

S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1 regularized loss minimization. InProceedings of the Twenty Sixth International Conference on Machine Learning (ICML),2009.

S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection usingsparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.

J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for large scale l1-regularizedlogistic regression. Journal of Machine Learning Research, 11:713–741, 2010.

C. H. Teo, S. Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized riskminimization. Journal of Machine Learning Research, 11:311–365, 2010.


Experiments

References VR. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical

Society Series B, 58:267–288, 1996.

P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separableminimization. Mathematical Programming, 117:387–423, 2009.

S. J. Wright. Accelerated block-coordinate relaxation for regularized optimization. SIAMJournal on Optimization, 22(1):159–186, 2012.

J. Yu, S. Vishwanathan, S. Gunter, and N. N. Schraudolph. A quasi-Newton approach tononsmooth convex optimization problems in machine learning. Journal of MachineLearning Research, 11:1–57, 2010.

G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methodsand software for large-scale l1-regularized linear classification. Journal of Machine LearningResearch, 11:3183–3234, 2010. URLhttp://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf.

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An improved GLMNET for l1-regularized logisticregression. Journal of Machine Learning Research, 13:1999–2030, 2012. URLhttp://www.csie.ntu.edu.tw/~cjlin/papers/l1_glmnet/long-glmnet.pdf.

S. Yun and K.-C. Toh. A coordinate gradient descent method for l1-regularized convexminimization. Computational Optimizations and Applications, 48(2):273–307, 2011.

P. Zhao and B. Yu. Stagewise lasso. Journal of Machine Learning Research, 8:2701–2726,2007.


http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf

http://www.csie.ntu.edu.tw/~cjlin/papers/l1_glmnet/long-glmnet.pdf

Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf ·...

Documents

Transcript of Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf ·...