Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf ·...
Transcript of Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf ·...
Sparse Representation and OptimizationMethods for L1-regularized Problems
Chih-Jen LinDepartment of Computer Science
National Taiwan University
Chih-Jen Lin (National Taiwan Univ.) 1 / 60
Outline
Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments
Chih-Jen Lin (National Taiwan Univ.) 2 / 60
Sparse Representation
Outline
Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments
Chih-Jen Lin (National Taiwan Univ.) 3 / 60
Sparse Representation
Sparse Representation
A mathematical way to model a signal, an image, ora document is
y =Xw
=w1
x11...xl1
+ · · ·+ wn
x1n...xln
A signal is a linear combination of others
X and y are given
We would like to find w with as few non-zeros aspossible (sparsity)
Chih-Jen Lin (National Taiwan Univ.) 4 / 60
Sparse Representation
Example: Image Deblurring
Considery = Hz
z : original image, H : blur operation
y : observed image
Assumez = Dw
with known dictionary D
Try tominw‖y − HDw‖ and get w
Chih-Jen Lin (National Taiwan Univ.) 5 / 60
Sparse Representation
Example: Image Deblurring (Cont’d)
We hope w has few nonzeros as each image isgenerated using only several columns of thedictionary
The restored image is Dw
Chih-Jen Lin (National Taiwan Univ.) 6 / 60
Sparse Representation
Example: Face Recognition
Assume a face image is a combination of the sameperson’s other imagesx11...
xl1
: 1st image,
x12...xl2
: 2nd image, . . .
l : number of pixels in a face image
Given a face image y and collections of twopersons’ faces X1 and X2
Chih-Jen Lin (National Taiwan Univ.) 7 / 60
Sparse Representation
Example: Face Recognition (Cont’d)
Ifminw‖y − X1w‖ < min
w‖y − X2w‖,
predict y as the first person
We hope w has few nonzeros as noisy imagesshouldn’t be used
Chih-Jen Lin (National Taiwan Univ.) 8 / 60
Sparse Representation
Example: Feature Selection
Given
X =
x11 . . . x1n...
xl1 . . . xln
, [xi1 . . . xin
]: ith document
yi = +1 or − 1 (two classes)
We hope to find w such that
wTx i
{> 0 if yi = 1
< 0 if yi = −1
Chih-Jen Lin (National Taiwan Univ.) 9 / 60
Sparse Representation
Example: Feature Selection (Cont’d)
Try to
minw
l∑i=1
e−yiwTx i
and hope that w is sparse
That is, we assume that each document isgenerated from important features
wi 6= 0⇒ important features
Chih-Jen Lin (National Taiwan Univ.) 10 / 60
Sparse Representation
L1-norm Minimization I
Finding w with the smallest number of non-zeros isdifficult
‖w‖0 : number of nonzeros
Instead, L1-norm minimization
minw
C‖y − Xw‖2 + ‖w‖1
C : a parameter given by users
This is a regression problem
Chih-Jen Lin (National Taiwan Univ.) 11 / 60
Sparse Representation
L1-norm Minimization II
1-norm versus 2-norm
‖w‖1 = |w1|+ · · ·+ |wn|‖w‖22 = w 2
1 + · · ·+ w 2n
Two figures
w
|w |
w
w 2
Chih-Jen Lin (National Taiwan Univ.) 12 / 60
Sparse Representation
L1-norm Minimization III
If using 2-norm, all wi are non-zeros
Using 1-norm, many wi may be zeros
Smaller C , better sparsity
Chih-Jen Lin (National Taiwan Univ.) 13 / 60
Sparse Representation
L1-regularized Classifier
Training data {yi , x i}, x i ∈ Rn, i = 1, . . . , l , yi = ±1l : # of data, n: # of features
minw
‖w‖1 + C∑l
i=1ξ(w ; x i , yi)
ξ(w ; x i , yi): loss functionLogistic loss:
log(1 + e−ywTx)
L1 and L2 losses:
max(1− ywTx , 0) and max(1− ywTx , 0)2
We do not consider kernelsChih-Jen Lin (National Taiwan Univ.) 14 / 60
Sparse Representation
L1-regularized Classifier (Cont’d)
‖w‖1 not differentiable ⇒ causes difficulties inoptimization
Loss functions: logistic loss twice differentiable, L2loss differentiable, and L1 loss not differentiable
We focus on logistic and L2 loss
Sometimes bias term is added
wTx ⇒ wTx + b
Chih-Jen Lin (National Taiwan Univ.) 15 / 60
Sparse Representation
L1-regularized Classifier (Cont’d)
Many available methods; we review existingmethods and show details of some methods
Notation:
f (w) ≡ ‖w‖1 + C∑l
i=1ξ(w ; x i , yi)
is the function to be minimized, and
L(w) ≡ C∑l
i=1ξ(w ; x i , yi).
We do not discuss L1-regularized regression, whichis another hot topic recently
Chih-Jen Lin (National Taiwan Univ.) 16 / 60
Existing Optimization Methods
Outline
Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments
Chih-Jen Lin (National Taiwan Univ.) 17 / 60
Existing Optimization Methods
Decomposition Methods
Working on some variables at a time
Cyclic coordinate descent methods
Working variables sequentially or randomly selected
One-variable case:
minz
f (w + ze j)− f (w)
e j : indicator vector for the the jth element
Examples: Goodman (2004); Genkin et al. (2007);Balakrishnan and Madigan (2005); Tseng and Yun(2009); Shalev-Shwartz and Tewari (2009); Duchiand Singer (2009); Wright (2012)
Chih-Jen Lin (National Taiwan Univ.) 18 / 60
Existing Optimization Methods
Decomposition Methods (Cont’d)
Gradient-based working set selection
Higher cost per iteration; larger working set
Examples: Shevade and Keerthi (2003); Tseng andYun (2009); Yun and Toh (2011)
Active set method
Working set the same as the set of non-zero welements
Examples: Perkins et al. (2003)
Chih-Jen Lin (National Taiwan Univ.) 19 / 60
Existing Optimization Methods
Constrained Optimization
Replace w with w+ −w−:
minw+,w−
∑n
j=1w+j +
∑n
j=1w−j + C
∑l
i=1ξ(w+−w−; x i , yi)
s. t. w+j ≥ 0, w−j ≥ 0, j = 1, . . . , n.
Any bound-constrained optimization methods canbe usedExamples: Schmidt et al. (2009) used Gafni andBertsekas (1984); Kazama and Tsujii (2003) usedBenson and More (2001); we have considered Linand More (1999); Koh et al. (2007): interior pointmethod
Chih-Jen Lin (National Taiwan Univ.) 20 / 60
Existing Optimization Methods
Constrained Optimization (Cont’d)
Equivalent problem with non-smooth constraints:
minw
∑l
i=1ξ(w ; x i , yi)
subject to ‖w‖1 ≤ K .
C replaced by a corresponding KGo back to LASSO (Tibshirani, 1996) if y ∈ R andleast-square lossExamples: Kivinen and Warmuth (1997); Lee et al.(2006); Donoho and Tsaig (2008); Duchi et al.(2008); Liu et al. (2009); Kim and Kim (2004); Kimet al. (2008); Roth (2004)
Chih-Jen Lin (National Taiwan Univ.) 21 / 60
Existing Optimization Methods
Other Methods
Expectation maximization: Figueiredo (2003);Krishnapuram et al. (2004, 2005).
Stochastic gradient descent: Langford et al. (2009);Shalev-Shwartz and Tewari (2009)
Modified quasi Newton: Andrew and Gao (2007);Yu et al. (2010)
Hybrid: easy method first and then interior-point forfaster local convergence (Shi et al., 2010)
Chih-Jen Lin (National Taiwan Univ.) 22 / 60
Existing Optimization Methods
Other Methods (Cont’d)
Quadratic approximation followed by coordinatedescent: Krishnapuram et al. (2005); Friedmanet al. (2010); Yuan et al. (2012); a kind of Newtonapproach
Cutting plane method: Teo et al. (2010)
Some methods find a solution path for different Cvalues; e.g., Rosset (2005), Zhao and Yu (2007),Park and Hastie (2007), and Keerthi and Shevade(2007).
Here we focus on a single C
Chih-Jen Lin (National Taiwan Univ.) 23 / 60
Existing Optimization Methods
Strengths and Weaknesses of ExistingMethods
Convergence speed: higher-order methods (quasiNewton or Newton) have fast local convergence,but fail to obtain a reasonable model quickly
Implementation efforts: higher-order methodsusually more complicated
Large data: if solving linear systems is needed, useiterative (e.g., CG) instead of direct methods
Feature correlation: methods working on somevariables at a time (e.g., decomposition methods)may be efficient if features are almost independent
Chih-Jen Lin (National Taiwan Univ.) 24 / 60
Coordinate Descent Methods
Outline
Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments
Chih-Jen Lin (National Taiwan Univ.) 25 / 60
Coordinate Descent Methods
Coordinate Descent Methods I
Minimizing the one-variable function
gj(z) ≡ |wj + z | − |wj |+ L(w + ze j)− L(w),
wheree j ≡ [0, . . . , 0︸ ︷︷ ︸
j−1
, 1, 0, . . . , 0]T .
No closed form solution
Genkin et al. (2007), Shalev-Shwartz and Tewari(2009), and Yuan et al. (2010)
Chih-Jen Lin (National Taiwan Univ.) 26 / 60
Coordinate Descent Methods
Coordinate Descent Methods II
They differ in how to minimize this one-variableproblem
While gj(z) is not differentiable, we can have a formsimilar to Taylor expansion:
gj(z) = gj(0) + g ′j (0)z +1
2g ′′j (ηz)z2,
Chih-Jen Lin (National Taiwan Univ.) 27 / 60
Coordinate Descent Methods
Coordinate Descent Methods III
Anoter representation (for our derivation)
minz
gj(z) = |wj + z | − |wj |+ Lj(z ; w)− Lj(0; w),
where
Lj(z ; w) ≡ L(w + ze j).
is a function of z
Chih-Jen Lin (National Taiwan Univ.) 28 / 60
Coordinate Descent Methods
BBR (Genkin et al., 2007) I
They rewrite gj(z) as
gj(z) = gj(0) + g ′j (0)z +1
2g ′′j (ηz)z2,
where 0 < η < 1,
g ′j (0) ≡
{L′j(0) + 1 if wj > 0,
L′j(0)− 1 if wj < 0,(1)
andg ′′j (ηz) ≡ L′′j (ηz).
Chih-Jen Lin (National Taiwan Univ.) 29 / 60
Coordinate Descent Methods
BBR (Genkin et al., 2007) II
gj(z) is not differentiable if wj = 0
BBR finds an upper bound Uj of g ′′j (z) in a trustregion
Uj ≥ g ′′j (z), ∀|z | ≤ ∆j .
Then gj(z) is an upper-bound function of gj(z):
gj(z) ≡ gj(0) + g ′j (0)z +1
2Ujz
2.
Any step z satisfying gj(z) < gj(0) leads to
gj(z)− gj(0) = gj(z)− gj(0) ≤ gj(z)− gj(0) < 0,
Chih-Jen Lin (National Taiwan Univ.) 30 / 60
Coordinate Descent Methods
BBR (Genkin et al., 2007) III
Convergence not proved (no sufficient decreasecondition via line search)
Logistic loss
Uj ≡ Cl∑
i=1
x2ijF(yiwTx i ,∆j |xij |
),
where
F (r , δ) =
{0.25 if |r | ≤ δ,
12+e(|r |−δ)+e(δ−|r |) otherwise.
Chih-Jen Lin (National Taiwan Univ.) 31 / 60
Coordinate Descent Methods
BBR (Genkin et al., 2007) IV
The sub-problem solved in practice:
minz
gj(z)
s. t. |z | ≤ ∆j and wj + z
{≥ 0 if wj > 0,
≤ 0 if wj < 0.
Update rule:
d = min
(max
(P(−
g ′j (0)
Uj,wj),−∆j
),∆j
),
Chih-Jen Lin (National Taiwan Univ.) 32 / 60
Coordinate Descent Methods
BBR (Genkin et al., 2007) V
where
P(z ,w) ≡
{z if sgn(w + z) = sgn(w),
−w otherwise.
Chih-Jen Lin (National Taiwan Univ.) 33 / 60
Coordinate Descent Methods
SCD (Shalev-Shwartz and Tewari, 2009) I
SCD: stochastic coordinate descent
w = w+ −w−
At each step, randomly select a variable from
{w+1 , . . . ,w
+n ,w
−1 , . . . ,w
−n }
One-variable sub-problem:
minz
gj(z) ≡ z+Lj(z ; w+−w−)−Lj(0; w+−w−),
subject to
w k,+j + z ≥ 0 or w k ,−
j + z ≥ 0,Chih-Jen Lin (National Taiwan Univ.) 34 / 60
Coordinate Descent Methods
SCD (Shalev-Shwartz and Tewari, 2009) II
Second-order approximation similar to BBR:
gj(z) = gj(0) + g ′j (0)z +1
2Ujz
2,
where
g ′j (0) =
{1 + L′j(0) for w+
j
1− L′j(0) for w−jand Uj ≥ g ′′j (z), ∀z .
BBR: Uj an upper bound of g ′′j (z) in the trust region
SCD: global upper bound
Chih-Jen Lin (National Taiwan Univ.) 35 / 60
Coordinate Descent Methods
SCD (Shalev-Shwartz and Tewari, 2009)III
For logistic regression,
Uj = 0.25Cl∑
i=1
x2ij ≥ g ′′j (z), ∀z .
Shalev-Shwartz and Tewari (2009) assume−1 ≤ xij ≤ 1, ∀i , j , so a simple upper bound is
Uj = 0.25Cl .
Chih-Jen Lin (National Taiwan Univ.) 36 / 60
Coordinate Descent Methods
CDN (Yuan et al., 2010) I
Newton step:
minz
g ′j (0)z +1
2g ′′j (0)z2.
That is,
minz
|wj + z | − |wj |+ L′j(0)z +1
2L′′j (0)z2.
Second-order term not replaced by an upper bound
Function value may not be decreasing
Tighter second-order term is useful (will see later)Chih-Jen Lin (National Taiwan Univ.) 37 / 60
Coordinate Descent Methods
CDN (Yuan et al., 2010) IIAssume z is the optimal solution; need line search
Following Tseng and Yun (2009)
gj(λz)− gj(0) ≤ σλ(L′j(0)z + |wj + z | − |wj |),
This is slightly different from the traditional form ofline search. Now
|wj + z | − |wj |
must be taken into consideration
Convergence can be proved
Chih-Jen Lin (National Taiwan Univ.) 38 / 60
Coordinate Descent Methods
Calculating First and Second OrderInformation I
We have
L′j(0) =dL(w + ze j)
dz
∣∣∣∣z=0
= ∇jL(w)
L′′j (0) =d2L(w + ze j)
dzdz
∣∣∣∣z=0
= ∇2jjL(w)
Chih-Jen Lin (National Taiwan Univ.) 39 / 60
Coordinate Descent Methods
Calculating First and Second OrderInformation II
For logistic loss:
L′j(0) = Cl∑
i=1
yixij(τ(yi(w)Tx i)− 1
),
L′′j (0) = Cl∑
i=1
x2ij(τ(yi(w)Tx i)
) (1− τ(yi(w)Tx i)
),
where
τ(s) ≡ 1
1 + e−s
Chih-Jen Lin (National Taiwan Univ.) 40 / 60
Other Methods
Outline
Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments
Chih-Jen Lin (National Taiwan Univ.) 41 / 60
Other Methods
GLMNET (Friedman et al., 2010) I
A quadratic approximation of L(w):
f (w + d)− f (w)
= (‖w + d‖1 + L(w + d))− (‖w‖1 + L(w))
≈∇L(w)Td +1
2dT∇2L(w)d + ‖w + d‖1 − ‖w‖1.
Thenw ← w + d
Line search is needed for convergence
Chih-Jen Lin (National Taiwan Univ.) 42 / 60
Other Methods
GLMNET (Friedman et al., 2010) IIBut how to handle quadratic minimization withsome one-norm terms?
GLMNET uses coordinate descent
For logistic regression:
∇L(w) = Cl∑
i=1
(τ(yiwTx i)− 1
)yix i
∇2L(w) = CXTDX ,
where D ∈ R l×l is a diagonal matrix with
Dii = τ(yiwTx i)(1− τ(yiwTx i)
)Chih-Jen Lin (National Taiwan Univ.) 43 / 60
Other Methods
GLMNET (Friedman et al., 2010) III
See an improved version called newGLMNET atYuan et al. (2012)
Chih-Jen Lin (National Taiwan Univ.) 44 / 60
Other Methods
Bundle Methods (Teo et al., 2010) I
Also called cutting plane method
L(w) a convex loss function
If w k the current solution,
L(w) ≥∇L(w k)T (w −w k) + L(w k)
=aTk w + bk , ∀w ,
where
ak ≡ ∇L(w k) and bk ≡ L(w k)− aTk w k .
Chih-Jen Lin (National Taiwan Univ.) 45 / 60
Other Methods
Bundle Methods (Teo et al., 2010) II
Maintains all earlier cutting planes to form alower-bound function for L(w):
L(w) ≥ LCPk (w) ≡ max1≤t≤k
aTt w + bt , ∀w .
Obtaining w k+1 by solving
minw
‖w‖1 + LCPk (w).
Chih-Jen Lin (National Taiwan Univ.) 46 / 60
Other Methods
Bundle Methods (Teo et al., 2010) III
This is a linear program using w = w+ −w−:
minw+,w−,ζ
n∑j=1
w+j +
n∑j=1
w−j + ζ
subject to aTt (w+ −w−) + bt ≤ ζ, t = 1, . . . , k ,
w+j ≥ 0, w−j ≥ 0, j = 1, . . . , n.
Chih-Jen Lin (National Taiwan Univ.) 47 / 60
Experiments
Outline
Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments
Chih-Jen Lin (National Taiwan Univ.) 48 / 60
Experiments
Data
Data set l n # of non-zerosreal-sim 72,309 20,958 3,709,083news20 19,996 1,355,191 9,097,916rcv1 677,399 47,236 49,556,258yahoo-korea 460,554 3,052,939 156,436,656
l : number of data, n: number of features
They are all document sets
4/5 for training and 1/5 for testing
Select best C by cross validation on training
Chih-Jen Lin (National Taiwan Univ.) 49 / 60
Experiments
Compared Methods
Software using wTx without b
BBR (Genkin et al., 2007)
SCD (Shalev-Shwartz and Tewari, 2009)
CDN: our coordinate descent implementation
TRON: our Newton implementation forbound-constrained formulation
OWL-QN (Andrew and Gao, 2007)
BMRM (Teo et al., 2010)
Chih-Jen Lin (National Taiwan Univ.) 50 / 60
Experiments
Compared Methods (Cont’d)
Software using wTx + b
CDN: our coordinate descent implementation
BBR (Genkin et al., 2007)
CGD-GS (Yun and Toh, 2011)
IPM (Koh et al., 2007)
GLMNET (Friedman et al., 2010)
Lassplore (Liu et al., 2009)
Chih-Jen Lin (National Taiwan Univ.) 51 / 60
Experiments
Convergence of Objective Values (no b)
real-sim news20
rcv1 yahoo-koreaChih-Jen Lin (National Taiwan Univ.) 52 / 60
Experiments
Test Accuracy
real-sim news20
rcv1 yahoo-koreaChih-Jen Lin (National Taiwan Univ.) 53 / 60
Experiments
Convergence of Objective Values (with b)
real-sim news20
rcv1 yahoo-koreaChih-Jen Lin (National Taiwan Univ.) 54 / 60
Experiments
Observations and Conclusions
Decomposition methods better in the early stage
One-variable sub-problem in coordinate descent
Use tight approximation if possible
Newton (IPM, GLMNET) and quasi Newton(OWL-QN): fast local convergence in the end
We also checked gradient and sparsity
Complete results (of more data sets) and programsare in Yuan et al.; JMLR 2010 (11), 3183–3234
Chih-Jen Lin (National Taiwan Univ.) 55 / 60
Experiments
References I
G. Andrew and J. Gao. Scalable training of L1-regularized log-linear models. In Proceedings ofthe Twenty Fourth International Conference on Machine Learning (ICML), 2007.
S. Balakrishnan and D. Madigan. Algorithms for sparse linear classifiers in the massive datasetting. 2005. URL http://www.stat.rutgers.edu/~madigan/PAPERS/sm.pdf.
S. Benson and J. J. More. A limited memory variable metric method for bound constrainedminimization. Preprint MCS-P909-0901, Mathematics and Computer Science Division,Argonne National Laboratory, Argonne, Illinois, 2001.
D. L. Donoho and Y. Tsaig. Fast solution of l1 minimization problems when the solution maybe sparse. IEEE Transactions on Information Theory, 54:4789–4812, 2008.
J. Duchi and Y. Singer. Boosting with structural sparsity. In Proceedings of the Twenty SixthInternational Conference on Machine Learning (ICML), 2009.
J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the L1-ballfor learning in high dimensions. In Proceedings of the Twenty Fifth InternationalConference on Machine Learning (ICML), 2008.
M. A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions onPattern Analysis and Machine Intelligence, 25:1150–1159, 2003.
J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linearmodels via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
Chih-Jen Lin (National Taiwan Univ.) 56 / 60
Experiments
References II
E. M. Gafni and D. P. Bertsekas. Two-metric projection methods for constrained optimization.SIAM Journal on Control and Optimization, 22:936–964, 1984.
A. Genkin, D. D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for textcategorization. Technometrics, 49(3):291–304, 2007.
J. Goodman. Exponential priors for maximum entropy models. In Proceedings of the AnnualMeeting of the Association for Computational Linguistics, 2004.
J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models withinequality constraints. In Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing, pages 137–144, 2003.
S. S. Keerthi and S. Shevade. A fast tracking algorithm for generalized LARS/LASSO. IEEETransactions on Neural Networks, 18(6):1826–1830, 2007.
J. Kim, Y. Kim, and Y. Kim. A gradient-based optimization algorithm for LASSO. Journal ofComputational and Graphical Statistics, 17(4):994–1009, 2008.
Y. Kim and J. Kim. Gradient LASSO for feature selection. In Proceedings of the 21stInternational Conference on Machine Learning (ICML), 2004.
J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linearpredictors. Information and Computation, 132:1–63, 1997.
Chih-Jen Lin (National Taiwan Univ.) 57 / 60
Experiments
References IIIK. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic
regression. Journal of Machine Learning Research, 8:1519–1555, 2007. URLhttp://www.stanford.edu/~boyd/l1_logistic_reg.html.
B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. Figueiredo. A Bayesian approachto joint feature selection and classifier design. IEEE Transactions on Pattern Analysis andMachine Intelligence, 26(6):1105–1111, 2004.
B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink. Sparse multinomiallogistic regression: fast algorithms and generalization bounds. IEEE Transactions onPattern Analysis and Machine Intelligence, 27(6):957–968, 2005.
J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal ofMachine Learning Research, 10:771–801, 2009.
S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng. Efficient l1 regularized logistic regression. InProceedings of the Twenty-first National Conference on Artificial Intelligence (AAAI-06),pages 1–9, Boston, MA, USA, July 2006.
C.-J. Lin and J. J. More. Newton’s method for large-scale bound constrained problems. SIAMJournal on Optimization, 9:1100–1127, 1999.
J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of The 15thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages547–556, 2009.
Chih-Jen Lin (National Taiwan Univ.) 58 / 60
Experiments
References IVM. Y. Park and T. Hastie. L1 regularization path algorithm for generalized linear models.
Journal of the Royal Statistical Society Series B, 69:659–677, 2007.
S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradientdescent in function space. Journal of Machine Learning Research, 3:1333–1356, 2003.
S. Rosset. Following curved regularized optimization solution paths. In L. K. Saul, Y. Weiss,and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages1153–1160, Cambridge, MA, 2005. MIT Press.
V. Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):16–28, 2004.
M. Schmidt, G. Fung, and R. Rosales. Optimization methods for l1-regularization. TechnicalReport TR-2009-19, University of British Columbia, 2009.
S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1 regularized loss minimization. InProceedings of the Twenty Sixth International Conference on Machine Learning (ICML),2009.
S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection usingsparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.
J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for large scale l1-regularizedlogistic regression. Journal of Machine Learning Research, 11:713–741, 2010.
C. H. Teo, S. Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized riskminimization. Journal of Machine Learning Research, 11:311–365, 2010.
Chih-Jen Lin (National Taiwan Univ.) 59 / 60
Experiments
References VR. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society Series B, 58:267–288, 1996.
P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separableminimization. Mathematical Programming, 117:387–423, 2009.
S. J. Wright. Accelerated block-coordinate relaxation for regularized optimization. SIAMJournal on Optimization, 22(1):159–186, 2012.
J. Yu, S. Vishwanathan, S. Gunter, and N. N. Schraudolph. A quasi-Newton approach tononsmooth convex optimization problems in machine learning. Journal of MachineLearning Research, 11:1–57, 2010.
G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methodsand software for large-scale l1-regularized linear classification. Journal of Machine LearningResearch, 11:3183–3234, 2010. URLhttp://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf.
G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An improved GLMNET for l1-regularized logisticregression. Journal of Machine Learning Research, 13:1999–2030, 2012. URLhttp://www.csie.ntu.edu.tw/~cjlin/papers/l1_glmnet/long-glmnet.pdf.
S. Yun and K.-C. Toh. A coordinate gradient descent method for l1-regularized convexminimization. Computational Optimizations and Applications, 48(2):273–307, 2011.
P. Zhao and B. Yu. Stagewise lasso. Journal of Machine Learning Research, 8:2701–2726,2007.
Chih-Jen Lin (National Taiwan Univ.) 60 / 60