Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf ·...

60
Sparse Representation and Optimizat Methods for L1-regularized Problem Chih-Jen Lin Department of Computer Science National Taiwan University Chih-Jen Lin (National Taiwan Univ.) 1 / 60

Transcript of Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf ·...

Page 1: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation and OptimizationMethods for L1-regularized Problems

Chih-Jen LinDepartment of Computer Science

National Taiwan University

Chih-Jen Lin (National Taiwan Univ.) 1 / 60

Page 2: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Outline

Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments

Chih-Jen Lin (National Taiwan Univ.) 2 / 60

Page 3: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

Outline

Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments

Chih-Jen Lin (National Taiwan Univ.) 3 / 60

Page 4: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

Sparse Representation

A mathematical way to model a signal, an image, ora document is

y =Xw

=w1

x11...xl1

+ · · ·+ wn

x1n...xln

A signal is a linear combination of others

X and y are given

We would like to find w with as few non-zeros aspossible (sparsity)

Chih-Jen Lin (National Taiwan Univ.) 4 / 60

Page 5: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

Example: Image Deblurring

Considery = Hz

z : original image, H : blur operation

y : observed image

Assumez = Dw

with known dictionary D

Try tominw‖y − HDw‖ and get w

Chih-Jen Lin (National Taiwan Univ.) 5 / 60

Page 6: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

Example: Image Deblurring (Cont’d)

We hope w has few nonzeros as each image isgenerated using only several columns of thedictionary

The restored image is Dw

Chih-Jen Lin (National Taiwan Univ.) 6 / 60

Page 7: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

Example: Face Recognition

Assume a face image is a combination of the sameperson’s other imagesx11...

xl1

: 1st image,

x12...xl2

: 2nd image, . . .

l : number of pixels in a face image

Given a face image y and collections of twopersons’ faces X1 and X2

Chih-Jen Lin (National Taiwan Univ.) 7 / 60

Page 8: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

Example: Face Recognition (Cont’d)

Ifminw‖y − X1w‖ < min

w‖y − X2w‖,

predict y as the first person

We hope w has few nonzeros as noisy imagesshouldn’t be used

Chih-Jen Lin (National Taiwan Univ.) 8 / 60

Page 9: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

Example: Feature Selection

Given

X =

x11 . . . x1n...

xl1 . . . xln

, [xi1 . . . xin

]: ith document

yi = +1 or − 1 (two classes)

We hope to find w such that

wTx i

{> 0 if yi = 1

< 0 if yi = −1

Chih-Jen Lin (National Taiwan Univ.) 9 / 60

Page 10: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

Example: Feature Selection (Cont’d)

Try to

minw

l∑i=1

e−yiwTx i

and hope that w is sparse

That is, we assume that each document isgenerated from important features

wi 6= 0⇒ important features

Chih-Jen Lin (National Taiwan Univ.) 10 / 60

Page 11: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

L1-norm Minimization I

Finding w with the smallest number of non-zeros isdifficult

‖w‖0 : number of nonzeros

Instead, L1-norm minimization

minw

C‖y − Xw‖2 + ‖w‖1

C : a parameter given by users

This is a regression problem

Chih-Jen Lin (National Taiwan Univ.) 11 / 60

Page 12: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

L1-norm Minimization II

1-norm versus 2-norm

‖w‖1 = |w1|+ · · ·+ |wn|‖w‖22 = w 2

1 + · · ·+ w 2n

Two figures

w

|w |

w

w 2

Chih-Jen Lin (National Taiwan Univ.) 12 / 60

Page 13: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

L1-norm Minimization III

If using 2-norm, all wi are non-zeros

Using 1-norm, many wi may be zeros

Smaller C , better sparsity

Chih-Jen Lin (National Taiwan Univ.) 13 / 60

Page 14: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

L1-regularized Classifier

Training data {yi , x i}, x i ∈ Rn, i = 1, . . . , l , yi = ±1l : # of data, n: # of features

minw

‖w‖1 + C∑l

i=1ξ(w ; x i , yi)

ξ(w ; x i , yi): loss functionLogistic loss:

log(1 + e−ywTx)

L1 and L2 losses:

max(1− ywTx , 0) and max(1− ywTx , 0)2

We do not consider kernelsChih-Jen Lin (National Taiwan Univ.) 14 / 60

Page 15: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

L1-regularized Classifier (Cont’d)

‖w‖1 not differentiable ⇒ causes difficulties inoptimization

Loss functions: logistic loss twice differentiable, L2loss differentiable, and L1 loss not differentiable

We focus on logistic and L2 loss

Sometimes bias term is added

wTx ⇒ wTx + b

Chih-Jen Lin (National Taiwan Univ.) 15 / 60

Page 16: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Sparse Representation

L1-regularized Classifier (Cont’d)

Many available methods; we review existingmethods and show details of some methods

Notation:

f (w) ≡ ‖w‖1 + C∑l

i=1ξ(w ; x i , yi)

is the function to be minimized, and

L(w) ≡ C∑l

i=1ξ(w ; x i , yi).

We do not discuss L1-regularized regression, whichis another hot topic recently

Chih-Jen Lin (National Taiwan Univ.) 16 / 60

Page 17: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Existing Optimization Methods

Outline

Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments

Chih-Jen Lin (National Taiwan Univ.) 17 / 60

Page 18: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Existing Optimization Methods

Decomposition Methods

Working on some variables at a time

Cyclic coordinate descent methods

Working variables sequentially or randomly selected

One-variable case:

minz

f (w + ze j)− f (w)

e j : indicator vector for the the jth element

Examples: Goodman (2004); Genkin et al. (2007);Balakrishnan and Madigan (2005); Tseng and Yun(2009); Shalev-Shwartz and Tewari (2009); Duchiand Singer (2009); Wright (2012)

Chih-Jen Lin (National Taiwan Univ.) 18 / 60

Page 19: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Existing Optimization Methods

Decomposition Methods (Cont’d)

Gradient-based working set selection

Higher cost per iteration; larger working set

Examples: Shevade and Keerthi (2003); Tseng andYun (2009); Yun and Toh (2011)

Active set method

Working set the same as the set of non-zero welements

Examples: Perkins et al. (2003)

Chih-Jen Lin (National Taiwan Univ.) 19 / 60

Page 20: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Existing Optimization Methods

Constrained Optimization

Replace w with w+ −w−:

minw+,w−

∑n

j=1w+j +

∑n

j=1w−j + C

∑l

i=1ξ(w+−w−; x i , yi)

s. t. w+j ≥ 0, w−j ≥ 0, j = 1, . . . , n.

Any bound-constrained optimization methods canbe usedExamples: Schmidt et al. (2009) used Gafni andBertsekas (1984); Kazama and Tsujii (2003) usedBenson and More (2001); we have considered Linand More (1999); Koh et al. (2007): interior pointmethod

Chih-Jen Lin (National Taiwan Univ.) 20 / 60

Page 21: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Existing Optimization Methods

Constrained Optimization (Cont’d)

Equivalent problem with non-smooth constraints:

minw

∑l

i=1ξ(w ; x i , yi)

subject to ‖w‖1 ≤ K .

C replaced by a corresponding KGo back to LASSO (Tibshirani, 1996) if y ∈ R andleast-square lossExamples: Kivinen and Warmuth (1997); Lee et al.(2006); Donoho and Tsaig (2008); Duchi et al.(2008); Liu et al. (2009); Kim and Kim (2004); Kimet al. (2008); Roth (2004)

Chih-Jen Lin (National Taiwan Univ.) 21 / 60

Page 22: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Existing Optimization Methods

Other Methods

Expectation maximization: Figueiredo (2003);Krishnapuram et al. (2004, 2005).

Stochastic gradient descent: Langford et al. (2009);Shalev-Shwartz and Tewari (2009)

Modified quasi Newton: Andrew and Gao (2007);Yu et al. (2010)

Hybrid: easy method first and then interior-point forfaster local convergence (Shi et al., 2010)

Chih-Jen Lin (National Taiwan Univ.) 22 / 60

Page 23: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Existing Optimization Methods

Other Methods (Cont’d)

Quadratic approximation followed by coordinatedescent: Krishnapuram et al. (2005); Friedmanet al. (2010); Yuan et al. (2012); a kind of Newtonapproach

Cutting plane method: Teo et al. (2010)

Some methods find a solution path for different Cvalues; e.g., Rosset (2005), Zhao and Yu (2007),Park and Hastie (2007), and Keerthi and Shevade(2007).

Here we focus on a single C

Chih-Jen Lin (National Taiwan Univ.) 23 / 60

Page 24: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Existing Optimization Methods

Strengths and Weaknesses of ExistingMethods

Convergence speed: higher-order methods (quasiNewton or Newton) have fast local convergence,but fail to obtain a reasonable model quickly

Implementation efforts: higher-order methodsusually more complicated

Large data: if solving linear systems is needed, useiterative (e.g., CG) instead of direct methods

Feature correlation: methods working on somevariables at a time (e.g., decomposition methods)may be efficient if features are almost independent

Chih-Jen Lin (National Taiwan Univ.) 24 / 60

Page 25: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

Outline

Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments

Chih-Jen Lin (National Taiwan Univ.) 25 / 60

Page 26: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

Coordinate Descent Methods I

Minimizing the one-variable function

gj(z) ≡ |wj + z | − |wj |+ L(w + ze j)− L(w),

wheree j ≡ [0, . . . , 0︸ ︷︷ ︸

j−1

, 1, 0, . . . , 0]T .

No closed form solution

Genkin et al. (2007), Shalev-Shwartz and Tewari(2009), and Yuan et al. (2010)

Chih-Jen Lin (National Taiwan Univ.) 26 / 60

Page 27: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

Coordinate Descent Methods II

They differ in how to minimize this one-variableproblem

While gj(z) is not differentiable, we can have a formsimilar to Taylor expansion:

gj(z) = gj(0) + g ′j (0)z +1

2g ′′j (ηz)z2,

Chih-Jen Lin (National Taiwan Univ.) 27 / 60

Page 28: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

Coordinate Descent Methods III

Anoter representation (for our derivation)

minz

gj(z) = |wj + z | − |wj |+ Lj(z ; w)− Lj(0; w),

where

Lj(z ; w) ≡ L(w + ze j).

is a function of z

Chih-Jen Lin (National Taiwan Univ.) 28 / 60

Page 29: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

BBR (Genkin et al., 2007) I

They rewrite gj(z) as

gj(z) = gj(0) + g ′j (0)z +1

2g ′′j (ηz)z2,

where 0 < η < 1,

g ′j (0) ≡

{L′j(0) + 1 if wj > 0,

L′j(0)− 1 if wj < 0,(1)

andg ′′j (ηz) ≡ L′′j (ηz).

Chih-Jen Lin (National Taiwan Univ.) 29 / 60

Page 30: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

BBR (Genkin et al., 2007) II

gj(z) is not differentiable if wj = 0

BBR finds an upper bound Uj of g ′′j (z) in a trustregion

Uj ≥ g ′′j (z), ∀|z | ≤ ∆j .

Then gj(z) is an upper-bound function of gj(z):

gj(z) ≡ gj(0) + g ′j (0)z +1

2Ujz

2.

Any step z satisfying gj(z) < gj(0) leads to

gj(z)− gj(0) = gj(z)− gj(0) ≤ gj(z)− gj(0) < 0,

Chih-Jen Lin (National Taiwan Univ.) 30 / 60

Page 31: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

BBR (Genkin et al., 2007) III

Convergence not proved (no sufficient decreasecondition via line search)

Logistic loss

Uj ≡ Cl∑

i=1

x2ijF(yiwTx i ,∆j |xij |

),

where

F (r , δ) =

{0.25 if |r | ≤ δ,

12+e(|r |−δ)+e(δ−|r |) otherwise.

Chih-Jen Lin (National Taiwan Univ.) 31 / 60

Page 32: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

BBR (Genkin et al., 2007) IV

The sub-problem solved in practice:

minz

gj(z)

s. t. |z | ≤ ∆j and wj + z

{≥ 0 if wj > 0,

≤ 0 if wj < 0.

Update rule:

d = min

(max

(P(−

g ′j (0)

Uj,wj),−∆j

),∆j

),

Chih-Jen Lin (National Taiwan Univ.) 32 / 60

Page 33: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

BBR (Genkin et al., 2007) V

where

P(z ,w) ≡

{z if sgn(w + z) = sgn(w),

−w otherwise.

Chih-Jen Lin (National Taiwan Univ.) 33 / 60

Page 34: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

SCD (Shalev-Shwartz and Tewari, 2009) I

SCD: stochastic coordinate descent

w = w+ −w−

At each step, randomly select a variable from

{w+1 , . . . ,w

+n ,w

−1 , . . . ,w

−n }

One-variable sub-problem:

minz

gj(z) ≡ z+Lj(z ; w+−w−)−Lj(0; w+−w−),

subject to

w k,+j + z ≥ 0 or w k ,−

j + z ≥ 0,Chih-Jen Lin (National Taiwan Univ.) 34 / 60

Page 35: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

SCD (Shalev-Shwartz and Tewari, 2009) II

Second-order approximation similar to BBR:

gj(z) = gj(0) + g ′j (0)z +1

2Ujz

2,

where

g ′j (0) =

{1 + L′j(0) for w+

j

1− L′j(0) for w−jand Uj ≥ g ′′j (z), ∀z .

BBR: Uj an upper bound of g ′′j (z) in the trust region

SCD: global upper bound

Chih-Jen Lin (National Taiwan Univ.) 35 / 60

Page 36: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

SCD (Shalev-Shwartz and Tewari, 2009)III

For logistic regression,

Uj = 0.25Cl∑

i=1

x2ij ≥ g ′′j (z), ∀z .

Shalev-Shwartz and Tewari (2009) assume−1 ≤ xij ≤ 1, ∀i , j , so a simple upper bound is

Uj = 0.25Cl .

Chih-Jen Lin (National Taiwan Univ.) 36 / 60

Page 37: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

CDN (Yuan et al., 2010) I

Newton step:

minz

g ′j (0)z +1

2g ′′j (0)z2.

That is,

minz

|wj + z | − |wj |+ L′j(0)z +1

2L′′j (0)z2.

Second-order term not replaced by an upper bound

Function value may not be decreasing

Tighter second-order term is useful (will see later)Chih-Jen Lin (National Taiwan Univ.) 37 / 60

Page 38: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

CDN (Yuan et al., 2010) IIAssume z is the optimal solution; need line search

Following Tseng and Yun (2009)

gj(λz)− gj(0) ≤ σλ(L′j(0)z + |wj + z | − |wj |),

This is slightly different from the traditional form ofline search. Now

|wj + z | − |wj |

must be taken into consideration

Convergence can be proved

Chih-Jen Lin (National Taiwan Univ.) 38 / 60

Page 39: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

Calculating First and Second OrderInformation I

We have

L′j(0) =dL(w + ze j)

dz

∣∣∣∣z=0

= ∇jL(w)

L′′j (0) =d2L(w + ze j)

dzdz

∣∣∣∣z=0

= ∇2jjL(w)

Chih-Jen Lin (National Taiwan Univ.) 39 / 60

Page 40: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Coordinate Descent Methods

Calculating First and Second OrderInformation II

For logistic loss:

L′j(0) = Cl∑

i=1

yixij(τ(yi(w)Tx i)− 1

),

L′′j (0) = Cl∑

i=1

x2ij(τ(yi(w)Tx i)

) (1− τ(yi(w)Tx i)

),

where

τ(s) ≡ 1

1 + e−s

Chih-Jen Lin (National Taiwan Univ.) 40 / 60

Page 41: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Other Methods

Outline

Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments

Chih-Jen Lin (National Taiwan Univ.) 41 / 60

Page 42: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Other Methods

GLMNET (Friedman et al., 2010) I

A quadratic approximation of L(w):

f (w + d)− f (w)

= (‖w + d‖1 + L(w + d))− (‖w‖1 + L(w))

≈∇L(w)Td +1

2dT∇2L(w)d + ‖w + d‖1 − ‖w‖1.

Thenw ← w + d

Line search is needed for convergence

Chih-Jen Lin (National Taiwan Univ.) 42 / 60

Page 43: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Other Methods

GLMNET (Friedman et al., 2010) IIBut how to handle quadratic minimization withsome one-norm terms?

GLMNET uses coordinate descent

For logistic regression:

∇L(w) = Cl∑

i=1

(τ(yiwTx i)− 1

)yix i

∇2L(w) = CXTDX ,

where D ∈ R l×l is a diagonal matrix with

Dii = τ(yiwTx i)(1− τ(yiwTx i)

)Chih-Jen Lin (National Taiwan Univ.) 43 / 60

Page 44: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Other Methods

GLMNET (Friedman et al., 2010) III

See an improved version called newGLMNET atYuan et al. (2012)

Chih-Jen Lin (National Taiwan Univ.) 44 / 60

Page 45: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Other Methods

Bundle Methods (Teo et al., 2010) I

Also called cutting plane method

L(w) a convex loss function

If w k the current solution,

L(w) ≥∇L(w k)T (w −w k) + L(w k)

=aTk w + bk , ∀w ,

where

ak ≡ ∇L(w k) and bk ≡ L(w k)− aTk w k .

Chih-Jen Lin (National Taiwan Univ.) 45 / 60

Page 46: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Other Methods

Bundle Methods (Teo et al., 2010) II

Maintains all earlier cutting planes to form alower-bound function for L(w):

L(w) ≥ LCPk (w) ≡ max1≤t≤k

aTt w + bt , ∀w .

Obtaining w k+1 by solving

minw

‖w‖1 + LCPk (w).

Chih-Jen Lin (National Taiwan Univ.) 46 / 60

Page 47: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Other Methods

Bundle Methods (Teo et al., 2010) III

This is a linear program using w = w+ −w−:

minw+,w−,ζ

n∑j=1

w+j +

n∑j=1

w−j + ζ

subject to aTt (w+ −w−) + bt ≤ ζ, t = 1, . . . , k ,

w+j ≥ 0, w−j ≥ 0, j = 1, . . . , n.

Chih-Jen Lin (National Taiwan Univ.) 47 / 60

Page 48: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

Outline

Sparse RepresentationExisting Optimization MethodsCoordinate Descent MethodsOther MethodsExperiments

Chih-Jen Lin (National Taiwan Univ.) 48 / 60

Page 49: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

Data

Data set l n # of non-zerosreal-sim 72,309 20,958 3,709,083news20 19,996 1,355,191 9,097,916rcv1 677,399 47,236 49,556,258yahoo-korea 460,554 3,052,939 156,436,656

l : number of data, n: number of features

They are all document sets

4/5 for training and 1/5 for testing

Select best C by cross validation on training

Chih-Jen Lin (National Taiwan Univ.) 49 / 60

Page 50: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

Compared Methods

Software using wTx without b

BBR (Genkin et al., 2007)

SCD (Shalev-Shwartz and Tewari, 2009)

CDN: our coordinate descent implementation

TRON: our Newton implementation forbound-constrained formulation

OWL-QN (Andrew and Gao, 2007)

BMRM (Teo et al., 2010)

Chih-Jen Lin (National Taiwan Univ.) 50 / 60

Page 51: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

Compared Methods (Cont’d)

Software using wTx + b

CDN: our coordinate descent implementation

BBR (Genkin et al., 2007)

CGD-GS (Yun and Toh, 2011)

IPM (Koh et al., 2007)

GLMNET (Friedman et al., 2010)

Lassplore (Liu et al., 2009)

Chih-Jen Lin (National Taiwan Univ.) 51 / 60

Page 52: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

Convergence of Objective Values (no b)

real-sim news20

rcv1 yahoo-koreaChih-Jen Lin (National Taiwan Univ.) 52 / 60

Page 53: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

Test Accuracy

real-sim news20

rcv1 yahoo-koreaChih-Jen Lin (National Taiwan Univ.) 53 / 60

Page 54: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

Convergence of Objective Values (with b)

real-sim news20

rcv1 yahoo-koreaChih-Jen Lin (National Taiwan Univ.) 54 / 60

Page 55: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

Observations and Conclusions

Decomposition methods better in the early stage

One-variable sub-problem in coordinate descent

Use tight approximation if possible

Newton (IPM, GLMNET) and quasi Newton(OWL-QN): fast local convergence in the end

We also checked gradient and sparsity

Complete results (of more data sets) and programsare in Yuan et al.; JMLR 2010 (11), 3183–3234

Chih-Jen Lin (National Taiwan Univ.) 55 / 60

Page 56: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

References I

G. Andrew and J. Gao. Scalable training of L1-regularized log-linear models. In Proceedings ofthe Twenty Fourth International Conference on Machine Learning (ICML), 2007.

S. Balakrishnan and D. Madigan. Algorithms for sparse linear classifiers in the massive datasetting. 2005. URL http://www.stat.rutgers.edu/~madigan/PAPERS/sm.pdf.

S. Benson and J. J. More. A limited memory variable metric method for bound constrainedminimization. Preprint MCS-P909-0901, Mathematics and Computer Science Division,Argonne National Laboratory, Argonne, Illinois, 2001.

D. L. Donoho and Y. Tsaig. Fast solution of l1 minimization problems when the solution maybe sparse. IEEE Transactions on Information Theory, 54:4789–4812, 2008.

J. Duchi and Y. Singer. Boosting with structural sparsity. In Proceedings of the Twenty SixthInternational Conference on Machine Learning (ICML), 2009.

J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the L1-ballfor learning in high dimensions. In Proceedings of the Twenty Fifth InternationalConference on Machine Learning (ICML), 2008.

M. A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions onPattern Analysis and Machine Intelligence, 25:1150–1159, 2003.

J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linearmodels via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.

Chih-Jen Lin (National Taiwan Univ.) 56 / 60

Page 57: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

References II

E. M. Gafni and D. P. Bertsekas. Two-metric projection methods for constrained optimization.SIAM Journal on Control and Optimization, 22:936–964, 1984.

A. Genkin, D. D. Lewis, and D. Madigan. Large-scale Bayesian logistic regression for textcategorization. Technometrics, 49(3):291–304, 2007.

J. Goodman. Exponential priors for maximum entropy models. In Proceedings of the AnnualMeeting of the Association for Computational Linguistics, 2004.

J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models withinequality constraints. In Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing, pages 137–144, 2003.

S. S. Keerthi and S. Shevade. A fast tracking algorithm for generalized LARS/LASSO. IEEETransactions on Neural Networks, 18(6):1826–1830, 2007.

J. Kim, Y. Kim, and Y. Kim. A gradient-based optimization algorithm for LASSO. Journal ofComputational and Graphical Statistics, 17(4):994–1009, 2008.

Y. Kim and J. Kim. Gradient LASSO for feature selection. In Proceedings of the 21stInternational Conference on Machine Learning (ICML), 2004.

J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linearpredictors. Information and Computation, 132:1–63, 1997.

Chih-Jen Lin (National Taiwan Univ.) 57 / 60

Page 58: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

References IIIK. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic

regression. Journal of Machine Learning Research, 8:1519–1555, 2007. URLhttp://www.stanford.edu/~boyd/l1_logistic_reg.html.

B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. Figueiredo. A Bayesian approachto joint feature selection and classifier design. IEEE Transactions on Pattern Analysis andMachine Intelligence, 26(6):1105–1111, 2004.

B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink. Sparse multinomiallogistic regression: fast algorithms and generalization bounds. IEEE Transactions onPattern Analysis and Machine Intelligence, 27(6):957–968, 2005.

J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal ofMachine Learning Research, 10:771–801, 2009.

S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng. Efficient l1 regularized logistic regression. InProceedings of the Twenty-first National Conference on Artificial Intelligence (AAAI-06),pages 1–9, Boston, MA, USA, July 2006.

C.-J. Lin and J. J. More. Newton’s method for large-scale bound constrained problems. SIAMJournal on Optimization, 9:1100–1127, 1999.

J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of The 15thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages547–556, 2009.

Chih-Jen Lin (National Taiwan Univ.) 58 / 60

Page 59: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

References IVM. Y. Park and T. Hastie. L1 regularization path algorithm for generalized linear models.

Journal of the Royal Statistical Society Series B, 69:659–677, 2007.

S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradientdescent in function space. Journal of Machine Learning Research, 3:1333–1356, 2003.

S. Rosset. Following curved regularized optimization solution paths. In L. K. Saul, Y. Weiss,and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages1153–1160, Cambridge, MA, 2005. MIT Press.

V. Roth. The generalized LASSO. IEEE Transactions on Neural Networks, 15(1):16–28, 2004.

M. Schmidt, G. Fung, and R. Rosales. Optimization methods for l1-regularization. TechnicalReport TR-2009-19, University of British Columbia, 2009.

S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1 regularized loss minimization. InProceedings of the Twenty Sixth International Conference on Machine Learning (ICML),2009.

S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection usingsparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.

J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for large scale l1-regularizedlogistic regression. Journal of Machine Learning Research, 11:713–741, 2010.

C. H. Teo, S. Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized riskminimization. Journal of Machine Learning Research, 11:311–365, 2010.

Chih-Jen Lin (National Taiwan Univ.) 59 / 60

Page 60: Sparse Representation and Optimization Methods for L1 ...cjlin/courses/optml2014/course_l1.pdf · Sparse Representation L1-regularized Classi er (Cont’d) kwk 1 not di erentiable

Experiments

References VR. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical

Society Series B, 58:267–288, 1996.

P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separableminimization. Mathematical Programming, 117:387–423, 2009.

S. J. Wright. Accelerated block-coordinate relaxation for regularized optimization. SIAMJournal on Optimization, 22(1):159–186, 2012.

J. Yu, S. Vishwanathan, S. Gunter, and N. N. Schraudolph. A quasi-Newton approach tononsmooth convex optimization problems in machine learning. Journal of MachineLearning Research, 11:1–57, 2010.

G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methodsand software for large-scale l1-regularized linear classification. Journal of Machine LearningResearch, 11:3183–3234, 2010. URLhttp://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf.

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An improved GLMNET for l1-regularized logisticregression. Journal of Machine Learning Research, 13:1999–2030, 2012. URLhttp://www.csie.ntu.edu.tw/~cjlin/papers/l1_glmnet/long-glmnet.pdf.

S. Yun and K.-C. Toh. A coordinate gradient descent method for l1-regularized convexminimization. Computational Optimizations and Applications, 48(2):273–307, 2011.

P. Zhao and B. Yu. Stagewise lasso. Journal of Machine Learning Research, 8:2701–2726,2007.

Chih-Jen Lin (National Taiwan Univ.) 60 / 60