Independently Interpretable Lasso: A New Regularizer for...

Independently Interpretable Lasso: A New Regularizer for SparseRegression with Uncorrelated Variables

Masaaki Takada1,2 Taiji Suzuki3,4,5 Hironori Fujisawa1,6

[email protected] [email protected] [email protected]

1School of Advanced Sciences, The Graduate University for Advanced Studies2Toshiba Corporation

3Graduate School of Information Science and Technology, The University of Tokyo4PRESTO, Japan Science and Technology Agency, Japan

5Center for Advanced Integrated Intelligence Research, RIKEN, Tokyo, Japan6The Institute of Statistical Mathematics

Abstract

Sparse regularization such as ℓ1 regulariza-tion is a quite powerful and widely usedstrategy for high dimensional learning prob-lems. The effectiveness of sparse regulariza-tion has been supported practically and the-oretically by several studies. However, oneof the biggest issues in sparse regularizationis that its performance is quite sensitive tocorrelations between features. Ordinary ℓ1regularization can select variables correlatedwith each other, which results in deteriora-tion of not only its generalization error butalso interpretability. In this paper, we pro-pose a new regularization method, “Indepen-dently Interpretable Lasso” (IILasso). Ourproposed regularizer suppresses selecting cor-related variables, and thus each active vari-able independently affects the objective vari-able in the model. Hence, we can interpretregression coefficients intuitively and also im-prove the performance by avoiding overfit-ting. We analyze theoretical property ofIILasso and show that the proposed methodis much advantageous for its sign recoveryand achieves almost minimax optimal conver-gence rate. Synthetic and real data analysesalso indicate the effectiveness of IILasso.

Proceedings of the 21st International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2018, Lan-zarote, Spain. PMLR: Volume 84. Copyright 2018 by theauthor(s).

1 Introduction

High dimensional data appears in many fields such asbiology, economy and industry. A common approachfor high dimensional regression is sparse regularizationstrategy such as Lasso (Least absolute shrinkage andselection operator) [25]. Since the sparse regulariza-tion performs both parameter estimation and featureselection simultaneously, (i) it offers interpretable re-sults by identifying informative variables, and (ii) itcan effectively avoid overfitting by discarding redun-dant variables. Because of these properties, sparseregularization has shown huge success in wide rangeof data analysis in science and engineering. Moreover,several theoretical studies have been developed to sup-port the effectiveness of sparse regularization, and ef-ficient optimization methods also have been proposedso that sparse learning is easily executed.

However, the performance of sparse regularization isguaranteed only under “small correlation” assump-tions, that is, the features are not much correlatedwith each other. Actually, typical theoretical supportsare based on a kind of small correlation assumptionssuch as restricted eigenvalue condition [3, 2]. In thesituation where the features are highly correlated, theselected variables are likely to be jointly correlated.As a result, each coefficient cannot be seen as inde-pendent variable contribution so that the model is nolonger easy to interpret. This kind of interpretability iscalled “decomposability” in [14], which represents theability whether we can decompose a model into someparts and interpret each component. In our experi-ence, this kind of interpretability is quite importantin many practical modeling for decision-making. Inaddition to this, selecting several correlated variables

Independently Interpretable Lasso: A New Regularizer for Sparse Regression with UncorrelatedVariables

results in worse generalization error because the redun-dant representation tends to give overfitting. Thus, itis favorable to construct the model with uncorrelatedvariables both for interpretability and generalizationability.

Several methods have been proposed to resolve theproblem induced by correlations among variables. Thefirst line of research is based on a strategy in whichcorrelated variables are either all selected or not se-lected at all. Examples of this line are Elastic Net [30],Pairwise Elastic Net [15] and Trace Lasso [11]. Thesemethods select not only important variables but alsovariables strongly correlated. Although these meth-ods often give better generalization error, this strat-egy makes it hard to interpret the model, becausemany correlated variables are incorporated into thefinal model.

The second line of research including our proposedmethod is based on another strategy in which we se-lect uncorrelated variables, and thus obtain decompos-ability. Uncorrelated Lasso [5] intends to construct amodel with uncorrelated variables. However, it stilltends to select “negatively” correlated variables andhence the correlation problem is not resolved. Exclu-sive Group Lasso [13] is also in this line, but it is nec-essary to group correlated variables beforehand. Theysuggest that we group variables whose correlations aregreater than a certain threshold. However, determi-nation of the threshold is not a trivial problem andpractically it causes unstable results.

In this paper, we propose a new regularization method,named “Independently Interpretable Lasso” (IILasso),which offers efficient variable selection, does not se-lect negatively correlated variables, and is free froma specific pre-processing such as grouping. Our pro-posed regularization formulation more aggressively in-duces the sparsity of the active variables and reducesthe correlations among them. Hence, we can indepen-dently interpret the effects of active variables in theoutput model, and the generalization performance isalso much improved. To support the usefulness of ourproposal, we give the following contributions:

• We show the necessary and sufficient condition forthe sign consistency of variables selection. As aresult, it is shown that our method achieves thesign consistency under a milder condition thanLasso.

• The convergence rate of the estimation error isanalyzed. We show that the estimation errorachieves the almost minimax optimal rate andgives better performance than Lasso in some sit-uations.

• We propose a coordinate descent algorithm to finda local optimum of the objective function. This isguaranteed to converge to a stationary point.

• It will be shown that every local optimal solu-tion achieves the same statistical error rate as theglobal optimal solution and thus is almost mini-max optimal though the objective function is notnecessarily convex for the sake of better statisticalproperties.

The rest of this paper is as follows: In Section 2, wepropose a new regularization formulation and intro-duce its optimization method. We also state the re-lationship with existing works. In Section 3, we showtheoretical results on the sign recovery and the con-vergence rate. We can see that IILasso is superior toLasso for correlated design. In Section 4, both syn-thetic and real-world data experiments, including 10microarray datasets, are illustrated. In Section 5, wesummarize the properties of IILasso.

Notations. Let M ∈ Rn×p. We use subscripts forthe columns of M , i.e., Mj denotes the j-th column.Let v ∈ Rp. Diag(v) ∈ Rp×p is the diagonal matrixwhose j-th diagonal element is vj . |v| is the element-wise absolute vector whose j-th element is |vj |. sgn(v)is the sign vector whose elements are 1 for vj > 0,−1 for vj < 0, and 0 for vj = 0. supp(v) is the supportset of v, i.e., {j ∈ {1, · · · , p}|vj ̸= 0}. Let S be asubset of {1, · · · , p}. |S| is the number of the elementsin S. Sc is the complement subset of S, i.e., Sc ={1, · · · , p}\S. vS is the vector v restricted to the indexset S. MS1S2

is the matrix whose row indexes arerestricted to S1 and column indexes are restricted toS2.

2 Proposed Method

2.1 IILasso: A New RegularizationFormulation

Consider the problem of predicting y ∈ Rn, given adesign matrix X ∈ Rn×p, assuming a linear model

y = Xβ + ϵ,

where ϵ ∈ Rn is a noise and β ∈ Rp is a regression co-efficient. We assume without loss of generality thatthe features are standardized such that Σn

i=1Xij =0, Σn

i=1X2ij/n = 1 and Σn

i=1yi = 0. Then, Lasso solvesthe following optimization problem:

minβ

1

2n∥y −Xβ∥22 + λ||β∥1,

where λ > 0 is a regularization parameter. Since thelast penalty term induces the sparsity of estimated val-ues, the output model contains few variables and hence

Masaaki Takada1,2, Taiji Suzuki3,4,5, Hironori Fujisawa1,6

beta1

beta

2

−1 0 1

−1

01

beta1

beta

2

−1 0 1

−1

01

beta1

beta

2

−1 0 1

−1

01

beta1

beta

2

−1 0 1

−1

01

beta1

beta

2

−1 0 1

−1

01

beta1

beta

2

−1 0 1

−1

01

R =

[0 00 0

]R =

[0 11 0

]R =

[0 22 0

]

R =

[1 00 1

]R =

[1 11 1

]R =

[1 22 1

]

Figure 1: Contours of penalties such that ∥β∥1 + 12 |β|

TR|β| = 0.5, 1, 1.5 with β = [β1, β2]

is tractable. This is critically important for inter-pretability. However, as we have described in Section1, when there are highly correlated variables, Lassocan select correlated variables especially for a small λresulting in worse interpretability and poor generaliza-tion error.

To overcome this problem, we propose a new regular-ization formulation as follows:

minβ

1

2n∥y −Xβ∥22 + λ

(∥β∥1 +

α

2|β|⊤R|β|

), (1)

where α > 0 is a regularization parameter for the newregularization term, and R ∈ Rp×p is a symmetric ma-trix whose component Rjk ≥ 0 represents the similar-ity between Xj and Xk. The last term of (1) is alsowritten as λα

2

∑pj=1

∑pk=1 Rjk|βj ||βk|. We define Rjk

for j ̸= k as a monotonically increasing function of theabsolute correlation rjk = 1

n |X⊤j Xk|, so that corre-

lated variables are hard to be selected simultaneously.In particular, when Xj and Xk are strongly correlated,the squared error does not change under the conditionthat βj + βk is constant, but the penalty Rjk|βj ||βk|strongly induces either βj = 0 or βk = 0. On thecontrary, if Xj and Xk are uncorrelated, i.e., Rjk issmall, then the penalty of selecting both βj and βk isnegligible.

We can see the exclusive effect of our regularizationterm by the constraint regions corresponding to thepenalties. Figure 1 illustrates the constraint regions

of ∥β∥1 + 12 |β|

TR|β| for the case p = 2. As diago-nal elements of R increases (from the top to the bot-tom panel), the contours become smooth at the axis ofcoordinates. Because of this, the minimizer tends toselect both variables if two variables are strongly cor-related. This is the grouping effect of Elastic Net as wedescribe later. On the other hand, as off-diagonal ele-ments of R increases (from the left to the right panel),the contours become pointed at the axis of coordinatesand the minimizer tends to be sparser. This is the ex-clusive effect for correlated variables. Although theshape of contours resemble that of ℓq (0 < q < 1)penalty, we would emphasize that we use correla-tion information Rjk and multiplication term |βj ||βk|.Since our regularization term is adaptive for correla-tions, our penalty achieves both sparse and stable so-lutions.

Some definition variations of the similarity matrix Rcan be considered. One of the natural choices isRjk = r2jk. R is positive semidefinite in this casebecause the Hadamard product of positive semidefi-nite matrices is also positive semidefinite. Hence, theproblem (1) turns to be convex and easy to solve theglobal optimum. However, it may not reduce corre-lations enough. Another choice is Rjk = |rjk|, whichreduces correlations more strongly. Another effectivechoice is Rjk = |rjk|/(1− |rjk|) for j ̸= k and Rjk = 0for j = k. In this case, if a correlation between two cer-tain variables becomes higher, i.e., rjk → 1, then thepenalty term diverges infinitely and IILasso cannot si-


0 1 2

01

X1

X2

X3

y

(B) X1 + 2X3

(A) 2X1 + X2

Figure 2: Uncorrelated model vs correlated model

multaneously select both of them. We use the last onein our numerical experiments, because it is favorablefrom theoretical studies, as described in Section 3.

Now, let us emphasize the advantage of uncorrelatedmodels for interpretability with a simple example (Fig-ure 2). Suppose X = [X1, X2, X3] ∈ Rn×3 is stan-dardized, X1 and X2 are orthogonal, and X3 = (X1 +X2)/

√2. Consider two models: (A) y = 2X1 + X2

and (B) y = X1 +√2X3. Both models output the

same prediction. Which model do you think is moreinterpretable? We believe that the model (A) is moreinterpretable than (B), because active variables are un-correlated so that we can see each coefficient as inde-pendent variable contribution. Although Lasso selects(B) because ℓ1 norm of its coefficients is small, IILassofor α large enough selects (A) because our reguraliza-tion term excludes correlations.

2.2 Optimization

To solve (1), we introduce Coordinate Descent Algo-rithm (CDA), which was originally proposed for Lasso(α = 0 for IILasso) [8, 9]. CDA is a simple and effi-cient algorithm, particularly in high dimensions. CDAbasically follows simply: For each j ∈ {1, · · · , p}, weoptimize the objective function with respect to βj withthe remaining elements of β fixed at their most re-cently updated values.

CDA is applicable even when the quadratic penaltyis included. Let L(β) denote the objective functionin (1). To derive the update equation, when βj ̸= 0,differentiating L(β) with respect to βj yields

∂βjL(β) =− 1

nX⊤

j (y −X−jβ−j) + (1 + λαRjj)βj

+ λ (1 + αRj,−j |β−j |) sgn(βj),

where β−j denotes β without the j-th component, X−j

Algorithm 1 CDA for IILasso

for λ = λmax, · · · , λmin doinitialize βwhile until convergence do

for j = 1, · · · , p doβj ← 1

1+λαRjjS(1nX

⊤j (y −X−jβ−j) ,

λ (1 + αRj,−j |β−j |))

end forend while

end for

denotes X without j-th column and Rj,−j denotes thej-th row vector without j-th column of R. Solving∂βj

L(β) = 0, we obtain the update rule as

βj ←1

1 + λαRjjS

(1

nX⊤

j (y −X−jβ−j) ,

λ (1 + αRj,−j |β−j |)), (2)

where S(z, γ) is a soft thresholding function

S(z, γ) : = sgn(z)(|z| − γ)+

=

z − γ if z > 0 and γ < |z|,z + γ if z < 0 and γ < |z|,0 if |z| ≤ γ.

The whole algorithm for solving IILasso is described inAlgorithm 1. We search several λ from λmax to λmin.β is initialized at each λ in some way such as a) zerosfor all elements, b) the solution of previous λ, or c) thesolution of ordinary Lasso.

In Algorithm 1, we can see that the objective func-tion monotonically decreases at each update and theestimate converges a stationary point.

Proposition 1. Let {βt}t=0,1,··· be a sequence ofβ in Algorithm 1. Then, every cluster point of{βt}t≡(p−1)modp is a stationary point.

Proof. The proof is based on Theorem 4.1 in [26].First, we can see that the level set {β|L(β) ≤ L(β0)} iscompact and L(β) is continuous. Moreover, L(β) hasa unique minimum with (2) in terms of βj . Therefore,every cluster point of {βt}t≡(p−1)modp is a coordinate-wise minimum point. In addition, since L(β) can beseen as a locally quadratic function in any directions,L(β) is regular at the cluster point. These and Theo-rem 4.1 (c) in [26] concludes the assertion.

2.3 Related Work

There are some existing works that take correlationsamong variables into account. We can divide them into


mainly two directions: 1) grouping selection and 2)exclusive selection. The former groups correlated vari-ables and selects correlated variables together. Thelatter excludes correlated variables and selects uncor-related variables.

The representative method of grouping selection isElastic Net [30]. The objective function is constructedby squared ℓ2 penalty in addition to ℓ1 penalty:

minβ||y −Xβ∥22 + λ1∥β∥1 + λ2∥β∥22.

Due to ℓ2 penalty, if the variables Xj and Xk arestrongly correlated tending to be 1, then estimatedvalues of βj and βk get closer. If Xj and Xk are equal,especially, then βj and βk must be equal. This be-havior is called grouping effect. Pairwise Elastic Net[15] and Trace Lasso [11] are the same direction of re-search (and improve the prediction accuracy). Thesemethods tend to include many correlated variables andeach coefficient no longer indicates independent vari-able contribution. As a result, it is hard to interpretwhich variables are truly active and how variables af-fect the objective variable.

Another direction of research is exclusive selection.Uncorrelated Lasso (ULasso) [5] aims to reduce cor-relations among active variables. It optimizes the fol-lowing objective function:

minβ∥y −Xβ∥22 + λ1∥β∥1 + λ2β

⊤Rβ, (3)

where R ∈ Rp×p with each element Rjk = ( 1nX⊤j Xk)

2.ULasso quite resembles our formulation, but there ex-ists a critical difference that they use β instead of |β|in the objective function (3). We found that ULassodoes not necessarily select uncorrelated variables. Forexample, consider the case X = [X1, X2]. The lastterm of (3) is λ2(β

21 + β2

2 + 2R12β1β2). If R12 ̸= 0,then the term R12β1β2 encourages |β1β2| larger withβ1β2 < 0. This implies that ULasso tends to selectcorrelated variables and set coefficients to the oppositesign. In particular, X1 and X2 are strongly correlated,then it reduces λ2(β1+β2)

2, which induces β1 = −β2.It is not a major problem when X1 and X2 are pos-itively correlated, but is a significant problem whenX1 and X2 are negatively correlated. This problem isovercome in our method, as described in Section 2.1.Therefore, the difference between their ULasso and ourIILasso is essential and crucial.

Excusive Group Lasso (EGLasso) [13] is also the samedirection of exclusive selection. It optimizes the fol-lowing objective function:

minβ∥y −Xβ∥22 + λ1∥β∥1 + λ2

K∑k=1

∥β(k)∥21, (4)

where β(k) consists of the variables of β within a groupof predictors gk ⊂ {1, · · · , p} and K is the number ofgroups. The last term is ℓ1/ℓ2 penalty, which actson exclusive feature selection. For example, whenp = 3, g1 = {1, 2} and g2 = {3}, then the last termbecomes λ2((|β1|+ |β2|)2 + |β3|2). This enforces spar-sity over each intra-group. They suggest that we puthighly correlated variables into the same group in or-der to select uncorrelated variables. They use |rij | > θwith θ ≈ 0.90 as a threshold. EGLasso can be seen asa special case of IILasso. Let R be a group indicatormatrix such as Rjk = 1 if Xj and Xk belong to thesame group and Rjk = 0 otherwise. Then IILasso isreduced to EGLasso. For the above example, if we de-fine similarity matrix R = [1, 1, 0; 1, 1, 0; 0, 0, 1], thenthe last term of IILasso objective function (1) becomesλ(β21 + 2|β1||β2|+ β2

2 + β23

), which is the same as the

last term of (4). As we see, EGLasso needs to deter-mine the threshold θ and group variables beforehand.This can cause severely unstable estimation.

3 Theoretical Results

In this section, we show theoretical properties ofIILasso. We first show the sign recovery conditionof IILasso. Then, we derive the convergence rate ofIILasso. These results are significantly important forinterpretability and generalization ability. In addition,we show the property of local minimum, which impliesthat every local optimal solution achieves the same sta-tistical error rate as the global optimal solution. In thissection, let β∗ denote the true parameter, S denote thetrue active sets, i.e., S = {j ∈ {1, · · · , p}|β∗

j ̸= 0}, ands = |S|.

3.1 Sign Recovery

We give the necessary and sufficient condition of signrecovery.

Theorem 2. Define

U :=1

nX⊤

S XS + λαDiag(sgn(β∗S))RSSDiag(sgn(β∗

S)),

V :=λsgn(β∗S) + λαDiag(sgn(β∗

S))RSSDiag(sgn(β∗S))β

∗S

− 1

nX⊤

S ϵ.

Assume U is invertible. Then, there exists a criti-cal point β̂ of (1) with correct sign recovery sgn(β̂) =sgn(β∗) if and only if the following two conditions hold:

sgn(β∗S − U−1V

)= sgn(β∗

S), (5)∣∣∣∣ 1nX⊤ScXSU

−1V +1

nX⊤

Scϵ

∣∣∣∣≤ λ

(1 + αRScS

∣∣β∗S − U−1V

∣∣) , (6)


where both of these vector inequalities are taken ele-mentwise.

The proof is given in the supplementary material. Thesign recovery condition is derived from the standardconditions for optimality. We note that α = 0 reducesthe condition into the ordinary Lasso condition in [27].The invertible assumption of U is not restrictive be-cause it is true for almost all λ if XT

S XS is invertible,which is the same assumption as standard analysis ofLasso.

The condition of IILasso is milder than that of Lassowhen RSS is small enough, since (5) is the same asLasso and (6) is easier to be satisfied unless αRScS = 0.This implies that IILasso is more favorable than Lassofrom the viewpoint of sign recovery. In addition, wecan see that a large value of RScS is favorable for signrecovery. That is, IILasso tends to succeed the signrecovery if the true active variables have small correla-tions and the inactive variables are strongly correlatedwith the true active variables.

Theorem 2 is not the condition of a global optimalsolution. However, if global optimal solutions are fi-nite, they must be a critical point. Hence, there existsa global optimal solution with correct sign recoveryonly if (5) and (6) hold.

3.2 Convergence Rate

Here we give the convergence rate of the estimationerror of our method. Before we give the statement,the assumption and definition are prepared.

Assumption 1. (ϵi)ni=1 is an i.i.d. sub-Gaussian se-

quence: E[etϵi ] ≤ eσ2t2

2 (∀t ∈ R) for σ > 0.

Definition 1 (Generalized Restricted EigenvalueCondition (GRE(S,C,C ′))). Let a set of vectorsB(S,C,C ′) for C > 0, C ′ > 0 be

B(S,C,C ′) := {β ∈ Rp | ∥βSc∥1 +C ′α

2|βSc |⊤RScSc |βSc |

+ C ′α|βSc |⊤RScS |βS + β∗S | ≤ C∥βS∥1}.

Then, we assume ϕGRE > 0 where

ϕGRE = ϕGRE(S,C,C′) := inf

v∈B(S,C,C′)

v⊤ 1nX

⊤Xv

∥v∥22.

Definition 1 is a generalized notion of the restrictedeigenvalue condition [3, 2] tailored for our regulariza-tion. One can see that, if α = 0 or C ′ = 0, then ϕGRE isreduced to the ordinary restricted eigenvalue [3, 19] forthe analysis of Lasso. Since there are additional termsrelated to |β|⊤R|β|, the set B(S,C,C ′) is smaller thanthat for the ordinary restricted eigenvalue if the same

C is used. In particular, the term |βSc |⊤RScS |βS+β∗S |

on the left-hand side strongly restricts the amplitudeof coefficients for unimportant variables (especially thevariables with large RScS), and thus promote indepen-dence among the selected variables.

GRE condition holds in quite general class of Gaus-sian design, because RE(S,C) condition is satisfied ingeneral class of Gaussian design [18], and AssumptionGRE(S,C,C ′) is milder than Assumption RE(S,C).Therefore, GRE condition is not so restrictive.

Here, for any δ > 0, we define γn as

γn = γn(δ) := σ

√2 log(2p/δ)

n.

Then, we obtain the convergence rate of IILasso asfollows.

Theorem 3. Suppose that Assumption 1 is satisfied.Suppose R satisfy for all j, k ∈ S,

0 ≤ Rjk ≤ D,

for some positive constant D, and the estimator β̂ of(1) is approximately minimizing the objective functionso that

Lλn(β̂) ≤ Lλn

(β∗). (7)

Fix any 0 < δ < 1, let the regularization parameterssatisfy

3γn ≤ λn and α ≤ 1

4D∥β∗S∥1

.

Suppose that Assumption GRE(S, 3, 32 ) (Definition 1)

is satisfied. Then, it holds that

∥β̂ − β∗∥22 ≤16sλ2

n

ϕ2GRE

,

with probability 1− δ.

The proof is given in the supplementary material. Theobtained convergence rate is roughly evaluated as

∥β̂ − β∗∥22 = Op

(s log(p)

n

),

which is almost the minimax optimal rate [20].

As is obvious from the proof of Theorem 3, we alsohave a little bit stricter bound

∥β̂ − β∗∥22 ≤(83 + 5αD∥β∗

S∥1 + 34 (αD∥β

∗S∥1)2

)2sλ2

n

ϕ2GRE

,

under Assumption GRE(S,C, 32 ) (Definition 1) where

C = 2+ 154 αD∥β∗

S∥1+ 916 (αD∥β

∗S∥1)2, with high prob-

ability. This proof is also given in the supplementarymaterial. We can easily see that, when α = 0, then


the convergence rate analysis is reduced to the stan-dard one for ordinary Lasso [3, 2]. Under well “inter-pretable” cases where the true non-zero components Sare independent, i.e., RSS = O, the error bounds forLasso and IILasso are the same except for the termϕ2GRE. Since RScS and RScSc shrink the set of vectorsB(S, C, C′) in Definition 1, ϕGRE of IILasso is largerthan that of Lasso. Therefore, our method achieves abetter error bound than the ordinary ℓ1 regularizationin this setting. In addition, our method has more ad-vantageous when the variables are correlated betweeninformative and non-informative variables.

3.3 Local Optimality of estimator

The objective function of our method is not necessar-ily convex in exchange for better statistical propertiesas observed above. Our next theoretical interest isabout the global optimality of our optimization algo-rithm (Algorithm 1). Since our optimization method isa greedy one, there is no confirmation that it achievesthe global optimum. However, as we see in this sec-tion, the local solution achieves almost the same es-timation error as the global optimum satisfying (7).For theoretical simplicity, we assume the following abit stronger condition.

Assumption 2. There exists ϕ > 0 and qn ≤ p suchthat, for all V ⊂ {1, . . . , p} satisfying |V | ≤ qn andV ∩ S = ∅, it holds that

1

nX⊤

S∪V XS∪V ≻ ϕI.

Moreover, there exists D̄ such that the maximum ab-solute value of the eigenvalue of R is bounded as

supu∈RS∪V

u⊤(RS∪V,S∪V )u ≤ D̄∥u∥22.

Theorem 4. Suppose Assumptions 1 and 2 are satis-fied. Assume that β̃ is a local optimal solution of (1)satisfying | supp(β̃)| ≤ |S|+ qn. Let the regularizationparameters satisfy

γn < λn and α < min

{ √s

2D̄∥β∗∥2,

ϕ

2D̄λn

},

for any δ > 0. Then, β̃ should satisfy

∥β̃ − β∗∥22 ≤25sλ2

n

ϕ2,

with probability 1− δ.

The proof is given in the supplementary material. The-orem 4 indicates that every local optimum β̃ achievesthe same convergence rate with the ideal optimal so-lution β̂. In other words, there is no local optimal

solution with sparsity level |S| + qn far from the truevector β∗. In the theorem, we assumed the sparsityof the local optimal solution β̃. Such a sparse solutioncan be easily obtained by running CDA multiple timesfrom different initial solutions.

4 Numerical Experiments

4.1 Synthetic Data

We consider the case in which the true active vari-ables are uncorrelated and many inactive variables arestrongly correlated with the active variable. If all ofactive and inactive variables are uncorrelated, it iseasy to estimate which is active or inactive. On theother hand, if the inactive variables are strongly corre-lated with the active variables, it is hard to distinguishwhich one is active. We simulate such a situation andvalidate the effectiveness of IILasso.

First, we generated a design matrix X ∈ Rn×p

from the Gaussian distribution of N (0,Σ) whereΣ = Diag(Σ(1), · · · ,Σ(b)) was a block diagonal ma-

trix whose element Σ(l) ∈ Rq×q was Σ(l)jk = 0.95 for

j ̸= k and Σ(l)jk = 1 for j = k. We set n = 50,

p = 100, b = 10 and q = 10. Thus, there were10 groups containing 10 strongly correlated variables.Next, we generated an objective variable y by thetrue active variables X1, X11, X21, · · · , X91, such thaty = 10X1−9X11+8X21−7X31+ · · ·+2X81−X91+ϵ,with a standard Gaussian noise ϵ. Each group includedone active variable. We generated three datasets fortraining, validation and test as above.

Then, we compared the performance of Lasso, SCAD[7], MCP [29], EGLasso and IILasso. Evaluation cri-teria are prediction error (mean squared error), es-timation error (ℓ2 norm between the true and esti-mated coefficients) and model size (the number of non-zero coefficients). SCAD and MCP are representa-tive methods of folded concave penalty, so their ob-jective functions are non-convex, which are the sameas our method. They have a tuning parameter γ; weset γ = 2.5, 3.7, 10, 20, 100, 1000 for SCAD and γ =1.5, 3, 10, 20, 100, 1000 for MCP. EGLasso has a param-eter λ2; we set λ2/λ1 = 0.01, 0.1, 1, 10, 100, 1000. ForEGLasso, we used the true group information before-hand. We used R packages ncvreg [4] for Lasso, SCADand MCP, and sparsenet [16] for MCP. One can solveMCP using either ncvreg or sparsenet, but they dif-fer in their optimization algorithms and ways of initial-ization. For IILasso, we defined Rjk = |rjk|/(1−|rjk|)for j ̸= k and Rjk = 0 for j = k. Hence, RSS takessmall values if active variables are independent, andRSSc and RScSc take large values if inactive variablesare strongly correlated with other variables, which is


Table 1: Results of synthetic data

prediction estimation modelerror error size

Lasso(ncvreg) 2.67(0.05) 4.44(0.06) 34.1(0.46)SCAD(ncvreg) 1.52(0.02) 1.79(0.04) 14.6(0.23)MCP(ncvreg) 1.53(0.02) 1.79(0.04) 14.6(0.24)MCP(sparsenet) 2.41(0.11) 3.15(0.13) 13.4(0.28)EGLasso 2.60(0.04) 4.36(0.05) 33.3(0.32)IILasso (ours) 1.45(0.02) 1.40(0.04) 13.5(0.23)

favorable from the theoretical results obtained in Sec-tion 3. We set α = 0.01, 0.1, 1, 10, 100, 1000. We tunedthe above parameters using validation data and calcu-lated errors using test data. We iterated this procedure500 times and evaluated the averages and standard er-rors.

Table 1 shows the performances with their standarderror in parentheses. IILasso achieved the best pre-diction and estimation among all of them. Thiswas because our penalty term excluded the correla-tions and avoided overfitting. Moreover, the modelsize of IILasso was much less than those of Lassoand EGLasso, and comparable to MCP. As a whole,IILasso could distinguish the true active variables.

4.2 Real Data: Gene Expression Data

We applied our method to various gene expressiondata to validate its effectiveness for real applications.We used the following 10 datasets: ‘alon’ [1] (coloncancer), ‘chiaretti’ [6] (leukemia), ‘gordon’ [10] (lungcancer), ‘gravier’ [12] (breast cancer), ‘pomeroy’ [17](central nervous system disorders), ‘shipp’ [21] (lym-phoma), ‘singh’ [22] (prostate cancer), ‘subrama-nian’ [23] (miscellaneous), ‘tian’ [24] (myeloma),‘west’ [28] (breast cancer). All of these data are pro-vided by R package datamicroarray. The abstract ofthese datasets is described in Table 2. All datasets aresmall-sample high-dimensional DNA microarray data.

Since the objective variable is binary, logistic regres-sion was applied. Logistic regression of Lasso, SCADand MCP is supported by ncvreg (sparsenet does notsupport logistic regression). EGLasso and IILasso canalso be formulated as logistic regression. For details,see the supplementary material. We used the samesettings on regularization parameters as described inSection 4.1. We evaluated the log-likelihood, misclas-sification error and model size using ten-fold cross val-idation.

The results are given in Figure 3. From the view-point of log-likelihood, IILasso won in 5 out of 10 cases.MCP showed similar performance to IILasso, but onlywon 2 cases. Lasso, EGLasso and SCAD showed sim-

Table 2: Abstract of microarray datadata # samples # dimensionsalon 62 2000chiaretti 111 12625gordon 181 12533gravier 168 2905pomeroy 60 7128shipp 58 6817singh 102 12600subramanian 50 10100tian 173 12625west 49 7129

ilar performance, but fell behind IILasso and MCP.We can see the similar tendency of misclassificationerrors. IILasso won in 9 out of 10 cases including 5ties. IILasso lost only 1 case, in which MCP won. Inaddition, the model size of IILasso was always smallerthan Lasso and the smallest in 6 out of 10 cases includ-ing 1 tie. In particular, IILasso showed a much smallermisclassification error with a much smaller model sizefor the dataset ‘subramanian’, and smaller model sizesfor the datsets ‘gordon’, ‘gravier’, ‘pomeroy’ and ‘west’among the comparable methods with almost the samemisclassification errors. As a whole, IILasso could con-struct accurate models with a small number of vari-ables.

5 Conclusion

In this paper, we proposed a new regularizationmethod “IILasso”. IILasso reduces correlations amongthe active variables, hence it is easy to decompose andinterpret the model. We showed that the sign recov-ery condition of IILasso is milder than that of Lassowhen the true active variables are uncorrelated witheach other. The convergence rate of IILasso also hasa better performance compared to that of Lasso. Inaddition, we showed that every local optimum by co-ordinate descent algorithm has the same order conver-gence rate as the global optimum. Finally, we veri-fied the effectiveness of IILasso by synthetic and realdata analyses using 10 gene expression data, and wesaw that IILasso was superior in many cases on high-dimensional data.

Acknowledgement

TS was partially supported by MEXT kakenhi(25730013, 25120012, 26280009, 15H01678 and15H05707), JST-PRESTO and JST-CREST. HF waspartially supported by MEXT kakenhi 17K00065.


Figure 3: Results of 10 microarray datasets

References

[1] U. Alon, N. Barkai, D. A. Notterman, K. Gish,S. Ybarra, D. Mack, and A. J. Levine. Broadpatterns of gene expression revealed by cluster-ing analysis of tumor and normal colon tissuesprobed by oligonucleotide arrays. Proceedings ofthe National Academy of Sciences, 96(12):6745–6750, 1999.

[2] Peter Bhlmann and Sara van de Geer. Statis-tics for High-Dimensional Data: Methods, Theoryand Applications. Springer, 2011.

[3] Peter J Bickel, Ya’acov Ritov, and Alexandre BTsybakov. Simultaneous analysis of lasso anddantzig selector. The Annals of Statistics, pages1705–1732, 2009.

[4] P. Breheny and H. Jian. Coordinate descent algo-rithms for nonconvex penalized regression, withapplications to biological feature selection. Theannals of applied statistics, 5(1):232–253, 2011.

[5] S. B. Chen, C. Ding, B. Luo, and Y. Xie. Uncorre-lated lasso. In Proceedings of the Twenty-SeventhAAAI Conference on Artificial Intelligence, pages166–172, 2013.

[6] S. Chiaretti, X. Li, R. Gentleman, A. Vitale,M. Vignetti, F. Mandelli, J. Ritz, and R. Foa.Gene expression profile of adult T-cell acute lym-phocytic leukemia identifies distinct subsets of pa-tients with different response to therapy and sur-vival. Blood, 103(7):2771–2778, 2004.

[7] J. Fan and R. Li. Variable selection via noncon-cave penalized likelihood and its oracle properties.Journal of the American statistical Association,96(456):1348–1360, 2001.

[8] J. Friedman, T. Hastie, H. Hfling, and R. Tib-shirani. Pathwise coordinate optimization. TheAnnals of Applied Statistics, 1(2):302–332, 2007.

[9] J. Friedman, T. Hastie, and R. Tibshirani. Regu-larization paths for generalized linear models viacoordinate descent. Journal of the Royal Statis-tical Society: Series B (Statistical Methodology),67(2):301–320, 2010.

[10] Gordon G. J, R. V. Jensen, L. L. Hsiao, S. R. Gul-lans, J. E. Blumenstock, S. Ramaswamy, W. G.Richards, D. J. Sugarbaker, and R. Bueno. Trans-lation of Microarray Data into Clinically RelevantCancer Diagnostic Tests Using Gene ExpressionRatios in Lung Cancer and Mesothelioma. CancerResearch, 62(17):4963–4967, 2002.

[11] E. Grave, G. R. Obozinski, and F. R. Bach. Tracelasso: a trace norm regularization for correlateddesigns. In Advances in Neural Information Pro-cessing Systems, pages 2187–2195, 2011.

[12] E. Gravier, G. Pierron, A. Vincent-Salomon,N. gruel, V. Raynal, A. Savignoni, Y. De Rycke,J. Y. Pierga, C. Lucchesi, F. Reyal, A. Four-quet, S. Roman-Roman, F. Radvanyi, X. Sastre-Garau, B. Asselain, and O. Delattre. A prognos-tic DNA signature for T1T2 node-negative breastcancer patients. Genes, Chromosomes and Can-cer, 49(12):1125–1125, 2010.

[13] D. Kong, R. Fujimaki, J. Liu, F. Nie, and C. Ding.Exclusive feature learning on arbitrary structuresvia ℓ1,2-norm. In Advances in Neural InformationProcessing Systems, pages 1655–1663, 2014.

[14] Z. C. Lipton. The mythos of model interpretabil-ity. In ICML Workshop on Human Interpretabil-ity in Machine Learning, pages 96–100, 2016.


[15] A. Lobert, D. Eis, V. Kostina, D. Blei, and P. Ra-madge. Exploiting covariate similarity in sparseregression via the pairwise elastic net. In Thir-teenth International Conference on Artificial In-telligence and Statistics, pages 477–484, 2010.

[16] R. Mazumder, J. H. Friedman, and T. Hastie.Sparsenet: Coordinate descent with nonconvexpenalties. Journal of the American Statistical As-sociation, 106(495):1125–1138, 2011.

[17] S. L. Pomeroy, P. Tamayo, M. Gaasenbeek, L. M.Sturla, M. Angelo, M. E. McLaughlin, J. Y. H.Kim, L. C. Goumnerova, P. M. Black, C. Lau,J. C. Allen, D. Zagzag, J. M. Olson, T. Curran,C. Wetmore, J. A. Biegel, T. Poggio, S. Mukher-jee, R. Rifkin, A. Califano, G. Stolovitzky, D. N.Louis, J. P. Mesirov, E. S. Lander, and T. R.Golub. Prediction of central nervous system em-bryonal tumour outcome based on gene expres-sion. Nature, 415(6870):436–442, 2002.

[18] G. Raskutti, M. J. Wainwright, and B. Yu. Re-stricted eigenvalue properties for correlated Gaus-sian designs. Journal of Machine Learning Re-search, 11:2241–2259, 2010.

[19] Garvesh Raskutti, Martin J Wainwright, and BinYu. Restricted eigenvalue properties for corre-lated gaussian designs. Journal of Machine Learn-ing Research, 11(Aug):2241–2259, 2010.

[20] Garvesh Raskutti, Martin J Wainwright, andBin Yu. Minimax rates of estimation for high-dimensional linear regression over ℓq-balls. IEEEtransactions on information theory, 57(10):6976–6994, 2011.

[21] M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng,J. L. Kutok, R. C. T. Aguiar, M. Gaasenbeek,M. Angelo, M. Reich, G. S. Pinkus, T. S. Ray,M. A. Koval, K. W. Last, A. Norton, T. A. Lis-ter, J. Mesirov, D. S. Neuberg, E. S. Lander,J. C. Aster, and T. R. Golub. Diffuse large B-celllymphoma outcome prediction by gene-expressionprofiling and supervised machine learning. NatureMedicine, 8(1):68–74, 2002.

[22] D. Singh, P. G. Febbo, K. Ross, D. G. Jack-son, J. Manola, C. Ladd, P. Tamayo, A. Ren-shaw, A. V. D’Amico, J. P. Richie, E. S. Lander,M. Loda, P. W. Kantoff, T. R. Golub, and W. R.Sellers. Gene expression correlates of clinicalprostate cancer behavior. Cancer Cell, 1(2):203–209, 2002.

[23] A. Subramanian, P. Tamayo, V. K. Mootha,S. Mukherjee, B. L. Ebert, M. A. Gillette,

A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S.Lander, and J. P. Mesirov. Gene set enrichmentanalysis: a knowledge-based approach for inter-preting genome-wide expression profiles. Proceed-ings of the National Academy of Sciences of theUnited States of America, 102(43):15545–15550,2005.

[24] E. Tian, F. Zhan, R. Walker, E. Rasmussen,Y. Ma, B. Barlogie, and J. D. ShaughnessyJr. The Role of the Wnt-Signaling AntagonistDKK1 in the Development of Osteolytic Lesionsin Multiple Myeloma. New England Journal ofMedicine, 349(26):2483–2494, 2003.

[25] R. Tibshirani. Regression shrinkage and selectionvia the lasso. Journal of the Royal Statistical Soci-ety Series B (Methodological), 58:267–288, 1996.

[26] P. Tseng. Convergence of a Block Coordinate De-scent Method for Nondifferentiable Minimization.Journal of Optimization Theory and Applications,109(3):475–494, 2001.

[27] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming (lasso). TheAnnals of statistics, 55(5):2183–2202, 2009.

[28] M. West, C. Blanchette, H. Dressman, E. Huang,S. Ishida, R. Spang, H. Zuzan, J. A. Olson,J. R. Marks, and J. R. Nevins. Predicting theclinical status of human breast cancer by usinggene expression profiles. Proceedings of the Na-tional Academy of Sciences of the United Statesof America, 98(20):11462–11467, 2001.

[29] C. H. Zhang. Nearly unbiased variable selectionunder minimax concave penalty. The Annals ofstatistics, 38(2):894–942, 2010.

[30] H. Zou and T. Hastie. Regularization and variableselection via the elastic net. Journal of the RoyalStatistical Society: Series B (Statistical Method-ology), 67(2):301–320, 2005.

Independently Interpretable Lasso: A New Regularizer for...

Documents

Transcript of Independently Interpretable Lasso: A New Regularizer for...