[IEEE 2012 IEEE Statistical Signal Processing Workshop (SSP) - Ann Arbor, MI, USA...

4
GREEDY DIRTY MODELS: A NEW ALGORITHM FOR MULTIPLE SPARSE REGRESSION Ali Jalali and Sujay Sanghavi Electrical and Computer Engineering Department University of Texas at Austin alij & [email protected] (Invited Paper) ABSTRACT This paper considers the recovery of multiple sparse vectors, with partially shared supports, from a small number of noisy linear measurements of each. It has recently been shown that it is possible to lower the sample complexity of recovery, for all levels of support sharing, by using a “dirty model”: a super-position of sparse and group-sparse modeling ap- proaches; this is based on convex optimization. In this paper, we provide a new forward-backward greedy procedure for the dirty model approach. Each forward step involves the addition of either a shared feature common to all vectors, or a unique feature to one of the vectors, chosen in a natural greedy fashion. Each backward step involves greedy removal, again of at most one feature of either type. Analytical and empirical evidence shows that this outper- forms all convex approaches, in terms of both sample and computational complexity. Index TermsMulti-task Learning, High-dimensional Statistics, Forward-Backward Greedy Algorithm, Dirty Model, Sparsity, Block-Sparsity 1. SETUP s In this paper 1 we study the multiple sparse linear regres- sion problem, which is the following: there are r vectors β * 1 ,...,β * r R p that we wish to recover from noisy linear observations; in particular, we have that y i = X (i) β * i + w i i ∈{1,...,r}, (1) and we want to recover β * 1 ,...,β * r given (y i ,X (i) ), i = 1,...,r. We will refer to X (i) R ni×p as the i th design matrix and y i R ni as the i th response vector; these ma- trices can be different for different i, or all the same. Here we assume w i R ni is iid zero mean Gaussian noise with variance σ 2 . Thus, there are n i observations of the i th task; for simplicity, we set n = max i n i and state our results in terms of n. 1 A longer version with proofs is available on Arxiv. We are interested in the high-dimensional setting, where n<p, often significantly so; to do so we want to leverage the fact that the vectors are sparse – kβ * i k 0 = s * i for all i, and s * = max i s * i is such that s * <p. In the following, we briefly review existing approaches to this problem in section 1.1, detail our algorithm and its the- oretical guarantees in section 2, and present empirical results on both synthetic data and a handwriting dataset in section 3. 1.1. Related Work The literature on sparse recovery from linear measurements is vast; we limit ourselves to the most directly relevant literature, on multiple sparse recovery. We discuss along two axes: the modeling approach, and the algorithmic approach. Recall that our task is to recover the vectors β * i , which we can organize as columns of a matrix β * . There are three main modeling approaches. The first ap- proach is to consider each of the tasks individually and use sparse regression methods such as LASSO [11, 12] to esti- mate the tasks, using kβk 1 as the regularizer. The second ap- proach focuses on the shared features across tasks [13, 6, 7, 9] and tries to recover the tasks by taking advantage of similar- ity between tasks, via a “block-sparse” regularizer of the form kβk 1,q (the 1 norm of q norms of each row, with q> 1) e.g. in Group LASSO. It has been shown [7] that depending on the similarity of features the tasks depend on, there are regimes that each of the two approaches has advantages over the other one. Thus, not knowing the level of dependency, it is not clear which approach must be taken apriori. The third modeling approach is the simultaneous use of sparse and block-sparse approaches, by decomposing β = S + B into two matrices, and regularizing each appropriately – this is the dirty models approach in [4]. In particular, this advocates using the regularizer λ 1 kSk 1 + λ 2 kBk 1,q . It has been shown [4] that this method uniformly outperforms both previous approaches, for all levels of sharing. Algorithmically, in addition to the convex approaches above, greedy methods like OMP etc. have been used to great success, and often have comparable or better perfor- mance [5]. In this paper, we develop a greedy method for the 2012 IEEE Statistical Signal Processing Workshop (SSP) 978-1-4673-0183-1/12/$31.00 ©2012 IEEE 416

Transcript of [IEEE 2012 IEEE Statistical Signal Processing Workshop (SSP) - Ann Arbor, MI, USA...

GREEDY DIRTY MODELS: A NEW ALGORITHM FOR MULTIPLE SPARSE REGRESSION

Ali Jalali and Sujay Sanghavi

Electrical and Computer Engineering DepartmentUniversity of Texas at Austin

alij & [email protected](Invited Paper)

ABSTRACT

This paper considers the recovery of multiple sparse vectors,with partially shared supports, from a small number of noisylinear measurements of each. It has recently been shown thatit is possible to lower the sample complexity of recovery,for all levels of support sharing, by using a “dirty model”:a super-position of sparse and group-sparse modeling ap-proaches; this is based on convex optimization.

In this paper, we provide a new forward-backward greedyprocedure for the dirty model approach. Each forward stepinvolves the addition of either a shared feature common to allvectors, or a unique feature to one of the vectors, chosen in anatural greedy fashion. Each backward step involves greedyremoval, again of at most one feature of either type.

Analytical and empirical evidence shows that this outper-forms all convex approaches, in terms of both sample andcomputational complexity.

Index Terms— Multi-task Learning, High-dimensionalStatistics, Forward-Backward Greedy Algorithm, Dirty Model,Sparsity, Block-Sparsity

1. SETUP

sIn this paper1 we study the multiple sparse linear regres-

sion problem, which is the following: there are r vectorsβ∗1 , . . . , β

∗r ∈ Rp that we wish to recover from noisy linear

observations; in particular, we have that

yi = X(i)β∗i + wi ∀i ∈ {1, . . . , r}, (1)

and we want to recover β∗1 , . . . , β∗r given (yi, X

(i)), i =1, . . . , r. We will refer to X(i) ∈ Rni×p as the ith designmatrix and yi ∈ Rni as the ith response vector; these ma-trices can be different for different i, or all the same. Herewe assume wi ∈ Rni is iid zero mean Gaussian noise withvariance σ2. Thus, there are ni observations of the ith task;for simplicity, we set n = maxi ni and state our results interms of n.

1A longer version with proofs is available on Arxiv.

We are interested in the high-dimensional setting, wheren < p, often significantly so; to do so we want to leveragethe fact that the vectors are sparse – ‖β∗i ‖0 = s∗i for all i, ands∗ = maxi s

∗i is such that s∗ < p.

In the following, we briefly review existing approaches tothis problem in section 1.1, detail our algorithm and its the-oretical guarantees in section 2, and present empirical resultson both synthetic data and a handwriting dataset in section 3.

1.1. Related Work

The literature on sparse recovery from linear measurements isvast; we limit ourselves to the most directly relevant literature,on multiple sparse recovery. We discuss along two axes: themodeling approach, and the algorithmic approach. Recall thatour task is to recover the vectors β∗i , which we can organizeas columns of a matrix β∗.

There are three main modeling approaches. The first ap-proach is to consider each of the tasks individually and usesparse regression methods such as LASSO [11, 12] to esti-mate the tasks, using ‖β‖1 as the regularizer. The second ap-proach focuses on the shared features across tasks [13, 6, 7, 9]and tries to recover the tasks by taking advantage of similar-ity between tasks, via a “block-sparse” regularizer of the form‖β‖1,q (the `1 norm of `q norms of each row, with q > 1) e.g.in Group LASSO. It has been shown [7] that depending on thesimilarity of features the tasks depend on, there are regimesthat each of the two approaches has advantages over the otherone. Thus, not knowing the level of dependency, it is not clearwhich approach must be taken apriori.

The third modeling approach is the simultaneous use ofsparse and block-sparse approaches, by decomposing β =S +B into two matrices, and regularizing each appropriately– this is the dirty models approach in [4]. In particular, thisadvocates using the regularizer λ1‖S‖1 + λ2‖B‖1,q . It hasbeen shown [4] that this method uniformly outperforms bothprevious approaches, for all levels of sharing.

Algorithmically, in addition to the convex approachesabove, greedy methods like OMP etc. have been used togreat success, and often have comparable or better perfor-mance [5]. In this paper, we develop a greedy method for the

2012 IEEE Statistical Signal Processing Workshop (SSP)

978-1-4673-0183-1/12/$31.00 ©2012 IEEE 416

“dirty model” approach. We show it achieves “sparsistency”– consistent support recovery – with milder assumptions thanconvex programming. Empirically, it is seen to outperformeven the convex dirty model. Our work is the first greedyalgorithm to simultaneously use more than one structuralassumption.

2. GREEDY ALGORITHM AND GUARANTEE

Consider the loss function

L(β) =

r∑i=1

1

2ni‖yi −X(i)βi‖22

The algorithm proceeds by iteratively building and modifyingtwo sets Sb and Ss, of active row and single variables respec-tively, by iteratively adding useful elements and removinguseless ones. A “row variable” is an entire row of β, withinwhich each element can be set to any value.

In each forward step it finds both the best single variable,and the best row variable, that give the biggest incrementaldecrease in the loss function. After weighting these gains ap-propriately the algorithm then picks the better of the two andadds it to the corresponding set – Sb for the row variable orSs for the single variable.

In each backward step the algorithm finds the best sin-gle variable, and the best row variable, whose individual re-moval results in the smallest incremental increase in the lossfunction. Again after weighting, it removes the one with thesmallest increase.

The goal of our theoretical analysis is to provide a spar-sistency guarantee by imposing some assumptions on the lossfunction. We say the Hessian matrix Q(i) = X(i)X(i)T satis-fiesREP (s) if the restricted eigenvalue property (REP) on k-sparse vectors δ ∈ Rp holds with constants Cmin and ρ ≥ 1;that is

Cmin‖δ‖2 ≤ ‖Q(i)δ‖2 ≤ ρCmin‖δ‖2 ∀‖δ‖0 ≤ s (2)

The next theorem states our guarantee on the performanceof the Algorithm 1 for two tasks (i.e. r = 2). This algorithmrequires the REP hold for vectors δ with slightly larger spar-sity than s∗ to guarantee the success of the greedy algorithm.

Theorem 1 (Two-task Sparsistency). Suppose 1 ≤ εbεs≤

2 and the Hessian matrices Q(i) = X(i)X(i)T satisfyREP (η s∗) for some η ≥ 2+4ψ2(

√(ψ2 − ψ)/s∗+

√2)2 with

ψ = 2εsεbρ2. Provided the sample complexity n ≥ Ks∗ log(2p)

for some constant K, if we run Algorithm 1 with stoppingthreshold εs ≥ 8c25ψ

3ηρ4

s∗ log(2p)n , the output β with support

Sb ∪ Ss satisfies:

(a) Error Bound:

‖β − β∗‖F ≤

√4c25ψ

ρ4s∗ log(2p)

n+

√8ψ

ρ2εs.

Algorithm 1 Greedy forward-backward algorithm for finding asparse + block-sparse optimizer of L(·)

Input: Data D := {y1, X(1), . . . , yr, X(r)}, Stopping Thresh-

olds εb and εs, Backward Factor ν ∈ (0, 1)

Output: Estimate β

β(0) ←− 0 and S(0)b S

(0)s ←− φ

µ(0)b , µ

(0)s ←− 0 and kb, ks, k ←− 1

while true do {Forward Step}(ib∗, α∗) ←− arg min

i/∈S(kb−1)

b;α∈Rr

L(β(k−1)+ ei αT;D) and

µ(kb)b ←−

L(β(k−1);D)−L(β(k−1)+e

ib∗αT∗ ;D

)εb

((is∗, js∗) , γ∗) ← arg min

(i,j)/∈S(k−1)s ; γ∈R

L(β(ks−1) +γeieTj ;D) and

µ(ks)s ←

L(β(k−1);D)−L(β(k−1)+γ∗eis∗

eTjs∗;D)

εs

if max(µ(ks)s , µ

(kb)b ) ≤ 1 then

breakend ifif µ(kb)

b ≥ µ(ks)s then

S(kb)b ←− S(kb−1)

b ∪{(ib∗, j) : ∀j} and kb ←− kb+1elseS(ks)s ←− S(ks−1)

s ∪ {(is∗, js∗)} and ks ←− ks + 1end if

β(k) ←− arg minβL(βS(kb−1)

b∪S(ks−1)

s;D)

k ←− k + 1

while true do {Backward Step}i∗b ←− arg min

i∈S(kb−1)

b

L(β(k−1) − ei β(k−1) Ti· ;D) and

νb =L(β(k−1)−ei∗

bβ(k−1) T

i∗b· ;D

)−L(β(k−1);D)

µ(kb−1)

bεb

(i∗s , j∗s )←−arg min

(i,j)∈S(ks−1)s

L(β(k−1)− β(k−1)ij eie

Tj ;D) and

νs =L(β(k−1)−β(k−1)

i∗sj∗sei∗s

eTj∗s;D

)−L(β(k−1);D)

µ(ks−1)s εs

if min(νb, νs) > ν thenbreak

end ifif νb ≤ νs thenkb ←− kb − 1 and S(kb−1)

b ←− S(kb)b − {(i∗b , j) :

∀j}elseks ←− ks− 1 and S(ks−1)

s ←− S(ks)s −{(i∗s , j∗s )}

end if

β(k−1) ←− arg minβL(βS(kb−1)

b∪S(ks−1)

s;D)

k ←− k − 1

end while

end while

417

Fig. 1. Behavior of phase transition threshold versus the pa-rameter κ in a 2-task problem for greedy algorithm, dirty model,LASSO and group LASSO (`1/`∞ regularizer). The y-axis isΘ = n

s log(p−(2−κ)s) , where n is the number of samples at whichthreshold was observed. Here, we let s = p/10 and the values of theparameter and design matrices are i.i.d standard Gaussians. Also,the noise variance is set to be σ = 0.1. The greedy algorithm showssubstantial improvement in terms of the minimum number of sam-ples required for exact sign support recovery over the other methods.

(b) No False Inclusions: Sb ∪ Ss − S∗b ∪ S∗s = ∅.

(c) No False Exclusions: If min(i,j)∈S∗ |β∗ij | >√

32ψεs,then S∗b ∪ S∗s − Sb ∪ Ss = ∅.

The result holds with high-probability for Gaussian designswhen the population matrices Σ(i) satisfy REP.

Remark. This guarantee is equivalent to those given byconvex surrogate optimization [4], except that here we are re-laxing the irrepresentability assumption which is known to benecessary for the convex surrogate optimization [12].

3. EXPERIMENTAL RESULTS

3.1. Synthetic Data

To have a common ground for comparison, we run thesame experiment used for the comparison of LASSO, groupLASSO and dirty model in [7, 4]. Consider the case wherewe have r = 2 tasks each with the support size of s = p/10and suppose these two tasks share a κ portion of their sup-ports. The location of non-zero entries are chosen uniformlyat random and values of β∗1 and β∗2 are chosen to be standardGaussian realizations. Each row of he matricesX(1) andX(2)

is distributed as N (0, I) and each entry of the noise vectorsw1 and w2 is a zero-mean Gaussian draw with variance 0.1.We run the experiment for problem sizes p ∈ 128, 256, 512and for support overlap levels κ ∈ 0.3, 2/3, 0.8.

We use cross-validation to find the best values of regular-izer coefficients. To do so, we choose εs = c s log(p)n , wherec ∈ [10−4, 10], and εb = k εs, where k ∈ [1, 2]. Notice thatthis search region is motivated by the requirements of our the-orem and can be substantially smaller than the region needsto be searched for εs and εb if they are independent. Interest-ingly, for small number of samples n, the ratio k tends to beclose to 1, where for large number of samples, the ratio tends

(a) Little support overlap: κ = 0.3

(b) Moderate support overlap: κ = 2/3

(c) High support overlap: κ = 0.8

Fig. 2. Probability of success in recovering the exact sign sup-port using greedy algorithm, dirty model, Lasso and group LASSO(`1/`∞). For a 2-task problem, the probability of success for dif-ferent values of feature-overlap fraction κ is plotted. Here, we lets = p/10 and the values of the parameter and design matricesare i.i.d standard Gaussians. Also, the noise variance is set to beσ = 0.1. As we can see, greedy method outperforms all methods inthe minimum number of samples required for sign support recovery.

to be close to 2. We suspect this phenomenon is due to thelack of curvature around the optimal point when we have fewsamples. The greedy algorithm is more stable if it picks a rowas opposed to a single coordinate, even if the improvement ofthe entire row is comparable to the improvement of a singlecoordinate.

To compare different methods under this regime, we de-fine a re-scaled version of sample size n, aka control param-eter, Θ =

n

s log (p− (2− κ)s). For different values of κ, we

plot the probability of success, obtained by averaging over100 problems, versus the control parameter Θ in Fig.3. It canbe seen that the greedy method outperforms, i.e., requires lessnumber of samples, to recover the exact sign support of β∗.

This result matches the known theoretical guarantees. Itis well-known that LASSO has a sharp transition at Θ ≈ 2

418

n Greedy Dirty Model Group LASSO LASSO10 Average Classification Error 6.5% 8.6% 9.9% 10.8%

Variance of Error 0.4% 0.53% 0.64% 0.51%Average Row Support Size 180 171 170 123

Average Support Size 1072 1651 1700 53920 Average Classification Error 2.1% 3.0% 3.5% 4.1%

Variance of Error 0.44% 0.56% 0.62% 0.68%Average Row Support Size 185 226 217 173

Average Support Size 1120 2118 2165 82140 Average Classification Error 1.4% 2.2% 3.2% 2.8%

Variance of Error 0.48% 0.57% 0.68% 0.85%Average Row Support Size 194 299 368 354

Average Support Size 1432 2761 3669 2053

Table 1. Handwriting Classification Results for greedy algorithm, dirty model, group LASSO and LASSO. The greedy method providesmuch better classification errors with simpler models. The greedy model selection is more consistent as the number of samples increases.

[12]1, group LASSO (`1/`∞ regularizer) has a sharp transi-tion at Θ = 4− 3κ [7] and dirty model has a sharp transitionat Θ = 2 − κ [4]. Although we do not have a theoreticalresult, our conjecture is that for our algorithm Θ = 1− κ

2 .To investigate our conjecture, we plot the sharp transi-

tion thresholds for different methods versus different val-ues of κ ∈ {0.05, 0.3, 2/3, 0.8, 0.95} for problem sizesp ∈ {128, 256, 512}. Fig 1 shows that the sharp transi-tion threshold for greedy algorithm follows our conjecturewith a good precision. Although, theoretical guarantee forsuch a tight threshold remains open.

3.2. Handwritten Digits Dataset

We use the handwritten digit dataset [2] that is used by anumber of papers [10, 3, 4] as a reliable dataset for opticalhandwritten digit recognition algorithms. The dataset con-tains p = 649 features of handwritten numerals 0-9 (r = 10tasks) extracted from a collection of Dutch utility maps. Thedataset provides 200 samples of each digit written by differentpeople. We take n/10 samples from each digit and combinethem to a big matrix X ∈ Rn×p, i.e., we set X(i) = X for alli ∈ {1, . . . , 10}. We construct the response vectors yi to be 1if the corresponding row in X is an instance of ith digit andzero otherwise. Clearly, yi’s will have a disjoint support sets.We run all four algorithms on this data and report the results.

Table 1 shows the results of our analysis for different sizesof the training set n. We measure the classification error foreach digit to get the 10-vector of errors. Then, we find the av-erage error and the variance of the error vector to show howthe error is distributed over all tasks. Again, in all methods,parameters are chosen via cross-validation. It can be seenthat the greedy method provides a more consistent model se-lection as the model complexity does not change too much asthe number of samples increases while the classification er-ror decreases substantially. In all cases, we get %25 − %30improvement in classification error.

1The exact expression is ns log(p)

= 2. Here, we ignore the term (2−κ)scomparing to p.

References[1] R. Caruana. Multitask learning. Machine Learning, 28:41–75,

1997.

[2] R. P.W. Duin. Department of Applied Physics, Delft Universityof Technology, Delft, The Netherlands, 2002.

[3] X. He and P. Niyogi. Locality preserving projections. In NIPS,2003.

[4] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirtymodel for multi-task learning. In NIPS, 2010.

[5] A. Jalali, C. Johnson, and P. Ravikumar. On learning discretegraphical models using greedy methods. In NIPS, 2011.

[6] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer.Taking advantage of sparsity in multi-task learning. In 22ndConference On Learning Theory (COLT), 2009.

[7] S. Negahban and M. J. Wainwright. Joint support recoveryunder high-dimensional scaling: Benefits and perils of `1,∞-regularization. In Advances in Neural Information ProcessingSystems (NIPS), 2008.

[8] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu.A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Neural Infor-mation Processing Systems (NIPS) 22, 2009.

[9] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Supportunion recovery in high-dimensional multivariate regression.Annals of Statistics, 2010.

[10] S. Perkins and J. Theiler. Online feature selection using graft-ing. In ICML, 2003.

[11] R. Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.

[12] M. J. Wainwright. Sharp thresholds for noisy andhigh-dimensional recovery of sparsity using `1-constrainedquadratic programming (lasso). IEEE Transactions on Infor-mation Theory, 55:2183–2202, 2009.

[13] C. Zhang and J. Huang. Model selection consistency of thelasso selection in high-dimensional linear regression. Annalsof Statistics, 36:1567–1594, 2008.

419