A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the...
Transcript of A short introduction to parsimonyA short introduction to parsimony and a complexity analysis of the...
A short introduction to parsimony
and
a complexity analysis
of the Lasso regularization path
Julien Mairal
Inria, LEAR team, Grenoble
Toulouse, October 2014
Julien Mairal Complexity Analysis of the Lasso Regularization Path 1/67
Big data
An ill-defined concept
a “buzz” word; regardless of rationality, you may get some fundingand become famous by making extensive use of “big data”;
replacing “thinking” by “data” and hope for the best;
a means to make money from your personal data.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 2/67
Big data
An ill-defined concept
a “buzz” word; regardless of rationality, you may get some fundingand become famous by making extensive use of “big data”;
replacing “thinking” by “data” and hope for the best;
a means to make money from your personal data.
A scientific utopia
converting data into scientific knowledge;
better understanding the world by observing it as much as we can.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 2/67
Big data
An ill-defined concept
a “buzz” word; regardless of rationality, you may get some fundingand become famous by making extensive use of “big data”;
replacing “thinking” by “data” and hope for the best;
a means to make money from your personal data.
A scientific utopia
converting data into scientific knowledge;
better understanding the world by observing it as much as we can.
in French
Megadonnees.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 2/67
Collaborator
The analysis of the regularization path is a joint workwith Bin Yu, from UC Berkeley.
Reference
J. Mairal and B. Yu. Complexity analysis of the Lasso regularizationpath. ICML. 2012.
Advertisement
The introduction to parsimony is based on the material of the upcomingmonograph, which will be freely available on arXiv mid-october:
J. Mairal, F. Bach and J. Ponce. Sparse Modeling for Image and VisionProcessing. 2014.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 3/67
Part I: A Short Introduction to Parsimony
Julien Mairal Complexity Analysis of the Lasso Regularization Path 4/67
1 A short introduction to parsimonyEarly thoughtsSparsity in the statistics literature from the 60’s and 70’sWavelet thresholding in signal processing from 90’sThe modern parsimony and the ℓ1-normStructured sparsity
Julien Mairal Complexity Analysis of the Lasso Regularization Path 5/67
Early thoughts
(a) Dorothy Wrinch1894–1980
(b) Harold Jeffreys1891–1989
The existence of simple laws is, then, apparently, to beregarded as a quality of nature; and accordingly we may inferthat it is justifiable to prefer a simple law to a more complexone that fits our observations slightly better.
[Wrinch and Jeffreys, 1921]. Philosophical Magazine Series.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 6/67
Historical overview of parsimony
14th century: Ockham’s razor;
1921: Wrinch and Jeffreys’ simplicity principle;
1952: Markowitz’s portfolio selection;
60 and 70’s: best subset selection in statistics;
70’s: use of the ℓ1-norm for signal recovery in geophysics;
90’s: wavelet thresholding in signal processing;
1996: Olshausen and Field’s dictionary learning;
1996–1999: Lasso (statistics) and basis pursuit (signal processing);
2006–now: compressed sensing (signal processing) and Lassoconsistency (statistics); applications in various scientific fields suchas image processing, bioinformatics, neuroscience, computer vision...
Julien Mairal Complexity Analysis of the Lasso Regularization Path 7/67
Sparsity in the statistics literature from the 60’s and 70’s
Given some observed data points z1, . . . , zn that are assumed to beindependent samples from a statistical model with parameters θ in R
p,maximum likelihood estimation (MLE) consists of minimizing
minθ∈Rp
[
L(θ) △
= −n∑
i=1
logPθ(zi )
]
.
Example: ordinary least square
Observations zi = (yi , xi ), with yi in R.Linear model: yi = x⊤i θ + εi , with εi ∼ N (0, 1).
minθ∈Rp
n∑
i=1
1
2
(
yi − x⊤i θ)2.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 8/67
Sparsity in the statistics literature from the 60’s and 70’s
Given some observed data points z1, . . . , zn that are assumed to beindependent samples from a statistical model with parameters θ in R
p,maximum likelihood estimation (MLE) consists of minimizing
minθ∈Rp
[
L(θ) △
= −n∑
i=1
logPθ(zi )
]
.
Motivation for finding a sparse solution:
removing irrelevant variables from the model;
obtaining an easier interpretation;
preventing overfitting;
Julien Mairal Complexity Analysis of the Lasso Regularization Path 8/67
Sparsity in the statistics literature from the 60’s and 70’s
Given some observed data points z1, . . . , zn that are assumed to beindependent samples from a statistical model with parameters θ in R
p,maximum likelihood estimation (MLE) consists of minimizing
minθ∈Rp
[
L(θ) △
= −n∑
i=1
logPθ(zi )
]
.
Why this is highly relevant in a modern big data context:
large n allows learning (better) complex models with large p;
large p leads to poor interpretation and irrelevant variables;
large p and large n lead to high computational cost;
Julien Mairal Complexity Analysis of the Lasso Regularization Path 8/67
Sparsity in the statistics literature from the 60’s and 70’s
Given some observed data points z1, . . . , zn that are assumed to beindependent samples from a statistical model with parameters θ in R
p,maximum likelihood estimation (MLE) consists of minimizing
minθ∈Rp
[
L(θ) △
= −n∑
i=1
logPθ(zi )
]
.
Two questions:
1 how to choose k?
2 how to find the best subset of k variables?
Julien Mairal Complexity Analysis of the Lasso Regularization Path 8/67
Sparsity in the statistics literature from the 60’s and 70’s
How to choose k?
Mallows’s Cp statistics [Mallows, 1964, 1966];
Akaike information criterion (AIC) [Akaike, 1973];
Bayesian information critertion (BIC) [Schwarz, 1978];
Minimum description length (MDL) [Rissanen, 1978].
These approaches lead to penalized problems
minθ∈Rp
L(θ) + λ‖θ‖0,
with different choices of λ depending on the chosen criterion.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 9/67
Sparsity in the statistics literature from the 60’s and 70’s
How to solve the best k-subset selection problem?
Unfortunately...
...the problem is NP-hard [Natarajan, 1995].
Two strategies
combinatorial exploration with branch-and-boundtechniques [Furnival and Wilson, 1974] → leaps and bounds,exact algorithm but exponential complexity;
greedy approach: forward selection [Efroymson, 1960] (originallydeveloped for observing intermediate solutions),already contains all the ideas of matching pursuit algorithms.
Important reference: [Hocking, 1976]. The analysis and selection ofvariables in linear regression. Biometrics.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 10/67
Wavelet thresholding in signal processing from the 90’s
A wavelet basis represents a set of functions ϕ1, ϕ2 that are essentiallydilated and shifted versions of each other [see Mallat, 2008].
Concept of parsimony with wavelets
When a signal f is “smooth”, it is close to an expansion∑
i αiϕi whereonly a few coefficients αi are non-zero.
−8 −6 −4 −2 0 2 4 6 8−1
−0.5
0
0.5
1
1.5
(a) Meyer’s wavelet
−4 −3 −2 −1 0 1 2 3 4−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
(b) Morlet’s wavelet
Julien Mairal Complexity Analysis of the Lasso Regularization Path 11/67
Wavelet thresholding in signal processing from the 90’s
Wavelets where the topic of a long quest for representing natural images
2D-Gabors [Daugman, 1985];
steerable wavelets [Simoncelli et al., 1992];
curvelets [Candes and Donoho, 2002];
countourlets [Do and Vertterli, 2003];
bandlets [Le Pennec and Mallat, 2005];
⋆-lets.
(a) 2D Gabor filter. (b) With shifted phase. (c) With rotation.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 12/67
Wavelet thresholding in signal processing from 90’s
The theory of wavelets is well developed for continuous signals, e.g.,in L2(R), but also for discrete signals x in R
n.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 13/67
Wavelet thresholding in signal processing from 90’s
Given an orthogonal wavelet basis D = [d1, . . . ,dn] in Rn×n, the wavelet
decomposition of x in Rn is simply
β = D⊤x and we have x = Dβ.
The k-sparse approximation problem
minα∈Rp
1
2‖x−Dα‖22 s.t. ‖α‖0 ≤ k ,
is not NP-hard here: since D is orthogonal, it is equivalent to
minα∈Rp
1
2‖β −α‖22 s.t. ‖α‖0 ≤ k .
Julien Mairal Complexity Analysis of the Lasso Regularization Path 14/67
Wavelet thresholding in signal processing from 90’s
Given an orthogonal wavelet basis D = [d1, . . . ,dn] in Rn×n, the wavelet
decomposition of x in Rn is simply
β = D⊤x and we have x = Dβ.
The k-sparse approximation problem
minα∈Rp
1
2‖x−Dα‖22 s.t. ‖α‖0 ≤ k ,
The solution is obtained by hard-thresholding:
αht[j ] = δ|β[j ]|≥µβ[j ] =
{β[j ] if |β[j ]| ≥ µ0 otherwise
,
where µ the k-th largest value among the set {|β[1]|, . . . , |β[p]|}.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 14/67
Wavelet thresholding in signal processing, 90’s
Another key operator introduced by Donoho and Johnstone [1994] isthe soft-thresholding operator:
αst[j ]△
= sign(β[j ])max(|β[j ]| − λ, 0) =
β[j ]− λ if β[j ] ≥ λβ[j ] + λ if β[j ] ≤ −λ0 otherwise
,
where λ is a parameter playing the same role as µ previously.
With β△
= D⊤x and D orthogonal, it provides the solution of thefollowing sparse reconstruction problem:
minα∈Rp
1
2‖x−Dα‖22 + λ‖α‖1,
which will be of high importance later.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 15/67
Wavelet thresholding in signal processing, 90’s
β
αst
λ
−λ
(d) Soft-thresholding operator,αst = sign(β)max(|β| − λ, 0).
β
αht
µ
−µ
(e) Hard-thresholding operatorαht = δ|β|≥µβ.
Figure : Soft- and hard-thresholding operators, which are commonly used forsignal estimation with orthogonal wavelet basis.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 16/67
Wavelet thresholding in signal processing, 90’s
Various work tried to exploit the structure of wavelet coefficients.
α1
α2 α3
α4 α5 α6 α7
α8 α9 α10 α11 α12 α13 α14 α15
Figure : Illustration of a wavelet tree with four scales for one-dimensionalsignals. We also illustrate the zero-tree coding scheme [Shapiro, 1993].
Julien Mairal Complexity Analysis of the Lasso Regularization Path 17/67
The modern parsimony and the ℓ1-normSparse linear models in signal processing
Let x in Rn be a signal.
Let D = [d1, . . . ,dp] ∈ Rn×p be a set of
elementary signals.We call it dictionary.
D is “adapted” to x if it can represent it with a few elements—that is,there exists a sparse vector α in R
p such that x ≈ Dα. We call α thesparse code.
x
︸ ︷︷ ︸
x∈Rn
≈
d1 d2 · · · dp
︸ ︷︷ ︸
D∈Rn×p
α[1]α[2]...
α[p]
︸ ︷︷ ︸
α∈Rp,sparse
Julien Mairal Complexity Analysis of the Lasso Regularization Path 18/67
The modern parsimony and the ℓ1-normSparse linear models: machine learning/statistics point of view
Let (yi , xi )ni=1 be a training set, where the vectors xi are in R
p and arecalled features. The scalars yi are in
{−1,+1} for binary classification problems.
R for regression problems.
We assume there exists a relation y ≈ β⊤x, and solve
minβ∈Rp
1
n
n∑
i=1
L(yi ,β⊤xi )
︸ ︷︷ ︸
empirical risk
+ λψ(β)︸ ︷︷ ︸
regularization
.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 19/67
The modern parsimony and the ℓ1-normSparse linear models: machine learning/statistics point of view
A few examples:
Ridge regression: minβ∈Rp
1
n
n∑
i=1
1
2(yi − β⊤xi )
2 + λ‖β‖22.
Linear SVM: minβ∈Rp
1
n
n∑
i=1
max(0, 1− yiβ⊤xi ) + λ‖β‖22.
Logistic regression: minβ∈Rp
1
n
n∑
i=1
log(
1 + e−yiβ⊤xi
)
+ λ‖β‖22.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 20/67
The modern parsimony and the ℓ1-normSparse linear models: machine learning/statistics point of view
A few examples:
Ridge regression: minβ∈Rp
1
n
n∑
i=1
1
2(yi − β⊤xi )
2 + λ‖β‖22.
Linear SVM: minβ∈Rp
1
n
n∑
i=1
max(0, 1− yiβ⊤xi ) + λ‖β‖22.
Logistic regression: minβ∈Rp
1
n
n∑
i=1
log(
1 + e−yiβ⊤xi
)
+ λ‖β‖22.
The squared ℓ2-norm induces “smoothness” in β. When one knows inadvance that β should be sparse, one should use a sparsity-inducingregularization such as the ℓ1-norm. [Chen et al., 1999, Tibshirani, 1996]
Julien Mairal Complexity Analysis of the Lasso Regularization Path 21/67
The modern parsimony and the ℓ1-norm
Originally used to induce sparsity in geophysics [Claerbout and Muir,1973, Taylor et al., 1979], the ℓ1-norm became popular in statistics withthe Lasso [Tibshirani, 1996] and in signal processing with the Basispursuit [Chen et al., 1999].
Three “equivalent” formulations
1
minα∈Rp
1
2‖x−Dα‖22 + λ‖α‖1;
2
minα∈Rp
1
2‖x−Dα‖22 s.t. ‖α‖1 ≤ µ;
3
minα∈Rp
‖α‖1 s.t. ‖x−Dα‖22 ≤ ε.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 22/67
The modern parsimony and the ℓ1-norm
And some variants...
For noiseless problems
minα∈Rp
‖α‖1 s.t. x = Dα.
Beyond least squares
minα∈Rp
f (α) + λ‖α‖1,
where f : Rp → R is convex.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 23/67
The modern parsimony and the ℓ1-norm
And some variants...
For noiseless problems
minα∈Rp
‖α‖1 s.t. x = Dα.
Beyond least squares
minα∈Rp
f (α) + λ‖α‖1,
where f : Rp → R is convex.
An important question remains:
why does the ℓ1-norm induce sparsity?
Julien Mairal Complexity Analysis of the Lasso Regularization Path 23/67
The modern parsimony and the ℓ1-norm
Why does the ℓ1-norm induce sparsity?
Can we get some intuition from the simplest isotropic case?
α(λ) = argminα∈Rp
1
2‖x−α‖22 + λ‖α‖1,
or equivalently the Euclidean projection onto the ℓ1-ball?
α(µ) = argminα∈Rp
1
2‖x−α‖22 s.t. ‖α‖1 ≤ µ.
“equivalent” means that for all λ ≥ 0, there exists µ ≥ 0 such thatα(µ) = α(λ) and vice versa.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 24/67
The modern parsimony and the ℓ1-norm
Why does the ℓ1-norm induce sparsity?
Can we get some intuition from the simplest isotropic case?
α(λ) = argminα∈Rp
1
2‖x−α‖22 + λ‖α‖1,
or equivalently the Euclidean projection onto the ℓ1-ball?
α(µ) = argminα∈Rp
1
2‖x−α‖22 s.t. ‖α‖1 ≤ µ.
“equivalent” means that for all λ ≥ 0, there exists µ ≥ 0 such thatα(µ) = α(λ) and vice versa.The relation between µ and λ is unknown a priori.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 24/67
Why does the ℓ1-norm induce sparsity?Regularizing with the ℓ1-norm
ℓ1-ball
‖α‖1 ≤ µ
α[2]
α[1]
The projection onto a convex set is “biased” towards singularities.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 25/67
Why does the ℓ1-norm induce sparsity?Regularizing with the ℓ2-norm
α[2]
α[1]ℓ2-ball
‖α‖2 ≤ µ
The ℓ2-norm is isotropic.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 26/67
Why does the ℓ1-norm induce sparsity?In 3D. (images produced by G. Obozinski
Julien Mairal Complexity Analysis of the Lasso Regularization Path 27/67
Why does the ℓ1-norm induce sparsity?Regularizing with the ℓ∞-norm
α[2]
α[1]ℓ∞-ball
‖α‖∞ ≤ µ
The ℓ∞-norm encourages |α[1]| = |α[2]|.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 28/67
Why does the ℓ1-norm induce sparsity?Analytical point of view: 1D case
minα∈R
1
2(x − α)2 + λ|α|
Piecewise quadratic function with a kink at zero.
Derivative at 0+: g+ = −x + λ and 0−: g− = −x − λ.
Optimality conditions. α is optimal iff:
|α| > 0 and (x − α) + λ sign(α) = 0
α = 0 and g+ ≥ 0 and g− ≤ 0
The solution is a soft-thresholding:
α⋆ = sign(x)(|x | − λ)+.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 29/67
Why does the ℓ1-norm induce sparsity?Comparison with ℓ2-regularization in 1D
ψ(α) = α2
ψ′(α) = 2α
ψ(α) = |α|
ψ′−(α) = −1, ψ′
+(α) = 1
The gradient of the ℓ2-penalty vanishes when α get close to 0. On itsdifferentiable part, the norm of the gradient of the ℓ1-norm is constant.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 30/67
Why does the ℓ1-norm induce sparsity?Physical illustration
E1 = 0 E1 = 0
x
Julien Mairal Complexity Analysis of the Lasso Regularization Path 31/67
Why does the ℓ1-norm induce sparsity?Physical illustration
E1 =k12 (x − α)2
E2 =k22 α
2 α
α
E1 =k12 (x − α)2
E2 = mgα
Julien Mairal Complexity Analysis of the Lasso Regularization Path 31/67
Why does the ℓ1-norm induce sparsity?Physical illustration
E1 =k12 (x − α)2
E2 =k22 α
2 α
α = 0 !!
E1 =k12 (x − α)2
E2 = mgα
Julien Mairal Complexity Analysis of the Lasso Regularization Path 31/67
Why does the ℓ1-norm induce sparsity?
0 1 2 3 4−0.5
0
0.5
1
1.5
λ
co
eff
icie
nt
va
lue
s
w1
w2
w3
w4
w5
Figure : The regularization path of the Lasso.
minα∈Rp
1
2‖x−Dα‖22 + λ‖α‖1.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 32/67
Non-convex sparsity-inducing penalties
Exploiting concave functions with a kink at zero
ψ(α) =∑p
j=1 ϕ(|α[j ]|).ℓq-penalty, with 0 < q < 1: ψ(α)
△
=∑p
j=1 |α[j ]|q, [Frank andFriedman, 1993];
log penalty, ψ(α)△
=∑p
j=1 log(|α[j ]|+ ε), [Candes et al., 2008].
ϕ is any function that looks like this:
α
ϕ(|α|)
Julien Mairal Complexity Analysis of the Lasso Regularization Path 33/67
Non-convex sparsity-inducing penalties
(a) ℓ0.5-ball, 2-D (b) ℓ1-ball, 2-D (c) ℓ2-ball, 2-D
Figure : Open balls in 2-D corresponding to several ℓq-norms andpseudo-norms.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 34/67
Non-convex sparsity-inducing penalties
α[2]
α[1]ℓq-ball
‖α‖q ≤ µ with q < 1
Julien Mairal Complexity Analysis of the Lasso Regularization Path 35/67
Elastic-net
The elastic net introduced by [Zou and Hastie, 2005]
ψ(α) = ‖α‖1 + γ‖α‖22,
The penalty provides more stable (but less sparse) solutions.
(a) ℓ1-ball, 2-D (b) elastic-net, 2-D (c) ℓ2-ball, 2-D
Julien Mairal Complexity Analysis of the Lasso Regularization Path 36/67
The elastic-netvs other penalties
α[2]
α[1]ℓq-ball
‖α‖q ≤ µ with q < 1
Julien Mairal Complexity Analysis of the Lasso Regularization Path 37/67
The elastic-netvs other penalties
ℓ1-ball
‖α‖1 ≤ µ
α[2]
α[1]
Julien Mairal Complexity Analysis of the Lasso Regularization Path 37/67
The elastic-netvs other penalties
elastic-net
ball
(1− γ)‖α‖1 + γ‖α‖22 ≤ µ
α[2]
α[1]
Julien Mairal Complexity Analysis of the Lasso Regularization Path 37/67
The elastic-netvs other penalties
α[2]
α[1]ℓ2-ball
‖α‖22 ≤ µ
Julien Mairal Complexity Analysis of the Lasso Regularization Path 37/67
Total variation and fused Lasso
The anisotropic total variation [Rudin et al., 1992]
ψ(α) =
p−1∑
j=1
|α[j + 1]−α[j ]|,
called fused Lasso in statistics [Tibshirani et al., 2005]. The penaltyencourages piecewise constant signals (can be extended to images).
Image borrowed from a talk of J.-P. Vert, representing DNA copynumbers.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 38/67
Group Lasso and mixed norms[Turlach et al., 2005, Yuan and Lin, 2006, Zhao et al., 2009][Grandvalet and Canu, 1999, Bakin, 1999]
the ℓ1/ℓq-norm : ψ(α) =∑
g∈G‖α[g ]‖q.
G is a partition of {1, . . . , p};q = 2 or q = ∞ in practice;
can be interpreted as the ℓ1-norm of [‖α[g ]‖q]g∈G .
ψ(α) = ‖α[{1, 2}]‖2 + |α[3]|.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 39/67
Spectral sparsity[Fazel et al., 2001, Srebro et al., 2005]
A natural regularization function for matrices is the rank
rank(A)△
= |{j : sj(A) 6= 0}| = ‖s(A)‖0,
where sj is the j-th singular value and s is the spectrum of A.
A successful convex relaxation of the rank is the sum of singular values
‖A‖∗ △
=
p∑
j=1
sj(A) = ‖s(A)‖1,
for A in Rp×k with k ≥ p.
The resulting function is a norm, called the trace or nuclear norm.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 40/67
Structured sparsityimages produced by G. Obozinski
Julien Mairal Complexity Analysis of the Lasso Regularization Path 41/67
Structured sparsityimages produced by G. Obozinski
Julien Mairal Complexity Analysis of the Lasso Regularization Path 42/67
Structured sparsityMetabolic network of the budding yeast from Rapaport et al. [2007]
Julien Mairal Complexity Analysis of the Lasso Regularization Path 43/67
Structured sparsityMetabolic network of the budding yeast from Rapaport et al. [2007]
Julien Mairal Complexity Analysis of the Lasso Regularization Path 44/67
Structured sparsity
Warning: Under the name “structured sparsity” appear in factsignificantly different formulations!
1 non-convex
zero-tree wavelets [Shapiro, 1993];predefined collection of sparsity patterns: [Baraniuk et al., 2010];select a union of groups: [Huang et al., 2009];structure via Markov random fields: [Cehver et al., 2008];
2 convex (norms)
tree-structure: [Zhao et al., 2009];select a union of groups: [Jacob et al., 2009];zero-pattern is a union of groups: [Jenatton et al., 2011];other norms: [Micchelli et al., 2013].
Julien Mairal Complexity Analysis of the Lasso Regularization Path 45/67
Structured sparsityGroup Lasso with overlapping groups [Jenatton et al., 2011]
ψ(α) =∑
g∈G‖α[g ]‖q.
What happens when the groups overlap?
the pattern of non-zero variables is an intersection of groups;
the zero pattern is a union of groups.
ψ(α) = ‖α‖2 + |α[2]|+ |α[3]|.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 46/67
Structured sparsityHierarchical norms [Zhao et al., 2009].
(d) Sparsity. (e) Group sparsity. (f) Hierarchical sparsity.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 47/67
Some thoughts from Hocking [1976]:
The problem of selecting a subset of independent orpredictor variables is usually described in an idealizedsetting. That is, it is assumed that (a) the analyst has dataon a large number of potential variables which include allrelevant variables and appropriate functions of them plus,possibly, some other extraneous variables and variablefunctions and (b) the analyst has available “good” data onwhich to base the eventual conclusions. In practice, the lackof satisfaction of these assumptions may make a detailedsubset selection analysis a meaningless exercise.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 48/67
Part II: Complexity Analysis of the LassoRegularization Path
Julien Mairal Complexity Analysis of the Lasso Regularization Path 49/67
What this work is about
another paper about the Lasso/Basis Pursuit [Tibshirani, 1996,Chen et al., 1999]:
minw∈Rp
1
2‖y − Xw‖22 + λ‖w‖1; (1)
the first complexity analysis of the homotopy method [Ritter, 1962,Osborne et al., 2000, Efron et al., 2004] for solving (1);
a robust homotopy algorithm.
A main message reminiscent of
the simplex algorithm [Klee and Minty, 1972];
the SVM regularization path [Gartner et al., 2010].
Julien Mairal Complexity Analysis of the Lasso Regularization Path 50/67
The Lasso Regularization Path and the Homotopy
When it exists, the regularization path is piecewise linear:
0 1 2 3 4−0.5
0
0.5
1
1.5
λ
co
eff
icie
nt
va
lue
s
w1
w2
w3
w4
w5
Julien Mairal Complexity Analysis of the Lasso Regularization Path 51/67
Our Main Results
Theorem - worst case analysis
In the worst-case, the regularization path of the Lasso has exactly(3p + 1)/2 linear segments.
Proposition - approximate analysis
there exists an ε-approximate path with O(1/√ε) linear segments.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 52/67
Brief Introduction to the Homotopy Algorithm
Optimality conditions of the Lasso
w⋆ in Rp is a solution of Eq. (1) if and only if for all j in {1, . . . , p},
xj⊤(y − Xw⋆) = λ sign(w⋆j ) if w⋆
j 6= 0,
|xj⊤(y − Xw⋆)| ≤ λ otherwise.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 53/67
Brief Introduction to the Homotopy Algorithm
Optimality conditions of the Lasso
w⋆ in Rp is a solution of Eq. (1) if and only if for all j in {1, . . . , p},
xj⊤(y − Xw⋆) = λ sign(w⋆j ) if w⋆
j 6= 0,
|xj⊤(y − Xw⋆)| ≤ λ otherwise.
Uniqueness of the solution
Define J△
= {j ∈ {1, . . . , p} : |xj⊤(y − Xw⋆)| = λ}.If the matrix X⊤
J XJ is invertible, the solution is unique and
w⋆J = (X⊤
J XJ)−1(X⊤
J y − ληJ) = A+ λB,
where η△
= sign(X⊤(y − Xw⋆)).
Julien Mairal Complexity Analysis of the Lasso Regularization Path 53/67
Brief Introduction to the Homotopy Algorithm
Piecewise linearity
Under uniqueness assumptions of the Lasso solution, the regularizationpath λ 7→ w⋆(λ) is continuous and piecewise linear.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 54/67
Brief Introduction to the Homotopy Algorithm
Piecewise linearity
Under uniqueness assumptions of the Lasso solution, the regularizationpath λ 7→ w⋆(λ) is continuous and piecewise linear.
Recipe of the homotopy method - main ideas
1 finds a trivial solution w⋆(λ∞) = 0 with λ∞ = ‖X⊤y‖∞;
2 compute the direction of the piecewise linear segment of the path;
3 follow the direction of the path by decreasing λ;
4 stop at the next “kink” and go back to 2.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 54/67
Brief Introduction to the Homotopy Algorithm
Piecewise linearity
Under uniqueness assumptions of the Lasso solution, the regularizationpath λ 7→ w⋆(λ) is continuous and piecewise linear.
Recipe of the homotopy method - main ideas
1 finds a trivial solution w⋆(λ∞) = 0 with λ∞ = ‖X⊤y‖∞;
2 compute the direction of the piecewise linear segment of the path;
3 follow the direction of the path by decreasing λ;
4 stop at the next “kink” and go back to 2.
Caveats - questions
kinks can be very close to each other;
X⊤J XJ can be ill-conditioned;
what is the complexity?
Julien Mairal Complexity Analysis of the Lasso Regularization Path 54/67
Worst case analysis
Theorem - worst case analysis
In the worst-case, the regularization path of the Lasso has exactly(3p + 1)/2 linear segments.
100 200 300
−1
0
1
2
Regularization path, p=6
Kinks
Co
eff
icie
nts
(lo
g s
ca
le)
Julien Mairal Complexity Analysis of the Lasso Regularization Path 55/67
Worst case analysis
Consider a Lasso problem (y ∈ Rn, X ∈ R
n×p).Define the vector y in R
n+1 and the matrix X in R(n+1)×(p+1) as follows:
y△
=
[y
yn+1
]
, X△
=
[X 2αy0 αyn+1
]
,
where yn+1 6= 0 and 0 < α < λ1/(2y⊤y + y2n+1).
Adverserial strategy
If the regularization path of the Lasso (y,X) has k linear segments, thepath of (y, X) has 3k − 1 linear segments.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 56/67
Worst case analysis
y△
=
[y
yn+1
]
, X△
=
[X 2αy0 αyn+1
]
,
Let us denote by {η1, . . . ,ηk} the sequence of k sparsity patterns in{−1, 0, 1}p encountered along the path of the Lasso (y,X).
The new sequence of sparsity patterns for (y, X) is
{first k patterns
︷ ︸︸ ︷[η1 = 0
0
]
,
[η2
0
]
, . . . ,
[ηk
0
]
,
middle k patterns︷ ︸︸ ︷[ηk
1
]
,
[ηk−1
1
]
, . . . ,
[η1=0
1
]
,
[−η2
1
]
,
[−η3
1
]
, . . . ,
[−ηk
1
]
︸ ︷︷ ︸
last k−1 patterns
}
.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 57/67
Worst case analysis
y△
=
[y
yn+1
]
, X△
=
[X 2αy0 αyn+1
]
,
Some intuition why this is true:
1 the patterns of the new path must be [ηi⊤, 0]⊤ or [±ηi⊤, 1]⊤;2 the factor α ensures the (p + 1)-th variable to enter late the path;
3 after the k first kinks, we have y ≈ Xw⋆(λ) and thus
X
[w⋆(λ)
0
]
+
[0
yn+1
]
≈ y ≈ X
[−w⋆(λ)1/α
]
.
We are now in shape to build a pathological path with (3p + 1)/2linear segments. Note that this lower-bound complexity is optimal.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 58/67
Approximate Complexity
Strong Duality
w⋆
w κ
κ⋆
f (w), primal
g(κ), dual
b
b
b
b
Strong duality means that maxκ g(κ) = minw f (w)
Julien Mairal Complexity Analysis of the Lasso Regularization Path 59/67
Approximate Complexity
Duality Gaps
w
w
κ
κ
f (w), primal
g(κ), dual
b
b
b
bδ(w, κ)
Strong duality means that maxκ g(κ) = minw f (w)
The duality gap guarantees us that 0 ≤ f (w)− f (w⋆) ≤ δ(w, κ).
Julien Mairal Complexity Analysis of the Lasso Regularization Path 60/67
Approximate Complexity
minw
{
fλ(w)△
=1
2‖y − Xw‖22 + λ‖w‖1
}
, (primal)
maxκ
{
gλ(κ)△
= −1
2κ⊤κ− κ⊤y s.t. ‖X⊤κ‖∞ ≤ λ
}
. (dual)
ε-approximate solution
w is a ε-approximate solution when there exists a dual variable κ s.t.
δλ(w,κ) = fλ(w)− gλ(κ) ≤ εfλ(w).
ε-approximate path
A path P : λ 7→ w(λ) is an approximate path if it always containsε-approximate solutions.
(see Giesen et al. [2010] for generic results)
Julien Mairal Complexity Analysis of the Lasso Regularization Path 61/67
Approximate Complexity
ε-approximate solution
w satisfies APPROXλ(ε) when there exists a dual variable κ s.t.
δλ(w,κ) = fλ(w)− gλ(κ) ≤ εfλ(w).
ε-approximate path
A path P : λ 7→ w(λ) is an approximate path if it always containsε-approximate solutions.
(see Giesen et al. [2010] for generic results)
Julien Mairal Complexity Analysis of the Lasso Regularization Path 62/67
Approximate Complexity
Optimality conditions
w in Rp is a solution of (1) if and only if for all j in {1, . . . , p},
xj⊤(y − Xw) = λ sign(wj) if wj 6= 0,
|xj⊤(y − Xw)| ≤ λ otherwise.(exact)
Julien Mairal Complexity Analysis of the Lasso Regularization Path 63/67
Approximate Complexity
(ε1, ε2)-approximate optimality conditions
w in Rp satisfies OPTλ(ε1, ε2) if and only if for all j in {1, . . . , p},
λ(1− ε2) ≤ xj⊤(y − Xw) sign(wj) ≤ λ(1 + ε1) if wj 6= 0,
|xj⊤(y − Xw)| ≤ λ(1 + ε1) otherwise.
Relations between OPTλ and APPROXλ
APPROXλ(0) =⇒ OPTλ(0, 0)
=⇒ OPTλ(1−
√(ε)
(√
(ε)/(1−√ε),−
√
(ε)/(1−√ε))
=⇒ APPROXλ(1−√ε)(ε)
Proposition - approximate analysis
there exists an ε-approximate path with at most⌈log(λ∞/λ1)√
ε
⌉
segments.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 64/67
Approximate Homotopy
Recipe - main ideas/features
Maintain OPTλ(ε/2, ε/2) instead of OPTλ(0, 0);
Make steps in λ greater than or equal to λ(1− θ√ε);
When the kinks are too close to each other, make a large step anduse a first-order method instead;
Between λ∞ and λ1, the number of iterations is upper-bounded by⌈log(λ∞/λ1)
θ√ε
⌉
.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 65/67
Conclusion
A few messages
Despite an exponential complexity, the homotopy algorithmsremains extremely powerful in practice;
the main issue of the homotopy algorithm might be its numericalstability;
when one does not care about precision, the worst-case complexityof the path can significantly reduce.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 66/67
Advertisement SPAMS toolbox (open-source)
C++ interfaced with Matlab, R, Python.
proximal gradient methods for ℓ0, ℓ1, elastic-net, fused-Lasso,group-Lasso, tree group-Lasso, tree-ℓ0, sparse group Lasso,overlapping group Lasso...
...for square, logistic, multi-class logistic loss functions.
handles sparse matrices, provides duality gaps.
fast implementations of OMP and LARS - homotopy.
dictionary learning and matrix factorization (NMF, sparse PCA).
coordinate descent, block coordinate descent algorithms.
fast projections onto some convex sets.
Try it! http://www.di.ens.fr/willow/SPAMS/
Julien Mairal Complexity Analysis of the Lasso Regularization Path 67/67
References I
H. Akaike. Information theory and an extension of the maximumlikelihood principle. In Second International Symposium onInformation Theory, volume 1, pages 267–281, 1973.
S. Bakin. Adaptive regression and model selection in data miningproblems. PhD thesis, 1999.
R.G. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-basedcompressive sensing. IEEE Transactions on Information Theory, 56(4):1982–2001, 2010.
D.P. Bertsekas. Nonlinear programming. Athena Scientific, 1999. 2ndedition.
J.M. Borwein and A.S. Lewis. Convex analysis and nonlinearoptimization: theory and examples. Springer, 2006.
S. Boyd and L. Vandenberghe. Convex Optimization. CambridgeUniversity Press, 2004.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 68/67
References IIE. Candes and D. L. Donoho. Recovering edges in ill-posed inverseproblems: Optimality of curvelet frames. Annals of Statistics, 30(3):784–842, 2002.
E.J. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweightedL1 minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008.
V. Cehver, M. Duarte, C. Hedge, and R.G. Baraniuk. Sparse signalrecovery using markov random fields. In Advances in NeuralInformation Processing Systems (NIPS), 2008.
S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decompositionby basis pursuit. SIAM Journal on Scientific Computing, 20:33–61,1999.
J. F. Claerbout and F. Muir. Robust modeling with erratic data.Geophysics, 38(5):826–844, 1973.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 69/67
References IIIJ. G. Daugman. Uncertainty relation for resolution in space, spatialfrequency, and orientation optimized by two-dimensional visual corticalfilters. Journal of the Optical Society of America A, 2(7):1160–1169,1985.
M. Do and M. Vertterli. Contourlets, Beyond Wavelets. AcademicPress, 2003.
D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by waveletshrinkage. Biometrika, 81(3):425–455, 1994.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angleregression. Annals of Statistics, 32(2):407–499, 2004.
M. A. Efroymson. Multiple regression analysis. Mathematical methodsfor digital computers, 9(1):191–203, 1960.
M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic withapplication to minimum order system approximation. In AmericanControl Conference, volume 6, pages 4734–4739, 2001.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 70/67
References IVI. E Frank and J. H. Friedman. A statistical view of some chemometricsregression tools. Technometrics, 35(2):109–135, 1993.
G. M. Furnival and R. W. Wilson. Regressions by leaps and bounds.Technometrics, 16(4):499–511, 1974.
B. Gartner, M. Jaggi, and C. Maria. An exponential lower bound on thecomplexity of regularization paths. preprint arXiv:0903.4817v2, 2010.
J. Giesen, M. Jaggi, and S. Laue. Approximating parameterized convexoptimization problems. In Algorithms - ESA, Lectures Notes Comp.Sci. 2010.
Y. Grandvalet and S. Canu. Outcomes of the equivalence of adaptiveridge with least absolute shrinkage. In Advances in Neural InformationProcessing Systems (NIPS), 1999.
R. R. Hocking. A Biometrics invited paper. The analysis and selection ofvariables in linear regression. Biometrics, 32:1–49, 1976.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 71/67
References VJ. Huang, Z. Zhang, and D. Metaxas. Learning with structured sparsity.
In Proceedings of the International Conference on Machine Learning(ICML), 2009.
L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlaps andgraph Lasso. In Proceedings of the International Conference onMachine Learning (ICML), 2009.
R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selectionwith sparsity-inducing norms. Journal of Machine Learning Research,12:2777–2824, 2011.
V. Klee and G. J. Minty. How good is the simplex algorithm? InO. Shisha, editor, Inequalities, volume III, pages 159–175. AcademicPress, New York, 1972.
E. Le Pennec and S. Mallat. Sparse geometric image representationswith bandelets. IEEE Transactions on Image Processing, 14(4):423–438, 2005.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 72/67
References VIS. Mallat. A wavelet tour of signal processing. Academic press, 2008.3rd edition.
C. L. Mallows. Choosing variables in a linear regression: A graphical aid.unpublished paper presented at the Central Regional Meeting of theInstitute of Mathematical Statistics, Manhattan, Kansas, 1964.
C. L. Mallows. Choosing a subset regression. unpublished paperpresented at the Joint Statistical Meeting, Los Angeles, California,1966.
C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers forstructured sparsity. Advances in Computational Mathematics, 38(3):455–489, 2013.
B. K. Natarajan. Sparse approximate solutions to linear systems. SIAMJournal on Computing, 24:227–234, 1995.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 73/67
References VIIM. R. Osborne, B. Presnell, and B. A. Turlach. On the Lasso and itsdual. Journal of Computational and Graphical Statistics, 9(2):319–37,2000.
F. Rapaport, A. Zinovyev, M. Dutreix, E. Barillot, and J.P. Vert.Classification of microarray data using gene networks. BMCbioinformatics, 8(1):35, 2007.
J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
K. Ritter. Ein verfahren zur losung parameterabhangiger, nichtlinearermaximum-probleme. Mathematical Methods of Operations Research,6(4):149–166, 1962.
L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation basednoise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 74/67
References VIIIG. Schwarz. Estimating the dimension of a model. Annals of Statistics,6(2):461–464, 1978.
J. M. Shapiro. Embedded image coding using zerotrees of waveletcoefficients. IEEE Transactions on Signal Processing, 41(12):3445–3462, 1993.
E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger.Shiftable multiscale transforms. IEEE Transactions on InformationTheory, 38(2):587–607, 1992.
N. Srebro, J.D.M. Rennie, and T.S. Jaakkola. Maximum-margin matrixfactorization. In Advances in Neural Information Processing Systems(NIPS), 2005.
H. L. Taylor, S. C. Banks, and J. F. McCoy. Deconvolution with the ℓ1norm. Geophysics, 44(1):39–52, 1979.
R. Tibshirani. Regression shrinkage and selection via the Lasso. Journalof the Royal Statistical Society Series B, 58(1):267–288, 1996.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 75/67
References IXR. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsityand smoothness via the fused Lasso. Journal of the Royal StatisticalSociety Series B, 67(1):91–108, 2005.
B. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variableselection. Technometrics, 47(3):349–363, 2005.
D. Wrinch and H. Jeffreys. XLII. on certain fundamental principles ofscientific inquiry. Philosophical Magazine Series 6, 42(249):369–390,1921.
M. Yuan and Y. Lin. Model selection and estimation in regression withgrouped variables. Journal of the Royal Statistical Society Series B,68:49–67, 2006.
P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties familyfor grouped and hierarchical variable selection. Annals of Statistics, 37(6A):3468–3497, 2009.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 76/67
References XH. Zou and T. Hastie. Regularization and variable selection via theelastic net. Journal of the Royal Statistical Society Series B, 67(2):301–320, 2005.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 77/67
Appendix
Julien Mairal Complexity Analysis of the Lasso Regularization Path 78/67
Basic convex optimization tools: subgradients
α
(g) Smooth case
α
(h) Non-smooth case
Figure : Gradients and subgradients for smooth and non-smooth functions.
∂f (α)△
= {κ ∈ Rp | f (α) + κ⊤(α′ −α) ≤ f (α′) for all α′ ∈ R
p}.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 79/67
Basic convex optimization tools: subgradientsSome nice properties
∂f (α) = {g} iff f differentiable at α and g = ∇f (α).
many calculus rules: ∂(γf + µg) = γ∂f + µ∂g for γ, µ > 0.
for more details, see Boyd and Vandenberghe [2004], Bertsekas [1999],Borwein and Lewis [2006] and S. Boyd’s course at Stanford.
Optimality conditions
For g : Rp → R convex,
g differentiable: α⋆ minimizes g iff ∇g(α⋆) = 0.
g nondifferentiable: α⋆ minimizes g iff 0 ∈ ∂g(α⋆).
Careful: the concept of subgradient requires a function to beabove its tangents. It does only make sense for convex functions!
Julien Mairal Complexity Analysis of the Lasso Regularization Path 80/67
Basic convex optimization tools: dual-norm
Definition
Let κ be in Rp,
‖κ‖∗ △
= maxα∈Rp :‖α‖≤1
α⊤κ.
Exercises
‖α‖∗∗ = ‖α‖ (true in finite dimension)
ℓ2 is dual to itself.
ℓ1 and ℓ∞ are dual to each other.
ℓq and ℓ′q are dual to each other if 1q+ 1
q′= 1.
similar relations for spectral norms on matrices.
∂‖α‖ = {κ ∈ Rp s.t. ‖κ‖∗ ≤ 1 and κ⊤α = ‖α‖}.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 81/67
Optimality conditions
Let f : Rp → R be convex differentiable and ‖.‖ be any norm.
minα∈Rp
f (α) + λ‖α‖.
α is solution if and only if
0 ∈ ∂(f (α) + λ‖α‖) = ∇f (α) + λ∂‖α‖
Since ∂‖α‖ = {κ ∈ Rp s.t. ‖κ‖∗ ≤ 1 and κ⊤α = ‖α‖},
General optimality conditions:
‖∇f (α)‖∗ ≤ λ and −∇f (α)⊤α = λ‖α‖.
Julien Mairal Complexity Analysis of the Lasso Regularization Path 82/67
Convex Duality
Strong Duality
α⋆
α κ
κ⋆
f (α), primal
g(κ), dual
b
b
b
b
Strong duality means that maxκ g(κ) = minα f (α)
Julien Mairal Complexity Analysis of the Lasso Regularization Path 83/67
Convex Duality
Duality Gaps
α
α
κ
κ
f (α), primal
g(κ), dual
b
b
b
bδ(α, κ)
Strong duality means that maxκ g(κ) = minα f (α)
The duality gap guarantees us that 0 ≤ f (α)− f (α⋆) ≤ δ(α, κ).
Julien Mairal Complexity Analysis of the Lasso Regularization Path 84/67