A ½ mWatt, 128‐MAC Sparsity Aware Neural Processing Unit ...
Structured Sparsity in Natural Language Processing
Transcript of Structured Sparsity in Natural Language Processing
Structured Sparsityin Natural Language Processing:
Models, Algorithms, and Applications
Andre F. T. Martins1,2,3 Mario A. T. Figueiredo1 Noah A. Smith2
1Instituto de TelecomunicacoesInstituto Superior Tecnico, Lisboa, Portugal
2Language Technologies Institute, School of Computer ScienceCarnegie Mellon University, Pittsburgh, PA, USA
3Priberam, Lisboa, Portugal
NAACL 2012: TutorialsMontreal, Quebec, June 3, 2012
Slides online at http://tiny.cc/ssnlp
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 1 / 155
Welcome
This tutorial is about sparsity, a topic of great relevance to NLP.
Sparsity relates to feature selection, model compactness, runtime,memory footprint, interpretability of our models.
New idea in the last 5 years: structured sparsity. This tutorial tries toanswer:
What is structured sparsity?
How do we apply it?
How has it been used so far?
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 2 / 155
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Convex Analysis
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 3 / 155
Notation
Many NLP problems involve mapping from one structured space toanother. Notation:
Input set X
For each x ∈ X, candidate outputs are Y(x) ⊆ Y
Mapping is hw : X→ Y
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 4 / 155
Linear Models
Our predictor will take the form
hw(x) = arg maxy∈Y(x)
w>f(x , y)
where:
f is a vector function that encodes all the relevant things about(x , y); the result of a theory, our knowledge, feature engineering, etc.
w ∈ RD are the weights that parameterize the mapping.
NLP today: D is often in the tens or hundreds of millions.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 5 / 155
Learning Linear Models
Max ent, perceptron, CRF, SVM, even supervised generative models all fitthe linear modeling framework.
General training setup:
We observe a collection of examples 〈xn, yn〉Nn=1.
Perform statistical analysis to discover w from the data.Ranges from “count and normalize” to complex optimization routines.
Optimization view:
w = arg minw
1
N
N∑n=1
L(w; xn, yn)︸ ︷︷ ︸empirical loss
+ Ω(w)︸ ︷︷ ︸regularizer
This tutorial will focus on the regularizer, Ω.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 6 / 155
What is Sparsity?
The word “sparsity” has (at least) four related meanings in NLP!
1 Data sparsity: N is too small to obtain a good estimate for w.Also known as “curse of dimensionality.”(Usually bad.)
2 “Probability” sparsity: I have a probability distribution over events(e.g., X× Y), most of which receive zero probability.(Might be good or bad.)
3 Sparsity in the dual: associated with SVMs and other kernel-basedmethods; implies that the predictor can be represented via kernelcalculations involving just a few training instances.
4 Model sparsity: Most dimensions of f are not needed for a good hw;those dimensions of w can be zero, leading to a sparse w (model).
This tutorial is about sense #4: today, (model) sparsity is a good thing!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 7 / 155
Why Sparsity is Desirable in NLP
Occam’s razor and interpretability.
The bet on sparsity (Friedman et al., 2004): it’s often correct. When itisn’t, there’s no good solution anyway!
Models with just a few features are
easy to explain and implement
attractive as linguistic hypotheses
reminiscent of classical symbolic systems
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 8 / 155
A decision list from Yarowsky (1995).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 9 / 155
Why Sparsity is Desirable in NLP
Computational savings.
wd = 0 is equivalent to erasing the feature from the model; smallereffective D implies smaller memory footprint.
This, in turn, implies faster decoding runtime.
Further, sometimes entire kinds of features can be eliminated, givingasymptotic savings.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 10 / 155
Why Sparsity is Desirable in NLP
Generalization.
The challenge of learning is to extract from the data only what willgeneralize to new examples.
Forcing a learner to use few features is one way to discourageoverfitting.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 11 / 155
Experimental results from Kazama and Tsujii (2003): F1 on two textcategorization tasks as the number of features varies.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 12 / 155
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 13 / 155
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 13 / 155
Filter-based Feature Selection
For each candidate feature fd , apply a heuristic to determine whether toinclude it. (Excluding fd equates to fixing wd = 0.)
Examples:
Count threshold: is |n | fd (xn, yn) > 0| > τ?(Ignore rare features.)
Mutual information or correlation between features and labels
Advantage: speed!
Disadvantages:
Ignores the learning algorithm
Thresholds require tuning
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 14 / 155
Ratnaparkhi (1996), on his POS tagger:
The behavior of a feature that occurs very sparsely in thetraining set is often difficult to predict, since its statistics maynot be reliable. Therefore, the model uses the heuristic that anyfeature which occurs less than 10 times in the data is unreliable,and ignores features whose counts are less than 10.1 While thereare many smoothing algorithms which use techniques morerigorous than a simple count cutoff, they have not yet beeninvestigated in conjunction with this tagger.
1Except for features that look only at the current word, i.e., features of the formwi =<word> and ti = <TAG>. The count of 10 was chosen by inspection of Training andDevelopment data.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 15 / 155
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 16 / 155
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 16 / 155
Wrapper-based Feature Selection
For each subset F ⊆ 1, 2, . . .D, learn hwFfor features fd | d ∈ F.
2D − 1 choices; so perform a search over subsets.
Cons:
NP-hard problem (Amaldi and Kann, 1998; Davis et al., 1997)
Must resort to greedy methods
Even those require iterative calls to a black-box learner
Danger of overfitting in choosing F.(Typically use development data or cross-validate.)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 17 / 155
Della Pietra et al. (1997) add features one at a time. Step (3) involvesre-estimating parameters:
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 18 / 155
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 19 / 155
(Automatic) Feature Selection
Human NLPers are good at thinking of features.
Can we automate the process of selecting which ones to keep?
Three kinds of methods:
1 filters
2 wrappers
3 embedded methods (this tutorial)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 19 / 155
Embedded Methods for Feature Selection
Formulate the learning problem as a trade-off between
minimizing loss (fitting the training data, achieving good accuracy onthe training data, etc.)
choosing a desirable model (e.g., one with no more features thanneeded)
minw
1
N
N∑n=1
L(w; xn, yn) + Ω(w)
Key advantage: declarative statements of model “desirability” often leadto well-understood, solvable optimization problems.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 20 / 155
Useful Papers on Feature Selection and Sparsity
Overview of many feature selection methods:Guyon and Elisseeff (2003)
Greedy wrapper-based method used for max ent models in NLP:Della Pietra et al. (1997)
Early uses of sparsity in NLP:Kazama and Tsujii (2003); Goodman (2004)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 21 / 155
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Convex Analysis
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 22 / 155
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 23 / 155
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 23 / 155
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 23 / 155
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 23 / 155
Loss functions (I)
Regression (y ∈ R) typically uses the squared error loss:
LSE(w; x , y) =1
2
(y −w>f(x)
)2
Total loss:
1
2
N∑n=1
(yn −w>f(xn)
)2=
1
2‖Aw − y‖2
2
Design matrix: A = [Aij ]i=1,...,N; j=1,...,D , where Aij = fj (xi ).
Response vector: y = [y1, ..., yN ]>.
Arguably, the most/best studied loss function (statistics, machinelearning, signal processing).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 23 / 155
Loss functions (II)
Classification and structured prediction using log-linear models(logistic regression, max ent, conditional random fields):
LLR(w; x , y) = − log P (y |x ; w)
= − logexp(w>f (x , y))∑
y ′∈Y(x) exp(w>f (x , y ′))
= −w>f (x , y) + log Z (w, x)
Partition function:
Z (w, x) =∑
y ′∈Y(x)
exp(w>f (x , y ′)).
Related loss functions: hinge loss (in SVM) and the perceptron loss.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 24 / 155
Loss functions (II)
Classification and structured prediction using log-linear models(logistic regression, max ent, conditional random fields):
LLR(w; x , y) = − log P (y |x ; w)
= − logexp(w>f (x , y))∑
y ′∈Y(x) exp(w>f (x , y ′))
= −w>f (x , y) + log Z (w, x)
Partition function:
Z (w, x) =∑
y ′∈Y(x)
exp(w>f (x , y ′)).
Related loss functions: hinge loss (in SVM) and the perceptron loss.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 24 / 155
Loss functions (II)
Classification and structured prediction using log-linear models(logistic regression, max ent, conditional random fields):
LLR(w; x , y) = − log P (y |x ; w)
= − logexp(w>f (x , y))∑
y ′∈Y(x) exp(w>f (x , y ′))
= −w>f (x , y) + log Z (w, x)
Partition function:
Z (w, x) =∑
y ′∈Y(x)
exp(w>f (x , y ′)).
Related loss functions: hinge loss (in SVM) and the perceptron loss.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 24 / 155
Main Loss Functions: Summary
Squared (linear regression) 12
(y −w>f(x)
)2
Log-linear (MaxEnt, CRF, logistic) −w>f(x , y) + log∑y ′∈Y
exp(w>f(x , y ′))
Hinge (SVMs) −w>f(x , y) + maxy ′∈Y
(w>f(x , y ′) + c(y , y ′)
)Perceptron −w>f(x , y) + max
y ′∈Yw>f(x , y ′)
(in the SVM loss, c(y , y ′) is a cost function.)
All these losses are particular cases of general family (Martins et al., 2010).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 25 / 155
Main Loss Functions: Summary
Squared (linear regression) 12
(y −w>f(x)
)2
Log-linear (MaxEnt, CRF, logistic) −w>f(x , y) + log∑y ′∈Y
exp(w>f(x , y ′))
Hinge (SVMs) −w>f(x , y) + maxy ′∈Y
(w>f(x , y ′) + c(y , y ′)
)Perceptron −w>f(x , y) + max
y ′∈Yw>f(x , y ′)
(in the SVM loss, c(y , y ′) is a cost function.)
All these losses are particular cases of general family (Martins et al., 2010).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 25 / 155
Regularization
Regularized parameter estimate:
w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)︸ ︷︷ ︸total loss
where Ω(w) ≥ 0 and lim‖w‖→∞
Ω(w) =∞ (coercive function).
Why regularize?
Improve generalization by avoiding over-fitting.
The total loss may not be coercive (e.g., logistic loss on separabledata), thus having no minima.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 26 / 155
Regularization
Regularized parameter estimate:
w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)︸ ︷︷ ︸total loss
where Ω(w) ≥ 0 and lim‖w‖→∞
Ω(w) =∞ (coercive function).
Why regularize?
Improve generalization by avoiding over-fitting.
The total loss may not be coercive (e.g., logistic loss on separabledata), thus having no minima.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 26 / 155
Regularization
Regularized parameter estimate:
w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)︸ ︷︷ ︸total loss
where Ω(w) ≥ 0 and lim‖w‖→∞
Ω(w) =∞ (coercive function).
Why regularize?
Improve generalization by avoiding over-fitting.
The total loss may not be coercive (e.g., logistic loss on separabledata), thus having no minima.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 26 / 155
Regularization
Regularized parameter estimate:
w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)︸ ︷︷ ︸total loss
where Ω(w) ≥ 0 and lim‖w‖→∞
Ω(w) =∞ (coercive function).
Why regularize?
Improve generalization by avoiding over-fitting.
The total loss may not be coercive (e.g., logistic loss on separabledata), thus having no minima.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 26 / 155
Regularization
Regularized parameter estimate:
w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)︸ ︷︷ ︸total loss
where Ω(w) ≥ 0 and lim‖w‖→∞
Ω(w) =∞ (coercive function).
Why regularize?
Improve generalization by avoiding over-fitting.
The total loss may not be coercive (e.g., logistic loss on separabledata), thus having no minima.
Express prior knowledge about w.
Select relevant features (via sparsity-inducing regularization).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 26 / 155
Regularization Formulations
Tikhonov regularization: w = arg minwλΩ(w) +
N∑n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 27 / 155
Regularization Formulations
Tikhonov regularization: w = arg minwλΩ(w) +
N∑n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 27 / 155
Regularization Formulations
Tikhonov regularization: w = arg minwλΩ(w) +
N∑n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 27 / 155
Regularization Formulations
Tikhonov regularization: w = arg minwλΩ(w) +
N∑n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 27 / 155
Regularization vs. Bayesian estimation
Regularized parameter estimate: w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)
...interpretable as Bayesian maximum a posteriori (MAP) estimate:
w = arg maxw
exp (−Ω(w))︸ ︷︷ ︸prior p(w)
N∏n=1
exp (−L(w; xn, yn))︸ ︷︷ ︸likelihood (i.i.d. data)
.
This interpretation underlies the logistic regression (LR) loss:LLR(w; xn, yn) = − log P (yn|xn; w).
Same is true for the squared error (SE) loss:
LSE(w; xn, yn) = 12
(y −w>f(x)
)2= − logN(y |w>f(x), 1)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 28 / 155
Regularization vs. Bayesian estimation
Regularized parameter estimate: w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)
...interpretable as Bayesian maximum a posteriori (MAP) estimate:
w = arg maxw
exp (−Ω(w))︸ ︷︷ ︸prior p(w)
N∏n=1
exp (−L(w; xn, yn))︸ ︷︷ ︸likelihood (i.i.d. data)
.
This interpretation underlies the logistic regression (LR) loss:LLR(w; xn, yn) = − log P (yn|xn; w).
Same is true for the squared error (SE) loss:
LSE(w; xn, yn) = 12
(y −w>f(x)
)2= − logN(y |w>f(x), 1)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 28 / 155
Regularization vs. Bayesian estimation
Regularized parameter estimate: w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)
...interpretable as Bayesian maximum a posteriori (MAP) estimate:
w = arg maxw
exp (−Ω(w))︸ ︷︷ ︸prior p(w)
N∏n=1
exp (−L(w; xn, yn))︸ ︷︷ ︸likelihood (i.i.d. data)
.
This interpretation underlies the logistic regression (LR) loss:LLR(w; xn, yn) = − log P (yn|xn; w).
Same is true for the squared error (SE) loss:
LSE(w; xn, yn) = 12
(y −w>f(x)
)2= − logN(y |w>f(x), 1)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 28 / 155
Norms: A Quick Review
Before focusing on regularizers, a quick review about norms.
Some function p : R→ R is a norm if if satisfies:
p(αw) = |α|p(w), for any w (homogeneity);
p(w + w′) ≤ p(w) + p(w′), for any w,w′ (triangle inequality);
p(w) = 0 if and only if w = 0.
Examples of norms:
‖w‖p =
(∑i
(wi )p
)1/p
(called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 29 / 155
Norms: A Quick Review
Before focusing on regularizers, a quick review about norms.
Some function p : R→ R is a norm if if satisfies:
p(αw) = |α|p(w), for any w (homogeneity);
p(w + w′) ≤ p(w) + p(w′), for any w,w′ (triangle inequality);
p(w) = 0 if and only if w = 0.
Examples of norms:
‖w‖p =
(∑i
(wi )p
)1/p
(called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 29 / 155
Norms: A Quick Review
Before focusing on regularizers, a quick review about norms.
Some function p : R→ R is a norm if if satisfies:
p(αw) = |α|p(w), for any w (homogeneity);
p(w + w′) ≤ p(w) + p(w′), for any w,w′ (triangle inequality);
p(w) = 0 if and only if w = 0.
Examples of norms:
‖w‖p =
(∑i
(wi )p
)1/p
(called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 29 / 155
Norms: A Quick Review
Before focusing on regularizers, a quick review about norms.
Some function p : R→ R is a norm if if satisfies:
p(αw) = |α|p(w), for any w (homogeneity);
p(w + w′) ≤ p(w) + p(w′), for any w,w′ (triangle inequality);
p(w) = 0 if and only if w = 0.
Examples of norms:
‖w‖p =
(∑i
(wi )p
)1/p
(called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 29 / 155
Norms: A Quick Review
Before focusing on regularizers, a quick review about norms.
Some function p : R→ R is a norm if if satisfies:
p(αw) = |α|p(w), for any w (homogeneity);
p(w + w′) ≤ p(w) + p(w′), for any w,w′ (triangle inequality);
p(w) = 0 if and only if w = 0.
Examples of norms:
‖w‖p =
(∑i
(wi )p
)1/p
(called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 29 / 155
Norms: A Quick Review
Before focusing on regularizers, a quick review about norms.
Some function p : R→ R is a norm if if satisfies:
p(αw) = |α|p(w), for any w (homogeneity);
p(w + w′) ≤ p(w) + p(w′), for any w,w′ (triangle inequality);
p(w) = 0 if and only if w = 0.
Examples of norms:
‖w‖p =
(∑i
(wi )p
)1/p
(called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 29 / 155
Norms: A Quick Review
Before focusing on regularizers, a quick review about norms.
Some function p : R→ R is a norm if if satisfies:
p(αw) = |α|p(w), for any w (homogeneity);
p(w + w′) ≤ p(w) + p(w′), for any w,w′ (triangle inequality);
p(w) = 0 if and only if w = 0.
Examples of norms:
‖w‖p =
(∑i
(wi )p
)1/p
(called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 29 / 155
Norms: A Quick Review
Before focusing on regularizers, a quick review about norms.
Some function p : R→ R is a norm if if satisfies:
p(αw) = |α|p(w), for any w (homogeneity);
p(w + w′) ≤ p(w) + p(w′), for any w,w′ (triangle inequality);
p(w) = 0 if and only if w = 0.
Examples of norms:
‖w‖p =
(∑i
(wi )p
)1/p
(called `p norm, for p ≥ 1).
‖w‖∞ = limp→∞
‖w‖p = max|wi |, i = 1, ...,D
Fact: all norms are convex.
Also important (but not a norm): ‖w‖0 = limp→0‖w‖p
p = |i : wi 6= 0|
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 29 / 155
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ‖w‖2
2
)
Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 30 / 155
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ‖w‖2
2
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 30 / 155
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ‖w‖2
2
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 30 / 155
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ‖w‖2
2
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 30 / 155
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ‖w‖2
2
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 30 / 155
Classical Regularizers: Ridge
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
Arguably, the most classical choice: squared `2 norm: Ω(w) =λ
2‖w‖2
2
Corresponds to zero-mean Gaussian prior p(w) ∝ exp(−λ‖w‖2
2
)Ridge regression (SE loss): Hoerl and Kennard (1962 and 1970).
Ridge logistic regression: Schaefer et al. (1984), Cessie andHouwelingen (1992); in NLP: Chen and Rosenfeld (1999).
Closely related to Tikhonov (1943) and Wiener (1949).
Pros: smooth and convex, thus benign for optimization.
Cons: doesn’t promote sparsity (no explicit feature selection).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 30 / 155
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)
Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 31 / 155
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 31 / 155
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 31 / 155
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 31 / 155
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 31 / 155
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 31 / 155
Classical Regularizers: Lasso
Regularized parameter estimate: w = arg minw
N∑n=1
L(w; xn, yn) + Ω(w)
The new classic is the `1 norm: Ω(w) = λ‖w‖1 = λ
D∑i=1
|wi |.
Corresponds to zero-mean Laplacian prior p(wi ) ∝ exp (−λ|wi |)Best known as: least absolute shrinkage and selection operator(Lasso) (Tibshirani, 1996).
Used earlier in signal processing (Claerbout and Muir, 1973; Tayloret al., 1979) neural networks (Williams, 1995),...
In NLP: Kazama and Tsujii (2003); Goodman (2004).
Pros: encourages sparsity: embedded feature selection.
Cons: convex, but non-smooth: challenging optimization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 31 / 155
The Lasso and SparsityWhy does the Lasso yield sparsity?
The simplest case:
w = arg minw
1
2(w − y)2 + λ|w | = soft(y , λ) =
y − λ ⇐ y > λ0 ⇐ |y | ≤ λy + λ ⇐ y < −λ
Contrast with the squared `2 (ridge) regularizer (linear scaling):
w = arg minw
1
2(w − y)2 +
λ
2w 2 =
1
1 + λy
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 32 / 155
The Lasso and SparsityWhy does the Lasso yield sparsity?
The simplest case:
w = arg minw
1
2(w − y)2 + λ|w | = soft(y , λ) =
y − λ ⇐ y > λ0 ⇐ |y | ≤ λy + λ ⇐ y < −λ
Contrast with the squared `2 (ridge) regularizer (linear scaling):
w = arg minw
1
2(w − y)2 +
λ
2w 2 =
1
1 + λy
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 32 / 155
The Lasso and SparsityWhy does the Lasso yield sparsity?
The simplest case:
w = arg minw
1
2(w − y)2 + λ|w | = soft(y , λ) =
y − λ ⇐ y > λ0 ⇐ |y | ≤ λy + λ ⇐ y < −λ
Contrast with the squared `2 (ridge) regularizer (linear scaling):
w = arg minw
1
2(w − y)2 +
λ
2w 2 =
1
1 + λy
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 32 / 155
The Lasso and Sparsity (II)
Why does the Lasso yield sparsity?
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 33 / 155
The Lasso and Sparsity (II)
Why does the Lasso yield sparsity?
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 33 / 155
Relationship Between `1 and `0
The `0 “norm” (number of non-zeros): ‖w‖0 = |i : wi 6= 0|.Not convex, but...
w = arg minw
1
2(w − y)2 + λ|w |0 = hard(y ,
√2λ) =
y ⇐ |y | >
√2λ
0 ⇐ |y | ≤√
2λ
The “ideal” feature selection criterion (best subset):
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ (limit the number of features)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 34 / 155
Relationship Between `1 and `0
The `0 “norm” (number of non-zeros): ‖w‖0 = |i : wi 6= 0|.Not convex, but...
w = arg minw
1
2(w − y)2 + λ|w |0 = hard(y ,
√2λ) =
y ⇐ |y | >
√2λ
0 ⇐ |y | ≤√
2λ
The “ideal” feature selection criterion (best subset):
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ (limit the number of features)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 34 / 155
Relationship Between `1 and `0
The `0 “norm” (number of non-zeros): ‖w‖0 = |i : wi 6= 0|.Not convex, but...
w = arg minw
1
2(w − y)2 + λ|w |0 = hard(y ,
√2λ) =
y ⇐ |y | >
√2λ
0 ⇐ |y | ≤√
2λ
The “ideal” feature selection criterion (best subset):
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ (limit the number of features)Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 34 / 155
Relationship Between `1 and `0 (II)The best subset selection problem
is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem, also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006a; Donoho,
2006)
.Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 35 / 155
Relationship Between `1 and `0 (II)The best subset selection problem is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem, also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006a; Donoho,
2006)
.Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 35 / 155
Relationship Between `1 and `0 (II)The best subset selection problem is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem,
also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006a; Donoho,
2006)
.Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 35 / 155
Relationship Between `1 and `0 (II)The best subset selection problem is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem, also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006a; Donoho,
2006)
.Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 35 / 155
Relationship Between `1 and `0 (II)The best subset selection problem is NP-hard Amaldi and Kann(1998)(Davis et al., 1997).
w = arg minw
N∑n=1
L(w; xn, yn)
subject to ‖w‖0 ≤ τ
A closely related problem, also NP-hard (Muthukrishnan, 2005).
w = arg minw‖w‖0
subject toN∑
n=1
L(w; xn, yn) ≤ δ
In some cases, one may replace `0 with `1 and obtain “similar” results:
central issue in compressive sensing (CS) (Candes et al., 2006a; Donoho,
2006).Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 35 / 155
Compressive Sensing in One Slide
Even in the noiseless case, it seems impossible to recover w from y
...unless, w is sparse and A has some properties.
If w is sparse enough and A has certain properties, then w is stablyrecovered via (Haupt and Nowak, 2006)
w = arg minw‖w‖0
subject to ‖Aw − y‖ ≤ δ NP-hard!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 36 / 155
Compressive Sensing in One Slide
Even in the noiseless case, it seems impossible to recover w from y...unless, w is sparse and A has some properties.
If w is sparse enough and A has certain properties, then w is stablyrecovered via (Haupt and Nowak, 2006)
w = arg minw‖w‖0
subject to ‖Aw − y‖ ≤ δ NP-hard!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 36 / 155
Compressive Sensing in One Slide
Even in the noiseless case, it seems impossible to recover w from y...unless, w is sparse and A has some properties.
If w is sparse enough and A has certain properties, then w is stablyrecovered via (Haupt and Nowak, 2006)
w = arg minw‖w‖0
subject to ‖Aw − y‖ ≤ δ NP-hard!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 36 / 155
...OK, in Two Slides
Under some conditions on A (e.g., the restricted isometry property (RIP)),`0 can be replaced with `1 (Candes et al., 2006b):
w = arg minw‖w‖1
subject to ‖Aw − y‖ ≤ δ convex problem
Matrix A satisfies the RIP of order k, with constant δk ∈ (0, 1), if
‖w‖0 ≤ k ⇒ (1− δk )‖w‖22 ≤ ‖Aw‖ ≤ (1 + δk)‖w‖2
2
...i.e., for k-sparse vectors, A is approximately an isometry.
Other properties (spark and null space property (NSP)) can be used;checking RIP, NSP, spark is NP-hard (Tillmann and Pfetsch, 2012).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 37 / 155
...OK, in Two Slides
Under some conditions on A (e.g., the restricted isometry property (RIP)),`0 can be replaced with `1 (Candes et al., 2006b):
w = arg minw‖w‖1
subject to ‖Aw − y‖ ≤ δ convex problem
Matrix A satisfies the RIP of order k, with constant δk ∈ (0, 1), if
‖w‖0 ≤ k ⇒ (1− δk )‖w‖22 ≤ ‖Aw‖ ≤ (1 + δk )‖w‖2
2
...i.e., for k-sparse vectors, A is approximately an isometry.
Other properties (spark and null space property (NSP)) can be used;checking RIP, NSP, spark is NP-hard (Tillmann and Pfetsch, 2012).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 37 / 155
...OK, in Two Slides
Under some conditions on A (e.g., the restricted isometry property (RIP)),`0 can be replaced with `1 (Candes et al., 2006b):
w = arg minw‖w‖1
subject to ‖Aw − y‖ ≤ δ convex problem
Matrix A satisfies the RIP of order k, with constant δk ∈ (0, 1), if
‖w‖0 ≤ k ⇒ (1− δk )‖w‖22 ≤ ‖Aw‖ ≤ (1 + δk )‖w‖2
2
...i.e., for k-sparse vectors, A is approximately an isometry.
Other properties (spark and null space property (NSP)) can be used;checking RIP, NSP, spark is NP-hard (Tillmann and Pfetsch, 2012).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 37 / 155
Relevant Theory?
Are sparsity-related compressed sensing (CS) results relevant for NLP?
Nearly all CS results assume linear observations y = Aw + noise;recent exceptions: Blumensath (2012); Plan and Vershynin (2012).
“Good” matrices with RIP or NSP are randomly constructed.
What is missing: results for (multinomial) logistic loss, not based onRIP or NSP.
Other types of results for `1-regularized logistic regression:
PAC-Bayesian bounds (generalization improves with sparsity):Krishnapuram et al. (2005)
Oracle (van de Geer, 2008) and consistency (Negahban et al., 2012)results.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 38 / 155
Relevant Theory?
Are sparsity-related compressed sensing (CS) results relevant for NLP?
Nearly all CS results assume linear observations y = Aw + noise;recent exceptions: Blumensath (2012); Plan and Vershynin (2012).
“Good” matrices with RIP or NSP are randomly constructed.
What is missing: results for (multinomial) logistic loss, not based onRIP or NSP.
Other types of results for `1-regularized logistic regression:
PAC-Bayesian bounds (generalization improves with sparsity):Krishnapuram et al. (2005)
Oracle (van de Geer, 2008) and consistency (Negahban et al., 2012)results.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 38 / 155
Relevant Theory?
Are sparsity-related compressed sensing (CS) results relevant for NLP?
Nearly all CS results assume linear observations y = Aw + noise;recent exceptions: Blumensath (2012); Plan and Vershynin (2012).
“Good” matrices with RIP or NSP are randomly constructed.
What is missing: results for (multinomial) logistic loss, not based onRIP or NSP.
Other types of results for `1-regularized logistic regression:
PAC-Bayesian bounds (generalization improves with sparsity):Krishnapuram et al. (2005)
Oracle (van de Geer, 2008) and consistency (Negahban et al., 2012)results.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 38 / 155
Relevant Theory?
Are sparsity-related compressed sensing (CS) results relevant for NLP?
Nearly all CS results assume linear observations y = Aw + noise;recent exceptions: Blumensath (2012); Plan and Vershynin (2012).
“Good” matrices with RIP or NSP are randomly constructed.
What is missing: results for (multinomial) logistic loss, not based onRIP or NSP.
Other types of results for `1-regularized logistic regression:
PAC-Bayesian bounds (generalization improves with sparsity):Krishnapuram et al. (2005)
Oracle (van de Geer, 2008) and consistency (Negahban et al., 2012)results.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 38 / 155
Relevant Theory?
Are sparsity-related compressed sensing (CS) results relevant for NLP?
Nearly all CS results assume linear observations y = Aw + noise;recent exceptions: Blumensath (2012); Plan and Vershynin (2012).
“Good” matrices with RIP or NSP are randomly constructed.
What is missing: results for (multinomial) logistic loss, not based onRIP or NSP.
Other types of results for `1-regularized logistic regression:
PAC-Bayesian bounds (generalization improves with sparsity):Krishnapuram et al. (2005)
Oracle (van de Geer, 2008) and consistency (Negahban et al., 2012)results.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 38 / 155
Relevant Theory?
Are sparsity-related compressed sensing (CS) results relevant for NLP?
Nearly all CS results assume linear observations y = Aw + noise;recent exceptions: Blumensath (2012); Plan and Vershynin (2012).
“Good” matrices with RIP or NSP are randomly constructed.
What is missing: results for (multinomial) logistic loss, not based onRIP or NSP.
Other types of results for `1-regularized logistic regression:
PAC-Bayesian bounds (generalization improves with sparsity):Krishnapuram et al. (2005)
Oracle (van de Geer, 2008) and consistency (Negahban et al., 2012)results.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 38 / 155
Relevant Theory?
Are sparsity-related compressed sensing (CS) results relevant for NLP?
Nearly all CS results assume linear observations y = Aw + noise;recent exceptions: Blumensath (2012); Plan and Vershynin (2012).
“Good” matrices with RIP or NSP are randomly constructed.
What is missing: results for (multinomial) logistic loss, not based onRIP or NSP.
Other types of results for `1-regularized logistic regression:
PAC-Bayesian bounds (generalization improves with sparsity):Krishnapuram et al. (2005)
Oracle (van de Geer, 2008) and consistency (Negahban et al., 2012)results.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 38 / 155
Take-Home Messages
Sparsity is desirable for interpretability, computational savings, andgeneralization
`1-regularization gives an embedded method for feature selection
Another view of `1: a convex surrogate for direct penalization ofcardinality (`0)
Under some conditions, `1 guarantees exact support recovery
However: the currently known sufficient conditions are too strong andnot met in typical NLP problems
Yet: a lot of theory is still missing
There are compelling algorithmic reasons for using convex surrogateslike `1
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 39 / 155
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Convex Analysis
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 40 / 155
Models
`1 regularization promotes sparse models
A very simple sparsity pattern: prefer models with small cardinality
Our main question: how can we promote less trivial sparsity patterns?
We’ll talk about structured sparsity and group-Lasso regularization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 41 / 155
Models
`1 regularization promotes sparse models
A very simple sparsity pattern: prefer models with small cardinality
Our main question: how can we promote less trivial sparsity patterns?
We’ll talk about structured sparsity and group-Lasso regularization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 41 / 155
Models
`1 regularization promotes sparse models
A very simple sparsity pattern: prefer models with small cardinality
Our main question: how can we promote less trivial sparsity patterns?
We’ll talk about structured sparsity and group-Lasso regularization.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 41 / 155
Structured Sparsity and Groups
Main goal: promote structural patterns, not just penalize cardinality
Group sparsity: discard entire groups of features
density inside each group
sparsity with respect to the groups which are selected
choice of groups: prior knowledge about the intended sparsity patterns
Leads to statistical gains if the prior assumptions are correct (Stojnicet al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 42 / 155
Structured Sparsity and Groups
Main goal: promote structural patterns, not just penalize cardinality
Group sparsity: discard entire groups of features
density inside each group
sparsity with respect to the groups which are selected
choice of groups: prior knowledge about the intended sparsity patterns
Leads to statistical gains if the prior assumptions are correct (Stojnicet al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 42 / 155
Structured Sparsity and Groups
Main goal: promote structural patterns, not just penalize cardinality
Group sparsity: discard entire groups of features
density inside each group
sparsity with respect to the groups which are selected
choice of groups: prior knowledge about the intended sparsity patterns
Leads to statistical gains if the prior assumptions are correct (Stojnicet al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 42 / 155
Tons of Uses
feature template selection (Martins et al., 2011b)
multi-task learning (Caruana, 1997; Obozinski et al., 2010)
multiple kernel learning (Lanckriet et al., 2004)
learning the structure of graphical models (Schmidt and Murphy,2010)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 43 / 155
“Grid” Sparsity
For feature spaces that can be arranged as a grid (examples next)
Goal: push entire columns to have zero weights
The groups are the columns of the grid
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 44 / 155
“Grid” Sparsity
For feature spaces that can be arranged as a grid (examples next)
Goal: push entire columns to have zero weights
The groups are the columns of the grid
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 44 / 155
“Grid” Sparsity
For feature spaces that can be arranged as a grid (examples next)
Goal: push entire columns to have zero weights
The groups are the columns of the grid
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 44 / 155
Example 1: Sparsity with Multiple Classes
Assume the feature map decomposes as f(x , y) = f(x)⊗ ey
In words: we’re conjoining each input feature with each output class
input features
labels
“Standard” sparsity is wasteful—we still need to hash all the input features
What we want: discard some input features, along with each class theyconjoin with
Solution: one group per input feature
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 45 / 155
Example 1: Sparsity with Multiple Classes
Assume the feature map decomposes as f(x , y) = f(x)⊗ ey
In words: we’re conjoining each input feature with each output class
input features
labels
“Standard” sparsity is wasteful—we still need to hash all the input features
What we want: discard some input features, along with each class theyconjoin with
Solution: one group per input feature
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 45 / 155
Example 2: Multi-Task Learning(Caruana, 1997; Obozinski et al., 2010)
Same thing, except now rows are tasks and columns are features
shared features
task
s
What we want: discard features that are irrelevant for all tasks
Solution: one group per feature
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 46 / 155
Example 2: Multi-Task Learning(Caruana, 1997; Obozinski et al., 2010)
Same thing, except now rows are tasks and columns are features
shared features
task
s
What we want: discard features that are irrelevant for all tasks
Solution: one group per feature
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 46 / 155
Example 3: Multiple Kernel Learning(Lanckriet et al., 2004)
Same thing, except now columns are kernel functions KmMm=1
kernels
implic
it f
eatu
res
Goal: a new kernel which is a sparse combination of the given kernels
K ((x , y), (x ′, y ′)) =M∑
m=1
αmKm((x , y), (x ′, y ′)), α is sparse
Solution: make each group be a kernel Kj
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 47 / 155
Example 3: Multiple Kernel Learning(Lanckriet et al., 2004)
Same thing, except now columns are kernel functions KmMm=1
kernels
implic
it f
eatu
res
Goal: a new kernel which is a sparse combination of the given kernels
K ((x , y), (x ′, y ′)) =M∑
m=1
αmKm((x , y), (x ′, y ′)), α is sparse
Solution: make each group be a kernel Kj
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 47 / 155
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 ‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 48 / 155
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 ‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 48 / 155
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 ‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 48 / 155
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 ‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 48 / 155
Group Sparsity
D features
M groups G1, . . . ,GM , eachGm ⊆ 1, . . . ,Dparameter subvectors w1, . . . ,wM
Group-Lasso (Bakin, 1999; Yuan and Lin, 2006):
Ω(w) =∑M
m=1 λm‖wm‖2
Intuitively: the `1 norm of the `2 norms
Technically, still a norm (called a mixed norm, denoted `2,1)
λm: prior weight for group Gm (different groups have different sizes)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 48 / 155
Regularization Formulations (reminder)
Tikhonov regularization: w = arg minw
Ω(w) +N∑
n=1
L(w; xn, yn)
Ivanov regularization
w = arg minw
N∑n=1
L(w; xn, yn)
subject to Ω(w) ≤ τ
Morozov regularization
w = arg minw
Ω(w)
subject toN∑
n=1
L(w; xn, yn) ≤ δ
Equivalent, under mild conditions (namely convexity).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 49 / 155
Lasso versus group-Lasso
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 50 / 155
Lasso versus group-Lasso
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 50 / 155
Other names, other norms
Statisticians call these composite absolute penalties (Zhao et al., 2009)
In general: the (weighted) `r -norm of the `q-norms (r ≥ 1, q ≥ 1), calledthe mixed `q,r norm
Ω(w) =(∑M
m=1λm‖wm‖rq
)1/r
Group sparsity corresponds to r = 1
This talk: q = 2
However q =∞ is also popular (Quattoni et al., 2009; Graca et al., 2009;
Wright et al., 2009; Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 51 / 155
Other names, other norms
Statisticians call these composite absolute penalties (Zhao et al., 2009)
In general: the (weighted) `r -norm of the `q-norms (r ≥ 1, q ≥ 1), calledthe mixed `q,r norm
Ω(w) =(∑M
m=1λm‖wm‖rq
)1/r
Group sparsity corresponds to r = 1
This talk: q = 2
However q =∞ is also popular (Quattoni et al., 2009; Graca et al., 2009;
Wright et al., 2009; Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 51 / 155
Other names, other norms
Statisticians call these composite absolute penalties (Zhao et al., 2009)
In general: the (weighted) `r -norm of the `q-norms (r ≥ 1, q ≥ 1), calledthe mixed `q,r norm
Ω(w) =(∑M
m=1λm‖wm‖rq
)1/r
Group sparsity corresponds to r = 1
This talk: q = 2
However q =∞ is also popular (Quattoni et al., 2009; Graca et al., 2009;
Wright et al., 2009; Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 51 / 155
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 52 / 155
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 52 / 155
Non-overlapping Groups
Assume G1, . . . ,GM are disjoint
⇒ Each feature belongs to exactly one group
Ω(w) =∑M
m=1 λm‖wm‖2
Trivial choices of groups recover unstructured regularizers:
`2-regularization: one large group G1 = 1, . . . ,D`1-regularization: D singleton groups Gd = d
Examples of non-trivial groups:
label-based groups (groups are columns of a matrix)
template-based groups (next)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 53 / 155
Non-overlapping Groups
Assume G1, . . . ,GM are disjoint
⇒ Each feature belongs to exactly one group
Ω(w) =∑M
m=1 λm‖wm‖2
Trivial choices of groups recover unstructured regularizers:
`2-regularization: one large group G1 = 1, . . . ,D`1-regularization: D singleton groups Gd = d
Examples of non-trivial groups:
label-based groups (groups are columns of a matrix)
template-based groups (next)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 53 / 155
Non-overlapping Groups
Assume G1, . . . ,GM are disjoint
⇒ Each feature belongs to exactly one group
Ω(w) =∑M
m=1 λm‖wm‖2
Trivial choices of groups recover unstructured regularizers:
`2-regularization: one large group G1 = 1, . . . ,D`1-regularization: D singleton groups Gd = d
Examples of non-trivial groups:
label-based groups (groups are columns of a matrix)
template-based groups (next)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 53 / 155
Non-overlapping Groups
Assume G1, . . . ,GM are disjoint
⇒ Each feature belongs to exactly one group
Ω(w) =∑M
m=1 λm‖wm‖2
Trivial choices of groups recover unstructured regularizers:
`2-regularization: one large group G1 = 1, . . . ,D`1-regularization: D singleton groups Gd = d
Examples of non-trivial groups:
label-based groups (groups are columns of a matrix)
template-based groups (next)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 53 / 155
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5
5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5
5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5
5Input: We want to explore the feature space
PRP VBP TO VB DT NN NNOutput: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5
5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5
5Input: We want to explore the feature space
PRP VBP TO VB DT NN NNOutput: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
"DT NN NN"
"VB DT NN"
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"the feature"
"explore the"
"DT NN NN"
"VB DT NN"
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
"DT NN NN"
"VB DT NN"
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Example: Feature Template Selection
5 5
Input: We want to explore the feature spacePRP VBP TO VB DT NN NN
Output: B-NP B-VP I-VP I-VP B-NP I-NP I-NP
Goal: Select relevant feature templates
⇒ Make each group correspond to a feature template
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 54 / 155
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 55 / 155
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 55 / 155
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 56 / 155
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 56 / 155
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 56 / 155
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 56 / 155
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 56 / 155
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 56 / 155
Tree-Structured Groups
Assumption: if two groups overlap, one is contained in the other
⇒ hierarchical structure (Kim and Xing, 2010; Mairal et al., 2010)
What is the sparsity pattern?
If a group is discarded, all its descendants are also discarded
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 56 / 155
Plate Notation
Typically used for graphical models, but also works here for representingthe Hasse diagram of tree-structured groups
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 57 / 155
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 58 / 155
Three Scenarios
Non-overlapping Groups
Tree-structured Groups
Graph-structured Groups
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 58 / 155
Graph-Structured Groups
In general: groups can be represented as a directed acyclic graph
set inclusion induces a partial order on groups (Jenatton et al., 2009)
feature space becomes a poset
sparsity patterns: given by this poset
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 59 / 155
Example: coarse-to-fine regularization
1 Define a partial order between basic feature templates (e.g., p0 w0)
2 Extend this partial order to all templates by lexicographic closure:p0 p0p1 w0w1
Goal: only include finer features if coarser ones are also in the model
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 60 / 155
Things to Keep in Mind
Structured sparsity cares about the structure of the feature space
Group-Lasso regularization generalizes `1 and it’s still convex
Choice of groups: problem dependent, opportunity to use priorknowledge to favour certain structural patterns
Next: algorithms
We’ll see that optimization is easier with non-overlapping ortree-structured groups than with arbitrary overlaps
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 61 / 155
Things to Keep in Mind
Structured sparsity cares about the structure of the feature space
Group-Lasso regularization generalizes `1 and it’s still convex
Choice of groups: problem dependent, opportunity to use priorknowledge to favour certain structural patterns
Next: algorithms
We’ll see that optimization is easier with non-overlapping ortree-structured groups than with arbitrary overlaps
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 61 / 155
Things to Keep in Mind
Structured sparsity cares about the structure of the feature space
Group-Lasso regularization generalizes `1 and it’s still convex
Choice of groups: problem dependent, opportunity to use priorknowledge to favour certain structural patterns
Next: algorithms
We’ll see that optimization is easier with non-overlapping ortree-structured groups than with arbitrary overlaps
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 61 / 155
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Convex Analysis
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 62 / 155
Learning the Model
Recall that learning involves solving
minw
Ω(w)︸ ︷︷ ︸regularizer
+1
N
N∑i=1
L(w, xi , yi )︸ ︷︷ ︸total loss
,
We’ll address two kinds of optimization algorithms:
batch algorithms (attacks the complete problem);
online algorithms (uses the training examples one by one)
Before that: we’ll review some key concepts of convex analysis
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 63 / 155
Learning the Model
Recall that learning involves solving
minw
Ω(w)︸ ︷︷ ︸regularizer
+1
N
N∑i=1
L(w, xi , yi )︸ ︷︷ ︸total loss
,
We’ll address two kinds of optimization algorithms:
batch algorithms (attacks the complete problem);
online algorithms (uses the training examples one by one)
Before that: we’ll review some key concepts of convex analysis
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 63 / 155
Learning the Model
Recall that learning involves solving
minw
Ω(w)︸ ︷︷ ︸regularizer
+1
N
N∑i=1
L(w, xi , yi )︸ ︷︷ ︸total loss
,
We’ll address two kinds of optimization algorithms:
batch algorithms (attacks the complete problem);
online algorithms (uses the training examples one by one)
Before that: we’ll review some key concepts of convex analysis
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 63 / 155
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Convex Analysis
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 64 / 155
Key Concepts in Convex Analysis: Convex Sets
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 65 / 155
Key Concepts in Convex Analysis: Convex Functions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 66 / 155
Key Concepts in Convex Analysis: Minimizers
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 67 / 155
Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at xIf f is differentiable, ∂f (x) = ∇f (x)
linear lower bound non-differentiable case
Notation: ∇f (x) is a subgradient of f at x
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 68 / 155
Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at x
If f is differentiable, ∂f (x) = ∇f (x)
linear lower bound
non-differentiable case
Notation: ∇f (x) is a subgradient of f at x
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 68 / 155
Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at xIf f is differentiable, ∂f (x) = ∇f (x)
linear lower bound
non-differentiable case
Notation: ∇f (x) is a subgradient of f at x
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 68 / 155
Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at xIf f is differentiable, ∂f (x) = ∇f (x)
linear lower bound non-differentiable case
Notation: ∇f (x) is a subgradient of f at x
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 68 / 155
Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).
Subgradients generalize gradients for (maybe non-diff.) convex functions:
v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)
Subdifferential: ∂f (x) = v : v is a subgradient of f at xIf f is differentiable, ∂f (x) = ∇f (x)
linear lower bound non-differentiable case
Notation: ∇f (x) is a subgradient of f at xMartins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 68 / 155
Key Concepts in Convex Analysis: Strong ConvexityRecall the definition of convex function: ∀λ ∈ [0, 1],
f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)
A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1]
f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)− β
2λ(1− λ)‖x − x ′‖2
2
convexity
strong convexity
Strong convexity⇒6⇐ strict convexity.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 69 / 155
Key Concepts in Convex Analysis: Strong ConvexityRecall the definition of convex function: ∀λ ∈ [0, 1],
f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)
A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1]
f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)− β
2λ(1− λ)‖x − x ′‖2
2
convexity strong convexity
Strong convexity⇒6⇐ strict convexity.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 69 / 155
Key Concepts in Convex Analysis: Strong ConvexityRecall the definition of convex function: ∀λ ∈ [0, 1],
f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)
A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1]
f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)− β
2λ(1− λ)‖x − x ′‖2
2
convexity strong convexity
Strong convexity⇒6⇐ strict convexity.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 69 / 155
Proximity OperatorsLet Ω : RD → R be a convex function.
The Ω-proximity operator is the following RD → RD map:
w 7→ proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
...always well defined, because ‖u−w‖22 is strictly convex.
Classical examples:
Squared `2 regularization, Ω(w) = λ2‖w‖
22: scaling operation
proxΩ(w) =1
1 + λw
`1 regularization, Ω(w) = λ‖w‖1: soft-thresholding;
proxΩ(w) = soft(w, λ)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 70 / 155
Proximity OperatorsLet Ω : RD → R be a convex function.
The Ω-proximity operator is the following RD → RD map:
w 7→ proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
...always well defined, because ‖u−w‖22 is strictly convex.
Classical examples:
Squared `2 regularization, Ω(w) = λ2‖w‖
22: scaling operation
proxΩ(w) =1
1 + λw
`1 regularization, Ω(w) = λ‖w‖1: soft-thresholding;
proxΩ(w) = soft(w, λ)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 70 / 155
Proximity OperatorsLet Ω : RD → R be a convex function.
The Ω-proximity operator is the following RD → RD map:
w 7→ proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
...always well defined, because ‖u−w‖22 is strictly convex.
Classical examples:
Squared `2 regularization, Ω(w) = λ2‖w‖
22: scaling operation
proxΩ(w) =1
1 + λw
`1 regularization, Ω(w) = λ‖w‖1: soft-thresholding;
proxΩ(w) = soft(w, λ)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 70 / 155
Proximity OperatorsLet Ω : RD → R be a convex function.
The Ω-proximity operator is the following RD → RD map:
w 7→ proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
...always well defined, because ‖u−w‖22 is strictly convex.
Classical examples:
Squared `2 regularization, Ω(w) = λ2‖w‖
22: scaling operation
proxΩ(w) =1
1 + λw
`1 regularization, Ω(w) = λ‖w‖1: soft-thresholding;
proxΩ(w) = soft(w, λ)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 70 / 155
Proximity Operators (II)
proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
`2 regularization, Ω(w) = λ‖w‖2: vector soft thresholding
proxΩ(w) =
0 ⇐ ‖w‖ ≤ λ
w‖w‖ (‖w‖ − λ) ⇐ ‖w‖ > λ
indicator function, Ω(w) = ιS(w) =
0 ⇐ w ∈ S
+∞ ⇐ w 6∈ S
proxΩ(w) = PS(w)
Euclidean projection
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 71 / 155
Proximity Operators (II)
proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
`2 regularization, Ω(w) = λ‖w‖2: vector soft thresholding
proxΩ(w) =
0 ⇐ ‖w‖ ≤ λ
w‖w‖ (‖w‖ − λ) ⇐ ‖w‖ > λ
indicator function, Ω(w) = ιS(w) =
0 ⇐ w ∈ S
+∞ ⇐ w 6∈ S
proxΩ(w) = PS(w)
Euclidean projection
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 71 / 155
Proximity Operators (II)
proxΩ(w) = arg minu
1
2‖u−w‖2
2 + Ω(u)
`2 regularization, Ω(w) = λ‖w‖2: vector soft thresholding
proxΩ(w) =
0 ⇐ ‖w‖ ≤ λ
w‖w‖ (‖w‖ − λ) ⇐ ‖w‖ > λ
indicator function, Ω(w) = ιS(w) =
0 ⇐ w ∈ S
+∞ ⇐ w 6∈ S
proxΩ(w) = PS(w)
Euclidean projection
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 71 / 155
Proximity Operators (III)
Group regularizers: Ω(w) =M∑
m=1
Ωj (wGm )
Groups: Gm ⊂ 1, 2, ...,D. wGm is a sub-vector of w.
Non-overlapping groups (Gm ∩ Gn = ∅, for m 6= n): separable proxoperator
[proxΩ(w)]Gm= proxΩm
(wGm )
Tree-structured groups: (two groups are either non-overlapping orone contais the other) proxΩ can be computed recursively (Jenattonet al., 2011).
Arbitrary groups:For Ωj (wGm ) = ‖wGm‖2: solved via convex smooth optimization (Yuanet al., 2011).Sequential proximity steps (Martins et al., 2011a) (more later).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 72 / 155
Proximity Operators (III)
Group regularizers: Ω(w) =M∑
m=1
Ωj (wGm )
Groups: Gm ⊂ 1, 2, ...,D. wGm is a sub-vector of w.
Non-overlapping groups (Gm ∩ Gn = ∅, for m 6= n): separable proxoperator
[proxΩ(w)]Gm= proxΩm
(wGm )
Tree-structured groups: (two groups are either non-overlapping orone contais the other) proxΩ can be computed recursively (Jenattonet al., 2011).
Arbitrary groups:For Ωj (wGm ) = ‖wGm‖2: solved via convex smooth optimization (Yuanet al., 2011).Sequential proximity steps (Martins et al., 2011a) (more later).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 72 / 155
Proximity Operators (III)
Group regularizers: Ω(w) =M∑
m=1
Ωj (wGm )
Groups: Gm ⊂ 1, 2, ...,D. wGm is a sub-vector of w.
Non-overlapping groups (Gm ∩ Gn = ∅, for m 6= n): separable proxoperator
[proxΩ(w)]Gm= proxΩm
(wGm )
Tree-structured groups: (two groups are either non-overlapping orone contais the other) proxΩ can be computed recursively (Jenattonet al., 2011).
Arbitrary groups:For Ωj (wGm ) = ‖wGm‖2: solved via convex smooth optimization (Yuanet al., 2011).Sequential proximity steps (Martins et al., 2011a) (more later).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 72 / 155
Proximity Operators (III)
Group regularizers: Ω(w) =M∑
m=1
Ωj (wGm )
Groups: Gm ⊂ 1, 2, ...,D. wGm is a sub-vector of w.
Non-overlapping groups (Gm ∩ Gn = ∅, for m 6= n): separable proxoperator
[proxΩ(w)]Gm= proxΩm
(wGm )
Tree-structured groups: (two groups are either non-overlapping orone contais the other) proxΩ can be computed recursively (Jenattonet al., 2011).
Arbitrary groups:For Ωj (wGm ) = ‖wGm‖2: solved via convex smooth optimization (Yuanet al., 2011).Sequential proximity steps (Martins et al., 2011a) (more later).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 72 / 155
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Convex Analysis
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 73 / 155
Subgradient Methods
minw Ω(w) + Λ(w), where Λ(w) = 1N
∑Ni=1 L(w, xi , yi ) (loss)
Subgradient methods were invented by Shor in the 1970’s (Shor, 1985):
input: stepsize sequence (ηt)Tt=1
initialize wfor t = 1, 2, . . . do
(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇Λ(w)
)end for
Key disadvantages:
The step size ηt needs to be annealed for convergence: very slow!
Doesn’t explicitly capture the sparsity promoted by `1 regularizers.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 74 / 155
Subgradient Methods
minw Ω(w) + Λ(w), where Λ(w) = 1N
∑Ni=1 L(w, xi , yi ) (loss)
Subgradient methods were invented by Shor in the 1970’s (Shor, 1985):
input: stepsize sequence (ηt)Tt=1
initialize wfor t = 1, 2, . . . do
(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇Λ(w)
)end for
Key disadvantages:
The step size ηt needs to be annealed for convergence: very slow!
Doesn’t explicitly capture the sparsity promoted by `1 regularizers.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 74 / 155
Subgradient Methods
minw Ω(w) + Λ(w), where Λ(w) = 1N
∑Ni=1 L(w, xi , yi ) (loss)
Subgradient methods were invented by Shor in the 1970’s (Shor, 1985):
input: stepsize sequence (ηt)Tt=1
initialize wfor t = 1, 2, . . . do
(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇Λ(w)
)end for
Key disadvantages:
The step size ηt needs to be annealed for convergence: very slow!
Doesn’t explicitly capture the sparsity promoted by `1 regularizers.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 74 / 155
(Block-)Coordinate Descent
minw Ω(w) + Λ(w), where Λ(w) = 1N
∑Ni=1 L(w, xi , yi ) (loss)
Update one (block of) component(s) of w at a time:
w newi ← arg min
wi
Ω([w1, ...,wi , ...wD ]) + Λ([w1, ...,wi , ...wD ])
(Genkin et al., 2007; Krishnapuram et al., 2005; Liu et al., 2009; Shevade and
Keerthi, 2003; Tseng and Yun, 2009; Yun and Toh, 2011)
Squared error loss: closed-form solution. Other losses (e.g., logistic):
solve numerically, e.g., Newton steps (Shevade and Keerthi, 2003).
use local quadratic approximation/bound (Krishnapuram et al., 2005;Tseng and Yun, 2009; Yun and Toh, 2011).
Shown to converge; competitive with state-of-the-art (Yun and Toh, 2011).
Has been used in NLP: Sokolovska et al. (2010); Lavergne et al. (2010).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 75 / 155
(Block-)Coordinate Descent
minw Ω(w) + Λ(w), where Λ(w) = 1N
∑Ni=1 L(w, xi , yi ) (loss)
Update one (block of) component(s) of w at a time:
w newi ← arg min
wi
Ω([w1, ...,wi , ...wD ]) + Λ([w1, ...,wi , ...wD ])
(Genkin et al., 2007; Krishnapuram et al., 2005; Liu et al., 2009; Shevade and
Keerthi, 2003; Tseng and Yun, 2009; Yun and Toh, 2011)
Squared error loss: closed-form solution. Other losses (e.g., logistic):
solve numerically, e.g., Newton steps (Shevade and Keerthi, 2003).
use local quadratic approximation/bound (Krishnapuram et al., 2005;Tseng and Yun, 2009; Yun and Toh, 2011).
Shown to converge; competitive with state-of-the-art (Yun and Toh, 2011).
Has been used in NLP: Sokolovska et al. (2010); Lavergne et al. (2010).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 75 / 155
(Block-)Coordinate Descent
minw Ω(w) + Λ(w), where Λ(w) = 1N
∑Ni=1 L(w, xi , yi ) (loss)
Update one (block of) component(s) of w at a time:
w newi ← arg min
wi
Ω([w1, ...,wi , ...wD ]) + Λ([w1, ...,wi , ...wD ])
(Genkin et al., 2007; Krishnapuram et al., 2005; Liu et al., 2009; Shevade and
Keerthi, 2003; Tseng and Yun, 2009; Yun and Toh, 2011)
Squared error loss: closed-form solution. Other losses (e.g., logistic):
solve numerically, e.g., Newton steps (Shevade and Keerthi, 2003).
use local quadratic approximation/bound (Krishnapuram et al., 2005;Tseng and Yun, 2009; Yun and Toh, 2011).
Shown to converge; competitive with state-of-the-art (Yun and Toh, 2011).
Has been used in NLP: Sokolovska et al. (2010); Lavergne et al. (2010).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 75 / 155
(Block-)Coordinate Descent
minw Ω(w) + Λ(w), where Λ(w) = 1N
∑Ni=1 L(w, xi , yi ) (loss)
Update one (block of) component(s) of w at a time:
w newi ← arg min
wi
Ω([w1, ...,wi , ...wD ]) + Λ([w1, ...,wi , ...wD ])
(Genkin et al., 2007; Krishnapuram et al., 2005; Liu et al., 2009; Shevade and
Keerthi, 2003; Tseng and Yun, 2009; Yun and Toh, 2011)
Squared error loss: closed-form solution. Other losses (e.g., logistic):
solve numerically, e.g., Newton steps (Shevade and Keerthi, 2003).
use local quadratic approximation/bound (Krishnapuram et al., 2005;Tseng and Yun, 2009; Yun and Toh, 2011).
Shown to converge; competitive with state-of-the-art (Yun and Toh, 2011).
Has been used in NLP: Sokolovska et al. (2010); Lavergne et al. (2010).Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 75 / 155
Projected GradientInstead of min
wΩ(w) + Λ(w) , tackle min
wΛ(w) subject to Ω(w) ≤ τ .
Building blocks:
loss gradient ∇Λ(w)
Euclidean projection PS(w), where S = w : Ω(w) ≤ τ
w← PS(w − η∇Λ(w))
...maybe using line search to adjust the step length.
Example: for S = w : ‖w‖1 ≤ τ, projection PS(w) has O(D log D) cost(Duchi et al., 2008).
Viable and competitive alternative, which has been used in machinelearning and NLP (Duchi et al., 2008; Quattoni et al., 2009).
Shown later: projected gradient is a particular instance of the moregeneral proximal gradient methods.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 76 / 155
Projected GradientInstead of min
wΩ(w) + Λ(w) , tackle min
wΛ(w) subject to Ω(w) ≤ τ .
Building blocks:
loss gradient ∇Λ(w)
Euclidean projection PS(w), where S = w : Ω(w) ≤ τ
w← PS(w − η∇Λ(w))
...maybe using line search to adjust the step length.
Example: for S = w : ‖w‖1 ≤ τ, projection PS(w) has O(D log D) cost(Duchi et al., 2008).
Viable and competitive alternative, which has been used in machinelearning and NLP (Duchi et al., 2008; Quattoni et al., 2009).
Shown later: projected gradient is a particular instance of the moregeneral proximal gradient methods.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 76 / 155
Projected GradientInstead of min
wΩ(w) + Λ(w) , tackle min
wΛ(w) subject to Ω(w) ≤ τ .
Building blocks:
loss gradient ∇Λ(w)
Euclidean projection PS(w), where S = w : Ω(w) ≤ τ
w← PS(w − η∇Λ(w))
...maybe using line search to adjust the step length.
Example: for S = w : ‖w‖1 ≤ τ, projection PS(w) has O(D log D) cost(Duchi et al., 2008).
Viable and competitive alternative, which has been used in machinelearning and NLP (Duchi et al., 2008; Quattoni et al., 2009).
Shown later: projected gradient is a particular instance of the moregeneral proximal gradient methods.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 76 / 155
Projected GradientInstead of min
wΩ(w) + Λ(w) , tackle min
wΛ(w) subject to Ω(w) ≤ τ .
Building blocks:
loss gradient ∇Λ(w)
Euclidean projection PS(w), where S = w : Ω(w) ≤ τ
w← PS(w − η∇Λ(w))
...maybe using line search to adjust the step length.
Example: for S = w : ‖w‖1 ≤ τ, projection PS(w) has O(D log D) cost(Duchi et al., 2008).
Viable and competitive alternative, which has been used in machinelearning and NLP (Duchi et al., 2008; Quattoni et al., 2009).
Shown later: projected gradient is a particular instance of the moregeneral proximal gradient methods.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 76 / 155
Projected GradientInstead of min
wΩ(w) + Λ(w) , tackle min
wΛ(w) subject to Ω(w) ≤ τ .
Building blocks:
loss gradient ∇Λ(w)
Euclidean projection PS(w), where S = w : Ω(w) ≤ τ
w← PS(w − η∇Λ(w))
...maybe using line search to adjust the step length.
Example: for S = w : ‖w‖1 ≤ τ, projection PS(w) has O(D log D) cost(Duchi et al., 2008).
Viable and competitive alternative, which has been used in machinelearning and NLP (Duchi et al., 2008; Quattoni et al., 2009).
Shown later: projected gradient is a particular instance of the moregeneral proximal gradient methods.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 76 / 155
From Gradient to Hessian: Newton’s Method
Assume F (w) = Ω(w) + Λ(w) is twice-differentiable.
Second order (quadratic) Taylor expansion around w′:
F (w) ≈ F (w′) +∇F (w′)︸ ︷︷ ︸Gradient
>(w −w′) +
1
2(w −w′)> H(w′)︸ ︷︷ ︸
Hessian:
(w −w′)
Use the direction that minimizes this quadratic approximation:
w ← w − α(H(w))−1∇F (w)
with stepsize α usually determined by line search.
Drawback: may be costly (or impossible) to compute and invert theHessian! O(D3) for a naıve approach.
Quasi-Newton methods, namely L-BFGS, approximate the inverseHessian directly from past gradient information.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 77 / 155
From Gradient to Hessian: Newton’s Method
Assume F (w) = Ω(w) + Λ(w) is twice-differentiable.
Second order (quadratic) Taylor expansion around w′:
F (w) ≈ F (w′) +∇F (w′)︸ ︷︷ ︸Gradient
>(w −w′) +
1
2(w −w′)> H(w′)︸ ︷︷ ︸
Hessian:
(w −w′)
Use the direction that minimizes this quadratic approximation:
w ← w − α(H(w))−1∇F (w)
with stepsize α usually determined by line search.
Drawback: may be costly (or impossible) to compute and invert theHessian! O(D3) for a naıve approach.
Quasi-Newton methods, namely L-BFGS, approximate the inverseHessian directly from past gradient information.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 77 / 155
From Gradient to Hessian: Newton’s Method
Assume F (w) = Ω(w) + Λ(w) is twice-differentiable.
Second order (quadratic) Taylor expansion around w′:
F (w) ≈ F (w′) +∇F (w′)︸ ︷︷ ︸Gradient
>(w −w′) +
1
2(w −w′)> H(w′)︸ ︷︷ ︸
Hessian:
(w −w′)
Use the direction that minimizes this quadratic approximation:
w ← w − α(H(w))−1∇F (w)
with stepsize α usually determined by line search.
Drawback: may be costly (or impossible) to compute and invert theHessian! O(D3) for a naıve approach.
Quasi-Newton methods, namely L-BFGS, approximate the inverseHessian directly from past gradient information.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 77 / 155
From Gradient to Hessian: Newton’s Method
Assume F (w) = Ω(w) + Λ(w) is twice-differentiable.
Second order (quadratic) Taylor expansion around w′:
F (w) ≈ F (w′) +∇F (w′)︸ ︷︷ ︸Gradient
>(w −w′) +
1
2(w −w′)> H(w′)︸ ︷︷ ︸
Hessian:
(w −w′)
Use the direction that minimizes this quadratic approximation:
w ← w − α(H(w))−1∇F (w)
with stepsize α usually determined by line search.
Drawback: may be costly (or impossible) to compute and invert theHessian! O(D3) for a naıve approach.
Quasi-Newton methods, namely L-BFGS, approximate the inverseHessian directly from past gradient information.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 77 / 155
Orthant-Wise Limited-memory Quasi Newton
OWL-QN: clever adaptation of L-BFGS to `1-regularization (Andrew andGao, 2007; Gao et al., 2007)
input: stepsize sequence (ηt)Tt=1
initialize w = 0for t = 1, 2, . . . do
compute a particular subgradient gt := ∇Ω(w) + ∇Λ(w)compute inverse Hessian approximation St “a la L-BFGS”compute descent direction dt = −(St) gt
do line search for α, and update w ← w + αdt
clip w if necessary to stay in the same orthantend for
Pros: provably convergent; updates are sparse due to the clipping.
Cons: not applicable to group-regularizers.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 78 / 155
Orthant-Wise Limited-memory Quasi Newton
OWL-QN: clever adaptation of L-BFGS to `1-regularization (Andrew andGao, 2007; Gao et al., 2007)
input: stepsize sequence (ηt)Tt=1
initialize w = 0for t = 1, 2, . . . do
compute a particular subgradient gt := ∇Ω(w) + ∇Λ(w)compute inverse Hessian approximation St “a la L-BFGS”compute descent direction dt = −(St) gt
do line search for α, and update w ← w + αdt
clip w if necessary to stay in the same orthantend for
Pros: provably convergent; updates are sparse due to the clipping.
Cons: not applicable to group-regularizers.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 78 / 155
Orthant-Wise Limited-memory Quasi Newton
OWL-QN: clever adaptation of L-BFGS to `1-regularization (Andrew andGao, 2007; Gao et al., 2007)
input: stepsize sequence (ηt)Tt=1
initialize w = 0for t = 1, 2, . . . do
compute a particular subgradient gt := ∇Ω(w) + ∇Λ(w)compute inverse Hessian approximation St “a la L-BFGS”compute descent direction dt = −(St) gt
do line search for α, and update w ← w + αdt
clip w if necessary to stay in the same orthantend for
Pros: provably convergent; updates are sparse due to the clipping.
Cons: not applicable to group-regularizers.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 78 / 155
Proximal Gradient
Recall the problem: minw
Ω(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 79 / 155
Proximal Gradient
Recall the problem: minw
Ω(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 79 / 155
Proximal Gradient
Recall the problem: minw
Ω(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 79 / 155
Proximal Gradient
Recall the problem: minw
Ω(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 79 / 155
Proximal Gradient
Recall the problem: minw
Ω(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 79 / 155
Proximal Gradient
Recall the problem: minw
Ω(w) + Λ(w)
Key assumptions: ∇Λ(w) and proxΩ “easy”.
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt))
Key feature: each steps decouples the loss and the regularizer.
Projected gradient is a particular case, for proxΩ = PS.
Can be derived with different tools:
expectation-maximization (EM) (Figueiredo and Nowak, 2003);
majorization-minimization (Daubechies et al., 2004);
forward-backward splitting (Combettes and Wajs, 2006);
separable approximation (Wright et al., 2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 79 / 155
Majorization-Minimization DerivationAssume Λ(w) has L-Lipschitz gradient: ‖∇Λ(w)−∇Λ(w′)‖ ≤ L‖w −w′‖.Separable 2nd order approximation of Λ(w) around wt
Λ(w′) + (w −wt)>∇Λ(w′) +1
2ηt‖w −w′‖2 = Q(w,wt)
≥ Λ(w)
if ηt ≤ 1/L, with equality for w = wt .
Consequently, if wt+1 = arg minw Q(w,wt) + Ω(w),
Λ(wt+1) + Ω(wt+1) ≤ Q(wt+1,wt) + Ω(wt+1)
≤ Q(wt ,wt) + Ω(w1) = Λ(wt) + Ω(wt)
Easy to show that
wt+1 = arg minw
Q(w,wt) + Ω(w) = proxηt Ω (wt − ηt∇Λ(wt)) .
Thus, with ηt ≤ 1/L: objective monotonically decreases.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 80 / 155
Majorization-Minimization DerivationAssume Λ(w) has L-Lipschitz gradient: ‖∇Λ(w)−∇Λ(w′)‖ ≤ L‖w −w′‖.Separable 2nd order approximation of Λ(w) around wt
Λ(w′) + (w −wt)>∇Λ(w′) +1
2ηt‖w −w′‖2 = Q(w,wt) ≥ Λ(w)
if ηt ≤ 1/L, with equality for w = wt .
Consequently, if wt+1 = arg minw Q(w,wt) + Ω(w),
Λ(wt+1) + Ω(wt+1) ≤ Q(wt+1,wt) + Ω(wt+1)
≤ Q(wt ,wt) + Ω(w1) = Λ(wt) + Ω(wt)
Easy to show that
wt+1 = arg minw
Q(w,wt) + Ω(w) = proxηt Ω (wt − ηt∇Λ(wt)) .
Thus, with ηt ≤ 1/L: objective monotonically decreases.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 80 / 155
Majorization-Minimization DerivationAssume Λ(w) has L-Lipschitz gradient: ‖∇Λ(w)−∇Λ(w′)‖ ≤ L‖w −w′‖.Separable 2nd order approximation of Λ(w) around wt
Λ(w′) + (w −wt)>∇Λ(w′) +1
2ηt‖w −w′‖2 = Q(w,wt) ≥ Λ(w)
if ηt ≤ 1/L, with equality for w = wt .
Consequently, if wt+1 = arg minw Q(w,wt) + Ω(w),
Λ(wt+1) + Ω(wt+1) ≤ Q(wt+1,wt) + Ω(wt+1)
≤ Q(wt ,wt) + Ω(w1) = Λ(wt) + Ω(wt)
Easy to show that
wt+1 = arg minw
Q(w,wt) + Ω(w) = proxηt Ω (wt − ηt∇Λ(wt)) .
Thus, with ηt ≤ 1/L: objective monotonically decreases.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 80 / 155
Majorization-Minimization DerivationAssume Λ(w) has L-Lipschitz gradient: ‖∇Λ(w)−∇Λ(w′)‖ ≤ L‖w −w′‖.Separable 2nd order approximation of Λ(w) around wt
Λ(w′) + (w −wt)>∇Λ(w′) +1
2ηt‖w −w′‖2 = Q(w,wt) ≥ Λ(w)
if ηt ≤ 1/L, with equality for w = wt .
Consequently, if wt+1 = arg minw Q(w,wt) + Ω(w),
Λ(wt+1) + Ω(wt+1) ≤ Q(wt+1,wt) + Ω(wt+1)
≤ Q(wt ,wt) + Ω(w1) = Λ(wt) + Ω(wt)
Easy to show that
wt+1 = arg minw
Q(w,wt) + Ω(w) = proxηt Ω (wt − ηt∇Λ(wt)) .
Thus, with ηt ≤ 1/L: objective monotonically decreases.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 80 / 155
Monotonicity and Convergence
Proximal gradient, a.k.a., iterative shrinkage thresholding (IST):
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt)) .
Monotonicity: if ηt ≤ 1/L, then Λ(wt+1) + Ω(wt+1) ≤ Λ(wt) + Ω(wt).
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
t
)
Important: monotonicity doesn’t imply convergence of w1,w2, ...,wt , ....
Convergence (even with inexact steps) proved for ηt ≤ 2/L (Combettesand Wajs, 2006).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 81 / 155
Monotonicity and Convergence
Proximal gradient, a.k.a., iterative shrinkage thresholding (IST):
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt)) .
Monotonicity: if ηt ≤ 1/L, then Λ(wt+1) + Ω(wt+1) ≤ Λ(wt) + Ω(wt).
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
t
)
Important: monotonicity doesn’t imply convergence of w1,w2, ...,wt , ....
Convergence (even with inexact steps) proved for ηt ≤ 2/L (Combettesand Wajs, 2006).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 81 / 155
Monotonicity and Convergence
Proximal gradient, a.k.a., iterative shrinkage thresholding (IST):
wt+1 ← proxηt Ω (wt − ηt∇Λ(wt)) .
Monotonicity: if ηt ≤ 1/L, then Λ(wt+1) + Ω(wt+1) ≤ Λ(wt) + Ω(wt).
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
t
)
Important: monotonicity doesn’t imply convergence of w1,w2, ...,wt , ....
Convergence (even with inexact steps) proved for ηt ≤ 2/L (Combettesand Wajs, 2006).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 81 / 155
Accelerating IST: SpaRSA
The step sizes ηt ≤ 2/L (guarantees convergence) and ηt ≤ 1/L(monotonicity) are too conservative.
A bolder choice: let ηt mimic a Newton step (Barzilai and Borwein, 1988),
1
ηtI ∼ H(wt) (Hessian)
Approximation in the mean squared sense over the previous step:
1
ηt= arg min
α‖α(wt −wt−1)− (∇Λ(wt)−∇Λ(wt−1))‖2
Resulting algorithm: SpaRSA (sparse reconstruction by separableapproximation); shown to converge (with a safeguard) and to be fastWright et al. (2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 82 / 155
Accelerating IST: SpaRSA
The step sizes ηt ≤ 2/L (guarantees convergence) and ηt ≤ 1/L(monotonicity) are too conservative.
A bolder choice: let ηt mimic a Newton step (Barzilai and Borwein, 1988),
1
ηtI ∼ H(wt) (Hessian)
Approximation in the mean squared sense over the previous step:
1
ηt= arg min
α‖α(wt −wt−1)− (∇Λ(wt)−∇Λ(wt−1))‖2
Resulting algorithm: SpaRSA (sparse reconstruction by separableapproximation); shown to converge (with a safeguard) and to be fastWright et al. (2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 82 / 155
Accelerating IST: SpaRSA
The step sizes ηt ≤ 2/L (guarantees convergence) and ηt ≤ 1/L(monotonicity) are too conservative.
A bolder choice: let ηt mimic a Newton step (Barzilai and Borwein, 1988),
1
ηtI ∼ H(wt) (Hessian)
Approximation in the mean squared sense over the previous step:
1
ηt= arg min
α‖α(wt −wt−1)− (∇Λ(wt)−∇Λ(wt−1))‖2
Resulting algorithm: SpaRSA (sparse reconstruction by separableapproximation); shown to converge (with a safeguard) and to be fastWright et al. (2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 82 / 155
Accelerating IST: SpaRSA
The step sizes ηt ≤ 2/L (guarantees convergence) and ηt ≤ 1/L(monotonicity) are too conservative.
A bolder choice: let ηt mimic a Newton step (Barzilai and Borwein, 1988),
1
ηtI ∼ H(wt) (Hessian)
Approximation in the mean squared sense over the previous step:
1
ηt= arg min
α‖α(wt −wt−1)− (∇Λ(wt)−∇Λ(wt−1))‖2
Resulting algorithm: SpaRSA (sparse reconstruction by separableapproximation); shown to converge (with a safeguard) and to be fastWright et al. (2009).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 82 / 155
Accelerating IST: FISTA
Idea: compute wt+1 based, not only on wt , but also on wt−1.
Fast IST algorithm (FISTA) (Beck and Teboulle, 2009):
bt+1 =1+√
1+4 b2t
2
z = wt + bt−1bt+1
(wt −wt−1)
wt+1 = proxηΩ (z− η∇Λ(z))
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
t2
)(vs O(1/t) for IST)
Convergence of iterates has not been shown.
Another two-step method: TwIST (two-step IST) (Bioucas-Dias andFigueiredo, 2007).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 83 / 155
Accelerating IST: FISTA
Idea: compute wt+1 based, not only on wt , but also on wt−1.
Fast IST algorithm (FISTA) (Beck and Teboulle, 2009):
bt+1 =1+√
1+4 b2t
2
z = wt + bt−1bt+1
(wt −wt−1)
wt+1 = proxηΩ (z− η∇Λ(z))
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
t2
)(vs O(1/t) for IST)
Convergence of iterates has not been shown.
Another two-step method: TwIST (two-step IST) (Bioucas-Dias andFigueiredo, 2007).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 83 / 155
Accelerating IST: FISTA
Idea: compute wt+1 based, not only on wt , but also on wt−1.
Fast IST algorithm (FISTA) (Beck and Teboulle, 2009):
bt+1 =1+√
1+4 b2t
2
z = wt + bt−1bt+1
(wt −wt−1)
wt+1 = proxηΩ (z− η∇Λ(z))
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
t2
)(vs O(1/t) for IST)
Convergence of iterates has not been shown.
Another two-step method: TwIST (two-step IST) (Bioucas-Dias andFigueiredo, 2007).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 83 / 155
Accelerating IST: FISTA
Idea: compute wt+1 based, not only on wt , but also on wt−1.
Fast IST algorithm (FISTA) (Beck and Teboulle, 2009):
bt+1 =1+√
1+4 b2t
2
z = wt + bt−1bt+1
(wt −wt−1)
wt+1 = proxηΩ (z− η∇Λ(z))
Convergence of objective value (Beck and Teboulle, 2009)
(Λ(wt) + Ω(wt)
)−(Λ(w∗) + Ω(w∗)
)= O
(1
t2
)(vs O(1/t) for IST)
Convergence of iterates has not been shown.
Another two-step method: TwIST (two-step IST) (Bioucas-Dias andFigueiredo, 2007).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 83 / 155
Least Angle Regression (LARS)LARS only applies to w(λ) = arg min
wλ‖w‖1 + ‖Aw − y‖2
Key ideas (Efron et al., 2004; Osborne et al., 2000)
“regularization path” w(λ) is piecewise linear (Markowitz, 1952);the cusps can be identified in closed form;simply jump from one cusp to the next.
Cons: doesn’t apply to group regularizers; exponential worst casecomplexity (Mairal and Yu, 2012).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 84 / 155
Least Angle Regression (LARS)LARS only applies to w(λ) = arg min
wλ‖w‖1 + ‖Aw − y‖2
Key ideas (Efron et al., 2004; Osborne et al., 2000)
“regularization path” w(λ) is piecewise linear (Markowitz, 1952);the cusps can be identified in closed form;simply jump from one cusp to the next.
Cons: doesn’t apply to group regularizers; exponential worst casecomplexity (Mairal and Yu, 2012).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 84 / 155
Least Angle Regression (LARS)LARS only applies to w(λ) = arg min
wλ‖w‖1 + ‖Aw − y‖2
Key ideas (Efron et al., 2004; Osborne et al., 2000)
“regularization path” w(λ) is piecewise linear (Markowitz, 1952);the cusps can be identified in closed form;simply jump from one cusp to the next.
Cons: doesn’t apply to group regularizers; exponential worst casecomplexity (Mairal and Yu, 2012).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 84 / 155
Least Angle Regression (LARS)LARS only applies to w(λ) = arg min
wλ‖w‖1 + ‖Aw − y‖2
Key ideas (Efron et al., 2004; Osborne et al., 2000)
“regularization path” w(λ) is piecewise linear (Markowitz, 1952);the cusps can be identified in closed form;simply jump from one cusp to the next.
Cons: doesn’t apply to group regularizers; exponential worst casecomplexity (Mairal and Yu, 2012).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 84 / 155
Homotopy/Continuation Methods
LARS is related to a more general family: homotopy/continuationmethods.
Consider w(λ) = arg minw
λΩ(w) + Λ(w)
Key ideas
start with high value of λ, such that w(λ) is easy (e.g., zero);
slowly decrease λ while “tracking” the solution;
“tracking” means: use the previous w(λ) to “warm start” the solverfor the next problem.
It’s a meta-algorithm of general applicability when using “warm startable”solvers (Figueiredo et al., 2007; Hale et al., 2008; Osborne et al., 2000).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 85 / 155
Homotopy/Continuation Methods
LARS is related to a more general family: homotopy/continuationmethods.
Consider w(λ) = arg minw
λΩ(w) + Λ(w)
Key ideas
start with high value of λ, such that w(λ) is easy (e.g., zero);
slowly decrease λ while “tracking” the solution;
“tracking” means: use the previous w(λ) to “warm start” the solverfor the next problem.
It’s a meta-algorithm of general applicability when using “warm startable”solvers (Figueiredo et al., 2007; Hale et al., 2008; Osborne et al., 2000).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 85 / 155
Homotopy/Continuation Methods
LARS is related to a more general family: homotopy/continuationmethods.
Consider w(λ) = arg minw
λΩ(w) + Λ(w)
Key ideas
start with high value of λ, such that w(λ) is easy (e.g., zero);
slowly decrease λ while “tracking” the solution;
“tracking” means: use the previous w(λ) to “warm start” the solverfor the next problem.
It’s a meta-algorithm of general applicability when using “warm startable”solvers (Figueiredo et al., 2007; Hale et al., 2008; Osborne et al., 2000).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 85 / 155
Some Stuff We Didn’t Talk About
shooting method (Fu, 1998);
grafting (Perkins et al., 2003) and grafting-light (Zhu et al., 2010);
forward stagewise regression (Hastie et al., 2007);
alternating direction method of multipliers (ADMM) (Figueiredo andBioucas-Dias, 2011).
Next: We’ll talk about online algorithms.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 86 / 155
Some Stuff We Didn’t Talk About
shooting method (Fu, 1998);
grafting (Perkins et al., 2003) and grafting-light (Zhu et al., 2010);
forward stagewise regression (Hastie et al., 2007);
alternating direction method of multipliers (ADMM) (Figueiredo andBioucas-Dias, 2011).
Next: We’ll talk about online algorithms.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 86 / 155
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Convex Analysis
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 87 / 155
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 88 / 155
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 88 / 155
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 88 / 155
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 88 / 155
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 88 / 155
Why Online?
1 Suitable for large datasets
2 Suitable for structured prediction
3 Faster to approach a near-optimal region
4 Slower convergence, but this is fine in machine learning
cf. “the tradeoffs of large scale learning” (Bottou and Bousquet, 2007)
What we will say can be straighforwardly extended to the mini-batch case.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 88 / 155
Plain Stochastic (Sub-)Gradient Descent
minw
Ω(w)︸ ︷︷ ︸regularizer
+1
N
N∑i=1
L(w, xi , yi )︸ ︷︷ ︸empirical loss
,
input: stepsize sequence (ηt)Tt=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)end for
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 89 / 155
Plain Stochastic (Sub-)Gradient Descent
minw
Ω(w)︸ ︷︷ ︸regularizer
+1
N
N∑i=1
L(w, xi , yi )︸ ︷︷ ︸empirical loss
,
input: stepsize sequence (ηt)Tt=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)(sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)end for
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 89 / 155
What’s the Problem with SGD?
(Sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)
`2-regularization Ω(w) = λ2‖w‖
22 =⇒ ∇Ω(w) = λw
w ← (1− ηtλ)w︸ ︷︷ ︸scaling
− ηt∇L(w; xt , yt)
`1-regularization Ω(w) = λ‖w‖1 =⇒ ∇Ω(w) = λsign(w)
w ← w − ηtλsign(w)︸ ︷︷ ︸constant penalty
− ηt∇L(w; xt , yt)
Problem: iterates are never sparse!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 90 / 155
What’s the Problem with SGD?
(Sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)`2-regularization Ω(w) = λ
2‖w‖22 =⇒ ∇Ω(w) = λw
w ← (1− ηtλ)w︸ ︷︷ ︸scaling
− ηt∇L(w; xt , yt)
`1-regularization Ω(w) = λ‖w‖1 =⇒ ∇Ω(w) = λsign(w)
w ← w − ηtλsign(w)︸ ︷︷ ︸constant penalty
− ηt∇L(w; xt , yt)
Problem: iterates are never sparse!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 90 / 155
What’s the Problem with SGD?
(Sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)`2-regularization Ω(w) = λ
2‖w‖22 =⇒ ∇Ω(w) = λw
w ← (1− ηtλ)w︸ ︷︷ ︸scaling
− ηt∇L(w; xt , yt)
`1-regularization Ω(w) = λ‖w‖1 =⇒ ∇Ω(w) = λsign(w)
w ← w − ηtλsign(w)︸ ︷︷ ︸constant penalty
− ηt∇L(w; xt , yt)
Problem: iterates are never sparse!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 90 / 155
What’s the Problem with SGD?
(Sub-)gradient step: w ← w − ηt
(∇Ω(w) + ∇L(w; xt , yt)
)`2-regularization Ω(w) = λ
2‖w‖22 =⇒ ∇Ω(w) = λw
w ← (1− ηtλ)w︸ ︷︷ ︸scaling
− ηt∇L(w; xt , yt)
`1-regularization Ω(w) = λ‖w‖1 =⇒ ∇Ω(w) = λsign(w)
w ← w − ηtλsign(w)︸ ︷︷ ︸constant penalty
− ηt∇L(w; xt , yt)
Problem: iterates are never sparse!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 90 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `2-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 91 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
Plain SGD with `1-regularization
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 92 / 155
“Sparse” Online Algorithms
SGD with Cumulative Penalty (Tsuruoka et al., 2009)
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 93 / 155
“Sparse” Online Algorithms
SGD with Cumulative Penalty (Tsuruoka et al., 2009)
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 93 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
an attempt to reconcile SGD and `1 regularizarion, maintainingalgorithmic efficiency
computational trick: accumulates the penalties, and applies themall at once when a feature fires (due to Carpenter (2008))
clipping: if the total penalty is greater than the magnitude of thefeature weight wj , clip wj to zero
but store the amount of clipping for future use.
leads to very sparse models
however: no proof of convergence
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 94 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
Cumulative Penalties (Tsuruoka et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 95 / 155
“Sparse” Online Algorithms
SGD with Cumulative Penalty (Tsuruoka et al., 2009)
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 96 / 155
Truncated Gradient (Langford et al., 2009)
input: laziness coefficient K , stepsize sequence (ηt)Tt=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)(sub-)gradient step: w ← w − ηt∇L(θ; xt , yt)if t/K is integer then
truncation step: w ← w − sign(w) (|w| − ηtKτ)︸ ︷︷ ︸soft-thresholding
end ifend for
take gradients only with respect to the loss
every K rounds: a “lazy” soft-thresholding step
Langford et al. (2009) also suggest other forms of truncation
converges to ε-accurate objective after O(1/ε2) iterations
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 97 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
Truncated Gradient (Langford et al., 2009)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 98 / 155
“Sparse” Online Algorithms
SGD with Cumulative Penalty (Tsuruoka et al., 2009)
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 99 / 155
Online Forward-Backward Splitting (Duchi andSinger, 2009)
input: stepsize sequences (ηt)Tt=1, (ρt)T
t=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)gradient step: w ← w − ηt∇L(w; xt , yt)proximal step: w ← proxρt Ω(w)
end for
generalizes truncated gradient to arbitrary regularizers Ωcan tackle non-overlapping or hierarchical group-Lasso, but arbitraryoverlaps are difficult to handle (more later)
practical drawback: without a laziness parameter, iterates areusually not very sparse
converges to ε-accurate objective after O(1/ε2) iterations
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 100 / 155
“Sparse” Online Algorithms
SGD with Cumulative Penalty (Tsuruoka et al., 2009)
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 101 / 155
Regularized Dual Averaging (Xiao, 2010)
input: coefficient η0
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)gradient step: s ← s +∇L(w; xt , yt)proximal step: w ← η0
√t × proxΩ(−s/t)
end for
based on the dual averaging technique (Nesterov, 2009)
in practice: quite effective at getting sparse iterates (the proximalsteps are not vanishing)
O(C1/ε2 + C2/
√ε) convergence, where C1 is a Lipschitz constant,
and C2 is the variance of the stochastic gradients
drawback: requires storing two vectors (w and s), and s is not sparse
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 102 / 155
What About Group Sparsity?
Both online forward-backward splitting (Duchi and Singer, 2009) andregularized dual averaging (Xiao, 2010) can handle groups
All that is necessary is to compute proxΩ(w)
easy for non-overlapping and tree-structured groups
But what about general overlapping groups?
Martins et al. (2011a): a prox-grad algorithm that can handle arbitraryoverlapping groups
decompose Ω(w) =∑J
j=1 Ωj (w) where each Ωj is non-overlapping
then apply proxΩjsequentially
still convergent (Martins et al., 2011a)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 103 / 155
“Sparse” Online Algorithms
SGD with Cumulative Penalty (Tsuruoka et al., 2009)
Truncated Gradient (Langford et al., 2009)
Online Forward-Backward Splitting (Duchi and Singer, 2009)
Regularized Dual Averaging (Xiao, 2010)
Online Proximal Gradient (Martins et al., 2011a)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 104 / 155
Online Proximal Gradient (Martins et al., 2011a)
input: gravity sequence (σt)Tt=1, stepsize sequence (ηt)T
t=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)gradient step: w ← w − ηt∇L(θ; xt , yt)sequential proximal steps:for j = 1, 2, . . . do
w ← proxηtσt Ωj(w)
end forend for
PAC Convergence. ε-accurate solution after T ≤ O(1/ε2) rounds
Computational efficiency. Each gradient step is linear in thenumber of features that fire.Each proximal step is linear in the number of groups M.Both are independent of D.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 105 / 155
Online Proximal Gradient (Martins et al., 2011a)
input: gravity sequence (σt)Tt=1, stepsize sequence (ηt)T
t=1
initialize w = 0for t = 1, 2, . . . do
take training pair (xt , yt)gradient step: w ← w − ηt∇L(θ; xt , yt)sequential proximal steps:for j = 1, 2, . . . do
w ← proxηtσt Ωj(w)
end forend for
PAC Convergence. ε-accurate solution after T ≤ O(1/ε2) rounds
Computational efficiency. Each gradient step is linear in thenumber of features that fire.Each proximal step is linear in the number of groups M.Both are independent of D.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 105 / 155
Implementation Tricks (Martins et al., 2011b)
Budget driven shrinkage. Instead of a regularization constant,specify a budget on the number of selected groups. Each proximalstep sets σt to meet this target.
Sparseptron. Let L(w) = w>(f(x , y)− f(x , y)) be the perceptronloss. The algorithm becomes perceptron with shrinkage.
Debiasing. Run a few iterations of sparseptron to identify therelevant groups. Then run a unregularized learner at a second stage.
Memory efficiency. Only asmall active set of features needto be maintained. Entire groupscan be deleted after eachproximal step.Many irrelevant features arenever instantiated.
0 5 10 150
2
4
6x 10
6
# Epochs
# Fe
atur
es
MIRA
Sparceptron + MIRA (B=30)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 106 / 155
Implementation Tricks (Martins et al., 2011b)
Budget driven shrinkage. Instead of a regularization constant,specify a budget on the number of selected groups. Each proximalstep sets σt to meet this target.
Sparseptron. Let L(w) = w>(f(x , y)− f(x , y)) be the perceptronloss. The algorithm becomes perceptron with shrinkage.
Debiasing. Run a few iterations of sparseptron to identify therelevant groups. Then run a unregularized learner at a second stage.
Memory efficiency. Only asmall active set of features needto be maintained. Entire groupscan be deleted after eachproximal step.Many irrelevant features arenever instantiated.
0 5 10 150
2
4
6x 10
6
# Epochs
# Fe
atur
es
MIRA
Sparceptron + MIRA (B=30)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 106 / 155
Summary of Algorithms
Converges? Rate? Sparse? Groups? Overlaps?Coordinate descent X ? X Maybe NoProx-grad (IST) X O(1/ε) Yes/No X Not easyOWL-QN X ? Yes/No No NoSpaRSA X O(1/ε) Yes/No X Not easyFISTA X O(1/
√ε) Yes/No X Not easy
ADMM (C-SALSA) X ? No X XOnline subgradient X O(1/ε2) No X NoCumulative penalty ? ? X No NoTruncated gradient X O(1/ε2) X No NoFOBOS X O(1/ε2) Sort of X Not easyRDA X O(1/ε2) X X Not easyOnline prox-grad X O(1/ε2) X X X
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 107 / 155
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Convex Analysis
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 108 / 155
Applications of Structured Sparsity in NLP
Relatively few to date (but this list may not be exhaustive).
1 Martins et al. (2011b):
Phrase chunkingNamed entity recognitionDependency parsing
2 Semisupervised lexicon expansion (Das and Smith, 2012)
3 Unsupervised tagging (Graca et al., 2009)
4 Sociolinguistic association discovery (Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 109 / 155
Applications of Structured Sparsity in NLP
Relatively few to date (but this list may not be exhaustive).
1 Martins et al. (2011b):
Phrase chunkingNamed entity recognitionDependency parsing
2 Semisupervised lexicon expansion (Das and Smith, 2012)
3 Unsupervised tagging (Graca et al., 2009)
4 Sociolinguistic association discovery (Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 109 / 155
Martins et al. (2011b): Group by Template
Feature templates provide a straightforward way to define non-overlappinggroups.
To achieve group sparsity, we optimize:
minw
1
N
N∑n=1
L(w; xn, yn)︸ ︷︷ ︸empirical loss
+ Ω(w)︸ ︷︷ ︸regularizer
where we use the `2,1 norm:
Ω(w) = λ
M∑m=1
dm‖wm‖2
for M groups/templates.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 110 / 155
Chunking
CoNLL 2000 shared task (Sang and Buchholz, 2000)
Unigram features: 96 feature templates using POS tags, words, andword shapes, with various context sizes
Bigram features: 1 template indicating the label bigram
Baseline: L2-regularized MIRA, 15 epochs, all features,cross-validation to choose regularization strength
Template-based group lasso: 5 epochs of sparseptron + 10 ofMIRA
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 111 / 155
Chunking Experiments
Baseline Template-based group lasso# templates 96 10 20 30 40model size 5,300,396 71,075 158,844 389,065 662,018
F1 (%) 93.10 92.99 93.28 93.59 93.42
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 112 / 155
0 5 10 150
2
4
6x 106
# Epochs
# Fe
atur
es
MIRASparceptron + MIRA (B=30)
Memory requirement of sparseptron is < 7.5% of that of the baseline.(Oscillations are due to proximal steps after every 1,000 instances.)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 113 / 155
Applications of Structured Sparsity in NLP
Relatively few to date (but this list may not be exhaustive).
1 Martins et al. (2011b):
Phrase chunkingNamed entity recognitionDependency parsing
2 Semisupervised lexicon expansion (Das and Smith, 2012)
3 Unsupervised tagging (Graca et al., 2009)
4 Sociolinguistic association discovery (Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 114 / 155
Named Entity Recognition
CoNLL 2002/2003 shared tasks (Sang, 2002; Sang and De Meulder,2003): Spanish, Dutch, and English
Unigram features: 452 feature templates using POS tags, words, wordshapes, prefixes, suffixes, and other string features, all with variouscontext sizes
Bigram features: 1 template indicating the label bigram
Baselines:
L2-regularized MIRA, 15 epochs, all features, cross-validation to chooseregularization strengthsparseptron with lasso, different values of C
Template-based group lasso: 5 epochs of sparseptron + 10 ofMIRA
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 115 / 155
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
Spanish Dutch English
MIRA Lasso (0.1) Lasso (0.5) Lasso (1) Group Lasso (100) Group Lasso (200) Group Lasso (300)
Named entity models: number of features. (Lasso C = 1/λN.)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 116 / 155
60
65
70
75
80
85
Spanish Dutch English
MIRA Lasso (0.1) Lasso (0.5) Lasso (1) Group Lasso (100) Group Lasso (200) Group Lasso (300)
Named entity models: F1 accuracy on the test set. (Lasso C = 1/λN.)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 117 / 155
Applications of Structured Sparsity in NLP
Relatively few to date (but this list may not be exhaustive).
1 Martins et al. (2011b):
Phrase chunkingNamed entity recognitionDependency parsing
2 Semisupervised lexicon expansion (Das and Smith, 2012)
3 Unsupervised tagging (Graca et al., 2009)
4 Sociolinguistic association discovery (Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 118 / 155
Non-projective Dependency Parsing
CoNLL-X shared task (Buchholz and Marsi, 2006): Arabic, Danish,Dutch, Japanese, Slovene, and Spanish
Arc-factored models (McDonald et al., 2005)
684 feature templates by conjoining words, shapes, lemmas, and POSof the head and the modifier, contextual POS, distance andattachment direction
Baselines:
MIRA with all featuresfilter-based template selection (information gain)standard lasso
Our methods: template-based group lasso; coarse-to-fineregularization
Budget sizes: 200, 300, and 400
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 119 / 155
Non-projective Dependency Parsing (c’ed)
2 4 6 8 10 12
x 106
76.5
77
77.5
78
78.5
Number of Features
UA
S (
%)
Arabic
0 5 10 15
x 106
89
89.2
89.4
89.6
89.8
90Danish
0 2 4 6 8
x 106
92
92.5
93
93.5Japanese
0 2 4 6 8 10
x 106
81
82
83
84Slovene
0 0.5 1 1.5 2
x 107
82
82.5
83
83.5
84Spanish
0 5 10 15
x 106
74
74.5
75
75.5
76Turkish
Group−LassoGroup−Lasso (C2F)LassoFilter−based (IG)
Template-based group lasso is better at selecting feature templates thanthe IG criterion, and slightly better than coarse-to-fine.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 120 / 155
Which features get selected?
Qualitative analysis of selected templates:
Arabic Danish Japanese Slovene Spanish TurkishBilexical ++ + +Lex. → POS + +POS → Lex. ++ + + + +POS → POS ++ +Middle POS ++ ++ ++ ++ ++ ++Shape ++ ++ ++ ++Direction + + + + +Distance ++ + + + + +
(Empty: none or very few templates selected; +: some templatesselected; ++: most or all templates selected.)
Morphologically-rich languages with small datasets (Turkish andSlovene) avoid lexical features.
In Japanese, contextual POS appear to be especially relevant.
Take this with a grain of salt: some patterns may be properties ofthe datasets, not the languages!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 121 / 155
Which features get selected?
Qualitative analysis of selected templates:
Arabic Danish Japanese Slovene Spanish TurkishBilexical ++ + +Lex. → POS + +POS → Lex. ++ + + + +POS → POS ++ +Middle POS ++ ++ ++ ++ ++ ++Shape ++ ++ ++ ++Direction + + + + +Distance ++ + + + + +
(Empty: none or very few templates selected; +: some templatesselected; ++: most or all templates selected.)
Morphologically-rich languages with small datasets (Turkish andSlovene) avoid lexical features.
In Japanese, contextual POS appear to be especially relevant.
Take this with a grain of salt: some patterns may be properties ofthe datasets, not the languages!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 121 / 155
Applications of Structured Sparsity in NLP
Relatively few to date (but this list may not be exhaustive).
1 Martins et al. (2011b):
Phrase chunkingNamed entity recognitionDependency parsing
2 Semisupervised lexicon expansion (Das and Smith, 2012)
3 Unsupervised tagging (Graca et al., 2009)
4 Sociolinguistic association discovery (Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 122 / 155
Lexicon Expansion (Das and Smith, 2012)
Desired: mapping from words (types) to categories (e.g., POS orsemantic predicates)
Allow ambiguity, but not too much!
Given some words’ mappings and a large corpus
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 123 / 155
Lexicon Expansion (Das and Smith, 2012)
Approach:
1 Calculate distributional vectors for words
2 Construct a graph with words as vertices; edges from a word to thek-most similar distributional vectors
3 “Link” known words to empirical distributions
4 “Propagate” label distributions throughout the graph (Corduneanuand Jaakkola, 2003; Zhu et al., 2003; Subramanya and Bilmes, 2008,2009; Talukdar and Crammer, 2009)
Known as graph-based semisupervised learning.
See Noah’s talk about this work on Wednesday!
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 124 / 155
Example (Das and Smith, 2011)
Green words are observed in FrameNet data, each with a single frame(category); other words come from a larger, unlabeled corpus.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 125 / 155
Graph-Based SSL
Here we reason about types, not tokens (instances).
Regularized empirical risk minimization doesn’t quite describe thissetting.
Instead, think of maximum a posteriori inference in a factor graphG = (V,F,E):
p(vv∈V) =1
Z
∏f ∈F
φf (vv∈V:(v ,f )∈E)
where V are random variables, F are factors, and E are edges.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 126 / 155
Factor Graph Representation of Graph-Based SSL
X1
X4 X3
X2
Shaded variables X1 and X4 take the values of empirical distributions overcategories for words 1 and 4. Shaded factors encourage inferreddistributions X1 and X4 to be similar to them. Solid white factorsencourage smoothness across the graph, and dashed unary factors can beused to encourage sparsity.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 127 / 155
Unary Factors
Here, qn(y) is an unnormalized distribution over categories given the wordassociated with the nth vertex.
Three unary factor conditions:
Uniform squared `2: −λ∑
n
∑y
(qn(y)− 1
|Y|
)2
Used in past work (Subramanya et al., 2010); with quadratic pairwisepenalties and normalized q, generalizes Zhu et al. (2003)
Lasso (`1) for global sparsity: −λ∑
n
∑y
|qn(y)|
Elitist lasso (squared `1,2; Kowalski and Torresani, 2009) forper-vertex sparsity):
−λ∑
n
(∑y
|qn(y)|
)2
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 128 / 155
Experimental Results: Expanding the FrameNetLexicon
Vertices: lemmatized, coarse POS-tagged word types
Each qn(·) is a(n unnormalized) distribution over 877 semantic frames
9,263 vertices with labels, 55,217 unlabeled
Accuracy is for unknown predicates, partial match score (SemEval2007)
accuracy lexicon size
supervised 46.62 –
Das and Smith (2011)(normalized qn, squared `2-uniform) 62.35 128,960
squared `2-uniform 62.81 128,232
`1 62.43 128,771
squared `1,2 65.28 45,554
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 129 / 155
Applications of Structured Sparsity in NLP
Relatively few to date (but this list may not be exhaustive).
1 Martins et al. (2011b):
Phrase chunkingNamed entity recognitionDependency parsing
2 Semisupervised lexicon expansion (Das and Smith, 2012)
3 Unsupervised tagging (Graca et al., 2009)
4 Sociolinguistic association discovery (Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 130 / 155
Unsupervised Tagging (Graca et al., 2009)
Posterior regularization (Ganchev et al., 2010): penalizeprobabilistic models based on properties of their posteriordistributions.
One such property: for each token, the number of labels with nonzeroprobability.
Related idea: Ravi and Knight (2009) directly minimized the numberof tag bigram types allowed by the model (using ILP).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 131 / 155
Understanding Posterior Regularization
Begin with a generative model pw(X ,Y ). We are unsupervised here,so Y is always hidden.
This leads to log marginal likelihood as loss:L(w; xn) = − log
∑y∈Y pw(xn, y).
For a given x , define a set of distributionsQx ,ξ = q(Y | x) | Eq[f(x ,Y )]− b ≤ ξ.For a model distribution pw(X ,Y ), define a regularizer:
Ω(w, ξ) = σ‖ξ‖β +N∑
n=1
minq∈Qxn,ξ
KL(q(·)‖pw(· | xn))
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 132 / 155
Optimization for Posterior Regularization
An iterative EM-like algorithm can be used to locally optimize theregularized loss:
E-step: calculate posteriors pw(Y | xn), ∀n. (Hinges on localfactorization of pw.)
Project onto Qxn : for each n:
∀y , q(y | xn) ∝ pw(y | xn)e−λ∗n>f(xn,y)
where λ∗ are the solution to the dual of the optimization probleminside Ω. (Hinges on local factorization of f.)
M-step: solve quasi-supervised problem using q to fill in thedistribution over Y :
minw
N∑n=1
∑y
−q(y | xn) log pw(xn, y))
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 133 / 155
Posterior Sparsity (Graca et al., 2009)
Define indicator features:
fi ,w ,t(xn, y) =
1 if xn contains the ith occurrence of w at position j
and y [j ] = t0 otherwise
bi ,w ,t = 0
Regularize with the `∞,1 norm (ξ is solved for and substituted):
Ω(w) = σ∑w ,t
maxi
Epw [fi ,w ,t ]︸ ︷︷ ︸`∞︸ ︷︷ ︸
`1
+N∑
n=1
minq∈Qxn
KL(q(·)‖pw(· | xn))
The dual form of the optimization problem is solvable with projectedgradient descent.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 134 / 155
From Ganchev et al. (2010), figure 11. (Left) initial tag distributions for20 instances of a word. (Middle) Optimal dual variables λ; each row sumsto σ = 20. (Right) q concentrates posteriors for all instances on the NNtag, reducing the `∞,1 norm from ≈ 4 to ≈ 1.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 135 / 155
Unsupervised Tagging Results (Graca et al., 2009)
Average accuracy (standard deviation in parentheses) over 10 differentruns (random seeds identical across models) for 200 iterations. Sparse PRconstraint strength is give in parentheses.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 136 / 155
Applications of Structured Sparsity in NLP
Relatively few to date (but this list may not be exhaustive).
1 Martins et al. (2011b):
Phrase chunkingNamed entity recognitionDependency parsing
2 Semisupervised lexicon expansion (Das and Smith, 2012)
3 Unsupervised tagging (Graca et al., 2009)
4 Sociolinguistic association discovery (Eisenstein et al., 2011)
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 137 / 155
Sociolinguistic Association Discovery
Dataset:
geotagged tweets from 9,250 authorsmapping of locations to the U.S. Census’ ZIP code tabulation areas(ZCTAs)a ten-dimensional vector of statistics on demographic attributes
Can we learn a compact set of terms used on Twitter that associatewith demographics?
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 138 / 155
Sociolinguistic Association Discovery (Eisensteinet al., 2011)
Setup: multi-output regression.
xn is a P-dimensional vector of independent variables; matrix isX ∈ RN×P
yn is a T -dimensional vector of dependent variables; matrix isY ∈ RN×T
wp,t is the regression coefficient for the pth variable in the tth task;matrix is W ∈ RP×T
Regularized objective with squared error loss typical for regression:
minW
Ω(W) + ‖Y − XW‖2F
Regressions are run in both directions.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 139 / 155
Structured Sparsity with `∞,1
Drive entire rows of W to zero (Turlach et al., 2005): “somepredictors are useless for any task”
Ω(W) = λ∑T
t=1 maxp wp,t
Optimization with blockwise coordinate ascent (Liu et al., 2009) andsome tricks to maintain sparsity:
Scale X to achieve variance 1 for each predictorPrecompute C = X>Y − N x>y, where x and y are mean row vectorsfor X, Y, respectivelyPrecompute D = X>X− N x>xMore regression tricks in Eisenstein et al. (2011)
Related work: Duh et al. (2010) used multitask regression and `2,1 toselect features useful for reranking across many instances (applicationin machine translation).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 140 / 155
Predicting Demographics from Text (Eisensteinet al., 2011)
Predict 10-dimensional ZCTA characterization from words tweeted inthat region (vocabulary is P = 5, 418)Measure Pearson’s correlation between prediction and correct value(average over tasks, cross-validated test sets)Compare with truncated SVD, greatest variance across authors, mostfrequent words
102
103
0.16
0.18
0.2
0.22
0.24
0.26
0.28
number of features
ave
rag
e c
orr
ela
tion
multi−output lassoSVDhighest variancemost frequent
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 141 / 155
Predictive Words (Eisenstein et al., 2011)
Significant p < 0.05 positive (+) and negative (-) associations in a69-feature model (see the paper for the rest).
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 142 / 155
Predicting Text from Demographics (Eisensteinet al., 2011)
Embed the model in a feature induction outer loop: “screen andclean” (Wu et al., 2010)
Compare language model perplexity of models with no demographicfeatures, raw demographic features (10), and 37 discoveredconjunctive features.
Significant reduction compared to both baselines.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 143 / 155
Predictive Demographic Features (Eisenstein et al.,2011)
Selected demographic features and words with high and low log-oddsassociated with each.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 144 / 155
Outline
1 Introduction
2 Loss Functions and Sparsity
3 Structured Sparsity
4 Algorithms
Convex Analysis
Batch Algorithms
Online Algorithms
5 Applications
6 Conclusions
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 145 / 155
Summary
Sparsity is desirable in NLP: feature selection, runtime, memoryfootprint, interpretability
Beyond plain sparsity: structured sparsity can be promoted throughgroup-Lasso regularization
Choice of groups reflects prior knowledge about the desired sparsitypatterns.
We have seen examples for feature template selection, grid sparsity,and elite discovery, but many more are possible!
Small/medium scale: many batch algorithms available, with fastconvergence (IST, FISTA, SpaRSA, ...)
Large scale: online proximal-gradient algorithms suitable to explorelarge feature spaces
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 146 / 155
Thank you!
Questions?
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 147 / 155
Acknowledgments
National Science Foundation (USA), CAREER grant IIS-1054319
Fundacao para a Ciencia e Tecnologia (Portugal), grantPEst-OE/EEI/LA0008/2011.
Fundacao para a Ciencia e Tecnologia and Information andCommunication Technologies Institute (Portugal/USA), through theCMU-Portugal Program.
Priberam: QREN/POR Lisboa (Portugal), EU/FEDER programme,Discooperio project, contract 2011/18501.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 148 / 155
References IAmaldi, E. and Kann, V. (1998). On the approximation of minimizing non zero variables or unsatisfied relations in linear
systems. Theoretical Computer Science, 209:237–260.
Andrew, G. and Gao, J. (2007). Scalable training of L1-regularized log-linear models. In Proc. of ICML. ACM.
Bakin, S. (1999). Adaptive regression and model selection in data mining problems. PhD thesis, Australian National University.
Barzilai, J. and Borwein, J. (1988). Two point step size gradient methods. IMA Journal of Numerical Analysis, 8:141–148.
Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journalon Imaging Sciences, 2(1):183–202.
Bioucas-Dias, J. and Figueiredo, M. (2007). A new twist: two-step iterativeshrinkage/thresholding algorithms for imagerestoration. IEEE Transactions on Image Processing, 16:2992–3004.
Blumensath, T. (2012). Compressed sensing with nonlinear observations and related nonlinear optimisation problems. Technicalreport, arXiv/1205.1650.
Bottou, L. and Bousquet, O. (2007). The tradeoffs of large scale learning. NIPS, 20.
Buchholz, S. and Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In Proc. of CoNLL.
Candes, E., Romberg, J., and Tao, T. (2006a). Robust uncertainty principles: Exact signal reconstruction from highlyincomplete frequency information. IEEE Transactions on Information Theory, 52:489–509.
Candes, E., Romberg, J., and Tao, T. (2006b). Stable signal recovery from incomplete and inaccurate measurements.Communications in Pure and Applied Mathematics, 59:1207–1223.
Carpenter, B. (2008). Lazy sparse stochastic gradient descent for regularized multinomial logistic regression. Technical report,Technical report, Alias-i.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1):41–75.
Cessie, S. L. and Houwelingen, J. C. V. (1992). Ridge estimators in logistic regression. Journal of the Royal Statistical Society;Series C, 41:191–201.
Chen, S. and Rosenfeld, R. (1999). A Gaussian prior for smoothing maximum entropy models. Technical report,CMU-CS-99-108.
Claerbout, J. and Muir, F. (1973). Robust modelling of erratic data. Geophysics, 38:826–844.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 149 / 155
References IICombettes, P. and Wajs, V. (2006). Signal recovery by proximal forward-backward splitting. Multiscale Modeling and
Simulation, 4:1168–1200.
Corduneanu, A. and Jaakkola, T. (2003). On information regularization. In Proc. of UAI.
Das, D. and Smith, N. A. (2011). Semi-supervised frame-semantic parsing for unknown predicates. In Proc. of ACL.
Das, D. and Smith, N. A. (2012). Graph-based lexicon expansion with sparsity-inducing penalties. In Proceedings of NAACL.
Daubechies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with asparsity constraint. Communications on Pure and Applied Mathematics, 11:1413–1457.
Davis, G., Mallat, S., and Avellaneda, M. (1997). Greedy adaptive approximation. Journal of Constructive Approximation,13:57–98.
Della Pietra, S., Della Pietra, V., and Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on PatternAnalysis and Machine Intelligence, 19:380–393.
Donoho, D. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52:1289–1306.
Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T. (2008). Efficient projections onto the L1-ball for learning in highdimensions. In ICML.
Duchi, J. and Singer, Y. (2009). Efficient online and batch learning using forward backward splitting. JMLR, 10:2873–2908.
Duh, K., Sudoh, K., Tsukada, H., Isozaki, H., and Nagata, M. (2010). n-best reranking by multitask learning. In Proceedings ofthe Joint Fifth Workshop on Statistical Machine Translation and Metrics.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32:407–499.
Eisenstein, J., Smith, N. A., and Xing, E. P. (2011). Discovering sociolinguistic associations with structured sparsity. In Proc. ofACL.
Figueiredo, M. and Bioucas-Dias, J. (2011). An alternating direction algorithm for (overlapping) group regularization. In Signalprocessing with adaptive sparse structured representations–SPARS11. Edinburgh, UK.
Figueiredo, M. and Nowak, R. (2003). An EM algorithm for wavelet-based image restoration. IEEE Transactions on ImageProcessing, 12:986–916.
Figueiredo, M., Nowak, R., and Wright, S. (2007). Gradient projection for sparse reconstruction: application to compressedsensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing: Special Issue on ConvexOptimization Methods for Signal Processing, 1:586–598.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 150 / 155
References IIIFriedman, J., Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004). Discussion of three boosting papers. Annals of
Statistics, 32(1):102–107.
Fu, W. (1998). Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, pages397–416.
Ganchev, K., Graca, J., Gillenwater, J., and Taskar, B. (2010). Posterior regularization for structured latent variable models.JMLR, 11:2001–2049.
Gao, J., Andrew, G., Johnson, M., and Toutanova, K. (2007). A comparative study of parameter estimation methods forstatistical natural language processing. In Proc. of ACL.
Genkin, A., Lewis, D., and Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics,49:291–304.
Goodman, J. (2004). Exponential priors for maximum entropy models. In Proc. of NAACL.
Graca, J., Ganchev, K., Taskar, B., and Pereira, F. (2009). Posterior vs. parameter sparsity in latent variable models. Advancesin Neural Information Processing Systems.
Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research,3:1157–1182.
Hale, E., Yin, W., and Zhang, Y. (2008). Fixed-point continuation for l1-minimization: Methodology and convergence. SIAMJournal on Optimization, 19:1107–1130.
Hastie, T., Taylor, J., Tibshirani, R., and Walther, G. (2007). Forward stagewise regression and the monotone lasso. ElectronicJournal of Statistics, 1:1–29.
Haupt, J. and Nowak, R. (2006). Signal reconstruction from noisy random projections. IEEE Transactions on InformationTheory, 52:4036–4048.
Jenatton, R., Audibert, J.-Y., and Bach, F. (2009). Structured variable selection with sparsity-inducing norms. Technical report,arXiv:0904.3523.
Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011). Proximal methods for hierarchical sparse coding. Journal ofMachine Learning Research, 12:2297–2334.
Kazama, J. and Tsujii, J. (2003). Evaluation and extension of maximum entropy models with inequality constraints. In Proc. ofEMNLP.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 151 / 155
References IVKim, S. and Xing, E. (2010). Tree-guided group lasso for multi-task regression with structured sparsity. In Proc. of ICML.
Kowalski, M. and Torresani, B. (2009). Sparsity and persistence: mixed norms provide simple signal models with dependentcoefficients. Signal, Image and Video Processing, 3(3):251–264.
Krishnapuram, B., Carin, L., Figueiredo, M., and Hartemink, A. (2005). Sparse multinomial logistic regression: Fast algorithmsand generalization bounds. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27:957–968.
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., and Jordan, M. I. (2004). Learning the kernel matrix withsemidefinite programming. JMLR, 5:27–72.
Langford, J., Li, L., and Zhang, T. (2009). Sparse online learning via truncated gradient. JMLR, 10:777–801.
Lavergne, T., Cappe, O., and Yvon, F. (2010). Practical very large scale CRFs. In Proc. of ACL.
Liu, H., Palatucci, M., and Zhang, J. (2009). Blockwise coordinate descent procedures for the multi-task lasso, with applicationsto neural semantic basis discovery. In Proceedings of the 26th Annual International Conference on Machine Learning, pages649–656. ACM.
Mairal, J., Jenatton, R., Obozinski, G., and Bach, F. (2010). Network flow algorithms for structured sparsity. In Advances inNeural Information Processing Systems.
Mairal, J. and Yu, B. (2012). Complexity analysis of the lasso regularization path. Technical report, arXiv:1205.0079.
Markowitz, H. (1952). Portfolio selection. Journal of Finance, 7:77–91.
Martins, A. F. T., Figueiredo, M. A. T., Aguiar, P. M. Q., Smith, N. A., and Xing, E. P. (2011a). Online learning of structuredpredictors with multiple kernels. In Proc. of AISTATS.
Martins, A. F. T., Smith, N. A., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2011b). Structured Sparsity in StructuredPrediction. In Proc. of Empirical Methods for Natural Language Processing.
Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2010). Turbo parsers: Dependencyparsing by approximate variational inference. In Proc. of EMNLP.
McDonald, R. T., Pereira, F., Ribarov, K., and Hajic, J. (2005). Non-projective dependency parsing using spanning treealgorithms. In Proc. of HLT-EMNLP.
Muthukrishnan, S. (2005). Data Streams: Algorithms and Applications. Now Publishers, Boston, MA.
Negahban, S., Ravikumar, P., Wainwright, M., and Yu, B. (2012). A unified framework for high-dimensional analysis ofm-estimators with decomposable regularizers. Technical report, Department of EECS, UC Berkeley.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 152 / 155
References VNesterov, Y. (2009). Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259.
Obozinski, G., Taskar, B., and Jordan, M. (2010). Joint covariate selection and joint subspace selection for multipleclassification problems. Statistics and Computing, 20(2):231–252.
Osborne, M., Presnell, B., and Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA Journalof Numerical Analysis, 20:389–403.
Perkins, S., Lacker, K., and Theiler, J. (2003). Grafting: Fast, incremental feature selection by gradient descent in functionspace. Journal of Machine Learning Research, 3:1333–1356.
Plan, Y. and Vershynin, R. (2012). Robust 1-bit compressed sensing and sparse logistic regression: A convex programmingapproach. Technical report, arXiv/1202.1212.
Quattoni, A., Carreras, X., Collins, M., and Darrell, T. (2009). An efficient projection for l1,∞ regularization. In Proc. of ICML.
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proc. of EMNLP.
Ravi, S. and Knight, K. (2009). Minimized models for unsupervised part-of-speech tagging. In Proc. of ACL.
Sang, E. (2002). Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proc. ofCoNLL.
Sang, E. and Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings of CoNLL-2000 andLLL-2000.
Sang, E. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entityrecognition. In Proc. of CoNLL.
Schaefer, R., Roi, L., and Wolfe, R. (1984). A ridge logistic estimator. Communications in Statistical Theory and Methods,13:99–113.
Schmidt, M. and Murphy, K. (2010). Convex structure learning in log-linear models: Beyond pairwise potentials. In Proc. ofAISTATS.
Shevade, S. and Keerthi, S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression.Bioinformatics, 19:2246–2253.
Shor, N. (1985). Minimization Methods for Non-differentiable Functions. Springer.
Sokolovska, N., Lavergne, T., Cappe, O., and Yvon, F. (2010). Efficient learning of sparse conditional random fields forsupervised sequence labelling. IEEE Journal of Selected Topics in Signal Processing, 4(6):953–964.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 153 / 155
References VIStojnic, M., Parvaresh, F., and Hassibi, B. (2009). On the reconstruction of block-sparse signals with an optimal number of
measurements. Signal Processing, IEEE Transactions on, 57(8):3075–3085.
Subramanya, A. and Bilmes, J. (2008). Soft-supervised learning for text classification. In Proc. of EMNLP.
Subramanya, A. and Bilmes, J. (2009). Entropic graph regularization in non-parametric semi-supervised classification. In Proc.of NIPS.
Subramanya, A., Petrov, S., and Pereira, F. (2010). Efficient graph-based semi-supervised learning of structured tagging models.In Proc. of EMNLP.
Talukdar, P. P. and Crammer, K. (2009). New regularized algorithms for transductive learning. In Proc. of the ECML-PKDD.
Taylor, H., Bank, S., and McCoy, J. (1979). Deconvolution with the `1 norm. Geophysics, 44:39–52.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B., pages267–288.
Tikhonov, A. (1943). On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pages 195–198.
Tillmann, A. and Pfetsch, M. (2012). The computational complexity of RIP, NSP, and related concepts in compressed sensing.Technical report, arXiv/1205.2081.
Tseng, P. and Yun, S. (2009). A coordinate gradient descent method nonsmooth seperable approximation. MathematicalProgrammin (series B), 117:387–423.
Tsuruoka, Y., Tsujii, J., and Ananiadou, S. (2009). Stochastic gradient descent training for l1-regularized log-linear models withcumulative penalty. In Proc. of ACL.
Turlach, B. A., Venables, W. N., and Wright, S. J. (2005). Simultaneous variable selection. Technometrics, 47(3):349–363.
van de Geer, S. (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36:614–645.
Wiener, N. (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Series. Wiley, New York.
Williams, P. (1995). Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7:117–143.
Wright, S., Nowak, R., and Figueiredo, M. (2009). Sparse reconstruction by separable approximation. IEEE Transactions onSignal Processing, 57:2479–2493.
Wu, J., Devlin, B., Ringquist, S., Trucco, M., and Roeder, K. (2010). Screen and clean: A tool for identifying interactions ingenome-wide association studies. Genetic Epidemiology, 34(3):275–285.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 154 / 155
References VII
Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. Journal of MachineLearning Research, 11:2543–2596.
Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proc. of ACL.
Yuan, L., Liu, J., and Ye, J. (2011). Efficient methods for overlapping group lasso. In Advances in Neural InformationProcessing Systems 24, pages 352–360.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the RoyalStatistical Society (B), 68(1):49.
Yun, S. and Toh, K.-C. (2011). A coordinate gradient descent method for L1-regularized convex minimization. ComputationalOptimization and Applications, 48:273–307.
Zhao, P., Rocha, G., and Yu, B. (2009). Grouped and hierarchical model selection through composite absolute penalties. Annalsof Statistics, 37(6A):3468–3497.
Zhu, J., Lao, N., and Xing, E. (2010). Grafting-light: fast, incremental feature selection and structure learning of markovrandom fields. In Proc. of International Conference on Knowledge Discovery and Data Mining, pages 303–312.
Zhu, X., Ghahramani, Z., and Lafferty, J. D. (2003). Semi-supervised learning using Gaussian fields and harmonic functions. InProc. of ICML.
Martins, Figueiredo, Smith (IST, CMU) Structured Sparsity in NLP http://tiny.cc/ssnlp 155 / 155