A Survey of L Regression - Universitetet i oslo€¦ · 1 Introduction L 1-penalized linear...
Transcript of A Survey of L Regression - Universitetet i oslo€¦ · 1 Introduction L 1-penalized linear...
A Survey of L1 Regression
Vidaurre, Bielza and Larranaga (2013)
Céline Cunen, 20/10/2014
Outline of article
1.Introduction
2.The Lasso for Linear Regression
a) Notation and Main Concepts
b) Statistical Properties
c) Computational Algorithms
d) A Bayesian Interpretation
e) Connection to Boosting
3. Improving the Lasso's Properties
4. Adapting the Lasso to Particular Problems
5. Generalized Linear Models
a) Logistic Regression
b) Poisson Regression
c)Cox Proportional-hazard regression
6. Time Series
a) Wavelet Analysis
b) Autoregressive Models
c) Other Regression Models
d) Change Point Analysis
7. Discussion
1 Introduction
● L1-penalized linear regression = Lasso (Least Absolute Shrinkage and Selection Operator)– Lasso shrinks the coefficients (like Ridge), but also does variables
selection (unlike Ridge)
● Useful for sparse models: believes that the “true” model contains only a small subset of the explanatory variables (?)
● Main advantage: lasso offers interpretable, stable models, and efficient prediction at a reasonable cost.– Variable selection
– Solutions when p > n
● Bias/ Variance trade-off
2.1 Main Concepts
● Linear model: ● Ordinary Least Squares:
– Minimizing the RSS (residual sum of squares)
– Where:
– LS-solution:
● Lasso: – Where:
● Ridge: – Ridge-solution:
β=argminβ∥ y−Xβ∥22
∥a∥22=at a=a1
2+a22+...
β=argminβ(∥y−X β∥22+λ∥β∥1)
β=argminβ(∥y−X β∥22+λ∥β∥2
2)
β=(X t X )−1 X t y
∥a∥1=∣a1∣+∣a2∣+ ...
β=(X t X+λ I )−1 X t y
Orthonormal input matrix X
from Hastie, Tibshirani and Friedman (2008)
β1ls β1
ls
β1 β1
Optimization problem
from Hastie, Tibshirani and Friedman (2008)
∥β∥1=∣β1∣+∣β2∣⩽s
∥β∥22=β1
2+β22⩽s
Lasso Ridge
General Lq penalty
● Penalty: ● q < 1:
– Less bias than Lasso
– Non-convexity
– Variable selection
● q > 1:– Convex
– No variable selection
● q = 1: Lasso! A compromise
λ∥β∥qq=λ(∣β1∣
q+∣β2∣
q+ ...)
from Hastie, Tibshirani and Friedman (2008)
Lasso vs Ridge
● Lasso better if “true” model is sparse– Because shrinks the remaining coefficients less
than Ridge
● Ridge better when the predictors are highly correlated– Ridge keeps the redundant variables, but shrinks
them. Lasso discards all but one.
2.2 Statistical Properties
● 2 measures:– Prediction accuracy (for new data)
– Recovering the “true” model
● Prediction consistency:– Expected squared prediction error:
2.2 Statistical Properties (2)
● Recovering the “true” explanatory variables and give consistent estimates– Irrepresentable condition: true model can be recovered <=> no high
correlations between relevant and irrelevant predictors
– In addition Beta-min condition: true non-zero coefficients must be sufficiently large
● Variable screening instead: select the “true” variables, but allow some false positives
● Optimal λ– λ: adaptively chosen to minimize the expected prediction error
– Optimal λ for predition > optimal λ for variable selection
2.3 Computational Algorithms
● Algorithms solving the Lasso optimization problem:– LARS (least-angle regression)
– Pathwise coordinate descent
● LARS– Computes the entire regularization path (for all λs)
in p steps (same cost as LS)
Regularization path for the Lasso
More shrinking
Least squares solution
LARS Algorithm
Modification to get the Lasso solution:
from Hastie, Tibshirani and Friedman (2008)
When can LARS be used?
● The regularization path is piecewise linear when:– Loss function is quadratic as a function of β
– Penalty function is piecewise linear as a function of β
● When these conditions are not fulfilled (and/or we have a very large number of explanatory variables):
– Coordinate descent optimization algorithms may be used (must still be a convex problem)
2.4 A Bayesian Interpretation
● The Lasso can be considered a Bayes estimates – With a Laplace prior on β (the Ridge solution can be
obtained with another prior)
– Only the mode of the posterior (MAP) gives a sparse estimate
● Advantages:– Reliable SE(β)
– Reveal information about dependence between explanatory variables
● But more computationally expensive
2.5 Connection to Boosting
● Reminder: Ordinary Lasso = squared loss + L1-penalization
● Boosting regression with squared loss ≈ lasso (when low correlation between explanatory variables)
● Boosting can give methods for computation of L1-penalized regression (with other losses than squared)– blasso: generalization of lasso to any convex loss function
3 Improving the Lasso's Properties
● Bias/variance trade-off (reminder)● Reminder: the regularization parameter λ is generally
chosen by cross-validation (adaptively to minimize expected prediction error)
● Problem 1: the right variables may be identified, but their coefficient estimates are biased– Relaxed lasso
– Variable Inclusion and Shrinkage Algorithm (VISA)
– Adaptive lasso
– (Dantzig selector, LAD-lasso)
3 Improving the Lasso's Properties (2)
● Problem 2: ridge is better than lasso when there are strong correlations between explanatory variables– Elastic net
● Popular● If p > N: can select more than N variables● Can select groups of redundant variables
β=argminβ(∥y−Xβ∥22+λ(α∥β∥1+(1−α)∥β∥2
2))
Elastic net
α = 0.2 – More like Ridge α = 0.8 – More like Lasso
4 Adapting the Lasso to Particular Problems
● Group lasso– When we are interested in including entire groups of
explanatory variables
– Sparsity on both group and individual level
● Composite Absolute Penalties (CAP)● Fused Lasso
– Adjacent coefficients get similar values
● Multiresponse regression
5 Generalized Linear Models
● More general● Link function g() defines how the linear
contribution of the explanatory variables affect the expectation:
g (E(Y∣x))=β0+xtβ
5.1 Logistic Regression
● Y categorical (can belong to 2 classes)● Link function = logit function
● Solution with L1 penalty:
● Can be solved by several algorithms
● Also methods for L1-regularized multinomial logistic regression (more than 2 classes)
g (a)=log (a
1−a)
β=argminβ(−∑i=1
N
( yi xitβ−log(1+ex i
t β))+λ∥β∥1)
5.2 Poisson Regression
● Y: positive integer or 0– Modeling count data
● Link function:● Model:
● Solution with L1 penalty:
● Gives sparse estimates of β
g (a)=log (a)
β=argminβ(∑i=1
N
(− yi(β0+xitβ)+eβ0+x i
t β)+λ∥β∥1)
log (E (Y∣x))=β0+x tβ
6.1 Wavelet Analysis
● A method for representing complicated function in a simpler manner
● W: an orthonormal basis matrix – Gives a basis for all square-integrable functions
– = all square integrable functions can be represented as a (possibly infinite) linear combination of these basis functions
● y: data vector – Find the optimal z such that y ≈ Wz
z=argminβ(∥y−W z∥22+2λ∥z∥1)
6.2 Autoregressive Models
● A model were the output depends linearly on its previous values
● L1 penalty for sparse solution
● Also a group penalty
7 Discussion
● Lasso:
– Minimizing the RSS subject to an L1 constraint
– = Minimizing the L1-penalized negative log-likelihood function
● Wide applications● Advantages:
– Variable selection
– Bias/Variance trade-off
7 Discussion (2)
● Extension to non-linear models:– Use complex models for the entire data domain:
with a dictionary of functions
– Simple (linear) models for different parts of the data domain
● Remaining work:– Techniques for making rigorous inference
Sources
● Hastie, Tibshirani and Friedman (2008): The Elements of Statistical Learning