A Survey of L Regression - Universitetet i oslo€¦ · 1 Introduction L 1-penalized linear...

A Survey of L1 Regression

Vidaurre, Bielza and Larranaga (2013)

Céline Cunen, 20/10/2014

Outline of article

1.Introduction

2.The Lasso for Linear Regression

a) Notation and Main Concepts

b) Statistical Properties

c) Computational Algorithms

d) A Bayesian Interpretation

e) Connection to Boosting

3. Improving the Lasso's Properties

4. Adapting the Lasso to Particular Problems

5. Generalized Linear Models

a) Logistic Regression

b) Poisson Regression

c)Cox Proportional-hazard regression

6. Time Series

a) Wavelet Analysis

b) Autoregressive Models

c) Other Regression Models

d) Change Point Analysis

7. Discussion

1 Introduction

● L1-penalized linear regression = Lasso (Least Absolute Shrinkage and Selection Operator)– Lasso shrinks the coefficients (like Ridge), but also does variables

selection (unlike Ridge)

● Useful for sparse models: believes that the “true” model contains only a small subset of the explanatory variables (?)

● Main advantage: lasso offers interpretable, stable models, and efficient prediction at a reasonable cost.– Variable selection

– Solutions when p > n

● Bias/ Variance trade-off

2.1 Main Concepts

● Linear model: ● Ordinary Least Squares:

– Minimizing the RSS (residual sum of squares)

– Where:

– LS-solution:

● Lasso: – Where:

● Ridge: – Ridge-solution:

β=argminβ∥ y−Xβ∥22

∥a∥22=at a=a1

2+a22+...

β=argminβ(∥y−X β∥22+λ∥β∥1)

β=argminβ(∥y−X β∥22+λ∥β∥2

2)

β=(X t X )−1 X t y

∥a∥1=∣a1∣+∣a2∣+ ...

β=(X t X+λ I )−1 X t y

Orthonormal input matrix X

from Hastie, Tibshirani and Friedman (2008)

β1ls β1

ls

β1 β1

Optimization problem


∥β∥1=∣β1∣+∣β2∣⩽s

∥β∥22=β1

2+β22⩽s

Lasso Ridge

General Lq penalty

● Penalty: ● q < 1:

– Less bias than Lasso

– Non-convexity

– Variable selection

● q > 1:– Convex

– No variable selection

● q = 1: Lasso! A compromise

λ∥β∥qq=λ(∣β1∣

q+∣β2∣

q+ ...)


Lasso vs Ridge

● Lasso better if “true” model is sparse– Because shrinks the remaining coefficients less

than Ridge

● Ridge better when the predictors are highly correlated– Ridge keeps the redundant variables, but shrinks

them. Lasso discards all but one.

2.2 Statistical Properties

● 2 measures:– Prediction accuracy (for new data)

– Recovering the “true” model

● Prediction consistency:– Expected squared prediction error:

2.2 Statistical Properties (2)

● Recovering the “true” explanatory variables and give consistent estimates– Irrepresentable condition: true model can be recovered <=> no high

correlations between relevant and irrelevant predictors

– In addition Beta-min condition: true non-zero coefficients must be sufficiently large

● Variable screening instead: select the “true” variables, but allow some false positives

● Optimal λ– λ: adaptively chosen to minimize the expected prediction error

– Optimal λ for predition > optimal λ for variable selection

2.3 Computational Algorithms

● Algorithms solving the Lasso optimization problem:– LARS (least-angle regression)

– Pathwise coordinate descent

● LARS– Computes the entire regularization path (for all λs)

in p steps (same cost as LS)

Regularization path for the Lasso

More shrinking

Least squares solution

LARS Algorithm

Modification to get the Lasso solution:


When can LARS be used?

● The regularization path is piecewise linear when:– Loss function is quadratic as a function of β

– Penalty function is piecewise linear as a function of β

● When these conditions are not fulfilled (and/or we have a very large number of explanatory variables):

– Coordinate descent optimization algorithms may be used (must still be a convex problem)

2.4 A Bayesian Interpretation

● The Lasso can be considered a Bayes estimates – With a Laplace prior on β (the Ridge solution can be

obtained with another prior)

– Only the mode of the posterior (MAP) gives a sparse estimate

● Advantages:– Reliable SE(β)

– Reveal information about dependence between explanatory variables

● But more computationally expensive

2.5 Connection to Boosting

● Reminder: Ordinary Lasso = squared loss + L1-penalization

● Boosting regression with squared loss ≈ lasso (when low correlation between explanatory variables)

● Boosting can give methods for computation of L1-penalized regression (with other losses than squared)– blasso: generalization of lasso to any convex loss function

3 Improving the Lasso's Properties

● Bias/variance trade-off (reminder)● Reminder: the regularization parameter λ is generally

chosen by cross-validation (adaptively to minimize expected prediction error)

● Problem 1: the right variables may be identified, but their coefficient estimates are biased– Relaxed lasso

– Variable Inclusion and Shrinkage Algorithm (VISA)

– Adaptive lasso

– (Dantzig selector, LAD-lasso)

3 Improving the Lasso's Properties (2)

● Problem 2: ridge is better than lasso when there are strong correlations between explanatory variables– Elastic net

● Popular● If p > N: can select more than N variables● Can select groups of redundant variables

β=argminβ(∥y−Xβ∥22+λ(α∥β∥1+(1−α)∥β∥2

2))

Elastic net

α = 0.2 – More like Ridge α = 0.8 – More like Lasso

4 Adapting the Lasso to Particular Problems

● Group lasso– When we are interested in including entire groups of

explanatory variables

– Sparsity on both group and individual level

● Composite Absolute Penalties (CAP)● Fused Lasso

– Adjacent coefficients get similar values

● Multiresponse regression

5 Generalized Linear Models

● More general● Link function g() defines how the linear

contribution of the explanatory variables affect the expectation:

g (E(Y∣x))=β0+xtβ

5.1 Logistic Regression

● Y categorical (can belong to 2 classes)● Link function = logit function

● Solution with L1 penalty:

● Can be solved by several algorithms

● Also methods for L1-regularized multinomial logistic regression (more than 2 classes)

g (a)=log (a

1−a)

β=argminβ(−∑i=1

N

( yi xitβ−log(1+ex i

t β))+λ∥β∥1)

5.2 Poisson Regression

● Y: positive integer or 0– Modeling count data

● Link function:● Model:

● Solution with L1 penalty:

● Gives sparse estimates of β

g (a)=log (a)

β=argminβ(∑i=1

N

(− yi(β0+xitβ)+eβ0+x i

t β)+λ∥β∥1)

log (E (Y∣x))=β0+x tβ

6.1 Wavelet Analysis

● A method for representing complicated function in a simpler manner

● W: an orthonormal basis matrix – Gives a basis for all square-integrable functions

– = all square integrable functions can be represented as a (possibly infinite) linear combination of these basis functions

● y: data vector – Find the optimal z such that y ≈ Wz

z=argminβ(∥y−W z∥22+2λ∥z∥1)

6.2 Autoregressive Models

● A model were the output depends linearly on its previous values

● L1 penalty for sparse solution

● Also a group penalty

7 Discussion

● Lasso:

– Minimizing the RSS subject to an L1 constraint

– = Minimizing the L1-penalized negative log-likelihood function

● Wide applications● Advantages:

– Variable selection

– Bias/Variance trade-off

7 Discussion (2)

● Extension to non-linear models:– Use complex models for the entire data domain:

with a dictionary of functions

– Simple (linear) models for different parts of the data domain

● Remaining work:– Techniques for making rigorous inference

Sources

● Hastie, Tibshirani and Friedman (2008): The Elements of Statistical Learning

A Survey of L Regression - Universitetet i oslo€¦ · 1 Introduction L 1-penalized linear...

Documents

Transcript of A Survey of L Regression - Universitetet i oslo€¦ · 1 Introduction L 1-penalized linear...