Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset...

66
Lecture 4. Linear Models for Regression

Transcript of Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset...

Page 1: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Lecture 4. Linear Models for Regression

Page 2: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Outline

Linear Regression

Least Square Solution

Subset Least Squaresubset selection/forward/backward

Penalized Least Square:Ridge RegressionLASSO Elastic Nets (LASSO+Ridge)

Page 3: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Linear Methods for Regression

Input (FEATURES) Vector: (p-dimensional)X = X1, X2, …, Xp

Real Valued OUTPUT: YJoint Distribution of (Y,X )

Function:

Regression Function E(Y |X ) = f(X)

Training Data :

(x1, y1), (x2, y2), …, (xN, yN) for estimation of input-output relation f.

Page 4: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Linear Model

f(x): Regression function or a good approximation

LINEAR in Unknown Parameters(weights, coefficients)

∑=

+=p

jjjXf

10)( ββX

pβββ ,...,, 10

Page 5: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Features

Quantitative inputs Any arbitrary but known function of measured attributes

Transformations of quantitative attributes: g(x), e.g., log, square, square-root etc.

Basis expansions: e.g., a polynomial approximation of f as a function of X1 (Taylor Series expansion with unknown coefficients)

,,..., 1313

212

kk XXXXXX ===

Page 6: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Features (Cont.)

Qualitative (categorical) input GDummy Codes: For an attribute with k categories, may use k codes j = 1,2, …, k, as indicators of the category (level) used. Together, this collection of inputs represents the effect of G through

This is a set of level-dependent constants, since only one of the Xj equals one and others are zero

∑=

k

jjjX

1

β

Page 7: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Features(cont)

Interactions: 2nd or higher-order Interactions of some features, e.g.,

Feature vector for the ith case in training set (Example)

3214213 **,* XXXXXXX ==

x i = (x i1,x i2,..., x ip )T

Page 8: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Generalized Linear Models: Basis Expansion

Wide variety of flexible models

Model for f is a linear expansion of basis functions

Dictionary: Prescribed basis functions

1

( ) ( )K

k kk

f x h xθ θ=

=∑

Page 9: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Other Basis Functions

Polynomial basis of degree s (Smooth functions Cs).

Fourie Series (Band-limited functions, a compact subspace of C∞)

SplinesPiecewise polynomials of degree K between the knots, joined with continuity of degree K-1 at the knots (Sobolev Spaces).

Wavelets (Besov Spaces)

Radial Basis Functions: Symmetric p-dim kernels located at particular centroids f(|x-y|)

Gaussian Kernel at each centroids

And more …

-- Curse of Dimensionality: p could be equal to or much larger than n.

Page 10: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Method of Least Squares

Find coefficients that minimize

Residual Sum of Squares, RSS(b) =

RSS denotes the empirical risk over the training set. It doesn’t assure the predictive performance over all inputs of interest.

Tp ),...,,( 10 ββββ =

∑ ∑∑= ==

−−=−N

i

p

jjiji

N

iii xyxfy

1 1

20

1

2 )())(( ββ

Page 11: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Min RSS Criterion

Statistically Reasonable provided Examples in the Training Set

Large # of independent random draws from the inputs population for which prediction is desirable.

Given inputs (x1, x2, …, xN), the outputs (y1, y2, …, yN) conditionally independent

In principle, predictive performance over the set of future input vectors should be examined.

Gaussian Noise: the Least Squares method equivalent to Max Likelihood

Min RSS(b) over b in R(p+1), a quadratic function of . b

Optimal Solution: Take the derivatives with respect to elements of , b and set them equal to zero.

Page 12: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Optimal Solution

The Hession (2nd derivative) of the criterion function is given by XTX.

The optimal solution satisfies the normal equations

XT(Y-X b) = 0 or (XTX) = b XTY

For an unique solution, the matrix XTX must full rank.

YXXX TT 1)(ˆ −=β1 1ˆ ( ) , ( )T T T TY X X X X Y HY H X X X X− −= = =

Page 13: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Projection

When the matrix XTX is full rank. the estimated response for the training set:

H: Projection (Hat) matrix

HY: Orthogonal Projection of Y on the space spanned by the columns of X

Note: the projection is linear in Y

1 1ˆ ( ) , ( )T T T TY X X X X Y HY H X X X X− −= = =

Page 14: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Geometrical Insight

Page 15: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Simple Univariate Regression

One Variable with no intercept

LS estimate

inner product

= cosine (angle between vectors x and y), a measure of similarity between y and x

Residuals: projection on normal space

Definition: “Regress b on a”

Simple regression of response b and input a, with no intercept

Estimate

Residual “b adjusted for a”

“b orthogonalized with respect to a”

Y X β ε= +

1

2

1

,ˆ,

N

i i

N

i

x y x y

x xxβ = =∑

1,

N Ti ix y x y x y= =∑

ˆr y xβ= −

ˆ , / ,a b a aγ =

ˆb aγ−

Page 16: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Multiple Regression

Multiple Regression:p>1

LS estimates different from simple univariate regression estimates, unless columns of input matrix X orthogonal,

If

then These estimates are uncorrelated, and

Orthogonal inputs occur sometimes in balanced, designed experiments (experimental design).

Observational studies, will almost never have orthogonal inputs

Must “orthogonalize” them in order to have similar interpretationUse Gram-Schmidt procedure to obtain an orthogonal basis for multiple regression

, 0, for all i jx x i j= ≠

ˆ , / ,j j j jx y x xβ =

2ˆ( ) / ,p p pVar x xβ σ= < >

Page 17: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Multiple Regression Estimates: Sequence of Simple

Regressions Regression by Successive Orthogonalization:

Initialize For, j=1, 2, …, p, Regress

to produce coefficients

and residual vectors

Regress y on the residual vector for the estimate

0 1 1 on , , ,j jx z z z −L

ˆ , / ,lj l j l lz x z zγ =

11ˆ

j

j j kj kkz x zγ −=

= −∑pz

0 0 1z x= =

,ˆ,

pp

p p

z y

z zβ

< >=< >

Instead of using x1 and x2, take x1 and z as features

Page 18: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Multiple Regression = Gram-Schmidt Orthogonalization

ProcedureThe vector zp is the residual of the multiple regression of xp on all other inputs

Successive z’s in the above algorithm are orthogonal and form an orthogonal basis for the columns space of X.

The least squares projection onto this subspace is the usual

By re-arranging the order of these variables, any input can be labeled as the pth variable.

If is highly correlated with other variables, the residuals are quite small, and the coefficient has high variance.

y

0 1 1, , , px x x −L

jx

Page 19: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Statistics Properties of LS

Model

Uncorrelated noise: Mean zero, Variance

Then

Noise estimation

Model d.f. = p+1 (dimension of the model space)

To Draw inferences on parameter estimates, we need assumptions on noise:

If assume:

then,

2σ1 2( ) ( ' )Var X Xβ σ−=

)

2 2

1

1ˆ( )

1

N

i ii

y yN p

σ=

= −− − ∑)

1 2ˆ ~ ( , ( ) )TNβ β σ

2 2 21ˆ( 1) ~ N pN p σ σ

Page 20: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Gauss-Markov Theorem

(The Gauss-Markov Theorem) If we have any linear estimator that is unbiased for aT β, that is, E(cT y)= aT β,then

It says that, for inputs in row space of X, LS estimate have Minimum variance among all unbiased estimates.

ˆvar( ) var( )T Ta c yβ

Page 21: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Bias-Variance Tradeoff

Mean square error of an estimator = variance + bias

Least square estimator achieves the minimal variance among all unbiased estimators

There are biased estimators to further reduce variance: Stein’s estimator, Shrinkage/Thresholding (LASSO, etc.)

More complicated a model is, more variance but less bias, need a trade-off

Page 22: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Hypothesis Test

• Single Parameter test: βj=0, T-statistics

where vj is the j-th diagonal element of V = (XTX)-1

Confidence interval , e.g. z1-0.0025=1.96

• Group parameter: , F-statistics for nested models

1

ˆ~

ˆj

j N p

j

z tv

βσ

0 1 1 0

1

( ) ( )

( 1)

RSS RSS p pF

RSS N p

βΩ =0, Ω = p1 − p0( )

ˆ β ± z1−α v j1/ 2 ˆ σ

Page 23: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Example

R command: lm(Y ~ x1 + x2 + … +xp)

Page 24: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Rank Deficiency

X : rank deficient Normal equations has infinitely many solutions

Hat matrix H, and the projection are unique.

For an input in the row space of X, unique LS estimate.

For an input, not in the row space of X, the estimate may change with the solution used. How to generalize to inputs outside the training set?

Penalized methods (!)

Y X β= %

β%

Page 25: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Reasons for Alternatives to LS Estimates

Prediction accuracyLS estimates have low bias but high variance when inputs are highly correlated

Larger ESPE

Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero.Small bias in estimates may yield a large decrease in variance

Bias/var tradeoff may provide better predictive ability

Better interpretationWith a large number of input variables, like to determine a smaller subset that exhibit the strongest effects.

Many tools to achieve these objectives

Subset selection

Penalized regression -constrained optimization

Page 26: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Best Subset Selection Method

Algorithm: leaps & boundsFind the best subset corresponding to the smallest RSS for each size

For each fixed size k, can also find a specified number of subsets close to the bestFor each fixed subset, obtain LS estimatesFeasible for p ~ 40.

Choice of optimal k based on model selection criteria to be discussed later

{0,1,2, , }k p∈ L

Page 27: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Other Subset Selection Procedures

Larger p, Classical Forward selection (step-up),

Backward elimination (step down)

Hybrid forward-backward (step-wise) methodsGiven a model, these methods only provide local controls for variable selection or deletion

Which current variable is least effective (candidate for deletion)

Which variable not in the model is most effective (candidate for inclusion)

Do not attempt to find the best subset of a given size

Not too popular in current practice

Page 28: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Forward Stagewise Selection

(Incremental) Forward stagewiseStandardize the input variables

Note:

Page 29: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Penalized Regression

• Instead of directly minimize the Residual Sum Square,

The penalized regression usually take the form:

where J(f) is the penalization term, usually penalize on

the smoothness or complexity of the function f

λ is chosen by cross-validation.

Page 30: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Model Assessment and Selection

If we are in data-rich situation, split data into three parts: training, validation, and testing.

Train Validation Test

See chapter 7.1 for details

Page 31: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Cross Validation

When sample size not sufficiently large, Cross Validation is a way to estimate the out of sample estimation error (or classification rate).

Available Data Training Test

Randomly split

Split many times and geterror2, …, errorm ,then averageover all error to get an estimate

error1

Page 32: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Ridge Regression (Tikhonov Regularization)

Ridge regression shrinks coefficients by imposing a penalty on their size

Min a penalized RSS

Here is complexity parameter, that controls the amount of shrinkage

Larger its value, greater the amount of shrinkage

Coefficients are shrunk towards zero

Choice of penalty term based on cross validation

2

1ˆ arg min { }

pridgejj

RSSββ λ β=

= + ∑0λ ≥

Prostate Cancer Example

Page 33: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Ridge Regression (cont)

Equivalent problemMin RSS subject to

Lagrangian multiplier 1-1 correspondence between s and

With many correlated variables, LS estimates can become unstable and exhibit high variance and high correlations

A widely large positive coeff on one variable can be cancelled by a large negative coeff on another

Imposing a size constraint on the coefficients, this phenomena is prevented from occurring

Ridge solutions are not invariant under scaling of inputs

Normally standardize the inputs before solving the optimization problem

Since the penalty term does not include the bias term, estimate intercept by the mean of response y.

2

1

p

jjsβ

=≤∑

λ

λ

Page 34: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Ridge Regression (cont)

The Ridge criterion

Shrinkage:For orthogonal inputs, ridge: scaled version of LS estimates

Ridge is mean or mode of posterior distribution of under a normal prior

Centered input matrix XSVD of X:

U and V are orthogonal matrices

Columns of U span column space of X

Columns of V span row space of X

D: a diagonal matrix of singular values

Eigen decomposition of

The Eigen vectors : principal components directions of X (Karhunen-Loeve direction)

( ) ( ) ( )T TRSS y X y Xλ β β λβ β= − − +

-1ˆ ( X+ I)ridge T TX X Yβ λ=

ˆ ˆ,0 1ridgeβ γβ γ= ≤ ≤

β

TX UDV=

1 2 0pd d d≥ ≥ ≥ ≥L2T TX X VD V=jv

Page 35: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Ridge Regression and Principal Components

First PC direction :Among all normalized linear combinations of columns of X, the has largest sample variance Derived variable, is first PC of X.

Subsequent PC have max variance subject to being orthogonal to earlier ones. Last PC has min variance

Effective Degree of Freedmon

1v

1 1 1 1z Xv u d= =

1z

jz

Page 36: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Ridge Regression (Summary)

Ridge Regression penalized the complexity of a linear model by the sum squares of the coefficients

It is equivalent to minimize RRS given the constraints

The matrix (XTX+ I) is always invertable.

The penalization parameter controls how simple “you” want the model to be.

Page 37: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Prostate Cancer Example

Page 38: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Ridge Regression (Summary)

Solutions are not sparse in the coefficient space.

- ’s are not zero

almost all the time.

The computation complexity is O(p3) when inversing the matrix XTX+ I.

Prostate Cancer Example

Page 39: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Least Absolute Shrinkage and Selection Operator (LASSO)

Penalized RSS with L1-norm penalty, or subject to constraint

Shrinks like Ridge with L2-norm penalty, but LASSO coefficients hit zero, as the penalty increases.

1

p

jjtβ

=≤∑

Page 40: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

LASSO as Penalized Regression

• Instead of directly minimize the Residual Sum Square,

The penalized regression usually take the form:

where

J( f ) = f1:= βi

i=1

p

f (X) = β0 + X jβ jj =1

p

Page 41: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

LASSO(cont)

The computation is a quadratic programming problem.

We can obtain the solution path, piece-wise linear.

Coefficients are non-linear in response y (they are linear in y in Ridge Regression)

Regularization parameter is chosen by cross validation.

Page 42: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

LASSO and RidgeContour of RRS in the space of ’s

Page 43: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Generalize to L-q norm as penalty

Minimize RSS subject to constraints on the l-q norm

Equivalent to Min

Bridge regression with ridge and LASSO as special cases (q=1, smallest value for convex region)

For q=0, best subset regression

For 0<q<1, it is not convex!

1

qp

jjRSS λ β

=+ ∑

Page 44: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Contours of constant values of L-q norms

Page 45: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Why non-convex norms?

LASSO is biased:

Nonconvex Penalty is necessary for unbiased estimator

1

2y − β( )

2+ λ β

ˆ β λ = Sλ (y) = sign(y)max( y − λ ,0)

E( ˆ β ) ≠ E(y), y >> 0

1

2y − β( )

2+ λJ(β)

∂J(β) = (y − β) /λ →0, y →∞

Page 46: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Elastic Net as a compromise between Ridge and LASSO

(Zou and Hastie 2005)

Page 47: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

The Group LASSOGroup LASSO

Group norm l1-l2 (also l1-l∞)

Every group of variable are simultaneously selected or dropped

Page 48: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Methods using Derived Directions

Principal Components Regression

Partial Least Squares

Page 49: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Principal Components Regression

Principal Components Regression

(M<p)

Motivation: leading

eigen-vectors

describe most of

the variability in X

X2

X1

Z2

Z1

Page 50: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Principal Components Regression

Zi and Zj are orthogonal now.

The dimension is reduced.

High correlation between independent variables are eliminated.

Noises in X’s are taken off (hopefully).

Computation: PCA + Regression

Page 51: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Partial Least Square

Partial Least Squares (PLS)

Uses inputs as well as response y to form the directions Zm

Seeks directions that have high variance and have high correlation with response yPopular in ChemometricsIf original inputs are orthogonal, finds the LS after one step. Subsequent steps have no effect.

Since the derived inputs use y, the estimates are non-linear functions of the response, when the inputs are not orthogonal

The coefficients for original variables tend to shrink as fewer PLS directions are used

Choice of M can be made via cross validation

Page 52: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Partial Least Square Algorithm

Page 53: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

PCR vs. PLS

Principal Component Regression choose directions:

Partial Least Square has m-th direction:

variance tends to dominate, whence PLS is close to Ridge

Page 54: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Ridge, PCR and PLS

The solution path of different methods in a two variable (corr(X1,X2)=ρ, β=(4,2))Regression case.

Page 55: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Comparisons on Estimated Prediction Errors (Prostate

Cancer Example)

0.574 (0.156)

0.540 (0.168)

0.636 (0.172)

0.491 (0.152)

0.527 (0.122) Least Square:

Test error: 0.586 Sd of error:(0.184)

Page 56: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

LASSO and Forward Stagewise

Page 57: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Diabetes Data

Page 58: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

LASSO and Forward Stagewise

Page 59: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Least Angel Regression (LARS)

Efron, Hastie, Johnstone, and Tibshirani (2003)

Page 60: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Recall: Forward Stagewise Selection

(Incremental) Forward stagewiseStandardize the input variables

Note:

Page 61: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

LAR directions and Example

y

Page 62: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Relationship between those three

• Lasso and forward stagewise can be thought of as restricted version of LARS

• For Lasso: Start with LARS; If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. It gives the LASSO path.

• For Stagewise, Start with LARS; select the most correlated direction at each stage, go that direction with a small step.

• There are other related methods:• Orthogonal Matching Pursuit• Linearized Bregman Iteration

Page 63: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Homework Project I

Keyword Pricing (regression)

Page 64: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Homework Project IIClick Prediction (classification): two subproblems

click/impression

click/bidding

Data Directory: /data/ipinyou/

Files:bid.20130301.txt: Bidding log file, 1.2M rows, 470MB

imp.20130301.txt: Impression log, 0.8M rows, 360MB

clk.20130301.txt: Click log file, 796 rows, 330KB

data.zip: compressed files above (Password: ipinyou2013)

dsp_bidding_data_format.pdf: format file

Region&citys.txt: Region and City code

Questions: [email protected]

Page 65: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Homework Project II

Data Input by R:bid <- read.table("/Users/Liaohairen/DSP/bid.20130301.txt", sep='\t', comment.char='')

imp <- read.table("/Users/Liaohairen/DSP/imp.20130301.txt", sep='\t', comment.char='’)

R read.table by default uses '#' ascomment character, that is , it has the comment.char = '#' parameter, but the user-agent field may have '#' character.  To read correctly, turning off of interpretation of comments by setting comment.char='' is needed. 

Page 66: Lecture 4. Linear Models for Regression. Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized.

Homework Project III

Heart Operation Effect Prediction (classification)

Note: Large amount missing values