Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the...

30
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011

Transcript of Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the...

Page 1: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Data Mining Stat 588

Lecture 02: Linear Methods for Regression

Department of Statistics & BiostatisticsRutgers University

September 13 2011

Page 2: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Regression Problem

Quantitative generic output variable Y .

Generic input vector X = (X1, . . . , Xp)T .

Regression function Y = f(X) to predict Y .

If the pair (X,Y ) has a joint probability distribution, and we takethe loss function as

L(Y, f(X)) = EX,Y (Y − f(X))2,

then the optimal regression function is

f(X) = EY |X(Y |X).

Target

Typically we have a set of training data (x1, y1), . . . , (xN , yN ), fromwhich we want to learn the regression function f(X), or in term ofstatistical language, we want to estimate f(X).

Page 3: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Linear Regression Models

The linear regression model assumes f(X) takes the form

f(X) = β0 +

p∑

j=1

Xjβj .

Under the decision theory frame work, the linear model assumes thateither the regression function E(Y |X) is linear, or that the linear from isa reasonable approximation.The variables Xj can come from different sources:

quantitative inputs;

transformations of quantitative inputs, such as log, square-root orsquare;

basis expansions, such as X2 = X21 , X3 = X3

1 , leading to apolynomial representation;

numeric or “dummy” coding of the levels of qualitative inputs. Forexample, if G is a five-level factor input, we might createXj , j = 1, . . . , 5, such that Xj = I(G = j). Together this group ofXj represents the effect of G by a set of level-dependent constants.

interactions between variables, for example, X3 = X1 ·X2.

Page 4: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Least Square Methods

Training set: (x1, y1), . . . , (xN , yN ).

xi = (xi1, . . . , xip)T .

β = (β0, β1, . . . , βp)T .

The most popular estimation method is least squares, in which we pickthe coefficients β to minimize the residual sum of squares

RSS(β) =

N∑

i=1

yi − β0 −

p∑

j=1

xijβj

2

.

Page 5: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Least Square Methods

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

•• •

••

• ••

• •

••

••

••

••

••

•• ••

• ••

•• •

••

• ••

• •

••

• •••

X1

X2

Y

FIGURE 3.1. Linear least squares fitting withX ∈ IR2. We seek the linear function of X that mini-mizes the sum of squared residuals from Y .

Page 6: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Normal Equation

xj = (x1j , x2j , . . . , xNj)T denotes the N measurements on the jth

feature/input.

1 is a N -dimensional vector with all entries equal to one.

X = (1,x1,x2, . . . ,xp) is a N × (p+ 1) matrix.

The optimal solution β satisfies the normal equation:

XTXβ = XTy,

and is given by

β =(XTX

)−1XTy.

The fitted values at the training inputs are

y = Xβ = X(XTX

)−1XT

︸ ︷︷ ︸hat matrix

y.

Page 7: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Geometric Interpretation

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

x1

x2

y

y

FIGURE 3.2. The N-dimensional geometry of leastsquares regression with two predictors. The outcomevector y is orthogonally projected onto the hyperplanespanned by the input vectors x1 and x2. The projectiony represents the vector of the least squares predictions

Page 8: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Statistical Inference

xi are fixed.

yi are uncorrelated and have constant variance σ2.

Usually enough for large sample/asymptotic theory.

The variance matrix of the least squares parameter estimates is given by

Var(β) =(XTX

)−1σ2.

The usual estimate of the variance σ2 is

σ2 =1

N − p− 1

N∑

i=1

(yi − yi)2.

Page 9: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Exact Distribution Theory

Generic model.

Y = β0 +

p∑

j=1

Xjβj + ε.

The error ε ∼ N(0, σ2).

For the training data, εi, 1 ≤ i ≤ N are i.i.d.

Under this model assumption,

β ∼ N[β,(XTX

)−1σ2

]and (N − p− 1)σ2 ∼ σ2χ2

N−p−1.

Moreover, β and σ2 are independent.

Page 10: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Hypothesis Testing

To test the hypothesis that a particular coefficient βj = 0, we use Z-score

zj =βj

σ√vj∼ tN−p−1 under null,

where vj is the jth diagonal element of(XTX

)−1.

To test for the significance of groups of coefficients simultaneously, weuse the F -statistic

F =(RSS0 − RSS1)/(p1 − p0)

RSS1/(N − p1 − 1)∼ Fp1−p0,N−p1−1 under null,

where RSS1 is the residual sum-of-squares for the least squares fit of thebigger model with p1 + 1 parameters, and RSS0 the same for the nestedsmaller model with p0 + 1 parameters, having p1 − p0 parametersconstrained to be zero.

Page 11: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Confidence Region

When the sample size N is sufficiently large, the distribution of(βj − βj)/(σ√vj) is well approximated by N(0, 1), and a 1− αconfidence interval for βj is given by

(βj − z(1−α/2)σ

√vj , βj + z(1−α/2)σ

√vj

),

where z(1−α) is the (1− α/2)th percentile of the normal

distribution, which should be replaced by t(1−α)N−p−1, the (1− α/2)th

percentile of the tN−p−1 distribution, when N is not very large.

Similarly we can obtain an approximate confidence set for β,

Cβ ={β : (β − β)TXTX(β − β) ≤ σ2χ2

p+1(1−α)

},

where χ2p+1

(1−α) is the (1− α)th percentile of χ2p+1.

Page 12: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Gauss-Markov Theorem

An estimate β is called a linear unbiased estimate (LUE) of β if

(i) it is linear in y, that is, β = CTy for some C ∈ RN×(p+1);

(ii) it is unbiased, that is, Eβ β = β for all β ∈ Rp+1.

Theorem (Gauss-Markov Theorem)

The least square estimate β has smallest variance among all LUE of β.

(a) β is a LUE.

(b) If β is another LUE, then Var(β) � Var(β) in the sense that the

matrix Var(β)−Var(β) is positive semi-definite.

Page 13: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Bias-Variance Tradeoff

38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Pre

dic

tion

Err

or

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

∑i(yi − yi)

2. Unfortunatelytraining error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). In

that case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

Figure: Test and training error as functions of model complexity.

Page 14: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Subset Selection

Two reasons why we are often not satisfied with the least squaresestimates.

The first is prediction accuracy: the least squares estimates oftenhave low bias but large variance. Prediction accuracy can sometimesbe improved by shrinking or setting some coefficients to zero. Bydoing so we sacrifice a little bit of bias to reduce the variance of thepredicted values, and hence may improve the overall predictionaccuracy.

The second reason is interpretation. With a large number ofpredictors, we often would like to determine a smaller subset thatexhibit the strongest effects. In order to get the “big picture,” weare willing to sacrifice some of the small details.

Page 15: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Best Subset Selection

For each k ∈ {0, 1, 2, . . . , p}, find the subset of size k that givessmallest residual sum of squares.

The best subset of size 2, for example, need not include the variablethat was in the best subset of size 1.

The smallest residual sum of squares as a function of k is necessarilydecreasing, so cannot be used to select the subset size k.

The question of how to choose k involves the tradeoff between biasand variance, along with the more subjective desire for parsimony.

Typically we choose the smallest model that minimizes an estimateof the expected prediction error.

Many of the other approaches are similar, in that they use thetraining data to produce a sequence of models varying in complexityand indexed by a single parameter. Popular method for selecting theright parameter (subset size here) include cross validation and AIC,BIC etc. More details later.

Page 16: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Best Subset Selection

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

Subset Size k

Res

idua

l Sum

−of

−S

quar

es

020

4060

8010

0

0 1 2 3 4 5 6 7 8

•••••••

••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••

•• • • • • • •

FIGURE 3.5. All possible subset models for theprostate cancer example. At each subset size is shownthe residual sum-of-squares for each model of that size.

Page 17: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Forward-Stepwise Selection

Forward-stepwise selection starts with the intercept, and thensequentially adds into the model the predictor that most improvesthe fit.

Can exploit the QR decomposition for the current fit to rapidlyestablish the next candidate.

It produces a sequence of models indexed by k, the subset size,which must be determined.

Pros: computationally efficient, smaller variance as compared tobest subset selection, but perhaps more bias.

Cons: errors made at the beginning cannot be corrected later.

Page 18: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Backward-Stepwise Selection

Backward-Stepwise Selection starts with the full model, andsequentially deletes the predictor that has the least impact on the fit.

The candidate for dropping is the variable with the smallest Z-score.

Pros: can throw out the “right” predictor by looking at the fullmodel.

Cons: computationally inefficient (start with the full model), cannotwork if p ≥ N .

Page 19: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Hybrid-stepwise Selection

Hybrid-stepwise selection considers both forward and backwardmoves at each step, and select the best of the two.

Pros: computationally efficient, error made at an earlier stage can becorrected later.

Need a criterion to decide whether to add or drop at each step. e.g.AIC takes proper account of both the number of parameters andhow good the model fits.

Page 20: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Forward-Stagewise Regression

Forward-stagewise regression starts with an intercept equal to y, andcentered predictors with coefficients initially all 0.

At each step the algorithm identifies the variable most correlatedwith the current residual. It then computes the simple linearregression coefficient of the residual on this chosen variable, andthen adds it to the current coefficient for that variable.

Can take many more than p steps to reach the least squares fit. Sohistorically viewed as inefficient.

Quite competitive in very high dimensional problems.

Page 21: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Comparison of Four Subset Selection Methods

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

0 5 10 15 20 25 30

0.65

0.70

0.75

0.80

0.85

0.90

0.95

Best SubsetForward StepwiseBackward StepwiseForward Stagewise

E||β

(k)−

β||2

Subset Size k

FIGURE 3.6. Comparison of four subset-selectiontechniques on a simulated linear regression problemY = XT β + ε. There are N = 300 observationson p = 31 standard Gaussian variables, with pair-wise correlations all equal to 0.85. For 10 of the vari-ables, the coefficients are drawn at random from aN(0, 0.4) distribution; the rest are zero. The noiseε ∼ N(0, 6.25), resulting in a signal-to-noise ratio of0.64. Results are averaged over 50 simulations. Shownis the mean-squared error of the estimated coefficient

β(k) at each step from the true β.

Page 22: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Shrinkage Methods

Subset selection produces a model that is interpretable and haspossibly lower prediction error than the full model.

It is a discrete process, which means variables are either retained ordiscarded. So it often exhibits high variance, and doesn’t reduce theprediction error of the full model.

Shrinkage methods are more continuous, and don’t suffer as muchfrom high variability.

Page 23: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Motivation: James-Stein’s Estimate

y1, y2, . . . , yN are i.i.d. N(µ, Ip)

µ is unknown, which we want to estimate

y = 1N

∑Ni=1 yi is sufficient, MLE, BLUE

We say an estimate µ of µ is inadmissible if there exists another estimateµ such that

(i) Eµ‖µ− µ‖22 ≤ Eµ‖µ− µ‖22 for all µ ∈ Rp;

(ii) for some µ∗ ∈ Rp, Eµ‖µ− µ∗‖22 < Eµ‖µ− µ∗‖22.

Otherwise µ is said to be admissible.

Theorem (Stein 1956)

(a) If p ≤ 2, then y is admissible.

(b) If p > 2, then y is inadmissible.

Theorem (James-Stein, 1961)

If p ≥ 3, then µJS =[1− (p− 2)/‖y‖2

]y has everywhere smaller MSE

than y.

Page 24: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Ridge Regression

The ridge regression solves the optimization problem

βridge = arg minβ

N∑

i=1

yi − β0 −

p∑

j=1

xijβj

2

+ λ

p∑

j=1

β2j

for some λ ≥ 0. An equivalent form is

βridge = arg minβ

N∑

i=1

yi − β0 −

p∑

j=1

xijβj

2

subject to

p∑

j=1

β2j ≤ t,

for some t ≥ 0.

Page 25: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Ridge Regression

Here λ ≥ 0 is a complexity parameter that controls the amount ofshrinkage: the larger the value of λ, the greater the amount ofshrinkage. The coefficients are shrunk toward zero (and each other).

The ridge solutions are not equivariant under scaling of the inputs,and so one normally standardizes the inputs before solving theoptimization problem.

The intercept β0 has been left out of the penalty term. We cansolve the optimization problem in two steps.

(1) Estimate β0 by y = 1N

∑Ni=1 yi.

(2) Centerize y and normalize each xj for 1 ≤ j ≤ p. The remainingcoefficients get estimated by a ridge regression without intercept,using the normalized xj .

Page 26: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Ridge Regression

We assume

(i) The output vector y is centered, that is,∑Ni=1 yi = 0;

(ii) Each predictor xj , 1 ≤ j ≤ p is normalized, i.e.

N∑

i=1

xij = 0 andN∑

i=1

x2ij = 1, ∀ 1 ≤ j ≤ p;

(iii) The input matrix X has p (rather than p+ 1) columns;

and solve the problem (here β = (β1, . . . , βp)T )

βridge = arg minβ∈Rn

{‖y −Xβ‖22 + λ‖β‖22

}.

Ridge regression has a closed form solution

βridge =(XTX + λI

)−1XTy.

Page 27: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

Coe

ffici

ents

0 2 4 6 8

−0.

20.

00.

20.

40.

6

••••

••

••

••

••

••

••

••

••

•••

lcavol

••••••••••••••••••••••••

lweight

••••••••••••••••••••••••

age

•••••••••••••••••••••••••

lbph

••••••••••••••••••••••••

svi

•••

••

••

••

••

••••••••••••

lcp

••••••••••••••••••••••••

•gleason

•••••••••••••••••••••••

pgg45

df(λ)

FIGURE 3.8. Profiles of ridge coefficients for theprostate cancer example, as the tuning parameter λ isvaried. Coefficients are plotted versus df(λ), the ef-fective degrees of freedom. A vertical line is drawn atdf = 5.0, the value chosen by cross-validation.

Page 28: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Lasso

The LASSO solves the optimization problem

βlasso = arg minβ

N∑

i=1

yi − β0 −

p∑

j=1

xijβj

2

subject to

p∑

j=1

|βj | ≤ t,

for some t ≥ 0. The equivalent Lagrangian form is

βlasso = arg minβ

1

2

N∑

i=1

yi − β0 −

p∑

j=1

xijβj

2

+ λ

p∑

j=1

|βj |

for some λ ≥ 0.

Page 29: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Comparison: Subset Selection, Ridge Regression and Lasso

When the columns of X are orthonormal, the formulas of differentmethods are given by

Estimator Formula

Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)

Lasso sign(βj)(|βj | − λ)+

3.4 Shrinkage Methods 71

TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.

Estimator Formula

Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)

Lasso sign(βj)(|βj | − λ)+

(0,0) (0,0) (0,0)

|β(M)|

λ

Best Subset Ridge Lasso

β^ β^2. .β

1

β 2

β1β

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2

1 + β22 ≤ t2, respectively,

while the red ellipses are the contours of the least squares error function.

Page 30: Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the varianceof the predicted values, and hence may improve the overall prediction accuracy.

Comparison: Ridge Regression and Lasso

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3

β^ β^2. .β

1

β 2

β1β

FIGURE 3.11. Estimation picture for the lasso (left)and ridge regression (right). Shown are contours of theerror and constraint functions. The solid blue areas arethe constraint regions |β1|+ |β2| ≤ t and β2

1 + β22 ≤ t2,

respectively, while the red ellipses are the contours ofthe least squares error function.