Computationally-predicted AOPs · computationally-predicted AOPs & networks
Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the...
Transcript of Data Mining Stat 588 - Rutgers University · doing so wesacri ce a little bit of bias to reduce the...
Data Mining Stat 588
Lecture 02: Linear Methods for Regression
Department of Statistics & BiostatisticsRutgers University
September 13 2011
Regression Problem
Quantitative generic output variable Y .
Generic input vector X = (X1, . . . , Xp)T .
Regression function Y = f(X) to predict Y .
If the pair (X,Y ) has a joint probability distribution, and we takethe loss function as
L(Y, f(X)) = EX,Y (Y − f(X))2,
then the optimal regression function is
f(X) = EY |X(Y |X).
Target
Typically we have a set of training data (x1, y1), . . . , (xN , yN ), fromwhich we want to learn the regression function f(X), or in term ofstatistical language, we want to estimate f(X).
Linear Regression Models
The linear regression model assumes f(X) takes the form
f(X) = β0 +
p∑
j=1
Xjβj .
Under the decision theory frame work, the linear model assumes thateither the regression function E(Y |X) is linear, or that the linear from isa reasonable approximation.The variables Xj can come from different sources:
quantitative inputs;
transformations of quantitative inputs, such as log, square-root orsquare;
basis expansions, such as X2 = X21 , X3 = X3
1 , leading to apolynomial representation;
numeric or “dummy” coding of the levels of qualitative inputs. Forexample, if G is a five-level factor input, we might createXj , j = 1, . . . , 5, such that Xj = I(G = j). Together this group ofXj represents the effect of G by a set of level-dependent constants.
interactions between variables, for example, X3 = X1 ·X2.
Least Square Methods
Training set: (x1, y1), . . . , (xN , yN ).
xi = (xi1, . . . , xip)T .
β = (β0, β1, . . . , βp)T .
The most popular estimation method is least squares, in which we pickthe coefficients β to minimize the residual sum of squares
RSS(β) =
N∑
i=1
yi − β0 −
p∑
j=1
xijβj
2
.
Least Square Methods
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3
•• •
••
• ••
•
• •
••
•
•
•
••
•
••
••
•
•
••
•
•• ••
•
•
•
•
•
• ••
•
•
•
•
•
•
•
•
•
•
•• •
•
•
•
••
•
• ••
• •
••
• •••
•
•
•
•
X1
X2
Y
FIGURE 3.1. Linear least squares fitting withX ∈ IR2. We seek the linear function of X that mini-mizes the sum of squared residuals from Y .
Normal Equation
xj = (x1j , x2j , . . . , xNj)T denotes the N measurements on the jth
feature/input.
1 is a N -dimensional vector with all entries equal to one.
X = (1,x1,x2, . . . ,xp) is a N × (p+ 1) matrix.
The optimal solution β satisfies the normal equation:
XTXβ = XTy,
and is given by
β =(XTX
)−1XTy.
The fitted values at the training inputs are
y = Xβ = X(XTX
)−1XT
︸ ︷︷ ︸hat matrix
y.
Geometric Interpretation
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3
x1
x2
y
y
FIGURE 3.2. The N-dimensional geometry of leastsquares regression with two predictors. The outcomevector y is orthogonally projected onto the hyperplanespanned by the input vectors x1 and x2. The projectiony represents the vector of the least squares predictions
Statistical Inference
xi are fixed.
yi are uncorrelated and have constant variance σ2.
Usually enough for large sample/asymptotic theory.
The variance matrix of the least squares parameter estimates is given by
Var(β) =(XTX
)−1σ2.
The usual estimate of the variance σ2 is
σ2 =1
N − p− 1
N∑
i=1
(yi − yi)2.
Exact Distribution Theory
Generic model.
Y = β0 +
p∑
j=1
Xjβj + ε.
The error ε ∼ N(0, σ2).
For the training data, εi, 1 ≤ i ≤ N are i.i.d.
Under this model assumption,
β ∼ N[β,(XTX
)−1σ2
]and (N − p− 1)σ2 ∼ σ2χ2
N−p−1.
Moreover, β and σ2 are independent.
Hypothesis Testing
To test the hypothesis that a particular coefficient βj = 0, we use Z-score
zj =βj
σ√vj∼ tN−p−1 under null,
where vj is the jth diagonal element of(XTX
)−1.
To test for the significance of groups of coefficients simultaneously, weuse the F -statistic
F =(RSS0 − RSS1)/(p1 − p0)
RSS1/(N − p1 − 1)∼ Fp1−p0,N−p1−1 under null,
where RSS1 is the residual sum-of-squares for the least squares fit of thebigger model with p1 + 1 parameters, and RSS0 the same for the nestedsmaller model with p0 + 1 parameters, having p1 − p0 parametersconstrained to be zero.
Confidence Region
When the sample size N is sufficiently large, the distribution of(βj − βj)/(σ√vj) is well approximated by N(0, 1), and a 1− αconfidence interval for βj is given by
(βj − z(1−α/2)σ
√vj , βj + z(1−α/2)σ
√vj
),
where z(1−α) is the (1− α/2)th percentile of the normal
distribution, which should be replaced by t(1−α)N−p−1, the (1− α/2)th
percentile of the tN−p−1 distribution, when N is not very large.
Similarly we can obtain an approximate confidence set for β,
Cβ ={β : (β − β)TXTX(β − β) ≤ σ2χ2
p+1(1−α)
},
where χ2p+1
(1−α) is the (1− α)th percentile of χ2p+1.
Gauss-Markov Theorem
An estimate β is called a linear unbiased estimate (LUE) of β if
(i) it is linear in y, that is, β = CTy for some C ∈ RN×(p+1);
(ii) it is unbiased, that is, Eβ β = β for all β ∈ Rp+1.
Theorem (Gauss-Markov Theorem)
The least square estimate β has smallest variance among all LUE of β.
(a) β is a LUE.
(b) If β is another LUE, then Var(β) � Var(β) in the sense that the
matrix Var(β)−Var(β) is positive semi-definite.
Bias-Variance Tradeoff
38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Pre
dic
tion
Err
or
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
∑i(yi − yi)
2. Unfortunatelytraining error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). In
that case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
Figure: Test and training error as functions of model complexity.
Subset Selection
Two reasons why we are often not satisfied with the least squaresestimates.
The first is prediction accuracy: the least squares estimates oftenhave low bias but large variance. Prediction accuracy can sometimesbe improved by shrinking or setting some coefficients to zero. Bydoing so we sacrifice a little bit of bias to reduce the variance of thepredicted values, and hence may improve the overall predictionaccuracy.
The second reason is interpretation. With a large number ofpredictors, we often would like to determine a smaller subset thatexhibit the strongest effects. In order to get the “big picture,” weare willing to sacrifice some of the small details.
Best Subset Selection
For each k ∈ {0, 1, 2, . . . , p}, find the subset of size k that givessmallest residual sum of squares.
The best subset of size 2, for example, need not include the variablethat was in the best subset of size 1.
The smallest residual sum of squares as a function of k is necessarilydecreasing, so cannot be used to select the subset size k.
The question of how to choose k involves the tradeoff between biasand variance, along with the more subjective desire for parsimony.
Typically we choose the smallest model that minimizes an estimateof the expected prediction error.
Many of the other approaches are similar, in that they use thetraining data to produce a sequence of models varying in complexityand indexed by a single parameter. Popular method for selecting theright parameter (subset size here) include cross validation and AIC,BIC etc. More details later.
Best Subset Selection
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3
Subset Size k
Res
idua
l Sum
−of
−S
quar
es
020
4060
8010
0
0 1 2 3 4 5 6 7 8
•
•
•••••••
••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••
•••••••
•
•
•
•• • • • • • •
FIGURE 3.5. All possible subset models for theprostate cancer example. At each subset size is shownthe residual sum-of-squares for each model of that size.
Forward-Stepwise Selection
Forward-stepwise selection starts with the intercept, and thensequentially adds into the model the predictor that most improvesthe fit.
Can exploit the QR decomposition for the current fit to rapidlyestablish the next candidate.
It produces a sequence of models indexed by k, the subset size,which must be determined.
Pros: computationally efficient, smaller variance as compared tobest subset selection, but perhaps more bias.
Cons: errors made at the beginning cannot be corrected later.
Backward-Stepwise Selection
Backward-Stepwise Selection starts with the full model, andsequentially deletes the predictor that has the least impact on the fit.
The candidate for dropping is the variable with the smallest Z-score.
Pros: can throw out the “right” predictor by looking at the fullmodel.
Cons: computationally inefficient (start with the full model), cannotwork if p ≥ N .
Hybrid-stepwise Selection
Hybrid-stepwise selection considers both forward and backwardmoves at each step, and select the best of the two.
Pros: computationally efficient, error made at an earlier stage can becorrected later.
Need a criterion to decide whether to add or drop at each step. e.g.AIC takes proper account of both the number of parameters andhow good the model fits.
Forward-Stagewise Regression
Forward-stagewise regression starts with an intercept equal to y, andcentered predictors with coefficients initially all 0.
At each step the algorithm identifies the variable most correlatedwith the current residual. It then computes the simple linearregression coefficient of the residual on this chosen variable, andthen adds it to the current coefficient for that variable.
Can take many more than p steps to reach the least squares fit. Sohistorically viewed as inefficient.
Quite competitive in very high dimensional problems.
Comparison of Four Subset Selection Methods
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3
0 5 10 15 20 25 30
0.65
0.70
0.75
0.80
0.85
0.90
0.95
Best SubsetForward StepwiseBackward StepwiseForward Stagewise
E||β
(k)−
β||2
Subset Size k
FIGURE 3.6. Comparison of four subset-selectiontechniques on a simulated linear regression problemY = XT β + ε. There are N = 300 observationson p = 31 standard Gaussian variables, with pair-wise correlations all equal to 0.85. For 10 of the vari-ables, the coefficients are drawn at random from aN(0, 0.4) distribution; the rest are zero. The noiseε ∼ N(0, 6.25), resulting in a signal-to-noise ratio of0.64. Results are averaged over 50 simulations. Shownis the mean-squared error of the estimated coefficient
β(k) at each step from the true β.
Shrinkage Methods
Subset selection produces a model that is interpretable and haspossibly lower prediction error than the full model.
It is a discrete process, which means variables are either retained ordiscarded. So it often exhibits high variance, and doesn’t reduce theprediction error of the full model.
Shrinkage methods are more continuous, and don’t suffer as muchfrom high variability.
Motivation: James-Stein’s Estimate
y1, y2, . . . , yN are i.i.d. N(µ, Ip)
µ is unknown, which we want to estimate
y = 1N
∑Ni=1 yi is sufficient, MLE, BLUE
We say an estimate µ of µ is inadmissible if there exists another estimateµ such that
(i) Eµ‖µ− µ‖22 ≤ Eµ‖µ− µ‖22 for all µ ∈ Rp;
(ii) for some µ∗ ∈ Rp, Eµ‖µ− µ∗‖22 < Eµ‖µ− µ∗‖22.
Otherwise µ is said to be admissible.
Theorem (Stein 1956)
(a) If p ≤ 2, then y is admissible.
(b) If p > 2, then y is inadmissible.
Theorem (James-Stein, 1961)
If p ≥ 3, then µJS =[1− (p− 2)/‖y‖2
]y has everywhere smaller MSE
than y.
Ridge Regression
The ridge regression solves the optimization problem
βridge = arg minβ
N∑
i=1
yi − β0 −
p∑
j=1
xijβj
2
+ λ
p∑
j=1
β2j
for some λ ≥ 0. An equivalent form is
βridge = arg minβ
N∑
i=1
yi − β0 −
p∑
j=1
xijβj
2
subject to
p∑
j=1
β2j ≤ t,
for some t ≥ 0.
Ridge Regression
Here λ ≥ 0 is a complexity parameter that controls the amount ofshrinkage: the larger the value of λ, the greater the amount ofshrinkage. The coefficients are shrunk toward zero (and each other).
The ridge solutions are not equivariant under scaling of the inputs,and so one normally standardizes the inputs before solving theoptimization problem.
The intercept β0 has been left out of the penalty term. We cansolve the optimization problem in two steps.
(1) Estimate β0 by y = 1N
∑Ni=1 yi.
(2) Centerize y and normalize each xj for 1 ≤ j ≤ p. The remainingcoefficients get estimated by a ridge regression without intercept,using the normalized xj .
Ridge Regression
We assume
(i) The output vector y is centered, that is,∑Ni=1 yi = 0;
(ii) Each predictor xj , 1 ≤ j ≤ p is normalized, i.e.
N∑
i=1
xij = 0 andN∑
i=1
x2ij = 1, ∀ 1 ≤ j ≤ p;
(iii) The input matrix X has p (rather than p+ 1) columns;
and solve the problem (here β = (β1, . . . , βp)T )
βridge = arg minβ∈Rn
{‖y −Xβ‖22 + λ‖β‖22
}.
Ridge regression has a closed form solution
βridge =(XTX + λI
)−1XTy.
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3
Coe
ffici
ents
0 2 4 6 8
−0.
20.
00.
20.
40.
6
•
••••
••
••
••
••
••
••
••
••
•••
•
lcavol
••••••••••••••••••••••••
•
lweight
••••••••••••••••••••••••
•
age
•••••••••••••••••••••••••
lbph
••••••••••••••••••••••••
•
svi
•
•••
••
••
••
••
••••••••••••
•
lcp
••••••••••••••••••••••••
•gleason
•
•••••••••••••••••••••••
•
pgg45
df(λ)
FIGURE 3.8. Profiles of ridge coefficients for theprostate cancer example, as the tuning parameter λ isvaried. Coefficients are plotted versus df(λ), the ef-fective degrees of freedom. A vertical line is drawn atdf = 5.0, the value chosen by cross-validation.
Lasso
The LASSO solves the optimization problem
βlasso = arg minβ
N∑
i=1
yi − β0 −
p∑
j=1
xijβj
2
subject to
p∑
j=1
|βj | ≤ t,
for some t ≥ 0. The equivalent Lagrangian form is
βlasso = arg minβ
1
2
N∑
i=1
yi − β0 −
p∑
j=1
xijβj
2
+ λ
p∑
j=1
|βj |
for some λ ≥ 0.
Comparison: Subset Selection, Ridge Regression and Lasso
When the columns of X are orthonormal, the formulas of differentmethods are given by
Estimator Formula
Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)
Lasso sign(βj)(|βj | − λ)+
3.4 Shrinkage Methods 71
TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.
Estimator Formula
Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)
Lasso sign(βj)(|βj | − λ)+
(0,0) (0,0) (0,0)
|β(M)|
λ
Best Subset Ridge Lasso
β^ β^2. .β
1
β 2
β1β
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2
1 + β22 ≤ t2, respectively,
while the red ellipses are the contours of the least squares error function.
Comparison: Ridge Regression and Lasso
Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 3
β^ β^2. .β
1
β 2
β1β
FIGURE 3.11. Estimation picture for the lasso (left)and ridge regression (right). Shown are contours of theerror and constraint functions. The solid blue areas arethe constraint regions |β1|+ |β2| ≤ t and β2
1 + β22 ≤ t2,
respectively, while the red ellipses are the contours ofthe least squares error function.