Topic 18: Model Selection and Diagnosticsbacraig/notes512/Topic_18.pdf · Topic 18: Model Selection...
Transcript of Topic 18: Model Selection and Diagnosticsbacraig/notes512/Topic_18.pdf · Topic 18: Model Selection...
Topic 18: Model Selection and Diagnostics
Variable Selection• We want to choose a “best” model
that is a subset of the available explanatory variables
• Two separate problems1. How many explanatory variables
should we use (i.e., subset size)2. Given the subset size, which
variables should we choose
KNNL Example• Page 350, Section 9.2• n = 54 patients / cases• Y : survival time (liver operation)• X’s (explanatory variables) are
– Blood clotting score– Prognostic index– Enzyme function test– Liver function test
KNNL Example cont.• We start with the usual plots and
descriptive statistics• Note that time-to-event / survival
data are often heavily skewed and typically transformed with a log prior to model fitting
Ln Transform of Y• Recall that regression model
requires Y|X to be Normally distributed, not Y
• Better to look at residuals• With data like these, transform
reduces influence of long right tail and stabilizes the variance of the residuals
Data
Data a1; infile 'U:\.www\datasets512\CH09TA01.txt‘
delimiter='09'x;input blood prog enz liver age gender
alcmod alcheavy surv lsurv;run;
Tab delimited
Dummy variables for alcohol use
Ln(surv)
Obs blood prog enz liver age Gender alcmod alcheavy surv lsurv1 6.7 62 81 2.59 50 0 1 0 695 6.544
2 5.1 59 66 1.70 39 0 0 0 403 5.999
3 7.4 57 83 2.16 55 0 0 0 710 6.565
4 6.5 73 41 2.01 48 0 0 0 349 5.854
5 7.8 65 115 4.30 45 0 0 1 2343 7.759
6 5.8 38 72 1.42 65 1 1 0 348 5.852
Data
Long right tail
Generate scatterplotsproc corr plot=matrix;var blood prog enz liver;run;
proc corr plot=scatter;var blood prog enz liver;with lsurv;run;
Correlation Summary
Pearson Correlation Coefficients, N = 54Prob > |r| under H0: Rho=0
blood prog enz liverlsurv 0.24619
0.07270.46994
0.00030.65389<.0001
0.64926<.0001
The Two Problems in Variable Selection
1. To determine an appropriate subset size
– Might use adjusted R2, Cp, MSE,
PRESS, AIC, SBC (BIC)2. To determine best model of a fixed size
– Might use R2
Adjusted R2
• R2 by its construction is guaranteed to increase with p– SSE cannot decrease with
additional X and SSTO constant• Adjusted R2 uses df to account for
changes in p
2 SSE MSE11 1SSTO MSTO
p pa
nRn p
−= − = − −
Adjusted R2
• Want to find model that maximizes • Since MSTO remains constant for a
given data set, equivalent to finding model that minimizes MSE
• Details on pages 354-356
2aR
Cp Criterion• The basic idea is to compare subset
models with a full model • A subset model is “good” if there is not
substantial bias in the predicted values relative to the full model
• Looks at the ratio of total mean squared error and the true error variance
• See page 357-359 for details
Cp Criterion
)2()Full(MSE
SSEpnC p
p −−=
SSE based on a specific choice of p-1variables
MSE(full) based on all the variables
Consider full set Cp=(n-p)-(n-2p)=p
Use of Cp
• p is the number of regression coefficients including the intercept
• A model is good according to this criterion if Cp ≤ p
• Rule: Pick the smallest model for which Cp is smaller than p or pick the model that minimizes Cp, provided the Cp is not much larger than p
SBC (BIC) and AICCriterion based on log(likelihood) plus a penalty for more complexity
• AIC – minimize
• SBC – minimize
SSEn log 2p
np
+ SSE
n log p log(n)n
p +
Other approaches• PRESS (prediction SS)
– For each case i, delete the case and predict Y using the fitted model based on the other n-1 cases
– Look at the SS for observed minus predicted
– Want to minimize the PRESS– Appears this requires n regressions
but not the case
Variable Selection in SAS• Additional proc reg model statement
options useful in variable selection– INCLUDE=n forces the first n
explanatory variables into all models– BEST=n limits the output to the best n
models of each subset size or total– START=n limits output to models that
include at least n explanatory variables
Variable Selection• Step-type procedures
– Forward selection (Step up)– Backward elimination (Step down)– Stepwise (forward selection with a
backward glance)• Very popular but now have much
better search techniques like BEST
2. Ordering models of the same subset size
• Use R2 or SSE / MSE or F*• This approach can lead us to consider
several models that give us approximately the same predicted values
• May need to apply knowledge of the subject matter to make a final selection
• Not that important if prediction is the key goal
Proc Reg Codeproc reg data=a1;
model lsurv=blood prog enz liver/selection=rsquare cp aicsbc b best=3;
run;
Number inModel R-Square C(p) AIC SBC
1 0.4276 66.4889 -103.8269 -99.848891 0.4215 67.7148 -103.2615 -99.283571 0.2208 108.5558 -87.1781 -83.200112 0.6633 20.5197 -130.4833 -124.516342 0.5995 33.5041 -121.1126 -115.145612 0.5486 43.8517 -114.6583 -108.691383 0.7573 3.3905 -146.1609 -138.204943 0.7178 11.4237 -138.0232 -130.067233 0.6121 32.9320 -120.8442 -112.888234 0.7592 5.0000 -144.5895 -134.64461
Selection Results
Number inModel
Parameter EstimatesIntercept blood prog enz liver
1 5.26426 . . 0.01512 .1 5.61218 . . . 0.298191 5.56613 . 0.01367 . .2 4.35058 . 0.01412 0.01539 .2 5.02818 . . 0.01073 0.209452 4.54623 0.10792 . 0.01634 .3 3.76618 0.09546 0.01334 0.01645 .3 4.40582 . 0.01101 0.01261 0.129773 4.78168 0.04482 . 0.01220 0.163604 3.85195 0.08368 0.01266 0.01563 0.03216
Selection Results
Proc Reg Codeproc reg data=a1;
model lsurv=blood prog enz liver/selection=cp aicsbc b best=3;
run;
Selection ResultsNumber in
Model C(p) R-Square AIC SBC3 3.3905 0.7573 -146.1609 -138.204944 5.0000 0.7592 -144.5895 -134.644613 11.4237 0.7178 -138.0232 -130.06723
WARNING: “selection=cp” just lists the models in order based on lowest C(p), regardless of whether it is good or not
How to Choose with C(p)1. Want small C(p)2. Want C(p) near p
In original paper, it was suggested to plot C(p) versus p and consider the smallest model that satisfies these criteria
Can be somewhat subjective when determining “near”
Proc Regproc reg data=a1 outest=b1;
model lsurv=blood prog enz liver/ selection=rsquare cp aic sbc b;
run;quit;
symbol1 v=circle i=none;symbol2 v=none i=join;proc gplot data=b1;plot _Cp_*_P_ _P_*_P_ / overlay;
run;
Creates data set with estimates & criteria
Start to approach C(p)=p line here
Model Validation• Since data used to generate parameter
estimates, you’d expect model to predict fitted Y’s well
• Should check model predictive ability for a separate data set if available
• Various techniques of cross validation (data split, leave one out) are possible if only one data set available
Additional Multiple Regression Diagnostics
• Partial regression plots• Studentized deleted residuals• Hat matrix diagonals• Dffits, Cook’s D, DFBETAS• Variance inflation factor• Tolerance
KNNL Example• Page 386, Section 10.1• Y is amount of life insurance• X1 is average annual income • X2 is a risk aversion score• n = 18 managers
Read in the data set
data a1; infile ‘../data/ch10ta01.txt';input income risk insur;
Partial regression plots
• Also called added variable plots or adjusted variable plots
• One plot for each Xi
Partial regression plots• These plots show the strength of the
marginal relationship between Y and Xi inthe full model (recall partial correlation)
• They can also detect– Nonlinear relationships – Heterogeneous variances– Outliers
Partial regression plots• Consider plot for X1
–Use the other X’s to predict Y –Use the other X’s to predict X1–Plot the residuals from the first
regression vs the residuals from the second regression
The partial option with proc reg
proc reg data=a1; model insur=income risk
/partial;run;
OutputAnalysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 2 173919 86960 542.33 <.0001Error 15 2405.1476 160.3431Corrected Total 17 176324
Root MSE 12.66267 R-Square 0.9864Dependent Mean 134.44444 Adj R-Sq 0.9845Coeff Var 9.41851
OutputParameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept 1 -205.71866 11.39268 -18.06 <.0001
income 1 6.28803 0.20415 30.80 <.0001
risk 1 4.73760 1.37808 3.44 0.0037
Curvilinear relationship
Can also see that here
Other Residuals• There are several versions of residuals
1. Our usual residuals•
2. Studentized residuals
•
• Studentized means dividing by its standard error
• Are almost distributed t(n-p)
i i iˆe Y Y= −
( )* ii
ii
eeMSE 1 h
=−
Studentized deleted Residual
– Delete case i and refit the model– Compute the predicted value for
case i using this refitted model– Compute the “studentized
residual”– Don’t do this literally but this is the
concept– Results in t-distributed residuals
Studentized Deleted Residuals
• We use the notation “(i)” to indicate that case i has been deleted from the model fit computations
• is the deleted residual
• Turns out di = ei/(1-hii)
• Also Var(di)=Var(ei)/(1-hii)2=MSE(i)/(1- hii)
•
i i i(i)ˆd Y Y= −
( )i i (i)e MSE 1 h iit = −
Using Residuals • When we examine the residuals,
regardless of version, we are looking for – Outliers– Non-normal error distributions– Influential observations
The r option and studentized residuals
proc reg data=a1; model insur=income risk/r;
run;
OutputOutput Statistics
Obs income riskDependent
VariablePredicted
Value
StdErrorMean
Predict ResidualStd ErrorResidual
StudentResidual Cook's D
1 45.0 6 91 105.7311 3.3332 -14.7311 12.216 -1.206 0.036
2 57.2 4 162 172.9321 4.0172 -10.9321 12.009 -0.910 0.031
3 26.9 5 11 -13.1845 5.5052 24.1845 11.403 2.121 0.349
4 66.3 7 240 244.2780 4.5932 -4.2780 11.800 -0.363 0.007
5 41.0 5 73 75.5522 3.4815 -2.5522 12.175 -0.210 0.001
6 73.0 10 311 300.6583 7.4898 10.3417 10.210 1.013 0.184
7 79.4 1 316 298.1627 9.9907 17.8373 7.780 2.293 2.889
8 52.8 8 154 163.9763 4.5985 -9.9763 11.798 -0.846 0.036
9 55.9 6 164 174.3084 3.2470 -10.3084 12.239 -0.842 0.017
10 38.1 4 54 52.9440 4.0148 1.0560 12.009 0.088 0.000
11 35.8 6 53 48.0699 4.3886 4.9301 11.878 0.415 0.008
12 75.8 9 326 313.5272 6.9287 12.4728 10.599 1.177 0.197
13 37.4 5 55 53.1919 3.8909 1.8081 12.050 0.150 0.001
14 54.4 2 130 145.6744 5.7973 -15.6744 11.258 -1.392 0.171
15 46.2 7 112 117.8634 3.9171 -5.8634 12.042 -0.487 0.008
16 46.1 4 91 103.2985 3.5257 -12.2985 12.162 -1.011 0.029
17 30.4 3 14 -0.5636 5.3985 14.5636 11.454 1.271 0.120
18 39.1 5 63 63.5798 3.6886 -0.5798 12.114 -0.048 0.000
Cook’s Distance• A measure of the influence of case i
on all of the ’s (all the cases)• It is a standardized version of the
sum of squares of the differences between the predicted values computed with and without case i
• Compare with F(p,n-p) • Concern if distance above 50%-tile
iY
The influence option and studentized deleted
residualsproc reg data=a1;
model insur=income risk/influence;
run;
OutputOutput Statistics
Obs income risk Residual RStudentHat Diag
HCov
Ratio DFFITS
DFBETAS
Intercept income risk1 45.0 6 -14.7311 -1.2259 0.0693 0.9732 -0.3345 -0.1179 0.1245 -0.1107
2 57.2 4 -10.9321 -0.9048 0.1006 1.1532 -0.3027 -0.0395 -0.1470 0.1723
3 26.9 5 24.1845 2.4487 0.1890 0.5205 1.1821 0.9594 -0.9871 0.1436
4 66.3 7 -4.2780 -0.3518 0.1316 1.3794 -0.1369 0.0770 -0.0821 -0.0410
5 41.0 5 -2.5522 -0.2028 0.0756 1.3189 -0.0580 -0.0394 0.0286 0.0011
6 73.0 10 10.3417 1.0138 0.3499 1.5296 0.7437 -0.5298 0.3048 0.5125
7 79.4 1 17.8373 2.7483 0.6225 0.8930 3.5292 -0.3649 2.6598 -2.6751
8 52.8 8 -9.9763 -0.8371 0.1319 1.2237 -0.3263 0.0816 0.0254 -0.2452
9 55.9 6 -10.3084 -0.8336 0.0658 1.1384 -0.2212 0.0308 -0.0672 -0.0366
10 38.1 4 1.0560 0.0850 0.1005 1.3653 0.0284 0.0238 -0.0138 -0.0092
11 35.8 6 4.9301 0.4033 0.1201 1.3502 0.1490 0.0863 -0.1057 0.0536
12 75.8 9 12.4728 1.1933 0.2994 1.3128 0.7801 -0.5820 0.4495 0.4096
13 37.4 5 1.8081 0.1451 0.0944 1.3521 0.0468 0.0348 -0.0294 0.0014
14 54.4 2 -15.6744 -1.4415 0.2096 1.0274 -0.7423 -0.2706 -0.2656 0.6269
15 46.2 7 -5.8634 -0.4742 0.0957 1.2966 -0.1543 -0.0164 0.0532 -0.0953
16 46.1 4 -12.2985 -1.0120 0.0775 1.0788 -0.2934 -0.1810 0.0258 0.1424
17 30.4 3 14.5636 1.3004 0.1818 1.0677 0.6129 0.5803 -0.3608 -0.2577
18 39.1 5 -0.5798 -0.0462 0.0849 1.3434 -0.0141 -0.0101 0.0080 -0.0001
Hat matrix diagonals• hii is a measure of how much Yi is
contributing to the prediction of • = hi1Y1 + hi2 Y2 + hi3Y3 + …• hii is sometimes called the leverage
of the ith observation• It is a measure of the distance
between the X values for the ith case and the means of the X values
iYiY
Hat matrix diagonals• 0 ≤ hii ≤ 1• Σ(hii) = p• Large value of hii suggess that ith
case is distant from the center of all X’s
• The average value is p/n• Values far from this average point to
cases that should be examined carefully
OutputOutput Statistics
Obs income risk Residual RStudentHat Diag
HCov
Ratio DFFITS
DFBETAS
Intercept income risk1 45.0 6 -14.7311 -1.2259 0.0693 0.9732 -0.3345 -0.1179 0.1245 -0.1107
2 57.2 4 -10.9321 -0.9048 0.1006 1.1532 -0.3027 -0.0395 -0.1470 0.1723
3 26.9 5 24.1845 2.4487 0.1890 0.5205 1.1821 0.9594 -0.9871 0.1436
4 66.3 7 -4.2780 -0.3518 0.1316 1.3794 -0.1369 0.0770 -0.0821 -0.0410
5 41.0 5 -2.5522 -0.2028 0.0756 1.3189 -0.0580 -0.0394 0.0286 0.0011
6 73.0 10 10.3417 1.0138 0.3499 1.5296 0.7437 -0.5298 0.3048 0.5125
7 79.4 1 17.8373 2.7483 0.6225 0.8930 3.5292 -0.3649 2.6598 -2.6751
8 52.8 8 -9.9763 -0.8371 0.1319 1.2237 -0.3263 0.0816 0.0254 -0.2452
9 55.9 6 -10.3084 -0.8336 0.0658 1.1384 -0.2212 0.0308 -0.0672 -0.0366
10 38.1 4 1.0560 0.0850 0.1005 1.3653 0.0284 0.0238 -0.0138 -0.0092
11 35.8 6 4.9301 0.4033 0.1201 1.3502 0.1490 0.0863 -0.1057 0.0536
12 75.8 9 12.4728 1.1933 0.2994 1.3128 0.7801 -0.5820 0.4495 0.4096
13 37.4 5 1.8081 0.1451 0.0944 1.3521 0.0468 0.0348 -0.0294 0.0014
14 54.4 2 -15.6744 -1.4415 0.2096 1.0274 -0.7423 -0.2706 -0.2656 0.6269
15 46.2 7 -5.8634 -0.4742 0.0957 1.2966 -0.1543 -0.0164 0.0532 -0.0953
16 46.1 4 -12.2985 -1.0120 0.0775 1.0788 -0.2934 -0.1810 0.0258 0.1424
17 30.4 3 14.5636 1.3004 0.1818 1.0677 0.6129 0.5803 -0.3608 -0.2577
18 39.1 5 -0.5798 -0.0462 0.0849 1.3434 -0.0141 -0.0101 0.0080 -0.0001
DFFITS• A measure of the influence of case i
on (a single case)• Thus, it is closely related to hii• It is a standardized version of the
difference between computed with and without case i
• Concern if greater than 1 for small data sets or greater than for large data sets
iY
iY
2 p n
DFBETAS• A measure of the influence of case i on
each of the regression coefficients• It is a standardized version of the
difference between the regression coefficient computed with and without case i
• Concern if DFBETA greater than 1 in small data sets or greater than for large data sets
2 / n
Variance Inflation Factor• The VIF is related to the variance of
the estimated regression coefficients• We calculate it for each explanatory
variable• One suggested rule is that a value of
10 or more for VIF indicates excessive multicollinearity
Tolerance• TOL = (1-R2
k) where R2k is the
squared multiple correlation obtained in a regression where all other explanatory variables are used to predict Xk
• TOL = 1/VIF• Described in comment on p 410
Output
Parameter Estimates
Variable DFParameter
EstimateStandard
Error t Value Pr > |t| ToleranceVarianceInflation
Intercept 1 -205.71866 11.39268 -18.06 <.0001 . 0
income 1 6.28803 0.20415 30.80 <.0001 0.93524 1.06925
risk 1 4.73760 1.37808 3.44 0.0037 0.93524 1.06925
Full diagnosticsproc reg data=a1; model insur=income risk
/r partial influence tol;id income risk; plot rstudent.*(income risk);
run;
Plot statement inside Reg• Can generate several plots within
Proc Reg• Need to know symbol names• Available in Table 1 once you click
on plot command inside REG syntax– r. represents usual residuals– rstudent. represents deleted resids– p. represents predicted values
Last slide• We went over KNNL Chapters 9 and 10• We used program topic18.sas to
generate the output