What You See May Not Be What You Get: A Primer on Regression Artifacts
description
Transcript of What You See May Not Be What You Get: A Primer on Regression Artifacts
What You See May Not Be What You Get:
A Primer on Regression Artifacts
Michael A. Babyak, PhD
Duke University Medical Center
Topics to Cover
1. Models: what and why?2. Preliminaries—requirements for a good model3. Dichotomizing a graded or continuous variable
is dumb4. Using degrees of freedom wisely5. Covariate selection6. Transformations and smoothing techniques for
non-linear effects7. Resampling as a superior method of model
validation
What is a model ?
Y = f(x1, x2, x3…xn)
Y = a + b1x1 + b2x2…bnxn
Y = e a + b1x1 + b2x2…bnxn
Why Model? (instead of test)
1. Can capture theoretical/predictive system
2. Estimates of population parameters
3. Allows prediction as well as hypothesis testing
4. More information for replication
Preliminaries
1. Correct model2. Measure well and don’t throw
information away3. Adequate Sample Size
Correct Model
• Gaussian: General Linear Model• Multiple linear regression
• Binary (or ordinal): Generalized Linear Model• Logistic Regression• Proportional Odds/Ordinal Logistic
• Time to event: • Cox Regression
• Distribution of predictors generally not important
Measure well and don’t throw information away
• Reliable, interpretable• Use all the information about the
variables of interest• Don’t create “clinical cutpoints”
before modeling• Model with ALL the data first, then
use prediction to make decisions about cutpoints
Dichotomizing for Convenience Can Destroy a Model
0 4 8 12 16 20 24 28 32 36 40 44
Depression score
AB C
Implausible measurement assumption
“not depressed” “depressed”
Dichotomization, by definition, reduces power by a minimum of about
30%
http://psych.colorado.edu/~mcclella/MedianSplit/
Dichotomization, by definition, reduces power by a minimum of about
30%
Dear Project Officer,
In order to facilitate analysis and interpretation, we have decided to throw away about 30% of our data. Even though this will waste about 3 or 4 hundred thousand dollars worth of subject recruitment and testing money, we are confident that you will understand.
Sincerely,
Dick O. Tomi, PhDProf. Richard Obediah Tomi, PhD
Examples from the WCGS Study:Correlations with CHD Mortality (n = 750)
Continuous Dichotomizedat median
Reductionin r2
Variable r r2 r r2
SystolicBloodPressure
.15 .023 .12 .014 -39%
Hostility .15 .023 .08 .006 -74%
Dichotomizing does not reduce measurement error
Gustafson, P. and Le, N.D. (2001). A comparison of continuous and discrete measurement error: is it wise
to dichotomize imprecise covariates? Submitted. Available at http://www.stat.ubc.ca/people/gustaf.
Simulation: Dichotomizing makes matters worse when
measure is unreliable
X1 = .4
Y
True Model: X1 continuous
Simulation: Dichotomizing makes matters worse when
measure is unreliable
X1 = .4
Y
Same Model with X1 dichotomized
Simulation: Dichotomizing makes matters worse when
measure is unreliable
X1 = .4
Y
= .4YX1
Contin.
Dich.
Reliability=.65, .75., .85, 1.00
Models with reliability of X1 manipulated
Dichotomization of a variable measured with error (y = .4x + e)
50
60
70
80
90
100
1.00 0.85 0.75 0.65Reliability of x
% c
orr
ec
t re
jec
tio
ns
of
nu
ll h
yp
oth
es
is
Continuous x
Dichotomization of a variable measured with error (y = .4x + e)
50
60
70
80
90
100
1.00 0.85 0.75 0.65Reliability of x
% c
orr
ec
t re
jec
tio
ns
of
nu
ll h
yp
oth
es
is
Continuous xDichotomized x
Dichotomizing will obscure non-linearity
Dichotomized at Median (CES-D = 7)
Perc
ent w
ith W
all
Motio
n A
bnorm
alit
y
0
6
12
18
24
30
Not Depressed Depressed
WMA on at Least 1 TaskUsing Cubic Spline
CES-D Score
Pro
babi
lity
of W
MA
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25 30 35 40
Dichotomizing will obscure non-linearity
Simulation 2: Dichotomizing a continuous predictor that is
correlated with another predictor
X1 = .4
= .0X2
Y
X1 and X2 continuous
X1 = .4
= .0X2
Y
X1 dichotomized
Simulation 2: Dichotomizing a continuous predictor that is
correlated with another predictor
X1 = .4
= .0
=
.0, .4, .7
X2
Y
X1 dichotomized; rho12 manipulated
Simulation 2: Dichotomizing a continuous predictor that is
correlated with another predictor
00.5
11.5
22.5
33.5
44.5
0 0.4 0.7
Correlation between x1, x2
(%)
Incorr
ect
reje
cti
ons
of X2 =
0
X1 and X2 continuous
Simulation 2: Dichotomizing a continuous predictor that is
correlated with another predictor
0
5
10
15
20
25
30
0 0.4 0.7
Correlation between x1, x2
(%)
Inco
rrec
t re
ject
ion
s
of
X2
= 0
Both continuous x1 dichotomous, x2 continuous
Simulation 2: Dichotomizing a continuous predictor that is
correlated with another predictor
Is it ever a good idea to categorize quantitatively measured variables?
• Yes: – when the variable is truly categorical– for descriptive/presentational purposes– for hypothesis testing, if enough categories
are made.• However, using many categories can lead to problems of
multiple significance tests and still run the risk of misclassification
CONCLUSIONS• Cutting:
– Doesn’t always make measurement sense– Almost always reduces power– Can fool you with too much power in some
instances– Can completely miss important features of the
underlying function• Modern computing/statistical packages can
“handle” continuous variables
• Want to make good clinical cutpoints? Model first, cut later.
Pro
b{
even
t}
Maximum Change in LVEF (%)
Clinical Events and LVEF Change during Mental Stress: 5 Year follow-upModel first, cut later
Requirements: Sample Size
• Linear regression– minimum of N = 50 + 8:predictor (Green, 1990)
• Logistic Regression– Minimum of N = 10-15/predictor among
smallest group (Peduzzi et al., 1990a)
• Survival Analysis– Minimum of N = 10-15/predictor (Peduzzi et al.,
1990b)
Y = b X + error
bs
1
bs
2
bs
3
bs
4
bsk-1 bsk………………….
Concept of Simulation
bs
1
bs
2
bs
3
bs
4
bsk-1 bsk………………….
Y = b X + error
Evaluate
Concept of Simulation
Y = .4 X + error
bs
1
bs
2
bs
3
bs
4
bsk-1 bsk………………….
Simulation Example
bs
1
bs
2
bs
3
bs
4
bsk-1 bsk………………….
Evaluate
Y = .4 X + error
Simulation Example
0.2 0.4 0.6
05
00
10
00
15
00
20
00
25
00
Value of beta for x1
Fre
qu
en
cy o
f b
eta
va
lue
True Model:Y = .4*x1 + e
Sample Size• Linear regression
– minimum of N = 50 + 8:predictor (Green, 1990)
• Logistic Regression– Minimum of N = 10-15/predictor among
smallest group (Peduzzi et al., 1990a)
• Survival Analysis– Minimum of N = 10-15/predictor (Peduzzi et
al., 1990b)
All-noise, but good fit
R-Square from Full Model
De
nsi
ty
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
02
46
81
01
21
41
6
n/p~3n/p~6.6n/p=10n/p~13.3
Simulation: number of events/predictor ratio
Y = .5*x1 + 0*x2 + .2*x3 + 0*x4
-- Where x1 x4 = .4
-- N/p = 3, 5, 10, 20, 50
Parameter stability and n/p ratiox1
Den
sity
-2.0 -1.0 0.0 0.5 1.0 1.5 2.0
01
23
45
67
8
n/p=3n/p=5n/p=10n/p=20n/p=50
x2
-2.0 -1.0 0.0 0.5 1.0 1.5 2.0
01
23
45
67
8
x3
Parameter Estimate
Den
sity
-2.0 -1.0 0.0 0.5 1.0 1.5 2.0
01
23
45
67
8
x4
Parameter Estimate
-2.0 -1.0 0.0 0.5 1.0 1.5 2.0
01
23
45
67
8
Peduzzi’s Simulation: number of events/predictor ratio
P(survival) =a + b1*NYHA + b2*CHF + b3*VES+b4*DM + b5*STD + b6*HTN + b7*LVC
--Events/p = 2, 5, 10, 15, 20, 25
--% relative bias = (estimated b – true b/true b)*100
-20
-10
0
10
20
30
40
50
0 2 5 10 15 20 25
Events per variable
% R
elat
ive
Bia
s NYHACHFVESDMSTDHTNLVC
Simulation results: number of events/predictor ratio
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 2 5 10 15 20 25
Events per variable
Pro
port
ion w
/ B
ias
>
100%
NYHACHFVESDMSTDHTNLVC
Simulation results: number of events/predictor ratio
Predictor (covariate) selection
1. Theory, substantive knowledge, prior models
2. Testing for confounding
3. Univariate testing
4. Last (and least), automated methods, aka stepwise and best subset regression
Searching for Confounders
• Fundamental tension between underfitting and overfitting•Underfitting = not adjusting for
important confounders•Overfitting = capitalizing on
chance relations (sample fluctuation)
Covariate selection
• Overfitting has been studied extensively
• “Scariest” study is by Faraway (1992)—showed that any pre-modeling strategy cost a df over and above df used later in modeling.
• Premodeling strategies included: variable selection, outlier detection, linearity tests, residual analysis.
Covariate selection
• Therefore, if you transform, select, etc., you must include the DF in (i.e., penalize for) the “Final Model”
Covariate selection: Univariate Testing
• Non-Significant tests also cost a DF• Variables may not behave the
same way in a multivariable model—variable “not significant” at univariate test may be very important in the presence of other variables
Covariate selection
• Despite the convention, testing for confounding has not been systematically studied—likely leads to overadjustment and underestimate of true effect of variable of interest.
• At the very least, pulling variables in and out of models inflates the Type I error rate, sometimes dramatically
1. It yields R-squared values that are badly biased high
SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution
SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution
3. The method yields confidence intervals for effects and predicted values that are falsely narrow (See Altman and Anderson Stat in Med)
SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem
SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem
5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani, 1996)
SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996).
6. It has severe problems in the presence of collinearity
SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996). 6. It has severe problems in the presence of collinearity
7. It is based on methods (e.g. F- tests for nested models) that were intended to be used to test pre-specified hypotheses
SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996). 6. It has severe problems in the presence of collinearity 7. It is based on methods (e.g. F tests for nested models) that were intended to be
used to test pre-specified hypotheses.
8. Increasing the sample size doesn't help very much (see Derksen and Keselman)
SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996). 6. It has severe problems in the presence of collinearity 7. It is based on methods (e.g. F tests for nested models) that were intended to be
used to test pre-specified hypotheses. 8. Increasing the sample size doesn't help very much (see Derksen and Keselman)
9. It allows us to not think about the problem
SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996). 6. It has severe problems in the presence of collinearity 7. It is based on methods (e.g. F tests for nested models) that were intended to be
used to test pre-specified hypotheses. 8. Increasing the sample size doesn't help very much (see Derksen and Keselman) 9. It allows us to not think about the problem
10. It uses a lot of paper
SOME of the problems with stepwise variable selection.
“I now wish I had never written the stepwise selection code for SAS.” --Frank Harrell, author of forward and
backwards selection algorithm for SAS PROC REG
Automated Selection: Derksen and Keselman (1992) Simulation Study
• Studied backward and forward selection
• Some authentic variables and some noise variables among candidate variables
• Manipulated correlation among candidate predictors
• Manipulated sample size
Automated Selection: Derksen and Keselman (1992) Simulation Study
• “The degree of correlation between candidate predictors affected the frequency with which the authentic predictors found their way into the model.”
• “The greater the number of candidate predictors, the greater the number of noise variables were included in the model.”
• “Sample size was of little practical importance in determining the number of authentic variables contained in the final model.”
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7
Variables in Final Model
% o
f sa
mple
s
100200500100010000
Simulation results: Number of Noise Variables Included
20 candidate predictors; 100 samples
Sample Size
0102030405060708090
100
0 0-5 5-10 10-15 15-20 20-25 > 25
% Variance Explained
% o
f sa
mple
s
100200500100010000
Simulation results: R-Square From Noise Variables
20 candidate predictors; 100 samples
Sample Size
0
0.05
0.1
0.15
0.2
0.25
0.3
Samples (Deciles)
R-S
quare
10,0001,000500200100
Simulation results: R-Square From Noise Variables
20 candidate predictors; 100 samples
Sample Size
Variable Selection
• Pick variables a priori• Stick with them• Penalize appropriately for any
data-driven decision about how to model a variable
Spending DF wisely
• Select variables of most importance• Use DF to assess non-linearity using
flexible curve approach (more about this later)
• If not enough N/predictor, combine covariates using techniques that do not look at Y in the sample, PCA, FA, conceptual clustering, collapsing, scoring, established indexes, propensity scores.
Can use data to determine where to spend DF
• Use Spearman’s Rho to test “importance”
• Not peeking because we have chosen to include the term in the model regardless of relation to Y
• Use more DF for non-linearity
Example-Predict Survival from age, gender, and fare on Titanic
If you have already decided to include them (and promise to keep them in the model) you can peek at predictors in order to see where to add complexity
Adjusted rho^2
0.0 0.05 0.10 0.15 0.20 0.25
1046 1
1308 1
1309 1
N df
age
fare
sex
Spearman Test
Non-linearity using splines
0
0.5
1
1.5
2
2.5
0 0 5 10 15 20 25
X
YLinear Spline
(piecewise regression)
Y = a + b1(x<10) + b2(10<x<20) + b3 (x >20)
0
0.5
1
1.5
2
2.5
0 0
X
Y
Cubic Spline (non-linear piecewise
regression)
knots
fitfare<-lrm(survived~(rcs(fare,3)+age+sex)^2,x=T,y=T)
anova(fitfare)
Logistic regression model
Spline with 3 knots
Wald Statistics Response: survived
Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001
Wald Statistics Response: survived
Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001
Wald Statistics Response: survived
Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001
Wald Statistics Response: survived
Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001
Wald Statistics Response: survived
Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001
0.50 2.00 4.00 6.00 8.00 10.00 12.00
fare - 31:7.9
age - 39:21
0.95
sex - female:male
Adjusted to:fare=14 age=28 sex=male
Predictors of Survival on Titanic
0
50
100150
200250
Fare10
20
30
40
50
60
age
00.
20.
40.
60.
81
Pro
b. o
f Sur
viva
l
Adjusted to: sex=male
Fare and Age Interaction
Fare
Pro
b.
of
Su
rviv
al
0 50 100 150 200 250 300
0.2
0.4
0.6
0.8
1.0
female
male
Adjusted to: age=28
Fare and Gender Interaction
Validation• Apparent
• too optimistic• Internal
• cross-validation, bootstrap• honest estimate for model
performance• provides an upper limit to what would
be found on external validation• External validation
• replication with new sample, different circumstances
Validation
• Steyerburg, et al. (1999) compared validation methods
• Found that split-half was far too conservative
• Bootstrap was equal or superior to all other techniques
?1………………….
My Sample
Evaluate
Bootstrap
?2 ?3 ?4 ?k-1 ?k
WITH REPLACEMENT
1, 3, 4, 5, 7, 10
7114510
1032221
351427
211727
4414210
Index Training Corrected
Dxy 0.6565 0.646
R2 0.4273 0.407
Intercept 0.0000 -0.011
Slope 1.0000 0.952
Bootstrap Validation
Summary
• Think about your model• Collect enough data
Summary
• Measure well• Don’t destroy what you’ve
measured
• Pick your variables ahead of time and collect enough data to test the model you want
• Keep all your variables in the model unless extremely unimportant
Summary
• Use more df on important variables, fewer df on “nuisance” variables
• Don’t peek at Y to combine, discard, or transform variables
Summary
• Estimate validity and shrinkage with bootstrap
Summary
• By all means, tinker with the model later, but be aware of the costs of tinkering
• Don’t forget to say you tinkered
• Go collect more data
Summary
Web links for references, software, and more
• Harrell’s regression modeling text– http://hesweb1.med.virginia.edu/biostat/rms/
• SAS Macros for spline estimation– http://hesweb1.med.virginia.edu/biostat/SAS/survrisk.txt
• Some results comparing validation methods– http://hesweb1.med.virginia.edu/biostat/reports/logistic.val.pdf
• SAS code for bootstrap– ftp://ftp.sas.com/pub/neural/jackboot.sas
• S-Plus home page– insightful.com
• Mike Babyak’s e-mail – [email protected]
• This presentation– http://www.duke.edu/~mbabyak