Lecture 7 Remedial Measures - Purdue Universityghobbs/STAT_512/Lecture_Notes/Regression/... · 7-1...
Transcript of Lecture 7 Remedial Measures - Purdue Universityghobbs/STAT_512/Lecture_Notes/Regression/... · 7-1...
7-1
Lecture 7
Remedial Measures
STAT 512
Spring 2011
Background Reading
KNNL: 3.8-3.11, Chapter 4
7-2
Topic Overview
Review Assumptions & Diagnostics
Remedial Measures for
Non-normality
Non-constant variance
Non-linearity
Other Miscellaneous Topics (Chapter 4)
7-3
Regression Assumptions
X and Y are related linearly (scatter plot,
residuals vs. X)
Assumptions on the Errors...
Constancy of Variance (residuals vs. X)
Normality (normal probability plot)
Independent (sequence plot)
7-4
Remedial Measures
Two basic choices when assumptions are violated:
Use some more appropriate model (often
more complicated)
Find a transformation of the data for which
the regression model is appropriate
7-5
Non-linear Relationships
Can potentially still use a “linear” model. For
example,
20 1 2
0 1 ln
Y X X
Y X
This model is still “linear” in terms of the
regression coefficients (parameters). Simply
consider a new predictor variable 2 or lnX X ,
and just treat this like any usual predictor.
7-6
Non-linear Relationships
Can use nonlinear regression models (beyond the
scope of this course, but discussed in Chapter 13).
For now, we will try to guess at a good
transformation and see if it works.
7-7
Variance Not Constant
Might be able to model the change in variance (if
it is related to X). In this case, can use a weighted
analysis (Chapter 11.1)
Sometimes a variance-stabilizing transformation
can be found (log, square-root are common)
Box-Cox procedure can help to find a
transformation
Note: In this class, we use natural logs,
unless specified otherwise
7-8
Errors Not Normal
Knowledge of error distribution known? If so,
can use SAS GENMOD procedure (Chapter 14)
Binomial (Yes/No or Categorical Resp.)
Poisson (Response is a Count)
Knowledge of error distribution unknown?
Sometimes a transformation will help
Often, non-normality/non-constant
variance occur together and
transformations can sometimes help both!
7-9
Other Remedies
Correlated errors (not independent)
Use a model for correlated error structure
(Chapter 12)
Omission of Important Predictors
Multiple Regression (starts in Chapter 6)
7-10
Other Remedies
Outliers
Determine whether to keep in analysis
(e.g., was there a recording error? Be
very cautious of deleting observations!)
Determine influence on parameter
estimates and standard errors
Perform more robust estimation
procedure that puts less emphasis on
outliers (Chapter 11.3)
7-11
Transformations
Finding a good transformation gets easier
with practice
Generally the method is to make an
educated guess at a useful transformation
and then try it to see if it works by
rechecking diagnostic plots
Transformations have a tendency to stabilize
variance and normality.
7-12
Transformations (X)
(For nonlinear relationship issues)
Log or Square-root
Square or Exp(x)
Reciprocal or Exp(-x)
7-13
Transformations (Y)
(For non-constant variance issues)
See page 132
Standard transformations if increasing
variance: square-root or reciprocal
If decreasing variance: log
Simultaneous transformations on X may
also be useful
7-14
Box-Cox Procedure
Automated procedure to determine a “best”
power transformation for the response
Chooses from different , Y
1 (No transformation)
0.5 (Square Root)
0 (Natural Log)
0.5 (Reciprocal Square Root)
1 (Reciprocal)
Use TRANSREG procedure in SAS
7-15
Example (1) – boxcox.sas
X - Age
Y – Plasma Level
25 Healthy children
data orig; input age plasma @@;
datalines;
0 13.44 0 12.84 0 11.91 0 20.09 0 15.60
1 10.11 1 11.38 1 10.28 1 8.96 1 8.59
2 9.83 2 9.00 2 8.65 2 7.85 2 8.88
3 7.94 3 6.01 3 5.14 3 6.90 3 6.77
4 4.86 4 5.10 4 5.67 4 5.75 4 6.23
; proc print data=orig; run;
7-16
Example (2)
First, let’s look at the scatterplot to see the
relationship
goptions ftitle=centb ftext=swissb htitle=3
htext=1.5 ctitle=blue ctext=black;
title1 'Original Variables';
symbol1 v=dot c=blue ;
axis1 label=('Age (Years)');
axis2 label=(angle=90 'Plasma Level');
proc gplot data=orig;
plot plasma*age / haxis=axis1 vaxis=axis2;
run;
Note, method for obtaining titles, axis labels.
7-17
Example (3)
7-18
Example (4)
Run SLR model and check diagnostic plots
proc reg data=orig;
model plasma=age;
output out = notrans r = resid;
run;
axis1 label=('Age (Years)');
axis2 label=(angle=90 'Residual');
proc gplot data = notrans;
plot resid*age / vref = 0 haxis=axis1 vaxis=axis2;
run;
proc univariate data=notrans;
var resid;
qqplot/normal (L=1 mu = est sigma = est);
run;
Note: Reference line in residual plot, 45-degree
line in normal probability plot
7-19
Example (5) Root MSE 1.84135 R-Square 0.7532
7-20
Example (6)
7-21
Example (7)
Residuals do not appear to have constant
variance
Relationship not quite linear
Use Box-Cox procedure to suggest a possible
transformation of the Y variable
7-22
Example (8) proc transreg data = orig;
model boxcox(plasma)=identity(age);
run;
The TRANSREG Procedure
Box-Cox Transformation Information for plasma
Lambda R-Square Log Like
-----
-2.00 0.80 -12.3665
-1.75 0.82 -10.1608
-1.50 0.83 -8.1127
-1.25 0.85 -6.3056
-1.00 0.86 -4.8523 *
-0.75 0.86 -3.8891 *
-0.50 0.87 -3.5523 <
-0.25 0.86 -3.9399 *
0.00 + 0.85 -5.0754 *
0.25 0.84 -6.8988
0.50 0.82 -9.2925
-----
< - Best Lambda
* - 95% Confidence Interval
+ - Convenient Lambda
7-23
Example (9)
“+” indicates the most convenient ; “<”
indicates the best as determined by the
log-likelihood function.
Try ln ( )Y and 1
Y
data trans; set orig;
logplasma = log(plasma);*In SAS log=ln, log10=log base 10;
rsplasma = plasma**(-0.5);
proc print data = trans; run;
7-24
Example (10)
Re-run regression, diagnostic plots with
transformed variables title1 'Natural Log Transformation';
proc reg data = trans;
model logplasma = age;
output out = logtrans r = logresid;
run;
axis1 label=('Age (Years)');
axis2 label=(angle=90 'ln(Plasma)');
proc gplot data = logtrans;
plot logplasma * age/ haxis=axis1 vaxis=axis2;
run;
axis1 label=('Age (Years)');
axis2 label=(angle=90 'Residual');
proc gplot data = logtrans;
plot logresid * age / vref = 0 haxis=axis1 vaxis=axis2;
run;
proc univariate data=logtrans;
var logresid;
qqplot/ normal (L=1 mu = est sigma = est);
run;
7-25
Example (11)
7-26
Example (12) Root MSE 0.14385 R-Square 0.8535
7-27
Example (13)
7-28
Example (14)
7-29
Example (15) Root MSE 0.02319 R-Square 0.8665
7-30
Example (16)
7-31
Example (17)
Both transformations ln ( )Y and 1
Y :
Led to a reasonably linear regression
relation
R-square improvement
Improved non-constant variance problem
Normality assumption supported in all cases
7-32
Summary of Remedial Measures
Nonlinear Relationships – Sometimes a transformation
on X will fix this.
Nonconstant Variance – If we can model the way in
which the error changes, we can use weighted
regression. Sometimes a transformation on Y will work
instead.
Nonnormal Errors – Could use a procedure that allows
different distributions for the error term. Often, a
transformation on Y will help.
7-33
Summary of Remedial Measures
Often, a transformation on Y may help with more
than one issue (e.g., normality and non-constant
variance).
Box-Cox Transformations – Suggests some possibly
Y transformations to try.
Sometimes a transformation on X and Y may help.
Remember - Assumptions still need to be satisfied
(on the transformed scale) if we are to use linear
regression model. So, we must always recheck
diagnostic plots after transforming any variable.
7-34
Chapter 4
Covers some miscellaneous but important
topics
Joint (family) confidence levels (4.1-4.3)
Regression through the origin (4.4)
Measurement Errors (4.5, optional
reading)
Inverse predictions – when Y becomes X
(4.6)
7-35
Summary of Inference - Reminder
100 1 %
Confidence Intervals
1 1critb t s b
0 0critb t s b
Where (1 ; 2)2critt t n
.
7-36
Family Confidence Levels
Separate confidence intervals for 0 and 1
Now: Joint estimation of 0 and 1
If k 95% CI’s are independent then their
family confidence coefficient is given by
0.95 k . Note, here k=2 and 0.95 k=0.9025.
Usually not independent, so family
confidence coefficient will be somewhat
larger than 0.95 k , but certainly less than
0.95.
7-37
Bonferroni Adjustment We want the probability that both intervals are correct
to be 0.95
Basic idea is that we have an error budget (α =.05), so
spend half on β0 and half on β1
We use α =.025 for each CI (97.5% CI), leading to
0 0
1 1
c
c
b t s b
b t s b
where 0.025
1 , 22
ct t n
.
We start with a 5% error budget and we have two
intervals so we give 2.5% to each
Each interval has two ends (tails) so we again divide
by 2
7-38
Bonferroni Adjustment (Summary)
Want to control family confidence level at
95%, then need to make adjustments
Instead of , use /k .
This is often more conservative than
necessary, but will work in all cases and for
any number of CIs.
We can use this method for simultaneous
estimation of mean responses and
predictions of new observations too.
7-39
Mean Response CIs We already talked about simultaneous
estimation for all Xh with a confidence band:
use Working-Hotelling
ˆ ˆh hY Ws Y where 2 2 1 ;2, 2W F n
For simultaneous estimation for a few Xh, say k
different values, we may use Bonferroni
instead.
ˆ ˆh hY Bs Y where 1 / 2 , 2B t k n
Similar for simultaneous prediction intervals.
7-40
Regression Through Origin
Yi = β1Xi + εi
Should be very cautious using something
like this. Forcing regression line through
(0,0) can introduce bias, especially if X=0
isn’t in the scope of the model.
Problems with r2 and other statistics
Generally safer not to use this method; if
when X=0 it is true that Y=0, then
probably 0 will not test as significantly
different from zero anyway.
7-41
Inverse Predictions
From equation 0 1Y b b X instead of estimating
Y based on X, want to estimate X based on Y
Sometimes called calibration
Example: A regression analysis was performed on
the amount of decrease in cholesterol level (Y)
achieved with a given dose (X) of a new drug
based on observations of 50 patients. A
physician needs to know the dose to give if a new
patient’s cholesterol needs to be decreased by a
certain amount ( ( )h newY ).
7-42
Inverse Predictions
Natural point estimate is ( ) 0( )
1
ˆ h newh new
Y bX
b
Approximate confidence limits are obtained
using the standard error:
2
( )
21
ˆ11
h new
X
X XMSESE predX
b n SS
7-43
Upcoming in Lecture 8...
Review of Matrix Algebra in the context of
simple linear regression (Chapter 5)