Lecture 14 Diagnostics & Remedial...
Transcript of Lecture 14 Diagnostics & Remedial...
14-1
Lecture 14
Diagnostics & Remedial Measures
STAT 512
Spring 2011
Background Reading
KNNL: 6.8, 7.6, 10.5
14-2
Topic Overview
• Usual plots/tests to examine error
assumptions
• Multicollinearity
• CDI Case Study
14-3
Diagnostic (Residual) Plots
• Residuals vs. Normal Quantiles (Check
Normality)
• Residuals vs. Predicted Values (Check
Constant Variance)
• Residuals vs. Predictor Variables (Check
Linearity, Constant Variance)
• Residuals vs. Order of Observations (Check
Independence)
14-4
Diagnostic Tests
• Breusch-Pagan or Brown-Forsythe to test
for constancy of variance.
• Kolmogorov-Smirnov, etc. to test for
normality.
• Lack-of-fit test (we’ll hold off talking about
this one until we’ve discussed ANOVA)
14-5
Scatter Plot Matrix
• Plots Y, X1, X2, etc. against each of the
other variables.
• Compare Y to X’s to find relationships.
• Compare X’s to each other to identify
potential multicollinearity.
14-6
Remedial Measures
• Transform X if relationship non-linear
• Transform Y if violations of constant
variance and/or normality assumptions
• Use Box-Cox to come up with “best”
transformation on Y.
14-7
Multicollinearity (1)
• Definition: Intercorrelation exists
whenever the predictor variables are
correlated. The term multicollinearity is
generally reserved for instances where the
correlation is very high (greater than 0.9).
• Multicollinearity can make it difficult to...
� Judge relative importance of predictor variables.
� Ascertain the magnitude of an effect of a predictor
variable on the response.
14-8
Ideal Situation
• For a balanced design, absolutely no
intercorrelation exists. (For example, if
there are two predictor variables, 212 0r = ).
• Uncorrelated variables cannot overlap in the
variation in the response that they explain
• Type I and Type III SS will be identical.
• Slope estimates will also be the same.
14-9
Example
(p. 279) Eight observations on productivity based
on crew-size and bonus-size.
Productivity (Y) Crew Size (X1) Bonus Pay (X2)
42, 39 4 2
48, 51 4 3
49, 53 6 2
61, 60 6 3
14-10
Example (2)
Output from PROC GLM Source DF SS MS F P-value
Model 2 402.250 201.125 57 0.0004
Error 5 17.625 3.525
Total 7 419.875
Source DF Type I SS Mean Square
size 1 231.1250000 231.1250000
bonuspay 1 171.1250000 171.1250000
Source DF Type III SS Mean Square
size 1 231.1250000 231.1250000
bonuspay 1 171.1250000 171.1250000
14-11
Example (3)
Parameter EST SE T P-value
Intercept 0.375 4.74 0.08 0.9400
size 5.375 0.66 8.10 0.0005
bonuspay 9.250 1.33 6.97 0.0009
If we consider only size:
Source DF SS MS F P-val
Model 1 231.125 231.125 7.4 0.035
Error 6 188.750 31.458
Total 7 419.875 _
Parameter Est SE T P-value
Intercept 23.5 10.1 2.32 0.0591
size 5.375 1.98 2.71 0.0351
14-12
Perfect Correlation Example
(p. 281) Four observations in three-space, but
over the line X2 = 5 + 0.5 * X1.
X1 X2 Y
2 6 25
8 9 81
6 8 60
10 10 113
14-13
Perfect Correlation Example (2)
• Since points are exactly a line in two-space,
there are infinitely many regression planes
available. There is no unique best regression
plane.
• If you try to fit this in SAS, you will get output
that does not fit the full model (because it
cannot since ′X X is not invertible).
• SAS is “smart” enough to figure out that
something is wrong, and try to do something
about it.
14-14
Output
Source DF SS MS F P-val
Model 1 4007.15 4007.15 91.49 0.0108
Error 2 87.60 43.80
Total 3 4094.75
NOTE: Model is not full rank. Least-squares
solutions for the parameters are not unique.
Some statistics will be misleading. A
reported DF of 0 or B means that the
estimate is biased.
14-15
Output (2)
NOTE: The following parameters have been set
to 0, since the variables are a linear
combination of other variables as shown.
x2 = 5 * Intercept + 0.5 * x1
Parameter Std
Variable DF Estimate Error T P-value
Intercept B 0.20 7.99 0.03 0.9800
x1 B 10.70 1.12 9.56 0.0108
x2 0 0 . . .
14-16
Effects of Multicollinearity
• Variables are almost never 100% correlated.
• When there is a lot of intercorrelation....
� Can generally still obtain a “good fit”.
� Prediction made within the scope of the model
is generally unaffected.
� ′X X matrix has a near-zero determinant that
can be a source of serious round-off errors
14-17
Effects on Regression Coefficients
• Regression coefficients are highly correlated
and have large standard errors
• Cannot use the common interpretation of the
regression coefficient (since, for one thing,
probably isn’t feasible to hold other
variables constant).
14-18
Simultaneous T-tests
• Common abuse of multiple linear regression
models is to do simultaneous t-tests for testing
0kβ = , k = 1, 2, ...., p – 1
• The big problem with this is that these are all
MARGINAL (or variable-added-last) tests.
• If there were no intercorrelation, all of the
variables act “independently” and this would
be no problem. But when there is
intercorrelation, often would end up
incorrectly dropping ALL variables on this
basis.
14-19
Extra Sums of Squares
• When predictor variables are correlated, Type I
and Type III SS tend to be quite different
• Added first a variable may do a lot in terms of
explaining variation, but added later it may
not do much
14-20
Indicators of Multicollinearity
• Large simple correlations between pairs of
predictors.
• F-test says model is significant; Marginal T-
tests do not show any significance.
• Watch for Type I and Type III SS having
large differences.
• Large changes in estimated regression
coefficients when variables are
added/deleted.
14-21
Variance Inflation Factors
• Formal method for detecting multicollinearity
• VIF is related to the variance of the estimated
regression coefficients (think: variances get
“inflated” by having intercorrelation among
the predictors)
2
1
1k
k
VIFR
=−
• 2kR is the coefficient of determination obtained
in regression of Xk on all other predictors.
14-22
Variance Inflation Factors (2)
• 2kR > 0.9 means that Xk is well predicted by
the other variables. This corresponds to VIF
of 10 or higher and indicates excessive
multicollinearity.
• Tolerance is defined as
21 1/k k
TOL R VIF= − =
• Tolerance below 0.01, 0.001, or 0.0001
typically raise concern.
14-23
Physicians Case Study (7.37)
• Goal: Predict # of active physicians in a
county (Y) from
1. X1 = Total Population
2. X2 = Total Personal Income
3. X3 = Land Area
4. X4 = Percent of Pop. Age 65 or older
5. X5 = # of hospital beds
6. X6 = Total Serious Crimes
• SAS code available in file CDI.sas.
14-24
Initial Model
• All six predictors included
• VIF and TOL can be used as options after
the ‘/’ in the model statement of REG
Variable DF Tolerance VIF _
tot_pop 1 0.01192 83.87229
tot_income 1 0.01883 53.10731
land_area 1 0.79952 1.25074
pop_elderly 1 0.94209 1.06147
beds 1 0.12251 8.16293
crimes 1 0.16763 5.96556
14-25
Residual Plots
Residual
-2000
-1000
0
1000
2000
3000
Predicted Value of physicians
0 5000 10000 15000 20000 25000
14-26
Residual Plots
Resid vs. Total Population – See SAS for other variables
Residual
-2000
-1000
0
1000
2000
3000
tot_pop
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000
14-27
Normal Probability Plot
-3 -2 -1 0 1 2 3
-2000
-1000
0
1000
2000
3000
Residual
Normal Quantiles
14-28
Assumption Violations
• Errors not normal.
• Variance does not appear to be constant.
• BOXCOX suggests a log transformation,
which clears up some of the issues.
14-29
Normal Probability Plot
-3 -2 -1 0 1 2 3
-5
-4
-3
-2
-1
0
1
2
Residual
Normal Quantiles
14-30
Two Outliers
• Can see these in the QQ-plot.
• Further investigation shows that they are for
Los Angeles County and Cook County
� Twice as many physicians than other counties.
� Also outliers in total population and total
income.
� There is reason to drop these two for the time
being, as it makes sense that such huge
counties should not be considered as the same
population as the rest.
14-31
QQ Plot w/o Outliers
-3 -2 -1 0 1 2 3
-4
-3
-2
-1
0
1
2
Residual
Normal Quantiles
14-32
Residual Plot
Residual
-4
-3
-2
-1
0
1
2
Predicted Value of lphysicians
5 6 7 8 9 10 11 12 13
14-33
Residual Plot
Resid vs. Total Population – See SAS for other variables
Residual
-4
-3
-2
-1
0
1
2
tot_pop
0 1000000 2000000 3000000
14-34
Still Problems?
• Normality is ok
• No other unreasonable outliers
• Residual Plot suggests some nonlinearity
• Look at Residual vs. Predictor Variable
Plots to learn more
• Possibly add some quadratic or other terms
• We’ve thus far ignored multicollinearity –
time to consider it.
14-35
Multicollinearity
• VIF’s for tot_pop and tot_income already
have informed us that there are problems.
pop inc l_ar eld beds
tot_pop 1.00 0.90.90.90.99999 0.17 -0.03 0.920.920.920.92
tot_income 0.90.90.90.99999 1.00 0.13 -0.02 0.900.900.900.90
land_area 0.17 0.13 1.00 0.01 0.07
pop_elderly -0.03 -0.02 0.01 1.00 0.05
beds 0.920.920.920.92 0.900.900.900.90 0.07 0.05 1.00
• Will continue analysis with Model Selection
14-36
Big Picture
• For checking basic assumptions: PLOTS
are generally easier to construct than
TESTS – and generally if there is
something to see, it will show up in the
appropriate plot.
• MULTICOLLINEARITY is a big issue
when trying to interpret estimates –
however it’s not really a problem for
prediction.
14-37
Upcoming in Lecture 15...
• Model Building: Selection Criteria (Ch 9)
• Continuing the Physicians Dataset Analysis