Lecture 14 Diagnostics & Remedial...

14-1

Lecture 14

Diagnostics & Remedial Measures

STAT 512

Spring 2011

Background Reading

KNNL: 6.8, 7.6, 10.5

14-2

Topic Overview

• Usual plots/tests to examine error

assumptions

• Multicollinearity

• CDI Case Study

14-3

Diagnostic (Residual) Plots

• Residuals vs. Normal Quantiles (Check

Normality)

• Residuals vs. Predicted Values (Check

Constant Variance)

• Residuals vs. Predictor Variables (Check

Linearity, Constant Variance)

• Residuals vs. Order of Observations (Check

Independence)

14-4

Diagnostic Tests

• Breusch-Pagan or Brown-Forsythe to test

for constancy of variance.

• Kolmogorov-Smirnov, etc. to test for

normality.

• Lack-of-fit test (we’ll hold off talking about

this one until we’ve discussed ANOVA)

14-5

Scatter Plot Matrix

• Plots Y, X1, X2, etc. against each of the

other variables.

• Compare Y to X’s to find relationships.

• Compare X’s to each other to identify

potential multicollinearity.

14-6

Remedial Measures

• Transform X if relationship non-linear

• Transform Y if violations of constant

variance and/or normality assumptions

• Use Box-Cox to come up with “best”

transformation on Y.

14-7

Multicollinearity (1)

• Definition: Intercorrelation exists

whenever the predictor variables are

correlated. The term multicollinearity is

generally reserved for instances where the

correlation is very high (greater than 0.9).

• Multicollinearity can make it difficult to...

� Judge relative importance of predictor variables.

� Ascertain the magnitude of an effect of a predictor

variable on the response.

14-8

Ideal Situation

• For a balanced design, absolutely no

intercorrelation exists. (For example, if

there are two predictor variables, 212 0r = ).

• Uncorrelated variables cannot overlap in the

variation in the response that they explain

• Type I and Type III SS will be identical.

• Slope estimates will also be the same.

14-9

Example

(p. 279) Eight observations on productivity based

on crew-size and bonus-size.

Productivity (Y) Crew Size (X1) Bonus Pay (X2)

42, 39 4 2

48, 51 4 3

49, 53 6 2

61, 60 6 3

14-10

Example (2)

Output from PROC GLM Source DF SS MS F P-value

Model 2 402.250 201.125 57 0.0004

Error 5 17.625 3.525

Total 7 419.875

Source DF Type I SS Mean Square

size 1 231.1250000 231.1250000

bonuspay 1 171.1250000 171.1250000

Source DF Type III SS Mean Square

size 1 231.1250000 231.1250000

bonuspay 1 171.1250000 171.1250000

14-11

Example (3)

Parameter EST SE T P-value

Intercept 0.375 4.74 0.08 0.9400

size 5.375 0.66 8.10 0.0005

bonuspay 9.250 1.33 6.97 0.0009

If we consider only size:

Source DF SS MS F P-val

Model 1 231.125 231.125 7.4 0.035

Error 6 188.750 31.458

Total 7 419.875 _

Parameter Est SE T P-value

Intercept 23.5 10.1 2.32 0.0591

size 5.375 1.98 2.71 0.0351

14-12

Perfect Correlation Example

(p. 281) Four observations in three-space, but

over the line X2 = 5 + 0.5 * X1.

X1 X2 Y

2 6 25

8 9 81

6 8 60

10 10 113

14-13

Perfect Correlation Example (2)

• Since points are exactly a line in two-space,

there are infinitely many regression planes

available. There is no unique best regression

plane.

• If you try to fit this in SAS, you will get output

that does not fit the full model (because it

cannot since ′X X is not invertible).

• SAS is “smart” enough to figure out that

something is wrong, and try to do something

about it.

14-14

Output

Source DF SS MS F P-val

Model 1 4007.15 4007.15 91.49 0.0108

Error 2 87.60 43.80

Total 3 4094.75

NOTE: Model is not full rank. Least-squares

solutions for the parameters are not unique.

Some statistics will be misleading. A

reported DF of 0 or B means that the

estimate is biased.

14-15

Output (2)

NOTE: The following parameters have been set

to 0, since the variables are a linear

combination of other variables as shown.

x2 = 5 * Intercept + 0.5 * x1

Parameter Std

Variable DF Estimate Error T P-value

Intercept B 0.20 7.99 0.03 0.9800

x1 B 10.70 1.12 9.56 0.0108

x2 0 0 . . .

14-16

Effects of Multicollinearity

• Variables are almost never 100% correlated.

• When there is a lot of intercorrelation....

� Can generally still obtain a “good fit”.

� Prediction made within the scope of the model

is generally unaffected.

� ′X X matrix has a near-zero determinant that

can be a source of serious round-off errors

14-17

Effects on Regression Coefficients

• Regression coefficients are highly correlated

and have large standard errors

• Cannot use the common interpretation of the

regression coefficient (since, for one thing,

probably isn’t feasible to hold other

variables constant).

14-18

Simultaneous T-tests

• Common abuse of multiple linear regression

models is to do simultaneous t-tests for testing

0kβ = , k = 1, 2, ...., p – 1

• The big problem with this is that these are all

MARGINAL (or variable-added-last) tests.

• If there were no intercorrelation, all of the

variables act “independently” and this would

be no problem. But when there is

intercorrelation, often would end up

incorrectly dropping ALL variables on this

basis.

14-19

Extra Sums of Squares

• When predictor variables are correlated, Type I

and Type III SS tend to be quite different

• Added first a variable may do a lot in terms of

explaining variation, but added later it may

not do much

14-20

Indicators of Multicollinearity

• Large simple correlations between pairs of

predictors.

• F-test says model is significant; Marginal T-

tests do not show any significance.

• Watch for Type I and Type III SS having

large differences.

• Large changes in estimated regression

coefficients when variables are

added/deleted.

14-21

Variance Inflation Factors

• Formal method for detecting multicollinearity

• VIF is related to the variance of the estimated

regression coefficients (think: variances get

“inflated” by having intercorrelation among

the predictors)

2

1

1k

k

VIFR

=−

• 2kR is the coefficient of determination obtained

in regression of Xk on all other predictors.

14-22

Variance Inflation Factors (2)

• 2kR > 0.9 means that Xk is well predicted by

the other variables. This corresponds to VIF

of 10 or higher and indicates excessive

multicollinearity.

• Tolerance is defined as

21 1/k k

TOL R VIF= − =

• Tolerance below 0.01, 0.001, or 0.0001

typically raise concern.

14-23

Physicians Case Study (7.37)

• Goal: Predict # of active physicians in a

county (Y) from

1. X1 = Total Population

2. X2 = Total Personal Income

3. X3 = Land Area

4. X4 = Percent of Pop. Age 65 or older

5. X5 = # of hospital beds

6. X6 = Total Serious Crimes

• SAS code available in file CDI.sas.

14-24

Initial Model

• All six predictors included

• VIF and TOL can be used as options after

the ‘/’ in the model statement of REG

Variable DF Tolerance VIF _

tot_pop 1 0.01192 83.87229

tot_income 1 0.01883 53.10731

land_area 1 0.79952 1.25074

pop_elderly 1 0.94209 1.06147

beds 1 0.12251 8.16293

crimes 1 0.16763 5.96556

14-25

Residual Plots

Residual

-2000

-1000

0

1000

2000

3000

Predicted Value of physicians

0 5000 10000 15000 20000 25000

14-26

Residual Plots

Resid vs. Total Population – See SAS for other variables

Residual

-2000

-1000

0

1000

2000

3000

tot_pop

0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000

14-27

Normal Probability Plot

-3 -2 -1 0 1 2 3

-2000

-1000

0

1000

2000

3000

Residual

Normal Quantiles

14-28

Assumption Violations

• Errors not normal.

• Variance does not appear to be constant.

• BOXCOX suggests a log transformation,

which clears up some of the issues.

14-29

Normal Probability Plot

-3 -2 -1 0 1 2 3

-5

-4

-3

-2

-1

0

1

2

Residual

Normal Quantiles

14-30

Two Outliers

• Can see these in the QQ-plot.

• Further investigation shows that they are for

Los Angeles County and Cook County

� Twice as many physicians than other counties.

� Also outliers in total population and total

income.

� There is reason to drop these two for the time

being, as it makes sense that such huge

counties should not be considered as the same

population as the rest.

14-31

QQ Plot w/o Outliers

-3 -2 -1 0 1 2 3

-4

-3

-2

-1

0

1

2

Residual

Normal Quantiles

14-32

Residual Plot

Residual

-4

-3

-2

-1

0

1

2

Predicted Value of lphysicians

5 6 7 8 9 10 11 12 13

14-33

Residual Plot

Resid vs. Total Population – See SAS for other variables

Residual

-4

-3

-2

-1

0

1

2

tot_pop

0 1000000 2000000 3000000

14-34

Still Problems?

• Normality is ok

• No other unreasonable outliers

• Residual Plot suggests some nonlinearity

• Look at Residual vs. Predictor Variable

Plots to learn more

• Possibly add some quadratic or other terms

• We’ve thus far ignored multicollinearity –

time to consider it.

14-35

Multicollinearity

• VIF’s for tot_pop and tot_income already

have informed us that there are problems.

pop inc l_ar eld beds

tot_pop 1.00 0.90.90.90.99999 0.17 -0.03 0.920.920.920.92

tot_income 0.90.90.90.99999 1.00 0.13 -0.02 0.900.900.900.90

land_area 0.17 0.13 1.00 0.01 0.07

pop_elderly -0.03 -0.02 0.01 1.00 0.05

beds 0.920.920.920.92 0.900.900.900.90 0.07 0.05 1.00

• Will continue analysis with Model Selection

14-36

Big Picture

• For checking basic assumptions: PLOTS

are generally easier to construct than

TESTS – and generally if there is

something to see, it will show up in the

appropriate plot.

• MULTICOLLINEARITY is a big issue

when trying to interpret estimates –

however it’s not really a problem for

prediction.

14-37

Upcoming in Lecture 15...

• Model Building: Selection Criteria (Ch 9)

• Continuing the Physicians Dataset Analysis

Lecture 14 Diagnostics & Remedial...

Documents

Transcript of Lecture 14 Diagnostics & Remedial...