Lecture 14 Diagnostics & Remedial...

37
14-1 Lecture 14 Diagnostics & Remedial Measures STAT 512 Spring 2011 Background Reading KNNL: 6.8, 7.6, 10.5

Transcript of Lecture 14 Diagnostics & Remedial...

Page 1: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-1

Lecture 14

Diagnostics & Remedial Measures

STAT 512

Spring 2011

Background Reading

KNNL: 6.8, 7.6, 10.5

Page 2: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-2

Topic Overview

• Usual plots/tests to examine error

assumptions

• Multicollinearity

• CDI Case Study

Page 3: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-3

Diagnostic (Residual) Plots

• Residuals vs. Normal Quantiles (Check

Normality)

• Residuals vs. Predicted Values (Check

Constant Variance)

• Residuals vs. Predictor Variables (Check

Linearity, Constant Variance)

• Residuals vs. Order of Observations (Check

Independence)

Page 4: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-4

Diagnostic Tests

• Breusch-Pagan or Brown-Forsythe to test

for constancy of variance.

• Kolmogorov-Smirnov, etc. to test for

normality.

• Lack-of-fit test (we’ll hold off talking about

this one until we’ve discussed ANOVA)

Page 5: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-5

Scatter Plot Matrix

• Plots Y, X1, X2, etc. against each of the

other variables.

• Compare Y to X’s to find relationships.

• Compare X’s to each other to identify

potential multicollinearity.

Page 6: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-6

Remedial Measures

• Transform X if relationship non-linear

• Transform Y if violations of constant

variance and/or normality assumptions

• Use Box-Cox to come up with “best”

transformation on Y.

Page 7: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-7

Multicollinearity (1)

• Definition: Intercorrelation exists

whenever the predictor variables are

correlated. The term multicollinearity is

generally reserved for instances where the

correlation is very high (greater than 0.9).

• Multicollinearity can make it difficult to...

� Judge relative importance of predictor variables.

� Ascertain the magnitude of an effect of a predictor

variable on the response.

Page 8: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-8

Ideal Situation

• For a balanced design, absolutely no

intercorrelation exists. (For example, if

there are two predictor variables, 212 0r = ).

• Uncorrelated variables cannot overlap in the

variation in the response that they explain

• Type I and Type III SS will be identical.

• Slope estimates will also be the same.

Page 9: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-9

Example

(p. 279) Eight observations on productivity based

on crew-size and bonus-size.

Productivity (Y) Crew Size (X1) Bonus Pay (X2)

42, 39 4 2

48, 51 4 3

49, 53 6 2

61, 60 6 3

Page 10: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-10

Example (2)

Output from PROC GLM Source DF SS MS F P-value

Model 2 402.250 201.125 57 0.0004

Error 5 17.625 3.525

Total 7 419.875

Source DF Type I SS Mean Square

size 1 231.1250000 231.1250000

bonuspay 1 171.1250000 171.1250000

Source DF Type III SS Mean Square

size 1 231.1250000 231.1250000

bonuspay 1 171.1250000 171.1250000

Page 11: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-11

Example (3)

Parameter EST SE T P-value

Intercept 0.375 4.74 0.08 0.9400

size 5.375 0.66 8.10 0.0005

bonuspay 9.250 1.33 6.97 0.0009

If we consider only size:

Source DF SS MS F P-val

Model 1 231.125 231.125 7.4 0.035

Error 6 188.750 31.458

Total 7 419.875 _

Parameter Est SE T P-value

Intercept 23.5 10.1 2.32 0.0591

size 5.375 1.98 2.71 0.0351

Page 12: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-12

Perfect Correlation Example

(p. 281) Four observations in three-space, but

over the line X2 = 5 + 0.5 * X1.

X1 X2 Y

2 6 25

8 9 81

6 8 60

10 10 113

Page 13: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-13

Perfect Correlation Example (2)

• Since points are exactly a line in two-space,

there are infinitely many regression planes

available. There is no unique best regression

plane.

• If you try to fit this in SAS, you will get output

that does not fit the full model (because it

cannot since ′X X is not invertible).

• SAS is “smart” enough to figure out that

something is wrong, and try to do something

about it.

Page 14: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-14

Output

Source DF SS MS F P-val

Model 1 4007.15 4007.15 91.49 0.0108

Error 2 87.60 43.80

Total 3 4094.75

NOTE: Model is not full rank. Least-squares

solutions for the parameters are not unique.

Some statistics will be misleading. A

reported DF of 0 or B means that the

estimate is biased.

Page 15: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-15

Output (2)

NOTE: The following parameters have been set

to 0, since the variables are a linear

combination of other variables as shown.

x2 = 5 * Intercept + 0.5 * x1

Parameter Std

Variable DF Estimate Error T P-value

Intercept B 0.20 7.99 0.03 0.9800

x1 B 10.70 1.12 9.56 0.0108

x2 0 0 . . .

Page 16: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-16

Effects of Multicollinearity

• Variables are almost never 100% correlated.

• When there is a lot of intercorrelation....

� Can generally still obtain a “good fit”.

� Prediction made within the scope of the model

is generally unaffected.

� ′X X matrix has a near-zero determinant that

can be a source of serious round-off errors

Page 17: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-17

Effects on Regression Coefficients

• Regression coefficients are highly correlated

and have large standard errors

• Cannot use the common interpretation of the

regression coefficient (since, for one thing,

probably isn’t feasible to hold other

variables constant).

Page 18: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-18

Simultaneous T-tests

• Common abuse of multiple linear regression

models is to do simultaneous t-tests for testing

0kβ = , k = 1, 2, ...., p – 1

• The big problem with this is that these are all

MARGINAL (or variable-added-last) tests.

• If there were no intercorrelation, all of the

variables act “independently” and this would

be no problem. But when there is

intercorrelation, often would end up

incorrectly dropping ALL variables on this

basis.

Page 19: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-19

Extra Sums of Squares

• When predictor variables are correlated, Type I

and Type III SS tend to be quite different

• Added first a variable may do a lot in terms of

explaining variation, but added later it may

not do much

Page 20: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-20

Indicators of Multicollinearity

• Large simple correlations between pairs of

predictors.

• F-test says model is significant; Marginal T-

tests do not show any significance.

• Watch for Type I and Type III SS having

large differences.

• Large changes in estimated regression

coefficients when variables are

added/deleted.

Page 21: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-21

Variance Inflation Factors

• Formal method for detecting multicollinearity

• VIF is related to the variance of the estimated

regression coefficients (think: variances get

“inflated” by having intercorrelation among

the predictors)

2

1

1k

k

VIFR

=−

• 2kR is the coefficient of determination obtained

in regression of Xk on all other predictors.

Page 22: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-22

Variance Inflation Factors (2)

• 2kR > 0.9 means that Xk is well predicted by

the other variables. This corresponds to VIF

of 10 or higher and indicates excessive

multicollinearity.

• Tolerance is defined as

21 1/k k

TOL R VIF= − =

• Tolerance below 0.01, 0.001, or 0.0001

typically raise concern.

Page 23: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-23

Physicians Case Study (7.37)

• Goal: Predict # of active physicians in a

county (Y) from

1. X1 = Total Population

2. X2 = Total Personal Income

3. X3 = Land Area

4. X4 = Percent of Pop. Age 65 or older

5. X5 = # of hospital beds

6. X6 = Total Serious Crimes

• SAS code available in file CDI.sas.

Page 24: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-24

Initial Model

• All six predictors included

• VIF and TOL can be used as options after

the ‘/’ in the model statement of REG

Variable DF Tolerance VIF _

tot_pop 1 0.01192 83.87229

tot_income 1 0.01883 53.10731

land_area 1 0.79952 1.25074

pop_elderly 1 0.94209 1.06147

beds 1 0.12251 8.16293

crimes 1 0.16763 5.96556

Page 25: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-25

Residual Plots

Residual

-2000

-1000

0

1000

2000

3000

Predicted Value of physicians

0 5000 10000 15000 20000 25000

Page 26: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-26

Residual Plots

Resid vs. Total Population – See SAS for other variables

Residual

-2000

-1000

0

1000

2000

3000

tot_pop

0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000

Page 27: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-27

Normal Probability Plot

-3 -2 -1 0 1 2 3

-2000

-1000

0

1000

2000

3000

Residual

Normal Quantiles

Page 28: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-28

Assumption Violations

• Errors not normal.

• Variance does not appear to be constant.

• BOXCOX suggests a log transformation,

which clears up some of the issues.

Page 29: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-29

Normal Probability Plot

-3 -2 -1 0 1 2 3

-5

-4

-3

-2

-1

0

1

2

Residual

Normal Quantiles

Page 30: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-30

Two Outliers

• Can see these in the QQ-plot.

• Further investigation shows that they are for

Los Angeles County and Cook County

� Twice as many physicians than other counties.

� Also outliers in total population and total

income.

� There is reason to drop these two for the time

being, as it makes sense that such huge

counties should not be considered as the same

population as the rest.

Page 31: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-31

QQ Plot w/o Outliers

-3 -2 -1 0 1 2 3

-4

-3

-2

-1

0

1

2

Residual

Normal Quantiles

Page 32: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-32

Residual Plot

Residual

-4

-3

-2

-1

0

1

2

Predicted Value of lphysicians

5 6 7 8 9 10 11 12 13

Page 33: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-33

Residual Plot

Resid vs. Total Population – See SAS for other variables

Residual

-4

-3

-2

-1

0

1

2

tot_pop

0 1000000 2000000 3000000

Page 34: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-34

Still Problems?

• Normality is ok

• No other unreasonable outliers

• Residual Plot suggests some nonlinearity

• Look at Residual vs. Predictor Variable

Plots to learn more

• Possibly add some quadratic or other terms

• We’ve thus far ignored multicollinearity –

time to consider it.

Page 35: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-35

Multicollinearity

• VIF’s for tot_pop and tot_income already

have informed us that there are problems.

pop inc l_ar eld beds

tot_pop 1.00 0.90.90.90.99999 0.17 -0.03 0.920.920.920.92

tot_income 0.90.90.90.99999 1.00 0.13 -0.02 0.900.900.900.90

land_area 0.17 0.13 1.00 0.01 0.07

pop_elderly -0.03 -0.02 0.01 1.00 0.05

beds 0.920.920.920.92 0.900.900.900.90 0.07 0.05 1.00

• Will continue analysis with Model Selection

Page 36: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-36

Big Picture

• For checking basic assumptions: PLOTS

are generally easier to construct than

TESTS – and generally if there is

something to see, it will show up in the

appropriate plot.

• MULTICOLLINEARITY is a big issue

when trying to interpret estimates –

however it’s not really a problem for

prediction.

Page 37: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated

14-37

Upcoming in Lecture 15...

• Model Building: Selection Criteria (Ch 9)

• Continuing the Physicians Dataset Analysis