Topic 9: Remedies

Outline

• Review diagnostics for residuals• Discuss remedies– Nonlinear relationship– Nonconstant variance– Non-Normal distribution– Outliers

Diagnostics for residuals

• Look at residuals to find serious violations of the model assumptions – nonlinear relationship – nonconstant variance– non-Normal errors• presence of outliers• a strongly skewed distribution

Recommendations for checking assumptions

• Plot Y vs X (is it a linear relationship?)

• Look at distribution of residuals

• Plot residuals vs X, time, or any other potential explanatory variable

• Use the i=sm## in symbol statement to get smoothed curves

Plots of Residuals

• Plot residuals vs–Time (order)

–X or predicted value (b0+b1X)

• Look for

–nonrandom patterns

–outliers (unusual observations)

Residuals vs Order

• Pattern in plot suggests dependent errors / lack of indep

• Pattern usually a linear or quadratic trend and/or cyclical

• If you are interested read KNNL pgs 108-110

Residuals vs X

• Can look for

–nonconstant variance

–nonlinear relationship

–outliers

–somewhat address Normality of residuals

Tests for Normality

• H0: data are an i.i.d. sample from a Normal population

• Ha: data are not an i.i.d. sample from a Normal population

• KNNL (p 115) suggest a correlation test that requires a table look-up

Tests for Normality

• We have several choices for a significance testing procedure

• Proc univariate with the normal option provides four

proc univariate normal;• Shapiro-Wilk is a common choice

Tests for Normality

Test Statistic p ValueShapiro-Wilk W 0.978904 Pr < W 0.8626

Kolmogorov-Smirnov D 0.09572 Pr > D >0.1500

Cramer-von Mises W-Sq 0.033263 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.207142 Pr > A-Sq >0.2500

All P-values > 0.05…Do not reject H0

Other tests for model assumptions

• Durbin-Watson test for serially correlated errors (KNNL p 114)

• Modified Levene test for homogeneity of variance (KNNL p 116-118)

• Breusch-Pagan test for homogeneity of variance (KNNL p 118)

• For SAS commands see topic9.sas

Plots vs significance test

• Plots are more likely to suggest a remedy

• Significance tests results are very dependent on the sample size; with sufficiently large samples we can reject most null hypotheses

Default graphics with SAS 9.3

proc reg data=toluca; model hours=lotsize; id lotsize;run;

Will discuss these diagnostics more in multiple regression

Provides rule of thumb limits

Questionable observation (30,273)

Additional summaries

• Rstudent: Studentized residual…almost all should be between ± 2

• Leverage: “Distance” of X from center…helps determine outlying X values in multivariable setting…outlying X values may be influential

• Cooks’D: Influence of ith case on all predicted values

Lack of fit• When we have repeat observations at

different values of X, we can do a significance test for nonlinearity

• Browse through KNNL Section 3.7

• Details of approach discussed when we get to KNNL 17.9, p 762

• Basic idea is to compare two models

• Gplot with a smooth is a better (i.e., simpler) approach

SAS code and output

proc reg data=toluca; model hours=lotsize / lackfit; run;

Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > FModel 1 252378 252378 105.88 <.0001

Error 23 54825 2383.71562

Lack of Fit 9 17245 1916.06954 0.71 0.6893Pure Error 14 37581 2684.34524

Corrected Total 24 307203

Nonlinear relationships

• We can model many nonlinear relationships with linear models, some have several explanatory variables (i.e., multiple linear regression)

–Y = β0 + β1X + β2X2 + e (quadratic)

–Y = β0 + β1log(X) + e

Nonlinear Relationships• Sometimes can transform a

nonlinear equation into a linear equation

• Consider Y = β0exp(β1X) + e

• Can form linear model using log

log(Y) = log(β0) + β1X + log(e)• Note that we have changed our

assumption about the error

Nonlinear Relationship

• We can perform a nonlinear regression analysis

• KNNL Chapter 13

• SAS PROC NLIN

Nonconstant variance

• Sometimes we model the way in which the error variance changes

–may be linearly related to X

• We can then use a weighted analysis

• KNNL 11.1

• Use a weight statement in PROC REG

Non-Normal errors

• Transformations often help

• Use a procedure that allows different distributions for the error term

–SAS PROC GENMOD

Generalized Linear Model

• Possible distributions of Y:–Binomial (Y/N or percentage data)–Poisson (Count data)–Gamma (exponential)– Inverse gaussian–Negative binomial–Multinomial

• Specify a link function for E(Y)

Ladder of Reexpression(transformations)

p

0.0 0.5

-0.5-1.0

1.0 1.5

Transformation is xp

Circle of Transformations

X up, Y up

X up, Y down

X down, Y up

X down, Y down

X

Y

Box-Cox Transformations

• Also called power transformations

• These transformations adjust for non-Normality and nonconstant variance

• Y´ = Y or Y´ = (Y - 1)/• In the second form, the limit as

approaches zero is the (natural) log

Important Special Cases

= 1, Y´ = Y1, no transformation = .5, Y´ = Y1/2, square root = -.5, Y´ = Y-1/2, one over square root = -1, Y´ = Y-1 = 1/Y, inverse = 0, Y´ = (natural) log of Y

Box-Cox Details

• We can estimate by including it as a parameter in a non-linear model

• Y = β0 + β1X + e

and using the method of maximum likelihood

• Details are in KNNL p 134-137

• SAS code is in boxcox.sas

Box-Cox Solution• Standardized transformed Y is –K1(Y - 1) if ≠ 0–K2log(Y) if = 0

where K2 = ( Yi)1/n (the geometric mean)

and K1 = 1/ ( K2 -1)

• Run regressions with X as explanatory variable

• estimated minimizes SSE

data a1; input age plasma @@;cards;0 13.44 0 12.84 0 11.91 0 20.090 15.60 1 10.11 1 11.38 1 10.281 8.96 1 8.59 2 9.83 2 9.002 8.65 2 7.85 2 8.88 3 7.943 6.01 3 5.14 3 6.90 3 6.774 4.86 4 5.10 4 5.67 4 5.754 6.23;

Example

Box Cox Procedure

*Procedure that will automatically find the Box-Cox transformation;

proc transreg data=a1;

model boxcox(plasma)=identity(age);

run;

Transformation Information for BoxCox(plasma)

Lambda R-Square Log Like

-2.50 0.76 -17.0444

-2.00 0.80 -12.3665

-1.50 0.83 -8.1127

-1.00 0.86 -4.8523 *

-0.50 0.87 -3.5523 <

0.00 + 0.85 -5.0754 *

0.50 0.82 -9.2925

1.00 0.75 -15.2625

1.50 0.67 -22.1378

2.00 0.59 -29.4720

2.50 0.50 -37.0844

< - Best Lambda

* - Confidence Interval

+ - Convenient Lambda

*The first part of the program gets the geometric mean;

data a2; set a1; lplasma=log(plasma);

proc univariate data=a2 noprint; var lplasma; output out=a3 mean=meanl;

data a4; set a2; if _n_ eq 1 then set a3; keep age yl l; k2=exp(meanl); do l = -1.0 to 1.0 by .1; k1=1/(l*k2**(l-1)); yl=k1*(plasma**l -1); if abs(l) < 1E-8 then yl=k2*log(plasma); output; end;

proc sort data=a4 out=a4; by l;proc reg data=a4 noprint outest=a5; model yl=age; by l;data a5; set a5; n=25; p=2; sse=(n-p)*(_rmse_)**2;proc print data=a5; var l sse;

Obs l sse 1 -1.0 33.9089 2 -0.9 32.7044 3 -0.8 31.7645 4 -0.7 31.0907 5 -0.6 30.6868 6 -0.5 30.5596 7 -0.4 30.7186 8 -0.3 31.1763 9 -0.2 31.9487 10 -0.1 33.0552

symbol1 v=none i=join;proc gplot data=a5; plot sse*l;run;

data a1; set a1;

tplasma = plasma**(-.5);

tage = (age+.5)**(-.5);

symbol1 v=circle i=sm50;

proc gplot; plot tplasma*age;

proc sort; by tage;

proc gplot; plot tplasma*tage;

run;

Background Reading

• Sections 3.4 - 3.7 describe significance tests for assumptions (read it if you are interested).

• Box-Cox transformation is in nknw132.sas

• Read sections 4.1, 4.2, 4.4, 4.5, and 4.6

Topic 9: Remedies

Documents

Transcript of Topic 9: Remedies