Lecture 7 Remedial Measures - Purdue Universityghobbs/STAT_512/Lecture_Notes/Regression/... · 7-1...

7-1

Lecture 7

Remedial Measures

STAT 512

Spring 2011

Background Reading

KNNL: 3.8-3.11, Chapter 4

7-2

Topic Overview

Review Assumptions & Diagnostics

Remedial Measures for

Non-normality

Non-constant variance

Non-linearity

Other Miscellaneous Topics (Chapter 4)

7-3

Regression Assumptions

X and Y are related linearly (scatter plot,

residuals vs. X)

Assumptions on the Errors...

Constancy of Variance (residuals vs. X)

Normality (normal probability plot)

Independent (sequence plot)

7-4

Remedial Measures

Two basic choices when assumptions are violated:

Use some more appropriate model (often

more complicated)

Find a transformation of the data for which

the regression model is appropriate

7-5

Non-linear Relationships

Can potentially still use a “linear” model. For

example,

20 1 2

0 1 ln

Y X X

Y X

This model is still “linear” in terms of the

regression coefficients (parameters). Simply

consider a new predictor variable 2 or lnX X ,

and just treat this like any usual predictor.

7-6

Non-linear Relationships

Can use nonlinear regression models (beyond the

scope of this course, but discussed in Chapter 13).

For now, we will try to guess at a good

transformation and see if it works.

7-7

Variance Not Constant

Might be able to model the change in variance (if

it is related to X). In this case, can use a weighted

analysis (Chapter 11.1)

Sometimes a variance-stabilizing transformation

can be found (log, square-root are common)

Box-Cox procedure can help to find a

transformation

Note: In this class, we use natural logs,

unless specified otherwise

7-8

Errors Not Normal

Knowledge of error distribution known? If so,

can use SAS GENMOD procedure (Chapter 14)

Binomial (Yes/No or Categorical Resp.)

Poisson (Response is a Count)

Knowledge of error distribution unknown?

Sometimes a transformation will help

Often, non-normality/non-constant

variance occur together and

transformations can sometimes help both!

7-9

Other Remedies

Correlated errors (not independent)

Use a model for correlated error structure

(Chapter 12)

Omission of Important Predictors

Multiple Regression (starts in Chapter 6)

7-10

Other Remedies

Outliers

Determine whether to keep in analysis

(e.g., was there a recording error? Be

very cautious of deleting observations!)

Determine influence on parameter

estimates and standard errors

Perform more robust estimation

procedure that puts less emphasis on

outliers (Chapter 11.3)

7-11

Transformations

Finding a good transformation gets easier

with practice

Generally the method is to make an

educated guess at a useful transformation

and then try it to see if it works by

rechecking diagnostic plots

Transformations have a tendency to stabilize

variance and normality.

7-12

Transformations (X)

(For nonlinear relationship issues)

Log or Square-root

Square or Exp(x)

Reciprocal or Exp(-x)

7-13

Transformations (Y)

(For non-constant variance issues)

See page 132

Standard transformations if increasing

variance: square-root or reciprocal

If decreasing variance: log

Simultaneous transformations on X may

also be useful

7-14

Box-Cox Procedure

Automated procedure to determine a “best”

power transformation for the response

Chooses from different , Y

1 (No transformation)

0.5 (Square Root)

0 (Natural Log)

0.5 (Reciprocal Square Root)

1 (Reciprocal)

Use TRANSREG procedure in SAS

7-15

Example (1) – boxcox.sas

X - Age

Y – Plasma Level

25 Healthy children

data orig; input age plasma @@;

datalines;

0 13.44 0 12.84 0 11.91 0 20.09 0 15.60

1 10.11 1 11.38 1 10.28 1 8.96 1 8.59

2 9.83 2 9.00 2 8.65 2 7.85 2 8.88

3 7.94 3 6.01 3 5.14 3 6.90 3 6.77

4 4.86 4 5.10 4 5.67 4 5.75 4 6.23

; proc print data=orig; run;

7-16

Example (2)

First, let’s look at the scatterplot to see the

relationship

goptions ftitle=centb ftext=swissb htitle=3

htext=1.5 ctitle=blue ctext=black;

title1 'Original Variables';

symbol1 v=dot c=blue ;

axis1 label=('Age (Years)');

axis2 label=(angle=90 'Plasma Level');

proc gplot data=orig;

plot plasma*age / haxis=axis1 vaxis=axis2;

run;

Note, method for obtaining titles, axis labels.

7-17

Example (3)

7-18

Example (4)

Run SLR model and check diagnostic plots

proc reg data=orig;

model plasma=age;

output out = notrans r = resid;

run;


axis2 label=(angle=90 'Residual');

proc gplot data = notrans;

plot resid*age / vref = 0 haxis=axis1 vaxis=axis2;

run;

proc univariate data=notrans;

var resid;

qqplot/normal (L=1 mu = est sigma = est);

run;

Note: Reference line in residual plot, 45-degree

line in normal probability plot

7-19

Example (5) Root MSE 1.84135 R-Square 0.7532

7-20

Example (6)

7-21

Example (7)

Residuals do not appear to have constant

variance

Relationship not quite linear

Use Box-Cox procedure to suggest a possible

transformation of the Y variable

7-22

Example (8) proc transreg data = orig;

model boxcox(plasma)=identity(age);

run;

The TRANSREG Procedure

Box-Cox Transformation Information for plasma

Lambda R-Square Log Like

-----

-2.00 0.80 -12.3665

-1.75 0.82 -10.1608

-1.50 0.83 -8.1127

-1.25 0.85 -6.3056

-1.00 0.86 -4.8523 *

-0.75 0.86 -3.8891 *

-0.50 0.87 -3.5523 <

-0.25 0.86 -3.9399 *

0.00 + 0.85 -5.0754 *

0.25 0.84 -6.8988

0.50 0.82 -9.2925

-----

< - Best Lambda

* - 95% Confidence Interval

+ - Convenient Lambda

7-23

Example (9)

“+” indicates the most convenient ; “<”

indicates the best as determined by the

log-likelihood function.

Try ln ( )Y and 1

Y

data trans; set orig;

logplasma = log(plasma);*In SAS log=ln, log10=log base 10;

rsplasma = plasma**(-0.5);

proc print data = trans; run;

7-24

Example (10)

Re-run regression, diagnostic plots with

transformed variables title1 'Natural Log Transformation';

proc reg data = trans;

model logplasma = age;

output out = logtrans r = logresid;

run;


axis2 label=(angle=90 'ln(Plasma)');

proc gplot data = logtrans;

plot logplasma * age/ haxis=axis1 vaxis=axis2;

run;


axis2 label=(angle=90 'Residual');

proc gplot data = logtrans;

plot logresid * age / vref = 0 haxis=axis1 vaxis=axis2;

run;

proc univariate data=logtrans;

var logresid;

qqplot/ normal (L=1 mu = est sigma = est);

run;

7-25

Example (11)

7-26


7-27

Example (13)

7-28

Example (14)

7-29


7-30

Example (16)

7-31

Example (17)

Both transformations ln ( )Y and 1

Y :

Led to a reasonably linear regression

relation

R-square improvement

Improved non-constant variance problem

Normality assumption supported in all cases

7-32

Summary of Remedial Measures

Nonlinear Relationships – Sometimes a transformation

on X will fix this.

Nonconstant Variance – If we can model the way in

which the error changes, we can use weighted

regression. Sometimes a transformation on Y will work

instead.

Nonnormal Errors – Could use a procedure that allows

different distributions for the error term. Often, a

transformation on Y will help.

7-33

Summary of Remedial Measures

Often, a transformation on Y may help with more

than one issue (e.g., normality and non-constant

variance).

Box-Cox Transformations – Suggests some possibly

Y transformations to try.

Sometimes a transformation on X and Y may help.

Remember - Assumptions still need to be satisfied

(on the transformed scale) if we are to use linear

regression model. So, we must always recheck

diagnostic plots after transforming any variable.

7-34

Chapter 4

Covers some miscellaneous but important

topics

Joint (family) confidence levels (4.1-4.3)

Regression through the origin (4.4)

Measurement Errors (4.5, optional

reading)

Inverse predictions – when Y becomes X

(4.6)

7-35

Summary of Inference - Reminder

100 1 %

Confidence Intervals

1 1critb t s b

0 0critb t s b

Where (1 ; 2)2critt t n

.

7-36

Family Confidence Levels

Separate confidence intervals for 0 and 1

Now: Joint estimation of 0 and 1

If k 95% CI’s are independent then their

family confidence coefficient is given by

0.95 k . Note, here k=2 and 0.95 k=0.9025.

Usually not independent, so family

confidence coefficient will be somewhat

larger than 0.95 k , but certainly less than

0.95.

7-37

Bonferroni Adjustment We want the probability that both intervals are correct

to be 0.95

Basic idea is that we have an error budget (α =.05), so

spend half on β0 and half on β1

We use α =.025 for each CI (97.5% CI), leading to

0 0

1 1

c

c

b t s b

b t s b

where 0.025

1 , 22

ct t n

.

We start with a 5% error budget and we have two

intervals so we give 2.5% to each

Each interval has two ends (tails) so we again divide

by 2

7-38

Bonferroni Adjustment (Summary)

Want to control family confidence level at

95%, then need to make adjustments

Instead of , use /k .

This is often more conservative than

necessary, but will work in all cases and for

any number of CIs.

We can use this method for simultaneous

estimation of mean responses and

predictions of new observations too.

7-39

Mean Response CIs We already talked about simultaneous

estimation for all Xh with a confidence band:

use Working-Hotelling

ˆ ˆh hY Ws Y where 2 2 1 ;2, 2W F n

For simultaneous estimation for a few Xh, say k

different values, we may use Bonferroni

instead.

ˆ ˆh hY Bs Y where 1 / 2 , 2B t k n

Similar for simultaneous prediction intervals.

7-40

Regression Through Origin

Yi = β1Xi + εi

Should be very cautious using something

like this. Forcing regression line through

(0,0) can introduce bias, especially if X=0

isn’t in the scope of the model.

Problems with r2 and other statistics

Generally safer not to use this method; if

when X=0 it is true that Y=0, then

probably 0 will not test as significantly

different from zero anyway.

7-41

Inverse Predictions

From equation 0 1Y b b X instead of estimating

Y based on X, want to estimate X based on Y

Sometimes called calibration

Example: A regression analysis was performed on

the amount of decrease in cholesterol level (Y)

achieved with a given dose (X) of a new drug

based on observations of 50 patients. A

physician needs to know the dose to give if a new

patient’s cholesterol needs to be decreased by a

certain amount ( ( )h newY ).

7-42

Inverse Predictions

Natural point estimate is ( ) 0( )

1

ˆ h newh new

Y bX

b

Approximate confidence limits are obtained

using the standard error:

2

( )

21

ˆ11

h new

X

X XMSESE predX

b n SS

7-43

Upcoming in Lecture 8...

Review of Matrix Algebra in the context of

simple linear regression (Chapter 5)

Lecture 7 Remedial Measures - Purdue Universityghobbs/STAT_512/Lecture_Notes/Regression/... · 7-1...

Documents

Transcript of Lecture 7 Remedial Measures - Purdue Universityghobbs/STAT_512/Lecture_Notes/Regression/... · 7-1...