Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and...

58
Chapter 5 : Correlation and Regression 1. Correlation 2. Partial Correlation 3. Linear Regression 4. Residual Plots 5. Quadratic Regression 6. Transformations This material covers sections 5ABCDEFHIJK. Omit 3N, 5GL 1

Transcript of Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and...

Page 1: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Chapter 5 : Correlation and Regression

1. Correlation

2. Partial Correlation

3. Linear Regression

4. Residual Plots

5. Quadratic Regression

6. Transformations

This material covers sections 5ABCDEFHIJK. Omit 3N,

5GL

1

Page 2: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Correlation

• The CORRelation procedure gives sample correlation

coefficients between pairs of variables.

• These include the

1. Pearson product r

2. Spearman rank ρ

3. Kendall τ

2

Page 3: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Correlation Properties:

1. Correlation coefficients lie in [−1,1].

2. Pearson’s r measures the linear relation between two

continuous variables.

3. Positive values indicate that one variable increases lin-

early with the other variable.

4. Negative values indicate that the variables are inversely

related.

5. A correlation of 0 means that the variables are not lin-

early related.

6. Spearman’s rank and Kendall’s tau apply to ordinal

data.

3

Page 4: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Correlation Syntax:

PROC CORR DATA=MYDATA;

Gives the Pearson r between all pairs of numeric variables

in MYDATA.

4

Page 5: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Simple Example:

Compute the correlation between the variables X and Y ,

given in the following table:

X Y

22 12

23 11

26 15

24 14

31 20

27 18

25 16

5

Page 6: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Simple Example: Cont’d

DATA CORR_EG;

INPUT X Y;

DATALINES;

22 12

23 11

26 15

24 14

31 20

27 18

25 16

;

PROC CORR DATA=CORR_EG;

PROC PLOT; /* Not necessary, but recommended

as an aid in interpreting the correlation. */

PLOT Y*X;

6

Page 7: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Default Output:

1. Simple statistics for each numeric variable in MYDATA.

2. Correlations between all pairs of variables in MYDATA.

3. P-values for testing significance.

7

Page 8: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Significance Tests:

1. Null Hypotheses: true population correlations are really

0.

2. Test statistic: the sample correlation, r

3. PROB = p-value, the probability of observing a sample

correlation at least as large as r, under the null hypoth-

esis.

4. A small p-value indicates that the population correlation

is really nonzero.

8

Page 9: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Assumptions:

• the variables are normally distributed, and

• the observations are independent of each other.

If data are not normal, use nonparametric tests based on

Spearman’s ρ or Kendall’s τ .

Computing Spearman’s ρ:

PROC CORR DATA=MYDATA SPEARMAN;

9

Page 10: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Alternative Interpretation of Pearson’s r:

r2 = the proportion of variance in one of the variables

that can be explained by variation in the other variable.

1− r2 = the proportion of variance left unexplained.

e.g.

• Heights are measured for 20 father and (adult) son

pairs.

• The correlation is estimated to be r = .6.

• r2 = .36 so 36% of the variation in the height of sons

is attributable to variation in the height of the father.

• 64% of the variance in sons’ heights is left unexplained.

10

Page 11: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Correlation Cont’d

• Causality: It must be emphasized that a linear rela-

tionship between variables does not imply a cause-and-

effect relationship between the variables.

• Correlation Matrices: A matrix of correlations between

all pairs of numeric variables in the SAS data set can

be computed, using the VAR statement.

• Example: Information on waste output in a region of

California is stored in the file waste.dat. The data set

contains 40 observations on 10 variables:

11

Page 12: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Correlation Cont’d

• ZONE - the area in which the data was collected.

• WASTE - the amount of waste output in the area.

• Predictor variables: each gives the percentage of the

zone devoted to

– IND - industry.

– MET - fabricated metals.

– WHS - wholesale and trucking.

– RET - retail trade.

– RES - restaurants and hotels.

– FIN - finance and insurance.

– MSC - miscellaneous activities.

– HOM - residential dwellings.

Find the pairwise correlation matrix for the variables IND,

MET, WHS.

12

Page 13: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Correlation matrix

DATA WASTE;

INFILE ’waste.dat’;

INPUT ZONE WASTE IND MET WHS RET RES

FIN MSC HOM;

PROC CORR DATA=WASTE NOSIMPLE NOPROB;

/* NOSIMPLE suppresses the printing

of summary statistics */

/* NOPROB suppresses the significance

tests */

VAR IND MET WHS;

RUN; QUIT;

The output window then contains:Pearson Correlation Coefficients

IND MET WHSIND 1.0 .39315 .41971

MET .39315 1.0 .88869WHS .41971 .88869 1.0

13

Page 14: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Correlation: WITH

Using the WITH statement:

PROC CORR DATA=MYDATA;

VAR X Y;

WITH Z1 Z2 Z3;

computes correlations between the following pairs of vari-

ables:

• X, Z1

• X, Z2

• X, Z3

• Y, Z1

• Y, Z2

• Y, Z3

14

Page 15: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

WITH example

DATA OZONE;

INFILE ’ozone.dat’;

OPTIONS PAGESIZE = 40;

/* Daily Zonal means: OZONE, in units DOBSON;

SOURCE: NASA*/

/* 100 DOBSON UNITS = 1MM THICKNESS (IF OZONE

LAYER WERE BROUGHT TO EARTH’S SURFACE) */

/* Each observation contains ozone thickness

measurements averaged over 288 longitudes at

latitudes separated by 5 degrees;

e.g. M875 = average ozone thickness at latitude 87.5 */

/* SH = average over the southern hemisphere;

NH = average over the northern hemisphere */

/* 0 = MISSING VALUE */

15

Page 16: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

WITH example Cont’d

INPUT YRFRAC M875 M825 M775 M725 M675 M625 M575 M525 M475 M425

M375 M325 M275 M225 M175 M125 M75 M25 P25 P75 P125 P175

P225 P275 P325 P375 P425 P475 P525 P575 P625 P675 P725

P775 P825 P875 SH NH;

/* IT IS VERY IMPORTANT TO SET

THE MISSING VALUES TO .;

OTHERWISE, THE 0’s WILL BE ENTERED

INTO THE CORRELATION COMPUTATION

AND GIVE MISLEADING RESULTS.*/

IF M875 = 0 THEN M875 = .;

IF M825 = 0 THEN M825 = .;

IF M775 = 0 THEN M775 = .;

IF M725 = 0 THEN M725 = .;

IF M675 = 0 THEN M675 = .;

IF M625 = 0 THEN M625 = .;

IF M575 = 0 THEN M575 = .;

IF M525 = 0 THEN M525 = .;

16

Page 17: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

WITH example Cont’d

IF M475 = 0 THEN M475 = .;

IF M425 = 0 THEN M425 = .;

IF M375 = 0 THEN M375 = .;

IF M325 = 0 THEN M325 = .;

IF M275 = 0 THEN M275 = .;

IF M225 = 0 THEN M225 = .;

IF M175 = 0 THEN M175 = .;

IF M125 = 0 THEN M125 = .;

IF M75 = 0 THEN M75 = .;

IF M25 = 0 THEN M25 = .;

IF P875 = 0 THEN P875 = .;

IF P825 = 0 THEN P825 = .;

IF P775 = 0 THEN P775 = .;

IF P725 = 0 THEN P725 = .;

IF P675 = 0 THEN P675 = .;

IF P625 = 0 THEN P625 = .;

IF P575 = 0 THEN P575 = .;

17

Page 18: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

WITH example Cont’d

IF P525 = 0 THEN P525 = .;

IF P475 = 0 THEN P475 = .;

IF P425 = 0 THEN P425 = .;

IF P375 = 0 THEN P375 = .;

IF P325 = 0 THEN P325 = .;

IF P275 = 0 THEN P275 = .;

IF P225 = 0 THEN P225 = .;

IF P175 = 0 THEN P175 = .;

IF P125 = 0 THEN P125 = .;

IF P75 = 0 THEN P75 = .; IF P25 = 0 THEN P25 = .;

SASDATE = FLOOR(365.25*(YRFRAC-1960));

MONTH = MONTH(SASDATE);

SEASONAL = ABS(7-MONTH);

PROC CORR NOSIMPLE NOPROB;

VAR M875 M475 M75; WITH P875 P475 P75;

/* This program gives correlations between various southern and

northern ozone layer averages */ RUN; QUIT;

18

Page 19: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Partial Correlations:

Often, two variables appear to be highly correlated but are

both highly correlated with another variable. Computation

of partial correlations between variables removes the effects

of other variables.

Syntax:

PROC CORR DATA=MYDATA;

VAR X;

WITH Y;

PARTIAL V;

This computes partial correlations between X and Y, after

eliminating effects due to correlation with V.

19

Page 20: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Partial Correlations: Example

Ozone measurements are correlated with temperature. The

resulting seasonal effect can be removed as follows:

PROC CORR DATA = OZONE NOSIMPLE NOPROB;

VAR M875 M475 M75;

WITH P475 P75;

PARTIAL SEASONAL;

We see that the magnitude of the correlations has been

reduced after taking into account the seasonal effects.

20

Page 21: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Exercise:

Refer to the Winnipeg Climate data set.

1. For each month, compute correlations between maxi-

mum temperature and each of minimum temperature,

minimum and maximum pressure, and minimum and

maximum wind speed.

2. Compute the partial correlations, taking into account

minimum temperature.

21

Page 22: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Simple Linear Regression

• The equation for a straight line relating the (nonran-

dom) variables y and x is

y = α+ βx

β is the slope.

α is the intercept.

The dependent variable y is the response variable.

The independent variable x is the predictor or explana-

tory variable.

22

Page 23: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

‘Simulation’ Example:

Suppose the intercept is 2.0 and the slope is -1.5. Compute

y values corresponding to x = 1,2,4,5,8,10,13, and obtain

a scatterplot of the paired observations.

DATA NONRAN;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

Y = ALPHA + BETA*X;

23

Page 24: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

‘Simulation’ Example: Cont’d

DATALINES;

1

2

4

5

8

10

13

;

PROC PLOT;

PLOT Y*X;

RUN;

QUIT;

If BETA is positive, say BETA = 2.5, then the graph increases.

24

Page 25: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Simple linear model Adding Noise:

• Even if variables are related to each other by a straight

line, experimental observations usually contain some

kind of (unobservable) noise or random errors which

cause small distortions.

• The simplest way of modelling this kind of noise is to

assume it is normally distributed and to add it y: i.e.

Y = y(x) + ε

or

Y = α+ βx+ ε

• For each observation, ε is a normal random variable

having mean 0 and variance σ2.

25

Page 26: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Simple linear model: Cont’d

• If there are n observations, there must be n random

errors. We assume that they are independent of each

other.

• This model is called the simple linear model.

• Note that since E[ε] = 0, we have

E[Y ] = α+ βx

i.e. the mean of the Y variable is a linear function of x.

26

Page 27: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Simulating the simple linear model:

Add noise to the above data. Assume σ = 0.2.

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2;

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

27

Page 28: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Simulating the simple linear model:

8

10

13

;

PROC PLOT;

PLOT Y*X;

RUN;

QUIT;

Repeating this simulation experiment, using σ = 1.0, σ = 1.5

and σ = 2.0, we see that as σ increases, the less ‘linear’ the

graph appears.

28

Page 29: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Estimation:

Unlike in our simulation study, α and β are unknown, and

must be estimated. Least-squares estimators are com-

puted by the REGression procedure or PROC REG.

Syntax:

PROC REG DATA = MYDATA;

MODEL Y = X;

29

Page 30: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Estimation: Cont’d

Estimating the parameters for the simulated data.

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2;

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

30

Page 31: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Estimation: Cont’d

8

10

13

;

PROC REG;

MODEL Y = X;

RUN; QUIT;

Note that the intercept estimate is near 2.0 and the slope

estimate is near -1.5.

31

Page 32: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Example:

10 fields were planted in wheat and i kg/acre of nitrate

were applied to the ith field, for i = 1,2, . . . ,10. We want

to model the relationship between mean wheat yield Y and

amount of nitrate X. That is, we

Y = α+ βX + ε

where ε is the unobserved error random variable.

32

Page 33: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Example: Cont’d

To estimate α and β we use

DATA WHTYIELD;

INPUT NITRATE YIELD;

DATALINES;

1 15

2 13

3 16

4 12

5 14

6 18

7 17

8 19

9 16

10 20

;

33

Page 34: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Example: Cont’d

PROC PLOT;

PLOT YIELD*NITRATE;

/* One should first plot \verb+YIELD+

against \verb+NITRATE+ to see whether

a linear model is appropriate. */

PROC REG DATA=WHTYIELD;

MODEL YIELD = NITRATE;

RUN;

34

Page 35: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Output

The output window lists

• an ANOVA table

• estimates of some statistics

• a table of parameter estimates and their standard er-

rors.

The estimates of α and β are

• a = 12.7 (standard error = 1.31)

• b = 0.606 (standard error = .212)

and the fitted model is

y = 12.7 + .606x.

The estimated standard deviation of the errors ε is given

by ROOT-MSE and is 1.93.

35

Page 36: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Tests

We can also test whether the slope of the regression line

is 0:

Ho : β = 0

versus

H1 : β 6= 0.

Test statistic:

T = β/s(β) = 2.86.

p-value: 0.0212

Conclusion: Reject the null hypothesis at α = .05. The

true slope is not 0.

36

Page 37: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Tests Cont’d

The intercept α can also be tested. This time, we have a

p-value of .0001 which is very strong evidence against the

hypothesis that α = 0.

These tests are based on the assumption that the obser-

vations are independent (i.e. the errors ε are independent

of each other). They are correct if the errors are normally

distributed, and approximately correct, if the true error dis-

tribution has a finite variance.

37

Page 38: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

The ANOVA table:

• This table demonstrates how the variance in the Y or

response variable decomposes into variance explained

by the regression with X (SSModel) plus variance left un-

explained (SSError).

• SSModel + SSError = C Total

• MSModel = SSModel/DFModel (DFModel = number of pa-

rameters estimated - 1)

38

Page 39: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

The ANOVA table: Cont’d

• MSError = SSError/DFError (DFError = n - DFModel)

Note that MSE is the square of the ROOT-MSE

• The test statistic F=MSModel/MSError is the square of the

T statistic used to test whether the slope is 0.

• If F is large, we have reason to reject the null hypothesis

in favor of the alternative that β 6= 0. Note that the p-

value for this test is identical with the earlier p-value.

39

Page 40: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Other Statistics

• R-square = the coefficient of determination R2 = .505.

It is the square of the correlation between YIELD and

NITRATE.

• Dep Mean = average of the dependent variable values =

16.0.

40

Page 41: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Predicted Values:

• The predicted values can be calculated from the fitted

regression equation for the given values of the explana-

tory variable X.

• For the wheat yield example, we found that

y = 12.7 + .606x

Therefore, if x = 3, we predict y = 12.7+ .606(3) = 14.5.

This is the predicted value corresponding to x = 3.

41

Page 42: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Plotting Predicted Values:

SAS can plot the predicted values versus the explanatory

variable:

PROC REG DATA = WHTYIELD;

MODEL YIELD = NITRATE;

PLOT PREDICTED. *NITRATE;

RUN;

42

Page 43: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Plotting Predicted Values: Overlay

It is possible to overlay this plot on the plot of the original

data:

PROC REG DATA = WHTYIELD;

MODEL YIELD = NITRATE;

PLOT PREDICTED. *NITRATE=’P’

YIELD*NITRATE=’*’ / OVERLAY;

RUN;

/* PLOT X*Y=’!’ causes the plotting symbol

to be ’!’ */

43

Page 44: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Residual Plots:

• Residuals are the differences between the response val-

ues and predicted values:

y − y = y − a− bx

• They are ’estimates’ of the errors ε.

ε = y − α− βx

• Examine plots of the residuals:

1. look for outliers - indications that the linear model

may not be adequate or that the error distribution is

not close enough to normal. Tests are not trustwor-

thy in this case.

44

Page 45: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Residual Plots: Cont’d

2 Patterns can

• indicate the need to transform the data or to add a

quadratic term to the linear model

• indicate that the error variance is not constant. If

not, weighted least-squares should be used.

To get a feel for what the residual plots should look like if

the linear model is appropriate, use simulation:

45

Page 46: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Residual Plots: Cont’d

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2;

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

7

8

46

Page 47: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Residual Plots: Cont’d

9

10

12

13

;

PROC REG;

MODEL Y = X;

PLOT RESIDUAL. * X;

RUN; QUIT;

47

Page 48: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

A quadratic example:

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2;

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*(X - 5)**2 + EPSILON;

DATALINES;

1

2

4

5

7

8

48

Page 49: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

A quadratic example:

9

10

12

13

;

PROC REG;

MODEL Y = X;

PLOT RESIDUAL. * X;

RUN; QUIT;

49

Page 50: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Outlier example:

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2; SIGMA2 = 1.8;

U = UNIFORM(0);

IF U < .8 THEN EPSILON = SIGMA*RANNOR(0);

ELSE EPSILON = SIGMA2*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

7

50

Page 51: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Outlier example: Cont’d

8

9

10

12

13

;

PROC REG; MODEL Y = X;

PLOT RESIDUAL. * X;

RUN; QUIT;

51

Page 52: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

An increasing variance example

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2*SQRT(X);

/* SIGMA increases with the square

root of X */

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

7

8

52

Page 53: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

An increasing variance example Cont’d

9

10

12

13

;

PROC REG;

MODEL Y = X;

PLOT RESIDUAL. * X;

RUN; QUIT;

Exercise: Examine a plot of the residuals for the wheat

yield data.

53

Page 54: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Adding a Quadratic Term:

To fit the quadratic model

Y = α+ βX + β2X2

use the line

X2 = X**2;

in the DATA step, and use the following MODEL state-

ment in PROC REG:

MODEL Y = X X2;

54

Page 55: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Example:

DATA WHTYIELD;

INPUT NITRATE YIELD;

NITRATE2 = NITRATE**2;

DATALINES;

1 15

2 13

3 16

4 12

5 14

6 18

7 17

8 19

9 16

10 20

;

55

Page 56: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Example : Cont’d

PROC REG DATA=WHTYIELD;

MODEL YIELD = NITRATE NITRATE2;

RUN;

QUIT;

The output includes a test of the hypothesis that β2 = 0.

56

Page 57: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Transformations:

Sometimes an appropriate transformation (like a log or a

square root) is sufficient to linearize a relationship between

two variables. Sometimes, such a transformation can cor-

rect for a variance that is not constant. (N.B. If the re-

sponse variable is a count, it is almost always the case that

a nonconstant variance can be corrected for by taking a

square root of the response variable.)

57

Page 58: Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and Regression 1.Correlation 2.Partial Correlation 3.Linear Regression 4.Residual Plots 5.Quadratic

Summary:

PROC REG;

MODEL Y = X; /* FITS LINEAR MODEL RELATING

RESPONSE Y TO

EXPLANATORY VARIABLE X */

PLOT PREDICTED. * X;/* PLOTS PREDICTED VALUES */

PLOT RESIDUAL.*X = ’Y’; /* PLOTS RESIDUALS

WITH PLOTTING SYMBOL ’Y’ */

58