Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and...

Post on 29-Apr-2018

314 views 10 download

Transcript of Chapter 5 : Correlation and Regression 1.Correlation 2 ... · Chapter 5 : Correlation and...

Chapter 5 : Correlation and Regression

1. Correlation

2. Partial Correlation

3. Linear Regression

4. Residual Plots

5. Quadratic Regression

6. Transformations

This material covers sections 5ABCDEFHIJK. Omit 3N,

5GL

1

Correlation

• The CORRelation procedure gives sample correlation

coefficients between pairs of variables.

• These include the

1. Pearson product r

2. Spearman rank ρ

3. Kendall τ

2

Correlation Properties:

1. Correlation coefficients lie in [−1,1].

2. Pearson’s r measures the linear relation between two

continuous variables.

3. Positive values indicate that one variable increases lin-

early with the other variable.

4. Negative values indicate that the variables are inversely

related.

5. A correlation of 0 means that the variables are not lin-

early related.

6. Spearman’s rank and Kendall’s tau apply to ordinal

data.

3

Correlation Syntax:

PROC CORR DATA=MYDATA;

Gives the Pearson r between all pairs of numeric variables

in MYDATA.

4

Simple Example:

Compute the correlation between the variables X and Y ,

given in the following table:

X Y

22 12

23 11

26 15

24 14

31 20

27 18

25 16

5

Simple Example: Cont’d

DATA CORR_EG;

INPUT X Y;

DATALINES;

22 12

23 11

26 15

24 14

31 20

27 18

25 16

;

PROC CORR DATA=CORR_EG;

PROC PLOT; /* Not necessary, but recommended

as an aid in interpreting the correlation. */

PLOT Y*X;

6

Default Output:

1. Simple statistics for each numeric variable in MYDATA.

2. Correlations between all pairs of variables in MYDATA.

3. P-values for testing significance.

7

Significance Tests:

1. Null Hypotheses: true population correlations are really

0.

2. Test statistic: the sample correlation, r

3. PROB = p-value, the probability of observing a sample

correlation at least as large as r, under the null hypoth-

esis.

4. A small p-value indicates that the population correlation

is really nonzero.

8

Assumptions:

• the variables are normally distributed, and

• the observations are independent of each other.

If data are not normal, use nonparametric tests based on

Spearman’s ρ or Kendall’s τ .

Computing Spearman’s ρ:

PROC CORR DATA=MYDATA SPEARMAN;

9

Alternative Interpretation of Pearson’s r:

r2 = the proportion of variance in one of the variables

that can be explained by variation in the other variable.

1− r2 = the proportion of variance left unexplained.

e.g.

• Heights are measured for 20 father and (adult) son

pairs.

• The correlation is estimated to be r = .6.

• r2 = .36 so 36% of the variation in the height of sons

is attributable to variation in the height of the father.

• 64% of the variance in sons’ heights is left unexplained.

10

Correlation Cont’d

• Causality: It must be emphasized that a linear rela-

tionship between variables does not imply a cause-and-

effect relationship between the variables.

• Correlation Matrices: A matrix of correlations between

all pairs of numeric variables in the SAS data set can

be computed, using the VAR statement.

• Example: Information on waste output in a region of

California is stored in the file waste.dat. The data set

contains 40 observations on 10 variables:

11

Correlation Cont’d

• ZONE - the area in which the data was collected.

• WASTE - the amount of waste output in the area.

• Predictor variables: each gives the percentage of the

zone devoted to

– IND - industry.

– MET - fabricated metals.

– WHS - wholesale and trucking.

– RET - retail trade.

– RES - restaurants and hotels.

– FIN - finance and insurance.

– MSC - miscellaneous activities.

– HOM - residential dwellings.

Find the pairwise correlation matrix for the variables IND,

MET, WHS.

12

Correlation matrix

DATA WASTE;

INFILE ’waste.dat’;

INPUT ZONE WASTE IND MET WHS RET RES

FIN MSC HOM;

PROC CORR DATA=WASTE NOSIMPLE NOPROB;

/* NOSIMPLE suppresses the printing

of summary statistics */

/* NOPROB suppresses the significance

tests */

VAR IND MET WHS;

RUN; QUIT;

The output window then contains:Pearson Correlation Coefficients

IND MET WHSIND 1.0 .39315 .41971

MET .39315 1.0 .88869WHS .41971 .88869 1.0

13

Correlation: WITH

Using the WITH statement:

PROC CORR DATA=MYDATA;

VAR X Y;

WITH Z1 Z2 Z3;

computes correlations between the following pairs of vari-

ables:

• X, Z1

• X, Z2

• X, Z3

• Y, Z1

• Y, Z2

• Y, Z3

14

WITH example

DATA OZONE;

INFILE ’ozone.dat’;

OPTIONS PAGESIZE = 40;

/* Daily Zonal means: OZONE, in units DOBSON;

SOURCE: NASA*/

/* 100 DOBSON UNITS = 1MM THICKNESS (IF OZONE

LAYER WERE BROUGHT TO EARTH’S SURFACE) */

/* Each observation contains ozone thickness

measurements averaged over 288 longitudes at

latitudes separated by 5 degrees;

e.g. M875 = average ozone thickness at latitude 87.5 */

/* SH = average over the southern hemisphere;

NH = average over the northern hemisphere */

/* 0 = MISSING VALUE */

15

WITH example Cont’d

INPUT YRFRAC M875 M825 M775 M725 M675 M625 M575 M525 M475 M425

M375 M325 M275 M225 M175 M125 M75 M25 P25 P75 P125 P175

P225 P275 P325 P375 P425 P475 P525 P575 P625 P675 P725

P775 P825 P875 SH NH;

/* IT IS VERY IMPORTANT TO SET

THE MISSING VALUES TO .;

OTHERWISE, THE 0’s WILL BE ENTERED

INTO THE CORRELATION COMPUTATION

AND GIVE MISLEADING RESULTS.*/

IF M875 = 0 THEN M875 = .;

IF M825 = 0 THEN M825 = .;

IF M775 = 0 THEN M775 = .;

IF M725 = 0 THEN M725 = .;

IF M675 = 0 THEN M675 = .;

IF M625 = 0 THEN M625 = .;

IF M575 = 0 THEN M575 = .;

IF M525 = 0 THEN M525 = .;

16

WITH example Cont’d

IF M475 = 0 THEN M475 = .;

IF M425 = 0 THEN M425 = .;

IF M375 = 0 THEN M375 = .;

IF M325 = 0 THEN M325 = .;

IF M275 = 0 THEN M275 = .;

IF M225 = 0 THEN M225 = .;

IF M175 = 0 THEN M175 = .;

IF M125 = 0 THEN M125 = .;

IF M75 = 0 THEN M75 = .;

IF M25 = 0 THEN M25 = .;

IF P875 = 0 THEN P875 = .;

IF P825 = 0 THEN P825 = .;

IF P775 = 0 THEN P775 = .;

IF P725 = 0 THEN P725 = .;

IF P675 = 0 THEN P675 = .;

IF P625 = 0 THEN P625 = .;

IF P575 = 0 THEN P575 = .;

17

WITH example Cont’d

IF P525 = 0 THEN P525 = .;

IF P475 = 0 THEN P475 = .;

IF P425 = 0 THEN P425 = .;

IF P375 = 0 THEN P375 = .;

IF P325 = 0 THEN P325 = .;

IF P275 = 0 THEN P275 = .;

IF P225 = 0 THEN P225 = .;

IF P175 = 0 THEN P175 = .;

IF P125 = 0 THEN P125 = .;

IF P75 = 0 THEN P75 = .; IF P25 = 0 THEN P25 = .;

SASDATE = FLOOR(365.25*(YRFRAC-1960));

MONTH = MONTH(SASDATE);

SEASONAL = ABS(7-MONTH);

PROC CORR NOSIMPLE NOPROB;

VAR M875 M475 M75; WITH P875 P475 P75;

/* This program gives correlations between various southern and

northern ozone layer averages */ RUN; QUIT;

18

Partial Correlations:

Often, two variables appear to be highly correlated but are

both highly correlated with another variable. Computation

of partial correlations between variables removes the effects

of other variables.

Syntax:

PROC CORR DATA=MYDATA;

VAR X;

WITH Y;

PARTIAL V;

This computes partial correlations between X and Y, after

eliminating effects due to correlation with V.

19

Partial Correlations: Example

Ozone measurements are correlated with temperature. The

resulting seasonal effect can be removed as follows:

PROC CORR DATA = OZONE NOSIMPLE NOPROB;

VAR M875 M475 M75;

WITH P475 P75;

PARTIAL SEASONAL;

We see that the magnitude of the correlations has been

reduced after taking into account the seasonal effects.

20

Exercise:

Refer to the Winnipeg Climate data set.

1. For each month, compute correlations between maxi-

mum temperature and each of minimum temperature,

minimum and maximum pressure, and minimum and

maximum wind speed.

2. Compute the partial correlations, taking into account

minimum temperature.

21

Simple Linear Regression

• The equation for a straight line relating the (nonran-

dom) variables y and x is

y = α+ βx

β is the slope.

α is the intercept.

The dependent variable y is the response variable.

The independent variable x is the predictor or explana-

tory variable.

22

‘Simulation’ Example:

Suppose the intercept is 2.0 and the slope is -1.5. Compute

y values corresponding to x = 1,2,4,5,8,10,13, and obtain

a scatterplot of the paired observations.

DATA NONRAN;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

Y = ALPHA + BETA*X;

23

‘Simulation’ Example: Cont’d

DATALINES;

1

2

4

5

8

10

13

;

PROC PLOT;

PLOT Y*X;

RUN;

QUIT;

If BETA is positive, say BETA = 2.5, then the graph increases.

24

Simple linear model Adding Noise:

• Even if variables are related to each other by a straight

line, experimental observations usually contain some

kind of (unobservable) noise or random errors which

cause small distortions.

• The simplest way of modelling this kind of noise is to

assume it is normally distributed and to add it y: i.e.

Y = y(x) + ε

or

Y = α+ βx+ ε

• For each observation, ε is a normal random variable

having mean 0 and variance σ2.

25

Simple linear model: Cont’d

• If there are n observations, there must be n random

errors. We assume that they are independent of each

other.

• This model is called the simple linear model.

• Note that since E[ε] = 0, we have

E[Y ] = α+ βx

i.e. the mean of the Y variable is a linear function of x.

26

Simulating the simple linear model:

Add noise to the above data. Assume σ = 0.2.

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2;

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

27

Simulating the simple linear model:

8

10

13

;

PROC PLOT;

PLOT Y*X;

RUN;

QUIT;

Repeating this simulation experiment, using σ = 1.0, σ = 1.5

and σ = 2.0, we see that as σ increases, the less ‘linear’ the

graph appears.

28

Estimation:

Unlike in our simulation study, α and β are unknown, and

must be estimated. Least-squares estimators are com-

puted by the REGression procedure or PROC REG.

Syntax:

PROC REG DATA = MYDATA;

MODEL Y = X;

29

Estimation: Cont’d

Estimating the parameters for the simulated data.

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2;

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

30

Estimation: Cont’d

8

10

13

;

PROC REG;

MODEL Y = X;

RUN; QUIT;

Note that the intercept estimate is near 2.0 and the slope

estimate is near -1.5.

31

Example:

10 fields were planted in wheat and i kg/acre of nitrate

were applied to the ith field, for i = 1,2, . . . ,10. We want

to model the relationship between mean wheat yield Y and

amount of nitrate X. That is, we

Y = α+ βX + ε

where ε is the unobserved error random variable.

32

Example: Cont’d

To estimate α and β we use

DATA WHTYIELD;

INPUT NITRATE YIELD;

DATALINES;

1 15

2 13

3 16

4 12

5 14

6 18

7 17

8 19

9 16

10 20

;

33

Example: Cont’d

PROC PLOT;

PLOT YIELD*NITRATE;

/* One should first plot \verb+YIELD+

against \verb+NITRATE+ to see whether

a linear model is appropriate. */

PROC REG DATA=WHTYIELD;

MODEL YIELD = NITRATE;

RUN;

34

Output

The output window lists

• an ANOVA table

• estimates of some statistics

• a table of parameter estimates and their standard er-

rors.

The estimates of α and β are

• a = 12.7 (standard error = 1.31)

• b = 0.606 (standard error = .212)

and the fitted model is

y = 12.7 + .606x.

The estimated standard deviation of the errors ε is given

by ROOT-MSE and is 1.93.

35

Tests

We can also test whether the slope of the regression line

is 0:

Ho : β = 0

versus

H1 : β 6= 0.

Test statistic:

T = β/s(β) = 2.86.

p-value: 0.0212

Conclusion: Reject the null hypothesis at α = .05. The

true slope is not 0.

36

Tests Cont’d

The intercept α can also be tested. This time, we have a

p-value of .0001 which is very strong evidence against the

hypothesis that α = 0.

These tests are based on the assumption that the obser-

vations are independent (i.e. the errors ε are independent

of each other). They are correct if the errors are normally

distributed, and approximately correct, if the true error dis-

tribution has a finite variance.

37

The ANOVA table:

• This table demonstrates how the variance in the Y or

response variable decomposes into variance explained

by the regression with X (SSModel) plus variance left un-

explained (SSError).

• SSModel + SSError = C Total

• MSModel = SSModel/DFModel (DFModel = number of pa-

rameters estimated - 1)

38

The ANOVA table: Cont’d

• MSError = SSError/DFError (DFError = n - DFModel)

Note that MSE is the square of the ROOT-MSE

• The test statistic F=MSModel/MSError is the square of the

T statistic used to test whether the slope is 0.

• If F is large, we have reason to reject the null hypothesis

in favor of the alternative that β 6= 0. Note that the p-

value for this test is identical with the earlier p-value.

39

Other Statistics

• R-square = the coefficient of determination R2 = .505.

It is the square of the correlation between YIELD and

NITRATE.

• Dep Mean = average of the dependent variable values =

16.0.

40

Predicted Values:

• The predicted values can be calculated from the fitted

regression equation for the given values of the explana-

tory variable X.

• For the wheat yield example, we found that

y = 12.7 + .606x

Therefore, if x = 3, we predict y = 12.7+ .606(3) = 14.5.

This is the predicted value corresponding to x = 3.

41

Plotting Predicted Values:

SAS can plot the predicted values versus the explanatory

variable:

PROC REG DATA = WHTYIELD;

MODEL YIELD = NITRATE;

PLOT PREDICTED. *NITRATE;

RUN;

42

Plotting Predicted Values: Overlay

It is possible to overlay this plot on the plot of the original

data:

PROC REG DATA = WHTYIELD;

MODEL YIELD = NITRATE;

PLOT PREDICTED. *NITRATE=’P’

YIELD*NITRATE=’*’ / OVERLAY;

RUN;

/* PLOT X*Y=’!’ causes the plotting symbol

to be ’!’ */

43

Residual Plots:

• Residuals are the differences between the response val-

ues and predicted values:

y − y = y − a− bx

• They are ’estimates’ of the errors ε.

ε = y − α− βx

• Examine plots of the residuals:

1. look for outliers - indications that the linear model

may not be adequate or that the error distribution is

not close enough to normal. Tests are not trustwor-

thy in this case.

44

Residual Plots: Cont’d

2 Patterns can

• indicate the need to transform the data or to add a

quadratic term to the linear model

• indicate that the error variance is not constant. If

not, weighted least-squares should be used.

To get a feel for what the residual plots should look like if

the linear model is appropriate, use simulation:

45

Residual Plots: Cont’d

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2;

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

7

8

46

Residual Plots: Cont’d

9

10

12

13

;

PROC REG;

MODEL Y = X;

PLOT RESIDUAL. * X;

RUN; QUIT;

47

A quadratic example:

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2;

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*(X - 5)**2 + EPSILON;

DATALINES;

1

2

4

5

7

8

48

A quadratic example:

9

10

12

13

;

PROC REG;

MODEL Y = X;

PLOT RESIDUAL. * X;

RUN; QUIT;

49

Outlier example:

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2; SIGMA2 = 1.8;

U = UNIFORM(0);

IF U < .8 THEN EPSILON = SIGMA*RANNOR(0);

ELSE EPSILON = SIGMA2*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

7

50

Outlier example: Cont’d

8

9

10

12

13

;

PROC REG; MODEL Y = X;

PLOT RESIDUAL. * X;

RUN; QUIT;

51

An increasing variance example

DATA RANDOM;

INPUT X;

ALPHA = 2.0;

BETA = -1.5;

SIGMA=0.2*SQRT(X);

/* SIGMA increases with the square

root of X */

EPSILON = SIGMA*RANNOR(0);

Y = ALPHA + BETA*X + EPSILON;

DATALINES;

1

2

4

5

7

8

52

An increasing variance example Cont’d

9

10

12

13

;

PROC REG;

MODEL Y = X;

PLOT RESIDUAL. * X;

RUN; QUIT;

Exercise: Examine a plot of the residuals for the wheat

yield data.

53

Adding a Quadratic Term:

To fit the quadratic model

Y = α+ βX + β2X2

use the line

X2 = X**2;

in the DATA step, and use the following MODEL state-

ment in PROC REG:

MODEL Y = X X2;

54

Example:

DATA WHTYIELD;

INPUT NITRATE YIELD;

NITRATE2 = NITRATE**2;

DATALINES;

1 15

2 13

3 16

4 12

5 14

6 18

7 17

8 19

9 16

10 20

;

55

Example : Cont’d

PROC REG DATA=WHTYIELD;

MODEL YIELD = NITRATE NITRATE2;

RUN;

QUIT;

The output includes a test of the hypothesis that β2 = 0.

56

Transformations:

Sometimes an appropriate transformation (like a log or a

square root) is sufficient to linearize a relationship between

two variables. Sometimes, such a transformation can cor-

rect for a variance that is not constant. (N.B. If the re-

sponse variable is a count, it is almost always the case that

a nonconstant variance can be corrected for by taking a

square root of the response variable.)

57

Summary:

PROC REG;

MODEL Y = X; /* FITS LINEAR MODEL RELATING

RESPONSE Y TO

EXPLANATORY VARIABLE X */

PLOT PREDICTED. * X;/* PLOTS PREDICTED VALUES */

PLOT RESIDUAL.*X = ’Y’; /* PLOTS RESIDUALS

WITH PLOTTING SYMBOL ’Y’ */

58