Copyright (c)Bani K. Mallick1 STAT 651 Lecture #21.

39
Copyright (c)Bani K. Mall ick 1 STAT 651 Lecture #21
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    2

Transcript of Copyright (c)Bani K. Mallick1 STAT 651 Lecture #21.

Copyright (c)Bani K. Mallick 1

STAT 651

Lecture #21

Copyright (c)Bani K. Mallick 2

Topics in Lecture #21 Correlation

Copyright (c)Bani K. Mallick 3

Book Sections Covered in Lecture #21

Chapter 11.7

Copyright (c)Bani K. Mallick 4

Lecture 20 Review: Leverage and Outliers

Outliers in Linear Regression are difficult to diagnose

They depend crucially on where X is

* **

*

*

A boxplot of Y would think this is an outlier, when in reality it fits the line quite well

Copyright (c)Bani K. Mallick 5

Lecture 20 Review: Outliers and Leverage

It’s also the case than one observation can have a dramatic impact on the fit

**

**

*

*

*

The slope of the line depends crucially on the value far to the right

Copyright (c)Bani K. Mallick 6

Lecture 20 Review: Outliers and Leverage

But Outliers can occur

*

**

**

*

*

This point is simply too high for its value of XLine with Outlier

Line without Outlier

Copyright (c)Bani K. Mallick 7

Lecture 20 Review: Outliers and Leverage

A leverage point is an observation with a value of X that is outlying among the X values

An outlier is an observation of Y that seems not to agree with the main trend of the data

Outliers and leverage values can distort the fitted least squares line

It is thus important to have diagnostics to detect when disaster might strike

Copyright (c)Bani K. Mallick 8

Lecture 20 Review: Outliers and Leverage

We have three methods for diagnosing high leverage values and outliers

Leverage plots: For a single X, these are basically the same as boxplots of the X-space (leverage)

Cook’s distance (measures how much the fitted line changes if the observation is deleted)

Residual Plots

Copyright (c)Bani K. Mallick 9

Correlation and Measures of Fit

You all know the word “correlation”, as in “Height and Weight are positively correlated”

Many of you may also have heard of R-squared denoted by R2

Both are measures of how well an independent variable predicts a dependent variable

Copyright (c)Bani K. Mallick 10

Correlation and Measures of Fit

R2 measures the fraction of variance explained by the least squares line

The relevant sums of squares are

The fraction of the total sum of squares explained by the fitted line is

n 2

ii 1

Sum of Squares Total SST Y Y

n 2

i ii 1

ˆSum of Squares Residual SSR Y Y

2 SST SSRR

SST

Copyright (c)Bani K. Mallick 11

Correlation and Measures of Fit

R2 measures the fraction of variance explained by the least squares line

If Y and X are perfectly linearly related, then all the variation in Y is explained by the line, and thus R2 = 1

If Y and X are completely independent, then the line explains nothing about Y, so R2 = 0

However, Y and X can be perfectly related but not linearly, and R2 is misleading in this case (see later on)

Copyright (c)Bani K. Mallick 12

GPA and Height

Height in inches

807570656055

Gra

de

Po

int

Ave

rag

e (

GP

A)

4.5

4.0

3.5

3.0

2.5

2.0

1.5

Note that this is a fairly weak relationship, so little variance explained: suggests R-squared is near zero

Copyright (c)Bani K. Mallick 13

GPA and Height

Model Summary: GPA regression on Height

.273a .074 .065 .4979Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), Height in inchesa.

ANOVAb

1.954 1 1.954 7.881 .006a

24.294 98 .248

26.247 99

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Height in inchesa.

Dependent Variable: Grade Point Average (GPA)b.

Copyright (c)Bani K. Mallick 14

Aortic Valve Area and Body Surface Area

Note that this is a stronger relationship: suggests R-squared is higher

Healthy Children, Stenosis Data

Body Surface Area

2.52.01.51.0.50.0

Ao

rtic

Va

lve

Are

a

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

-.5

Copyright (c)Bani K. Mallick 15

AVA and BSA in Healthy Kids

Model Summary: AVA Regression on BSA in Healthy Kids

.866a .750 .747 .39337Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), Body Surface Areaa.

ANOVAb

31.640 1 31.640 204.474 .000a

10.522 68 .155

42.162 69

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Body Surface Areaa.

Dependent Variable: Aortic Valve Areab.

Copyright (c)Bani K. Mallick 16

Correlation and Measures of Fit

The (Pearson) correlation coefficient measures how well Y and X are linearly related

The correlation is always between –1 and +1

2xy

2xy

xy

r R ifl east squares slope 0

r R ifl east squares slope 0

r 0 ifl east squares slope 0

Copyright (c)Bani K. Mallick 17

Correlation and Measures of Fit

If the correlation = +1, then Y and X are perfectly positively related

If the correlation = -1, then Y and X are perfectly negatively related

If the correlation = 0, then Y and X are not linearly related

2xy

2xy

xy

r R ifl east squares slope 0

r R ifl east squares slope 0

r 0 ifl east squares slope 0

Copyright (c)Bani K. Mallick 18

Correlation and Measures of Fit

The (Spearman) correlation coefficient measures how well Y and X are monotonically related

Replace Y by its rank among the Y’s

Replace X by its rank among the X’s

Computer the (Pearson) correlation

Why would someone do a Spearman correlation?

Copyright (c)Bani K. Mallick 19

Correlation and Measures of Fit

The (Spearman) correlation coefficient measures how well Y and X are monotonically related

Replace Y by its rank among the Y’s

Replace X by its rank among the X’s

Computer the (Pearson) correlation

Why would someone do a Spearman correlation? Because it is more robust to outliers, and it is not affected by transformations

Copyright (c)Bani K. Mallick 20

Correlation and Measures of Fit

Both types of correlations are easily obtained in SPSS

Go to “Analyze”, “Correlation” and type in all the variables that you want correlations for

You have to click on Spearman to get it, otherwise you get only Pearson

Confidence intervals for the population correlations are not included

SPSS Demonstration using aortic data

Copyright (c)Bani K. Mallick 21

Correlation and Measures of Fit

The diagonals are meaningless: Y is perfectly correlated with Y

Correlations

1.000 .866 ** .873 **

. .000 .000

70 70 70

.866 ** 1.000 .982 **

.000 . .000

70 70 70

.873 ** .982 ** 1.000

.000 .000 .

70 70 70

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Body Surface Area

Aortic Valve Area

Log(1+Aortic Valve Area)

Body SurfaceArea

AorticValve Area

Log(1+AorticValve Area)

Correlation is significant at the 0.01 level (2-tailed).**.

Copyright (c)Bani K. Mallick 22

Correlation and Measures of Fit

Note how the Spearman correlation of BSA and AVA is the same as the Spearman correlation of bSA and log(1+AVA)

Correlations

1.000 .883** .883**

. .000 .000

70 70 70

.883** 1.000 1.000**

.000 . .

70 70 70

.883** 1.000** 1.000

.000 . .

70 70 70

Correlation Coefficient

Sig. (2-tailed)

N

Correlation Coefficient

Sig. (2-tailed)

N

Correlation Coefficient

Sig. (2-tailed)

N

Body Surface Area

Aortic Valve Area

Log(1+Aortic Valve Area)

Spearman's rho

Body SurfaceArea

AorticValve Area

Log(1+AorticValve Area)

Correlation is significant at the .01 level (2-tailed).**.

Copyright (c)Bani K. Mallick 23

Correlation and Measures of Fit

Both correlations are random variables, i.e., if you redo an experiment, you will get a different Pearson correlation

The population Pearson correlation is

The estimated standard error of the sample Pearson correlation is

2xy

xy

1 rs.e.(r )

n 2

xy

Copyright (c)Bani K. Mallick 24

Correlation and Measures of Fit

Null hypothesis of no linear relationship:

A (1100% CI for the population Pearson correlation is

Since the population correlation must be between –1 and +1, you should restrict your interval to that range: reject null if interval does not include 0

2xy

xy / 2

1 rr z

n 2

0 xyH : 0

Copyright (c)Bani K. Mallick 25

Correlation and Measures of Fit

Consider the aortic stenosis healthy kids

n = 70, Pearson correlation = 0.866

The 95% CI is

What is the meaning of this interval?

2xy

xy 1 / 2

1 rr z

n 2

21 0.8660.866 1.96

70 20.866 0.119

.747 to .985

Copyright (c)Bani K. Mallick 26

Correlation and Measures of Fit

Consider the aortic stenosis healthy kids

n = 70, Pearson correlation = 0.866

The 95% CI is

What is the meaning of this interval? 95% certain that the population Pearson correlation is between .747 and .985

2xy

xy 1 / 2

1 rr z

n 2

21 0.8660.866 1.96

70 20.866 0.119

.747 to .985

Copyright (c)Bani K. Mallick 27

Some Warnings About Correlation

The Pearson correlation can be greatly affected by outliers and leverage values

This is why it is good to have the Spearman

Copyright (c)Bani K. Mallick 28

Aortic Stenosis Data: Note the outlier in the Stenotic Kids

Linear Regression

0.000 0.500 1.000 1.500 2.000

Body Surface Area

0.000

1.000

2.000

3.000

Ao

rtic

Val

ve A

rea

Healthy Stenoti

0.000 0.500 1.000 1.500 2.000

Body Surface Area

Aortic Stenosis Data

Copyright (c)Bani K. Mallick 29

Some Warnings About Correlation

The Pearson correlation with the outlier in the Stenotic kids is 0.477

It is 0.648 without the outlier

The Spearman correlations are 0.691 and 0.762 with and without the outlier

I can make correlations dance

Copyright (c)Bani K. Mallick 30

Some Warnings About Correlation

The correlations are 0.058 (left) and 1.00 (right)! Only one point differs: high leverage outlier

Linear Regression

-10.00 -7.50 -5.00 -2.50 0.00

x

-10.00

-5.00

0.00

5.00

10.00

Ou

tlie

r ad

ded

1.00 2.00

-10.00 -7.50 -5.00 -2.50 0.00

x

Made Up Data with (left) and without (right) a high leverage outlier

Copyright (c)Bani K. Mallick 31

Some Warnings About Correlation

The Pearson correlation only measures linear correlation

If your relationship is not linear, then Pearson will get confused

Copyright (c)Bani K. Mallick 32

Some Warnings About Correlation

Note the perfect quadratic relationship

Pearson corr = 0

X

20100-10-20

Y q

ua

dra

tic in

X

120

100

80

60

40

20

0

-20

Copyright (c)Bani K. Mallick 33

Construction Data

20.00 30.00 40.00 50.00

Age (Modified)

8.00

9.00

10.00

11.00

Copyright (c)Bani K. Mallick 34

Construction Data

Correlations

1.000 .178** .120*

. .000 .011

447 447 447

.178** 1.000 .896**

.000 . .000

447 447 447

.120* .896** 1.000

.011 .000 .

447 447 447

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Age (Modified)

Base Pay (modified)

Log(Base Paymodified - $30,000)

Age (Modified)Base Pay

(modified)

Log(BasePay modified

- $30,000)

Correlation is significant at the 0.01 level (2-tailed).**.

Correlation is significant at the 0.05 level (2-tailed).*.

Copyright (c)Bani K. Mallick 35

Construction Data

Correlations

1.000 .151** .151**

. .001 .001

447 447 447

.151** 1.000 1.000**

.001 . .

447 447 447

.151** 1.000** 1.000

.001 . .

447 447 447

Correlation Coefficient

Sig. (2-tailed)

N

Correlation Coefficient

Sig. (2-tailed)

N

Correlation Coefficient

Sig. (2-tailed)

N

Age (Modified)

Base Pay (modified)

Log(Base Paymodified - $30,000)

Spearman's rhoAge (Modified)

Base Pay(modified)

Log(BasePay modified

- $30,000)

Correlation is significant at the .01 level (2-tailed).**.

Copyright (c)Bani K. Mallick 36

Armspan Data (Males)

67.50 70.00 72.50 75.00 77.50

Height (modified)

65.00

70.00

75.00

80.00

Armspan (modified) = -9.16 + 1.13 * modhR-Square = 0.73

Height and Armspan for Males

Copyright (c)Bani K. Mallick 37

Armspan Data (Males)

Correlations

1.000 .855**

. .000

50 50

.855** 1.000

.000 .

50 50

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Height (modified)

Armspan (modified)

Height(modified)

Armspan(modified)

Correlation is significant at the 0.01 level (2-tailed).**.

Copyright (c)Bani K. Mallick 38

Armspan Data (Males)

A 95% confidence interval for the population Pearson correlation is

Meaning?

2xy

xy

2

1 rr 1.96

n 2

1 0.8550.855 1.96

50 2

0.855 1.96 0.0056

0.855 0.146

0.709,1.00

Copyright (c)Bani K. Mallick 39

Armspan Data (Males)

Correlations

1.000 .813**

. .000

50 50

.813** 1.000

.000 .

50 50

Correlation Coefficient

Sig. (2-tailed)

N

Correlation Coefficient

Sig. (2-tailed)

N

Height (modified)

Armspan (modified)

Spearman's rho

Height(modified)

Armspan(modified)

Correlation is significant at the .01 level (2-tailed).**.