381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

18
38 1 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

Transcript of 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

Page 1: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Correlation and Regression-I

QSCI 381 – Lecture 36(Larson and Farber, Sect 9.1)

Page 2: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Overview-I Many of the

questions commonly addressed in the natural sciences relate to whether there is a (significant) (linear) relationship between two variables.

Independent Variables

Dep

end

ent V

aria

ble

Independent Variables

De

pe

nd

en

t V

ari

ab

le

Negative linear correlation

Positive linear correlation

Page 3: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Overview-II There may be no correlation

between two variables or a non-linear relationship.

Independent Variables

De

pe

nd

en

t V

ari

ab

le

Independent Variables

De

pe

nd

en

t V

ari

ab

le

No relationship Non-linear relationship

Page 4: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Overview-III We will be looking at two questions

this week: Is there really a relationship between

two variables (the correlation problem).

How do we predict values for one variable, given particular values for the other variable (the regression problem).

Page 5: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Independent and Dependent Variables

A is a relationship between two variables. The data can be represented by the ordered pair (x, y) where x is the independent, or explanatory, variable and y is dependent, or response, variable.

Page 6: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Correlation Coefficient-I Correlations can be examined using

scatterplots. However, these are subjective. The , r, is a measure of the strength and the direction of a linear relationship between two variables:

The population correlation coefficient is denoted .

2 2 2 2

( )( )

( ) ( )

i i i ii i i

i i i ii i i i

n x y x yr

n x x n y y

Page 7: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Correlation Coefficient-II The correlation

coefficient ranges between -1 and 1.

Independent Variables

De

pe

nd

en

t V

ari

ab

le

Independent Variables

De

pe

nd

en

t V

ari

ab

le

Independent Variables

De

pe

nd

en

t V

ari

ab

le

r=-0.0612

r=0.9481

r=-0.9306

Page 8: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Example – Calculating r

900

950

1000

1050

1100

1150

1200

1250

1300

1350

1400

14 15 16 17 18 19

Sea Surface Temperature

Re

cru

itm

en

t

Recruitment in a fish stock must relate in some way tothe environment. Assess the level of correlation betweenrecruitment and Sea Surface Temperature (SST).

Page 9: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Example – Calculating rSST Recruitment xy x2 y2

15.13 1132.10 17127 229 1281639

17.86 1190.61 21259 319 1417560

15.64 1047.49 16379 245 1097233

15.71 1196.45 18794 247 1431486

17.74 1281.47 22739 315 1642167

16.46 1254.98 20656 271 1574976

15.02 1159.21 17410 226 1343757

17.47 1094.87 19124 305 1198733

14.28 1102.58 15749 204 1215676

15.44 1173.70 18127 239 1377565

15.82 1133.52 17929 250 1284862

14.59 955.35 13936 213 912702

17.83 1228.92 21908 318 1510237

17.32 1217.01 21084 300 1481108

226.30 16168.24 262223 3679 18769702

12184.20.612

19922.36r

2 2 2 2

( )( )

( ) ( )

i i i ii i i

i i i ii i i i

n x y x yr

n x x n y y

2 2

14x 262223 16168x 226.30

14x3679 226.30 14 18769702 16168r

x

In EXCEL, we can calculate correlation coefficients using the formula: CORREL(A1:A10,B1:B10)

Page 10: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Testing the Significance of a Correlation Coefficient-I

r is a statistic – it is based on a paired sample from the population.

We want to infer the significance of the population correlation coefficient, , from the sample correlation coefficient, r.

Page 11: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Testing the Significance of a Correlation Coefficient-II

1. State H0 and Ha and specify .2. Set the d.f.=n-2 and find the critical

values of the t-distribution with n-2 degrees of freedom.

3. Find the standardized test statistic.

4. Make a decision to reject or fail to reject the null hypothesis.

212

rt

rn

Page 12: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Example-I Test the hypothesis =0 for the SST –

Recruitment data set. Assume that =0.05.1. H0: =0 (no correlation); Ha: 0. 2. =0.05 - this is a two-tailed test with d.f.=14-

2=12 degrees of freedom.3. The rejection region is |t|>2.179.4. The standardized test statistic is:

5. We reject the null hypothesis of no correlation.

2 2

0.6122.681

1 1 0.6122 14 2

rt

rn

Page 13: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Example-II Using the same data as for Example I,

examine the claim: “Recruitment is higher when the temperature is higher”. Note that this claim was developed before the data were collected.

Note: whether a correlation coefficient is significant or not can be examined using tables (e.g. Table 11 of Appendix B). You can construct this table using EXCEL.

Page 14: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Correlation and Causation-I

Perhaps the biggest mistake that can be made when analyzing data is to confuse correlation and causation. The fact that two variables are strongly correlated does not in itself imply a cause-and-effect relationship between the variables.

Page 15: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Correlation and Causation-II

Evaluating whether causation can be inferred is not trivial – infer causation only after thinking about the following:

Did I think of the relationship between these variables before I saw the data?

Is there a good theoretical reason for a direct cause-and-effect relationship between the variables?

Is there perhaps a reverse cause-and-effect relationship between the variables?

Is there perhaps a third (or more) variable(s) which determine(s) BOTH x and y.

Is the relationship just coincidental?

Page 16: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Correlation and Causation-III

The danger of too much data. We spend five months in the field

collecting data to examine the question of the relationship between the density of sparrows and the environment. Thanks to the internet we identify 120 possible environmental data series. We correlate the density estimates with each. What should we expect to find by doing this?

Page 17: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Correlation and Causation-IV

-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.95

Correlation Coefficient

Significant=0.05

Significant=0.05

Type I Error: probability of rejecting the null hypothesis when it is in fact true!

Page 18: 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)

381

Correlation and Causation-V

“Looking for” variables that correlate with a variable of interest is variously called “data mining”, “data dredging”, “statistical fishing”.