381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)
-
Upload
virgil-mathews -
Category
Documents
-
view
213 -
download
0
Transcript of 381 Correlation and Regression-I QSCI 381 – Lecture 36 (Larson and Farber, Sect 9.1)
381
Correlation and Regression-I
QSCI 381 – Lecture 36(Larson and Farber, Sect 9.1)
381
Overview-I Many of the
questions commonly addressed in the natural sciences relate to whether there is a (significant) (linear) relationship between two variables.
Independent Variables
Dep
end
ent V
aria
ble
Independent Variables
De
pe
nd
en
t V
ari
ab
le
Negative linear correlation
Positive linear correlation
381
Overview-II There may be no correlation
between two variables or a non-linear relationship.
Independent Variables
De
pe
nd
en
t V
ari
ab
le
Independent Variables
De
pe
nd
en
t V
ari
ab
le
No relationship Non-linear relationship
381
Overview-III We will be looking at two questions
this week: Is there really a relationship between
two variables (the correlation problem).
How do we predict values for one variable, given particular values for the other variable (the regression problem).
381
Independent and Dependent Variables
A is a relationship between two variables. The data can be represented by the ordered pair (x, y) where x is the independent, or explanatory, variable and y is dependent, or response, variable.
381
Correlation Coefficient-I Correlations can be examined using
scatterplots. However, these are subjective. The , r, is a measure of the strength and the direction of a linear relationship between two variables:
The population correlation coefficient is denoted .
2 2 2 2
( )( )
( ) ( )
i i i ii i i
i i i ii i i i
n x y x yr
n x x n y y
381
Correlation Coefficient-II The correlation
coefficient ranges between -1 and 1.
Independent Variables
De
pe
nd
en
t V
ari
ab
le
Independent Variables
De
pe
nd
en
t V
ari
ab
le
Independent Variables
De
pe
nd
en
t V
ari
ab
le
r=-0.0612
r=0.9481
r=-0.9306
381
Example – Calculating r
900
950
1000
1050
1100
1150
1200
1250
1300
1350
1400
14 15 16 17 18 19
Sea Surface Temperature
Re
cru
itm
en
t
Recruitment in a fish stock must relate in some way tothe environment. Assess the level of correlation betweenrecruitment and Sea Surface Temperature (SST).
381
Example – Calculating rSST Recruitment xy x2 y2
15.13 1132.10 17127 229 1281639
17.86 1190.61 21259 319 1417560
15.64 1047.49 16379 245 1097233
15.71 1196.45 18794 247 1431486
17.74 1281.47 22739 315 1642167
16.46 1254.98 20656 271 1574976
15.02 1159.21 17410 226 1343757
17.47 1094.87 19124 305 1198733
14.28 1102.58 15749 204 1215676
15.44 1173.70 18127 239 1377565
15.82 1133.52 17929 250 1284862
14.59 955.35 13936 213 912702
17.83 1228.92 21908 318 1510237
17.32 1217.01 21084 300 1481108
226.30 16168.24 262223 3679 18769702
12184.20.612
19922.36r
2 2 2 2
( )( )
( ) ( )
i i i ii i i
i i i ii i i i
n x y x yr
n x x n y y
2 2
14x 262223 16168x 226.30
14x3679 226.30 14 18769702 16168r
x
In EXCEL, we can calculate correlation coefficients using the formula: CORREL(A1:A10,B1:B10)
381
Testing the Significance of a Correlation Coefficient-I
r is a statistic – it is based on a paired sample from the population.
We want to infer the significance of the population correlation coefficient, , from the sample correlation coefficient, r.
381
Testing the Significance of a Correlation Coefficient-II
1. State H0 and Ha and specify .2. Set the d.f.=n-2 and find the critical
values of the t-distribution with n-2 degrees of freedom.
3. Find the standardized test statistic.
4. Make a decision to reject or fail to reject the null hypothesis.
212
rt
rn
381
Example-I Test the hypothesis =0 for the SST –
Recruitment data set. Assume that =0.05.1. H0: =0 (no correlation); Ha: 0. 2. =0.05 - this is a two-tailed test with d.f.=14-
2=12 degrees of freedom.3. The rejection region is |t|>2.179.4. The standardized test statistic is:
5. We reject the null hypothesis of no correlation.
2 2
0.6122.681
1 1 0.6122 14 2
rt
rn
381
Example-II Using the same data as for Example I,
examine the claim: “Recruitment is higher when the temperature is higher”. Note that this claim was developed before the data were collected.
Note: whether a correlation coefficient is significant or not can be examined using tables (e.g. Table 11 of Appendix B). You can construct this table using EXCEL.
381
Correlation and Causation-I
Perhaps the biggest mistake that can be made when analyzing data is to confuse correlation and causation. The fact that two variables are strongly correlated does not in itself imply a cause-and-effect relationship between the variables.
381
Correlation and Causation-II
Evaluating whether causation can be inferred is not trivial – infer causation only after thinking about the following:
Did I think of the relationship between these variables before I saw the data?
Is there a good theoretical reason for a direct cause-and-effect relationship between the variables?
Is there perhaps a reverse cause-and-effect relationship between the variables?
Is there perhaps a third (or more) variable(s) which determine(s) BOTH x and y.
Is the relationship just coincidental?
381
Correlation and Causation-III
The danger of too much data. We spend five months in the field
collecting data to examine the question of the relationship between the density of sparrows and the environment. Thanks to the internet we identify 120 possible environmental data series. We correlate the density estimates with each. What should we expect to find by doing this?
381
Correlation and Causation-IV
-1
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
Correlation Coefficient
Significant=0.05
Significant=0.05
Type I Error: probability of rejecting the null hypothesis when it is in fact true!
381
Correlation and Causation-V
“Looking for” variables that correlate with a variable of interest is variously called “data mining”, “data dredging”, “statistical fishing”.