Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome...
-
Upload
audra-rose -
Category
Documents
-
view
214 -
download
0
Transcript of Chapter 2 Looking at Data - Relationships. Relations Among Variables Response variable - Outcome...
Chapter 2
Looking at Data - Relationships
Relations Among Variables
• Response variable - Outcome measurement (or characteristic) of a study. Also called: dependent variable, outcome, and endpoint. Labelled as y.
• Explanatory variable - Condition that explains or causes changes in response variables. Also called: independent variable and predictor. Labelled as x.
• Theories usually are generated about relationships among variables and statistical methods can be used to test them.
• Research questions are stated such as: Do changes in x cause changes in y?
Scatterplots
• Identify the explanatory and response variables of interest, and label them as x and y
• Obtain a set of individuals and observe the pairs (xi , yi) for each pair. There will be n pairs.
• Statistical convention has the response variable (y) placed on the vertical (up/down) axis and the explanatory variable (x) placed on the horizontal (left/right) axis. (Note: economists reverse axes in price/quantity demand plots)
• Plot the n pairs of points (x,y) on the graph
France August,2003 Heat Wave Deaths
• Individuals: 13 cities in France• Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002• Explanatory Variable: Change in Mean Temp in period (C)• Data: City Dth03 Dth9902 %chng (y) Degchg(x)
Little 200 192.3 4 4Marseilles 571 456.8 25 4.3Grenoble 148 115.6 28 6.3Rennes 156 114.7 36 5.6Toulouse 315 231.6 36 6.6Bordeaux 318 222.4 43 6.2Strasbourg 253 167.5 51 5.9Nice 341 222.9 53 4.3Poitiers 184 102.8 79 7.3Lyon 447 248.3 80 6.8Le Mans 204 112.1 82 7Dijon 168 87 93 7.4Paris 1854 766.1 142 6.7
France August,2003 Heat Wave Deaths2003 France Heat Wave Mortality
0
20
40
60
80
100
120
140
160
3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
Change in Mean Temp (Celsius)
Ex
cess
Mo
rta
lity
(%
)
Possible Outlier
Example - Pharmacodynamics of LSD
Score (y) LSD Conc (x)78.93 1.1758.20 2.9767.47 3.2637.47 4.6945.65 5.8332.92 6.0029.97 6.41
• Response (y) - Math score (mean among 5 volunteers)
• Explanatory (x) - LSD tissue concentration (mean of 5 volunteers)
• Raw Data and scatterplot of Score vs LSD concentration:
LSD_CONC
7654321
SC
OR
E
80
70
60
50
40
30
20
Source: Wagner, et al (1968)
Manufacturer Production/Cost Relation
Month Prod Cost Month Prod Cost Month Prod Cost1 46.75 92.64 17 36.54 91.56 33 32.26 66.712 42.18 88.81 18 37.03 84.12 34 30.97 64.373 41.86 86.44 19 36.60 81.22 35 28.20 56.094 43.29 88.80 20 37.58 83.35 36 24.58 50.255 42.12 86.38 21 36.48 82.29 37 20.25 43.656 41.78 89.87 22 38.25 80.92 38 17.09 38.017 41.47 88.53 23 37.26 76.92 39 14.35 31.408 42.21 91.11 24 38.59 78.35 40 13.11 29.459 41.03 81.22 25 40.89 74.57 41 9.50 29.02
10 39.84 83.72 26 37.66 71.60 42 9.74 19.0511 39.15 84.54 27 38.79 65.64 43 9.34 20.3612 39.20 85.66 28 38.78 62.09 44 7.51 17.6813 39.52 85.87 29 36.70 61.66 45 8.35 19.2314 38.05 85.23 30 35.10 77.14 46 6.25 14.9215 39.16 87.75 31 33.75 75.47 47 5.45 11.4416 38.59 92.62 32 34.29 70.37 48 3.79 12.69
Y= Amount Produced x= Total Cost n=48 months (not in order)
Manufacturer Production/Cost Relation
Production (x) / Cost (y) Relation
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40 45 50
Total Production
To
tal C
ost
Correlation• Numerical measure to summarize the strength of the
linear (straight-line) association between two variables• Bounded between -1 and +1 (Labelled as r)
– Values near -1 Strong Negative association
– Values near 0 Weak or no association
– Values near +1 Strong Positive association
• Not affected by linear transformation of either x or y
• Does not distinguish between response and explanatory variable (x and y can be interchaged)
yyxxn
yxCOVss
yxCOV
s
yy
s
xx
nr ii
yxy
i
x
i
1
1),(
),(
1
1
Excess French Heatwave Deaths
City Degchg(x) %chng (y) x-xbar y-ybar (x-xbar)(y-ybar)Little 4.0 4 -2.03 -53.85 109.3155Marseilles 4.3 25 -1.73 -32.85 56.8305Grenoble 6.3 28 0.27 -29.85 -8.0595Rennes 5.6 36 -0.43 -21.85 9.3955Toulouse 6.6 36 0.57 -21.85 -12.4545Bordeaux 6.2 43 0.17 -14.85 -2.5245Strasbourg 5.9 51 -0.13 -6.85 0.8905Nice 4.3 53 -1.73 -4.85 8.3905Poitiers 7.3 79 1.27 21.15 26.8605Lyon 6.8 80 0.77 22.15 17.0555Le Mans 7.0 82 0.97 24.15 23.4255Dijon 7.4 93 1.37 35.15 48.1555Paris 6.7 142 0.67 84.15 56.3805Total 78.4 752.0 0.0 0.0 333.7
1346.3685.5716.103.6 nsysx yx
66.029.42
81.27
)46.36)(16.1(
81.2781.27
113
7.333),(
ryxCOV
Examples
Least-Squares Regression
• Goal: Fit a line that “best fits” the relationship between the response variable and the explanatory variable
• Equation of a straight line: y = a + bx– a - y-intercept (value of y when x = 0)
– b - slope (amount y increases as x increases by 1 unit)
• Prediction: Often want to predict what y will be at a given level of x. (e.g. How much will it cost to fill an order of 1000 t-shirts)
• Extrapolation: Using a fitted line outside level of the explanatory variable observed in sample: BAD IDEA
Least-Squares Regression
• y = a + bx is a deterministic equation• Sample data don’t fall on a straight line, but rather
around one• Obtain equation that “best fits” a sample of data points• Error - Difference between observed response and
predicted response (from equation)• Least Squares criteria: Choose the line that minimizes
the sum of squared errors. Resulting regression line:
xbyas
srbbxay
x
y ^
Excess French Heatwave Deaths
xy
a
b
rsysx yx
74.2021.67
21.6706.12585.57)03.6(74.2085.57
74.20)43.31(66.016.1
46.3666.0
66.046.3685.5716.103.6
^
2003 France Heat Wave Mortality
0
20
40
60
80
100
120
140
160
3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
Change in Mean Temp (Celsius)
Ex
cess
Mo
rta
lity
(%
)
For each 1C increase in mean temp, excess mortality increases about 20%
Effect of an Outlier (Paris)
• Re-fitting the model without Paris, which had a very high excess mortality (Using EXCEL):
xyr 34.1778.52*76.0^
* Heat Wave Mortality (No Paris)
0
10
20
30
40
50
60
70
80
90
100
3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
Temp Change
Exc
ess
Mo
rtal
ity
Squared Correlation• The squared correlation represents the fraction of the
variation in the response variable that is “explained” by the explanatory variable
• Represents the improvement (reduction in sum of squared errors) by using x (and fitted equation y-hat) to predict y as opposed to ignoring x (and simply using the sample mean y-bar) to predict y
• 0 r2 1 – Values near 0 x does not help predict y (regression line flat)
– Values near 1 x predicts y well (data near regression line)
2
2^
2
yy
yyr
Residual Analysis
• Residuals: Difference between observed responses and their predicted values:
• Useful to plot the residuals versus the level of the explanatory variable (x)
• Outliers: Large (positive or negative) residuals. Values of y that are inconsistent with prediction
• Influential observations: Cases where the level of the explanatory variable is far away from the other individuals (extreme x values)
^
yy
France Heatwave Mortalityy x yhat e=y-yhat4 4 16.04 -12.0425 4.3 22.22 2.7828 6.3 63.39 -35.3936 5.6 48.98 -12.9836 6.6 69.56 -33.5643 6.2 61.33 -18.3351 5.9 55.15 -4.1553 4.3 22.22 30.7879 7.3 83.98 -4.9880 6.8 73.68 6.3282 7 77.80 4.2093 7.4 86.03 6.97
142 6.7 71.62 70.38
Residual Plot
-60.00
-40.00
-20.00
0.00
20.00
40.00
60.00
80.00
3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
Temp Change (x)
Res
idu
al
Paris (outlier)
Miscellaneous Topics
• Lurking Variable: Variable not included in regression analysis that may influence the association between y and x. Sometimes referred to as a spurious association between y and x.
• Association does not imply causation (it is one of various steps to demonstrating cause-and-effect)
• Do not extrapolate outside range of x observed in study • Some relationships are not linear, which may show low
correlation when relation is strong• Correlations based on averages across individuals tend
to be higher than those based on individuals
Causation
• Association between x and y demonstrated• Time order confirmed (x “occurs” before y)• Alternative explanations are considered and explained
away:– Lurking variables - Another variable causes both x and y
– Confounding - Two explanatory variables are highly related, and which causes y cannot be determined
• Dose-Response Effect • Plausible cause