Chapter 4
-
Upload
quinlan-powell -
Category
Documents
-
view
24 -
download
0
description
Transcript of Chapter 4
04/19/23 1
Chapter 4
Regression
04/19/23 2
Regression• Like correlation, regression addresses
linear relationships between quantitative variables X & Y
• Objective of correlation quantify direction and strength of linear association
• Objective of regression derive best fitting line that describes the association
• We are especially interested in the slope of the line
Country Per Capita GDP X
Life ExpectancyY
Austria 21.4 77.48Belgium 23.2 77.53Finland 20.0 77.32France 22.7 78.63Germany 20.8 77.17Ireland 18.6 76.39Italy 21.5 78.51Netherlands 22.0 78.15Switzerland 23.8 78.99UK 21.2 77.37
Same illustrative data as Ch 3
En
ter da
ta in
to ca
lcula
tor
Algebraic equation for a line
• y = a + b∙Xwhere
• b ≡ slope ≡ change in Y per unit X
• a ≡ intercept ≡ value of Y when x = 0
04/19/23 5
ŷ = a + b∙X
where: ŷ ≡ predicted average of Y at a given level
of X
a ≡ interceptb ≡ slope
a and b are called regression coefficients
Statistical Equation for a Line
04/19/23 6
How do we find the equation for the best fitting line through the scatter cloud?
76
77
78
79
18 20 22 24
Lif
e e
xpec
tanc
y (y
rs)
Per Capital GDP
Ans: We use the “least squares method”
04/19/23 7
Slope
y
x
sb r
s
These formulas derive the coefficients for the least squares regression line
Intercept
a y bx
04/19/23 8
0.795 1.532
0.809 77.754 21.52
yx ss
ryx
Statistics for illustrative data (calculated with TI-30XSII)
y
x
sb r
s
0.795(0.809)
1.532
Illustrative Example (GDP & Life Expectancy)
Calculation of regression coefficients by hand:
a y bx 77.754 - (0.420)(21.52)
0.420
68.716
04/19/23 9
“Least Squares” Regression Coefficients via TI-30XIIS
BEWARE! The TI-30XIIS mislabels the slope & intercept. The slope is mislabeled as a and the intercept is mislabeled as b.
It should be the other way around!
STAT > 2-VAR > DATA > STATVAR
04/19/23 10
ŷ = 68.7 + 0.42∙X
Interpretation of Slope (GDP & Life Expectancy)
1 unit X
b = increase in Y per unit X = 0.42 years
Each ↑$1K in GDP associated with a 0.42 year increase in life expectancy
04/19/23 11
Interpretation of Intercept• Mathematically =
the predicted value of Y when X = 0
• In real-world = has no interpretation unless a value of X = 0 is plausible
04/19/23 12
Regression Line for Prediction• Example: What is the
predicted life expectancy of a country with a GDP of 20?
• Ŷx=20 = 68.7 + (0.42)X
= 68.7+(0.42)(20)
= 77.12
• The regression line will always go through (x-bar, y-bar) which in this case is (21.5, 77.8)
• To draw the regression line, connect any two points on the line
Case Study (Life Expectancy)
76
77
78
79
18 20 22 24
Per Capital GDP
Lif
e ex
pec
tan
cy (
yrs)
x
x
04/19/23 13
Coefficient of Determination r2
Interpretation: proportion of the variability in Y mathematically explained by X
Our example r =.809
r2 = .8092 = 0.66.
Interpretation: 66% of the variability in Y (life expectancy) mathematically explained* by X (GDP)
* mathematically explained ≠ causally explained
04/19/23 14
Cautions about linear regression
1. Applies to linear relationships only
2. Strongly influenced by outliers, especially when outlier is in the X direction
3. Do not extrapolate!
4. Association ≠ causation (Beware of lurking variables.)
04/19/23 15
Outliers / Influential Points• Outliers in the X
direction have strong influence (tip the line)
• Example (right)
– Child 18 = outlier in X direction
– Changes the slope substantially
w/o outlier
with outlier
04/19/23 16
0
1
2
3
4
5
6
7
8
0 5 10 15 20 25 30 35
age (years)
hei
gh
t (f
eet)
Do Not Extrapolate! • Example (right): Sarah’s
height from age 3 to 5 • Least squares regression
line: ŷ = 2.32 + .159(X)• Predict height at age 30• ŷ = 2.32 + .159(X)
= 2.32 + .159(30) = 8.68’ (ridiculous)
Do NOT extrapolate beyond the range of X
04/19/23 17
Association ≠ Causation
• “Association” not the same as “causation”
• Lurking variable ≡ an extraneous factor (Z) that is associated with both X and Y
• Lurking variables can confound an association
Example of Confounding by a Lurking Variable
• Explanatory variable X ≡ number of prior children
• Response variable Y ≡ the risk of Down’s syndrome
• Lurking variable Z ≡ advanced age of mother
• X is associated with Y, but does not cause Y in this example
• Z does cause Y
04/19/23 18
Older mother
Number of children Mental retardation
04/19/23 19
Criteria used to establish causality with examples about smoking (X) and lung cancer (Y)
• Strength of association– X & Y strongly correlated
• Consistency of findings– Many studies have shown X & Y correlated
• Dose-response relationship– The more you smoke, the more you increase risk
• Temporality (time relation) – Lung cancer occurs after 10 – 20 years of smoking
• Biological plausibility– Chemical in cigarette smoke are mutagenic