Chapter 13 Simple Linear Regression and Correlation: Inferential Methods.
-
Upload
christian-lawson -
Category
Documents
-
view
215 -
download
1
Transcript of Chapter 13 Simple Linear Regression and Correlation: Inferential Methods.
Chapter 13
Simple Linear Regression and
Correlation: Inferential Methods
Suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average.
The equation for an additive probabilistic model is:
Where e is an “error” variable
Is the first-year college grade point average determined solely by the high school grade point
average?
A relationship in which the value of y is completely determined by the value of an independent variable x is called a
deterministic relationship.
The first-year college grade point average and the high school grade point average do NOT
have a deterministic relationship.
A description of the relationship between two variables that are not
deterministically related can be given by a probabilistic model.
exf
y
)(
deviation random x of function ticdeterminis
x
y
x1 x2
The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line.
When a value of the independent variable x is fixed and an observation on the dependent variable y is made, exy
a
Population regression line (slope b)
e1
e2
Without the random deviation e in the equation, all observed (x, y) points would fall exactly on the
population regression line.
Basic Assumptions of the Simple Linear Regression Model1. The distribution of e at any particular x
value has mean value 0. that is, me = 0.
2. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s.
3. The distribution of e at any particular value of x is normal.
4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another.
60 62 64 66 68Height
Weig
ht
60 62 64 66 68
How much would an
adult female weigh if she were 5 feet
tall?
Weights of women that are 5 feet tall will vary – in other words, there is a
distribution of weights for adult
females who are 5 feet tall.
This distribution is normally distributed.
60 62 64 66 68
60 62 64 66 68
We want the standard
deviations of all these normal
distributions to be the same.
Let’s look at the heights and weights of a population of adult women.
Are some of these weights
more likely than others?
What would this distribution look
like?
What would you expect for
other heights?
Where would you expect
the population regression line
to be?
Basic Assumptions of the Simple Linear Regression Model Revisited
1. The distribution of e at any particular x value has mean value 0. that is, me = 0.
2. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s.
3. The distribution of e at any particular value of x is normal.
4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another.
Remember the variable e is a measure of the extent that individual y-values deviate from
the population regression line.
For any particular x value, the standard deviation
of y equals the standard deviation of e.
The distribution of y at any particular value of x is normal.
We use to estimate the true population regression line.
b = point estimate of b =
where
bxay ˆ
nx
xSn
yxxyS xxxy
22 and
a = point estimate of a = y - bx
xx
xy
S
S
Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers.
The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).
x 15 17 18 15 16 19 17 16 18 19
y 2289
3393
3271
2648
2897
3327
2970
2535
3138
3573
15 16 17 18 19
2500
3000
3500
Mother’s Age (yrs)
Baby’s
Weig
ht
(g)
Sketch a scatterplot of these data.
The scatterplot shows a linear pattern and the spread in the y
values appears to be similar across the range of x values. This supports the
appropriateness of the simple linear
regression model.
Birth Weight Continued . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).
x 15 17 18 15 16 19 17 16 18 19
y 2289
3393
3271
2648
2897
3327
2970
2535
3138
3573
xy 15.24545.1163ˆ
15 16 17 18 19
2500
3000
3500
Mother’s Age (yrs)
Baby’s
Weig
ht
(g)
grams25.3249)18(15.24545.1163 What is the point estimate for the mean weight of
babies born to 18-year-old mothers?This is the point
estimate for the mean weight of all babies born to 18-year-old mothers.
This is also the prediction of the weight of a single
baby born to a mother 18 years of
age.
The weight of babies increase approximately 245.15 grams for each increase of 1 year in the
mother’s age.
The statistic for estimating the variance s2 is
where
The estimate for the standard deviation s is
Recall the coefficient of determination, r2, is the proportion of observed y variation that is attributed to the model relationship.
2Resid2
nSS
se
2
ˆResid yySS
2ee ss
The subscript e reminds us that we are estimating
the variance of the “errors” or residuals.
Note that the degrees of freedom associated with
estimating s2 or s in simple linear regression is
df = n - 2
Why n – 2?
Since we must estimate both for a and b in the
regression line, we reduce the sample size n by 2
76.2 r
Birth Weight Revisited . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).
x 15 17 18 15 16 19 17 16 18 19
y 2289
3393
3271
2648
2897
3327
2970
2535
3138
3573
15 16 17 18 19
2500
3000
3500
Mother’s Age (yrs)
Baby’s
Weig
ht
(g)
308.205es
For a particular mother’s age, the typical deviation for possible
weights of babies is approximately 231 grams.
Approximately 76% of the variability observed weight of babies can be
explained by this model.
Properties of the Sampling Distribution of bWhen the four basic assumptions of the
simple linear regression model are satisfied, the following statements are true:
1. The mean value of b is b. That is, mb = b.
2. The standard deviation of the statistic b is
3. The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed.)
xxb S
Since b is almost always unknown, it must be estimated from independently selected
observations. The slope b of the least-squares line gives a point
estimate for b.
Since s is usually unknown, the estimated standard deviation of the
statistic b isxx
eb S
ss
Confidence Interval for b
When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form
where the t critical value is based on df = n – 2.
bstb value) critical(
The plot shows a linear pattern, and the vertical spread of points does not appear to be changing over the range of x values in the sample. If we assume that the distribution of errors at any given x value is approximately normal, then the simple linear regression model seems appropriate.
Is cardiovascular fitness (as measured by time to exhaustion from running on a treadmill) related to an athlete’s performance in a 20-km ski race?
The following data on x = treadmill time to exhaustion (in minutes) and y = 20-km ski time (in minutes) were taken from the article “Physiological Characteristics and Performance of Top U.S. Biathletes” (Medicine and Science in Sports and Exercise, 1995):
x 7.7 8.4 8.7 9.0 9.6 9.6 10.0
10.2
10.4
11.0
11.7
y 71.0
71.4
65.0
68.7
64.4
69.4
63.0
64.6
66.9
62.6
61.7
Sketch a scatterplot for the data.
8 9 10 11 12
62
67
72
Treadmill Time (min)
Ski
Tim
e
(min
)
We are 95% confident that the true average decrease in ski time associated with a 1 minute increase in treadmill exhaustion time is between 1 minute and 3.7 minutes.
Biathletes Continued . . .x = treadmill exhaustion time
y = ski time
x 7.7 8.4 8.7 9.0 9.6 9.6 10.0
10.2
10.4
11.0
11.7
y 71.0
71.4
65.0
68.7
64.4
69.4
63.0
64.6
66.9
62.6
61.7
Find a 95% confidence interval for the slope of the true regression line.8 9 10 11 12
62
67
72
Treadmill Time (min)
Ski
Tim
e
(min
)
)999.,671.3()591)(.26.2(3335.2
Biathletes Continued . . .Partial Minitab Output
The regression equation is
Ski time = 88.8 – 2.33 treadmill time
Predictor Coef StDev T P
Constant 88.796 5.750 15.44 0.000
Treadmill -2.3335 0.5911 -3.95 0.003
S = 2.188 R-Sq = 63.4% R-Sq (adj) = 59.3%
Analysis of Variance
Source DF SS MS F P
Regression 1 74.630 74.630 15.58 0.003
Residual Error 9 43.097 4.789
Total 10 117.727
Equation of estimated
regression line
Estimated y intercept aEstimated slope b
sb = estimated standard deviation of b
se100×r2r2 (adjusted) is
not used in simple linear regression.
SSResidSSTo2es
n - 2
Summary of Hypothesis Tests Concerning bNull hypothesis: H0: b = hypothesized value
Test Statistic:
The test is based on df = n – 2.
Alternative Hypothesis: P -value:Ha: b > hypothesized value area to right of t under the
appropriate t curve
Ha: b < hypothesized value area to left of t under the appropriate t curve
Ha: b ≠ hypothesized value 2(area to right of t ) if +t or 2(area to left of t ) if -t
bs
bt
value edhypothesiz
Often the hypothesized value is zero – this is called
the model utility test for simple linear regression.
Summary of Hypothesis Tests Concerning b Continued . . .Assumptions:For this test to be appropriate the four basic
assumptions of the simple regression model must be met:
1. The distribution of e at any particular x value has a mean of 0 (me = 0),
2. The standard deviation of e is s, which does not depend on x.
3. The distribution of e at any particular x value is normal.
4. The random deviations e1, e2, …, en associated with different observations are independent of one another.
60 62 64 66 68Height
Weig
ht
Suppose the least-squares
line is horizontal –
would height be useful in predicting weight?
What is the slope of a
horizontal line?
A slope of zero – means that there
is NO linear relationship
between x and y!
The Model Utility Test for Simple Linear Regression
The model utility test for simple linear regression is the test of
H0: b = 0
Ha: b ≠ 0
Test Statistic:
The null hypothesis specifies that there is no useful linear relationship
between x and y.
bsb
t
Since the P-value < a, we reject H0. There is sufficient evidence of a linear relationship between treadmill time and ski time.
Biathletes Revisited . . .x = treadmill exhaustion time
y = ski time
x 7.7 8.4 8.7 9.0 9.6 9.6 10.0
10.2
10.4
11.0
11.7
y 71.0
71.4
65.0
68.7
64.4
69.4
63.0
64.6
66.9
62.6
61.7
8 9 10 11 12
62
67
72
Treadmill Time (min)
Ski
Tim
e
(min
)
H0: b = 0
Ha: b ≠ 0
Where b is the slope of the population regression line between treadmill time and ski time
95.35911.03335.2
tP-value = .003
a = .05 df = 9Even though the
scatterplots indicates a linear relationship between ski time and treadmill time,
let’s perform the model utility test.
Biathletes Revisited . . .Partial Minitab Output
The regression equation is
Ski time = 88.8 – 2.33 treadmill time
Predictor Coef StDev T P
Constant 88.796 5.750 15.44 0.000
Treadmill -2.3335 0.5911 -3.95 0.003
S = 2.188 R-Sq = 63.4% R-Sq (adj) = 59.3%
Analysis of Variance
Source DF SS MS F P
Regression 1 74.630 74.630 15.58 0.003
Residual Error 9 43.097 4.789
Total 10 117.727
t test statistic
÷ =
P-value
Statistical software usually performs the model utility test with
H0: b = 0 versus Ha: b ≠ 0
Checking Model Adequacy
The simple linear regression model is
y = a + bx + e
where e represents the random deviation of an observed y value from the population regression line a + bx.
The assumptions for simple linear regression are based on this random
deviation e.
However, we do not know the deviations for e1, e2, …, en because the population regression line is unknown.
If we knew the deviations of e1, e2, …, en, we could examine them for any
inconsistencies with model assumptions.
Therefore, we must estimate these deviations using the residuals from the
estimated line. Thus, we use the residuals to check our assumptions.
Residual Analysis
• Standardize the residuals to look at their magnitudes
• Create a residual plot (from Chapter 5) or a standardized residual plot (which is a plot of the (x, standardized residual) pairs)
residual of deviation standard estimatedresidual
residual edstandardiz
Any observation with a large positive or negative residual should be examined
carefully for any error in recording data, nonstandard experimental condition, or
atypical experimental unit.
Most statistical software will perform this calculation. It is
tedious to do by hand.A desirable plot is one that exhibits no particular pattern (such as curvature or much greater spread in one part on the plot than the
other) and that has no point that is far removed from all the others.
A Look at Standardized Residual Plots
This is a desirable plot in that it exhibits no
pattern and has no point that lies far away from
the other points.
This plot exhibits a curved pattern which indicates
that the fitted model should be changed to incorporate
the curvature.
In this plot, the standard deviation of the residuals increases as the x-values increase.
While a straight-line model might still be appropriate, the best-fit line should be found using weighted least-squares.
Consult your local statistician!
Both of these plots contain
points far away from the others.
These points can have
substantial effects on
estimates of a and b as well as other quantities.
Biathletes Revisited . . .r = residuals
sr = standardized residuals (from Minitab)x 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7
y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
r 0.17 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 2.37 -0.53 0.21
sr 0.10 1.13 -1.74 0.44 -0.96 1.44 -1.18 -0.19 1.16 -0.27 0.12
8 9 10 11 12
62
67
72
Treadmill Time (min)
Ski
Tim
e
(min
)
-2
-1
0
1
-2 -1 0 1 2Normal ScoreS
tan
dard
ized R
esi
du
alLet’s look at a normal
probability plot of the standardized residuals
The normal probability plot of the standardized residuals is quite straight.
There is no reason to doubt the plausibility that the random deviations e
are normally distributed.
Biathletes Continued . . .r = residualssr = standardized residuals (from Minitab)
x 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7
y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
r 0.17 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 2.37 -0.53 0.21
sr 0.10 1.13 -1.74 0.44 -0.96 1.44 -1.18 -0.19 1.16 -0.27 0.12
8 9 10 11 12
-2
-1
0
1
Treadmill Time
Sta
ndard
ized R
esi
du
als Sketch a standardized
residual plot.
The standardized residual plot does not show evidence of any pattern or of increasing spread.
Sketch a residual plot.
8 9 10 11 12
-3
-2-1
10
23
Treadmill Time
Resi
duals
Notice that these two plots have similar appearances.
Remember that residuals can also be plotted against y.
Optional Topics
Inferences Based on the Estimated Regression Line
andInference about the
Population Correlation Coefficient
Properties of the Sampling Distribution of a + bx for a Fixed Value of x
Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a +bx* had the following properties:1) The mean value of a + bx* is a + bx*, so a + bx* is an unbiased statistic estimating the mean y value when x = x*.2) The standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by
3) The distribution of a + bx* is normal.
xxbxa S
xxn
2
*
)*(1
The farther x* is from the center, the larger sa+bx* is.
Since s is unknown, sa+bx* can be estimated by sa+bx* which substitutes se in
place of s.
Confidence Interval for a Mean y ValueWhen the basic assumptions of the simple linear regression model are met, a confidence interval for a +bx*, the mean y value when x has value x*, is
where the t critical value is based on df = n – 2.
*value) critical (* bxastbxa
Because sa+bx* is larger the farther x* is from x, the confidence interval becomes
wider as x* moves away from the center of the data.
Physical characteristics of sharks are of interest to surfers and scuba divers as well as to marine researcher. The data on x = length (in feet) and y = jaw width (in inches) for 44 sharks (were found in various articles appearing in the magazines Skin Diver and Scuba News. (These data are found on page 778 of the text.)Because it is difficult to measure jaw width in living sharks, researchers would like to determine whether it is possible to estimate jaw width from body length, which is more easily measured.
This scatterplot of the data shows a linear
pattern and is consistent with use of
the simple linear regression model.
.in140.15)15(96345.688.)15( ba
Jaws Continued . . .The regression equation is
Jaw Width = 0.69 + 0.963 Length
Predictor Coef StDev T P
Constant 0.688 1.299 0.53 0.599
Length 0.96345 0.08228 11.71 0.000
S = 1.376
R-Sq = 76.6% R-Sq (adj) = 76.0%
The simple linear regression model explains 76.6% of the
variability in jaw width.
The model utility test confirms the usefulness of
this model.
The point estimate isLet’s use the data to compute a 90% confidence interval for the mean jaw
width for 15 foot long sharks.
213.8718.279
)586.1515(441
376.12
)15(
bas
The estimated standard deviation of a + b(15) is
)498.15,782.14()213)(.68.1(140.15
value) critical()15( *
bxastba
Jaws Continued . . .The regression equation is
Jaw Width = 0.69 + 0.963 Length
Predictor Coef StDev T P
Constant 0.688 1.299 0.53 0.599
Length 0.96345 0.08228 11.71 0.000
S = 1.376
R-Sq = 76.6% R-Sq (adj) = 76.0%
The 90% confidence interval is
Based on these sample data, we can be 90% confident that the mean jaw width for sharks of length 15 feet is between 14.782 and 15.498 inches.
Prediction Interval for a Single y ValueWhen the basic assumptions of the simple linear regression model are met, a prediction interval for y*, a single y observation made when x = x*, has the form
where the t critical value is based on df = n – 2.
2*
2value) critical (* bxae sstbxa
The prediction interval and the confidence interval are centered at exactly the same place, a + bx*.
The prediction interval is wider than the confidence interval due to the due to the addition of se under the square-
root symbol.
)479.17,801.12(9388.1)68.1(140.15
value) critical()15( 2)15(
2
bae sstba
Jaws Revisited . . .
The 90% prediction interval is
We can be 90% confident that an individual shark of length 15 feet will have a jaw width between 12.801 and 17.479 inches.
Suppose that we were interested in predicting the jaw width of a single shark of length 15 feet.
140.15)15(96245.688.)15( ba 8934.1376.1 22 es
0454.213. 22)15( bas
Notice that this interval is much wider than the
confidence interval for the
mean jaw width.
Below is a Regression Plot from Minitab showing the confidence interval and the prediction interval for the shark data.
Notice that the prediction interval is substantial
wider than the confidence
interval
Also notice that the
confidence interval is very narrow close
to x, but widens the farther it is from the mean.
A Test for Independence in a Bivariate Normal Population
Null Hypothesis: H0: r = 0
Test Statistic:
The test is based on df = n – 2.
Alternative Hypothesis: P-value:Ha: r > 0 (positive dependence) Area to the right of t
Ha: r < 0 (negative dependence) Area to the left of t
Ha: r ≠ 0 (dependence) 2(Area to the right of t) if +t
or 2(Area to the left of t) if -t
Greek letter “rho”r is the population correlation coefficient. It assesses the extent of
any linear relationship in the population. r must be between -1 and
1.
Many investigators are interested if ANY relationship exist between x and
y. That is, are x and y are independent of each other?
However, r = 0 is NOT equivalent to x and y being independent except in the case of a bivariate normal
population.
A bivariate normal population is one where for any fixed x value, the distribution of
associated y values is normal, and for any fixed y value, the distribution of x values is
normal.An example would be the height x and
weight y of American adult males.
21 2
nr
rt
A Test for Independence in a Bivariate Normal Population
Assumptions:r is the correlation coefficient for a random sample from a bivariate normal population.
The one way to verify that the population is a bivariate normal
population is to plot individual normal probability plots of the x and y
variables.
Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans
The relationship between sleep duration and the level of the hormone leptin ( a hormone related to energy intake and energy expenditure) in the blood was investigated. Average nightly sleep (x, in hours) and blood leptin level (y) were recorded for each person in a sample of 716 participants in the Wisconsin Sleep Cohort Study. The sample correlation coefficient was r = 0.11. Does this support the claim that short sleep duration is associated with reduced leptin? Use a = .01.
H0: r = 0
Ha: r > 0
Test Statistic:
State the hypotheses.
96.2
714)11(.1
11.2
t
P-value = .0015 df = 714 a = .01
To verify the assumptions, we would look at normal probability plots of the x values and
of the y values. However, data is not available, so we will assume the bivariate normal population is reasonable. We will
also assume that it is reasonable to regard the sample of participants as representative
of the population of adult Americans.
Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans
Sleepless Nights Continued . . .
H0: r = 0
Ha: r > 0
Test Statistic: 96.2
714)11(.1
11.2
t
P-value = .0015 df = 714 a = .01
Since the P-value < .01, we reject H0. There is evidence to suggest that there is a positive association (perhaps a weak one since r = .11) between sleep duration and blood leptin level.
Note: the hypothesis of no linear relationship (H0: b = 0) can also be used to test for independence in a
bivariate normal population.