AP Statistics Linear Regression Inference Hypothesis Tests: Slopes Given: Observed slope relating...
-
Upload
rudolf-heath -
Category
Documents
-
view
212 -
download
0
Transcript of AP Statistics Linear Regression Inference Hypothesis Tests: Slopes Given: Observed slope relating...
AP Statistics
Linear Regression Inference
Hypothesis Tests: Slopes• Given: Observed slope relating Education to Job
Prestige = 2.47• Question: Can we generalize this to the
population of all Americans?– How likely is it that this observed slope was actually
drawn from a population with slope = 0?• Solution: Conduct a hypothesis test• Notation: slope = b, population slope = b• H0: Population slope b = 0• H1: Population slope b 0 (two-tailed test)
Review: Slope Hypothesis Tests• What information lets us to do a hypothesis test?• Answer: Estimates of a slope (b) have a sampling
distribution, like any other statistic– It is the distribution of every value of the slope, based
on all possible samples (of size N)• If certain assumptions are met, the sampling
distribution approximates the t-distribution– Thus, we can assess the probability that a given value
of b would be observed, if b = 0– If probability is low – below alpha – we reject H0
0
Sampling distribution of the slope
Review: Slope Hypothesis Tests
• Visually: If the population slope (b) is zero, then the sampling distribution would center at zero– Since the sampling distribution is a probability
distribution, we can identify the likely values of b if the population slope is zero
If b=0, observed slopes should commonly fall near zero, toob
If observed slope falls very far from 0, it is improbable that b is really equal to zero. Thus, we can reject H0.
Bivariate Regression Assumptions• Assumptions for bivariate regression hypothesis
tests:• 1. Random sample
– Ideally N > 20– But different rules of thumb exist. (10, 30, etc.)
• 2. Variables are linearly related– i.e., the mean of Y increases linearly with X– Check scatter plot for general linear trend– Watch out for non-linear relationships (e.g., U-
shaped)
Bivariate Regression Assumptions• 3. Y is normally distributed for every outcome of
X in the population– “Conditional normality”
• Ex: Years of Education = X, Job Prestige (Y)• Suppose we look only at a sub-sample: X = 12
years of education– Is a histogram of Job Prestige approximately normal?– What about for people with X = 4? X = 16
• If all are roughly normal, the assumption is met
Bivariate Regression Assumptions
• Normality:
INCOME
100000800006000040000200000
HA
PP
Y
10
8
6
4
2
0
Examine sub-samples at different values of X. Make histograms and check for normality.
HAPPY
8.00
7.50
7.00
6.50
6.00
5.50
5.00
4.50
4.00
3.50
3.00
2.50
2.00
1.50
1.00
.50
12
10
8
6
4
2
0
Std. Dev = 1.51
Mean = 3.84
N = 60.00
Good
HAPPY
10.00
9.50
9.00
8.50
8.00
7.50
7.00
6.50
6.00
5.50
5.00
4.50
4.00
3.50
3.00
2.50
2.00
1.50
1.00
.50
12
10
8
6
4
2
0
Std. Dev = 3.06
Mean = 4.58
N = 60.00
Not very good
Bivariate Regression Assumptions• 4. The variances of prediction errors are identical
at different values of X– Recall: Error is the deviation from the regression line– Is dispersion of error consistent across values of X?– Definition: “homoskedasticity” = error dispersion is
consistent across values of X– Opposite: “heteroskedasticity”, errors vary with X
• Test: Compare errors for X=12 years of education with errors for X=2, X=8, etc.– Are the errors around line similar? Or different?
INCOME
100000800006000040000200000
HA
PP
Y
10
8
6
4
2
0
Bivariate Regression Assumptions
• Homoskedasticity: Equal Error VarianceExamine error at different values of X. Is it roughly equal?
Here, things look pretty good.
INCOME
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
HA
PP
Y
10
8
6
4
2
0
Bivariate Regression Assumptions
• Heteroskedasticity: Unequal Error VarianceAt higher values of X, error variance increases a lot.
This looks pretty bad.
Bivariate Regression Assumptions• Notes/Comments:• 1. Overall, regression is robust to violations of
assumptions– It often gives fairly reasonable results, even when
assumptions aren’t perfectly met• 2. Variations of regression can handle situations
where assumptions aren’t met• 3. But, there are also further diagnostics to help
ensure that results are meaningful…
Regression Hypothesis Tests• If assumptions are met, the sampling distribution
of the slope (b) approximates a T-distribution• Standard deviation of the sampling distribution is
called the standard error of the slope (sb)• Population formula of standard error:
N
ii
eb
XX1
2
2
)(
• Where se2 is the variance of the regression error
Regression Hypothesis Tests• Estimating se
2 lets us estimate the standard error:
ERRORERROR
N
ii
e MSN
SS
N
e
22ˆ 1
2
• Now we can estimate the S.E. of the slope:
N
ii
ERRORb
XX
MS
1
2)(̂
Regression Hypothesis Tests• Finally: A t-value can be calculated:
– It is the slope divided by the standard error
)1(2
2
Ns
MS
b
s
bt
X
ERROR
YX
b
YXN
• Where sb is the sample point estimate of the standard error
• The t-value is based on N-2 degrees of freedom
Regression Confidence Intervals• You can also use the standard error of the slope
to estimate confidence intervals:
)(.. 2 Nb tsbIC
• Where tN-2 is the t-value for a two-tailed test given a desired a-level
• Example: Observed slope = 2.5, S.E. = .10• 95% t-value for 102 d.f. is approximately 2
• 95% C.I. = 2.5 +/- 2(.10)• Confidence Interval: 2.3 to 2.7