Stor 155, Section 2, Last Time• 2-way Tables
– Sliced populations in 2 different ways
– Look for independence of factors
– Chi Square Hypothesis test
• Simpson’s Paradox
– Aggregating can give opposite impression
• Inference for Regression
– Sampling Distributions – TDIST & TINV
Reading In Textbook
Approximate Reading for Today’s Material:
Pages 634-667 & Review
Approximate Reading for Next Class:
Pages 634-667 & Review
Inference for RegressionChapter 10
Recall:
• Scatterplots
• Fitting Lines to Data
Now study statistical inference associated with fit lines
E.g. When is slope statistically significant?
Recall Scatterplot
For data (x,y)
View by plot:
(1,2)
(3,1)
(-1,0)
(2,-1)
Toy Scatterplot, Separate Points
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
-2 -1 0 1 2 3 4
x
y
Recall Linear Regression
Idea:
Fit a line to data in a scatterplot
• To learn about “basic structure”
• To “model data”
• To provide “prediction of new values”
Recall Linear Regression
Given a line, , “indexed” by
Define “residuals” = “data Y” – “Y on line”
=
Now choose to make these “small”
),( 11 yx
abxy
)( abxy ii
),( 22 yx
),( 33 yx
ab&
ab&
Recall Linear Regression
Make Residuals > 0, by squaring
Least Squares: adjust to
Minimize the “Sum of Squared Errors”
ab&
21
)(
n
iii abxySSE
Least Squares in Excel
Computation:
1. INTERCEPT (computes y-intercept a)
2. SLOPE (computes slope b)
Revisit Class Example 14http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg14.xls
Inference for Regression
Idea: do statistical inference on:
– Slope a
– Intercept b
Model:
Assume: are random, independent
and
iii ebaXY
ie
eN ,0
Inference for Regression
Viewpoint: Data generated as:
y = ax + b
Yi chosen from
Xi
Note: a and b are “parameters”
Inference for Regression
Parameters and determine the
underlying model (distribution)
Estimate with the Least Squares Estimates:
and
(Using SLOPE and INTERCEPT in Excel,
based on data)
a b
a b
Inference for Regression
Distributions of and ?
Under the above assumptions, the sampling
distributions are:
• Centerpoints are right (unbiased)
• Spreads are more complicated
a b
aaNa ,~ˆ
bbNb ,~ˆ
Inference for RegressionFormula for SD of :
• Big (small) for big (small, resp.)– Accurate data Accurate est. of slope
• Small for x’s more spread out– Data more spread More accurate
• Small for more data– More data More accuracy
a
n
ii
ea
xxaSD
1
2ˆ
e
Inference for RegressionFormula for SD of :
• Big (small) for big (small, resp.)– Accurate data Accur’te est. of intercept
• Smaller for – Centered data More accurate intercept
• Smaller for more data– More data More accuracy
b
n
ii
eb
xx
xn
bSD
1
2
21ˆ
e
0x
Inference for RegressionOne more detail:
Need to estimate using data
For this use:
• Similar to earlier sd estimate,
• Except variation is about fit line
• is similar to from before
e
2
ˆˆ1
2
n
bxays
n
iii
e
s
2n 1n
Inference for Regression
Now for Probability Distributions,
Since are estimating by
Use TDIST and TINV
With degrees of freedom =
e es
2n
Inference for RegressionConvenient Packaged Analysis in Excel:
Tools Data Analysis Regression
Illustrate application using:
Class Example 32,
Old Text Problem 10.12
Inference for RegressionClass Example 32,
Old Text Problem 10.12Utility companies estimate energy used by
their customers. Natural gas consumption depends on temperature. A study recorded average daily gas consumption y (100s of cubic feet) for each month. The explanatory variable x is the average number of heating degree days for that month. Data for October through June are:
Inference for RegressionData for October through June are:
Month X = Deg. Days Y = Gas Cons’n
Oct 15.6 5.2
Nov 26.8 6.1
Dec 37.8 8.7
Jan 36.4 8.5
Feb 35.5 8.8
Mar 18.6 4.9
Apr 15.3 4.5
May 7.9 2.5
Jun 0 1.1
Inference for RegressionClass Example 32,
Old Text Problem 10.12
Excel Analysis:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg32.xls
Good News:
Lots of things done automatically
Bad News:
Different language,
so need careful interpretation
Inference for RegressionExcel Glossary:
Excel Stor 155
R2 r2 = Prop’n of Sum of Squares
Explained by Line
intercept Intercept b
X Variable Slope a
Coefficient Estimates & .a b
Inference for RegressionExcel Glossary:
Excel Stor 155
Standard Errors
Estimates of & .
(recall from Sampling Dist’ns)
T – Stat. (Est. – mean) / SE, i.e. put
on scale of T – distribution
P-value For 2-sided test of:
a b
0:.0:0
b
aHvs
b
aH A
Inference for RegressionExcel Glossary:
Excel Stor 155
Lower 95%
Upper 95%
Ends of 95% Confidence
Interval for a and b
(since chose 0.95 for Confidence level)
Predicted . Points on line at ,
i.e. .iXiY
baX i
Inference for RegressionExcel Glossary:
Excel Stor 155
Residual for .
Recall: gave useful information about quality of fit
(useful to plot)
Standard Residuals:
on standardized scale
e
ii bXaY
ˆˆ
iX bXaY iiˆˆ
Inference for RegressionSome useful variations:
Class Example 33,
Text Problems 10.23 - 10.25
Excel Analysis:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg33.xls
Inference for RegressionClass Example 33, (10.23 – 10.25)
Engineers made measurements of the Leaning Tower of Pisa over the years 1975 – 1987. “Lean” is the difference between a points position if the tower were straight, and its actual position, in tenths of a meter, in excess of 2.9 meters. The data are:
Inference for RegressionClass Example 33,
(10.23 – 10.25)
The data are:
Year Lean
75 642
76 644
77 656
78 667
79 673
80 688
81 696
82 698
83 713
84 717
85 725
86 742
87 757
Inference for RegressionClass Example 33, (10.23 – 10.25) :
(a) Plot the data, does the trend in lean over time appear to be linear?
(b) What is the equation of the least squares fit line?
(c) Give a 95% confidence interval for the average rate of change of the lean.
http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg33.xls
Inference for Regression
HW:
10.17 b,c
10.26 (using log base 10, for part c:
Est’d slope: 0.194
Est'd intercept: -379
95% CI for slope: [0.186, 0.202])
And Now for Something Completely Different
Graphical Displays:
• Important Topic in Statistics
• Has large impact
• Need to think carefully to do this
• Watch for attempts to fool you
And Now for Something Completely Different
Graphical Displays: Interesting Article:
“How to Display Data Badly”
Howard Wainer
The American Statistician, 38, 137-147.
Internet Available:
http://links.jstor.org
And Now for Something Completely Different
Main Idea:
• Point out 12 types of bad displays
• With reasons behind
• Here are some favorites…
And Now for Something Completely Different
Hiding the data in the scale
And Now for Something Completely Different
The eye perceives
areas as “size”:
And Now for Something Completely Different
Change of
Scales in
Mid-Axis
Really trust
the
Post???
Review Slippery Issues
Major Confusion:
Population Quantities
Vs.
Sample Quantities
Review Slippery Issues
Population Quantities:• Parameters• Will never know• But can think about
Sample Quantities:• Estimates (of parameters)• Numbers we work with• Contain info about parameters
Review Slippery Issues
Population Mathematical Notation:
(fixed & unknown)
Sample Mathematical Notation :
(summaries of data, have numbers)
p,,
psX ˆ,,
Review Slippery Issues
Sampling Distributions:
Measurement Error:
Counting / Proportions:
nNX
,~
n
pppNpnBip
)1(,,~ˆ
Review Slippery Issues
Confidence Intervals: Based on margin of error:
Measurement Error:
brackets 95% of time
Counting / Proportions:
brackets 95% of time
],[ mXmX
]ˆ,ˆ[ mpmp
m
p
Review Slippery Issues
Hypothesis Testing:
Statement of Hypotheses:
Actual Test:
P-value = P{What saw or m.c. | Bdry}
AHH ,0
Hypothesis Testing from 3/22
Other views of hypothesis testing:
View 2: Z-scores
Idea: instead of reporting p-value (to assess statistical significance)
Report the Z-score
A different way of measuring significance
Hypothesis Testing – Z scores
E.g. Fast Food Menus:
Test
Using
P-value = P{what saw or m.c.| H0 & HA bd’ry}
000,20$:0 H
000,20$: AH
10,400,2$,000,21$ nsX
Hypothesis Testing – Z scores
P-value = P{what saw or or m.c.| H0 & HA bd’ry}
rybdXP '|000,21$
000,20$|000,21$ XP
102400$
000,20$000,21$
nsX
P
317.1 ZP
Top Related