Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance...

42
Stat 31, Section 1, Last Time 2-way tables Testing for Independence Chi-Square distance between data & model Chi-Square Distribution Gives P-values (CHIDIST) Simpson’s Paradox: Lurking variables can reverse comparisons Recall Linear Regression Fit a line to a scatterplot

Transcript of Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance...

Page 1: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Stat 31, Section 1, Last Time• 2-way tables

– Testing for Independence– Chi-Square distance between data & model– Chi-Square Distribution– Gives P-values (CHIDIST)

• Simpson’s Paradox:– Lurking variables can reverse comparisons

• Recall Linear Regression– Fit a line to a scatterplot

Page 2: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Recall Linear Regression

Idea:

Fit a line to data in a scatterplot

Recall Class Example 14https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg14.xls

• To learn about “basic structure”

• To “model data”

• To provide “prediction of new values”

Page 3: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for Regression

Goal: develop

• Hypothesis Tests and Confidence Int’s

• For slope & intercept parameters, a & b

• Also study prediction

Page 4: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for Regression

Idea: do statistical inference on:

– Slope a

– Intercept b

Model:

Assume: are random, independent

and

iii ebaXY

ie

eN ,0

Page 5: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for Regression

Viewpoint: Data generated as:

y = ax + b

Yi chosen from

Xi

Note: a and b are “parameters”

Page 6: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for Regression

Parameters and determine the

underlying model (distribution)

Estimate with the Least Squares Estimates:

and

(Using SLOPE and INTERCEPT in Excel,

based on data)

a b

a b

Page 7: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for Regression

Distributions of and ?

Under the above assumptions, the sampling

distributions are:

• Centerpoints are right (unbiased)

• Spreads are more complicated

a b

aaNa ,~ˆ

bbNb ,~ˆ

Page 8: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionFormula for SD of :

• Big (small) for big (small, resp.)– Accurate data Accurate est. of slope

• Small for x’s more spread out– Data more spread More accurate

• Small for more data– More data More accuracy

a

n

ii

ea

xxaSD

1

e

Page 9: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionFormula for SD of :

• Big (small) for big (small, resp.)– Accurate data Accur’te est. of intercept

• Smaller for – Centered data More accurate intercept

• Smaller for more data– More data More accuracy

b

n

ii

eb

xx

xn

bSD

1

2

21ˆ

e

0x

Page 10: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionOne more detail:

Need to estimate using data

For this use:

• Similar to earlier sd estimate,

• Except variation is about fit line

• is similar to from before

e

2

ˆˆ1

2

n

bxays

n

iii

e

s

2n 1n

Page 11: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for Regression

Now for Probability Distributions,

Since are estimating by

Use TDIST and TINV

With degrees of freedom =

e es

2n

Page 12: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionConvenient Packaged Analysis in Excel:

Tools Data Analysis Regression

Illustrate application using:

Class Example 27,

Old Text Problem 8.6 (now 10.12)

Page 13: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionClass Example 27,

Old Text Problem 8.6 (now 10.12)Utility companies estimate energy used by

their customers. Natural gas consumption depends on temperature. A study recorded average daily gas consumption y (100s of cubic feet) for each month. The explanatory variable x is the average number of heating degree days for that month. Data for October through June are:

Page 14: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionData for October through June are:

Month X = Deg. Days Y = Gas Cons’n

Oct 15.6 5.2

Nov 26.8 6.1

Dec 37.8 8.7

Jan 36.4 8.5

Feb 35.5 8.8

Mar 18.6 4.9

Apr 15.3 4.5

May 7.9 2.5

Jun 0 1.1

Page 15: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionClass Example 27,

Old Text Problem 8.6 (now 10.12)

Excel Analysis:https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg27.xls

Good News:

Lots of things done automatically

Bad News:

Different language,

so need careful interpretation

Page 16: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionExcel Glossary:

Excel Stat 31

R2 r2 = Prop’n of Sum of Squares

Explained by Line

intercept Intercept b

X Variable Slope a

Coefficient Estimates & .a b

Page 17: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionExcel Glossary:

Excel Stat 31

Standard Errors

Estimates of & .

(recall from Sampling Dist’ns)

T – Stat. (Est. – mean) / SE, i.e. put

on scale of T – distribution

P-value For 2-sided test of:

a b

0:.0:0

b

aHvs

b

aH A

Page 18: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionExcel Glossary:

Excel Stat 31

Lower 95%

Upper 95%

Ends of 95% Confidence

Interval for a and b

(since chose 0.95 for Confidence level)

Predicted . Points on line at ,

i.e. .iXiY

baX i

Page 19: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionExcel Glossary:

Excel Stat 31

Residual for .

Recall: gave useful information about quality of fit

Standard Residuals:

on standardized scale

e

ii bXaY

ˆˆ

iX bXaY iiˆˆ

Page 20: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionSome useful variations:

Class Example 28,

Old Text Problems 10.8 - 10.10

(now 10.13 – 10.15)

Excel Analysis:https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg28.xls

Page 21: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionClass Example 28, (now 10.13 – 10.15)

Old 10.8:

Engineers made measurements of the Leaning Tower of Pisa over the years 1975 – 1987. “Lean” is the difference between a points position if the tower were straight, and its actual position, in tenths of a meter, in excess of 2.9 meters. The data are:

Page 22: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionClass Example 28,

(now 10.13 – 10.15)

Old 10.8:

The data are:

Year Lean

75 642

76 644

77 656

78 667

79 673

80 688

81 696

82 698

83 713

84 717

85 725

86 742

87 757

Page 23: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for RegressionClass Example 28, (now 10.13 – 10.15)

Old 10.8:

(a) Plot the data, does the trend in lean over time appear to be linear?

(b) What is the equation of the least squares fit line?

(c) Give a 95% confidence interval for the average rate of change of the lean.

https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg28.xls

Page 24: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Inference for Regression

HW:

10.3 b,c

10.5

Page 25: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

And Now for Something Completely Different

Etymology of:

“And now for something completely

different”

Anybody heard of this before?

Page 26: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

And Now for Something Completely Different

What is “etymology”?

Google responses to:

define: etymology• The history of words; the study of the history

of words.csmp.ucop.edu/crlp/resources/glossary.html

• The history of a word shown by tracing its development from another language.www.animalinfo.org/glosse.htm

Page 27: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

And Now for Something Completely Different

What is “etymology”?

• Etymology is derived from the Greek word e/)tymon(etymon) meaning "a sense" and logo/j(logos) meaning "word." Etymology is the study of the original meaning and development of a word tracing its meaning back as far as possible.www.two-age.org/glossary.htm

Page 28: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

And Now for Something Completely Different

Google response to: define: and now for something

completely differentAnd Now For Something Completely Different is a

film spinoff from the television comedy series Monty Python's Flying Circus. The title originated as a catchphrase in the TV show. Many Python fans feel that it excellently describes the nonsensical, non sequitur feel of the program. en.wikipedia.org/wiki/And_Now_For_Something_Completely_Different

Page 29: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

And Now for Something Completely Different

Google Search for:

“And now for something completely different”

Gives more than 100 results….

A perhaps interesting one:

http://www.mwscomp.com/mpfc/mpfc.html

Page 30: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

And Now for Something Completely Different

Google Search for:

“Stat 31 and now for something completely different”

Gives:

[PPT] Slide 1File Format: Microsoft Powerpoint 97 - View as HTML... But what is missing? And now for something completely different… Review Ideas on State Lotteries,. from our study of Expected Value ...https://www.unc.edu/~marron/ UNCstat31-2005/Stat31-05-03-31.ppt - Similar pages

Page 31: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in Regression

Idea: Given data

Can find the Least Squares Fit Line, and do

inference for the parameters.

Given a new X value, say , what will the

new Y value be?

nn YXYX ,,,, 11

0X

Page 32: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in Regression

Dealing with variation in prediction:

Under the model:

A sensible guess about ,

based on the given ,

is:

(point on the fit line above )

iii ebaXY

0Y

iY ebXaY ˆˆˆˆ 00

0X

0X

Page 33: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in Regression

What about variation about this guess?

Natural Approach: present an interval

(as done with Confidence Intervals)

Careful: Two Notions of this:

1. Confidence Interval for mean of

2. Prediction Interval for value of

0Y

0Y

Page 34: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in Regression

1. Confidence Interval for mean of :

Use:

where:

and where

0Y

YSEtY ˆ*ˆ

n

ii xx

xxn

sSEY

1

2

20

ˆ

1

)2,95.01(* nTINVt

Page 35: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in Regression

Interpretation of:

• Smaller for closer to

• But never 0

• Smaller for more spread out

• Larger for larger

0x x

n

ii xx

xxn

sSEY

1

2

20

ˆ

1

six

Page 36: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in Regression

2. Prediction Interval for value of

Use:

where:

And again

0Y

YSEtY ˆ*

0

n

ii xx

xxn

sSEY

1

2

20

ˆ

11

)2,95.01(* nTINVt

Page 37: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in Regression

Interpretation of:

• Similar remarks to above …

• Additional “1 + ” accounts for added

variation in compared to

n

ii

Y

xx

xxn

sSE

1

2

20

ˆ

11

Y0Y

Page 38: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in RegressionRevisit Class Example 28,

(now 10.13 – 10.15) Old 10.8:

Engineers made measurements of the Leaning Tower of Pisa over the years 1975 – 1987. “Lean” is the difference between a points position if the tower were straight, and its actual position, in tenths of a meter, in excess of 2.9 meters. The data are listed above…

Page 39: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in RegressionClass Example 28, (now 10.13 – 10.15)

Old 10.9:

(a) Plot the data, Does the trend in lean over time appear to be linear?

(b) What is the equation of the least squares fit line?

(c) Give a 95% confidence interval for the average rate of change of the lean.

https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg28.xls

Page 40: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Prediction in RegressionHW:

10.20 and add part:

(f) Calculate a 95% Confidence Interval for

the mean oxygen uptake of individuals

having heart rate 96, and heart rate

115.

Page 41: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.
Page 42: Stat 31, Section 1, Last Time 2-way tables –Testing for Independence –Chi-Square distance between data & model –Chi-Square Distribution –Gives P-values.

Additional Issues in RegressionRobustness

Outliers via Java Applet

HW on outliers