Stat 155, Section 2, Last Time

50
Stat 155, Section 2, Last Time • Relations between variables – Scatterplots – useful visualization – Aspects: Form, Direction, Strength • Correlation – Numerical summary of Strength and Direction • Linear Regression – Fit a line to data

description

Stat 155, Section 2, Last Time. Relations between variables Scatterplots – useful visualization Aspects: Form, Direction, Strength Correlation Numerical summary of Strength and Direction Linear Regression Fit a line to data. Reading In Textbook. - PowerPoint PPT Presentation

Transcript of Stat 155, Section 2, Last Time

Page 1: Stat 155,  Section 2, Last Time

Stat 155, Section 2, Last Time

• Relations between variables– Scatterplots – useful visualization– Aspects: Form, Direction, Strength

• Correlation– Numerical summary of Strength and Direction

• Linear Regression– Fit a line to data

Page 2: Stat 155,  Section 2, Last Time

Reading In Textbook

Approximate Reading for Today’s Material:

Pages 132-145, 151-163, 192-196, 198-210

Approximate Reading for Next Class:

Pages 218-225, 231-240

Page 3: Stat 155,  Section 2, Last Time

Section 2.3: Linear Regression

Idea:

Fit a line to data in a scatterplot

Reasons:

• To learn about “basic structure”

• To “model data”

• To provide “prediction of new values”

Page 4: Stat 155,  Section 2, Last Time

Linear Regression - Approach

Given a line, , “indexed” by

Define “residuals” = “data Y” – “Y on line”

=

Now choose to make these “small”

),( 11 yx

abxy

)( abxy ii

),( 22 yx

),( 33 yx

ab&

ab&

Page 5: Stat 155,  Section 2, Last Time

Linear Regression - Approach

Make Residuals > 0, by squaring

Least Squares: adjust to

Minimize the “Sum of Squared Errors”

ab&

21

)(

n

iii abxySSE

Page 6: Stat 155,  Section 2, Last Time

Least Squares

Can Show: (math beyond this course)

Least Squares Fit Line:

• Passes through the point

• Has Slope:

(correction factor)

),( yx

x

y

s

srb

Page 7: Stat 155,  Section 2, Last Time

Least Squares in ExcelFirst explore basics, from 1st principals

(later will do summaries & more)

Worked out example:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg14.xls

1. Construct Toy Data Seta. Fixed x’s (-3, -2, …, 3) (A4:A10)

b. Random Errors: “eps” (B4-B10)

c. Data Y’s = 1 + 0.3 * x’s + eps (C4-C10)

Page 8: Stat 155,  Section 2, Last Time

Least Squares in Excel

2. First Attempt: just try some

1. Arbitrarily choose (B37 & B38)

2. Find points on that line (A41:A47)

3. Overlay Fit Line

(very clumsily done with “double plot”)

4. Residuals (A51:A57) & Squares (B51-B57)

5. Get SSE (B59) = 11.6 “pretty big”

ab&11& ab

Page 9: Stat 155,  Section 2, Last Time

Least Squares in Excel

3. Second Attempt: Choose

1. To make line pass through

2. By adjusting: , since

3. Recompute overlay fit line

4. Recompute residuals & ESS

5. Note now SSE = 9.91

(smaller than previous 11.6, i.e. better fit)

22 & ab

ya 2 0x yx,

Page 10: Stat 155,  Section 2, Last Time

Least Squares in Excel

4. Third Attempt: Choose

1. To also get slope right

2. By making:

3. Recompute line, residuals & ESS

4. Note now SSE = 0.041

(much smaller than previous, 9.91,

and now good visual fit)

33 & ab

y

x

ssrb 3

Page 11: Stat 155,  Section 2, Last Time

Least Squares in Excel

5. When you do this:

1. Use EXCEL summaries of these operations

2. INTERCEPT (computes y-intercept a)

3. SLOPE (computes slope b)

4. Much simpler than above operations

5. To draw line, right click data & “Add Trendline”

HW: 2.47a

Page 12: Stat 155,  Section 2, Last Time

Next time

Add slide(s) about difference between:

Regression of Y on X

And

Regression of X on Y

Page 13: Stat 155,  Section 2, Last Time

Effect of a Single Data Point

Nice Webster West Example:

http://www.stat.sc.edu/~west/javahtml/Regression.html

• Illustrates effect of adding a single new point

• Points nearby don’t change line much

• Far away points create “strong leverage”

Page 14: Stat 155,  Section 2, Last Time

Effect of a Single Data Point

HW: 2.71

Page 15: Stat 155,  Section 2, Last Time

Least Squares Prediction

Idea: After finding a & b (i.e. fit line)

For new x, predict new value of y,

Using b x + a

I. e. “predict by point on the line”

Page 16: Stat 155,  Section 2, Last Time

Least Squares Prediction

EXCEL Prediction: revisit examplehttp://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg14.xls

EXCEL offers two functions:

• TREND

• FORECAST

They work similarly, input raw x’s and y’s

(careful about order!)

Page 17: Stat 155,  Section 2, Last Time

Least Squares Prediction

Caution: prediction outside range of data is called “extrapolation”

Dangerous, since small errors are magnified

Page 18: Stat 155,  Section 2, Last Time

Least Squares Prediction

HW:

2.47b, 2.49,

2.55 (hint, use Least Squares formula above, since don’t have raw data)

Page 19: Stat 155,  Section 2, Last Time

Interpretation of r squared

Recall correlation measures

“strength of linear relationship”

is “fraction of variation explained by line”

for “good fit”

for “very poor fit”

measures “signal to noise ratio”

1

2r

0

r

2r

Page 20: Stat 155,  Section 2, Last Time

Interpretation of r squaredRevisit

http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg13.xls

(a, c, d) “data near line”high signal to noise ratio

(b) “noisier data”low signal to noise ratio

(c) “almost pure noise”nearly no signal

197.02 r

58.02 r

003.02 r

Page 21: Stat 155,  Section 2, Last Time

Interpretation of r squared

HW:

2.47c

Page 22: Stat 155,  Section 2, Last Time

And now for something completely different

Recall

Distribution

of majors of

students in

this course:

Anna Miller:

What about

Statistics?

Stat 155, Section 2, Majors

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Busine

ss /

Man

.

Biolog

y

Public

Poli

cy /

Health

Pharm

/ Nur

sing

Jour

nalis

m /

Comm

.

Env. S

ci.

Other

Undec

ided

Fre

qu

ency

Page 23: Stat 155,  Section 2, Last Time

And Now for Something Completely Different

Joke from Anna Miller:

Three professors (a physicist, a chemist, and a statistician) are called in to see their dean.

As they arrive the dean is called out of his office, leaving the three professors.

The professors see with alarm that there is a fire in the wastebasket.

Page 24: Stat 155,  Section 2, Last Time

And Now for Something Completely Different

The physicist says, "I know what to do! We must cool down the materials until their temperature is lower than the ignition temperature and then the fire will go out."

The chemist says, "No! No! I know what to do! We must cut off the supply of oxygen so that the fire will go out due to lack of one of the reactants."

Page 25: Stat 155,  Section 2, Last Time

And Now for Something Completely Different

While the physicist and chemist debate what course to take, they both are alarmed to see the statistician running around the room starting other fires.

They both scream, "What are you doing?"

To which the statistician replies, "Trying to get an adequate sample size."

Page 26: Stat 155,  Section 2, Last Time

And Now for Something Completely Different

This was a variation on another old joke:

An engineer, physicist and mathematician were taking a long care trip, and stopped for the night at a hotel.

All 3 went to bed, and were smoking when they fell asleep.

Page 27: Stat 155,  Section 2, Last Time

And Now for Something Completely Different

The 3 cigarettes fell to the carpet, and started a fire.

The engineer smelled the smoke, jumped out of bed, ran to the bathroom, grabbed a glass, filled it with water, ran back, and doused the fire.

Page 28: Stat 155,  Section 2, Last Time

And Now for Something Completely Different

The physicist smelled the smoke, jumped out of bed, made a careful estimate of the size of the fire, looked for the bathroom, found it and went in, found a glass, carefully calculated how much water would be needed to put out the fire, put that much water in the glass, went to the fire, and doused it.

Page 29: Stat 155,  Section 2, Last Time

And Now for Something Completely Different

The mathematician smelled the smoke, jumped out of bed, went to the bathroom, found the glass, carefully examined it, to be sure it would hold water, turned on the faucet to be sure that water would come out when that was done, and…

Page 30: Stat 155,  Section 2, Last Time

And Now for Something Completely Different

went back to bed, satisfied that a solution to the problem existed!

Page 31: Stat 155,  Section 2, Last Time

Diagnostic for Linear Regression

Recall Normal Quantile plot shows “how well

normal curve fits a data set”

Useful visual assessment of how well the

regression line fits data is:

Residual Plot

Just Plot of Residuals (on Y axis),

versus X’s (on X axis)

Page 32: Stat 155,  Section 2, Last Time

Residual Diagnostic Plot

Toy Examples:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg15.xls

1. Generate Data to follow a line

• Residuals seem to be randomly distributed

• No apparent structure

• Residuals seem “random”

• Suggests linear fit is a good model for data

Page 33: Stat 155,  Section 2, Last Time

Residual Diagnostic Plot

Toy Examples:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg15.xls

2. Generate Data to follow a Parabola

• Shows systematic structure

• Pos. – Neg. – Pos. suggests data follow a

curve (not linear)

• Suggests that line is a poor fit

Page 34: Stat 155,  Section 2, Last Time

Residual Diagnostic Plot

Example from text: problem 2.74http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg15.xls

Study (for runners), how Stride Rate

depends on Running Speed

(to run faster, need faster strides)

a. & b. Scatterplot & Fit line

c. & d. Residual Plot & Comment

Page 35: Stat 155,  Section 2, Last Time

Residual Diagnostic Plot E.g.http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg15.xls

a. & b. Scatterplot & Fit line

• Linear fit looks very good

• Backed up by correlation ≈ 1

• “Low noise” because data are averaged

(over 21 runners)

Page 36: Stat 155,  Section 2, Last Time

Residual Diagnostic Plot E.g.http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg15.xls

c. & d. Residual Plot & Comment

• Systematic structure: Pos. – Neg. – Pos.

• Not random, but systematic pattern

• Suggests line can be improved

(as a model for these data)

• Residual plot provides “zoomed in view”

(can’t see this in raw data)

Page 37: Stat 155,  Section 2, Last Time

Residual Diagnostic Plot

HW 2.73, 2.63

Page 38: Stat 155,  Section 2, Last Time

Chapter 3: Producing Data

(how this is done is critical to conclusions)

Section 3.1: Statistical Settings

2 Main Types:

I. Observational Study

Simply “see what happens, no intervention”

(to individuals or variables of interest)

e.g. Political Polls, Supermarket Scanners

Page 39: Stat 155,  Section 2, Last Time

Producing Data

2 Main Types:

I. Observational Study

II. Experiment

(Make Changes, & Study Effect)Apply “treatment” to individuals & measure

“responses”

e.g. Clinical trials for drugs, agricultural trials

(safe? effective?) (max yield?)

Page 40: Stat 155,  Section 2, Last Time

Producing Data

2 Main Types:

I. Observational Study

II. Experiment

(common sense)

Caution: Thinking is required for each.

Both if you do statistics & if you need to understand somebody else’s results

Page 41: Stat 155,  Section 2, Last Time

Helpful Distinctions(Critical Issue of “Good” vs. “Bad”)

I. Observational Studies:

A. Anecdotal Evidence

Idea: Study just a few cases

Problem: may not be representative

(or worse: only considered for this reason)

e.g. Cures for hiccups

Key Question: how were data chosen?(early medicine: this gave crazy attempts at cures)

Page 42: Stat 155,  Section 2, Last Time

Helpful DistinctionsI. Observational Studies:

B. Sampling

Idea: Seek sample representative of population

HW:

3.1, 3.3, 3.5, 3.7

Challenge: How to sample?

(turns out: not easy)

Page 43: Stat 155,  Section 2, Last Time

How to sample?History of Presidential Election Polls

During Campaigns, constantly hear in news “polls say …” How good are these? Why?

1936 Landon vs. Roosevelt Literary Digest Poll: 43% for R

Result: 62% for R

What happened?Sample size not big enough? 2.4 million

Biggest Poll ever done (before or since)

Page 44: Stat 155,  Section 2, Last Time

Bias in SamplingBias: Systematically favoring one outcome

(need to think carefully)

Selection Bias: Addresses from L. D.

readers, phone books, club memberships

(representative of population?)

Non-Response Bias: Return-mail survey

(who had time?)

Page 45: Stat 155,  Section 2, Last Time

How to sample?1936 Presidential Election (cont.)

Interesting Alternative Poll:

Gallup: 56% for R (sample size ~ 50,000)

Gallup of L.D. 44% for R ( ~ 3,000)

Predicted both correct result (62% for R),

and L. D. error (43% for R)!

(what was better?)

Page 46: Stat 155,  Section 2, Last Time

Improved SamplingGallup’s Improvements:

(i) Personal Interviews

(attacks non-response bias)

(ii) Quota Sampling

(attacks selection bias)

Page 47: Stat 155,  Section 2, Last Time

Quota SamplingIdea: make “sample like population”

So surveyor chooses people to give:i. Right % male

ii. Right % “young”

iii. Right % “blue collar”

iv. …

This worked well, until …

Page 48: Stat 155,  Section 2, Last Time

How to sample?1948 Dewey Truman sample size

Crossley 50% 45%

Gallup 50% 44% 50,000

Roper 53% 38% 15,000

Actual 45% 50% -

Note: Embarassing for polls, famous photo of Truman + Headline “Dewey Wins”

Page 49: Stat 155,  Section 2, Last Time

What went wrong?Problem: Unintentional Bias

(surveyors understood bias,

but still made choices)

Lesson: Human Choice can not give a Representative Sample

Surprising Improvement: Random Sampling

Now called “scientific sampling”

Random = Scientific???

Page 50: Stat 155,  Section 2, Last Time

Random SamplingKey Idea: “random error” is smaller than

“unintentional bias”, for large enough sample sizes

How large?

Current sample sizes: ~1,000 - 3,000

Note: now << 50,000 used in 1948.

So surveys are much cheaper

(thus many more done now….)