Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for...

24
Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed: This Week: Inference (Ch. 10) Next Week: Transformations and Polynomial Regression (Ch. 2.6), Example Regression Analysis Tue., Oct. 19: Review for Midterm I Thu., Oct. 21: Midterm I Fall Break!
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for...

Page 1: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Class 8: Tues., Oct. 5

• Causation, Lurking Variables in Regression (Ch. 2.4, 2.5)

• Inference for Simple Linear Regression (Ch. 10.1)

• Where we’re headed:– This Week: Inference (Ch. 10)– Next Week: Transformations and Polynomial

Regression (Ch. 2.6), Example Regression Analysis– Tue., Oct. 19: Review for Midterm I– Thu., Oct. 21: Midterm I– Fall Break!

Page 2: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Bivariate Fit of HousePrice By CrimeRate

0

100000

200000

300000

400000

500000

Hou

seP

rice

10 20 30 40 50 60 70

CrimeRate

Linear Fit HousePrice = 225233.55 - 2288.6894 CrimeRate

Regression without Center City Philadelphia

Page 3: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

The Question of Causation

• The community that ran this regression would like to increase property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenue from higher property values.

• The regression without Center City Philadelphia is Linear Fit HousePrice = 225233.55 - 2288.6894 CrimeRate• The community concludes that if it can cut its crime rate

from 30 down to 20 incidents per 1000 population, it will increase its average house price by $2288.6894*10=$22,887.

• Is the community’s conclusion justified?

Page 4: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Potential Outcomes Model

• Let Yi30 denote what the house price for

community i would be if its crime rate was 30 and all other aspects of community i were held fixed and let Yi

20 denote what the house price for community i would be if its crime rate was 20 and all other aspects of community I were held fixed.

• X (crime rate) causes a change in Y (house price) for community i if . A decrease in crime rate causes an increase in house price for community i if

2030ii YY

3020ii YY

Page 5: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Association is Not Causation

• A regression model tells us about how the mean of Y|X is associated with changes in X. A regression model does not tell us what would happen if we actually changed X.

• Possible Explanations for an Observed Association Between Y and X

1. X causes Y2. Y causes X3. There is a lurking variable Z that is associated with

changes in both X and Y. Any combination of the three explanations may apply

to an observed association.

Page 6: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Y Causes XBivariate Fit of CrimeRate By HousePrice

10

20

30

40

50

60

70

Crim

eRat

e

0 100000 200000 300000 400000 500000

HousePrice

Linear Fit

Linear Fit CrimeRate = 41.929126 - 0.0000805 HousePrice

Perhaps it is changes in house price that cause changes in crime rate. When houseprices increase, the residents of a community have more to lose by engaging in criminal activities; this is called the economic theory of crime.

Page 7: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Lurking Variables• Lurking variable for the causal relationship between X

and Y: A variable Z that is associated with both X and Y.• Example of lurking variable in Philadelphia crime rate

data: Level of education. Level of education may be associated with both house prices and crime rate.

• The effect of crime rate on house price is confounded with the effect of education on house price. If we just look at data on house price and crime rate, we can’t distinguish between the effect of crime rate on house price and the effect of education on house price.

• Lurking variables are sometimes called confounding variables.

Page 8: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Weekly Wages (Y) and Education (X) in March 1988 CPS

Bivariate Fit of wage By educ

0

2000

4000

6000

8000

10000

12000

14000

16000

18000w

age

0 1 2 3 4 5 6 7 8 9 10 1213 15 1718

educ

Linear Fit

Linear Fit wage = -19.06983 + 50.414381 educ

Will getting an extra year of education cause an increase of $50.41 on averagein your weekly wage? What are some potential lurking variables?

Page 9: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Establishing Causation• Best method is an experiment, but many

times that is not ethically or practically possible (e.g., smoking and cancer, education and earnings).

Page 10: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Establishing Causation from an Observational Study

• Main strategy for learning about causation when we can’t do an experiment: Consider all lurking variables you can think of. Look at how Y is associated with X when the lurking variables are held “fixed.” We will study methods for doing this when we study multiple regression in Chapter 11.

Page 11: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Statistics and Smoking• Doctors had long observed a strong association between

smoking and death from lung cancer. • But did smoking cause lung cancer? There were many

possible lurking variables – smokers have worse diet, drink more alcohol, get less exercise than nonsmokers. The possibility that there was a genetic factor that predisposes people both to nicotine addiction and lung cancer was also raised.

• Statistical evidence from observational studies formed an essential part of the Surgeon General’s report in 1964 that declared that smoking causes lung cancer.

• How were objections to the association between lung cancer and smoking being entirely the

result of observational studies overcome?

1957: Smoking 'causes lung cancer' The link between smoking and lung cancer is one of 'direct cause and effect', a special report by the Medical Research Council has found.

The report, published today, studied the dramatic increase in deaths from lung cancer over the past 25 years and concluded the main cause was smoking.

But tobacco firms have rejected the findings saying they are merely a 'matter of opinion'.

This smoker said the findings “didn’t frighten himat all.”

Page 12: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Criteria for Establishing Causation Without an Experiment

• The association is strong.

• The association is consistent.

• Higher doses are associated with stronger responses.

• The alleged cause precedes the effect in time.

• The alleged cause is plausible.

Page 13: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Random Samples and Inference• The Current Population Survey is a monthly sample survey

of the labor force behavior of American households. • The data in cpswages.JMP is the monthly wages and

education for a random sample of 25,631 men from the March 1988 Current Population Survey.

• Suppose we take random subsamples of size 25 from this data:

In JMP, we can take a random sample of data by clicking Tables, Subset, then click Random Sample and put the size of the sample you want in the box Sampling Rate or Sample Size. Then click OK and a new data table will be created that consists of a random sample of the rows in the original data.

Page 14: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Four Random Samples of Size 25 from cpswage.JMP

Bivariate Fit of wage By educ

0

500

1000

1500

2000

wag

e

5.0 7.5 10.0 12.5 15.0 17.5 20.0

educ

Linear Fit wage = -321.7406 + 66.572815 educ

Bivariate Fit of wage By educ

0

500

1000

1500

2000

2500

wag

e

7.5 10.0 12.5 15.0 17.5 20.0

educ

Linear Fit wage = -554.3953 + 96.141786 educ

Bivariate Fit of wage By educ

0

250

500

750

1000

1250

1500

wag

e

0 5 10 15 20

educ

Linear Fit wage = 8.1902045 + 44.402483 educ

Bivariate Fit of wage By educ

0

500

1000

1500

wag

e

0 5 10 15 20

educ

Linear Fit wage = 418.20795 + 12.449913 educ

Page 15: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Least Squares Slopes in 1000 Random Samples of Size 25

Distributions slope

-200-100 0 100 200 300 400 500 600

Quantiles 100.0% maximum 589.0 99.5% 171.4 97.5% 122.4 90.0% 94.0 75.0% quartile 70.4 50.0% median 49.5 25.0% quartile 31.9 10.0% 15.8 2.5% -2.9 0.5% -28.7 0.0% minimum -200.3 Moments Mean 52.641995 Std Dev 37.856231 Std Err Mean 1.1971191 upper 95% Mean 54.991152 lower 95% Mean 50.292839 N 1000

Page 16: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Inference Based on Sample• The whole Current Population Survey (25,631 men ages 18-70) is a

random sample from the U.S. population (roughly 75 million men ages 18-70).

• In most regression analyses, the data, we have is a sample from some larger (hypothetical) population. We are interested in the true regression line for the larger population.

• Inference Questions:– How accurate is the least squares estimate of the slope for the true

slope in the larger population?– What is a plausible range of values for the true slope in the larger

population based on the sample?– Is it plausible that the slope equals a particular value (e.g., 0) based

on the sample?• Regression Applet:

http://gsbwww.uchicago.edu/fac/robert.mcculloch/research/webpage/teachingApplets/ciSLR/index.html

Page 17: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Model for Inference

• For inference, we assume the simple linear regression model is true.

• We should first check the assumptions using residual plots and also look for outliers and influential points before making inferences.

• Simple Linear Regression Model: Simple linear regression model: –

– has a normal distribution with mean 0 and standard deviation (SD)

– The subpopulation of Y with corresponding X=Xi has a normal distribution with mean and SD

– Technical note: For inference for simple linear regression, we assume we take repeated samples from the simple linear regression model with the X’s set equal to the X’s in the data,

iii XY 10i

iX10

nXX ,,1

Page 18: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Standard Error for the Slope

• True model:

• From the sample of size n, we estimate by the least squares estimate

• In repeated samples of size n with X’s set equal to , standard error is the “typical” absolute value of the error made in estimating by

nXX ,,1

XXYE 10)|( 1

11̂

n

i i XX

RMSESE

1

21

)()ˆ(

Page 19: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Bivariate Fit of wage By educ

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

wag

e

0 1 2 3 4 5 6 7 8 9 10 1213 15 1718

educ

Linear Fit wage = -19.06983 + 50.414381 educ Summary of Fit RSquare 0.108609 RSquare Adj 0.108575 Root Mean Square Error 419.4715 Mean of Response 640.1625 Observations (or Sum Wgts) 25631 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -19.06983 12.08449 -1.58 0.1146 educ 50.414381 0.902171 55.88 0.0000

Bivariate Fit of wage By educ

0

250

500

750

1000

1250

1500

wag

e

5 10 15 20

educ

Linear Fit wage = 170.98998 + 35.345874 educ Summary of Fit RSquare 0.150327 RSquare Adj 0.113384 Root Mean Square Error 308.4266 Mean of Response 592.3128 Observations (or Sum Wgts) 25 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 170.98998 217.7806 0.79 0.4404 educ 35.345874 17.52198 2.02 0.0555

Full Data Set

Random Sample of Size 25

Page 20: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Confidence Intervals• Confidence interval: A range of values that are plausible for a

parameter given the data. • 95% confidence interval: An interval that 95% of the time will

contain the true parameter.• Approximate 95% confidence interval: Estimate of parameter

2*SE(Estimate of parameter).• Approximate 95% confidence interval for slope:

• For wage-education data, , approximate 95% CI =

• Interpretation of 95% confidence interval: It is most plausible that the true slope is in the 95% confidence interval. It is possible that the true slope is outside the 95% confidence interval but unlikely; the confidence interval will fail to contain the true slope only 5% of the time in repeated samples.

)ˆ(*2ˆ11 SE

90.0)ˆ(,41.50ˆ11 SE

)21.52,61.48(90.0*241.50

Page 21: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Conf. Intervals for Slope in JMP

• After Fit Line, right click in the parameter estimates table, go to Columns and click on Lower 95% and Upper 95%.

• The exact 95% confidence interval is close to but not equal to )ˆ(*2ˆ

11 SEBivariate Fit of wage By educ Summary of Fit RSquare 0.108609 RSquare Adj 0.108575 Root Mean Square Error 419.4715 Mean of Response 640.1625 Observations (or Sum Wgts) 25631 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95% Intercept -19.06983 12.08449 -1.58 0.1146 -42.75612 4.6164574 educ 50.414381 0.902171 55.88 0.0000 48.646075 52.182687

Page 22: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Confidence Intervals and the Polls

• Margin of Error = 2*SE(Estimate).• 95% CI for Bush-Kerry difference:

• 95% CI for difference between Bush and Kerry’s proportions:

error ofMargin ˆ)ˆ(*2ˆ ppSEp

Top Stories - washingtonpost.com In the aftermath of last week's debate, Bush leads Kerry 51 percent to 46 percent among those most likely to vote, according to polling conducted Friday through Sunday...

A total of 1,470 registered voters were interviewed, including 1,169 who were determined to be likely voters. The margin of sampling error for results based on either sample is plus or minus three percentage points.

%)8%,2(%3%5

Page 23: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Why Do the Polls Sometimes Disagree So Much?

If the election were held today, would you vote for Bush or Kerry?

Published Bush% Kerry% Error% Polled Source

Oct. 3 49 49 4 770 CNN-USA Today Gallup

Oct. 2 46 49 4 1,013 Newsweek

Sept. 29 51 46 3 1,100 Los Angeles Times

Sept. 27 52 44 4 758 CNN-USA Today Gallup

Sept. 20 50 42 3 1,088 CBS News/New York Times

Sept. 16 46 46 4 1,002 Pew

Sept. 16 55 42 4 767 Gallup

Page 24: Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:

Assumptions for Validity of Confidence Interval

• The margin of error in a confidence interval covers only random sampling errors according to the assumed random sampling model; the confidence interval’s “95% guarantee” assumes the model is correct.

• In presidential polls, it must be determined who is “likely to vote.” Different polls use different models for determining who is likely to vote. The margin of error in the confidence interval assumes that the poll’s model for who is likely to vote is correct.

• For simple linear regression, the confidence interval for the slope assumes the simple linear regression model is correct; if the simple linear regression model is not correct, the confidence interval’s “95% guarantee” that 95% of the time, it will contain the true slope, is not valid.

• Always check the assumptions of the simple linear regression model before doing inference.