Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean)...

49
1 Sampling and Inference The Quality of Data and Measures

Transcript of Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean)...

Page 1: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

1

Sampling and Inference

The Quality of Data and Measures

Page 2: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

2

Why we talk about sampling

• General citizen education• Understand data you’ll be using• Understand how to draw a sample, if you

need to• Make statistical inferences

Page 3: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

3

Why do we sample?

N

Cost/benefit Benefit

(precision)

Cost(hassle factor)

Page 4: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

4

How do we sample?

• Simple random sample– Variant: systematic sample with a random start

• Stratified• Cluster

Page 5: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

5

Stratification

• Divide sample into subsamples, based on known characteristics (race, sex, religiousity, continent, department)

• Benefit: preserve or enhance variability

Page 6: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

6

Cluster sampling

Block

HH Unit

Individual

Page 7: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

7

Effects of samples

• Obvious: influences marginals• Less obvious

– Allows effective use of time and effort– Effect on multivariate techniques

• Sampling of independent variable: greater precision in regression estimates

• Sampling on dependent variable: bias

Page 8: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

8

Sampling on Independent Variable

x

y

x

y

Page 9: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

9

Sampling on Dependent Variable

x

y

x

y

Page 10: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

10

Sampling

Consequences for Statistical Inference

Page 11: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

11

Statistical Inference:Learning About the Unknown From the

Known• Reasoning forward: distributions of sample

means, when the population mean, s.d., and n are known.

• Reasoning backward: learning about the population mean when only the sample, s.d., and n are known

Page 12: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

12

Reasoning Forward

Page 13: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

13

First, we play with some simulations

• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

• http://www.kuleuven.ac.be/ucs/java/index.htm

Page 14: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

14

Exponential Distribution Example

Fra

ctio

n

inc0 500000 1.0e+06

0

.271441

Mean = 250,000Median=125,000s.d. = 283,474Min = 0Max = 1,000,000

Page 15: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

15

Consider 10 random samples, of n = 100 apiece

Sample mean1 253,396.92 198.789.63 271,074.24 238,928.75 280,657.36 241,369.87 249,036.78 226,422.79 210,593.410 212,137.3

Fra

ctio

n

inc0 250000 500000 1.0e+06

0

.271441

Page 16: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

16

Consider 10,000 samples of n = 100

N = 10,000Mean = 249,993s.d. = 28,559Skewness = 0.060Kurtosis = 2.92

Fra

ctio

n

(mean) inc0 250000 500000 1.0e+06

0

.275972

Page 17: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

17

Consider 1,000 samples of various sizes

10 100 1000

Mean =250,105s.d.= 90,891Skew= 0.38Kurt= 3.13

Mean = 250,498s.d.= 28,297Skew= 0.02Kurt= 2.90

Mean = 249,938s.d.= 9,376Skew= -0.50Kurt= 6.80

Fra

ctio

n

(mean) inc0 250000 500000 1.0e+06

0

.731

Fra

ctio

n

(mean) inc0 250000 500000 1.0e+06

0

.731

Fra

ctio

n

(mean) inc0 250000 500000 1.0e+06

0

.731

Page 18: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

18

Difference of means example

Fra

ctio

n

inc0 250000 500000 1.0e+06

0

.280203

State 1Mean = 250,000

Fra

ctio

n

inc20 250000 500000 1.0e+06

0

.251984

State 2Mean = 300,000

Page 19: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

19

Take 1,000 samples of 10, of each state, and compare them

First 10 samplesSample State 1 State 2

1 311,410 <<><><<<<>

365,2242 184,571 243,0623 468,574 438,3364 253,374 557,9095 220,934 189,6746 270,400 284,3097 127,115 210,9708 253,885 333,2089 152,678 314,88210 222,725 152,312

Page 20: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

20

1,000 samples of 10(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 673 times

Page 21: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

21

1,000 samples of 100(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 909 times

Page 22: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

22

1,000 samples of 1,000(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 1,000 times

Page 23: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

23

Another way of looking at it:The distribution of Inc2 – Inc1

n = 10 n = 100 n = 1,000

Mean = 51,845s.d. = 124,815

Mean = 49,704s.d. = 38,774

Mean = 49,816s.d. = 13,932

Fra

ctio

n

diff-400000 0 600000

0

.565

Fra

ctio

n

diff-400000 050000 600000

0

.565

Fra

ctio

n

diff-400000 050000 600000

0

.565

Page 24: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

24

Reasoning Backward

µabout somethingsay obut want t , and ,X , knowyou When sn

Page 25: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

25

Central Limit Theorem

As the sample size n increases, the distribution of the mean of a random sample taken from practically any population approaches a normaldistribution, with mean : and standard deviation

X

Page 26: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

26

Calculating Standard Errors

In general:

ns

=err. std.

Page 27: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

27

Most important standard errorsMean

Proportion

Diff. of 2 means

Regression (slope) coeff.

ns

npp )1( −

21

11nn

sp +

xsnres 1...×

Page 28: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

28

Return to the aplets for the regression standard error

• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

• http://www.kuleuven.ac.be/ucs/java/index.htm

Page 29: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

29

The Idea Behind Classical Hypothesis Testing

True mean or regression coefficient

Sample mean or regression coefficientH0 = 0

Page 30: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

30

What We Know

• We know:– The sample mean/coeff. will not equal the

population mean/coeff.– The sample mean/coeff., sample s.d./s.e., & n

• The question:– Is the sample mean/coeff. “far” from H0 or

“close” to H0?

Page 31: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

31

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

Page 32: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

32

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

Page 33: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

33

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

Page 34: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

34

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

Page 35: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

35

Therefore….

• When the sample size is large (i.e., > 150), convert the difference into z units and consult a z table

Z = (H1 - H0) / s.e.

Page 36: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

36

Reading a z table

z table for standardized normal distribution. Image removed for copyright reasons.

Page 37: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

37

Therefore….

• When the sample size is small (i.e., <150), convert the difference into t units and consult a t table

Z = (H1 - H0) / s.e.

Page 38: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

38

t (when the sample is small)

z-4 -2 0 2 4

.000045

.003989

t-distribution

z (normal) distribution

Page 39: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

39

Reading a t table

t table for standardized normal distribution. Image removed forcopyright reasons.

Page 40: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

40

Doing a t-test

Frac

tion

diff9692-.2 0 .2 .4

0

.429558

Q: How likely is it that the residual vote rate in 1996 equal to the rate in 1992 (i.e., blank96-blank92= 0)?

Mean: 0.003069s.d.: 0.02323N: 1448

00061.01448/02323.0

/..

==

= nses

Page 41: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

41

The pictureMean: 0.003069s.d.: 0.02323N: 1448

y

newz.003069.00246.00185.00124.000627.000017-.00059

.000134

.398942

00061.01448/02323.0

/..

==

= nses

028.500061.0

0003069.0

=

−=t

Page 42: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

42

The STATA output. ttest blank96=blank92

Paired t test

------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------blank96 | 1448 .0242941 .0005116 .0194689 .0232904 .0252977blank92 | 1448 .021225 .0005382 .0204813 .0201692 .0222808

---------+--------------------------------------------------------------------diff | 1448 .003069 .0006104 .0232279 .0018717 .0042664

------------------------------------------------------------------------------

Ho: mean(blank96 - blank92) = mean(diff) = 0

Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean(diff) > 0t = 5.0278 t = 5.0278 t = 5.0278

P < t = 1.0000 P > |t| = 0.0000 P > t = 0.0000

. ttest diff9692=0

One-sample t test

------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------diff9692 | 1448 .003069 .0006104 .0232279 .0018717 .0042664------------------------------------------------------------------------------Degrees of freedom: 1447

Ho: mean(diff9692) = 0

Ha: mean < 0 Ha: mean ~= 0 Ha: mean > 0t = 5.0278 t = 5.0278 t = 5.0278

P < t = 1.0000 P > |t| = 0.0000 P > t = 0.0000

.

Page 43: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

43

Final t-testQ: Was there a relationship between residual vote and countySize in 1996?

Slope coeff: -0.07510s.e.r: 0.7115N: 1861Sx: 1.4788

01115.06762.001649.0

4788.11

18617115.0

1....

=×=

×=

×=xsn

reses

blan

k96

vap96_to

blank96 Fitted values

326 6.5e+06

.000281

.298789

Page 44: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

44

Calculating t

7319.601115.

07510.0

−=

−=t

Page 45: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

45

The STATA output

. reg lblank96 lvap96

Source | SS df MS Number of obs = 1861-------------+------------------------------ F( 1, 1859) = 45.32

Model | 22.941515 1 22.941515 Prob > F = 0.0000Residual | 941.080329 1859 .506229332 R-squared = 0.0238

-------------+------------------------------ Adj R-squared = 0.0233Total | 964.021844 1860 .518291314 Root MSE = .7115

------------------------------------------------------------------------------lblank96 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------lvap96 | -.0750985 .0111556 -6.73 0.000 -.0969774 -.0532197_cons | -3.129858 .1113781 -28.10 0.000 -3.348298 -2.911419

------------------------------------------------------------------------------

Page 46: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

46

A word about standard errors and collinearity

• The problem: if X1 and X2 are highly correlated, then it will be difficult to precisely estimate the effect of either one of these variables on Y

Page 47: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

47

How does having another collinearindependent variable affect standard

errors?

s eN n

SS

RR

Y

X

Y

X. .( $ )β1

2

2

2

2

11

11

1 1

=− −

−−

R2 of the “auxiliary regression” of X1 on allthe other independent variables

Page 48: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

48

Example: Effect of party, ideology, and religiosity on feelings toward

Quincy BushBush

FeelingsConserv. Repub. Religious

Bush Feelings

1.0 .39 .57 .16

Conserv. 1.0 .46 .18

Repub. 1.0 .06

Relig. 1.0

Page 49: Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean) inc 0 250000 500000 1.0e+06 0.731 F r ac t i on (mean) inc 0 250000 500000 1.0e+06

49

Regression table(1) (2) (3) (4)

Intercept 32.7(0.85)

32.9(1.08)

32.6(1.20)

29.3(1.31)

Repub. 6.73(0.244)

5.86(0.27)

6.64(0.241)

5.88(0.27)

Conserv. --- 2.11(0.30)

--- 1.87(0.30)

Relig. --- --- 7.92(1.18)

5.78(1.19)

N 1575 1575 1575 1575R2 .32 .35 .35 .36