Download - June 25, 2008Stat 111 - Lecture 14 - Two Means1 Comparing Means from Two Samples Statistics 111 – Lecture 14 One-Sample Inference for Proportions and.

June 25, 2008 Stat 111 - Lecture 14 - Two Means

1

Comparing Meansfrom Two Samples

Statistics 111 – Lecture 14

One-Sample Inference for Proportions

and


2

Administrative Notes

• Homework 5 is posted on website• Due Wednesday, July 1st


3

Outline

• Two Sample Z-test (known variance)

• Two Sample t-test (unknown variance)

• Matched Pair Test and Examples

• Tests and Intervals for Proportions (Chapter 8)

June 25, 2008 Stat 111 - Lecture 14 - Means 4

Comparing Two Samples

• Up to now, we have looked at inference for one sample of continuous data

• Our next focus in this course is comparing the data from two different samples

• For now, we will assume that these two different samples are independent of each other and come from two distinct populations

Population 1:1 , 1

Sample 1: , s1

Population 2: 2 , 2

Sample 2: , s2


Blackout Baby Boom Revisited

• Nine months (Monday, August 8th) after Nov 1965 blackout, NY Times claimed an increased birth rate

• Already looked at single two-week sample: found no significant difference from usual rate (430 births/day)

• What if we instead look at difference between weekends and weekdays?

Sun Mon Tue Wed Thu Fri Sat

452 470 431 448 467 377

344 449 440 457 471 463 405

377 453 499 461 442 444 415

356 470 519 443 449 418 394

399 451 468 432

Weekdays Weekends


Two-Sample Z test• We want to test the null hypothesis that the two

populations have different means• H0: 1 = 2 or equivalently, 1 - 2 = 0• Two-sided alternative hypothesis: 1 - 2 0

• If we assume our population SDs 1 and 2 are known, we can calculate a two-sample Z statistic:

• We can then calculate a p-value from this Z statistic using the standard normal distribution


7

Two-Sample Z test for Blackout Data

• To use Z test, we need to assume that our pop. SDs are known: 1 = s1 = 21.7 and 2 = s2 = 24.5

• From normal table, P(Z > 7.5) is less than 0.0002, so our p-value = 2 P(Z > 7.5) is less than 0.0004

• Conclusion here is a significant difference between birth rates on weekends and weekdays

• We don’t usually know the population SDs, so we need a method for unknown 1 and 2


8

Two-Sample t test

• We still want to test the null hypothesis that the two populations have equal means (H0: 1 - 2 = 0)

• If 1 and 2 are unknown, then we need to use the sample SDs s1 and s2 instead, which gives us the two-sample T statistic:

• The p-value is calculated using the t distribution, but what degrees of freedom do we use? • df can be complicated and often is calculated by software• Simpler and more conservative: set degrees of freedom

equal to the smaller of (n1-1) or (n2-1)


9

Two-Sample t test for Blackout Data

• To use t test, we need to use our sample standard deviations s1 = 21.7 and s2 = 24.5

• We need to look up the tail probabilities using the t distribution

• Degrees of freedom is the smaller of n1-1 = 22

or n2-1 = 7


10


11

Two-Sample t test for Blackout Data

• From t-table with df = 7, we see that P(T > 7.5) < 0.0005

• If our alternative hypothesis is two-sided, then we know that our p-value < 2 0.0005 = 0.001

• We reject the null hypothesis at -level of 0.05 and conclude there is a significant difference between birth rates on weekends and weekdays

• Same result as Z-test, but we are a little more conservative


12

Two-Sample Confidence Intervals

• In addition to two sample t-tests, we can also use the t distribution to construct confidence intervals for the mean difference

• When 1 and 2 are unknown, we can form the following 100·C% confidence interval for the mean difference 1 - 2 :

• The critical value tk* is calculated from a t distribution

with degrees of freedom k• k is equal to the smaller of (n1-1) and (n2-1)


13

Confidence Interval for Blackout Data

• We can calculate a 95% confidence interval for the mean difference between birth rates on weekdays and weekends:

• We get our critical value tk* = 2.365 is calculated from

a t distribution with 7 degrees of freedom, so our 95% confidence interval is:

• Since zero is not contained in this interval, we know the difference is statistically significant!


14

Matched Pairs• Sometimes the two samples that are being compared

are matched pairs (not independent)• Example: Sentences for crack versus powder

cocaine • We could test for the mean difference between X1 = crack sentences and X2 = powder sentences

• However, we realize that these data are paired: each row of sentences have a matching quantity of cocaine

• Our t-test for two independent samples ignores this relationship


15

Matched Pairs Test

• First, calculate the difference d = X1 - X2 for each pair

• Then, calculate the mean and SD of the differences d

Quantity

Sentences

Crack

X1

Powder

X2

Difference

d = X1 - X2

5 70.5 12 58.5

25 87.5 18 69.5

100 136 30 106.0

200 169.5 37 132.5

500 211.5 70.5 141.0

2000 264 87.5 176.5

5000 264 136 128.0

50000 264 211.5 52.5

150000 264 264 0.0


16

• Instead of a two-sample test for the difference between X1 and X2, we do a one-sample test on the difference d

• Null hypothesis: mean difference between the two samples is equal to zero

H0 : d= 0 versus Ha : d 0

• Usual test statistic when population SD is unknown:

• p-value calculated from t-distribution with df = 8 • P(T > 5.24) < 0.0005 so p-value < 0.001

• Difference between crack and powder sentences is statistically significant at -level of 0.05

Matched Pairs Test


17

• We can also construct a confidence interval for the mean differenced of matched pairs• We can just use the confidence intervals we learned for the

one-sample, unknown case

• Example: 95% confidence interval for mean difference between crack and powder sentences:

Matched Pairs Confidence Interval


18

Summary of Two-Sample Tests

• Two independent samples with known 1 and 2

• We use two-sample Z-test with p-values calculated using the standard normal distribution

• Two independent samples with unknown 1 and 2

• We use two-sample t-test with p-values calculated using the t distribution with degrees of freedom equal to the smaller of n1-1 and n2-1

• Also can make confidence intervals using t distribution

• Two samples that are matched pairs• We first calculate the differences for each pair, and then use

our usual one-sample t-test on these differences


19

One-Sample Inference for Proportions

June 25, 2008 Stat 111 - Lecture 14- One-Sample Proportions

20

Revisiting Count Data

• Chapter 6 and 7 covered inference for the population mean of continuous data

• We now return to count data:

• Example: Opinion Polls • Xi = 1 if you support Obama, Xi = 0 if not

• We call p the population proportion for Xi = 1 • What is the proportion of people who support the war? • What is the proportion of Red Sox fans at Penn?


21

Inference for population proportion p

• We will use sample proportion as our best estimate of the unknown population proportion p

where Y = sample count

• Tool 1: use our sample statistic as the center of an entire confidence interval of likely values for our population parameter

Confidence Interval : Estimate ± Margin of Error

• Tool 2: Use the data to for a specific hypothesis test• Formulate your null and alternative hypotheses• Calculate the test statistic• Find the p-value for the test statistic


22

Distribution of Sample Proportion

• In Chapter 5, we learned that the sample proportion technically has a binomial distribution

• However, we also learned that if the sample size is large, the sample proportion approximately follows a Normal distribution with mean and standard deviation:

• We will essentially use this approximation throughout chapter 8, so we can make probability calculations using the standard normal table


23

Confidence Interval for a Proportion

• We could use our sample proportion as the center of a confidence interval of likely values for the population parameter p:

• The width of the interval is a multiple of the standard deviation of the sample proportion

• The multiple Z* is calculated from a normal distribution and depends on the confidence level


24

Confidence Interval for a Proportion

• One Problem: this margin of error involves the population proportion p, which we don’t actually know!

• Solution: substitute in the sample proportion for the population proportion p, which gives us the interval:


25

Example: Red Sox fans at Penn• What proportion of Penn students are Red Sox fans? • Use Stat 111 class survey as sample

• Y = 25 out of n = 192 students are Red Sox fans so

• 95% confidence interval for the population proportion:

• Proportion of Red Sox fans at Penn is probably between 8% and 18%


26

Hypothesis Test for a Proportion• Suppose that we are now interested in using our count

data to test a hypothesized population proportion p0

• Example: an older study says that the proportion of Red Sox fans at Penn is 0.10. • Does our sample show a significantly different proportion?

• First Step: Null and alternative hypotheses

H0: p = 0.10 vs. Ha: p 0.10

• Second Step: Test Statistic


27

Hypothesis Test for a Proportion

• Problem: test statistic involves population proportion p• For confidence intervals, we plugged in sample

proportion but for test statistics, we plug in the hypothesized proportion p0 :

• Example: test statistic for Red Sox example


28

Hypothesis Test for a Proportion• Third step: need to calculate a p-value for our test

statistic using the standard normal distribution• Red Sox Example: Test statistic Z = 1.39

• What is the probability of getting a test statistic as extreme or more extreme than Z = 1.39? ie. P(Z > 1.39) = ?

• Two-sided alternative, so p-value = 2P(Z>1.39) = 0.16• We don’t reject H0 at a =0.05 level, and conclude that Red

Sox proportion is not significantly different from p0=0.10

Z = 1.39

prob = 0.082


29

Another Example• Mass ESP experiment in 1977 Sunday Mirror (UK)

• Psychic hired to send readers a mental message about a particular color (out of 5 choices). Readers then mailed back the color that they “received” from psychic

• Newspaper declared the experiment a success because, out of 2355 responses, they received 521 correct ones ( )

• Is the proportion of correct answers statistically different than we would expect by chance (p0 = 0.2) ?

H0: p= 0.2 vs. Ha: p 0.2


30

Mass ESP Example• Calculate a p-value using the standard normal distribution

• Two-sided alternative, so p-value = 2P(Z>2.43) = 0.015• We reject H0 at a =0.05 level, and conclude that the survey

proportion is significantly different from p0=0.20

• We could also calculate a 95% confidence interval for p:

Z = 2.43

prob = 0.0075

Interval doesn’t contain 0.20


31

Margin of Error• Confidence intervals for proportion p is centered at the

sample proportion and has a margin of error:

• Before the study begins, we can calculate the sample size needed for a desired margin of error

• Problem: don’t know sample prop. before study begins! • Solution: use which gives us the maximum m• So, if we want a margin of error less than m, we need


32

Margin of Error Examples• Red Sox Example: how many students should I poll in

order to have a margin of error less than 5% in a 95% confidence interval?

• We would need a sample size of 385 students• ESP example: how many responses must newspaper

receive to have a margin of error less than 1% in a 95% confidence interval?


33

Next Class - Lecture 15

• Two-Sample Inference for Proportions

• Moore, McCabe and Craig: Section 8.2