Week 2 a Con dence intervals b Hypothesis testing for proportions · 2011-10-31 · 150 Ernst Wit /...

Week 2a Confidence intervals

b Hypothesis testing for proportions

Ernst Wit / Wim Krijnen

Johann Bernoulli Institute

[email protected]

http://www.math.rug.nl/∼ernst

Ernst Wit / Wim Krijnen Week 2

Estimation

Remember:

Statistical inference is the business of calculatingstatistics on variables in a sample in order saysomething about parameters in a population.

One statistic we like to calculate is an estimator.

Example. In a country two candidates are involved in apresidential election, Pal en Oba. The country has 10,000inhabitants, of which 5,600 are in favour of Oba and 4,400 infavour of Pal.

I Parameter: p = 0.56 (true fraction of those in favour of Oba)

I Sample: news organization samples 100 people for their votes.

I Statistic: those in favour of Oba100 (Estimator).


How good is the estimator?

I Let’s assume there are 1,000 news organizations and eachperform their own independent poll.

I None of them will be certain to get it right, but their answerswill vary around the truth p = 0.56:


7.1 Confidence interval

> country<-c(rep(1,5600),rep(0,4400))

> phat <- vector(’numeric’,1000)

> for (i in 1:1000){

+ phat[i] <- mean(sample(country,size=100,replace=TRUE))}

Quantiles of 1,000 ps:

> quantile(phat,c(0.1,0.9))

10% 90%

0.50 0.62

> quantile(phat,c(0.025,0.975))

2.5% 97.5%

0.46 0.66


Above results suggest

P(0.50 ≤ p ≤ 0.62) ≈ 0.80,

P(0.46 ≤ p ≤ 0.66) ≈ 0.95.

Since true p = .56

P(0.50− p ≤ p − p ≤ 0.62− p) = .80

P(−0.06 ≤ p − p ≤ 0.06) = .80

Distance between p and p less than 0.06 with 80% certaintyEquivalently

P(p − 0.06 ≤ p ≤ p + 0.06) = .80

“True p within interval (p − 0.06, p + 0.06) with 80% confidence”80% confidence interval


Deriving some theory...

What can we say about the approximate distribution of p:

1. Approximately normal:

2. It is approximately unbiased: E p = p

> mean(phat)

[1] 0.56047

3. It has a tractable standard deviation: SD(p) =√

p(1−p)n

> sqrt(0.56*0.44/100)

[1] 0.04963869

> sd(phat)

[1] 0.04952708Ernst Wit / Wim Krijnen Week 2

7.2 Confidence interval for population proportion, p

We have seen that (approximately)

p ∼ N

(p,

p(1− p)

n

),

and so (approximately)

p − p

SE(p)∼ N(0, 1)

whereSE(p) =

√p(1− p/n

From normal distribution we know, for example, that

P

(−1.96 ≤ p − p

SE(p)≤ 1.96

)≈ 0.95

I interval (p − 1.96SE(p), p + 1.96SE(p) contains p withapproximate probability 0.95

I This is a 95% confidence intervalErnst Wit / Wim Krijnen Week 2

In general

α/2 = P(Z ≤ z∗), 1− α/2 = P(Z ≤ z∗)

zstar <- -qnorm(alpha/2); zstar <- qnorm(1-alpha/2)

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

z−axis

norm

al d

ensi

ty

rejectionregion

α 2

rejectionregion

α 2

acceptanceregion1 − α

zα 2− zα 2

So a (1− α)100% confidence interval for p is given as

(p − z ∗ SE(p), p + z ∗ SE(p)


What does this mean in practice?

“How does this example really help me, because inpractice I have only one example?”

Well, let’s go back to the Election example. We saw in one sample:

p = 0.58,

and so we can calculate√0.58× 0.42

100= 0.04935585

[NOTE: not so different from 0.04952708 from 1,000 samples]So, then we can claim that a 95% CI is given by:

(0.58− 0.04935× 1.96, 0.58 + 0.04935× 1.96) = (0.483, 0.677)


Example 7.2

466 out of 1,013 voters rate precedent performance as “good”Construct 95% confidence interval for true proportion.

> n <- 1013; alpha <- 0.05

> phat <- 466/n

> SE <- sqrt(phat*(1-phat)/n)

> zstar <- qnorm(1-alpha/2)

> round(phat + c(-1,1)*zstar*SE,4)

[1] 0.4293 0.4907

This means

The “probability” that the interval covers the truefraction is 0.95.

We say

“We are 95% confident that the true fraction liesbetween 0.4293 and 0.4907.”


Let R do the work: prop.test

> prop.test(466,1013,conf.level=0.95)

1-sample proportions test with continuity correction

data: 466 out of 1013, null probability 0.5

X-squared = 6.3179, df = 1, p-value = 0.01195

alternative hypothesis: true p is not equal to 0.5

95 percent confidence interval:

0.4290475 0.4912989

sample estimates:

p

0.4600197

Note: Slightly different result from solving Equation 7.1 directly.


7.5 Confidence intervals for differences of proportions

Example 7.8: Poll in Week 1 1,000 interviewed, 560 agree;Week 2 1,200 interviewed and 570 agree

Aim: to see if opinion in the population has changed.

Statistic:

Proportional change =570

1200− 560

1000= −0.085.

Is this really a change?

We use (again) the facts that:

I If n is large, then p2 − p1 is approximately normal.

I E (p2 − p1) = p2 − p1

I SE(p2 − p1) ≈√

p1(1−p1)n1

+ p2(1−p2)n2

p2 − p1 − (p2 − p1)

SE(p1 − p2)∼ N(0, 1)


Confidence intervals for differences of proportions (2)

This means that a (1− α)100% CI for the proportion difference is:(p2 − p1 − zα/2 × SE(p2 − p1) , p2 − p1 + zα/2 × SE(p2 − p1)

).

In example 7.8:

> prop.test(x=c(560,570),n=c(1000,1200),conf.level=0.95)

data: c(560, 570) out of c(1000, 1200)

X-squared = 15.437, df = 1, p-value = 8.53e-05

95 percent confidence interval: 0.04231207 0.12768793

sample estimates: prop 1 prop 2 0.560 0.475

Conclusion: 0 not in CI; non-zero difference with 95% confidence.


7.5.2 Difference in means

Example 7.9 Weight loss under condition placebo (Group 1)or condition drug ephedra (Group 2).

x <- c(0,0,0,2,4,5,13,14,14,14,15,17,17)

y <- c(0,6,7,8,11,13,16,16,16,17,18)

for (i in 1:1000){

x.new <- sample(x,size=length(x),replace=T)

y.new <- sample(y,size=length(y),replace=T)

dif<-c(dif,mean(x.new)-mean(y.new))}

quantile(dif,c(.025,.975))

2.5% 97.5%

-7.610140 2.168007

hist(dif)

Histogram of dif

dif

Fre

quen

cy

−10 −5 0 5

050

100

150


General (normal) theory for comparing two means

Sample 1: X1,X2, · · · ,Xnx ∼ N(µ1, σ1) gives mean X , var. s2xSample 2: Y1,Y2, · · · ,Yny ∼ N(µ2, σ2) gives mean Y , var. s2y

Problem: Construct CI for X − Y

T =(X − Y )− (µx − µy )

SE(X − Y )∼ t − distribution with df

df =

{nx + ny − 2 if σ1 = σ2(

s2xnx

+s2yny

)2·((s2x /nx )

2

nx−1 +(s2y /ny )

2

ny−1

)−1if σ1 6= σ2

SE(X − Y ) ≈

√

s2p(1/nx + 1/ny ) if σ1 = σ2√s2x /nx + s2y /ny ) if σ1 6= σ2

(1− α) · 100%C.I. = (X − Y )± t∗ · SE(X − Y )


Example 7.9: Weight loss under placebo vs ephedra

> t.test(x,y,var.equal=TRUE,conf.level=0.95)

t = -1.0542, df = 22, p-value = 0.3032 # 13+11-2=22


-8.279119 2.698699

> t.test(x,y,var.equal=FALSE,conf.level=0.95)

t = -1.0722, df = 21.99, p-value = 0.2953 #21.99 = 22


-8.187298 2.606878 #ouput adapted to fit screen

Conclusion: 0 in CI, no difference in average weight loss between 2groups belong to 95% of the expected values.


7.5.3 Matched samples

Data: Thickness of shoe sole materials A and B on one foot foreach of ten boys.

Aim: a difference in average wear between materials A and B?

> t.test(shoes$A,shoes$B,equal.var=T)

...

95 percent CI:

-2.745046 1.925046

sample estimates:

mean of x mean of y

10.63 11.04

●

●

●

●

●

●

●

●

●

●

8 10 12 14

810

1214

shoes$A

shoe

s$B


Matched pairs should be treated specially

> with(shoes, t.test(A,B,paired=TRUE,conf.level=.95))

Paired t-test

data: A and B

t = -3.3489, df = 9, p-value = 0.008539


-0.6869539 -0.1330461

I Conclusion: 0 not in CI; nonzero difference in means with95% confidence.

I Note: Different result from unpaired confidence intervals!

I Paired CI eliminates variability among boys.


Conclusions

I Estimation is one of the foremost statistical inferencetechniques.

I Estimates are variable: if new data would be collected, theestimate would be different.

I Confidence intervals capture the amount of variability of theestimate w.r.t. the true underlying parameter.

I Correct interpretation of a (1-α)100% CI for parameter:

In (1-α)100% of the cases in which a CI would beconstructed in the above way, the true parametervalue would be contained in it.

I Short-hand:

We are (1-α)100% confident that the trueparameter value lies in the CI.


Hypothesis testing


A Murder Mystery

DATA:

I 6:15 Ernst W. receives 20 min phone call

I 7:06 Ernst W. looks at his watch

I 7:08 Ernst W. bumps into his neighbour

I 7:13 Ernst W. arrives at the party

I 7:29 Ernst W. is discovered with the dead body of the host.

I Pathologist claims that host died between 7:00 and 7:05.

Police arrests W. on suspicion of murder and charge him with:

H0 : Ernst W. is guilty of the murder.

This is their working hypothesis or the null hypothesis

W. want to convince the judge of the alternative hypothesis:

H1 : Ernst W. is not guilty of the murder.


All’s well that ends well

W’s lawyer suggests following summary of data (test statistic):

T = the neighbour saw Ernst W. at home at 7:05.

Ernst W.’s lawyer now argues that:

I IF E.W. was guilty of the murder that took place between7:00 and 7:05pm,

I THEN it would be impossible that the neighbour E.W. athome at 7:05pm.

I BUT the neighbour saw E.W. at home at 7:05.

I SO E.W. cannot be the murderer.

In other words, the “possibility value”, the p-value:

Pr(T happens if H0is true) = very small.

Therefore the judge rejects the null-hypothesis.Ernst Wit / Wim Krijnen Week 2

What if W. had hired a cheaper lawyer?

Another test-statistic, for instance,

S = W. made phone call at 6:15pm that lasted 20 minutes.

Although true, it does not disprove the null-hypothesis:

I IF E.W. is the murderer,

I THEN he may have made phone call at 6:15 lasting 20minutes.

So the p-value,

Pr(Shappens ifH0true) 6= small.

We cannot reject the null-hypothesis!

[Note: Large p-values do not prove the null-hypothesis.]


Hypothesis Testing: General Theory

1. Key question: Is there evidence in favour of new theory?

2. Null-hypothesis: H0: THE NEW THEORY IS FALSE

3. Alternative hypothesis: H1: THE NEW THEORY IS TRUE

4. Test-statistic: T = summary of evidence in data.

5. Significance level: typically α = 0.05.

6. P-value: P(such an extreme T if H0 is true).

7. Decision:I If p-value < significance level, reject H0.I If p-value > significance level, do not reject H0.


Statistical hypotheses

The Null-Hypothesis is always of the form:

H0 : parameter = some value.

Alternative Hypothesis is often negation of Null-Hypothesis:

H1 : parameter 6= some value.

This is called a 2-sided test.

Although 1-sided tests exist, i.e.,

H1 : parameter > value or parameter < value.

You are not allowed to use them after seeing the data.

NOTE: Statistical hypotheses are statements about population,not about the sample.


The test-statistic, p-value and significance

Test statistic: summary of data informative about the parameter.

There is a main distinction:

I Normal Tests

I Non-Parametric Tests

Significance level: allowable mistake of rejecting H0 when it’sactually true.

P-value: probability of that same mistake in this particular case.

Asymmetric decision:

I IF p-value < significance level, thensufficient evidence to reject H0 and believe that H1 is true.

I IF p-value > significance level, theninsufficient evidence to reject H0. Proven nothing either way.


8.1 Significance test for a population proportion

Example 8.3: Known poverty rate in 2000 is 11.3%.Sample of 50,000 in 2001; 5,850 (11.7%) indicate poverty.

Question: Did rate of poverty increase?

Hypotheses: Test no change of poverty against change of poverty

H0 : p = 0.113

H1 : p 6= 0.113

Test-statistic:

T = fraction of poor in sample in 2001.

Significance level:α = 0.05.

P-value: (Think hard about this!)

p-value = P(|T − 0.113| > |0.117− 0.113|)


Calculating the p-value

Besides being conceptually challenging, we also need to remembera few things from last lecture in order to calculate the p-value.

Remember: For large n the sample proportion p satisfies

p − p0

SE(p|H0)=

p − p0√p0(1− p0)/n

∼ N(0, 1).

So

p-value = P(|T − 0.113| > |0.117− 0.113|)= P(| p−0.113√

0.113(1−0.113)/50000| > | 0.117−0.113√

0.113(1−0.113)/50000|)

= P(|Z | > 2.825)

= 2P(Z > 2.825)

= 0.0047


Decision

By the way, we can also get the p-value from R:

> prop.test(x=5850,n=50000,p=.113)

1-sample proportions test with continuity correction

data: 5850 out of 50000, null probability 0.113


Decision: p-value < 0.05, so we reject H0.

Conclusion: We have significant evidence that poverty increasedin the population from 2000 to 2001.

PS. in the book a one-sided test is performed. This is not recommended!


8.5 Two-sample tests of proportion

Example 8.8: Poverty rate (continued). What happened withthe fraction of poor from 2001 to 2002?

Sample:2001: 5850 out of 50000 indicate poverty2002: 7260 out of 60000 indicate poverty

Hypotheses: H0 : p1 = p2 against H1 : p1 6= p2 or equivalently:

H0 : p2 − p1 = 0

H1 : p2 − p1 6= 0

Test-statistic:T = p2 − p1.

We can use again that approximately for large n1 and n2:

p2 − p1 − (p2 − p1)√p1(1−p1)

n1+ p2(1−p2)

n2

∼ N(0, 1)


Did poverty change from 2001 to 2002?

We can calculate the p-value by first principles, or ...

> prop.test(x=c(5850,7260),n=c(50000,60000))

data: c(5850, 7260) out of c(50000, 60000)


alternative hypothesis: two.sided


-0.0078584975 -0.0001415025

sample estimates:

prop 1 prop 2

0.117 0.121

Decision: p-value < 0.05, so reject H0.

Conclusion: Also from 2001 to 2002 the fraction of poorincreased in the population.


One Sample T-Test

1. Key question of Interest:

Is population mean equal to some pre-specifiednumber, say µ0.

2. Hypotheses: Null-Hypothesis and Alternative Hypothesis:

H0 : population mean =µ0

H1 : population mean 6= µ0.

3. Data is a random sample from a population.

4. Test-Statistic: T = sample mean

5. P-value: P(|T − µ0| > |x − µ0|), where x is the observedsample mean.


IQ OF DRUG OFFENDERS

Question: Is there evidence that average IQ level of soft drugoffender population is different from 100?

Data: IQ scores on a sample of 15 soft drug offenders.

IQ of 15 soft drugs offenders

x

Fre

quen

cy

80 90 100 110 120 130 140 150

01

23

4

Hypotheses:

H0 : pop. mean IQ soft drug offenders = 100

H1 : pop. mean IQ soft drug offenders 6= 100


Test-statistic, p-value en conclusion

Test-statistic: T = average IQ of 15 drug offenders.P-value: We observe t = 114.6 and SD = 16.6:

p-value = P(|T − 100| > |114.6− 100|)

= P(|T − 100|16.6/

√15

>|114.6− 100|

16.6/√

15)

= P(|t14| > 3.41)

= 0.0042

... or with R:

> t.test(x,mu=100)

t = 3.41, df = 14, p-value = 0.004228

alternative hypothesis: true mean is not equal to 100

Decision: Since p-value = 0.0042 < 0.05, we reject thenull-hypothesis H0.

Conclusion: Soft drug users are on average more intelligent thangeneral population.

Note: (a) conclusion in terms of population (b) one-sidedconclusion.


Matched samples: paired t-test

Data: Thickness of shoe sole materials A and B on one foot foreach of ten boys.

Aim: a difference in average wear between materials A and B?

> t.test(shoes$A,shoes$B)

data: shoes$A and shoes$B

t = -0.3689, df = 17.987,

p-value = 0.7165

●

●

●

●

●

●

●

●

●

●

8 10 12 14

810

1214

shoes$A

shoe

s$B


Matched pairs should be treated specially

> with(shoes, t.test(A,B,paired=TRUE,conf.level=.95))

Paired t-test

data: A and B

t = -3.3489, df = 9, p-value = 0.008539


-0.6869539 -0.1330461

I Decision: p-value < 0.05, so reject H0.

I Conclusion: there is a difference between average wear of thetwo types of materials.

I Paired t-test eliminates variability among boys.


Non-parametric tests

Have another look at the picture:

●

●

●

●

●

●

●

●

●

●

8 10 12 14

810

1214

shoes$A

shoe

s$B

Fact: 8 out of 10 points lies above the line-of-equality...

Test-statistic: T = number of points above the line


Sign test

Hypotheses:

H0 : Median wear A = Median wear B

H1 : Median wear A 6= Median wear B

Note:T |H0 ∼ Binomial(10, .5).

p-value = P(|T − 5| > |8− 5|)= P(T ≥ 8) + P(T ≤ 2)

= 0.1094

I Decision: p-value > 0.05, so do not reject H0.I Conclusion: there is no evidence in T that there is a

difference between average wear of the two types of materials.I Sign test is LESS powerful than the paired t-test,

BUT it makes fewer assumptions.Ernst Wit / Wim Krijnen Week 2

Conclusions

Hypothesis testing checks to what extent the data can beexplained by a status quo assumption, or whether it needs to berejected because the data are too unusual otherwise.

I Terminology: hypotheses, test-statistic, p-value, significancelevel.

I Interpretation: be careful when interpreting p-values!

I Tests: most common tests are based on normal distribution.Non-parametric tests avoid making normal assumptions, butare less powerful if these assumptions are actually true.


Week 2 a Con dence intervals b Hypothesis testing for proportions · 2011-10-31 · 150 Ernst Wit /...

Documents

Transcript of Week 2 a Con dence intervals b Hypothesis testing for proportions · 2011-10-31 · 150 Ernst Wit /...