Download - Module 8: Hypothesis Testing II - Naval Postgraduate Schoolfaculty.nps.edu/rdfricke/OA3102/Hypothesis Testing II.pdf · Statistical vs. Practical Significance • Consider a test

Revision: 2-12 1

Module 8:

Hypothesis Testing II Statistics (OA3102)

Professor Ron Fricker Naval Postgraduate School

Monterey, California

Reading assignment:

WM&S chapter 10.6-10.10

Goals for this Module

• Understand the p-value approach to

hypothesis testing

• Small-sample hypothesis tests

– For m

– For p

• Tests for variance(s)

• Neyman-Pearson Lemma and likelihood ratio

tests

Revision: 2-12 2

An Alternate Approach to

Hypothesis Testing: p-Values

• Current hypothesis testing approach:

Specify a and report whether null is rejected

– Distills results down into a simple binary outcome

– In so doing, it leaves out some potentially useful

information since it does not say how strongly or

weakly null was rejected (or not)

• The p-value approach provides this

information

– It’s also the standard method for hypothesis

testing in all stats packages

Revision: 2-12 3

Revision: 2-12 4

So, What’s a p-value?

• A p-value is the probability of seeing a test statistic as

extreme or more extreme than the one observed,

assuming the null hypothesis is true

– Memorize this definition

– “As extreme or more extreme” is defined by the alternative

hypothesis

• The idea: Rather than checking whether the test

statistic falls in the rejection region, we assess how

unusual our test statistic is with a p-value

– If this probability is smaller than a pre-specified significance

level a (usually, a = 0.1. 0.05, or 0.01), reject the null

– The smaller the p-value, the less likely the null is to be true

Revision: 2-12 5

Back to the Down-time Example

• Remember, we observed which fell in the

rejection region of (-, -1.645):

• The p-value for this problem is

• Since this is less than the test significance level

a=0.05, we (strongly) reject the null

3.8z

0 -1.645 3.8z

Very unusual to see

this, if the null is true

Pr 3.8 | ~ (0,1) 0.000072Z Z N

Revision: 2-12 6

And Back to the Lab

• NPS student survey: “My instructors provide real

world examples during class”

– Response scale from 1 to 5, where 1 was “Strongly

Disagree” and 5 was “Strongly Agree”

• Conclusion: There is a significant relationship

between temperature and the number of

incidents! 7 7

And All the Way Back to the

Challenger Example in Module 1

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

50 55 60 65 70 75 80 85

JointTemp

Summary of Fit

RSquare

RSquare Adj Root Mean Square Error Mean of Response

Observations (or Sum Wgts)

0.332354

0.302007

0.647926

0.416667

24

Parameter Estimates

Term

Intercept JointTemp

Estimate

4.63697 -0.060362

Std Error

1.282118 0.01824

t Ratio

3.62 -3.31

Prob>|t|

0.0015 0.0032

Revision: 2-12

Textbook Definition

• Definition. If W is a test statistic,

the p-value or, attained significance

level, is the smallest level of

significance a for which the

observed data indicate that the null

hypothesis should be rejected.

Revision: 2-12 8

Revision: 2-12 9

Interpreting p-values

• Compare the p-value to the significance level a to decide whether to reject H0

• Small p-values mean either a rare event happened, or the null hypothesis is false

• In the down-time example, null is very unlikely – About one chance in 14,000 (i.e.,

1/0.000072) to see something like this or more extreme

– Pretty convincing evidence that the mod decreases mean down-time

Revision: 2-12 10

Why Do I Care About p-values?

• For some types of hypothesis tests it is not

possible (or hard) to calculate the

acceptance/rejection region

• All stats software packages report the results

of hypothesis tests in terms of p-values

• As we’ve discussed, p-values provide

additional information about the strength of

the observed effect

Revision: 2-12 11

Calculating p-values for z-tests

Revision: 2-12 12

Steps in Hypothesis Testing

When Using p-values

1. Identify the parameter of interest

2. State null and alternative hypotheses

3. Determine form of test statistic and its

sampling distribution

4. Calculate the p-value

5. Determine test outcome by comparing

the p-value to the significance level

Revision: 2-12 13

Example

• You’re an analyst at a logistics depot and you’ve been asked to

determine if the average monthly of demand of an end-item is

what was originally planned: 245 units per month

– For a sample of size n=50 months, you calculate

and s=3.6

• Using the p-value approach, conduct a hypothesis test of

H0: m=245 vs Ha: m≠245, at significance level a=0.01

246.18y

Example (continued)

Revision: 2-12 14

Revision: 2-12 15

Example 10.10

• Returning to the poll example (Ex. 10.1-10.4),

remember:

– We want to test H0: p = 0.5 vs. Ha: p < 0.5, where p

is the proportion of the population favoring a

candidate

– The test statistic is Y, the number of people in the

sample favoring the candidate, where n=15 people

– What is the p-value if we observe Y=3?

• Solution:

Example 10.10 (continued)

Revision: 2-12 16

Table 1 from Appendix 3

Revision: 2-12 17

Revision: 2-12 18

Example 10.11

• Returning to Example 10.7, the hypotheses

were H0: m0-m1= 0 vs. Ha: m0-m1 0 and the

standardized test statistic was z = -2.5

• Calculate the p-value:


Revision: 2-12 19

Revision: 2-12 20

A Note About Collecting Data

• Data collection may or may not follow the definition of the hypothesis – You can collect the data first, but you shouldn’t use it to

generate your hypothesis

• That is, peeking at the data first before generating your hypothesis is cheating

• It would be like a gambler who gets to see the outcome before placing his/her bet

• Generally, you only have one hypothesis to test—if it fails you don’t keep trying until something is “proven” – E.g., if you “fish” through a large data set, at a=0.05 on

average you will get “significance” one out of every 20 tests

• Finally, note that replication is the real proof of results – Until then, your significant results could just be a Type I error

Revision: 2-12 21

Statistical vs. Practical Significance

• No matter how small the effect, if we take a big

enough sample we can make it statistically

significantly different

– It’s because the larger the sample size, the smaller the

standard error

– Smaller standard errors mean statistical significance for

smaller differences between the null and alternative

• But statistical significance is not the same as

practical significance

– Practical significance follows when the result is important to

the decision maker

– A substantive, not statistical, question

Revision: 2-12 22

Statistical vs. Practical Significance

• Consider a test of H0: m=100 vs Ha: m>100, where m is

the mean of a normal population with s=10

– Suppose a true value of m=101 is inconsequential from a

practical perspective

– The table below shows the test outcome for various sample

sizes (n) and 101x

* Table from Probability and Statistics for Engineering and the Sciences, 7th ed., Duxbury Press, 2008.

Revision: 2-12 23

Small-Sample Hypothesis Testing

for the Mean: t -Tests

• If Y1,…,Yn has a normal distribution with unknown

mean m and unknown variance s2, then the

standardized statistic has t-distribution with n-1

degrees of freedom:

• As always, three possible hypotheses and tests:

0

/

YT

s n

m

0:aH m m

0:aH m m

0:aH m m

Alternative Hypothesis Rejection Region for Level a Test

, 1nT ta

, 1nT ta

/2, 1 /2, 1 or n nT t T ta a

(upper-tailed test)

(lower-tailed test)

(two-tailed test)

Revision: 2-12 24

Example 10.12

• In Example 8.11, tested the muzzle velocity of

n=8 shells using new gunpowder

– Resulting data: and s=39.1

– Is there evidence to contradict manufacturer’s claim

that m > 3,000 fps at a level of a=0.025?

• Specify the hypotheses to be tested:

2,959y

Revision: 2-12 25


• Conduct the test using the rejection region

approach

Revision: 2-12 26

Example 10.13

• Conduct the test by calculating the p-value

and compare to the previous result

Revision: 2-12 27

Small-Sample Hypothesis Testing for

Differences in Means: t -Tests

• For two samples,Y11,…,Y1n and Y21,…,Y2n , both

normally distributed with unknown means and

unknown but equal variances, the standardized

statistic has t-distribution with n-1 dfs:

• With , three possible tests:

1 2 0

1

1 1

2pT Y Y D S

n n

1 2 0:aH Dm m

1 2 0:aH Dm m

1 2 0:aH Dm m


, 1nT ta

, 1nT ta

/2, 1 /2, 1 or n nT t T ta a

(upper-tailed test)

(lower-tailed test)

(two-tailed test)

0 1 2 0:H Dm m

Revision: 2-12 28

Example 10.14

• Back to Example 8.12, assembly time using

two different training methods

– Data on independent two random samples of 9

people each collected giving

– Is there evidence to suggest a difference in the true

mean assembly times at the a=0.05 level?

• Specify the hypotheses to be tested:

2 2

1 1 2 235.22, 195.56 / 8, 31.56, and 160.22 / 8 y s y s

Revision: 2-12 29



method

Revision: 2-12 30


• Conduct the test using the p-value method

and compare to the previous results

Revision: 2-12 31

A Note on the Normality Assumption

• The small sample tests based on the

t distribution assume the data is normally

distributed

• It’s generally impossible to be sure this

assumption is met

• Turns out, is robust to

modest departures from normality

– True for both hypothesis tests and confidence

intervals

– Means you don’t have to agonize over

distributional assumptions before doing the test

T Y S nm

Revision: 2-12 32

Small-Sample Tests

for Population Proportion (p)

• The large-sample method is approximating

the sampling distribution of

• If the sample is small, the approximation no

longer holds and we must use the actual


• And, rather than derive the sampling

distribution of let’s make the test statistic Y

– We know its distribution – it’s binomial

– We’re back to our original coin flipping problem!

p̂

p̂

Revision: 2-12 33

How to Implement the Test?

• Proceed much like we did at the start of the last

module, but now only need to determine the p-value

based on the observed count

– As before, first define the hypotheses and specify the

significance level a

• Example 1: Consider a lower-tailed test where the

rejection region is defined as

– Then

• Example 2: Consider an upper-tailed test where the

rejection region is defined as

– Then

0-value Pr | ~ Bin( , )p Y y Y n p

Y c

Y c

0

0

-value Pr | ~ Bin( , )

1 Pr 1| ~ Bin( , )

p Y y Y n p

Y y Y n p

Revision: 2-12 34

Example

• Calculate the p-value for a small-sample test of

H0: p=0.9 vs Ha: p<0.9 if we observe y=14 successes

out of n=20 trials

– What do you conclude for a level a=0.05 test?

Example (continued)

Revision: 2-12 35

In R:

Revision: 2-12 36

Testing the Variance

• If Y1,…,Yn have a normal distribution with unknown

mean m and unknown variance s2

• Test statistic:

• Null hypothesis:

• Three possible alternative hypotheses:

2 2

0 0:H s s

2 2 2

0( 1)n S s

2 2

0:aH s s

2 2

0:aH s s

2 2

0:aH s s


2 2

, 1na

2 2 2 2

/2, 1 1 /2, 1 or n na a

2 2

1 , 1na

Illustrating the Rejection Region

Revision: 2-12 37

Revision: 2-12 38

Example 10.16

• The variance of machined parts cannot be

larger than 0.0002 in2

– A random sample of n=10 parts gives s2=0.0003

– Test H0: s2 = 0.0002 vs. Ha: s

2 > 0.0002, at a level of

a=0.05


approach:


Revision: 2-12 39

Revision: 2-12 40

Example 10.17

• Conduct the test using the p-value approach

In R:

Revision: 2-12 41

Example 10.18

• An experimenter is convinced that s=2

– A sample of n=16 yields s2=6.1

– Do the data disagree with his claim?

• Determine the p-value of the test

• What conclusion do you reach for a=0.05

• Solution:

Revision: 2-12 42


Revision: 2-12 43

F Test for Equality of Variances

• Null hypothesis:

• Test statistic is where numerator

has n1 observations and denominator n2 obs

• Three possible alternative hypotheses:

2 2

0 1 2:H s s

2 2

1 2F S S

2 2

1 2:aH s s

2 2

1 2:aH s s

2 2

1 2:aH s s


1 2, 1, 1n nF Fa

1 2 1 2/2, 1, 1 1 /2, 1, 1 or n n n nF F F Fa a

1 21 , 1, 1n nF F a

Calculating the Rejection Region

for a One-Tailed Test

• The easy test to conduct using Table 7 is

– Why? Because the table is set up to calculate

rejection regions in the right tail

– Now note that the choice of which population is

labeled “1” is arbitrary, so can always turn a left

tailed test into a right tailed on

– This avoids having to solve for a rejection region

from Table 7:

Revision: 2-12 44

2 2

1 2:aH s s

1 2 2 11 , , , ,1/F Fa a

Calculating the Rejection Region

for a Two-Tailed Test

• For can’t avoid the left tail

calculation issue

– Start by solving for

– Also solve for

– Then

• Example: Find the rejection region for a two-

sided test with n1=5 and n2=6

Revision: 2-12 45

2 2

1 2:aH s s

1 2 2 11 /2, , /2, ,1/F Fa a 1 2/2, ,Fa

1 2 1 2/2, 1, 1 1 /2, 1, 1: or n n n nRR F F F F Fa a

Example (continued)

Revision: 2-12 46

Example

Revision: 2-12 47

• In a comparison between a new fire control radar and

the existing radar, NAVSEA wants to test whether the

new radar is more precise

• By “precise” we mean there is less variability in

where the rounds land downrange

• In contrast, “accuracy” would measure how well

the rounds actually hit the target

• To compare, 20 rounds are fired from the same

weapons system, under the same conditions,

differing only by which fire control radar was used

Example (continued)

Revision: 2-12 48

• Results: snew radar = 23 meters and sold radar = 50 meters

• At an a=0.01 level, can you conclude that the new

radar is more precise than the existing radar?

Revision: 2-12 49

Example (continued)

Testing Variances in R

• var.test() in R will do the test:

Revision: 2-12 50


• The test can be done in either direction:

Revision: 2-12 51


• And, just to illustrate, here’s the two-tailed

test:

Revision: 2-12 52

Revision: 2-12 53

Example 10.19

• Compare the variation in parts between two

competitors:

– n1=10 and

– n2=20 and

• Can we conclude at a significance

level of a=0.05?

2 2

1 2s s

2

1 0.0003s 2

2 0.0001s


Revision: 2-12 54

Revision: 2-12 55

Example 10.20

• Conduct the test using the p-value approach

Revision: 2-12 56

Example 10.21

• An experiment to assess electrical shock

pain thresholds for males and females

conducted (by Bill Murray in Ghostbusters?)

– nmales=14,

– nfemales=10,

• Can we conclude at a significance

level of a=0.05?

2 2

males femaless s

216.2, 12.7males malesy s 214.9, 26.4females femalesy s


Revision: 2-12 57

Revision: 2-12 58


• What can be said about the p-value?

A Note on the Variance Tests

• Unlike the hypothesis tests based on the

t distribution, the variance tests are very

sensitive to departures from normality

– That is, these tests are not robust if the normality

assumption is violated

• So, when using these tests, check your data

carefully

Revision: 2-12 59

How Do We Know Whether All

These Tests are Any Good?

• We’ve learned about how to do hypothesis

tests, including evaluating a test’s

performance in terms of a and b

• But how do we know the tests we defined are

to be preferred?

– I.e., For any given hypothesis testing scenario, we

would want to use the test that achieves the

smallest b for a given a

– Alternatively, we would want to use the test that

achieves the highest power (1- b) for a given a

Revision: 2-12 60

Illustrative Power Curves

Revision: 2-12 61

0

1 Power

a

Test 1 power curve:

Test 2 power curve:

Simple vs. Composite Hypotheses

• A hypothesis is said to be simple if that

hypothesis uniquely identifies the distribution

from which the population is taken

– E.g., when testing the mean of a normally

distributed population with s known, then H0: m=m0

is a simple hypothesis

• Any hypothesis that is not simple is called a

composite hypothesis

– E.g., for the same test, then Ha: m > m0 is a

composite hypothesis

Revision: 2-12 62

Neyman-Peason Lemma

• Neyman-Pearson Lemma. Suppose we want

to test a simple null vs. a simple alternative:

• Then a test of the form

where k is chosen to achieve a desired level

of a, maximizes the power at a

– That is, the Neyman-Pearon Lemma guarantees a

most powerful a-level test

Revision: 2-12 63

0 0: vs. :a aH H

0

,a

Lk

L

Revision: 2-12 64

Example 10.22

• Suppose Y is a single observation from a population

with pdf

• Find the most powerful a-level test for

1, 0 1

|0, otherwise

y yf y

0 : 2 vs. H : 1aH


Revision: 2-12 65

Revision: 2-12 66


Uniformly Most Powerful a-level Tests

• As we’ve seen in this and the previous module,

we usually have composite alternative

hypotheses

• So, ideally, we’d like our test to have higher

power at every possible alternative – That is, we would like to have a uniformly most powerful test

• The Neyman-Pearson Lemma only applies to

simple vs. simple hypotheses – However, often the RR for a test with a composite alternative

only depends on 0 (and not a)

– In these cases we can apply the Neyman-Person Lemma

and get a uniformly most powerful test

Revision: 2-12 67

Revision: 2-12 68

Example 10.23

• Suppose Y1,…,Yn are a random sample from a normal

distribution with unknown mean m and known

variance s2

• Find the uniformly most powerful a-level test for

0 0 0: vs. : aH Hm m m m


Revision: 2-12 69

Revision: 2-12 70


Revision: 2-12 71

Likelihood Ratio Tests

• Method for conducting good hypothesis tests

– Similar in spirit to the use of maximum likelihood

for constructing good point estimators

– Often results in uniformly most powerful tests, or

at least tests with reasonably good performance

properties

• Most tests we have discussed in class

actually derived from the likelihood ratio

principle

– They are called likelihood ratio tests

Revision: 2-12 72

Likelihood Ratio Tests (continued)

• Definition. A likelihood ratio test is a test of

, where the rejection

region is for

– This is a ratio of likelihoods, where the numerator

is maximized over the space defined by H0 and

the denominator over the whole space

– It can be shown that , where smaller

values suggest favoring Ha over H0

• See Chapter 10.11 for a brief introduction

0 0: vs. :a aH H

k

0

max

max

L

L

0 1

Revision: 2-12 73

More Terminology

• p-value: probability of seeing a test statistic as

extreme or more extreme assuming null is

true

• z-test vs. t-test

– Large sample (z-test) vs. small sample with s

unknown (t-test)

• Simple hypothesis vs. composite hypothesis

– Simple hypotheses uniquely specifies the


What We Covered in this Module

• Understanding the p-value approach to

hypothesis testing

• Small-sample hypothesis tests

– For m

– For p

• Tests for variance(s)

• Neyman-Pearson Lemma and likelihood ratio

tests

Revision: 2-12 74

Revision: 2-12 75

Homework

• WM&S chapter 10

– Required: 50, 53, 56, 63, 65, 67, 69, 71, 79 and 84

– Extra credit: None

• Useful hints:

In Exercises 63, 65, 69, 71, and 79, use R instead

of the applet to calculate the exact p-value