Revision: 2-12 1
Module 8:
Hypothesis Testing II Statistics (OA3102)
Professor Ron Fricker Naval Postgraduate School
Monterey, California
Reading assignment:
WM&S chapter 10.6-10.10
Goals for this Module
• Understand the p-value approach to
hypothesis testing
• Small-sample hypothesis tests
– For m
– For p
• Tests for variance(s)
• Neyman-Pearson Lemma and likelihood ratio
tests
Revision: 2-12 2
An Alternate Approach to
Hypothesis Testing: p-Values
• Current hypothesis testing approach:
Specify a and report whether null is rejected
– Distills results down into a simple binary outcome
– In so doing, it leaves out some potentially useful
information since it does not say how strongly or
weakly null was rejected (or not)
• The p-value approach provides this
information
– It’s also the standard method for hypothesis
testing in all stats packages
Revision: 2-12 3
Revision: 2-12 4
So, What’s a p-value?
• A p-value is the probability of seeing a test statistic as
extreme or more extreme than the one observed,
assuming the null hypothesis is true
– Memorize this definition
– “As extreme or more extreme” is defined by the alternative
hypothesis
• The idea: Rather than checking whether the test
statistic falls in the rejection region, we assess how
unusual our test statistic is with a p-value
– If this probability is smaller than a pre-specified significance
level a (usually, a = 0.1. 0.05, or 0.01), reject the null
– The smaller the p-value, the less likely the null is to be true
Revision: 2-12 5
Back to the Down-time Example
• Remember, we observed which fell in the
rejection region of (-, -1.645):
• The p-value for this problem is
• Since this is less than the test significance level
a=0.05, we (strongly) reject the null
3.8z
0 -1.645 3.8z
Very unusual to see
this, if the null is true
Pr 3.8 | ~ (0,1) 0.000072Z Z N
Revision: 2-12 6
And Back to the Lab
• NPS student survey: “My instructors provide real
world examples during class”
– Response scale from 1 to 5, where 1 was “Strongly
Disagree” and 5 was “Strongly Agree”
• Conclusion: There is a significant relationship
between temperature and the number of
incidents! 7 7
And All the Way Back to the
Challenger Example in Module 1
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
50 55 60 65 70 75 80 85
JointTemp
Summary of Fit
RSquare
RSquare Adj Root Mean Square Error Mean of Response
Observations (or Sum Wgts)
0.332354
0.302007
0.647926
0.416667
24
Parameter Estimates
Term
Intercept JointTemp
Estimate
4.63697 -0.060362
Std Error
1.282118 0.01824
t Ratio
3.62 -3.31
Prob>|t|
0.0015 0.0032
Revision: 2-12
Textbook Definition
• Definition. If W is a test statistic,
the p-value or, attained significance
level, is the smallest level of
significance a for which the
observed data indicate that the null
hypothesis should be rejected.
Revision: 2-12 8
Revision: 2-12 9
Interpreting p-values
• Compare the p-value to the significance level a to decide whether to reject H0
• Small p-values mean either a rare event happened, or the null hypothesis is false
• In the down-time example, null is very unlikely – About one chance in 14,000 (i.e.,
1/0.000072) to see something like this or more extreme
– Pretty convincing evidence that the mod decreases mean down-time
Revision: 2-12 10
Why Do I Care About p-values?
• For some types of hypothesis tests it is not
possible (or hard) to calculate the
acceptance/rejection region
• All stats software packages report the results
of hypothesis tests in terms of p-values
• As we’ve discussed, p-values provide
additional information about the strength of
the observed effect
Revision: 2-12 11
Calculating p-values for z-tests
Revision: 2-12 12
Steps in Hypothesis Testing
When Using p-values
1. Identify the parameter of interest
2. State null and alternative hypotheses
3. Determine form of test statistic and its
sampling distribution
4. Calculate the p-value
5. Determine test outcome by comparing
the p-value to the significance level
Revision: 2-12 13
Example
• You’re an analyst at a logistics depot and you’ve been asked to
determine if the average monthly of demand of an end-item is
what was originally planned: 245 units per month
– For a sample of size n=50 months, you calculate
and s=3.6
• Using the p-value approach, conduct a hypothesis test of
H0: m=245 vs Ha: m≠245, at significance level a=0.01
246.18y
Example (continued)
Revision: 2-12 14
Revision: 2-12 15
Example 10.10
• Returning to the poll example (Ex. 10.1-10.4),
remember:
– We want to test H0: p = 0.5 vs. Ha: p < 0.5, where p
is the proportion of the population favoring a
candidate
– The test statistic is Y, the number of people in the
sample favoring the candidate, where n=15 people
– What is the p-value if we observe Y=3?
• Solution:
Example 10.10 (continued)
Revision: 2-12 16
Table 1 from Appendix 3
Revision: 2-12 17
Revision: 2-12 18
Example 10.11
• Returning to Example 10.7, the hypotheses
were H0: m0-m1= 0 vs. Ha: m0-m1 0 and the
standardized test statistic was z = -2.5
• Calculate the p-value:
Example 10.11 (continued)
Revision: 2-12 19
Revision: 2-12 20
A Note About Collecting Data
• Data collection may or may not follow the definition of the hypothesis – You can collect the data first, but you shouldn’t use it to
generate your hypothesis
• That is, peeking at the data first before generating your hypothesis is cheating
• It would be like a gambler who gets to see the outcome before placing his/her bet
• Generally, you only have one hypothesis to test—if it fails you don’t keep trying until something is “proven” – E.g., if you “fish” through a large data set, at a=0.05 on
average you will get “significance” one out of every 20 tests
• Finally, note that replication is the real proof of results – Until then, your significant results could just be a Type I error
Revision: 2-12 21
Statistical vs. Practical Significance
• No matter how small the effect, if we take a big
enough sample we can make it statistically
significantly different
– It’s because the larger the sample size, the smaller the
standard error
– Smaller standard errors mean statistical significance for
smaller differences between the null and alternative
• But statistical significance is not the same as
practical significance
– Practical significance follows when the result is important to
the decision maker
– A substantive, not statistical, question
Revision: 2-12 22
Statistical vs. Practical Significance
• Consider a test of H0: m=100 vs Ha: m>100, where m is
the mean of a normal population with s=10
– Suppose a true value of m=101 is inconsequential from a
practical perspective
– The table below shows the test outcome for various sample
sizes (n) and 101x
* Table from Probability and Statistics for Engineering and the Sciences, 7th ed., Duxbury Press, 2008.
Revision: 2-12 23
Small-Sample Hypothesis Testing
for the Mean: t -Tests
• If Y1,…,Yn has a normal distribution with unknown
mean m and unknown variance s2, then the
standardized statistic has t-distribution with n-1
degrees of freedom:
• As always, three possible hypotheses and tests:
0
/
YT
s n
m
0:aH m m
0:aH m m
0:aH m m
Alternative Hypothesis Rejection Region for Level a Test
, 1nT ta
, 1nT ta
/2, 1 /2, 1 or n nT t T ta a
(upper-tailed test)
(lower-tailed test)
(two-tailed test)
Revision: 2-12 24
Example 10.12
• In Example 8.11, tested the muzzle velocity of
n=8 shells using new gunpowder
– Resulting data: and s=39.1
– Is there evidence to contradict manufacturer’s claim
that m > 3,000 fps at a level of a=0.025?
• Specify the hypotheses to be tested:
2,959y
Revision: 2-12 25
Example 10.12 (continued)
• Conduct the test using the rejection region
approach
Revision: 2-12 26
Example 10.13
• Conduct the test by calculating the p-value
and compare to the previous result
Revision: 2-12 27
Small-Sample Hypothesis Testing for
Differences in Means: t -Tests
• For two samples,Y11,…,Y1n and Y21,…,Y2n , both
normally distributed with unknown means and
unknown but equal variances, the standardized
statistic has t-distribution with n-1 dfs:
• With , three possible tests:
1 2 0
1
1 1
2pT Y Y D S
n n
1 2 0:aH Dm m
1 2 0:aH Dm m
1 2 0:aH Dm m
Alternative Hypothesis Rejection Region for Level a Test
, 1nT ta
, 1nT ta
/2, 1 /2, 1 or n nT t T ta a
(upper-tailed test)
(lower-tailed test)
(two-tailed test)
0 1 2 0:H Dm m
Revision: 2-12 28
Example 10.14
• Back to Example 8.12, assembly time using
two different training methods
– Data on independent two random samples of 9
people each collected giving
– Is there evidence to suggest a difference in the true
mean assembly times at the a=0.05 level?
• Specify the hypotheses to be tested:
2 2
1 1 2 235.22, 195.56 / 8, 31.56, and 160.22 / 8 y s y s
Revision: 2-12 29
Example 10.14 (continued)
• Conduct the test using the rejection region
method
Revision: 2-12 30
Example 10.15 (continued)
• Conduct the test using the p-value method
and compare to the previous results
Revision: 2-12 31
A Note on the Normality Assumption
• The small sample tests based on the
t distribution assume the data is normally
distributed
• It’s generally impossible to be sure this
assumption is met
• Turns out, is robust to
modest departures from normality
– True for both hypothesis tests and confidence
intervals
– Means you don’t have to agonize over
distributional assumptions before doing the test
T Y S nm
Revision: 2-12 32
Small-Sample Tests
for Population Proportion (p)
• The large-sample method is approximating
the sampling distribution of
• If the sample is small, the approximation no
longer holds and we must use the actual
sampling distribution
• And, rather than derive the sampling
distribution of let’s make the test statistic Y
– We know its distribution – it’s binomial
– We’re back to our original coin flipping problem!
p̂
p̂
Revision: 2-12 33
How to Implement the Test?
• Proceed much like we did at the start of the last
module, but now only need to determine the p-value
based on the observed count
– As before, first define the hypotheses and specify the
significance level a
• Example 1: Consider a lower-tailed test where the
rejection region is defined as
– Then
• Example 2: Consider an upper-tailed test where the
rejection region is defined as
– Then
0-value Pr | ~ Bin( , )p Y y Y n p
Y c
Y c
0
0
-value Pr | ~ Bin( , )
1 Pr 1| ~ Bin( , )
p Y y Y n p
Y y Y n p
Revision: 2-12 34
Example
• Calculate the p-value for a small-sample test of
H0: p=0.9 vs Ha: p<0.9 if we observe y=14 successes
out of n=20 trials
– What do you conclude for a level a=0.05 test?
Example (continued)
Revision: 2-12 35
In R:
Revision: 2-12 36
Testing the Variance
• If Y1,…,Yn have a normal distribution with unknown
mean m and unknown variance s2
• Test statistic:
• Null hypothesis:
• Three possible alternative hypotheses:
2 2
0 0:H s s
2 2 2
0( 1)n S s
2 2
0:aH s s
2 2
0:aH s s
2 2
0:aH s s
Alternative Hypothesis Rejection Region for Level a Test
2 2
, 1na
2 2 2 2
/2, 1 1 /2, 1 or n na a
2 2
1 , 1na
Illustrating the Rejection Region
Revision: 2-12 37
Revision: 2-12 38
Example 10.16
• The variance of machined parts cannot be
larger than 0.0002 in2
– A random sample of n=10 parts gives s2=0.0003
– Test H0: s2 = 0.0002 vs. Ha: s
2 > 0.0002, at a level of
a=0.05
• Conduct the test using the rejection region
approach:
Example 10.16 (continued)
Revision: 2-12 39
Revision: 2-12 40
Example 10.17
• Conduct the test using the p-value approach
In R:
Revision: 2-12 41
Example 10.18
• An experimenter is convinced that s=2
– A sample of n=16 yields s2=6.1
– Do the data disagree with his claim?
• Determine the p-value of the test
• What conclusion do you reach for a=0.05
• Solution:
Revision: 2-12 42
Example 10.18 (continued)
Revision: 2-12 43
F Test for Equality of Variances
• Null hypothesis:
• Test statistic is where numerator
has n1 observations and denominator n2 obs
• Three possible alternative hypotheses:
2 2
0 1 2:H s s
2 2
1 2F S S
2 2
1 2:aH s s
2 2
1 2:aH s s
2 2
1 2:aH s s
Alternative Hypothesis Rejection Region for Level a Test
1 2, 1, 1n nF Fa
1 2 1 2/2, 1, 1 1 /2, 1, 1 or n n n nF F F Fa a
1 21 , 1, 1n nF F a
Calculating the Rejection Region
for a One-Tailed Test
• The easy test to conduct using Table 7 is
– Why? Because the table is set up to calculate
rejection regions in the right tail
– Now note that the choice of which population is
labeled “1” is arbitrary, so can always turn a left
tailed test into a right tailed on
– This avoids having to solve for a rejection region
from Table 7:
Revision: 2-12 44
2 2
1 2:aH s s
1 2 2 11 , , , ,1/F Fa a
Calculating the Rejection Region
for a Two-Tailed Test
• For can’t avoid the left tail
calculation issue
– Start by solving for
– Also solve for
– Then
• Example: Find the rejection region for a two-
sided test with n1=5 and n2=6
Revision: 2-12 45
2 2
1 2:aH s s
1 2 2 11 /2, , /2, ,1/F Fa a 1 2/2, ,Fa
1 2 1 2/2, 1, 1 1 /2, 1, 1: or n n n nRR F F F F Fa a
Example (continued)
Revision: 2-12 46
Example
Revision: 2-12 47
• In a comparison between a new fire control radar and
the existing radar, NAVSEA wants to test whether the
new radar is more precise
• By “precise” we mean there is less variability in
where the rounds land downrange
• In contrast, “accuracy” would measure how well
the rounds actually hit the target
• To compare, 20 rounds are fired from the same
weapons system, under the same conditions,
differing only by which fire control radar was used
Example (continued)
Revision: 2-12 48
• Results: snew radar = 23 meters and sold radar = 50 meters
• At an a=0.01 level, can you conclude that the new
radar is more precise than the existing radar?
Revision: 2-12 49
Example (continued)
Testing Variances in R
• var.test() in R will do the test:
Revision: 2-12 50
Testing Variances in R
• The test can be done in either direction:
Revision: 2-12 51
Testing Variances in R
• And, just to illustrate, here’s the two-tailed
test:
Revision: 2-12 52
Revision: 2-12 53
Example 10.19
• Compare the variation in parts between two
competitors:
– n1=10 and
– n2=20 and
• Can we conclude at a significance
level of a=0.05?
2 2
1 2s s
2
1 0.0003s 2
2 0.0001s
Example 10.19 (continued)
Revision: 2-12 54
Revision: 2-12 55
Example 10.20
• Conduct the test using the p-value approach
Revision: 2-12 56
Example 10.21
• An experiment to assess electrical shock
pain thresholds for males and females
conducted (by Bill Murray in Ghostbusters?)
– nmales=14,
– nfemales=10,
• Can we conclude at a significance
level of a=0.05?
2 2
males femaless s
216.2, 12.7males malesy s 214.9, 26.4females femalesy s
Example 10.21 (continued)
Revision: 2-12 57
Revision: 2-12 58
Example 10.21 (continued)
• What can be said about the p-value?
A Note on the Variance Tests
• Unlike the hypothesis tests based on the
t distribution, the variance tests are very
sensitive to departures from normality
– That is, these tests are not robust if the normality
assumption is violated
• So, when using these tests, check your data
carefully
Revision: 2-12 59
How Do We Know Whether All
These Tests are Any Good?
• We’ve learned about how to do hypothesis
tests, including evaluating a test’s
performance in terms of a and b
• But how do we know the tests we defined are
to be preferred?
– I.e., For any given hypothesis testing scenario, we
would want to use the test that achieves the
smallest b for a given a
– Alternatively, we would want to use the test that
achieves the highest power (1- b) for a given a
Revision: 2-12 60
Illustrative Power Curves
Revision: 2-12 61
0
1 Power
a
Test 1 power curve:
Test 2 power curve:
Simple vs. Composite Hypotheses
• A hypothesis is said to be simple if that
hypothesis uniquely identifies the distribution
from which the population is taken
– E.g., when testing the mean of a normally
distributed population with s known, then H0: m=m0
is a simple hypothesis
• Any hypothesis that is not simple is called a
composite hypothesis
– E.g., for the same test, then Ha: m > m0 is a
composite hypothesis
Revision: 2-12 62
Neyman-Peason Lemma
• Neyman-Pearson Lemma. Suppose we want
to test a simple null vs. a simple alternative:
• Then a test of the form
where k is chosen to achieve a desired level
of a, maximizes the power at a
– That is, the Neyman-Pearon Lemma guarantees a
most powerful a-level test
Revision: 2-12 63
0 0: vs. :a aH H
0
,a
Lk
L
Revision: 2-12 64
Example 10.22
• Suppose Y is a single observation from a population
with pdf
• Find the most powerful a-level test for
1, 0 1
|0, otherwise
y yf y
0 : 2 vs. H : 1aH
Example 10.22 (continued)
Revision: 2-12 65
Revision: 2-12 66
Example 10.22 (continued)
Uniformly Most Powerful a-level Tests
• As we’ve seen in this and the previous module,
we usually have composite alternative
hypotheses
• So, ideally, we’d like our test to have higher
power at every possible alternative – That is, we would like to have a uniformly most powerful test
• The Neyman-Pearson Lemma only applies to
simple vs. simple hypotheses – However, often the RR for a test with a composite alternative
only depends on 0 (and not a)
– In these cases we can apply the Neyman-Person Lemma
and get a uniformly most powerful test
Revision: 2-12 67
Revision: 2-12 68
Example 10.23
• Suppose Y1,…,Yn are a random sample from a normal
distribution with unknown mean m and known
variance s2
• Find the uniformly most powerful a-level test for
0 0 0: vs. : aH Hm m m m
Example 10.23 (continued)
Revision: 2-12 69
Revision: 2-12 70
Example 10.23 (continued)
Revision: 2-12 71
Likelihood Ratio Tests
• Method for conducting good hypothesis tests
– Similar in spirit to the use of maximum likelihood
for constructing good point estimators
– Often results in uniformly most powerful tests, or
at least tests with reasonably good performance
properties
• Most tests we have discussed in class
actually derived from the likelihood ratio
principle
– They are called likelihood ratio tests
Revision: 2-12 72
Likelihood Ratio Tests (continued)
• Definition. A likelihood ratio test is a test of
, where the rejection
region is for
– This is a ratio of likelihoods, where the numerator
is maximized over the space defined by H0 and
the denominator over the whole space
– It can be shown that , where smaller
values suggest favoring Ha over H0
• See Chapter 10.11 for a brief introduction
0 0: vs. :a aH H
k
0
max
max
L
L
0 1
Revision: 2-12 73
More Terminology
• p-value: probability of seeing a test statistic as
extreme or more extreme assuming null is
true
• z-test vs. t-test
– Large sample (z-test) vs. small sample with s
unknown (t-test)
• Simple hypothesis vs. composite hypothesis
– Simple hypotheses uniquely specifies the
sampling distribution
What We Covered in this Module
• Understanding the p-value approach to
hypothesis testing
• Small-sample hypothesis tests
– For m
– For p
• Tests for variance(s)
• Neyman-Pearson Lemma and likelihood ratio
tests
Revision: 2-12 74
Revision: 2-12 75
Homework
• WM&S chapter 10
– Required: 50, 53, 56, 63, 65, 67, 69, 71, 79 and 84
– Extra credit: None
• Useful hints:
In Exercises 63, 65, 69, 71, and 79, use R instead
of the applet to calculate the exact p-value
Top Related