Summary of Video

Unit 22: Sampling Distributions

Unit 22: Sampling Distributions | Student Guide |

Summary of VideoIf we know an entire population, then we can compute population parameters such as the population mean or standard deviation. However, we generally don’t have access to data from the entire population and must base our information about a population on a sample. From samples, we compute statistics such as sample means or sample standard deviations. However, if we resample, chances are good that we won’t get the same results.

This video begins with a population of heights from students in a third grade class at Monica Ros School. A graphic display of the population distribution of heights shows a roughly normal shape with a mean µ = 53.4 inches and standard deviation σ = 1.8 inches (See Figure 22.1.).

Figure 22.1. Population distribution of heights from third-grade class.

Next, we draw random samples of size four from the class and record the heights. Figure 22.2 shows the results from five samples along with their sample means, which can be found in Table 22.1. Notice that the sample means vary from sample to sample, except for Samples 3 and 4 where the sample means match even though the data values differ.


5756555453525150

1

23

4

5

Height

Sam

ple

Figure 22.2. Random samples of size four.

We can keep sampling until we’ve selected all samples of size four from this population of 20 students. If we plot the sample means of all possible samples of size four, we get what is called the sampling distribution of the sample mean (See bottom graph in Figure 22.3.).

Figure 22.3. Sampling distribution of the sample mean.

Now, compare the sampling distribution of x to the population distribution. Notice that both distributions are approximately normal with mean 53.4 inches. However, the sampling distribution of x is not as spread out as the population distribution.

We can calculate the standard deviation of x as follows:

Sample Mean, x1 53.002 52.253 52.754 52.755 53.25

Table 22.1. Sample means.


σ x =σn

σ x =1.8 inches

4≈ 0.9 inch

Next, we put what we have learned about the sampling distribution of the sample mean to use in the context of manufacturing circuit boards. Although the scene depicted in the video is one that you don’t see much anymore in the United States, we can still explore how statistics can be used to help control quality in manufacturing. A key part of the manufacturing process of circuit boards is when the components on the board are connected together by passing it through a bath of molten solder. After boards have passed through the soldering bath, an inspector randomly selects boards for a quality check. A score of 100 is the standard, but there is variation in the scores. The goal of the quality control process is to detect if this variation starts drifting out of the acceptable range, which would suggest that there is a problem with the soldering bath.

Based on historical data collected when the soldering process was in control, the quality scores have a normal distribution with mean 100 and standard deviation 4. The inspector’s random sampling of boards consists of samples of size five. Hence, the sampling distribution of x is normal with a mean of 100 and standard deviation of 4 / 5 ≈1.79 . The inspector uses this information to make an x control chart, a plot of the values of x against time. A normal curve showing the sampling distribution of x has been added to the side of the control chart. Recall from the 68-95-99.7% rule, that we expect 99.7% of the scores to be within three standard deviations of the mean. So, we have added control limits that are three standard deviations (3 × 1.79 or 5.37 units) on either side of the mean (See Figure 22.4.). A point outside either of the control limits is evidence that the process has become more variable, or that its mean has shifted – in other words, that it’s gone out of control. As soon as an inspector sees a point such as the one outside the upper control limit in Figure 22.4, it’s a signal to ask, what’s gone wrong? (For more information on control charts, see Unit 23, Control Charts.)


Figure 22.4. Control chart with control limits.

So far we’ve been looking at population distributions that follow a roughly normal curve. Next, we look at a distribution of lengths of calls coming into the Mayor’s 24 Hour Hotline call center in Boston, Massachusetts. Most calls are relatively brief but a few last a very long time. The shape of the call-length distribution is skewed to the right as shown in Figure 22.5.

Figure 22.5. Duration of calls to a call center.

To gain insight into the sampling distribution of the sample mean, x , for samples of size 10, we randomly selected 40 samples of size 10 and made a histogram of the sample means. We repeated this process for samples of size 20 and then again for samples of size 60. The histograms of the sample means appear in Figure 22.6.


Figure 22.6. Histograms of sample means from samples of size 10, 20, and 60.

Now let’s compare our sampling distributions (Figure 22.6) with the population distribution (Figure 22.5). Notice that the spread of all the sampling distributions is smaller than the spread of the population distribution. Furthermore, as the sample size n increases, the spread of the sampling distributions decreases and their shape becomes more symmetric. By the time n = 60, the sampling distribution appears approximately normally distributed. What we have uncovered here is one of the most powerful tools statisticians possess, called the Central Limit Theorem. This states that, regardless of the shape of the population, the sampling distribution of the sample mean will be approximately normal if the sample size is sufficiently large. It is because of the Central Limit Theorem that statisticians can generalize from sample data to the larger population. We will be seeing applications of the Central Limit Theorem in later units on confidence intervals and significance tests.


Student Learning Objectives

A. Recognize that there is variability due to sampling. Repeated random samples from the same population will give variable results.

B. Understand the concept of a sampling distribution of a statistic such as a sample mean, sample median, or sample proportion.

C. Know that the sampling distributions of some common statistics are approximately normally distributed; in particular, the sample mean x of a simple random sample drawn from a normal population has a normal distribution.

D. Know that the standard deviation of the sampling distribution of x depends on both the standard deviation of the population from which the sample was drawn and the sample size n.

E. Grasp a key concept of statistical process control: Monitor the process rather than examine all of the products; all processes have variation; we want to distinguish the natural variation of the process from the added variation that shows that the process has been disturbed.

F. Make an x control chart. Use the 68-95-99.7% rule and the sampling distribution of x to help identify if a process is out of control.

G. Be familiar with the Central Limit Theorem: the sample mean x of a large number of observations has an approximately normal distribution even when the distribution of individual observations is not normal.


Content Overview

The idea of a sampling distribution, in general, and specifically about the sampling distribution of the sample mean x , underlies much of introductory statistical inference. The application to x charts is important in practice and the discussion of x charts, along with other types of control charts, continues in Unit 23, Control Charts.

If repeated random samples are chosen from the same population, the values of sample statistics such as x will vary from sample to sample. This variation follows a regular pattern in the long run; the sampling distribution is the distribution of values of the statistic in a very large number of samples. For example, suppose we start with data from the population distribution shown in Figure 22.7. This population is skewed to the right, and clearly not normally distributed.

2520151050X x

Figure 22.7. Population distribution.

Now, we draw a random sample of size 50 from this population and compute two statistics, the mean and the median, and get 20.7 and 19.8, respectively. Next we take another sample of size 50 and compute the mean and median for that sample. We keep resampling until we have a total of 1000 samples. Histograms of the 1000 means and 1000 medians from those samples appear in Figures 22.8 and 22.9, respectively. In both cases, the sampling distribution of the statistic appears approximately normally distributed. The sampling distribution of the sample mean, x , is centered around 24 and the sampling distribution of the sample median at around 22.


302826242220

120

100

80

60

40

20

0

Sample Mean

Freq

uenc

y

Figure 22.8. Distribution of the sample mean from 1000 samples of size 50.

3028262422201816

100

80

60

40

20

0

Sample Median

Freq

uenc

y

Figure 22.9. Distribution of the sample median from 1000 samples of size 50.

Although basic statistics such as the sample mean, sample median and sample standard deviation all have sampling distributions, the remainder of this unit will focus on the sampling distribution of the sample mean, x . If x is the mean of a simple random sample of size n from a population with mean µ and standard deviation σ, then the mean and standard deviation of the sampling distribution of x are:

µx = µ

σ x =σn


If a population has the normal distribution with mean µ and standard deviation σ, then the sample mean x of n independent observations has a normal distribution with mean µ and standard deviation σ n . In our example above, the population distribution was not normal (see Figure 22.7). In such cases, the Central Limit Theorem comes to the rescue – if the sample size is large (say n > 30), the sampling distribution of x is approximately normal for any population with finite standard deviation.

Control charts for the sample mean x provide an immediate application for the sampling distribution of x . In the 1920’s Walter Shewhart of Bell Laboratories noticed that production workers were readjusting their machines in response to every variation in the product. If the diameter of a shaft, for example, was a bit small, the machine was adjusted to cut a larger diameter. When the next shaft was a bit large, the machine was adjusted to cut smaller. Any process has some variation, so this constant adjustment did nothing except increase variation. Shewhart wanted to give workers a way to distinguish between the natural variation in the process and the extraordinary variation that shows that the process has been disturbed and hence, actually requires adjustment.

The result was the Shewhart x control chart. The basic idea is that the distribution of sample mean x is close to normal if either the sample size is large or individual measurements are normally distributed. So, almost all the x -values lie within ±3 standard deviations of the mean. The correct standard deviation here is the standard deviation of x , which is σ n (where σ is the standard deviation of individual measurements). So, the control limits µ ± 3σ n contain the range in which sample means can be expected to vary if the process remains stable. The control limits distinguish natural variation from excessive variation.


Key Terms

If repeated random samples are chosen from the same population, the values of sample statistics such as x will vary from sample to sample. This variation follows a regular pattern in the long run; the sampling distribution is the distribution of values of the statistic in a very large number of samples.

If x is the mean of a simple random sample (SRS) of size n from a population having mean µ and standard deviation σ, then the mean and standard deviation of are:

µx = µ

σ x =σn

If a population has a normal distribution with mean µ and standard deviation σ, then the sampling distribution of the sample mean, x , of n independent observations has a normal distribution with mean µ and standard deviation σ n .

If the population is not normal but n is large (say n > 30), then the Central Limit Theorem tells us that the sampling distribution of the sample mean, x , of n independent observations has an approximate normal distribution with mean µ and standard deviation σ n .

x


The Video

Take out a piece of paper and be ready to write down answers to these questions as you watch the video.

1. What is the difference between parameters and statistics?

2. Does statistical process control inspect all the items produced after they are finished?

3. The inspector samples five circuit boards at regular intervals and finds the mean solder quality score x for these five boards. Do we expect x to be exactly 100 if the soldering process is functioning properly?

4. If the quality of individual boards varies according to a normal distribution with mean µ = 100 and standard deviation σ = 4 , what will be the distribution of the sample averages, x ? (Recall the sample size is n = 5.)

5. In general, is the mean of several observations more or less variable than single observations from a population? Explain.


6. The distribution of call lengths to a call center is strongly skewed. What does the Central Limit Theorem say about the distribution of the mean call length x from large samples of calls?


Unit Activity: Sampling Distributions of the Sample Mean

Write each of On this these numbers many slips

50 1049, 51 948, 52 947, 53 846, 54 645, 55 544, 56 343, 57 242, 58 141, 59 140, 60 1

Table 22.2. Numbered slips for the population distribution.

1. Your instructor has a container filled with numbered strips as shown in Table 22.2. Make a histogram of this distribution. Describe its shape.

2. You will need 100 samples of size 9. Your instructor will provide instructions for gathering these samples. After the data have been collected, you will need a copy of the table of results before you can answer parts (a) and (b).

a. Find the sample mean for each of the samples. Record the sample means in the results table. (Save your results table. You will need this table again for the activity in Unit 24, Confidence Intervals.)

b. To get an idea of the characteristics of the sampling distribution for the sample mean, make a histogram of the sample means. (Use the same scaling on the horizontal axis that you used in question 1.) Compare the shape, center and spread of the sampling distribution to that of the original distribution (question 1).


Extension

3. A population has a uniform distribution with density curve as shown in Figure 22.10.

1.00.80.60.40.20.0

1.0

0.8

0.6

0.4

0.2

0.0

x

Prop

ortio

n

Figure 22.10. Density curve for uniform distribution.

a. Your instructor will give you directions for using technology to generate 100 samples of size 9 from this distribution.

b. Once you have your 100 samples, find the sample means.

c. Make a histogram of the 100 sample means. Describe the shape of your histogram. Compare the center of this sampling distribution with the center of the population distribution from Figure 22.10.


Exercises

1. The law requires coal mine operators to test the amount of dust in the atmosphere of the mine. A laboratory carries out the test by weighing filters that have been exposed to the air in the mine. The test has a standard deviation of σ = 0.08 milligram in repeated weighings of the same filter. The laboratory weighs each filter three times and reports the mean result.

a. What is the standard deviation of the reported result?

b. Why do you think the laboratory reported a result based on the mean of three weighings?

2. The scores of students on the ACT college entrance examination in a recent year had the normal distribution with mean µ = 18.6 and standard deviation σ = 5.9 .

a. What fraction of all individual students who take the test have scores 21 or higher?

b. Suppose we choose 55 students at random from all who took the test nationally. What is the distribution of average scores, x , in a sample of size 55? In what fraction of such samples will the average score be 21 or higher?

3. The number of accidents per week at a hazardous intersection varies with mean 2.2 and standard deviation 1.4. This number, x, takes only whole-number values, and so is certainly not normally distributed.

a. Let x be the mean number of accidents per week at the intersection during a year (52 weeks). What is the approximate distribution of x according to the Central Limit Theorem?

b. What is the approximate probability that, on average, there are fewer than two accidents per week over a year?

c. What is the approximate probability that there are fewer than 100 accidents at the intersection in a year? (Hint: Restate this event in terms of x .)

4. A company produces a liquid that can vary in its pH levels unless the production process is carefully controlled. Quality control technicians routinely monitor the pH of the liquid. When the process is in control, the pH of the liquid varies according to a normal distribution with mean µ = 6.0 and standard deviation σ = 0.9.


a. The quality control plan calls for collecting samples of size three from batches produced each hour. Using n = 3, calculate the lower control limit (LCL) and upper control limit (UCL).

b. Samples collected over a 24-hour time period appear in Table 22.3.

Sample pH level Sample mean1 5.8 6.2 6.02 6.4 6.9 5.33 5.8 5.2 5.54 5.7 6.4 5.05 6.5 5.7 6.76 5.2 5.2 5.87 5.1 5.2 5.68 5.8 6.0 6.29 4.9 5.7 5.610 6.4 6.3 4.411 6.9 5.2 6.212 7.2 6.2 6.713 6.9 7.4 6.114 5.3 6.8 6.215 6.5 6.6 4.916 6.4 6.1 7.017 6.5 6.7 5.418 6.9 6.8 6.719 6.2 7.1 4.720 5.5 6.7 6.721 6.6 5.2 6.822 6.4 6.0 5.923 6.4 4.6 6.724 7.0 6.3 7.4

Table 22.3. pH of samples.

c. Make an x chart by plotting the sample means versus the sample number. Draw horizontal reference lines at the mean and lower and upper control limits.

d. Do any of the sample means fall below the lower control limit or above the upper control limit? This is one indication that a process is “out of control.”

e. Apart from sample means falling outside the lower and upper control limits, is there any other reason why you might be suspicious that this process is either out of control or going out of control? Explain.


Review Questions

1. Suppose a chemical manufacturer produces a product that is marketed in plastic bottles. The material is toxic, so the bottles must be tightly sealed. The manufacturer of the bottles must produce the bottles and caps within very tight specification limits. Suppose the caps will be acceptable to the chemical manufacturer only if their diameters are between 0.497 and 0.503 inch. When the manufacturing process for the caps is in control, cap diameter can be described by a normal distribution with µ = 0.500 inch and σ = 0.0015 inch .

a. If the process is in control, what percentage of the bottle caps would have diameters outside the chemical manufacturer’s specification limits?

b. The manufacturer of the bottle caps has instituted a quality control program to prevent the production of defective caps. As part of its quality control program, the manufacturer measures the diameters of a random sample of n = 9 bottle caps each hour and calculates the sample mean diameter. If the process is in control, what is the distribution of the sample mean x ? Be sure to specify both the mean and standard deviation of x ’s distribution.

c. The cap manufacturer has a rule that the process will be stopped and inspected any time the sample mean falls below 0.499 inch or above 0.501 inch. If the process is in control, find the proportion of times it will be stopped for inspection.

2. A study of rush-hour traffic in San Francisco records the number of people in each car entering a freeway at a suburban interchange. Suppose that this number, x, has mean 1.5 and standard deviation 0.75 in the population of all cars that enter at this interchange during rush hours.

a. Could the exact distribution of x be normal? Why or why not?

b. Traffic engineers estimate that the capacity of the interchange is 700 cars per hour. According to the Central Limit Theorem, what is the approximate distribution of the mean number of persons, x , per car in 700 randomly selected cars at this interchange?

c. What is the probability that 700 cars will carry more than 1075 people? (Hint: Restate the problem in terms of the average number of people per car.)


3. Recall that the distribution of the lengths of calls coming into a Boston, Massachusetts, call center each month is strongly skewed to the right. The mean call length is µ = 90 seconds and the standard deviation is σ = 120 seconds.

a. Let x be the sample mean from 10 randomly selected calls. What is the mean and standard deviation of x ? What, if anything, can you say about the shape of the distribution of x ? Explain.

b. Let x be the sample mean from 100 randomly selected calls. What is the mean and standard deviation of x ? What, if anything, can you say about the shape of the distribution of x ? Explain.

c. In a random sample of 100 calls from the call center, what is the probability that the average length of these calls will be over 2 minutes?

Unit 23: Control Charts

Unit 23: Control Charts | Student Guide | Page 1

Summary of VideoStatistical inference is a powerful tool. Using relatively small amounts of sample data we can figure out something about the larger population as a whole. Many businesses rely on this principle to improve their products and services. Management theorist and statistician W. Edwards Deming was among the first to champion the idea of statistical process management. Initially, Deming found the most receptive audience to his management theories in Japan.

After World War II, Japanese industry was shattered. Rebuilding was a daunting challenge, one that Japanese business leaders took on with great determination. In the decades after the war, they transformed the phrase “Made in Japan” from a sign of inferior, cheaply-made goods to a sign of quality respected the world over. Deming’s emphasis on long-term thinking and continuous process improvement was vital in bringing about the so-called “Japanese Miracle.”

At first, Deming’s ideas were not as well received in America. Deming criticized American managers for their lack of understanding of statistics. But as time went on – and competition from Japan grew – companies in the U.S. began to embrace Deming’s ideas on statistical process control. Now his principles of total quality management are an integral part of American business, helping workers uncover problems and produce higher quality goods and services.

In statistics, a process is a chain of steps that turns inputs into outputs. A process could be anything from the way a factory turns raw iron into a finished bolt to the way you turn raw ingredients into a hot dinner. Statisticians say a process that is running smoothly, with its variables staying within an expected range, is in control. Deming was adamant that statistics could help in understanding a manufacturing process and identifying its problems, or when things were out of control or about to go out of control. He advocated the use of control charts as a way to monitor whether a process is in or out of control. This technique is widely used to this day as we’ll see in the video in a visit to Quest Diagnostics’ lab.

Quest performs medical tests for healthcare providers. So, for example, at Quest a patient’s blood sample is the input of the process and the test result is the output. A courier picks up specimens and transports them to the processing lab, where they are sorted by time of arrival and urgency of test. Technicians verify each specimen and confirm the doctor’s orders. Then the specimens are barcoded and are ready to be passed on for testing. Quest’s Seattle


processing lab aimed to get all specimens logged in and ready by 2 a.m. so the sample could move on to the technical department for analysis. However, until a few years ago, they were rarely meeting that 2 a.m. goal. Their lateness was leading to poor customer and employee satisfaction and moreover, it was wasting corporate resources. Enter statistical process control!

Quest needed to know where the process stood at present: How close were they to hitting the 2 a.m. target and how much did finish times vary? Keep in mind that all processes have variation. Common cause variation is due to the day-to-day factors that influence the process. In Quest’s case, it could be things like a printer running out of paper and needing to be refilled, or a worker calling in sick. It is the normal variation in a system.

Processes are also susceptible to special cause variation – that’s when sudden, unpredictable events throw a wrench into the process. Examples of special cause variation would be blackouts that shut down the lab’s power, or a major crash on the highway that would keep the samples from being delivered to the lab. Quest needed to figure out how their process was running on a day-to-day basis when they were only up against common cause variation.

Quest used six months of finish-time data to set up control limits and then created a control chart, which is a graphic way to keep track of variation in finish times. Figure 23.1 shows a control chart for month 1. The center line is the target finish time. The control limits at 12:00 a.m. and 4:00 a.m. are set three standard deviations above and below the center line. The data points are the finish times that Quest tracked over a one-month period.

Figure 23.1. Control chart for month 1.

Quest assumed that their nightly finish times are normally distributed. In Figure 23.2, we add a graph of the normal distribution to the control chart. Remember, in a normal distribution 68% of your data is within one standard deviation of the mean, 95% is within two standard deviations, and 99.7% is within three standard deviations.


Figure 23.2. Adding an assumption of normality.

Using the control chart Quest was able to figure out when their process had been disturbed and gone out of control, or was heading that way. One dead giveaway that the finish times are out of control is if a point falls outside the control limits. That should only happen 0.3% of the time if everything is running smoothly. Take a look at Figure 23.3, which highlights what happened toward the end of the one-month cycle.

Figure 23.3. Highlighting finishing times beyond the control limits.

There are other indicators that something suspicious might be going on besides points falling outside the control limits. For example, if too many points are on one side of the center line or if a strong pattern emerges (hence, the variability is not random) – then it’s time to investigate. Mapping finish times on the control chart helps monitor the process, and alerts techs right away that something has been disturbed. Then they can track down and address the cause immediately.

Another way the control chart helped Quest improve efficiency was by revealing some of the causes of variation in the process, which the team could then address. Quest actually


restructured the entire department. It set up pods within the department and changed staffing. These sorts of changes brought the mean finish time much closer to the 2 a.m. target, and the remaining variation clustered more tightly around the center line. The days of wildly erratic finish times were gone thanks to statistical process control.



A. Understand why statistical process control is used.

B. Be able to distinguish between common cause and special cause variation.

C. Know how to construct a run chart and describe patterns/trends in data over time.

D. Know how to construct an x chart and describe the changes in sample means over time.

E. Make decisions based on observed patterns in 7 run charts and x charts.

F. Be able to apply decision rules to determine if a process is out of control.

Content Overview

Figure 23.4: Silicon ingots and polished wafers.

Consider the problem of quality control in the manufacturing process of turning ingots of silicon into polished wafers used to make microchips. (See Figure 23.4.) Assume that the manufacturer wants the polished wafers to have consistent thickness with a target thickness of 0.5 millimeters. A sample of 50 polished wafers is selected as a batch is being produced. Table 23.1 contains these data.

0.555 0.543 0.533 0.538 0.533 0.529 0.526 0.522 0.518 0.519 0.516 0.515 0.513 0.515 0.512 0.510 0.508 0.507 0.507 0.507 0.506 0.506 0.506 0.505 0.503 0.502 0.500 0.498 0.499 0.496 0.497 0.493 0.492 0.491 0.487 0.488 0.486 0.485 0.483 0.484 0.482 0.479 0.476 0.476 0.474 0.471 0.471 0.469 0.454 0.447

Table 23.1. Wafer thickness from sample of 50 polished wafers.

In order to gain a sense of the distribution of wafer thickness, a quality control technician constructs the histogram shown in Figure 23.5.


0.600.550.500.450.40

40

30

20

10

0

Thickness (mm)

Perc

ent

Figure 23.5. Histogram of wafer thickness.

The histogram indicates that distribution of wafer thickness is approximately normal. The sample mean is 0.50064, which is pretty close to the target value. Furthermore, the standard deviation is 0.02227, which is relatively small compared to the mean. The analysis thus far supports the conclusion that the process is in control.

The sample mean and standard deviation together with the histogram provide information on the overall pattern of the sample data. However, there is more to quality control than simply studying the overall pattern. Manufacturers also keep track of the run order, the order in which the data are collected. For the data in Table 23.1, the run order may relate to which part of the ingot – top, middle, or bottom – the wafers came from, or it may relate to the order in which wafers were fed through the grinding and polishing machines. If a process is stable or in control, the order in which data are collected, or the time in which they are processed, should not affect the thickness of polished wafers. One way to check that the production processes of polished wafers are in control is by creating a run chart.

A run chart is a scatterplot of the data versus the run order. To help visualize patterns over time, the dots in the scatterplot are usually connected. Table 23.1 lists the data values in the order they were collected, starting with the first row 0.555, 0.543, . . . , 0.519, followed by the second row, third row, fourth row and ending with 0.447, the last entry in the fifth row. So, the run order for 0.555 is 1, for 0.543 is 2, and so forth until you get to the run order for 0.447, which is 50. Figure 23.6 shows the run chart for the wafer thickness data. A center line has been drawn on the chart at the target thickness of 0.05 millimeters.


50403020100

0.550

0.525

0.500

0.475

0.450

Run Order

Thic

knes

s (m

m)

Run Chart

Figure 23.6. Run chart for wafer thickness data from Table 23.1.

Even though the overall pattern of the data gave no indication that there were any problems with the grinding and polishing processes, it is clear from the run chart in Figure 23.6 that the thickness of polished wafers is decreasing over time. Processes need to be stopped so that adjustments can be made to the grinding and polishing processes.

The run chart involved plotting individual data values over time (run order). Another approach is to select samples from batches produced over regular time intervals. For example, a quality control plan for the polished wafers might call for routine collection of a sample of n polished wafers from batches produced each hour. The thickness of each wafer in the sample is recorded and the mean thickness, x , is calculated. The information on mean thickness can be used to determine if the process is out of control at a particular time and to track changes in the process over time.

Suppose when the grinding and polishing processes are in control, the distribution of the individual wafers can be described by a normal distribution with mean µ = 0.5 millimeters and standard deviation σ = 0.02 millimeters (similar to the data pattern in Figure 23.5). From Unit 22, Sampling Distributions, we know that under this condition the distribution of the hourly sample means, x , based on samples of size n are normally distributed with the following mean and standard deviation:

µx = µ = 0.05 millimeters

σ x =σn= 0.02

n millimeters


Assume that the quality control plan calls for taking samples of four polished wafers each hour. In this case, our standard deviation for x is:

σ x = 0.02 4 = 0.01 millimeter

Each hour a technician collects samples of four polished wafers, measures their thickness, records the values, and then calculates the sample mean. Suppose that the data in Table 23.2 come from samples collected over an eight-hour period.

Sample Sample SampleNumber Thickness (mm) Mean (mm)

1 0.509 0.502 0.521 0.469 0.50032 0.504 0.505 0.525 0.468 0.50053 0.489 0.506 0.486 0.497 0.49454 0.513 0.516 0.482 0.483 0.49855 0.552 0.516 0.476 0.472 0.50406 0.480 0.484 0.518 0.510 0.49807 0.516 0.489 0.513 0.495 0.50328 0.508 0.499 0.466 0.480 0.4882

Table 23.2. Data from samples collected each hour.

According to the 68-95-99.7% Rule, if the process is in control, we would expect:

68% of the x values to be within the interval 0.5 mm ± 0.1 mm or between 0.49 mm and 0.51 mm.

95% of the x values to be within the interval 0.5 mm ± 2(0.1) mm or between 0.48 mm and 0.52 mm.

99.7% of the x values to be within the interval 0.5 mm ± 3(0.1) mm or between 0.47 mm and 0.53 mm.

Next, we make an x chart, which is a scatterplot of the sample means versus the sample order. We draw a reference line at μ = 0.5 called the center line. We use the values from the 68-95-99.7% Rule to provide additional reference lines in our x chart. The lower and upper endpoints on the 99.7% interval are called the lower control limit (LCL) and upper control limit (UCL), respectively. Figure 23.7 shows the completed x chart.


14131211109876543210

0.53

0.52

0.51

0.50

0.49

0.48

0.47

Sample Number

Sam

ple

Mea

nx Chart

14131211109876543210

0.53

0.52

0.51

0.50

0.49

0.48

0.47

Sample Number

Sam

ple

Mea

nx Chart

Figure 23.7. x chart for wafer thickness over time.

The x chart in Figure 23.7 does not appear to indicate any problems that warrant stopping the grinding or polishing processes to make adjustments. All of the points except one fall within one σ n of the mean, in other words, fall between the reference lines corresponding to 0.49 and 0.51. However, as we add additional points, we will need some guidelines – a set of decision rules – that tell us when the process is going out of control. The decision rules below are based on a set of rules developed by the Western Electric Company. Although they are widely used, they are not the only set of decision rules.

Decision Rules:

The following rules identify a process that is becoming unstable or is out of control. If any of the rules apply, then the process should be stopped and adjusted (or the problem fixed) before resuming production.

Rule 1: Any single data point falls below the LCL or above the UCL.

Rule 2: Two of three consecutive points fall beyond the 2σ / n limit, on the same side of the center line.

Rule 3: Four out of five consecutive points fall beyond the σ n limit, on the same side of the center line.

Rule 4: A run of 9 consecutive points (in other words, nine consecutive points on the same side of the center line).


None of the decision rules apply to the control chart in Figure 23.7. Hence, the processes are allowed to continue. Table 23.3 contains data on the next seven hourly samples.

Sample Sample SampleNumber Thickness (mm) Mean (mm)

9 0.505 0.489 0.499 0.532 0.506310 0.534 0.521 0.530 0.511 0.524011 0.526 0.514 0.520 0.530 0.522512 0.517 0.518 0.511 0.512 0.514513 0.506 0.504 0.511 0.511 0.508014 0.507 0.499 0.501 0.510 0.504315 0.511 0.509 0.512 0.520 0.5130

Table 23.3. Samples from an additional seven hours.

Figure 23.8 shows the updated x chart that includes the means from the seven additional samples.

161514131211109876543210

0.53

0.52

0.51

0.50

0.49

0.48

0.47

Sample Number

Sam

ple

Mea

n

x Chart

Figure 23.8. Updated x chart.

Now, we apply the decision rules. This time, we find that Rule 2 applies. Data points associated with Samples 10 and 11 fall above 0.52 (which, in this case, is above the 2σ / n limit). According to Rule 2 the process should be stopped after observing Sample 11’s x -value.

The x chart monitors one statistic, the sample mean, over time. The x chart is only one type of control chart. As mentioned earlier, the manufacturer is also interested in producing a consistent product. So, instead of tracking the sample mean, the quality control plan could also track the sample standard deviations, or the sample ranges over time. More generally, control charts are scatterplots of sample statistics (or individual data values) versus sample order and are commonly used tools in statistical process control.


Before control charts were popular, there was a tendency to adjust processes whenever a slight change was noted. This led to over-adjustment, which often caused more problems than it solved. In addition, it meant that the process was stopped for adjustment more frequently than was necessary, which was a waste of money. Control charts and decision rules give manufacturers concrete guidelines for deciding when processes need attention.


Key Terms

A process is a chain of steps that turns inputs into outputs. Every process has variation. Common cause variation is the variation due to day-to-day factors that influence the process. Special cause variation is the variation due to sudden, unexpected events that affect the process.

When a process is running smoothly, with its variables staying within an acceptable range, the process is in control. When the process becomes unstable or its variables are no longer within an acceptable range, the process is out of control.

A run chart is a scatterplot of the data values versus the order in which these values are collected. The chart displays process performance over time. Patterns and trends can be spotted and then investigated.

Control charts are used to monitor the output of a process. The charts are designed to signal when the process has been disturbed so that it is out of control. Control charts rely on samples taken over regular intervals. Sample statistics (for example, mean, standard deviation, range) are calculated for each sample. A control chart is a scatterplot of a sample statistic (the quality characteristic) versus the sample number. Figure 23.9 shows a generic control chart.

1614121086420Sample Number

Qua

lity

Char

acte

ristic

UCL

Center Line

LCL

Control Chart

Figure 23.9. Generic control chart.

The center line on a control chart is generally the target value or the mean of the quality characteristic when the process is in control. The upper control limit (UCL) and lower control limit (LCL) on a control chart are generally set ±3 σ n from the center line.


An x chart is one example of a control chart. It is a scatterplot of the sample means versus the sample number. Scatterplots of sample standard deviations or sample ranges over time are two other types of control charts.

Decision rules consist of a set of rules used to identify when a process is becoming unstable or going out of control. Decision rules help quality control managers decide when to stop the process in order to fix problems or make adjustments.


The Video


1. What was W. Edwards Deming known for?

2. What is a process, statistically speaking? Give an example.

3. What does it mean for a process to be in control?

4. Why did Quest Diagnostics’ lab need a statistical-quality-control intervention?

5. In Quest’s control chart, how did they determine where to set the upper and lower control limits?


6. How did Quest respond to what it learned from its control charts? What were the results of these changes?


Unit Activity: You’re In Control!

For this activity, you will play the role of a semiconductor quality control manager in charge of monitoring the thickness of polished wafers. Open the Control Chart tool from the Interactive Tools menu. You will be working with x charts. The activity questions follow the list of steps below.

Step 1: Select a set of values for the mean µ and standard deviation σ. You have three possible choices for each of these parameters.

For now, you will work through the construction of at least two control charts with your selection. In the real world, these values would be determined from past data collected when the process was known to be in control.

Step 2: Since you are in charge of the quality control plan, decide on the sample size n you would like to use for monitoring the process. You have three choices: 5, 10, or 20.

Keep in mind the following: The more wafers you sample, the more time it will take, and the more it will cost. On the other hand, with larger samples, results are more precise.

Step 3: Select the Step-By-Step mode.

In this mode, you will get feedback immediately after each decision that you make. If you make a mistake, you will be told to start over and will need to click the “Start Over” button. Once you feel confident about your decisions, you can change to Continuous mode.

Step 4: Calculate the lower control limit to four decimals and enter its value in the box for LCL. Calculate the upper control limit to four decimals and enter its value in the box for UCL. Click the “Change Control Limits” button.

If your calculations are correct, control lines will appear in the x chart. In Step-By-Step mode, you will get feedback (see bottom of screen) if you have made a mistake. The feedback will say: Recalculate control limit values. To correct the error, enter new values for LCL and UCL and then click the “Change Control Limits” button.

Step 5: Click on the “Collect Sample Data” button. The data will appear in a column under the heading Thickness (mm) near the top of your screen. To calculate x , click the “Calculate Mean” button. The mean will appear underneath the column.


Step 6: Click the “Plot Point” button to plot the ordered pair (sample number, mean) on the x chart.

Step 7: Make a decision. Your possible decisions are: (1) Continue Process, which means that you have decided the process is in control; or (2) Stop Process, which means that you have decided to shut down the process for adjustments or inspection.

Step 8: Repeat steps 5 – 7 until one of the following three things happens:

(1) You decide to continue and get the following feedback: Process is not in control. It should be stopped immediately. In this case, click the “Start Over” button at the top of the screen.

(2) You decide to continue and get the following feedback: Good decision. In this case, continue constructing the control chart.

(3) After 25 samples, it will be time for routine maintenance even if the process is still in control. At this time you, you can proceed to the next question. Click the “Start Over” button to do so.

1. Work through Steps 1 – 8 using the Control Chart tool. Complete one control chart successfully. Make a sketch of your chart (or do a screen capture and paste the screen capture into a Word document). If the process was stopped before 25 samples were selected, state which of the decision rules applies.

2. Use the same settings as you did for question 1. Rework question 1.

After you have successfully completed two control charts in Step-by-Step mode, you are ready to move on to question 3.

3. Change the settings for µ, σ, and n. Choose Continuous mode. Allow the process to continue until you think it needs to be stopped. After clicking the “Stop Process” button, you will receive feedback.

a. What settings did you choose? What were the values of the upper and lower control limits?

b. Make a sketch of your control chart or save a screen capture of your control chart into a Word document.


c. What feedback did you receive?

d. If your feedback indicated that you made a correct choice to stop the process, state the rule that made you decide it was time to stop the process. If your feedback indicated that you should have stopped the process sooner, state the sample number for when you should have stopped the process and the rule that applies.

4. Select new settings for µ and σ (it is up to you if you also want to change n). Repeat question 3 and make another control chart.


Exercises

1. A manufacturer of electrical resistors makes 100-ohm resistors that have specifications of 100 ± 3 ohms. A quality control inspector collected a sample of 15 electrical resistors and tested their resistance. The results are recorded below:

99 98 101 98 99 101 99 100

100 98 99 102 99 101 100

Assume that these data are recorded in the order they were collected beginning with the first row 99, . . . 100, followed by the second row 100, . . . 100.

a. Make a run chart for these data. Leave room on the horizontal axis to expand the run orders out to 30. (You will be adding 15 more data points in part (c).) Draw a reference line for the target resistance (100 ohms) and for the tolerance interval (these can serve as control limits).

b. Based on your run chart in (a), is there any evidence that the process is out of control? Support your answer.

c. The quality control inspector continued collecting data on the resistors. Results from an additional 15 data values, in the order values were collected, are recorded below:

100 99 102 99 101 102 101 100

101 102 100 103 101 102 103

Use these data to complete the run chart in (a) for run orders from 1 – 30.

d. Based on the completed run chart in (c) is there any evidence that the manufacturing process is out of control? Support your answer.

2. Suppose a chemical manufacturer produces a product that is marketed in plastic bottles. The material is toxic, so the bottles must be tightly sealed. The manufacturer of the bottles must produce the bottles and caps within very tight specification limits. Suppose the caps will be acceptable to the chemical manufacturer only if their diameters are between 0.497 and 0.503 inch. When the manufacturing process for the caps is in control, cap diameter can be described by a normal distribution with µ = 0.5 inch and σ = 0.0015 inch .


a. If the process is in control, what percentage of the bottle caps would have diameters outside the chemical manufacturer’s specification limits?

b. The manufacturer of the bottle caps has instituted a quality control program to prevent the production of defective caps. As part of its quality control program, the manufacturer measures the diameters of a random sample of n = 9 bottle caps each hour and then calculates the sample mean diameter. If the process is in control, what is the distribution of the sample mean x ? Be sure to specify both the mean and standard deviation of x ’s distribution.

c. The cap manufacturer has a rule that the process will be stopped and inspected any time the sample mean falls below 0.499 inch or above 0.501 inch. If the process is in control, find the proportion of times it will be stopped during inspection periods.

3. For each of the x charts in Figures 23.10 – 23.12, decide whether or not the process is in control. If the process is out of control, state which decision rule applies. Justify your answer. (Note that reference lines at one, two, and three σ n on either side of the mean have been drawn on the control charts.)

a.

1614121086420

35

30

25

20

15

10

5

Sample Number

Sam

ple

Mea

n

5

10

15

20

25

30

35

Control Chart

Figure 23.10. Control chart for exercise 3(a).


b.

1614121086420

70

60

50

40

30

20

10

Sample Number

Sam

ple

Mea

n

10

20

30

40

60

70

50

Control Chart

Figure 23.11. Control chart for exercise 3(b).

c.

4035302520151050

70

60

50

40

30

20

10

Sample Number

Sam

ple

Mea

n

10

20

30

40

50

60

70

Control Chart

Figure 23.12. Control chart for exercise 3(c).

4. A company produces a liquid which can vary in its pH levels unless the production process is carefully controlled. Quality control technicians routinely monitor the pH of the liquid. When the process is in control, the pH of the liquid varies according to a normal distribution with mean µ = 6.0 and standard deviation σ = 0.9 .

a. The quality control plan calls for collecting samples of size three from batches produced each hour. Using n = 3, calculate the lower control limit (LCL) and upper control limit (UCL).

b. Samples collected over a 24-hour time period appear in Table 23.4. Compute the sample means for each of the 24 samples and add the results to a copy of Table 23.4.


Sample pH level Sample Mean1 5.8 6.2 6.02 6.4 6.9 5.33 5.8 5.2 5.54 5.7 6.4 5.05 6.5 5.7 6.76 5.2 5.2 5.87 5.1 5.2 5.68 5.8 6.0 6.29 4.9 5.7 5.610 6.4 6.3 4.411 6.9 5.2 6.212 7.2 6.2 6.713 6.9 7.4 6.114 5.3 6.8 6.215 6.5 6.6 4.916 6.4 6.1 7.017 6.5 6.7 5.418 6.9 6.8 6.719 6.2 7.1 4.720 5.5 6.7 6.721 6.6 5.2 6.822 6.4 6.0 5.923 6.4 4.6 6.724 7.0 6.3 7.4

Table 23.4. pH samples collected hourly.

c. Make an x chart. Add reference lines including lines for the lower and upper control limits.

d. Based on the control chart you drew for (c), decide whether or not the process is in control. If not, state which of the decision rules applies.


Review Questions

1. A manager keeps track of duplicate e-mail messages he receives, which he views as a waste of his time. His log of the number of duplicate e-mails over 20 consecutive workdays appears in Table 23.5.

Day Number 1 2 3 4 5 6 7 8 9 10Number of Duplicates 2 1 0 2 12 14 17 15 25 20Day Number 11 12 13 14 15 16 17 18 19 20Number of Duplicates 24 27 22 24 26 20 22 5 2 0

Table 23.5. Duplicate e-mail messages per day.

a. Calculate the mean number of duplicate e-mails per day.

b. Draw a run chart of the duplicate e-mail data. Add the mean number of duplicates as a reference centerline.

c. Nine or more consecutive data points on the same side of a center line can signal a special cause variation. Does the run chart from (b) signal a special cause variation?

2. A quality control inspector at a company that manufactures valve linings monitors the mass of the linings. When the process is in control, the mean mass is µ = 240.0 grams and standard deviation σ = 0.4 gram . The inspector randomly selects a valve liner from batches produced each hour and records its mass. The mass (in grams) of 25 valve liners are displayed in Table 23.6 on the next page.


Hour Mass Hour Mass1 240.0 14 240.22 239.9 15 239.83 239.6 16 240.74 240.2 17 239.45 239.6 18 240.56 239.8 19 239.77 239.8 20 239.38 240.1 21 240.59 239.8 22 239.710 240.1 23 239.511 240.1 24 240.712 239.8 25 239.413 240.2

Table 23.6. Mass of valve liners.

a. Make a histogram for mass of valve liners from Table 23.6. For the first class interval, use 239.0 grams to 239.2 grams. Based on the histogram is there any evidence that the manufacturing process is not in control? Explain.

b. Make a run chart for the mass of valve liners. Add a reference center line at µ. Add lower and upper control limits at µ ± 3σ .

c. Does the run chart show any changes in the distribution of valve-liner mass over time? Explain.

3. One process in the production of integrated circuits involves chemical etching of a layer of silicon dioxide until the metal beneath is reached. The company closely monitors the thickness of the silicon dioxide layers because thicker layers require longer etching times. The target thickness is 1 micrometer (µm) and has a standard deviation of 0.06 micrometers (based on past data when the process was in control). The company uses samples of four wafers. An x chart based on 40 consecutive samples appears in Figure 23.13 on the next page.


Figure 23.13. Control chart for thickness of silicon dioxide layers.

a. Calculate the appropriate control limits (the values of the reference lines drawn in Figure 23.13). Round the values to two decimals.

b. Decide whether or not the process is in control. If not, explain which decision rule applies and identify the sample number after which the process should be shut down for adjustments.

4. The company referred to in exercise 4 has two plant lines that produce the liquid. Data from the second line appears in Table 23.7. When the process is in control, the pH of the liquid varies according to a normal distribution with mean µ = 6.0 and standard deviation σ = 0.9 . The quality control plan calls for collecting samples of size three from batches produced each hour.

Sample pH level 1 7.2 7.4 7.42 6.9 6.6 6.53 6.2 6.3 6.34 6.8 6.4 6.55 6.5 6.6 6.76 6.8 6.8 6.87 6.2 6.3 6.48 5.6 5.7 5.99 4.9 5.8 5.6

10 6.4 6.0 4.411 6.9 5.3 6.2

4038363432302826242220181614121086420

1.0

Sample Number

Sam

ple

Mea

n

1

L2L1

U1U2UCL

LCL

Control Chart

Continued on the next page...


12 5.5 5.9 5.913 5.3 5.1 5.214 6.2 6.7 6.515 4.9 4.7 4.816 6.4 6.1 7.017 6.3 5.8 6.018 4.9 5.0 5.119 5.5 5.7 5.320 5.3 5.2 5.421 5.8 5.8 5.622 5.8 5.6 5.723 4.8 4.7 4.624 4.8 4.9 4.8

Table 23.7. pH of samples.

a. Calculate the sample means for each of the 24 samples.

b. Construct an x chart for the pH samples from the second plant line. Include reference lines marking the center line and one, two, and three σ n on either side of the center line.

c. Based on the control chart from (b), does the process appear to be in control? If not, which decision rule applies and what appears to be the problem?

Unit 24: Confidence Intervals

Unit 24: Confidence Intervals | Student Guide | Page 1

Summary of VideoThis video is an introduction to inference, which means we use information from a sample to infer something about a population. For example, we might use a sample statistic to estimate a population parameter. Suppose we wanted to know a man’s mean blood pressure. A sample of blood pressure readings is shown in Table 24.1.

Table 24.1. Systolic blood pressure readings.

We could estimate his mean blood pressure using the sample mean from these readings, x = 130. But how trustworthy is our conclusion given that different samples could lead to different results, some higher and others lower? Statisticians address this issue by calculating confidence intervals. Rather than a single number like 130, we can compute a range of values along with a confidence level for that range.

Next, the context switches from blood pressure to the length of life of batteries. Because companies promise specific battery lifetimes and improved performance over a competitor, they need proof before ads promoting their product go on the air. At Kodak’s Ultra Technologies, technicians use rigorous testing and calculate confidence intervals to back up their marketing claims. Here’s how the data are collected. Random samples of batteries are pulled from the warehouse. The batteries are drained under controlled conditions and the time it takes for them to run out of juice is recorded. From these data, Kodak has determined that its population of AA batteries when used in a toy will last 7½ hours ± 20 minutes and that their confidence in that range is 95%.

Now, we retrace Kodak’s steps to figure out how they came up with this interval. Before getting started, we need to check that a few underlying assumptions are satisfied:

Su M T W Th F Sa130 120 140 125 130 130 140 A.M.125 130 145 140 125 135 110 P.M.

Table 24.1


1. Observations are independent.

2. Data are from a normal population or the sample size n is large.

3. The population standard deviation is known.

Selecting a random sample of batteries for the test takes care of the assumption of independent observations. The second assumption is satisfied since the sample size of n = 40 is considered large. The last assumption is not reasonable in the real world, but for now, we’ll assume that from past data we do know the population standard deviation, σ = 63.5 minutes.

The task is to calculate a confidence interval for μ, the mean life of Kodak’s AA batteries. Our sample statistic x is a point estimate for the parameter μ. If we include a margin of error around our point estimate, we get an interval estimate of the form:

point estimate ± margin of error

From Unit 22, Sampling Distributions, we know that the sampling distribution of x is normal, with mean µx = µ and standard deviation σ x = σ n . In this case, we are given σ = 63.5 minutes and we can compute σ x = 63.5 40 minutes or about 10 minutes. Think back to the 68-95-99.7% Rule. In any normal distribution, 95% of the observations lie within two standard deviations of the mean. So, 95% of all possible samples result in battery-life data for which µ is within plus or minus 20 minutes of that sample’s mean, x . In our example, =x 450 minutes. So, we can say with 95% confidence that μ lies within 20 minutes of x , giving us a confidence interval from 430 minutes to 470 minutes. To say that we are 95% confident in our calculated range of values means that we got the numbers using a method that gives correct results 95% of the time over many, many examples.

What if Kodak were willing to settle for only 90% confidence? Or what if they insisted on 99% confidence? We can get any confidence level that we want by turning to the standard normal distribution and finding the z* critical value. Then just substitute the appropriate values into the following formula:

x ± z * σn

⎛⎝⎜

⎞⎠⎟

Notice that the margin of error gets larger if we insist on higher confidence because z* will be larger. On the other hand, the margin of error gets smaller if we take more observations so that n is larger.



A. Understand that a common reason for taking a sample is to estimate some property of the underlying population.

B. Recognize that a useful estimate requires a measure of how accurate the estimate is.

C. Know that a confidence interval has two parts: an interval that gives the estimate and the margin of error, and a confidence level that gives the likelihood that the method will produce correct results in the long range.

D. Be able to assess whether the underlying assumptions for confidence intervals are reasonably satisfied. Provided the underlying assumptions are satisfied, be able to calculate a confidence interval for μ given the sample mean, sample size, and population standard deviation.

E. Understand the tradeoff between confidence and margin of error in intervals based on the same data.

F. Given a specific confidence level, recognize that increasing the size of the sample can give a margin of error as small as desired.


Content Overview

The purpose of a confidence interval is to estimate an unknown parameter with an indication of (1) how precise the estimate is and (2) how confident we are that the result is correct. Any confidence interval has two parts: an interval computed from the data and a confidence level. The interval often has the form

point estimate ± margin of error.

The confidence level states the probability that the method will give a correct result. That is, if you use 95% confidence intervals often, in the long run 95% of your intervals will contain the true parameter value.

Suppose that a simple random sample of size n is drawn from a normally distributed population having an unknown mean µ and known standard deviation σ. A level C (expressed as a decimal) confidence interval for µ is

x ± z * σn

⎛⎝⎜

⎞⎠⎟

,

where z* is a cutoff point for the standard normal curve with area (1 – C)/2 to its right. For example, if C = 0.95 (for a 95% confidence interval) then (1 – C)/2 = (1 – 0.95)/2 = 0.025. In this case, z* turns out to be 1.96 as shown in Figure 24.1.

0.4

0.3

0.2

0.1

0.0

Z

Dens

ity

-1.960 1.960

0.025

0

0.025

Distribution PlotNormal, Mean=0, StDev=1

Figure 24.1. Standard normal density curve illustrating z* = 1.96.

If the sample size n is relatively small, we first need to check that the underlying assumption of normality is reasonably satisfied before computing a confidence interval. One way to check the assumption of normality is to make a normal quantile plot of the sample data. Alternatively,


we could look at a boxplot. If the sample size n is large (n at least 30), the confidence interval formula is approximately correct even when the population does not have a normal distribution. This result is due to the Central Limit Theorem (Unit 22, Sampling Distributions).

The size of the margin of error controls the precision (width) of the confidence interval estimate. Precision is increased as the margin of error shrinks. The margin of error of a confidence interval decreases if any of the following occur:

• the confidence level C is reduced

• the sample size n increases

• the population standard deviation σ decreases

In practice, the population standard deviation σ is not known and must be estimated from the sample. If the sample size n is fairly large (say at least 30), then the value of the sample standard deviation s should be close to σ. In that case, you can replace σ by s in the confidence interval formula. (See Unit 26, Small Sample Inference for One Mean, for a continued discussion of confidence intervals for µ when σ is unknown.)


Key Terms

A point estimate of an unknown population parameter is a single number based on sample data (a statistic) that represents a plausible value for that parameter.

A confidence interval for a population parameter is an interval of plausible values for that parameter. It is constructed so that the value of the parameter will be captured between the endpoints of the interval with a chosen level of confidence. The confidence level is the success rate of the method used to construct the confidence interval.

Many confidence intervals have the following form: point estimate ± margin of error. The margin of error is the range of values above and below the point estimate.

A formula used to compute a confidence interval for µ when σ is known and either the sample size n is large or the population distribution is normal is given by:

x ± z * σn

⎛⎝⎜

⎞⎠⎟

,

where z* is a z-critical value associated with the confidence level.


The Video


1. Why is a single blood pressure reading not sufficient if we want to estimate a person’s average blood pressure?

2. What are the two parts of any confidence interval?

3. What assumptions need to be checked before computing a confidence interval?

4. In plain language, what does “95% confidence” mean?


Unit Activity: Confidence Interval Simulation

In this activity, you will need the simulation data collected for question 2 in Unit 22’s activity. Recall that samples of size 9 were drawn from an approximately normal distribution with µ = 50 and standard deviationσ = 4. Assume for the moment that µ is unknown. You will be using sample data to find confidence interval estimates for µ.

1. a. What is the standard deviation of x for samples of size 9?

b. What is the margin of error for a 95% confidence interval for µ? (Round your answer to two decimals.)

2. Your instructor has a container filled with numbered strips. Draw a sample of size 9.

a. Record the outcomes of your sample and calculate the sample mean, x .

b. Based on your sample, determine a 95% confidence interval for µ.

c. In this case, the true value of µ is 50. Does your confidence interval contain the true value of µ?

3. Your instructor should distribute a table of the results from 100 samples of size 9 generated for Unit 22’s activity.

a. For each sample, calculate a 95% confidence interval estimate for μ and record the endpoints of the interval.

b. Of the 100 samples collected, how many of the 95% confidence intervals contain the true value of μ, which is 50? How many did you expect to contain 50? Is there a discrepancy between the number you found and the number you expected to find? Explain how this discrepancy could occur.


Exercises

1. Students who take SATs in most high schools are not representative of all students in the school. Generally, only students who plan to apply for college admission take the test. The statistics class at Lincoln High decides to get better data. They select a random sample of 20 members of the junior class and arrange for all 20 to take the Math SAT. The scores are given below.

410 400 460 440 390 400 450 460 520 380

480 480 490 450 480 330 390 460 600 610

Assume that the standard deviation of scores for all juniors is σ = 100 .

a. Find the value of σ x , the standard deviation of the sample mean in size-20 samples.

b. Check to see whether these data could be considered to come from a normally distributed population. (The data need only be roughly normal – in other words, the data should have no severe departures from normality.)

c. Let μ be the mean score that would be observed if every junior at Lincoln High took the exam. Give a 95% confidence interval for μ. Show your calculations. How could you get a smaller margin of error with the same confidence?

d. Give a 99% confidence interval for μ. Explain in plain language, to someone who knows no statistics, why this interval is wider than your result in (c).

2. The Massachusetts Comprehensive Assessment System (MCAS) includes a 10th grade math test which is scaled from 200 to 280. Assume that the standard deviation of math test scores is σ = 17 . (This assumption is reasonable based on past results.)

a. A random sample of 30 test results is given below. Use these results to determine a 95% confidence interval for the mean MCAS math score, μ.

252 266 264 244 262 268 236 254 264 276

266 220 218 260 258 232 268 218 262 242

238 262 250 264 276 234 232 266 276 248


b. A second random sample of 30 test results was taken. The results are given below. Combine the data from the two samples, the one below and the one in (a), and use the combined data to compute a 95% confidence interval for μ.

258 252 268 264 264 264 222 258 220 254

254 274 266 264 268 248 238 248 258 254

254 258 208 268 268 272 274 254 272 270

c. Compare the margin of errors for the confidence intervals in (a) and (b). Why would you expect the margin of error based on 60 observations to be less than the margin of error based on 30 observations?

d. Keeping the confidence level at 95%, how many observations would you need in order to reduce the margin of error to under 3.0?

3. A city planner randomly selects 100 apartments in Boston, Massachusetts, to estimate the mean living area per apartment. The sample yielded 875x = square feet with a standard deviation s = 255 square feet.

a. Calculate a 95% confidence interval for μ, the mean living area per apartment. (Keep in mind that since the sample size is large, s should be close to σ.)

b. Having found the interval in (a), can you say there is a 95% chance that the mean living area is within the interval? Explain why or why not.

4. A random sample of 50 full-time, hourly wage workers between the ages of 20 and 40 was selected from participants in the 2012 March Supplement, which is part of the Current Population Survey (a joint venture of the U.S. Bureau of Labor Statistics and Census Bureau). The hourly rate (in dollars) of these workers is given below.

7.25 30.09 12.00 25.00 8.00 27.53 14.20 31.00 20.00 18.00

12.00 28.12 16.50 8.00 9.00 15.00 15.10 18.00 17.43 14.00

15.25 34.50 8.00 14.80 7.80 11.00 33.07 10.55 19.00 19.50

12.25 18.00 24.00 27.50 15.00 6.75 30.00 10.30 27.00 14.50

8.00 14.00 10.00 11.75 15.00 28.00 7.50 28.50 16.25 11.75


a. Calculate the sample mean and standard deviation.

b. Calculate a 95% confidence interval for μ, the mean hourly wage of full-time, hourly wage workers between the ages 20 and 40. Because the sample size is large, use s, the sample standard deviation, in place of σ, the unknown population standard deviation.

c. A politician speaking around the time that the data for the 2012 March Supplement were collected claimed that salaries were rising. He stated that the average hourly rate for fulltime workers between the ages of 20 and 40 was $20.00. Does your confidence interval from (b) affirm or refute the politician’s claim. Explain.

d. After being confronted, the politician complained that we should have used a 99% confidence interval to estimate the mean hourly wage. Compute a 99% confidence interval for μ. Does the 99% confidence interval affirm his claim? Explain.


Review Questions

1. There are many thousands of male high school basketball players. Julie collects the heights of the 96 varsity players in her school’s league. The mean height of these 96 players is x = 71.1 inches and the standard deviation is s = 2.7 inches.

a. Because the sample is large, the population standard deviation σ will be close to 2.7, the observed sample standard deviation. Give a 95% confidence interval for the mean height of all varsity basketball players, assuming that Julie’s observations are a random sample. Show your calculations.

b. Do you think it is reasonable to take these 96 players as a random sample of all male varsity basketball players? Why or why not?

2. A random sample of 36 skeletal remains from females was taken from data stored in the Forensic Anthropology Data Bank (FDB) at the University of Tennessee. The femur lengths (right leg) in millimeters are recorded below.

432 432 435 460 432 440 448 449 434 443

525 451 448 443 450 467 436 423 475 435

433 438 453 438 435 413 439 442 507 424

468 419 434 483 448 514

a. Determine the sample mean and standard deviation.

Since the sample size is large, we can use the sample standard s in place of σ in calculations of confidence intervals.

b. Before doing any calculations, think about a 90%, 95% and 99% confidence for µ, the mean femur bone length for women. Which of these intervals would be the widest? Which would be the narrowest? Explain how you know without calculating the confidence intervals.

c. Calculate 90%, 95%, and 99% confidence intervals for µ, the mean femur bone length for adult females. Do your results confirm your answer to (b)?


3. Birth weights in grams of a random sample of 20 babies born in Massachusetts in 2010 appear below. Based on past data, assume that the standard deviation for birth weights is σ = 600 grams .

4054 3572 2636 3430 3118 3969 3628 3940 4536 4819

3883 3487 3827 3883 2749 3487 3855 4450 4309 3345

a. Are the underlying assumptions for confidence intervals reasonably satisfied?

b. Determine a 95% confidence interval for the mean birth weights of babies born in Massachusetts.

c. In the United States, we are more accustomed to reporting babies’ weights in ounces (or even pounds and ounces) than grams. How would you modify the confidence interval to give a confidence interval for the mean weight in ounces? Calculate that interval. (Use the following conversion: 1 gram ≈ 0.03527 ounce.) Does your result seem reasonable?

4. How much can a single outlier affect a confidence interval? Suppose that the first observation of 4054 grams in the random sample in question 3 had been 350 grams (the weight of a baby that did not survive).

a. Make a boxplot of the modified data set to show that this low weight baby is an outlier.

b. Recalculate the 95% confidence interval based on the modified data. How much did the outlier affect the confidence interval?

Final comment: Always look at your data before calculating confidence intervals. Outliers can greatly affect your results.

Unit 25: Tests of Significance

Unit 25: Tests of Significance | Student Guide | Page 1

Summary of VideoSometimes, when you look at the outcome of a particular study, it can be hard to tell just how noteworthy the results are. For example, if the severe injury and death rates due to car crashes on one state’s roads have dropped from 4.7% down to 3.8% after enacting a seat belt law, how would we know whether this result was due to the seat belt law or simply due to chance variation?

To sort out whether results are due to chance or there is something else at work (such as the enactment of the seat belt law), statisticians turn to a tool of inference called tests of significance. Significance testing can be applied in a variety of situations. We next explore how researchers used it to help solve a controversy in classic literature.

In 1985, scholar Gary Taylor made a surprising find while conducting research for a new edition of the complete works of William Shakespeare. While going through a 17th century anthology at the Bodleian Library at Oxford University, he came upon a sonnet he had never seen before and it was attributed to William Shakespeare. Obviously, Taylor was excited about his new find and wanted to include it in his new edition of The Complete Works.

This discovery caused quite a controversy – some scholars were thrilled by the discovery but others didn’t think the poem was good enough to be one of Shakespeare’s. Statistics to the rescue! A decade earlier, statistician Ron Thisted had done a statistical analysis of Shakespeare’s vocabulary. Thisted’s program provided a detailed, numeric description of Shakespeare’s vocabulary. For every work, Thisted could tell how many new words there were that Shakespeare didn’t use anywhere else. Using this model, Thisted predicted that if Shakespeare had written the poem in question, it would have 7 unique words in it. When they ran the poem through the program, however, they found that there were 10 unique words. Did this difference reflect random variation within Shakespeare’s writing? Or did it indicate that Shakespeare was not the author? This is where significance testing (or tests of hypotheses) can be helpful.

Thisted set up two opposing hypotheses: the null hypothesis, written as H0, that basically means nothing unusual is happening; and the alternative hypothesis, the researchers’ point of


view, written as Ha. Researchers aim to reject the null hypothesis with evidence that suggests something more is going on than random variation. In this case, the hypotheses are:

H0: Shakespeare wrote the poem.

Ha: Someone other than Shakespeare wrote the poem.

The question was whether the discrepancy between the observed number of unique words, 10, and the predicted number of unique words, 7, was due to another author writing the poem rather than to chance variation. Is that three-word difference a big difference? To answer this question, Thisted assumed (based on his data) that the number of unique words in Shakespeare’s poems had the approximately normal distribution with mean µ = 7 and standard deviation σ = 2.6 shown in Figure 25.1.

Figure 25.1. Distribution of the number of unique words in Shakespeare’s poems.

The shaded area under the density curve in Figure 25.2 corresponds to the probability of a number of unique words at least as extreme as 10 (in other words, a difference from 7 of 3 or more words).


Figure 25.2. Finding the p-value.

Using technology, we find that the shaded area is 2(0.1243) = 0.2483. Thus, Thisted could expect to find a value at least as extreme as 10 unique words roughly 25% of the time. Therefore, Thisted failed to find significant evidence against the null hypothesis that Shakespeare wrote the poem. He could not reject H0. In the absence of literary or statistical evidence against Shakespeare’s authorship, the poem was published in Taylor’s edition of The Complete Works.

Since we want to work with sample means, let’s suppose researchers found a folio of five new poems that were attributed to Shakespeare. Suppose that our sample mean from the five poems in the folio is 8.2x = . We want to know if, based on this evidence, we can conclude that Shakespeare did not write these poems. We set up our null and alternative hypotheses:

H0 : µ = 7 Shakespeare wrote the poems.Ha : µ ≠ 7 Someone else wrote the poems.

One thing to decide, when setting up a significance test, is whether to use a one-sided or two-sided alternative hypothesis. In our Shakespeare example, we are using a two-sided alternative hypothesis because a different author might consistently use either more or fewer unique words than Shakespeare. But suppose we suspected the poem was written by a particular author who was known to consistently use more unique words than Shakespeare?


Then the alternative hypothesis would be one-sided:

Ha : µ > 7

We begin by assuming the null hypothesis is true. Then we find the probability of getting a result at least as extreme as ours if the null hypothesis really is true. If these poems were written by Shakespeare, then the distribution of x , the mean number of unique words per poem in five poems, would have a normal distribution with the following mean and standard deviation:

µx = µ

σ x =2.65≈1.163

Next, we need to find the probability that any sample of five of Shakespeare poems would have an x at least as far from 7 as what we observed from our sample, 8.2x = . Figure 25.3 illustrates this probability. Notice that two areas are shaded because our alternative is two-sided.

Figure 25.3. Sampling distribution of x .

To calculate this probability from a standard normal table, we find the z-score for our observed sample mean. This is called a z-test statistic:


z = x − µσ n

z = 8.2− 72.6 5

≈1.03

So, the observed value of our test statistic z is 1.03, a little more than one standard deviation away from the mean, 0, on the standard normal curve. The final step in our test of significance is to find the probability of observing a value from a standard normal distribution that is at least this extreme. This probability is called the p-value. To find this p-value, we use 1.03z = and look in the standard normal table (z-table). From Figure 25.4, we find that the area under the standard normal curve to the left of 1.03 is 0.8485.

Figure 25.4. Portion of standard normal table (z-table).

That means that 1 – 0.8485 or 0.1515 is the area in the right tail (the shaded region in Figure 25.5). Since we choose a two-sided alternative, we double this value because we are interested in the area under BOTH tails (the area to the right of 1.03 and the area to the left of -1.03). Our final result gives a p-value of 0.303.


Figure 25.5. Finding the p-value from a standard normal distribution.

From the p-value, we know that there is a 30.3% chance that random variation would produce a mean unique word count as far from 7 in either direction as 8.2. Since a 30.3% chance is a pretty good chance, we have failed to disprove the null hypothesis. We have not found good evidence against Shakespeare’s authorship of these new poems.

This example helps illustrate the general rule about p-values: Small p-values give evidence against the null hypothesis; large p-values fail to reject the null hypothesis. Since p-values can range from the very small – close to zero – to the very large – close to one, researchers need to decide when a p-value is small enough for them to reject the null hypothesis. One of the most common levels is 0.05 or 5%. If something is statistically significant at the 5% level, it means that the results produced a p-value less than 0.05. Another widely used level is 0.01 or the 1% level.



A. Understand that a significance test answers the question “Is this sample outcome good evidence that an effect is present in the population, or could it easily occur just by chance?”

B. Be able to formulate the null hypothesis and alternative hypothesis for tests about the mean of a population. Understand that the alternative hypothesis is the researcher’s point of view.

C. Understand the concept of a p-value. Know that smaller p-values indicate stronger evidence against the null hypothesis.

D. Be able to calculate p-values as areas under a normal curve in the setting of tests about the mean of a normal population with known standard deviation.

E. Be able to test a population mean with a z-test.


Content Overview

A significance test (also called a test of hypotheses) answers the question “Is this sample outcome good evidence that an effect is present in the population, or could it easily occur just by chance?” The reasoning is as follows: Suppose, for the moment, that we assume the effect is not present in the population. If the observed result is very unlikely to occur given this assumption, that’s evidence that the supposition of “no population effect” is false.

The statement being tested in a test of significance is called the null hypothesis, written H0. For example, H0 might state that a population parameter, such as the mean µ, takes a specific value. Usually the null hypothesis is a statement of “no effect” or “no difference” or “status quo.” The test of significance is designed to assess the strength of the evidence against the null hypothesis and in favor of an alternative hypothesis Ha that represents the effect we hope or suspect is true. (Ha is generally the researcher’s point of view.) The alternative hypothesis might be that the parameter differs from its null value, in a specific direction (one-sided alternative) or in either direction (two-sided alternative).

Suppose that we want to conduct a test about the mean of a population. More specifically, suppose that we want to test that the mean has a specific value, which we’ll call µ0 , or that it doesn’t have that value, or is smaller than that value, or larger than that value. We form two opposing hypotheses – the null and alternative hypotheses – which we express symbolically as follows (select one of the possible alternatives):

H0 : µ = µ0Ha : µ ≠ µ0 or Ha : µ > µ0 or Ha : µ < µ0

To test the hypothesis H0 : µ = µ0 based on a random sample of size n from a population with unknown mean µ and known standard deviation σ, we compute the sample mean x . Here’s a recap of what we know about x :

• If H0 is true and the population is normal, then x has the normal distribution with mean µ0 and standard deviation σ n .

• Suppose instead that the population does not follow a normal distribution. If the sample size n is large, we can apply the Central Limit Theorem and conclude that x is approximately normally distributed with mean µ0 and standard deviation σ n .


• Next, still assuming H0 is true, we convert x into a z-score. The result is the z-test statistic given below:

z =x − µ0σ n

If H0 is true, z has the standard normal distribution (at least approximately).

Now, we work through an example. Researchers studying the effects of smoking on sleep believe that men who smoke need more sleep than what is average for men, which is 7.5 hours per night. Let μ be the mean number of hours of sleep for men who smoke. Assume that the standard deviation is σ = 0.5 hours. The null and alternative hypotheses are:

0 : 7.5: 7.5a

HH

m

m

=>

A random sample of 50 smokers completed a questionnaire in which they were asked to record the number of hours they sleep each night. The sample mean is 7.7x = hours. We compute the z-test statistic as follows:

7.7 7.5 2.830.5 50

z −= ≈

From the z-test statistic, we learn that the observed value of 7.7x = is 2.83 standard deviations from the hypothesized mean from 0H , µ = 7.5 . If H0 is true, then z has the standard normal distribution. Now, we are ready to evaluate the evidence against H0 – How likely would it be to observe a value from the standard normal distribution that is at least as extreme as 2.83? The answer, around 0.2%, is illustrated in Figure 25.6. Around 0.2% is pretty unlikely. So, in this case, we reject the null hypothesis and accept the alternative: Male smokers, on average, need more sleep than men in general.


Figure 25.6. The evidence against H0.

As we saw in the previous example, the distribution of the z-test statistic, under the assumption that H0 is true, allows us to use the observed z-value to assess the evidence against H0. We calculate the probability, assuming H0 is true, of observing a value from the standard normal distribution as extreme or more extreme than the z-value we calculated – this probability is called the p-value. Because there are three possible alternatives, there are three possibilities for computing the p-value:

1. The p-value for a test of H0 against Ha : µ > µ0 is the probability of observing a value from the standard normal distribution that is at least as large as the observed z-test statistic. (See Figure 25.7 (1).)

2. The p-value for a test of H0 against Ha : µ < µ0 is the probability of observing a value from the standard normal distribution that is at least as small as the observed z-test statistic. (See Figure 25.7 (2).)

3. The p-value for a test of H0 against Ha : µ ≠ µ0 is the probability of observing a value from the standard normal distribution that is at least as far from 0 (on either side of 0) as the observed z-test statistic. (See Figure 25.7 (3).)

Figure 25.7. Calculating p-values corresponding to alternative hypotheses (1 – 3).

0 Observed z

p-value

0Observed z

p-value

0

p-value

Observed z

(1) (2) (3)

0.4

0.3

0.2

0.1

0.0

Z

Dens

ity

2.83

0.002327

0

Standard Normal Density Curve


Small p-values mean that the probability of observing standard normal values at least as extreme as the observed z-test statistic are very unlikely to occur assuming the null hypothesis is true. Hence, small p-values provide evidence against the null hypothesis in support of the alternative.

Sometimes we set certain cutoffs for the p-value called the significance level. For example, if the p-value is below 0.05 (p < 0.05), we say the results are significant at the 0.05 level, or the 5% level.


Key Terms

A significance test or test of hypotheses is a method that uses sample data to decide between two competing claims.

The claim tested by a significance test is called the null hypotheses. Usually the null hypothesis is a statement about “no effect” or “no change.” The claim that we are trying to gather evidence for – the researcher’s point of view – is called the alternative hypothesis. The alternative hypothesis is two-sided if it states that a parameter is different from the null hypothesis value. The alternative hypothesis is one-sided if it states that either a parameter is greater than or a parameter is less than the null hypothesis value.

A test statistic is a quantity computed from the sample data that measures the gap between the null hypotheses and the sample data. A test statistic is used to make a decision between the null and alternative hypotheses.

The p-value is the probability, computed under the assumption that the null hypothesis is true, of observing a value from the test statistic at least as extreme as the one that was actually observed.

The significance level of a test of hypotheses is the highest p-value for which we will reject the null hypothesis.

A z-test statistic for testing H0 : µ = µ0 , where μ is the population mean, is given by:

z =x − µ0σ n

The z-test is used in situations where the population standard deviation σ is known and either the population has a normal distribution or the sample size n is large.


The Video


1. In the 1970s, statistician Ron Thisted did a statistical analysis of Shakespeare’s vocabulary. Based on his analysis he created a computer program. What could his program tell you about a Shakespearean poem?

2. In analyzing a poem to see whether or not it was authored by Shakespeare, Thisted set up a null hypothesis and an alternative hypothesis. State those hypotheses in words.

3.What was the approximate distribution of the number of unique words per poem in Shakespeare’s poems?

4. Thisted observed 10 unique words in the newly discovered poem. Was that sufficient evidence to conclude that Shakespeare did not write the poem?

5. Which is better evidence against the null hypothesis, a large p-value or a small p-value?


Unit Activity: Chips Ahoy!

Nabisco Chips Ahoy is a popular brand of chocolate chip cookie. In the 1980s, Nabisco ran television ads claiming that their cookies had, on average, 16 chips per cookie. Since the 1980s many more brands of chocolate chip cookies have appeared on supermarket shelves, which could have put pressure on Nabisco to improve its product perhaps by increasing the amount of chips. On the other hand, the price of chocolate has increased, which could have had the opposite effect. In this activity, you will test whether or not Nabisco could run the same ad today.

1. Collect the data. Your instructor will provide directions and, after the data collection is complete, distribute the data. (Save the data for use in Unit 27’s activity.)

2. Compute the mean and standard deviation of the number of chips per cookie.

3. a. State the null and alternative hypotheses.

b. Calculate the value of the z-test statistic. (Since the sample size is large, use s in place of σ.)

c. Calculate the p-value and state your conclusion.

4. Calculate a 95% confidence interval for µ. Does your confidence interval indicate that µ has increased, decreased, or remained the same from its value in the 1980s?


Exercises

1. Each of the following situations requires a significance test about a population mean μ. State the appropriate null hypothesis, H0, and alternative hypothesis, Ha, in each case.

a. Larry’s car averages 32 miles per gallon on the highway. He switches to a new motor oil that is advertised as increasing gas mileage. After driving 3000 highway miles with the new oil, he wants to determine if his gas mileage actually has increased.

b. A university gives credit in a French language course to students who pass a placement test. The language department wants to know if students who get credit in this way differ in their understanding of spoken French from students who actually take the French course. Some faculty think the students who test out of the course are better, but others argue that they are weaker in oral comprehension. Experience has shown that the mean score of students in the course on a standard listening test is 24. The language department gives the same listening test to a sample of 40 students who passed the placement test to see if their performance is different.

c. Experiments on learning in animals sometimes measure how long it takes a mouse to find its way through a maze. The mean time is 18 seconds for one particular maze. A student thinks that a loud noise will cause the mice to complete the maze faster. She measures how long each of 10 mice takes with a noise as stimulus.

2. The Survey of Study Habits and Attitudes (SSHA) is a psychological test that measures the motivation, attitude toward school, and study habits of students. Scores range from 0 to 200. The mean score for U.S. college students is about 115, and the standard deviation is about 30. A teacher who suspects that older students have better attitudes toward school gives the SSHA to 25 students who are at least 30 years of age. Their mean score is 125.2x = .

Assume that σ = 30 for the population of older students, and that the students tested are a random sample from the population of older college students. Carry out a significance test of

H0 : µ = 115Ha : µ >115

Report the value of the test statistic, the p-value of your test, and state your conclusion clearly.


3. Radon is a colorless, odorless gas that is naturally released by rocks and soils and may concentrate in tightly closed houses. Because radon is slightly radioactive, there is some concern that it may be a health hazard. Radon detectors are sold to homeowners worried about this risk, but the detectors may be inaccurate. Tricia wants to study the accuracy of radon detectors for a science fair project. At a nearby university, she places 12 detectors in a chamber where they are exposed to 105 picocuries per liter (pci/l) of radon over 3 days. Here are the readings given by the detectors.

91.9 97.8 111.4 122.3 105.4 95.0 99.6 96.6 119.3 104.8 101.7 03.8

a. In this case, the sample size n = 12 is relatively small. Check to see if it is reasonable to assume these data come from an approximately normal population.

b. Do these observations provide good evidence that the average detector reading differs from the true value of 105? Assume that you know that the standard deviation of readings for all detectors of this type is σ = 9 .

4. The CDC publishes charts on Body Mass Index (BMI) percentiles for boys and girls of different ages. Based on the chart for girls, the mean BMI for 6-year-old girls is listed as 15.2 kg/m2. The data from which the CDC charts were developed is old and there is concern that the mean BMI for 6-year old girls has increased. The BMIs of a random sample of 30 6-year-old girls are given below.

24.5 16.3 15.7 20.6 15.3 14.5 14.7 15.7 14.4 13.2 16.3 15.9 16.3 13.5 15.5 14.3 13.7 14.3 13.7 16.0 14.2 17.3 19.5 22.8 16.4 15.4 18.2 13.9 17.6 15.5

a. State null and alternative hypotheses relevant to this situation.

b. Calculate the sample mean and standard deviation.

c. Since the sample size is relatively large, use s in place of σ and calculate the value of the z-test statistic. Then calculate the p-value.

d. Based on your answer to (c), do the sample data provide sufficient evidence that the mean BMI for 6-year-old girls has increased? Explain.


Review Questions

1. Small amounts of sulfur compounds are often present in wine. Because these compounds have unpleasant odors, wine experts have determined the odor threshold, the lowest concentration of a compound that a trained human nose can detect. For example, the odor threshold for dimethyl sulfide (DMS) is 25 micrograms per liter of wine (µg/l). Untrained noses may be less sensitive, however. A wine researcher found the DMS odor thresholds for 10 students in his restaurant management class. Here are the data.

31 31 43 36 23 34 32 30 20 24

Assume that the standard deviation of the odor threshold for untrained noses is known to be σ = 7 µg/l.

a. Is it reasonable to assume the data are from an approximately normal population? Explain.

b. The researcher believes that the mean odor threshold for beginning students is higher than the published threshold, 25 µg/l, and decides to conduct a significance test. What are the null and alternative hypotheses?

c. Carry out a significance test. Report the value of the test statistic, the p-value, and your conclusion.

2. In 2010/2011 the national mean SAT Math score was 514. Faculty at a state university had disagreements over their students’ mathematics preparation for college. Some felt that their students had fallen below the national average, and others felt that their students had made some advances. To help answer this question, math faculty took a random sample of 50 students who entered the university fall semester 2011. The SAT Math scores from those students are given below.

580 540 520 490 430 570 520 540 440 610 430 390 470 550 390 500 550 440 550 660 560 550 450 560 680 630 400 450 500 460 460 530 590 380 660 570 520 530 500 680 450 590 660 420 370 550 450 510 480 500



b. Do these data provide sufficient evidence that the mean SAT Math scores of students entering the university in fall 2011 differed from the national mean? State the hypothesis you are testing, the value of the test statistic, the p-value and your conclusion. (Replace σ in the test statistic by s since the sample size is large.)

c. Construct a 95% confidence interval for µ, the mean Math SAT for students entering this university in fall 2011. (Refer to Unit 24, Confidence Intervals.) Does your confidence interval indicate that the true mean SAT Math score for students entering the university in fall 2011 is less than 514, could be 514, or is greater than 514? Explain.

3. The average length of calls coming into a municipal call center had been around 90 seconds. Lately, there has been some concern that more complicated calls are coming into the center causing the mean length of the calls to increase. In order to test this assumption, the city draws a random sample of 100 calls. The sample mean and standard deviation are

118.4x = seconds and s = 186.5 seconds, respectively.

a. State the hypotheses being tested.

b. Do these data provide good evidence that the average call length has increased from 90 seconds? (Since the sample size is large, use s in place of µ ) Show the work needed to support your answer. Conduct the significance test at the 0.05 level.

c. Suppose city planners are willing to run the test at the 0.10 level. (They will reject the null hypothesis if the p-value is below 0.10.) Would this change the conclusion reached in (b)? Explain.

4. Eating fish contaminated with mercury can cause serious health problems. Mercury contamination from historic gold mining operations is fairly common in sediments of rivers, lakes and reservoirs today. A study was conducted on Lake Natoma in California to determine if the mercury concentration in fish in the lake exceeded guidelines for safe human consumption. Suppose that you are an inspector for the Fish and Game Department and that you are given the task of determining whether to prohibit fishing in Lake Natoma. You will close the lake to fishing if it is determined that fish from the lake have unacceptably high mercury content.


a. Assuming that mercury concentration of 5 ppm is considered the maximum safe concentration, which of the pairs of hypotheses below would you test? Justify your choice.

H0 : µ = 5 versus Ha : µ > 5

or

H0 : µ = 5 versus Ha : µ < 5

b. Would you prefer a significance level of 0.1 or 0.01 for your test? Explain your choice.

Unit 26: Small Sample Inference for One Mean

Unit 26: Small Sample Inference for One Mean | Student Guide | Page 1

Summary of Video The z-procedures for computing confidence intervals or hypothesis testing work in cases where we know the population’s standard deviation. But that’s hardly ever the case in real life. For times when we don’t know the population standard deviation but still want to figure out confidence intervals and do significance tests, statisticians turn to t-inference procedures. These t-procedures were invented in 1908 by William S. Gosset. Gosset was a chemist at the Guinness Brewery in Ireland. Making ale requires constant sampling of everything from barley to yeast to the beer itself. Gosset wanted to save time and money using small samples and their standard deviations as an estimate of the unknown population standard deviation, σ. Using the standard deviation s derived from only a few data values does not give a sufficiently good estimate of the entire population’s standard deviation; and he couldn’t proceed with a z-procedure. In his efforts to figure out a way around this issue, Gosset created a new class of distribution called the t-distributions.

The video now turns to a modern day brewery, Pretty Things Beer and Ale Project. Dann Paquette and Martha Holley-Paquette, owners of the operation, take samples at various stages in the brewing process. At one stage, they take a sample and measure its density. They aim for a density reading of at least 19.5 degrees Plato. (Degrees Plato measure how much more dense the liquid is than water.) In one batch of Baby Tree beer, they got a reading of 20.3 degrees Plato – a good sign that this batch will be great.

Let’s imagine that Pretty Things wants to see how closely their production of Baby Tree beer is hitting their pre-fermentation density goal of 19.5 degrees Plato. Data collected from 10 batches are given below.

20.2 18.9 19.6 20.6 20.3 18.7 21.0 18.5 20.1 19.3

Since our sample size is small, its standard deviation could be quite far from the population standard deviation of all Baby Tree beer ever brewed. So, z-procedures won’t work. Instead, we call on the t-procedure that William Gosset invented.


In Figure 26.1, we compare a t-distribution for a sample of size 3 to the standard normal curve.

Figure 26.1. Comparing t-distribution to standard normal distribution.

The two curves share certain features. They are both bell-shaped. But the t-density curve is broader with a shorter peak, and its tails are higher. These fatter tails mean there is more probability of getting results far from zero. That’s because the sample standard deviation varies from sample to sample, particularly when the sample size is small, adding uncertainty. Another difference is that although there is a single standard normal distribution, there is a different t-distribution for every sample size. As the sample size increases, the sample standard deviation s gets closer and closer to the population standard deviation, σ and the t-density curve gets closer to the standard normal curve.

The different t-distributions are specified by something called degrees of freedom, which are related to the sample size n: degrees of freedom = n – 1. For our beer data, the degrees of freedom are 10 – 1 = 9. Next, we calculate a confidence interval for the population mean of the density of all Baby Tree beer. For these calculations to work, our data need to be from a normal distribution, which is a safe assumption in this case. Here’s our formula:

x ± t * sn

Filling in information from our data, we get

19.72 ± t * 0.8510

5.02.50.0-2.5-5.0

Standard Normal Curve

t-distributionSample size 3


In order to determine t*, we choose whatever confidence level we like – we’ll go with 95%. We calculate the value of t* from software as illustrated in Figure 26.2: t* = 2.262.

Figure 26.2. t-distribution with df = 9.

Plugging in the value of t*, we get the following confidence interval:

19.72 ± (2.262) 0.8510

⎛⎝⎜

⎞⎠⎟≈19.72 ± 0.61

So, our confidence interval (19.11, 20.33) gives us a range of plausible values for µ, the mean density of Baby Tree beer.

Take a moment to compare z*, the z-critical value for a 95% z-confidence interval with our value for t*:

z* = 1.960t* = 2.262

Here, the t-critical value of 2.262 gives us a wider confidence interval. That is the price we pay for having a small sample and for not knowing σ.

Pretty Things’ goal for their Baby Tree beer is a density of 19.5 degrees Plato. Using the confidence interval that we have calculated, we can say with 95% confidence that a 19.5 population mean is within our range of plausible values for µ.

T-2.262

0.025

2.2620

0.0250.95

t*



A. Understand when to use t-procedures for a single sample and how they differ from the z-procedures covered in Units 24 and 25.

B. Understand what a t-distribution is and how it differs from a normal distribution.

C. Know how to check whether the underlying assumptions for a t-test or t-confidence interval are reasonably satisfied.

D. Be able to calculate a t-confidence interval for a population mean.

E. Be able to test a population mean with a t-test. Be able to calculate the t-test statistic and to determine the p-value as an area under a t-density curve.

F. Be able to adapt one-sample t-procedures to analyze matched pairs data.


Content Overview

In Units 24 and 25, we introduced z-procedures for (1) calculating confidence intervals for a population mean and (2) conducting significance tests about a population mean. The confidence interval formula and z-test statistic are as follows:

(1) x ± z * σn

(2) z =x − µσ n

For both procedures we assumed that the population was normally distributed or the sample size n was large, and that the population standard deviation σ was known. However, in real life, σ is generally unknown.

Let’s start with an example. The weights (pounds) from a random sample of 16 4-year-old children who took part in a study on childhood obesity appear below.

37.1 26.7 36.1 36.2 40.3 43.9 36.2 40.7

42.5 34.8 37.9 34.5 31.1 36.4 35.7 33.4

From these data, we can compute the sample mean and sample standard deviation: 36.47x = lbs and s = 4.23 lbs. However, the population standard deviation σ is unknown.

Nevertheless, we would like to calculate a confidence interval for µ, the mean weight of 4-year-olds.

It seems reasonable to assume that weights of 4-year-olds are normally distributed and the normal quantile plot in Figure 26.3 confirms this assertion.

50454035302520

99

95

90

80

7060504030

20

10

5

1

Weight Age 4 (pounds)

Perc

ent

Normal - 95% CINormal Quantile Plot

Figure 26.3. Normal quantile plot of children’s weights.


But we still have to deal with the fact that σ is unknown. We know that when the sample size n is large, s will be close to σ. But in this case n = 16, which is not large enough. Hence, simply replacing σ by s in the z-confidence interval formula would introduce too much additional variability. To compensate for the additional variability, we also replace z*, a critical value from a standard normal distribution, with t*, a critical value from a t-distribution. The result is the formula for computing t-confidence intervals:

x ± t * sn

The t-distributions have some features in common with the standard normal distribution. Both have density curves that are bell shaped and centered at zero as can be seen in Figure 26.4. However, there is more area in the tails of t-distributions than there is for the standard normal distribution. This difference is particularly noticeable when the degrees of freedom are small (See Figure 26.4(a).).

(a) t-distribution has 5 degrees of freedom. (b) t-distribution has 15 degrees of freedom.

Figure 26.4. Comparison: two t-distributions with standard normal distribution.

The degrees of freedom (df) associated with t*, the t-critical value in our confidence interval formula, are related to the size n of the sample:

df = n – 1

So, for our sample of 16 observations, df = 15. The value of t* for a 95% confidence interval can be determined from a t-table (Figure 26.5) or using statistical software (Figure 26.6). In either case, t* = 2.131.

43210-1-2-3-4

Density Curves

t-distributiondf = 5

standard normal distribution

43210-1-2-3-4

Density Curves

t-distributiondf = 15

standard normal distribution


Figure 26.5. Using a t-table to determine t* Figure 26.6. Using statistical software to determine t*.

We now have everything that we need to calculate a 95% confidence interval for µ, the mean weight of 4-year-olds. Here are the calculations:

4.23* 36.47 (2.131) 36.47 2.2516

sx tn

± = ± = ±

Hence, we can say that µ is between 34.22 lbs and 38.72 lbs and that we have used a process to calculate this interval that has a 95% track record of giving correct results.

Next, we turn our attention to significance tests about a population mean µ for situations where the sample size is relatively small (n < 30), the population has a normal distribution, but σ is unknown. The t-test statistic results from replacing σ in the z-test statistic with s:

t = x − µs n

We return to our sample of 16 4-year-olds to see how this works.

Suppose that a height chart listed the average height of 4-year-olds as 39 inches. The heights of the sample of 16 4-year-olds are given below.

39.9 37.4 40.3 39.6 39.2 43.2 40.5 40.6

41.8 39.5 40.9 39.8 40.3 39.4 40.7 39.5

We suspect that children’s heights have increased since the time the height chart was created due in part to better nutrition. To test our supposition, we let µ represent the mean height of

0.4

0.3

0.2

0.1

0.0

T

Dens

ity

-2.131

0.025

2.131

0.025

0

T, df =15Density Curve

0.95

t*90% 95% 96%

15 . . . 1.753 2.131 2.249

Figure 26.5

CONFIDENCE LEVEL C DEGREES OF FREEDOM


4-year-olds. The null and alternative hypotheses are:

H0 : µ = 39Ha : µ > 39

The sample mean and standard deviation for the height data are: 40.163x = inches and s = 1.255 inches. Now we calculate the t-test statistic, replacing µ0 with its value from the null hypothesis and substituting in the sample values for x and s:

t =x − µ0s n

= 40.163 − 391.255 16

≈ 3.71

If the null hypothesis is true, then the t-test statistic will have a t-distribution with n – 1 = 16 – 1 or 15 degrees of freedom. All that is left is to determine the p-value, the probability of observing a value at least as extreme as 3.71. Using statistical software we find p ≈ 0.001 as illustrated in Figure 26.7. Hence, we reject the null hypothesis and conclude the mean height of 4-year-olds has increased since the time that the height chart was created.

T3.71

0.001048

0

T, df = 15Density Curve

Figure 26.7. Determining the p-value.

The study involving the 16 children followed the children for two years. As part of the study, children were weighed when they were four and again when they were six. Table 26.1 shows the results, including the children’s weight gain over the two-year period (Difference).


Weight Age 4 (lbs) Weight Age 6 (lbs) Difference (lbs)37.1 46.9 9.826.7 35.9 9.236.1 48.3 12.236.2 44.2 8.040.3 50.6 10.343.9 56.4 12.536.2 51.8 15.640.7 50.2 9.542.5 51.6 9.134.8 41.4 6.637.9 55.3 17.434.5 44.5 10.031.1 39.9 8.836.4 46.3 9.935.7 49.6 13.933.4 39.0 5.6

Table 26.1. Change in weight from age 4 to age 6.

In the past, children around this age would have been expected to gain 4.5 pounds per year or 9 pounds over the two-year period. However, we suspect that the mean change in weight has increased. To test this assumption, we perform what is called a matched-pairs t-test. The parameter µ in a matched-pairs t-procedure is the mean difference in observations (or responses) on each individual (or subject in a match pair) – in our case, µ is the mean difference between weight at age 4 and weight at age 6. We set up the null hypothesis (no change from what was expected in the past) and alternative hypothesis (increase from what was expected in the past):

H0 : µ = 9Ha : µ > 9

Now, we calculate the one-sample t-test statistic using the differences. The sample mean and sample standard deviation of the differences are: 10.525Dx = lbs and 3.122Ds = lbs. The matched-pairs t-test statistic is computed as follows:

t =xD − µ0sD n

t = 10.525 − 93.122 16

≈1.95


From Figure 26.8 we see that p ≈ 0.035. Since p < 0.05, we reject the null hypothesis and accept the alternative that the mean weight gain in children from age 4 to age 6 is greater than 9 pounds.

T1.95

0.03506

0

T, df = 15Density Curve

p-value

Figure 26.8. Calculating the p-value for matched-pairs t-test.

The last step in this analysis is to use a matched-pairs t-procedure to calculate a 95% confidence interval for µ, the mean weight gain from age four to age six. The matched-pairs t-confidence interval is computed as follows:

xD ± t *sDn

⎛⎝⎜

⎞⎠⎟

10.525 ± 2.131( ) 3.12216

⎛⎝⎜

⎞⎠⎟= 10.525 ±1.663

Notice that the t-critical value, t*, depends only on the sample size and the confidence level and not on whether we are calculating a one-sample or a matched pairs t-confidence interval. Our confidence interval for µ is from 8.9 lbs to 12.2 lbs.

Look back at the results from our two matched-pairs t-procedures. We concluded from the t-test that µ, the mean weight gain from age 4 to age 6, was greater than 9 pounds. However, our confidence interval for µ was (8.2, 12.2), an interval that includes values that are below 9 pounds. Results from a two-sided confidence interval do not always match the results from a significance test involving a one-sided alternative.


Key Terms

Density curves for t-distributions are bell-shaped and centered at zero, similar to the standard normal density curve. Compared to the standard normal distribution, a t-distribution has more area under its tails. The shape of a t-distribution, and how closely it resembles the standard normal distribution, is controlled by a number called its degrees of freedom (df). A t-distribution with df > 30 is very close to a standard normal distribution.

A t-confidence interval for µ when σ is unknown is calculated from the formula:

* sx tn

±

where t* is a t-critical value associated with the confidence level and determined from a t-distribution with df = n – 1.

A t-test statistic for testing H0 : µ = µ0 , where µ is the population mean, is given by:

t =x − µ0s n

The t-test is a modification of the z-test and is used in situations where the population standard deviation σ is unknown and either the population has a normal distribution or the sample size n is large. The p-value is determined from a t-distribution with df = n –1.

A matched-pairs, t-confidence interval for µD, the population mean difference, is given by the formula:

xD ± t *sDn

⎛⎝⎜

⎞⎠⎟

where t* is a t-critical value associated with the confidence level and determined from a t-distribution with df = n – 1 and dx and sD are the mean and standard deviation of the sample differences.

A matched-pairs t-test statistic for testing H0 : µD = µD0 , where µD is the population mean difference, is given by

t =xD − µD0sD n

where dx and Ds are the mean and standard deviation of the sample differences.


The Video


1. Why won’t the z-procedure work in most cases, particularly if the sample size is small?

2. Who invented t-inference procedures?

3. Compare a normal density curve with a t-distribution for a sample size of 3. How are the two distributions similar and how do they differ?

4. For a t-distribution, how are the degrees of freedom related to sample size?

5. For a 95% confidence interval, which is larger, z* or t*?


Unit Activity: Step-by-Step

Pedometers count the number of steps a person walks. If you want your pedometer to calculate how far you have walked, you need to enter in your step length (distance from heel of one foot to heel of other foot when walking). In this activity, you will collect data on step length for males and females in the class. Assuming that students in your class are representative of the student population, you will calculate confidence intervals for the mean step length of males and females.

1. Discuss methods for getting reliable measurements for step length. After your group discussion, the class must decide on the method that will be used to collect the step-length data. Write a brief description of this method.

2. Collect the step-length data for males and females separately. After the data are collected, the two data sets (male step lengths, female step lengths) should be distributed to the class.

In answering the remaining questions, assume that the class data are representative of the general male and female student populations.

3. Check that the underlying assumption of normality is reasonably satisfied in both data sets.

4. a. Calculate the mean and standard deviation for the male step-length data.

b. Calculate the mean and standard deviation for the female step-length data.

5. a. Calculate a 95% confidence interval for the mean step length of males. What are the degrees of freedom of the t-critical value?

b. Calculate a 95% confidence interval for the mean step lengths of females. What are the degrees of freedom of the t-critical value?

6. Based on your confidence intervals in question 5, can you conclude that the mean step length for males is greater than for females? Explain.


Exercises

1. A woman in a nursing home is on medication for high blood pressure. Her blood pressure is taken daily. A sample of 20 blood pressure readings (mmHg) appears below.

150 148 136 120 142 144 130 150 130 142

140 130 148 142 138 130 120 166 130 152


b. Determine a 95% confidence interval for µ, her mean systolic blood pressure. Show your calculations.

c. Is the underlying assumption that these data come from a normal distribution reasonably satisfied? Explain.

2. A manufacturer of brass washers produces one type of washer that has a target mean thickness of 0.019 inches. After the production process had continued for some time without any adjustment, a random sample of 10 washers was selected and measured for thickness. The data are given below.

0.0185 0.0190 0.0180 0.0184 0.0179

0.0186 0.0188 0.0178 0.0182 0.0186

a. Does the assumption that washer thickness is normally distributed seem reasonable given these data? Explain.

b. Do you think the production process is still in control or do the data indicate that it is time to make some adjustments? To answer this question, test the hypothesis that the mean thickness equals 0.019 inches against the hypothesis that it does not. Report the value of the test statistic, the p-value and your conclusion.

c. Calculate a 95% confidence interval for µ, the mean thickness of washers currently being produced. Show your calculations. Does your interval indicate that the process needs to be adjusted to increase washer thickness or decrease washer thickness? Explain.


3. Students in a statistics class measured their foot lengths and forearm lengths. The data are given in Table 26.2. Assume this sample is representative of the student population.

Forearm Length (inches) Foot Length (inches)8.75 10.00 10.00 9.009.00 9.00 9.00 10.508.50 10.00 9.50 11.0010.25 10.00 10.00 11.5010.25 11.50 12.50 11.258.50 9.00 11.50 9.009.25 8.50 9.00 10.5010.50 6.75 9.25 10.508.25 10.00 9.50 8.509.00 8.25 8.25 10.007.00 8.25 9.50 8.759.50 9.00 9.50 8.759.75 8.00 9.50 10.00

Table 26.2. Data on forearm and foot lengths.

a. Calculate a 95% confidence interval for µForearm , the mean forearm length of students. Then calculate a 95% confidence interval for µFoot , the mean foot length of students.

b. Do your 95% confidence intervals support the hypothesis that the mean forearm length of students differs from the mean foot length of students? Explain.

c. Calculate 90% confidence intervals for µForearm and µFoot . Compare the 95% confidence intervals to the 90% confidence intervals. Which of the two were wider? Explain why that was the case.

d. Answer part (b) based on the 90% confidence intervals calculated for (c).

4. A statistics professor was concerned that students were not as successful on the second exam (which was on inference) as they were on the first exam (which was on descriptive statistics). She took a random sample of 15 students enrolled in her introductory statistics courses over the past three years. Their grades on these two exams appear in Table 26.3 (continued on next page).


Exam 1 Exam 267 5974 8285 9671 6278 6996 8363 5291 9493 8281 6784 6664 6688 8489 7582 90

Table 26.3. Statistics exam scores.

a. Compute the differences between Exam 2 and Exam 1. What is the sample mean for the differences? What is the sample standard deviation for the differences?

b. Return to the professor’s concern that students were not as successful on the material for Exam 2 as they were on the material for Exam 1. Let µD be the population mean difference between the two exam scores. Set up null and alternative hypotheses to test the professor’s supposition.

c. Conduct a significance test. Report the value of the test statistic and the p-value.

d. Do the results of your test of hypotheses support the professors’ concern? Explain.


Review Questions

1. Given a simple random sample of size n, you want to compute a confidence interval for µ of the form: x ± t * s n( ) . Find the value of t* for each of the following confidence levels and sample sizes.

a. A 95% confidence interval for a sample size n = 12.

b. A 99% confidence interval for a sample size n = 15.

c. An 80% confidence interval for a sample size n = 10.

2. Supermarket rotisserie chickens have become very popular with the American public. A study was conducted to compare the nutrient composition of commercially-prepared rotisserie chicken to that of roasted chicken, which is listed in the USDA National Nutrient Database for Standard Reference (SR).

a. The Standard Reference (SR) listed the mean protein content of roasted chicken breast as 31 grams. In the sample of 9 rotisserie chickens, x = 29.86 grams and s = 1.95 grams. Conduct a t-test to see if the mean protein content in rotisserie chicken breasts differs from the SR. Report the value of the test statistic, the p-value, and your conclusion.

b. The SR listed the mean cholesterol level in roasted chicken thighs as 95 milligrams. In a sample of 9 rotisserie chicken thighs, x = 134 milligrams and s = 2.43 milligrams. Conduct a t-test to see if the mean cholesterol level in rotisserie chicken thighs differs from the SR. Report the value of the test statistic, the p-value, and your conclusion.

3. A researcher studying ocean literacy focused her efforts on the program Ocean Commotion – a one-day ocean/wetlands literacy program that includes hands-on demonstrations about marine environments and products. Prior to attending the program, a sample of 337 students from 6 schools were given a pre-test to measure their attitudes toward the ocean and wetlands. After the program, students were given a post-test. The test was graded on a scale from 1 (lowest) to 5 (highest). The mean score on the pre-test was 4.06. The mean score on the post-test was 4.13. The standard deviation of the pre-post differences was 0.40.


a. The researcher wanted to test whether attendance at Ocean Commotion had a significant positive effect on students’ attitudes toward the ocean and wetlands. Let µD be the mean difference in attitude after the program compared to before the program. Write the researcher’s null and alternative hypotheses.

b. Calculate the value of the t-test statistic. What are the degrees of freedom associated with the t-test statistic?

c. What is the p-value?

d. The researcher concluded that Ocean Commotion had a significant effect on students’ attitudes toward the ocean and wetlands. Given your answer to (c), do you agree with her conclusion? Do you think that the difference is of practical importance? Explain.

4. A psychology class wanted to find out whether attendance at a workshop on happiness can teach people how to be happier. The class arranged to attend a “happiness” workshop. A week prior to going they completed the Oxford Happiness Questionnaire. The questionnaire, developed by psychologists Michael Argyle and Peter Hills, is scored on a scale from 1 (not happy) to 6 (too happy). Students completed the Oxford Happiness Questionnaire a second time two days after the workshop and a third time six weeks after the workshop. Table 26.4 displays their data (See table on next page).


First Questionnaire

Second Questionnaire

Third Questionnaire

2.6 3.1 2.24.9 5.0 2.56.0 4.8 3.93.3 3.5 4.54.1 4.2 2.93.8 4.0 3.15.2 4.6 4.84.3 4.7 5.13.1 5.3 3.95.5 5.2 3.84.2 5.4 4.92.0 3.2 2.84.3 5.1 2.93.6 4.6 3.23.1 3.8 2.95.2 4.7 3.4

Table 26.4. Results from Oxford Happiness Questionnaire

a. Students thought that the workshop would have a positive effect on happiness, at least short term. Use the data from the first and second questionnaires to test their hypothesis. State the null and alternative hypotheses, calculate the value of the t-test statistic, determine the p-value and give your conclusion.

b. Use the data from the first and third questionnaires to test whether there is any long-term positive effect on students from this type of happiness workshop. State the null and alternative hypotheses, the value of the test statistic, the p-value, and your conclusion.

c. Let µ(Third – First) be the population mean difference of Oxford Happiness Questionnaire scores: taken before and six weeks after participation in a happiness workshop. To estimate the long-term effect of the workshop on students’ happiness, calculate a 95% confidence interval for µ(Third – First). Interpret what it tells you about students who participated in happiness workshops before and after participating in the Oxford Happiness Questionnaire.

d. The sample for this study consisted of students from one psychology class. Do you think the results are valid for all college students from this university? Explain.

Unit 27: Comparing Two Means

Unit 27: Comparing Two Means | Student Guide | Page 1

Summary of VideoIt’s an age old battle of the sexes. Are men or women worse drivers? Whatever your opinion on this question, a statistician needs evidence in order to make a decision. One way to analyze this question would be to see which gender, on average, gets more moving violations. We could take a sample from all licensed drivers in one state, and then look at the number of tickets each person received in one year. We could then calculate the mean number of tickets received by members of each gender and compare the two numbers to see which group had the worst driving record.

Comparing two populations is an important topic in statistics because it occurs quite frequently. Moreover, we can use inference to move beyond just looking at two sample means, as was suggested in our driving example. We can go on to figure out whether the difference between two groups is statistically significant; and if it is, we can calculate a confidence interval for the difference of population means.

That’s what researchers did when they decided to investigate the difference in the amount of calories necessary to power daily life in two groups of people with very different lifestyles. Herman Pontzer is an anthropologist who is interested in how energy is used by primate species, particularly human beings. Pontzer teamed up with other researchers to work with the Hadza in Tanzania, a group of traditional hunter-gatherers who live in a way very similar to our ancestors. Men hunt with bows and arrows and women forage for plant foods and dig for root vegetables. The Hadza are a lot more active and cover a lot more ground than their Western counterparts. Everyone had always assumed that this physically-demanding hunter-forager lifestyle would require much more energy than the relatively inactive daily life of a Western office worker. In fact, one suspected cause of the obesity epidemic in the West is our more sedentary modern lifestyle. But the Hadza’s actual energy expenditure had never yet been tested.

Was the assumption correct that the Hadza used more calories throughout their day? Pontzer and his team already had data on how many calories typical Americans and Europeans burned in their daily lives. Now they needed to measure how many calories it took to power


the Hadza through their daily tasks. They used a technique that relies on the subjects drinking something called doubly-labeled water. For this technique, a person drinks some water where the hydrogen and oxygen have been enriched with rare isotopes. From urine samples taken over a two-week period, researchers can measure with the use of spectroscopy how much of those rare isotopes are in their urine samples. As the concentration of the special traceable hydrogen and oxygen isotopes in the urine goes down over time, Pontzer can figure out how much carbon dioxide the subject has exhaled over the course of the study. When a body burns calories, a byproduct is carbon dioxide, so the amount of carbon dioxide exhaled told the researchers how much energy the Hadza were expending. In addition, the researchers recorded the physical activity of the Hadza by having them wear heart rate monitors and GPS units.

The Hadza are typically smaller and lighter than their Western counterparts. That difference required Pontzer and his colleagues to use sophisticated statistical techniques in their analyses to control for the effects of body size, age, and sex. To keep things simple, so that we can follow their comparison, we’ll look just at women with comparable body sizes from the Hadza and Western groups. We want to use our sample to determine whether there is a significant difference between the means of the Hadza and Western populations.

First, the scientists calculated the mean total energy expenditure (TEE), which was measured in calories, for each group. The sample means, standard deviations, and sample sizes for each group are as follows:

Hadza

1 1,877x = calories, 1 364s = calories, 1 17n =

Westerners

2 1,975x = calories, 2 286s = calories, 2 26n =

Is the difference between these sample means significant? Or, could the difference we see be due simply to chance variation? We can set up a significance test to figure this out. Below are the null and alternative hypotheses concerning the total energy expenditure.


0H : Mean Hadza TEE = Mean Western TEE

µ 1= µ2

aH : Mean Hadza TEE ≠ Mean Western TEE

µ1 ≠ µ2

The two-sample t-test statistic is:

t =(x1 − x2)− (µ1 − µ2)

s12

n1+s22

n2

Now we can substitute the numbers into the formula. We have the sample means, standard deviations, and sample sizes. For the value of µ 1−µ2 , we use the value from the null hypothesis, which states these two means are equal, and hence,µ 1−µ2 = 0 .

t = (1,877 −1,975)− 0(364)2

17+ (286)

2

26

≈ −0.94

Like all of the z- or t-test statistics that we have encountered, this one tells us how far 1 2x x− is from 0, the hypothesized difference in means, in standard units.

Software can figure out the degrees of freedom for the t-test statistic, or we can just go with a very conservative approach that uses the smaller sample size minus one, which gives us 16 degrees of freedom. We can look up the corresponding p-value in a t-table or use technology; either way, we get p = 0.3612. That means that assuming the null hypothesis is true, we have a 36% chance of seeing a t-value as or more extreme than the one we calculated. A 36% chance is pretty likely, so we have insufficient evidence to reject the null hypothesis. We conclude that there is no significant difference between total energy expenditure of Hadza women and Western women.

This, in fact, is what the researchers concluded. After controlling for body size, age, and sex, the scientists did not find any statistical difference when they compared the mean daily energy expenditure of the Hadza and the Westerners. This result seemed counterintuitive, since they knew the Hadza were much more active. The researchers suspect that the Hadza’s bodies are allocating a smaller percentage of those daily calories to run-of-the-mill cellular function and more to physical activity. Researchers think that it is a difference in energy allocation, not energy efficiency.


Today’s obesity epidemic tells us something is out of balance between the amount of calories that we take in and the amount we burn off. Based on this study and others, metabolism seems to hold quite constant among different populations of people with varying activity levels. Because of this finding, Pontzer and his colleagues place the blame for rising societal levels of obesity more on people eating too much than on our modern lifestyle.



A. Understand when to use two-sample t-procedures.

B. Know how to check whether the underlying assumptions for a two-sample t-procedure are reasonably satisfied.

C. Be able to calculate a confidence interval for the difference of two population means.

D. Be able to test hypotheses about the difference between two population means. Be able to calculate the t-test statistic and use technology to determine a p-value.


Content Overview

Consider the following questions: Do men earn more than women? Are women smarter than men? Do students from private schools do better on their SATs than students from public schools? Which relieves headaches more quickly, Tylenol or Advil? Each of these questions seeks to compare two populations or two treatments – a commonly encountered situation in statistical practice.

We begin with a question related to the activity in Unit 26: Are the step lengths of 10th-grade male students longer, on average, than the step lengths of 10th-grade female students? In this case, the comparison is between two populations, 10th-grade males and 10th-grade females. Let µ1and σ1 be the mean and standard deviation, respectively, of step lengths for the population of 10th-grade males. Let µ2 and σ 2 be the mean and standard deviation, respectively, for the 10th-grade females. If there is no difference between the mean step lengths of male students and female students, then µ1 − µ2 = 0 . However, if males, on average, take longer steps than females, then µ1 − µ2 > 0 . We can state the null hypothesis and alternative hypothesis as follows:

H0 : µ1 − µ2 = 0

Ha : µ1 − µ2 > 0

Suppose we randomly select two samples, one of size 1n from the male students and another of size 2n from the female students. After collecting the data, we can calculate the sample means, 1x from the males and 2x from the females, and sample standard deviations, 1s from the males and 2s from the females. It seems reasonable to use the difference in sample means, 1 2x x− , to estimate the difference in population means,µ1 − µ2 . If the two populations are normally distributed or if the sample sizes are large, then 1 2x x− has a normal distribution with the following mean and standard deviation:

µx1−x2 = µ1 − µ2 and σ x1−x2=

σ1

n1+σ 2

n2

Transforming 1 2x x− into a z-score, we get:


z =(x1 − x2)− (µ1 − µ2)

σ12

n1+σ 22

n2

The two-sample z-test statistic has the standard normal distribution. At this point, if we knew the population standard deviations, we could use a z-procedure to test our hypotheses. Unfortunately, σ 1 and σ 2 are unknown, so we will need to use the sample standard deviations, 1s and 2s , as estimates. Substituting 1s and 2s in place of σ 1 and σ 2 gives us the two-sample t-test statistic:

t =(x1 − x2)− (µ1 − µ2)

s12

n1+s22

n2

The two-sample t-test statistic has an approximate t-distribution. The degrees of freedom (df) are a bit complicated to figure out. We can either use software or adopt a conservative approach and set the degrees of freedom to be one less than the smaller of the two sample sizes.

Now, we return to our hypotheses about step lengths of male and female students. Sample data were collected and are summarized below:

Males: 1 12n = , 1 64.08 cmx = , 1 7.71 cms =

Females: 2 15n = , 2 60.34 cmx = , 2 7.74 cms =

Normal quantile plots of the male and female step-length data indicate that it is reasonable to assume that step lengths are approximately normally distributed. Now we are ready to compute the t-test statistic. Using the null hypothesis value of 0 for µ1 − µ2 and our sample means and standard deviations, we get:

t = (64.08 − 60.34)− 0(7.71)2

12+ (7.74)

2

15

≈1.25


Taking the conservative approach, we assume df = 12 – 1, or 11. Since the alternative is one-sided, we use technology to determine the area under the t-density curve that lies to the right of 1.25, our observed value of the test statistic. As shown in Figure 27.1, this gives p = 0.1186.

Figure 27.1. Density curve for t-distribution with df = 11.

Since p > 0.05, there is insufficient evidence to reject the null hypothesis. We cannot conclude that the mean step length of 10th-grade male students differs from the mean step length of 10th-grade female students.

The next example involves a study that compares two teaching strategies for nursing students – lecture notes combined with structured group discussions versus a traditional lecture format. Two groups of students taking a medical-surgical nursing course were taught using each of the two strategies. Exam scores were used to compare the effectiveness of the two teaching strategies. Let µ1 be the mean exam score for students enrolled in the lecture notes/group discussion version of the course; let µ 2 be the mean exam score for students enrolled in the lecture only version of the course. We set up the following null and alternative hypotheses:

H0 : µ1 − µ2 = 0

Ha : µ1 − µ2 ≠ 0

Exam scores of two groups of students taught by each of these methods were collected with the following results:

Lecture notes/group discussion: 1 81n = , 1 80.6x = , 1 7.34s =

Lecture format (control): 2 88n = , 2 77.68x = , 2 7.23s =

Since the sample sizes are large, we can conduct a t-test to decide between the null and alternative hypotheses without first checking that the data come from normal distributions.

T1.25

0.1186

0


Here are the calculations:

t = (80.60 − 77.68)− 07.342

81+ 7.23

2

88

≈ 2.60

Adopting the conservative approach, we use df = 81 – 1, or 80, to determine the p-value. Based on Figure 27.2, we get a value of p = 2(0.00555) ≈ 0.011. (Note: In this situation, we have a two sided alternative.)

Figure 27.2. Calculating the p-value for a two-sided alternative.

Since p < 0.05, we reject the null hypothesis and accept the alternative hypothesis that the mean exam scores for the two teaching methods differ. To estimate that difference, we calculate a two-sample t-confidence interval for µ1 − µ2using the following formula:

2 2* 1 2

1 21 2

( ) s sx x tn n

− ± +

Adopting the conservative approach, we set df = 80 and determine a t-critical value for a 95% confidence level: t*= 1.990. Now we are ready to calculate a 95% confidence interval for µ1 − µ2 :

(80.60 − 77.68) ± (1.990) 7.342

81+ 7.23

2

88≈ 2.92 ± 2.23 , or (0.69, 5.15).

Hence, the mean exam scores for the lecture notes/group discussion teaching strategy are between 0.69 and 5.15 points higher than the mean exam scores for the traditional lecture teaching strategy.

0.4

0.3

0.2

0.1

0.0

T-2.600

0.005550

2.6

0.005550

0

Distribution PlotT, df = 80


We conclude this section with one final comment related to checking the underlying assumptions for the two-sample t-procedures. In the development of the two-sample t-procedures for cases where the sample sizes are small, we assume that the population distributions are normal. As it turns out, if the sample sizes are reasonably close and the population distributions are similar in shape, without major outliers, the probabilities from the t-distribution are quite accurate even if the population distributions are not normal.


Key Terms

Two sample t-procedures are used to test or estimate µ1 − µ2 , the difference in the means of two populations (or treatments). The required data consists of two independent simple random samples of sizes 1n and 2n from each of the populations (or treatments).

The two-sample t-test statistic for testing the difference in population means is:

t =(x1 − x2)− (µ1 − µ2)

s12

n1+s22

n2

where the value for µ1 − µ2 is taken from the null hypothesis. There are two options for finding the degrees of freedom (df) associated with t: (1) use technology or (2) use a conservative approach and let df = smaller of 1 1n − or 2 1n − .

The two-sample t-confidence interval for µ1 − µ2 is computed as follows:

(x1 − x2) ± t* s1

2

n1+s22

n2

The degrees of freedom for calculating t*, the t-critical value associated with the confidence level, uses the approach outlined for the two-sample t-test statistic.


The Video


1. How might a statistician gather evidence to answer the following question: Are men or women worse drivers?

2. What was different about the lifestyle of the Hadzas compared to typical Europeans or Americans?

3. What was Pontzer’s original assumption about the daily energy expenditure (in calories) consumed by the Hadza compared to the Westerners?

4. What type of test was used in the video to test this assumption for Hadza and Western women of similar body size?

5. Was the assumption in question 3 correct? Explain.

6. On what did Pontzer and his colleagues place the blame for rising societal levels of obesity?


Unit Activity: Chips Ahoy, Regular and Reduced Fat

Nabisco’s Chips Ahoy is a popular brand of chocolate chip cookies. Nabisco makes both a regular and, for those who want to restrict their fat intake, a reduced fat version of chocolate chip cookies. The question for this activity is to find out whether the mean number of chips per cookie is the same for Chips Ahoy reduced fat chocolate chip cookies as it is for Chips Ahoy regular chocolate chip cookies.

1. If needed, collect the data on the number of chips per cookie in regular and reduced fat Chips Ahoy cookies. Your instructor will provide directions. (You may already have collected the data as part of Unit 25’s activity.)

2. a. Do you think the mean number of chips per cookie is the same for both Chips Ahoy regular and Chips Ahoy reduced fat chocolate chip cookies? If not, which type, regular or reduced fat, do you think has, on average, more chips per cookie? Explain.

b. Set up null and alternative hypotheses for testing whether the mean number of chips per cookie is the same for both the regular and the reduced fat version of Chips Ahoy chocolate chip cookies. Be sure to define any symbols that you use in your hypotheses. (Did you choose a one-sided or two-sided alternative?)

3. Report the sample size, mean and standard deviation for the regular chocolate chip cookie data. Then do the same for the reduced fat chocolate chip cookie data.

4. Make comparative graphic displays of the chip count data for the regular and reduced fat cookies. Based on your plots, do the chip counts for the two types of cookies appear to differ?


5. a. Compute the two-sample t-test statistic. Show your calculations.

b. Determine a p-value for your test statistic in (a).

c. Is there a significant difference in the mean number of chips per cookie in regular and reduced fat Chips Ahoy chocolate chip cookies? Explain.

6. Calculate a 95% confidence interval for the difference between the mean number of chips per cookie in Chips Ahoy regular and Chips Ahoy reduced fat chocolate chip cookies.


Exercises

1. A study published in the Journal of Business and Psychology investigated whether being pregnant had an adverse effect on women’s job performance appraisal ratings. Two groups of female employees at a large financial institution were subjects in this study, a pregnancy group and a control group. The first group consisted of 71 pregnant women. For each subject in the pregnancy group, a control group subject was randomly selected from the non-pregnant female employees with the same job title. The performance appraisal ratings were on a scale from 1 (outstanding performance) to 5 (unsatisfactory performance). The sample sizes, sample means and standard deviations for the two groups are given below:

Pregnancy group: 1 71n = , 1 2.38x = , 1 1.10s =

Control group: 2 71n = , 2 2.69x = , 2 0.58s =

The researchers hypothesized that pregnant employees would be rated differently when compared with the control group.

a. Set up a null hypothesis and an alternative hypothesis to test whether the population mean performance ratings differed for the two groups of female employees.

b. Calculate the t-test statistic and determine a p-value. State your conclusion.

c. Calculate a 95% t-confidence interval for the difference in mean performance appraisal ratings for pregnant employees and non-pregnant female employees. On average, are the performance ratings for pregnant women better or worse than for the non-pregnant female employees? Explain.

2. Return to the study discussed in question 1. The same group of researchers also gathered data on the pregnant group’s performance appraisal ratings during pregnancy and after returning from pregnancy leave. Here is a summary of the data gathered:

During pregnancy: 1 2.38x = , 1 1.10s = , 1 71n =

After pregnancy: 2 2.65x = , 2 1.64s = , 2 71n =


Difference “During – After”: 0.27dx = − , 2.00ds = , 71dn =

a. The researchers were interested in answering the following question:

Did the mean performance ratings for the pregnancy group differ significantly between the During Pregnancy and After Pregnancy time periods?

Which test is more appropriate to answer this question, a one-sample t-test or a two-sample t-test? Explain.

b. Use the appropriate test to answer the question posed in (a). Report the value of the test statistic, the p-value, and your conclusion.

3. A state university is concerned that female students are not as well prepared in mathematics as their male counterparts. Random samples of 20 male students and 20 female students were selected from the class of first-year students. Their SAT Math scores are given below.

SAT Math Scores of Female Students

530 450 550 470 450 500 480 510 470 450

600 540 530 470 420 490 440 540 500 480

SAT Math Scores of Male Students

670 440 410 510 410 600 530 490 600 530

570 550 640 530 550 460 660 570 670 490

a. Make graphic displays to compare the SAT Math scores of the female students and the male students. Do your plots provide evidence that male students entering the university have higher SAT Math scores than female students?

b. Is it reasonable to assume that the distributions of SAT Math scores for both populations, first-year male students and first-year female students, are approximately normally distributed? Support your answer.

c. Calculate the sample means and standard deviations for the females and the males SAT Math scores.


d. Conduct a test of hypotheses to see if the mean SAT Math scores are significantly higher for males than for females. Report the value of the two-sample t-test statistic, the p-value and your conclusion.

4. A group of 4-year-olds, who were part of the Infant Growth Study, participated in a laboratory meal. Data collected during this meal can be used to answer the following research question: Do the mean number of calories consumed by girls at a meal differ from the mean number of calories consumed by boys? Below is a summary of the results:

Girls: 494Gx = calories, 172Gs = calories, 31Gn =

Boys: 409Bx = calories, 148Bs = calories, 27Bn =

a. Write the null and alternative hypotheses.

b. Calculate the value of the two-sample t-test statistic. (Round to three decimals.)

c. Adopt a conservative approach and set the degrees of freedom to be one less than the smaller of the two sample sizes. Calculate the p-value. Are the results significant at the 0.05 level? Explain.

d. For two-sample t-tests, the Content Overview of this unit suggested using a conservative approach for determining the degrees of freedom associated with the test statistic: set df = smaller of 1 1n − or 2 1n − , where 1n and 2n are the sample sizes of the two groups. However, statistical software calculates degrees for freedom using the following formula:

22 21 2

1 22 22 2

1 2

1 1 2 2

1 11 1

s sn n

dfs s

n n n n

+

=

+ − −

In general, this formula does not result in an integer value. In that case, the degrees of freedom are rounded down to the closest integer below the calculated value.

Use the formula given above to calculate the degrees of freedom. (Be sure to round down to the closest integer.) Calculate the p-value based on the df you calculated from the formula. Based on this p-value, are the results significant at the 0.05 level?


Review Questions

1. The financial aid office of a university asks a sample of students about their employment and earnings. The report says that “for academic year earnings, a significant difference (p = 0.038) was found between the sexes, with men earning more on the average. No difference (p = 0.476) was found between the earnings of African-American and white students.” Explain both of these conclusions, for the effects of sex and of race on mean earnings, in language understandable to someone who knows no statistics.

2. A study was conducted to investigate the effect of different levels of air pollution on the pulmonary functions of healthy, non-smoking, young men. Two geographical areas with different levels of air pollution were selected – Area 1 had lower levels of pollutants than Area 2. Samples of 60 men were selected from each area. The two groups of men had no significant differences in age, height, weight, and BMI. Data on two measures of pulmonary function for each group are provided below:

Forced vital capacity (FVC, in Liters)

Area 1: 1 4.49x = , 1 0.43s =

Area 2: 2 4.32x = , 2 0.45s =

Respiratory rate (RR, per minute)

Area 1: 1 17.17y = , 1 4.26s =

Area 2: 2 16.28y = , 2 2.39s =

a. Why do you think the researchers tested to see if there were significant differences between the age, height, weight, and BMI for the two samples?

b. Test whether there is a significant difference between the mean FVC for the participants from Area 1 and Area 2. State the value of the test statistic, the p-value and your conclusion.

c. Test whether there is a significant difference between the mean RR for the two areas. State the value of the test statistic, the p-value and your conclusion.

d. If you find a significant difference in (b) or (c) or both, construct a 95% confidence interval to estimate the difference in means between the two areas.


3. A state university is concerned that there is a difference in the writing abilities of their male and female students. To test this assertion, the university took a random sample of 60 of their first-year students and recorded their genders and SAT Writing scores. The data appears below.

SAT Writing Scores of Female Students

480 540 620 590 530 620 580 530 530 560 510 560 560

550 520 480 560 510 500 540 490 430 610 620 510

SAT Writing Scores of Male Students

480 560 400 580 480 460 430 430 490 610 540 500 540

400 530 640 350 470 600 610 530 580 430 510 520 380

540 460 640 520 570 560 490 440 480

a. Make comparative boxplots for the SAT Writing scores for female and male students. Based on your boxplots, is it reasonable to assume that SAT Writing scores are approximately normally distributed for each gender? Does one gender tend to have higher SAT Writing scores than the other?

b. Summarize the data by reporting the sample sizes, sample means and standard deviations for both groups.

c. Test to see if there is a significant difference in mean SAT Writing scores between female and male first-year students attending this university. Report the value of the test statistic, the p-value, and your conclusion.

d. Compute a 95% confidence interval for the difference in mean SAT Writing scores for female and male students attending this university. Interpret your results.

4. Do 4-year-old boys eat, on average, more mouthfuls of food at a meal than 4-year-old girls? A group of 4-year-olds, who were part of the Infant Growth Study, participated in a laboratory meal.


Data on mouthfuls of food consumed during the laboratory meal were collected. Here is a summary of the results:

Girls: 80.8Gx = mouthfuls, 41.4Gs = mouthfuls, 31Gn =

Boys: 92.3Bx = mouthfuls, 42.0Bs = mouthfuls, 27Bn =

a. State the null and alternative hypotheses.

b. Determine the value of the two-sample t-test statistic and the p-value. Report your conclusion.

Unit 28: Inference for Proportions | Student Guide | Page 1

Summary of VideoIt is nearly impossible to collect data about an entire population. Take, for example, all the salmon in one watershed. We can’t count the number of eggs laid by every single spawning salmon. But we can count the eggs laid by a sample of some of these salmon. Then, using statistical inference, we can use the mean number of eggs in our sample to draw conclusions about the egg-laying population as a whole. As part of the inference procedure, we use probability to indicate the reliability of our results.

We can also use statistical inference to estimate a population proportion. For instance, suppose we wanted to know how many of the eggs laid by the salmon were fertilized. We could investigate the fertilization rate in our sample to get a sample proportion or sample percentage. Then we could use the sample proportion as an estimate of the unknown population proportion. But how good of an estimate is it? This will be the topic of this video – using information from samples to make inferences about population proportions.

Let’s turn our attention to a completely different context: the workplace. Employers think about how to motivate their employees to do their best, most creative work. Psychologist Teresa Amabile has studied creativity for years. One of Amabile’s discoveries from her earlier research is that creativity fluctuates, even for a given individual, as a function of the kind of work environment the individual is in. Building on that foundation, Amabile designed a study around the question of worker motivation. She recruited 238 people with creative jobs who were willing to keep track of their activities, emotions, and motivation levels every workday. Their electronic diaries had two components. One consisted of participants rating their motivation, emotions, and other subjective factors on a seven-point scale. The second component was an open-ended question where participants were asked to describe one event that stood out that day. It could be anything, as long as it was relevant to the work or the project. After several years, Amabile had nearly 12,000 diary entries. These entries validated her earlier findings that people were able to solve problems creatively and come up with new ideas on days they felt most motivated and excited about their work. So, the next question to ask was: What led to high levels of motivation?

Unit 28: Inference for Proportions


Dipping into the diaries, Amabile was able to see that one factor, more than anything else, made people feel they were having a great day at work. That was simply making progress in meaningful work, even if the progress looked incremental. She called this the Progress Principle. It turned out that 76% of participants’ best days had a progress event; whereas only 25% of their worst days had a progress event. Progress was paramount for people to feel positive and highly motivated – much more than other things like support from management and coworkers, feelings of doing important work, or collaboration, as can be seen from Figure 28.1.

Figure 28.1. Type of event recorded on workers best and worst days.

Amabile and her coauthor decided to survey managers to see whether they were aware of how important this feeling of progress was in motivating workers. She asked them to rate five different items in order of how much they felt they affected workers’ motivation. If the managers just randomly chose one of the five options to rank as most important, we would expect 20% of them to pick progress. So, we let p be the proportion of all managers who would pick progress as the most important of the five items for motivating workers. Now, we can set up a test of hypothesis for the population proportion, p:

0 : 0.20: 0.20a

H pH p

=≠

As it turned out, only 35 out of 669 managers selected progress as the top motivational factor. That gives a sample proportion of just 0.0523, or a mere 5.23%. This seems pretty low compared to the 20% proportion from our null hypothesis. But is it low enough to reject the null hypothesis? To find out, we can turn to a z-test statistic:


0

0 0

ˆ(1 )

p pzp p

n

−=−

,

where p (pronounced p-hat) is our sample proportion, 0p is the null hypothesis proportion, and n is the sample size. Substituting our sample proportion and sample size we get:

z = 0.0523 − 0.20(0.20)(1− 0.20)

669

≈ −9.55

That is a pretty extreme z-test statistic. If you compare it to a standard normal distribution, being 9.55 standard deviations from the mean is highly unlikely. As can be seen from Figure 28.2, the area under the curve that far out is not really visible! In fact, the p-value is 0.000. So, we have our answer: reject the null hypothesis and accept the alternative. The population proportion of all managers in the world who would select “Support for Making Progress” as the most important motivator is not 20%.

Figure 28.2. Determining a p-value from a standard normal density curve.

Now that we have rejected the null hypothesis, let’s calculate a confidence interval for the true population proportion. We know that the sample proportion of managers who selected progress was 0.0523, but we don’t know how close that is to the true population proportion. Just like in the confidence intervals for one mean, we can figure out a standard error to go with our point estimate. Here’s the formula for the confidence interval:

ˆ ˆ(1 )ˆ * p pp zn−±


Suppose we decide that we want a 95% confidence interval. Then our value for z* is 1.96 just as it was for z-confidence intervals for a population mean.

Next, we use our sample information to calculate the 95% confidence interval for the population proportion, p:

0.0523(1 1.0523)0.0523 1.96669

−± ,

0.0523 0.0169± or 5.23% ± 1.69%

So, our estimate is that only between 3.5% and 6.9% of managers in the overall population would rate progress as the number one motivational factor. A good question to ask is how could managers be so unaware of what really counted to their employees? What managers have said in response to that question is that it is just part of their employees’ jobs – they are supposed to make progress. Managers don’t typically think of progress as something that they need to worry about. But, according to Amabile, they actually do need to worry about it a lot. What Amabile saw in the diaries was that there were often little hassles happening in the work lives of most of the study participants that kept them from making as much progress as they would like. These were things that managers could have cleared away for them, without a lot of effort, if they had just been paying attention.

On some level the workers themselves might have recognized that their best days often went hand-in-hand with progress events. But the managers basically had no clue. It is the kind of finding that makes perfect sense once you know about it. Sometimes you just have to ask the right questions and know how to analyze the data.



A. Identify inference problems that concern a population proportion.

B. Know how to conduct a significance test of a population proportion.

C. Be able to calculate a confidence interval for a population proportion.

D. Understand that the z-inference procedures for proportions are based on approximations to the normal distribution and that accuracy depends on having moderately large sample sizes.


Content Overview

Up to this point, all the inference procedures we have discussed involve using sample means, x , to make inferences about population means, μ. In this unit, we focus on proportions. For example, what if we wanted to know what proportion of people own or use a computer at home, or have access to the Internet from home, or from work, or from school? In order to answer these questions, we need new inference procedures designed for proportions.

In inference, we start by defining the population – for our question on home-use of computers, the population will be all households in America. Of interest is the population proportion, p, of households in which some member owns or uses a computer at home. Now, we don’t have access to every household in America, but we can take a sample. In a random sample of 2,500 households, 2,036 answered yes to the following question:

At home, do you or any member of this household own or use a desktop, laptop, netbook, or notebook computer?

From this information we can calculate the sample proportion, which we label as p :

2036ˆ 0.81442500

p = = , or 81.44%

But how good is this estimate for p? Remember, the sample proportion, p , is a statistic. If we take another sample of 2,500 households, we will most likely get a different estimate for p. So, as a first step in developing inference procedures for population proportions, we need to know

something about the sampling distribution of the sample proportion, p .


Sampling Distribution of a Sample Proportion

Suppose that a large population is divided by some characteristic into two categories, successes and failures, and that p is the population proportion of successes. A simple random sample of size n is drawn from the population and is the sample proportion:

number of successes in the samplepn

=

As a statistic, p varies over repeated sampling. Its sampling distribution has the following properties:

• Mean:µp = p

• Standard deviation: σ p =p(1− p)n

.

• Distribution: For large n, p has an approximately normal distribution.

Since, in the case of home use and/or ownership of computers, the sample size is large,

2,500, the sampling distribution of p is approximately normal (as pictured in Figure 28.3.)

Figure 28.3. Sampling distribution of the sample proportion, p .

Suppose that an online source claimed that 79% of American households had a member of the household who owned or used a computer at home. We would like to test that claim. To do so, we use the online source’s claim about the population to set up the null and alternative hypotheses:

pValues of p

Approximately Normal Density Curve

(1 )p pn−

Standard deviation


0 : 0.79: 0.79a

H pH p

=≠

Now, if the null hypothesis is true, then the distribution of p from a sample with n = 2,500 will have an approximately normal distribution with the following mean and standard deviation:

µp = 0.79

σ p =(0.79)(1− 0.79)

2500≈ 0.0081

Since we are dealing with an approximately normal distribution, we can express p in standardized units (subtract the mean and divide by the standard deviation):

ˆ 0.790.0081pz −=

If the null hypothesis is true, z will have a standard normal distribution. Now, go back to the

results of the survey, ˆ 0.8144p = , and express that value in standardized units:

0.8144 0.79 3.010.0081

z −= ≈

We calculate a p-value for the significance test by determining how likely it is to observe a value from the standard normal distribution that is at least 3.05 from the mean. In this case, we get a p-value of 2(0.001306) ≈ 0.003. Since this p-value < 0.05, we can reject the null hypothesis and conclude that the population proportion is not 0.79, or 79%.

Figure 28.4. Calculating the p-value of a z-test statistic.

z-3.01 3.01

0.001306

0

0.001306

Standard Normal Density Curve


Before moving on, we summarize the basics of a significance test for population proportions.

Significance Test for a Population Proportion

To test the null hypothesis 0 0:H p p= , where p is the population proportion and 0p is the hypothesized value, we use the z-test statistic:

0

0 0

ˆ(1 )

p pzp p

n

−=−

where p is the sample proportion. When the null hypothesis is true and the sample size is large, the z-test statistic will have an approximate standard normal distribution.

Now that we have rejected the null hypothesis that members of 79% of American households own/use a computer at home, let’s calculate a confidence interval for the true population proportion. The formula for a confidence interval for a population proportion follows the same pattern that was used to calculate a confidence interval for a population mean:

Point estimate ± z*(standard error of point estimate)Point estimate ± margin of error

Here’s the formula for calculating a confidence interval for a population proportion.

Confidence Interval for a Population Proportion

ˆ ˆ( )(1 )ˆ * p pp zn−±

where p is the sample proportion and z* is the z-critical value (from a standard normal distribution) associated with the confidence level.

Suppose we decide on a 95% confidence interval for p. Then we use z* = 1.96, just as we did in Unit 24, Confidence Intervals. All that is left is to substitute our observed sample proportion, ˆ 0.8144p = into the formula (Continued on next page):


0.8144 ±1.96 (0.8144)(1- 0.8144)2500

- 0.8144 ± 0.0152

81.44% ± 1.52%, or between 79.92% to 82.96%

So, now we are ready to use sample proportions to conduct significance tests and calculate confidence intervals for population proportions.


Key Terms

Assume that a population is divided into two categories, successes and failures, based on some characteristic. The population proportion, p, is:

p = number of successes in the populationpopulation size

Draw a sample of size n from this population. Then the sample proportion, p , is calculated as follows:

number of successes in the samplepn

=

If the sample size n is relatively large, the sampling distribution of the sample proportion, p , is approximately normally distributed with the following mean and standard deviation:

• Mean: µp = p , where p is the population proportion.

• Standard deviation: σ p =p(1− p)n

, where n is the sample size.

To test the null hypothesis 0 0:H p p= , where p is the population proportion, we can use the z-test statistic for proportions. The formula for the z-test statistic is:

0

0 0

ˆ(1 )

p pzp p

n

−=−

The z-test is used in situations where the sample size n is large.

In situations where the sample size n is large, a confidence interval for the population proportion, p, can be calculated from the formula:

ˆ ˆ(1 )ˆ * p pp zn−±

where p is the sample proportion and z* is the z-critical value (from a standard normal distribution) associated with the confidence level.


The Video


1. What is the general topic of this video?

2. In Teresa Amabile’s earlier study of workers in creative jobs, how did participants of the study feel on the days when they were most able to solve problems creatively and come up with new ideas?

3. Describe the principle that Amabile dubbed the Progress Principle.

4. Managers were given five items, including progress, and asked to select the one that they felt most affected workers’ motivation. If managers randomly selected one of the five items, what percentage of the managers would we expect to select progress?

5. What type of test statistic was used to test the null hypothesis 0 : 0.20H p = , where p is the population proportion?

6. In the video, a 95% confidence interval was calculated for the true population proportion of managers who would select progress as the most important motivational factor. After converting to percentages, were the values in this confidence interval below 20%, around 20%, or above 20%?


Unit Activity: Proportions of Blue Eyes

In the activity for Unit 21, you completed Table 21.1 by simulating data for inheriting blue eyes (genes bb) from brown-eyed parents who carried a recessive gene for blue eyes (genes Bb). You will need those data for this activity. In this activity, the population consists of the children of brown-eyed parents, each of whom carries a recessive gene for blue eyes. In this case, the true population proportion is known, which is generally not the case, and p = 0.25. In this case, knowing the population proportion allows us to see how well the statistics perform.

Table 28.1. Data on children’s eye color.

123456789

101112131415161718192021222324252627282930

Table 28.1

Sample Number

Number of Blue-Eyed Children n = 4

Estimated Proportion

Blue-Eyed Children n = 4

Running Total Number of Children

Running Total Number of Blue-Eyed Children


1. In a copy of Table 28.1, enter the x-values from your completed Table 21.1 into the second column.

2. a. For each sample of four children, calculate the sample proportion of blue-eyed children, p . Enter the sample proportions in the third column of Table 28.1.

b. Notice that your sample proportions vary from sample to sample (even though the population proportion stayed the same). What was the smallest sample proportion? What was the largest?

c. To get a sense of the shape of the sampling distribution of the sample proportion, make

a histogram of your values for p (from column three). Use class intervals of width 0.25 for your histogram. Does your histogram indicate that the sample proportions have a normal distribution?

3. a. Complete the fourth column of Table 28.1 by entering a running total of the number of children as samples are combined. This list should contain the following numbers: 4, 8, 12, . . . , 120.

b. Complete the fifth column of Table 28.1 by entering a running total of the number of blue-eyed children as samples are combined.

4. The confidence interval formula given in the Content Overview is for large sample sizes. After combining the data from the first 10 samples, you now have a sample of 40 children.

a. Give a point estimate for the population proportion, p, of blue-eyed children based on the 40 children from Samples 1 – 10.

b. Compute a 95% confidence interval for p. (Round to three decimals.)

c. How big is the margin of error in your confidence interval in (b)?

5. After combining the data from the first 20 samples, you now have a sample of 80 children.

a. Give a point estimate for the population proportion, p, of blue-eyed children based on your sample of 80 children.


b. Compute a 95% confidence interval for p.


6. After combining the data from all 30 samples, you now have 120 children.

a. Give a point estimate for the population proportion, p, of blue-eyed children based on your sample of 120 children.

b. Compute a 95% confidence interval for p.


7. Compare the margins of error for the three confidence intervals that you computed in questions 4 – 6. What happened to the margin of error as the sample size increased?

8. From questions 4 – 6, we know that sample size affects the margin of error. How large a sample size n is needed to guarantee that the margin of error for a 95% confidence interval for p is less than 0.05? Complete parts (a) – (c) to find out.

a. The margin of error, E, for a 95% confidence interval is calculated by the following formula:

ˆ ˆ(1 )1.96 p pEn−=

Replace E by 0.5 and solve for n.

b. If you solved for n correctly, you found that n is a multiple of ˆ ˆ(1 )p p− , which varies for different values of p . Complete the second column of Table 28.2 by calculating the values of ˆ ˆ(1 )p p− for different values of p (See next page).


Table 28.2 Values of ˆ ˆ(1 )p p− .

c. To find the value of n that guarantees a margin of error < 0.05, substitute the largest value

you found for ˆ ˆ(1 )p p− into your equation in (a). Report the value of n needed to guarantee that

the margin of error will be less than 0.05 (regardless of the value of p ).

9. To conclude this activity, we know that the population proportion of blue-eyed children born to brown-eyed parents with a blue-eye recessive gene is p = 0.25. Which of your confidence intervals from questions 4 – 6 gave correct results? (In other words, which of your confidence intervals contained the true population mean?)

0.10.20.30.40.50.60.70.80.9

Table 28.2

p ˆ ˆ(1 )p p−


Exercises

1. A random sample of 2,454 12th-grade students were asked the following question:

Taking all things together, how would you say things are these days – would you say you’re happy or not too happy? Of the responses, 2,098 students selected happy. (These data were from a Monitoring the Future survey.)

a. Determine the sample proportion of students who responded they were happy.

b. Calculate a 95% confidence interval for the population proportion of 12th-grade students who are happy.

c. Would a 90% confidence interval for the proportion of happy students be wider or narrower than the one you calculated for (b)? Justify your answer.

2. Currently, mothers in North America are advised to put babies to sleep on their backs. This recommendation has reduced the number of cases of sudden infant death syndrome (SIDS). However, it is a likely cause of another problem – flat spots on babies’ heads. A study of 440 babies aged 7 – 12 weeks found that 46.6% had flat spots on their heads.

a. The headline of the online news article reporting this story read: Nearly half of babies have

flat spots, study finds. Conduct a test of hypotheses to test 0 : 0.5H p = against : 0.5aH p ≠ where p is the population proportion of North American babies aged 7 – 12 weeks who have flat spots on their heads. Report the value of your test statistic, the p-value, and your conclusion.

b. Calculate a 95% confidence interval for the proportion of babies in this age group that have flat spots.

c. Suppose you decide to use your confidence interval from (b) to make a decision between

0 : 0.50H p = and : 0.50aH p ≠ . Would your decision based on your confidence interval agree with your decision based on the z-test statistic from (a)? Explain.

3. An online article claims that 90% of American households in which a computer is owned/used have access to the Internet. However, an Internet provider questioned the claim. The Internet provider felt that the percentage should be higher. A phone survey contacted 1,910


households in which a computer was owned/used and respondents were asked if they could access the Internet from their home. A total of 1,816 of the households responded yes.

a. Define the population.

b. Set up the null and alternative hypotheses.

c. Calculate the z-test statistic, determine the p-value, and state your conclusion.

4. Return to question 3. Calculate a 95% confidence interval for the population proportion p. Re-express your confidence interval as a percentage.


Review Questions

1. A sample of 5,462 eighth-grade students were asked whether or not they actively participated in sports, athletics, or exercising on a nearly daily basis. Of the students who responded, 2,998 said yes. (These data were from a Monitoring the Future survey.)

a. Determine the sample proportion of eighth-grade students who responded that they were involved nearly daily in some sort of physical activity.

b. A physical education teacher claimed that over 50% of all eighth-grade students in America actively participate in physical activity on a nearly daily basis. Set up a null hypothesis and an alternative hypothesis to test this claim.

c. Conduct a significance test for the population proportion. Report the value of the test statistic, the p-value, and your conclusion.

2. Polls taken a few days before the 2012 presidential election between Barack Obama and Mitt Romney did not indicate a clear winner. An NBC/Wall Street Journal poll showed that 48% of the sample intended to vote for Obama. The polling organization announced that they were 95% confident that the sample result was within ± 2.6 percentage points of the true percent of all voters who favored Obama.

a. Explain in plain language to someone who knows no statistics what “95% confident” means in this announcement.

b. The poll showed Obama leading Romney 48% to 47%. Yet NBC/Wall Street Journal declared the election was too close to call. Explain why.

3. A community college conducted a survey of student learning outcomes just prior to graduation. A sample of its students completed the survey. Student responses have been boiled down to two categories, agree or disagree, for the following three questions:

a. I have improved in my ability to take responsibility for my own actions.

Valid responses: 296; Agree: 255

b. I have improved in my ability to understand my society and the world.



c. I have improved in my awareness and appreciation of cultures other than my own.


For each of questions (a) – (c), determine a point estimate for the proportion of graduates from this college who would agree with the statement. Then calculate a 95% confidence interval for the population proportion.

4. Rasmussen Reports conducted a national survey of 1,000 adults from June 19-20, 2013. The poll found that 63% of Americans think that a government that is too powerful is a bigger danger than one that is not powerful enough.

a. Use the information from the report to calculate a 95% confidence interval for the proportion of Americans who would agree with the statement above. Restate your confidence interval in terms of percentages. What is the margin of error for your confidence interval?

b. The report concluded with the following statement: The margin of error is 3%± with a 95% level of confidence. Compare this statement with the margin of error you calculated in (a).

c. Was a sample size of 1,000 large enough to guarantee that the margin of error was less than 3% even if the sample percentage had been as low as 50% or as high as 80%? Explain.

d. How large a sample size was needed to guarantee that the margin of error was below 3% regardless of the sample proportion?

Unit 29: Inference for Two-Way Tables | Student Guide | Page 1

Summary of VideoIn this video, we visit the Broad Institute in Cambridge, Massachusetts, where our host, Dr. Pardis Sabeti, has a small research team investigating an ancient biological battle – the non-stop evolutionary arms race between our bodies and the infectious microorganisms that try to invade and inhabit them. The Broad Institute is home to new high tech tools such as the latest generation of genome sequencers. They allow us to sequence out the letters that code the genomes of both humans and our microbial enemies. In her research, Dr. Sabeti and her team use the data that these machines provide to find clues that might lead to new ways to battle some of our most dangerous diseases, diseases that we in the West rarely encounter.

One of the deadliest is Lassa fever, which, like the more notorious tropical disease Ebola, is caused by a virus and kills its victims with hemorrhagic fever. Throughout West Africa, thousands of people die of Lassa fever every year. But what is surprising is that many tens of thousands more throughout the region are exposed to the virus without getting sick. This suggests that these people have some sort of resistance to the virus. It is the source of this resistance that Dr. Sabeti wants to discover.

Dr. Sabeti’s work on Lassa fever is still in its early stages, but one of the models for what she hopes to uncover can be found in the research on another tropical disease, malaria, which kills and sickens millions every year. With malaria we already know of one important source of resistance to the disease. It’s a genetic mutation that is better known for the harm it does than for the good – sickle cell anemia.

(Continued on next page...)

Unit 29: Inference for Two-Way Tables


Figure 29.1. Inheriting the sickle cell mutation.

As we discovered in the module on binomial distributions (Unit 21), if a child inherits two copies of the sickle cell mutation (SS) from his or her parents, the child will have sickle cell anemia. If the child inherits only one copy of the gene, he or she is unaffected by the disease, but more importantly the child is protected against malaria. (See Figure 29.1.) It is this protective effect that is responsible for the sickle cell mutation becoming so prevalent and it is statistics that can reveal it.

To see how two-way tables can help reveal protective factors, Dr. Sabati has borrowed some data from Dr. Hans Ackerman. He and his colleagues looked at the genotypes of 315 children with severe malaria. Since each child inherits one hemoglobin gene from each parent, they examined 630 genes in total. The researchers wanted to quantify whether children who came down with malaria were less likely to have inherited the protective sickle cell version of the gene (HbS) rather than the normal version (HbA), as compared to the general population. Table 29.1 shows the breakdown of HbA and HbS in two groups of children. The top row of the table shows the genes they found in the children with severe malaria. The bottom row shows the genes they found in a control group of newborn babies.

Table 29.1. Table of hemoglobin gene in two groups of children.

Intuitively, we would expect to find the protective version of the gene less frequently in the children sick with malaria than in the control group. After all, if they were protected, they likely wouldn’t have come down with the disease. Table 29.2 shows the conditional distribution of HbA and HbS for each group of children.

Malaria 623 7 630Control 1065 101 1166

Table 29.1

TotalHbS ProtectiveHbA Susceptible


Table 29.2. Conditional distribution of HbA and HbS for each group.

Notice that HbS was inherited by the kids who caught malaria only 1.11% of the time compared to 8.66% of the time by the control group. Is that difference larger than would be expected just by chance? Is it statistically significant? We can conduct a test of hypotheses to find out whether there is sufficient evidence that the status of two variables – Malaria/General Population and HbS/HbA – are linked. Our null hypothesis is that there is no association between contracting malaria and having the HbS sickle cell gene. The alternative hypothesis is that there is an association between contracting malaria and having the protective HbS sickle cell gene.

0: No association between malaria and HbS: Association between malaria and HbSa

HH

We can compute what the expected counts in our two-way table would be if there really is no association between our variables as the null hypothesis states. Here’s how to compute the expected counts:

(row total)(column total)expected count = grand total

Table 29.3 shows the results of adding the expected counts to our two-way table.

Table 29.3. Adding the expected counts.

Now we can see that if there were no relationship between having the gene and coming down with the disease we would expect to find 37.9 HbS genes in the children with malaria. But in reality there are only 7 HbS genes in that group. Is that difference between 7 and 37.9 enough to tell us that there is an association between our two categorical variables? The next step in our analysis is to use the chi-square test statistic, given below, to figure out if that difference is significant.

Malaria 98.89% 1.11% 100%Control 91.34% 8.66% 100%

Table 29.2

HbA Susceptible HbS Protective Total

Observed 623 7 630Expected 592.1 37.9 630.0Observed 1065 101 1166Expected 1095.9 70.1 1166.0

Table 29.3

Malaria

Control

HbA Susceptible HbS Protective Total


χ 2 = (observed − expected)2

expected∑

The chi-square test statistic is a measure of how far the observed counts in the table are from the expected counts. Here are the calculations:

χ 2 =623 − 592.1( )2592.1

+7 − 37.9( )237.9

+1065 −1095.9( )2

1095.9+101− 70.1( )270.1

χ 2 ≈ 41.26

Using software, we find the p-value: 0p ≈ . So, we have very strong evidence that there is an association between our variables and we can reject our null hypothesis. This result, together with the pattern of the data, gives support to the research hypothesis that the HbS sickle cell variant of the hemoglobin gene does protect against malaria.



A. Understand the basic principles of the chi-square test.

B. Know how to calculate the expected cell counts in a two-way table.

C. Know the assumptions required for the chi-square test of independence.

D. Be able to conduct a chi-square test of independence.


Content Overview

Each year, the study Monitoring the Future: A Continuing Study of American Youth (MTF) surveys 12th-grade students on a wide range of topics related to behaviors, attitudes, and values. It is a major source of information on smoking, drinking, and drug habits of American youth.

Suppose we want to investigate whether the environment in which students grow up is linked to the likelihood that they have consumed alcohol (more than just a few sips). We focus on three growing-up environments – a farm, the country, or a small-to-medium size city. Since we expect the growing-up environment may help us explain the likelihood of alcohol consumption, environment is the explanatory variable, and alcohol consumption is the response variable. We are interested in testing if there is an association between these two variables or if they are independent.

The two-way table in Table 29.4 shows the results on these questions from the 2011 MTF survey.

Table 29.4. Results from questions on growing-up environment and drinking alcohol.

We begin analyzing these data using techniques covered in Unit 13, Two-Way Tables. Because we think that growing-up environment explains whether or not students might have consumed alcohol, we calculate the conditional percentages for the variable alcohol for each level of environment. In other words, we compute the column percentages, which appear in Table 29.5.

Table 29.5. Column percentages.

Based on Table 29.5, it looks as if students who grew up in the country were the most likely (70.05%) to have drunk alcoholic beverages and the students who grew up on a farm were

Count A Farm Country Small/Medium CityNo 144 342 1366Yes 305 800 3049

Table 29.4

Alcohol

Environment

Percent A Farm Country Small/Medium CityNo 30.94 29.95 32.07Yes 69.06 70.05 67.93

100.00 100.00 100.00

Table 29.5

Environment

Alcohol

Total


the least likely (67.93%). The question is whether these differences are due to an association between the two variables or could these differences be due simply to chance variation? In order to distinguish between these two cases, we set up hypotheses for a significance test:

0 : No association between drinking alcohol and growing-up environment.: Association between drinking alcohol and growing-up environment.a

HH

Remember, the data in Table 29.5 came from a sample of 12th-grade students. The meaning of the null hypothesis is that in the population of all 12th-grade students in America there is no difference among the distributions of alcohol consumption for the three growing-up environments. To test 0H we compare the observed counts from Table 29.4 with the counts that we would expect to see if the two variables were independent (no association). If it turns out that the observed counts are far from the expected counts, then we would have evidence against the null hypothesis. Here’s how to calculate the expected counts.

Calculating the Expected Counts

Assume that 0H is true and that there is no association between two variables in a two-way table. Then the expected count in any cell of the table is computed as follows:

(row total)(column total)expected countgrand total

= ,

where the grand total is the sum of the counts in all cells in the table.

Before calculating the expected counts, we add the row and column totals to our table of counts (See Table 29.6.).

Table 29.6. Addition of row and column totals to Table 29.4.

Farm Country CityNo 144 342 1366 1852Yes 305 800 3049 4154

449 1142 4415 6006

Table 29.6

Alcohol

TotalEnvironment

Total


For example, the expected count for the cell in the first row, first column is:

(1852)(449)expected count 138.456006

= =

Table 29.7 shows the expected counts added to the table. For each cell, the expected counts appear below the observed counts.

Table 29.7. Table 29.6 with expected counts added.

If there is no association between alcohol consumption and growing-up environment, the expected counts should be close to the observed counts. We compare the observed and expected counts by way of a chi-square test statistic, χ

2, which is given below.

Computing the Chi-Square Test Statistic

The chi-square test statistic measures how far the observed counts in a two-way table are from the expected counts.

The χ 2 - test statistic is calculated by the following formula:

χ 2 =observed− expected( )2

expected∑

Next, we calculate the value of the chi-square test statistic:

χ 2 =144 −138.5( )2138.5

+342− 352.1( )2352.1

+1366 −1361.4( )21361.40.76

+305 − 310.5( )2310.5

+800 − 789.9( )2789.5

+3049 − 3053.6( )2

3053.6

≈ 0.76

Farm Country City144 342 1366138.5 352.1 1361.4305 800 3049310.5 789.9 3053.6449 1142 4415 6006

Table 29.7

Environment Total

Total

AlcoholNo

Yes

1852

4154


If the null hypothesis is true and the cell counts are reasonably large, then the chi-square test statistic has an approximate chi-square distribution. Like t-distributions, chi-square distributions are specified by degrees of freedom. In this case, the degrees of freedom depend on the number of rows and columns of the table: df = (r – 1)(c – 1), where r and c are the number of rows and columns, respectively. Table 29.4 has two rows and three columns.

So, we get df = (2 – 1)(3 – 1) = 2. To calculate a p-value, we find the probability of observing a value from the chi-square distribution with df = 2 that is at least as large as the one we observed, χ 2 = 0.76. Using software, we determine that p ≈ 0.684 as can be seen in Figure 29.2. Assuming the null hypothesis is true, we would expect to see χ 2-values at least as large as the one we observed around 68% of the time. That’s pretty often. So, we have insufficient evidence to reject the null hypothesis. We conclude that there does not appear to be an association between students’ drinking alcohol and whether students grew up on a farm, in the country, or in a city.

Figure 29.2. Calculating the p-value from a chi-square distribution.

Next, we ask whether the same results would be true for 12th-grade students’ smoking habits. In other words, are the smoking habits of 12th-grade students independent of their growing-up environment? Table 29.8 gives results for these questions from the 2011 MTF survey. (More students answered the question on smoking than did on drinking alcohol.)

0.5

0.4

0.3

0.2

0.1

0.0

Χ0.760

0.6839

Chi-Square Density Curve, df = 2

2


Table 29.8. Table of smoking and growing-up environment.

Again we set up the null and alternative hypotheses:

0 : No association between smoking and growing-up environment.H

: Association between smoking and growing-up environment.aH

This time we leave the work of calculating the expected cell counts to the statistical software Minitab. Figure 29.3 shows the Minitab output.

Figure 29.3. Minitab chi-square analysis for smoking and growing-up environment.

Notice that the cell counts appear below the observed counts in the table. The value of the test statistic is χ 2 ≈ 56.2 . Since this is a 3×3 table, df = (3 – 1)(3 – 1) = 4. Minitab reports the p-value as approximately 0. So, we conclude that the variables smoking and growing-up environment are not independent – there is an association. The results from the chi-square test do not tell us anything about the nature of the association, only that there is one. To learn

Farm Country CityNever 299 738 3218

Occasionally 159 403 1521Regularly, now or past 103 236 596

Table 29.8

Environment

Smoking

Student Guide, Unit 29, Inference for TwoWay Tables Page 8

Figure 29.3. Minitab chisquare analysis for smoking and growingup environment. Notice that the cell counts appear below the observed counts in the table. The value of the test statistic is 2 56.2χ ≈ . Since this is a 3×3 table, = (3 – 1)(3 – 1) = 4. Minitab reports the value as approximately 0. So, we conclude that the variables smoking and growingup environment are not independent – there is an association. The results from the chisquare test do not tell us anything about the nature of the association, only that there is one. To learn about the nature of that association, we look at the conditional distributions of smoking for each of the growingup environments. Table 29.9 shows the column percentages.

Environment

Percent Farm Country City

Smoking Never 53.3 53.6 60.3 Occasionally 28.3 29.3 28.5 Regularly, now or past 18.4 17.1 11.2 Total 100.0 100.0 100.0

Table 29.9. Conditional distribution of smoking for each growingup environment. What we notice from Table 29.9 is that a higher percentage of students who grew up in a city never smoked (60.3%) compared to students who grew up on a farm (53.3%) or in the country (53.6%). The percentages for students who occasionally smoked (but not regular smokers) were about the same for all three growingup environments. However,


about the nature of that association, we look at the conditional distributions of smoking for each of the growing-up environments. Table 29.9 shows the column percentages.

Table 29.9. Conditional distribution of smoking for each growing-up environment.

What we notice from Table 29.9 is that a higher percentage of students who grew up in a city never smoked (60.3%) compared to students who grew up on a farm (53.3%) or in the country (53.6%). The percentages for students who occasionally smoked (but not regular smokers) were about the same for all three growing-up environments. However, the percentage of regular smokers (either now or in the past) was higher for students who grew up on a farm (18.4%) or in the country (17.1%) compared to students who grew up in a city (11.2%).

The chi-square test, like the z-test for proportions, is an approximate method that becomes more accurate as the cell counts get larger. If the expected cell counts get too low, the test becomes untrustworthy. Here are some guidelines for when a chi-square test gives accurate results.

Guidelines for Using Chi-Square Test

The chi-square test gives trustworthy results provided the following are satisfied:

• All expected counts are greater than 1.

• No more than 20% of the expected counts are less than 5.

Statistical software will often give a warning if the guidelines have been violated. For example, energy drinks – non-alcoholic beverages that usually contain high amounts of caffeine (e.g., Red Bull, Full Throttle, and Monster) – have caused concern in the medical community. Suppose we wanted to know if the pattern of daily consumption of energy drinks was associated with students’ growing-up environment.

The output from Minitab appears in Figure 29.4. Notice the software reports the value of the chi-square test statistic, but this time it does not provide a p-value. Instead it prints a warning, which we have highlighted.

Percent Farm Country CityNever 53.3 53.6 60.3

Occasionally 28.3 29.3 28.5Regularly, now or past 18.4 17.1 11.2

100.0 100.0 100.0

Table 29.9

Smoking

Environment

Total


Figure 29.4. Minitab chi-square analysis for energy drinks and growing-up environment.

In this case, we could combine some of the categories for energy drinks. For example, we might combine categories Three, Four, and Five or more into a single category “Three or more.” You will get a chance to try this approach in the exercises.

Data for two-way tables can arise in different ways. In the case of the Monitoring the Future data, a single sample of high school students was chosen to take part in the survey. Their responses to two questions (two categorical variables) were organized into two-way tables. That was not the case for the data discussed in the video. Those data came from two different samples, a sample of children sick with malaria and a sample of newborns (control group), Student Guide, Unit 29, Inference for TwoWay Tables Page 10

Figure 29.4. Minitab chisquare analysis for energy drinks and growingup environment. In this case, we could combine some of the categories for energy drinks. For example, we might combine categories Three, Four, and Five or more into a single category “Three or more.” You will get a chance to try this approach in the exercises. Data for twoway tables can arise in different ways. In the case of the data, a single sample of high school students was chosen to take part in the survey. Their responses to two questions (two categorical variables) were organized into twoway tables. That was not the case for the data discussed in the video. Those data came from two different samples, a sample of children sick with malaria and a sample of newborns (control group), which were then classified according to one


which were then classified according to one categorical variable, HbS/HbA. In this case, the sample, malaria or control, was the second variable in the two-way table. There is no difference in the analysis used in these two situations. The expected counts and chi-square test statistics were computed using the same formulas in both cases.


Key Terms

The observed cell counts (or frequencies) are the actual number of observations that fall into each cell (class). The expected counts (or frequencies) are the number of observations that should fall into each class in a frequency distribution under the hypothesized probability distribution.

Chi-square statistic:

χ 2 = (observed − expected)2

expected∑

Degrees of freedom for chi-square test of independence: ( 1)( 1)r c− − , where the number r and c are the number of rows and columns, respectively.

Expected count for chi-square test of independence:

(row total)(column total)expected count = grand total .


The Video

Take out a piece of paper and be ready to write down the answers to these questions as you watch the video.

1. What type of research is the host of this series, Dr. Pardis Sabeti, involved in?

2. Dr. Sabeti’s work is modeled off of work done on malaria. What genetic mutation is an important source of resistance to malaria?

3. What were the null and alternative hypotheses for testing whether the sickle cell gene protects against malaria?

4. What is the rule for calculating the expected counts under the null hypothesis?

5. The p-value of the chi-square test statistic turned out to be approximately 0. What can you conclude based on this p-value?


Unit Activity: Associations With Color

This activity is in three parts. In Part I, you will examine the reasoning behind the expected count formula. In Part II, you will need to collect data on eye color and gender from a sample of students. In Part III, there are different samples – different types of M&M candies. The candies are classified on one variable, color. In all three cases, you will conduct chi-square analyses.

Part I: Introduction – Assumption of Independence and Expected Count Formula

1. A survey given to 500 students asked: How would you describe your political preference? There were three response choices: GOP (Republican), DEM (Democrat), and IND (Independent). Keeping with the color theme of this activity, GOP is red (red states tend to vote Republican), DEM is the blue, and to make the color scheme patriotic, we’ll let IND be represented by the color white. In addition to collecting information on political preference, the students indicated whether they were male or female. The results are given in Table 29.10.

Table 29.10. Distribution of political preference and gender.

We are interested in finding out whether there is an association between gender and political preference. We begin attacking this problem as a problem in probability. For example, to estimate the probability that a randomly selected student will be female and a Democrat (blue), we use the observed proportion 107/500. We can also calculate marginal probabilities using the row or column totals. For example, we estimate the probability that a student prefers the Democratic Party to be 196/500 and the probability that a randomly selected student is female as 246/500.

Using probability, we can examine what it would mean for the variables gender and political preference to be independent (or to have no association). If gender and political preference are independent, then we can use the Multiplication Rule to calculate this probability: P(political preference = DEM and gender = female).

Count Male Female TotalPolitical DEM (Blue) 107 89 196Preference GOP (Red) 76 109 185Color IND (White) 63 56 119

Total 246 254 500

Table 29.10


a. Assume the variables gender and political preference are independent. Use the Multiplication Rule to estimate P(DEM and female) from the marginal probabilities. Show your calculations. (Give your answer to at least 4 decimals.)

b. Use your probability in (a) to determine the number of students out of the 500 observed that you would expect to fall into the category of being female and preferring the Democratic Party.

c. In a test of the null hypothesis 0 : no association between the variablesH , the formula for calculating the expected count is

(row total)(column total)expected count =grand total

For the cell corresponding to female and DEM, determine the expected count from the formula above. Compare your result with your answer to (b).

d. Repeat (a) - (c) for the cell corresponding to DEM and Male.

2. a. Assuming that the null hypothesis in 1(c) is true, calculate the expected counts for each cell in Table 29.10.

b. Calculate the value of the chi-square test statistic and the degrees of freedom. Then determine the p-value.

c. What can you conclude from your results in (b)?

Part II: Single Sample, Classified on Two Categorical Variables

One way to gather data that is appropriate for chi-square analysis is to select a single sample and then to classify the subjects in that sample by two categorical variables.

You will need a sample of students (your class, combined classes, friends). The two variables that you will use to classify the students in your sample are gender and eye color. The null hypothesis is:

0 : No association between gender and eye color.H

or equivalently:

0 : The variables gender and eye color are independent.H


3. a. Collect the data from your sample of students. Enter it into a copy of Table 29.11.

Table 29.11. Data on gender and eye color.

b. State the null and alternative hypotheses.

c. Calculate the expected cell counts and enter them into your table.

d. Perform a chi-square test. Report the value of the test statistic, the p-value, and your conclusion.

Part III: Multiple Samples, Classified on One Categorical Variable

Another data structure that is appropriate for chi-square analysis is when samples are drawn from different populations and classified on one categorical variable. In this case, we can think of “which sample” as the second variable. Next, your samples will be from different types of M&M candies. Given bags of at least two types of M&Ms, you will classify the M&Ms into colors, taking care to record which type of M&Ms candies you are classifying.

The null hypothesis is:

0 : No association between M&M type and color.H

or equivalently:

0 : The color distributions are the same for the different M&M types.H

4. a. Collect the color distribution from bags of up to four types of M&Ms. Then enter your data into a table similar to the one in Table 29.12. (Be sure to record the type.)

Count Blue Brown Other

Table 29.11

Total

Total

Male

FemaleGender

Eye Color


Table 29.12. Data on M&Ms type and color.

b. State the null and alternative hypotheses.

c. Calculate the expected cell counts and enter them into your table.

d. Perform a chi-square test. Report the value of the test statistic, the p-value, and your conclusion.

Count

Table 29.12

Type 1 Regular Type 2 Type 3 Type 4 Total

Total

Color

Green

Blue

Yellow

Orange

Red

Brown


Exercises

The questions in these exercises all relate to data collected from the study Monitoring the Future: A Continuing Study of American Youth (MTF).

1. One of the questions on the MTF survey asked the following: About how many (if any) energy drinks do you drink PER DAY, on average? Figure 29.4 (see Page 12) shows Minitab results from testing to see if there is an association between the number of energy drinks students consumed each day and their growing-up environment. As noted in the Content Overview, Minitab computed the value of the chi-square test statistic but did not compute a p-value.

a. Explain all ways in which this analysis failed to meet the guidelines for using a chi-square test.

b. In order to continue the investigation into an association between energy drink consumption and growing-up environment, we decided to combine the last three categories (Three, Four, and Five or more) into a single category Three+. Make a copy of Table 29.13. Use the data from Figure 29.4 to fill in the observed values in the third row of the table. Then find the row total and enter that into your table.

Table 29.13. Two-way table of energy drinks and growing-up environment.

c. Use the row and column totals to calculate the expected counts for the third row. Enter the expected counts into your table. Do the expected counts in your completed table meet the guidelines for using the chi-square test?

d. Calculate the value of the chi-square test statistic. How many degrees of freedom are associated with this statistic?

e. Determine the p-value and state your conclusion.

Count Farm Country City TotalNone Observed 57 144 598

Expected 52.55 150.44 596.01One Observed 11 44 160

Expected 14.14 40.48 160.38Two Observed 4 13 36

Expected 3.49 9.98 39.54Three + Observed

ExpectedTotal 73 209 828 1110

Table 29.13

Energy Drinks

Environment

799

215

53


2. Table 29.14 revisits data from Unit 12’s exercises, which also dealt with responses to the MTF survey. Table 29.14 organizes data on gender and responses to the following question: How intelligent do you think you are compared with others your age?

Table 29.14. Results from questions on gender and intelligence rating.

a. We would like to test whether there is a statistical difference between how males and females rate their intelligence compared to their peers. In this context, which is the explanatory variable and which is the response variable? Explain.

b. State an appropriate null hypothesis and alternative hypothesis.

c. Make a copy of Table 29.14. Calculate the row totals and column totals and enter them into your table. Then calculate the expected counts for each cell and enter the expected counts into your table.

d. Calculate the chi-square test statistic. What are the degrees of freedom associated with the chi-square test statistic?

e. Calculate the p-value and state your conclusion.

3. We would expect that there is an association between how students rated their intelligence and their academic success. Table 29.15 organizes students responses rating their intelligence compared to their peers and their average grade in high school.

Table 29.15. Results from question on intelligence and average grade.

Below Above Count Average Average Average

Total

Table 29.14

4593

IntelligenceTotal

GenderFemale

Male

40722243437

456 1643

Count A B C or BelowAbove 2886 4044 1387

Average 1335 1881 585Below 305 416 164

Table 29.15

Average Grade

Intelligence


a. State the null and alternative hypotheses.

b. Calculate the expected counts and record them in a table.

c. Calculate the chi-square test statistic. State the degrees of freedom. Determine the p-value.

d. If the null hypothesis is true, how likely would it be to observe a value from the chi-square distribution in (c) at least as large as the value of the chi-square test statistic that you calculated in (c)? Does this provide strong evidence against the null hypothesis? Explain.

4. Another question on the MTF survey asked the following: On average over the school year, how many hours per week do you work in a paid or unpaid job? The survey results, classified into a two-way table, are shown in Figure 29.5. In addition, the Minitab output contains the conditional distributions of hours worked per week for each gender (row percentages). And finally, of particular interest is whether or not there is a statistical difference in work patterns between male and female 12th-grade students. The expected counts, under the hypothesis that there is no association between gender and work patterns, also appear in Figure 29.5. (See key at bottom of output for the order of appearance.)

Figure 29.5. Minitab chi-square analysis for gender and weekly work hours.

Student Guide, Unit 29, Inference for TwoWay Tables Page 19

c. Calculate the chisquare test statistic. State the degrees of freedom. Determine the value. d. If the null hypothesis is true, how likely would it be to observe a value from the chisquare distribution in (c) at least as large as the value of the chisquare test statistic that you calculated in (c)? Does this provide strong evidence against the null hypothesis? Explain. 4. Another question on the MTF survey asked the following: On average over the school year, how many hours per week do you work in a paid or unpaid job? The survey results, classified into a twoway table, are shown in Figure 29.5. In addition, the Minitab output contains the conditional distributions of hours worked per week for each gender (row percentages). And finally, of particular interest is whether or not there is a statistical difference in work patterns between male and female 12thgrade students. The expected counts, under the hypothesis that there is no association between gender and work patterns, also appear in Figure 29.5. (See key at bottom of output for the order of appearance.) Figure 29.5. Minitab chisquare analysis for gender and weekly work hours. a. State appropriate the null and alternative hypotheses for this situation. b. Report the outcome of the chisquare test and state your conclusion.


a. State appropriate null and alternative hypotheses for this situation.

b. Report the outcome of the chi-square test and state your conclusion.

c. A chi-square test tells you whether or not there is an association between the two variables but it doesn’t tell you anything about the nature of that association. Based on the row percentages, describe the nature of the association between gender and hours worked per week or describe evidence for the lack of such an association.


Review Questions

1. The video for Unit 15, Designing Experiments, focused on an observational study of coral reefs. Moray eels are an important component of coral reef fish communities. Researchers Robert Young and Howard Winn conducted an observational study of moray eel behavior in the Belize Barrier Reef. They focused on two common species, the spotted moray and the purplemouth moray. For each eel they observed, they identified its species and classified the locations of the sightings into three categories, G for grass bed, S for sand or rubble, and B for within one meter of the border between grass and sand/rubble. The results are presented in Table 29.16.

Table 29.16. Habitat types for two species of moral eels. Source: Robert F. Young, Howard E. Winn, and W. L. Montgomery. Activity Patterns, Diet, and Shelter Site Use for Two Species of Moray Eels, Gymnothorax moringa and Gymnothorax vicinus, in Belize. Copeia: February 2003, Vol. 2003, No. 1, pp. 44-55.

a. Set up the hypotheses to test whether there is a relationship between eel species and habitat use.

b. Create a table of expected cell counts.

c. Calculate the chi-square test statistic. Show your calculations. Report the degrees of freedom, and the p-value. At the 0.05 level of significance, is the habitat use independent of the species of moray eel?

d. To examine the nature of any association between the two variables, habitat use and moray eel species, calculate either row or column percentages, whichever is more appropriate to the situation under study. Justify your choice of type of percentage. What do your percentages reveal about moray eels?

2. A random sample of registered voters was asked about their educational background and whether or not they voted in the November 2012 elections. Table 29.17 contains the results of the survey.

Count Spotted PurplemouthG 127 116S 99 67B 264 161

Table 29.16

Habitat Use


Table 29.17. Survey results on voting and highest educational attainment.

a. In this situation, which is the explanatory variable and which is the response variable? Justify your answer.

b. Set up the hypotheses for testing whether educational attainment and voting in the 2012 presidential election are independent.

c. Calculate the expected counts for each cell.

d. Calculate the chi-square test statistic, state the degrees of freedom, and determine the p-value. Are the results significant?

e. Make a bar chart that displays how voting patterns are related to highest educational attainment. (Your choice of which variable is the explanatory variable should be evident in your display.) Label the bars with the corresponding percentages. Describe the nature of the relationship between the two variables.

3. Some tired, stressed-out students have turned to 2-ounce energy drink shots such as 5-Hour Energy to give them the energy boost they feel they need to make it through the day (or night). Compared to energy drinks that can run about 100 calories per 8-ounce serving, energy shots are sugar free and are around 4 calories per shot.

Because of the low calorie count, would female students be apt to drink more energy shots on a daily basis than male students? To find out, researchers asked a group of 12th-grade students the following question: How many (if any) energy drink shots do you drink PER DAY, on average? Table 29.18 gives the results from a survey given to a sample of 12th-grade students.

Yes NoNot HS Graduate 57 64HS Graduate/No College 227 163Some College or Associate's Degree 303 51Bachelor's Degree or Higher 303 51

Table 29.17

Highest Educational Attainment

Voted Nov. 2012


Table 29.18. Student responses to question on energy drink shots.

a. The researchers wanted to see if there was an association between the daily number of energy drink shots consumed and gender. Calculate the expected cell counts for each cell.

b. Based on your answer to (a) do the expected counts satisfy the guidelines for using a chi-square test? Explain.

c. Combine some of the categories for the amount of energy shots consumed per day. Compute the expected counts and check to see if the guidelines for using the chi-square test are satisfied. If not, combine some additional categories until the guidelines are satisfied. (There are different choices for how the categories can be combined.)

d. Perform a chi-square test on your data from (c). What is the value of the chi-square test statistic? Report its p-value. What conclusions could the researchers draw from your results?

Count Female MaleNone 896 938Less than one 63 70One 16 19Two 5 16Three 7 5Four 1 0Five or Six 4 4Seven or more 4 7

Table 29.18

Energy Shots Consumed Per

Day

Unit 30: Inference for Regression | Student Guide | Page 1

Summary of VideoIn Unit 11, Fitting Lines to Data, we examined the relationship between winter snowpack and spring runoff. Colorado resource managers made predictions about the seasonal water supply using a least-squares regression line that was fit to a scatterplot of their measurement data, which is shown in Figure 30.1.

Figure 30.1. Least-squares regression line used by Colorado resource managers.

But would we really see a linear relationship between snowpack and runoff if we had all the possible data? Or might the pattern we see in the sample data’s scatterplot occur just by chance? We would like to know whether the positive association we see between snowpack and runoff in the sample is strong enough that we can conclude that the same relationship holds for the whole population. Statisticians rely on inference to determine whether the relationship observed between two variables in a sample is valid for some larger population.

Inference is a powerful tool. Powerful enough, in fact, to help bring an entire bird species back from the brink of extinction. After World War II, the agrichemical industry began mass-producing chemicals to control pests. Cities like San Antonio, Texas, sprayed whole sections of the city with the insecticide DDT in their fight against the spread of poliomyelitis. Unfortunately,

Unit 30: Inference for Regression


there weren’t many safeguards in place, and the damaging environmental effects of these compounds were not taken into account. Eventually, changes in the natural environment due to chemical pesticides became apparent. One species that was severely affected was the peregrine falcon.

In Great Britain, Derek Ratcliffe noticed in the 1950s that peregrine falcons were declining at nesting sites and they were unable to hatch their eggs. This decline in falcons was eventually demonstrated to be a worldwide phenomenon. Researchers determined that the reason peregrine falcons were not successfully hatching their eggs was due to eggshell thinning, a very serious problem since the weaker shells were breaking before the baby birds were ready to hatch. After looking at some of the causes for this eggshell thinning, scientists began to zero in on a possible culprit: DDT and its breakdown product, DDE.

There were a couple of reasons why scientists believed that there was a relationship between DDT or DDE and eggshell thinning. In studying the broken eggshells and eggs collected in the field, scientists found very high residues of DDE that had not been seen in historic samples. The falcons were ingesting DDT through their prey – birds they ate had small concentrations of the chemical in their flesh. Over time the DDT built up in the peregrines’ own bodies and started to affect the females’ ability to lay healthy eggs.

Even though scientists had a pretty strong hunch that DDT was the cause of peregrine falcon eggshell thinning, they could not rely on their scientific instincts alone. So, researchers turned to statistics as a way to validate their analyses. We can follow in the researchers’ footsteps by taking a look at a data set comprised of 68 peregrine falcon eggs from Alaska and Northern Canada. A scatterplot of the two variables we will be studying, eggshell thickness (response variable) and the log-concentration of DDE (explanatory variable), appears in Figure 30.2. We have added the least-squares regression line fit to these data. Remember it is described by an equation of the form y a bx= + .


Figure 30.2. Scatterplot of eggshell thickness versus log-concentration of DDE.

The data in Figure 30.2 show a negative, linear relationship between the two variables. Using the equation, we can predict eggshell thickness for any measurement of DDE. The slope b and intercept a are statistics, meaning we calculated them from our sample data. But if we repeated the study with a different sample of eggs, the statistics a and b would take on somewhat different values. So, what we want to know now is whether there really is a negative linear relationship between these variables for the entire population of all peregrine eggs, beyond just the eggs that happen to be in our sample. Or might the pattern we see in the sample data be due simply to chance variation?

Data of the entire peregrine egg population might look like the scatterplot in Figure 30.3. Notice that for any given value of the explanatory variable, such as the value indicated by the vertical line, many different eggshell thicknesses may be observed.

Figure 30.3. Scatterplot representing population of peregrine eggs.


In the scatterplot in Figure 30.4, the mean eggshell thickness, y, does have a linear relationship with the log concentration of DDE, x. The line fit to the hypothetical population data is called the population regression line. Because we don’t have access to ALL the population data, we use our sample data to estimate the population regression line.

Figure 30.4. The population regression line fit to the population data.

Several conditions, which are discussed in the Content Overview, must be met in order to move forward with regression inference. You can check out whether these conditions are satisfied in Review Question 1. But for now, we assume that the conditions for inference are met. The population regression model is written as follows:

µy = α + βx

where y represents the true population mean of the response y for the given level of x, α is the population y-intercept, and β is the population slope. Now let’s look back at our least squares regression line, based on the sample of 68 bird eggs. The equation is

ˆ 2.146 0.3191y x= −

The sample intercept, a = 2.146, is an estimate for the population intercept α . And the sample slope, b = -0.3191, is an estimate for the population slope β.

Of course, we’ve learned by now that other samples from the same population will give us different data, resulting in different parameter estimates of α and β. In repeated sampling, the value of these statistics, a and b, form sampling distributions, which provide the basis for statistical inference. In particular, we want to infer from the sampling distribution for our statistic b, whether the sample data provide sufficiently strong evidence that higher levels of DDE are


related to eggshell thinning in the population. To answer this question, we set up our null and alternative hypotheses.

:oH Amount of DDE and eggshell thickness have no linear relationship.

or H0 :β = 0

:aH Amount of DDE and eggshell thickness have a negative linear relationship.

or Ha :β < 0

The t-test statistic for testing the null hypothesis is:

t =b − β0sb

where b is our sample estimate for the population slope, β0 is the null hypothesis value for the population slope, and bs is the standard error of the estimate b, which we can get from software. In this case, 0.0255bs = . Next, we calculate the value of our t-test statistic:

0.3191 0 12.50.0255

t − −= ≈ −

If the null hypothesis is true, then t has a t-distribution with n – 2, or 66, degrees of freedom. The value t = -12.5 is an extreme value and the corresponding p-value is essentially 0. Thus, we have strong evidence to reject the null hypothesis. By rejecting the null hypothesis, we can confirm what scientists already suspected – that there is a connection between peregrine falcon eggshell thickness and the presence of DDE. More precisely, there is a statistically significant, negative linear relationship between the log-concentration of DDE and the thickness of peregrine eggshells.

Before researchers could present this finding to the public, however, they had to quantify the relationship. That meant computing a confidence interval for the population slope. Here’s the formula:

* bb t s±

For a 95% confidence interval and df = 68 – 2 = 66, we find t* = 1.997. Now, we can compute the confidence interval:


0.3191 (1.997)(0.0255)− ±

3.191 0.0509− ±

0.3700 to 0.2681− −

Hence, based on our sample of 68 peregrine falcon eggs, we are 95% confident that a one-unit increase in the log-concentration of DDE is associated with a true average decrease of between 0.27 and 0.37 in Ratcliffe’s eggshell thickness index. Armed with this information, scientists were able to make a strong argument against the use of DDT because of its dangerous impact on peregrines and the environment as a whole. These results led to a prolonged legal battle with people on both sides presenting evidence. Due to scientific and statistical evidence, the United States and many Western European countries banned DDT use. Since then, the peregrine falcon population has rebounded significantly. So, this environmental detective story has a happy ending for the peregrine falcons.



A. Understand the linear regression model. Know how to find the least-squares regression line as an estimate (covered in Unit 11, Fitting Lines to Data.)

B. Know how to check whether the assumptions for the linear regression model are reasonably satisfied.

C. Recall how to find the least-squares regression equation (Unit 11, Fitting Lines to Data).

D. Be able to calculate, or obtain from software, the standard error of the estimate, es , and the standard error of the slope, bs .

E. Be able to conduct a significance test for the population slope β.

F. Be able to calculate a confidence interval for the population slope β.


Content Overview

While we often hear of the benefits of eating fish, we also hear warnings about limiting our consumption of certain fish whose flesh contains high levels of mercury. Much like the peregrine falcons and DDT, small levels of mercury in oceans, lakes, and streams build up in fish tissue over time. It becomes most concentrated in larger fish, which are higher up on the food chain.

To better understand the relationship between fish size and mercury concentration, the United State Geological Survey (USGS) collected data on total fish length and mercury concentration in fish tissue. (Total length is the length from the tip of the snout to the tip of the tail.) The data from a sample of largemouth bass (of legal size to catch) collected in Lake Natoma, California, appear in Table 30.1. (You may remember these data from Review Question 3 in Unit 11.)

Table 30.1. Fish total length and mercury concentration in fish tissue.

Since we believe that fish length explains mercury concentration, total length is the explanatory variable and mercury concentration is the response variable. A scatterplot of mercury concentration versus total length appears in Figure 30.5.

Total Length Mercury Concentration Total Length Mercury Concentration(mm) (µg/g wet wt.) (mm) (µg/g wet wt.)341 0.515 490 0.807353 0.268 315 0.320387 0.450 360 0.332375 0.516 385 0.584389 0.342 390 0.580395 0.495 410 0.722407 0.604 425 0.550415 0.695 480 0.923425 0.577 448 0.653446 0.692 460 0.755

Table 30.1


Figure 30.5. Scatterplot of mercury concentration versus total fish length.

Since the pattern of the dots in the scatterplot indicates a positive, linear relationship between the two variables, we fit a least-squares line to the data. However, these data are a sample of 20 largemouth bass from the population of all the largemouth bass that live in Lake Natoma. While we can use the least-squares equation to make predictions about mercury concentration for fish of a particular length, we need techniques from statistical inference to answer the following questions about the population:

• Is there really a positive linear relationship between the variables mercury concentration and total length, or might the pattern observed in the scatterplot be due simply to chance?

• Can we determine a confidence interval estimate for the population slope, the rate of change of mercury concentration per one millimeter increase in fish total length?

• If we use the least-squares line to predict the mercury concentration for a fish of a particular length, how reliable is our prediction?

Now, what if we could make a scatterplot of mercury concentration versus total length for all of the largemouth bass (at or close to the legal catch length) in Lake Natoma? Figure 30.6 shows how a scatterplot of the population might look and how a regression line fit to the population data might look.

500450400350300

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

Total Length (mm)

Mer

cury

Con

cent

ratio

n (µ

g/g)

y = - 0.7374 + 0.003227x


Figure 30.6. Population scatterplot of mercury concentration versus total length.

Notice, for each fish length, x, there are many different values of mercury concentration, y. For example, in Figure 30.6 a vertical line segment has been drawn at length 1x . That line segment intersects with a whole distribution of mercury concentration values, y-values, on the scatterplot. The mean of that distribution of y-values, µy , is at the intersection of the vertical line at 1x and the regression line. Now look at the vertical line at 2x . It too intersects with an entire distribution of y-values, with mean at the intersection of the vertical line at

2x and the regression line. So, the population regression line describes how the mean mercury concentration values, µy , are related to total length, x. In this case, the relationship looks linear and so we express it as: µy = α + βx . As mentioned earlier in this unit, several conditions must be met in order to move forward with regression inference. Those conditions, along with a description of the simple linear regression model, are presented below.

600500400300200

1.25

1.00

0.75

0.50

0.25

0.00

Total Length (mm)

Mer

cury

Con

cent

ratio

n µ

g/g

Population Scatterplot

x x1 2

y xµ α β= +


Simple Linear Regression Model and Conditions

We have n ordered pairs of observations (x, y) on an explanatory variable, x, and response variable, y.

The simple linear regression model assumes that for each value of x the observed values of the response variable, y, vary about a mean µy that has a linear relationship with x:

µy = α + βx

The line described by µy = α + βx is called the population regression line. In addition, the following conditions must be satisfied:

• For any fixed value of x, the response y varies according to a normal distribution. Repeated responses, y-values, are independent of each other.

• The standard deviation of y for any value of x, σ , is the same for all values of x.

Thus, the model has three unknown population parameters: α, β, and σ .

Figure 30.7 provides a graphic representation of the simple linear regression model and conditions.

x x x1 2 3

yµ = α + βx

x

y

α + βx

α + βx

α + βx1

2

3

Three different x-values

σ

σ

σ

Figure 30.7. Graphic representation of linear regression model.

A first step in inference is to estimate the unknown parameters. We begin with estimates for the slope and intercept of the population regression line. The estimated regression line for the linear regression model is the least-squares line, y a bx= + . From Figure 30.5, the estimated regression line is:


ˆ 0.7374 0.003227y x= − +

The y-intercept, a = -0.7374, is a point estimate for the population intercept, α , and the slope, b = 0.003227, is a point estimate of the population slope, β.

Next, we develop an estimate for σ , which measures the variability of the response y about the population regression line. Because the least-squares line estimates the population regression line, the residuals estimate how much y varies about the population regression line:

residual = observed y – predicted y

= ˆy y−

We estimate σ from the standard deviation of the residuals, es , as follows:

2ˆ( ) SSE2 2e

y ys

n n−

= =− −

∑

Our estimate for σ , es , is called the standard error of the estimate.

The computation of es is tedious by hand. Regression outputs from statistical software will compute the value for you. However, here’s how it is computed in our example of mercury concentration and fish length. First, we’ll compute the residual corresponding to data value (341, 0.515) as a reminder of how that is done.

ˆ 0.7374 0.003227(341) 0.363y = − + ≈

ˆ 0.515 0.363 0.152y y− = − =

Here are all 20 residuals (rounded to three decimals):

0.152 -0.134 -0.062 0.043 -0.176

-0.042 0.028 0.093 -0.057 -0.010

-0.037 0.041 -0.092 0.079 0.059

0.136 -0.084 0.111 -0.055 0.008

Next, we calculate the SSE, the sum of the squares of the residuals:

SSE = 2 2 2 2(0.152) ( .0134) ( 0.062) . . . (0.008) 0.1545+ − + − + + ≈


Now, we calculate es :

SSE 0.1545 0.092620 2 18es = ≈ ≈

− μg/g

We can use the equation of the least-squares line, ˆ 0.7374 0.003227y = − + , to make predictions. However, those predictions are more reliable when the data points lie “close” to the line. Keep in mind that es is one measure of the closeness of the data to the least-squares line. If 0es = , the data points fall exactly on the least-squares line. Moreover, when es is positive, we can use it to place error bounds above and below the least-squares line. These error bounds are lines parallel to the least-squares line that lie one or two es above and below the least-squares line. We apply this idea to our mercury concentration and fish length data.

ˆ 0.7374 0.003227 0.0926y x= − + ±

ˆ 0.7374 0.003227 2(0.0926)y x= − + ±

Figure 30.8. Adding lines ± es and ±2 es above and below the least-squares line.

Recall from Unit 8, Normal Calculations, that we expect roughly 68% of normal data to lie within one standard deviation of the mean and roughly 95% to lie within two standard deviations of the mean. Notice that all of our data fall within two es of the least-squares line. So, for a particular fish length, say with total length = 400 mm, we expect roughly 95% of the fish to have mercury concentrations between 0.3682 μg/g and 0.7386 μg/g.

The standard error of the estimate provides one way to select between competing models. For example, suppose we had a second model relating mercury concentration to the explanatory variable fish weight. Choose the model with the smaller value for es .

500450400350300

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Total Length (mm)

Mer

cury

Con

cent

ratio

n (

µg/g

)


The scatterplot in Figure 30.5 appears to support the hypothesis that longer fish tend to have higher levels of mercury concentration. But is this positive association statistically significant? Or could it have occurred just by chance? To answer this question, we set up the following null and alternative hypotheses:

H0 : Total length and mercury concentration have no linear relationship. or H0 :β = 0

: Total length and mercury concentration have a positive linear relationship.aH

or Ha :β > 0

A regression line with slope 0 is horizontal. That indicates that the mean of the response y does not change as x changes – which, in turn, means that the linear regression equation is of no value in predicting y. In the case of mercury concentration and total length, the estimate of the population slope is very small, b = 0.003227. So, we might jump to the conclusion that total length is not useful in predicting mercury concentration. But we’d better work through the details of a significance test before jumping to such a conclusion.

Significance Test For Regression Slope, β

To test the hypothesis H0 :β = β0 , compute the t-test statistic:

t =b − β0s b

where 2( )

eb

ssx x

=−∑

and b is the least-squares estimate of the population slope, β, and β0 is the null hypothesis value for β .

If the null hypothesis is true and the linear regression conditions are satisfied, then t has a t-distribution with df = n – 2.

Back to the situation with mercury concentration and fish length. We use software to help us calculate bs :

0.093 0.00046839463.2bs = ≈


Now we are ready to calculate t:

0.003227 0 6.90.000468

t −= ≈

In this case, df = n – 2 = 20 – 2 = 18. Since this is a one-sided alternative, we find the probability of observing a value of t at least as large as the one we observed, 6.9. As shown in Figure 30.9, the area under the t-density curve to the right of 6.9 is so small that it isn’t really visible. The area is only 9.4127 × 10-7; so, p ≈ 0. We conclude that there is sufficient evidence to reject the null hypothesis and conclude β > 0. There is a positive linear relationship between total length and mercury concentration.

Figure 30.9. Calculating the p-value.

Next, we calculate a confidence interval estimate for the regression slope, β. Here are the details for constructing a confidence interval.

Confidence Interval For Regression Slope, β

A confidence interval for β is computed using the following formula:

* bb t s±

where t* is a t-critical value associated with the confidence level and determined from a t-distribution with df = n – 2; b is the least-squares estimate of the population slope, and bs is the standard error of b.

t

9.4127E-07

0 6.9

Density Curve for t-distribution, df = 18


To calculate the confidence interval, we start by determining the value of t* for a 95% confidence interval when df = 18. Using a t-table, we get t* = 2.101. We can now calculate the confidence interval:

* bb t s±

0.003227 (2.101)(0.000468) 0.003227 0.000983± ≈ ± ,

Or, rounded to four decimals, from 0.0022 to 0.0042.

Thus, for each increase of 1 millimeter in total length, we expect the mercury concentration to increase between 0.0022 μg/g and 0.0042 μg/g. That may seem like a small increase, but, for example, Florida has set the safe limit on mercury concentration to be below 0.5 μg/g.

The results from inference are trustworthy provided the conditions for the simple linear regression model are satisfied. We conclude this overview with a discussion of checking the conditions – what should be done first before proceeding to inference. The conditions involve the population regression line and deviations of responses, y-values, from this line. We don’t know the population regression line, but we have the least-squares line as an estimate. We also don’t know the deviations from the population regression line, but we have the residuals as estimates. So, checking the assumptions can be done through examining the residuals. Here is a rundown of the conditions that must be checked:

1. Linearity Check the adequacy of the linear model (covered in Unit 11). Make a residual plot, a scatterplot of the residuals versus the explanatory variable. If the pattern of the dots appears random, with about half the dots above the horizontal axis and half below, then the condition of linearity is satisfied.

2. Normality The responses, y-values, vary normally about the regression line for each x. This does not mean that the y-values are normally distributed because different y-values come from different x-values. However, the deviations of the y-values about their mean (the regression line) are normal and those deviations are estimated by the residuals. So, check that the residuals are approximately normally distributed (covered in Unit 9). Make a normal quantile plot. If the pattern of the dots appears fairly linear, then the condition of normality is satisfied. If the plot indicates that the residuals are severely skewed or contain extreme outliers, then this condition is not satisfied.

3. Independence The responses, y-values, must be independent of each other. The best evidence of independence is that the data are a random sample.


4. Constant standard deviations of the responses for all x To check this condition, examine a residual plot. Check to see if the vertical spread of the dots remains about the same as x-values increase. As an example, consider the two residual plots in Figure 30.10. In residual plot (a), the vertical spread is about the same for small x-values as it is for large x-values. In this case, Condition 4 is satisfied. In residual plot (b), the spread of the residuals tends to increase as x-values increase. We’ve used a pencil to roughly draw an outline of the spread as it fans out for larger values of x. Here Condition 4 is not satisfied.

(a) (b)

Figure 30.10. Checking to see if Condition 4 is satisfied.

Now, we return to the fish study: Are the inference results – the significance test and confidence interval that we calculated – trustworthy? Let’s check to see if Conditions 1 – 4 are reasonably satisfied. A residual plot appears in Figure 30.11.

Figure 30.11. Residual plot for checking conditions.

54321

2

1

0

-1

-2

x

Resi

dual

s

54321

2

1

0

-1

-2

x

Resi

dual

s

500450400350300

0.2

0.1

0.0

-0.1

-0.2

Total Length (mm)

Resi

dual

Residual Plot(Response is Mercury Concentration))


The dots appear randomly scattered and split above and below the horizontal axis. In addition, the vertical spread seems to be roughly the same as total length, x, increases. Therefore, Conditions 1 and 4 are reasonably satisfied. Figure 30.12 shows a normal quantile plot of the residuals. The pattern of the dots appears fairly linear. So, Condition 2 is reasonably satisfied.

Figure 30.12. Normal quantile plot of residuals.

Finally, the data were a random sample of fish. So, the mercury concentration levels are independent of each other. Condition 3 is satisfied. So, now we can say that our inference results are trustworthy.

0.30.20.10.0-0.1-0.2-0.3

99

9590

80706050403020

10

5

1

Residuals

Perc

ent

Normal Quantile Plot


Key Terms

The simple linear regression model assumes that for each value of x the observed values of the response variable y are normally distributed about a mean µy that has the following linear relationship with x:

µy = α + βx

The line described by µy = α + βx is called the population regression line. The estimated regression line for the linear regression model is the least-squares line, y a bx= + .

Assumptions of the linear regression model:

The observed response y for any value of x varies according to a normal distribution.

The y-values are independent of each other.

The mean response, µy , has a straight-line relationship with x: µy = α + βx .

The standard deviation of y, σ , is the same for all values of x.

The standard error of the estimate, es , is a measure of how much the observations vary about the least-squares line. It is a point estimate for σ and is computed from the following formula:

2ˆ( ) SSE

2 2e

y ys

n n−

= =− −

∑

The standard error of the slope, bs , is the estimated standard deviation of b, the least-squares estimate for the population slope β. It is calculated from the following formula:

2( )

eb

ssx x

=−∑

The t-test statistic for testing H0 :β = β 0 , where β is the population slope, is calculated as follows:

t =b − β0s b


where b is the least-squares estimate of the population slope, β0 is the null hypothesis value for β, and bs is the standard error of b. When 0H is true, t has a t-distribution with df = n – 2, where n is the number of (x,y)-pairs in the sample. The usual null hypothesis is H0 :β = 0 , which says that the straight-line dependence on x has no value in predicting y.

To calculate a confidence interval for the population slope, β, use the following formula:

* bb t s±

where t* is a t-critical value associated with the confidence level and determined from a t-distribution with df = n – 2; b is the least-squares estimate of the population slope, and bs is the standard error of b.


The Video

Take out a piece of paper and be ready to write down the answers to these questions as you watch the video.

1. The population of peregrine falcons was in decline in the 1950s. What was the reason for the population’s decline?

2. In a scatterplot of eggshell thickness and log-concentration of DDE, which was the explanatory variable and which was the response variable?

3. Describe the form of the relationship between eggshell thickness and log-concentration of DDE – is the form linear or nonlinear? Positive or negative?

4. What is a population regression line?

5. Why are a and b, the y-intercept and slope of the least-squares line, called statistics?

6. State the null and alternative hypotheses used for testing whether the sample data provided strong evidence that higher levels of DDE were related to eggshell thinning in the population.


7. What was the outcome of the significance test?

8. Did the peregrine falcons ever recover?


Unit Activity: Clues to the Thief

A high school’s mascot is stolen and the poster shown in Figure 30.13 has been posted around the school and the town. The thief has left clues: a plain black sweater and a set of footprints under a window. The footprints appear to have been made by a man’s sneaker. Here are more details from the investigation:

• The distance between the footprints reveals that the thief’s steps are about 58 cm long. This distance was measured from the back of the heel on the first footprint to the back of the heel on the second.

• The thief’s forearm is between 26 and 27 cm. The forearm length was estimated from the sweater by measuring from the center of a worn spot on the elbow to the turn at the cuff.

Figure 30.13. The missing manatee.

School officials suspect that the thief is a student from a rival high school. Table 30.2 contains data from a random sample of 9th and 10th-grade students that you can use for this activity. Feel free to add and/or substitute data that your class collects.

In this activity, you will fit two linear regression models to the data. For the first model you will fit a line to forearm length and height; for the second model, you will fit a line to step length and height. To eliminate confusion, express your models using the variable names rather than x and y.


1. a. Make a scatterplot of height versus forearm length. Calculate the equation of the least-squares line and add its graph to your scatterplot.

b. Check to see if the four conditions for the simple linear regression model are reasonably satisfied. (Look to see if there are strong departures from the conditions.)

c. Calculate the standard error of the estimate, es .

2. Next, let’s focus on inference related to the relationship between height and forearm length.

a. We expect people with longer forearms to be taller than people with shorter forearms. Conduct a significance test H0 :β = 0 against Ha :β > 0 . Report the value of the test statistic, the degrees of freedom, the p-value, and your conclusion.

b. Construct a 95% confidence interval for β. Interpret your confidence interval in the context of this situation.

3. a. Make a scatterplot of height versus step length. Calculate the equation of the least-squares line and add its graph to your scatterplot.

b. Check to see if the four conditions for the simple linear regression model are reasonably satisfied. (Look to see if there are strong departures from the conditions.)

c. Calculate the standard error of the estimate, es .

4. Next, we focus on inference related to the relationship between height and step length.

a. We expect people with longer step lengths to be taller than people with shorter step lengths. Conduct a significance test H0 :β = 0 against Ha :β > 0 . Report the value of the test statistic, the degrees of freedom, the p-value, and your conclusion.

b. Construct a 95% confidence interval for β. Interpret your confidence interval in the context of this situation.

5. a. You have two competing models for predicting height, one based on forearm length and the other based on step length. Which of your two models is likely to produce more precise estimates? Explain.


b. Use one or both of your models to fill in the blanks in the following sentence. Justify your answer.

We predict that the thief is ______ cm tall. But the thief might be as short as ______ or as tall as ______.

Table 30.2. Data from 9th and 10-grade students.

Height Stride Length Forearm Length(cm) (cm) (cm)

Male 166.0 58.250 28.5Male 178.0 68.500 29.0Male 171.0 58.500 27.2Male 165.0 50.125 28.0Male 177.5 58.750 31.3Male 166.0 62.875 28.3Male 175.5 59.125 28.6Male 171.0 67.750 31.5Male 184.0 68.875 30.5Male 184.5 66.250 30.8Male 183.5 79.500 30.5Male 172.0 70.500 30.3

Female 164.5 55.875 24.2Female 166.0 52.375 27.3Female 168.0 55.375 28.0Female 178.5 59.750 29.1Female 166.0 48.375 27.9Female 159.0 57.125 28.0Female 166.0 64.000 27.4Female 154.5 57.750 25.8Female 161.0 63.500 27.0Female 177.0 69.750 30.1Female 161.0 72.500 26.5Female 164.0 75.250 28.2Female 174.0 58.500 28.4Female 164.0 59.750 26.8Female 168.0 55.250 26.4

Table 30.2

Gender


Exercises

Table 30.3 provides data on femur (thighbone) and ulna (forearm bone) lengths and height. These data are a random sample taken from the Forensic Anthropology Data Bank (FDB) at the University of Tennessee. Notice that height is given in centimeters and bone length in millimeters. All exercises will be based on these data.

Table 30.3. Data on femur and ulna length and height.

1. a. Make a scatterplot of height versus femur length. Would you describe the pattern of the dots as linear or nonlinear? Positive association or negative?

b. Calculate the equation of the least-squares line. Add a graph of the line to your scatterplot in (a).

c. Check to see if the conditions for regression inference are reasonably satisfied. Identify any strong departures from the conditions.

Femur Length, x1 Ulna Length, x2 Height, y(mm) (mm) (cm)432 237 158498 288 188463 276 173443 245 163511 278 191547 283 189484 279 178522 293 182438 251 163462 262 175449 255 159499 273 181484 280 168472 255 175484 269 175432 248 160439 248 165483 263 170484 269 180508 307 183

Table 30.3


2. a. Building on the work done for question 1, calculate the standard error of the estimate, es .

b. Write the equations of error bands one and two standard errors, es , above and below the least-squares line. Add graphs of these lines to your scatterplot from question 1(b).

c. If the distributions of the responses, y-values, for any fixed x are normally distributed with mean on the regression line, then the outermost bands in (b) should trap roughly 95% of the data between the bands. Is that the case?

3. a. Make a scatterplot of height versus ulna length. Determine the equation of the least-squares line and add a graph of the least-squares line to your scatterplot.

b. Calculate the standard error of the estimate, es .

c. Suppose a partial skeleton is found on a rugged hillside. The skeleton is brought to a lab for identification. The ulna bone measures 287 mm and the femur measures 520 mm. Use your equation from 3(a) to predict the person’s height. Then use your equation from 1(b) to predict the person’s height. Which of your estimates, the one based on ulna length or the one based on femur length, is likely to be more reliable? Justify your answer based on the standard error of the estimate, es , for each equation.

4. Consider the linear regression model for height based on femur length.

a. Test the hypothesis H0 :β = 0 against the one-sided alternative Ha :β > 0 . Report the value of the t-test statistic, the degrees of freedom, the p-value, and your conclusion.

b. Calculate a 95% confidence interval for the population slope, β.


Review Questions

1. The video focused on peregrine falcons and the relationship between eggshell thickness and log-concentration of DDE. During the video, we did not check whether or not the conditions for inference were met and went ahead with conducting a significance test and constructing a confidence interval. Your task is to check whether the four conditions for inference are reasonably satisfied given the following information. Justify your answer.

Assume that the data came from a random sample of eggs collected from Alaska and Northern Canada. Figure 30.14 shows a residual plot and Figure 30.15 displays a normal quantile plot of the residuals.

Figure 30.14. Residual plot.

Figure 30.15. Normal Quantile Plot of Residuals.

2.52.01.51.0

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

Log-Concentration DDE

Resi

dual

s

Residual Plot

0.40.30.20.10.0-0.1-0.2-0.3-0.4-0.5

99.9

99

959080706050403020105

1

0.1

Residuals

Perc

ent

Normal Quantile Plot


2. Admissions offices of colleges and universities are interested in any information that can help them determine which students will be successful at their institution. For example, could students’ high school grade point averages (GPA) be useful in predicting their first-year college GPAs? Data on high school GPA and first-year college GPA from a random sample of 32 college students attending a state university is displayed in Table 30.4.

Table 30.4. Data on high school GPA and first-year college GPA.

a. Make a scatterplot of first-year college GPA versus high school GPA. Does the form of these data appear to be linear? Would you describe the relationship as positive or negative?

b. Determine the equation of the least-squares line and add the line to your scatterplot in (a).

c. Determine the t-test statistic for testing H0 :β = 0 . How many degrees of freedom does t have?

d. Find the p-value for the one-sided alternative Ha :β > 0 . What do you conclude?

3. Linda heats her house with natural gas. She wonders how her gas usage is related to how cold the weather is. Table 30.5 shows the average temperature (in degrees Fahrenheit) each month from September through May and the average amount of natural gas Linda’s house used (in hundreds of cubic feet) each day that month.

High School GPA First Year College GPA High School GPA First Year College GPA3.00 3.15 2.90 1.463.00 2.07 3.50 3.102.30 2.60 3.10 2.763.68 4.00 3.35 2.012.20 2.03 3.70 3.343.00 3.53 2.70 2.903.03 3.17 2.86 2.933.00 2.68 2.51 1.953.16 3.88 2.93 3.012.70 2.30 3.41 3.484.00 3.64 3.30 2.873.77 3.62 3.76 2.852.70 2.34 2.66 1.673.10 3.64 2.91 3.383.23 3.67 3.47 3.682.80 3.37 3.40 3.76

Table 30.4


Table 30.5. Gas usage and temperature data.

a. Make a scatterplot of gas usage versus temperature. Describe the form and direction of the relationship between these two variables.

b. Fit a least-squares line to gas usage versus temperature and add a graph of the line to your scatterplot in (a).

c. Check to see if the conditions needed for inference are satisfied.

d. Calculate the standard error of the estimate, es , and standard error of the slope, bs . Show your calculations.

e. Conduct a significance test of Ho :β = 0 . Should the alternative be one-sided or two-sided? Report the value of the t-test statistic, the degrees of freedom, the p-value and your conclusion.

f. Calculate a 95% confidence interval for the population slope. Interpret your results in the context of this problem.

4. Do taller 4-year-olds tend to become taller 6-year-olds? Can a linear regression model be used to predict a 4-year-old’s height when he or she turns six? Table 30.6 gives data on heights of children when they were four and then when they were six.

Table 30.6. Data on children’s heights at age 4 and 6.

Month Sep Oct Nov Dec Jan Feb Mar Apr MayOutdoor temperature °F 48 46 38 29 26 28 49 57 65Gas used per day (100 cu ft) 5.1 4.9 6 8.9 8.8 8.5 4.4 2.5 1.1

Table 30.5

Height Age 4 Height Age 6 Height Age 4 Height Age 6104.4 118.4 98.1 112.8104.0 119.4 100.6 115.292.1 103.9 100.5 115.8

103.3 116.8 102.7 117.398.4 113.1 98.5 113.396.5 110.0 98.8 109.3

105.3 119.3 102.3 117.9103.2 118.6 99.0 112.2105.9 123.2 100.2 112.997.4 110.2 100.3 113.4

103.4 118.7 99.6 112.6101.7 119.2 109.8 124.5105.4 120.2 100.2 113.7104.4 119.2 99.6 115.2100.7 112.6 104.1 117.1

Table 30.6


a. Make a scatterplot of height at age 6 versus height at age 4. Determine the equation of the least squares line and add its graph to the scatterplot.

b. From regression output we get 1.38596es = and 0.07437bs = . Construct a 95% confidence interval for the population slope β. Interpret your confidence interval in the context of children’s growth.

Unit 31: One-Way ANOVA | Student Guide | Page 1

Summary of VideoA vase filled with coins takes center stage as the video begins. Students will be taking part in an experiment organized by psychology professor John Kelly in which they will guess the amount of money in the vase. As a subterfuge for the real purpose of the experiment, students are told that they are taking part in a study to test the theory of the “Wisdom of the Crowd,” which is that the average of all of the guesses will probably be more accurate than most of the individual guesses. However, the real purpose of the study is to see whether holding heavier or lighter clipboards while estimating the amount of money in the jar will have an impact on students’ guesses. The idea being tested is that physical experience can influence our thinking in ways we are unaware of – this phenomenon is called embodied cognition.

The sheet on which students will record their monetary guesses is clipped onto a clipboard. For the actual experiment, clipboards, each holding varying amounts of paper, weigh either one pound, two pounds or three pounds. Students are randomly assigned to clipboards and are unaware of any difference in the clipboards. After the data are collected, guesses are entered into a computer program and grouped according to the weights of the clipboards. The mean guess for each group is computed and the output is shown in Table 31.1.

Table 31.1. Average guesses by clipboard weight.

Looking at the means, the results appear very promising. As clipboard weight goes up, so does the mean of the guesses, and that pattern appears fairly linear (See Figure 31.1.). To test whether or not the apparent differences in means could be due simply to chance, John turns to a technique called a one-way analysis of variance, or ANOVA. The null

Unit 31: One-Way ANOVA

Clipboard Weight Mean N Standard Deviation

1 $106.56 75 $100.62 2 $129.79 75 $204.95 3 $143.29 75 $213.13

Total $126.55 225 $180.16

Table 31.1

Money Guesses

Figure 31.1. Mean guess versus clipboard weight.


hypothesis for the analysis of variance will be that there is no difference in population means for the three weights of clipboards: H0 : µ1 = µ2 = µ3 . He hopes to find sufficient evidence to reject the null hypothesis so that he can conclude that there is a significant difference among the population means. John runs an ANOVA using SPSS statistical software to compute a statistic called F, which is the ratio of two measures of variation:

F= variation among sample meansvariation within individual observations in the same sample

In this case, F = 0.796 with a p-value of 0.45. That means there is a 45% chance of getting an F value at least this extreme when there is no difference between the population means. So, the data from this experiment do not provide sufficient evidence to reject the null hypothesis.

One of the underlying assumptions of ANOVA is that the data in each group are normally distributed. However, the boxplots in Figure 31.2 indicate that the data are skewed and include some rather extreme outliers. John’s students tried some statistical manipulations on the data to make them more normal and reran the ANOVA. However, the conclusion remained the same.

Figure 31.2. Boxplots of guesses grouped by clipboard weight.

But what if we used the data displayed in Figure 31.3 instead? The sample means are the same, around $107, $130, and $143, but this time the data are less spread out about those means.

321

$1,600.00

$1,400.00

$1,200.00

$1,000.00

$800.00

$600.00

$400.00

$200.00

$0.00

Clipboard Weight

Mon

eyG

uess


Figure 31.3. Hypothetical guess data.

In this case, after running ANOVA, the result is F = 33.316 with a p-value that is essentially zero. Our conclusion is to reject the null hypothesis and conclude that the population means are significantly different.

In John’s experiment, the harsh reality of a rigorous statistical analysis has shot down the idea that holding something heavy causes people, unconsciously, to make larger estimates, at least in this particular study. But if the real experiment didn’t work, what about the cover story – the theory of the Wisdom of the Crowd? The actual amount in the vase is $237.52. Figure 31.4 shows a histogram of all the guesses. The mean of the estimates is $129.22 – more than $100 off, but still better than about three-quarters of the individual guesses. So, the crowd was wiser than the people in it.

Figure 31.4. Histogram of guess data.

321

225

200

175

150

125

100

75

50

Clipboard Weight

Mon

eyG

uess



A. Be able to identify when analysis of variance (ANOVA) should be used and what the null and alternative (research) hypotheses are.

B. Be able to identify the factor(s) and response variable from a description of an experiment.

C. Understand the basic logic of an ANOVA. Be able to describe between-sample variability (measured by mean square for groups (MSG)) and within-sample variability (measured by mean square error (MSE)).

D. Know how to compute the F statistic and determine its degrees of freedom given the following summary statistics: sample sizes, sample means and sample standard deviations. Be able to use technology to compute the p-value for F.

E. Be able to use technology to produce an ANOVA table.

F. Recognize that statistically significant differences among population means depend on the size of the differences among the sample means, the amount of variation within the samples, and the sample sizes.

G. Recognize when underlying assumptions for ANOVA are reasonably met so that it is appropriate to run an ANOVA.

H. Be able to create appropriate graphic displays to support conclusions drawn from ANOVA output.


Content Overview

In Unit 27, Comparing Two Means, we compared two population means, the mean total energy expenditure for Hadza and Western women, and used a two-sample t-test to test whether or not the means were equal. But what if you wanted to compare three population means? In that case, you could use a statistical procedure called Analysis of Variance or ANOVA, which was developed by Ronald Fisher in 1918.

For example, suppose a statistics class wanted to test whether or not the amount of caffeine consumed affected memory. The variable caffeine is called a factor and students wanted to study how three levels of that factor affected the response variable, memory. Twelve students were recruited to take part in the study. The participants were divided into three groups of four and randomly assigned to one of the following drinks:

A. Coca-Cola Classic (34 mg caffeine)

B. McDonald’s coffee (100 mg caffeine)

C. Jolt Energy (160 mg caffeine).

After drinking the caffeinated beverage, the participants were given a memory test (words remembered from a list). The results are given in Table 31.2.

Table 31.2. Number of words recalled in memory test.

For an ANOVA, the null hypothesis is that the population means among the groups are the same. In this case, H0 : µA = µB = µC , where µA is the population mean number of words recalled after people drink Coca Cola and similarly for µB and µC . The alternative or research hypothesis is that there is some inequality among the three means. Notice that there is a lot of variation in the number of words remembered by the participants. We break that variation into two components:

(1) variation in the number of words recalled among the three groups also called between-groups variation

Group A (34 mg) Group B (100 mg) Group C (160 mg)

7 11 148 14 12

10 14 1012 12 167 10 13

Table 31.2


(2) variation in number of words among participants within each group also called within-groups variation.

To measure each of these components, we’ll compute two different variances, the mean square for groups (MSG) and the mean square error (MSE). The basic idea in gathering evidence to reject the null hypothesis is to show that the between-groups variation is substantially larger than the within-groups variation and we do that by forming the ratio, which we call F:

F=between-groups variation

within-groups variation=MSG

MSE

In the caffeine example, we have three groups. More generally, suppose there were k different groups (each assigned to consume varying amounts of caffeine) with sample sizes n1, n2, … nk. Then the null hypothesis is H0 : µ1 = µ2 = . . . = µk and the alternative hypothesis is that at least two of the population means differ. The formulas for computing the between-groups variation and within-groups variation are given below:

MSG=

n1(x1 - x)2 + n2(x2 - x)

2 + . . .+ nk(xk - x)2

k -1

where x is the mean of all the observations and x1,x2 , . . . ,xk are the sample means for each group.

MSE =(n1 -1)s1

2 + (n2 -1)s22 + . . . ,+ (nk -1)sk

2

N - k

where 1 2 . . . kN n n n= + + + and 1 2, , . . . , ks s s are the sample standard deviations for each group.

When 0H is true, then F = MSG/MSE has the F distribution k –1 and N – k degrees of freedom. We use the F distribution to calculate the p-value for the F-test statistic.

We return to our three-group caffeine experiment to see how this works. To begin, we calculate the sample means and standard deviations (See Table 31.3.).

Table 31.3. Group means and standard deviations.

Group Sample Mean Sample Standard Deviation

A 8.8 2.168B 12.2 1.789C 13.0 2.236

Table 31.3


Before calculating the mean square for groups we also need the grand mean of all the data values: x ≈ 11.3 . Now we are ready to calculate MSG, MSE, and F:

MSG = 5(8.8 −11.33)2 + 5(12.2−11.33)2 + 5(13.0 −11.33)2

3 −1= 49.73

2= 24.87

MSE = (5 −1)(2.168)2 + (5 −1)(1.789)2 + (5 −1)(2.236)2

15 − 3= 51.602

12≈ 4.30

24.87 5.783.40

F = ≈

All that is left is to find the p-value. If the null hypothesis is true, then the F-statistic has the F distribution with 2 and 12 degrees of freedom. We use software to see how likely it would be to get an F value at least as extreme as 5.78. Figure 31.5 shows the result giving a p-value of around 0.017. Since p < 0.05, we conclude that the amount of caffeine consumed affected the mean memory score.

Figure 31.5. Finding the p-value from the F-distribution.

It takes a lot of work to compute F and find the p-value. Here’s where technology can help. Statistical software such as Minitab, spreadsheet software such as Excel, and even graphing calculators can calculate ANOVA tables. Table 31.4 shows output from Minitab. Now, match the calculations above with the values in Table 31.4. Check out where you can find the values for MSG, MSE, F, the degrees of freedom for F, and the p-value directly from the output of ANOVA. That will be a time saver!

1.0

0.8

0.6

0.4

0.2

0.0

F

Dens

ity

5.78

0.01746

0

Distribution PlotF, df1=2, df2=12


Table 31.4. ANOVA output from Minitab.

It is important to understand that ANOVA does not tell you which population means differ, only that at least two of the means differ. We would have to use other tests to help us decide which of the three population means are significantly different from each other. However, we can also get a clue by plotting the data. Figure 31.6 shows comparative dotplots for the number of words for each group. The sample means are marked with triangles. Notice that the biggest difference in sample means is between groups A (34 mg caffeine) and C (160 mg of caffeine). The sample means for groups B and C are quite close together. So, it looks as if consuming Coca Cola doesn’t give the memory boost you could expect from consuming coffee or Jolt Energy.

Figure 31.6. Comparative dotplots.

There is one last detail before jumping into running an ANOVA – there are some underlying assumptions that need to be checked in order for the results of the analysis to be valid. What we should have done first with our caffeine experiment, we will do last. Here are the three things to check.

1. Each group’s data need to be an independent random sample from that population. In the case of an experiment, the subjects need to be randomly assigned to the levels of the factor. Check: The subjects in the caffeine-memory experiment were divided into groups. Groups were then randomly assigned to the level of caffeine.

Student Guide, Unit 31, OneWay ANOVA Page 7

Minitab. Now, match the calculations above with the values in Table 31.4. Check out where you can find the values for , , , the degrees of freedom for , and the value directly from the output of ANOVA. That will be a time saver! Table 31.4. ANOVA output from Minitab.

It is important to understand that ANOVA does not tell you which population means differ, only that at least two of the means differ. We would have to use other tests to help us decide which of the three population means are significantly different from each other. However, we can also get a clue by plotting the data. Figure 31.6 shows comparative dotplots for the number of words for each group. The sample means are marked with triangles. Notice that the biggest difference in sample means is between groups A (34 mg caffeine) and C (160 mg of caffeine). The sample means for groups B and C are quite close together. So, it looks as if consuming Coca Cola doesn’t give the memory boost you could expect from consuming coffee or Jolt Energy.

Figure 31.6. Comparative dotplots. There is one last detail before jumping into running an ANOVA – there are some underlying assumptions that need to be checked in order for the results of the analysis to be valid. What we should have done first with our caffeine experiment, we will do last. Here are the three things to check.

1. Each group’s data need to be an independent random sample from that population. In the case of an experiment, the subjects need to be randomly assigned to the levels of the factor. The subjects in the caffeinememory experiment were divided into groups. Groups were then randomly assigned to the level of caffeine.

1817161514131211109876

A

B

C

Number of Words

Gro

up


2. Next, each population has a Normal distribution. The results from ANOVA will be approximately correct as long as the sample group data are roughly normal. Problems can arise if the data are highly skewed or there are extreme outliers. Check: The normal quantile plots of Words Recalled for each group are shown in Figure 31.7. Based on these plots, it seems reasonable to assume these data are from a Normal distribution.

Figure 31.7. Checking the normality assumption.

3. All populations have the same standard deviation. The results from ANOVA will be approximately correct as long as the ratio of the largest standard deviation to the smallest standard deviation is less than 2. Check: The ratio of the largest to the smallest standard deviation is 2.236/1.789 or around 1.25, which is less than 2.

20151050

99

90

50

10

12015105

99

90

50

10

1

2015105

99

90

50

10

1

Group A (35 mg)

Perc

ent

Group B (100 mg)

Group C (160 mg)

Normal - 95% CINormal Quantile Plots of Number of Words


Key Terms

A factor is a variable that can be used to differentiate population groups. The levels of a factor are the possible values or settings that a factor can assume. The variable of interest is the response variable, which may be related to one or more factors.

An analysis of variance or ANOVA is a method of inference used to test whether or not three or more population means are equal. In a one-way ANOVA there is one factor that is thought to be related to the response variable.

An analysis of variances tests the equality of means by comparing two types of variation, between-groups variation and within-groups variation. Between-groups variation deals with the spread of the group sample means about the grand mean, the mean of all the observations. It is measured by the mean square for groups, MSG. Within-groups variation deals with the spread of individual data values within a group about the group mean. It is measured by the mean square error, MSE.

The F-statistic is the ratio MSG/MSE.


The Video


1. What is the theory called the Wisdom of the Crowd?

2. What was different about the clipboards that students were holding?

3. State the null hypothesis for the one-way ANOVA.

4. What is the name of the test statistic that results from ANOVA?

5. Was the professor able to conclude from the F-statistic that the population means differed depending on the weight of the clipboard? Explain.

6. Did the crowd prove to be wiser than the individual students?


Unit Activity: Controlling Wafer Thickness

You will use the Wafer Thickness tool to collect data for this activity. There are three control settings that affect wafer thickness during the manufacture of polished wafers used in the production of microchips.

1. Leave Controls 2 and 3 set at level 2. Your first task will be to perform an experiment to collect data and determine whether settings for Control 1 affect the mean thickness of polished wafers.

a. Open the Wafer Thickness tool. Set Control 1 to level 1, and Controls 2 and 3 to level 2 (the middle setting). In Real Time mode, collect data from 10 polished wafers. Store the data in a statistical package or Excel spreadsheet or in a calculator list. Make a sketch of the histogram produced by the interactive tool.

b. Set Control 1 to level 2. Leave Controls 2 and 3 set at level 2. Repeat (a). Sketch the second histogram using the same scales as was used on the first. Store the data in your spreadsheet or a calculator list.

c. Set Control 1 to 3. Leave Controls 2 and 3 set at level 2. Repeat (a). Sketch your third histogram, again using the same scales as were used on the first histogram. Store the data in your spreadsheet or a calculator list.

d. Calculate the means and standard deviations for each of your three samples. Based on the sample means and on your histograms, do you think that there is sufficient evidence that changing the level of Control 1 changes the mean thickness of the polished wafers produced? Or might these sample-mean differences be due simply to chance variation? Explain your thoughts.

e. Use technology to run an ANOVA. State the null hypothesis being tested, the value of F, the p-value, and your conclusion.

2. Your next task will be to perform an experiment to collect data and determine whether settings for Control 2 affect the mean thickness of polished wafers.

a. Leave Controls 1 and 3 set at level 2. Adapt the process used in question 1(a – c) to collect the data on Control 2.


b. Compute the standard deviations for the three samples. Is the underlying assumption of equal standard deviations reasonably satisfied? Explain.

c. Provided you answered yes to (b), use technology to run an ANOVA. State the null hypothesis being tested, the value of F, the p-value, and your conclusion. (If you answered no to (b), skip this part.)

3. Your final task will be to perform an experiment to collect data and determine whether settings for Control 3 affect the mean thickness of polished wafers.

a. Leave Controls 1 and 2 set at level 2. Adapt the process used in question 1(a – c) to collect the data on Control 3.

b. Compute the standard deviations for the three samples. Is the underlying assumption of equal standard deviations reasonably satisfied? Explain.

c. Provided you answered yes to (b), use technology to run an ANOVA. State the null hypothesis being tested, the value of F, the p-value, and your conclusion. (If you answered no to (b), skip this part.)


Exercises

1. A professor predicts that students will learn better if they study to white noise (similar to a fan) compared to music or no sound. She randomly divides 27 students into three groups and sends them to three different rooms. In the first room, students hear white noise, in the second, music from a local radio station, and in the third, the door is closed to help block out normal sound from the hall. Each group is given 30 minutes to study a section of their text after which they take a 10-question multiple-choice exam. Table 31.5 contains the results.

Table 31.5. Test results.

a. Calculate the mean test score for each group. Calculate the standard deviation of the test scores for each group.

b. Make comparative dotplots for the test results of the three groups. Do you think that the dotplots give sufficient evidence that there is a difference in population mean test results depending on the type of noise? Explain.

c. Run an ANOVA. State the hypotheses you are testing. Show the calculations for the F-statistic. What are the degrees of freedom associated with this F-statistic?

d. What is the p-value of the test statistic? What is your conclusion?

2. Not all hotdogs have the same calories. Table 31.6 contains calorie data on a random sample of Beef, Poultry, and Veggie dogs. (One extreme outlier for Veggie dogs was omitted from the data.) Does the mean calorie count differ depending on the type of hotdog? You first encountered this topic in Unit 5, Boxplots.

White Noise Music No Sound

8 5 45 7 75 3 67 5 28 5 47 8 38 4 53 5 5

10 7 4

Table 31.5


a. Verify that the standard deviations allow the use of ANOVA to compare population means.

b. Use technology to run an ANOVA. State the value of the F-statistic, the degrees of freedom for F, the p-value of the test, and your conclusion.

c. Make boxplots that compare the calorie data for each type of hot dog. Add a dot to each boxplot to mark the sample means. Do your plots help confirm your conclusion in (b)?

3. Many states rate their high schools using factors such as students’ performance, teachers’ educational backgrounds, and socioeconomic conditions. High school ratings for one state have been boiled down into three categories: high, medium, and low. The question for one of the state universities is whether or not college grade performance differs depending on high school rating. Table 31.7 contains random samples of students from each high school rating level and their first-year cumulative college grade point averages (GPA).

a. Calculate the sample means for the GPAs in each group. Based on the sample means alone, does high school rating appear to have an impact on mean college GPA? Explain.

b. Check to see that underlying assumptions for ANOVA are reasonably satisfied.

Beef Poultry Veggie110 60 40110 60 45130 60 45130 70 45140 70 50150 70 50160 80 55160 90 57170 90 60170 100 60175 100 70180 100 80180 110 80180 110 81190 110 90190 120 95190 120 100200 130 100210 140 110230 150

Table 31.6

Calories

Table 31.6. Calorie content of hotdogs.

Table 31.7. First-year college GPA by high school rating.

High Rating Medium Rating Low Rating3.37 2.92 3.193.28 2.11 1.571.73 3.92 0.993.64 2.83 3.583.04 3.26 3.872.80 3.18 1.433.83 2.28 2.503.22 3.13 3.453.55 2.90 2.632.28 3.41 2.592.51 2.64 3.501.74 3.71 1.952.88 2.52 2.212.86 2.24 2.504.00 0.98 3.352.67 3.65 2.883.75 2.87 1.272.30 2.21 3.103.24 3.02 3.292.43 3.66 1.06

Table 31.7

First-Year Cumulative Grade Point Average


c. Run an ANOVA to test whether there is a significant difference among the population mean GPAs for the three groups. Report the value of the F-statistic, the p-value, and your conclusion.

4. Researchers in a nursing school of a large university conducted a study to determine if differences exist in levels of active and collaborative learning (ACL) among nursing students, other health professional students, and students majoring in education. A random sample of 1,000 students from each of these three majors was selected from students who completed the National Survey of Student Engagement (NSSE).

a. The sample mean ACL scores for nursing, other health professional students, and education majors were 46.44, 45.58, and 48.59, respectively. Do these sample means provide sufficient evidence to conclude that there was some difference in population mean ACL scores among these three majors? Explain.

b. A one-way analysis of variance was run to determine if there was a difference among the three groups on mean ACL scores. Assuming that all students answered the NSSE questions related to ACL, what were the degrees of freedom of the F-test?

c. The results from the ANOVA gave F = 8.382. Determine the p-value. What can you conclude?


Review Questions

1. Random samples of three types of candy were given to children. The three types of candy were chocolate bars (A), hard candy (B), and chewy candy (C). A group of 15 children rated samples of each type of candy on a scale from 1 (lowest) to 10 (highest). Two hypothetical data sets are given in Tables 31.8 and 31.9.

a. Find the sample means of each candy type based on the ratings in Table 31.8. Then do the same for the ratings in Table 31.9. Based on these results, can you tell if there is a significant difference in population mean ratings among the different types of candies? Explain.

b. Make comparative boxplots for the data in Table 31.8. Then do the same for the data in Table 31.9. For both sets of plots, mark the mean with a dot on each boxplot. For which data set is it more likely that the results from a one-way ANOVA will be significant? Explain.

c. Run an ANOVA based on Data Set #1. Report the value of the F-statistic, the p-value, and your conclusion. Then do the same for Data Set #2. Explain why you should not be surprised by the results.

2. The data in Table 31.10 were part of a study to investigate online questionnaire design. The researcher was interested in the effect that type of answer entry and type of question-to-question navigation would have on the time it would take to complete online surveys. Twenty-

Ratings for A Ratings for B Ratings for C8 4 6

10 5 57 7 78 8 56 7 67 8 54 6 67 5 56 5 48 6 26 6 35 7 46 3 57 8 58 5 6

Table 31.8

Ratings for A Ratings for B Ratings for C8 4 6

10 5 57 6 88 9 23 6 77 8 34 6 79 5 56 4 48 7 25 6 34 9 46 3 38 9 7

10 3 8

Table 31.9Table 31.8. Data set #1. Table 31.9. Data set #2.


seven volunteers participated in this study. Each completed two questionnaires, one dealing with credit and the other focusing on vacations. Each questionnaire had 14 questions, and participants could select only one answer.

There were three display types for answers:

(1) radio button

(2) drop down list

(3) list box

There were three navigational methods:

(1) questions were on a single page

(2) click the Next/Prev button

(3) press Tab

Table 31.10. Time to complete on-line questionnaires.

Display Type Navigation Type Time (sec) Display Type Navigation Type Time (sec)

1 1 97 1 3 1173 3 83 2 2 741 1 102 1 3 663 3 85 3 1 621 1 92 1 2 933 3 71 3 1 621 2 105 1 2 643 3 92 3 1 481 2 67 1 2 573 3 71 3 1 961 2 54 2 3 683 3 66 3 1 901 3 63 2 3 712 1 61 3 1 741 3 101 2 3 742 1 117 3 2 781 3 124 2 3 922 1 97 3 2 712 1 126 2 3 803 2 83 3 2 492 1 107 2 3 673 2 88 1 1 1012 1 88 2 2 1113 2 62 1 1 1032 2 55 2 2 801 3 73 1 1 1032 2 126 2 2 111

Table 31.10


a. Enter the data in Table 31.10 into a statistical software, Excel, or graphing calculator spreadsheet. The display types and navigation types have been coded as numbers to facilitate data entry. Once the data are entered, you can replace the coded values with their categorical values. (For example, in Display Type replace 1 with radio button, 2 with drop down list, and 3 with list box).

Display Type and Time

b. Make comparative boxplots of the times for each level of Display Type. Mark the location of the means on your boxplot. Do you see anything unusual in the data that might make it not appropriate to use ANOVA? If so, follow up with normal quantile plots to check the assumption of normality.

c. Run an ANOVA using Display Type as the factor. State the null hypothesis you are testing. Report the value of the F-statistic, the p-value, and your conclusion.

d. Make comparative boxplots of the times for each level of Navigation Type. Mark the location of the means on your boxplot. Do you see anything unusual in these data that might make it not appropriate to use ANOVA? If so, follow up with normal quantile plots to check the assumption of normality.

e. Run an ANOVA using Navigation Type as the factor. Report the value of the F-statistic, the degrees of freedom of the F-statistic, the p-value, and your conclusion.

f. Based on this study, what recommendations would you make to online questionnaire designers?

3. A group researching wage discrepancies among the four regions of the U.S. focused on full-time, hourly-wage workers between the ages of 20 and 40. Researchers randomly selected 200 workers meeting the age criteria from the northeast, midwest, south and west and recorded their hourly pay rates. The mean hourly rate for the combined regions was $15.467. A summary of the data are given in Table 31.11. The researchers ran an ANOVA on these data.

Table 31.11. Summary of hourly rate data.

Region Sample size Sample Mean Standard Deviation Northeast 200 16.560 9.164 Midwest 200 15.154 6.381 South 200 13.931 6.933 West 200 16.223 9.289

Table 31.11


a. Is the homogeneous standard deviations assumption for ANOVA reasonably satisfied? Explain.

b. State the researchers’ null hypothesis.

c. Calculate the value of the F-statistic and give its degrees of freedom. Show calculations.

d. Determine the p-value.

e. Based on the evidence in Table 31.11 and your answers to (a – d), what conclusions can the researchers make?

4. A study focusing on women’s wages was investigating whether there was a significant difference in salaries in four occupations commonly (but not exclusively) held by women – cashier, customer service representative, receptionist, and secretary/administrative assistant. Weekly wages from 50 women working in each occupation are recorded in Table 31.12.

Table 31.12. Weekly wages of women in four occupations. Data from 2012 March Supplement, Current Population Survey.

385 320 440 473 673 807 640 850427 450 540 400 232 2038 600 580480 333 222 673 523 529 420 840520 300 1000 84 380 520 554 522540 1200 625 769 715 600 577 528690 240 540 1428 596 383 712 382680 288 600 289 402 400 447 800690 315 738 1885 787 945 380 440364 548 344 429 238 400 1384 92113 369 800 555 538 620 705 877360 340 481 480 360 969 600 923360 350 340 788 500 450 225 11542885 387 720 1480 430 577 440 481720 345 885 596 540 440 1058 812360 350 680 600 528 650 673 560297 540 760 920 568 672 769 918300 548 700 624 561 400 692 600340 400 560 340 603 500 826 227360 439 800 769 510 420 900 865511 320 431 400 400 345 1900 640508 315 500 390 500 919 769 692321 302 1162 615 520 400 746 654400 665 400 481 1280 480 1384 543290 220 400 640 560 188 360 320729 331 1270 430 386 680 415 323

Table 31.12

Cashier Customer Service Receptionist Secretary/Adm. Asst.


a. Calculate the means and standard deviations for the weekly wages in each occupation category.

b. Is the homogeneous standard deviations assumption for ANOVA reasonably satisfied? Explain.

c. Run an ANOVA. Record the ANOVA table and highlight the value of F, and the p-value.

d. Based on your answers to (a – c), what are your conclusions?

Summary of Video

Documents

Transcript of Summary of Video