7001_PDF_C09

7

Click here to load reader

Transcript of 7001_PDF_C09

Page 1: 7001_PDF_C09

1999 by CRC Press LLC

chapter nine

Sampling

Michael S. Broida Miami University

Contents

9.1 Purpose9.2 Strengths, weaknesses, and limitations9.3 Inputs and related ideas9.4 Concepts

9.4.1 Why sample?9.4.2 Sample size and sampling error9.4.3 Bias9.4.4 Random sampling9.4.5 Random-like samples9.4.6 Stratified random sampling

9.4.6.1 Proportional Allocation9.4.6.2 Optimal Allocation

9.5 Key terms9.6 Software9.7 References

9.1 Purpose Sampling is a technique for obtaining an estimate from a population bystudying, measuring, or interviewing a subset (or sample) of that popula-tion. This chapter discusses basic sampling concepts.

Page 2: 7001_PDF_C09

1999 by CRC Press LLC

9.2 Strengths, weaknesses, and limitations A well-selected sample yields an estimate of the target parameters in muchless time and at much less cost than studying, measuring, or interviewingthe entire population (conducting a census). It is often impossible to achieve100 percent response because some of the entities to be studied, measured,or interviewed are unavailable or do not respond. A sample is sometimesmore accurate than a census because obtaining numerous measurementsintroduces errors owing to fatigue, inaccurate or inconsistent data entry, andthe use of less qualified personnel.

The sample answer, called an estimate, is almost never exactly the sameas the corresponding population value. (This difference is called error.)Additionally, before a statistically valid sample can be selected, a great dealof information about the population must be available.

9.3 Input and related ideas Before conducting a sample, it is necessary to define the specific informationbeing sought and the population from which the sample will be drawn. Forexample, if an analyst needs information about perceived weaknesses in theexisting sales order tracking system, the population would consist of all thepeople who utilize the existing system.

Sampling can be used to select the subset of a population to be inter-viewed (Chapter 8), the members of a JAD team (Chapter 14), or the members of an inspection team (Chapter 23). Sampling is an effective wayto study an existing system by selecting the entities, transactions, occur-rences, or personnel to be observed and measured. Sampling is an effectivetool for estimating population characteristics when using such mathemati-cal tools as simulation (Chapter 19) and queuing theory (Chapter 79).During the testing phase of the system development life cycle (Part VII),sampling is used to generate test data and select the specific events to bemonitored. During the operation and maintenance phase (Part VIII), sam-pling is an effective tool for evaluating and monitoring performance and forimplementing system controls (Chapter 77). For example, quality control is often implemented by taking random samples of a process.Sometimes the estimates generated by sampling a process are plotted on acontrol chart (Chapter 10) to determine if the process is in control.

9.4 Concepts Sampling is a technique for obtaining an estimate from a population bystudying, measuring, or interviewing a subset (or sample) of the population.This chapter discusses basic sampling concepts.

9.4.1 Why sample? Every year, Consumer Reports magazine conducts tests on new automobilesand reports its findings to its readers. Given the (literally) millions of

Page 3: 7001_PDF_C09

1999 by CRC Press LLC

automobiles that roll off the assembly lines every year, testing the entirepopulation would be incredibly time consuming, prohibitively expensive,and practically impossible, so the test results are based on a sample.

In many cases, testing a sample is actually more accurate than testingthe entire population. A tester’s reactions and perceptions are likely tochange between the first car and the tenth car, if only because of fatigue.Multiple tests mean considerable data, and data entry errors are inevitable.Multiple tests also imply multiple testers, not all of whom are equallyskilled. Finally, the test conditions and criteria will almost certainly changeover time. For example, if enough cars are crashed into a barrier, the barrierwill eventually be deformed, thus changing the test conditions.

If the sample is drawn properly, it is reasonable to assume that the sam-ple estimate reflects the population. The balance of this chapter discusses theprocess of drawing a good sample.

9.4.2 Sample size and sampling error The difference between the sample estimate and the true population value iscalled error. As a general rule, the sampling error decreases as the samplesize increases. For example, assuming a 95 percent confidence interval, asample of 1,000 voters might predict the outcome of an election with an errorof slightly more than plus or minus 3 percent. Increase the sample size to4,000, and the error drops to plus or minus 1.5 percent, while a sample sizeof 10,000 reduces the error to less than plus or minus 1 percent.

A useful formula for computing the sample size is:

n = (z2�2) /E2, (9.1)

where z is a number from the normal distribution table that corresponds tothe desired confidence interval, � is the standard deviation of the popula-tion as estimated by the sample standard deviation, and E is the maximumacceptable error between the sample mean and the actual populationmean. For a 95 percent confidence interval, use z = 1.96. For a 99 percentconfidence interval, use z = 2.575. As a practical matter, one-fifth the sam-ple range can be used as an estimate of the standard deviation.

For example, suppose you want to estimate the average amount ofmoney a state university student spends on food and beverages in an average week. The maximum acceptable error is $2. Based on a prelimi-nary sample, � is estimated to be $8. The desired confidence interval is 95percent. Plugging those numbers into Equation (9.1) suggests a samplesize of:

n = [(1.962)(82)] / 22 = 62.426

or 63 students. (It is impossible to sample a fractional student, and roundingup yields a confidence interval slightly higher than 95 percent.) Assuming

Page 4: 7001_PDF_C09

1999 by CRC Press LLC

the students answer truthfully, averaging the weekly food expenditures of63 randomly selected university students will yield a value that is within $2of the population average with 95 percent confidence. To put it another way,there is a 0.95 probability that the sample mean will lie within $2 of the truemean. (Note: A real statistician would probably argue that the last statementis not technically correct, but in most cases it is a reasonable way to visualizea confidence interval.)

9.4.3 Bias Simply selecting the right sample size is not enough, however. For example,a sample taken outside an expensive restaurant and a sample taken outsidea food bank will almost certainly yield two very different (and equallyinvalid) estimates of the weekly food expenditures of university studentsbecause those samples are likely to be biased. A biased sample systematicallyfavors some members of the population over others. To cite another exam-ple, if a telephone book is used to select a sample, people with unlisted numbers, people who have recently moved into that telephone market, andpeople with no telephone are automatically excluded from the sample.

Non-response bias occurs when one or more members of the selectedgroup are not included in the sample. A survey that includes informationonly from people who answer their telephones at a certain time of dayexcludes one subset of the population. Dismissing or excluding people whorefuse to answer certain questions is another source of non-response bias. Beaware of non-response bias. Before taking a sample, study the samplingprocess, identify subsets of the population that might be excluded or choosenot to participate, and adjust the sampling process as necessary.

9.4.4 Random sampling One relatively easy way to avoid introducing bias is to sample randomly. Asample is considered random if each member of the population has the samechance of being selected. Random samples yield unbiased estimates.Generally, an unbiased estimate is high about half the time and low abouthalf the time.

There are two commonly used techniques for selecting a random sam-ple. If the population is small, the members (or slips of paper representingeach member) can be mixed thoroughly and the sample selected directly(like bingo markers or lottery tickets). For larger populations, assign eachmember a number and use a random number generator or a table of randomnumbers to select the sample.

9.4.5 Random-like samples In cases where it is impossible or inconvenient to select a true random sam-ple, the objective is to generate estimates that behave as though they werebased on a random sample. The key to successful, almost random sampling isto avoid introducing bias. For example, imagine a grocer inspecting a ship-

Page 5: 7001_PDF_C09

1999 by CRC Press LLC

ment of fruit. An estimate based on a sample taken from a single box or evenfrom the tops of several boxes is unlikely to accurately reflect the quality ofall the fruit. However, if the grocer selects several boxes and then selectsfruit from the top, the middle, and the bottom of each, the sample is likelyto be random-like.

On an assembly line, selecting every tenth, hundredth, or thousandthitem (generally, every nth item) as it flows by might be an effective way toselect a random-like sample. An option is to select every m ± nth item),where n is a random number (for example, every 100 ± 5th item.

Avoid predictability when sampling human beings, however, because itoften introduces bias. For example, if the boss walks through the work areaevery hour on the hour, he or she is likely to find everyone hard at work. Ifanother boss were to use a random number table to define the times for random visits to the work area, he or she is likely to gain a more accuratepicture of the employees’ work habits.

9.4.6 Stratified random sampling

With stratified random sampling, a population of size N is divided into msubgroups. Each subgroup is called a stratum, and each member of thepopulation must lie in exactly one stratum. For example, dividing a groupof people by sex yields two strata (male and female); dividing a group of voters into Democrat, Republican, Independent, and Socialist yields fourstrata; and comparing the products produced on the first, second, andthird shifts calls for three strata. Samples are taken randomly within eachstratum.

Stratified random sampling is important if the different strata have different means and/or different levels of variability. For example, supposethe newer, relatively inexperienced employees who work the third shift produce markedly more errors than the people who work the other twoshifts. In such cases, stratified sampling tends to yield more accurate estimates than simple random sampling.

9.4.6.1 Proportional allocation One technique for distributing a sample across several strata is called proportional allocation. If 200 employees are distributed over three shiftswith 100 on first shift, 60 on second shift, and 40 on third shift, a reasonablesample distribution might be 50 percent first shift, 30 percent second shift,and 20 percent third shift.

9.4.6.2 Optimal allocation If one stratum exhibits significantly more variability than the others, proportionally more samples should be taken from the inconsistent stratum.Also, if one stratum is more costly to measure or interview than another,proportionally fewer samples should be taken from the expensive stratum.

Page 6: 7001_PDF_C09

1999 by CRC Press LLC

Optimal allocation is a technique for distributing a sample across several strata that considers variability and cost. The optimum allocationformula is:

(ni/ n) = [Wi�i/ (Ci1/2)] / � [Wi�i / (Ci

1/2)], (9.2)

where ni is the number of samples in stratum i, n is the total sample size, Wiis the percentage of the population in stratum i, �i is the standard deviationof stratum i, and Ci is the cost to sample stratum i. The formula calculates arelatively larger sample size for a given stratum if its variability (measuredby �i) is higher than average or if the cost of sampling from that stratum islower than average.

For example, suppose n, the total sample size, is 500. The population isdivided among three strata, with costs to sample of $3, $4, and $5 per itemfor strata 1, 2, and 3 respectively (C1 = $3, C2 = $4, and C3 = $5). Stratum 1 contains 50 percent of the population (W1 = 0.5), stratum 2 contains 30 per-cent of the population (W2 = 0.3), and stratum 3 contains 20 percent of thepopulation (W3 = 0.2). Finally, the estimated standard deviations for thethree strata are �1 = 1.5, �2 = 2, and �3 = 2.5.

First calculate

�(Wi�i/(Ci1/2)) = [W1�1/(C1

1/2)] + [W2 �2/(C21/2)] + [W3�3/(C3

1/2)]

= [0.5(1.5) / (31/2)] + [0.3(2) / (41/2)] + [0.2(2.5) / (51/2)]

≈ 0.433 + 0 .300 + 0.224 = 0.957.

Next, compute

n1/n = 0.433/0.957 = 0.452

n2/n = 0.300/0.957 = 0.314

n3/n = 0.224/0.957 = 0.234.

Those numbers suggest that n1 (the stratum 1 sample size) should be45.2 percent (or 226 units) of the total sample size (500 items), n2 should be31.4 percent (or 157 units), and n3 should be 23.4 percent (or 167 units).

9.5 Key termsBias — Any factor that systematically favors some members of the population

over others when a sample is drawn. Census — A set of measurements (or interviews) for every element of a

population. Confidence interval — A range of numbers around an estimate that

contains the corresponding population parameter with the stated prob-ability. For example, a 95 percent confidence interval for an estimate of

Page 7: 7001_PDF_C09

1999 by CRC Press LLC

the population mean is a range of numbers that contains the popula-tion mean with 95 percent certainty.

Error — The difference between the value of a parameter as estimated bya sample and the actual value of that parameter for the entire popula-tion.

Estimate — A value of a parameter determined by a sample. Mean — An arithmetic average; the sum of all the observations divid-

ed by the number of observations. Non-response bias — A form of bias that occurs when one or more

members of the selected group are not included or choose not to par-ticipate in the sample.

Population — The entire set of relevant entities or measurements. Random sample — A sample in which each item in the population has

the same chance of being selected. Range — The difference between the highest value and the lowest

value in a set of measurements. Sample — A selected subset of a population. Standard deviation — The square root of the variance. Strata — The set of subgroups in a stratified random sample. Stratified random sampling — A random sampling technique in

which the population is divided into subgroups called strata suchthat each element of the population lies in exactly one stratum; sam-ples are taken randomly within each stratum.

Stratum — A single subgroup in a stratified random sample. Unbiased estimate — An estimate that is high about half the time and

low about half the time. Variance — The average of the squared differences between the indi-

vidual population values and the population mean.

9.6 Software Random number tables are found in many statistics textbooks and/or in thesoftware packages that accompany those books. Random number functionsare found in most spreadsheet programs. SAS users can generate randomobservations from a binomial distribution (RANDBIN), an exponential distribution (RANEXP), a normal distribution (RANNOR), a Poisson distri-bution (RANPOI), or a uniform distribution (RANUNI). Minitab forWindows users should check the RANDOM DATA sub-window on theCALC pull down window.

9.7 References 1. Aczel, A. D., Complete Business Statistics, Irwin, Homewood, IL, 1989, chap. 16. 2. Badarinathi, R., Introduction to SAS, Dryden Press, New York, 1992, 21. 3. Bowerman, B. L. and O’Connell, R. T., Applied Statistics. Improving Business

Processes, Irwin, Chicago, 1997.