CI for a Proportion

of 24 /24
7/23/2019 CI for a Proportion http://slidepdf.com/reader/full/ci-for-a-proportion 1/24 Confidence Interval for a Proportion 1 Confidence Interval for a Proportion Example 74% of a company’s customers would like to see new product packaging. A random sample of 50 customers is taken. X  is the number of customers in the  sample who would like to see the new  packaging; then the sample proportion is n  X  p   ˆ . The mean and standard deviation for  p ˆ  are Mean of  p ˆ :  p  p   ˆ  = 0.74 Standard deviation of  p ˆ : 06203 50 26 . 0 74 . 0 1 ˆ   n  p  p  p   Since np = 50(0.74) = 37 and n(1 –   p) = 50(0.26) = 13 are both at least 10, the distribution of  p ˆ  is approximately Normal. Using the Normal approximation we know that the probability is approximately 95% that  p ˆ falls within 1.96 standard deviations of the mean. 95% of all samples have a sample proportion  p ˆ  between 0.74 –  1.96(0.06203) = 0.618 and 0.74 + 1.96(0.06203) = 0.862. As 1.96(0.06203) = 0.1216 we could equivalently state “  p ˆ  is within 0.122 of 0.74.” For this situation approximately 95% of all possible samples yield a proportion  p ˆ  within 0.122 of 0.740 (within 12.2% of 74.0%). In general, provided a Normal approximation can be used, 95% of all possible samples yield a  proportion  p ˆ  within  n  p  p   1 96 . 1  of p. An approximate 95% Confidence Interval The section begins with a description of steps that lead to a usable result. (A rigorous treatment of the issue requires a good deal of mathematical statistics.) The explanations provided below are simplified. We’ve observed the following:  For approximately 95% of all samples  p ˆ  is within  n  p  p   1 96 . 1  of p. Flip-flopping p and  p ˆ  yields the following statement (which is true): For approximately 95% of all samples  p is within  n  p  p  ˆ 1 ˆ 96 . 1    of  p ˆ . In this second statement the interval  is random. Different samples yield different values for  p ˆ , which result in different intervals.

Embed Size (px)

Transcript of CI for a Proportion

  • 7/23/2019 CI for a Proportion

    1/24

    Confidence Interval for a Proportion 1

    Confidence Interval for a Proportion

    Example

    74% of a companys customers would like to see new product packaging. A random sample of

    50 customers is taken.Xis the number of customers in thesamplewho would like to see the new

    packaging; then the sample proportion is nXp . The mean and standard deviation for p are

    Mean of p : pp = 0.74

    Standard deviation of p :

    0620350

    26.074.01

    n

    ppp

    Since np= 50(0.74) = 37 and n(1p) = 50(0.26) = 13 are both at least 10, the distribution of p

    is approximately Normal. Using the Normal approximation we know that the probability is

    approximately 95% that p falls within 1.96 standard deviations of the mean.

    95% of all samples have a sample proportion p between 0.741.96(0.06203) = 0.618

    and 0.74 + 1.96(0.06203) = 0.862.

    As 1.96(0.06203) = 0.1216 we could equivalently state p is within 0.122 of 0.74.

    For this situation approximately 95% of all possible samples yield a proportion p within

    0.122 of 0.740 (within 12.2% of 74.0%).

    In general, provided a Normal approximation can be used, 95% of all possible samples yield a

    proportion p within npp 196.1 ofp.

    An approximate 95% Confidence Interval

    The section begins with a description of steps that lead to a usable result. (A rigorous treatment

    of the issue requires a good deal of mathematical statistics.) The explanations provided below are

    simplified.

    Weve observed the following:

    For approximately 95% of all samples p is within npp 196.1 ofp.

    Flip-floppingpand p yields the following statement (which is true):

    For approximately 95% of all samplespis within npp 196.1 of p .

    In this second statement the intervalis random. Different samples yield different values for p ,

    which result in different intervals.

  • 7/23/2019 CI for a Proportion

    2/24

    Confidence Interval for a Proportion 2

    The Resulta 95% Confidence Interval for p

    For one sampleyielding a result p , the interval

    Ep where nppE 196.1

    forms an approximate 95% confidence intervalforp. p is the point estimate ofp;Eis the error

    margin associated with the estimate.

    That is: (approximately) 95% of all random samples (of size n) produce an interval includingp

    within the bounds. When we obtain the random sample and compute the interval from it, we no

    longer have anything random. At this point we state that we are (approximately) 95% confident

    thatpis within the interval bounds.

    A couple restrictions are on this formula.

    1. It should be applied to situations where units are randomly selected.

    2.

    If sampling is without replacement, check the 20 Times Rulethe population must be at

    least 20 times the sample size to use this result. (If not, and you know the population size,

    you can use the adjustment described a bit later in this section. However: small

    populations are uncommon, and the adjustment is rarely needed. You should recognize

    that when the population is notat least 20 times the sample size, then our recipe above for

    error margin does not work.1)

    3. The actual counts of SuccessesXand Failures (nX) must both be at least 10.2(If not,

    either you have too small a sample, or the Success probability is likely too close to 0 or 1,

    for the Normal to be a decent approximation.) If this is not the case, you must seek

    alternative strategies. (Minitab, and most other statistical software, can obtain theconfidence interval by an exactmethod that doesnt require the Normal distribution.

    When the countsXand (nX) both at least 10, the exact method and the approximate

    interval given here will be quite similar.)

    We state such an interval in one of four equivalent styles:

    EppEp EpEp , EpEp to Ep

    The first is preferred, as it indicates what is being estimated:p. In the first through third versions,

    the lowervalue is always stated first. Every confidence interval should be accompanied by an

    interpretation that states the confidence level (here 95%).

    1In fact our formula gives too large an error margin. So in essence you can be more than 95% confident in our result

    when its misapplied to situations where the population is small. Probably not the worse error in the world.2Some sources use 5 in place of 10. 10 is better. 5 is somewhat OK, but if either of these values is between 5 and 9,

    youd be better off getting some help from a statistician, rather than using our methods.

  • 7/23/2019 CI for a Proportion

    3/24

    Confidence Interval for a Proportion 3

    Example

    A simple random sample of 1000 adults finds that 343 approve of the President. Obtain a 95%

    confidence interval for the proportion of all adults who approve of the President.

    This is a random sample wherep= the proportion of all adults who approve of the President (p

    is an unknown, but fixed and unvarying, quantity). The population size is hugemuch larger

    than 20(1,000) = 20,000. The number of Successes and Failures are 343 and 657 respectively

    both are well above 10). We can obtain and then interpret a confidence interval with the

    presented method.

    x= 343 out of n= 1000 trials. So p = 0.3430. Then

    0294.001501.096.11000657.0343.096.1196.1 nppE .

    This 95% confidence interval forp(unknown) can be written any of four ways:

    0.3136

  • 7/23/2019 CI for a Proportion

    4/24

    Confidence Interval for a Proportion 4

    Confidence interval for a (population) proportion p

    This section summarizes our results. We extend the treatment to include a confidence interval for

    a population proportionpwith any confidence level.

    The typical textbook treatment of the issue takes an overblown approach to notation. In addition

    to the confidence C% is the error rate (lack of confidence?) for the procedure, which is

    generally denoted with , where almost always is rather small. C% = 1 - .

    Then takez/2to be theZ score with /2 area in the right tail of the Normal distribution; its

    oppositez/2is theZ score with /2 area in the left tail. Betweenz/2andz/2is area

    (1) = C% the confidence.

    The 1 Sample Z Conf idence I nterval for a Proporti on

    An approximate C% confidence interval forpis

    Ep

    whereEis the error margin

    n

    ppzE

    12

    ,

    withz/2from the Standard Normal distribution, taking into consideration the confidence C%.

    When should you use this formula for a confidence interval?

    You have a random sample drawn from a population with unknown population

    proportionp.

    If sampling is without replacement, the population must be at least 20 times the size of

    the sample (in most practical cases this is almost trivially true).

    The sample result has at least 10 Successes and 10 Failures.

    What if the population is too small or the sample is too large?

    The at least 20 times rule is violated. There is a relatively simple fix, but to employ it

    you must knowat least reasonably accuratelythe sizeNof the population. The error

    margin becomes

    1

    1

    2

    NnN

    nppzE

    If you examine the adjustment (a multiplication) you can see the logic in the 20 times

    rule: WhenNis more than 20 times nthis multiplier will be quite close to, and just

    below, 1. In ignoring the adjustment, we end up with a slightly biggerEthan is required

    so if anything we will be understating the confidence.

  • 7/23/2019 CI for a Proportion

    5/24

    Confidence Interval for a Proportion 5

    What if the both at least 10 rule isnt met?

    There is a method that works whether or not the both at least 5 rule is met. We wont

    cover it here. (If you like, learn the Wald Interval described in textbooks. This is also

    an approximate methodbut it pretty well for counts below 10.)

    What if the sampling isnt random?

    No statistical method can guarantee results with any particular reliability (confidence)

    when sampling is not random. The situation may well be hopeless.

    Example

    A companys human resources department investigates the application materials submitted by 84

    applicants for an entry level position over a six month period. One finding is that 15 of the

    applicants falsified information in the application materials.

    Assume that the 84 applicants are a random sample from a larger pool of similar applicants. Give

    a 99% confidence interval for the proportion of all applicants who falsify information inapplication materials.

    Solution

    First check conditions. We take it for granted that the complete pool of applications is at least 20

    times larger than the 84 in the sample; we are told to assume the sample is random. (In reality,

    random sampling in such a circumstance might be hard to accomplish. Still, we might reasonably

    assume that the sample applicants are representative of the population of all applicants.) Both the

    number of falsifiers (15) and nonfalsifiers (69) exceed 10. We may use the 1 SampleZ

    Confidence Interval.

    For 99% confidence, 005.02 zz = 2.576. The estimated/observed proportion of falsifiers is p

    = 15/84 = 0.1786. Then the error margin is

    1076.00418.0576.284

    8214.01786.0576.2

    84

    1786.011786.0576.2

    12

    n

    ppzE

    Then 0.179 0.108 is the 99% confidence interval: 0.071

  • 7/23/2019 CI for a Proportion

    6/24

    Confidence Interval for a Proportion 6

    statistical software. 2) The formulaalong with trial and error and some examplescan

    assist with your understanding of properties of confidence intervals.

    Fact worth knowing

    95% is the standard confidence level for scientific polls published in the media and online. If a

    poll does not publish an error margin, you may assume that sampling is not randomthe poll isnot a scientific one. Keep in mind also that many polls with stated error margins are not done

    properly. You should have less than 95% confidence in results from such polls.

    Nomenclature

    Theres some terminology that goes with each of these quantities. (The terminology is useful

    because it is generalized to other situations.)

    The confidence level (or just confidence) is C. Usually C= 0.90, 0.95 or 0.99generally

    we prefer to have a high amount of confidence in our statements. However: There is

    nothing illegal or necessarily wrong about a 50% CI. (Its just that 50% CIs miss the

    target quantity half of the time.)

    Dont confuse the confidence level (or just plain confidence) with the confidence

    interval. The confidence interval is the interval of values you obtain.

    p is the (point) estimate ofp. Its a single good estimate forpfrom the sample of data.

    It is theprevalance of Successesin the sample.

    2

    z is the critical value (from the Standard Normal) that goes with C% confidence.

    (Some reference materials use simpler notation likez* - and leave it to common sense

    that thiszis the one that goes with the confidence C.)

    The two endpoints of the interval are the bounds: lower bound and upper bound. The

    width Wof a confidence interval is the distance from the lower to the upper bound. The

    part in total is referred to as the error marginE. 2E= W.

    n

    pp 1 is often called the (estimated) standard error of p .A standard error is

    essentially a standard deviation.5Recall: Standard deviation measures typical deviation

    from mean. The deviation of p from its meanpis a (sampling) error. Its common to see

    the abbreviation SEfor standard error. In this case:

    n

    pp

    pSE

    1

    .

    The error marginEfor this interval can be expressed pSEzE 2 .

    5This quantity really is an estimated standard deviation, as the standard deviation of p is npp 1 .

  • 7/23/2019 CI for a Proportion

    7/24

    Confidence Interval for a Proportion 7

    Mind yourp's and qs

    Many textbooks make the formulas look shorter by using a second letter qto stand for Failure

    rate. So q= (1p) and nxnpq 1 . When this is done

    nqp

    n

    pp 1

    .

    More on interpreting the interval

    If you followed the development above, you can deduce the proper interpretation of a confidence

    interval. It's also possible to take the justification for granted, and come to an interpretive

    understanding.

    Different samples give different results. Consider all possible samples. Obtain, for each sample,

    a 95% confidence interval. Some of these intervals includep, some do not. Most do. In fact: 95%

    of all of them do.

    In a statistical study, a single sample is drawn randomly. The data are collected and summarized,

    and a 95% confidence interval is computed. We have one sample - selected at random from the

    collection of all samples. Because 95% of all samples lead to an interval that coversp, we are

    95% confident that the particular interval we have coversp.

    We use the word confidence, rather than probability. In statistical applications where parameters

    are estimated, those parameters are thought of as fixed values describing populations. They do

    not vary. The parameterpis either in the interval or not. There is no probability involved.

    Where did the probability go?

    There was probability - beforethe sample was selected. This is similar to tossing a coin. Beforeit's tossed the probability of a Head is 1/2. But once the toss is completed, the probability - for

    that toss - is either 0 or 1, depending on the outcome. In this application, the probability is either

    0 (the interval covers p) or 1 (the interval doesn't), depending on whether in fact the interval does

    or does not coverp. Not knowingp, we cannot tell. All we know is that 95% of all samples yield

    an interval coveringp. So we are 95% confident that ours does. (Similarly, after the coin is

    tossed, if you're unable to see the result, you can be 50% confident it's a Head. The word

    probability doesn't apply here.)

    In short: Use the word "probability" for random things that haven't yet taken place. Once they've

    taken place, even if there are unknowns, use the word confidence. The unknowns merely reflect

    human ignorance about events.

  • 7/23/2019 CI for a Proportion

    8/24

    Confidence Interval for a Proportion 8

    Properties of Confidence Intervals

    Three values impact the error margin of a confidence interval.

    1. the prevalence of Success (p )

    2.

    the sample size (n)

    3. the confidence level (C)

    Undertake an investigation: How do changes in each of these impact the error margin? These

    issues are addressed through the exercises.

    In some respects the properties you discover will convince you that statistics makes sense: The

    numbers work out in ways that common sense would anticipate in advance(Common sense

    would never anticipate the precise results.6But certain procedural properties do make sense.

    That's what you want to discover.)

    What a Confidence Interval Cannot Do

    Notice that the 95% in a 95% confidence interval refers to the percent of all samples that yield an

    interval that coversp. If we choose one such sample at random, were 95% confident in that

    result. The error margin in a confidence interval addresses errors due to random sampling.

    The error margin in a confidence interval does not include the effects of other errors. Poorly

    recorded data is one source of error. Or perhaps the study didnt really sample randomly. In these

    cases, quantifying sampling error is not enough.

    All these other factors will lead to additional estimation errorerror that is not captured by our

    formula. So while you can still use the formula when other types of error are present, it doesnt

    give a 95% confidence interval. The actual confidence is unknown. For analyses involving

    nonrandom data, the actual confidence will be considerably lower than 95%. Thats a real issue

    in many studies.

    Polling Refusals

    Suppose (to oversimplify) that 88 million people approve of the President and 72 million

    disapprove. So the Presidents approval rating isp= 88/160 = 0.55.

    A telephone poll is taken. But: The people that approve of the President are crankier than those

    that do not. They are less likely to put up with an intruding phone call. In fact, 40% of the

    approvers will not respond (thats 35.2 million people). The disapprovers are more willing totake the call: only 10% of them will refuse (thats 7.2 million people). Heres the breakdown

    6In fact, given the randomness involved in selecting the samples, and the various other attributes that change from

    problem to problem (n,p,x, as well as the size of the population), it is remarkable that the formula we have is so

    simple.

  • 7/23/2019 CI for a Proportion

    9/24

    Confidence Interval for a Proportion 9

    Approve Disapprove Total

    Respond 52.8 64.8 117.6

    Refuse 35.2 7.2 42.4

    Total 88.0 72.0 160.0The problem here is that our sample is going to reflect the views of only the responders. Of the

    117.6 million responders, 52.8 million approve, for a rating of 52.8/117.6 = 0.45.

    While people will be randomly called, those who refuse to respond will not be included in the

    results. So: Our poll will be estimating 0.45. With a sample size of 1000, the error margin will be

    around 0.03. While some samples will give results higher than 0.45, it is highly unlikely that

    well get a sample that produces a confidence interval including 0.55. After all: The interval is

    designed to include 0.45.

    This is an example of a biasedestimation. A result is biased if it systematically [on average]

    produces the wrong result. Yes: We couldget a random sample that has unusually high amountsof Approvers, and luckily gives an interval including 0.55. But we are unlikely to do so, because

    on average our estimate is 0.45 not 0.55. Thats what bias is: The average result from the

    sampling procedure is not equal to the intended result.

    If we know that nonresponse occurred with 40% probability among Approvers, and 10% among

    Disapprovers, we could adjust the survey results accordingly, and produce an unbiased estimate.

    But generally nonresponse rates are unknown, and the rate changes from survey to survey,

    depending on what the issue is. It is difficult to adjust results to compensate for the nonresponse

    issue.

    Poll results are even harder to interpret when sampling is not done randomly. Internet pollschoose subjects by convenience and interest. Only people who care enough to vote will vote.

    These people may be significantly different in their views than the population of interest. No

    error margin can fix up such polls. (Hopefully results are stated without an error margin.)

    Statistical Software

    Statistical software will compute the confidence interval forp. All you need to do is input three

    values: n,x, and the confidence C, along with specifying the method the software should use.

    The interval you have learned is the (approximate) 1 sample Z interval for a proportion.

    Good software has other choicesthey use a different "formula" than that above. The formula

    you have is approximate, and requires at least ten Successes and Failures to allow using the

    Normal. For cases where this condition is not met, you may have statistical software compute the

    interval using a different method/formula.7In fact, where there are at least ten Successes and

    7Usually called the Exact Binomial interval or method.

  • 7/23/2019 CI for a Proportion

    10/24

    Confidence Interval for a Proportion 10

    Failures, you may use the alternative method in place of the 1 sample Z interval; youll get

    slightly different results.8For really large samples, these differences will be quite small.

    One quirk about one method software may use: The intervals may not balanced: The estimate p

    is not exactly in the middle of the interval. This is particularly noticeable for results with small

    precents of either Successes or Failures. (If you have, say, 2 Successes in only 1020 trials, thenultimately the value ofpis quite small. So the distribution is clamped against the left edge of the

    range of values, and has right skew. Skewness is an expression of imbance; so its no surprise

    that the interval is not balanced. This is a case where you could not use the formula stated above

    2 is too few Successes. The both at least 10restriction prevents use of a Normal

    approximation when things arent at all close to Normal.) If bothxand innxare large, the

    intervalnot matter which method is usedis nearly balanced about p , and in fact, the exact

    interval and the interval from your formula will give very similar results.

    Sample Size Determination

    The error margin for our confidence interval is n

    ppzpSEzE

    1

    22

    . This is

    equivalent to ppE

    zn 1

    2

    2

    . Suppose, prior to the study, we desire an error margin ofE. If

    we can produce a reasonable educated guess for the prevalence of Successes ( p ), then an

    appropriate minimum sample size for the study is ppE

    zn 1

    2

    2

    .

    Example 1Suppose you want to estimate the proportion of students at a large university who are

    nearsighted. The prevalence for the general population is around 0.45. Use this as a guess to

    determine how many students would need to be included in a random sample if you wanted the

    error margin for a 95% confidence interval to be less than or equal to 2%.

    Recall that the error margin quantifies the maximum reasonable difference between the observed

    value p and the population valuep.

    8For really large samples, these differences will be quite small. One quirk about the Exact Binomial method: The

    intervals may not balanced: The estimate is generally not exactly in the middle of the interval. This is particularly

    noticeable for results with small prevalence of either Successes or Failures. (Any pother than 0.5 implies some

    asymmetry, and these intervals reflect this.) If both xand nxare large, the Exact Binomial interval is nearly

    symmetric about the sample propotion, and in fact, the exact interval and the interval from your formula will give

    very similar results.

  • 7/23/2019 CI for a Proportion

    11/24

    Confidence Interval for a Proportion 11

    Solution:The desired error margin isE= 0.02. Our guess is p = 0.45. The required sample size

    is 99.237655.045.002.0

    96.1 2

    n . Of course we cannot sample 0.99 of a student, so we move

    up to 2377.

    The actual study was run and it turned out that 951 of 2377 randomly sampled students were

    nearsighted. This yields a 95% confidence interval of 0.4001 0.0197.

    Remark 1

    If our guess is closer to 0.5 than prevalence p observed in the data, then the actual error margin

    will be greater than desired. If the guess is further from 0.5, then the actual error margin will be

    less than the desiredEd. (If the two are equal then someone is a very lucky guesser.)

    Example 2

    In a study of a new drug, the researchers assume that the cure rate for the drug is the same, 0.60,

    as for the established drug. What sample size is required to obtain a 99% confidence interval

    with error margin no greater than 0.05?

    Solution:The desired error margin isE= 0.05. Our guess is 0.60. The required sample size is

    03.63740.060.005.0

    576.2 2

    n . Of course one cannot sample 0.03 of a patient. To ensure a

    large enough sample size, round up to 638.

    When the data are collected, its found that the drug is much more effective than the established

    drug: 573 of the 638 patients (thats 89.8%) are cured. The 99% confidence interval is

    0.898 0.021.

    Notice that the error margin is far below 0.05. But this came at a cost. If they had known that the

    prevalence would be around 0.90, the researchers could have used 0.90 for a guess, and

    determined a sample size of 239 (because 89.2381.09.005.0

    576.2 2

    n ). Notice that

    239

    1.09.0576.2

    = 0.0500. They ended up sampling 399 more patients than necessary. This cost

    them time and moneyif theyd had any clue the cure rate would be far higher, they could have

    taken advantage of it.

    Remark 2

    If the actual error margin is less than the desired value, then, while the error margin is bettered,

    the expense of conducting the study was larger than necessary. A smaller sample size would

    have sufficed to obtain the desired error margin.

    There is no way to choose exactly the right sample size. In some cases, we may have no idea

    what the prevalence is in advance. If no guess is possiblewe are completely in the dark as to

  • 7/23/2019 CI for a Proportion

    12/24

    Confidence Interval for a Proportion 12

    30

    40

    50

    60

    70

    80

    90

    100

    0.02 0.03 0.04 0.05 0.06

    proportion p

    samplesizen

    the prevalencewe can use 0.5. This guarantees that actual error margin to be no larger than

    what is desired. On the other hand, it also pretty much guarantees that we will take a larger

    sample than is necessary (only if the prevalence turns out to be 50% will the sample be just

    large enough).

    Remark 3Whenever the range of plausible guesses includes 0.5, use 0.5 as the guess. This rule works

    when one has no idea what the prevalence is: the range of plausible guesses is from 0 to 1,

    which certainly includes 0.5.

    Most two-candidate political races are reasonably close. Pollsters9generally use 0.50 to

    determine the sample size. Using a guess of 0.50 tends not to lead to dramatic oversampling

    unless the result falls below 1/3 or above 2/3.

    Example 3

    Production line defects occur infrequently at an industrial plant. In the past the rate has generallybeen between 2% and 6% (this value would change over time as the production line, and the

    employees working on it, change). What sample size is required to estimate the current rate at

    90% confidence with error margin no larger than 4%?

    If we assume 2%, then the required size

    is 33; if 6% is assumed, the required size

    is 96. Heres a plot of the relationship.

    (The relationship is not exactly linear.

    However: Linear interpolation would

    work well here. In general, as long as the

    proportion is confined to a small range

    of values to one side of 0.50,

    interpolation does work fine.)

    You can see that 6% requires the largest n(its closest to 0.5). To cover all historical

    possibilities, use n= 96. If the rate is actually less than 0.06, you will have oversampled. What if

    you sample less than 96? Perhaps a good idea, but if the rate is near 0.06 you wont get the

    desired error margin. And, of course, if production falls seriously out of control, you might see a

    result much higher than 6%leading to an error margin considerably larger than 0.04.

    Remark 4

    A good idea is to produce a range of plausible guesses, and find the sample size for a number of

    values within that range. Graph this relationship. If the final decision isnt yours, you can place

    your graph in front of the decision maker.

    9People or organizations who are paid to conduct polls.

  • 7/23/2019 CI for a Proportion

    13/24

    Confidence Interval for a Proportion 13

    0

    200

    400

    600

    800

    1000

    0 0.2 0.4 0.6 0.8 1

    proportion p

    samplesiz

    en

    This point is illustrated by Example 3. The decision maker on sample size needs to see that

    graph.

    At right is a plot of the required sample

    size for 95% confidence intervals having

    error margin 3%. The curve has the sameshape for other confidence levels and

    error margins. You can see that a

    prevalence of 0.50 requires the largest

    sample size.

    Appendix

    A general format for confi dence in tervals

    The confidence interval we just studied is

    p

    z/2SE(p

    )

    Where

    z/2 is the critical value, found from the Normal, taking into consideration the desired

    level of confidence, and

    the standard error of the estimate isSE(p ) =

    n

    pp 1 , yielding

    the error margin

    n

    ppzpSEzE

    1**

    .

    More generally, provided the conditions are right, a confidence interval is determined with

    Estimate Error margin

    Where

    Error margin = critical value SE(estimate)

    This formula is broadly applicable to all sorts of data analyses. In almost all circumstances

    SE(Estimate) has the square root of the sample size(s) in the denominator.

  • 7/23/2019 CI for a Proportion

    14/24

    Confidence Interval for a Proportion 14

    Exercises

    1. There are 8640 students enrolled at SUNY Oswego this semester; 5146 live more than 50

    miles from campus. A professor (unaware of these figures) samples 92 students and finds

    that 60 of them live more than 50 miles from campus.

    a)

    Identify the following: i) The population proportionp; ii) The sample countX; iii) The

    sample proportion p .

    b) Which ofpand p is a parameter? Which is a statistic?

    A student (also unaware of the whole-campus figures) is about to randomly select 142

    students to estimate the proportion who live more than 50 miles away.

    c) For the student: What are the mean and standard deviation for p ? Interpret this mean.

    (Be sure to include the phrase allpossible samples in your statement.)

    2.

    The saturation rate for a particular kind of marketing via a newspaper ad is 15%. That is:15% of all newspaper buyers will read the ad. For a new ad, marketers randomly sample 30

    buyers and determines that 2 have read the ad.

    a) Identify values forp,X, and p .

    b) Which ofpand p is a parameter? Which is a statistic?

    For the following exercises, when you interpret results, use the word all or population.

    3. A random sample of 212 adoptive parents finds that 85 of them stated No Preference for

    their childs gender. Use this sample data to construct a 95% confidence interval estimate for

    the proportion of adoptive parents who state No Preference. Explicitly identify thefollowing:

    a) The point estimate.

    b) The critical value (Z/2).

    c) The error margin.

    d) Write the interval bounds in this format: p E.

    e) Express the interval in this format: ________ < ________ < ________ .

    f)

    What confidence do you have in this result?

    g) Explain whatprepresents in this situation. Is its value known?

    h) p : Parameter or Statistic? p: Parameter or Statistic?

    i) Interpret your interval in words. We are 95% confident that

  • 7/23/2019 CI for a Proportion

    15/24

    Confidence Interval for a Proportion 15

    4. The Genetics and IVF Institute conducted a clinical trial of the XSORT method designed to

    increase the probability of conceiving a girl. 325 babies were born to parents using XSORT,

    and 295 of them were girls. Use this data to construct a 99% confidence interval for the

    proportion of girls born to parents using XSORT. Interpret your result.

    5.

    Do individuals have the ability to temporarily postpone death to survive a major holiday?(The hypothesis would be that these holidays are family affairs that give a dying person

    incentive to live a bit longer.) In one study, 12000 deaths, over the period from one week

    before to one week after Thanksgiving, were examined. Of these, 6062 occurred in the week

    before Thanksgiving. Give a 95% confidence interval for the proportion of deaths in this two

    week period that occur in the earlier week. Interpret your result. Does your data conclusively

    support the postpone death theory? (Hint: Check where 0.5 lands relative to your interval.)

    6. Complete the small table indicating which critical value from the Standard Normal table goes

    with the given levels of confidence.

    C 50% 75% 90% 95% 98% 99% 99.9% 99.99%

    Z

    /2 1.645 1.960 3.891

    7. Over a period of 11 years in Hidalgo County, Texas, 870 people were selected for grand jury

    duty, and 39% of them were Mexican-American. Notice that you are toldthe value of p -

    you dont have to compute it: p = 0.39. From this you can deduce that the number Xof

    Mexican-Americans in thesample.Since 0.39(870) = 339.3, the number must be 339 (it

    cant be 339.3 you cant select 3-10ths of a Mexican-American). The given value is

    rounded for convenience: 339/870 = 0.3899 to four significant digits, and 0.390 to three,

    which is sufficient for computing purposes.)

    a)

    Assume these data represent a random sample of jury-duty-eligible county citizens.

    Obtain a 99% confidence for the percent of all county citizens that are Mexican-

    American. Interpret your result.

    b) It was determined that 79.1% of all county citizens were Mexican-American. What does

    your confidence interval suggest about selection for jury duty?

    8. Perform an investigation of the relationship between confidence and error margin. Heres

    how.

    a) Take exercise 7, where n= 870 and p = 0.390. Youve already obtained a 99%

    confidence interval: 0.390 0.043. The error margin is 0.043. Now obtain a 95%confidence interval; determine the error margin.

    b) Compute intervals for each of the confidence levels specified in the table. Fill in the table

    below with the error margin for the various levels of confidence.

  • 7/23/2019 CI for a Proportion

    16/24

    Confidence Interval for a Proportion 16

    C 50% 75% 90% 95% 99% 99.9%

    E 0.032 0.043

    c) Write a sentence describing how the error margin changes as the confidence is increased

    (decreased).

    9. A recent survey of 4276 randomly selected households showed that 94.0% of them had

    telephones.

    a) How many of the 4276 households have telephones? Answer with a whole number.

    b) What is the value of p to the nearest 0.0001?

    c) Using these results, construct a 99% confidence interval for the proportion of households

    with telephones. Interpret your result. What is the error margin for this interval?

    d) Give a 99% confidence interval for the proportion of households withouttelephones.

    How does the error margin compare to that in part c?10.Gregor Mendel was responsible for famous genetics experiments with peas. In one

    experiment he crossed lines of peas, and the results included 428 green peas and 152 yellow

    peas.

    a) Find a 95% confidence interval for the proportion of all peas that are green. Interpret your

    result.

    b) Mendels theory of genetic propagation of inherited traits predicted that 75% of all peas

    would be green. Is the theory refuted by his data?

    11.Perform an investigation of the relationship between sample size and error margin. Heres

    how. Take exercise 7 where p = 428/580 = 0.7393. Youve already obtained a 95%

    confidence interval: 0.7393 0.0358 (keep figures to the nearest 0.0001 for this). The error

    margin is 0.0358.

    a) Suppose hypothetically the study had investigated four times fewer peas, but the percent

    that are green is the same: 107 green and 38 yellow. Determine a 95% confidence interval

    for this outcome. Place the error margin in the table below. The sample size is 4 times

    smaller: How many times larger is this error margin?

    b) Suppose hypothetically the study had investigated twenty-five times more peas than in

    the actual study, with 10700 green and 3800 yellow. Determine a 95% confidence

    interval for this outcome. Place the error margin in the table below. The sample size is 25

    times larger: How many times smaller is this error margin?

    n 145 580 14500

    E 0.0358

  • 7/23/2019 CI for a Proportion

    17/24

    Confidence Interval for a Proportion 17

    c) Write a sentence describing how the error margin changes when the sample size is k

    times larger. Check the solution to make sure you have the right result in mind as you go

    forward.

    12.Jack conducts a student opinion poll and gets an error margin of 10% for his result. He is

    not happy. He wants

    3%. How must he adjust his sample size? (Assume the confidence andsample proportion remain the same.)

    13.A poll of 4000 people gives an error margin of 0.01. What would the error margin be for a

    similar poll of 800 people? (Assume the confidence and sample proportion remain the same.)

    Summary

    Here is the formula for the error margin: n

    ppzE

    12

    . At this point you ought to be able

    to look at the formula and deduce that10

    :

    The error margin increases when the confidence is increased. (This happens through the

    valuez/2.)

    The error margin decreases when the sample size is increased. (The sample size appears in

    the denominator of the formula.) In particular, the relationship is that increasing (decreasing)

    the sample size by a factor of kdecreases (increases) the error margin by a factor of k .

    (Thats because the sample size appears in the square root.)

    The error margin does not depend on what is called a Success and what is called a Failure.

    (Exercise 9 parts c and d explicitly address this.)

    14.Perform an investigation of the relationship between the sample prevalence of Success and

    error margin. Heres an exercise that will help you do this.

    What is the relation between the socio-economic status of parents and college graduation of

    their children? Different groups are sampled in order to make comparisons. For each socio-

    economic status, n= 400 children are sampled and tracked through adulthood. The number of

    the 400 who graduate from college is recorded.

    a) Obtain a 95% CI for each socio-economic status. Determine the error margin for each

    interval. Place your results in the table below.

    b) How do error margins compare for the cases p = 0.10 and p = 0.90? How about p =

    0.20 and p = 0.80? p = 0.30 and p = 0.70? Why does this make sense?

    10Assuming that all other factors stay the same.

  • 7/23/2019 CI for a Proportion

    18/24

    Confidence Interval for a Proportion 18

    c) Write a single sentence describing the relationship between the proportion p and the

    error margin of the confidence interval.

    Parents Status # of grads p 95% CI Error Margin

    Welfare 40

    Poor 80

    Low Income 120

    Middle Income 200

    High Income 280

    Wealthy 320

    Super rich 360

    15.

    Go back to problem 10. The 95% confidence interval is 0.3575

  • 7/23/2019 CI for a Proportion

    19/24

    Confidence Interval for a Proportion 19

    18.At SUNY Oswego n= 125 students are randomly selected; 100 of them are opposed to a

    proposal that calls for the college to jam cell phone signals in classrooms. (This would

    prevent texting in class.) You can confirm that a 90% confidence interval is

    0.741

  • 7/23/2019 CI for a Proportion

    20/24

    Confidence Interval for a Proportion 20

    b) Convince yourself that the answer to part a is No. For which of these countries is the

    error margin for a confidence interval the largest? The smallest? (Assume the same

    confidence level is used for all three results.)

    22.You want a 95% confidence interval estimate with error margin 4% for the proportion of

    science majors who are left handed. How many science majors do you sample?a) Describe in words the parameter you are estimating. What symbol is it given?

    b) Assume you have no idea what the prevalence of lefties is for this population. Use a

    guess of 0.5 to determine the required sample size.

    c) In the general population, 10% of people are lefties. Use this value to determine the

    sample size.

    d) Which of the answers from c or d is the better choice?

    e) It turns out that 24 of 217 sampled science majors are lefties. Obtain the confidence

    interval. How does the error margin compare to 4%?

    23.Suppose you undertook a study of the day of the week that babies are born. You are

    interested in the proportion of babies born on a weekend (Saturday or Sunday). Your goal is

    a 90% confidence interval with error margin no greater than 3.5%.

    a) Explain why a guess of 0.50 is unreasonable.

    b) What is a better value for this guess?

    c) In fact, 25% is probably an adequate value for the guess. If all days are equally likely,

    then 2/7 = 28.6% should be born on weekends. However, in recent years there is more of

    a trend for doctors to induce pregnancy, which usually happens on a weekday! Use 0.25to obtain a sample size for this study.

    d) If the actual prevalence is p = 0.25, determine the confidence interval when the sample

    size from c is used. Identify the error margindoes it meet the goal of 0.035? What

    would such a result say about the 2/7 hypothesis?

    e) If the actual prevalence is 0.20 and the sample size from c is used, how will the error

    margin compare to 0.035? Explain.

    24.Consider a large citys mayoral race where there are two candidates.

    a)

    Determine the required sample size for a media poll to estimate the percent of peoplewho favor the Republican candidate with error margin 3%.

    b) Does the required sample size depend on the population of the city?

    25.What proportion of people die during summer (as officially defined)? You decide to

    investigate this issue by collecting data. How many obituaries would you examine in order to

    obtain a 98% confidence interval estimate with error margin of 1%?

  • 7/23/2019 CI for a Proportion

    21/24

    Confidence Interval for a Proportion 21

    Solutions

    1. a) i)p= 5146/8640 = 0.5956; ii)X= 60; iii) p = 60/92 = 0.6522.p = 0.5956 is a parameter;

    p = 0.6522 is a statistic, c) The mean is p= 0.5956. If we examined all possible samples of 142

    students, determiningfor each samplethe proportion who live more than 50 miles from

    home, the mean of these proportions is 0.5956. The standard deviation of these proporitons is

    1424044.05956.0 = 0.0412.

    2. a)p = 0.15. This is a parameter.X= 2. p = 2/30 = 0.06667. p= 0.15 is a parameter; p is are

    statistic.

    3. a) p = 85/212 = 0.4009. b) 1.96. c) E = 0.0660

    3.0660.0

    212

    5991.04009.096.1

    212

    212

    127

    212

    85

    96.1

    E

    d) 0.4009 0.0660. e) 0.3335

  • 7/23/2019 CI for a Proportion

    22/24

    Confidence Interval for a Proportion 22

    a) The 95% confidence interval is 0.3575

  • 7/23/2019 CI for a Proportion

    23/24

    Confidence Interval for a Proportion 23

    b) They are the same for each pair. A Success prevalence of 90% is equivalent to a Failure

    prevalence of 10%, so the error margins must be the same.

    c) Error margin is largest for prevalence p = 0.5 and drops (symmetrically) as the prevalence

    gets further from 0.5on either side of 0.5.

    15. Yellow has prevalence 10.7379 = 0.2621. So the interval is 0.2621 0.0358. Or take

    10.702 = 0.298 and 10.774 = 0.226 to get (0.226, 0.298).

    16. a) I am 95% confident that between 40.7% and 47.7% of allvoters lean towards D.

    b) 55.8% 3.5%. c) No. We dont know how the 55.8% is split up. d) Yes. If there are, for

    instance, 20% undecided, then the result for R is around 35.8%.

    17. This is a media pollthe confidence is 95%. I am 95% confident that between 77% and 85%

    of all Americans favor a drawdown.

    18. a) The point estimate is 0.800, the error margin is 0.059. b) The interval at Penn State is

    exactly the same. c) The population size does not play into this. The formula for error margindepends only upon the prevalence p and the sample size n. This is an underappreciated fact

    about sampling and statistical analysis: As long as a population is large (at least 20 times

    bigger than the sample), its size is pretty much immaterial. What matters in most practical

    situations is the sample size.

    19. a)pis the proportion of all SUNY Oswego students who stay in Oswego. Its aparameter.

    Itd be difficult to get this value youd have to census virtually every student. b) 0.70 is a

    statisticit describes a sample. c) For a 90% confidence interval the error margin will be

    smaller. (See #7.) d) To get an error margin that is 5 times smaller will require a sample size that

    is 52= 25 times larger. (See exercises 1113.) Thats 2000 students.

    20. The error margin for a larger sample size will be smaller. It will not be three times smaller. It

    will be 3 = 1.732 times smaller: 0.07 / 1.732 = 0.0404about 4%. (See exercises 1113.)

    21. a) No. b) The error margin is smallest for Canada and largest for the United States. (Go back

    and examine #14.)

    22. a) The symbol isp.p= the proportion of all science majors at this university who are left

    handed. b) 25.6005.05.004.0

    96.1 2

    n . Select 601 science majors.

    c) 09.2169.01.004.0

    96.1 2

    n . Select 217 science majors. d) 217 is the better choice. The lefty

    rate for scientists is going to be fairly close to that for the general population. (Not only that, but

    the sample size is smaller!) e) 0.1106 0.0417. Pretty close to 0.04 for the error margin. It

    missed a little because the actual lefty rate was slightly closer to 0.5 than the guess of 0.10 that

    was used to determine the sample size.

  • 7/23/2019 CI for a Proportion

    24/24

    23. a) The weekend constitutes 2 days out of 7. We wouldnt expect half of all births to occur in

    2/7th

    s of all days. b) A better guess would be 2/7 = 0.286 (anything from 0.25 to 0.30 is

    reasonable; anything outside of this is not).

    c) 19.41475.025.0035.0

    645.1 2

    n , so sample 415. d) 0.035. The interval is (0.215, 0.285).

    The error margin is 0.035right on the target. The interval does not include 0.286 = 2/7. So we

    have some evidence that the 2/7 hypothesis is false. e) It will be smaller. When the actual result

    falls further from 0.5, the error margin is smaller. (In fact, if 0.20 is the result, the error margin is

    0.032.)

    24. a) Its a media poll, so the confidence is 95%. The required sample size is then 1068. (This

    number is well known to pollsters. Many polls have 3% error rates because they use a sample

    size of around 1000.) b) Absolutely not. Exercises 18 and 21 covered this.

    25. You shouldnt guess anything except 0.25. The required sample size is then

    3.1014475.025.001.0

    326.2

    2

    n , so sample 10,145 obituaries.