Confidence Intervals and Tests based on a Sample Proportion

78
Course Manual: PG Certificate in Statistics, TCD, Base Module 1 Chapter 3: The analysis of count data 3.0 Introduction In this chapter we turn our attention to situations where our data come in the form of counts, rather than continuous measurements. The simplest example, to be discussed in Section 3.1, is, perhaps, a political poll, where n voters are randomly selected from a population of voters and asked a question to which they must answer either yes or no. What is of interest is the proportion of voters in the population who would say yes to the question. The analysis will involve calculating a confidence interval for the population proportion; this is directly equivalent to calculating a confidence interval for a population mean based on a sample mean, as discussed in Chapter 2 (Section 2.1). In many comparative studies, the outcome of interest is a count rather than a measured quantity; in Section 3.2 we will consider such studies. One example involves drug trials where two groups of patients, who are treated differently, are classified as improved/not improved, rather than measured on an improvement scale. In such cases the question of interest is whether there would be a difference in the population proportions that improve, if the treatments were applied to whole populations, rather than to relatively small samples. Again, this is directly equivalent to the simple comparative studies (involving two-sample and paired t-tests and the corresponding confidence intervals) of Chapter 2, where comparisons of population means were of interest. In Section 3.3 we generalise from a two-group comparison to comparing proportions arising in more than two groups (in Chapter 5 Analysis of Variance (ANOVA) will be used to generalise from comparing two sample means to comparing several means for continuous data). We also consider situations where more than two outcomes are observed: for example, we examine the results of a market research study where respondents were classified as having bought a product, having heard of it but not bought it, or, as never having heard of it. The chi-square distribution is introduced for analysing such data. Section 3.4 is concerned with ‘goodness-of-fit’ tests - we ask if an observed set of counts corresponds to an a priori set of relative frequencies; for example, does the observed distribution of seed colours in a plant study correspond to what would be expected under a Mendelian inheritance scheme? To answer the question, we carry out a statistical test which compares the divergences between the observed counts and those that would be expected given the validity of the a priori relative frequencies. The test assesses whether or not the overall divergence would have been likely under chance variation. Again, the chi-square distribution will be used as the reference distribution for the divergence measure.

Transcript of Confidence Intervals and Tests based on a Sample Proportion

Page 1: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 1

Chapter 3: The analysis of count data 3.0 Introduction In this chapter we turn our attention to situations where our data come in the form of counts, rather than continuous measurements. The simplest example, to be discussed in Section 3.1, is, perhaps, a political poll, where n voters are randomly selected from a population of voters and asked a question to which they must answer either yes or no. What is of interest is the proportion of voters in the population who would say yes to the question. The analysis will involve calculating a confidence interval for the population proportion; this is directly equivalent to calculating a confidence interval for a population mean based on a sample mean, as discussed in Chapter 2 (Section 2.1). In many comparative studies, the outcome of interest is a count rather than a measured quantity; in Section 3.2 we will consider such studies. One example involves drug trials where two groups of patients, who are treated differently, are classified as improved/not improved, rather than measured on an improvement scale. In such cases the question of interest is whether there would be a difference in the population proportions that improve, if the treatments were applied to whole populations, rather than to relatively small samples. Again, this is directly equivalent to the simple comparative studies (involving two-sample and paired t-tests and the corresponding confidence intervals) of Chapter 2, where comparisons of population means were of interest. In Section 3.3 we generalise from a two-group comparison to comparing proportions arising in more than two groups (in Chapter 5 Analysis of Variance (ANOVA) will be used to generalise from comparing two sample means to comparing several means for continuous data). We also consider situations where more than two outcomes are observed: for example, we examine the results of a market research study where respondents were classified as having bought a product, having heard of it but not bought it, or, as never having heard of it. The chi-square distribution is introduced for analysing such data. Section 3.4 is concerned with ‘goodness-of-fit’ tests - we ask if an observed set of counts corresponds to an a priori set of relative frequencies; for example, does the observed distribution of seed colours in a plant study correspond to what would be expected under a Mendelian inheritance scheme? To answer the question, we carry out a statistical test which compares the divergences between the observed counts and those that would be expected given the validity of the a priori relative frequencies. The test assesses whether or not the overall divergence would have been likely under chance variation. Again, the chi-square distribution will be used as the reference distribution for the divergence measure.

Page 2: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 2

3.1 Single proportions We are often interested in counting the relative frequency of occurrences of particular characteristics of units sampled from a population or process, with a view to making inferences on the population or process characteristics. For example, political polls are carried out to estimate support for individual politicians or parties in a defined population of electors; market research studies are conducted to elicit information on purchasing behaviour in a defined population of consumers; quality control samples are taken to monitor conformance to specifications of industrial products. Here the practical question is how to estimate the relative frequency of a given characteristic in a population or process from that of the sample. The Statistical Model Suppose a random sample of n units is selected from a population or process in

which the proportion of units that has a particular characteristic is and that r members of the sample have this characteristic. For example, in a political poll r=700 voters from a sample of n=1000 respond “yes” to a question; what does

the sample proportion p=r/n tell us about the population proportion who would say “yes”? It can be shown that under repeated sampling the sample proportion p=r/n

behaves (for large n) as if selected from a Normal distribution with mean and standard error:

SE(p) (1)

n.

This means that repeated samples of size n will give sample proportions that

vary around as shown in Figure 3.1.1 below. Note that this result is approximate (it is a large sample or ‘asymptotic’ result) and that the larger the sample size n is, the better the approximation becomes.

Figure 3.1.1 The distribution of sample proportions

Page 3: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 3

The fact that p is approximately Normally distributed allows the use of the methods previously encountered for the analysis of measured variables, even though we are now dealing with counts, where only integer values are possible. A confidence interval for a population proportion

The statistical question is how to estimate , the population proportion, from that of the sample value, p. Recall that when dealing with measured quantities we used the fact that if some variable x is at least approximately Normally distributed

(say, with mean and standard deviation ) then the average of n randomly

selected values, x , will also be Normally distributed, with the same mean, , and

with standard error /n. This is illustrated in Figure 3.1.2 below.

Figure 3.1.2: Distributions of individual measurements (a) and of the averages of n measurements (b).

This gave us a method for estimating the population or process mean , using

the sample information. A 95% confidence interval for was given by:

x ± 1.96

n .

Where was unknown it was replaced by the sample value s and, if the sample size was small, the standard Normal value of 1.96 was replaced by the corresponding Student’s t-value. The sample proportion in the current case corresponds directly to x above. In

estimating the standard error of p we replace by the sample estimate p in the formula for standard error:

S ˆ E (p) p(1 p)

n.

Page 4: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 4

This allows us to calculate an approximate 95% confidence interval1 for as:

p 1.96p(1 p)

n.

Note the structure of the confidence interval formula: the confidence interval is given by a point estimate of the parameter of interest, plus or minus the estimated standard error of the point estimate multiplied by a critical value. Exactly the same structure applies to all confidence intervals, irrespective of the parameter being estimated:

Example 1: French 2007 presidential election On May 4th, 2007 the BBC website reported that an opinion poll (Ifop for the Le Monde newspaper on May 3rd) “put Mr. Sarkozy at 53%, with Ms. Royal trailing with 47%” in the election campaign for the French presidency (voting took place on the following Sunday). The sample size was 858 people and all had watched the head to head television debate between the candidates. What can we infer about the voting intentions of the French electorate from these sample findings, if we can assume that this is a random sample of all French voters? A 95% confidence interval for the population proportion of voters who favoured Sarkozy is given by:

p 1.96p(1 p)

n

0.530 1.960.530(10.530)

858

0.5300.033

i.e., between 49.7% and 56.3% are estimated to favour Sarkozy. A confidence interval allows for the sampling variability involved in basing our conclusions about the population on the results of a random sample. Taken at face value the confidence interval indicates that, although Sarkozy was six percentage points above Royal in the sample, sampling variability is such that Royal could have been very slightly ahead in the population of voters [50.3%, corresponding to the lower bound of 49.7% for Sarkozy]. In other words the sample size was not large enough to distinguish clearly between the two

1 Note that in this chapter the relevant reference distribution for calculating confidence intervals is

the standard Normal (never the t-distribution). Obviously, confidence levels other than 95% may be obtained by replacing 1.96 by the corresponding Z value; thus, 2.58 gives 99% confidence.

Page 5: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 5

candidates. Note that the formula implies that the larger n is, the narrower will be the error bound on the sample result. The confidence interval is based on the assumption of a simple random sample from the population of interest; this means that all voters would be equally likely to be included in the sample. In practice, this is unlikely to be true for several reasons. The poll was very likely a telephone survey – this raises several questions: do all voters have telephones, if several voters reside in the same house, are they equally likely to answer the telephone, will they answer truthfully if they do respond to the survey? The poll was aimed at people who had watched the TV debate – does this subset of the French electorate reflect accurately the voting intentions of the overall electorate? What exactly was the question asked – did they ask who had won the debate or who the respondent would support – the question could appear the same, but it is dangerous to assume that the respondent interprets a question as you do, unless it is made very explicit. (I had a colleague once who, when asked by a TV licence inspector if there was a television in the house, replied that there was and when asked if he had a TV licence, replied that he did not. When summoned to court he was asked by the judge why he did not have a TV licence; he explained that he did not own the TV, which was his wife’s and that she did have a TV licence!). Many people are likely to refuse to answer survey questions (whether on the telephone or otherwise) – is there a difference between the subset of the population who are willing to respond and that which is not; any such difference leads to a non-response bias. Many surveys have response rates of less than 20%. In such cases it is virtually impossible to assess the likely bias in the sample of respondents. In short, surveying is something of a black art! The use of a mathematical formula, for most people, suggests an exactness which is clearly inappropriate in very many survey situations. Judgements need to be made on how representative the sample is likely to be and the uncertainties inherent in these judgements are not included in the error bounds generated by the confidence interval. The real uncertainty is likely to be quite a lot higher (note the precision in this final sentence!). Exercise 3.1.1 On Saturday May 17

th, 2008, the Irish Times published the results of a poll on attitudes to

the planned constitutional referendum on the Lisbon Treaty. The size of the sample was 1000; 35% of respondents said they intended to vote ‘yes’. Estimate the population proportion who intended to vote ‘yes’, at this point in the campaign.

Sample Size In any empirical study, one of the first questions that needs to be answered is: “how many observations are required?”. One way of addressing this question in the current context is to ask how wide the confidence interval bounds on the

Page 6: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 6

sample proportion could be while still giving an interval that answers the practical question, i.e., an interval that is ‘fit for purpose’. The standard error of the sample proportion is:

SE(p) (1)

n.

This depends not only on the sample size (n) but also on the population or

process proportion () or rather on the product ). Table 3.1.1 shows how

this varies with ; it assumes its worst value (in the sense of giving the widest

interval) when is 0.5.

0.1

0.9

0.09 0.2 0.8 0.16 0.3 0.7 0.21 0.4 0.6 0.24 0.5 0.5 0.25 0.6 0.4 0.24 0.7 0.3 0.21 0.8 0.2 0.16 0.9 0.1 0.09

Table 3.1.1: The impact of on the standard error

Suppose we require that a 95% confidence interval is of the form:

p ± 0.01 then, since the confidence interval is given by:

p 1.96p(1 p)

n

and a worst case is either p(1–p) or we have:

1.960.25

n 0.01

which gives a sample size

Page 7: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 7

n (1.96)20.25

(0.01)2 9604 .

If n is around 1000 the error bounds are ± 0.03. This is a typical sample size for political polls, which are usually reported in the press as subject to an error of approximately 3%. Significance Tests When only a single proportion is of interest it is much more likely that the desired analysis will be to calculate a confidence interval than to carry out a significance test. Confidence intervals are more useful in that they not only tell us whether the data are consistent with a particular hypothesised population or process value, but also show the range of values with which the results are consistent. Tests may be appropriate, however, in certain contexts, especially where a decision has to be made. Example 2: An Acceptance Sampling test In Chapter 2 (Section 2.1) the use of a one-sided test was illustrated by an acceptance sampling example. Acceptance sampling in an industrial context often involves counted data. For example, a company that routinely buys batches of 1000 components for use in one of its own products might take a sample of 80 components on delivery and accept the consignment if no more than 2 components in the sample are ‘non-conforming’, i.e., do not meet the required specifications. The sample size of 80 and the acceptance number of 2 are suggested by the American National Standard: Sampling Procedures and Tables for Inspection by Attributes (ANSI/ASQC Z1.4-1993) for batches of between 501-1200 where the maximum acceptable non-conformance rate is 1%. This decision rule is a one-sided test with a rejection region defined by the numbers 3, 4, 5,…80. Equivalently, a sample proportion greater than

p=2/80=0.025 leads to rejection of the null hypothesis that ≤ 0.01. The sampling tables are often referred to as the ‘Military Standards’ as they were first developed during the second world war for monitoring the purchase of military equipment. Further information on the use of these tables can be found in the guidebook American National Standard: Introduction to Attribute Sampling (ANSI/ASQC S2-1995).

Page 8: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 8

3.2 Simple Comparative Studies In Chapter 2, we discussed comparative studies where the outcomes of interest could be measured and the questions to be addressed referred to population or process means. We now turn our attention to studies where the outcome involves counting, rather than measuring, and the questions of interest refer to population or process proportions. Typical examples would be medical studies where the two groups are either inherently different (e.g., smokers versus non-smokers) or the subjects are initially similar and are broken into two groups to be allocated to different treatments; the number in each group who respond in a defined way (e.g., develop lung cancer or become free of a medical condition) after a period of time is then counted and expressed as a proportion of the group size. The statistical problem is to assess the extent to which observed sample differences are likely to be systematic and not simply due to chance variation. Example 3: Use of Timolol for treating angina Angina pectoris is a chronic heart condition in which the sufferer has periodic attacks of chest pain. Samuels [1] cites the data in Table 3.2.1 from a study [2] to evaluate the effectiveness of the drug Timolol in preventing angina attacks; patients were randomly assigned to receive a daily dosage of either Timolol or a Placebo for 28 weeks. The numbers of patients who did or did not become completely free of angina attacks (Angina-Free) are shown in Table 3.2.1. Clearly a higher proportion of the Timolol patients became Angina-Free in the study. The statistical question is: “what are the implications for use of Timolol in a population of such patients?”

Timolol

Placebo

Totals

Angina-Free (AF) 44 19 63

Not Angina Free 116 128 244

Totals 160 147 307

Proportions

0.275

0.129

0.205

Table 3.2.1: The Timolol study results

The Statistical Model In Section 3.1, we saw that if a single random sample of n units is selected from a population or process in which the proportion of units that has a particular

characteristic is and if r members of the sample have this characteristic, then

Page 9: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 9

the sample proportion p=r/n behaves as if selected from a Normal distribution

with mean and standard error:

SE(p) (1)

n.

Assume now that we have two groups which are independent of each other in the sense that the chance that any unit in one group does or does not have the characteristic of interest is not in any way affected by the responses of the units in the second group. If this is true then we can extend the result used above in a very simple way [see Appendix 3.1, page 60 for details]. If p1=r1/n1 is the sample

proportion and is the population or process proportion for the first group, and

p2=r2/n2 and are the corresponding values for the second group, then the difference between the sample proportions, (p1–p2) has a sampling distribution

which is approximately Normal with mean (– and standard error:

2

22

1

1121

)1()1()(

nnppSE

.

This is illustrated in Figure 3.2.1.

Figure 3.2.1: The sampling distribution of the difference between sample proportions

We are now back on familiar territory and can work with this sampling distribution, as it is merely a single Normal distribution (with which we have worked several times already), even though the standard error formula looks a little complicated.

Page 10: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 10

The statistical questions that we typically wish to address are whether or not the

population or process proportions are the same and, if there is a

difference, how big is that difference –? As we have seen in several contexts, questions about statistical significance and the size of effects can be answered by calculating confidence intervals. This will be our approach here too. We will, however, also examine the use of a statistical test for comparing sample proportions, since such tests are commonly used in the research literature. We will use the Timolol study to illustrate the methods. Confidence Interval We have seen that a confidence interval for a population parameter (or the difference between two such parameters) is given by a point estimate plus or minus the standard error multiplied by a critical value from the relevant reference distribution. Accordingly, since the difference between two sample proportions

follows a Normal distribution (approximately) a 95% confidence interval for 1–2 is given by:

(p1 p2) 1.96p1(1 p1)

n1

p2(1 p2)

n2

where 1.96 is the critical value from a standard Normal curve and p1 (Timolol)

and p2 (Placebo) estimate the corresponding population values 1 and 2, respectively. For the Timolol data this gives:

(0.275 0.129) 1.960.275(10.275)

160

0.129(10.129)

147

0.1460.088. The interval suggests that if patients were routinely administered Timolol (according to the study protocol) then an excess of between 6% and 23% of the population would become Angina-Free over and above the percentage that would be expected if the patients were routinely prescribed a placebo. Thus, the interval answers the research questions: it tells us that Timolol works and it measures by how much it is better than a placebo. The results of studies of this type are very often analysed by and reported by way of statistical tests rather than confidence intervals. Accordingly, one such test will be introduced below. However, I strongly recommend that a confidence interval should be obtained, even if a statistical test is the more commonly used method of analysis in a particular discipline. With very large sample sizes even

Page 11: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 11

tiny differences (of no practical interest) may be statistically significant; so simply reporting results as being statistically significant (especially if the ‘statistically’ qualifier is dropped, as it often is!) can be misleading. On the other hand, small sample sizes result in large standard errors, which will mean non-statistically-significant results unless large sample differences are observed. Calculating a confidence interval makes explicit the uncertainty in the observed sample difference and may lead readers to ask (correctly) if the non-significance is due to the absence of an effect or simply due to the small size of the study. Significance Test Obviously, the sample proportion of patients who became Angina-Free is greater in the Timolol group (p1=44/160=0.275) than in the Placebo group (p2=19/147=0.129). If we imagine the two treatments being administered to a large population of such patients (from which the sample may be considered to be randomly selected) then we can ask if the sample results are consistent with a difference in the population proportions. We frame our question in terms of a null hypothesis which considers the long-run proportions as being the same and an alternative hypothesis which denies their equality.

Ho: – = 0

H1: – 0.

When the null hypothesis is true the sample difference (p1–p2) has a

sampling distribution which is approximately Normal with mean –and standard error:

SE(p1 p2) (1)

n1

(1)

n2

where is the common population value. Since the test statistic is calculated on the assumption that the null hypothesis is true, the Z statistic is given by:

Z (p1 p2) 0

(1)

n1

(1)

n2

.

This has a standard Normal distribution. Because is unknown it must be estimated from the sample data. If there is no group difference then a natural

Page 12: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 12

estimator is the overall sample proportion of patients who become Angina-Free, p=63/307=0.205. This gives our test statistic2 as:

Z (p1 p2) 0

p(1 p)

n1

p(1 p)

n2

(0.275 0.129) 0

0.205(10.205)

160

0.205(10.205)

147

3.16.

If we choose a significance level of =0.05 then the critical values are ±1.96 as shown in Figure 3.2.2 below.

Figure 3.2.2: The standard Normal distribution, with critical values

Our test statistic Z=3.16 is greater than the upper critical value and the sample difference is said to be ‘statistically significant’: we reject Ho and conclude that, in the long-run, the proportion of patients who would become Angina-Free is greater if Timolol is prescribed than it would be if a placebo were administered routinely. Note the assumption implicit in the analysis, that the patients are a random sample from the corresponding population. In practice, of course, this is unlikely to be true (especially as it would be the intention to administer the treatment to future patients also). It is important, therefore, in considering the scientific value of the test result to assess the extent to which the sample may be treated as if it were a random sample from the target population. In reporting on the findings it would also be important to specify the target population carefully; for example, the sample may be restricted in its coverage: it may include patients of a narrow age profile, of only one sex, of only one nationality or race etc.

2 note: rounding affects such calculations, the ratio is 0.1457/0.0461.

0.95

Page 13: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 13

Typical Computer Output Table 3.2.2 shows the output obtained from Minitab when the Timolol study sample results are entered.

Test and CI for Two Proportions Sample Angina-Free N Sample p

Timolol 44 160 0.275000 Placebo 19 147 0.129252

Difference = p (1) - p (2) Estimate for difference: 0.145748 95% CI for difference: (0.0578398, 0.233657) Test for difference = 0 (vs not = 0): Z = 3.16 P-Value = 0.002

Table 3.2.2: Minitab analysis of Timolol data

The p-value associated with the test indicates that if the null hypothesis were true, the probability would be 0.002 that we would have obtained, just by chance, a Z test statistic as far out, or further, into the tails of the standard Normal distribution, than the value Z=3.16 obtained here. Accordingly, this is an unusually large value and the null hypothesis of no long-run group difference should be rejected. The computer output also gives the confidence interval (to an impressive number of decimal places!). Continuity Correction In carrying out the Z-test (or the corresponding chi-square test which we will meet in Section 3.3), it is often advised that a ‘continuity or Yates’ correction’ be applied to the formula for the test statistic. This arises because although we are counting discrete events, our calculations are based on the Normal distribution, which is continuous – there is, therefore, an approximation involved. This is a technical detail which I have chosen to ignore in the course of this chapter. The continuity correction is only relevant when the sample size is relatively small – in the current case applying the correction changes the test statistic Z from 3.16 to 3.02; the p-value changes from 0.0016 to 0.0025. Exercise 3.2.1 On Saturday May 17

th, 2008, the Irish Times published the results of a poll on attitudes to

the planned constitutional referendum on the Lisbon Treaty. The size of the sample was 1000; 35% of respondents said they intended to vote ‘yes’. A similar poll, carried out at the end of January, 2008, gave 26% as the percentage who intended to vote ‘yes’. Carry out a test to determine if the data suggest that the population proportions intending to vote ‘yes’ were the same or different on the two sampling dates. Estimate the change in the population proportion of those who intended to vote ‘yes’.

Page 14: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 14

Other Sampling Designs The data of Example 3 resulted from a designed experiment – a clinical trial – in which patients were assigned randomly to the two treatment groups. Where random assignment is possible it strengthens the conclusions that may be drawn from the study, as randomisation is the best protection against unknown sources of bias, which may be present in purely observational studies (see further discussion in Chapter 4). Our next two examples give data that are analysed in exactly the same way as the Timolol data, even though the data generating mechanisms are different. Example 4 is a typical cohort or longitudinal study. Such studies are said to be prospective: they identify the study groups first, follow them over time, and subsequently classify the subjects or patients according to the outcomes being studied. Example 5 is a case-control study – this type of data collection design is retrospective, as we shall see. Example 4: Cohort Study of Slate Workers Machin et al. [3] cite a study by Campbell et al. [4] in which a group of Welsh men who had been exposed to slate dust through their work were compared to a similar group who had not been so exposed; the control group was similar in age and smoking status. The groups were followed over 24 years and the numbers of deaths in the two groups were recorded, as shown in Table 3.2.3.

Died Survived Totals Proportions

Slate Workers 379 347 726 0.522

Controls 230 299 529 0.435

Totals 609 646 1255

Relative risk 1.201

Odds Ratio 1.420

Table 3.2.3: A cohort study of slate workers and controls

Table 3.2.4 shows a statistical comparison of the proportions who died in the two groups: the difference p1–p2 is highly statistically significant, with a p-value of 0.002.

Page 15: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 15

Test and CI for Two Proportions Sample Died N Sample p

Slate Workers 379 726 0.522039 Controls 230 529 0.434783 Difference = p (1) - p (2) Estimate for difference: 0.0872560 95% CI for difference: (0.0315353, 0.142977) Test for difference = 0 (vs not = 0): Z = 3.05 P-Value = 0.002 Table 3.2.4: A statistical comparison of the death rates

The confidence interval shows an increased mortality rate of between 3 and 14 percentage points for the slate workers over the control group. The relative risk and odds ratio shown in Table 3.2.3 will be discussed later. Example 5: Oesophageal Cancer Case-Control Study Case-control studies involve selecting a group of patients with some condition (a disease, say), together with a control group, and then looking backwards in time (they are retrospective studies) to see if some antecedent or risk factor occurs more frequently with the study group, than it does with the control group. If it does, it suggests a possible link between that factor and the outcome of interest. Obviously, case-control studies involve an inversion of the preferred point of view: we would like to be able to say that “people who do X are more likely to get disease Y”, whereas, what we can say is that “people who have disease Y are more likely to have done X”. Such studies are particularly useful for studying rare conditions, for which enormous sample sizes would be required for prospective studies (in order to be sure of recruiting into the study sufficient numbers of people who will get the disease). Breslow and Day [5] cite a case-control study by Tuyns et al [6] involving 200 males who were diagnosed with oesophageal cancer in a Brittany hospital. A control sample of 775 adult males was selected from electoral lists in the nearby communes. Both cases and controls were administered a detailed dietary interview, which contained questions about their consumption of alcohol and tobacco as well as food. Only a small part of the dataset is examined below, Breslow and Day discuss the data in a much more sophisticated way than is attempted here. The data, below, are classified into two groups according to their reported alcohol consumption: high (80+ g/day) and low (<80 g/day).

Page 16: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 16

Alcohol consumption (g/day)

80+ <80 Totals Proportions

Cases 96 104 200 0.480

Controls 109 666 775 0.141

Totals 205 770 975

Odds Ratio 5.640

Table 3.2.5: Oesophageal cancer case-control study

A Z-test (see Table 3.2.6, p<0.001) provides strong statistical evidence of an association between alcohol consumption and disease status. The sample proportions P(high alcohol consumption for Cases)=0.48 and P(high alcohol consumption for Controls)=0.14 are, therefore, indicative of underlying differences.

Test and CI for Two Proportions Sample 80+ N Sample p

Cases 96 200 0.480000 Controls 109 775 0.140645

Difference = p (1) - p (2) Estimate for difference: 0.339355 95% CI for difference: (0.265916, 0.412793) Test for difference = 0 (vs not = 0): Z = 10.50 P-Value = 0.000

Table 3.2.6: Minitab analysis of oesophageal cancer case-control study

The confidence interval suggests a higher prevalence of heavy drinking among the cases, over and above the level for the controls, of between 27 and 41 percentage points. Measuring Association The Timolol study was a clinical trial – an experimental study in which patients were randomly assigned to the treatment and control groups. In such cases we have strong evidence for a causal relationship between the response and the treatment administered. The oesophageal cancer study, on the other hand, was a purely observational study: in such cases the assumption/deduction of a

Page 17: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 17

causal relationship between the risk factor (heavy drinking) and the response (cancer) is always subject to the caveat that there could be other (unknown) factors related to both of the observed factors, which are responsible for the apparent relationship. This is described as ‘confounding’. For this reason, the term ‘association’ tends to be used, in order to make clear that the data collection method does not allow us to deduce, with any certainty, the existence of a causal relationship. For the Timolol study a natural measure of association is:

p1–p2 = P(Angina free|Timolol)–P(Angina free|Control) as it makes a direct comparison between the two proportions of interest. We have already discussed such comparisons extensively. An alternative way of viewing the same information is to calculate an estimate of the relative risk:

Relative Risk = 1 3.21 2 9.0

2 7 5.0

2

1 p

p.

This lends itself to a simple description of the outcome of the study: “the chances of becoming angina-free are more than doubled for the Timolol group”. Similarly, for the slate workers study the mortality rates after 24 years were p1=0.522 for the slate workers and p2=0.435 for the controls. The relative risk of dying was 1.2 – slate workers had a 20% higher chance of dying within the study period. Neither of these measures (p1–p2 or p1/p2) is available if the study is retrospective, since the relevant proportions are not available from the data. For example, in the oesophageal cancer study 96/205=0.47 may appear, for the high alcohol group, to estimate the probability of being a case, i.e., of having this cancer, for the heavy drinking group. However, if the researchers had identified 400 rather than 200 cases, the estimate would be 192/301=0.64, assuming the number of heavy drinkers among the cases also doubled. Thus, the estimate would be determined by the design of the study rather than the characteristics of the population being studied. An alternative measure of association, the odds ratio, has the useful property that it provides a direct measure of association, even when the data collection is retrospective. Odds Ratio Table 3.2.7 reproduces the Timolol data, together with a second table where the frequencies are represented by letters.

Page 18: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 18

Timolol Control Totals Timolol Control Totals

Angina free 44 19 63 AF a b a+b

Not A-free 116 128 244 not-AF c d c+d

Totals 160 147 307 Totals a+c b+d n

Table 3.2.7: Timolol study results

Out of the 160 people in the Timolol group 44 become Angina-free (AF) while 116 do not. Instead of expressing the chance of becoming Angina-free as a proportion of the total p1=44/160, an alternative is to say that the odds of becoming AF as opposed to not-AF are 44:116, which we may write as 44/116. Similarly, the odds of becoming AF for the control group are 19/128. If we take the ratio of these two quantities we get the odds ratio:

OR = 6.2][128/19

116/44

bc

ad

d

b

c

a.

The odds of becoming AF are more than two and a half times higher for patients who received Timolol. Table 3.2.8 reproduces the oesophageal cancer study results in a similar format to that just considered for the Timolol study. Note however, that it is the row totals that are fixed by the study design here, whereas it was the column totals for the Timolol study.

Alcohol consumption Alcohol consumption

80+ <80 Totals 80+ <80 Totals

Cases 96 104 200 Cases a b a+b

Controls 109 666 775 Controls c d c+d

Totals 205 770 975 Totals a+c b+d n

Table 3.2.8: Oesophageal cancer study results

By design, we cannot calculate directly the odds of being a case for the high alcohol group (just as we could not calculate the corresponding probability, as discussed earlier), because the number would be meaningless – to see this you need only realise that the odds of being a case would double if the study had included 400 instead of 200 cases (assuming the same proportion of heavy drinkers appeared in the study the odds would be 192/109=1.76 instead of

Page 19: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 19

96/109=0.88). We can, however, find the odds of being a heavy drinker versus a moderate drinker for each of the two study groups. For the cases, this is 96/104 and for the controls, it is 109/666. The odds ratio is then:

OR = 6.5][666/109

104/96

bc

ad

d

c

b

a .

The odds are more than five times higher that an oesophageal cancer patient will be classified as a heavy drinker, than that a control will be one. Note, however, that the odds ratio formula (in letters) for the retrospective case-control study turns out to be exactly the same as for the prospective experimental study:

OR = bc

ad

and so, although the data were collected retrospectively, we can say that the odds of getting oesophageal cancer are more than five times higher for heavy drinkers than they are for more moderate drinkers3. The ratio is often called the ‘cross-product’ ratio because of the ‘shape’ of the resulting formula: compare it to the layout of the letters in the tables. This property that it produces a forward-oriented index, even when the data collection is retrospective makes the odds ratio an attractive measure of association. The odds ratio also allows the tools of multiple regression to be applied to frequency data, when we want to analyse the influence of several risk factors simultaneously, e.g., the effects not just of alcohol consumption, but also of age, smoking habit, nutritional status etc. There is another reason why the odds ratio is considered a useful index: it is frequently applied to diseases with low rates4, and so it gives a close approximation to the relative risk for the groups being compared, even when the sampling is retrospective and the relative risk cannot be calculated. Exercises

3 Thus, although the odds a/c and b/d of being a case are dependent on the design (total number

of cases identified) and hence are not interpretable separately, their ratio is, since, while doubling the number of cases will double both a and b, the ratio of the two odds will be unchanged, as the doubling will cancel. Note that this is not true, in general, for the relative risk p1/p2. 4 The odds are p/(1–p) which is approximately p(1+p+p

2) so, as long as p is small (say no more

than 0.05), p2 and higher powers of p are negligible and p/(1–p) is effectively equal to p. Then

the odds for group 1 are p1/(1–p1)=p1 and those for group 2 are p2/(1–p2)=p2, so the odds ratio becomes p1/p2 which is the relative risk. Thus, the relative risk is effectively the same as the odds ratio when p is very small. Note that this approximate formula is based on a power series expansion of the function p/(1–p), so unless you studied this kind of mathematics, do not expect to be able to establish the relationship algebraically.

Page 20: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 20

3.2.2. 460 adults took part in an influenza vaccination study [7]. Of these 240 received the

influenza vaccine and 220 received a placebo vaccination. At the end of the trial 100 people contracted influenza of whom 20 were in the treatment group and 80 in the control group. Carry out a statistical test to determine if the vaccine is effective. Calculate and interpret a confidence interval which will estimate the potential effect of the vaccine (beyond that of a placebo). Calculate the odds ratio.

If the entire population of adults had been vaccinated, estimate the proportion of the

population that would be expected to have contracted influenza during the test period. 3.2.3. A study of respiratory disease in childhood [7] examined the question of whether children

with bronchitis in infancy get more respiratory symptoms in later life than others. Of 273 children with a history of bronchitis before age 5 years, 26 were reported to have a persistent cough at age 14, whereas in the group of 1046 children in the study who did not have bronchitis before the age of 5 years, 44 were reported to have a persistent cough at 14 years. Are the proportions having a persistent cough statistically significantly different? Calculate an interval estimate of the long run average difference between the two rates of occurrence of persistent cough at age 14 years. Calculate the odds ratio.

3.2.4 Hauer, Ansorge and Zinke [8] report data on otters that were either killed on the road or

shot by hunters in eastern Germany over a 40 year time period. Note that the numbers were recovered from summaries in the original paper and the total sample size here is 1023, but the original paper refers to a total sample size of 1027. Table 3.2.9 shows the N=1023 otters cross-classified into a 2x2 table with age-group and sex as the row and column variables, respectively. The authors of the paper described otters of age four or more years as ‘grown adults’; I have described otters younger than this as ‘young’.

Male

Female

Totals

Young 286 153 439 Adults 312 272 584

Totals 598 425 1023

Table 3.2.9: Otters cross-classified by age at death and sex

Note that the sampling scheme here is different from our earlier 2x2 tables – the Timolol

study was a clinical trial in which patients were randomly assigned to treatment and control groups and the outcome (becoming angina-free) was recorded later, the Slate Workers example was a cohort study in which two groups of subjects were identified and then followed over many years, the Oesophageal Cancer data came from a case-control study in which cancer patients and a control group were identified and interviewed (retrospectively) about their drinking habits (a potential risk-factor for the cancer). In the Otter study the researchers collected data for a given time period, and then cross-classified the results, without pre-specifying any sample size – this was itself a random quantity.

A natural research question would be: “is there an association between age at death and

sex?”, i.e., is the proportion of males who are young when killed the same as or different from that for females?

Page 21: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 21

Example 6: Collapsing Multi-dimensional Contingency Tables: Simpson’s Paradox – a statistical health warning! Consider an hypothetical Graduate School with two programmes: Mathematics and English5. Suppose Table 3.2.10 represents the admission/rejection decisions for applications in a particular year, classified by sex. The table strongly suggests bias against females (12% acceptance versus 29% for males) on the part of those making the admission decisions. This difference is highly statistically significant, as shown by the Z-test result (Z=4.8).

Overall

Number of Percentage

Reject Accept Applicants Accepted Z-value p-value

Males 107 43 150 29

Females 370 50 420 12 4.8 0.000

Totals 477 93 570 16

Table 3.2.10: Overall Graduate School admission decisions

Consider now the results from the two Schools individually, as shown in Table 3.2.11. Note that the data in Table 3.2.11 sum to give the corresponding totals in Table 3.2.10 – 150 males and 420 females.

In both cases the sample admission rate is higher for females (though not statistically significantly) – 50 versus 40% for Mathematics and 10 versus 6% for English. This is in stark contrast to the overall rates for the Graduate School. This phenomenon (aggregated results contradicting the disaggregated results) is called Simpson’s Paradox. Why does it happen? In this example females apply (in greater numbers) for the programme into which it is more difficult to gain admission. Thus, only 10% of applicants are accepted into the English programme while 42% are accepted into Mathematics, but the females overwhelmingly apply for English. The conclusion must be that if there is sex bias, the source needs to be sought, not in the admissions procedures of the Graduate School, but in the conditioning and social influences that lead females to apply disproportionately for the graduate programmes into which it is more difficult to gain entry.

5 The Graduate School example is inspired by Bickel et. al [9]

Page 22: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 22

Mathematics

Number of Percentage

Reject Accept Applicants Accepted Z-value p-value

Males 60 40 100 40

Females 10 10 20 50 0.828 0.408

Totals 70 50 120 42

English

Number of Percentage

Reject Accept Applicants Accepted Z-value p-value

Males 47 3 50 6

Females 360 40 400 10 0.907 0.364

Totals 407 43 450 10

Table 3.2.11: Graduate School admission decisions by School

This should be a sobering example when we consider how often we collect multiply cross-classified data and, typically, collapse across categories to form a multiplicity of two-dimensional tables for analytical purposes. For example, Social/Political surveys will often collect information on Voting Intention, Party Affiliation, Sex, Age, Income Group, Education, Occupational Status, Geographical Location etc. Data collected on expectant mothers will include most of the above and also variables such as number of previous pregnancies, smoking, drinking, dietary, and exercise habits, as well as medical history information. There is an obvious need for care in the analysis of such information, to ensure that apparent effects are not the result of collapsing over influential variables. Matched Studies The analysis of two-sample studies where proportions are the appropriate outcome measure was introduced using the results of a clinical trial of Timolol for preventing angina attacks. The Timolol study design is similar to that which underlies studies where two-sample t-tests are used for the analysis (Chapter 2, Section 2.4). If, instead of simply classifying each patient as Angina-Free or not, we could assign an (approximately Normally distributed) improvement score to each individual in the study, we could have used a two-sample t-test (or a Z=test, given the sample sizes) and the corresponding confidence interval to analyse the resulting data. This would be preferable statistically (more powerful; statistical

Page 23: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 23

power is discussed in Chapter 4 – a more powerful test is more likely to lead to a correct result), as there is very little information embodied in a simple yes/no classification. When we studied two-sample t-tests, we also discussed paired t-tests: these were appropriate when each person or experimental unit was individually matched with another, or where, in measurement studies, the same material/experimental unit was measured twice, by two different instruments. Matching increases the statistical power of the test by eliminating much of the chance variation affecting the comparison – the ‘treatments’ are compared on more uniform material than is the case for two-sample t-tests. Matching can also be used where the response is binary. The analysis of matched studies will be illustrated using data from a case-control study. We previously encountered a case-control study in Example 5 (oesophageal cancer study); there, the control group was selected from the same community of adult males as the cases, but there was no attempt to match the characteristics of individual cases to those of individual controls. In our next example, patients with other forms of cancer were individually matched to the cases, all of whom had a particular cancer. Example 7: Case-control study – Endometrial cancer Rice [10] cites a study [11] in which 317 patients, who were diagnosed as having endometrial carcinoma, were matched with other cancer patients. The controls all had one of cervical cancer, ovarian cancer or cancer of the vulva. Each control was matched by age at diagnosis (within 4 years) and year of diagnosis (within two years) to a corresponding case. Table 3.2.12 shows the 317 pairs cross-classified by use or not of oestrogen for at least six months prior to the cancer diagnosis. The question of interest is whether or not there is an association between exposure to oestrogen and the presence of endometrial cancer.

Controls

Exposed Not-expos. Totals

Cases

Exposed 39 113 152 a b a+b

Not-expos. 15 150 165 c d c+d

Totals 54 263 317 a+c b+d n

Table 3.2.12: A matched case-control study of endometrial cancer

Page 24: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 24

We address the question of interest by asking if the proportion of cases exposed to oestrogen [p1=(a+b)/n = 152/317] is statistically significantly different from the proportion of the controls who were exposed [p2=(a+c)/n = 54/317]. This means that we want to assess whether or not the difference, (p1–p2) = [(a+b)/n –(a+c)/n] = (b–c)/n, is statistically significantly different from zero, and if so, to estimate the long-run difference. A confidence interval provides both the answer to the statistical significance question and also the interval estimate we require. Confidence Interval The two sample proportions are not statistically independent of each other, so the standard error formula used in comparing independent proportions (e.g., the Timolol study) is not applicable here. The appropriate standard error estimate is:

S ˆ E (p1 p2) 1n (bc) (b c)2 /n .

We can use this to obtain an approximate 95% confidence interval for the

difference between the corresponding long-run (population) proportions, (). Thus, when we fill the relevant numbers for the endometrial cancer study into the formula:

(p1 p2) 1.96 1n (bc) (b c)2 /n

we obtain:

(0.48 0.17) 1.96 1317 (11315) (11315)2 /317

08.031.0 . The interval is quite far from zero, suggesting that the proportion exposed to oestrogen in the population of endometrial cancer patients is considerably higher than that for the other cancer groups – clear evidence of an association between the use of oestrogen and the development of endometrial cancer. A Statistical Test An alternative approach to determining if an association exists, is through the use

of a statistical test. If we wish to test the null hypothesis that then, when this hypothesis is true, the estimated standard error formula simplifies6 to:

S ˆ E (p1 p2) 1n (bc) .

6 If the null hypothesis is true, the expectation (long-run value) of (p1–p2) is zero, i.e., the

expectation of (b–c)/n is zero, so the (b-c) part of the standard error formula drops out.

Page 25: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 25

Using this we calculate a Z-statistic:

Z (p1 p2) 0

S ˆ E (p1 p2)

(b c) /n

(b c) /n

(b c)

(b c)

which follows a standard Normal distribution (again, when the null hypothesis is true). For the oestrogen data we obtain:

Z (b c)

(b c)

(11315)

(11315) 8.7.

The critical values are ±1.96 for a two-tail test using a significance level of 0.05. Our observed value of Z=8.7 is far out in the right-hand tail. There is, therefore, strong evidence of an association between oestrogen use and endometrial cancer.

Exercises 3.2.5 Samuels [1] reports data from a 1973 study of risk factors for stroke in which 155 women

who had experienced a haemoragic stroke (cases) were interviewed. Controls (who had not had a stroke) were matched to each case by where they lived, age and race. The subjects were asked if they had used oral contraceptives. The table below displays oral contraceptive use (Yes/No) for all pairs.

Controls Yes No

Cases

Yes 5

30

No 13 107

Table 3.2.13: Pairs classified by use of oral contraceptives

Do the data of Table 3.2.13 suggest an association between oral contraception use and strokes?

3.2.6 Altman [12] reports data from a 1975 study carried out to see if patients’ reaction to

dinitrochlorobenzene (DNCB), a contact allergen, would be related to or independent of their response to croton oil, a skin irritant. The table below shows the results of simultaneous tests on the two chemicals for 173 patients with skin cancer; thus, this is an example of a ‘self-paired’ study. The results are classified as either positive (+) or negative (-) reactions to the chemicals.

Page 26: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 26

DNCB + –

Croton Oil

+

81

48

– 23 21

Table 3.2.14: 173 patients classified according to response to two chemicals

Is there evidence of a differential response to the two chemicals?

Comparing two proportions that sum to 1 The matched samples design leads to proportions that are correlated with each other. A special case of correlated proportions occurs when there are only two proportions to be compared and the proportions are based on the same total. The proportions will then be perfectly correlated, in that knowing one immediately gives the value for the other, and the standard error needs to take account of this fact. The analysis is a simple modification of our earlier analysis for two independent proportions. Example 8: French 2007 presidential election poll - revisited

Recall the political poll, of Section 3.1, that estimated the support for Sarkozy ()

and Royal () in the French 2007 presidential election. Note that in this case the two sample proportions p1 and p2 are not independent since (p1+p2)=1 (assuming respondents chose one or the other, i.e., no ‘neithers’ or ‘don’t knows’ in the sample of 858). If we wish to focus on the difference in support for the two candidates then the standard error of the sample difference (p1–p2) is

SE(p1 p2) 21(11)

n

where n is the size of the poll and p2=(1–p1); see Appendix 1, page 61 for details.

In practice, we replace by p1 in using this formula for estimation. Note the difference between this formula and that for the standard error of independent proportions (page 9). A 95% confidence interval for Sarkozy’s lead at the time the poll was taken is:

(p1 p2) 1.96(2)p1(1 p1)

n

Page 27: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 27

(0.530 0.470) 3.920.530(0.470)

858

0.0600.067. This interval covers zero and so the sample difference is not statistically significant – the corresponding significance test would not reject the hypothesis of no difference in the population of voters. Despite the size of the sample difference in Sarkozy’s favour, the sampling variability is such that Royal could, in fact, have had a slight lead in the population of electors from which the sample was selected. The conclusion is exactly the same as that based on an analysis of either Sarkozy’s or Royal’s support separately, since (because p1+p2=1) no extra information is used in analysing the difference. The modification to the standard error formula would be important, though, where one proportion was statistically significantly bigger than the other and an interval estimate of the difference was required. Exercise

3.2.7 Exercise 3.2.4 (page 20) analysed data on otters that were either killed on the road or

shot by hunters in eastern Germany over a 40 year time period. The total sample size of 1023 was made up of 598 males and 425 females. The authors state that there is an “even sex ratio in living otters”. Do the study results support the proposition that male otters are more likely to be killed, or could the sample difference (58.5 versus 41.5%) be attributable to chance sampling variability in the reported data?

Calculate a 95% confidence interval for the difference in the long-run proportions of male

and female otters killed in eastern Germany in the study period to answer this question.

Page 28: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 28

3.3 Analysis of Contingency Tables In Section 3.2 we discussed the analysis of comparative studies involving two groups where the responses were classified into one of two categories. We now turn our attention to studies which may involve more than two groups and where the responses may be classified into more than two categories. The data resulting from such studies are usually presented in tabular form, with the columns (say) representing the different groups and the rows the different response categories. Such tables are often referred to as ‘contingency tables’ - one characteristic is considered as ‘contingent on’ or ‘dependent on’ a second. Chi-square Test The analysis of the resulting data involves the use of (for us) a new statistical test – the chi-square test. Before plunging into the analysis of larger tables, the chi-square test will be introduced, by using it to re-analyse an old friend – the Timolol data! Example 9: The Timolol Study- revisited Patients were assigned to receive a daily dosage of either Timolol or a Placebo for 28 weeks. The numbers of patients who did or did not become completely free of angina attacks (Angina-Free) are shown again in Table 3.3.1.

Timolol

Placebo

Totals

Angina-Free 44 19 63 Not Angina Free 116 128 244

Totals 160 147 307

Table 3.3.1: The Timolol Study Results

Suppose we ask the question: “If Timolol has no greater effect than the Placebo, how many of the treatment group would be expected to become Angina-Free?”. Since 63/307 in all become Angina-Free, and this fraction should apply equally to both the treatment and placebo groups if there is no differential effect, we estimate the expected7 number of Angina-Free patients in the Timolol group as:

7 If 100 fair coins are tossed, we expect NxP[head] = 100[0.5] = 50 heads. By this we do not

mean that if the experiment were carried out once, the result would be 50 heads. Rather, that in a very long sequence of such experiments the long-run average number of heads would be 50. Thus, the term ‘expected value’ is a technical statistical term meaning ‘long-run average’.

Since an expected value is an average it may assume non-integer values even though the raw data are counts. Accordingly, the expected values should not be rounded to integer values when carrying out the calculations.

Page 29: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 29

83.32)160(307

6311 E

Timolol

Placebo

Angina Free

83.32)160(307

6311 E

17.30)147(307

6312 E

63

Not Angina Free

17.127)160(307

24421 E

83.116)147(307

24422 E

244

160 147 307

Table 3.3.2: Calculating the expected values

Following exactly the same logic, we estimate the expected numbers in the other cells of the 2x2 table as shown in Table 3.3.2, where i=1, 2 labels the rows and j=1, 2 labels the columns. Note that the expected value in cell (i, j) is generated by:

GrandTotal

olumnTotalRowTotalxCEij

We now have a set of ‘observed values’ and a set of ‘expected values’. If the assumption of no differential treatment effect is valid, then the two sets of values should differ only by chance variation. A measure of the divergence is:

a l l c e l l si j

i ji j

E

EOX

22

)(.

Since a difference of (Oij–Eij)=5 would be much more important if we expected Eij=10 than if we expected Eij=100, the divisor scales each squared deviation to allow for the size of the expected value. The deviations are squared, as otherwise they would sum to zero, since each row and column of the 2x2 table of deviations contains a positive and negative deviation (Oij–Eij) of equal magnitude (see Table 3.3.3).

Our divergence measure is a test statistic which has a (chi-square) sampling distribution when the assumption under which the expected values were calculated is true. The shape of the sampling distribution (examples will be shown later) depends on a parameter called the ‘degrees of freedom’. This is

Page 30: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 30

often labelled (the Greek letter corresponding to n); here =1. This value follows from the fact that for 2x2 tables once one deviation (Oij–Eij) is found, the rest follow from the fact that the deviations sum to zero across the rows and down the columns, see Table 3.3.3. Equivalently, once one expected value is calculated, all the others can be found by subtracting this value from the marginal values of the table. Thus, the expected value in cell (1,2) is 63–32.83 = 30.17 and so on (see Table 3.3.2).

797.383.32

)17.11(

83.32

)83.3244()( 22

11

21111

E

EO

133.417.30

)17.11(

17.30

)17.3019()( 22

12

21212

E

EO

980.017.127

)17.11(

17.127

)17.127116()( 22

21

22121

E

EO

067.183.116

)17.11(

83.116

)83.116128()( 22

22

22222

E

EO

Note the pattern of deviations (Oij–Eij):

11.17 –11.17 0

–11.17 11.17 0

0 0

Table 3.3.3: Calculating the contributions to X

2

The Statistical Test

If the proportion becoming Angina-Free would be and the proportion becoming

not-Angina-Free would be (note =1) in a population of patients receiving

the placebo, then the null hypothesis is that the pair [] applies also to the

population of patients receiving Timolol. The pair [] is the ‘relative frequency distribution’: the null hypothesis is that this applies to both groups, equally. The individual cell calculations are shown in Table 3.3.3. These give the test statistic as:

X 2 (Oij E ij )

2

E ijallcells

9.98 .

If we choose a significance level =0.05 then the critical value is

c

2 3.84as

shown in Figure 3.3.1 (see Statistical Table 4: ST-4). Note that the critical region for the test statistic is inherently one-tailed, as large values of X2 point to large divergences between observed and expected values, thus requiring rejection of the null hypothesis under which the expected values were calculated.

Page 31: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 31

Figure 3.3.1: The chi-square distribution with degrees of freedom =1

Since the test statistic X2=9.98 greatly exceeds the critical value of

c

2 3.84 , the

null hypothesis is rejected. The test result suggests a systematic difference between the observed and expected values. This points to a difference between the two patient groups in terms of the long-run proportions becoming Angina-Free. Of course, we already know this from our previous analysis of the sample proportions. Typical Computer Output Table 3.3.4 shows a chi-square analysis of the Timolol data, produced by Minitab.

Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Timolol Placebo Total

A-Free 44 19 63

32.83 30.17 3.797 4.133 Not- A-Free 116 128 244

127.17 116.83 0.980 1.067 Total 160 147 307 Chi-Sq = 9.978, DF = 1, P-Value = 0.002

Table 3.3.4: Minitab chi-square output for the Timolol data

Page 32: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 32

The computer output gives the observed and expected values and the contribution to X2 for each of the four cells of the table. The individual contributions are summed to give the test statistic, X2=9.978. This is stated to have a p-value of 0.002, which means that if the null hypothesis were true, the probability would be only 0.002 that we would have obtained, by chance, an X2 statistic equal to or greater than X2=9.978. This means that our test statistic is unusually large (see Figure 3.3.2), which in turn means that the divergence between the observed and expected values is also large. Such a p-value would lead to rejection of the null hypothesis.

Figure 3.3.2: The p-value is the area in the right-hand tail beyond the calculated test statistic

X2=9.978

Chi-square or Z-test? Table 3.3.5 shows the Minitab analysis for the comparison of the two proportions of patients who become Angina-Free.

Test and CI for Two Proportions Sample A-Free N Sample p Timolol 44 160 0.275000 Placebo 19 147 0.129252

Difference = p (1) - p (2) Estimate for difference: 0.145748 95% CI for difference: (0.0578398, 0.233657) Test for difference = 0 (vs not = 0): Z = 3.16 P-Value = 0.002

Table 3.3.5: Comparing the proportions becoming angina-free using Minitab

The Z value has a standard Normal distribution; when the standard Normal is squared it gives a chi-square distribution with one degree of freedom, which is the sampling distribution of the X2 statistic. The Z-test for the difference between

Page 33: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 33

the population proportions for the Timolol study gives a test statistic of Z=3.16; when this is squared it gives (apart from the effects of rounding) a value (9.98) which is the same as the X2 statistic for the chi-square test. Also, the square of the critical values for the standard Normal distribution equals the critical value for the chi-square test ( 1.96=3.84): the two tests are, therefore, equivalent. The approach using proportions is (to me at least!) more appealing in that it focuses attention on the quantities of direct interest (the proportions becoming Angina-Free). It also leads naturally to calculating a confidence interval to quantify the likely population difference. Exercises 3.3.1 Re-analyse the Slate Workers Cohort Study data of Example 4, using a chi-square test. 3.3.2 Re-analyse the Oesophageal Cancer Case-Control Study of Example 5, using a chi-square

test.

McNemar’s Test In Example 7 we analysed a matched case-control study (endometrial cancer) using both a confidence interval for the difference between two correlated proportions and the corresponding Z-test. If the Z-statistic is squared it gives a chi-squared test statistic with one degree of freedom, just as is the case for the test on two independent proportions. This test gives identical results to the corresponding Z-test. The chi-square version is usually referred to as McNemar’s test. Larger Contingency Tables No new ideas arise when we carry out chi-square tests on contingency tables that are larger than the 2x2 tables just discussed – the amount of calculation simply increases. The initial chi-square test tells us if all groups behave in the same way or not. If they do not it will be necessary to follow up with further analyses to identify where the differences arise. We consider three examples from different areas: market research, literary style analysis and genetics. Exercises then illustrate applications in physics, hospital management and ecology. Example 10: Market Research Study Stuart [13, 14] reports on a market research study in which interviewees were asked whether they had heard of a product, had heard of it but not bought it, or had bought it at least once; these classifications measure ‘the degree of market

Page 34: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 34

penetration’ of the product. The study results, for three regions are shown in Table 3.3.6.

Region

A B C Totals

Bought 109 49 168 326

Heard-no-buy 55 56 78 189

Never heard 36 45 54 135

Totals 200 150 300 650

Table 3.3.6: Market penetration study

The samples sizes were clearly fixed for the three regions; why they were different is not reported – the regions may, for example, be of different geographical sizes. Before we begin our formal statistical analysis of the data, a few comments on presenting data in tabular form are worth considering. Table 3.3.7 shows the relative frequencies for the market penetration categories in the three regions; the proportions are expressed to three decimal places.

Region

A B C

Bought 0.545 0.327 0.560

Heard-no-buy 0.275 0.373 0.260

Never heard 0.180 0.300 0.180

Table 3.3.7: Relative frequency distributions of response by region

The same information is shown in Table 3.3.8, but this time the numbers are expressed as column percentages and rounded to whole numbers (note that, because of the rounding, region A sums to 101). The columns are also re-arranged to put A beside C to emphasise their similarity.

Region

A C B

Bought 55 56 33 Heard-no-buy 28 26 37 Never heard 18 18 30

Table 3.3.8: The percentage responses by region (rounded)

Page 35: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 35

I think you will agree that Table 3.3.8 is much easier to read and interpret: regions A and C are similar in response characteristics and different from B. Clearly organised tables are important when presenting information to others - see Ehrenberg, [15] or Stuart [14] for a discussion of the principles of good table making. The Statistical Model

For the chi-square analysis of the Timolol study our model was that a fraction () of the population of patients would become free of angina attacks, and a fraction

() would not. The question of interest was whether the same fractions [] applied to both the Timolol and the placebo groups. Here, we have three groups and three response categories. Our model is that the observed sets of three response frequencies are generated by the same underlying population response

relative frequency distribution []. Initial analysis The observed relative frequencies from the three regions are clearly not the same. The first statistical question is whether the differences between regions could be attributed to chance sampling variation or not. If the answer is yes, there is little point in making further between-region comparisons. This question will be addressed using a chi-square test which is a straightforward extension of the test carried out on the 2x2 table earlier. If the analysis suggests differences between regions, their magnitudes will then be quantified using confidence intervals for differences between proportions – these are calculated exactly as discussed earlier for the Timolol data. Our null hypothesis is that there is a common set of population relative

frequencies []. On this assumption, Table 3.3.6 suggests that a natural estimate of the proportion who have bought the product is 326/650, and this should apply equally to all three regions. This implies that the expected number of respondents in region A who have bought the product is:

31.100)200(650

32611 E .

As before, this estimate takes the form:

GrandTotal

olumnTotalRowTotalxCEij .

Following precisely the same logic, Table 3.3.9 shows the calculation of the expected values for all the cells. Note that the row and column totals are identical to those for the observed values, as shown in Table 3.3.6.

Page 36: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 36

A B C

Bought 31.100)200(650

32611 E 23.75)150(

650

32612 E 46.150)300(

650

32613 E

326

Heard-not-bought

15.58)200(650

18921 E 62.43)150(

650

18922 E 23.87)300(

650

18923 E

189

Never heard

54.41)200(650

13531 E 15.31)150(

650

13532 E 31.62)300(

650

13533 E

135

200

150

300

Table 3.3.9: Calculating the expected values

Table 3.3.10 shows the divergences between the observed and expected values [Oij–Eij]. Note that the divergences sum to zero across all rows and down all columns (our method of calculating the expected values forces the expected table margins to be the same as the observed table margins, which means the differences are zero).

A B C Bought 69.831.100109 23.2623.7549 54.1746.150168 0

Heard-not-bought

15.315.5855 38.1262.4356 23.923.8778 0

Never heard 54.554.4136 85.1315.3145 31.831.6254 0

0

0

0

Table 3.3.10: Differences between observed and expected values

Because of these restrictions on the table margins, once four cells are filled (either expected values or differences between observed and expected) the other five are automatically determined – the deviations are said, therefore, to have four degrees of freedom. This means that when we calculate our X2 test statistic, we will compare it to a chi-square distribution with four degrees of freedom. In general, the degrees of freedom for contingency tables are given by:

degrees of freedom = (Number of rows – 1) x (Number of columns – 1)

which, for the current dataset, gives:

degrees of freedom = (3 – 1) x (3 – 1) = 4.

Page 37: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 37

Figure 3.3.3: Chi-square distribution with 4 degrees of freedom The Minitab analysis of Table 3.3.11 shows that the individual cell contributions sum to a test statistic of:

6.24)( 2

2

allcells ij

ijij

E

EOX

which greatly exceeds the critical value of 9.49 (shown in Figure 3.3.3). Clearly, the assumption of a common underlying relative frequency distribution of responses must be rejected.

Chi-Square Test: A, B, C Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A B C Total

Bought 109 49 168 326

100.31 75.23 150.46 0.753 9.146 2.044 Heard- no-buy 55 56 78 189

58.15 43.62 87.23 0.171 3.517 0.977 Never- heard 36 45 54 135

41.54 31.15 62.31 0.738 6.154 1.108 Total 200 150 300 650

Chi-Sq = 24.608, DF = 4, P-Value = 0.000

Table 3.3.11: Minitab chi-square analysis of market research data

Page 38: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 38

Further analysis A chi-square analysis of regions A and C supports the conclusion of common underlying rates, which was suggested by Table 3.3.8. The p-value of 0.928 of Table 3.3.12 shows the observed and expected values to be very close to each other8.

Chi-Square Test: A, C Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A C Total

Bought 109 168 277

110.80 166.20 0.029 0.019 Heard-no-buy 55 78 133

53.20 79.80 0.061 0.041 Never heard 36 54 90

36.00 54.00 0.000 0.000 Total 200 300 500

Chi-Sq = 0.150, DF = 2, P-Value = 0.928

Table 3.3.12 Chi-square comparison of responses for regions A and C

Table 3.3.13 shows a chi-square comparison of the pooled values for regions A and C versus those of region B: region B clearly behaves differently. Note that the statistical tests shown in Tables 3.3.12 and 3.3.13 have a different status from that in Table 3.3.11. The tests carried out in Tables 3.3.12 and 3.3.13 were prompted by examining the data. It is axiomatic in scientific investigation that you cannot form an hypothesis on the basis of a set of data and test the hypothesis using the same dataset. These tests may be regarded, therefore, as crude measures in an exploratory analysis; we would treat the results with a great deal of caution if the tests were only marginally statistically significant. Given the marked differences in responses between B and the combined A and C, it is possible (even likely?) that the market researchers would have expected such results and might have been in a position to formulate the hypotheses underlying the tests of Tables 3.3.12 and 3.3.13 a priori, in which case the tests would have their usual status. It might, for example, be the case

8 My colleague Michael Stuart found the data in a textbook (but has lost the reference). The

results in Table 3.3.12 are so closely matched for regions A and C as to suggest to me that they may have been invented for illustrative purposes! However, I am taking them at face value until he retrieves the reference.

Page 39: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 39

that B is a relatively new market while A and C are mature markets; Stuart does not provide further information which would throw light on such questions.

Chi-Square Test: A+C, B Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A+C B Total

Bought 277 49 326

250.77 75.23 2.744 9.146 Heard-no-buy 133 56 189

145.38 43.62 1.055 3.517 Never heard 90 45 135

103.85 31.15 1.846 6.154 Total 500 150 650

Chi-Sq = 24.461, DF = 2, P-Value = 0.000

Table 3.3.13: Chi-square comparison of pooled A and C versus B

Quantitative statements regarding differences can be supported by confidence intervals for the population differences. For example, Table 3.3.14 shows a comparison of the proportions of consumers who have bought the product in region B versus the combined region A+C. Expressed in terms of percentages, the point estimate for the difference is 23%, but the confidence interval suggests that the percentage of consumers in regions A and C who bought the product is between 14 and 31 percentage points higher than the corresponding value for region B.

Test and CI for Two Proportions Sample Bought N Sample p A+C 277 500 0.554 B 49 150 0.327

Difference = p (1) - p (2) Estimate for difference: 0.227333 95% CI for difference: (0.140550, 0.314117) Test for difference = 0 (vs not = 0): Z = 4.88 P-Value = 0.000

Table 3.3.14: Difference in buying proportions; regions A+C versus B

Page 40: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 40

Figure 3.3.4: A graphical representation of the study results (percentages)

Figure 3.3.4 presents the results of the study in graphical form. The picture shows regions A and C to be higher by 22 percentage points than region B in relation to those who have bought the product. On the other hand, B is higher by 10-12 percentage points for the other two market penetration categories. Note that the lines connecting the dots have no intrinsic meaning – they simply help us visualise the different market penetration ‘profiles’ corresponding to the two market segments. Contrast this picture with Table 3.3.6 as a way of reporting the study results. Tables of technical statistical analyses, such as Table 3.3.11, are inappropriate in presenting study results – they contain too many unnecessary technical details. If at all, they should appear in a technical appendix to a report on the market research study. Example 11: Literary Style Jane Austen left an unfinished novel (Sanditon) when she died. She also left a summary of the remaining chapters. The novel was finished by an imitator, who tried to match her style as closely as possible. The data below are cited by Rice [10], based on an analysis by Morton [16]. Six words were selected for analysis and their frequencies were counted in Chapters 1 and 3 of Sense and Sensibility, 1, 2 and 3 of Emma, 1 and 6 of Sanditon (described below as Sanditon-A, written by Austen) and 12 and 24 of Sandition (described below as Sanditon-B, written by the imitator). Can we distinguish between the authors on the basis of the word counts?

Page 41: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 41

Sense+

Sensibility Emma Sanditon-A Sanditon-B Totals

a 147 186 101 83 517

an 25 26 11 29 91

this 32 39 15 15 101

that 94 105 37 22 258

with 59 74 28 43 204

without 18 10 10 4 42

Totals 375 440 202 196 1213

Table 3.3.15 Word counts for the literary works

Statistical Model Before beginning our analysis it is worthwhile considering the nature of the data given in the table. Note that the data generation mechanism is quite different from our previous market research example. There, there were three categories of responses (bought, heard but no buy, and never heard of product). The three relative frequencies or percentages for any region provided us with estimates of consumer behaviour. Here the row categories are arbitrarily selected words used by the author. The relative frequencies for any one novel do not have any intrinsic meaning (unlike the market research study); they are simply the relative frequencies of these particular words (relative to the full set of words included in the list) and would change if any of the words were dropped from the list or if others were added. The statistical model that underlies the chi-square analysis that follows is that there is a set of relative frequencies for these arbitrarily selected words

[] that characterises Austen’s writing style; it underlies all her writing and should apply also to the imitator, If he has been successful in matching her style. If this is true, then the observed frequencies (considered as a random sample from her works) should only reflect chance variation away from this stable underlying structure. Analysis Before comparing Austen to her imitator, it makes sense to check if the word distributions are stable within her own work. If they are not, then the basis for the comparison is undermined; we might have to ask ourselves to which Jane Austen we want to compare the imitator! The frequency distributions for the three known Austen works are given in Table 3.3.16 and a chi-square test is carried out. The critical value for a test using a significance level of 0.05, where there are 10 degrees of freedom, is 18.31. The test statistic of 12.3 is much less than this (with a correspondingly large p-value)

Page 42: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 42

and so we do not reject the null hypothesis of a consistent underlying structure for the three works. It makes sense, therefore, to combine the counts for the three works before comparing them to the imitator. One set of frequencies, based on a total sample size of 1017, will be more stable than three sets based on smaller sample sizes. The observed values in the first column of Table 3.3.17 are the combined word counts, that is the totals from the right-hand margin of Table 3.3.16.

Chi-Square Test: Sensibility, Emma, Sanditon-A Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Sensibility Emma Sanditon-A Total a 147 186 101 434

160.03 187.77 86.20 1.061 0.017 2.540 an 25 26 11 62

22.86 26.82 12.31 0.200 0.025 0.140 this 32 39 15 86

31.71 37.21 17.08 0.003 0.086 0.254 that 94 105 37 236

87.02 102.10 46.88 0.560 0.082 2.080 with 59 74 28 161

59.37 69.66 31.98 0.002 0.271 0.495 Without 18 10 10 38

14.01 16.44 7.55 1.135 2.523 0.797

Total 375 440 202 1017

Chi-Sq = 12.271, DF = 10, P-Value = 0.267

Table 3.3.16: Consistency check on Austen’s style

In Table 3.3.17 the test statistic for the between author comparison (X2=32.8) is highly statistically significant (the critical value for a significance level of 0.05 with 5 degrees of freedom is 11.07). This indicates a divergence between the authors in the relative frequency of use of one or more of the selected words. Note the very large contributions to the overall chi-square statistic given by ‘an’ (13.899 for the imitator) and ‘that’ (9.298 for the imitator). Table 3.3.18 shows the relative frequencies (expressed as percentages) of the six words for the two authors.

Page 43: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 43

Chi-Square Test: Austen, Imitator Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Austen Imitator Total a 434 83 517

433.46 83.54 0.001 0.003 an 62 29 91

76.30 14.70 2.679 13.899 this 86 15 101

84.68 16.32 0.021 0.107 that 236 22 258

216.31 41.69 1.792 9.298 with 161 43 204

171.04 32.96 0.589 3.056 Without 38 4 42

35.21 6.79 0.220 1.144 Total 1017 196 1213

Chi-Sq = 32.810, DF = 5, P-Value = 0.000

Table 3.3.17: Between author comparison

Relative to his use of the other words, the imitator uses ‘an’ much more frequently than does Austen (15% versus 6%). Austen on the other hand uses ‘that’ more frequently (relative to her use of the other words) than does the imitator (23% versus 11%). The impact of these divergences can be seen in Table 6.3.17. For Austen, 62 instances of ‘an’ were observed; on the equal relative frequency assumption we would have expected 76.3. For the imitator, the corresponding values are 29 and 14.7. Note that the divergences are equal in magnitude but opposite in sign ±14.3 (as they must be since the row sums of both the observed and expected values equal 91 and, hence, the sum of the cell divergences is zero). The contribution to X2 is, however, greater for the imitator since the contribution is:

ij

ijij

E

EO 2)(

Page 44: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 44

and Eij is only 14.7 for the imitator, whereas it is 76.3 for Austen. Similar comments apply to the use of ‘that’.

Word Austen Imitator

a 43 42

an 6 15

this 8 8

that 23 11

with 16 22

without 4 2

Table 3.3.18: Relative frequencies (column percentages) of word use

When ‘an’ and ‘that’ are excluded from the word list, the chi-square test that compares the distributions for the remaining words is not statistically significant, as shown in Table 3.3.19. This highlights the potential influence of the arbitrary nature of the word selection in stylistic comparisons.

Chi-Square Test: Austen, Imitator Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Austen Imitator Total a 434 83 517

430.23 86.77 0.033 0.163 this 86 15 101

84.05 16.95 0.045 0.224 with 161 43 204

169.76 34.24 0.452 2.243 Without 38 4 42

34.95 7.05 0.266 1.319 Total 719 145 864

Chi-Sq = 4.746, DF = 3, P-Value = 0.191

Table 3.3.19: The reduced word list comparison

In summary, the imitator appears to have matched the relative frequencies of the uses of ‘a’, ‘this’, ‘with’, and ‘without’ in Austen’s style, but failed to do so with respect to ‘an’ and ‘that’. Does this mean we will never see a BBC Sanditon series? Will Hollywood be quite so fastidious?

Page 45: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 45

Example 12: Mendel’s Pea Plant Data The data consist of 529 pea plants cross-classified into a two-way table according to the shape [round, round and wrinkled, wrinkled] and colour [yellow, yellow and green, green] of their seeds. They are typical of the type of data presented in discussions of simple Mendelian theories of inheritance. What is of interest here is whether the row and column classifications can or cannot be considered independent of each other: this carries implications for the underlying genetic mechanisms at work.

Colour

Yellow Y+G Green

Totals

Round 38 65 35 138

Shape R+W 60 138 67 265

Wrinkled 28 68 30 126

Totals 126 271 132 529

Table 3.3.20 Mendel’s pea plant data

Statistical Question In the previous two examples, there were natural response characteristics (e.g., buy, heard but no-buy, and no-buy) in each of several defined groups (areas for the market research data). An analysis expressed in terms of the equality of the response proportions having these characteristics in each of the groups was a natural approach in addressing the question of interest. In the current example, however, neither classification factor (shape or colour) can be thought of as a response while the other drives that response in some way – here, the row and column classifications have equal status. Accordingly, the research question is expressed in the form: “is the row classification independent of the column classification?” For the data of Table 3.3.20 this would mean that the probability that a randomly selected seed is yellow does not depend on the probability that the seed is round, for example. Analysis Seed shape and colour are genetically determined. In asking if the row and column classifications of the 529 plants are independent of each other, what we are really concerned with are the underlying genetic mechanisms determining seed characteristics. Rejection of the null hypothesis of independence would suggest that the genetic mechanisms are linked in some way.

Page 46: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 46

Chi-Square Test: Yellow, Y+G, Green Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Yellow Y+G Green Total Round 38 65 35 138

32.87 70.70 34.43 0.801 0.459 0.009 R+W 60 138 67 265

63.12 135.76 66.12 0.154 0.037 0.012 Wrinkled 28 68 30 126

30.01 64.55 31.44 0.135 0.185 0.066

Total 126 271 132 529

Chi-Sq = 1.857, DF = 4, P-Value = 0.762

Table 3.3.21 Chi-square test of row-column independence

We estimate the probability that a randomly selected seed will be Round as 138/529, while the probability that it will be Yellow is estimated as 126/529, based on the data in the margins of Table 3.3.20 If the two characteristics are independent of each other then the product of these two sample estimates gives us an estimate of the probability that both characteristics are present simultaneously.

P[Round and Yellow] = P[Round] x P[Yellow]

138

529x

126

529.

The estimate of the expected number of round and yellow seeds is given by:

N x P[Round and Yellow] = N x P[Round] x P[Yellow]

=

529x138

529x

126

529

=

138x126

529 32.87.

Page 47: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 47

Note that the expected value is given by:

E11 =

RowTotalxColumnTotal

GrandTotal

as before. Precisely the same logic, applied to the other cells of the table, gives a complete set of expected values, under the independence assumption (these are shown in Table 3.3.21). If the sample size were infinite then the observed and expected values would coincide (assuming independence): sampling variability would not affect the results. Because of the small sample size and the inevitable chance variation in biological systems, we do not expect exact correspondence, even if the two characteristics are independent. However, the X2 measure of divergence should be smaller than the critical value on a chi-square distribution with 4 degrees of freedom.

If we choose a significance level of =0.05, then our critical value is 9.49. Since the value of X2 in Table 3.3.21 is much smaller than this, there is no basis for rejecting the null hypothesis of row-column classification independence. The observed data are consistent with the independence assumption. It would appear that the genetic mechanisms that determine seed shape and colour do not affect each other. Exercises

3.3.3 In a study involving high-altitude balloon flights near the north magnetic pole, the

numbers of positive and negative electrons in cosmic ray showers were counted. Each particle was classified by charge and into one of three energy intervals (MeV stands for millions of electron volts) as shown in Table 3.3.22.

I have lost the source of these data, so no further information is available. Studies like this were the most important source of information on high-energy sub-atomic particles before the construction of high-energy particle accelerators. Cosmic rays are still an important and active research area (see Wikipedia for a long account of the history and current research).

Electron Charge

Energy

Interval (MeV) Positive Negative

50-100 9 20

101-300 32 51

301-1000 23 117

64 188

Table 3.3.22 Cosmic ray shower data

Page 48: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 48

(a) What can we learn about the charge-energy characteristics of the two particle types from the data? Carry out a chi-square test. What is the null hypothesis? If you get a statistically significant result carry out further analyses (confidence intervals) to tease out the differences.

(b) Suppose it was realized, subsequent to your analysis, that the energy measurements had

been faulty in such a way that electrons of up to 200 MeV could sometimes have been assigned to the lowest energy interval. Re-analyse the data in the light of this information. Do your initial conclusions still hold?

3.3.4 Box, Hunter and Hunter [17] report the data shown below for five hospitals. The

frequencies represent patient status after a surgical procedure designed to improve the functioning of certain joints which have become impaired by disease. Patients were classified by post-operative status: no improvement, partial functional restoration and complete functional restoration.

Hospital

A B C D E Totals

No improvement 13 5 8 21 43 90

Partial funct. rest. 18 10 36 56 29 149

Complete funct. rest. 16 16 35 51 10 128

Totals 47 31 79 128 82 367

Table 3.3.23: Post-operative improvement status of patients at five hospitals

The authors describe hospital E as a ‘referral hospital’, which presumably means that it is a specialised centre where more difficult procedures are carried out. It is to be expected, therefore, that the post-operative patient characteristics could be different from the other hospitals. Are they? Analyse the data and present your results both in tabular and graphical form.

3.3.5 Hauer, Ansorge and Zinke [8] report data on otters that were either killed on the road or

shot by hunters in eastern Germany over a 40 year time period. Note that the numbers were recovered from summaries in the original paper and the total sample size here is 1023, but the original paper refers to a total sample size of 1027. Table 3.3.24 shows the N=1023 otters cross-classified into a 2x2 table with age-group and sex as the row and column variables, respectively. The authors of the paper described otters of age four or more years as ‘grown adults’; I have described otters younger than this as ‘young’.

Male

Female

Totals

Young 286 153 439 Adults 312 272 584

Totals 598 425 1023

Table 3.3.24.: Otters cross-classified by age and sex

Page 49: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 49

These data were analysed in Exercise 3.2.4 as a comparison of two independent

proportions (the proportion of ‘young’ animals for each sex). Now, investigate if the row and column classifications can be considered independent (as

for the Mendelian data) using a chi-square test to answer the question.

3.4 Goodness-of-Fit Tests We have used the chi-square test to compare two or more empirical frequency distributions (for example, in the Market Research Study we compared the numbers of consumers falling into each of three ‘market penetration’ categories, for three regions), to determine if the underlying distributions could be considered the same or not. Sometimes, it is desired to assess whether or not a single frequency distribution corresponds to an expected theoretical distribution; the test is then called a ‘goodness-of-fit’ test. For example, we might ask if the numbers on a roulette wheel are equally likely to turn up – if not, we might conclude that the wheel is inherently biased or that the wheel operator can influence the outcome of the game. Example 13: Random Number Generation Random numbers are used in a variety of important problems, from solving mathematical equations that are too difficult to analyse analytically, to creating simulation models of physical systems (or computer games), to selecting random samples from lists of people for inclusion in a survey. In practice, the numbers used are, very often, ‘pseudo-random’, that is they are generated by a deterministic algorithm with a view to creating a list of numbers that have the characteristics (suitably defined) of random numbers. I generated 10,000 digits, 0–9, using the random number generator in Minitab and obtained the results shown in Table 3.4.1. Digits 0 1 2 3 4 5 6 7 8 9 Total

Observed freq. 1036 960 1059 946 1029 995 974 959 1045 997 10000

Table 3.4.1 A set of pseudo-random digits

Perhaps the simplest characteristic we might expect in a set of random digits is that the ten values are equally likely to appear. Does the set of frequencies I have generated suggest that the random number generator is well-behaved in this regard? Our null hypothesis is that the probability of occurrence for each digit is constant,

=0.10. The alternative hypothesis is that not all probabilities are

Page 50: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 50

equal to 0.10. If the null hypothesis is true, then the expected frequency is 1000

[N=10,000(0.10)] for each digit, as shown in Table 3.4.2. The chi-square test statistic is calculated as before:

X 2 (Observed Expected )2

Expectedi1

10

14.55.

Digits 0 1 2 3 4 5 6 7 8 9 Totals

Observed freq. 1036 960 1059 946 1029 995 974 959 1045 997 10000

Expected freq. 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 10000

Chi-square 1.296 1.6 3.481 2.916 0.841 0.025 0.676 1.681 2.025 0.009 14.55

Table 3.4.2: A chi-square goodness-of-fit test for on random digits

The ten observed and expected values in Table 3.4.2 each add to 10,000 and, therefore, the set of differences [Oi–Ei] sums to zero. Accordingly, once nine expected values (or differences) are calculated, the last one is automatically determined. This means that, when the null hypothesis is true, the X2 statistic follows a chi-square distribution with 9 degrees of freedom. The critical value for a test having a significance level of 0.05 is Xc

2=16.9. Since our test statistic of X2=14.55 is less than this, we cannot reject the null hypothesis that the digits are equally likely to occur. The area beyond X2=14.55 in the right-hand tail of a chi-square distribution with 9 degrees of freedom is 0.10; this is the p-value for our test. The fact that it exceeds 0.05 means that we cannot reject the null hypothesis, if we have previously decided on a significance level of 0.05. The observed frequency distribution is consistent with the pseudo-random number generator producing equal relative frequencies for the ten digits. Example 14: Lotto Numbers Table 3.4.3 shows information on the frequencies of occurrence of different numbers in the Irish National Lottery (downloaded from the Lotto website on 30/7/2007). The mechanism used to select the numbers is physical rather than algorithmic (balls are picked from a large glass bubble), and it is obviously important that the numbers are equally likely to be picked. I decided to use only the first 32 (the original number of balls when the Lotto was established) as I was unclear on whether numbers after 32 had been included to the same extent in all draws (clearly 43-45 had not).

Page 51: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 51

1=248 2=235 3=236 4=242 5=214

6=219 7=226 8=205 9=255 10=233

11=255 12=220 13=238 14=259 15=227

16=235 17=200 18=242 19=217 20=230

21=236 22=222 23=217 24=224 25=213

26=192 27=210 28=233 29=198 30=217

31=217 32=253 33=194 34=236 35=216

36=228 37=213 38=213 39=226 40=214

41=229 42=220 43=10 44=17 45=8

NOTE: The numbers 43, 44 and 45 were introduced on the 28th October, 2006

Table 3.4.3: Frequencies for the Lotto numbers

The total for the 32 observed frequencies is N=7268. On the assumption that all numbers are equally likely (null hypothesis) then our expected value for each of the 32 numbers is E=7268/32=227.125. We calculate the X2 statistic as above and obtain X2=40.2. As before, the degrees of freedom for the sampling distribution of X2 (when the null hypothesis is true) will be the number of categories (K) minus 1, (K–1)=31. For a test with a significance level of 0.05, the critical value is Xc

2=45.0; our test statistic is not statistically significant – the corresponding p-value is 0.12. We cannot reject the null hypothesis of equally likely outcomes – the test does not provide any evidence of bias in the selection of Lotto numbers. Example 15: Mendelian Ratios Samuels [1] reports data from a breeding study [18] in which white chickens with small combs were mated and produced 190 offspring, which were classified as shown in Table 3.4.4.

Feathers/ Observed

Comb size Freq.

White, small 111

White, large 37

Dark, small 34

Dark, large 8

Total 190

Table 3.4.4: Breeding study results

This table is a typical example of Mendelian genetics data, where there are two characteristics, here Feather Colour (white is dominant) and Comb Size (small is

Page 52: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 52

dominant), each determined by a single pair of genes; the members of the pair may appear in either of two versions. Dominant means that if even one copy of this version of the gene is present the organism will have the appearance corresponding to the dominant gene. The parents in this study had one dominant and one recessive gene for each characteristic; according to Mendelian theory their offspring would be expected to be born in the ratios ¾ for the appearance (phenotype) of the dominant characteristic, and ¼ for the recessive characteristic. How do these ratios arise? Mendelian theory suggests the following mechanism. The process begins with parents with pure genetic characteristics: both copies of the gene for the characteristic of interest are the same – either both dominant, D, or both recessive, r. When two such parents produce offspring, the genetic structure of the offspring is necessarily hybrid, D-r, as each parent contributes one copy of the gene. When two first generation hybrid offspring breed, the second generation appear as shown in Figure 3.4.1. Thus, the probability that the Dominant characteristic will be seen (phenotype) is ¾, while the corresponding probability for the recessive characteristic is ¼.

Figure 3.4.1: Mendelian mechanism for second generation ratios

If the two characteristics are independently determined then we expect the combined characteristics to appear with probabilities 9/16, 3/16, 3/16, 1/16, that is, in the ratios 9:3:3:1 as shown in Table 3.4.59.

Small:3/4 Large:1/4

White:3/4 9/16 3/16

Dark:1/4 3/16 1/16

Table 3.4.5: The expected ratios, given independence

9 Recall, if two characteristics A and B are independent, then the probability that they both occur

together is the product of their separate probabilities of occurring: P(A and B)=P(A).P(B). Hence, P(small and white)=P(small).P(white)=3/4 x 3/4 =9/16

Page 53: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 53

Our null hypothesis is that the Mendelian ratios 9:3:3:1 describe the underlying frequency structure, i.e., the two characteristics are independently determined. A statistical test essentially asks if the observed frequencies differ from those predicted by the probabilities 9/16:3/16:3/16:1/16 by more than would be expected from chance biological variation (the expected frequencies are 190 times the corresponding probabilities). Table 3.4.6 shows the observed and expected values and the contributions to the overall chi-square test statistic X2=1.55 for the four categories. The critical value for a test with a significance level of 0.05 is Xc

2=7.8, for a chi-square distribution with 3 degrees of freedom [DF=No. of categories–1]. The test is non-significant; the corresponding p-value is 0.67. The empirical data are consistent with the classic Mendelian theory.

Feathers/Comb size Ratios

Obs. Freq.

Expect. Freq. Chi-square

White, small 9/16 111 106.875 0.159

White, large 3/16 37 35.625 0.053

Dark, small 3/16 34 35.625 0.074

Dark, large 1/16 8 11.875 1.264

Total 1.00 190 190 1.551

Table 3.4.6: A chi-square analysis of the chicken breeding data

Example 16: Poisson Distribution – -particles The Poisson distribution is a probability model for rare events, where the rate of

occurrence () is constant in either space or time, and events occur independently. For example, suppose a particular complication in the delivery of

babies occurs at a rate of =3 per month in a large maternity hospital. We can calculate the probabilities for given numbers of such events10 (x) in any one month using the formula:

ex

xPx

!)( where =3 and x = 0, 1, 2, 3, 4 …..

thus

0498.0)0( 3 eP

1 4 9 4.01

3)1( 3 eP

10

Note that the requirement that events occur independently means that the model could be applied to numbers of deliveries, but not to numbers of births, if twin births were a possibility for the event being modelled.

Page 54: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 54

2 2 4 0.01.2

3)2(

32

eP

2240.01.2.3

3)3( 3

3

eP

1680.01.2.3.4

3)4( 3

4

eP .

The value of such calculations lies in the fact that if special teams/equipment are required for these deliveries, then we can evaluate the probabilities for different numbers of events and plan resources (contingency plans) accordingly. The Poisson model is a fundamental tool in analysing the likelihood of events such as occurrence of equipment failure, overloads on computer systems (e.g., internet switching devices), blackouts on power transmission networks etc. Simulation of such systems will require generation of Poisson random numbers (or of a closely related distribution). Accordingly, we might want to test random numbers to determine if they behave as if they come from a Poisson distribution, rather than being simply equally likely, as in Example 13. The example I have chosen to illustrate testing such a distribution refers to a physics experiment. Rice [10] reports an analysis by Berkson [19] of radioactivity data observed in the American Bureau of Standards. The data are summarised in Table 3.4.7: the observed frequencies refer to the number of 10-second intervals in which

different numbers (x) of -particles were emitted from an Americium 241 source. The total number of particles observed divided by the total observation time gave

0.8392 particles per second, which corresponds to =8.392 particles per 10-second interval. As this value is calculated from the data, it is only an estimate of the ‘true’ underlying rate (this is relevant to the analysis). The question of interest is whether or not the data can be modelled by a Poisson distribution. If the answer is yes, it suggests that the individual emissions are independent of each other (independence is a key assumption underlying the Poisson model) – which carries implications for the underlying physical processes. To calculate the set of expected frequencies11, we first use the Poisson model

with parameter =8.392 to get a set of probabilities P(x) for different numbers of particles (x) in any one interval, and then multiple these by N=1207, the total number of intervals, to get the expected number of intervals with x particles [Expected value, E(x) = NP(x)].

11

In calculating expected frequencies it is recommended that the cells should have an expectation of at least 1 (some say 5) to ensure that the approximation to the chi-square distribution is good; cells should be combined to ensure that this happens. This is the reason the first row of the table accumulates x=0, 1, 2; otherwise the expected values corresponding to x=0 and 1 would both be less than 5.

Page 55: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 55

For example:

ex

xPx

!)( x = 0, 1, 2, 3, 4 …..

022328.01.2.3

392.8)3( 392.8

3

eP

and the expected number of intervals with three particles is:

E[No. of intervals with 3 particles] = NP(3) = 1207x0.022328=26.95 as shown in Table 3.4.8. The probability for the first row of the table is the sum P(0)+P(1)+P(2)=P(<3)=0.010111. Multiplying this by 1207, we get the expected frequency of 12.20 shown in the first row of Table 3.4.8. Since there is no upper bound on the number of particles in a time interval, we calculate P(x>16)=1–[P(0)+P(1)+P(2)…P(16)] = 0.00587, which when multiplied by 1207 gives the last expected frequency of 7.09 in Table 3.4.8.

No. particles

(x) Frequency

0-2 18

3 28

4 56

5 105

6 126

7 146

8 164

9 161

10 123

11 101

12 74

13 53

14 23

15 15

16 9

17+ 5

Total 1207

Table 3.4.7: Numbers of intervals in which (x) -particles were observed

Page 56: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 56

The last column of Table 3.4.8 is calculated using:

i

ii

E

EO 2)(

where (i) labels the rows.

Number of Observed Expected

particles (x) Frequency Frequency Chi-square

0-2 18 12.20 2.75

3 28 26.95 0.04

4 56 56.54 0.01

5 105 94.90 1.08

6 126 132.73 0.34

7 146 159.12 1.08

8 164 166.92 0.05

9 161 155.65 0.18

10 123 130.62 0.44

11 101 99.65 0.02

12 74 69.69 0.27

13 53 44.99 1.43

14 23 26.97 0.58

15 15 15.09 0.00

16 9 7.91 0.15

17+ 5 7.09 0.61

Total 1207 1207 9.04

Table 3.4.8: A chi-square analysis of the -particle data

The overall X2 statistic is 9.04, as shown in Table 3.4.8. To calculate the degrees of freedom, we subtract 1 from the number of categories, K, as before; we also subtract 1 degree of freedom for every parameter estimated from the

data12 – here we estimated one parameter, , so we have:

Degrees of Freedom = No. of categories –1 – no. of parameters estimated

= (K –1 –1) = 16 – 1 – 1 = 14.

12

If we were fitting a Normal distribution to a set of data, we would need to estimate the mean, ,

and the standard deviation, , and so we would lose 2 degrees of freedom for parameter estimation (3 in total).

Page 57: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 57

The critical value is Xc

2=23.7 for a test with a significance level of 0.05; our test statistic of 9.04 is very much less than this (the corresponding p-value is 0.83), so we cannot reject the null hypothesis - a Poisson model fits the data. This

suggests that the -particles are emitted independently of each other. Exercise 3.4.1. Example 12 analysed 529 pea plants cross-classified into a two-way table according to

the shape [round, round and wrinkled, wrinkled] and colour [yellow, yellow and green, green] of their seeds. The question of interest was whether the row and column classifications were independent. The data are reproduced below.

Colour

Yellow Y+G Green

Totals

Round 38 65 35 138

Shape R+W 60 138 67 265

Wrinkled 28 68 30 126

Totals 126 271 132 529

Table 3.4.9: Mendel’s pea plant data

Mendelian theory predicts that the frequencies for the three colours (Y: Y+G: G) will

appear in the ratios 1:2:1 [or equivalently, the probability distribution is: ¼, ½, ¼]. Similarly the shapes (R: R+W: W) are also predicted to appear in the same ratios. The mechanism is illustrated in Figure 3.4.1, page 52.

Carry out a goodness-of-fit test to check that the observed data depart from the 1:2:1

ratios by no more than would be expected from chance variation. Do this for both the Colour and Shape distributions.

Page 58: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 58

References

[1] Samuels, M.L., Statistics for the Life Sciences, Dellen Publishing Company,

1989. [2] Brailovsky, D., Timolol maleate (MK-950): A new beta-blocking agent for

prophylatic management of angina pectoris. A multicenter, multinational, cooperative trial. In Beta-Adrenergic Blocking Agents in the Management of Hypertension and Angina Pectoris, B. Magnani (ed.), Raven Press, 1974.

[3] Machin, D., Campbell, M.J., and Walters, S.J., Medical Statistics, Wiley,

2007. [4] Campbell et al., A 24 year cohort study of mortality in slate workers in North

Wales, Journal of Occupational Medicine, 55, 448-453, 2005. [5] Breslow, N.E. and Day, N.E., Statistical Methods in Cancer Research: Vol. 1

– The analysis of case-control studies, International Agency for Research on Cancer, 1980.

[6] Tuyns, A.J., Pequignot, G. and Jensen, O.M., Le cancer de l’oesophage en

Ille-et-Vilaine en fonction des niveaux de consommation d’alcool et de tabac, Bull. Cancer, 64, 45-60, 1977.

[7] Bland, M., An Introduction to Medical Statistics, Oxford University Press,

2000. [8] Hauer S., Ansorge H. and Zinke O., A long-term analysis of the age

structure of otters (Lutra Lutra) from eastern Germany, International Journal of Mammalian Biology, 65, 360 – 368, 2000.

[9] P.J. Bickel, Hammel, E.A. and O’Connell, J.W., Sex Bias in Graduate

Admissions: Data from Berkeley, Science, Vol. 187, 398-404, 1975. [10] Rice, J.A., Mathematical Statistics and Data Analysis, Duxbury Press,

1995. [11] Smith, D.G., Prentice, R., Thompson, D.J., and Hermann, W.L., Association

of exogenous estrogen and endometrial carcinoma, N. Eng. J. Med., 293, 1164-67, 1975.

[12] Altman, D. G., Practical Statistics for Medical Research, Chapman and Hall,

1991.

Page 59: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 59

[13] Stuart, M., Changing the teaching of statistics, The Statistician, Vol. 44, No.

1, 45-54, 1995. [14] Stuart, M., An Introduction to Statistical Analysis for Business and Industry,

Hodder, 2003. [15] Ehrenberg, A.S.C., Data Reduction, Wiley, 1975. [16] Morton, A.Q., Literary Detection, Scribner’s, 1978. [17] Box, G.E.P., Hunter, S., and Hunter, W.G., Statistics for Experimenters,

Wiley, 1978 (note a second edition appeared in 2005). [18] Bateson, W. and Saunders, E.R., Reports for the Evolution Committee of

the Royal Society, 1, 1-160, 1902. [19] Berkson, J., Examination of the randomness of alpha particle emissions, in

Research Papers in Statistics, F.N. David (ed.), Wiley, 1966.

Page 60: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 60

Appendix 3.1 Let X and Y be two random quantities each with their own long-run mean (expectation) and standard deviation. The square of the standard deviation is called the variance. The variance is of interest because when we combine random quantities the variances (but not the standard deviations) are additive, as discussed in Chapter 1, Section 1.4. Thus, it can be shown that if X and Y are independent (the value of X does not affect the probable value of Y) then:

VAR(X + Y) = VAR(X) + VAR(Y) and

VAR(X – Y) = VAR(X) + VAR(Y). This latter result is used in finding the standard error of (p1 – p2). We have seen that:

SE(p) (1)

n and, hence,

VAR(p) (1)

n

where p is a sample proportion, n the sample size, and is the corresponding population proportion. It follows that:

VAR(p1 p2) 1(11)

n1

2(12)

n2

which gives

SE(p1 p2) 1(11)

n1

2(12)

n2

.

It is intuitively obvious (and can be shown mathematically) that the long-run

mean of (p1 – p2) is (1 – 2). These results give us the sampling distribution of (p1 – p2) as shown in Figure 3.2.1, page 9.

Page 61: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 61

The Special Case of two proportions that sum to 1. If (p1+p2) = 1 then (p1–p2) = [p1–(1–p1)] = (2p1–1). It follows that:

SE(p1 p2) SE(2p1 1) SE(2p1) 2SE(p1)

since ‘1’ is a constant. Therefore,

SE(p1 p2) 2SE(p1) 21(11)

n

which is the result used in our analysis of the French presidential poll in Example 8, page 26.

Text: © Eamonn Mullins, 2013, Data: see references for copyright owners.

Page 62: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 62

Outline Solutions 3.1.1: Referendum poll A 95% confidence interval for the population proportion who intended to vote yes in the referendum is given by:

n

ppp

)1(96.1

1000

)35.01(35.096.135.0

03.035.0 i.e., we estimate that between 32% and 38% of the population of voters intended to vote for the Lisbon treaty. A Minitab analysis is shown below.

Test and CI for One Proportion Sample Yes N Sample p 95% CI

1 350 1000 0.350000 (0.320438, 0.379562)

Using the normal approximation.

Table 3.1.1.1: Minitab analysis of referendum data

Page 63: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 63

3.2.1: Referendum polls First, we will carry out a Z-test of the equality of the population proportions. As for the Timolol study, we frame our question in terms of a null hypothesis which considers the long-run proportions as being the same and an alternative hypothesis which denies their equality.

Ho: – = 0

H1: – 0

where 1 is the proportion of the whole electorate who intended to vote “yes” in

May, while 2 is the corresponding proportion in January. We calculate a Z statistic:

Z (p1 p2) 0

(1)

n1

(1)

n2

- this will have a standard Normal distribution if the null hypothesis is true. Our

estimate of the common (when the null hypothesis is true) is p=610/2000 = 0.305

37.4

1000

)305.01(305.0

1000

)305.01(305.0

0)26.035.0(

)1()1(

0)(

21

21

n

pp

n

pp

ppZ .

When compared to the critical values of 1.96 corresponding to a significance level of 0.05, the Z-statistic is highly statistically significant, leading us to conclude that the population proportions are not the same. A 95% confidence interval for the difference between the population proportions is given by:

(p1 p2) 1.96p1(1 p1)

n1

p2(1 p2)

n2

.

For the referendum poll results this gives:

(0.35 0.26) 1.960.35(10.35)

1000

0.26(10.26)

1000

0.090.04.

Page 64: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 64

We are 95% confident that the difference between the population proportions is somewhere between 5 and 13 percentage points. A Minitab analysis is shown below.

Test and CI for Two Proportions Sample ‘Yes’ N Sample p

1 350 1000 0.350000

2 260 1000 0.260000

Difference = p(1) - p(2)

Estimate for difference: 0.09

95% CI for difference: (0.0498375, 0.130163)

Test for difference = 0 (vs not = 0): Z = 4.37 P-Value = 0.000

Table 3.2.1.1: Minitab comparison of survey results

3.2.2: Influenza vaccination study This exercise requires the same analyses for the test and confidence interval as did Exercise 3.2.1 – the Minitab output given below shows the required results.

Test and CI for Two Proportions Sample Influenza N Sample p

1 80 220 0.363636

2 20 240 0.083333

Difference = p (1) - p (2)

Estimate for difference: 0.280303

95% CI for difference: (0.207754, 0.352852)

Test for difference = 0 (vs not = 0): Z = 7.28 P-Value = 0.000

Table 3.2.2.1: Minitab analysis of vaccine data

The influenza data are set out in tabular form in Table 3.2.2.2, to facilitate the calculation of the odds ratio.

Page 65: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 65

Control Vaccine Totals Control Vaccine Totals

Influenza 80 20 100 Inf. a b a+b

No influenza 140 220 360 no Inf. c d c+d

Totals 220 240 460 Totals a+c b+d n

Table 3.2.2.2: The influenza vaccine data

OR = 3.6][220/20

140/80

bc

ad

d

b

c

a.

The study suggests that the odds of contracting influenza are more than six times higher for the control group. A one-sample confidence interval for the population incidence of influenza, if the entire population (of adults) had been vaccinated is shown in the Minitab output below. Between 5 and 12% of adults would be expected to contract influenza, even though they had been vaccinated.

Test and CI for One Proportion Sample Influenza N Sample p 95% CI

1 20 240 0.083333 (0.048366, 0.118300)

Using the normal approximation.

Table 3.2.2.3: Estimating the incidence of influenza in the population

3.2.3: Childhood respiratory disease study The statistical test and confidence interval for comparing the sample proportions are presented below and show a highly statistically significant difference.

Test and CI for Two Proportions Sample Cough N Sample p

1 26 273 0.095238

2 44 1046 0.042065

Difference = p (1) - p (2)

Estimate for difference: 0.0531731

95% CI for difference: (0.0162884, 0.0900577)

Test for difference = 0 (vs not = 0): Z = 3.49 P-Value = 0.000

Table 3.2.3.1: Minitab analysis of bronchitis data

Page 66: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 66

Bronchitis Control Totals Bronchitis Control Totals

Coughs 26 44 70 Coughs a b a+b

No coughs 247 1002 1249 No coughs c d c+d

Totals 273 1046 1319 Totals a+c b+d n

Table 3.2.3.2: The bronchitis data

OR = 4.2][1002/44

247/26

bc

ad

d

b

c

a.

The study suggests that the odds of having a persistent cough in later life are more than twice as high for the children who suffered from bronchitis in infancy than they are for the control group. 3.2.4: Otter deaths in eastern Germany For the otter data a natural practical question is ‘by how much does the long-run proportion of young male deaths exceed that for young females?’ In addressing this question we compare the sample proportions; in doing so, we regard the numbers of males and females as fixed in making our comparison.

Test and CI for Two Proportions Sample Young N Sample p

Males 286 598 0.478261 Females 153 425 0.360000

Difference = p (1) - p (2) Estimate for difference: 0.118261 95% CI for difference: (0.0575530, 0.178969) Test for difference = 0 (vs not = 0): Z = 3.77 P-Value = 0.000

Table 3.2.5.1: Minitab comparison of sample proportions for otters

The confidence interval suggests that the long-run percentage of young male deaths is between 6 and 18 percentage points higher than the corresponding percentage for female otters. The odds-ratio is 1.63 – the odds of a male being young when killed are about 60% higher than those for a female. The original paper suggests reasons for such a difference. These include the larger territorial

Page 67: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 67

size and higher activity levels, particularly of younger males searching for a free territory, leading to increased exposure to road traffic (69% of the deaths were attributed to road kills). 3.2.5: Case-control contraception study Table 3.2.5.1 shows the 155 pairs cross-classified by exposure to oral contraceptives. The question of interest is whether or not this 1973 study supported the existence of an association between exposure to oral contraception and the occurrence of a stroke.

Controls

Exposed Not-expos. Totals

Cases

Exposed 5 30 35 a b a+b

Not-expos. 13 107 120 c d c+d

Totals 18 137 155 a+c b+d n

Table 3.2.5.1: A matched case-control study of strokes and oral contraceptive use

To test the null hypothesis that the proportion of women (in the relevant population) who used oral contraceptives was the same for the population of women who experienced strokes [cases: p1=(a+b)/n] as for the population who did not [controls: p2=(a+c)/n], we calculate a Z-statistic:

59.2)1330(

)1330(

)(

)(

/)(

/)(

)(ˆ

0)(

21

21

cb

cb

ncb

ncb

ppES

ppZ .

When compared to critical values of 1.96 given by the conventional significance level of 0.05, the Z-value is clearly statistically significant: cases – those who had strokes - were more likely to have been exposed to oral contraception. A confidence interval for the increased probability of exposure to oral contraception for cases is given by:

ncbcbppn

/)()(96.1)( 2121

where p1 = 35/155 = 0.226 and p2 = 18/155 = 0.116 giving (p1 – p2) = 0.110.

Page 68: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 68

The calculated confidence interval is:

0.110 0.085 or [0.025 to 0.195] that is, a difference of between 3 and 20 percentage points. Note that this is an old study and the formulations of modern contraceptives are quite different, so the results are of mainly historical interest as regards risks associated with contraceptive pill use. 3.2.6. Self-paired allergy study

Table 3.2.6.1 shows the 173 self-pairs classified by response to the two chemicals. The question of interest is whether or not there is a differential response to the two chemicals.

DNCB

+ – Totals

Croton Oil

+ 81 48 129 a b a+b

– 23 21 44 c d c+d

Totals 104 69 173 a+c b+d n

Table 3.2.6.1: 173 patients classified according to response to two chemicals

These data may be analysed in exactly the same way as the strokes/contraceptives data of Exercise 3.2.5. To test the null hypothesis that the probability of a negative response to DNCB was the same that for Croton Oil we can calculate a Z-statistic:

9 7.2)2 34 8(

)2 34 8(

)(

)(

/)(

/)(

)(ˆ

0)(

21

21

cb

cb

ncb

ncb

ppES

ppZ .

When compared to critical values of 1.96 given by the conventional significance level of 0.05, the Z-value is clearly statistically significant – patients (in the relevant population) were more likely to react negatively to DNCB than they were to Croton Oil. A confidence interval for the increased probability of reacting negatively to DNCB is given by:

ncbcbppn

/)()(96.1)( 2121

Page 69: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 69

where p1 = 69/173 = 0.399 and p2 = 44/173 = 0.254 giving (p1 – p2) = 0.145. The calculated confidence interval is:

0.145 0.098 Which ranges from 0.05 to 0.24, that is, a difference of between 5 and 24 percentage points. 3.2.7: Otter deaths in eastern Germany A 95% confidence interval for the difference between the long-run proportions for

males () and females () is:

n

pppp

)1()2(96.1)( 11

21

1023

)415.0(585.092.3)415.0585.0( = 06.017.0 .

This suggests that there is an excess of such violent deaths among males, of between 11 and 23 percentage points, over the percentage for female otters. 3.3.1: Slate workers study A Minitab chi-square analysis of the slate workers’ study is shown below. The X2 value of 9.328 is simply the square of the Z-value shown on page 15 (apart from rounding). The conclusions from the chi-square analysis are the same, therefore, as those from the Z-test for comparing two proportions, as discussed on pages 14,15.

Chi-Square Test: C1, C2 Expected counts are printed below observed counts

Chi-Square contributions are printed below expected counts

Died Surv. Total

Slate workers 379 347 726

352.30 373.70

2.024 1.908

Controls 230 299 529

256.70 272.30

2.778 2.618

Total 609 646 1255

Chi-Sq = 9.328, DF = 1, P-Value = 0.002

Table 3.3.1.1: Chi-square analysis of the slate workers’ data

Page 70: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 70

3.3.2: Oesophageal cancer case-control study A chi-square analysis of the oesophageal cancer case-control data is shown in Table 3.3.2.1, below – again there is a one-to-one correspondence with the Z-test of Table 3.2.6 (page 16).

Chi-Square Test: C3, C4 Expected counts are printed below observed counts

Chi-Square contributions are printed below expected counts

80+ <80 Total

Cases 96 104 200

42.05 157.95

69.212 18.427

Controls 109 666 775

162.95 612.05

17.861 4.755

Total 205 770 975

Chi-Sq = 110.255, DF = 1, P-Value = 0.000

Table 3.3.2.1: Chi-square analysis of the oesophageal cancer data

3.3.3: Cosmic Rays It is not obvious to me whether the physicists would ask if the energy distributions are different for the two particle types or if the charge distributions are different at different energy ranges. In analysing these data, therefore, I would frame the null hypothesis as a test of independence – are the energy and charge classifications independent? I would treat the two classifications as if they have equal status, in other words. A chi-square test on the data gives a test statistic of X2=14.01, as shown in Table 3.3.3.1. This should be compared to a chi-square distribution with (3–1)(2–1)=2

degrees of freedom; for a significance level of =0.05, the critical value is 5.99 and the null hypothesis of energy-charge independence is rejected.

Page 71: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 71

Chi-Square Test: Positive, Negative Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Energy Positive Negative Total Interval

50-100 9 20 29

7.37 21.63 0.363 0.124 101-300 32 51 83

21.08 61.92 5.658 1.926 301-1000 23 117 140

35.56 104.44 4.434 1.509 Total 64 188 252

Chi-Sq = 14.013, DF = 2, P-Value = 0.001

Table 3.3.3.1: Chi-square analysis of the cosmic ray data

Suppose we decide to view the data in terms of the energy distributions for the two particle types; positive electrons are called ‘positrons’ (as in Positron Emission Tomography (PET) scanners). Table 3.3.3.2 shows the energy distributions expressed as percentages, for the two particle types.

Electron Charge

Energy

Interval (MeV) Positive Negative

50-100 14 11

101-300 50 27

301-1000 36 62

100 100

Table 3.3.3.2: The energy distributions (percentage)

It is obvious that there is a much higher percentage of highest energy particles in the negative charge group; there is a difference of 26 percentage points, i.e., 62 versus 36%. The 95% confidence interval for the difference is from 13 to 40 percentage points, as shown in Table 3.3.3.3.

Page 72: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 72

Test and CI for Two Proportions Sample 301+ N Sample p

e- 117 188 0.622340

e+ 23 64 0.359375

Difference = p (1) - p (2) Estimate for difference: 0.262965 95% CI for difference: (0.126506, 0.399425)

Test for difference = 0 (vs not = 0): Z = 3.66 P-Value = 0.000

Table 3.3.3.3: Comparing proportions in the highest energy interval

Presumably, differences in the energy distributions would carry implications for the mechanisms generating the particles. Thus, the non-symmetrical energy distributions observed here may rule-in/rule-out certain sources of these particles. Faulty data collection If the energy measurements cannot be trusted to differentiate between the two lower energy intervals, then it makes sense to combine them into a single 50 – 300 MeV band; the revised distributions are shown in Table 3.3.3.4.

Electron Charge

Energy

Interval (MeV) Positive Negative

50 -300 41 71

301-1000 23 117

64 188

Table 3.3.3.4: Revised energy distributions

A chi-square analysis of this 2x2 table (shown in Table 3.3.3.5) is effectively equivalent to the comparison of two proportions in Table 3.3.3.3, since we focused there on the proportions in the highest energy interval, which remains unchanged. However, the null hypothesis underlying the Z-test for proportions in Table 3.3.3.3 is an ‘homogeneity of proportions’ hypothesis (i.e., the same set of

long-run relative frequencies () applies to both types of electron), while that

Page 73: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 73

underlying Table 3.3.3.5 could be framed, alternatively, as an energy level-charge independence hypothesis. As we have seen, these are essentially the same hypothesis and the tests are equivalent (Z=3.66)2 = (X2=13.37), apart from rounding.

Chi-Square Test: Positron, Electron Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Energy Positron Electron Total Interval Low 41 71 112

28.44 83.56 5.542 1.887 High 23 117 140

35.56 104.44 4.434 1.509 Total 64 188 252

Chi-Sq = 13.372, DF = 1, P-Value = 0.000

Table 3.3.3.5: A chi-square analysis of the revised data

The overall conclusion of our analysis is that energy level and particle charge are not independent classifications and that there is a disproportionately higher fraction of (negatively charged) electrons in the highest energy band.

Page 74: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 74

3.3.4: Hospitals We begin our analysis by excluding the referral hospital, E, and by examining the other four hospitals A–D to determine if their post-operative patient status distributions can be considered as being generated by the same set of underlying

relative frequencies

Chi-Square Test: A, B, C, D Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A B C D Total

No-imp 13 5 8 21 47

7.75 5.11 13.03 21.11 3.555 0.002 1.941 0.001 Partial 18 10 36 56 120

19.79 13.05 33.26 53.89 0.162 0.714 0.225 0.082 Complete 16 16 35 51 118

19.46 12.84 32.71 53.00 0.615 0.780 0.160 0.075 Total 47 31 79 128 285

Chi-Sq = 8.313, DF = 6, P-Value = 0.216

Table 3.3.4.1: Chi-square comparison of hospitals A – D

Table 3.3.4.1 shows a chi-square analysis of the 3x4 table. The X2=8.3 test

statistic is non-significant when compared to the =0.05 critical value of 2c=12.6

of the chi-square distribution with (3–1)(4–1)=6 degrees of freedom. The large p-value of 0.216 indicates that the test statistic would be expected to be bigger than 8.3 more than 20% of the time, even if the null hypothesis of equal underlying frequency distributions were true. Accordingly, we cannot reject the null hypothesis of equal underlying post-operative result performances for the four hospitals. It makes sense, therefore, to combine the data into one ‘general hospital’ distribution.

Page 75: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 75

Table 3.3.4.2 shows the (rounded) percentages that fall into the three outcome categories for the general and referral hospitals.

Hospital

General Referral

No improvement 16 52

Partial funct. rest. 42 35

Complete funct. rest. 41 12

Table 3.3.4.2: Percentage (rounded) changes

There are quite different response patterns, as might be expected. The referral hospital has a much lower ‘complete recovery’ rate. (12% versus 41%) and a much higher ‘no improvement’ rate (52% versus 16%). Table 3.3.4.3 is a chi-square analysis of this table.

Chi-Square Test: General, Referral Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts General Referral Total

No-imp 47 43 90

69.89 20.11 7.497 26.058 Partial 120 29 149

115.71 33.29 0.159 0.553 Complete 118 10 128

99.40 28.60 3.480 12.096 Total 285 82 367

Chi-Sq = 49.844, DF = 2, P-Value = 0.000

Table 3.3.4.3 Chi-square comparison of general and referral hospitals

The chi-square analysis (highly significant test statistic) provides statistical justification for interpreting the two different response distributions of Table 3.3.4.2: the chi-square result tells us that the differences are unlikely to be due to chance fluctuations.

Page 76: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 76

Figure 3.3.4.1: Response ‘profiles’ for hospital types

Figure 3.3.4.1 shows one possible graphical representation of the results. It illustrates the difference in the ‘response profiles’ of the two types of hospital. The profiles do not have any intrinsic meaning, of course, as the horizontal axis is not a continuous scale. The profiles do, however, illustrate the fundamentally different response patterns and would (I think!) help a report reader who might be overwhelmed by the detail shown in Tables 3.3.4.1 and 3.3.4.3, in particular, to interpret the results of the study. 3.3.5: Otter deaths in eastern Germany We address the question: ‘is the row classification independent of the column classification?’ In the current case this would mean that the probability that an otter is young when it is killed does not depend on its sex. If an otter is selected at random from the total sample of N=1023, then the probability that it will be male is P[male]=598/1023 and the probability that it will be young is P[young]=439/1023. If two characteristics are independent of each other (in the sense that the occurrence of one does not affect the probable occurrence of the other) then the probability that they both occur simultaneously is given by the product of their separate probabilities. So, if independence holds:

P[young and male] = P[young] x P[male] These sample probabilities are, of course, only estimates of corresponding population values. Under independence the estimate of the expected number of young males is given by:

N x P[young and male] = N x P[young] x P[male]

Page 77: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 77

= 1 0 2 3

5 9 8

1 0 2 3

4 3 91 0 2 3 xx

= 6 2.2 5 61 0 2 3

5 9 84 3 9

x .

The estimated expected values and the contributions to the X2 statistic for all cells are shown in the Minitab analysis in Table 3.3.5.1. The sum of all the cell contributions gives an overall test statistic X2=14.183 which is highly statistically significant when compared to a critical value of 3.84 (again ‘degrees of freedom’=1). The test statistic and the corresponding p-value (p<0.0005) suggest that the two classifications are not independent of each other: the observed frequencies deviate too much from those calculated on the independence assumption.

Chi-Square Test: Male, Female Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Male Female Total

Young 286 153 439

256.62 182.38 3.364 4.733 Adult 312 272 584

341.38 242.62 2.529 3.558 Total 598 425 1023 Chi-Sq = 14.183, DF = 1, P-Value = 0.000

Table 3.3.5.1: Minitab chi-square analysis of the otter data

Chi-square tests where the null hypothesis is expressed in terms of the equality of two or more proportions (as in the Timolol example) are often referred to as ‘tests for homogeneity of proportions’, while those for which the null hypothesis is posed in the manner just discussed are called ‘tests of independence’ or ‘tests of association’, i.e., they address the question: are frequencies for one classification ‘independent of’ or ‘associated with’ those of the other?

Page 78: Confidence Intervals and Tests based on a Sample Proportion

Course Manual: PG Certificate in Statistics, TCD, Base Module 78

3.4.1: Mendel’s pea plant data The observed colour distribution for the seeds is compared to the theoretical ratios of 1:2:1 in the Minitab output below. This gives a chi-square value of 0.46, with a corresponding p-value of 0.796. There is no reason to doubt the applicability of the Mendelian theory.

Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: C5

Test Contribution

Category Observed Propn. Expected to Chi-Sq

Yellow 126 0.25 132.25 0.295369

Yellow+Green 271 0.50 264.50 0.159735

Green 132 0.25 132.25 0.000473

N DF Chi-Sq P-Value

529 2 0.455577 0.796

Table 3.4.1.1: Minitab Goodness of Fit analysis of seed colours

Similarly, the observed shape distribution for the seeds is compared to the theoretical ratios of 1:2:1 in the Minitab output in Table 3.4.1.2. This gives a chi-square value of 0.55, with a corresponding p-value of 0.76. There is no reason to doubt the applicability of the Mendelian theory with respect to the shape distribution either.

Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: C7 Test Contribution

Category Observed Propn. Expected to Chi-Sq

Round 138 0.25 132.25 0.250000

Round+Wrinkled 265 0.50 264.50 0.000945

Wrinkled 126 0.25 132.25 0.295369

N DF Chi-Sq P-Value

529 2 0.546314 0.761

Table 3.4.1.2: Minitab Goodness of Fit analysis of seed shapes

© Text, Eamonn Mullins, 2014; data, see references,