SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

18
SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

Transcript of SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

Page 1: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

SADC Course in Statistics

Goodness-of-fit tests (and further issues)

(Session 16)

Page 2: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

2To put your footer here go to View > Header and Footer

Learning ObjectivesBy the end of this session, you will be able to

• conduct and interpret results from a chi-square test for testing the goodness-of-fit of data to a particular distribution

• understand how two-way contingency tables can be further examined to look at its residuals

• present results from a standard chi-square test, paying attention to the table’s summary features

Page 3: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

3To put your footer here go to View > Header and Footer

Goodness-of-fit tests• In previous sessions, we have seen that

many tests are based on the assumption of normality

• On some occasions, it is also important to ascertain whether the data follow other distributions, e.g. the binomial or Poisson distributions

• We shall now look at how the chi-square test can be applied to examine the extent to which assumptions concerning the distribution of a given variable holds

Page 4: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

4To put your footer here go to View > Header and Footer

Goodness-of-fit tests• The basic idea is first to calculate the

probability of each possible value occurring

• e.g. the number of cows getting disease in a farm which has 6 cows, may be assumed to follow a binomial random variable.

• e.g. the number of visits made by a pregnant woman in a region to the region’s single anti-natal clinic may be assumed to follow a Poisson distribution.

Can we check these assumptions beforesubjecting the data to tests based on these?

Page 5: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

5To put your footer here go to View > Header and Footer

Goodness-of-fit test: Normal distn

• Because the Normal distribution applies to a continuous random variable, it is necessary to group the data and obtain observed frequencies in each group.

• The next step is to determine the probability of an observation falling in each group, and hence the expected value.

• The chi-square test can then be applied in the usual way: the d.f. being number of groups – 1 – number of parameters estimated in computing expected values.

Page 6: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

6To put your footer here go to View > Header and Footer

An example: Normal distn

• Consider the total rainfall in June at a particular site from 1928 to 1983. Suppose we wish to test the assumption that these data follow a normal distribution

• A histogram for the data appears below.

0

2

4

6

8

10

12

14

<=100 to 125 to 150 to 175 to 200 to 225 to 250 > 250

Rainfall totals

Fre

qu

ency

Page 7: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

7To put your footer here go to View > Header and Footer

An example: Normal distn

Expected values are now calculated for each group, assuming a normal distribution.

The table shows observed and expected frequencies.

The chi-square value is 3.6 with d.f.=5.P-value = 0.6083.Conclusions?

RainTotal Observed Expected

<=100 4 6.86

to 125 11 7.45

to 155 12 10.31

to 175 9 11.12

to 200 9 9.33

to 225 6 6.10

to 250 3 3.11

> 250 2 1.72

Totals 56 56

Page 8: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

8To put your footer here go to View > Header and Footer

An example: Binomial distn

• First recall (from Module H1) the form of the probability density function for the binomial random variable with parameters n and p, where p is the probability of a “success” in a sequence of n trials, each trial having just 2 possible outcomes.

• The number of successes (X) in n trials has a binomial distribution.

nkppknk

nkXP knk ,,1,0,)1(

)!(!

!)(

• This formula gives the binomial probabilities, obtained also from Excel’s function Binomdist(x,n,p,false).

Page 9: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

9To put your footer here go to View > Header and Footer

An example: Binomial distn

Suppose we have a binomial variable with observed values as shown (n=7,p=0.222)

Expected values can be derived using [P(X=k)]*404.

The chi-square value is 141.3 with d.f.=4 since p has been estimated from the data. p-value = 0.000

k Observed Expected

0 81 69.7

1 130 139.2

2 129 119.2

3 37 56.7

4 14 16.2

5,6,7 23 3.0

Totals 404 404

What are your conclusions?

Page 10: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

10To put your footer here go to View > Header and Footer

Other issues

There are two more issues to discuss concerning chi-square tests for testing the association between two categorical variables.

These relate to

• further examination of the table offrequencies when a significant result is found;

and

• how to present the results

Page 11: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

11To put your footer here go to View > Header and Footer

Example of Session 15For data below, we found a significant chi-square value, with p=0.0024, i.e. evidence that the proportion of diseased animals are not the same for all vaccines.

Vaccine diseased

healthy Total

A 43 237 280

B 52 198 250

C 25 245 270

D 48 212 260

E 57 233 290

Total 225 1125 1350

Question:

But what contributes most to the chi-square statistic?

i.e. departs most from Pr(diseased)=0.167?

Page 12: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

12To put your footer here go to View > Header and Footer

Cell contributions to chi-square:

Vaccine diseased healthy

A 0.288 0.057

B 2.563 0.512

C 8.889 1.778

D 0.502 0100

E 1.554 0.311

Table gives the chi-square contributions to each cell, i.e. values (O-E)2/E.

Rule of thumb:

Focus on cells with values4 and in larger tables, focus on those 9.

Page 13: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

13To put your footer here go to View > Header and Footer

Standardised residualsVaccine disease

dhealthy

A -0.54 0.24

B 1.60 -072

C -2.98 1.33

D 0.71 -0.33

E 1.25 -0.56

Better still, use standardised residuals so signs are also included, i.e. use SR=(O-E)/E.

Rule of thumb:

Focus on SR>|2|, or in larger tables, focus on those >|3|.

Conclusion:

Vaccine C gives most discrepancy from H0.

Page 14: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

14To put your footer here go to View > Header and Footer

Presentation of results

Vaccine % DISEASED

C 9.3%

A 15.4%

D 18.5%

E 19.7%

B 20.8%

In this example, it would be appropriate to present a table of the percentage of animals diseased under each vaccine.

Table sorted by the most useful vaccine would make the results easier to see.Note there are more advanced methods, e.g. modelling, to make specific comparisons between the above percentages

Page 15: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

15To put your footer here go to View > Header and Footer

Presentation: Example from Sess 14

Usually sleep under a mosquito net?

Suffered malaria?

Yes No Total

Yes 649

62.5%

3849

55.8%

4498

56.6%

No 390

37.5%

3055

44.2%

3445

43.4%

Total 1039

100.0%

6904

100.0%

7943

(100%)

Recall results below from before. Test of association gave p=0.000.

Page 16: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

16To put your footer here go to View > Header and Footer

Presentation and conclusionsTest results indicate that there is an association between use of a mosquito net and incidence of malaria. However the resulting incidences are unexpected. Note: malaria incidence

for those using net = 62.5%

for those not using a net is = 55.8%.

This emphasises the danger of ignoring other factors that may affect malaria incidence, e.g. altitude, housing conditions, etc. Further, could it be that those who had malaria, then started using mosquito nets?

Page 17: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

17To put your footer here go to View > Header and Footer

Some final remarks• Performing a chi-square analysis is simple, but

it does not take account of other factors that may affect the results.

• More advanced (e.g. log-linear modelling) procedures do exist for exploring factors affecting a categorical response, here use of a bednet.

• Recall that the chi-square test is an approximation. This approximation is poor if the expected frequencies are very small (e.g. < 5). Try collapsing some rows or columns if this happens.

Page 18: SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)

18To put your footer here go to View > Header and Footer

Some practical work follows…