Chi Square Test

1

Chi-square testChi-square testor22 test A A nonparametric nonparametric hypothesis testhypothesis test

2

Parametric vs. Parametric vs. Nonparametric TestsNonparametric Tests

•Parametric hypothesis test– about population parameter (

or p)– z, t, tests– interval/ratio data

•Nonparametric tests– do not test a specific parameter– nominal & ordinal data– frequency data ~

3

Chi-square testChi-square test

Used to test the countscounts of categorical dataThreeThree types

Goodness of fit (univariate)Independence (bivariate)Homogeneity (univariate with two samples)

4

22 distribution – distribution –

df=3

df=5

df=10

5

22 distribution distribution

Different df have different curvesSkewed rightAs df increases, curve shifts toward right & becomes more like a normal curvenormal curve

6

2 2 assumptionsassumptionsSRS SRS – reasonably random sampleHave countscounts of categorical data & we expect each category to happen at least onceSample sizeSample size – to insure that the sample size is large enough we should expect at least five in each category.

***Be sure to list expected counts!!

Combine these together:

All expected counts are at

least 5.

7

2 2 formulaformula

i

ii

EO 2

2 E

where where OOi i is the observed frequencyis the observed frequencyEEii is the expected frequency is the expected frequency

8

2 2 Goodness of fit testGoodness of fit testUses univariate dataWant to see how well the observed counts “fit” what we expect the counts to be

Based on df –Based on df –

df = number of df = number of categoriescategories - 1 - 1

9

Hypotheses – written in Hypotheses – written in wordswords

H0: the observed counts equal the expected counts, i.e., there is no significant difference between the observed and the expected counts H1: the observed counts are not equal to the expected counts

Be sure to write in context!

10

Steps for Computation of Steps for Computation of 22(1)Compute the expected frequencies E1,

E2,…, En corresponding to the observed frequencies O1, O2,…, On under some hypothesis.

(2) Compute the deviations (O-E) for each frequency and then square them to obtain (O-E)2 .

(3)Divide the square of the deviations (O-E)2 by the corresponding expected frequency to obtain (O-E)2 / E.

(4) Add the values obtained in step 3 to compute:

i

ii

EO 2

2 E

11

Steps for Computation of Steps for Computation of 2 2 (con’t)(con’t)5 Under the null hypothesis test the theory

fits well, the above statistic follows 22 distribution with v=n-1 d.f.

6 Look up the tabulated value of 22 for (n-1) d.f at given level of significance.

7 If the calculated value of 2 2 is less than the corresponding tabulated value obtained in step 6, then it is said to be non-significant at the required level of significance.

8 If the calculated value of 2 2 is greater than the corresponding tabulated value obtained in step 6, then it is said to be significant at the required level of significance.

12

Let’s test our dice!Let’s test our dice!

13

1. A dice is rolled 100 times with the following distribution:Number : 1 2 3 4 5 6 Observed frequency : 17 14 20 17 17 15At the 0.01 level of significance, determine whether dice is true (unbiased).Solution. We are given: Number of categories = 6 N = total frequency = 17+14+20+17+17+15 =100

Null Hypothesis: Ho : The die is unbiased, i.e., the probability of obtaining six faces is same

Alternate Hypothesis: H1 : The dice is biased

CASELETSCASELETS

14

CASELETS (con’t)CASELETS (con’t)Expected frequency for each face (E) = N*p =100/6 =16.67(all expected values are greater than 5)

Number Observed Frequency (O)

Expected Frequency (E)

(O-E) (O-E)2 (O-E)2/E

1 17 16.67 0.33 0.1089 0.00652 14 16.67 -2.67 7.1289 0.42763 20 16.67 3.33 11.088 0.66524 17 16.67 0.33 0.1089 0.00655 17 16.67 0.33 0.1089 0.00656 15 16.67 -1.67 2.7889 0.1673Total

22==1.2791.27966

15

CASELETS (con’t)CASELETS (con’t)

2796.1E 22

i

ii

EO

Since the calculated value of 22 = 1.2796 = 1.2796 is less than the tabulated value of 22, i.e., 15.086 (for 5 d.f at 1% level of significance) therefore the null hypothesis is accepted and we conclude that the dice is regarded as true (unbiased).

16

2. Offspring of certain fruit flies may have yellow or ebony bodies and normal wings or short wings. Genetic theory predicts that these traits will appear in the ratio 9:3:3:1 (yellow & normal, yellow & short, ebony & normal, ebony & short) A researcher checks 100 such flies and finds the distribution of traits to be 59, 20, 11, and 10, respectively. What are the expected counts? df?Are the results consistent with the theoretical distribution predicted by the genetic model? (5% level of significance)

Expected counts:Y & N = 56.25Y & S = 18.75E & N = 18.75E & S = 6.25 We expect 9/16 of the

100 flies to have yellow and normal

wings. (Y & N)

Since there are 4 categories,

df = 4 – 1 = 3

CASELETSCASELETS

17

Assumption:All expected counts are greater than 5. Expected counts:Y & N = 56.25, Y & S = 18.75, E & N = 18.75, E & S = 6.25H0: The distribution of fruit flies is the same as the theoretical model.Ha: The distribution of fruit flies is not the same as the theoretical model.Number Obser.

Freq. (O)ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

1 59 56.25 2.75 7.5625 0.1352 20 18.75 1.25 1.5625 0.0833 11 18.75 -7.75 60.062

53.203

4 10 6.25 3.75 14.0625

2.25

Total

22==5.6715.671


18


671.5E 22

i

ii

EO

Since the calculated value of 22 = 5.671 = 5.671 is less than the tabulated value of 22, i.e., 7.815 (for 3 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that the distribution of fruit flies is the same as the theoretical model.

19

3.Does your zodiac sign determine how successful you will be? Fortune magazine collected the zodiac signs of 256 heads of the largest 400 companies. Is there sufficient evidence to claim that successful people are more likely to be born under some signs than others?Aries 23 Libra 18 Leo

20Taurus 20 Scorpio 21 Virgo 19Gemini 18 Sagittarius19 Aquarius

24Cancer 23 Capricorn 22 Pisces

29How many would you expect in each sign if there were no difference between them?How many degrees of freedom?

Since there are 12 signs –df = 12 – 1 = 11

CASELETSCASELETS

I would expect CEOs to be equally born under all signs.

So 256/12 = 21.333333

20

AssumptionAll expected counts are greater than 5. (I expect 21.33 CEO’s to be born in each sign.)H0: The number of CEO’s born under each sign is the same.H1: The number of CEO’s born under each sign is the different.

094.53.21

3.2129...3.21

3.21203.21

3.2123 2222


21


094.5E 22

i

ii

EO

Since the calculated value of 22 = 5.094 = 5.094 is less than the tabulated value of 22, i.e., 19.675 (for 11 d.f at 1% level of significance) therefore the null hypothesis is accepted and we conclude that number of CEO’s born under each sign is the same.

22

CASELETSCASELETS4.Records taken of the number of male and female births in 800 families having four children are given as follows:

No. of births FrequencyMale Female

0 4 321 3 1782 2 2903 1 2364 0 64

Test whether the data are consistent with the hypothesis that the binomial law holds and the chance of a male birth is equal to that of female birth.

23

CASELETS (con’t)CASELETS (con’t)Let us set up the null hypothesis that the data are consistent with the binomial law of equal probability for male and female births

We are given n = 4, N = 800According to binomial probability law, the frequency of r male births is given by:F(r) = N*p(r) = N* nCr * pr

*qn-r

= 800* 4Cr * (0.5)r

*(0.5)4-r

= 50 * 4Cr; (r = 0,1,2,3,4)

No. of male births

Expected frequency

F(r)0 50 * 4C0 = 501 50 * 4C1 = 2002 50 * 4C2 = 3003 50 * 4C3 = 2004 50 * 4C4 = 50

Total 800

24

No. of male birth

Obser.Freq. (O)

ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

0 32 50 -18 324 6.481 178 200 -22 484 2.422 290 300 -10 100 0.333 236 200 36 1296 6.484 64 50 14 196 3.92Total

22==19.6319.63


63.19E 22

i

ii

EO

Since the calculated value of 22 = 19.63 = 19.63 is greater than the tabulated value of 22, i.e., 9.488 (for 4 d.f at 5% level of significance) therefore the null hypothesis is rejected and we conclude that hypothesis of equal male and female births is wrong

25

5. A company says its premium mixture of nuts contains 10% Brazil nuts, 20% cashews, 20% almonds, 10% hazelnuts and 40% peanuts. You buy a large can and separate the nuts. Upon weighing them, you find there are 112 g Brazil nuts, 183 g of cashews, 207 g of almonds, 71 g or hazelnuts, and 446 g of peanuts. You wonder whether you mix is significantly different from what the company advertises?Why is the chi-square goodness-of-fit test NOT appropriate here?What might you do instead of weighing the nuts in order to use chi-square?

Because we do NOT have countscounts

of the type of nuts.We could countcount the number of each type of nut and then perform a

2 test.

CASELETSCASELETS

26

1. The following figures show the distribution of digits in numbers chosen at random from a telephone directory:

Digit : 0 1 2 3 4 5 6 7 8 9 10

Frequency:1026 1107 997 966 1075 933 1107 972 964 853

Test whether the digits may be taken to occur equally frequently in the directory.(tabular value for 9 d.f at 5% level of significance is 16.92)

2. The number of scooter accidents per month in a certain town were as follows:

12, 8, 20, 2, 14, 10, 15, 6, 9, 4Are these frequencies in agreement with the belief that

accidents conditions were the same during this 10 month period? (tabular value for 9 d.f at 5% level is 16.92)

Practice CASELETSPractice CASELETS

27

2 test for independence

Used with categorical, bivariate data from ONE sampleUsed to see if the two categorical variables are associated (dependent) or not associated (independent)

28

Assumptions & Assumptions & formula remain the formula remain the same!same!

29

Hypotheses – written in Hypotheses – written in wordswordsH0: two variables are independentH1: two variables are dependent


30

1. A beef distributor wishes to determine whether there is a relationship between geographic region and cut of meat preferred. If there is no relationship, we will say that beef preference is independent of geographic region. Suppose that, in a random sample of 500 customers, 300 are from the North and 200 from the South. Also, 150 prefer cut A, 275 prefer cut B, and 75 prefer cut C. Also suppose that in the actual sample of 500 consumers the observed numbers were as follows:

CASELETSCASELETS

31


Is there sufficient evidence to suggest that geographic regions and beef preference are not independent? (Is there a difference between the expected and observed counts?)

North South Total

Cut A 100 50 150

Cut B 150 125 275

Cut C 50 25 75

Total 300 200 500

32

Expected Counts Expected Counts

Assuming H0 is true,

totaltablealcolumn tot totalrow counts expected


SolutionSolution

33

Degrees of freedomDegrees of freedom

)1c)(1(r df

Or cover up one row & one column & count the number of cells remaining!


34

If beef preference is independent of geographic region, how would we expect this table to be filled in?

North South Total

Cut A 150

Cut B 275

Cut C 75

Total 300 200 500

90 60165

110

45 30


35

Assumptions:All expected counts are greater than 5.

H0: geographic region and beef preference are independent H1: geographic region and beef preference are dependent Obser.

Freq. (O)ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

100 90 10 100 0.1150 60 -10 100 1.67150 165 -15 225 0.33125 110 15 225 1.3650 45 5 25 0.5625 30 -5 25 0.83

22==4.864.86


36

Since the calculated value of 22 = 4.86 = 4.86 is less than the tabulated value of 22, i.e., 5.991 (for (3-1)*(2-1)=2 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that geographic region and beef preference are independent


37

2. In a certain sample of 2000 families 1400 families are consumer of tea. Out of 1800 Hindu families, 1236 families consume tea. Use Chi-Square test and state whether there is any significant difference between consumption of tea among Hindu and non-Hindu families. (5% level of significance)

CASELETSCASELETS

Number of Hindu Non-Hindu TotalFamilies consuming tea

1236 164 1400

Families not consuming tea

564 36 600

Total 1800 200 2000

38

Solution: Expected Solution: Expected Counts Counts

•Assuming H0 is true,



Number of Hindu Non-Hindu TotalFamilies consuming tea

1400

Families not consuming tea

600

Total 1800 200 2000

1260540

14060

39


H0: consumption of tea and community are independent, i.e., there is no significance difference between the consumption of tea among Hindu and non-Hindu familiesH1: consumption of tea and community are dependent

Obser.Freq. (O)

ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

1236 1260 -24 576 0.457564 540 24 576 1.067164 140 24 576 4.11436 60 -24 576 9.6

Total

22==15.2315.2388


40

Since the calculated value of 22 = 15.238 = 15.238 is greater than the tabulated value of 22, i.e., 3.841 (for (2-1)*(2-1)=1 d.f at 5% level of significance) therefore the null hypothesis is rejected and we conclude that the two communities differ significantly as regards the consumption of tea among them.


41

3. A sample of 400 students of under-graduate and 400 students of post-graduate classes was taken to know their opinion about autonomous colleges. 290 of the under-graduate and 310 of the post-graduate students favored the autonomous status. Present these facts in the form of a table and test at 55 level, that the opinion regarding autonomous status of college is independent of the level of classes of students.

CASELETSCASELETS

Class Number of Students Total

Favoring Opposing

Under Graduate 290 400Post Graduate 310 400Total

SolutionObserved Frequencies

11090

600

200

800

42

Expected Counts Expected Counts •Assuming H0 is true,



Class Number of Students Total

Favoring Opposing

Under Graduate 400Post Graduate 400Total 600 200 800

300

10030

0100

43


H0: Opinion about autonomous colleges is independent of the level of classesH1: consumption of tea and community are dependent

Obser.Freq. (O)

ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

290 300 -10 100 0.33110 100 10 100 1.00310 300 10 100 0.3390 100 -10 100 1.00

Total

22==2.662.66


44

Since the calculated value of 22 = 2.66 = 2.66 is less than the tabulated value of 22, i.e., 3.841 (for (2-1)*(2-1)=1 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that the opinion about autonomous colleges may be regarded to be independent of the level of classes of the students


45

4. Suppose that, in a public opinion survey answers to the questions-

(a)Do you drink(b)Are you in favor of local option on sale of liquor?Were as given in the table

CASELETSCASELETS

Question (b)

Question (a) TotalYes No

Yes 56 31 87No 18 6 24Total 74 37 111

Can you infer that opinion on local option is dependent on whether or not an individual drinks?

46

Solution: Expected Solution: Expected Counts Counts

•Assuming H0 is true,



Question (b)

Question (a) TotalYes No

Yes 87No 24Total 74 37 111

58 2916 8

47


H0: the local option on sale of liquor is independent of whether or not an individual drinksH1: geographic region and beef preference are dependent

Obser.Freq. (O)

ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

56 58 -2 4 0.06931 29 2 4 0.13818 16 2 4 0.256 8 -2 4 0.5

Total

22==0.9570.957


48

Since the calculated value of 22 = 0.957 = 0.957 is less than the tabulated value of 22, i.e., 3.841 (for (2-1)*(2-1)=1 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that the opinion on local option on sale of liquor is independent (not dependent) of whether or not an individual drinks


49

Yates correction for Yates correction for ContinuityContinuity

If any cell frequency in 2X2 table is less than 5, then for the application of Chi-Square test it has to be pooled with the preceding or succeeding frequency so that total is greater than 5. This results in the loss of 1 d.f. Since for 2X2 table, d.f. = (2-1)X(2-1) = 1; the d.f. left after adjusting for pooling are v = 1-1 = 0, which is absurd. In such situation we apply Yates correction for ‘continuity’. In this method we add 0.5 to the cell frequency which is less than 5 and adjusting the remaining frequencies accordingly, since row and column totals are fixed and then applying Chi-Square.

50

22 test for homogeneity test for homogeneity

Used with a single single categoricalcategorical variable from two (or more) two (or more) independent samplesindependent samplesUsed to see if the two populations are the same (homogeneous)

51

Assumptions & formula remain the same!Expected counts & df are found the same way as test for independence.

OnlyOnly change is the hypotheses!

52

Hypotheses – written in Hypotheses – written in wordswordsH0: the two (or more) distributions are the sameH1: the distributions are different


Chi Square Test

Documents

Transcript of Chi Square Test