Chi Square Test
Transcript of Chi Square Test
1
Chi-square testChi-square testor22 test A A nonparametric nonparametric hypothesis testhypothesis test
2
Parametric vs. Parametric vs. Nonparametric TestsNonparametric Tests
•Parametric hypothesis test– about population parameter (
or p)– z, t, tests– interval/ratio data
•Nonparametric tests– do not test a specific parameter– nominal & ordinal data– frequency data ~
3
Chi-square testChi-square test
Used to test the countscounts of categorical dataThreeThree types
Goodness of fit (univariate)Independence (bivariate)Homogeneity (univariate with two samples)
4
22 distribution – distribution –
df=3
df=5
df=10
5
22 distribution distribution
Different df have different curvesSkewed rightAs df increases, curve shifts toward right & becomes more like a normal curvenormal curve
6
2 2 assumptionsassumptionsSRS SRS – reasonably random sampleHave countscounts of categorical data & we expect each category to happen at least onceSample sizeSample size – to insure that the sample size is large enough we should expect at least five in each category.
***Be sure to list expected counts!!
Combine these together:
All expected counts are at
least 5.
7
2 2 formulaformula
i
ii
EO 2
2 E
where where OOi i is the observed frequencyis the observed frequencyEEii is the expected frequency is the expected frequency
8
2 2 Goodness of fit testGoodness of fit testUses univariate dataWant to see how well the observed counts “fit” what we expect the counts to be
Based on df –Based on df –
df = number of df = number of categoriescategories - 1 - 1
9
Hypotheses – written in Hypotheses – written in wordswords
H0: the observed counts equal the expected counts, i.e., there is no significant difference between the observed and the expected counts H1: the observed counts are not equal to the expected counts
Be sure to write in context!
10
Steps for Computation of Steps for Computation of 22(1)Compute the expected frequencies E1,
E2,…, En corresponding to the observed frequencies O1, O2,…, On under some hypothesis.
(2) Compute the deviations (O-E) for each frequency and then square them to obtain (O-E)2 .
(3)Divide the square of the deviations (O-E)2 by the corresponding expected frequency to obtain (O-E)2 / E.
(4) Add the values obtained in step 3 to compute:
i
ii
EO 2
2 E
11
Steps for Computation of Steps for Computation of 2 2 (con’t)(con’t)5 Under the null hypothesis test the theory
fits well, the above statistic follows 22 distribution with v=n-1 d.f.
6 Look up the tabulated value of 22 for (n-1) d.f at given level of significance.
7 If the calculated value of 2 2 is less than the corresponding tabulated value obtained in step 6, then it is said to be non-significant at the required level of significance.
8 If the calculated value of 2 2 is greater than the corresponding tabulated value obtained in step 6, then it is said to be significant at the required level of significance.
12
Let’s test our dice!Let’s test our dice!
13
1. A dice is rolled 100 times with the following distribution:Number : 1 2 3 4 5 6 Observed frequency : 17 14 20 17 17 15At the 0.01 level of significance, determine whether dice is true (unbiased).Solution. We are given: Number of categories = 6 N = total frequency = 17+14+20+17+17+15 =100
Null Hypothesis: Ho : The die is unbiased, i.e., the probability of obtaining six faces is same
Alternate Hypothesis: H1 : The dice is biased
CASELETSCASELETS
14
CASELETS (con’t)CASELETS (con’t)Expected frequency for each face (E) = N*p =100/6 =16.67(all expected values are greater than 5)
Number Observed Frequency (O)
Expected Frequency (E)
(O-E) (O-E)2 (O-E)2/E
1 17 16.67 0.33 0.1089 0.00652 14 16.67 -2.67 7.1289 0.42763 20 16.67 3.33 11.088 0.66524 17 16.67 0.33 0.1089 0.00655 17 16.67 0.33 0.1089 0.00656 15 16.67 -1.67 2.7889 0.1673Total
22==1.2791.27966
15
CASELETS (con’t)CASELETS (con’t)
2796.1E 22
i
ii
EO
Since the calculated value of 22 = 1.2796 = 1.2796 is less than the tabulated value of 22, i.e., 15.086 (for 5 d.f at 1% level of significance) therefore the null hypothesis is accepted and we conclude that the dice is regarded as true (unbiased).
16
2. Offspring of certain fruit flies may have yellow or ebony bodies and normal wings or short wings. Genetic theory predicts that these traits will appear in the ratio 9:3:3:1 (yellow & normal, yellow & short, ebony & normal, ebony & short) A researcher checks 100 such flies and finds the distribution of traits to be 59, 20, 11, and 10, respectively. What are the expected counts? df?Are the results consistent with the theoretical distribution predicted by the genetic model? (5% level of significance)
Expected counts:Y & N = 56.25Y & S = 18.75E & N = 18.75E & S = 6.25 We expect 9/16 of the
100 flies to have yellow and normal
wings. (Y & N)
Since there are 4 categories,
df = 4 – 1 = 3
CASELETSCASELETS
17
Assumption:All expected counts are greater than 5. Expected counts:Y & N = 56.25, Y & S = 18.75, E & N = 18.75, E & S = 6.25H0: The distribution of fruit flies is the same as the theoretical model.Ha: The distribution of fruit flies is not the same as the theoretical model.Number Obser.
Freq. (O)ExpecFreq. (E)
(O-E) (O-E)2 (O-E)2/E
1 59 56.25 2.75 7.5625 0.1352 20 18.75 1.25 1.5625 0.0833 11 18.75 -7.75 60.062
53.203
4 10 6.25 3.75 14.0625
2.25
Total
22==5.6715.671
CASELETS (con’t)CASELETS (con’t)
18
CASELETS (con’t)CASELETS (con’t)
671.5E 22
i
ii
EO
Since the calculated value of 22 = 5.671 = 5.671 is less than the tabulated value of 22, i.e., 7.815 (for 3 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that the distribution of fruit flies is the same as the theoretical model.
19
3.Does your zodiac sign determine how successful you will be? Fortune magazine collected the zodiac signs of 256 heads of the largest 400 companies. Is there sufficient evidence to claim that successful people are more likely to be born under some signs than others?Aries 23 Libra 18 Leo
20Taurus 20 Scorpio 21 Virgo 19Gemini 18 Sagittarius19 Aquarius
24Cancer 23 Capricorn 22 Pisces
29How many would you expect in each sign if there were no difference between them?How many degrees of freedom?
Since there are 12 signs –df = 12 – 1 = 11
CASELETSCASELETS
I would expect CEOs to be equally born under all signs.
So 256/12 = 21.333333
20
AssumptionAll expected counts are greater than 5. (I expect 21.33 CEO’s to be born in each sign.)H0: The number of CEO’s born under each sign is the same.H1: The number of CEO’s born under each sign is the different.
094.53.21
3.2129...3.21
3.21203.21
3.2123 2222
CASELETS (con’t)CASELETS (con’t)
21
CASELETS (con’t)CASELETS (con’t)
094.5E 22
i
ii
EO
Since the calculated value of 22 = 5.094 = 5.094 is less than the tabulated value of 22, i.e., 19.675 (for 11 d.f at 1% level of significance) therefore the null hypothesis is accepted and we conclude that number of CEO’s born under each sign is the same.
22
CASELETSCASELETS4.Records taken of the number of male and female births in 800 families having four children are given as follows:
No. of births FrequencyMale Female
0 4 321 3 1782 2 2903 1 2364 0 64
Test whether the data are consistent with the hypothesis that the binomial law holds and the chance of a male birth is equal to that of female birth.
23
CASELETS (con’t)CASELETS (con’t)Let us set up the null hypothesis that the data are consistent with the binomial law of equal probability for male and female births
We are given n = 4, N = 800According to binomial probability law, the frequency of r male births is given by:F(r) = N*p(r) = N* nCr * pr
*qn-r
= 800* 4Cr * (0.5)r
*(0.5)4-r
= 50 * 4Cr; (r = 0,1,2,3,4)
No. of male births
Expected frequency
F(r)0 50 * 4C0 = 501 50 * 4C1 = 2002 50 * 4C2 = 3003 50 * 4C3 = 2004 50 * 4C4 = 50
Total 800
24
No. of male birth
Obser.Freq. (O)
ExpecFreq. (E)
(O-E) (O-E)2 (O-E)2/E
0 32 50 -18 324 6.481 178 200 -22 484 2.422 290 300 -10 100 0.333 236 200 36 1296 6.484 64 50 14 196 3.92Total
22==19.6319.63
CASELETS (con’t)CASELETS (con’t)
63.19E 22
i
ii
EO
Since the calculated value of 22 = 19.63 = 19.63 is greater than the tabulated value of 22, i.e., 9.488 (for 4 d.f at 5% level of significance) therefore the null hypothesis is rejected and we conclude that hypothesis of equal male and female births is wrong
25
5. A company says its premium mixture of nuts contains 10% Brazil nuts, 20% cashews, 20% almonds, 10% hazelnuts and 40% peanuts. You buy a large can and separate the nuts. Upon weighing them, you find there are 112 g Brazil nuts, 183 g of cashews, 207 g of almonds, 71 g or hazelnuts, and 446 g of peanuts. You wonder whether you mix is significantly different from what the company advertises?Why is the chi-square goodness-of-fit test NOT appropriate here?What might you do instead of weighing the nuts in order to use chi-square?
Because we do NOT have countscounts
of the type of nuts.We could countcount the number of each type of nut and then perform a
2 test.
CASELETSCASELETS
26
1. The following figures show the distribution of digits in numbers chosen at random from a telephone directory:
Digit : 0 1 2 3 4 5 6 7 8 9 10
Frequency:1026 1107 997 966 1075 933 1107 972 964 853
Test whether the digits may be taken to occur equally frequently in the directory.(tabular value for 9 d.f at 5% level of significance is 16.92)
2. The number of scooter accidents per month in a certain town were as follows:
12, 8, 20, 2, 14, 10, 15, 6, 9, 4Are these frequencies in agreement with the belief that
accidents conditions were the same during this 10 month period? (tabular value for 9 d.f at 5% level is 16.92)
Practice CASELETSPractice CASELETS
27
2 test for independence
Used with categorical, bivariate data from ONE sampleUsed to see if the two categorical variables are associated (dependent) or not associated (independent)
28
Assumptions & Assumptions & formula remain the formula remain the same!same!
29
Hypotheses – written in Hypotheses – written in wordswordsH0: two variables are independentH1: two variables are dependent
Be sure to write in context!
30
1. A beef distributor wishes to determine whether there is a relationship between geographic region and cut of meat preferred. If there is no relationship, we will say that beef preference is independent of geographic region. Suppose that, in a random sample of 500 customers, 300 are from the North and 200 from the South. Also, 150 prefer cut A, 275 prefer cut B, and 75 prefer cut C. Also suppose that in the actual sample of 500 consumers the observed numbers were as follows:
CASELETSCASELETS
31
CASELETS (con’t)CASELETS (con’t)
Is there sufficient evidence to suggest that geographic regions and beef preference are not independent? (Is there a difference between the expected and observed counts?)
North South Total
Cut A 100 50 150
Cut B 150 125 275
Cut C 50 25 75
Total 300 200 500
32
Expected Counts Expected Counts
Assuming H0 is true,
totaltablealcolumn tot totalrow counts expected
CASELETS (con’t)CASELETS (con’t)
SolutionSolution
33
Degrees of freedomDegrees of freedom
)1c)(1(r df
Or cover up one row & one column & count the number of cells remaining!
CASELETS (con’t)CASELETS (con’t)
34
If beef preference is independent of geographic region, how would we expect this table to be filled in?
North South Total
Cut A 150
Cut B 275
Cut C 75
Total 300 200 500
90 60165
110
45 30
CASELETS (con’t)CASELETS (con’t)
35
Assumptions:All expected counts are greater than 5.
H0: geographic region and beef preference are independent H1: geographic region and beef preference are dependent Obser.
Freq. (O)ExpecFreq. (E)
(O-E) (O-E)2 (O-E)2/E
100 90 10 100 0.1150 60 -10 100 1.67150 165 -15 225 0.33125 110 15 225 1.3650 45 5 25 0.5625 30 -5 25 0.83
22==4.864.86
CASELETS (con’t)CASELETS (con’t)
36
Since the calculated value of 22 = 4.86 = 4.86 is less than the tabulated value of 22, i.e., 5.991 (for (3-1)*(2-1)=2 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that geographic region and beef preference are independent
CASELETS (con’t)CASELETS (con’t)
37
2. In a certain sample of 2000 families 1400 families are consumer of tea. Out of 1800 Hindu families, 1236 families consume tea. Use Chi-Square test and state whether there is any significant difference between consumption of tea among Hindu and non-Hindu families. (5% level of significance)
CASELETSCASELETS
Number of Hindu Non-Hindu TotalFamilies consuming tea
1236 164 1400
Families not consuming tea
564 36 600
Total 1800 200 2000
38
Solution: Expected Solution: Expected Counts Counts
•Assuming H0 is true,
totaltablealcolumn tot totalrow counts expected
CASELETS (con’t)CASELETS (con’t)
Number of Hindu Non-Hindu TotalFamilies consuming tea
1400
Families not consuming tea
600
Total 1800 200 2000
1260540
14060
39
Assumptions:All expected counts are greater than 5.
H0: consumption of tea and community are independent, i.e., there is no significance difference between the consumption of tea among Hindu and non-Hindu familiesH1: consumption of tea and community are dependent
Obser.Freq. (O)
ExpecFreq. (E)
(O-E) (O-E)2 (O-E)2/E
1236 1260 -24 576 0.457564 540 24 576 1.067164 140 24 576 4.11436 60 -24 576 9.6
Total
22==15.2315.2388
CASELETS (con’t)CASELETS (con’t)
40
Since the calculated value of 22 = 15.238 = 15.238 is greater than the tabulated value of 22, i.e., 3.841 (for (2-1)*(2-1)=1 d.f at 5% level of significance) therefore the null hypothesis is rejected and we conclude that the two communities differ significantly as regards the consumption of tea among them.
CASELETS (con’t)CASELETS (con’t)
41
3. A sample of 400 students of under-graduate and 400 students of post-graduate classes was taken to know their opinion about autonomous colleges. 290 of the under-graduate and 310 of the post-graduate students favored the autonomous status. Present these facts in the form of a table and test at 55 level, that the opinion regarding autonomous status of college is independent of the level of classes of students.
CASELETSCASELETS
Class Number of Students Total
Favoring Opposing
Under Graduate 290 400Post Graduate 310 400Total
SolutionObserved Frequencies
11090
600
200
800
42
Expected Counts Expected Counts •Assuming H0 is true,
totaltablealcolumn tot totalrow counts expected
CASELETS (con’t)CASELETS (con’t)
Class Number of Students Total
Favoring Opposing
Under Graduate 400Post Graduate 400Total 600 200 800
300
10030
0100
43
Assumptions:All expected counts are greater than 5.
H0: Opinion about autonomous colleges is independent of the level of classesH1: consumption of tea and community are dependent
Obser.Freq. (O)
ExpecFreq. (E)
(O-E) (O-E)2 (O-E)2/E
290 300 -10 100 0.33110 100 10 100 1.00310 300 10 100 0.3390 100 -10 100 1.00
Total
22==2.662.66
CASELETS (con’t)CASELETS (con’t)
44
Since the calculated value of 22 = 2.66 = 2.66 is less than the tabulated value of 22, i.e., 3.841 (for (2-1)*(2-1)=1 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that the opinion about autonomous colleges may be regarded to be independent of the level of classes of the students
CASELETS (con’t)CASELETS (con’t)
45
4. Suppose that, in a public opinion survey answers to the questions-
(a)Do you drink(b)Are you in favor of local option on sale of liquor?Were as given in the table
CASELETSCASELETS
Question (b)
Question (a) TotalYes No
Yes 56 31 87No 18 6 24Total 74 37 111
Can you infer that opinion on local option is dependent on whether or not an individual drinks?
46
Solution: Expected Solution: Expected Counts Counts
•Assuming H0 is true,
totaltablealcolumn tot totalrow counts expected
CASELETS (con’t)CASELETS (con’t)
Question (b)
Question (a) TotalYes No
Yes 87No 24Total 74 37 111
58 2916 8
47
Assumptions:All expected counts are greater than 5.
H0: the local option on sale of liquor is independent of whether or not an individual drinksH1: geographic region and beef preference are dependent
Obser.Freq. (O)
ExpecFreq. (E)
(O-E) (O-E)2 (O-E)2/E
56 58 -2 4 0.06931 29 2 4 0.13818 16 2 4 0.256 8 -2 4 0.5
Total
22==0.9570.957
CASELETS (con’t)CASELETS (con’t)
48
Since the calculated value of 22 = 0.957 = 0.957 is less than the tabulated value of 22, i.e., 3.841 (for (2-1)*(2-1)=1 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that the opinion on local option on sale of liquor is independent (not dependent) of whether or not an individual drinks
CASELETS (con’t)CASELETS (con’t)
49
Yates correction for Yates correction for ContinuityContinuity
If any cell frequency in 2X2 table is less than 5, then for the application of Chi-Square test it has to be pooled with the preceding or succeeding frequency so that total is greater than 5. This results in the loss of 1 d.f. Since for 2X2 table, d.f. = (2-1)X(2-1) = 1; the d.f. left after adjusting for pooling are v = 1-1 = 0, which is absurd. In such situation we apply Yates correction for ‘continuity’. In this method we add 0.5 to the cell frequency which is less than 5 and adjusting the remaining frequencies accordingly, since row and column totals are fixed and then applying Chi-Square.
50
22 test for homogeneity test for homogeneity
Used with a single single categoricalcategorical variable from two (or more) two (or more) independent samplesindependent samplesUsed to see if the two populations are the same (homogeneous)
51
Assumptions & formula remain the same!Expected counts & df are found the same way as test for independence.
OnlyOnly change is the hypotheses!
52
Hypotheses – written in Hypotheses – written in wordswordsH0: the two (or more) distributions are the sameH1: the distributions are different
Be sure to write in context!