Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables...

33
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15

Transcript of Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables...

Page 1: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.

More About

Categorical Variables

Chapter 15

Page 2: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 2

Principal Question:

Is there a relationship between the two variables, so that the category into which individuals fall for one variable seems to depend on the category they are in for the other variable?

Page 3: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 3

Recall:• Data displayed in a contingency or two-way table.• Each combination of row/column is a cell of table.• Two types of conditional percents: row and column.• Row percents: percents across a row, based on total

number in the row.• Column percents: percents down a column, based

on total number in the column.• If one variable is explanatory, use it to define rows

and use row percents.

Page 4: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 4

15.1 Chi-Square Test for Two-Way Tables

Recall there are five steps for assessing statistical significance.

Step 1: Determine null and alternative hypotheses

H0: The two variables are not related.

Ha: The two variables are related.

Sometimes associated is used instead of related.

Page 5: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 5

Example 15.1 Ear Infections and Xylitol

Experiment: n = 533 children randomized to 3 groups Group 1: Placebo Gum; Group 2: Xylitol Gum; Group 3: Xylitol LozengeResponse = Did child have an ear infection?

Only 16.2% of children in Xylitol Gum group had infection.

Page 6: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 6

Example 15.1 Infections and Xylitol (cont)

H0: p1 = p2 = p3(no relationship between trt and

outcome)

Ha: p1, p2 , p3 are not all the same (there is a relationship)

Letp1 = proportion who would get an ear infection

in a population given placebo gump2 = proportion who would get an ear infection

in a population given xylitol gump3 = proportion who would get an ear infection

in a population given xylitol lozenges

Page 7: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 7

Example 15.2 Making FriendsQ: With whom do you find it easiest to make friend –

opposite sex or same sex or no difference?

H0: No difference in distribution of responses of men and women (no relationship between gender and response)

Ha: There is a difference in distribution of responses of men and women (is a relationship between gender and response)

Page 8: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 8

Tech Note: Homogeneity and Independence

Two variations of the general hypothesis statements which depend on the method of sampling.

• If samples have been taken from separate populations, the null hypothesis statement is a statement of homogeneity (sameness) among the populations.

• If a sample has been taken from a single population, and two categorical variables measured for each individual, the statement of no relationship is a statement of independence between the two variables.

Page 9: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 9

Guidelines for large sample:1. All expected counts should be greater than 1.

2. At least 80% of the cells should have an expected count greater than 5.

Step 2: Chi-square Statistic and Necessary Conditions

• Compute expected count for each cell:Expected count = (Row total) (Column total)

Total n

• Compute test statistic by totaling over all cells: (Observed – Expected)2

Expected 2

Page 10: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 10

More on the Chi-square Statistic

Chi-square statistic measures the difference between the observed counts and the counts that would be expected if there were no relationship (i.e. if the null hypothesis were true).

Large difference => evidence of a relationship.

Chi-square probability distribution used to find p-value. Degrees of freedom df = (Rows – 1)(Columns – 1).

Page 11: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 11

Example 15.1 Infections and Xylitol (cont)Output for testing significance of the relationship:

p-value = 0.035 which is < 0.05

There is a statistically significant relationship between the risk of an ear infection and the preventative treatment.

Page 12: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 12

Example 15.1 Infections and Xylitol (cont)

Expected count for “Placebo Gum, Yes Infection” cell:

Expected Counts:

Page 13: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 13

Example 15.1 Infections and Xylitol (cont)

Chi-square Test Statistic:

Page 14: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 14

Step 3: p-value of Chi-square Test

p-value = probability the chi-square test statistic could have been as large or larger if the null hypothesis were true.

Large test statistic => evidence of a relationship.So how large is enough to declare significance?

Chi-square probability distribution used to find p-value.

Degrees of freedom df = (Rows – 1)(Columns – 1) = (r – 1)(c – 1)

Page 15: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 15

Chi-square Distributions

• Skewed to the right distributions.• Minimum value is 0.• Indexed by the degrees of freedom.

Page 16: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 16

Example 15.1 Infections and Xylitol (cont)

Chi-square statistic was 6.69 df = (3-1)(2-1) = 2p-value = 0.035

Page 17: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 17

Finding the p-value from Table A.5:

• If value of statistic falls between two table entries, p-value is between values of p (column headings) for these two entries.

• If value of statistic is larger than entry inrightmost column (labeled p = 0.001), p-value is less than 0.001 (written as p < 0.001).

• If value of statistic is smaller than entry in leftmost column (labeled p = 0.50), p-value is greater than 0.50 (written as p > 0.50).

Look in the corresponding “df” row of Table A.5. Scan across until you find where the statistic falls.

Page 18: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 18

Example 15.3

Table has three rows and three columns.The computed chi-square statistic is 8.12. Degrees of freedom are df = (3 – 1)(3 – 1) = 4.

Finding the p-value:Scan the df = 4 row in Table A.5 and the value of 8.12 is between the entries 7.78 (p = 0.10) and 8.50 (p = 0.075). Thus, the p-value is between 0.075 and 0.10.

0.075 < p-value < 0.10

Page 19: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 19

Step 4: Making a Decision

Two equivalent rules: Reject H0 when …

• p-value 0.05

• Chi-square statistic is greater than the entry in the 0.05 column of Table A.5 (the critical value).

Large test statistic => small p-value => evidence a real relationship exists in the population.

Note: For 22 tables, a test statistic of 3.84 or larger is significant.

Page 20: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 20

Step 5: Reporting a Conclusion

Ways to write “do not reject H0”

• The relationship between smoking and drinking alcohol is not statistically significant.

• The proportions of smokers who never drink, drink occasionally, and drink often are not significantly different from the proportions of non-smokers who do so.

• There is insufficient evidence to conclude that there is a relationship in the population between smoking and drinking alcohol.

Example: Testing whether there is a relationship between smoking (yes or no) and drinking alcohol (never, occasionally, often).

Page 21: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 21

Step 5: Reporting a Conclusion

Ways to write “reject H0”

• There is a statistically significant relationship between smoking and drinking alcohol.

• The proportions of smokers who never drink, drink occasionally, and drink often are not the same as the proportions of non-smokers who do so.

• Smokers have significantly different drinking behavior than non-smokers.

Example: Testing whether there is a relationship between smoking (yes or no) and drinking alcohol (never, occasionally, often).

Page 22: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 22

Example 15.2 Making Friends (cont)Q: With whom do you find it easiest to make friend –

opposite sex or same sex or no difference?

df = (2 – 1)(3 – 1) = 2. Table A.5: value of 8.515 falls between the entries in the 0.025 column (7.38) and the 0.01 column (9.21).

0.01 < p-value < 0.025

There is statistically significant relationship at the 0.05 level.

There appears to be a a difference in distribution of responses of men and women if the populations were asked this question.

Page 23: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 23

Supporting Analyses

• Description of row (or column) percents.

• Bar chart of counts or percents.

• Examination each cell’s “contribution to chi-square.” Cells with largest values have contributed most to significance of the relationship => deserve attention in any description of the relationship.

• Confidence intervals for important proportions or for differences between proportions.

To learn about the specific nature of the relationship:

Page 24: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 24

15.2 Analyzing 2 2 Tables

Shortcut Formula:

The test statistic formula is below, based on df = 1.

Page 25: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 25

Example 6.10 Randomly Pick S or Q (cont)

College students asked: “Randomly choose one of the letters S or Q”,

or “Randomly choose one of the letters Q or S”.

Page 26: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 26

Chi-Square Test or Z-Test forDifference in Two Proportions?

Does it make a difference?

• If desired Ha has no specific direction (two-sided), the two tests give exactly the same p-value. The squared value of the z-statistic equals the chi-square statistic.

• If desired Ha has a direction (one-sided), the z-test should be used.

Page 27: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 27

Fisher’s Exact Test for 2 2 Tables

Can be used for any 2 2 table, but most commonly used when necessary sample size conditions for using the z-test or the chi-square test are violated.

Although computations are cumbersome, most statistical software programs include the Fisher’s Exact Test.

Page 28: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 28

15.3 Testing Hypotheses about One Categorical Variable: GOF

Step 1: Determine the null and alternative hypotheses.

H0: The probabilities for k categories are p1, p2, . . . , pk.

Ha: Not all probabilities specified in H0 are correct.

Note: Probabilities in the null hypothesis must sum to 1.

Goodness of Fit (GOF) Test

Page 29: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 29

Goodness of Fit (GOF) Test (cont)

Step 2: Verify necessary data conditions, and if met, summarize the data into an appropriate test statistic.

If at least 80% of the expected counts are greater than 5 and none are less than 1, compute

where the expected count for the ith category is computed as npi.

(Observed – Expected)2

Expected 2

Page 30: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 30

Goodness of Fit (GOF) Test (cont)

Step 3: Assuming the null hypothesis is true, find the p-value. Use chi-square distribution with df = k – 1.

Step 4: Decide whether or not the result is statistically significant based on the p-value. The result is statistically significant if the p-value .

Step 5: Report the conclusion in the context of the situation.

Page 31: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 31

Example 15.8 Pennsylvania Daily Number

State lottery game: Three-digit number made by drawing a digit between 0 and 9 from each of three different containers.

Focus = draws from the first container. If numbers randomly selected, each value would be equally likely to occur.

H0: p = 1/10 for each of the 10 possible digitsHa: Not H0

Page 32: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 32

Example 15.8 Daily Number (cont)

Data: n = 500 days between 7/19/99 and 11/29/00

Page 33: Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 33

Example 15.8 Daily Number (cont)

Chi-square goodness of fit statistic:

From Table A.5: df = k – 1 = 10 – 1 = 9 p-value > 0.50

Result is not statistically significant; the null hypothesis is not rejected.