CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR …dmohr/sta5126/CHAP2.pdfIn the first case, our...

STA 5126 -- 1 of Chp 2

©D. Mohr

CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR QUANTITATIVE

VARIABLES

TOPICS RANDOM SAMPLES

INDEPENDENT AND DEPENDENT VARIABLES

PAIRED T-TESTS (Before/After comparisons)

INDEPENDENT SAMPLE T-TESTS (Separate group comparisons)

A. Large sample or equal population variances

B. Unequal variances

TESTS TO COMPARE VARIANCES

NORMALITY ASSUMPTION

Section 1. Introduction

Many important experimental results are based on statistical analyses no more difficult

than those we will review in this chapter. They are among the most useful test statistics ever

devised, simply because the experimental designs which they match are easy, powerful, and

popular.

All statistical tests require that the sample studied be a random sample from the

population of interest. This is an extremely stringent requirement. It means that every person or

item in the population had an equal chance of making it into the sample. These are the ONLY

conditions under which sampling variability can be calculated. It ensures that sampling

variability is the sole source of error in your results. If random samples are not selected

randomly, biases can easily contaminate the results. At the very least, if random samples are not

possible, randomization has to be used to create comparison groups.

We should note that most hypothesis tests are loosely stated as questions of the form

"Does variable X affect variable Y?" For instance, we might ask if Gender affects a person's

opinion on Abortion, or whether Age affects a person's Blood Pressure. In questions where one

variable can be thought of as "affecting" the other variable, the "cause" is referred to as the

independent variable. The outcome is referred to as the dependent variable. Hence, in the

preceding examples, Gender and Age are independent variables possibly affecting the dependent

variables Opinion and Blood Pressure.

The tools discussed in this chapter are only applicable when the dependent variable is

quantitative. Basically, that is because all these methods focus on the effect of the independent

STA 5126 -- 2 of Chp 2

©D. Mohr

variable on the mean and standard deviation of the dependent variable. These parameters only

make sense if the variable is quantitative. There are further mathematical requirements, which

we will list at the end of the chapter.

Section 2. Paired t-test

The most common experimental designs are in the form of a comparison. Often, the

comparison is on values collected on the same experimental subjects. For instance, we may have

reading proficiency scores for children before and after they undergo a six-week training

program. We may have strength scores on right and left arms of the same person. We may have

yields from tomato plants of type A and B, when one of each were planted in the same pot. In

each of these examples, the key feature is that there is a "matching" mechanism which pairs an

observation of one type unambiguously with an observation of the other type. The statistical

technique we discuss will preserve the information due to the pairing by using one of the

observations as a "baseline" against which the other is measured.

The method is simple. Consider the before and after scores for reading proficiency. If we

are interested in whether the program (independent variable) affected the reading proficiency

(dependent variable), we are really interested in whether there was typically a change in the

scores from before to after. We will calculate the individual changes for each child and use the

one-sample t-test (Chapter 1) to test the null hypothesis that the mean change is zero (no effect

exists.)

Recipe for Paired t-test

Data structure: For n individuals, we have measurement 1 (X1) and measurement 2 (X2)

which we wish to compare. X1 and X2 must be quantitative variables. Form a new

column D=X1 -X2 which contains the differences in the two measurements for each

individual.

Perform one sample t-test on D

1) Ho: D=0 (typical values do not differ for measurements 1 and 2)

H1: D0 (typical values do differ for measurements 1 and 2)

2) Since the sample of D which we have observed has n observations, the t-

statistics will have n-1 d.f.

STA 5126 -- 3 of Chp 2

©D. Mohr

n/s

x

n/s

0xt

D

D

D

D

The subscript "D" is to remind you that these statistics are calculated from the column of

D=Differences. Form your critical region using the table of the t-distribution with n-1 df

where n is the number of pairs.

3) Calculate the value of t for your sample. If you use a statistical computer package, it

may give you the p-value for this test automatically.

4) Write the appropriate conclusion.

Example of a paired comparison. Notice the presence of a natural pairing mechanism between

observations with the different "treatments". What are the advantages of such a mechanism? The data below are from Darwin's study of cross- and self-fertilization.

Pairs of seedlings of the same age, one produced by cross-fertilization and

the other by self-fertilization, were grown together so that the members of

each pair were reared under nearly identical conditions. The data are the

final heights of each plant after a fixed period of time, in inches. Darwin

consulted the famous 19th century statistician Francis Galton about the

analysis of these data. The summary information was produced by the

statistical package SAS for Windows.

PAIR CROSS SELF DIFF = cross-self

1 23.5 17.4 6.1

2 12.0 20.4 -8.4

3 21.0 20.0 1.0

4 22.0 20.0 2.0

5 19.1 18.4 0.7

6 21.5 18.6 2.9

7 22.1 18.6 3.5

8 20.4 15.3 5.1

9 18.3 16.5 1.8

10 21.6 18.0 3.6

11 23.3 16.3 7.0

12 21.0 18.0 3.0

13 22.1 12.8 9.3

14 23.0 15.5 7.5

15 12.0 18.0 -6.0

Summary on N Mean Std Dev Minimum Maximum

variable ----------------------------------------------------------

DIFF 15 2.6066667 4.7128194 -8.4000000 9.3000000

STA 5126 -- 4 of Chp 2

©D. Mohr

The null hypothesis is that the mean difference in the population is 0,

implying that mean heights of cross and self-fertilized plants would not

differ. In symbols,

Ho: D=0 vs H1: D 0

There are 15 observations in the data set, so 14 d.f. If we use = 5%, then the critical region would be "Reject Ho if t < -2.145 or t > 2.145" In this

sample, t=2.142. Hence, there is no significant evidence, at =5%, that cross and self-fertilized seedlings differ in mean length.

Section 3. Two-sample t-test (also called the independent samples t-test)

Frequently, we have two separate groups on which we wish to make comparisons. We

may be interested in comparing mean salaries for male and female entry-level employees, or

length of hospital stays for HMO and PPC plan insurees. In the first case, our independent

variable is gender while the dependent variable is salary. Salary is a quantitative variable for

which we summarize typical values using the mean. Unlike the paired t-test, where the values in

each group are naturally matched, here we assume the two groups are completely independent.

Diagram 1 gives a schematic of the statistical situation. We have two populations

summarized by the means in each (1 and 2) and the standard deviations (1 and 2). Our

hypotheses are

Ho: 1 = 2 (1 - 2 = 0) "group" has no effect on mean

Ha: 1 2 (1 - 2 0) "group" has an effect on mean

Since we cannot observe 1 and 2, we must use our sample data to reach conclusions. Looking

at the hypotheses, our natural move is to compare the two sample means to each other, or

equivalently, their difference to 0.

STA 5126 -- 5 of Chp 2

©D. Mohr

Ho: 1 = 2

DIAGRAM 1. Comparing two populations

If the population variances are known, probability theory shows that the appropriate statistic

would be

2221

21

21

n/n/

xxZ

The two sample t-test has two versions, which differ in how they "doctor up" the

denominator of this statistic since the population variance is hardly ever known. The versions

differ depending on whether the two population variances can be assumed equal or unequal. In

section 4 we cover a method for checking this assumption.

Section 3A. Large samples or unequal variances

When the variances (or standard deviations) in the two groups appear very dissimilar, the

best method may be the unequal variance version. This method does not require the assumption

of equal variances. The disadvantage of this method is that the degrees of freedom are

sometimes small, and they are always difficult to calculate (this is referred to as Satterthwaite‘s

approximation). While the test statistic itself is easy to calculate, the degrees of freedom are best

computed by a statistical package. Without the computer, it helps to know that the d.f. are

always between 1sn where sn is the size of the smallest sample, and 1 2 2n n , so if you get

the same conclusion using both those d.f., you are safe. If both samples are large (at least 50), it

Population 1

Parameters 1, 1

Population 2

Parameters 2, 2

Sample 1

Stats: n1, mean x1, s1

Sample 2

Stats: n2, mean x2, s2

STA 5126 -- 6 of Chp 2

©D. Mohr

is probably safe to use infinite () d.f. The value of the test statistic is computed by:

2221

21

21

n/sn/s

xxt

Section 3B. Equal variance t-test

When the variances in the two samples appear similar, it is advantageous to "pool" the

two estimates into an estimate of the alleged single underlying variance. This allows us to pool

the degrees of freedom in the two groups as well, giving more sensitive critical regions.

iancevarpooled2nn

s)1n(s)1n(s

21

222

2112

p

)snot,snote(n/1n/1s

xxt 2

pp

21p

21

d.f. = n1 + n2 - 2

Section 4. Comparing two standard deviations

Some authors now argue that we should always use the unequal variance version of the

test. Traditionally, however, the equal variance version was preferred both because of the

potentially greater degrees of freedom and because its relation to more advanced topics (like the

one-way ANOVA) is well understood. In this tradition, before we decide which version of the

two-sample t-test to use, we need a tool for deciding whether the variances in two groups are

equal or different. This amounts to a hypothesis test for the hypotheses

Ho: 1 = 2 (or equivalently, 2

1 = 2

2)

vs Ha: 1 2 (or equivalently, 2

1 2

2)

Note the restatement of the hypotheses in terms of the variances. There are many tests available

for testing these hypotheses. The most commonly cited are Fisher's test (F-test classic!) and

STA 5126 -- 7 of Chp 2

©D. Mohr

Levene's test.

Section 4a. Fisher's test

The test statistic used to compare the variances if the F-statistic. F is for Sir R. A. Fisher,

who pioneered many classic statistical techniques.

2min

2max

22

21 s/s'Fors/sF

F' differs from F only in that it always places the larger of the two sample variances in the

numerator. If the null hypothesis is true, we expect F (or F') to be near 1. If F is either very

much larger or very much smaller than 1 (F' very much larger than 1), we would believe Ha is

true. As always, the question is where to draw the line (critical region).

The table of the F-distribution is provided in most statistics texts. Most tables only give

the cutpoint which marks off the lower 1-A of area from the upper A of area in the righthand tail.

It can be quite confusing to understand how to use this to get the critical values for all the

varieties of test.

Generic shape of an Area A

F distribution: Area 1-A

a) F is explicitly a two-tailed test. So we need cutpoints which mark off the lower /2 area

in the lefthand tail, and /2 in the righthand tail. Most tables only give the righthand cutpoint.

To get the lefthand cutpoint, you use

Lefthand cutpoint for lower /2 with M,N df = 1. / (Righthand cutpoint for upper /2 with N,M df)

Example: Suppose we are using = 5%, and sample 1 has n=10 while sample 2 has n=6 (9 and

5 df, respectively). We should put 2.5% in each tail. From the table, we see that the cutpoint for

the upper tail is 6.68. To get the lower cutpoint, we need to reverse the order of the d.f. (now 5

and 9), then take the reciprocal of the upper cutpoint. That is, the lower cutpoint is 1/4.48=.223.

b) F' is also two-tailed, but it finesses the problem of getting the lower cutpoint by arranging

to always put the largest variance on top. Hence, if we had sample 1 with n=10 and sample 2

with n=6, our critical region would be: If s1 is the largest, reject if F' > 6.68 (9 and 5 df, with

/2 in the upper tail); if s2 is the largest, reject if F' > 4.48 (5 and 9 df, with /2 in the upper

STA 5126 -- 8 of Chp 2

©D. Mohr

tail). Example of an F test.

Notice that sometimes hypothesis tests about the variances (or standard deviations) are of

interest in there own right. Drill press operators in a manufacturing plant must drill holes of specified

diameter in sheets of metal. One goal is that all holes should have the same

diameter (small variability in the individual diameters). Actual diameters

are measured for 20 holes drilled by inexperienced operators, and 10 holes

drilled by experienced operators. The data is summarized below. Is there

significant evidence, at = 5%, that the population variances differ for experienced and inexperienced operators?

Inexperienced n = 20 s = .52 mm

Experienced n = 10 s = .21 mm

1) Ho: 1 = E (variability is the same for experienced and inexperienced operators)

Ha: 1 E (variability is not the same) 2) We will reject Ho if F' > 3.69, using F table for upper area of .025, 19

df in numerator and 9 in denominator.

3) F' = .522 / .21

2 = 6.13

4) There is significant evidence that the variability in the diameters

differs for experienced and inexperienced operators. Inexperienced operators

have larger variability (less consistency) in the diameters of the holes.

Section 4b. Levene's Test (used by SPSS)

Levene's test actually tests the null hypothesis that the mean values of the magnitude of

the distances from individual observations to the mean are the same. Instead of defining

dispersion in terms of 'squared distances' as variances do, it uses absolute values of distances.

The actual algorithm is as follows:

1) Within each group, compute the difference between the individual observations and

the group mean.

2) Take the absolute value of these differences.

3) Do a independent sample t-test (equal variance version) of the null hypothesis that the

means of the absolute differences are equal.

4) Square the t-value from the t-test. (Under Ho, the square of a t should have an F

distribution with 1 df in the numerator and n1 + n2 -2 in the denominator.) Compare it to the

cutpoint which places (usually 5%) area in the upper tail of the distribution. You are only

interested in large values of F, because only large values would indicate that the variances are

different. (Note the difference between this and the cutpoints for Fisher's test, which place /2 in

each tail.)

STA 5126 -- 9 of Chp 2

©D. Mohr

Large values of F indicate that one of the means must be different from the other (Ha

true). Bear in mind at this point that we are no longer talking about the means of the raw data,

but of the distance of the raw values around their group means. In the example above, a large

value for F would indicate that the typical (mean) distance of individual diameters from the

group mean was larger in one group than in another, indicating more variability in one group.

Levene's Test and Fisher's Test do not give exactly the same result. Except in borderline

cases, however, they usually give comparable values. There is some intuitive evidence that

Levene's Test is less sensitive to departures from the normality assumption, and I think that is

why it is the default in SPSS.

Example of Levene's Test The following data shows test scores for five freshman and five juniors

on an assessment test for critical thinking. Does variability differ in the

two groups, using = 5%? Freshman: 28 32 21 36 33 ( sample mean = 30.0)

Juniors: 34 49 43 32 27 ( sample mean = 37.0)

Ho: 2 2

1 2 , vs Ha: 2 2

1 2 . Reject Ho if F > 5.32 (using table

with 1 and 8 d.f., and 5% in the upper tail.)

Absolute values of differences from mean within each group:

Freshman: 2 2 9 6 3 (sample mean = 4.4, s=3.05)

Juniors: 3 12 6 5 10 (sample mean = 7.2, s=3.70)

Sp = 3.39, df = 8, t = 1.31, F = 1.71. Since 1.71 is less than the

cutpoint of 5.32, there is no significant evidence that the variances are

different in the two groups. When we compute the t-test to compare the means

of the test scores, we can use the equal variance version.

Section 5. A Full Example!!

Recipe for a two sample t-test:

Data structure: Two separate groups are measured for a quantitative variable Y.

Performing the two sample t-test

1) State the null and alternative hypotheses in terms of the means in the two groups.

2) Decide which version of the t-test to use by using the F-test or Levene's test to examine the

variances within the two samples.

3) Calculate the number of degrees of freedom for the appropriate version of the test. Use your

and a table of the t-distribution to set the critical region.

4) Calculate the appropriate version of t.

5) Write your conclusion.

Most computer programs will automatically calculate both versions of t as well as F, along with their p-

values, saving you a lot of effort.

Example for two-sample test (two independent samples)

STA 5126 -- 10 of Chp 2

©D. Mohr

Notice that the two groups of patients are completely separate, with no natural pairing.

In small to moderate samples, the particular version of the two-sample t-test depends on whether

the variances within the two groups seem similar. SAS for Windows computes an F-test to help

you decide which version is appropriate. The data summarized below show cholesterol values for the 39 heaviest

men in the Western Collaborative Group Study. (This study was carried out in

California in 1960-1961 and involved 3,154 middle-aged men. The purpose was

to study behaviour patterns and risk of coronary heart disease.) All the

cholesterols summarized below are for men weighing more than 225 pounds.

Cholesterols are given in mg per 100 ml. Each man was rated as generally

having Behaviour Type A (urgency, aggression, ambition) or Behaviour Type B

(relaxed, non-competitive, less hurried.) In heavy, middled-aged men, is

cholesterol level related to behaviour type?

1) The null hypothesis is that behavior type has no effect on mean

cholesterol, while the alternative hypothesis is that it does have an effect

on mean cholesterol. In symbols:

Ho: A = B vs Ha: A B

2) Since the hypotheses concern the means in two separate groups, we will use

the two sample t-test. To decide which version, we notice that the program

has printed the value of F', along with the p-value (labeled Prob>F').

Recall that this statistic tests the null hypothesis that the two population

variances are equal. Since the p-value of .2927 is greater than any

reasonable (.1 to .01) so it is reasonable to assume that the variances are equal and use that version of the t-test.

3) For the pooled (equal) variance version, the d.f.=37. With a significance

level of 5%, we would reject Ho if t < -2.021 or t > 2.021. Alternatively, we

reject if the p-value is less than .05.

4) For the equal variance version, t=2.5191, df=37 and the p-value is .0162.

You should use the table of sample means and standard deviations to check

these results.

5) If we are using a significance level of .05, we would reject the null

hypothesis. Hence, we can say there is significant evidence that behaviour

type is associated with differences in mean cholesterol.

COMPUTER PRINTOUT - TTEST PROCEDURE

Variable: CHOL

TYPE N Mean Std Dev Std Error

---------------------------------------------------------------------

A 19 245.36842105 37.61384279 8.62920735

B 20 210.30000000 48.33991486 10.80913356

Variances T DF Prob>|T| <----- note how SAS labels

--------------------------------------- the p-values

Unequal 2.5355 35.7 0.0158

Equal 2.5191 37.0 0.0162

For H0: Variances are equal, F' = 1.65 DF = (19,18)

Prob>F' = 0.2927 <------note how SAS labels

the p-values

STA 5126 -- 11 of Chp 2

©D. Mohr

Boxplot for CHOL by TYPE

|

400 +

|

| | 0

300 + |

| +-----+ |

| *--+--* +-----+

200 + | *--+--*

| | +-----+

| |

100 +

------------+-----------+-----------

TYPE A B

CASE STUDY

Jerrold et al (2009) compared typically developing children to young adults who had Downs Syndrome, with respect to a number of psychological measures thought to be related to the ability to learn new words. Data on two of the measures is summarized in Table 5.6. Recall Score is a measure of verbal short-term memory. Raven’s CPM is a task in which the participant must correctly identify an image which completes a central pattern. The authors used the pooled t test to compare the typical scores in the two groups. For Raven’s

CPM, .485, .629t p value . For Recall Score, 7.007, value .0001t p . Hence, the two

groups did not differ significantly with respect to mean Raven’s CPM, but the Down’s Syndrome group scored significantly differently (apparently lower) on Recall Score. Based on this and a number of other comparisons, the authors conclude that verbal short-term memory is a primary factor in the ability to learn new words. The authors choice of the pooled t test rather than the unequal-variance t Test appears

reasonable here. For Raven’s CPM, 0.700, value .379F p . For Recall Score,

0.691, value .361F p . Neither variable showed a significant difference in the variances within the

groups. The other distributional assumption underlying t tests is that the data comes from normal distributions. Journal publications rarely have space in which to present graphical evidence with which the reader can check this assumption. However, the discussion will often include a sentence addressing this issue, and remark on any transformations (e.g. logarithms) used to make the variable more nearly normal. The authors actually presented the results of the pooled t test (with 80 degrees of freedom) as an F test with 1 degrees of freedom in the numerator and 80 in the denominator. The relation between these two test statistics will be explained in Chapter 4. Summary statistics from Jerrold (2009).

Down Syndrome young adults Typically developing children

n = 21 n = 61

Mean S.D. Mean S.D.

Raven’s CPM 19.33 4.04 19.90 4.83

Recall Score 12.00 3.05 18.25 3.67

(Source: Jerrold, C., Thorn, A. S. C, and Stephens, E. (2009). The relationship among verbal short-term memory, phonological awareness, and new word learning: evidence from typical development and Down syndrome. J. Experimental Child Psychology, 102(2) 196-218.)

STA 5126 -- 12 of Chp 2

©D. Mohr

Section 6. Nasty mathematical assumptions

We already know of two fundamental assumptions underlying the tests in this chapter,

and that of the t-test in Chapter 1.

1) The sample must be random

2) The dependent variable must be quantitative

In addition, the derivations of the t and F-distributions have a nasty mathematical assumption:

that the distribution of the variable in the population must follow a normal distribution. In plain

language, if you could draw a histogram of the values for all the observations in the entire

population, you should see the famous "bell curve". So we have a third assumption:

3) The distribution for the individual values is normal.

It is not very likely that we will ever know for sure whether assumption 3 is met. What can we

do to check and how important is it anyway? There are several graphical techniques we can use

to check for normality. So far, we have seen dotplots and boxplots, though we have not

discussed them. (See your elementary text.) In chapter 4 we will meet a tool called a normal

probability plot which gives a more sensitive check. What are we really looking for? An

immediate cause of trouble in a small or moderate data set would be when one or two values are

very far away from the rest. The self/cross-fertilization data used as an example of the paired t-

test may be a case where the data contains two "outliers". Outliers should be rare in normally

distributed data. Outliers can cause the p-values and critical regions to be only approximate.

The most common problem is to make the p-value larger than what it should be.

If there are no outliers, and the data show a nearly symmetric pattern with the points

clustering in the middle of the range, then it is unlikely that non-normality is a serious problem.

If you do seriously suspect nonnormality in your data, consult a statistician on a variety of

"nonparametric" statistical tests which do not require the normality assumption.

There is one frequent case in social science data where normality is very questionable. If

you have data collected on an ordinal scale (e.g. 0 = strongly disagree to 4 = strongly agree), it is

unlikely to be normally distributed. Recall that the normal distribution is for continuous or

nearly continuous random variables, and data on a five point scale is quite discrete. This is

STA 5126 -- 13 of Chp 2

©D. Mohr

especially true if the values cluster at one end or the other end of the scale (e.g. almost all agree

or strongly agree). In this case, one of the techniques of Chapter 3 might be appropriate.

Furthermore, it is questionable as to whether one can legitimately average values on this kind of

scale — does a (―disagree‖+‖strongly agree‖)/2 = ―agree‖? Nevertheless, treating this ordinal

data AS IF it were numerical on a 0-4 scale, and conducting averaging operations, is a sloppy but

common practice in the social sciences. Averages over several questions frequently produce

values which appear reasonably normally distributed.

Finally, when comparing two population means, the choice of the version of the test

depends on whether variances can be assumed equal. Since this assumption, referred to as

―homogeneity of variance‖, underlies much of the Analysis of Variance, we list it as a fourth

assumption:

4) variances in the two groups are equal.

STA 5126 -- 14 of Chp 2

©D. Mohr

EXERCISES FOR CHAPTER 2

*Exercise 1. Data below show blood pressures for 5 subjects. The first value was taken while the

subject was resting. The second was taken while the subject was resting, but asked to work a mental

arithmetic problem. Does math affect mean blood pressure? Use = 5%.

Subject number Resting BP During Math BP

1 115 125

2 125 125

3 110 130

4 120 115

5 110 125

*Exercise 2. Occupancy rates (average annual percentage of beds filled) are compared for randomly

selected urban and suburban hospitals in a state.

a. Is there evidence of difference in variability between the two groups? Compute both Levene's Test

and Fisher's Test to answer this. Use = 5%.

b. Is there evidence of a difference in the mean occupancy rates? Use the results of A to help you

decide on an appropriate version of the t-test. Use = 5%.

Urban: 76.5 79.6 77.5 79.4 79.3 78.1

Suburban: 71.5 73.4 71.2 67.8 63.0 76.5

Exercise 3. Eight students volunteer to participate in a test of the effect of caffeine on the speed with

which they can respond to a flashing light. Each student takes the test on a morning when they have

had no caffeine, then again a week later on a morning after having had the equivalent of two cups of

coffee. The data is given below, in hundredths of seconds to respond to the light.

Subject Without caffeine With caffeine

1 12 10

2 18 14

3 22 20

4 9 8

5 14 14

6 24 21

7 21 19

8 16 14

Does caffeine have an effect? Use = 5%.

Exercise 4. Do HMO‘s really reduce costs of care? 40 adults aged 55-60 enrolled in HMO‘s are

questioned on their health care within the last 2 years. They report an average of days hospitalized

STA 5126 -- 15 of Chp 2

©D. Mohr

during that period of 1.19 days with a standard deviation of 1.4 days. A similar sample of 40 adults

with ordinary healthcare insurance reports an average of 1.35 days with a standard deviation of 1.7

days.

a) Is there evidence of a difference in the variability within the groups? (You don‘t have enough

information to do Levene‘s test here, you must use Fisher‘s.)

b) Is there evidence of a difference in the means for the groups?

Use =5% for each test.

Exercise 5. We are comparing math FCAT scores for rural and urban high schools. We have a

random sample of 20 urban high schools and 20 rural high schools. Their school-aggregated math

FCAT scores for 10th

graders are summarized below.

Location n sample mean sample standard deviation

Rural 20 1925 252

Urban 20 1982 212

a. Use the F‘ test to say whether it is reasonable to assume that the two populations have the same

variance. Why can you not prove that the variances are equal?

b. Do the means differ significantly in the two groups? Use = 1%.

Exercise 6. From each of 4 different litters of mice, a researcher chooses two female mice (for a total

of 8 mice). Within each pair of sisters, one is chosen to be fed a standard diet, and the other is fed a

high-protein diet. Their weight in grams, at the end of 6 weeks, is shown below.

Diet Pair 1 Pair 2 Pair 3 Pair 4

Standard 19.4 18.2 18.5 19.8

High-protein 17.6 19.4 17.2 19.2

Do the different diets seem to affect mean weight? Use = 5%.

Exercise 7. (A development from Exercise 1.) The researcher wishes to know whether girls and boys

differ in their reaction to arithmetic. 5 girls are are recruited, and their blood pressures are tested

resting, and again resting but doing mental arithmetic. 5 boys are tested under the same

circumstances. The data is given below. Is there significant evidence, at = 5%, that boys and girls

differ in the mean change in BP experienced while doing arithmetic? Note: this experimental design

uses ideas from both paired and two-sample experiments!

Girls Boys

Resting During Math Resting During Math

120 130 120 125

110 115 110 125

115 115 105 115

110 120 110 120

110 115 120 115

STA 5126 -- 16 of Chp 2

©D. Mohr

Exercise 8. Pedersen (2007, Perceptual and Motor Skills, 104(1), pp 201-211) interviewed a sample

of students enrolled in psychology courses in a large private university in the western U.S. regarding

their attitudes towards sports. Each student was asked to self-rate his or her degree of sport

participation, on a scale of 1 to 5. The 112 men in the sample had M = 4.3 and SD = 1.7. The 173

women had M = 3.6 and S.D. = 1.7. (M is a common abbreviation for the sample mean, and SD a

common abbreviation for the standard deviation.) Is there significant evidence, at = 1%, that men

and women at this university differ in their mean self-rankings of sport participation?

Exercise 9. Martinussen et al. (2007, J. Criminal Justice 35, 239-249) compared ‗burnout‘ among a

sample of Norwegian police officers to a comparison group of air traffic controllers, journalists and

building constructors. Burnout was measured on three scales: exhaustion, cynicism, and efficacy. The

data is summarized in the table below. The authors state

The overall level of burnout was not high among police compared to other occupational

groups sampled from Norway. In fact, police scored significantly lower on exhaustion and

cynicism than the comparison group, and the difference between groups was largest for

exhaustion.

Substantiate the authors‘ claim regarding Exhaustion. That is, check that it does show a significant

difference between the two groups..

Summary Statistics for Exercise 9

Police, n = 222 Comparison group, n = 473

Mean std dev mean std dev

Exhaustion 1.38 1.14 2.20 1.46

Cynicism 1.50 1.33 1.75 1.34

Efficacy 4.72 0.97 4.69 0.89

SOLUTIONS TO STARRED PROBLEMS

Exercise 1. Notice the existence of a pairing mechanism between items. Each experimental unit (a subject) has two

blood pressures--a resting and a ‗during math‘ blood pressure. This should be done via a paired t-test.

a) D = mean difference in during math – resting blood pressure in the population

Ho: D = 0 versus Ha: D 0.

b) The 5 differences in the sample are: 10 0 20 –5 15. There will be 4 degrees of freedom. We will reject Ho if t

< -2.776 or t> 2.776

c) 8 0

8 10.368, 1.72510.368/ 5

Dd and s t

d) Do not reject Ho. There is no significant evidence that math affects mean blood pressure, at = 5%.

Further Note. A computer package would not tell you the cutpoints for t. Instead, it would report that the p-value for this

data was .1595. Since .1595 > .05 (your ), you would not reject Ho.

Exercise 2.

Urban group had mean = 78.4, s.d. 1.2458

STA 5126 -- 17 of Chp 2

©D. Mohr

Suburban group had mean = 70.5667 and s.d. = 4.6779

a) Difference in variability: Ho: 2 2

s U versus Ha: 2 2

s U

Fisher‘s test. Reject if F‘ > F-table value with 6-1=5 and 6-1=5 df and 2.5% in tail

F‘ = 4.67792 / 1.24582 = 14.1. Cutpoint in table is 7.15. Since F‘ > 7.15, we reject Ho. There is significant evidence of

variability, at = 5%. (Note, tail value is half the desired for F‘ version.)

Levene‘s test. Absolute values of difference of individual scores from group mean—

Urban 0.9 1.2 0.9 1.0 0.9

Suburban .93 2.83 .63 2.77 7.57 5.93

Running an independent samples t-test (equal variance version) on this data gives t=2.099. F=2.0992 = 4.406 with 1 and

10 df. Since the cutpoint in the F-table is 4.96 for =5% (don‘t split the !) we would not reject Ho: that is, we have no

significant evidence of a difference in variability.

b) Fisher‘s test and Levene‘s test differ on the advisability of using the equal variance /unequal variance version.

Fortunately, in this case, the answers don‘t differ. Both versions of t come out to 3.96, which would be significant

whether you use df=10 (equal variance version) or df=5 (smallest df possible under unequal variance version).

STA 5126 -- 18 of Chp 2

©D. Mohr

SPSS NOTES IF YOU WANT TO GET STARTED ON YOUR OWN!

Step 1. Deciding how to set up your data.

When you double-click on the SPSS icon, the first thing you see is a spreadsheet-like grid for entering your data. This is

called the Data Editor. Before you charge in and start typing, you have to think about how the data is structured. The

format in which you enter the data must follow that structure.

The basic rule-of-thumb is that entries on the same row, or line, are from the SAME subject, or experimental unit.

Things on different lines are from different subjects. Things in different columns are different measurements. Let‘s see

how that plays out in the Starred Exercises. By double-clicking on the heading of each column, you can change the name

to something sensible, and also indicate whether your data is nominal (‗string‘) or numerical.

Exercise 1. There are 5 different subjects, so our data entry will have five rows. There should be one column for subject

number (‗SUBJ‘), one for the resting BP (‗REST‘) and one for the during-math BP (‗MATH‘). In otherwords, the data

entry will look very like the table given in the problem.

Exercise 2. There are 14 different hospitals. Each will have its own row in the data entry. In addition to a column for

occupancy rate (‗O_RATE‘), I will need a column which tells me whether it is an Urban or Suburban hospital. I will call

this column ‗LOCATION‘. Many of the ANOVA and T-test routines in SPSS want group variables to be coded AS IF

they were numeric. I am going to code Urban=0, Suburban=1. Keep notes of the codes you define.

LOCATION O_RATE

0 76.5

0 79.6

......

1 76.5

Step 2. Request the appropriate t-test

Exercise 1.

In SPSS, click on the ANALYZE option at the top. From the drop-down menu, request COMPARE MEANS. Choose the

type of T-test you need, in this case, the PAIRED SAMPLES T-TEST. You will see a ‗Dialog Box‘ like the one below.

You need to click on the column names with the two variables you are trying to compare (REST and MATH), and move

them into the big box on the right using the key that looks like an arrow >. Then hit the OK button. You will see printout

like that on the next page.

STA 5126 -- 19 of Chp 2

©D. Mohr

Paired Samples Statistics

116.0000 5 6.51920 2.91548

124.0000 5 5.47723 2.44949

REST

MATH

Pair

1

Mean N Std. Dev iat ion

Std. Error

Mean

Paired Samples Correlations

5 -.490 .402REST & MATHPair 1

N Correlation Sig.

Paired Samples Test

-8.0000 10.36822 4.63681 -20.8738 4.8738 -1.725 4 .160REST - MATHPair 1

Mean Std. Dev iat ion

Std. Error

Mean Lower Upper

95% Conf idence

Interv al of the

Dif f erence

Paired Dif f erences

t df Sig. (2-tailed)

The first panel gives you some of the summary statistics within each group. The last panel reports the results of the t-test.

The p-value is labeled Sig., which is short for ‗Observed Significance Level‘. Since the .16 is greater than your , you do

not have significant evidence of a math effect.

Exercise 2. From the ANALYZE / COMPARE MEANS menu, choose INDEPENDENT SAMPLES T-TEST. You need

to click on O-RATE and use the > key to move it into the Test Variable(s) box. You need to click on location and move it

into the Grouping Variable box. Then hit OK.

STA 5126 -- 20 of Chp 2

©D. Mohr

Group Statistics

6 78.4000 1.24579 .50859

6 70.5667 4.67789 1.90974

LOCATION

0

1

O_RATE

N Mean Std. Dev iation

Std. Error

Mean

Independent Samples Test

4.406 .062 3.964 10 .003 7.8333 1.97630 3.42985 12.23681

3.964 5.706 .008 7.8333 1.97630 2.93638 12.73029

Equal variances

assumed

Equal variances

not assumed

O_RATE

F Sig.

Levene's Test f or

Equality of Variances

t df Sig. (2-tailed)

Mean

Dif f erence

Std. Error

Dif f erence Lower Upper

95% Conf idence

Interv al of the

Dif f erence

t-test for Equality of Means

Note that SPSS automatically gives you Levene‘s test to help you choose the version of the t-test. The p-value is once

again labeled Sig. in SPSS..

STA 5126 -- 21 of Chp 2

©D. Mohr

BOXPLOTS

Simple boxplots use a box to mark off the middle 50% of the data. The box extends from the first

quartile to the third quartile, with a thick mark at the median. The purpose of the box is to draw your

eye to the central 'typical' half of the data. The lowest 25% of the data is marked off by a whisker that

extends from the minimum value to the first quartile. The highest 25% of the data is marked off by a

whisker that extends from the third quartile (75th

percentile) to the maximum.

Modified boxplots alter the whiskers to draw your attention to outliers, or wild values in the data set.

To define outliers, the computer calculates a value called the hinge width, which is 1.5 x (75th

percentile - 25th

percentile ) = 1.5 x length of 'box'. Any value lying more than one hinge width

ABOVE the 75th

percentile is an outlier on the high side. Any value lying more than one hinge width

BELOW the 25th

percentile is an outlier on the low side. Modified boxplots draw the whiskers from

the quartile to the most extreme value that is not an outlier. Any outliers are marked off with a

separate symbol.

Boxplots give you a quick view as to whether typical values (denoted by the boxes) are changing.

They also help you see whether the spread (variance) are relatively stable. They can also help you

diagnose non-normality, by helping you spot asymmetries or outliers.

Outliers are wild, unusual values. Normally distributed data should have very few, if any, outliers.

A very large data set might reasonably have a few (1%?) outliers without causing harm, but very

severe or frequent outliers can cause statistical trouble. Moreover - outliers are of interest in there

own right -- what causes these people to be so different from the rest?

Example of Boxplots The typical values are higher in Group 1 than in Group 2. The spreads are

similar, except that Group 2 has an outlier with an unusually large value.

1515N =

GROUP

2.00001.0000

X

14

12

10

8

6

4

2

0

-2

25

CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR …dmohr/sta5126/CHAP2.pdfIn the first case, our...

Documents

Transcript of CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR …dmohr/sta5126/CHAP2.pdfIn the first case, our...