Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

1/179

2003-2005, The Trustees of Indiana University Comparing Group Means: 1

http://www.indiana.edu/~statmath

Comparing Group Means: The T-test and One-way

ANOVA Using STATA, SAS, and SPSS

Hun Myoung Park

This document summarizes the method of comparing group means and illustrates how to

conduct the t-test and one-way ANOVA using STATA 9.0, SAS 9.1, and SPSS 13.0.

1. Introduction

2. Univariate Samples3. Paired (dependent) Samples

4. Independent Samples with Equal Variances

5. Independent Samples with Unequal Variances6. One-way ANOVA, GLM, and Regression

7. Conclusion

1. Introduction

The t-test and analysis of variance (ANOVA) compare group means. The mean of a variable to

be compared should be substantively interpretable. A t-test may examine gender differences in

average salary or racial (white versus black) differences in average annual income. The left-hand side (LHS) variable to be tested should be interval or ratio, whereas the right-hand side

(RHS) variable should be binary (categorical).

1.1

T-test and ANOVA

While the t-test is limited to comparing means of two groups, one-way ANOVA can comparemore than two groups. Therefore, the t-test is considered a special case of one-way ANOVA.

These analyses do not, however, necessarily imply any causality (i.e., a causal relationship

between the left-hand and right-hand side variables). Table 1 compares the t-test and one-wayANOVA.

Table 1. Comparison between the T-test and One-way ANOVA

T-test One-way ANOVALHS (Dependent) Interval or ratio variable Interval or ratio variable

RHS (Independent) Binary variable with only two groups Categorical variableNull Hypothesis

21 = ...321 === Prob. Distribution* T distribution F distribution

* In the case of one degree of freedom on numerator, F=t2.

The t-test assumes that samples are randomly drawn from normally distributed populationswith unknown population means. Otherwise, their means are no longer the best measures of

central tendency and the t-test will not be valid. The Central Limit Theorem says, however, that


2/179



the distributions of 1y and 2y are approximately normal when N is large. When 3021 + nn , in

practice, you do not need to worry too much about the normality assumption.

You may numerically test the normality assumption using the Shapiro-Wilk W (N


3/179



Figure 1. Two Types of Data Arrangement

Variable Group Variable1 Variable2xx

00

xx

yy

yy

11

The data set used here is adopted from J. F. Fraumenis study on cigarette smoking and cancer

(Fraumeni 1968). The data are per capita numbers of cigarettes sold by 43 states and the

District of Columbia in 1960 together with death rates per hundred thousand people fromvarious forms of cancer. Two variables were added to categorize states into two groups. See the

appendix for the details.


4/179



2. Univariate Samples

The univariate-sample or one-sample t-test determines whether an unknown population mean

differs from a hypothesized value cthat is commonly set to zero: cH =:0 . The t statistic

follows Students T probability distribution with n-1 degrees of freedom, )1(~

= nts

cyt

y,

whereyis a variable to be tested and nis the number of observations.1

Suppose you want to test if the population mean of the death rates from lung cancer is 20 per

100,000 people at the .01 significance level. Note the default significance level used in mostsoftware is the .05 level.

2.1 T-test in STATA

The . t t est command conducts t-tests in an easy and flexible manner. For a univariate sampletest, the command requires that a hypothesized value be explicitly specified. The l evel ( )

option indicates the confidence level as a percentage. The 99 percent confidence level is

equivalent to the .01 significance level.

. ttest lung=20, level(99)

One- sampl e t t est- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 99% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

l ung | 44 19. 65318 . 6374133 4. 228122 17. 93529 21. 37108- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

mean = mean( l ung) t = - 0. 5441

Ho: mean = 20 degr ees of f r eedom = 43

Ha: mean < 20 Ha: mean ! = 20 Ha: mean > 20Pr ( T < t ) = 0. 2946 Pr ( | T| > | t | ) = 0. 5892 Pr ( T > t ) = 0. 7054

STATA first lists descriptive statistics of the variable l ung. The mean and standard deviation

of the 44 observations are 19.653 and 4.228, respectively. The t statistic is -.544 = (19.653-20)

/ .6374. Finally, the degrees of freedom are 43 =44-1.

There are three t-tests at the bottom of the output above. The first and third are one-tailed tests,

whereas the second is a two-tailed test. The t statistic -.544 and its large p-value do not reject

the null hypothesis that the population mean of the death rate from lung cancer is 20 at the .01

level. The mean of the death rate may be 20 per 100,000 people. Note that the hypothesizedvalue 20 falls into the 99 percent confidence interval 17.935-21.371.

2

1n

yy

i= ,1

)( 22

=

n

yys

i, and standard error

n

ssy = .

2The 99 percent confidence interval of the mean is 6374.*695.2653.192 = ysty , where the 2.695 is

the critical value with 43 degree of freedom at the .01 level in the two-tailed test.


5/179



If you just have the aggregate data (i.e., the number of observations, mean, and standard

deviation of the sample), use the . t t e s t i command to replicate the t-test above. Note thehypothesized value is specified at the end of the summary statistics.

. ttesti 44 19.65318 4.228122 20, level(99)

2.2 T-test Using the SAS TTEST Procedure

The TTEST procedure conducts various types of t-tests in SAS. The H0option specifies a

hypothesized value, whereas the ALPHA indicates a significance level. If omitted, the default

values zero and .05 respectively are assumed.

PROCTTESTH0=20ALPHA=.01DATA=masil.smoking;

VARlung;

RUN;

The TTEST Procedure

Statistics

Lower CL Upper CL Lower CL Upper CL

Variable N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

lung 44 17.935 19.653 21.371 3.2994 4.2281 5.7989 0.6374

T-Tests

Variable DF t Value Pr > |t|

lung 43 -0.54 0.5892

The TTEST procedure reports descriptive statistics followed by a one-tailed t-test. You may

have a summary data set containing the values of a variable (lung) and their frequencies

(count). The FREQ option of the TTEST procedure provides the solution for this case.

PROCTTESTH0=20ALPHA=.01DATA=masil.smoking;

VARlung;

FREQcount;

RUN;

2.3 T-test Using the SAS UNIVARIATE and MEANS Procedures

The SAS UNIVARIATE and MEANS procedures also conduct a t-test for a univariate-sample.

The UNIVARIATE procedure is basically designed to produces a variety of descriptive

statistics of a variable. Its MU0option tells the procedure to perform a t-test using the

hypothesized value specified. The VARDEF=DFspecifies a divisor (degrees of freedom) used in


6/179



computing the variance (standard deviation).3The NORMALoption examines if the variable is

normally distributed.

PROCUNIVARIATEMU0=20VARDEF=DF NORMALALPHA=.01DATA=masil.smoking;

VARlung;

RUN;

The UNIVARIATE Procedure

Variable: lung

Moments

N 44 Sum Weights 44

Mean 19.6531818 Sum Observations 864.74

Std Deviation 4.22812167 Variance 17.8770129

Skewness -0.104796 Kurtosis -0.949602

Uncorrected SS 17763.604 Corrected SS 768.711555

Coeff Variation 21.5136751 Std Error Mean 0.63741333

Basic Statistical Measures

Location Variability

Mean 19.65318 Std Deviation 4.22812

Median 20.32000 Variance 17.87701

Mode . Range 15.26000

Interquartile Range 6.53000

Tests for Location: Mu0=20

Test -Statistic- -----p Value------

Student's t t -0.5441 Pr > |t| 0.5892

Sign M 1 Pr >= |M| 0.8804

Signed Rank S -36.5 Pr >= |S| 0.6752

Tests for Normality

Test --Statistic--- -----p Value------

Shapiro-Wilk W 0.967845 Pr < W 0.2535

Kolmogorov-Smirnov D 0.086184 Pr > D >0.1500

Cramer-von Mises W-Sq 0.063737 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.382105 Pr > A-Sq >0.2500

Quantiles (Definition 5)

Quantile Estimate

100% Max 27.270

3The VARDEF=Nuses N as a divisor, while VARDEF=WDFspecifies the sum of weights minus one.


7/179



99% 27.270

95% 25.950

90% 25.450

75% Q3 22.815

50% Median 20.320

25% Q1 16.285

Quantiles (Definition 5)

Quantile Estimate

10% 14.110

5% 12.120

1% 12.010

0% Min 12.010

Extreme Observations

-----Lowest---- ----Highest----

Value Obs Value Obs

12.01 39 25.45 16

12.11 33 25.88 1

12.12 30 25.95 27

13.58 10 26.48 18

14.11 36 27.27 8

The third block of the output above reports a t statistic and its p-value. The fourth block

contains several statistics of normality test. Since N is less than 2,000, you should read the

Shapiro-Wilk W, which suggests that lungis normally distributed (p |t| CL for Mean CL for Mean

19.6531818 4.2281217 0.6374133 30.83


8/179



The MEANS procedure does not, however, have an option to specify a hypothesized value to

anything other than zero. Thus, the null hypothesis here is that the population mean of death

rate from lung cancer is zero. The t statistic 30.83 is (19.6532-0)/.6374. The large t statistic and

small p-value reject the null hypothesis, reporting a consistent conclusion.

2.4 T-test in SPSS

The SPSS has the T-TEST command for t-tests. The /TESTVALsubcommand specifies the value

with which the sample mean is compared, whereas the /VARIABLESlist the variables to be tested.

Like STATA, SPSS specifies a confidence level rather than a significance level in the

/CRITERIA=CI()subcommand.

T-TEST

/TESTVAL = 20

/VARIABLES = lung

/MISSING = ANALYSIS

/CRITERIA = CI(.99) .


9/179



3. Paired (Dependent) Samples

When two variables are not independent, but paired, the difference of these two variables,

iii yyd 21 = , is treated as if it were a single sample. This test is appropriate for pre-post

treatment responses. The null hypothesis is that the true mean difference of the two variables is

D0, 00 : DH d= .4The difference is typically assumed to be zero unless explicitly specified.

3.1 T-test in STATA

In order to conduct a paired sample t-test, you need to list two variables separated by an equal

sign. The interpretation of the t-test remains almost unchanged. The -1.871 = (-10.1667-

0)/5.4337 at 35 degrees of freedom does not reject the null hypothesis that the difference is zero.

. ttest pre=post0, level(95)

Pai red t t est- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

pre | 36 176. 0278 6. 529723 39. 17834 162. 7717 189. 2838post 0 | 36 186. 1944 7. 826777 46. 96066 170. 3052 202. 0836

- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -di f f | 36 - 10. 16667 5. 433655 32. 60193 - 21. 19757 . 8642387

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -mean(di f f ) = mean(pre post 0) t = - 1. 8711

Ho: mean(di f f ) = 0 degrees of f r eedom = 35

Ha: mean(di f f ) < 0 Ha: mean(di f f ) ! = 0 Ha: mean(di f f ) > 0Pr ( T < t ) = 0. 0349 Pr ( | T| > | t | ) = 0. 0697 Pr ( T > t ) = 0. 9651

Alternatively, you may first compute the difference between the two variables, and thenconduct one-sample t-test. Note that the default confidence level, l evel ( 95) , can be omitted.

. gen d=prepost0

. ttest d=0

3.2 T-test in SAS

In the TTEST procedure, you have to use the PAIRED instead of the VAR statement. For theoutput of the following procedure, refer to the end of this section.

PROCTTESTDATA=temp.drug;

PAIREDpre*post0;

RUN;

4 )1(~0

= nts

Ddt

d

d, where

n

dd

i= ,1

)( 22

=

n

dds

i

d , andn

ss d

d =


10/179



The PAIRED statement provides various ways of comparing variables using asterisk (*) and

colon (:) operators. The asterisk requests comparisons between each variable on the left with

each variable on the right. The colon requests comparisons between the first variable on the left

and the first on the right, the second on the left and the second on the right, and so forth.Consider the following examples.

PROCTTEST;

PAIREDpro: post0;

PAIRED(a b)*(c d); /* Equivalent toPAIREDa*c a*d b*c b*d; */

PAIRED(a b):(c d); /* Equivalent toPAIREDa*c b*c; */

PAIRED(a1-a10)*(b1-b10);

RUN;

The first PAIRED statement is the same as the PAIRED pre*post0. The second and the third

PAIRED statements contrast differences between asterisk and colon operators. The hyphen ()

operator in the last statement indicates a1through a10and b1through b10. Let us consider an

example of the PAIRED statement.

PROCTTEST DATA=temp.drug;

PAIRED(pre)*(post0-post1);

RUN;

The TTEST Procedure

Statistics


Difference N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

pre - post0 36 -21.2 -10.17 0.8642 26.443 32.602 42.527 5.4337

pre - post1 36 -30.43 -20.39 -10.34 24.077 29.685 38.723 4.9475

T-Tests

Difference DF t Value Pr > |t|

pre - post0 35 -1.87 0.0697

pre - post1 35 -4.12 0.0002

The first t statistic for preversus post0is identical to that of the previous section. The second

for preversus post1rejects the null hypothesis of no mean difference at the .01 level (p


11/179



PROCUNIVARIATEMU0=0VARDEF=DF NORMAL; VARd1 d2; RUN;

PROCMEANSMEANSTDSTDERRTPROBTCLM; VARd1 d2; RUN;

PROCTTESTALPHA=.05; VARd1 d2; RUN;

3.3 T-test in SPSS

In SPSS, the PAIRS subcommand indicates a paired sample t-test.

T-TEST PAIRS = pre post0

/CRITERIA = CI(.95)

/MISSING = ANALYSIS .


12/179


13/179



The SAS TTEST procedure and SPSS T-TEST command conduct F tests for equal variance.

SAS reports the folded form F statistic, whereas SPSS computes Levene's weighted F statistic.

In STATA, the . onewaycommand produces Bartletts statistic for the equal variance test. The

following is an example of Bartlett's test that does not reject the null hypothesis of equal

variance.

. oneway lung smoke

Anal ysi s of Vari anceSource SS df MS F Pr ob > F

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Bet ween groups 313. 031127 1 313. 031127 28. 85 0. 0000Wi t hi n groups 455. 680427 42 10. 849534

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Tot al 768. 711555 43 17. 8770129

Bart l ett ' s t est f or equal vari ances: chi 2( 1) = 0. 1216 Prob>chi 2 = 0. 727

STATA, SAS, and SPSS all compute Satterthwaites approximation of the degrees of freedom.In addition, the SAS TTEST procedure reports Cochran-Cox approximation and the

STATA . t t est command provides Welchs degrees of freedom.

4.2 T-test in STATA

With the .ttestcommand, you have to specify a grouping variable smoke in this example in

the parenthesis of thebyoption.

. ttest lung, by(smoke) level(95)

Two- sampl e t t est wi t h equal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt erval ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

0 | 22 16. 98591 . 6747158 3. 164698 15. 58276 18. 389061 | 22 22. 32045 . 7287523 3. 418151 20. 80493 23. 83598

- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 19. 65318 . 6374133 4. 228122 18. 36772 20. 93865- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

di f f | - 5. 334545 . 9931371 - 7. 338777 - 3. 330314- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

di f f = mean(0) - mean(1) t = - 5. 3714Ho: di f f = 0 degr ees of f r eedom = 42

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0

Pr ( T < t ) = 0. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 1. 0000

Let us first check the equal variance. The F statistic is )21,21(~1647.3

4182.317.1

2

2

2

2

Fs

s

S

L == . The

degrees of freedom of the numerator and denominator are 21 (=22-1). The p-value of .7273,

virtually the same as that of Bartletts test above, does not reject the null hypothesis of equal

variance. Thus, the t-test here is valid (t=-5.3714 and p


14/179



)22222(~3714.5

22

1

22

1

0)3205.229859.16(+=

+

= t

s

t

pool

, where

8497.10

22222

4182.3)122(1647.3)122( 222 =

+

+=pools

If only aggregate data of the two variables are available, use the . t t e s t i command and list the

number of observations, mean, and standard deviation of the two variables.

. ttesti 22 16.85591 3.164698 22 22.32045 3.418151, level(95)

Suppose a data set is differently arranged (second type in Figure 1) so that one variable

smk_l unghas data for smokers and the other non_l ungfor non-smokers. You have to use the

unpai r edoption to indicate that two variables are not paired. A grouping variable here is not

necessary. Compare the following output with what is printed above.

. ttest smk_lung=non_lung, unpaired

Two- sampl e t t est wi t h equal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -smk_l ung | 22 22. 32045 . 7287523 3. 418151 20. 80493 23. 83598non_l ung | 22 16. 98591 . 6747158 3. 164698 15. 58276 18. 38906- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 19. 65318 . 6374133 4. 228122 18. 36772 20. 93865- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

di f f | 5. 334545 . 9931371 3. 330313 7. 338777- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

di f f = mean( smk_l ung) - mean( non_l ung) t = 5. 3714Ho: di f f = 0 degr ees of f r eedom = 42

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000

This unpai r edoption is very useful since it enables you to conduct a t-test without additional

data manipulation. You may run the . t t est command with the unpai r edoption to compare

two variables, say l eukemi aand ki dney, as independent samples in STATA. In SAS and

SPSS, however, you have to stack up two variables and generate a grouping variable before t-tests.

. ttest leukemia=kidney, unpaired

Two- sampl e t t est wi t h equal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -l eukemi a | 44 6. 829773 . 0962211 . 6382589 6. 635724 7. 023821

ki dney | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 88 4. 812159 . 2249261 2. 109994 4. 365094 5. 259224- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

di f f | 4. 035227 . 1240251 3. 788673 4. 281781- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


15/179



di f f = mean( l eukemi a) - mean(ki dney) t = 32. 5356Ho: di f f = 0 degr ees of f r eedom = 86


The F 1.5119 = (.6532589^2)/(.5190799^2) and its p-value (=.1797) do not reject the nullhypothesis of equal variance. The large t statistic 32.5356 rejects the null hypothesis that death

rates from leukemia and kidney cancers have the same mean.

4.3 T-test in SAS

The TTEST procedure by default examines the hypothesis of equal variances, and provides T

statistics for either case. The procedure by default reports Satterthwaites approximation for the

degrees of freedom. Keep in mind that a variable to be tested is grouped by the variable that isspecified in the CLASS statement.

PROCTTESTH0=0ALPHA=.05 DATA=masil.smoking;

CLASSsmoke;

VARlung;

RUN;

The TTEST Procedure

Statistics


Variable smoke N Mean Mean Mean Std Dev Std Dev Std Dev

lung 0 22 15.583 16.986 18.389 2.4348 3.1647 4.5226

lung 1 22 20.805 22.32 23.836 2.6298 3.4182 4.8848lung Diff (1-2) -7.339 -5.335 -3.33 2.7159 3.2939 4.1865

Statistics

Variable smoke Std Err Minimum Maximum

lung 0 0.6747 12.01 25.45

lung 1 0.7288 12.11 27.27

lung Diff (1-2) 0.9931

T-Tests

Variable Method Variances DF t Value Pr > |t|

lung Pooled Equal 42 -5.37


16/179



lung Folded F 21 21 1.17 0.7273

The F test for equal variance does not reject the null hypothesis of equal variances. Thus, the t-

test labeled as Pooled should be referred to in order to get the t -5.37 and its p-value .0001. Ifthe equal variance assumption is violated, the statistics of Satterthwaite and Cochran

should be read.

If you have a summary data set with the values of variables (lung) and their frequency (count),

specify the count variable in the FREQ statement.

PROCTTESTDATA=masil.smoking;

CLASSsmoke;

VARlung;

FREQcount;

RUN;

Now, let us compare the death rates from leukemia and kidney in the second data arrangement

type of Figure 1. As mentioned before, you need to rearrange the data set to stack up two

variables into one and generate a grouping variable (first type in Figure 1).

DATAmasil.smoking2;

SETmasil.smoking;

death = leukemia; leu_kid ='Leukemia'; OUTPUT;

death = kidney; leu_kid ='Kidney'; OUTPUT;

KEEPleu_kid death;

RUN;

PROCTTESTCOCHRAN DATA=masil.smoking2; CLASSleu_kid; VARdeath; RUN;

The TTEST Procedure

Statistics


Variable leu_kid N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

death Kidney 44 2.6367 2.7945 2.9524 0.4289 0.5191 0.6577 0.0783

death Leukemia 44 6.6357 6.8298 7.0238 0.5273 0.6383 0.8087 0.0962

death Diff (1-2) -4.282 -4.035 -3.789 0.5063 0.5817 0.6838 0.124

T-Tests

Variable Method Variances DF t Value Pr > |t|

death Pooled Equal 86 -32.54


17/179



death Folded F 43 43 1.51 0.1794

Compare this SAS output with that of STATA in the previous section.

4.4 T-test in SPSS

In the T-TEST command, you need to use the /GROUPsubcommand in order to specify a

grouping variable. SPSS reports Levene's F .0000 that does not reject the null hypothesis ofequal variance (p


18/179



5. Independent Samples with Unequal Variances

If the assumption of equal variances is violated, we have to compute the adjusted t statisticusing individual sample standard deviations rather than a pooled standard deviation. It is also

necessary to use the Satterthwaite, Cochran-Cox (SAS), or Welch (STATA) approximations of

the degrees of freedom. In this chapter, you compare mean death rates from kidney cancerbetween the west (south) and east (north).

5.1 T-test in STATA

As discussed earlier, let us check equality of variances using the . onewaycommand. The

t abul at eoption produces a table of summary statistics for the groups.

. oneway kidney west, tabulate

| Summar y of ki dney

west | Mean Std. Dev. Freq.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

0 | 3. 006 . 3001298 201 | 2. 6183333 . 59837219 24

- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Tot al | 2. 7945455 . 51907993 44

Anal ysi s of Vari anceSource SS df MS F Pr ob > F

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Bet ween groups 1. 63947758 1 1. 63947758 6. 92 0. 0118Wi t hi n groups 9. 94661333 42 . 236824127

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Tot al 11. 5860909 43 . 269443975

Bart l ett ' s t est f or equal vari ances: chi 2( 1) = 8. 6506 Prob>chi 2 = 0. 003

Bartletts chi-squared statistic rejects the null hypothesis of equal variance at the .01 level. It is

appropriate to use the unequal option in the . t t es t command, which calculates

Satterthwaites approximation for the degrees of freedom.

Unlike the SAS TTEST procedure, the . t t est command cannot specify the mean difference

D0 other than zero. Thus, the null hypothesis is that the mean difference is zero.

. ttest kidney, by(west) unequal level(95)

Two- sampl e t t est wi t h unequal var i ances

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt erval ]

- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -0 | 20 3. 006 . 0671111 . 3001298 2. 865535 3. 1464651 | 24 2. 618333 . 1221422 . 5983722 2. 365663 2. 871004

- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

di f f | . 3876667 . 139365 . 1047722 . 6705611- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


19/179



di f f = mean( 0) - mean( 1) t = 2. 7817Ho: di f f = 0 Satt ert hwai t e' s degr ees of f r eedom = 35. 1098


See Satterthwaites approximation of 35.110 in the middle of the output. If you want to getWelchs approximation, use the wel chas well as unequal options; without the unequal option,

the wel chis ignored.

. ttest kidney, by(west) unequal welch

Two- sampl e t t est wi t h unequal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt erval ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

0 | 20 3. 006 . 0671111 . 3001298 2. 865535 3. 1464651 | 24 2. 618333 . 1221422 . 5983722 2. 365663 2. 871004

- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236

- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -di f f | . 3876667 . 139365 . 1050824 . 6702509- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

di f f = mean( 0) - mean( 1) t = 2. 7817Ho: di f f = 0 Wel ch' s degrees of f r eedom = 36. 2258


Satterthwaites approximation is slightly smaller than Welchs 36.2258. Again, keep in mindthat these approximations are not integers, but real numbers. The t statistic 2.7817 and its p-

value .0086 reject the null hypothesis of equal population means. The north and east have

larger death rates from kidney cancer per 100 thousand people than the south and west.

For aggregate data, use the . t t e s t i command with the necessary options.

. ttesti 20 3.006 .3001298 24 2.618333 .5983722, unequal welch

As mentioned earlier, the unpai r edoption of the . t t est command directly compares two

variables without data manipulation. The option treats the two variables as independent of each

other. The following is an example of the unpaired and unequal options.

. ttest bladder=kidney, unpaired unequal welch

Two- sampl e t t est wi t h unequal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -bl adder | 44 4. 121136 . 1454679 . 9649249 3. 827772 4. 4145ki dney | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236

- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 88 3. 457841 . 1086268 1. 019009 3. 241933 3. 673748- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

di f f | 1. 326591 . 1651806 . 9968919 1. 65629- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

di f f = mean(bl adder ) - mean(ki dney) t = 8. 0312Ho: di f f = 0 Wel ch' s degrees of f r eedom = 67. 0324


20/179




The F 3.4556 = (.9649249^2)/(.5190799^2) rejects the null hypothesis of equal variance

(p


21/179



F 3.9749 = (.5983722^2)/(.3001298^2) and p


22/179



Variable Method Num DF Den DF F Value Pr > F

death Folded F 43 43 3.46


23/179



6. One-way ANOVA, GLM, and Regression

The t-test is a special case of one-way ANOVA. Thus, one-way ANOVA produces equivalentresults to those of the t-test. ANOVA examines mean differences using the F statistic, whereas

the t-test reports the t statistic. The one-way ANOVA (t-test), GLM, and linear regression

present essentially the same things in different ways.

6.1 One-way ANOVA

Consider the following ANOVA procedure. The CLASS statement is used to specify

categorical variables. The MODEL statement lists the variable to be compared and a grouping

variable, separating them with an equal sign.

PROCANOVADATA=masil.smoking;

CLASSsmoke;

MODELlung=smoke;

RUN;

The ANOVA Procedure

Dependent Variable: lung

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 1 313.0311273 313.0311273 28.85 F

smoke 1 313.0311273 313.0311273 28.85 F- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Model | 313. 031127 1 313. 031127 28. 85 0. 0000|

smoke | 313. 031127 1 313. 031127 28. 85 0. 0000|

Resi dual | 455. 680427 42 10. 849534- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Tot al | 768. 711555 43 17. 8770129


24/179



In SPSS, the ONEWAY command is used.

ONEWAY lung BY smoke

/MISSING ANALYSIS .

6.2 Generalized Linear Model (GLM)

The SAS GLM and MIXED procedures and the SPSS UNIANOVA command also report the F

statistic for one-way ANOVA. Note that STATAs . gl mcommand does not perform one-way

ANOVA.

PROCGLMDATA=masil.smoking;

CLASSsmoke;

MODELlung=smoke /SS3;

RUN;

The GLM Procedure


Sum of

Source DF Squares Mean Square F Value Pr > F

Model 1 313.0311273 313.0311273 28.85 F

smoke 1 313.0311273 313.0311273 28.85


25/179



The SAS REG procedure, STATA . r egr ess command, and SPSS REGRESSION commandestimate linear regression models.

PROCREGDATA=masil.smoking;

MODELlung=smoke;

RUN;

The REG Procedure

Model: MODEL1


Number of Observations Read 44

Number of Observations Used 44

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 313.03113 313.03113 28.85 |t|

Intercept 1 16.98591 0.70225 24.19


26/179



Model | 313. 031127 1 313. 031127 Pr ob > F = 0. 0000Resi dual | 455. 680427 42 10. 849534 R- squar ed = 0. 4072

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- s quar ed = 0. 3931Tot al | 768. 711555 43 17. 8770129 Root MSE = 3. 2939

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -l ung | Coef . St d. Err . t P>| t | [ 95% Conf . I nt erval ]

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -smoke | 5. 334545 . 9931371 5. 37 0. 000 3. 330314 7. 338777_cons | 16. 98591 . 702254 24. 19 0. 000 15. 5687 18. 40311

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The SPSS REGRESSION command looks complicated compared to the SAS REG procedure

and STATA . r egr esscommand.

REGRESSION

/MISSING LISTWISE

/STATISTICS COEFF OUTS R ANOVA

/CRITERIA=PIN(.05) POUT(.10)

/NOORIGIN

/DEPENDENT lung/METHOD=ENTER smoke.

Note that ANOVA, GLM, and regression report the same F (1, 42) 28.85, which is equivalent

to t (42) -5.3714. As long as the degrees of freedom of the numerator is 1, F is always t^2

(28.85=-5.3714^2).


27/179



7. Conclusion

The t-test is a basic statistical method for examining the mean difference between two groups.One-way ANOVA can compare means of more than two groups. The number of observations

in individual groups does not matter in the t-test or one-way ANOVA; both balanced and

unbalanced data are fine. One-way ANOVA, GLM, and linear regression models all use thevariance-covariance structure in their analysis, but present the results in different ways.

Researchers must check four issues when performing t-tests. First, a variable to be testedshould be interval or ratio so that its mean is substantively meaningful. Do not, for example,

run a t-test to compare the mean of skin colors (white=0, yellow=1, black=2) between two

countries. If you have a latent variable measured by several Likert-scaled manifest variables,first run a factor analysis to get that latent variable.

Second, examine the normality assumptions before conducting a t-test. It is awkward to

compare means of variables that are not normally distributed. Figure 2 illustrates a normal

probability distribution on top and a Poisson distribution skewed to the right on the bottom.Although the two distributions have the same mean and variance of 1, they are not likely to be

substantively interpretable. This is the rationale to conduct normality test such as Shapiro-WilkW, Shapiro-Francia W, and Kolmogorov-Smirnov D statistics. If the normality assumption is

violated, try to use nonparametric methods.

Figure 2. Comparing Normal and Poisson Probability Distributions ( 2 =1 and =1)


28/179



Third, check the equal variance assumption. You should be careful when comparing means of

normally distributed variables with different variances. You may conduct the folded form F test.

If the equal variance assumption is violated, compute the adjusted t and approximations of thedegree of freedom.

Finally, consider the types of t-tests, data arrangement, and functionalities available in eachstatistical software (e.g., STATA, SAS, and SPSS) to determine the best strategy for dataanalysis (Table 3). The first data arrangement in Figure 1 is commonly used for independent

sample t-tests, whereas the second arrangement is appropriate for a paired sample test. Keep inmind that the type II data sets in Figure 1 needs to be reshaped into type I in SAS and SPSS.

Table 3. Comparison of T-test Functionalities of STATA, SAS and SPSS

STATA 9.0 SAS 9.1 SPSS 13.0Test for equal variance Bartletts chi-squared

(. t t e s t command)Folded form F

(TTESTprocedure)Levenes weighted F

(T- TESTcommand)Approximation of thedegrees of freedom (DF)

Satterthwaites DFWelchs DF

Satterthwaites DFCochran-Cox DF

Satterthwaites DF

Second Data Arrangement var1=var2 Reshaping the data set Reshaping the data set

Aggregate Data . t t est i command FREQoption N/A

SAS has several procedures (e.g., TTEST, MEANS, and UNIVARIATE) and useful options for

t-tests. The STATA . t t e s t and . t t e s t i commands provide very flexible ways of handling

different data arrangements and aggregate data. Table 4 summarizes usages of options in these

two commands.

Table 4. Summary of the Usages of the . t t es t and . t t est Command Options

Usage by(group var) unequal welch unpaired*

Univariate sample var=c

Paired (dependent) sample var1=var2

Equal variance (1 variable) Var O

Equal variance (2 variables)**

var1=var2 O

Unequal variance (1 variable) Var O O O

Unequal variance (2 variables) var1=var2 O O O

* The . t t e s t i command does not allow the unpai r edoption.** The var1=var2 assumes second type of data arrangement in Figure 1.


29/179



Appendix: Data Set

Literature: Fraumeni, J. F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract:Geographic Variations in the United States,"Journal of the National Cancer Institute, 41(5):

1205-1211.

Data Source: http://lib.stat.cmu.edu

The data are per capita numbers of cigarettes smoked (sold) by 43 states and the District ofColumbia in 1960 together with death rates per 100 thousand people from various forms of

cancer. The variables used in this document are,

cigar= number of cigarettes smoked (hds per capita)bladder= deaths per 100k people from bladder cancer

lung= deaths per 100k people from lung cancer

kidney= deaths per 100k people from kidney cancer

leukemia= deaths per 100k people from leukemiasmoke= 1 for those whose cigarette consumption is larger than the median and 0 otherwise.

west= 1 for states in the South or West and 0 for those in the North, East or Midwest.

The followings are summary statistics and normality tests of these variables.

. sum cigar-leukemia

Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

ci gar | 44 24. 91409 5. 573286 14 42. 4bl adder | 44 4. 121136 . 9649249 2. 86 6. 54

l ung | 44 19. 65318 4. 228122 12. 01 27. 27

ki dney | 44 2. 794545 . 5190799 1. 59 4. 32l eukemi a | 44 6. 829773 . 6382589 4. 9 8. 28

. sfrancia cigar-leukemia

Shapi r o- Franci a W' t est f or nor mal dat aVari abl e | Obs W' V' z Prob>z

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -ci gar | 44 0. 93061 3. 258 2. 203 0. 01381

bl adder | 44 0. 94512 2. 577 1. 776 0. 03789l ung | 44 0. 97809 1. 029 0. 055 0. 47823

ki dney | 44 0. 97732 1. 065 0. 120 0. 45217l eukemi a | 44 0. 97269 1. 282 0. 474 0. 31759

. tab west smoke

| smokewest | 0 1 | Tot al

- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -0 | 7 13 | 201 | 15 9 | 24

- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -Tot al | 22 22 | 44


30/179



References

Fraumeni, J. F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract: GeographicVariations in the United States,"Journal of the National Cancer Institute, 41(5): 1205-

1211.

Ott, R. Lyman. 1993.An Introduction to Statistical Methods and Data Analysis. Belmont, CA:Duxbury Press.

SAS Institute. 2005. SAS/STAT User's Guide, Version 9.1. Cary, NC: SAS Institute.

SPSS. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc.STATA Press. 2005. STATA Reference Manual Release 9. College Station, TX: STATA Press.

Walker, Glenn A. 2002. Common Statistical Methods for Clinical Research with SAS

Examples. Cary, NC: SAS Institute.

Acknowledgements

I am grateful to Jeremy Albright, Takuya Noguchi, and Kevin Wilhite at the UITS Center forStatistical and Mathematical Computing, Indiana University, who provided valuable comments

and suggestions.

Revision History

2003. First draft 2004. Second draft 2005. Third draft (Added data arrangements and conclusion).


31/179

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 1


Regression Models for Event Count Data

Using SAS, STATA, and LIMDEP

Hun Myoung Park

This document summarizes regression models for event count data and illustrates how to

estimate individual models using SAS, STATA, and LIMDEP. Example models were tested in SAS

9.1, STATA 9.0, and LIMDEP 8.0.

1. Introduction2. The Poisson Regression Model (PRM)3. The Negative Binomial Regression Model (NBRM)4. The Zero-Inflated Poisson Regression Model (ZIP)5. The Zero-Inflated Negative Binomial Regression Model (ZINB)6. Conclusion

7.

Appendix

1. Introduction

An event count is the realization of a nonnegative integer-valued random variable (Cameron andTrivedi 1998). Examples are the number of car accidents per month, thunder storms per year, andwild fires per year. The ordinary least squares (OLS) method for event count data results inbiased, inefficient, and inconsistent estimates (Long 1997). Thus, researchers have developedvarious nonlinear models that are based on the Poisson distribution and negative binomialdistribution.

1.1 Count Data Regression Models

The left-hand side (LHS) of the equation has event count data. Independent variables are, as inthe OLS, located at the right-hand side (RHS). These RHS variables may be interval, ratio, orbinary (dummy). Table 1 below summarizes the categorical dependent variable regressionmodels (CDVMs) according to the level of measurement of the dependent variable.

Table 1. Ordinary Least Squares and CDVMs

Model Dependent (LHS) Method Independent (RHS)

OLS Ordinary leastsquaresInterval or ratio Moment based

method

Binary response Binary (0 or 1)

Ordinal response Ordinal (1st, 2nd, 3rd)

Nominal response Nominal (A, B, C )CDVMs

Event count data Count (0, 1, 2, 3)

Maximumlikelihoodmethod

A linear function ofinterval/ratio or binaryvariables

...22110 XX ++

The Poisson regression model (PRM) and negative binomial regression model (NBRM) are basicmodels for count data analysis. Either the zero-inflated Poisson (ZIP) or the zero-inflated


32/179


33/179



)exp( iii x += (Long 1997). Thus, the conditional variance of y becomes larger than its

conditional mean, iii xyE =)|( , which remains unchanged. Figure 2 illustrates how the

probabilities for small and larger counts increase in the negative binomial distribution as theconditional variance of y increases, given 3= .

Figure 2. Negative Binomial Probability Distribution with Alpha of .01, .5, 1, and 5

The PRM and NBRM, however, have the same mean structure. If 0= , the NBRM reduces tothe PRM (Cameron and Trivedi 1998; Long 1997).

1.3 Overdispersion

When )|()|( iiii xyExyVar > , we are said to have overdispersion. Estimates of a PRM foroverdispersed data are unbiased, but inefficient with standard errors biased downward (Cameronand Trivedi 1998; Long 1997). The likelihood ratio test is developed to examine the nullhypothesis of no overdispersion, 0:0 =H .The likelihood ratio follows the Chi-squared

distribution with one degree of freedom, )1(~)ln(ln*2 2PoissonNB LLLR = . If the null

hypothesis is rejected, the NBRM is preferred to the PRM.

Zero-inflated models handle overdispersion by changing the mean structure to explicitly modelthe production of zero counts (Long 1997). These models assume two latent groups. One is thealways-zero group and the other is the not-always-zero or sometime-zero group. Thus, zero

counts come from the former group and some of the latter group with a certain probability.

The likelihood ratio, )1(~)ln(ln*2 2ZIPZINB LLLR = , tests 0:0 =H to compare the ZIP

and NBRM. The PRM and ZIP as well as NBRM and ZINB cannot, however, be tested by thislikelihood ratio, since they are not nested respectively. The Voungs statistic compares thesenon-nested models. If V is greater than 1.96, the ZIP or ZINB is favored. If V is less than -1.96,the PRM or NBRM is preferred (Long 1997).


34/179



1.4 Estimation in SAS, STATA, and LIMDEP

The SAS GENMOD procedure estimates Poisson and negative binomial regression models.

STATA has individual commands (e.g., . poi ssonand . nbr eg) for the corresponding count datamodels. LIMDEP has Poi sson$and Negbi n$commands to estimate various count data modelsincluding zero-inflated and zero-truncated models. Table 2 summarizes the procedures andcommands for count data regression models.

Table 2. Comparison of the Procedures and Commands for Count Data Models

Model SAS 9.1 STATA 9.0 LIMDEP 8.0Poisson Regression (PRM) GENMOD . poi sson Poi sson$

Negative Binomial Regression (NBRM) GENMOD . nbr eg Negbi n$Zero-Inflated Poisson (ZIP) - . zi p Poi sson; Zi p; Rh2$Zero-inflated Negative Binomial (ZINB) - . zi nb Negbi n; Zi p; Rh2$Zero-truncated Poisson (ZTP) - . zt p Poi sson; Tr uncat i on$

Zero-truncated Negative Binomial (ZTNB) - . zt nb Negbi n; Truncat i on$

The example here examines how waste quotas (emps) and the strictness of policy implementation(st r i ct ) affect the frequency of waste spill accidents of plants (acci dent ).

1. 5 Long and Freeses SPost Module

STATA users may take advantages of user-written modules such as SPost written by J. ScottLong and Jeremy Freese. The module allows researchers to conduct follow-up analyses ofvarious CDVMs including event count data models. See 2.3 for examples of major SPost

commands.

In order to install SPost, execute the following commands consecutively. For more details, visit J.Scott Longs Web site at http://www.indiana.edu/~jslsoc/spost_install.htm.

. net from http://www.indiana.edu/~jslsoc/stata/

. net install spost9_ado, replace

. net get spost9_do, replace


35/179


36/179



the number of regressors between the unrestricted and restricted models. The chi-squaredstatistic is 124.8218 = 2* [-667.2291 - (-729.6400)] (p ChiSq

Intercept 1 0.3168 0.0306 0.2568 0.3768 107.20


37/179



Poi sson r egr essi on Number of obs = 778LR chi 2(2) = 124. 82Pr ob > chi 2 = 0. 0000

Log l i kel i hood = - 1821. 5101 Pseudo R2 = 0. 0331

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

acci dent | Coef . St d. Er r. z P>| z| [ 95% Conf . I nt erval ]- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -emps | . 0054186 . 0007434 7. 29 0. 000 . 0039615 . 0068757

st r i ct | - . 7041664 . 0667619 - 10. 55 0. 000 - . 8350174 - . 5733154_cons | . 3900961 . 0466787 8. 36 0. 000 . 2986076 . 4815846

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Let us run a restricted model and then run the . di spl aycommand in order to double check thatthe likelihood ratio for goodness-of-fit is 124.8218.

. poisson accident

I t er at i on 0: l og l i kel i hood = - 1883. 921I t er at i on 1: l og l i kel i hood = - 1883. 921

Poi sson r egr essi on Number of obs = 778LR chi 2(0) = 0. 00Pr ob > chi 2 = .


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent | Coef . St d. Er r. z P>| z| [ 95% Conf . I nt erval ]

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -_cons | . 3168165 . 0305995 10. 35 0. 000 . 2568426 . 3767904

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

. display 2 * (-1821.5101 - (-1883.921))

124. 8218

2.3 Using the SPost Module in STATA

The SPost module provides useful commands for follow-up analyses of various categoricaldependent variable models. The . f i t st at command calculates various goodness-of-fit statisticssuch as log likelihood, McFaddens R2(or Pseudo R2), Akaike Information Criterion (AIC), andBayesian Information Criterion (BIC).

. quietly poisson accident emps strict

. fitstat

Measures of Fi t f or poi sson of acci dent

Log- Li k I nt ercept Onl y: - 1883. 921 Log- Li k Ful l Model : - 1821. 510D( 775) : 3643. 020 LR( 2): 124. 822

Pr ob > LR: 0. 000McFadden' s R2: 0. 033 McFadden' s Adj R2: 0. 032Maxi mumLi kel i hood R2: 0. 148 Cr agg & Uhl er ' s R2: 0. 149AI C: 4. 690 AI C*n: 3649. 020BI C: - 1515. 943 BI C' : - 111. 508


38/179



The . l i s t coef command lists unstandardized coefficients (parameter estimates), factor andpercent changes, and standardized coefficients to help interpret regression results.

. listcoef, help

poi sson (N=778) : Fact or Change i n Expect ed Count

Obser ved SD: 2. 9482675

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent | b z P>| z| e b e bSt dX SDof X

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -emps | 0. 00542 7. 289 0. 000 1. 0054 1. 2297 38. 1548

st r i ct | - 0. 70417 - 10. 547 0. 000 0. 4945 0. 7031 0. 5003- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

b = r aw coef f i ci entz = z- score f or t est of b=0

P>| z| = p- val ue f or z- t este b = exp(b) = f act or change i n expect ed count f or uni t i ncrease i n X

e bSt dX = exp( b*SD of X) = change i n expect ed count f or SD i ncrease i n XSDof X = st andar d devi at i on of X

The . prt abcommand constructs a table of predicted values (events) for all combinations ofcategorical variables listed. The following example shows that the predicted number of accidentsunder the strict policy is .9172 at the mean waste quota (emps=42.0129).

. prtab strict

poi sson: Predi ct ed r at es f or acci dent

- - - - - - - - - - - - - - - - - - - - - -s t r i ct | Predi ct i on

- - - - - - - - - - +- - - - - - - - - - -0 | 1. 85471 | 0. 9172

- - - - - - - - - - - - - - - - - - - - - -

emps st r i ctx= 42. 012853 . 50771208

The . pr val uelists predicted values for a given set of values for the independent variables. Forexample, the predicted probability of a zero count is .3996 at the mean waste quota under thestrict policy (str i ct=1). Note that the predicted rate of .917 is equivalent to .9172 in the . pr t ababove.

. prvalue, x(strict=1) maxcnt(5)

poi sson: Predi ct i ons f or acci dent

Predi ct ed r at e: . 917 95% CI [ . 827 , 1. 02]

Predi cted probabi l i t i es:

Pr( y=0| x) : 0. 3996 Pr( y=1| x): 0. 3665Pr( y=2| x) : 0. 1681 Pr( y=3| x): 0. 0514Pr( y=4| x) : 0. 0118 Pr( y=5| x): 0. 0022

emps st r i ctx= 42. 012853 1


39/179



The most useful command is the . pr changethat calculates marginal effects (changes) anddiscrete changes. For instance, a standard deviation increase in waste quota form its mean willincrease accidents by .3841 under the lenient policy (st r i ct =0).

. prchange, x(strict=0)

poi sson: Changes i n Pr edi ct ed Rat e f or acci dent

mi n- >max 0- >1 - +1/ 2 - +sd/ 2 MargEf ctemps 2. 3070 0. 0080 0. 0101 0. 3841 0. 0101

st r i ct - 0. 9375 - 0. 9375 - 1. 3332 - 0. 6568 - 1. 3060

exp( xb) : 1. 8547

emps st r i ctx= 42. 0129 0

sd( x) = 38. 1548 . 500262

SPost also includes the . prgen command, which computes a series of predictions by holding allvariables but one constant and allowing that variable to vary (Long and Freese 2003). These

SPost commands work with most categorical and count data models suchas . l ogi t , . probi t , . poi sson, . nbr eg, . z i p, and . zi nb.

2.4 PRM in LIMDEP

The LIMDEP Poi sson$command estimates the PRM. LIMDEP reports log likelihoods of boththe unrestricted and restricted models. Keep in mind that you must include the ONE for theintercept.

POISSON;

Lhs=ACCIDENT;

Rhs=ONE,EMPS,STRICT

+---------------------------------------------+

| Poisson Regression |

| Maximum Likelihood Estimates |

| Model estimated: Aug 24, 2005 at 04:56:45PM.|

| Dependent variable ACCIDENT |

| Weighting variable None |

| Number of observations 778 |

| Iterations completed 8 |

| Log likelihood function -1821.510 |

| Restricted log likelihood -1883.921 |

| Chi squared 124.8218 |

| Degrees of freedom 2 || Prob[ChiSqd > value] = .0000000 |

| Chi- squared = 4944.94781 RsqP= -.0051 |

| G - squared = 2827.20794 RsqD= .0423 |

| Overdispersion tests: g=mu(i) : 4.720 |

| Overdispersion tests: g=mu(i)^2: 4.253 |

+---------------------------------------------+

+---------+--------------+----------------+--------+---------+----------+

|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|

+---------+--------------+----------------+--------+---------+----------+


40/179



Constant .3900961420 .46678663E-01 8.357 .0000

EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853

STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208

(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

SAS, STATA, and LIMDEP produce almost the same parameter estimates and standard errors

(Table 3). The log likelihood in SAS is different from that of STATA and LIMDEP (-667.291versus -1821.5101). This difference seems to come from the generalized linear model that theGENMOD procedure uses. These log likelihoods are, however, equivalent in the sense that theyresult in the same likelihood ratio.

Table 3. Summary of the Poisson Regression Model in SAS, STATA, and LIMDEP

Model SAS 9.1 STATA 9.0 LIMDEP 8.0

Intercept. 3901

( . 0467). 3901

( . 0467). 3901

( . 0467)

EMPS. 0054

( . 0007). 0054

( . 0007). 0054

( . 0007)

STRICT

- . 7042

( . 0668)

- . 7042

( . 0668)

- . 7042

( . 0668)Log Likelihood (unrestricted) - 667. 2291 - 1821. 5101 - 1821. 510Log Likelihood (restricted) - 729. 6400 - 1883. 921 - 1883. 921Likelihood Ratio for Goodness-of-fit 124. 8218 124. 82 124. 8218


41/179



3. The Negative Binomial Regression Model

The SAS GENMODE procedure, STATA . nbr egcommand, and LIMDEP Negbi n$commandestimate the negative binomial regression model (NBRM).

3.1 NBRM in SAS

The GENMOD procedure estimates the NBRM using the /DIST=NEGBIN option. Note that thedispersion parameter is equivalent to the alpha in STATA and LIMDEP.

PROC

GENMOD

DATA = masil.accident;

MODELaccident=emps strict /DIST=NEGBIN LINK=LOG;

RUN

;

The GENMOD Procedure

Model Information

Data Set COUNT.WASTE

Distribution Negative Binomial

Link Function Log

Dependent Variable Accident

Observations Used 778

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 775 589.7752 0.7610

Scaled Deviance 775 589.7752 0.7610

Pearson Chi-Square 775 845.6033 1.0911

Scaled Pearson X2 775 845.6033 1.0911Log Likelihood 37.5628

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi-

Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 0.3851 0.1278 0.1345 0.6357 9.07 0.0026

Emps 1 0.0052 0.0023 0.0008 0.0096 5.29 0.0214

Strict 1 -0.6703 0.1671 -0.9978 -0.3427 16.09


42/179



The likelihood ratio for overdispersion is 1409.5838 = 2 * (37.5628 - (-667.2291)).

3.2 NBRM in STATA

STATA has the . nbr egcommand for the NBRM. The command reports three log likelihoodstatistics: for the PRM, restricted NBRM (constant-only model), and unrestricted NBRM (fullmodel), which make it easy to conduct likelihood ratio tests.

. nbreg accident emps strict

Fi t t i ng compari son Poi sson model :

I t er at i on 0: l og l i kel i hood = - 1821. 5112I t er at i on 1: l og l i kel i hood = - 1821. 5101I t er at i on 2: l og l i kel i hood = - 1821. 5101

Fi t t i ng const ant - onl y model :

I t er at i on 0: l og l i kel i hood = - 1256. 6761I t er at i on 1: l og l i kel i hood = - 1152. 6155I t er at i on 2: l og l i kel i hood = - 1125. 6643I t er at i on 3: l og l i kel i hood = - 1125. 4183I t er at i on 4: l og l i kel i hood = - 1125. 4183

Fi tt i ng f ul l model :

I t er at i on 0: l og l i kel i hood = - 1117. 1731I t er at i on 1: l og l i kel i hood = - 1116. 7201I t er at i on 2: l og l i kel i hood = - 1116. 7182I t er at i on 3: l og l i kel i hood = - 1116. 7182

Negat i ve bi nomi al r egr essi on Number of obs = 778LR chi 2(2) = 17. 40Pr ob > chi 2 = 0. 0002


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent | Coef . St d. Er r. z P>| z| [ 95% Conf . I nt erval ]

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -emps | . 0051981 . 0022595 2. 30 0. 021 . 0007694 . 0096267

st r i ct | - . 6702548 . 1671191 - 4. 01 0. 000 - . 9978021 - . 3427074_cons | . 3851111 . 1278468 3. 01 0. 003 . 134536 . 6356861

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -/ l nal pha | 1. 37509 . 0885176 1. 201599 1. 548582

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -al pha | 3. 955434 . 3501257 3. 32543 4. 704793

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Li kel i hood r at i o t est of al pha=0: chi bar2( 01) = 1409. 58 Prob>=chi bar2 = 0. 000

The restricted model or constant-only model gives us a log likelihood -1125.4183. Thus, thelikelihood ratio for goodness-of-fit is 17.4002 = 2 * [-1116.7182 - (-1125.4183)] (p


43/179



The likelihood ratio test for overdispersion results in a chi-squared of 1409.5838 (p1 of a binary variable st r i ct , since itsmarginal change at the mean (.5077) is meaningless.

. prchange

nbr eg: Changes i n Predi ct ed Rate f or acci dent

mi n- >max 0- >1 - +1/ 2 - +sd/ 2 MargEf ctemps 1. 5326 0. 0055 0. 0068 0. 2585 0. 0068

st r i ct - 0. 8931 - 0. 8931 - 0. 8885 - 0. 4383 - 0. 8721

exp( xb) : 1. 3011

emps st r i ctx= 42. 0129 . 507712

sd( x) = 38. 1548 . 500262

3.3 NBRM in LIMDEP

LIMDEP has the Negbi n$command for the NBRM that reports the PRM as well. Note that thestandard errors of parameter estimates are slightly different from those of SAS and STATA. TheMargi nal Ef f ect s$ and the Means$ subcommands compute marginal effects at the mean ofindependent variables. You may not omit the Means$ subcommand.

NEGBIN;

Lhs=ACCIDENT;

Rhs=ONE,EMPS,STRICT;

Marginal Effects;

Means

+---------------------------------------------+


| Maximum Likelihood Estimates || Model estimated: Sep 08, 2005 at 09:35:36AM.|








| Degrees of freedom 2 |


44/179



| Prob[ChiSqd > value] = .0000000 |

| Chi- squared = 4944.94781 RsqP= -.0051 |

| G - squared = 2827.20794 RsqD= .0423 |



+---------------------------------------------+

+---------+--------------+----------------+--------+---------+----------+


+---------+--------------+----------------+--------+---------+----------+

Constant .3900961420 .46678663E-01 8.357 .0000

EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853

STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208


Normal exit from iterations. Exit status=0.

+---------------------------------------------+

| Negative Binomial Regression |


| Model estimated: Sep 08, 2005 at 09:35:36AM.|

| Dependent variable ACCIDENT || Weighting variable None |








+---------------------------------------------+

+---------+--------------+----------------+--------+---------+----------+


+---------+--------------+----------------+--------+---------+----------+

Constant .3851110699 .12855240 2.996 .0027

EMPS .5198057234E-02 .22602075E-02 2.300 .0215 42.012853STRICT -.6702547660 .16729839 -4.006 .0001 .50771208

Dispersion parameter for count data model

Alpha 3.955434012 .35680876 11.086 .0000


+-------------------------------------------+

| Partial derivatives of expected val. with |

| respect to the vector of characteristics. |

| They are computed at the means of the Xs. |

| Observations used for means are All Obs. |

| Conditional Mean at Sample Point 1.3011 |

| Scale Factor for Marginal Effects 1.3011 |

+-------------------------------------------+

+---------+--------------+----------------+--------+---------+----------+


+---------+--------------+----------------+--------+---------+----------+

Constant .5010628939 .19396434 2.583 .0098

EMPS .6763123170E-02 .29746591E-02 2.274 .0230 42.012853

STRICT -.8720595665 .22469308 -3.881 .0001 .50771208



45/179


46/179


47/179



4.2 ZIP in LIMDEP

The LIMDEP Poi sson$command needs to have the Zi pand Rh2subcommands. The Rh2isequivalent to the i nf l at e( ) option in STATA. The Al g=Newt on$subcommand is needed to usethe Newton-Raphson algorithm because the default Broyden algorithm failed to converge.1

POISSON;

Lhs=ACCIDENT;

Rhs=ONE,EMPS,STRICT;

ZIP;

Rh2=ONE,EMPS,STRICT;

Alg=Newton

+---------------------------------------------+



| Model estimated: Sep 06, 2005 at 00:25:07PM.|

| Dependent variable ACCIDENT || Weighting variable None |








| Chi- squared = 4944.94781 RsqP= -.0051 |

| G - squared = 2827.20794 RsqD= .0423 |



+---------------------------------------------+

+---------+--------------+----------------+--------+---------+----------+


+---------+--------------+----------------+--------+---------+----------+

Constant .3900961420 .46678663E-01 8.357 .0000

EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853

STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208



+----------------------------------------------------------------------+

| Zero Altered Poisson Regression Model |

| Logistic distribution used for splitting model. || ZAP term in probability is F[tau x Z(i) ] |

| Comparison of estimated models |

| Pr[0|means] Number of zeros Log-likelihood |

| Poisson .27329 Act.= 498 Prd.= 212.6 -1821.51007 |

1If you get a warning message of Error: 806: Line search does not improve fn. Exit iterations.Status=3 or Error: 805: Initial iterations cannot improve function. Status=3, you maychange the optimization algorithm or increase the maximum number of iterations (e.g., Maxi t =1000$).


48/179



| Z.I.Poisson .64642 Act.= 498 Prd.= 502.9 -1259.88568 |

| Note, the ZIP log-likelihood is not directly comparable. |

| ZIP model with nonzero Q does not encompass the others. |

| Vuong statistic for testing ZIP vs. unaltered model is 9.5740 |

| Distributed as standard normal. A value greater than |

| +1.96 favors the zero altered Z.I.Poisson model. |

| A value less than -1.96 rejects the ZIP model. |

+----------------------------------------------------------------------+

+---------+--------------+----------------+--------+---------+----------+


+---------+--------------+----------------+--------+---------+----------+

Poisson/NB/Gamma regression model

Constant 1.361977491 .23944641E-01 56.880 .0000

EMPS -.2770010575E-03 .37770090E-03 -.733 .4633 42.012853

STRICT -.9239125073E-01 .33326502E-01 -2.772 .0056 .50771208

Zero inflation model

Constant .4886559537 .12210013 4.002 .0001

EMPS -.1098971050E-01 .22152492E-02 -4.961 .0000 42.012853

STRICT 1.057031399 .17715551 5.967 .0000 .50771208


In order to estimate the restricted model, run the following command with the ONE only in theLhs$subcommand. The Rh2$subcommand remains unchanged.

POISSON;

Lhs=ACCIDENT;

Rhs=ONE;

ZIP; Alg=Newton;

Rh2=ONE,EMPS,STRICT

Table 5 summarizes parameter estimates and goodness-of-fit statistics for the zero-inflatedPoisson model. STATA and LIMDEP report the same parameter estimates, but they produce

different standard errors and log likelihoods. In particular, LIMDEP returned a suspicious loglikelihood for the restricted model, and thus ended up with the unlikely likelihood ratio of -.0304. In addition, the Vuong statistics in STATA and LIMDEP are different.

Table 5. Summary of the Zero-Inflated Poisson Regression Model in STATA, and LIMDEP

Model SAS 9.1 STATA 9.0 LIMDEP 8.0

Intercept1. 3620( . 0493)

1. 3620( . 0239)

EMPS- . 0003( . 0009)

- . 0003( . 0004)

STRICT- . 0924( . 0729)

- . 0924( . 0333)

Intercept (Zero-inflated) . 4887( . 1211) . 4887( . 1221)

EMPS (Zero-inflated)- . 0110( . 0023)

- . 0110( . 0022)

STRICT (Zero-inflated)1. 0570( . 1768)

1. 0570( . 1772)

Log Likelihood (unrestricted) - 1269. 7206 - 1259. 8857Log Likelihood (restricted) - 1270. 9523 - 1259. 8705Likelihood Ratio for Goodness-of-fit 2. 46 - . 0304Vuong Statistic (ZINB versus NBRM) 8. 40 9. 5740


49/179



5. The Zero-Inflated NB Regression Model

STATA and LIMDEP can estimate the zero-inflated negative binomial regression model (ZINB).

5.1 ZINB in STATA (.zinb)

The STATA . zi nbcommand estimates the ZINB. The vuongoption computes the Vuongstatistic to compare the ZINB and NBRM.

. zinb accident emps strict, inflate(emps strict) vuong

Fi t t i ng const ant - onl y model :

I t er ati on 0: l og l i kel i hood = - 1190. 5117 ( not concave)I t er at i on 1: l og l i kel i hood = - 1106. 9874I t er at i on 2: l og l i kel i hood = - 1098. 8642I t er at i on 3: l og l i kel i hood = - 1095. 3638I t er at i on 4: l og l i kel i hood = - 1094. 0237

I t er at i on 5: l og l i kel i hood = - 1093. 063I t er at i on 6: l og l i kel i hood = - 1092. 6216I t er at i on 7: l og l i kel i hood = - 1091. 798I t er at i on 8: l og l i kel i hood = - 1091. 7332I t er at i on 9: l og l i kel i hood = - 1091. 7329I t er at i on 10: l og l i kel i hood = - 1091. 7329

Fi tt i ng f ul l model :

I t er at i on 0: l og l i kel i hood = - 1091. 7329I t er at i on 1: l og l i kel i hood = - 1089. 5565I t er at i on 2: l og l i kel i hood = - 1089. 5198I t er at i on 3: l og l i kel i hood = - 1089. 5198

Zero- i nf l ated negat i ve bi nomi al r egr essi on Number of obs = 778

Nonzer o obs = 280Zero obs = 498

I nf l ati on model = l ogi t LR chi 2( 2) = 4. 43Log l i kel i hood = - 1089. 52 Prob > chi 2 = 0. 1094

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -| Coef . St d. Err . z P>| z| [ 95% Conf . I nt erval ]

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent |

emps | - . 0004407 . 0020554 - 0. 21 0. 830 - . 0044691 . 0035877st r i ct | - . 3251317 . 1659173 - 1. 96 0. 050 - . 6503235 . 0000602_cons | . 7763065 . 1508037 5. 15 0. 000 . 4807367 1. 071876

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -i nf l at e |

emps | - . 2087768 . 0955122 - 2. 19 0. 029 - . 3959772 - . 0215763st r i ct | 7. 562388 3. 055775 2. 47 0. 013 1. 573179 13. 5516_cons | . 1032115 . 3800045 0. 27 0. 786 - . 6415835 . 8480065

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -/ l nal pha | . 9252514 . 1351387 6. 85 0. 000 . 6603845 1. 190118

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -al pha | 2. 522502 . 3408876 1. 935536 3. 28747

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Vuong t est of zi nb vs. st andard negat i ve bi nomi al : z = 4. 13 Pr>z = 0. 0000


50/179



The likelihood ratio, 360.4024= 2*(-1089.5198 - (-1269.721)), rejects the null hypothesis of nooverdispersion, indicating that the ZINB can improve goodness-of-fit over the ZIP (p 1.96, suggests that the ZINB is preferred to the NBRM.

5.2 ZINB in LIMDEP

The LIMDEP Negbi n$command needs to have the Zi p and Rh2subcommands for the ZINB.The following command produces the Poisson regression model, negative binomial model, andzero-inflated negative binomial model. You may omit the Al g=Newt on$subcommand.

NEGBIN;

Lhs=ACCIDENT;

Rhs=ONE,EMPS,STRICT; Rh2=ONE,EMPS,STRICT;

ZIP; Alg=Newton

+---------------------------------------------+













| Chi- squared = 4944.94781 RsqP= -.0051 |

| G - squared = 2827.20794 RsqD= .0423 |



+---------------------------------------------+

+---------+--------------+----------------+--------+---------+----------+


+---------+--------------+----------------+--------+---------+----------+

Constant .3900961420 .46678663E-01 8.357 .0000

EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853

STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208



+---------------------------------------------+

| Negative Binomial Regression || Maximum Likelihood Estimates |







| Restricted log likel

Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

Documents

Transcript of Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS