Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
-
Upload
wulandarir -
Category
Documents
-
view
227 -
download
0
Transcript of Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
1/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 1
http://www.indiana.edu/~statmath
Comparing Group Means: The T-test and One-way
ANOVA Using STATA, SAS, and SPSS
Hun Myoung Park
This document summarizes the method of comparing group means and illustrates how to
conduct the t-test and one-way ANOVA using STATA 9.0, SAS 9.1, and SPSS 13.0.
1. Introduction
2. Univariate Samples3. Paired (dependent) Samples
4. Independent Samples with Equal Variances
5. Independent Samples with Unequal Variances6. One-way ANOVA, GLM, and Regression
7. Conclusion
1. Introduction
The t-test and analysis of variance (ANOVA) compare group means. The mean of a variable to
be compared should be substantively interpretable. A t-test may examine gender differences in
average salary or racial (white versus black) differences in average annual income. The left-hand side (LHS) variable to be tested should be interval or ratio, whereas the right-hand side
(RHS) variable should be binary (categorical).
1.1
T-test and ANOVA
While the t-test is limited to comparing means of two groups, one-way ANOVA can comparemore than two groups. Therefore, the t-test is considered a special case of one-way ANOVA.
These analyses do not, however, necessarily imply any causality (i.e., a causal relationship
between the left-hand and right-hand side variables). Table 1 compares the t-test and one-wayANOVA.
Table 1. Comparison between the T-test and One-way ANOVA
T-test One-way ANOVALHS (Dependent) Interval or ratio variable Interval or ratio variable
RHS (Independent) Binary variable with only two groups Categorical variableNull Hypothesis
21 = ...321 === Prob. Distribution* T distribution F distribution
* In the case of one degree of freedom on numerator, F=t2.
The t-test assumes that samples are randomly drawn from normally distributed populationswith unknown population means. Otherwise, their means are no longer the best measures of
central tendency and the t-test will not be valid. The Central Limit Theorem says, however, that
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
2/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 2
http://www.indiana.edu/~statmath
the distributions of 1y and 2y are approximately normal when N is large. When 3021 + nn , in
practice, you do not need to worry too much about the normality assumption.
You may numerically test the normality assumption using the Shapiro-Wilk W (N
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
3/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 3
http://www.indiana.edu/~statmath
Figure 1. Two Types of Data Arrangement
Variable Group Variable1 Variable2xx
00
xx
yy
yy
11
The data set used here is adopted from J. F. Fraumenis study on cigarette smoking and cancer
(Fraumeni 1968). The data are per capita numbers of cigarettes sold by 43 states and the
District of Columbia in 1960 together with death rates per hundred thousand people fromvarious forms of cancer. Two variables were added to categorize states into two groups. See the
appendix for the details.
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
4/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 4
http://www.indiana.edu/~statmath
2. Univariate Samples
The univariate-sample or one-sample t-test determines whether an unknown population mean
differs from a hypothesized value cthat is commonly set to zero: cH =:0 . The t statistic
follows Students T probability distribution with n-1 degrees of freedom, )1(~
= nts
cyt
y,
whereyis a variable to be tested and nis the number of observations.1
Suppose you want to test if the population mean of the death rates from lung cancer is 20 per
100,000 people at the .01 significance level. Note the default significance level used in mostsoftware is the .05 level.
2.1 T-test in STATA
The . t t est command conducts t-tests in an easy and flexible manner. For a univariate sampletest, the command requires that a hypothesized value be explicitly specified. The l evel ( )
option indicates the confidence level as a percentage. The 99 percent confidence level is
equivalent to the .01 significance level.
. ttest lung=20, level(99)
One- sampl e t t est- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 99% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
l ung | 44 19. 65318 . 6374133 4. 228122 17. 93529 21. 37108- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
mean = mean( l ung) t = - 0. 5441
Ho: mean = 20 degr ees of f r eedom = 43
Ha: mean < 20 Ha: mean ! = 20 Ha: mean > 20Pr ( T < t ) = 0. 2946 Pr ( | T| > | t | ) = 0. 5892 Pr ( T > t ) = 0. 7054
STATA first lists descriptive statistics of the variable l ung. The mean and standard deviation
of the 44 observations are 19.653 and 4.228, respectively. The t statistic is -.544 = (19.653-20)
/ .6374. Finally, the degrees of freedom are 43 =44-1.
There are three t-tests at the bottom of the output above. The first and third are one-tailed tests,
whereas the second is a two-tailed test. The t statistic -.544 and its large p-value do not reject
the null hypothesis that the population mean of the death rate from lung cancer is 20 at the .01
level. The mean of the death rate may be 20 per 100,000 people. Note that the hypothesizedvalue 20 falls into the 99 percent confidence interval 17.935-21.371.
2
1n
yy
i= ,1
)( 22
=
n
yys
i, and standard error
n
ssy = .
2The 99 percent confidence interval of the mean is 6374.*695.2653.192 = ysty , where the 2.695 is
the critical value with 43 degree of freedom at the .01 level in the two-tailed test.
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
5/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 5
http://www.indiana.edu/~statmath
If you just have the aggregate data (i.e., the number of observations, mean, and standard
deviation of the sample), use the . t t e s t i command to replicate the t-test above. Note thehypothesized value is specified at the end of the summary statistics.
. ttesti 44 19.65318 4.228122 20, level(99)
2.2 T-test Using the SAS TTEST Procedure
The TTEST procedure conducts various types of t-tests in SAS. The H0option specifies a
hypothesized value, whereas the ALPHA indicates a significance level. If omitted, the default
values zero and .05 respectively are assumed.
PROCTTESTH0=20ALPHA=.01DATA=masil.smoking;
VARlung;
RUN;
The TTEST Procedure
Statistics
Lower CL Upper CL Lower CL Upper CL
Variable N Mean Mean Mean Std Dev Std Dev Std Dev Std Err
lung 44 17.935 19.653 21.371 3.2994 4.2281 5.7989 0.6374
T-Tests
Variable DF t Value Pr > |t|
lung 43 -0.54 0.5892
The TTEST procedure reports descriptive statistics followed by a one-tailed t-test. You may
have a summary data set containing the values of a variable (lung) and their frequencies
(count). The FREQ option of the TTEST procedure provides the solution for this case.
PROCTTESTH0=20ALPHA=.01DATA=masil.smoking;
VARlung;
FREQcount;
RUN;
2.3 T-test Using the SAS UNIVARIATE and MEANS Procedures
The SAS UNIVARIATE and MEANS procedures also conduct a t-test for a univariate-sample.
The UNIVARIATE procedure is basically designed to produces a variety of descriptive
statistics of a variable. Its MU0option tells the procedure to perform a t-test using the
hypothesized value specified. The VARDEF=DFspecifies a divisor (degrees of freedom) used in
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
6/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 6
http://www.indiana.edu/~statmath
computing the variance (standard deviation).3The NORMALoption examines if the variable is
normally distributed.
PROCUNIVARIATEMU0=20VARDEF=DF NORMALALPHA=.01DATA=masil.smoking;
VARlung;
RUN;
The UNIVARIATE Procedure
Variable: lung
Moments
N 44 Sum Weights 44
Mean 19.6531818 Sum Observations 864.74
Std Deviation 4.22812167 Variance 17.8770129
Skewness -0.104796 Kurtosis -0.949602
Uncorrected SS 17763.604 Corrected SS 768.711555
Coeff Variation 21.5136751 Std Error Mean 0.63741333
Basic Statistical Measures
Location Variability
Mean 19.65318 Std Deviation 4.22812
Median 20.32000 Variance 17.87701
Mode . Range 15.26000
Interquartile Range 6.53000
Tests for Location: Mu0=20
Test -Statistic- -----p Value------
Student's t t -0.5441 Pr > |t| 0.5892
Sign M 1 Pr >= |M| 0.8804
Signed Rank S -36.5 Pr >= |S| 0.6752
Tests for Normality
Test --Statistic--- -----p Value------
Shapiro-Wilk W 0.967845 Pr < W 0.2535
Kolmogorov-Smirnov D 0.086184 Pr > D >0.1500
Cramer-von Mises W-Sq 0.063737 Pr > W-Sq >0.2500
Anderson-Darling A-Sq 0.382105 Pr > A-Sq >0.2500
Quantiles (Definition 5)
Quantile Estimate
100% Max 27.270
3The VARDEF=Nuses N as a divisor, while VARDEF=WDFspecifies the sum of weights minus one.
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
7/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 7
http://www.indiana.edu/~statmath
99% 27.270
95% 25.950
90% 25.450
75% Q3 22.815
50% Median 20.320
25% Q1 16.285
Quantiles (Definition 5)
Quantile Estimate
10% 14.110
5% 12.120
1% 12.010
0% Min 12.010
Extreme Observations
-----Lowest---- ----Highest----
Value Obs Value Obs
12.01 39 25.45 16
12.11 33 25.88 1
12.12 30 25.95 27
13.58 10 26.48 18
14.11 36 27.27 8
The third block of the output above reports a t statistic and its p-value. The fourth block
contains several statistics of normality test. Since N is less than 2,000, you should read the
Shapiro-Wilk W, which suggests that lungis normally distributed (p |t| CL for Mean CL for Mean
19.6531818 4.2281217 0.6374133 30.83
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
8/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 8
http://www.indiana.edu/~statmath
The MEANS procedure does not, however, have an option to specify a hypothesized value to
anything other than zero. Thus, the null hypothesis here is that the population mean of death
rate from lung cancer is zero. The t statistic 30.83 is (19.6532-0)/.6374. The large t statistic and
small p-value reject the null hypothesis, reporting a consistent conclusion.
2.4 T-test in SPSS
The SPSS has the T-TEST command for t-tests. The /TESTVALsubcommand specifies the value
with which the sample mean is compared, whereas the /VARIABLESlist the variables to be tested.
Like STATA, SPSS specifies a confidence level rather than a significance level in the
/CRITERIA=CI()subcommand.
T-TEST
/TESTVAL = 20
/VARIABLES = lung
/MISSING = ANALYSIS
/CRITERIA = CI(.99) .
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
9/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 9
http://www.indiana.edu/~statmath
3. Paired (Dependent) Samples
When two variables are not independent, but paired, the difference of these two variables,
iii yyd 21 = , is treated as if it were a single sample. This test is appropriate for pre-post
treatment responses. The null hypothesis is that the true mean difference of the two variables is
D0, 00 : DH d= .4The difference is typically assumed to be zero unless explicitly specified.
3.1 T-test in STATA
In order to conduct a paired sample t-test, you need to list two variables separated by an equal
sign. The interpretation of the t-test remains almost unchanged. The -1.871 = (-10.1667-
0)/5.4337 at 35 degrees of freedom does not reject the null hypothesis that the difference is zero.
. ttest pre=post0, level(95)
Pai red t t est- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
pre | 36 176. 0278 6. 529723 39. 17834 162. 7717 189. 2838post 0 | 36 186. 1944 7. 826777 46. 96066 170. 3052 202. 0836
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -di f f | 36 - 10. 16667 5. 433655 32. 60193 - 21. 19757 . 8642387
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -mean(di f f ) = mean(pre post 0) t = - 1. 8711
Ho: mean(di f f ) = 0 degrees of f r eedom = 35
Ha: mean(di f f ) < 0 Ha: mean(di f f ) ! = 0 Ha: mean(di f f ) > 0Pr ( T < t ) = 0. 0349 Pr ( | T| > | t | ) = 0. 0697 Pr ( T > t ) = 0. 9651
Alternatively, you may first compute the difference between the two variables, and thenconduct one-sample t-test. Note that the default confidence level, l evel ( 95) , can be omitted.
. gen d=prepost0
. ttest d=0
3.2 T-test in SAS
In the TTEST procedure, you have to use the PAIRED instead of the VAR statement. For theoutput of the following procedure, refer to the end of this section.
PROCTTESTDATA=temp.drug;
PAIREDpre*post0;
RUN;
4 )1(~0
= nts
Ddt
d
d, where
n
dd
i= ,1
)( 22
=
n
dds
i
d , andn
ss d
d =
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
10/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 10
http://www.indiana.edu/~statmath
The PAIRED statement provides various ways of comparing variables using asterisk (*) and
colon (:) operators. The asterisk requests comparisons between each variable on the left with
each variable on the right. The colon requests comparisons between the first variable on the left
and the first on the right, the second on the left and the second on the right, and so forth.Consider the following examples.
PROCTTEST;
PAIREDpro: post0;
PAIRED(a b)*(c d); /* Equivalent toPAIREDa*c a*d b*c b*d; */
PAIRED(a b):(c d); /* Equivalent toPAIREDa*c b*c; */
PAIRED(a1-a10)*(b1-b10);
RUN;
The first PAIRED statement is the same as the PAIRED pre*post0. The second and the third
PAIRED statements contrast differences between asterisk and colon operators. The hyphen ()
operator in the last statement indicates a1through a10and b1through b10. Let us consider an
example of the PAIRED statement.
PROCTTEST DATA=temp.drug;
PAIRED(pre)*(post0-post1);
RUN;
The TTEST Procedure
Statistics
Lower CL Upper CL Lower CL Upper CL
Difference N Mean Mean Mean Std Dev Std Dev Std Dev Std Err
pre - post0 36 -21.2 -10.17 0.8642 26.443 32.602 42.527 5.4337
pre - post1 36 -30.43 -20.39 -10.34 24.077 29.685 38.723 4.9475
T-Tests
Difference DF t Value Pr > |t|
pre - post0 35 -1.87 0.0697
pre - post1 35 -4.12 0.0002
The first t statistic for preversus post0is identical to that of the previous section. The second
for preversus post1rejects the null hypothesis of no mean difference at the .01 level (p
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
11/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 11
http://www.indiana.edu/~statmath
PROCUNIVARIATEMU0=0VARDEF=DF NORMAL; VARd1 d2; RUN;
PROCMEANSMEANSTDSTDERRTPROBTCLM; VARd1 d2; RUN;
PROCTTESTALPHA=.05; VARd1 d2; RUN;
3.3 T-test in SPSS
In SPSS, the PAIRS subcommand indicates a paired sample t-test.
T-TEST PAIRS = pre post0
/CRITERIA = CI(.95)
/MISSING = ANALYSIS .
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
12/179
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
13/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 13
http://www.indiana.edu/~statmath
The SAS TTEST procedure and SPSS T-TEST command conduct F tests for equal variance.
SAS reports the folded form F statistic, whereas SPSS computes Levene's weighted F statistic.
In STATA, the . onewaycommand produces Bartletts statistic for the equal variance test. The
following is an example of Bartlett's test that does not reject the null hypothesis of equal
variance.
. oneway lung smoke
Anal ysi s of Vari anceSource SS df MS F Pr ob > F
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Bet ween groups 313. 031127 1 313. 031127 28. 85 0. 0000Wi t hi n groups 455. 680427 42 10. 849534
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Tot al 768. 711555 43 17. 8770129
Bart l ett ' s t est f or equal vari ances: chi 2( 1) = 0. 1216 Prob>chi 2 = 0. 727
STATA, SAS, and SPSS all compute Satterthwaites approximation of the degrees of freedom.In addition, the SAS TTEST procedure reports Cochran-Cox approximation and the
STATA . t t est command provides Welchs degrees of freedom.
4.2 T-test in STATA
With the .ttestcommand, you have to specify a grouping variable smoke in this example in
the parenthesis of thebyoption.
. ttest lung, by(smoke) level(95)
Two- sampl e t t est wi t h equal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt erval ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 22 16. 98591 . 6747158 3. 164698 15. 58276 18. 389061 | 22 22. 32045 . 7287523 3. 418151 20. 80493 23. 83598
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 19. 65318 . 6374133 4. 228122 18. 36772 20. 93865- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | - 5. 334545 . 9931371 - 7. 338777 - 3. 330314- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean(0) - mean(1) t = - 5. 3714Ho: di f f = 0 degr ees of f r eedom = 42
Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0
Pr ( T < t ) = 0. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 1. 0000
Let us first check the equal variance. The F statistic is )21,21(~1647.3
4182.317.1
2
2
2
2
Fs
s
S
L == . The
degrees of freedom of the numerator and denominator are 21 (=22-1). The p-value of .7273,
virtually the same as that of Bartletts test above, does not reject the null hypothesis of equal
variance. Thus, the t-test here is valid (t=-5.3714 and p
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
14/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 14
http://www.indiana.edu/~statmath
)22222(~3714.5
22
1
22
1
0)3205.229859.16(+=
+
= t
s
t
pool
, where
8497.10
22222
4182.3)122(1647.3)122( 222 =
+
+=pools
If only aggregate data of the two variables are available, use the . t t e s t i command and list the
number of observations, mean, and standard deviation of the two variables.
. ttesti 22 16.85591 3.164698 22 22.32045 3.418151, level(95)
Suppose a data set is differently arranged (second type in Figure 1) so that one variable
smk_l unghas data for smokers and the other non_l ungfor non-smokers. You have to use the
unpai r edoption to indicate that two variables are not paired. A grouping variable here is not
necessary. Compare the following output with what is printed above.
. ttest smk_lung=non_lung, unpaired
Two- sampl e t t est wi t h equal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -smk_l ung | 22 22. 32045 . 7287523 3. 418151 20. 80493 23. 83598non_l ung | 22 16. 98591 . 6747158 3. 164698 15. 58276 18. 38906- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 19. 65318 . 6374133 4. 228122 18. 36772 20. 93865- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | 5. 334545 . 9931371 3. 330313 7. 338777- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean( smk_l ung) - mean( non_l ung) t = 5. 3714Ho: di f f = 0 degr ees of f r eedom = 42
Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000
This unpai r edoption is very useful since it enables you to conduct a t-test without additional
data manipulation. You may run the . t t est command with the unpai r edoption to compare
two variables, say l eukemi aand ki dney, as independent samples in STATA. In SAS and
SPSS, however, you have to stack up two variables and generate a grouping variable before t-tests.
. ttest leukemia=kidney, unpaired
Two- sampl e t t est wi t h equal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -l eukemi a | 44 6. 829773 . 0962211 . 6382589 6. 635724 7. 023821
ki dney | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 88 4. 812159 . 2249261 2. 109994 4. 365094 5. 259224- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | 4. 035227 . 1240251 3. 788673 4. 281781- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
15/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 15
http://www.indiana.edu/~statmath
di f f = mean( l eukemi a) - mean(ki dney) t = 32. 5356Ho: di f f = 0 degr ees of f r eedom = 86
Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000
The F 1.5119 = (.6532589^2)/(.5190799^2) and its p-value (=.1797) do not reject the nullhypothesis of equal variance. The large t statistic 32.5356 rejects the null hypothesis that death
rates from leukemia and kidney cancers have the same mean.
4.3 T-test in SAS
The TTEST procedure by default examines the hypothesis of equal variances, and provides T
statistics for either case. The procedure by default reports Satterthwaites approximation for the
degrees of freedom. Keep in mind that a variable to be tested is grouped by the variable that isspecified in the CLASS statement.
PROCTTESTH0=0ALPHA=.05 DATA=masil.smoking;
CLASSsmoke;
VARlung;
RUN;
The TTEST Procedure
Statistics
Lower CL Upper CL Lower CL Upper CL
Variable smoke N Mean Mean Mean Std Dev Std Dev Std Dev
lung 0 22 15.583 16.986 18.389 2.4348 3.1647 4.5226
lung 1 22 20.805 22.32 23.836 2.6298 3.4182 4.8848lung Diff (1-2) -7.339 -5.335 -3.33 2.7159 3.2939 4.1865
Statistics
Variable smoke Std Err Minimum Maximum
lung 0 0.6747 12.01 25.45
lung 1 0.7288 12.11 27.27
lung Diff (1-2) 0.9931
T-Tests
Variable Method Variances DF t Value Pr > |t|
lung Pooled Equal 42 -5.37
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
16/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 16
http://www.indiana.edu/~statmath
lung Folded F 21 21 1.17 0.7273
The F test for equal variance does not reject the null hypothesis of equal variances. Thus, the t-
test labeled as Pooled should be referred to in order to get the t -5.37 and its p-value .0001. Ifthe equal variance assumption is violated, the statistics of Satterthwaite and Cochran
should be read.
If you have a summary data set with the values of variables (lung) and their frequency (count),
specify the count variable in the FREQ statement.
PROCTTESTDATA=masil.smoking;
CLASSsmoke;
VARlung;
FREQcount;
RUN;
Now, let us compare the death rates from leukemia and kidney in the second data arrangement
type of Figure 1. As mentioned before, you need to rearrange the data set to stack up two
variables into one and generate a grouping variable (first type in Figure 1).
DATAmasil.smoking2;
SETmasil.smoking;
death = leukemia; leu_kid ='Leukemia'; OUTPUT;
death = kidney; leu_kid ='Kidney'; OUTPUT;
KEEPleu_kid death;
RUN;
PROCTTESTCOCHRAN DATA=masil.smoking2; CLASSleu_kid; VARdeath; RUN;
The TTEST Procedure
Statistics
Lower CL Upper CL Lower CL Upper CL
Variable leu_kid N Mean Mean Mean Std Dev Std Dev Std Dev Std Err
death Kidney 44 2.6367 2.7945 2.9524 0.4289 0.5191 0.6577 0.0783
death Leukemia 44 6.6357 6.8298 7.0238 0.5273 0.6383 0.8087 0.0962
death Diff (1-2) -4.282 -4.035 -3.789 0.5063 0.5817 0.6838 0.124
T-Tests
Variable Method Variances DF t Value Pr > |t|
death Pooled Equal 86 -32.54
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
17/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 17
http://www.indiana.edu/~statmath
death Folded F 43 43 1.51 0.1794
Compare this SAS output with that of STATA in the previous section.
4.4 T-test in SPSS
In the T-TEST command, you need to use the /GROUPsubcommand in order to specify a
grouping variable. SPSS reports Levene's F .0000 that does not reject the null hypothesis ofequal variance (p
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
18/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 18
http://www.indiana.edu/~statmath
5. Independent Samples with Unequal Variances
If the assumption of equal variances is violated, we have to compute the adjusted t statisticusing individual sample standard deviations rather than a pooled standard deviation. It is also
necessary to use the Satterthwaite, Cochran-Cox (SAS), or Welch (STATA) approximations of
the degrees of freedom. In this chapter, you compare mean death rates from kidney cancerbetween the west (south) and east (north).
5.1 T-test in STATA
As discussed earlier, let us check equality of variances using the . onewaycommand. The
t abul at eoption produces a table of summary statistics for the groups.
. oneway kidney west, tabulate
| Summar y of ki dney
west | Mean Std. Dev. Freq.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 3. 006 . 3001298 201 | 2. 6183333 . 59837219 24
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Tot al | 2. 7945455 . 51907993 44
Anal ysi s of Vari anceSource SS df MS F Pr ob > F
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Bet ween groups 1. 63947758 1 1. 63947758 6. 92 0. 0118Wi t hi n groups 9. 94661333 42 . 236824127
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Tot al 11. 5860909 43 . 269443975
Bart l ett ' s t est f or equal vari ances: chi 2( 1) = 8. 6506 Prob>chi 2 = 0. 003
Bartletts chi-squared statistic rejects the null hypothesis of equal variance at the .01 level. It is
appropriate to use the unequal option in the . t t es t command, which calculates
Satterthwaites approximation for the degrees of freedom.
Unlike the SAS TTEST procedure, the . t t est command cannot specify the mean difference
D0 other than zero. Thus, the null hypothesis is that the mean difference is zero.
. ttest kidney, by(west) unequal level(95)
Two- sampl e t t est wi t h unequal var i ances
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt erval ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -0 | 20 3. 006 . 0671111 . 3001298 2. 865535 3. 1464651 | 24 2. 618333 . 1221422 . 5983722 2. 365663 2. 871004
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | . 3876667 . 139365 . 1047722 . 6705611- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
19/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 19
http://www.indiana.edu/~statmath
di f f = mean( 0) - mean( 1) t = 2. 7817Ho: di f f = 0 Satt ert hwai t e' s degr ees of f r eedom = 35. 1098
Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 0. 9957 Pr ( | T| > | t | ) = 0. 0086 Pr ( T > t ) = 0. 0043
See Satterthwaites approximation of 35.110 in the middle of the output. If you want to getWelchs approximation, use the wel chas well as unequal options; without the unequal option,
the wel chis ignored.
. ttest kidney, by(west) unequal welch
Two- sampl e t t est wi t h unequal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt erval ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 20 3. 006 . 0671111 . 3001298 2. 865535 3. 1464651 | 24 2. 618333 . 1221422 . 5983722 2. 365663 2. 871004
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -di f f | . 3876667 . 139365 . 1050824 . 6702509- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean( 0) - mean( 1) t = 2. 7817Ho: di f f = 0 Wel ch' s degrees of f r eedom = 36. 2258
Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 0. 9957 Pr ( | T| > | t | ) = 0. 0085 Pr ( T > t ) = 0. 0043
Satterthwaites approximation is slightly smaller than Welchs 36.2258. Again, keep in mindthat these approximations are not integers, but real numbers. The t statistic 2.7817 and its p-
value .0086 reject the null hypothesis of equal population means. The north and east have
larger death rates from kidney cancer per 100 thousand people than the south and west.
For aggregate data, use the . t t e s t i command with the necessary options.
. ttesti 20 3.006 .3001298 24 2.618333 .5983722, unequal welch
As mentioned earlier, the unpai r edoption of the . t t est command directly compares two
variables without data manipulation. The option treats the two variables as independent of each
other. The following is an example of the unpaired and unequal options.
. ttest bladder=kidney, unpaired unequal welch
Two- sampl e t t est wi t h unequal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -bl adder | 44 4. 121136 . 1454679 . 9649249 3. 827772 4. 4145ki dney | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 88 3. 457841 . 1086268 1. 019009 3. 241933 3. 673748- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | 1. 326591 . 1651806 . 9968919 1. 65629- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean(bl adder ) - mean(ki dney) t = 8. 0312Ho: di f f = 0 Wel ch' s degrees of f r eedom = 67. 0324
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
20/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 20
http://www.indiana.edu/~statmath
Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000
The F 3.4556 = (.9649249^2)/(.5190799^2) rejects the null hypothesis of equal variance
(p
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
21/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 21
http://www.indiana.edu/~statmath
F 3.9749 = (.5983722^2)/(.3001298^2) and p
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
22/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 22
http://www.indiana.edu/~statmath
Variable Method Num DF Den DF F Value Pr > F
death Folded F 43 43 3.46
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
23/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 23
http://www.indiana.edu/~statmath
6. One-way ANOVA, GLM, and Regression
The t-test is a special case of one-way ANOVA. Thus, one-way ANOVA produces equivalentresults to those of the t-test. ANOVA examines mean differences using the F statistic, whereas
the t-test reports the t statistic. The one-way ANOVA (t-test), GLM, and linear regression
present essentially the same things in different ways.
6.1 One-way ANOVA
Consider the following ANOVA procedure. The CLASS statement is used to specify
categorical variables. The MODEL statement lists the variable to be compared and a grouping
variable, separating them with an equal sign.
PROCANOVADATA=masil.smoking;
CLASSsmoke;
MODELlung=smoke;
RUN;
The ANOVA Procedure
Dependent Variable: lung
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 1 313.0311273 313.0311273 28.85 F
smoke 1 313.0311273 313.0311273 28.85 F- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model | 313. 031127 1 313. 031127 28. 85 0. 0000|
smoke | 313. 031127 1 313. 031127 28. 85 0. 0000|
Resi dual | 455. 680427 42 10. 849534- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 768. 711555 43 17. 8770129
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
24/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 24
http://www.indiana.edu/~statmath
In SPSS, the ONEWAY command is used.
ONEWAY lung BY smoke
/MISSING ANALYSIS .
6.2 Generalized Linear Model (GLM)
The SAS GLM and MIXED procedures and the SPSS UNIANOVA command also report the F
statistic for one-way ANOVA. Note that STATAs . gl mcommand does not perform one-way
ANOVA.
PROCGLMDATA=masil.smoking;
CLASSsmoke;
MODELlung=smoke /SS3;
RUN;
The GLM Procedure
Dependent Variable: lung
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 1 313.0311273 313.0311273 28.85 F
smoke 1 313.0311273 313.0311273 28.85
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
25/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 25
http://www.indiana.edu/~statmath
The SAS REG procedure, STATA . r egr ess command, and SPSS REGRESSION commandestimate linear regression models.
PROCREGDATA=masil.smoking;
MODELlung=smoke;
RUN;
The REG Procedure
Model: MODEL1
Dependent Variable: lung
Number of Observations Read 44
Number of Observations Used 44
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 313.03113 313.03113 28.85 |t|
Intercept 1 16.98591 0.70225 24.19
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
26/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 26
http://www.indiana.edu/~statmath
Model | 313. 031127 1 313. 031127 Pr ob > F = 0. 0000Resi dual | 455. 680427 42 10. 849534 R- squar ed = 0. 4072
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- s quar ed = 0. 3931Tot al | 768. 711555 43 17. 8770129 Root MSE = 3. 2939
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -l ung | Coef . St d. Err . t P>| t | [ 95% Conf . I nt erval ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -smoke | 5. 334545 . 9931371 5. 37 0. 000 3. 330314 7. 338777_cons | 16. 98591 . 702254 24. 19 0. 000 15. 5687 18. 40311
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The SPSS REGRESSION command looks complicated compared to the SAS REG procedure
and STATA . r egr esscommand.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT lung/METHOD=ENTER smoke.
Note that ANOVA, GLM, and regression report the same F (1, 42) 28.85, which is equivalent
to t (42) -5.3714. As long as the degrees of freedom of the numerator is 1, F is always t^2
(28.85=-5.3714^2).
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
27/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 27
http://www.indiana.edu/~statmath
7. Conclusion
The t-test is a basic statistical method for examining the mean difference between two groups.One-way ANOVA can compare means of more than two groups. The number of observations
in individual groups does not matter in the t-test or one-way ANOVA; both balanced and
unbalanced data are fine. One-way ANOVA, GLM, and linear regression models all use thevariance-covariance structure in their analysis, but present the results in different ways.
Researchers must check four issues when performing t-tests. First, a variable to be testedshould be interval or ratio so that its mean is substantively meaningful. Do not, for example,
run a t-test to compare the mean of skin colors (white=0, yellow=1, black=2) between two
countries. If you have a latent variable measured by several Likert-scaled manifest variables,first run a factor analysis to get that latent variable.
Second, examine the normality assumptions before conducting a t-test. It is awkward to
compare means of variables that are not normally distributed. Figure 2 illustrates a normal
probability distribution on top and a Poisson distribution skewed to the right on the bottom.Although the two distributions have the same mean and variance of 1, they are not likely to be
substantively interpretable. This is the rationale to conduct normality test such as Shapiro-WilkW, Shapiro-Francia W, and Kolmogorov-Smirnov D statistics. If the normality assumption is
violated, try to use nonparametric methods.
Figure 2. Comparing Normal and Poisson Probability Distributions ( 2 =1 and =1)
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
28/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 28
http://www.indiana.edu/~statmath
Third, check the equal variance assumption. You should be careful when comparing means of
normally distributed variables with different variances. You may conduct the folded form F test.
If the equal variance assumption is violated, compute the adjusted t and approximations of thedegree of freedom.
Finally, consider the types of t-tests, data arrangement, and functionalities available in eachstatistical software (e.g., STATA, SAS, and SPSS) to determine the best strategy for dataanalysis (Table 3). The first data arrangement in Figure 1 is commonly used for independent
sample t-tests, whereas the second arrangement is appropriate for a paired sample test. Keep inmind that the type II data sets in Figure 1 needs to be reshaped into type I in SAS and SPSS.
Table 3. Comparison of T-test Functionalities of STATA, SAS and SPSS
STATA 9.0 SAS 9.1 SPSS 13.0Test for equal variance Bartletts chi-squared
(. t t e s t command)Folded form F
(TTESTprocedure)Levenes weighted F
(T- TESTcommand)Approximation of thedegrees of freedom (DF)
Satterthwaites DFWelchs DF
Satterthwaites DFCochran-Cox DF
Satterthwaites DF
Second Data Arrangement var1=var2 Reshaping the data set Reshaping the data set
Aggregate Data . t t est i command FREQoption N/A
SAS has several procedures (e.g., TTEST, MEANS, and UNIVARIATE) and useful options for
t-tests. The STATA . t t e s t and . t t e s t i commands provide very flexible ways of handling
different data arrangements and aggregate data. Table 4 summarizes usages of options in these
two commands.
Table 4. Summary of the Usages of the . t t es t and . t t est Command Options
Usage by(group var) unequal welch unpaired*
Univariate sample var=c
Paired (dependent) sample var1=var2
Equal variance (1 variable) Var O
Equal variance (2 variables)**
var1=var2 O
Unequal variance (1 variable) Var O O O
Unequal variance (2 variables) var1=var2 O O O
* The . t t e s t i command does not allow the unpai r edoption.** The var1=var2 assumes second type of data arrangement in Figure 1.
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
29/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 29
http://www.indiana.edu/~statmath
Appendix: Data Set
Literature: Fraumeni, J. F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract:Geographic Variations in the United States,"Journal of the National Cancer Institute, 41(5):
1205-1211.
Data Source: http://lib.stat.cmu.edu
The data are per capita numbers of cigarettes smoked (sold) by 43 states and the District ofColumbia in 1960 together with death rates per 100 thousand people from various forms of
cancer. The variables used in this document are,
cigar= number of cigarettes smoked (hds per capita)bladder= deaths per 100k people from bladder cancer
lung= deaths per 100k people from lung cancer
kidney= deaths per 100k people from kidney cancer
leukemia= deaths per 100k people from leukemiasmoke= 1 for those whose cigarette consumption is larger than the median and 0 otherwise.
west= 1 for states in the South or West and 0 for those in the North, East or Midwest.
The followings are summary statistics and normality tests of these variables.
. sum cigar-leukemia
Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ci gar | 44 24. 91409 5. 573286 14 42. 4bl adder | 44 4. 121136 . 9649249 2. 86 6. 54
l ung | 44 19. 65318 4. 228122 12. 01 27. 27
ki dney | 44 2. 794545 . 5190799 1. 59 4. 32l eukemi a | 44 6. 829773 . 6382589 4. 9 8. 28
. sfrancia cigar-leukemia
Shapi r o- Franci a W' t est f or nor mal dat aVari abl e | Obs W' V' z Prob>z
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -ci gar | 44 0. 93061 3. 258 2. 203 0. 01381
bl adder | 44 0. 94512 2. 577 1. 776 0. 03789l ung | 44 0. 97809 1. 029 0. 055 0. 47823
ki dney | 44 0. 97732 1. 065 0. 120 0. 45217l eukemi a | 44 0. 97269 1. 282 0. 474 0. 31759
. tab west smoke
| smokewest | 0 1 | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -0 | 7 13 | 201 | 15 9 | 24
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -Tot al | 22 22 | 44
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
30/179
2003-2005, The Trustees of Indiana University Comparing Group Means: 30
http://www.indiana.edu/~statmath
References
Fraumeni, J. F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract: GeographicVariations in the United States,"Journal of the National Cancer Institute, 41(5): 1205-
1211.
Ott, R. Lyman. 1993.An Introduction to Statistical Methods and Data Analysis. Belmont, CA:Duxbury Press.
SAS Institute. 2005. SAS/STAT User's Guide, Version 9.1. Cary, NC: SAS Institute.
SPSS. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc.STATA Press. 2005. STATA Reference Manual Release 9. College Station, TX: STATA Press.
Walker, Glenn A. 2002. Common Statistical Methods for Clinical Research with SAS
Examples. Cary, NC: SAS Institute.
Acknowledgements
I am grateful to Jeremy Albright, Takuya Noguchi, and Kevin Wilhite at the UITS Center forStatistical and Mathematical Computing, Indiana University, who provided valuable comments
and suggestions.
Revision History
2003. First draft 2004. Second draft 2005. Third draft (Added data arrangements and conclusion).
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
31/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 1
http://www.indiana.edu/~statmath
Regression Models for Event Count Data
Using SAS, STATA, and LIMDEP
Hun Myoung Park
This document summarizes regression models for event count data and illustrates how to
estimate individual models using SAS, STATA, and LIMDEP. Example models were tested in SAS
9.1, STATA 9.0, and LIMDEP 8.0.
1. Introduction2. The Poisson Regression Model (PRM)3. The Negative Binomial Regression Model (NBRM)4. The Zero-Inflated Poisson Regression Model (ZIP)5. The Zero-Inflated Negative Binomial Regression Model (ZINB)6. Conclusion
7.
Appendix
1. Introduction
An event count is the realization of a nonnegative integer-valued random variable (Cameron andTrivedi 1998). Examples are the number of car accidents per month, thunder storms per year, andwild fires per year. The ordinary least squares (OLS) method for event count data results inbiased, inefficient, and inconsistent estimates (Long 1997). Thus, researchers have developedvarious nonlinear models that are based on the Poisson distribution and negative binomialdistribution.
1.1 Count Data Regression Models
The left-hand side (LHS) of the equation has event count data. Independent variables are, as inthe OLS, located at the right-hand side (RHS). These RHS variables may be interval, ratio, orbinary (dummy). Table 1 below summarizes the categorical dependent variable regressionmodels (CDVMs) according to the level of measurement of the dependent variable.
Table 1. Ordinary Least Squares and CDVMs
Model Dependent (LHS) Method Independent (RHS)
OLS Ordinary leastsquaresInterval or ratio Moment based
method
Binary response Binary (0 or 1)
Ordinal response Ordinal (1st, 2nd, 3rd)
Nominal response Nominal (A, B, C )CDVMs
Event count data Count (0, 1, 2, 3)
Maximumlikelihoodmethod
A linear function ofinterval/ratio or binaryvariables
...22110 XX ++
The Poisson regression model (PRM) and negative binomial regression model (NBRM) are basicmodels for count data analysis. Either the zero-inflated Poisson (ZIP) or the zero-inflated
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
32/179
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
33/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 3
http://www.indiana.edu/~statmath
)exp( iii x += (Long 1997). Thus, the conditional variance of y becomes larger than its
conditional mean, iii xyE =)|( , which remains unchanged. Figure 2 illustrates how the
probabilities for small and larger counts increase in the negative binomial distribution as theconditional variance of y increases, given 3= .
Figure 2. Negative Binomial Probability Distribution with Alpha of .01, .5, 1, and 5
The PRM and NBRM, however, have the same mean structure. If 0= , the NBRM reduces tothe PRM (Cameron and Trivedi 1998; Long 1997).
1.3 Overdispersion
When )|()|( iiii xyExyVar > , we are said to have overdispersion. Estimates of a PRM foroverdispersed data are unbiased, but inefficient with standard errors biased downward (Cameronand Trivedi 1998; Long 1997). The likelihood ratio test is developed to examine the nullhypothesis of no overdispersion, 0:0 =H .The likelihood ratio follows the Chi-squared
distribution with one degree of freedom, )1(~)ln(ln*2 2PoissonNB LLLR = . If the null
hypothesis is rejected, the NBRM is preferred to the PRM.
Zero-inflated models handle overdispersion by changing the mean structure to explicitly modelthe production of zero counts (Long 1997). These models assume two latent groups. One is thealways-zero group and the other is the not-always-zero or sometime-zero group. Thus, zero
counts come from the former group and some of the latter group with a certain probability.
The likelihood ratio, )1(~)ln(ln*2 2ZIPZINB LLLR = , tests 0:0 =H to compare the ZIP
and NBRM. The PRM and ZIP as well as NBRM and ZINB cannot, however, be tested by thislikelihood ratio, since they are not nested respectively. The Voungs statistic compares thesenon-nested models. If V is greater than 1.96, the ZIP or ZINB is favored. If V is less than -1.96,the PRM or NBRM is preferred (Long 1997).
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
34/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 4
http://www.indiana.edu/~statmath
1.4 Estimation in SAS, STATA, and LIMDEP
The SAS GENMOD procedure estimates Poisson and negative binomial regression models.
STATA has individual commands (e.g., . poi ssonand . nbr eg) for the corresponding count datamodels. LIMDEP has Poi sson$and Negbi n$commands to estimate various count data modelsincluding zero-inflated and zero-truncated models. Table 2 summarizes the procedures andcommands for count data regression models.
Table 2. Comparison of the Procedures and Commands for Count Data Models
Model SAS 9.1 STATA 9.0 LIMDEP 8.0Poisson Regression (PRM) GENMOD . poi sson Poi sson$
Negative Binomial Regression (NBRM) GENMOD . nbr eg Negbi n$Zero-Inflated Poisson (ZIP) - . zi p Poi sson; Zi p; Rh2$Zero-inflated Negative Binomial (ZINB) - . zi nb Negbi n; Zi p; Rh2$Zero-truncated Poisson (ZTP) - . zt p Poi sson; Tr uncat i on$
Zero-truncated Negative Binomial (ZTNB) - . zt nb Negbi n; Truncat i on$
The example here examines how waste quotas (emps) and the strictness of policy implementation(st r i ct ) affect the frequency of waste spill accidents of plants (acci dent ).
1. 5 Long and Freeses SPost Module
STATA users may take advantages of user-written modules such as SPost written by J. ScottLong and Jeremy Freese. The module allows researchers to conduct follow-up analyses ofvarious CDVMs including event count data models. See 2.3 for examples of major SPost
commands.
In order to install SPost, execute the following commands consecutively. For more details, visit J.Scott Longs Web site at http://www.indiana.edu/~jslsoc/spost_install.htm.
. net from http://www.indiana.edu/~jslsoc/stata/
. net install spost9_ado, replace
. net get spost9_do, replace
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
35/179
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
36/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 6
http://www.indiana.edu/~statmath
the number of regressors between the unrestricted and restricted models. The chi-squaredstatistic is 124.8218 = 2* [-667.2291 - (-729.6400)] (p ChiSq
Intercept 1 0.3168 0.0306 0.2568 0.3768 107.20
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
37/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 7
http://www.indiana.edu/~statmath
Poi sson r egr essi on Number of obs = 778LR chi 2(2) = 124. 82Pr ob > chi 2 = 0. 0000
Log l i kel i hood = - 1821. 5101 Pseudo R2 = 0. 0331
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
acci dent | Coef . St d. Er r. z P>| z| [ 95% Conf . I nt erval ]- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -emps | . 0054186 . 0007434 7. 29 0. 000 . 0039615 . 0068757
st r i ct | - . 7041664 . 0667619 - 10. 55 0. 000 - . 8350174 - . 5733154_cons | . 3900961 . 0466787 8. 36 0. 000 . 2986076 . 4815846
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Let us run a restricted model and then run the . di spl aycommand in order to double check thatthe likelihood ratio for goodness-of-fit is 124.8218.
. poisson accident
I t er at i on 0: l og l i kel i hood = - 1883. 921I t er at i on 1: l og l i kel i hood = - 1883. 921
Poi sson r egr essi on Number of obs = 778LR chi 2(0) = 0. 00Pr ob > chi 2 = .
Log l i kel i hood = - 1883. 921 Pseudo R2 = 0. 0000
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent | Coef . St d. Er r. z P>| z| [ 95% Conf . I nt erval ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -_cons | . 3168165 . 0305995 10. 35 0. 000 . 2568426 . 3767904
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
. display 2 * (-1821.5101 - (-1883.921))
124. 8218
2.3 Using the SPost Module in STATA
The SPost module provides useful commands for follow-up analyses of various categoricaldependent variable models. The . f i t st at command calculates various goodness-of-fit statisticssuch as log likelihood, McFaddens R2(or Pseudo R2), Akaike Information Criterion (AIC), andBayesian Information Criterion (BIC).
. quietly poisson accident emps strict
. fitstat
Measures of Fi t f or poi sson of acci dent
Log- Li k I nt ercept Onl y: - 1883. 921 Log- Li k Ful l Model : - 1821. 510D( 775) : 3643. 020 LR( 2): 124. 822
Pr ob > LR: 0. 000McFadden' s R2: 0. 033 McFadden' s Adj R2: 0. 032Maxi mumLi kel i hood R2: 0. 148 Cr agg & Uhl er ' s R2: 0. 149AI C: 4. 690 AI C*n: 3649. 020BI C: - 1515. 943 BI C' : - 111. 508
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
38/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 8
http://www.indiana.edu/~statmath
The . l i s t coef command lists unstandardized coefficients (parameter estimates), factor andpercent changes, and standardized coefficients to help interpret regression results.
. listcoef, help
poi sson (N=778) : Fact or Change i n Expect ed Count
Obser ved SD: 2. 9482675
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent | b z P>| z| e b e bSt dX SDof X
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -emps | 0. 00542 7. 289 0. 000 1. 0054 1. 2297 38. 1548
st r i ct | - 0. 70417 - 10. 547 0. 000 0. 4945 0. 7031 0. 5003- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
b = r aw coef f i ci entz = z- score f or t est of b=0
P>| z| = p- val ue f or z- t este b = exp(b) = f act or change i n expect ed count f or uni t i ncrease i n X
e bSt dX = exp( b*SD of X) = change i n expect ed count f or SD i ncrease i n XSDof X = st andar d devi at i on of X
The . prt abcommand constructs a table of predicted values (events) for all combinations ofcategorical variables listed. The following example shows that the predicted number of accidentsunder the strict policy is .9172 at the mean waste quota (emps=42.0129).
. prtab strict
poi sson: Predi ct ed r at es f or acci dent
- - - - - - - - - - - - - - - - - - - - - -s t r i ct | Predi ct i on
- - - - - - - - - - +- - - - - - - - - - -0 | 1. 85471 | 0. 9172
- - - - - - - - - - - - - - - - - - - - - -
emps st r i ctx= 42. 012853 . 50771208
The . pr val uelists predicted values for a given set of values for the independent variables. Forexample, the predicted probability of a zero count is .3996 at the mean waste quota under thestrict policy (str i ct=1). Note that the predicted rate of .917 is equivalent to .9172 in the . pr t ababove.
. prvalue, x(strict=1) maxcnt(5)
poi sson: Predi ct i ons f or acci dent
Predi ct ed r at e: . 917 95% CI [ . 827 , 1. 02]
Predi cted probabi l i t i es:
Pr( y=0| x) : 0. 3996 Pr( y=1| x): 0. 3665Pr( y=2| x) : 0. 1681 Pr( y=3| x): 0. 0514Pr( y=4| x) : 0. 0118 Pr( y=5| x): 0. 0022
emps st r i ctx= 42. 012853 1
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
39/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 9
http://www.indiana.edu/~statmath
The most useful command is the . pr changethat calculates marginal effects (changes) anddiscrete changes. For instance, a standard deviation increase in waste quota form its mean willincrease accidents by .3841 under the lenient policy (st r i ct =0).
. prchange, x(strict=0)
poi sson: Changes i n Pr edi ct ed Rat e f or acci dent
mi n- >max 0- >1 - +1/ 2 - +sd/ 2 MargEf ctemps 2. 3070 0. 0080 0. 0101 0. 3841 0. 0101
st r i ct - 0. 9375 - 0. 9375 - 1. 3332 - 0. 6568 - 1. 3060
exp( xb) : 1. 8547
emps st r i ctx= 42. 0129 0
sd( x) = 38. 1548 . 500262
SPost also includes the . prgen command, which computes a series of predictions by holding allvariables but one constant and allowing that variable to vary (Long and Freese 2003). These
SPost commands work with most categorical and count data models suchas . l ogi t , . probi t , . poi sson, . nbr eg, . z i p, and . zi nb.
2.4 PRM in LIMDEP
The LIMDEP Poi sson$command estimates the PRM. LIMDEP reports log likelihoods of boththe unrestricted and restricted models. Keep in mind that you must include the ONE for theintercept.
POISSON;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT
+---------------------------------------------+
| Poisson Regression |
| Maximum Likelihood Estimates |
| Model estimated: Aug 24, 2005 at 04:56:45PM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1821.510 |
| Restricted log likelihood -1883.921 |
| Chi squared 124.8218 |
| Degrees of freedom 2 || Prob[ChiSqd > value] = .0000000 |
| Chi- squared = 4944.94781 RsqP= -.0051 |
| G - squared = 2827.20794 RsqD= .0423 |
| Overdispersion tests: g=mu(i) : 4.720 |
| Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
40/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 10
http://www.indiana.edu/~statmath
Constant .3900961420 .46678663E-01 8.357 .0000
EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853
STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
SAS, STATA, and LIMDEP produce almost the same parameter estimates and standard errors
(Table 3). The log likelihood in SAS is different from that of STATA and LIMDEP (-667.291versus -1821.5101). This difference seems to come from the generalized linear model that theGENMOD procedure uses. These log likelihoods are, however, equivalent in the sense that theyresult in the same likelihood ratio.
Table 3. Summary of the Poisson Regression Model in SAS, STATA, and LIMDEP
Model SAS 9.1 STATA 9.0 LIMDEP 8.0
Intercept. 3901
( . 0467). 3901
( . 0467). 3901
( . 0467)
EMPS. 0054
( . 0007). 0054
( . 0007). 0054
( . 0007)
STRICT
- . 7042
( . 0668)
- . 7042
( . 0668)
- . 7042
( . 0668)Log Likelihood (unrestricted) - 667. 2291 - 1821. 5101 - 1821. 510Log Likelihood (restricted) - 729. 6400 - 1883. 921 - 1883. 921Likelihood Ratio for Goodness-of-fit 124. 8218 124. 82 124. 8218
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
41/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 11
http://www.indiana.edu/~statmath
3. The Negative Binomial Regression Model
The SAS GENMODE procedure, STATA . nbr egcommand, and LIMDEP Negbi n$commandestimate the negative binomial regression model (NBRM).
3.1 NBRM in SAS
The GENMOD procedure estimates the NBRM using the /DIST=NEGBIN option. Note that thedispersion parameter is equivalent to the alpha in STATA and LIMDEP.
PROC
GENMOD
DATA = masil.accident;
MODELaccident=emps strict /DIST=NEGBIN LINK=LOG;
RUN
;
The GENMOD Procedure
Model Information
Data Set COUNT.WASTE
Distribution Negative Binomial
Link Function Log
Dependent Variable Accident
Observations Used 778
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 775 589.7752 0.7610
Scaled Deviance 775 589.7752 0.7610
Pearson Chi-Square 775 845.6033 1.0911
Scaled Pearson X2 775 845.6033 1.0911Log Likelihood 37.5628
Algorithm converged.
Analysis Of Parameter Estimates
Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq
Intercept 1 0.3851 0.1278 0.1345 0.6357 9.07 0.0026
Emps 1 0.0052 0.0023 0.0008 0.0096 5.29 0.0214
Strict 1 -0.6703 0.1671 -0.9978 -0.3427 16.09
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
42/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 12
http://www.indiana.edu/~statmath
The likelihood ratio for overdispersion is 1409.5838 = 2 * (37.5628 - (-667.2291)).
3.2 NBRM in STATA
STATA has the . nbr egcommand for the NBRM. The command reports three log likelihoodstatistics: for the PRM, restricted NBRM (constant-only model), and unrestricted NBRM (fullmodel), which make it easy to conduct likelihood ratio tests.
. nbreg accident emps strict
Fi t t i ng compari son Poi sson model :
I t er at i on 0: l og l i kel i hood = - 1821. 5112I t er at i on 1: l og l i kel i hood = - 1821. 5101I t er at i on 2: l og l i kel i hood = - 1821. 5101
Fi t t i ng const ant - onl y model :
I t er at i on 0: l og l i kel i hood = - 1256. 6761I t er at i on 1: l og l i kel i hood = - 1152. 6155I t er at i on 2: l og l i kel i hood = - 1125. 6643I t er at i on 3: l og l i kel i hood = - 1125. 4183I t er at i on 4: l og l i kel i hood = - 1125. 4183
Fi tt i ng f ul l model :
I t er at i on 0: l og l i kel i hood = - 1117. 1731I t er at i on 1: l og l i kel i hood = - 1116. 7201I t er at i on 2: l og l i kel i hood = - 1116. 7182I t er at i on 3: l og l i kel i hood = - 1116. 7182
Negat i ve bi nomi al r egr essi on Number of obs = 778LR chi 2(2) = 17. 40Pr ob > chi 2 = 0. 0002
Log l i kel i hood = - 1116. 7182 Pseudo R2 = 0. 0077
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent | Coef . St d. Er r. z P>| z| [ 95% Conf . I nt erval ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -emps | . 0051981 . 0022595 2. 30 0. 021 . 0007694 . 0096267
st r i ct | - . 6702548 . 1671191 - 4. 01 0. 000 - . 9978021 - . 3427074_cons | . 3851111 . 1278468 3. 01 0. 003 . 134536 . 6356861
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -/ l nal pha | 1. 37509 . 0885176 1. 201599 1. 548582
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -al pha | 3. 955434 . 3501257 3. 32543 4. 704793
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Li kel i hood r at i o t est of al pha=0: chi bar2( 01) = 1409. 58 Prob>=chi bar2 = 0. 000
The restricted model or constant-only model gives us a log likelihood -1125.4183. Thus, thelikelihood ratio for goodness-of-fit is 17.4002 = 2 * [-1116.7182 - (-1125.4183)] (p
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
43/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 13
http://www.indiana.edu/~statmath
The likelihood ratio test for overdispersion results in a chi-squared of 1409.5838 (p1 of a binary variable st r i ct , since itsmarginal change at the mean (.5077) is meaningless.
. prchange
nbr eg: Changes i n Predi ct ed Rate f or acci dent
mi n- >max 0- >1 - +1/ 2 - +sd/ 2 MargEf ctemps 1. 5326 0. 0055 0. 0068 0. 2585 0. 0068
st r i ct - 0. 8931 - 0. 8931 - 0. 8885 - 0. 4383 - 0. 8721
exp( xb) : 1. 3011
emps st r i ctx= 42. 0129 . 507712
sd( x) = 38. 1548 . 500262
3.3 NBRM in LIMDEP
LIMDEP has the Negbi n$command for the NBRM that reports the PRM as well. Note that thestandard errors of parameter estimates are slightly different from those of SAS and STATA. TheMargi nal Ef f ect s$ and the Means$ subcommands compute marginal effects at the mean ofindependent variables. You may not omit the Means$ subcommand.
NEGBIN;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT;
Marginal Effects;
Means
+---------------------------------------------+
| Poisson Regression |
| Maximum Likelihood Estimates || Model estimated: Sep 08, 2005 at 09:35:36AM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1821.510 |
| Restricted log likelihood -1883.921 |
| Chi squared 124.8218 |
| Degrees of freedom 2 |
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
44/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 14
http://www.indiana.edu/~statmath
| Prob[ChiSqd > value] = .0000000 |
| Chi- squared = 4944.94781 RsqP= -.0051 |
| G - squared = 2827.20794 RsqD= .0423 |
| Overdispersion tests: g=mu(i) : 4.720 |
| Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .3900961420 .46678663E-01 8.357 .0000
EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853
STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
Normal exit from iterations. Exit status=0.
+---------------------------------------------+
| Negative Binomial Regression |
| Maximum Likelihood Estimates |
| Model estimated: Sep 08, 2005 at 09:35:36AM.|
| Dependent variable ACCIDENT || Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1116.718 |
| Restricted log likelihood -1821.510 |
| Chi squared 1409.584 |
| Degrees of freedom 1 |
| Prob[ChiSqd > value] = .0000000 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .3851110699 .12855240 2.996 .0027
EMPS .5198057234E-02 .22602075E-02 2.300 .0215 42.012853STRICT -.6702547660 .16729839 -4.006 .0001 .50771208
Dispersion parameter for count data model
Alpha 3.955434012 .35680876 11.086 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
+-------------------------------------------+
| Partial derivatives of expected val. with |
| respect to the vector of characteristics. |
| They are computed at the means of the Xs. |
| Observations used for means are All Obs. |
| Conditional Mean at Sample Point 1.3011 |
| Scale Factor for Marginal Effects 1.3011 |
+-------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .5010628939 .19396434 2.583 .0098
EMPS .6763123170E-02 .29746591E-02 2.274 .0230 42.012853
STRICT -.8720595665 .22469308 -3.881 .0001 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
45/179
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
46/179
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
47/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 17
http://www.indiana.edu/~statmath
4.2 ZIP in LIMDEP
The LIMDEP Poi sson$command needs to have the Zi pand Rh2subcommands. The Rh2isequivalent to the i nf l at e( ) option in STATA. The Al g=Newt on$subcommand is needed to usethe Newton-Raphson algorithm because the default Broyden algorithm failed to converge.1
POISSON;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT;
ZIP;
Rh2=ONE,EMPS,STRICT;
Alg=Newton
+---------------------------------------------+
| Poisson Regression |
| Maximum Likelihood Estimates |
| Model estimated: Sep 06, 2005 at 00:25:07PM.|
| Dependent variable ACCIDENT || Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1821.510 |
| Restricted log likelihood -1883.921 |
| Chi squared 124.8218 |
| Degrees of freedom 2 |
| Prob[ChiSqd > value] = .0000000 |
| Chi- squared = 4944.94781 RsqP= -.0051 |
| G - squared = 2827.20794 RsqD= .0423 |
| Overdispersion tests: g=mu(i) : 4.720 |
| Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .3900961420 .46678663E-01 8.357 .0000
EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853
STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
Normal exit from iterations. Exit status=0.
+----------------------------------------------------------------------+
| Zero Altered Poisson Regression Model |
| Logistic distribution used for splitting model. || ZAP term in probability is F[tau x Z(i) ] |
| Comparison of estimated models |
| Pr[0|means] Number of zeros Log-likelihood |
| Poisson .27329 Act.= 498 Prd.= 212.6 -1821.51007 |
1If you get a warning message of Error: 806: Line search does not improve fn. Exit iterations.Status=3 or Error: 805: Initial iterations cannot improve function. Status=3, you maychange the optimization algorithm or increase the maximum number of iterations (e.g., Maxi t =1000$).
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
48/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 18
http://www.indiana.edu/~statmath
| Z.I.Poisson .64642 Act.= 498 Prd.= 502.9 -1259.88568 |
| Note, the ZIP log-likelihood is not directly comparable. |
| ZIP model with nonzero Q does not encompass the others. |
| Vuong statistic for testing ZIP vs. unaltered model is 9.5740 |
| Distributed as standard normal. A value greater than |
| +1.96 favors the zero altered Z.I.Poisson model. |
| A value less than -1.96 rejects the ZIP model. |
+----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Poisson/NB/Gamma regression model
Constant 1.361977491 .23944641E-01 56.880 .0000
EMPS -.2770010575E-03 .37770090E-03 -.733 .4633 42.012853
STRICT -.9239125073E-01 .33326502E-01 -2.772 .0056 .50771208
Zero inflation model
Constant .4886559537 .12210013 4.002 .0001
EMPS -.1098971050E-01 .22152492E-02 -4.961 .0000 42.012853
STRICT 1.057031399 .17715551 5.967 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
In order to estimate the restricted model, run the following command with the ONE only in theLhs$subcommand. The Rh2$subcommand remains unchanged.
POISSON;
Lhs=ACCIDENT;
Rhs=ONE;
ZIP; Alg=Newton;
Rh2=ONE,EMPS,STRICT
Table 5 summarizes parameter estimates and goodness-of-fit statistics for the zero-inflatedPoisson model. STATA and LIMDEP report the same parameter estimates, but they produce
different standard errors and log likelihoods. In particular, LIMDEP returned a suspicious loglikelihood for the restricted model, and thus ended up with the unlikely likelihood ratio of -.0304. In addition, the Vuong statistics in STATA and LIMDEP are different.
Table 5. Summary of the Zero-Inflated Poisson Regression Model in STATA, and LIMDEP
Model SAS 9.1 STATA 9.0 LIMDEP 8.0
Intercept1. 3620( . 0493)
1. 3620( . 0239)
EMPS- . 0003( . 0009)
- . 0003( . 0004)
STRICT- . 0924( . 0729)
- . 0924( . 0333)
Intercept (Zero-inflated) . 4887( . 1211) . 4887( . 1221)
EMPS (Zero-inflated)- . 0110( . 0023)
- . 0110( . 0022)
STRICT (Zero-inflated)1. 0570( . 1768)
1. 0570( . 1772)
Log Likelihood (unrestricted) - 1269. 7206 - 1259. 8857Log Likelihood (restricted) - 1270. 9523 - 1259. 8705Likelihood Ratio for Goodness-of-fit 2. 46 - . 0304Vuong Statistic (ZINB versus NBRM) 8. 40 9. 5740
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
49/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 19
http://www.indiana.edu/~statmath
5. The Zero-Inflated NB Regression Model
STATA and LIMDEP can estimate the zero-inflated negative binomial regression model (ZINB).
5.1 ZINB in STATA (.zinb)
The STATA . zi nbcommand estimates the ZINB. The vuongoption computes the Vuongstatistic to compare the ZINB and NBRM.
. zinb accident emps strict, inflate(emps strict) vuong
Fi t t i ng const ant - onl y model :
I t er ati on 0: l og l i kel i hood = - 1190. 5117 ( not concave)I t er at i on 1: l og l i kel i hood = - 1106. 9874I t er at i on 2: l og l i kel i hood = - 1098. 8642I t er at i on 3: l og l i kel i hood = - 1095. 3638I t er at i on 4: l og l i kel i hood = - 1094. 0237
I t er at i on 5: l og l i kel i hood = - 1093. 063I t er at i on 6: l og l i kel i hood = - 1092. 6216I t er at i on 7: l og l i kel i hood = - 1091. 798I t er at i on 8: l og l i kel i hood = - 1091. 7332I t er at i on 9: l og l i kel i hood = - 1091. 7329I t er at i on 10: l og l i kel i hood = - 1091. 7329
Fi tt i ng f ul l model :
I t er at i on 0: l og l i kel i hood = - 1091. 7329I t er at i on 1: l og l i kel i hood = - 1089. 5565I t er at i on 2: l og l i kel i hood = - 1089. 5198I t er at i on 3: l og l i kel i hood = - 1089. 5198
Zero- i nf l ated negat i ve bi nomi al r egr essi on Number of obs = 778
Nonzer o obs = 280Zero obs = 498
I nf l ati on model = l ogi t LR chi 2( 2) = 4. 43Log l i kel i hood = - 1089. 52 Prob > chi 2 = 0. 1094
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -| Coef . St d. Err . z P>| z| [ 95% Conf . I nt erval ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent |
emps | - . 0004407 . 0020554 - 0. 21 0. 830 - . 0044691 . 0035877st r i ct | - . 3251317 . 1659173 - 1. 96 0. 050 - . 6503235 . 0000602_cons | . 7763065 . 1508037 5. 15 0. 000 . 4807367 1. 071876
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -i nf l at e |
emps | - . 2087768 . 0955122 - 2. 19 0. 029 - . 3959772 - . 0215763st r i ct | 7. 562388 3. 055775 2. 47 0. 013 1. 573179 13. 5516_cons | . 1032115 . 3800045 0. 27 0. 786 - . 6415835 . 8480065
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -/ l nal pha | . 9252514 . 1351387 6. 85 0. 000 . 6603845 1. 190118
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -al pha | 2. 522502 . 3408876 1. 935536 3. 28747
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Vuong t est of zi nb vs. st andard negat i ve bi nomi al : z = 4. 13 Pr>z = 0. 0000
-
8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS
50/179
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 20
http://www.indiana.edu/~statmath
The likelihood ratio, 360.4024= 2*(-1089.5198 - (-1269.721)), rejects the null hypothesis of nooverdispersion, indicating that the ZINB can improve goodness-of-fit over the ZIP (p 1.96, suggests that the ZINB is preferred to the NBRM.
5.2 ZINB in LIMDEP
The LIMDEP Negbi n$command needs to have the Zi p and Rh2subcommands for the ZINB.The following command produces the Poisson regression model, negative binomial model, andzero-inflated negative binomial model. You may omit the Al g=Newt on$subcommand.
NEGBIN;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT; Rh2=ONE,EMPS,STRICT;
ZIP; Alg=Newton
+---------------------------------------------+
| Poisson Regression |
| Maximum Likelihood Estimates |
| Model estimated: Sep 10, 2005 at 00:20:00AM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1821.510 |
| Restricted log likelihood -1883.921 |
| Chi squared 124.8218 |
| Degrees of freedom 2 |
| Prob[ChiSqd > value] = .0000000 |
| Chi- squared = 4944.94781 RsqP= -.0051 |
| G - squared = 2827.20794 RsqD= .0423 |
| Overdispersion tests: g=mu(i) : 4.720 |
| Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .3900961420 .46678663E-01 8.357 .0000
EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853
STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
Normal exit from iterations. Exit status=0.
+---------------------------------------------+
| Negative Binomial Regression || Maximum Likelihood Estimates |
| Model estimated: Sep 10, 2005 at 00:20:00AM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 12 |
| Log likelihood function -1116.718 |
| Restricted log likel