Learn to Test for Heteroscedasticity in SPSS With Data ...
Transcript of Learn to Test for Heteroscedasticity in SPSS With Data ...
Learn to Test for
Heteroscedasticity in SPSS With
Data From the China Health and
Nutrition Survey (2006)
© 2015 SAGE Publications, Ltd. All Rights Reserved.
This PDF has been generated from SAGE Research Methods Datasets.
Learn to Test for
Heteroscedasticity in SPSS With
Data From the China Health and
Nutrition Survey (2006)
How-to Guide for IBM® SPSS® Statistics Software
Introduction
In this guide you will learn how to detect heteroscedasticity following a linear
regression model in IBM® SPSS® Statistical Software (SPSS), using a practical
example to illustrate the process. You will find links to the example dataset and
you are encouraged to replicate this example. An additional practice example
is suggested at the end of this guide. The example assumes you have already
opened the data file in SPSS.
Contents
1. Heteroscedasticity
2. An Example in SPSS: Blood Pressure and Age in China
2.1 The SPSS Procedure
2.2 Exploring the SPSS Output
3. Your Turn
1 Heteroscedasticity
Linear regression models estimated via Ordinary Least Squares (OLS) rest on
several assumptions, one if which is that the variance of the residual from the
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 2 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)
model is constant and unrelated to the independent variable(s). Constant variance
is called homoscedasticity, while non-constant variance is called
heteroscedasticity. This example illustrates how to detect heteroscedasticity
following the estimation of a simple linear regression model.
2 An Example in SPSS: Blood Pressure and Age in China
This example uses two variables from the 2006 China Health and Nutrition
Survey:
• A person’s systolic blood pressure (systolic).
• A person’s age, measured in years (age).
There are 9178 respondents in this survey. Systolic blood pressure measures
the pressure in a person’s arteries when their heart beats (contracts and pumps
blood). In this dataset, this variable ranges from 70 to 240 with a mean of about
122 and a standard deviation of 18.14. Age is measured in years, and in this
dataset it ranges from 17 to 95 with a mean of about 49 and a standard deviation
of 15.19. Both of these variables are continuous, making them appropriate for
simple regression.
2.1 The SPSS Procedure
Before producing the simple regression model, it is a good idea to look at each
variable separately. However, in the interest of space, we forgo doing so here.
Readers should explore the SAGE Research Methods Dataset examples
associated with Simple Regression and Multiple Regression for more information.
You estimate a simple regression model in SPSS by selecting from the menu:
Analyze → Regression → Linear
In the Linear Regression dialog box that opens, move the systolic blood pressure
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 3 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)
variable (systolic) into the Dependent: window and move the age variable (age)
into the Independent(s): window.
Figure 1 shows what this looks like in SPSS.
Figure 1: Selecting simple regression from the Analyze menu in SPSS.
Because we want to explore whether there is evidence of heteroscedasticity
among the residuals of this regression, we also want to produce a scatter plot
that plots the standardized residuals on the Y-axis and the standardized predicted
values of the dependent variable on the X-axis.
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 4 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)
First, we click the “Plots…” button on the right-hand side of the Linear Regression
dialog box. That opens a second dialog box. In this second dialog box, move
*ZRESID into the open box under Y: and *ZPRED into the open box under X: as
shown in Figure 2
Figure 2: Producing a two-way scatter plot of standardized residuals and
standardized predicted values for a regression model in the Linear Regression:
Plots dialog box in SPSS.
Once you are done, click Continue in this dialog box, and then click OK to perform
the analysis.
2.2 Exploring the SPSS Output
Figure 3 presents five tables of results that are produced by the simple linear
regression procedure in SPSS. The fifth table is produced because we asked
SPSS to produce plots using the standardized residuals. The fourth table in Figure
3, outlined in red, includes the results of the regression model itself.
Figure 3: Simple regression of systolic blood pressure on age, 2006 China Health
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 5 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)
and Nutrition Survey.
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 6 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 7 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)
The first three tables in Figure 3 report the independent variable(s) entered into
the model, some summary fit statistics for the regression model, and an analysis
of variance for the model as a whole. While detailed examination of these tables
is beyond the scope of this example, we note in the second table that R Square
measures the proportion of the variance in the dependent variable explained by
the model, which in this case consists of a single independent variable. An R
Square of 0.157 means that approximately 15.7% of the variance in systolic blood
pressure is accounted for by age.
The fourth table in Figure 3, outlined in red, presents the estimates of the
intercept, or constant, and the slope coefficient. The results report an estimate
of the intercept (or constant) as equal to approximately 98.56. The constant of a
simple regression model can be interpreted as the average expected value of the
dependent variable when the independent variable equals zero. In this case, our
independent variable, age, can never be zero, so the constant by itself does not
tell us much.
The estimated value for the slope coefficient linking age to systolic blood pressure
is estimated to be approximately 0.47. This represents the average marginal effect
of age on systolic blood pressure, and can be interpreted as the expected change
on average in the dependent variable for a one-unit increase in the independent
variable. For this example, that means that every increase in age of 1 year is
associated with an average increase of about 0.47 in systolic blood pressure.
The fourth table in Figure 3 reports that this estimate is statistically significantly
different from zero, with a p-value well below 0.001. This leads us to reject the
null hypothesis and conclude that there does appear to be a positive relationship
between a person’s age and their systolic blood pressure in China.
Figure 4 presents a plot with the standardized residuals of this regression on the
Y-axis and the standardized predicted values of the dependent variable on the X-
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 8 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)
axis. Figure 4 shows that the vertical spread of the residuals is relatively low for
respondents with lower predicted levels of systolic blood pressure. However, as
we move left to right and the predicted level of systolic blood pressure increases,
we see the vertical spread of the residuals also increasing. This spread appears to
shrink somewhat at the very highest predicted values for systolic blood pressure.
Overall, Figure 4 shows a pattern in the variance of the residuals, meaning that
we appear to have evidence of heteroscedasticity.
Figure 4: Two-way scatter plot of standardized residuals from the regression
shown in forth table of Figure 3 on the Y-axis and standardized predicted values
of the dependent variable from that regression on the X-axis, 2006 China Health
and Nutrition Survey.
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 9 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)
Unfortunately, SPSS does not include any formal tests of heteroscedasticity.
Users can create macros within SPSS to perform specific functions not built into
the software, but that process is beyond the scope of this example. Example code
for a macro that includes the Breusch–Pagen test, and a tutorial video on how to
use it, can be found at the following links:
http://www.spsstools.net/Syntax/RegressionRepeatedMeasure/Breusch-
PaganAndKoenkerTest.sps
https://www.youtube.com/watch?v=3QcX4jqPn14
Applying the steps of the Breusch–Pagen test to this example results in a test
statistic of 652.33. When compared to a Chi-squared distribution with 1 degree of
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 10 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)
freedom, the resulting p-value falls well below the standard 0.05 level. Thus we
have clear evidence to reject the null hypothesis of homoscedasticity and accept
the alternative hypothesis that we do in fact have heteroscedasticity in the residual
of this regression model.
3 Your Turn
Download this sample dataset and see if you can replicate these results. Then
repeat the analysis, this time replacing the systolic blood pressure variable with a
variable measuring diastolic blood pressure (diastolic) as the dependent variable
and then explore whether or not there is evidence of heteroscedasticity in the
residuals of the regression.
IBM® SPSS® Statistics software (SPSS) screenshots Republished Courtesy of
International Business Machines Corporation, © International Business Machines
Corporation. SPSS Inc. was acquired by IBM in October, 2009. IBM, the IBM logo,
ibm.com, and SPSS are trademarks or registered trademarks of International
Business Machines Corporation, registered in many jurisdictions worldwide. Other
product and service names might be trademarks of IBM or other companies. A
current list of IBM trademarks is available on the Web at “IBM Copyright and
trademark information” at http://www.ibm.com/legal/copytrade.shtml.
SAGE
2015 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods Datasets Part
1
Page 11 of 11 Learn to Test for Heteroscedasticity in SPSS With Data From the China
Health and Nutrition Survey (2006)