SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General...
-
Upload
alexandra-taylor -
Category
Documents
-
view
244 -
download
6
Transcript of SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General...
SW388R7Data Analysis
& Computers II
Slide 1
Multiple Regression – Split Sample Validation
General criteria for split sample validation
Sample problems
SW388R7Data Analysis
& Computers II
Slide 2
General criteria for split sample validation
It is expected that the results obtained from split sample validations will vary somewhat from the results obtained from the analysis using the full data set. We will use the following as our criteria that the validation verified our analysis and supports the generalizability of our findings: First, the overall relationship between the
dependent variable and the set of independent variable must be statistically significant for both validation analyses.
Second, the R² for each validation must be within 5% (plus or minus) of the R² for the model using the full sample.
SW388R7Data Analysis
& Computers II
Slide 3
General criteria for split sample validation - 2
Third, the pattern of statistical significance for the coefficients of the independent variables for both validation analyses must be the same as the pattern for the full analysis, i.e. the same variables are statistically significant or not significant.
For stepwise multiple regression, we require that the same variables be significant, but it is not required that they enter in exactly the same order.
For hierarchical multiple regression, the R² change for the validation must be statistically significant.
SW388R7Data Analysis
& Computers II
Slide 4
Notes
Findings are stated on the results for the analysis of the full data set, not the validations.
If our validation analysis does not support the findings of the analysis on the full data set, we will declare the answer to the problem to be false. There is, however, another common option, which is to only cite statistical significance for independent variables supported by the validation analysis as well as the full data set analysis. All other variables are considered to be non-significant. Generally it is the independent variables with the weakest individual relationship to the dependent variable which fail to validate.
SW388R7Data Analysis
& Computers II
Slide 5
Problem 1
1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using 552804 as the random number seed.
The variables "age" [age], "highest year of school completed" [educ], and "sex" [sex] have a moderate relationship to the variable "occupational prestige score" [prestg80].
Survey respondents who were older had more prestigious occupations. Survey respondents who had completed more years of school had more prestigious occupations. The variable sex did not have a relationship to "occupational prestige score" [prestg80].
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
SW388R7Data Analysis
& Computers II
Slide 6
Steps prior to the validation analysis
Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers.
Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation.
When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate.
SW388R7Data Analysis
& Computers II
Slide 7
ANO VAb
14765.673 3 4921.891 36.639 .000a
33315.228 248 134 .336
48080.901 251
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors : (Constant), RESPONDENTS SEX, AG E OF RESPO NDENT, HIG HESTYEAR OF SCHO OL CO MPLETED
a.
Dependen t Variable: RS OCCUPAT IONAL PRESTIGE SCORE (1980)b.
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES - 1
M o de l Su mmar yb
.554a .307 .299 11.590Model1
R R SquareAdjustedR Square
Std . Error ofthe Estima te
Predictors : (Constant), RESPONDENTS SEX, AG E OFRESPONDENT, HIGHEST YEAR OF SCHO OLCO MPLET ED
a.
Dependen t Variable: RS OCCUPAT IONAL PRESTIGESCO RE (1980)
b.
The probability of the F statistic (36.639) for the overall regression relationship is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of independent variables and the dependent variable (R² = 0).
We support the research hypothesis that there is a statistically significant relationship between the set of independent variables and the dependent variable.
SW388R7Data Analysis
& Computers II
Slide 8
ANO VAb
14765.673 3 4921.891 36.639 .000a
33315.228 248 134 .336
48080.901 251
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors : (Constant), RESPONDENTS SEX, AG E OF RESPO NDENT, HIG HESTYEAR OF SCHO OL CO MPLETED
a.
Dependen t Variable: RS OCCUPAT IONAL PRESTIGE SCORE (1980)b.
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES - 2
M o de l Su mmar yb
.554a .307 .299 11.590Model1
R R SquareAdjustedR Square
Std . Error ofthe Estima te
Predictors : (Constant), RESPONDENTS SEX, AG E OFRESPONDENT, HIGHEST YEAR OF SCHO OLCO MPLET ED
a.
Dependen t Variable: RS OCCUPAT IONAL PRESTIGESCO RE (1980)
b.
The Multiple R for the relationship between the set of independent variables and the dependent variable is 0.554, which would be characterized as moderate using the rule of thumb than a correlation less than or equal to 0.20 is characterized as very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong.
The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 30.7%.
SW388R7Data Analysis
& Computers II
Slide 9
Coe fficie ntsa
3.039 5.305 .573 .567
.125 .045 .149 2.749 .006
2.810 .271 .563 10.352 .000
-1.354 1.477 -.049 -.917 .360
(Constant)
AGE OF RESPO NDENT
HIG HEST YEAR OFSCHOOL COMPLETED
RESPONDENTS SEX
Model1
B Std . Error
UnstandardizedCoe fficien ts
Beta
StandardizedCoe fficien ts
t Sig.
Dependen t Variable: RS OCCUPAT IONAL PRESTIGE SCORE (1980)a.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 1
For the independent variable age, the probability of the t statistic (2.749) for the b coefficient is 0.006 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with age is equal to zero (b = 0) and conclude that there is a statistically significant relationship between age and occupational prestige score.
The b coefficient associated with age (0.125) is positive, indicating a direct relationship in which higher numeric values for age are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who were older had more prestigious occupations.
SW388R7Data Analysis
& Computers II
Slide 10
Coe fficie ntsa
3.039 5.305 .573 .567
.125 .045 .149 2.749 .006
2.810 .271 .563 10.352 .000
-1.354 1.477 -.049 -.917 .360
(Constant)
AGE OF RESPO NDENT
HIG HEST YEAR OFSCHOOL COMPLETED
RESPONDENTS SEX
Model1
B Std . Error
UnstandardizedCoe fficien ts
Beta
StandardizedCoe fficien ts
t Sig.
Dependen t Variable: RS OCCUPAT IONAL PRESTIGE SCORE (1980)a.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 2
For the independent variable highest year of school completed, the probability of the t statistic (10.352) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with highest year of school completed is equal to zero (b = 0) and conclude that there is a statistically significant relationship between highest year of school completed and occupational prestige score.
The b coefficient associated with highest year of school completed (2.810) is positive, indicating a direct relationship in which higher numeric values for highest year of school completed are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who had completed more years of school had more prestigious occupations.
SW388R7Data Analysis
& Computers II
Slide 11
Coe fficie ntsa
3.039 5.305 .573 .567
.125 .045 .149 2.749 .006
2.810 .271 .563 10.352 .000
-1.354 1.477 -.049 -.917 .360
(Constant)
AGE OF RESPO NDENT
HIG HEST YEAR OFSCHOOL COMPLETED
RESPONDENTS SEX
Model1
B Std . Error
UnstandardizedCoe fficien ts
Beta
StandardizedCoe fficien ts
t Sig.
Dependen t Variable: RS OCCUPAT IONAL PRESTIGE SCORE (1980)a.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 3
For the independent variable sex, the probability of the t statistic (-0.917) for the b coefficient is 0.360 which is greater than the level of significance of 0.05. We fail to reject the null hypothesis that the slope associated with sex is equal to zero (b = 0) and conclude that there is not a statistically significant relationship between sex and occupational prestige score.
SW388R7Data Analysis
& Computers II
Slide 12
Setting the random number seed
To set the random number seed, select the Random Number Seed… command from the Transform menu.
SW388R7Data Analysis
& Computers II
Slide 13
Set the random number seed
First, click on the Set seed to option button to activate the text box.
Second, type in the random seed stated in the problem.
Third, click on the OK button to complete the dialog box.
Note that SPSS does not provide you with any feedback about the change.
SW388R7Data Analysis
& Computers II
Slide 14
Select the compute command
To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.
SW388R7Data Analysis
& Computers II
Slide 15
The formula for the split variable
First, type the name for the new variable, split, into the Target Variable text box.
Second, the formula for the value of split is shown in the text box.
The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50.
If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false.Third, click on the OK
button to complete the dialog box.
SW388R7Data Analysis
& Computers II
Slide 16
The split variable in the data editor
In the data editor, the split variable shows a random pattern of zero’s and one’s.
To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.
SW388R7Data Analysis
& Computers II
Slide 17
Repeat the regression with first validation sample
To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.
SW388R7Data Analysis
& Computers II
Slide 18
Using "split" as the selection variable
First, scroll down the list of variables and highlight the variable split.
Second, click on the right arrow button to move the split variable to the Selection Variable text box.
SW388R7Data Analysis
& Computers II
Slide 19
Setting the value of split to select cases
When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the
Rule… button to enter a value for split.
SW388R7Data Analysis
& Computers II
Slide 20
Completing the value selection
First, type the value for the first half of the sample, 0, into the Value text box.
Second, click on the Continue button to complete the value entry.
SW388R7Data Analysis
& Computers II
Slide 21
Requesting output for the first validation sample
When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable.
Click on the OK button to
request the output.
Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.
SW388R7Data Analysis
& Computers II
Slide 22
Repeat the regression with second validation sample
To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.
SW388R7Data Analysis
& Computers II
Slide 23
Setting the value of split to select cases
Since the split variable is already in the Selection Variable text box, we only need to change its value.
Click on the Rule… button to enter a different value for split.
SW388R7Data Analysis
& Computers II
Slide 24
Completing the value selection
First, type the value for the first half of the sample, 1, into the Value text box.
Second, click on the Continue button to complete the value entry.
SW388R7Data Analysis
& Computers II
Slide 25
Requesting output for the second validation sample
When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable.
Click on the OK button to request the output.
SW388R7Data Analysis
& Computers II
Slide 26
ANOVAb,c
9046.938 3 3015.646 25.527 .000a
13822.120 117 118.138
22869.058 120
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), RESPONDENTS SEX, HIGHEST YEAR OF SCHOOLCOMPLETED, AGE OF RESPONDENT
a.
Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1980)b.
Selecting only cases for which SPLIT = 1.000c.
SPLIT-SAMPLE VALIDATION - 1
ANOVAb,c
6186.368 3 2062.123 13.770 .000a
19019.143 127 149.757
25205.511 130
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HIGHESTYEAR OF SCHOOL COMPLETED
a.
Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1980)b.
Selecting only cases for which SPLIT = .000c.
In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant.
In the first validation, the probability for the F statistic testing overall relationship was <0.001.
For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001.
Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.
SW388R7Data Analysis
& Computers II
Slide 27
SPLIT-SAMPLE VALIDATION - 2
Model Summary
.629a .396 .380 10.869Model1
SPLIT = 1.000
(Selected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), RESPONDENTS SEX, HIGHESTYEAR OF SCHOOL COMPLETED, AGE OFRESPONDENT
a.
Model Summary
.495a .245 .228 12.238Model1
SPLIT = .000(Selected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), RESPONDENTS SEX, AGE OFRESPONDENT, HIGHEST YEAR OF SCHOOLCOMPLETED
a.
The proportion of variance in the relationship utilizing the full data set was 30.7% compared to 24.5% for the first split sample validation and 39.6% for the second split sample validation.
In both of the split-sample validation analyses, the proportion of variance in the dependent variable explained by the independent variables was not within 5% of the variance explained in the model using the full data set (30.7%).
The strength of the relationship between the independent variables and the dependent variable is not supported by the validation.
The answer to the question is false, because the validation did not verify the strength of the relationship in the analysis of the full data set.
SW388R7Data Analysis
& Computers II
Slide 28
Table of validation results: standard regression
Full Data Set Split = 0 (Split1 = 1)
Split = 1 (Split2 = 1)
ANOVA significance (sig <= 0.05)
<0.001 <0.001 <0.001
R2 0.307 0.245 0.396
Significant Coefficients (sig <= 0.05)
Age of respondentHighest year of
school completed
Highest year of school completed
Age of respondent
Highest year of school completed
This table shows us that the validation failed in the strength of relationship row (R²), where the R² for the validations did not fall within 5% if the R² for the full data set. Had we satisfied this criteria, the validation would have failed in the Significant Coefficients row, where age of respondent was not statistically significant in the first validation analysis.
It may be helpful to create a table for our validation results and fill in its cells as we complete the analysis.
SW388R7Data Analysis
& Computers II
Slide 29
Problem 2
1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using 397155 as the random number seed.
After controlling for the effects of the variables "age" [age] and "sex" [sex], the addition of the variable "happiness of marriage" [hapmar] reduces the error in predicting "general happiness" [happy] by 37.0%.
After controlling for age and sex, the variable happiness of marriage makes an individual contribution to reducing the error in predicting general happiness. Survey respondents who were less happy with their marriages were less happy overall.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
SW388R7Data Analysis
& Computers II
Slide 30
Steps prior to the validation analysis
Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers.
Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation.
When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate.
For hierarchical regression, the R² change statistics must be statistically significant in order for the model to be validated.
SW388R7Data Analysis
& Computers II
Slide 31
ANOVAc
.095 2 .047 .145 .865a
42.582 130 .328
42.677 132
15.878 3 5.293 25.476 .000b
26.799 129 .208
42.677 132
Regression
Residual
Total
Regression
Residual
Total
Model1
2
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESSOF MARRIAGE
b.
Dependent Variable: GENERAL HAPPINESSc.
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES - 1
The probability of the F statistic (25.476) for the overall regression relationship for all indpendent variables is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of all independent variables and the dependent variable (R² = 0).
We support the research hypothesis that there is a statistically significant relationship between the set of all independent variables and the dependent variable.
SW388R7Data Analysis
& Computers II
Slide 32
Model Summaryc
.047a .002 -.013 .572 .002 .145 2 130 .865
.610b .372 .357 .456 .370 75.972 1 129 .000
Model1
2
R R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2 Sig. F Change
Change Statistics
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESS OF MARRIAGEb.
Dependent Variable: GENERAL HAPPINESSc.
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES - 2
The R Square Change statistic for the increase in R² associated with the added variables (happiness of marriage) is 0.370. Using a proportional reduction in error interpretation for R², information provided by the added variables reduces our error in predicting general happiness by 37.0%.
The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 37.2%.
SW388R7Data Analysis
& Computers II
Slide 33
Model Summaryc
.047a .002 -.013 .572 .002 .145 2 130 .865
.610b .372 .357 .456 .370 75.972 1 129 .000
Model1
2
R R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2 Sig. F Change
Change Statistics
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESS OF MARRIAGEb.
Dependent Variable: GENERAL HAPPINESSc.
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES - 3
The probability of the F statistic (75.972) for the change in R² associated with the addition of the predictor variables to the regression analysis containing the control variables is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no improvement in the relationship between the set of independent variables and the dependent variable when the predictors are added (R² Change = 0).
We support the research hypothesis that there is a statistically significant improvement in the relationship between the set of independent variables and the dependent variable.
SW388R7Data Analysis
& Computers II
Slide 34
Coefficientsa
1.764 .257 6.863 .000
-.002 .003 -.046 -.517 .606
-.027 .103 -.023 -.259 .796
.924 .226 4.086 .000
-.002 .003 -.056 -.789 .432
-.082 .083 -.071 -.986 .326
.658 .075 .610 8.716 .000
(Constant)
AGE OF RESPONDENT
RESPONDENTS SEX
(Constant)
AGE OF RESPONDENT
RESPONDENTS SEX
HAPPINESS OFMARRIAGE
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: GENERAL HAPPINESSa.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 1
For the independent variable happiness of marriage, the probability of the t statistic (8.716) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with happiness of marriage is equal to zero (b = 0) and conclude that there is a statistically significant relationship between happiness of marriage and general happiness.
SW388R7Data Analysis
& Computers II
Slide 35
Coefficientsa
1.764 .257 6.863 .000
-.002 .003 -.046 -.517 .606
-.027 .103 -.023 -.259 .796
.924 .226 4.086 .000
-.002 .003 -.056 -.789 .432
-.082 .083 -.071 -.986 .326
.658 .075 .610 8.716 .000
(Constant)
AGE OF RESPONDENT
RESPONDENTS SEX
(Constant)
AGE OF RESPONDENT
RESPONDENTS SEX
HAPPINESS OFMARRIAGE
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: GENERAL HAPPINESSa.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 2
The b coefficient associated with happiness of marriage (0.658) is positive, indicating a direct relationship in which higher numeric values for happiness of marriage are associated with higher numeric values for general happiness. The independent variable happiness of marriage is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were less happy with their marriages. The dependent variable general happiness is also an ordinal variable. It is coded so that higher numeric values are associated with survey respondents who were less happy overall. Therefore, the positive value of b implies that survey respondents who were less happy with their marriages were less happy overall.
SW388R7Data Analysis
& Computers II
Slide 36
Coefficientsa
1.764 .257 6.863 .000
-.002 .003 -.046 -.517 .606
-.027 .103 -.023 -.259 .796
.924 .226 4.086 .000
-.002 .003 -.056 -.789 .432
-.082 .083 -.071 -.986 .326
.658 .075 .610 8.716 .000
(Constant)
AGE OF RESPONDENT
RESPONDENTS SEX
(Constant)
AGE OF RESPONDENT
RESPONDENTS SEX
HAPPINESS OFMARRIAGE
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: GENERAL HAPPINESSa.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 3
Age and sex are variables whose effects are controlled for, so there is no requirement that they be statistically significant.
SW388R7Data Analysis
& Computers II
Slide 37
Setting the random number seed
To set the random number seed, select the Random Number Seed… command from the Transform menu.
SW388R7Data Analysis
& Computers II
Slide 38
Set the random number seed
First, click on the Set seed to option button to activate the text box.
Second, type in the random seed stated in the problem.
Third, click on the OK button to complete the dialog box.
Note that SPSS does not provide you with any feedback about the change.
SW388R7Data Analysis
& Computers II
Slide 39
Select the compute command
To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.
SW388R7Data Analysis
& Computers II
Slide 40
The formula for the split variable
First, type the name for the new variable, split, into the Target Variable text box.
Second, the formula for the value of split is shown in the text box.
The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50.
If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false.Third, click on the OK
button to complete the dialog box.
SW388R7Data Analysis
& Computers II
Slide 41
The split variable in the data editor
In the data editor, the split variable shows a random pattern of zero’s and one’s.
To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.
SW388R7Data Analysis
& Computers II
Slide 42
Repeat the regression with first validation sample
To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.
SW388R7Data Analysis
& Computers II
Slide 43
Using "split" as the selection variable
First, scroll down the list of variables and highlight the variable split.
Second, click on the right arrow button to move the split variable to the Selection Variable text box.
SW388R7Data Analysis
& Computers II
Slide 44
Setting the value of split to select cases
When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the
Rule… button to enter a value for split.
SW388R7Data Analysis
& Computers II
Slide 45
Completing the value selection
First, type the value for the first half of the sample, 0, into the Value text box.
Second, click on the Continue button to complete the value entry.
SW388R7Data Analysis
& Computers II
Slide 46
Requesting output for the first validation sample
When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable.
Click on the OK button to request the output.
Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.
SW388R7Data Analysis
& Computers II
Slide 47
Repeat the regression with second validation sample
To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.
SW388R7Data Analysis
& Computers II
Slide 48
Setting the value of split to select cases
Since the split variable is already in the Selection Variable text box, we only need to change its value.
Click on the Rule… button to enter a different value for split.
SW388R7Data Analysis
& Computers II
Slide 49
Completing the value selection
First, type the value for the first half of the sample, 1, into the Value text box.
Second, click on the Continue button to complete the value entry.
SW388R7Data Analysis
& Computers II
Slide 50
Requesting output for the second validation sample
When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable.
Click on the OK button to request the output.
SW388R7Data Analysis
& Computers II
Slide 51
ANOVAc,d
.022 2 .011 .035 .966a
19.415 61 .318
19.438 63
7.685 3 2.562 13.079 .000b
11.752 60 .196
19.438 63
Regression
Residual
Total
Regression
Residual
Total
Model1
2
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.
SPLIT-SAMPLE VALIDATION - 1
ANOVAc,d
.095 2 .047 .137 .872a
22.891 66 .347
22.986 68
8.465 3 2.822 12.631 .000b
14.520 65 .223
22.986 68
Regression
Residual
Total
Regression
Residual
Total
Model1
2
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.
In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant.
In the first validation, the probability for the F statistic testing overall relationship was <0.001.
For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001.
Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.
SW388R7Data Analysis
& Computers II
Slide 52
Model Summary
.064a .004 -.026 .589 .004 .137 2 66 .872
.607b .368 .339 .473 .364 37.469 1 65 .000
Model1
2
SPLIT = .000(Selected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2 Sig. F Change
Change Statistics
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESS OF MARRIAGEb. Model Summary
.034a .001 -.032 .564 .001 .035 2 61 .966
.629b .395 .365 .443 .394 39.125 1 60 .000
Model1
2
SPLIT = 1.000
(Selected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2 Sig. F Change
Change Statistics
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESS OF MARRIAGEb.
SPLIT-SAMPLE VALIDATION - 2
The total proportion of variance in the relationship utilizing the full data set was 37.2% compared to 36.8% for the first split sample validation and 39.5% for the second split sample validation.
In both of the split-sample validation analyses, the total proportion of variance in the dependent variable explained by the independent variables was within 5% of the variance explained in the model using the full data set (37.2%).
SW388R7Data Analysis
& Computers II
Slide 53
Model Summary
.064a .004 -.026 .589 .004 .137 2 66 .872
.607b .368 .339 .473 .364 37.469 1 65 .000
Model1
2
SPLIT = .000(Selected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2 Sig. F Change
Change Statistics
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESS OF MARRIAGEb. Model Summary
.034a .001 -.032 .564 .001 .035 2 61 .966
.629b .395 .365 .443 .394 39.125 1 60 .000
Model1
2
SPLIT = 1.000
(Selected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2 Sig. F Change
Change Statistics
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.
Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESS OF MARRIAGEb.
SPLIT-SAMPLE VALIDATION - 3
In both of the split-sample validation analyses, the change in R² from the model including only control variables to the model containing all variables was statistically significant.
In the first validation, the probability for the F statistic testing the change in R² was <0.001.
For the second validation analysis, the probability for the F statistic testing the change in R² was <0.001.
Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables, and the significance of the contribution of the independent variables included after the control variables.
SW388R7Data Analysis
& Computers II
Slide 54
SPLIT-SAMPLE VALIDATION - 4
Coe fficie ntsa,b
1.876 .383 4.895 .000
-.002 .005 -.059 -.457 .649
-.058 .152 -.049 -.379 .706
.885 .348 2.547 .013
-.003 .004 -.080 -.772 .443
-.014 .122 -.012 -.115 .909
.667 .109 .605 6.121 .000
(Constant)
AG E OF RESPO NDENT
RESPONDENTS SEX
(Constant)
AG E OF RESPO NDENT
RESPONDENTS SEX
HAPPINESS OFMARRIAG E
Model1
2
B Std . Error
Unstanda rdizedCoe fficien ts
Beta
StandardizedCoe fficien ts
t Sig .
Dependen t Variable: G ENERAL HAPPINESSa.
Selecting only cases fo r which SPLIT = .000b.
The relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statisically significant.
In the first validation analysis, the probability for the test of relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was <0.001, which was less than or equal to 0.05 and statistically significant.
SW388R7Data Analysis
& Computers II
Slide 55
SPLIT-SAMPLE VALIDATION - 5
Coe fficie ntsa,b
1.659 .351 4.730 .000
-.001 .005 -.034 -.263 .793
-.003 .144 -.002 -.018 .986
.903 .300 3.004 .004
.000 .004 -.011 -.112 .911
-.164 .116 -.147 -1.421 .161
.675 .108 .645 6.255 .000
(Constant)
AG E OF RESPO NDENT
RESPONDENTS SEX
(Constant)
AG E OF RESPO NDENT
RESPONDENTS SEX
HAPPINESS OFMARRIAG E
Model1
2
B Std . Error
Unstanda rdizedCoe fficien ts
Beta
StandardizedCoe fficien ts
t Sig .
Dependen t Variable: G ENERAL HAPPINESSa.
Selecting only cases fo r which SPLIT = 1.000b.
The split sample validation supports the findings of the regression analysis using the full data set. A caution is added because of the inclusion of ordinal level variables.
The answer to the question is true with caution.
In the second validation analysis, the probability for the test of relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was <0.001, which was less than or equal to 0.05 and statistically significant.
SW388R7Data Analysis
& Computers II
Slide 56
Table of validation results: hierarchical regression
Full Data Set Split = 0 (Split1 = 1)
Split = 1 (Split2 = 1)
ANOVA significance (sig <= 0.05)
<0.001 <0.001 <0.001
R2 0.372 0.368 0.395
R2 Change (sig <= 0.05)
<0.001 <0.001 <0.001
Significant Coefficients (sig <= 0.05)
Happiness of marriage
Happiness of marriage
Happiness of marriage
In this example, we satisfy all of the validation criteria. The validation supports the generalizability of the regression model to the population represented by this sample.
NOTE: we add the R² change statistic to the validation results because it indicates the contribution of the variables added in the second stage.
SW388R7Data Analysis
& Computers II
Slide 57
Problem 3
1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using 636396 as the random number seed.
From the list of variables "how many in family earned money" [earnrs], "income" [rincom98], and "age" [age], the best predictors of "total family income" [income98] are "income" [rincom98] and "how many in family earned money" [earnrs]. Income and how many in family earned money have a strong relationship to total family income.
The most important predictor of total family income is income. The second most important predictor of total family income is how many in family earned money.
Survey respondents who had higher incomes had higher total family incomes. Survey respondents who had more family members earning money had higher total family incomes.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
SW388R7Data Analysis
& Computers II
Slide 58
Steps prior to the validation analysis
Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers.
Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation.
When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate.
For stepwise regression, we will require that the same variables be selected in each validation analysis. We will not require that they be entered in the same order.
SW388R7Data Analysis
& Computers II
Slide 59
ANOVAc
933.833 1 933.833 84.525 .000a
1756.639 159 11.048
2690.472 160
1490.955 2 745.477 98.194 .000b
1199.517 158 7.592
2690.472 160
Regression
Residual
Total
Regression
Residual
Total
Model1
2
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), RINCOM98a.
Predictors: (Constant), RINCOM98, LGEARNRSb.
Dependent Variable: INCOME98c.
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES - 1
The best subset of predictors for total family income included the independent variables: income and how many in family earned money. The probability of the F statistic (98.194) for the regression relationship which includes these variables is <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that there is no relationship between the best subset of independent variables and the dependent variable (R² = 0).
We support the research hypothesis that there is a statistically significant relationship between the best subset of independent variables and the dependent variable.
SW388R7Data Analysis
& Computers II
Slide 60
Model Summaryc
.589a .347 .343 3.324 .347 84.525 1 159
.744b .554 .549 2.755 .207 73.384 1 158
Model1
2
R R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2
Change Statistics
Predictors: (Constant), RINCOM98, LGEARNRSb.
Dependent Variable: INCOME98c.
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES - 2
The Multiple R for the relationship between the subset of independent variables that best predict the dependent variable is 0.744, which would be characterized as strong using the rule of thumb than a correlation less than or equal to 0.20 is very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong.
The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by the independent variables included in the analysis is 55.4%.
SW388R7Data Analysis
& Computers II
Slide 61
Variables Entered/Removeda
RINCOM98 .
Stepwise(Criteria:Probability-of-F-to-enter <=.010,Probability-of-F-to-remove >=.020).
LGEARNRS .
Stepwise(Criteria:Probability-of-F-to-enter <=.010,Probability-of-F-to-remove >=.020).
Model1
2
VariablesEntered
VariablesRemoved Method
Dependent Variable: INCOME98a.
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES - 3
Based on the table of "Variables Entered/Removed," the most important predictor of total family income is income. The second most important predictor of total family income is how many in family earned money.
SW388R7Data Analysis
& Computers II
Slide 62
Coefficientsa
10.902 .730 14.933 .000
.460 .050 .589 9.194 .000 1.000 1.000
3.932 1.014 3.877 .000
.470 .041 .602 11.322 .000 .999 1.001
16.009 1.869 .455 8.566 .000 .999 1.001
(Constant)
RINCOM98
(Constant)
RINCOM98
LGEARNRS
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: INCOME98a.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 1
For the independent variable income, the probability of the t statistic (11.322) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.01. We reject the null hypothesis that the slope associated with income is equal to zero (b = 0) and conclude that there is a statistically significant relationship between income and total family income.
SW388R7Data Analysis
& Computers II
Slide 63
Coefficientsa
10.902 .730 14.933 .000
.460 .050 .589 9.194 .000 1.000 1.000
3.932 1.014 3.877 .000
.470 .041 .602 11.322 .000 .999 1.001
16.009 1.869 .455 8.566 .000 .999 1.001
(Constant)
RINCOM98
(Constant)
RINCOM98
LGEARNRS
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: INCOME98a.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 2
The b coefficient associated with income (0.470) is positive, indicating a direct relationship in which higher numeric values for income are associated with higher numeric values for total family income. The independent variable income is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who had higher incomes. The dependent variable total family income is also an ordinal variable. It is coded so that higher numeric values are associated with survey respondents who had higher total family incomes. Therefore, the positive value of b implies that survey respondents who had higher incomes had higher total family incomes.
SW388R7Data Analysis
& Computers II
Slide 64
Coefficientsa
10.902 .730 14.933 .000
.460 .050 .589 9.194 .000 1.000 1.000
3.932 1.014 3.877 .000
.470 .041 .602 11.322 .000 .999 1.001
16.009 1.869 .455 8.566 .000 .999 1.001
(Constant)
RINCOM98
(Constant)
RINCOM98
LGEARNRS
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: INCOME98a.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 3
For the independent variable how many in family earned money, the probability of the t statistic (8.566) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.01. We reject the null hypothesis that the slope associated with how many in family earned money is equal to zero (b = 0) and conclude that there is a statistically significant relationship between how many in family earned money and total family income.
SW388R7Data Analysis
& Computers II
Slide 65
Coefficientsa
10.902 .730 14.933 .000
.460 .050 .589 9.194 .000 1.000 1.000
3.932 1.014 3.877 .000
.470 .041 .602 11.322 .000 .999 1.001
16.009 1.869 .455 8.566 .000 .999 1.001
(Constant)
RINCOM98
(Constant)
RINCOM98
LGEARNRS
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: INCOME98a.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 4
The b coefficient associated with how many in family earned money (16.009) is positive, indicating a direct relationship in which higher numeric values for how many in family earned money are associated with higher numeric values for total family income. Therefore, the positive value of b implies that survey respondents who had more family members earning money had higher total family incomes.
If we wanted to specifically interpret the value of the b coefficient (16.009), we would have to convert it out of log unit.
SW388R7Data Analysis
& Computers II
Slide 66
Excluded Variablesc
.455a 8.566 .000 .563 .999 1.001 .999
.094a 1.415 .159 .112 .931 1.074 .931
.090b 1.646 .102 .130 .931 1.074 .930
LGEARNRS
AGE
AGE
Model1
2
Beta In t Sig.Partial
Correlation Tolerance VIFMinimumTolerance
Collinearity Statistics
Predictors in the Model: (Constant), RINCOM98a.
Predictors in the Model: (Constant), RINCOM98, LGEARNRSb.
Dependent Variable: INCOME98c.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 5
The relationship between "age" [age] and "total family income" [income98] was not statistically significant for the model using the full data set (p=0.102).
SW388R7Data Analysis
& Computers II
Slide 67
Setting the random number seed
To set the random number seed, select the Random Number Seed… command from the Transform menu.
SW388R7Data Analysis
& Computers II
Slide 68
Set the random number seed
First, click on the Set seed to option button to activate the text box.
Second, type in the random seed stated in the problem.
Third, click on the OK button to complete the dialog box.
Note that SPSS does not provide you with any feedback about the change.
SW388R7Data Analysis
& Computers II
Slide 69
Select the compute command
To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.
SW388R7Data Analysis
& Computers II
Slide 70
The formula for the split variable
First, type the name for the new variable, split, into the Target Variable text box.
Second, the formula for the value of split is shown in the text box.
The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50.
If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false.Third, click on the OK
button to complete the dialog box.
SW388R7Data Analysis
& Computers II
Slide 71
The split variable in the data editor
In the data editor, the split variable shows a random pattern of zero’s and one’s.
To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.
SW388R7Data Analysis
& Computers II
Slide 72
Repeat the regression with first validation sample
To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.
SW388R7Data Analysis
& Computers II
Slide 73
Using "split" as the selection variable
First, scroll down the list of variables and highlight the variable split.
Second, click on the right arrow button to move the split variable to the Selection Variable text box.
SW388R7Data Analysis
& Computers II
Slide 74
Setting the value of split to select cases
When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the
Rule… button to enter a value for split.
SW388R7Data Analysis
& Computers II
Slide 75
Completing the value selection
First, type the value for the first half of the sample, 0, into the Value text box.
Second, click on the Continue button to complete the value entry.
SW388R7Data Analysis
& Computers II
Slide 76
Requesting output for the first validation sample
When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable.
Click on the OK button to request the output.
Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.
SW388R7Data Analysis
& Computers II
Slide 77
Repeat the regression with second validation sample
To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.
SW388R7Data Analysis
& Computers II
Slide 78
Setting the value of split to select cases
Since the split variable is already in the Selection Variable text box, we only need to change its value.
Click on the Rule… button to enter a different value for split.
SW388R7Data Analysis
& Computers II
Slide 79
Completing the value selection
First, type the value for the first half of the sample, 1, into the Value text box.
Second, click on the Continue button to complete the value entry.
SW388R7Data Analysis
& Computers II
Slide 80
Requesting output for the second validation sample
When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable.
Click on the OK button to request the output.
SW388R7Data Analysis
& Computers II
Slide 81
ANOVAc,d
378.332 1 378.332 29.477 .000a
1039.619 81 12.835
1417.952 82
787.811 2 393.905 50.009 .000b
630.141 80 7.877
1417.952 82
Regression
Residual
Total
Regression
Residual
Total
Model1
2
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), LGEARNRSa.
Predictors: (Constant), LGEARNRS, RINCOM98b.
Dependent Variable: INCOME98c.
Selecting only cases for which SPLIT = .0000d.
SPLIT-SAMPLE VALIDATION - 1
ANOVAc,d
575.955 1 575.955 63.161 .000a
693.032 76 9.119
1268.987 77
722.986 2 361.493 49.656 .000b
546.001 75 7.280
1268.987 77
Regression
Residual
Total
Regression
Residual
Total
Model1
2
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), RINCOM98a.
Predictors: (Constant), RINCOM98, LGEARNRSb.
Dependent Variable: INCOME98c.
Selecting only cases for which SPLIT = 1.0000d.
In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant.
In the first validation, the probability for the F statistic testing overall relationship was <0.001.
For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001.
SW388R7Data Analysis
& Computers II
Slide 82
Model Summaryc,d
.517a .267 .258 3.583 .267 29.477 1 81
.745b .739 .556 .544 2.807 .289 51.986 1 80
Model1
2
SPLIT = .0000
(Selected)
SPLIT ~=.0000
(Unselected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2
Change Statistics
Predictors: (Constant), LGEARNRS, RINCOM98b.
Unless noted otherwise, statistics are based only on cases for which SPLIT = .0000.c.
Dependent Variable: INCOME98d.
SPLIT-SAMPLE VALIDATION - 2
Model Summaryc,d
.674a .454 .447 3.020 .454 63.161 1 76
.755b .799 .570 .558 2.698 .116 20.197 1 75
Model1
2
SPLIT = 1.0000
(Selected)
SPLIT ~=1.0000
(Unselected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
R SquareChange F Change df1 df2
Change Statistics
Predictors: (Constant), RINCOM98, LGEARNRSb.
Unless noted otherwise, statistics are based only on cases for which SPLIT = 1.0000.c.
Dependent Variable: INCOME98d.
The total proportion of variance in the relationship utilizing the full data set was 55.4% compared to 55.6% for the first split sample validation and 57.0% for the second split sample validation.
In both of the split-sample validation analyses, the total proportion of variance in the dependent variable explained by the independent variables was within 5% of the variance explained in the model using the full data set (55.4%).
SW388R7Data Analysis
& Computers II
Slide 83
Excluded Variablesc
.538a 7.210 .000 .628 .996 1.004 .996
.235a 2.549 .013 .274 .999 1.001 .999
.088b 1.128 .263 .126 .917 1.091 .914
RINCOM98
AGE
AGE
Model1
2
Beta In t Sig.Partial
Correlation Tolerance VIFMinimumTolerance
Collinearity Statistics
Predictors in the Model: (Constant), LGEARNRSa.
Predictors in the Model: (Constant), LGEARNRS, RINCOM98b.
Dependent Variable: INCOME98c.
SPLIT-SAMPLE VALIDATION - 3
Excluded Variablesc
.340a 4.494 .000 .461 1.000 1.000 1.000
.075a .860 .392 .099 .941 1.063 .941
.091b 1.170 .246 .135 .939 1.065 .939
LGEARNRS
AGE
AGE
Model1
2
Beta In t Sig.Partial
Correlation Tolerance VIFMinimumTolerance
Collinearity Statistics
Predictors in the Model: (Constant), RINCOM98a.
Predictors in the Model: (Constant), RINCOM98, LGEARNRSb.
Dependent Variable: INCOME98c.
The relationship between "age" [age] and "total family income" [income98] was not statistically significant for the model using the full data set (p=0.102). Similarly, the relationships in both of the validation analyses were not statistically significant.
In the second validation analysis, the probability for the test of relationship between "age" [age] and "total family income" [income98] was 0.246, which was greater than the level of significance of 0.01 and not statistically significant.
In the first validation analysis, the probability for the test of relationship between "age" [age] and "total family income" [income98] was 0.263, which was greater than the level of significance of 0.01 and not statistically significant.
SW388R7Data Analysis
& Computers II
Slide 84
Coefficientsa,b
9.585 1.425 6.724 .000
17.243 3.176 .517 5.429 .000 1.000 1.000
3.083 1.435 2.148 .035
18.322 2.493 .549 7.351 .000 .996 1.004
.447 .062 .538 7.210 .000 .996 1.004
(Constant)
LGEARNRS
(Constant)
LGEARNRS
RINCOM98
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: INCOME98a.
Selecting only cases for which SPLIT = .0000b.
SPLIT-SAMPLE VALIDATION - 4
The relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statistically significant.
In the first validation analysis, the probability for the test of relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.
SW388R7Data Analysis
& Computers II
Slide 85
Coefficientsa,b
10.525 .921 11.429 .000
.495 .062 .674 7.947 .000 1.000 1.000
5.158 1.450 3.556 .001
.492 .056 .670 8.850 .000 1.000 1.000
12.783 2.844 .340 4.494 .000 1.000 1.000
(Constant)
RINCOM98
(Constant)
RINCOM98
LGEARNRS
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: INCOME98a.
Selecting only cases for which SPLIT = 1.0000b.
SPLIT-SAMPLE VALIDATION - 5
In the second validation analysis, the probability for the test of relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.
SW388R7Data Analysis
& Computers II
Slide 86
Coefficientsa,b
9.585 1.425 6.724 .000
17.243 3.176 .517 5.429 .000 1.000 1.000
3.083 1.435 2.148 .035
18.322 2.493 .549 7.351 .000 .996 1.004
.447 .062 .538 7.210 .000 .996 1.004
(Constant)
LGEARNRS
(Constant)
LGEARNRS
RINCOM98
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: INCOME98a.
Selecting only cases for which SPLIT = .0000b.
SPLIT-SAMPLE VALIDATION - 6
The relationship between "income" [rincom98] and "total family income" [income98] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statistically significant.
In the first validation analysis, the probability for the test of relationship between "income" [rincom98] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.
SW388R7Data Analysis
& Computers II
Slide 87
Coefficientsa,b
10.525 .921 11.429 .000
.495 .062 .674 7.947 .000 1.000 1.000
5.158 1.450 3.556 .001
.492 .056 .670 8.850 .000 1.000 1.000
12.783 2.844 .340 4.494 .000 1.000 1.000
(Constant)
RINCOM98
(Constant)
RINCOM98
LGEARNRS
Model1
2
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: INCOME98a.
Selecting only cases for which SPLIT = 1.0000b.
SPLIT-SAMPLE VALIDATION - 7
The split sample validation supports the findings of the regression analysis using the full data set. A caution is added because of the inclusion of ordinal level variables.
The answer to the question is true with caution.
In the second validation analysis, the probability for the test of relationship between "income" [rincom98] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.
SW388R7Data Analysis
& Computers II
Slide 88
Table of validation results: stepwise regression
Full Data Set Split = 0 (Split1 = 1)
Split = 1 (Split2 = 1)
ANOVA significance (sig <= 0.01)
<0.001 <0.001 <0.001
R2 0.554 0.556 0.570
Significant Coefficients (R2 change sig <= 0.01)
Respondent’s income
How many in family earned income
How many in family earned income
Respondent’s income
Respondent’s income
How many in family earned incomeNOTE: we use a lower
level of significance to offset stepwise regression’s tendency to over fit a model.
In this example, we satisfy all of the validation criteria. The validation supports the generalizability of the regression model to the population represented by this sample.
The same variables entered the validation analyses in a different order. The difference is order does not negate the validation.
SW388R7Data Analysis
& Computers II
Slide 89
Split sample validation - 1
The following is a guide to the decision process for answering problems about split sample validation:
Inappropriate application of a statistic
Yes
NoDependent variable metric?Independent variables metric or dichotomous?
Compute regression analysis, using transformations and omitting outliers as needed.
Yes
Ratio of cases to independent variables at least 5 to 1?
Yes
No Inappropriate application of a statistic
SW388R7Data Analysis
& Computers II
Slide 90
Split sample validation - 2
Are there statistical findings requiring validation?
Yes
NoFalse
Yes
YesNo
Enough valid cases to split sample and keep 5 to 1 ratio of cases/variables?
•Set the random seed and compute the split variable•Re-run factor with split = 0•Re-run factor with split = 1
•Set the first random seed and compute the split1 variable•Re-run factor with split1 = 1•Set the second random seed and compute the split2 variable•Re-run factor with split2 = 1
SW388R7Data Analysis
& Computers II
Slide 91
Split sample validation - 3
Compute regression analysis for both validation samples, using same method as analysis of full sample
Yes
Probability of ANOVA test <= level of significance for both validation analyses?
Yes
NoFalse
Yes
R² for both validations within 5% of R² for analysis of full data set?
Yes
NoFalse
SW388R7Data Analysis
& Computers II
Slide 92
Split sample validation - 4
Pattern of significance for independent variables in both validations matches pattern for full data set?
Yes
NoFalse
Change in R² statistically significant in both validation analyses? (Hierarchical only)
NoFalse
Yes
Yes
Yes
Satisfies ratio for preferred sample size: 15 to 1(stepwise: 50 to 1)
Yes
NoTrue with caution
SW388R7Data Analysis
& Computers II
Slide 93
Split sample validation - 5
Yes
Other cautions added for ordinal variables or violation of assumptions?
Yes
No
True with caution
True
Yes