SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General...

SW388R7Data Analysis

& Computers II

Slide 1

Multiple Regression – Split Sample Validation

General criteria for split sample validation

Sample problems


& Computers II

Slide 2

General criteria for split sample validation

It is expected that the results obtained from split sample validations will vary somewhat from the results obtained from the analysis using the full data set. We will use the following as our criteria that the validation verified our analysis and supports the generalizability of our findings: First, the overall relationship between the

dependent variable and the set of independent variable must be statistically significant for both validation analyses.

Second, the R² for each validation must be within 5% (plus or minus) of the R² for the model using the full sample.


& Computers II

Slide 3

General criteria for split sample validation - 2

Third, the pattern of statistical significance for the coefficients of the independent variables for both validation analyses must be the same as the pattern for the full analysis, i.e. the same variables are statistically significant or not significant.

For stepwise multiple regression, we require that the same variables be significant, but it is not required that they enter in exactly the same order.

For hierarchical multiple regression, the R² change for the validation must be statistically significant.


& Computers II

Slide 4

Notes

Findings are stated on the results for the analysis of the full data set, not the validations.

If our validation analysis does not support the findings of the analysis on the full data set, we will declare the answer to the problem to be false. There is, however, another common option, which is to only cite statistical significance for independent variables supported by the validation analysis as well as the full data set analysis. All other variables are considered to be non-significant. Generally it is the independent variables with the weakest individual relationship to the dependent variable which fail to validate.


& Computers II

Slide 5

Problem 1

1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using 552804 as the random number seed.

The variables "age" [age], "highest year of school completed" [educ], and "sex" [sex] have a moderate relationship to the variable "occupational prestige score" [prestg80].

Survey respondents who were older had more prestigious occupations. Survey respondents who had completed more years of school had more prestigious occupations. The variable sex did not have a relationship to "occupational prestige score" [prestg80].

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic


& Computers II

Slide 6

Steps prior to the validation analysis

Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers.

Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation.

When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate.


& Computers II

Slide 7

ANO VAb

14765.673 3 4921.891 36.639 .000a

33315.228 248 134 .336

48080.901 251

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors : (Constant), RESPONDENTS SEX, AG E OF RESPO NDENT, HIG HESTYEAR OF SCHO OL CO MPLETED

a.

Dependen t Variable: RS OCCUPAT IONAL PRESTIGE SCORE (1980)b.

OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT

VARIABLES - 1

M o de l Su mmar yb

.554a .307 .299 11.590Model1

R R SquareAdjustedR Square

Std . Error ofthe Estima te

Predictors : (Constant), RESPONDENTS SEX, AG E OFRESPONDENT, HIGHEST YEAR OF SCHO OLCO MPLET ED

a.

Dependen t Variable: RS OCCUPAT IONAL PRESTIGESCO RE (1980)

b.

The probability of the F statistic (36.639) for the overall regression relationship is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of independent variables and the dependent variable (R² = 0).

We support the research hypothesis that there is a statistically significant relationship between the set of independent variables and the dependent variable.


& Computers II

Slide 8

ANO VAb

14765.673 3 4921.891 36.639 .000a

33315.228 248 134 .336

48080.901 251

Regression

Residual

Total

Model1


Predictors : (Constant), RESPONDENTS SEX, AG E OF RESPO NDENT, HIG HESTYEAR OF SCHO OL CO MPLETED

a.

Dependen t Variable: RS OCCUPAT IONAL PRESTIGE SCORE (1980)b.


VARIABLES - 2

M o de l Su mmar yb

.554a .307 .299 11.590Model1


Std . Error ofthe Estima te

Predictors : (Constant), RESPONDENTS SEX, AG E OFRESPONDENT, HIGHEST YEAR OF SCHO OLCO MPLET ED

a.

Dependen t Variable: RS OCCUPAT IONAL PRESTIGESCO RE (1980)

b.

The Multiple R for the relationship between the set of independent variables and the dependent variable is 0.554, which would be characterized as moderate using the rule of thumb than a correlation less than or equal to 0.20 is characterized as very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong.

The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 30.7%.


& Computers II

Slide 9

Coe fficie ntsa

3.039 5.305 .573 .567

.125 .045 .149 2.749 .006

2.810 .271 .563 10.352 .000

-1.354 1.477 -.049 -.917 .360

(Constant)

AGE OF RESPO NDENT

HIG HEST YEAR OFSCHOOL COMPLETED

RESPONDENTS SEX

Model1

B Std . Error

UnstandardizedCoe fficien ts

Beta

StandardizedCoe fficien ts

t Sig.

Dependen t Variable: RS OCCUPAT IONAL PRESTIGE SCORE (1980)a.

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO

DEPENDENT VARIABLE - 1

For the independent variable age, the probability of the t statistic (2.749) for the b coefficient is 0.006 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with age is equal to zero (b = 0) and conclude that there is a statistically significant relationship between age and occupational prestige score.

The b coefficient associated with age (0.125) is positive, indicating a direct relationship in which higher numeric values for age are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who were older had more prestigious occupations.


& Computers II

Slide 10

Coe fficie ntsa

3.039 5.305 .573 .567

.125 .045 .149 2.749 .006

2.810 .271 .563 10.352 .000

-1.354 1.477 -.049 -.917 .360

(Constant)

AGE OF RESPO NDENT


RESPONDENTS SEX

Model1

B Std . Error


Beta


t Sig.




For the independent variable highest year of school completed, the probability of the t statistic (10.352) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with highest year of school completed is equal to zero (b = 0) and conclude that there is a statistically significant relationship between highest year of school completed and occupational prestige score.

The b coefficient associated with highest year of school completed (2.810) is positive, indicating a direct relationship in which higher numeric values for highest year of school completed are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who had completed more years of school had more prestigious occupations.


& Computers II

Slide 11

Coe fficie ntsa

3.039 5.305 .573 .567

.125 .045 .149 2.749 .006

2.810 .271 .563 10.352 .000

-1.354 1.477 -.049 -.917 .360

(Constant)

AGE OF RESPO NDENT


RESPONDENTS SEX

Model1

B Std . Error


Beta


t Sig.




For the independent variable sex, the probability of the t statistic (-0.917) for the b coefficient is 0.360 which is greater than the level of significance of 0.05. We fail to reject the null hypothesis that the slope associated with sex is equal to zero (b = 0) and conclude that there is not a statistically significant relationship between sex and occupational prestige score.


& Computers II

Slide 12

Setting the random number seed

To set the random number seed, select the Random Number Seed… command from the Transform menu.


& Computers II

Slide 13

Set the random number seed

First, click on the Set seed to option button to activate the text box.

Second, type in the random seed stated in the problem.

Third, click on the OK button to complete the dialog box.

Note that SPSS does not provide you with any feedback about the change.


& Computers II

Slide 14

Select the compute command

To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.


& Computers II

Slide 15

The formula for the split variable

First, type the name for the new variable, split, into the Target Variable text box.

Second, the formula for the value of split is shown in the text box.

The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50.

If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false.Third, click on the OK

button to complete the dialog box.


& Computers II

Slide 16

The split variable in the data editor

In the data editor, the split variable shows a random pattern of zero’s and one’s.

To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.


& Computers II

Slide 17

Repeat the regression with first validation sample

To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.


& Computers II

Slide 18

Using "split" as the selection variable

First, scroll down the list of variables and highlight the variable split.

Second, click on the right arrow button to move the split variable to the Selection Variable text box.


& Computers II

Slide 19

Setting the value of split to select cases

When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the

Rule… button to enter a value for split.


& Computers II

Slide 20

Completing the value selection

First, type the value for the first half of the sample, 0, into the Value text box.

Second, click on the Continue button to complete the value entry.


& Computers II

Slide 21

Requesting output for the first validation sample

When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable.

Click on the OK button to

request the output.

Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.


& Computers II

Slide 22

Repeat the regression with second validation sample

To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.


& Computers II

Slide 23


Since the split variable is already in the Selection Variable text box, we only need to change its value.

Click on the Rule… button to enter a different value for split.


& Computers II

Slide 24





& Computers II

Slide 25

Requesting output for the second validation sample


Click on the OK button to request the output.


& Computers II

Slide 26

ANOVAb,c

9046.938 3 3015.646 25.527 .000a

13822.120 117 118.138

22869.058 120

Regression

Residual

Total

Model1


Predictors: (Constant), RESPONDENTS SEX, HIGHEST YEAR OF SCHOOLCOMPLETED, AGE OF RESPONDENT

a.

Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1980)b.

Selecting only cases for which SPLIT = 1.000c.

SPLIT-SAMPLE VALIDATION - 1

ANOVAb,c

6186.368 3 2062.123 13.770 .000a

19019.143 127 149.757

25205.511 130

Regression

Residual

Total

Model1


Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HIGHESTYEAR OF SCHOOL COMPLETED

a.

Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1980)b.

Selecting only cases for which SPLIT = .000c.

In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant.

In the first validation, the probability for the F statistic testing overall relationship was <0.001.

For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001.

Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.


& Computers II

Slide 27


Model Summary

.629a .396 .380 10.869Model1

SPLIT = 1.000

(Selected)

R

R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), RESPONDENTS SEX, HIGHESTYEAR OF SCHOOL COMPLETED, AGE OFRESPONDENT

a.

Model Summary

.495a .245 .228 12.238Model1

SPLIT = .000(Selected)

R



Predictors: (Constant), RESPONDENTS SEX, AGE OFRESPONDENT, HIGHEST YEAR OF SCHOOLCOMPLETED

a.

The proportion of variance in the relationship utilizing the full data set was 30.7% compared to 24.5% for the first split sample validation and 39.6% for the second split sample validation.

In both of the split-sample validation analyses, the proportion of variance in the dependent variable explained by the independent variables was not within 5% of the variance explained in the model using the full data set (30.7%).

The strength of the relationship between the independent variables and the dependent variable is not supported by the validation.

The answer to the question is false, because the validation did not verify the strength of the relationship in the analysis of the full data set.


& Computers II

Slide 28

Table of validation results: standard regression

Full Data Set Split = 0 (Split1 = 1)

Split = 1 (Split2 = 1)

ANOVA significance (sig <= 0.05)

<0.001 <0.001 <0.001

R2 0.307 0.245 0.396

Significant Coefficients (sig <= 0.05)

Age of respondentHighest year of

school completed

Highest year of school completed

Age of respondent

Highest year of school completed

This table shows us that the validation failed in the strength of relationship row (R²), where the R² for the validations did not fall within 5% if the R² for the full data set. Had we satisfied this criteria, the validation would have failed in the Significant Coefficients row, where age of respondent was not statistically significant in the first validation analysis.

It may be helpful to create a table for our validation results and fill in its cells as we complete the analysis.


& Computers II

Slide 29

Problem 2


After controlling for the effects of the variables "age" [age] and "sex" [sex], the addition of the variable "happiness of marriage" [hapmar] reduces the error in predicting "general happiness" [happy] by 37.0%.

After controlling for age and sex, the variable happiness of marriage makes an individual contribution to reducing the error in predicting general happiness. Survey respondents who were less happy with their marriages were less happy overall.



& Computers II

Slide 30





For hierarchical regression, the R² change statistics must be statistically significant in order for the model to be validated.


& Computers II

Slide 31

ANOVAc

.095 2 .047 .145 .865a

42.582 130 .328

42.677 132

15.878 3 5.293 25.476 .000b

26.799 129 .208

42.677 132

Regression

Residual

Total

Regression

Residual

Total

Model1

2


Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENTa.

Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESSOF MARRIAGE

b.

Dependent Variable: GENERAL HAPPINESSc.


VARIABLES - 1

The probability of the F statistic (25.476) for the overall regression relationship for all indpendent variables is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of all independent variables and the dependent variable (R² = 0).

We support the research hypothesis that there is a statistically significant relationship between the set of all independent variables and the dependent variable.


& Computers II

Slide 32

Model Summaryc

.047a .002 -.013 .572 .002 .145 2 130 .865

.610b .372 .357 .456 .370 75.972 1 129 .000

Model1

2



R SquareChange F Change df1 df2 Sig. F Change

Change Statistics


Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESS OF MARRIAGEb.



VARIABLES - 2

The R Square Change statistic for the increase in R² associated with the added variables (happiness of marriage) is 0.370. Using a proportional reduction in error interpretation for R², information provided by the added variables reduces our error in predicting general happiness by 37.0%.

The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 37.2%.


& Computers II

Slide 33

Model Summaryc

.047a .002 -.013 .572 .002 .145 2 130 .865

.610b .372 .357 .456 .370 75.972 1 129 .000

Model1

2




Change Statistics





VARIABLES - 3

The probability of the F statistic (75.972) for the change in R² associated with the addition of the predictor variables to the regression analysis containing the control variables is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no improvement in the relationship between the set of independent variables and the dependent variable when the predictors are added (R² Change = 0).

We support the research hypothesis that there is a statistically significant improvement in the relationship between the set of independent variables and the dependent variable.


& Computers II

Slide 34

Coefficientsa

1.764 .257 6.863 .000

-.002 .003 -.046 -.517 .606

-.027 .103 -.023 -.259 .796

.924 .226 4.086 .000

-.002 .003 -.056 -.789 .432

-.082 .083 -.071 -.986 .326

.658 .075 .610 8.716 .000

(Constant)

AGE OF RESPONDENT

RESPONDENTS SEX

(Constant)

AGE OF RESPONDENT

RESPONDENTS SEX

HAPPINESS OFMARRIAGE

Model1

2

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: GENERAL HAPPINESSa.



For the independent variable happiness of marriage, the probability of the t statistic (8.716) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with happiness of marriage is equal to zero (b = 0) and conclude that there is a statistically significant relationship between happiness of marriage and general happiness.


& Computers II

Slide 35

Coefficientsa

1.764 .257 6.863 .000

-.002 .003 -.046 -.517 .606

-.027 .103 -.023 -.259 .796

.924 .226 4.086 .000

-.002 .003 -.056 -.789 .432

-.082 .083 -.071 -.986 .326

.658 .075 .610 8.716 .000

(Constant)

AGE OF RESPONDENT

RESPONDENTS SEX

(Constant)

AGE OF RESPONDENT

RESPONDENTS SEX


Model1

2

B Std. Error


Beta


t Sig.




The b coefficient associated with happiness of marriage (0.658) is positive, indicating a direct relationship in which higher numeric values for happiness of marriage are associated with higher numeric values for general happiness. The independent variable happiness of marriage is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were less happy with their marriages. The dependent variable general happiness is also an ordinal variable. It is coded so that higher numeric values are associated with survey respondents who were less happy overall. Therefore, the positive value of b implies that survey respondents who were less happy with their marriages were less happy overall.


& Computers II

Slide 36

Coefficientsa

1.764 .257 6.863 .000

-.002 .003 -.046 -.517 .606

-.027 .103 -.023 -.259 .796

.924 .226 4.086 .000

-.002 .003 -.056 -.789 .432

-.082 .083 -.071 -.986 .326

.658 .075 .610 8.716 .000

(Constant)

AGE OF RESPONDENT

RESPONDENTS SEX

(Constant)

AGE OF RESPONDENT

RESPONDENTS SEX


Model1

2

B Std. Error


Beta


t Sig.




Age and sex are variables whose effects are controlled for, so there is no requirement that they be statistically significant.


& Computers II

Slide 37




& Computers II

Slide 38







& Computers II

Slide 39




& Computers II

Slide 40








& Computers II

Slide 41





& Computers II

Slide 42




& Computers II

Slide 43





& Computers II

Slide 44





& Computers II

Slide 45





& Computers II

Slide 46






& Computers II

Slide 47




& Computers II

Slide 48





& Computers II

Slide 49





& Computers II

Slide 50





& Computers II

Slide 51

ANOVAc,d

.022 2 .011 .035 .966a

19.415 61 .318

19.438 63

7.685 3 2.562 13.079 .000b

11.752 60 .196

19.438 63

Regression

Residual

Total

Regression

Residual

Total

Model1

2




ANOVAc,d

.095 2 .047 .137 .872a

22.891 66 .347

22.986 68

8.465 3 2.822 12.631 .000b

14.520 65 .223

22.986 68

Regression

Residual

Total

Regression

Residual

Total

Model1

2






Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.


& Computers II

Slide 52

Model Summary

.064a .004 -.026 .589 .004 .137 2 66 .872

.607b .368 .339 .473 .364 37.469 1 65 .000

Model1

2


R




Change Statistics


Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESS OF MARRIAGEb. Model Summary

.034a .001 -.032 .564 .001 .035 2 61 .966

.629b .395 .365 .443 .394 39.125 1 60 .000

Model1

2

SPLIT = 1.000

(Selected)

R




Change Statistics




The total proportion of variance in the relationship utilizing the full data set was 37.2% compared to 36.8% for the first split sample validation and 39.5% for the second split sample validation.

In both of the split-sample validation analyses, the total proportion of variance in the dependent variable explained by the independent variables was within 5% of the variance explained in the model using the full data set (37.2%).


& Computers II

Slide 53

Model Summary

.064a .004 -.026 .589 .004 .137 2 66 .872

.607b .368 .339 .473 .364 37.469 1 65 .000

Model1

2


R




Change Statistics


Predictors: (Constant), RESPONDENTS SEX, AGE OF RESPONDENT, HAPPINESS OF MARRIAGEb. Model Summary

.034a .001 -.032 .564 .001 .035 2 61 .966

.629b .395 .365 .443 .394 39.125 1 60 .000

Model1

2

SPLIT = 1.000

(Selected)

R




Change Statistics




In both of the split-sample validation analyses, the change in R² from the model including only control variables to the model containing all variables was statistically significant.

In the first validation, the probability for the F statistic testing the change in R² was <0.001.

For the second validation analysis, the probability for the F statistic testing the change in R² was <0.001.

Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables, and the significance of the contribution of the independent variables included after the control variables.


& Computers II

Slide 54


Coe fficie ntsa,b

1.876 .383 4.895 .000

-.002 .005 -.059 -.457 .649

-.058 .152 -.049 -.379 .706

.885 .348 2.547 .013

-.003 .004 -.080 -.772 .443

-.014 .122 -.012 -.115 .909

.667 .109 .605 6.121 .000

(Constant)

AG E OF RESPO NDENT

RESPONDENTS SEX

(Constant)

AG E OF RESPO NDENT

RESPONDENTS SEX

HAPPINESS OFMARRIAG E

Model1

2

B Std . Error

Unstanda rdizedCoe fficien ts

Beta


t Sig .

Dependen t Variable: G ENERAL HAPPINESSa.

Selecting only cases fo r which SPLIT = .000b.

The relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statisically significant.

In the first validation analysis, the probability for the test of relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was <0.001, which was less than or equal to 0.05 and statistically significant.


& Computers II

Slide 55


Coe fficie ntsa,b

1.659 .351 4.730 .000

-.001 .005 -.034 -.263 .793

-.003 .144 -.002 -.018 .986

.903 .300 3.004 .004

.000 .004 -.011 -.112 .911

-.164 .116 -.147 -1.421 .161

.675 .108 .645 6.255 .000

(Constant)

AG E OF RESPO NDENT

RESPONDENTS SEX

(Constant)

AG E OF RESPO NDENT

RESPONDENTS SEX

HAPPINESS OFMARRIAG E

Model1

2

B Std . Error

Unstanda rdizedCoe fficien ts

Beta


t Sig .

Dependen t Variable: G ENERAL HAPPINESSa.

Selecting only cases fo r which SPLIT = 1.000b.

The split sample validation supports the findings of the regression analysis using the full data set. A caution is added because of the inclusion of ordinal level variables.

The answer to the question is true with caution.

In the second validation analysis, the probability for the test of relationship between "happiness of marriage" [hapmar] and "general happiness" [happy] was <0.001, which was less than or equal to 0.05 and statistically significant.


& Computers II

Slide 56

Table of validation results: hierarchical regression




<0.001 <0.001 <0.001

R2 0.372 0.368 0.395

R2 Change (sig <= 0.05)

<0.001 <0.001 <0.001

Significant Coefficients (sig <= 0.05)

Happiness of marriage



In this example, we satisfy all of the validation criteria. The validation supports the generalizability of the regression model to the population represented by this sample.

NOTE: we add the R² change statistic to the validation results because it indicates the contribution of the variables added in the second stage.


& Computers II

Slide 57

Problem 3


From the list of variables "how many in family earned money" [earnrs], "income" [rincom98], and "age" [age], the best predictors of "total family income" [income98] are "income" [rincom98] and "how many in family earned money" [earnrs]. Income and how many in family earned money have a strong relationship to total family income.

The most important predictor of total family income is income. The second most important predictor of total family income is how many in family earned money.

Survey respondents who had higher incomes had higher total family incomes. Survey respondents who had more family members earning money had higher total family incomes.



& Computers II

Slide 58





For stepwise regression, we will require that the same variables be selected in each validation analysis. We will not require that they be entered in the same order.


& Computers II

Slide 59

ANOVAc

933.833 1 933.833 84.525 .000a

1756.639 159 11.048

2690.472 160

1490.955 2 745.477 98.194 .000b

1199.517 158 7.592

2690.472 160

Regression

Residual

Total

Regression

Residual

Total

Model1

2


Predictors: (Constant), RINCOM98a.

Predictors: (Constant), RINCOM98, LGEARNRSb.

Dependent Variable: INCOME98c.


VARIABLES - 1

The best subset of predictors for total family income included the independent variables: income and how many in family earned money. The probability of the F statistic (98.194) for the regression relationship which includes these variables is <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that there is no relationship between the best subset of independent variables and the dependent variable (R² = 0).

We support the research hypothesis that there is a statistically significant relationship between the best subset of independent variables and the dependent variable.


& Computers II

Slide 60

Model Summaryc

.589a .347 .343 3.324 .347 84.525 1 159

.744b .554 .549 2.755 .207 73.384 1 158

Model1

2



R SquareChange F Change df1 df2

Change Statistics




VARIABLES - 2

The Multiple R for the relationship between the subset of independent variables that best predict the dependent variable is 0.744, which would be characterized as strong using the rule of thumb than a correlation less than or equal to 0.20 is very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong.

The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by the independent variables included in the analysis is 55.4%.


& Computers II

Slide 61

Variables Entered/Removeda

RINCOM98 .

Stepwise(Criteria:Probability-of-F-to-enter <=.010,Probability-of-F-to-remove >=.020).

LGEARNRS .

Stepwise(Criteria:Probability-of-F-to-enter <=.010,Probability-of-F-to-remove >=.020).

Model1

2

VariablesEntered

VariablesRemoved Method

Dependent Variable: INCOME98a.


VARIABLES - 3

Based on the table of "Variables Entered/Removed," the most important predictor of total family income is income. The second most important predictor of total family income is how many in family earned money.


& Computers II

Slide 62

Coefficientsa

10.902 .730 14.933 .000

.460 .050 .589 9.194 .000 1.000 1.000

3.932 1.014 3.877 .000

.470 .041 .602 11.322 .000 .999 1.001

16.009 1.869 .455 8.566 .000 .999 1.001

(Constant)

RINCOM98

(Constant)

RINCOM98

LGEARNRS

Model1

2

B Std. Error


Beta


t Sig. Tolerance VIF

Collinearity Statistics




For the independent variable income, the probability of the t statistic (11.322) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.01. We reject the null hypothesis that the slope associated with income is equal to zero (b = 0) and conclude that there is a statistically significant relationship between income and total family income.


& Computers II

Slide 63

Coefficientsa

10.902 .730 14.933 .000

.460 .050 .589 9.194 .000 1.000 1.000

3.932 1.014 3.877 .000

.470 .041 .602 11.322 .000 .999 1.001

16.009 1.869 .455 8.566 .000 .999 1.001

(Constant)

RINCOM98

(Constant)

RINCOM98

LGEARNRS

Model1

2

B Std. Error


Beta







The b coefficient associated with income (0.470) is positive, indicating a direct relationship in which higher numeric values for income are associated with higher numeric values for total family income. The independent variable income is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who had higher incomes. The dependent variable total family income is also an ordinal variable. It is coded so that higher numeric values are associated with survey respondents who had higher total family incomes. Therefore, the positive value of b implies that survey respondents who had higher incomes had higher total family incomes.


& Computers II

Slide 64

Coefficientsa

10.902 .730 14.933 .000

.460 .050 .589 9.194 .000 1.000 1.000

3.932 1.014 3.877 .000

.470 .041 .602 11.322 .000 .999 1.001

16.009 1.869 .455 8.566 .000 .999 1.001

(Constant)

RINCOM98

(Constant)

RINCOM98

LGEARNRS

Model1

2

B Std. Error


Beta







For the independent variable how many in family earned money, the probability of the t statistic (8.566) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.01. We reject the null hypothesis that the slope associated with how many in family earned money is equal to zero (b = 0) and conclude that there is a statistically significant relationship between how many in family earned money and total family income.


& Computers II

Slide 65

Coefficientsa

10.902 .730 14.933 .000

.460 .050 .589 9.194 .000 1.000 1.000

3.932 1.014 3.877 .000

.470 .041 .602 11.322 .000 .999 1.001

16.009 1.869 .455 8.566 .000 .999 1.001

(Constant)

RINCOM98

(Constant)

RINCOM98

LGEARNRS

Model1

2

B Std. Error


Beta







The b coefficient associated with how many in family earned money (16.009) is positive, indicating a direct relationship in which higher numeric values for how many in family earned money are associated with higher numeric values for total family income. Therefore, the positive value of b implies that survey respondents who had more family members earning money had higher total family incomes.

If we wanted to specifically interpret the value of the b coefficient (16.009), we would have to convert it out of log unit.


& Computers II

Slide 66

Excluded Variablesc

.455a 8.566 .000 .563 .999 1.001 .999

.094a 1.415 .159 .112 .931 1.074 .931

.090b 1.646 .102 .130 .931 1.074 .930

LGEARNRS

AGE

AGE

Model1

2

Beta In t Sig.Partial

Correlation Tolerance VIFMinimumTolerance


Predictors in the Model: (Constant), RINCOM98a.

Predictors in the Model: (Constant), RINCOM98, LGEARNRSb.




The relationship between "age" [age] and "total family income" [income98] was not statistically significant for the model using the full data set (p=0.102).


& Computers II

Slide 67




& Computers II

Slide 68







& Computers II

Slide 69




& Computers II

Slide 70








& Computers II

Slide 71





& Computers II

Slide 72




& Computers II

Slide 73





& Computers II

Slide 74





& Computers II

Slide 75





& Computers II

Slide 76






& Computers II

Slide 77




& Computers II

Slide 78





& Computers II

Slide 79





& Computers II

Slide 80





& Computers II

Slide 81

ANOVAc,d

378.332 1 378.332 29.477 .000a

1039.619 81 12.835

1417.952 82

787.811 2 393.905 50.009 .000b

630.141 80 7.877

1417.952 82

Regression

Residual

Total

Regression

Residual

Total

Model1

2


Predictors: (Constant), LGEARNRSa.

Predictors: (Constant), LGEARNRS, RINCOM98b.


Selecting only cases for which SPLIT = .0000d.


ANOVAc,d

575.955 1 575.955 63.161 .000a

693.032 76 9.119

1268.987 77

722.986 2 361.493 49.656 .000b

546.001 75 7.280

1268.987 77

Regression

Residual

Total

Regression

Residual

Total

Model1

2


Predictors: (Constant), RINCOM98a.



Selecting only cases for which SPLIT = 1.0000d.





& Computers II

Slide 82

Model Summaryc,d

.517a .267 .258 3.583 .267 29.477 1 81

.745b .739 .556 .544 2.807 .289 51.986 1 80

Model1

2

SPLIT = .0000

(Selected)

SPLIT ~=.0000

(Unselected)

R




Change Statistics

Predictors: (Constant), LGEARNRS, RINCOM98b.

Unless noted otherwise, statistics are based only on cases for which SPLIT = .0000.c.

Dependent Variable: INCOME98d.


Model Summaryc,d

.674a .454 .447 3.020 .454 63.161 1 76

.755b .799 .570 .558 2.698 .116 20.197 1 75

Model1

2

SPLIT = 1.0000

(Selected)

SPLIT ~=1.0000

(Unselected)

R




Change Statistics


Unless noted otherwise, statistics are based only on cases for which SPLIT = 1.0000.c.

Dependent Variable: INCOME98d.

The total proportion of variance in the relationship utilizing the full data set was 55.4% compared to 55.6% for the first split sample validation and 57.0% for the second split sample validation.

In both of the split-sample validation analyses, the total proportion of variance in the dependent variable explained by the independent variables was within 5% of the variance explained in the model using the full data set (55.4%).


& Computers II

Slide 83

Excluded Variablesc

.538a 7.210 .000 .628 .996 1.004 .996

.235a 2.549 .013 .274 .999 1.001 .999

.088b 1.128 .263 .126 .917 1.091 .914

RINCOM98

AGE

AGE

Model1

2




Predictors in the Model: (Constant), LGEARNRSa.

Predictors in the Model: (Constant), LGEARNRS, RINCOM98b.



Excluded Variablesc

.340a 4.494 .000 .461 1.000 1.000 1.000

.075a .860 .392 .099 .941 1.063 .941

.091b 1.170 .246 .135 .939 1.065 .939

LGEARNRS

AGE

AGE

Model1

2




Predictors in the Model: (Constant), RINCOM98a.

Predictors in the Model: (Constant), RINCOM98, LGEARNRSb.


The relationship between "age" [age] and "total family income" [income98] was not statistically significant for the model using the full data set (p=0.102). Similarly, the relationships in both of the validation analyses were not statistically significant.

In the second validation analysis, the probability for the test of relationship between "age" [age] and "total family income" [income98] was 0.246, which was greater than the level of significance of 0.01 and not statistically significant.

In the first validation analysis, the probability for the test of relationship between "age" [age] and "total family income" [income98] was 0.263, which was greater than the level of significance of 0.01 and not statistically significant.


& Computers II

Slide 84

Coefficientsa,b

9.585 1.425 6.724 .000

17.243 3.176 .517 5.429 .000 1.000 1.000

3.083 1.435 2.148 .035

18.322 2.493 .549 7.351 .000 .996 1.004

.447 .062 .538 7.210 .000 .996 1.004

(Constant)

LGEARNRS

(Constant)

LGEARNRS

RINCOM98

Model1

2

B Std. Error


Beta





Selecting only cases for which SPLIT = .0000b.


The relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statistically significant.

In the first validation analysis, the probability for the test of relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.


& Computers II

Slide 85

Coefficientsa,b

10.525 .921 11.429 .000

.495 .062 .674 7.947 .000 1.000 1.000

5.158 1.450 3.556 .001

.492 .056 .670 8.850 .000 1.000 1.000

12.783 2.844 .340 4.494 .000 1.000 1.000

(Constant)

RINCOM98

(Constant)

RINCOM98

LGEARNRS

Model1

2

B Std. Error


Beta





Selecting only cases for which SPLIT = 1.0000b.


In the second validation analysis, the probability for the test of relationship between "how many in family earned money" [earnrs] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.


& Computers II

Slide 86

Coefficientsa,b

9.585 1.425 6.724 .000

17.243 3.176 .517 5.429 .000 1.000 1.000

3.083 1.435 2.148 .035

18.322 2.493 .549 7.351 .000 .996 1.004

.447 .062 .538 7.210 .000 .996 1.004

(Constant)

LGEARNRS

(Constant)

LGEARNRS

RINCOM98

Model1

2

B Std. Error


Beta





Selecting only cases for which SPLIT = .0000b.


The relationship between "income" [rincom98] and "total family income" [income98] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statistically significant.

In the first validation analysis, the probability for the test of relationship between "income" [rincom98] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.


& Computers II

Slide 87

Coefficientsa,b

10.525 .921 11.429 .000

.495 .062 .674 7.947 .000 1.000 1.000

5.158 1.450 3.556 .001

.492 .056 .670 8.850 .000 1.000 1.000

12.783 2.844 .340 4.494 .000 1.000 1.000

(Constant)

RINCOM98

(Constant)

RINCOM98

LGEARNRS

Model1

2

B Std. Error


Beta





Selecting only cases for which SPLIT = 1.0000b.


The split sample validation supports the findings of the regression analysis using the full data set. A caution is added because of the inclusion of ordinal level variables.

The answer to the question is true with caution.

In the second validation analysis, the probability for the test of relationship between "income" [rincom98] and "total family income" [income98] was <0.001, which was less than or equal to the level of significance of 0.01 and statistically significant.


& Computers II

Slide 88

Table of validation results: stepwise regression




<0.001 <0.001 <0.001

R2 0.554 0.556 0.570

Significant Coefficients (R2 change sig <= 0.01)

Respondent’s income

How many in family earned income

How many in family earned income



How many in family earned incomeNOTE: we use a lower

level of significance to offset stepwise regression’s tendency to over fit a model.

In this example, we satisfy all of the validation criteria. The validation supports the generalizability of the regression model to the population represented by this sample.

The same variables entered the validation analyses in a different order. The difference is order does not negate the validation.


& Computers II

Slide 89

Split sample validation - 1

The following is a guide to the decision process for answering problems about split sample validation:

Inappropriate application of a statistic

Yes

NoDependent variable metric?Independent variables metric or dichotomous?

Compute regression analysis, using transformations and omitting outliers as needed.

Yes

Ratio of cases to independent variables at least 5 to 1?

Yes

No Inappropriate application of a statistic


& Computers II

Slide 90


Are there statistical findings requiring validation?

Yes

NoFalse

Yes

YesNo

Enough valid cases to split sample and keep 5 to 1 ratio of cases/variables?

•Set the random seed and compute the split variable•Re-run factor with split = 0•Re-run factor with split = 1

•Set the first random seed and compute the split1 variable•Re-run factor with split1 = 1•Set the second random seed and compute the split2 variable•Re-run factor with split2 = 1


& Computers II

Slide 91


Compute regression analysis for both validation samples, using same method as analysis of full sample

Yes

Probability of ANOVA test <= level of significance for both validation analyses?

Yes

NoFalse

Yes

R² for both validations within 5% of R² for analysis of full data set?

Yes

NoFalse


& Computers II

Slide 92


Pattern of significance for independent variables in both validations matches pattern for full data set?

Yes

NoFalse

Change in R² statistically significant in both validation analyses? (Hierarchical only)

NoFalse

Yes

Yes

Yes

Satisfies ratio for preferred sample size: 15 to 1(stepwise: 50 to 1)

Yes

NoTrue with caution


& Computers II

Slide 93


Yes

Other cautions added for ordinal variables or violation of assumptions?

Yes

No

True with caution

True

Yes

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General...

Documents

Transcript of SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General...