1 PSY6010: Statistics, Psychometrics and Research Design Regression (OLS, binary logistic, ordinal)...

1

PSY6010: Statistics, Psychometrics and Research Design

Regression (OLS, binary logistic, ordinal)

Professor Leora LawtonSpring 2008

Wednesdays 7-10 PMRoom 204

2

Regression

Steps to conduct a regression analysis1. Formulate research question

2. Create conceptual model

3. Operationalize

4. Conduct tests on variables

5. Build regression model

6. Run regressions

7. Write up results

8. Interpret results and draw conclusions and/or make recommendations

3

1. Formulate Research Question

• Understanding the factors behind fertility rates is important because population growth is intrinsic to health, political and economic stability, and the environment.

• The debate still rages whether controlling infant mortality is more important in controlling fertility, or whether investing in alternative choices for women provides motivation to lower fertility. In addition, how do different religious groups behave under similar conditions? Moslems have been shown to maintain higher fertility rates despite other signs of modernization.

4

2. Create Conceptual Model

• We propose that fertility is higher when infant mortality is higher, because families have ‘replacement children’.

• Furthermore, we propose that when women have more choices than family raising, that fertility will be lower.

• Finally, we propose that there are religious differences, such that Moslem families will have more children than Christian, Hindu or Buddhist families, because of strong motivations to have ‘what Allah wills’ over alternative religious beliefs of G-d given controls over one’s life path.

5

2. Conceptual Model

Fertility Model

Fertility

Women’s choices

Religion

Infant Mortality

6

3. Operationalized Model

Fertility Model

Number of

children

Women’s Literacy

Rate

Religion:Dummy variables

Infant Mortality

Rate

7

3. Operationalized Model

We use the 1995 World Data survey, which includes data from 109 countries. The unit of analysis therefore is the country itself. Variation among the different subgroups will not be attainable.

The dependent variable, fertility, ranges from 1.3 to 8.2 children per woman.

Female literacy is continuous and ranges from 9% to 100% of the population. The higher the rate of female literacy, the lower the rate of fertility.

The infant mortality rate (IMR) is continuous and ranges from 4 to 168 deaths per 1,000 women. The higher the infant mortality rate, the higher the fertility rate.

There are several religions represented, we will simply compare Moslems to other religions, such that a dummy variable for Moslem will take the value of 1 for Moslems, and 0 for other religions. Moslems are hypothesized to have higher fertility rates, compared to other religious groups.

8

4. Conduct tests on variables

• While the initial tests indicate that the independent variables are skewed, we will enter them here as they are.

9

5. Build Regression Model (methods section)

• We will be using an OLS regression model because the dependent variable is continuous and so regression is the appropriate method. The model is as follows:

Y = a + b1x1 + b2x2 + e

or

Fertility = Constant + b1IMR + b2LIT + b3Moslem,

where fertility is the total fertility rate (TFR), IMR is the infant mortality rate per 1,000 children within the first year of life, and Moslem is a dichtomous variable, which compares predominantly Moslem countries with those of other predominant religions.

10

6. Run Regressions

R = .867; R2 = .752; Adjusted R2 = .743

Anova: F = 81.829, sig = .000

Coefficientsa

4.805 .769 6.249 .000

.019 .005 .393 3.741 .000

.518 .256 .126 2.023 .046

-.030 .007 -.454 -4.055 .000

(Constant)

babymort Infantmortality (deaths per1000 live births)

Mus lim Moslems

lit_fema Femaleswho read (%)

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: fertil ty Fert ility: average number of kidsa.

11

7. Report Results

• Our hypotheses are supported by the results, that is, countries with higher rates of female literacy have lower rates of fertility, countries with higher infant mortality rates have higher fertility. While not as large a contribution to the model, we also see that countries with predominantly Moslem populations have higher fertility.

12

8. Interpret Results

• Both literacy and infant mortality had very high impact, yet it would make sense to prioritize reducing infant mortality where that is particularly high since its effect was somewhat stronger than that of female literacy. However, a reduction in IMR without improvements in female choices is not likely to be helpful in the long run. Further research is needed to understand why Moslem fertility is higher even when controlling for the other two factors.

13

9. Interaction Effects

• Sometimes the effect of one variable is moderated by another. That is, the effect of a variable is attenuated by the status of another. For example, the effect of education on wages may be different for minorities versus the dominant race.

• The general form of the model is

Y = a + b1 X + b2 Z + b3 XZ

14

9. Interaction Effects

• Visit the site, http://courses.washington.edu/psy209/Interpreting_Interactions_WorksheetKEY.htm

and check and the graphical examples of main effects versus interactions.

Computing an interaction term is simple:Transform compute Target variable = name for new variable, the interaction termNumeric expression is term1 times term2

e.g., male*educ. compute INTERACT = EDUC * MALE.

There is substantial debate whether it’s appropriate to have an interaction term of anything other than two continuous variables, but most social scientists agree that it is, subject to theory and interpretability. Interval and ratio variables are generally considered appropriate. A dummy * non-categorical is appropriate, but not categorical * categorical UNLESS it’s a 1/0 for both. Even so, for a two-by-two interaction, consider a series of dummy variables instead.

15

9. Interactions - Example

• Open the spss supplied employee.sav• Create a minority-education interaction

term• Run a regression where the DV is starting

salary, and include minority status and educational attainment.

• Then run another and add the interaction term. Look at B coefficient and R2 changes.

16

Transforming variables - a

Computing a variable from two variables syntaxcompute minsex = 0 .execute .if (female eq 1) and (minority eq 0) minsex = 1 .

if (female eq 1) and (minority eq 1) minsex = 2 .



execute .

17

Transforming variables - b

Now compute a set of dummiesRecode minsex (1=1)(else = 0) into femwhite .

Recode minsex (2=1)(else = 0) into femminor .

Recode minsex (3=1)(else = 0) into malewhit .

Recode minsex (4 = 1)(else = 0) into malemin .

execute .

18

Logistic Regression

• Binary Logistic (aka, regular or dichotomous)• Appropriate for dichotomous dependent variable, e.g.,

‘yes/no’ or ‘one/other’ kind of outcome or group. • B coefficients show direction, but the exp(b) is the odds

ratio, and is most commonly reported. • You want a smaller chi-square statistic this time (that is,

the probability is small that this model needs other explanatory variables…sort of counter-intuitive.)

• First conduct bivariate analyses to determine variable suitability and potential patterns (this step will also help you interpret the results).

19

SPSS Commands

• SPSS: Analyze – Regression – Binary Logistic• Example: spss file: employee data• Recode jobcat to manager = 1, else = 0.• Make Manager the dependent variable.• Use female and minority as two explanatory variables.• Add education, too.• Run. • LOGISTIC REGRESSION manager• /METHOD = ENTER female minority educ• /CRITERIA = PIN(.05) POUT(.10) ITERATE(20)

CUT(.5) .

20

The Results.

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

female -.895 .447 4.013 1 .045 .409Minority -2.325 .794 8.573 1 .003 .098educ 1.773 .275 41.625 1 .000 5.886Constant -28.195 4.280 43.388 1 .000 .000

Interpretation: Females and Minorities are less likely to be managers (from the B) and the Exp (B) (odds ratios) tell us that women are less than half as likely as men, and minorities less than 1/10 as likely as whites to be managers. With education, the interpretation is trickier. We can see from B that the effect is positive, that is, the more the education, the more likely it is that one will be a manager. But the correct interpretation of 5.886 is that each additional year increases the probability of being a manager nearly 6 times, but the probability of being a manager for any given year is not obvious. Clearly, education is a the largest and most significant contributor to the likelihood of being a manager.

Recommendation: Discuss.

21

Ordinal Logistic Regression

• Appropriate for a non-continuous but ordinal that is inappropriate for OLS regression (too few categories, or the order is ordered but hard to interpret).

• Need to distinguish between IVs that are factors (categorical variables) and covariates (continuous or at least, non-categorical).

• Useful to understand predictors of different levels of the DV (analogous to OLS).

22

Website Reference for Ordinal and Multinomial Logistic Regression

• http://www.xs4all.nl/~jhckx/spss/mlogist/

• Click on the appropriate link:• Example of multinomial logistic regression using NOMREG

Example of ordered logistic regression using PLUM

23

SPSS Commands

• SPSS: Analyze – Regression – Ordinal• Example: spss file: GSS93 subset.sav• Frequencies: musicals (likes broadway musicals).• Use sex (factor) and education (covariate) as two explanatory

variables.• Run.

PLUM musicals BY sex WITH educ /CRITERIA = CIN(95) DELTA(0) LCONVERGE(0)

MXITER(100) MXSTEP(5) PCONVERGE(1.0E-6) SINGULAR(1.0E-8)

/LINK = LOGIT /PRINT = FIT PARAMETER SUMMARY .

24

Interpretation of Ordinal Logistic Regression Results I

The value of 178.2 with 2 df is the most relevant value here. This is the likelihood ratio test that all coefficients for all independent variables are equal to zero. This null hypothesis can be rejected since the test is highly significant

The pseudo R-square measures indicate that the model does not perform very well. The Nagelkerke R2 value will usually be the most relevant value to report. It corrects the Cox and Snell value so that it can theoretically achieve a value of 1.

Note: Do not worry about the Chi-Square test for this model. These goodness of fit tests are not highly significant, indicating that the model sort of fits the data. However, the tests are not informative because of the large number of zero frequencies in a three-way table of the variables in use here. This information is really only relevant if a small

number of categorical independent variables is used.

Pseudo R-Square

.119

.126

.043

Cox and Snell

Nagelkerke

McFadden

Link function: Logit.

Model Fitting Information

676.087

497.918 178.169 2 .000

ModelIntercept Only

Final

-2 LogLikelihood Chi-Square df Sig.


Goodness-of-Fi t

165.939 138 .053

169.262 138 .036

Pearson

Deviance

Chi-Square df Sig.

Link funct ion: Logit.

25

Interpretation of Ordinal Logistic Regression Results II

Parameter Estimates

-3.522 .237 221.506 1 .000 -3.986 -3.058

-1.590 .222 51.339 1 .000 -2.024 -1.155

-.365 .218 2.797 1 .094 -.794 .063

1.578 .239 43.548 1 .000 1.110 2.047

-.159 .016 93.897 1 .000 -.191 -.127

.959 .100 91.125 1 .000 .762 1.156

0a . . 0 . . .

[musicals = 1]

[musicals = 2]

[musicals = 3]

[musicals = 4]

Threshold

educ

[sex=1]

[sex=2]

Location

Est imate Std. Error Wald df Sig. Lower Bound Upper Bound

95% Confidence Interval


This parameter is set to zero because it is redundant.a.

The threshold numbers are the constants for each level of the ordinalDependent variable. The location parameters are the regression coefficients. In this model, the highest value is set as the reference categoryfor factors and the DV. So we compare musicals 1,2,3,4 to 5, and sex =1 to sex = 2. Now, like with the OLS regression, you look at the Estimate fordirection and the Sig. for significance (derived from the Wald estimate).

26

Interpretation of Ordinal Logistic Regression Results II

• The problem is, right now the interpretation reads the more education one has, the less one is likely to dislike Broadway Musicals, and Sex = 1 (males) are more likely to dislike musicals. This is a bit backwards, so that’s why it’s good to reverse the coding on the DV, and also clarify the coding on the IV dummies (from SEX to MALE. You know how to do the dummies, here’s how to do the switch in codes for a numeric variable:

• Go to Transform – Automatic Recode• Add in the variable you want to transform, and give it a

new name (so you leave the original variable alone)• Click on Recode Starting from the HIGHEST Value. OK.• Check your values for the original and transformed to

make sure it came out right.

27

Multinomial Logistic Regression

• Appropriate for a 3- or maybe 4-category dependent variable.

• Need to distinguish between IVs that are factors (categorical variables) and covariates (continuous or at least, non-categorical).

• Useful to predict membership, or alternatively, to understand what kind of people can be found in that group.

28

SPSS Commands

• SPSS: Analyze – Regression – Multinomial• Example: spss file: GSS93 subset.sav• Frequencies: wkstat

wrkstat Labor Force Status

747 49.8 49.8 49.8

161 10.7 10.7 60.5

32 2.1 2.1 62.7

51 3.4 3.4 66.1

231 15.4 15.4 81.5

42 2.8 2.8 84.3

200 13.3 13.3 97.6

36 2.4 2.4 100.0

1500 100.0 100.0

1 Working fulltime

2 Working parttime

3 Temp not work ing

4 Unempl, laid off

5 Retired

6 School

7 Keeping house

8 Other

Total

ValidFrequency Percent Valid Percent

CumulativePercent

29

SPSS Commands

• Need to recode wkstat to laborfrc, where 1 = fulltime, 2 = part-time, 3 = retired.

• You can enter in sex as is for factors.• Add Age as well for a covariate.

NOMREG laborfrc (BASE=LAST ORDER=ASCENDING) BY male WITH

age /CRITERIA CIN(95) DELTA(0) MXITER(100) MXSTEP(5)

CHKSEP(20) LCONVERGE(0) PCONVERGE(0.000001) SINGULAR(0.00000001)

/MODEL /STEPWISE = PIN(.05) POUT(0.1) MINEFFECT(0)

RULE(SINGLE) ENTRYMETHOD(LR) REMOVALMETHOD(LR) /INTERCEPT =INCLUDE /PRINT = PARAMETER SUMMARY LRT CPS STEP MFI .

30

Interpretation of Multinomial Logistic Regression Results I

The most interesting result here is that the chi-square value of 839.7 with 4 df is highly significant. This means that the null hypothesis that all effects of the independent variable are zero can be rejected.

The likelihood ratio tests show that the null hypothesis that the effects on both log odds-ratios of the dependent variable are simultaneously equal to zero can be rejected for the intercept and both independent variables. However, the loss of fit associated with AGE is much stronger than that of SEX.

The pseudo R-square measures indicate that the model performs well. The Nagelkerke R2 value will usually be the most relevant value to report. It corrects the Cox and Snell value so that it can theoretically achieve a value of 1.

Model Fitting Information

1354.540

514.894 839.646 4 .000

ModelIntercept Only

Final

-2 LogLikelihood

ModelFittingCriteria

Chi-Square df Sig.

Likelihood Ratio Tests

Pseudo R-Square

.522

.632

.421

Cox and Snell

Nagelkerke

McFadden


514.894a .000 0 .

1316.366 801.473 2 .000

552.610 37.716 2 .000

EffectIntercept

age

sex

-2 LogLikelihood of

ReducedModel

Model Fitt ingCriteria

Chi-Square df Sig.


The chi-square s tatistic is the difference in -2 log-likelihoodsbetween the final model and a reduced model. The reducedmodel is formed by omitting an effec t from the final model. Thenull hypothesis is that all parameters of that effec t are 0.

This reduced model is equivalent to the final modelbecause omitting the effect does not increase thedegrees of freedom.

a.

31

Interpretation of Multinomial Logistic Regression Results II

Paramete r Estimates

14.940 1.127 175.875 1 .000

-.238 .018 181.989 1 .000 .788 .761 .816

.352 .281 1.566 1 .211 1.421 .819 2.465

0b . . 0 . . . .

14.088 1.160 147.590 1 .000

-.244 .019 169.472 1 .000 .784 .755 .813

-.756 .324 5.428 1 .020 .470 .249 .887

0b . . 0 . . . .

Intercept

age

[sex=1]

[sex=2]

Intercept

age

[sex=1]

[sex=2]

laborfrc fulltime=1,part time = 2, retired = 3

a

1.00

2.00

B Std. Error Wald df Sig. Exp(B) Lower Bound Upper Bound

95% Confidence Interval forExp(B)

The reference category is: 3.00.a.

This parameter is set to zero because it is redundant.b.

32

Interpretation of Multinomial Logistic Regression Results II

• Interpretation of Multinomial Logistic RegressionOne category chosen as reference group

• odds of being in category other than reference

• Now, this one is like the dichotomous logistic regression, we’re back with the Exp(B) coefficient (Exp(B) is the Odds Ratio). Now we see the odds ratio at any category of marital status on the first two categories of labor force, as compared to the third category. For age, the older you get the less likely you are to be working full or part time compared to being retired (but let’s look at the compare means).

• For the factors (categorical independent variables), you can enter them without transformation. Males are more likely to be in the fulltime labor force compared to retired than are women, and less likely to be parttime than retired compared to women.

33

Crosstabs Sex and Labor Force

sex Respondent's Sex * laborfrc fulltime=1, pa rttime = 2, retired = 3 Crosstabulation

408 46 107 561

72.7% 8.2% 19.1% 100.0%

54.6% 28.6% 46.3% 49.3%

339 115 124 578

58.7% 19.9% 21.5% 100.0%

45.4% 71.4% 53.7% 50.7%

747 161 231 1139

65.6% 14.1% 20.3% 100.0%

100.0% 100.0% 100.0% 100.0%

Count

% within sex Respondent's Sex

% within laborfrc fullt ime=1, parttime= 2, retired = 3

Count



Count



1 Male

2 Female

sex Respondent 'sSex

Total

1.00 2.00 3.00

laborfrc fulltime=1, partt ime = 2,retired = 3

Total

1 PSY6010: Statistics, Psychometrics and Research Design Regression (OLS, binary logistic, ordinal)...

Documents

Transcript of 1 PSY6010: Statistics, Psychometrics and Research Design Regression (OLS, binary logistic, ordinal)...