Simple Linear Regression Scott M Lynch

(copyright by Scott M. Lynch, February 2003)

Simple Linear Regression (Soc 504)

The linear regression model is one of the most important and widely-used models in statistics.Most of the statistical methods that are common in sociology (and other disciplines) todayare extensions of this basic model. The model is used to summarize large quantities of dataand to make inference about relationships between variables in a population. Before wediscuss this model in depth, we should consider why we need statistics in the first place, inorder to lay the foundation for understanding the importance of regression.

1 Why Statistics?

There are three goals of statistics:

1. Data summarization

2. Making inference about populations from which we have a sample

3. Making predictions about future observations

1.1 Data Summarization

You have already become familiar with this notion by creating univariate “summary statis-tics” about sample distributions, like measures of central tendency (mean, median, mode)and dispersion (variance, ranges). Linear regression is also primarily about summarizingdata, but regression is a way to summarize information about relationships between 2 ormore variables. Assume for instance, we have the following information:

●

●

●●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●●●

●●

●●

●

●

●●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

Age

Obs

erve

d/P

redi

cted

f(ra

tes)

40 60 80 100

-8-6

-4-2

0

Figure 1. F(death rates) by Age

1

This plot is of a function of 1995 death rates for the total US population ages 25-105,by age (see http://www.demog.berkeley.edu/wilmoth/mortality). [Side Note: Death ratesare approximately exponential across age, so I’ve taken the ln() to linearize the relationship.Also, I’ve added a random normal variable to them [N(0, .5)], because at the populationlevel, there is no sampling error, something we will discuss in the next lecture].

There is clearly a linear pattern between age and these 81 death rates. One goal of astatistical model is to summarize a lot of information with a few parameters. In this example,since there is a linear pattern, we could summarize this data with 3 parameters in a linearmodel:

Yi = β0 + β1Agei + εi (Population Equation)

OrYi = b0 + b1Agei + ei (Sample Equation)

OrYi = b0 + b1Agei (Sample Model)

We will discuss the meaning of each of these equations soon, but for now, note that,if this model ‘fits’ the data well, we wil have reduced 81 pieces of information to 3 piecesof information (an intercept-b0, a slope-b1, and an error term-more specifically, an errorvariance, s2

e).

1.2 Inference

Typically we have a sample, but we would like to infer things about the population fromwhich the sample was drawn. This is the same goal that you had when you conducted variousstatistical tests in previous courses. Inference in linear regression isn’t much different. Ourgoal is to let the estimates b0 and b1 be our ‘best guess’ about the population parameters β0

and β1, which represent the relationship between two (or more) variables at the populationlevel. Formalizing inference in this model will be the topic of the next lecture.

1.3 Prediction

Sometimes, a goal of statistical modeling is to predict future observations from observeddata. For example, given the data above, we might extrapolate our line out from age 105to predict what the death rate should be for age 106, or 110. Or we might extrapolate ourline back from age 25 to predict the death rate for persons at age 15. Alternatively, supposeI had a time series of death rates from 1950-2001, and I wanted to project death rates for2002. Or, finally, suppose I had a model that predicted the death rates of smokers versusnonsmokers, and I wanted to predict whether a person would die within the next 10 yearsbased on whether s/he were a smoker versus a nonsmoker.

1.4 Prediction and Problems with Causal Thinking

Inference and Prediction are not very different, but prediction tends to imply causal argu-ments. We often use causal arguments in interpreting regression coefficients, but we need to

2

realize that statistical models, no matter how complicated, may never be able to demonstratecausality. There are three rules of causality:

1. Temporality. Cause must precede effect.

2. Correlation. Two variables must be correlated (in some way) if one causes the other.

3. Nonspuriousness. The relationship between two variables can’t be attributable to theinfluence of a third variable.

1.4.1 Temporality

Many, if not most, social science data sets are cross-sectional. This makes it impossible todetermine whether A causes B or vice versa. Here is where theory (and some common sense)comes in. Theory may tell us that A is causally prior to B. For example, social psychologicaltheory suggests that stress induces depression, and not that depression leads to stress. (notethat two theories, however, may posit opposite causal directions). Common sense may alsoreveal the direction of the relationship. For example, in the mortality rate example, it isunreasonable to assume that death rates make people older.

1.4.2 Correlation and Nonspuriousness

Two variables must be related (B) if there is a causal relationship between them (A), butthis does not imply the reverse statement that correlation demonstrates causation. Why?Because there could be any number of alternate explanations for the relationship betweentwo variables. There are several types of alternate explanations. First, a variable C could becorrelated with A but be the true cause of B. In that case, the relationship between A andB is “spurious.” A classic example of this is that ice cream consumption rates (in gallonsper capita) are related to rape rates (in rapes per 100,000 persons). This is not a causalrelationship, however, because both are ultimately driven by “season.” More rapes occur,and more ice cream is consumed, during the summer. Regression modeling can help us ruleout such spurious relationships.

Second, A could affect C, with C affecting B. In that case, A may be considered acause, but C is the more proximate cause (often called an “intervening variable”). Forexample, years of education is strongly linearly related to health. However, we seriouslydoubt that time spent in school is ultimately the cause of health; instead, education probablyaffects income, and income is more proximately related to health. Often, our goal is to findproximate causes of an outcome variable, and we will be discussing how the linear model isoften used (albeit somewhat incorrectly) to find proximate causes.

Third, two variables may be correlated, but the relationship may not be a causal one.Generally, when we say that two variables are related, we are generally thinking about this atthe within-individual level; that as a characteristic changes for an individual it will influencesome other characteristic of the individual. Yet, our models generally capture covarianceat the between-individual level. The fact that gender, for example, covaries with incomeDOES NOT imply that a sex change operation will automatically lead to an increase in pay.With fixed characteristics, like gender or race, we often use causal terminology, but because

3

gender cannot change, it technically cannot be a cause of anything. Instead, there maybe more proximate and changeable factors that are associated with gender which are alsoassociated with the outcome variable in which we are interested. Experimentalists realizethis, and experiments involve observing average within-individual change, with a manipulableintervention.

As another example, life course research emphasizes three different types of effects oftime: age, period, and cohort effects. Age effects refer to biological or social processes thatoccur at the within-individual level as the individual ages. Period effects refer to historicalprocesses that occur at the macro level at some point in time and influence individuals atmultiple ages. Cohort effects, at least one interpretation of them, refer to the interaction ofperiod with age. A period event at time t may affect persons age x at time t differently thanpersons age x+5 at t. Think, for example, about the difference between computer knowledgeof persons currently age 20 versus those currently age 70. The fact that 70 year-olds knowless about computers is not an artifact of some cognitive function decline across age-it’s dueto the differential effect of the period event of the invention of the home PC across birthcohorts.

Recognizing these different types of time effects demonstrates why our models may failat determining causality. Imagine some sort of life course process that is stable across anindividual’s life course but may vary across birth cohorts-suppose it is decreasing across birthcohorts. Assume that we take a cross section and look at the age pattern-we will observe alinear age pattern, even though this is not true for any individual:

Age

Hyp

othe

tical

Y

0 20 40 60 80 100

-0.4

-0.2

0.0

0.2

0.4

0.6

1900 Cohort1920 Cohort1940 Cohort1960 Cohort1980 Cohort2000 Cohort2000 Period Mean

Figure 2. Hypothetical Y by Age

4

This example, as well as the ice cream and rape example, highlights additional fallaciesthat we need to be aware of when thinking causally: ecological fallacies and individualisticfallacies. If we assume that, because ice cream consumption rates and rape rates were related,that people who eat ice cream are rapists, this would be committing the ecological fallacy.Briefly stated, macro relationships aren’t necessarily true at the micro level. To give a morereasonable example, we know that SES is related to heart disease mortality at the macrolevel (with richer countries having greater rates of heart disease mortality). This does notimply that having low SES at the individual level is an advantage-we know that the patternis reversed at that level. The explanation for the former (macro) finding may be differencesin competing risks or diet at the national level. The explanation for the latter (micro) findingmay be differences in access to health care, diet, exercise, etc.

The individualistic fallacy is essentially the opposite fallacy-reverse the arguments above.We cannot infer macro patterns from micro patterns. Furthermore, exceptions to a patterndon’t invalidate the pattern.

Because of the various problems with causality and its modeling, we will stay away fromthinking causally, although to some extent, the semantics we use in discussing regression willbe causal.

2 Back to Regression

Recalling the example above under ‘data summarization,’ we have data on death rates by age,and we would like to summarize the data by using a linear model. We saw 3 equations above.In these equations, β0 is the linear intercept term, β1 is the (linear) effect of age on deathrates, and εi is an error term that captures the difference (vertical distance) between theline implied by the coefficients and the actual data we observed-obviously every observationcannot fall exactly on a straight line through the data.

We assume that the first equation holds in the population. However, we don’t havepopulation data. So, we change our notation slightly to the second equation.

In the third equation, the error term has dropped out of the equation, because Y (“y-hat”) is the expected (mean) score for the rate applicable to a particular age. Note thatsimple algebra yields ei = Yi − Yi ≡ (b0 + b1Agei).

3 Least Squares Estimation

How do we find estimates of the regression coefficients? We should develop some criteriathat lead us to prefer one set of coefficients over another set.

One reasonable strategy is to find the line that gives us the least error. This must bedone for the entire sample so, we would like the b0 and b1 that yield (min

∑ni=1 ei). However,

the sum of the raw errors will always be 0, so long as the regression line passes through thepoint (x, y) An alternate strategy is to find the line that gives us the least absolute valueerror: (min

∑ni=1 | ei |). We will discuss this strategy later in the semester.

It is easier to work with squares (which are also positive), so we may consider: (min∑n

i=1 e2i ).

In fact, this is the criteria generally used, which is why it is called Ordinary Least Squaresregression.

5

We need to minimize this term, so, recalling from calculus that a maximum or minimumis reaches wherever a curve inflects, we simply need to take the derivative of the least squaresfunction, set it equal to 0, and solve for b0 and b1.

There is one catch-in this model we have two parameters that we need to solve for. So,we need to take two partial derivatives: one with respect to b0 and the other with respect tob1. Then we will need to solve the set of two equations:

∂F

∂b0

=∂

∂b0

[n∑

i=1

(Yi − (b0 + b1Xi))2

]= 2nb0 − 2

∑Yi + 2b1

∑Xi

and:

∂F

∂b1

=∂

∂b1

[n∑

i=1

(Yi − (b0 + b1Xi))2

]= −2

∑XiYi + 2b0

∑Xi + 2b1

∑X2

i .

Setting these equations to 0 and solving for b0 and b1 yields:

b0 = Y − b1X

and

b1 =

∑ni=1(Xi − X)(Yi − Y )∑n

i=1(Xi − X)2

Notice that the denominator of b1 is (n− 1)× s2x, and the numerator is (n− 1)× cov(X, Y ).

So, b1 = Cov(X,Y )V ar(X)

.

4 Maximum Likelihood Estimation

An alternative approach to estimating the coefficients is to use maximum likelihood estima-tion. Recall that for ML estimation, we first establish a likelihood function. If we assumethe errors (ei) are ∼ N(0, se), then we can establish a likelihood function based on the error.Once again, assuming observations are independent, the joint pdf for the data given theparameters (the likelihood function) is:

p(Y | b0 , b1 , X) ≡ L(b0 , b1 | X, Y ) =n∏

i=1

1

se

√2π

exp

{−(Yi − (b0 + b1Xi))

2

2s2e

}This likelihood reduces to:

p(Y | b0 , b1 , X) ≡ L(b0 , b1 | X, Y ) =(2πs2

e

)−n2 exp

{∑ni=1(Yi − (b0 + b1Xi))

2

2s2e

}Taking the log of the likelihood, we get:

6

LL(b0, b1, X) ∝ −nlog(se)−1

2s2e

(n∑

i=1

(Yi − (b0 + b1Xi))2

)Taking the derivative of this function with respect to each parameter yields the following 3equations:

∂LL

∂b0

= − 1

2s2e

(2nb0 − 2

∑Yi + 2b1

∑Xi

)and

∂LL

∂b1

= − 1

2s2e

(−2∑

XiYi + 2b0

∑Xi + 2b1

∑X2

i

)and

∂LL

∂se

= s−3e

∑(Yi − (b0 + b1Xi))

2 − n

se

Setting these partial derivatives equal to 0 and solving for the parameters yields the samevalues as the OLS approach. Furthermore, the error variance is found to be:

s2e =

∑e2

i

nHowever, due to estimation, we lose 2 ‘degrees of freedom,’ making the unbiased denominatorn − 2, rather than n. Realize that this end result is really nothing more than the averagesquared error. If we take the square root, we get the ‘standard error of the regression,’ whichwe can use to construct measures of model fit.

Using the results above, the regression for the death rate data yields the following line:

●

●

●●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●●●

●●

●●

●

●

●●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

Age

Obs

erve

d/P

redi

cted

f(ra

tes)

40 60 80 100

-8-6

-4-2

0

Figure 3. Observed and Predicted F(death rates)

7


Simple Regression II: Model Fit and Inference (Soc 504)

In the previous notes, we derived the least squares and maximum likelihood estimators forthe parameters in the simple regression model. Estimates of the parameters themselves,however, are only useful insofar as we can determine a) how well the model fits the dataand b) how good our estimates are, in terms of the information they provide about thepossible relationship between variables in the population. In this set of notes, I discuss howto interpret the parameter estimates in the model, how to evaluate the fit of the model, andhow to make inference to the population using the sample estimates.

1 Interpretation of Coefficients/Parameters

The interpretation of the coefficients from a linear regression model is fairly straightforward.The estimated intercept (b0) tells us the value of y that is expected when x = 0. This isoften not very useful, because many of our variables don’t have true 0 values (or at least notrelevant ones-like education or income, which rarely if ever have 0 values). The slope (b1) ismore important, because it tells us the relationship between x and y. It is interpreted as theexpected change in y for a one-unit change in x. In the deathrates example, the interceptestimate was −9.45, and the slope estimate was .084. This means that the log of the deathrates is expected to be −9.45 at age 0 and is expected to increase .084 units for each year ofage.

The other parameter in the model, the standard error of the regression (se), is importantbecause it gives us (in y units) an estimate of the average error associated with a predictedscore. In the model for deathrates, the standard error of the regression was .576.

2 Evaluating Model Fit

Before we make inferences about parameters from our sample estimates, we would like todecide just how well the model fits the data. A first approach to this is to determine theamount of the variance in the dependent variable is ‘accounted for’ by the regression line. Ifwe think of the formula for variance:

n∑i=1

(Yi − Y )2

n− 1,

we can consider the numerator to be called the “Total Sum of Squares” (TSS). We have anestimate of the variance that is unexplained by the model-the “Residual Sum of Squares”(SSE), which is just the numerator of the standard error of the regression function:

n∑i=1

e2i =

n∑i=1

(Yi − Yi)2

1

The difference between these two is the “Regression Sum of Squares,” (RSS), or the amountof variance accounted for by the model. RSS can be represented as:

n∑i=1

(Yi − Y )2

Some basic algebra reveals that, in fact, RSS + SSE = TSS. A measure of model fit canbe constructed by taking the ratio of the explained variance to the total variance:

R2 =

∑(Yi − Y )2∑(Yi − Y )2

= 1−∑

(Yi − Yi)2∑

(Yi − Y )2

This measure ranges from 0 to 1. For a poor-fitting model, the error (SSE) will be large(possibly equal to the TSS), making RSS small. For example, if we were to allow the meanof y to be our best guess for y (Y = Y ), then we would be supposing that the relationshipbetween x and y did not exist. In this case, RSS would be 0, as would R2. For a perfectmodel, on the other hand, RSS = TSS, so R2 = 1. In the deathrates example, the R2 was.92, indicating a good linear fit.

Interestingly, in the simple linear regression model the signed square root of R2 is the cor-relation between x and y (sign based on sign of b1). In multiple regression, This computationyields the ‘multiple correlation,’ which we will discuss later.

These results are almost all that is needed to complete an ANOVA table for the regressionmodel. The typical model ANOVA output looks like (this is from my death rate example):

ANOVA TABLE Df SS MS F SigRegression 1 315.54 315.54 949.78 p = 0

Residual 79 26.25 ..33223

Total 80 341.79

The general computation of the table is:

ANOVA TABLE DF SS MS F Sig

Regression k-1∑n

i=1(Yi − Y )2 RSSDf(R)

MSRMSE

(from F table)

Residual n-k∑n

i=1(Yi − Yi)2 SSE

Df(E)(called MSE)

Total n-1∑n

i=1(Yi − Y )2

where k is the number of parameters, and n is the number of data points (sample size).

2

For the F test, the numerator and denominator degrees of freedom are k − 1 and n− k,respectively. The F test is a joint test that all the coefficients in the model are 0. In simplelinear regression, F = t2 (for b1).

There are numerous ways to further assess model fit, specifically by examining the errorterms. We will discuss these later, in the context of multiple regression.

3 Inference

Just as the mean in the ‘models’ that you discussed in previous classes (and that we discussedearlier) have a sampling distribution from which you can assume your estimate of x is asampled value, regression coefficients also have a sampling distribution. For these samplingdistributions to be general, there are several assumptions that the OLS (or MLE) regressionmodel must meet. They include:

1. Linearity. This assumption says that the relationship between x and y must be linear.If this assumption is not met, then the parameters will be biased.

2. Homoscedasticity (constant variance of y across x). This assumption says that thevariance of the error term cannot change across values of x. If this assumption isviolated, then the standard errors for the parameters will be biased. Note that if thelinearity assumption is violated, the homoscedasiticity assumption need not be.

3. Independence of ei and ej, for i 6= j. This assumption says that there cannot be arelationship between the errors for any two individuals. If this assumption is violated,then the likelihood function does not hold, because the probability multiplication rulesays that joint probabilities are multiple of the individual probabilities ONLY if theevents are independent.

4. ei ∼ N(0, σε). Because the structural part of y is fixed, given x, this is equivalentto saying that yi ∼ N(b0 + b1xi , σe). This assumption is simply that the errors arenormally distributed. If this assumption is not met, then the likelihood function isnot appropriate. In terms of least squares, this assumption guarantees that the sam-pling distributions for the parameters are also normal. However, as n gets large, thisassumption becomes less important, by the CLT.

If all of these assumptions are met, then the OLS estimates/MLE estimates are BLUE(Best Linear Unbiased Estimates), and the sampling distributions for the parameters can bederived. They are:

b0 ∼ N

(β0,

σ2ε

∑X2

i

n∑

(Xi − X)2

)

b1 ∼ N

(β1,

σ2ε∑

(Xi − X)2

)With the standard errors obtained (square root of the sampling distribution variance

estimators above), we can construct t-tests on the parameters just as we conducted t- tests

3

on the mean in previous courses. Note that these are t-tests, because we must replace σepsilon

with our estimate, se, for reasons identical to those we discussed before. The formulas forthe standard errors are often expressed as (where (MSE)

12 and se are identical):

S.E.(b0) =

√(MSE

∑X2

i

n∑

(Xi − X)2

)

S.E.(b1) =

√(MSE

n∑

(Xi − X)2

)Generally, we are interested in whether x is, in fact, related to y. If x is not related to y,

then b1 will equal 0, and our best guess for the value of y is y. Thus, we typically hypothesizethat b1 = 0 and construct the following t-test:

t =b1 − (H0 : β0 = 0)

s.e.(b1)

In the deathrate example, the standard error for b0 was .189, and the standard error for b1

was .0027. Computing the t-tests yields values of −49.92 and 30.82, respectively. Both ofthese are large enough that we can reject the null hypothesis that the population parameters,β0 and β1, are 0. In other words, if the null hypothesis were true, these data would be almostimpossible to obtain in a random sample. Thus, we reject the null hypothesis.

As we discussed before, constructing a confidence interval for this estimate is a simplealgebraic contortion of the t-test in which we decide a priori the ‘confidence level’ we desirefor our interval:

(1− α)%C.I. = b1 ± tα2s.e.(b1)

These tests work also for the intercept, although, as we discussed before, the intercept isoften not of interest. However, there are some cases in which the intercept may be of interest.For example, if we are attempting to determine whether one variable is an unbiased measureof another, then the intercept should be 0. If there is a bias, then the intercept should pickthis up, and the intercept will not be 0. For example, if people tend to under-report theirweight by 5 pounds, then the regression model for observed regressed on reported weightwill have an intercept of 5 (implying you can take an individual’s reported weight and add5 pounds to it to get their actual weight-see Fox.

Sometimes, beyond confidence intervals for the coefficients themselves, we would like tohave prediction intervals for an unobserved y. Because the regression model posits that y isa function of both b0 and b1, a prediction interval for y must take variability in the estimatesof both parameter estimates into account. There are various ways to do such calculations,depending on exactly what quantity you are interested in (but these are beyond the scopeof this material).

With the above results, you can pretty much complete an entire bivariate regressionanalysis.

4

4 Deriving the Standard Errors for Simple Regression

The process of deriving the standard errors for the parameters in the simple regressionproblem using maximum likelihood estimation involves the same steps as we used in findingthe standard errors in the normal mean/standard deviation problem before:

1. Construct the second derivative matrix of the log likelihood function with respect toeach of the parameters. That is, in this problem, we have three parameters, and hencethe Hessian matrix will contain 6 unique elements (I’ve replaced s2

e with τ :

∂2LL∂b20

∂2LL∂b0∂b1

∂2LL∂b0∂τ

∂2LL∂b1∂b0

∂2LL∂b21

∂2LL∂b1∂τ

∂2LL∂τ∂b0

∂2LL∂τ∂b1

∂2LL∂τ2

2. Take the negative expectation of this matrix to obtain the Information matrix.

3. Invert the information matrix to obtain the variance-covariance matrix of the param-eters. For the actual standard error estimates, we substitute the MLEs for the popu-lation parameters.

4.1 Deriving the Hessian Matrix

We can obtain the elements of the Hessian matrix using the first derivatives shown above.Using the first derivative of the log likelihood with respect to b0, we can take the secondderivative with respect to b0. The first derivative with respect to b0 was:

∂LL

∂b0

= − 1

2τ

(2nb0 − 2

∑Y + 2b1

∑X

).

The second derivative with respect to b0 is thus simply:

∂2LL

∂b20

= − 1

2τ(2n) =

−n

τ.

The second derivative with respect to b1 is:

∂2LL

∂b0∂b1

= − 1

2τ

(2∑

X)

=

∑X

τ

The second derivative with respect to τ is:

∂2LL

∂b0∂τ= − 1

τ 2

(nb0 −

∑Y + b1

∑X

)The three remaining second partial derivatives require the other two first partial derivatives.Using the first partial derivative with respect to b1:

5

∂LL

∂b1

= − 1

2τ

(−2

∑XY + 2b0

∑X + 2b1

∑X2

),

we can easily take the second partial derivative with respect to b1:

∂LL

∂b21

= − 1

2τ

(2∑

X2)

=−

∑X2

τ

We can also obtain the second partial derivative with respect to τ :

∂2LL

∂b1∂τ= τ−2

(−

∑XY + b0

∑X + b1

∑X2

)For finding the second partial derivative with respect to s2

e, we need to re-derive the firstderivative with respect to s2

e rather than se. So, letting τ = s2e as we did above, we get the

following:

∂LL

∂τ= − n

2τ+

1

2τ

∑(Y − (b0 + b1X))2 .

The second partial derivative, then, is:

∂2LL

∂τ 2=

1

2

[nτ−2 − 2τ−3

∑(Y − (b0 + b1X))2

].

We now have all 6 second partial derivatives, giving us the following Hessian matrix (aftera little rearranging of the terms):

−nτ

∑X

τ− (nb0−

∑Y +b1

∑X)

τ2

∑X

τ−

∑X2

τ

(∑

XY −b0∑

X−b1∑

X2)τ2

− (nb0−∑

Y +b1∑

X)τ2 −(

∑XY −b0

∑X−b1

∑X2)

τ2n

2τ2 −∑

(Y −(b0−b1X))2

τ3

4.2 Computing the Information Matrix

In order to obtain the information matrix, we need to negate the expecation of the Hessianmatrix. There are a few ‘tricks’ involved in this process. The negative expectation of the firstelement is simply n

τ. The negative expectation of the second element (and also the fourth,

given the symmetry of the matrix) is nµτ

(recall the trick we used earlier–that if X =∑

Xn

,then

∑X = nX). We will skip the third and sixth (and hence also the seventh and eighth)

elements for the moment. The negative of the expectation of the fifth element is∑

X2

τ. (We

leave it unchanged, given that there is no simple way to reduce this quantity). Finally, totake the negative of the expectation of the ninth and last element, we must first note that

the expression can be rewritten as: nτ−∑

(Y −(b0+b1X))2

2τ3 . The expectation of the sum, though,is nothing more than the error variance itself, τ , taken n times. Thus, the expectation ofthe numerator is nτ − 2nτ . After some simplification, including some cancelling with termsin the denominator, we are left with n

2τ2 for the negative epectation.

6

All the other elements in this matrix (3, 6, 7, and 8) go to 0 in expectation. Let’s takethe case of the third (and seventh) element. The expectation of b0 is β0, which is equal toµY − β1µX . The expectation of

∑Y is nµY , and the expectation of

∑X is nµX . Finally,

the expectation of b1 is β1. Substituting these expressions into the numerator yields:

n (µY − β1µX)− nµY + nβ1µX .

This term clearly sums to 0, and hence the entire third and seventh expressions are 0. Asimilar finding results for elements six and eight.

4.3 Inverting the Information Matrix

Our information matrix obtained in the previous section turns out to be:nτ

nµX

τ0

nµX

τ

∑X2

τ0

0 0 n2τ2

Although inverting 3 × 3 matrices is generally not easy, the 0 elements in this one makethe problem relatively simple. We can break this problem into parts. First, recall that themultiple of a matrix with its inverse produces an identity matrix. So, in this case, the sumof the multiple of the elements of the last row of the information matrix by the elements ofthe last column of the inverse of the information matrix must be 1. But, all these elementsare 0:

[0 0 I(θ)−1

33

] 00n

2τ2

= 1.

The only way for this to occur is for I(θ)−133 to be 2τ2

n.

If we do a little more thinking about this problem, we will see that, because of the 0elements, to invert the rest of the matrix, we can treat the remaining non-zero elements ofthe matrix as a 2×2 sub-matrix, with the elements that are 0 in the information matrix alsobeing 0 in the variance-covariance matrix. The inverse of a 2 × 2 matrix M can be foundusing the following rule:

M−1 ≡

A B

C D

−1

=1

| M |

D −B

−C A

.

You can derive this result algebraically for yourself by setting up a system of four equationsin four unknowns.

The determinant of the submatrix here isn

∑X2−n2µ2

X

τ2 , so, after inverting (the constant

in front of the inverse formula above is 1|M |), we obtain τ2

n∑

X2−n2µ2X

. This expression can

be simplified by recognizing that an n can be factored from the numerator, leaving us with

7

n multiplied by numerator for the so-called computation formula for the variance of X(=

∑(X − µX)2). Thus, we have τ2

n∑

(X−µX)2. The inverse matrix thus becomes:

τ∑

X2

n∑

(X−µX)2−τµX∑(X−µX)2

−τµX∑(X−µX)2

τ∑(X−µX)2

We now have all the elements of the variance-covariance matrix of the parameters. Theseterms should look familiar, after replacing τ with σ2

ε and the population-level parameterswith the sample ML estimates:

s2e

∑X2

n∑

(X−X)2

−s2eX∑

(X−X)2

−s2eX∑

(X−X)2

s2e∑

(X−X)2

8


Multiple Regression I (Soc 504)

Generally, we are not interested in examining the relationship between simply two variables.Rather we may be interested in examining the relationship between multiple variables andsome outcome of interest. Or, we may believe that a relationship between one variable andanother is spurious on a third variable. Or, we may believe that the relationship between onevariable and another is being ‘masked’ by some third variable. Or, still yet, we may believethat a relationship between one variable and another may depend on another variable. Inthese cases, we conduct multiple regression analysis, which is simply an extension of thesimple regression model we have discussed thus far.

1 The Multiple Regression Model

In scalar notation, the multiple regression model is:

Yi = β0 + β1Xi1 + β2Xi2 + . . . + βkXik + εi

We rarely express the model in this fashion, however, because it is more compact to usematrix notation. In that case, we often use:

Y1

Y2...Yn

=

1 X11 . . . X1k

1 X21 . . . X2k...

.... . .

...1 Xn1 . . . Xnk

β0

β1...βk

+

ε1

ε2...εn

or just Y = Xβ + ε. At the sample level, the model is Y = Xb + e. In these equations, nis the number of observations in the data, and k + 1 is the number of regression parameters(the +1 is for the intercept, β0 or b0).

Note that if you use what you know about matrix algebra, you can multiply out theX matrix by the β matrix, add the error vector, and get the scalar result above for eachobservation. The column of ones gets multiplied by β0, so that the intercept term standsalone without an X variable.

2 The OLS Solution in Matrix Form

The OLS solution for β can be derived the same way we derived it in the previous lecture,but here we must use matrix calculus. Again, we need to minimize the sum of the squarederror terms (

∑ni=1 e2

i ). This can be expressed in matrix notation as:

F = min (Y −Xb)T (Y −Xb)

1

Notice that the summation symbol is not needed here, because (Y −Xb) is an n× 1 columnvector. Transposing this vector and multiplying it by itself (untransposed) produces a scalarthat is equal to the sum of the squared errors.

In the next step, we need to minimize F by taking the derivative of the expression andsetting it equal to 0, as before:

∂F

∂β= −2XT (Y −Xβ) = −2XT Y + 2XT Xβ.

It may be difficult to see how this derivative is taken. Realize that the construction above isa quadratic form in β (and X). We could think of the equation as: (Y −Xβ)2. In that case,we would obtain: −2X(Y − Xβ) for our derivative. This is exactly what this expressionis. X is transposed so that the multiplication that is implied in the result is possible. Notethat, using the distributive property of matrix multiplication, we are able to distribute the−2XT across the parenthetical.

Setting the derivative equal to 0 and dividing by -2 yields:

0 = XT Y −XT Xβ

Obviously, if we move −XT Xβ to the other side of the equation, we get:

XT Xβ = XT Y

In order to isolate β, we need to premultiply both sides of the equation by (XT X)−1. This

leaves us with Iβ on the left, which equals β, and the OLS solution—(XT X

)−1 (XT Y

)—on

the right.The standard error of the regression looks much as before. Its analog in matrix form is:

σ2e =

1

n− keT e.

Finally, the variance-covariance matrix of the parameter estimates can be obtained by:

V ar(β) = σ2e

(XT X

)−1

You would need to square root the diagonal elements to obtain the standard errors of theparameters for hypothesis testing. Notice that this result looks similar to the bivariateregression result, if you think of the inverse function as being similar to division. We willderive these estimators for the standard errors using an ML approach in the next lecture.

3 Matrix Solution for Simple Regression

We will demonstrate the OLS solution for the bivariate regression model. Before we do so,though, let me discuss a little further a type of matrix expression you will see often: XT X.As we discussed above, this type of term is a quadratic form, more or less equivalent to X2

in scalar notation. The primary difference between the matrix and the scalar form is that inthe matrix form the off-diagonal elements of the resulting matrix will be the ‘cross-products’

2

of the X variables-essentially their covariances-while the main diagonal will be the sum ofX2 for each X-essentially the variances.

The X matrix for the bivariate regression model looks like:1 X11

1 X21...

...1 Xn1

For the purposes of exposition, I will not change the subscripts when we transpose thismatrix. If we compute XT X, we will get:

[1 1 . . . 1

X11 X21 . . . Xn1

]1 X11

1 X21...

...1 Xn1

=

[n

∑X∑

X∑

X2

]

To compute the inverse of this matrix, we can use the rule presented in the last chapter forinverting 2× 2 matrices:

M−1 =1

| M |

[m22 −m12

−m21 m11

].

In this case, the determinant of(XT X

)−1is n

∑X2 − (

∑X)2, and so the inverse is:

∑X2

n∑

X2−(∑

X)2−

∑X

n∑

X2−(∑

X)2

−∑

X

n∑

X2−(∑

X)2n

n∑

X2−(∑

X)2

We now need to postmultiply this by XT Y , which is:

[1 1 . . . 1

X11 X21 . . . Xn1

]Y1

Y2...

Yn

=

[ ∑Y∑

XY

]

So,(XT X

)−1 (XT Y

)is:

∑X2

n∑

X2−(∑

X)2−

∑X

n∑

X2−(∑

X)2

−∑

X

n∑

X2−(∑

X)2n

n∑

X2−(∑

X)2

[ ∑Y∑

XY

]=

∑

X2∑

Y−∑

X∑

XY

n∑

X2−(∑

X)2

−∑

X∑

Y +n∑

XY

n∑

X2−(∑

X)2

Let’s work first with the denominator of the elements in the vector on the right. The

term (∑

X)2 can be rewritten as n2X2. Then, an n can be factored from the denominator,and we are left with n (

∑X2 − nx2). As we discussed before, this is equal to n times the

numerator for the variance: n∑

(X − X)2.

3

The numerator of the second element in the vector can be rewritten to be n∑

XY −n2XY . An n can be factored from this expression and will cancel with the n in the denom-inator. So, we are left with

∑XY − nXY in the numerator. For reasons that will become

apparent in a moment, we can express this term as −2nXY + nXY . The first term herecan be rewritten as −Y

∑X − X

∑Y, and the second term can be written as

∑XY . All

four terms can now be collected under a single summation as:∑(

XY − XY − Y X + XY),

which factors into∑

(X − X)(Y − Y ), and so the whole expression (numerator and denom-inator) becomes: ∑

(X − X)(Y − Y )∑(X − X)2

.

This should look familiar. It is the same expression we obtained previously for the slope.Observe that we now have a new ‘computational’ formula:∑

XY − nXY =∑

(X − X)(Y − Y ).

What is interesting about the result we just obtained is that we obtained the result withoutdeviating each variable from its mean in the

(XT X

)matrix but the means re-entered anyway.

Now, let’s determine the first element in the solution vector. The numerator is∑

X2∑

Y−∑X

∑XY. This can be reexpressed as nY

∑X2−nX

∑XY. Once again, the denominator

is n∑(

X − X)2

. So, the ns in the numerator and denominator all cancel. Now, we canstrategically add and subtract XnXY in the numerator to obtain: Y

∑X2 − XnXY −(

X∑

XY − XnXY). With a minimal amount of algebraic manipulation, we can obtain:

Y(∑

X2 − nX2)− X

(∑XY − nXY

).

If we now separate out the two halves of the numerator and make two fractions, we get:

Y(∑

X2 − nX2)∑(

X − X)2 −

X(∑

XY − nXY)∑(

X − X)2 ,

which is just Y − b1X.

4 Why Do We Need Multiple Regression?

If we have more than 1 independent variable, the matrices become larger, and, as statedabove, the off-diagonal elements of the

(XT X

)matrix contain information equivalent to the

covariances among the X variables. This information is important, because the only reasonwe need to perform multiple regression is to ‘control’ out the effects of other X variableswhen trying to determine the true effect of one X on Y . For example, suppose we wereinterested in examining racial differences in health. We might conduct a simple regressionof health on race, and we would find a rather large and significant difference between whitesand nonwhites. But, suppose we thought that part of the racial difference in health wasattributable to income differences between racial groups. Multiple regression analysis wouldallow us to control out the income differences between racial groups to determine the residual

4

30493

Highlight

race differences. If, on the other hand, there weren’t racial differences in income (i.e., raceand income were not correlated), then including income in the model would not have aneffect on estimated race differences in health.

In other words, if the Xs aren’t correlated, then there is no need to perform multipleregression. Let me demonstrate this. For the sake of keeping the equations manageable (sothey fit on a page) and so that I can demonstrate another point, let’s assume that all the Xvariables have a mean of 0 (or, alternatively, that we have deviated them from their meansso that the new set of X variables each have a mean of 0). Above, we derived a matrixexpression for

(XT X

)in simple regression, but we now need the more general form with

multiple X variables (once again, I have left the subscripts untransposed for clarity):

1 1 . . . 1

X11 X21 . . . Xn1...

.... . .

...X1k X2k . . . Xnk

1 X11 . . . X1k

1 X21 . . . X2k...

.... . .

...1 Xn1 . . . Xnk

=

n

∑X1

∑X2 . . .

∑Xk∑

X1

∑X2

1 0 . . . 0∑X2 0

∑X2

2. . .

......

.... . . . . . 0∑

Xk 0 . . . 0∑

X2k

.

Thus, the first row and column of the matrix contain the sums of the variables, the maindiagonal contains the sums of squares of each variable, and all the cross-product positionsare 0. This matrix simplifies considerably, if we realize that, if the means of all of the Xvariables are 0, then

∑X must be 0 for each variable:

(XT X

)=

n 0 . . . 0

0∑

X21

. . ....

.... . . . . . 0

0 . . . 0∑

X2k

The matrix is now a diagonal matrix, and so its inverse is obtained simply by inverting eachof the diagonal elements. The

(XT Y

)matrix is:

1 1 . . . 1X11 X21 . . . Xn1...

.... . .

...X1k X2k . . . Xnk

Y1

Y2...

Yn

=

∑

Y∑X1Y...∑XkY

So, the solution vector is:

(XT X

)=

1n

0 . . . 0

0 1∑X2

1

. . ....

.... . . . . . 0

0 . . . 0 1∑X2

k

∑Y∑

X1Y...∑XkY

=

∑Y

n∑X1Y∑X2

1

...

∑XkY∑X2

k

5

30493

Highlight

The last thing we need to consider is what∑

XY and∑

X2 are when the mean of X is0. Let’s take the denominators–

∑X2–first. This is the same as

∑(X − 0)2, which, since

the means of all the X variables are 0, means the denominator for each of the coefficients

is∑(

X − X)2

. Now let’s think about the numerator. As it turns out, the numerator foreach coefficient can be rewritten as

∑(X − X

) (Y − Y

). Why? Try substituting 0 for X

and expanding:

∑(X)

(Y − Y

)=

∑XY − Y

∑X =

∑XY − Y nX =

∑XY − Y 0 =

∑XY

So, each of our coefficients can be viewed as:∑

(X−X)(Y−Y )∑(X−X)2

. Notice that this is exactly equal

to the slope coefficient in the simple regression model. Thus, we have shown that, if the meansof the X variables are equal to 0, and the X variables are uncorrelated, then the multipleregression coefficients are identical to what we would obtain if we conducted separate simpleregressions for each variable. The results we obtained here also apply if the mean of the Xvariables are not 0, but the matrix expressions become much more complicated.

What about the intercept term? Notice in the solution vector that the intercept termsimply turned out to be the mean of Y . This highlights an important point: the intercept issimply an adjusted mean of the outcome variable. More specifically, it represents the meanof the dependent variable when all the X variables are set to 0 (in this case, their means).This interpretation holds when the X variables’ means are not 0, as well, just as we discussedpreviously in interpreting the coefficients in the simple regression model.

6

30493

Highlight

(copyright by Scott M. Lynch, March 2003)

Multiple Regression II (Soc 504)

1 Evaluation of Model Fit and Inference

Previously, we discussed the theory underlying the multiple regression model, and we de-rived the OLS estimator for the model. In that process, we discussed the standard errorof the regression, as well as the standard errors for the regression coefficients. We will nowdevelop the t-tests for the parameters, as well as the ANOVA table for the regression andthe regression F test.

However, first, we must be sure that our data meet the assumptions of the model. Theassumptions of multiple regression are no different from those of simple regression:

1. Linearity

2. Homoscedasticity

3. Independence of errors across observations

4. Normality of the errors.

Typically, assumptions 1-4 are lumped together as: Y ∼ N (Xβ , σ2ε I), or equivalently

as e ∼ N (0 , σ2ε I). If linearity holds, then the expectation of the error is 0 (see preceding

formula). If homoscedasticity holds, then the variance is constant across observations (seethe identity matrix in the preceding formulas). If independence holds, then the off-diagonalelements of the error variance matrix will be 0 (also refer to the identity matrix in thepreceding formulas). Finally, if the errors are normal, then both Y and e will be normallydistributed, as the formulas indicate.

If the assumptions are met, then inference can proceed (if not, then we have someproblems-we will discuss this soon). This information is not really new-much of what wedeveloped for the simple regression model is still true. The basic ANOVA table is (k is thenumber of x variables INCLUDING the intercept):

ANOVA TABLE DF SS MS F SigRegression k-1 (Xβ − Y )T (Xβ − Y ) SSR

Df(Reg)MSRMSE

(from F table)

Residual n-k eT e SSEDf(E)

(called MSE)

Total n-1 (Y − Y )T (Y − Y )

1

R2 = SSRSST

or 1− SSESST

and Multiple Correlation=√

R2.Note that in the sums of squares calculations above, I have simply replaced the scalar

notation with matrix notation. Also note that the mean of Y is a column vector (n × 1)in which the elements are all the mean of Y . There is therefore no difference betweenthese calculations and those presented in the simple regression notes-in fact, if you’re morecomfortable with scalar notation, you can still obtain the sums of squares as before.

The multiple correlation before was simply the bivariate correlation between X and Y .Now, however, there are multiple X variables, so the multiple R has a different interpretation.Specifically, it is the correlation between the observed values of Y and the model-predictedvalues of Y (Y = Xβ).

The F -Test is now formally an overall (global) test of the model’s fit. If at least onevariable has a significant coefficient, then the model F should be significant (except in casesin which multicollinearity is a problem-we will discuss this later). There is more use for Fin the multiple regression context, as we will discuss shortly.

The t-tests for the parameters themselves are conducted just as before. Recall from thelast set of notes that the standard errors (more specifically, the variances) for the coefficients

are obtained using: σ2e(X

T X)−1. This is a k × k variance-covariance matrix for the coeffi-cients. However, we are generally only interested in the diagonal elements of this matrix.The diagonal elements are the covariances of the coefficients with themselves. Hence, theirsquare roots are the standard errors. The off-diagonal elements of this matrix are the covari-ances of the coefficients with other coefficients and are not important for our t-tests. Sincewe don’t know σ2

e , we replace it with the MSE from the ANOVA table.For example, suppose we have data that look like:

X =

1 11 21 31 41 51 6

Y =

112233

, so (XT Y ) =

[1250

]

So,

(XT X) =

[6 2121 91

]and

(XT X)−1 =

91105

−21105

−21105

6105

If we compute (XT X)−1(XT Y ), we get the OLS solution: 42

105

48105

,

2

which will ultimately yield an MSE of .085714 (if you compute the residuals, square them,sum them, and divide by 4).

If we compute MSE(XT X)−1, Then we will get:[.0743 −.0171−.0171 .0049

],

the variance-covariance matrix of the coefficients. If we take the diagonal elements andsquare root them, we get .273 and .07 as the standard errors of b0 and b1, respectively. Theoff-diagonal elements tell us how the coefficients are related to each other. We can get thecorrelation between b0 and b1 by using the formula: Corr(X, Y ) = Cov(X,Y )

S(X)S(Y ). In this case, the

‘S’ values are the standard errors we just computed. So, Corr(b0, b1) = −.0171(.273)(.07)

= −.895.This indicates a very strong negative relationship between the coefficients, which is typicalin a bivariate regression setting. This tells us that if the intercept were large, the slope wouldbe shallow (and vice versa), which makes sense.

If we wish to make inference about a single parameter, then the t test on the coefficientof interest is all we need to determine whether the parameter is ‘significant.’ However,occasionally we may be interested in a test of a set of parameters, rather than a singleparameter. For example, suppose our theory suggests that 3 variables are important inpredicting an outcome, net of a host of control variables. Ideally we should be able toconstruct a joint test of the importance of all 3 variables. In this case, we can conduct‘nested F tests.’ There are numerous approaches to doing this, but here we will discuss twoequivalent approaches. One approach uses R2 from two nested models; the other uses theANOVA table information directly.

The F -test approach is computed as:

F =SSE(R)− SSE(F )

df(R)− df(F )÷ SSE(F )

df(F )

Here, R refers to the reduced model (with fewer variables), and F refers to the full model(with all the variables). The degrees of freedom for the reduced model will be greater than forthe full model, because the degrees of freedom for the error are n−k, where k is the numberof regressors (including the intercept). The numerator df for the test is df(R)− df(F ); thedenominator df is df(F ). Recognize that this test is simply determining the proportionalincrease in error the reduced model generates relative to the full model.

An equivalent formulation of this test can be constructed using R2:

F =R2

F −R2R

df(R)− df(F )÷ 1−R2

F

df(F )

Simple algebra shows that these are equivalent tests; however, when there is no interceptin the models, we must use the first calculation.

2 Interpretation of the Model Parameters

The interpretation of the parameters in this model is ultimately no different than the in-terpretation of the parameters in the simple linear regression with one key difference. The

3

parameters are now interpreted as the relationship between X and Y , net of the effects ofthe other variables. What exactly does this mean? It means simply that the coefficient forX in the model represents the relationship between X and Y , holding all other variablesconstant. Let’s look at this a little more in depth. Suppose we have a data set consisting of3 variables: X1, X2, and Y , and the data set looks like:

X1 X2 Y0 1 20 2 40 3 60 4 80 5 101 6 121 7 141 8 161 9 181 10 20

Suppose we are interested in the relationship between X1 and Y . The mean of Y whenX1 = 0 is 6, and the mean of Y when X1 = 1 is 16. In fact, when a regression model isconducted, here are the results:

ANOVA TABLE Df SS MS F SigRegression 1 250 250 25 p = .001Residual 8 80 10Total 9 330

Effect Coefficient Standard Error t p-valueIntercept 6 1.41 4.24 .003Slope 10 2 5.00 .001

The results indicate there is a strong relationship between X1 and Y . But, now let’sconduct a multiple regression that includes X2. Those results are:

ANOVA TABLE Df SS MS F SigRegression 2 330 165 ∞ p = 0Residual 7 0 0Total 9 330

Effect Coefficient Standard Error t p-valueIntercept 0 0 ∞ 0bX2 2 0 ∞ 0bX1 0 0 ∞ 0

4

Notice how the coefficient for X1 has now been reduced to 0. Don’t be confused by thefact that the t-test for the coefficient is large: this is simply an artifact of the contrivednature of the data such that the standard error is 0 (and the computer considers 0

0to be ∞).

The coefficient for X2 is 2, and the R2 for the regression (if you compute it) is 1, indicatingperfect linear fit without error.

This result shows that, once we have controlled for the effect of X2, there is no relation-ship between X1 and Y : that X2 accounts for all of the relationship between X1 and Y .Why? Because the effect of X1 we first observed is simply capturing the fact that there aredifferences in X2 by levels of X1. Let’s look at this in the context of some real data.

Race differences in birthweight (specifically black-white differences) have been a hot topicof investigation over the last decade or so. African Americans tend to have lighter babiesthan whites. Below is a regression that demonstrates this. The data are from the NationalCenter for Health Statistics, which has recorded every live birth in the country for over adecades. The data I use here are a subset of the total births for 1999.

ANOVA TABLE Df SS MS F SigRegression 1 2.94e + 8 2.94e + 8 815.15 p = 0Residual 36,349 1.31e + 10 361,605.4Total 36,350 1.34e + 10

Effect Coefficient Standard Error t p-valueIntercept 3364.1 3.44 977.07 0Black -245.1 8.58 -28.55 0

These results indicate that the average white baby weighs 3364.1 grams, while the averageAfrican American baby weighs 3364.1− 245.1 = 3119 grams. The racial difference in thesebirth weights is significant, as indicated by the t-test. The question is: why is there a racialdifference? Since birth weight is an indicator of health status of the child, we may begin byconsidering that social class may influence access to prenatal care, and prenatal care mayincrease the probability of a healthy and heavier baby. Thus, if there is a racial difference insocial class, this may account for the racial difference in birthweight. So, here are the resultsfor a model that ‘controls on’ education (a measure of social class):


Effect Coefficient Standard Error t p-valueIntercept 3149.8 15.14 208.06 0Black -236.96 8.58 -27.62 0Education 16.67 1.15 14.54 0

5

Education, in fact, is positively related to birthweight: more educated mothers produceheavier babies. The coefficient for ‘Black’ has become smaller, indicating that, indeed, part ofthe racial difference in birthweight is attributable racial differences in educational attainment.In other words, at similar levels of education, the birthweight gap between blacks and whitesis about 237 grams. The overall mean difference, however, is about 245 grams, if educationaldifferences in these populations are not considered.

Let’s include one more variable: gestation length. Obviously, the longer a fetus is inutero,the heavier it will tend to be at birth. In addition to social class differences, there may begestation-length differences between blacks and whites (perhaps another proxy for prenatalcare). When we include gestation length, here are the results:


Effect Coefficient Standard Error t p-valueIntercept -1695.4 41.7 -40.64 0Black -160.44 7.25 -22.12 0Education 15.7 .97 16.26 0Gestation 124.9 1.02 121.98 0

These results indicate that a large portion of the racial difference in birthweight is at-tributable to racial differences in gestation length: notice how the race coefficient has beenreduced to −160.44. Interestingly, the effect of education has also been reduced slightly,indicating that there are some gestational differences by education level. Finally, notice nowhow the intercept has become large and negative. This is because there are no babies withgestation length 0. In fact, in these data, the minimum gestation time is 17 weeks (mean isjust under 40 weeks). Thus, a white baby with a mother with no education who gestated for17 weeks would be expected to weigh a little more than 1,000 grams (just over two pounds).As we’ve discussed before, taken out of the context of the variables included in the model,the intercept is generally meaningless.

3 Comparing Coefficients in the Model

In multiple regression, it may be of interest to compare effects of variables in the model.Comparisons are fairly clear cut if one variable reaches significance while another one doesn’t,but generally results are not that clear. Two variables may reach significance, but you may beinterested in which one is more important. To some extent, perhaps even a large extent, thisis generally more of a substantive or situational question than a statistical one. For example,from a policy perspective, if one variable is changeable, while another one isn’t, then themore important finding may be that the changeable variable has a significant effect evenafter controlling for unchangeable ones. In that case, an actual comparison of coefficients

6

may not be warranted. But let’s suppose we are testing two competing hypotheses, each ofwhich is measured by its own independent variable.

Let’s return to the birth weight example. Suppose one theory indicates that the racialdifference in birthweight is a function of prenatal care differences, while another theoryindicates that birthweight differences are a function of genetic differences between races. (Asa side note, there in fact is a somewhat contentious debate in this literature about racialdifferences in genetics). Suppose that our measure of prenatal care differences is education(reflecting the effect of social class on ability to obtain prenatal care), and that our measure ofgenetic differences is gestation length (perhaps the genetic theory explicitly posits that racialdifferences in genetics lead to fewer weeks of gestation). We would like to compare the relativemerits of the two hypotheses. We have already conducted a model that included gestationlength and education. First, we should note that, because the racial differences remainedeven after including these two variables, neither is a sufficient explanation. However, bothvariables have a significant effect, so how do we compare their effects?

It would be unreasonable to simply examine the relative size of the coefficients (about15.7 for education and 124.9 for gestation), because these variables are measured in differentunits. Education is measured in years, while gestation is measured in weeks. Our firstthought may be, therefore, to recode one or the other variable so that the units match.So, what are the results if we recode gestation length into years? If we do so, we will findthat the effect of education does not change, but the effect of gestation becomes 6,494.9.Nothing else in the model changes (including the t-test, the model F , R2, etc.). But now,the gestation length effect appears huge relative to the education effect. But, to demonstratewhy this approach doesn’t work, suppose we rerun the model after recoding education intocenturies. In that case, the effect of education becomes 1,570.5. Now, you could argue thatmeasuring education in centuries and comparing to gestation length in years does not placethe two variables in the same units. However, I would argue that neither of these units ismore inappropriate than the other. Practically no (human) fetus gestates for a year, and noone attends school for a century. The problem is thus a little more complicated than simplya difference in units of measurement. Plus, I would add that it is often impossible to makeunits comparable: for example how would you compare years of education with salary indollars?

What we are missing is some measure of the variability in the measures. As a firststep, I often compute the effect of variables at the smallest and largest values possible inthe measure. For example, gestation length has a minimum and maximum of 17 and 52,respectively. Thus, net effect of gestation for a fetus that gestates for 17 weeks is 2,123grams, while the net effect of gestation for a fetus that gestates for 52 weeks is 6,495 grams.Education, on the other hand, ranges from 0 to 17. Thus, the net effect of education at 0years is 0 grams, while the net effect at 17 years is 267 grams. On its face, then, the effectof gestation length appears to be larger.

However, even though there is a wider range in the net effect of gestation versus education,we have not considered that the extremes may be just that—extreme—in the sample. A 52-week gestation period is a very rare (if real) event, as is having no education whatsoever. Forthis reason, we generally standardize our coefficients for comparisons using some function ofthe standard deviation of the X variables of interest and sometimes the standard deviationof Y as well. For fully standardized coefficients (often denoted using β as opposed to b), we

7

compute:

βX = bXs.d.(X)

s.d.(Y ).

If we do that for this example, we get standardized effects for education and gestation of .07and .53, respectively. The interpretation is that, for a standard unit increase in education,we expect a .07 standard unit increase in birth weight; whereas for gestation, we expect a .53standard unit increase in birth weight for a one standard unit increase in gestation. Fromthis perspective, the effect of gestation length indeed seems more important.

As a final note on comparing coefficients, we must realize that it is not always useful tocompare even the standardized coefficients. The standardized effect of an X variable that ishighly nonnormal may not be informative. Furthermore, we cannot compare standardizedcoefficients of dummy variables, because they only take two values (more on this in the nextset of notes). Finally, our comparisons cannot necessarily help us determine which hypothesisis more correct: in part, this decision rests on how well the variables operationalize thehypothesis. For example, in this case, I would not conclude that genetics is more importantthan social class (or prenatal care). It could easily be argued (and perhaps more reasonably)that gestation length better represents prenatal care than genetics!

4 The Maximum Likelihood Solution

So far, we have derived the OLS estimator, and we have discussed the standard error esti-mator, for multiple regression. Here, I derive the Maximum Likelihood estimator and thestandard errors.

Once again, we have normal likelihood function, but we will now express it in matrixform:

p(Y | β , σe , X) ≡ L(β , σe | X, Y ) =(2πσ2

e

)−n2 exp

{−1

2σ2e

(Y −Xβ)T (Y −Xβ)

}Taking the log of this likelihood yields:

LL(β , σe) ∝ −nlog(σe)−1

2σ2e

{(Y −Xβ)T (Y −Xβ)

}We now need to take the partial derivative of the log of the likelihood with respect to eachparameter. Here, we will consider that we have two parameters: the vector of coefficientsand the variance of the error. The derivative of the log likelihood with respect to β shouldlook somewhat familiar (much of it was shown in the derivation of the OLS estimator):

∂LL

∂β=−1

2σ2e

(−2XT )(Y −Xβ) =1

σ2e

(XT )(Y −Xβ).

If we set this expression equal to 0 and multiply both sides by s2e, we end up with the same

result as we did for the OLS estimator. We can also take the partial derivative with respectto σe:

8

∂LL

∂σe

=−n

σe

+ σ−3e


}The solution, after setting the derivative to 0 and performing a little algebra, is:

σ2e =


}n

.

These solutions are the same as we found in the simple regression problem, only expressedin matrix form.

The next step is to take the second partial derivatives in order to obtain the standarderrors. Let’s first simplify the first partial derivatives (and exchange σ2

e with τ):

∂LL

∂β= −1

τ(XT Y −XT Xβ).

and∂LL

∂τ= −n

2τ−1 +

1

2τ−2eT e.

The second partial derivative of LL with respect to β is:

∂2LL

∂β2= −1

τ(XT X).

The second partial derivative of LL with respect to τ is:

∂2LL

∂τ 2=

n

2τ−2 − τ−3eT e.

The off-diagonal elements of the Hessian Matrix are:

∂2LL

∂β∂τ=

∂2LL

∂τ∂β=

1

τ 2

(−XT Y + XT Xβ

).

Thus, the Hessian Matrix is: − 1τ(XT X) 1

τ2

(−XT Y + XT Xβ

)1τ2

(−XT Y + XT Xβ

)n2τ−2 − τ−3eT e

.

We now need to take the negative expectation of this matrix to obtain the information matrix.The expectation of β is β, and so, if we substitute the computation of β (= (XT X)−1(XT Y )),the numerator of the off-diagonal elements is 0.

The expectation of the second partial derivative with respect to β remains unchanged.However, the second partial derivative with respect to τ changes a little. First, the expec-tation of eT e is nτ , just as we discussed while deriving the ML estimators for the simpleregression model. This gives us:

n

2τ 2− nτ

τ 3.

Simple algebraic manipulation gives us:

9

− n

2τ 2.

Thus, after taking the negative of these elements, our information matrix is: 1τ(XT X) 0

0 n2τ2

.

As we’ve discussed before, we need to invert this matrix and square root the diagonal elementsto obtain the standard errors of the parameters. Also as we’ve discussed before, the inverseof a diagonal matrix is simply a matrix with the diagonal elements inverted. Thus, ourvariance-covariance matrix is: τ(XT X)−1 0

0 2τ2

n

,

which should look familiar after substituting σ2e in for τ .

10


Expanding the Model Capabilities: Dummy Variables,

Interactions, and Nonlinear Transformations (Soc 504)

Until now, although we have considered including multiple independent variables in ourregression models, we have only considered continuous regressors and linear effects of themon the outcome variable. However, the linear regression model is quite flexible, and it isfairly straightforward to include qualitative variables, like race, gender, type of occupation,etc. Furthermore, it is quite easy to model nonlinear relationships between independent anddependent variables.

1 Dummy Variables

When we have dichotomous (binary) regressors, we construct something called ‘indicator,’or ‘dummy’ variables. For example, for gender, we could construct a binary variable called‘male’ and code it 1 if the person is male and 0 if the person is female. Suppose gender werethe only variable in the model:

Yi = b0 + b1Malei.

The expected value of Y for a male, then, would be b0 + b1, and the expected value of Y fora female would be just b0. Why? Because 0× b1 = 0, and 1× b1 is just b1. Interestingly, thet-test on the b1 parameter in this case would be identical to the t test you could perform onthe difference in the mean of Y for males and females.

When we have a qualitative variable with more than two categories (e.g., race codedas white, black, other), we can construct more than one dummy variable to represent theoriginal variable. In general, the rule is that you construct k − 1 dummy variables for aqualitative variable with k categories. Why not construct k dummies? Let’s assume we havea model in which race is our only predictor. If we construct a dummy for ‘black’ and adummy for ‘other,’ then we have the following model:

Yi = b0 + b1Blacki + b2Otheri.

When ‘black’=1, we get:

Yi = b0 + b1.

When ‘other’=1, we get:

Yi = b0 + b2.

But, when we come across an individual who is white, both dummy coefficients drop fromthe model, and we’re left with:

Yi = b0.

1

If we also included a dummy variable for ‘white,’ we could not separate the effect of thedummy variable and the intercept (more specifically, there is perfect negative collinearitybetween them). We call the omitted dummy variable the ‘reference’ category, because theother dummy coefficients are interpreted in terms of the expected mean change relative toit. (note: as an alternative to having a references group, we could simply drop the interceptfrom the model.)

When we have only dummy variables in a model representing a single qualitative variable,the model is equivalent to a one-way ANOVA. Recall from your previous coursework thatANOVA is used to detect differences in means across two or more groups. When there areonly two groups, a simple t-test is appropriate for detecting differences in means (indeed, anANOVA model—and a simple regression as discussed above—would yield an F value thatwould simply be t2). However, when there are more than two groups, the t-test becomesinefficient. We could, for example, conduct a t-test on all the possible two-group comparisons,

but this would be tedious, because the number of these tests is

(k2

)= k!

(k−2)!2!, where k is

the number of categories/groups. Thus, in those cases, we typically conduct an ANOVA, inwhich all group means are simultaneously compared. If the F statistic from the ANOVA issignificant, then we can conclude that at least one group mean differs from the others.

The standard ANOVA model with J groups is constructed using the following computa-tions:

Grand Mean ≡ ¯X =

∑j

∑i Xij

n

Sum Squares Total (SST ) =∑

j

∑i

(Xij − ¯X

)2

Sum Squares Between (SSB) =∑

j

nj

(Xj − ¯X

)2

Sum Squares Within (SSW ) =∑

j

∑i

(Xij − ¯Xj

)2

The ANOVA table is then constructed as:

SS df MS FBetween J-1 SSB

dfF

Within N-J SSWdf

Total N-1

This should look familiar—it is the same table format used in regression. Indeed, asI’ve said, the results will be equivalent also. The TSS calculation is identical to that fromregression. The degrees of freedom are also equivalent. Furthermore, if we realize thatin a model with dummy variables only, then Y is simply the group mean, then the SSW

2

calculation is identical to the (Y − Y ) calculation we use in regression. Thus, ANOVA andregression are equivalent models, and when dummy variables are used, the regression modelis sometimes called ANCOVA (analysis of covariance).

A limitation of the ANOVA approach is that the F -test only tells us whether at leastone group mean differs from the others; it doesn’t pinpoint which group mean is the cul-prit. In fact, if you use ANOVA and want to determine which group mean differs fromthe others, additional tests must be constructed. However, that’s the important feature ofregression with dummies—the t-tests on the individual dummy coefficients provides us withthis information. So, why do psychologists (and clinicians, often) use ANOVA? A shortanswer is that there is a ‘cultural inertia’ at work. Psychologists (and clinicians) historicallyhave dealt with distinct groups in clinical trials/experimental settings in which the ANOVAmodel is perfectly appropriate, and regression theoretically unnecessary. Sociologists, on theother hand, don’t use such data, and hence, sociologists have gravitated toward regressionanalyses.

So far, we have discussed the inclusion of dummy variables in a relatively simple model.When we have continuous covariates also in the model, interpretation of the dummy variableeffect is enhanced with graphics. A dummy coefficient in such a model simply tells us howmuch the regression line ‘jumps’ for one group versus another. For example, suppose wewere interested in how education relates to depressive symptoms, net of gender. Then, wemight construct a model like:

Y = b0 + b1Education + b2Male.

I have conducted this model using data from the National Survey of Families and Households(NSFH), with the following results: b0 = 25.64, b1 = −.74, and b2 = −3.27. These resultssuggest that education reduces symptoms (or more correctly, that individuals with greatereducation in the sample have fewer symptoms—at a rate of −.74 depressive symptoms peryear of education) and that men have 3.27 fewer symptoms on average. Graphically, thisimplies:

Years of Education

Dep

ress

ive

Sym

ptom

s

0 5 10 15 20

05

1015

2025

30

MaleFemale

Figure 1. Depressive Symptoms by Years of Education and Gender

3

If we had a qualitative variable in the model that had more than two categories (hencemore than one dummy variable), we would have more than two lines, but all would beparallel.

2 Statistical Interactions

Notice that the lines in the above example are parallel—that it is assumed in this modelthat education has the same effect on depressive symptoms for men and for women. This isoften an unrealistic assumption, and often our theories provide us with reasons why we mayexpect different slopes for different groups. In these cases, we need to come up with someway of capturing this difference in slopes.

2.1 Interactions Between Dummy and Continuous Variables

The simplest interactions involve the interaction between a dummy variable and a continuousvariable. In the education, gender, and depressive symptoms example, we may expect thateducation’s effect varies across gender, and we may wish to model this differential effect. Amodel which does so might look like:

Y = b0 + b1Education + b2Male + b3(Education×Male)

This model differs from the previous one, because it includes a ‘statistical interaction term,’Education×Male. This additional variable is easy to construct—it is simply the multiple ofeach individual’s education value and their gender. For women (Male = 0), this interactionterm is 0, but for men, the interaction term is equal to their education level (Education×1).This yields the following equations for men and women:

Ymen = (b0 + b2) + (b1 + b3)Education

Ywomen = b0 + b1Education

In the equation for men, I have consolidated the effects of education into one coefficient,so that the difference in education slopes for men and women is apparent. We still havean expected mean difference for men and women (b2), but we now also allow the effect ofeducation to vary by gender, unless the interaction effect is 0. I have estimated this model,with the following results: b0 = 26.29, b1 = −.794, b2 = −4.79, b3 = .118(ns):

4

Years of Education

Dep

ress

ive

Sym

ptom

s

0 5 10 15 20

05

1015

2025

30

MaleFemale

Figure 2. Depressive Symptoms by Years of Education and Gender (with Interaction)

Notice that in this plot, the lines for women and men appear to converge slightly aseducation increases. This implies that women gain more, in terms of reduction of depressivesymptoms, from education than men do.

In this example, the interaction effect is not significant, and I didn’t expect it to be.There is no theory that suggests an interaction between gender and education.

In the following example, I examine racial differences in depressive symptoms. The resultsfor a model with race (white) only are: b0 = 17.14, bwhite = −2.92. In a second model, Iexamine the racial difference, net of education. The results for that model are: b0 = 25.62,beduc = −.73, bwhite = −1.85:

Years of Education

Dep

ress

ive

Sym

ptom

s

0 5 10 15 20

05

1015

2025

30

WhiteNonwhite

Figure 3. Depressive Symptoms by Years of Education and Race

5

The change in the coefficient for ‘white’ between the two models indicates that educationaccounts for part of the racial difference in symptoms. More specifically, it suggests thatthe large difference observed in the first model is in part attributable to compositionaldifferences in educational attainment between white and nonwhite populations. The secondmodel suggests that whites have lower levels of depressive symptoms than nonwhites, butthat education reduces symptoms for both groups.

Theory might suggest that nonwhites (blacks specifically) get fewer returns from educa-tion, though, so in the next model, I include a white × education interaction term. Thoseresults (b0 = 21.64, beduc = −.39, bwhite = 4.34, bed×w = −.51) yield a plot that looks like:

Years of Education

Dep

ress

ive

Sym

ptom

s

0 5 10 15 20

05

1015

2025

30

WhiteNonwhite

Figure 4. Depressive Symptoms by Years of Education and Gender (with Interaction)

Here, it is apparent that whites actually have greater depressive symptoms than non-whites at very low levels of education, perhaps because nonwhites are more adaptive toadverse economic conditions than whites are. However, as education increases, symptomsdecrease much faster for whites than for nonwhites. In fact, we could determine the preciselevel of education at which the lines cross, by setting the nonwhite and white results equalto each other and solving for Education:

White Symptoms = Nonwhite Symptoms

21.64− .38E + 4.34− .51E = 21.64− .39E

E = 8.51

We may choose to get more elaborate. So far, I have alluded to the fact that these racialresults may be attributable to differences between blacks and whites—we may not findthe same results if we further subdivided our race variable, but this is an empirical ques-tion. Thus, I conducted an additional set of models in which I disaggregated the ‘nonwhite’category into ‘black’ and ‘other’ categories, and I constructed interaction terms betweeneducation and the ‘white’ and ‘other’ dummy variables. I obtained the following results

6

Variable Model 1 Model 2 Model 3Intercept 17.4*** 26.2*** 22.8***White -3.18*** -2.32*** 3.22#Other -.84 -1.53* -2.22Educ -.74*** -.45***E×W -.45**E×O .09

What do these results suggest? It appears, based on Model 1, that whites have signif-icantly fewer symptoms than blacks, but that ‘others’ don’t differ much from blacks. Thesecond model indicates that educational attainment is important, and, interestingly, the‘other’ coefficient is now significant. Here we observe a suppressor relationship—educationaldifferences between ‘blacks’ and ‘others’ masks the differences in symptoms between theseraces. This result implies that ‘other’ races have lower educational attainment than blacks.Model 3 confirms our earlier result—that education provides greater returns to whites. Italso dispels the hypothesis that this interaction effect is unique to blacks (blacks and othersappear to have similar patterns, given the nonsignificance of the coefficients for ‘others.’)It is important to note that the ‘other’ category is composed (in this study) primarily ofHispanics (who do, in fact, have lower educational attainment than blacks). If we wereto further subdivide the race variable, we may find that heterogeneity within the ‘other’category is masking important racial differences.

2.2 Interactions Between Continuous Variables

So far, we have been able to represent regression models with a single variable (continuousor dummies), and models with a single continuous variable and a set of dummies, andmodels with interactions between a single continuous variable and a dummy variable, withtwo-dimensional graphs. However, we will now begin considering models which can berepresented graphically only in three dimensions. Just as a continuous variable may havean effect that varies across levels of a dummy variable, we may have continuous variableswhose effects vary across levels of another continuous variable. These interactions becomesomewhat more difficult to perceive visually.

Below, I have shown a two variable model in which depressive symptoms were regressedon functional limitations (ADLs) and years of education. The regression function is a plane,rather than a line, as indicated in the figure.

7

0

5

10

15

20

Education

0

1

2

3

4

5

6

ADLs

1015

2025

3035

40D

epre

ssiv

e S

ympt

oms

Figure 5. Depressive Symptoms by Years of Education and ADLs

Notice that the plane is tilted in both ADL and Education dimensions. This tells usthat as we move down in education or up in ADLs, we get increases in depressive symptoms.Obviously, the lowest depressive symptoms occur for highly-educated unlimited individuals,and the greatest symptoms occur for low-educated highly-limited individuals. Suppose thatwe don’t believe this pattern is linear in both dimensions, but rather that there may be somesynergy between education and ADLs. For example, we might expect that the combinationof physical limitation and low education to be far worse, in terms of producing depressivesymptoms, than this additive model suggests. In that case, we may include an interactionterm between education and ADL limitations.

8

0

5

10

15

20

Education

0

1

2

3

45

6

ADLs

010

2030

40D

epre

ssiv

e S

ympt

oms

0 5 10 15 20

Education

01

2345

6A

DLs

010

2030

40D

epre

ssiv

e S

ympt

oms

0

5

10

15

20Education

01

23

4

5

6

ADLs

010

2030

40D

epre

ssiv

e S

ympt

oms

05101520Education

0 1 23

45

6ADLs

010

2030

40D

epre

ssiv

e S

ympt

oms

Figure 6. Depressive Symptoms by Years of Education and ADLs (with Interaction)

Now we are no longer dealing with a flat plane, but rather a twisted surface.If you don’t want to plot the model-predicted data, you can always algebraically factor

the model to determine what the total effect of one variable is, but you may need to do thisin both dimensions. For example, this model is:

Y = b0 + b1Education + b2ADLs + b3(Education× ADLs)

Thus, the total effect of education is:

Education Effect = b1 + b3ADLs

Similarly, the total effect of ADLs is:

ADL Effect = b2 + b3Education

If you want, you can plot these two-dimensional effects, but it is often useful to just plugin the numbers and observe what happens. I find the following table useful for makinginterpretations:

9

↓ XZ Interaction Effect of X (+) Effect of X (-)(+) Effect of X grows Effect of X shrinks

across levels of Z across levels of Z

(-) Effect of X shrinks Effect of X growsacross levels of Z across levels of Z

In our example, the individual effect of education is negative, the effect of ADLs ispositive, and the interaction effect is negative. This implies that the total effect of educationincreases across levels of disability, and that the total effect of disability decreases acrosslevels of education. We can discern this from the graph above-it appears that at very highlevels of education, adding ADL limitations increases depressive symptoms slower (in fact, itactually reduces symptoms) than it does at low levels of education. In the other dimension,it appears that adding years of education has a relatively large effect on reducing depressivesymptoms among persons with 6 ADLs compared to the effect among persons with 0 ADLs.Substantively, these results suggest that physical limitation is particularly troubling for lowerSES individuals, and that it is less troubling for persons with high SES. Certainly, we couldattempt to pin down an explanation by including additional variables, and here we enterthe realm of some complex interpretation. Whereas without interactions, if one variable(Z), when entered into a model, reduces the effect of another (X), we may argue that Zpartially explains the effect of X. We can make similar interpretations with interactions. If,for example, we entered some measure of ‘sense of control’ into the above model, and theinteraction effect disappeared or reversed, we have some evidence that differential feelingsof control explains why functional limitations have less effect on depressive symptoms forhigher-SES individuals. Sometimes, however, we may need an interaction to explain aninteraction.

2.3 Limitations of Interactions

In theory, there are no limits to the number interaction terms that can be put into a model,and there are no limits on the degree of interaction. For example, you can construct three-way or even higher order interactions. But, there is a tradeoff, in terms of parsimony andinterpretability. Multi-way interactions become very difficult to interpret, and they alsorequire complex theories to justify their inclusion in a model. As an example from myown research on education and health, it is clear that there is a life-course pattern in theeducation-health relationship. That is, education appears to become more important acrossage. Hence, education interacts with age. It is also clear that education is becoming moreimportant in differentiating health across birth cohorts-education has a stronger relationshipto health for more recent cohorts than older birth cohorts, perhaps due to improvement in thequality of education (content-wise), or perhaps due to the greater role education plays todayin getting a high-paying job. Taken together, these arguments imply a three-way interactionbetween age, birth cohort, and education. However, a simple three-way interaction wouldbe very difficult to interpret.

10

A second difficulty with interactions occurs when all the variables that are thought tointeract are dummy variables. For example, suppose from the previous example, age wereindicated by a dummy variable representing being 50 years old or older (versus youngerthan 50), education were indicated by a dummy representing having a high school diploma(versus not), and cohort was indicated by a dummy representing having a birth date pre-WWII (versus post-WWII). A simple three-way interaction between these variables wouldONLY take a value of 1 for persons who were a 1 on all three of these indicators. Yet, thismay not be the contrast of interest.

There are no simple rules to guide you through the process of selecting reasonableinteractions—it just takes practice and thoughtful consideration of the hypotheses you seekto address.

3 Nonlinear Transformations

We often find that the standard linear model is inappropriate for a given set of data, eitherbecause the dependent variable is not normally distributed, or because the relationshipbetween the independent and dependent variables is nonlinear. In these cases, we maymake suitable transformations of the independent and/or dependent variables to coax thedependent variable to normality or to produce a linear relationship between X and Y . Here,we will first discuss the case in which the dependent variable is not normally distributed.

3.1 Transformations of Dependent Variables

A dependent variable may not be normally distributed, either because it simply doesn’thave a normal-bell shape, or because its values are bounded, creating a skew. In othercases, it is possible that the variable has a nice symmetric shape, but that the interval onwhich the values of the variable exist is narrow enough to produce unreasonable predictedvalues (e.g., the dependent variable is a proportion). In terms of inference for parameters,while a nonnormal dependent variable doesn’t guarantee a nonnormal distribution of errors,it generally will. As stated previously, this invalidates the likelihood function, and it alsoinvalidates the derivation of the sampling distributions that provide us with the standarderrors for testing. Thus, we may consider a transformation to normalize the dependentvariable.

There are 4 common transformations that are used for dependent variables, includingthe logarithmic, exponential, power, and logistic transformations:

Transformation New D.V. New ModelLogarithmic Z = ln(Y ) Z = XbExponential Z = exp(Y ) Z = XbPower Z = Y p, p 6= 0, 1 Z = Xb

Logistic Z = ln(Y )(1−ln(Y ))

Z = Xb

11

Notice in all cases, the new model is the transformed dependent variable regressed on theregressors. This highlights that there will be a change in the interpretation of the effects ofX on Y . Specifically, you can either interpret the effects of X on Z, or you can invert thetransformation after running the model and computing predicted scores in Z units.

Each of these transformations has a unique effect on the distribution of Y . The logarith-mic transformation, as well as power transformations in which 0 < p < 1, each reduce rightskew. Why? Because the logbX function is a function whose return value is the power towhich b must be raised to equal X. For example, log10100 = 2. Below shows the pattern forlog base 10:

X Log X1 010 1100 21000 310000 4

You can see that, although X is increasing by powers of 10, logX is increasing linearly.We tend to use the natural log function, rather than the base 10 function, because it has niceproperties. The base of the natural logs is e ≈ 2.718. This transformation will tend to pulllarge values of Y back toward the mean, reducing right skewness. By the same token, powertransformations (0 < p < 1) will have a similar effect, only such transformations generallyaren’t as strong as the log transformation.

Recognize that changing the distribution of Y will also have an effect on the relationshipbetween X and Y . In the listing above of the logarithms of the powers of 10, the transforma-tion changes the structure of the variable, so that an X variable which was linearly relatedto powers of 10 will now be nonlinearly related to the log of these numbers.

The log and power transformations are used very often, because we encounter rightskewness problems frequently. Any variable that is bounded on the left by 0 will tend tohave right skewed distributions, especially if the mean of the variable is close to 0 (e.g.,income, depressive symptoms, small attitude scales, count variables, etc.).

When distributions have left skewness, we may use other types of power transformationsin which p > 1, or we may use the exponential transformation. By the same token that rootsand logs of variables pull in a right tail of a distribution, the inverse functions of roots andlogs will expand the right tail of a distribution, correcting for a left skew.

What if p < 0 in a power transformation? This does two things. First, it inverts thedistribution-what were large values of Y would now be small values after the transformation(and vice versa). Then, it accomplishes the same tail-pulling/expanding that power transfor-mations in which p is positive do. We don’t often see these types of transformations, unlesswe are dealing with a variable in which interest really lay in the inverse of it. For example,we may take rates and invert them to get time-until-an event. The main reason that wedon’t see this type of transformation frequently is that it doesn’t affect skewness any morethan a positive power transformation would.

12

We need to be careful when applying the log and power transformations. We can’t takeeven roots when a variable takes negative values. Similarly, we can’t take the logarithm of0. When we are faced with these problems, we may simply add a constant to the variablebefore transforming it. For example, with depressive symptoms, we may wish to add 1 tothe variable before taking the log to ensure that our logarithms will all exist.

The final transformation we will discuss is the logistic transformation. The logistic trans-formation is useful when the outcome variable of interest is a proportion-a number boundedon the [0, 1] interval. The distribution of such a variable might very well be symmetric, butthe bounding poses a problem, because a simple linear model will very likely predict valuesless than 0 and greater than 1. In such a case, we may take:

Z = log Y1−Y

. This transformation yields a boundless variable, because Z = −∞ whenY approaches 0, and Z = +∞ when Y approaches 1. Of course, because of a division by 0problem, and because the log of 0 is undefined, something must be done to values of 0 or 1.

3.2 Transformations of Independent Variables

Whereas we may be interested in transforming a dependent variable so that the distributionof the variable becomes normal, there is no such requirement for independent variables-they need not be normally distributed. However, we often perform transformations onindependent variables in order to linearize the relationship between X and Y . Before wediscuss methods for linearizing a relationship between X and Y , we should note that whatmakes the ‘linear model’ linear is that the effects of the parameters are additive. Statedanother way, the model is linear in the parameters. We may make all the transformationsof the variables themselves that we like, and the linear model is still appropriate so long asthe model parameters can be expressed in an additive fashion. For example:

Y = exp(b0 + b1X)

is a linear model, because it can be re-expressed as:

ln(Y ) = b0 + b1X.

However, the model: Y = bb1X0 is not a linear model, because it cannot be re-expressed in

an additive fashion.The same transformations that we used for altering the distribution of the dependent

variable can also be used to capture nonlinearity in a regression function. The most com-mon such transformations are power transformations, specifically squaring and cubing theindependent variable. When we conduct such transformations on independent variables,however, we generally include both the original variable in the model as well as the trans-formed variables—just as when we included interaction terms, we included the original ‘maineffects.’ (note: this is a matter of convention, however, and not a mathematical requirement!)Thus, we often see models like:

Y = b0 + b1X + b2X2 + b3X

3 . . .

Such models are called ‘polynomial regression models,’ because they include polynomialexpressions of the independent variables. These models are good for fitting regression models

13

when there is relatively mild curvature in the relationship between X and Y . For example, itis well-known that depressive symptoms tend to decline across young adulthood, bottomingout after retirement, before rising again in late life before death. This suggests a quadraticrelationship between age and depressive symptoms rather than a linear relationship. Thus, Iconducted a regression model of depressive symptoms on age, and then a second model withan age-squared term included:

Variable Model 1 Model 2Intercept 18.47*** 23.34***Age -.085*** -.322***Age2 .0025***

The first model indicates that age has a negative effect—every year of age contributes toa .085 symptom reduction. The second model reveals that age reduces symptoms, but alsothat each year of age contributes a smaller and smaller reduction. Ultimately, when age islarge enough, age will actually begin to lead to increases in symptoms. We can see this bytaking the first derivative of the regression function with respect to age, setting it equal to0, and solving to find the age at which age begins to raise symptoms:

Y = b0 + b1Age + b2Age2.

The derivative is:

∂Y

∂Age= b1 + 2b2Age.

Solving for the age at which the effect reaches a minimum is:

Agemin =−b1

2b2

In this case, if we substitute our regression estimates in, we find the age is 64.4. The plotbelow confirms this result.

14

Age

Red

uctio

n in

Dep

ress

ive

Sym

ptom

s

20 40 60 80 100

-10

-9-8

-7-6

Figure 7. Net Effect of Age on Depressive Symptoms: Second Degree (Quadratic) PolynomialRegression

In the plot, it is clear that the effect of age reaches a minimum in the mid 60’s and thenbecomes positive afterward.

We can use polynomial regression functions to capture more than a single point of inflec-tion. The rule is that we can capture k− 1 points of inflection with a polynomial regressionof degree k. As an example, I reconducted the above regression model with an age-cubedterm and got the following results:

Age

Red

uctio

n in

Dep

ress

ive

Sym

ptom

s

20 40 60 80 100 120

-14

-12

-10

-8-6

Figure 8. Net Effect of Age on Depressive Symptoms: Third Degree Polynomial Regression

15

In this case, it looks like the effect of age bottoms out just before retirement age, andthat the effect of age is positive but relatively trivial beyond retirement. Additionally, itappears that if one lives past 100 years of age, age again contributes to reducing symptoms(a second point of inflection). Of course the age-cubed term was nonsignificant, though, andreally shouldn’t be interpreted.

I should note that as you add polynomial terms, you can ultimately almost perfectlyfit the data (and you can perfectly fit it, if there is only one person at each value of X).However, the tradeoff of a perfect fit is a loss of parsimony.

3.3 Dummy Variables and Nonlinearity

Sometimes, nonlinearity is severe enough or odd-patterned enough that we can’t easily cap-ture the nonlinearity with a simple transformation of either (or both) independent or de-pendent variables. In these cases, we may consider a spline function. Splines can be quitecomplicated, but for our purposes, we can accomplish a spline using dummy variables andpolynomial expressions. In this case, we sometimes call the model a ‘piecewise regression.’Imagine that the regression function is linear for values of X less than C and also linear forvalues above C:

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

X

Y

0 5 10 15 20

2040

6080

Figure 9. Data That are Linear in Two Segments

In this figure, the relationship between X and Y is clearly linear, but in sections. Specif-ically, the function is:

Y =

{5 + 2X iff X < 1127 + 6(X − 11) iff X > 10

If we ran separate regression models for X < 11 and X > 10 (e.g., constructed two samples),we would find a perfectly linear fit for both pieces. However, each section would have

16

a different slope and intercept. One solution is to estimate a second-degree polynomialregression model (Y = b0 + b1X + b2X

2):

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

X

Y

0 5 10 15 20

2040

6080

Figure 10. Quadratic Fit to Two-Segment Data

This model offers a fairly decent fit in this case, but it tends to underpredict and over-predict systematically, which produces serial correlation of the errors (violating the indepen-dence assumption). Plus, this data could be predicted perfectly with another model.

We could construct a dummy variable indicating whether X is greater than 10, andinclude this along with X in the model:

Y = b0 + b1X + b2(I(X > 10)).

This model says Y is b0 + b1X, when X < 11, and (b0 + b2)+ b1X when X > 10. This modeldoesn’t work, because only the intercept has changed—the slope is still the same acrosspieces of X. But, we could include an interaction term between the dummy variable and theX variable itself:

Y = b0 + b1X + b2(I(X > 10)) + b3(I ×X).

This model says Y = b0 + b1X, when X < 11, and (b0 + b2) + (b1 + b3)X, when X > 10. Wehave now perfectly fit the data, because we have allowed both the intercepts and slopes tovary across the ‘pieces’ of the data:

17

●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

X

Y

0 5 10 15 20

2040

6080

Figure 11. Spline (with Single Fixed Knot and Degree=1) or Piecewise Regression Model:Observed and Predicted Scores

This model can be called either a piecewise regression, or a spline regression. Notice thetitle of the figure includes “degree 1” and “1 knot.” The ‘knots’ are the number of junctionsin the function. In this case, there are 2 regression functions joined in one place. The ‘degree’of the spline is 1, because our function within each regression is a polynomial of degree 1. Wecan, of course include quadratic functions in each piece, and as we include more and moreof them, the function becomes more and more complex because of the multiple interactionterms that must be included. Technically, a spline function incorporates the location of theknot as a parameter to be estimated (and possibly the number of knots, as well).

3.4 Final Note

For those who are interested in constructing a cognitive map of statistical methods, here area few tidbits. First, conducting the log transformation on a dependent variable leaves us withone form of a poisson regression model, which you would discuss in more depth in a class ongeneralized linear models (GLM). Also, the logistic transformation is used in a model called‘logistic regression,’ which you would also discuss in a GLM class. However, the logistictransformation we are discussing, while the same as the one used in logistic regression, is notapplied to the same type of data. In logistic regression, the outcome variable is dichotomous(or polytomous). Finally, the piecewise/spline regression discussed above is approximatelya special case of a neural network model.

18


1 Outliers and Influential Cases (Soc 504)

Throughout the semester we have assumed that all observations are clustered around themean of x and y, and that variation in y is attributable to variation in x. Occasionally, wefind that some cases don’t fit the general pattern we observe in the data, but rather thatsome cases appear ‘strange’ relative to the majority of cases. We call these cases ‘outliers.’We can think of outliers in (at least) two dimensions: outliers on x and outliers on y.

■

■

■ ■■ ■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■■

■

■

■ ■

■

■

■

■

■

■

■

■

■

■

■■■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■■

■■

■

■■

■

■

■

■

■

■

■

■

■

■

■■

■

■

■

■

■■

■

■

■

■

■

■

■

■

■

■

■

■■

■

■

■

■ ■

x

y

-2 -1 0 1 2 3

02

46

Figure 1. Y regressed on X, no outliers.

In the above plot, there are few if any outliers present, in either X or Y dimensions. The

regression coefficients for this model are: b0 = 2.88, b1 = 1.08. In the following plot, there is

one Y outlier. The regression coefficients for this model are: b0 = 2.93, b1 = 1.08.

1

■

■

■ ■■ ■

■

■

■

■

■

■

■

■

■

■■

■

■

■

■

■

■

■

■■

■

■

■ ■■

■

■

■

■

■

■

■

■

■

■■■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■■

■■

■

■■

■

■

■

■

■

■

■

■

■■

■■

■

■

■

■

■■

■

■

■

■

■

■

■

■

■

■

■

■■

■

■

■

■ ■

x

y

-2 -1 0 1 2 3

02

46

8

Y outlier -->

Figure 2. Y regressed on X, One Outlier on Y.

Notice that the regression line is only slightly affected by this outlier. The same would be

true if we had an outlier that was only an outlier in the X dimension. However, if a variable

is an outlier in both X and Y dimensions, we have a problem, as indicated by the next

figure.

2

■

■

■ ■■ ■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■■

■

■

■ ■

■

■

■

■

■

■

■

■

■

■

■■■

■

■

■

■

■

■

■

■

■

■

■

■

■

■

■■

■■

■

■■

■

■

■

■

■

■

■

■

■

■

■■

■

■

■

■

■■

■

■

■

■

■

■

■

■

■

■

■

■■

■

■

■

■ ■

x

y

-2 0 2 4 6

02

46

With OutlierOutlier Omitted

Figure 3. Y regressed on X, One Outliers on Y.

In this figure, it is apparent that the outlier at the far right edge of the figure is pulling

the regression line away from where it ‘should’ be. Outliers, therefore, have the potential to

cause serious problems in regression analyses.

1.1 Detecting Outliers

There are numerous ways to detect outliers in data. The simplest method is to construct

plots like the ones above to examine outliers. This procedure works well in two dimensions,

but may not work well in higher dimensions. Additionally, what may appear as an outlier

in one dimension may in fact not be an outlier when all variables are jointly modeled in a

regression.

Numerically, there are several statistics that you can compute to detect outliers, but we

will concentrate on only two: studentized residuals via dummy regression, and DFBetas via

regression. These statistics are produced by most software packages on request.

Studentized residuals can be obtained by constructing a dummy variable representing

an observation that is suspected to be an outlier (or do all of them, one at a time) and

3

including the dummy variable in the regression model. If the dummy variable coefficient for

a particular case is significant, it indicates that the observation is an outlier. The t test on

the dummy variable coefficient is the studentized residual, which can also be obtained in

other ways.

Many packages will produce studentized residuals if you ask for them. However, it is

important to realize that these residuals will follow a t distribution, implying that (in a

large sample) approximately 5% of them will appear ‘extreme.’ We correct for this problem

by adjusting our t test critical value, using a ‘Bonferroni correction.’ The correction is to

replace the usual critical t with t α2n

. This moves the critical t further out into the tails of

the distribution, making it more difficult to obtain a ‘significant’ residual.

The above approach finds outliers, but it doesn’t tell us how influential an outlier may

be. Not all outliers are influential. An approach to examining the influence of a particular

outlier is to conduct the regression model both with and without the offending observation.

A statistic Dij can then be computed as Bj − Bj(−i). This is simply the difference between

the j− th coefficient in the model when observation i is deleted from the data. This measure

can be standardized using the following formula:

D∗ij =

Dij

SE−i(βj)

This standardization makes these statistics somewhat more comparable across variables, but

we may be more interested in comparing these statistics across observations for the same

coefficient/variable. In doing so, it may be helpful to simply plot all the D statistics for all

observations for each variable.

1.2 Compensating for Outliers

Probably the easiest solution for dealing with outliers is to delete them. This costs informa-

tion, but it may be the best solution. Often outliers are miscoded variables (either in the

original data or in your own recoding schemes). Sometimes, however, they are important

cases that should be investigated further. Finding clusters of outliers, for example, may lead

you to discover that you have omitted something important from your model.

4

1.3 Cautions

Outliers can be very difficult to detect in high dimensions. You must also be very careful in

using the approaches discussed above in looking for outliers. If there is a cluster of outliers,

deleting only one of the observations in the cluster will not lead to a significant change in

the coefficients, so the D statistic will not detect it.

5


1 Multicollinearity (Soc 504)

Until now, we have assumed that the data are well-conditioned for linear regression. Thenext few sets of notes will consider what happens when the data are not so cooperative. Thefirst problem we will discuss is multicollinearity.

1.1 Defining the Problem

Multicollinearity is a problem with being able to separate the effects of two (or more) vari-ables on an outcome variable. If two variables are significantly alike, it becomes impossibleto determine which of the variables accounts for variance in the dependent variable. As arule of thumb, the problem primarily occurs when x variables are more highly correlatedwith each other than they are with the dependent variable.

Mathematically, the problem is that the X matrix is not of full rank. When this occurs,the X matrix (and hence the XT X matrix) has determinant 0 and cannot be inverted. Forexample, take a general 3× 3 matrix A:

A =

a b bc d de f f

The determinant of this matrix is:

adf + bde + bcf − (bde + adf + bcf) = 0

Recall from the notes on matrix algebra that the inverse can be found using the determinantfunction:

A−1 =1

det(A)adj(A).

However, when det(A) = 0, all of the elements of the inverse are clearly undefined. Generally,the problem is not severe enough (e.g., not every element of one column of X will be identicalto another) to crash a program, but it will produce other symptoms.

To some extent, multicollinearity is a problem of not having enough information. Addi-tional data points, for example, will tend to produce more variation across the columns ofthe X matrix and allow us to better differentiate the effects of two variables.

1.2 Detecting/Diagnosing Multicollinearity

In order to lay the foundation for discussing the detection of multicollinearity problems, Iconducted a brief simulation, using the following generated variables (n = 100 each):

1

u ∼ N(0, 1)

x2 ∼ N(0, 1)

e ∼ N(0, 1)

x1 = .9× u + (1− .92)12

The construction of the fourth variable gives us variables x1 and x2 that have a correlationof .9. I then created a series of y variables:

y1 = 3 + x1 + x2 + (.01)e

y2 = 3 + x1 + x2 + (.1)e

y3 = 3 + x1 + x2 + (1)e

y4 = 3 + x1 + x2 + (5)e

Changing the error variance has the effect of altering the noise contained in y, reducingthe relationship between each x and y. For example, the following are the correlations ofeach x and y:

X1 X2

X1 1 .91X2 .91 1Y1 .98 .98Y2 .98 .98Y3 .87 .87Y4 .37 .38

Notice that all of the correlations are high (which is atypical of sociological variables),but that the X variables are more highly correlated with each other than they are with Y3

and Y4.I conducted regression models of each y on x1 and x2 to examine the effect of the very

high collinearity between x1 and x2, given different levels of ‘noise’ (f(e)) disrupting thecorrelation between each x and y. I then reduced the data to size n = 20 and re-conductedthese regressions. Here are the results:

2

y1 y2 y3 y4

N=100

R2 1 1 .79 .15F 1740033 ∗ ∗∗ 17463 ∗ ∗∗ 181.11 ∗ ∗∗ 8.51 ∗ ∗∗b0 3.0 ∗ ∗∗ 3.0 3.04 ∗ ∗∗ 3.19 ∗ ∗∗b1 1.0 ∗ ∗∗ .99 ∗ ∗∗ .93 ∗ ∗ .63b2 1.0 ∗ ∗∗ 1.01 ∗ ∗∗ 1.11 ∗ ∗ 1.57

N=20

R2 1 1 .82 .27F 333802 ∗ ∗∗ 3378 ∗ ∗∗ 38.63 ∗ ∗ 3.08#b0 3.0 ∗ ∗∗ 2.98 ∗ ∗∗ 2.76∗ 1.82b1 1.01 ∗ ∗∗ 1.12 ∗ ∗∗ 2.22∗ 7.08b2 .99 ∗ ∗∗ .90 ∗ ∗∗ 0 −4.0

Some classic symptoms of multicollinearity include: 1) having a significant F, but nosignificant t-ratios; 2) wildly changing coefficients when an additional (collinear) variable isincluded in a model; and 3) unreasonable coefficients.

This example highlights some of these classic symptoms. First, in the final model (n = 20,y4), we have a significant F (p < .1), but none of the coefficients is significant (based onthe t-ratios). Second, if we were to reconduct this final model with only x1 included, thecoefficient for x1 would be 2.06 ∗ ∗∗. However, as the model reported above indicates, thecoefficient jumps to 7.08 when x2 is included. Finally, as the final model results indicate, thecoefficients also appear to be unreasonable. If we examined the bivariate correlation betweeneach x and y, we would find moderate and positive correlations—so, it may be unreasonablefor us to find regression coefficients that are opposite and large as in this model.

To summarize the findings of the simulation, it appears that when there is relativelylittle ‘noise’ in y, collinearity between the x variables doesn’t appear to cause much of aproblem. However, when there is considerable noise (as is typical in the social sciences),collinearity significantly influences the coefficients, and this effect is exacerbated when thereis less information (e.g., n is smaller). This highlights that to some extent multicollinearityis a problem of having too little information.

There are several classical tests for diagnosing collinearity problems, but we will focus ononly one—the variance-inflation factor—perhaps the most common test. As Fox notes, thesampling distribution variance for OLS slope coefficients can be expressed as:

V (bj) =1

1−R2j

× σ2e

(n− 1)S2j

In this formula, R2j is the explained variance we obtain when regressing xj on the other x

variables in the model, and S2j is the variance of xj. Recall that the variance of bj is used in

constructing the t-ratios that we use to evaluate significance. This variance is increased ifeither σ2

e is large, S2j is small, or R2

j is large. The first term of the expression above is called

3

the variance inflation factor (V IF ). If xj is highly correlated with the other x variables,then R2

j will be large, making the denominator of the V IF small, and hence the V IF verylarge. This inflates the variance of bj, making it difficult to obtain a significant t-ratio. Tosome extent, we can offset this problem if σ2

e is very small (e.g., there is little noise in thedependent variable-or alternatively, that the x′s account for most of the variation in y). Wecan also offset some of the problem if S2

j is large. We’ve discussed this previously, in termsof gaining leverage on y, but here increasing the variance of xj will also help generate morenoise in the regression of xj on the other x′s, and will thus tend to make R2

j smaller.What value of V IF should we use to determine whether collinearity is a problem? Typ-

ically, we often use 10 as our ‘threshold’ at which we consider it to be a problem, but this issimply a rule of thumb. The figure below shows what V IF is at different levels of correlationbetween xj and the other variables. V IF = 10 implies that the r-square for the regressionmust be .9.

R-squared for X Regressed on Other Variables

Var

ianc

e In

flatio

n F

acto

r (V

IF)

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

8010

0

Reference Line is for VIF=10

Figure. Variance Inflation Factor by Level of Relationship between X Variables.

1.3 Compensating for multicollinearity

There are several ways for dealing with multicollinearity when it is a problem. The first,

and most obvious, solution is to eliminate some variables from the model. If two variables

are highly collinear, then it means they contain highly redundant information. Thus, we can

pick one variable to keep in the model and discard the other one.

If collinearity is a problem but you can’t decide an appropriate variable to omit, you

4

can combine the offending x variables into a reduced set of variables. One such approach

would be to conduct an exploratory factor analysis of the data, determine the number of

unique factors the x variables contain, and generate either (factor) weighted or unweighted

scales, based on the factors on which each variables ‘load.’ For example, suppose we have

10 x variables that may be collinear. Suppose also that a factor analysis suggests that the

variables really reflect only two underlying factors, and that variables 1−5 strongly correlate

with the first factor, while variables 6− 10 strongly correlate with the second factor. In that

case, we could sum variables 1 − 5 to create a scale, and sum variables 6 − 10 to make a

second scale. We can either sum the variables directly, or we can weight them based on their

factor loadings. We then include these two scales in the regression model.

Another solution is to transform one of the offending x variables. We have already seen

that multicollinearity becomes particularly problematic when two x variables have a stronger

relationship with each other than they have with the dependent variable. Ideally, if we want

to model the relationship between each x and y, we would like to see a strong relationship

between the x variables and y. Transforming one or both x variables may yield a better

relationship to y, and at the same time, it will eliminate the collinearity problem. Of course,

be sure not to perform the same transformation on both x variables, or you will be back at

square 1.

A final approach to remedying multicollinearity is to conduct ‘ridge regression.’ Ridge

regression involves transforming all variables in the model and adding a biasing constant to

the new (XT X) matrix before solving the equation system for b. You can read about this

technique more in-depth in the book.

1.4 Final Note

As a final note, we should discuss why collinearity is an issue. As we’ve discussed before,

the only reason we conduct multiple regression is to determine the effect of x on y, net of

other variables. If there is no relationship between x and the other variables, then multi-

ple regression is unnecessary. Thus, to some extent, collinearity is the basis for conducting

multiple regression. However, when collinearity is severe, it leads to unreasonable coefficient

estimates, large standard errors, and consequently bad interpretation/inference. Ultimately

there is a very thin line between collinearity being problematic and collinearity simply ne-

cessitating the use of multiple regression.

5


1 Non-normal and Heteroscedastic Errors (Soc 504)

The Gauss-Markov Theorem says that OLS estimates for coefficients are BLUE when theerrors are normal and homoscedastic. When errors are nonnormal, the ‘E’ property (Ef-ficient) no longer holds for the estimators, and in small samples, the standard errors willbe biased. When errors are heteroscedastic, the standard errors become biased. Thus, wetypically examine the distribution of the errors to determine whether they are normal.

Some approaches to examining nonnormality of the errors include constructing a his-togram of the error terms and constructing a Q-Q plot (Quantile-Quantile plot). Certainly,there are other techniques, but these are two simple and effective methods. Constructinga histogram of errors is trivial, so I don’t provide an example of it. Creating a Q-Q plot,on the other hand, is a little more difficult. The objective of a Q-Q plot is to compare theempirical error distribution to a theoretical distribution (i.e., normal). If the errors are nor-mally distributed, then the empirical and theoretical distributions will look identical, and ascatterplot of the two will fall along a straight line. The observed errors are obtained fromthe regression analysis. The steps to computing the theoretical distribution are as follows:

1. Order the errors from smallest to largest so that X1 < X2 < . . . < Xn.

2. Compute the statistic: CDFempirical(i) =i− 1

2

n, where i is the rank of the empirical error

after step 1. This gives us the empirical CDF for the data.

3. Compute the inverse of this CDF value under the theoretical distribution. So takezi = Φ−1(CDFempirical(i)).

4. Plot X against z.

The figure below provides an example of this plot. I simulated 100 observations (errors)from a N(5, 2) distribution:

1

●●

●

● ●●

●●●●

●●●●●●●●●●●●

●●●●●

●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●

●●●

●●●●●●●

●●●●●● ●

●

●

●●

●

Theoretical CDF

Em

piric

al C

DF

0 2 4 6 8 10

24

68

10

Figure 1. Q-Q Plot for 100 random draws(X) from a N(5,2) distribution.

Notice that in this case, the distribution clearly falls on a straight line, indicating the errors

are probably normally distributed. The following figure is a histogram of exp(X). This

distribution is clearly right skewed.

2

ezden$x

Fre

quen

cy

0 10000 20000 30000 40000

0.0

0.00

005

0.00

010

0.00

015

Figure 2. Histogram of exp(X) from above.

The resulting Q-Q plot below reveals this skew clearly. The scatterplot points are not

on a straight line. Rather, the plot follows a curve. The plot reveals that there is too much

mass at the far left end of the distribution relative to what one expect if the distribution

were normal.

3

● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●● ●

●

●

●

●

●

Theoretical CDF

Em

piric

al C

DF

0 2 4 6 8 10

010

000

2000

030

000

Figure 3. Q-Q Plot for exp(X).

The next histogram shows the distribution for the distribution of ln(X), which has a clear

left skew.

4

lzden$x

Fre

quen

cy

-1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

Figure 4. Histogram of Log(X).

The resulting Q-Q plot also picks up this skew, as evidenced by the curvature at the top

edge of the plot. This curvature indicates that there is not enough cumulative mass in the

middle of the distribution relative to what would be expected under a normal distribution.

5

●

●

●

●●

●●

●●

●

●●●●●●●●●●●

●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●● ● ●

●● ●

●

Theoretical CDF

Em

piric

al C

DF

0 2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

Figure 5. Q-Q Plot of Log(X).

1.1 Heteroscedasticity

The above plots are helpful diagnostic tools for nonnormal errors. What about heteroscedas-

ticity? Heteroscedasticity means non-constant variance of the errors across levels of X. The

main consequence of heteroscedasticity is that the standard errors of the regression coeffi-

cients can no longer be trusted. Recall that the standard errors are a function σe. When σe

is not constant, then, the typical standard error formula is incorrect.

The standard technique for detecting heteroscedasticity is to plot the error terms against

either the predicted values of Y or each of the X variables. Below is a plot of errors that

evidences heteroscedasticity across levels of X

6

●●●●●●●

●●●●

●●●

●

●●

●

●

●

●●

●●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

x

y

0 20 40 60 80 100

-200

-100

010

020

0

Figure 6. Example of Heteroscedasticity.

The plot has a classic ‘horn’ shape in which the variance of the errors appears to increase as

X increases. This is typical of count variables, which generally follow a poisson distribution.

The poisson distribution is governed by one parameter, λ (density is p(X) = 1X!

λXexp(−λ)),

which is both the mean and variance of X. When X is large, it implies that λ is probably

large, and that the variance of X is also large. Heteroscedasticity is not limited to this

horn-shaped pattern, but it often follows this pattern.

1.2 Causes of, and Solutions for, Nonnormality and Heteroscedas-

ticity

Nonnormal errors and heteroscedasticity may be symptoms of an incorrect functional form

for the relationship between X and Y , an inappropriate level of measurement for Y , or an

omitted variable (or variables). If the functional form of the relationship between X and Y

is misspecified, errors may be nonnormal because there may be clusters of incredibly large

errors where the model doesn’t fit the data. For example, the following plot shows the

observed and predicted values of Y from a model in which an incorrect functional form was

7

estimated.

●●

●● ●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

● ●

●

●

●

●

●

●

●●●●

● ●●

● ●

●●●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●●

● ●●

●

●

●

●●

●● ●●

●

●

●

●

●

●

x

y

-2 -1 0 1 2 3

510

1520

25

Fitted

Figure 7. Observed and (Improperly) Fitted Values.

In this case, the correct model is y = b0 + b1exp(X), but the model y = b0 + b1X was fitted.

The histogram of the errors is clearly nonnormal:

8

eden$x

Fre

quen

cy

0 5 10 15 20

0.0

0.05

0.10

0.15

0.20

0.25

0.30

Figure 8. Histogram of Errors from Example.

and the Q-Q plot reveals the same problem:

●●

●●

● ●

●

●

●●

●

●

●● ●

●

●

●

●

●

●

●●● ●

●●

●

●●

●● ●●

●

●●●●

●

● ●●

● ●

●●

●

●

●

●●●

●

●●

●

●

●

●

●●

●

●●

●

● ●●

●●

●

●

●●

●● ●

●

●

●●

●

●● ●

●

●

● ●

●●

●

●● ●

●

● ●

●

x

Err

ors

-2 -1 0 1 2 3

05

1015

9

Figure 9. Q-Q Plot of Errors in Example.

A plot of the errors against the X values is:

● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●● ●

●

●●

●

●

Theoretical CDF

Em

piric

al C

DF

-4 -2 0 2 4

05

1015

Figure 10. Plot of Errors against X in Example.

This plot suggests that heteroscedasticity is not a problem, because the range of the errors

doesn’t really appear to vary across levels of X. However, the figure shows there is clearly a

problem with the functional form of the model-the errors reveal a clear pattern. This implies

that we may not only have the wrong functional form, but we also have a problem with serial

autocorrelation (which we will discuss in the context of time series analyses later).

If we do the appropriate transformation, we get the following.

10

dennew$x

Fre

quen

cy

-0.2 0.0 0.2

01

23

Figure 11. Histogram of Errors in Revised Example.

●

●● ●

● ● ●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●

●●●●●●●

●●●●

●●●

●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●

●●● ● ●

●●

●

●

Theoretical CDF

Em

piric

al C

DF

-0.2 -0.1 0.0 0.1 0.2

-0.2

-0.1

0.0

0.1

0.2

Figure 12. Q-Q Plot of Errors in Revised Example.

11

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

x

Err

ors

-2 -1 0 1 2 3

-0.2

-0.1

0.0

0.1

0.2

Figure 13. Plot of Errors Against X in Revised Example.

These plots show clear normality and homoscedasticity (and no autocorrelation) of the errors.

In other cases, a simple transformation of one or more variables may not be sufficient. For

example, if we had a discrete, dichotomous outcome, the functional form of the model would

need to be significantly respecified. Nonnormal and heteroscedastic errors are very often the

consequence of measurement of the dependent variable at a level that is inappropriate for

the linear model.

As stated above, another cause of heteroscedasticity is omitting a variable that should

be included in the model. The obvious solution is to include the appropriate variable.

An additional approach to correcting for heteroscedasticity (weighted least squares) will be

discussed soon.

12


1 Alternative Estimation Strategies (Soc 504)

When regression assumptions are violated to the point that they degrade the quality of theOLS estimator, we may use alternative strategies for estimating the model (or use alternativemodels). I discuss 4 types of alternate estimation strategies here: bootstrapping, robustestimation for M-estimators, Weighted Least Squares (WLS) estimation, and GeneralizedLeast Squares estimation.

2 Bootstrapping

Bootstrapping is useful when your sample size is small enough that the asymptotic propertiesof MLE or OLS estimators is questionable. It is also useful when you know the errors (in asmall sample) aren’t normally distributed.

The bootstrapping approach for a simple statistic is relatively simple. Given a sampleof n observations, we take m resamples with replacement of size n from the original sample.For each of these resamples, we compute the statistic of interest and form the empiricaldistribution of the statistic from the results.

For example, I took a sample of 10 observations from a U(0, 1) distribution. This sizesample is hardly large enough to justify using normal theory for estimating the standard errorof the mean. In this example, the sample mean was .525, and the estimated standard errorwas .1155. After taking 1000 bootstrap samples, the mean of the distribution of bootstrapsample means was .525, and the estimated standard error was .1082. The distribution ofmeans looked like:

1

Means

Fre

quen

cy

0.2 0.4 0.6 0.8

01

23

Figure 1. Bootstrap Sample Means.

This distribution is approximately normal, as it should be. The empirical 95% confidenceinterval for the mean was (.31, .74). This interval can be found by taking the 2.5th and 97.5th

largest values from the empirical bootstrap distribution.In this case, the bootstrap results did not differ much from the original results. However,

we can better trust the bootstrap results, because normal theory really doesn’t allow us tobe confident in our original estimate of the standard error.

Below is the c program that produces the bootstrap samples for the above example.

#include<stdio.h>

#include<math.h>

#include<stdlib.h>

double uniform(void);

main(int argc, char *argv[])

{

int samples,rep,pick;

double mean,threshold;

double replicate,r;

double y[10]={.382,.100681,.596484,.899106,.88461,.958464,.014496,.407422,.863247,.138585};

FILE *fpout;

for(samples=1;samples<=1000;samples++)

{

printf("doing sample %d\n",samples);

mean=0;

for(rep=0;rep<10;rep++)

2

{

r=uniform();

threshold=0;

for(pick=0;pick<10;pick++)

{

if(r>threshold){replicate=y[pick];}

threshold+=.1;

}

mean+=replicate;

}

mean/=10;

if ((fpout=fopen(argv[1],"a"))==NULL)

{printf("couldn’t open the file\n"); exit(0);}

fprintf(fpout,"%d %f\n",samples,mean);

fclose(fpout);

}

return 0;

}

double uniform(void)

{

double x;

double deviate;

x=random();

deviate=x/(2147483647);

if (deviate<.0000000000000001){deviate=.0000000000000001;}

return deviate;

}

In a regression setting, there are two ways to conduct bootstrapping. In one approach,we assume the X variables are random (rather than fixed). Then, we can obtain bootstrapestimates of the sampling distribution of β, by taking samples of size n from the originalsample and forming the distribution of (X(j)T X(j))−1(X(j)T Y (j)) (the OLS estimates fromeach bootstrap sample “j”). In the other approach, we treat X as fixed. If X is fixed, thenwe must sample the error term (the only random component of the model). We do this asfollows:

1. Compute the OLS estimates for the original sample.

2. Obtain ei = Yi −Xiβ.

3. Bootstrap samples of e,

4. Compute Y(j)i = Yi + e

(j)i for each bootstrap sample.

5. Compute the OLS estimates for each bootstrap sample Y (j).

3

3 Robust Estimation (with M-Estimators)

Fox discusses M-Estimation as a supplemental method to OLS for examining the effectof outliers. The book introduces the notion of “influence plots” by showing how a singleoutlying observation influences a sample estimate (e.g., the mean, median, etc.):

Additional Y

Mea

n/M

edia

n

-10 -5 0 5 10

-0.5

0.0

0.5

1.0

MeanMedian

Figure 2. Influence Plot Example.

The influence plot was produced by taking a sample of 9 N(0, 1) observations and adding a10th observation taking values incrementally in the range of (−10, 10). The statistics (meanand median) were then computed. The median is clearly more resistant to the outlyingobservation, as indicated by the plot. The true mean and median of a N(0, 1) variable are0, and the sample median remains much closer to this value as the 10th observation becomesmore extreme.

When outliers are a problem, OLS estimation may not be the best approach to estimatingregression coefficients, because the OLS estimator minimizes the mean squared error of theobservations. Outliers generate undue influence on the coefficients, then, because the errorterms are squared. In order to determine whether our estimates are robust, we can try othercriteria rather than OLS. For example, a common alternative is the least absolute valueestimator:

LAV = min∑

(| Yi − Y |)

Fox shows that this class of estimators can be estimated generally using iterativelyreweighted least squares (IRLS). In IRLS, the derivative of the objective function is re-

4

expressed in terms of weights that apply to each observation. These weights are a functionof the error term (for LAV, wi = 1

Ei), which, of course is a function of the regression co-

efficients in a regression model, or they can simply be Y − µ. Thus, we can solve for theregression coefficients by using a starting value for them, computing the weights, recomputingthe regression coefficients, etc. While an estimate of a sample mean is:

µ =wiYi∑

wi

the estimate of regression coefficients is:

β = (XT WX)−1(XT WY ).

I illustrate the use of IRLS estimation with the same sample of 9 N(0, 1) observationsas used above. In this example, I use IRLS on the LAV objective function to estimate themean. Notice that the LAV estimate of the mean is bounded. If we think about the medianof a sample, the median is the mean of the two centermost observations. In this sampledata, when the 10th observation is an extreme negative value, the centermost observationsare −.23 and .24. When the 10th observation is an extreme positive value, the centermostobservations are .24 and 1.10. Thus, the LAV estimate of the mean will be in the range of(−.23, 1.10) as the figure below illustrates. Once the value becomes extreme enough, theweight for that observation becomes very small in the IRLS routine, so the influence of theobservation is minimal. Below is a c program that estimates the LAV function using IRLS.

Additional Y

Mea

n/M

edia

n

-10 -5 0 5 10

-0.5

0.0

0.5

1.0

MeanMedian

Figure 3. IRLS Results of Using LAV Function to Estimate a Mean.

5

#include<stdio.h>

#include<math.h>

#include<stdlib.h>

main(int argc, char *argv[])

{

int i,j,k,loop;

double mean[100],weight[10],num,denom;

double y[10]={-.30023, -1.27768, .244257, 1.276474, 1.19835,

1.733133,-2.18359,-.23418, 1.095023, 0.0};

FILE *fpout;

for(i=-20;i<=20;i++)

{

y[9]=i*1.0;

mean[0]=0;

mean[1]=0;

for(j=0;j<=9;j++)

{mean[1]+=y[j];}

mean[1]/=10;

loop=1;

while(fabs(mean[loop]-mean[loop-1])>.000001)

{

printf("mean %d=%f\n",loop,mean[loop]);

for(k=0;k<=9;k++){weight[k]=1.0/(fabs(y[k]-mean[loop]));}

num=0; denom=0;

for(k=0;k<=9;k++){denom+=weight[k]; num+=(weight[k]*y[k]);}

loop++;

mean[loop]=num/denom;

}

if ((fpout=fopen(argv[1],"a"))==NULL)

{

printf("couldn’t open the file\n"); exit(0);

}

fprintf(fpout,"%f %f %f\n",y[9],mean[1],mean[loop]);

fclose(fpout);

}

return 0;

}

Within the while() loop, the previous value of the mean is used to compute the weightsof each observation (weight[k]=1/fabs(y[k]-mean[loop]);). Then, given the new weights, anew value of the mean is computed (mean[loop]=num/denom;). If we wanted to use analternate objective function (e.g., Huber, bisquare), then we would simply replace the weightcalculation. I do not illustrate IRLS for a robust regression, but it is a straightforwardmodification of this algorithm in which µ is replaced by Xβ in the weight calculation, andthe calculation of µ is replaced by the weighted OLS estimator shown above.

6

4 Weighted Least Squares and Generalized Least Squares

Estimation

When errors are heteroscedastic, the error term is no longer constant across all observations.Thus, the assumption: σi ∼ N(0, σeI) is no longer true. Rather, σ ∼ N(0, Σ), where Σ is adiagonal matrix (off-diagonal elements are still 0).

In this case, the likelihood function becomes modified to incorporate this altered errorvariance term:

L(β | X, Y ) =∏ 1

(2π)12 σi

exp

{−(Yi − (X ′β)i)

2

2σ2i

},

We can typically assume that the diagonal elements of the matrix are weighted values of aconstant error variance, say:

Σ = σe

1

w10 . . . 0

0 1w2

. . ....

.... . . . . .

...0 0 . . . 1

wn

which gives us a matrix expression for the likelihood:

L(β | X, Y ) =1

(2π)n2 | Σ | 12

exp

{−1

2(Y −Xβ)T Σ(Y −Xβ)

}.

The estimator for the parameters then becomes (XT WX)−1(XT WY ), and the variance ofthe estimator is: σ2

e(XT WX)−1.

The obvious question is: “What do we use for weights?” We may know that the errorvariance is related to one of the observed variables, in which we could either build thisfunction into the likelihood function above (and estimate it), or we could use the inverse ofthe variable as the weights. Alternatively, we could simply bypass WLS estimation altogetherand simply divide every variable in the model by the offending X variable and use OLSestimation.

Generalized Least Squares estimation is the generalization of WLS. In WLS, we assumethe off-diagonal elements of the Σ matrix are 0. This assumption, if you recall from our noteson multiple regression theory, implies that errors are independent across observations. Often,however, this may be an unreasonable assumption (e.g. in time series, or in clustered data).We can relax this assumption to obtain the GLS estimator: β = (XT Σ−1X)−1XT Σ−1Y .This should look similar to the WLS estimator—in fact, it’s the same (the WLS estimatorjust has 0’s on the off-diagonal elements). The variance estimator for β is: (XT Σ−1X)−1.

Obviously, we cannot estimate all the elements of Σ—there are n(n+1)2

unique elementsin the matrix. Thus, we may use functions or other simplifications to reduce the elementsto be estimated. This is the essence of basic time series models, as we will discuss in a fewweeks.

7


1 Missing Data (Soc 504)

The topic of missing data has gained considerable attention in the last decade, as evidencedby several recent trends. First, most graduating PhDs in statistics now claim “missing data”as an area of interest or expertise. Second, it has become difficult to publish empirical workin sociology without discussion of how missing data was handled. Third, more and moremethods for handling missing data have sprouted-up over the last few years.

Although missing data has received a growing amount of attention, there are still some keymisunderstandings regarding the problems that missing data generate, as well as acceptablesolutions. Missing data are important to consider, because they may lead to substantialbiases in analyses. On the other hand, missing data is often harmless beyond reducingstatistical power. In these notes, I define various types of missingness and discuss methodsfor handling it in your research.

2 Types of Missingness

Little and Rubin (1987) define three different classes of missingness. Here, I define the keyterms used in discussing missingness in the literature. I will then discuss how they relate tosome ways in which we encounter missing data.

• Data missing on Y are observed at random (OAR) if missingness on Y is not a functionof X. Phrased another way, if X determines missingness on Y , the data are not OAR.

• Data missing on Y are missing at random (MAR) if missingness on Y is not a functionof Y . Phrased another way, if Y determines missingness on Y , the data are not MAR.

• Data are Missing Completely at Random (MCAR) if missingness on Y is unrelated to Xor Y . In other words MCAR=OAR + MAR. If the data are MCAR or at least MAR,then the missing data mechanism is considered “ignorable.” Otherwise, the missingdata mechanism is considered “nonignorable.”

To make these ideas concrete, suppose we are examining the effect of education on income.If missingness on income is a function of education (e.g., highly educated individuals don’treport their income), then the data are not OAR. If missingness on income is a function ofincome (e.g., persons with high income do not report their income), then the data are notMAR.

There are a number of ways in which we may encounter missing data in social scienceresearch, including the following:

• Individuals not followed up by design (meets MCAR assumption)

• Item nonresponse (may not meet any assumption)

1

• Loss due to followup, or attrition (may not meet any assumption)

• Mortality (respondent’s, not yours)

• Sample selection (e.g., estimating a model that is only applicable to a subset of thetotal sample) (may not meet any assumption)

All of these avenues of missingness are common in sociology, and indeed, it is generally moresurprising to find you have very little missing data when conducting an analysis than to findyou have much missing data. The good news is that he statistical properties of maximumlikelihood estimation obtain if the data are MCAR or MAR. That is, the data do not haveto be OAR. This is true only when the model is properly specified, however. In that case,whatever “piece” of the data is observed is sufficient to produce unbiased parameters. Whyis this true? Below is a plot of the relationship between some variable X and Y.

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

X

Y

-2 -1 0 1 2

01

23

45

6

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

X: X<1

Y: X

<1

-2 -1 0 1 2

01

23

45

6

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●● ●

●

●●●

●●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

X: Y<4

Y: Y

<4

-2 -1 0 1 2

01

23

45

6

●

●

●● ●

●

●

●●●

●●●

●

● ●●

●

●● ●

●

●●

●●

●

●●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●●● ●

●

●

●

●

●

●●

●

●

●

●

●

X: X<1

Y: X

<1

-2 -1 0 1 2

01

23

45

6

Figure 1. Regression Results for Different Types of Missing Data. Upper left=complete data; upperright=MAR but not OAR; upper left=not MAR; bottom right=incorrect functional form, but dataare MAR.

The first plot (upper left corner) shows the fit of the regression model when all the dataare observed. The second shows what happens to the fit of the regression model when thedata are MAR but not OAR. For this case, I deleted all observations with X values greaterthan 1. Notice that the regression line is virtually identical to the one obtained with completedata. The bottom left plot shows what happens when the data are neither MAR nor OAR.For that case, I deleted all observations with Y values greater than 4. Observe that theregression line is biased considerably in this case. Finally, the last plot shows what happens

2

if the data are MAR, but an incorrect functional form is specified for the model. In thatcase, the model estimates will also be biased.

Although these results are somewhat encouraging in that the indicate that missing datamay not always lead to biases, the fact is that it is practically impossible to assess whetherdata are MAR, exactly because the data are missing on the variable of interest! Furthermore,methods for handling data that are not MAR are statistically difficult and rely on our abilityto correctly model the process that generates the missing data. This is also difficult to assess,because the data are missing!

3 Traditional approaches to handling missing data.

A variety of approaches to handling missing data have emerged over the last few years. Belowis a list of some of them, along with a brief description. The bad news is that most of themare not useful when the data are not MAR, but rather, only when they are not OAR. This isunfortunate, because, as we discussed above, parameter estimates are not biased when thedata are MAR but not OAR.

• Listwise deletion. Simply delete an entire observation if it is missing on any itemused in the analyses.

Problems: Appropriate only when data are MCAR (or MAR). In that case, the onlyloss is statistical power due to reduced n. If the data are not MAR, then results willbe biased. However, this is often the best method, even when the data are not MAR.

• Pairwise deletion. Delete missing variables by pairs. This only works if a model is es-timated from covariance or correlation matrices. In that case, the covariances/correlationsare estimated pairwise, so you simply don’t include an observation that is missing dataon one of the items in a pair.

Problems: Seems nice, but poor conceptual statistical foundation. What n shouldtests be based on? Also leads to bias if the data are not MAR.

• Dummy variable adjustment. Set missing values on a variable equal to some arbi-trary value. Then construct a dummy variable indicating missingness and include thisvariable in the regression model.

Problems: This approach simply doesn’t work and leads to biases in parameters andstandard errors. It also has no sound theoretical justification-it is simply an ad hocmethod to keep observations in an analysis.

• Mean Imputation. Replace a missing observation with the sample mean of thevariable. When using longitudinal data, we can instead replace a missing score withthe mean of the individual’s responses on the other waves. I think this makes moresense than sample mean imputation in this case. An even better approach, I think, isdiscussed below under regression imputation.

3

Problems: Sounds reasonable but isn’t. Whether the data are OAR or MAR, thisapproach leads to biases in both the standard errors and the parameters. The mainreasons are that it shifts possible extreme values back to the middle of the distribution,and it reduces variance in the variable being imputed.

• Hotdecking. Replace a missing observation with the score of a person who matchesthe individual who is missing on the item on a set of other covariates. If multiple in-dividuals match the individual who is missing on the item, use the mean of the scoresof the persons with complete information. Alternatively, a random draw from a distri-bution can be used.

Problems: Seems reasonable, but reduces standard errors because it (generally) ig-nores variability in the x. That is, the x are not perfectly correlated, but this methodassumes they are. The method also assumes the data are MAR. May be particularlydifficult to implement with lots of continuous covariates. Also, the more variables usedto match the missing observation, the better, but also the less likely you will be to finda match. Finally, what do you do with multiple missing variables? Impute and impute?Theoretically, you could replace all missing data through multiple passes through thedata, but this would definitely produce overconfident and suspect results.

• Regression-based Imputation. Estimate a regression model predicting the missingvariable of interest for those in the sample with complete information. Then computepredicted scores, using the regression coefficients, for the individuals who are missingon the item. Use these predicted scores to replace the missing data. When longitudinaldata are used, and the missing variable is one with a within-individual pattern acrosstime, use an individual-specific regression to predict the missing score.

Problems: This is probably one of the best simple approaches, but it suffers fromthe same main problem that hotdecking does: it underestimates standard errors byunderestimating the variance in x. A simple remedy is to add some random error tothe predicted score from the regression, but this begs another question: what distri-bution should the error follow? This method also assumes the data are MAR.

• Heckman Selection Modeling. The classic two-step method for which Heck-man won the Nobel Prize involves a) estimating a selection equation (I(Observed) =Xβ + e), b) constructing a new variable—a hazard for sample inclusion—that is a

function of predicted scores from this model (λi =φ(X′

iβ)

1−Φ(X′iβ)

), c) including this new

variable in the structural model of interest (Yi = Z ′iγ + δλi + ui).

Problems: This method is an excellent method for handling data that is NOT MAR.However, it has problems. One is that if there is a significant overlap between X andZ, then the method is inconsistent. Theoretically, there should be no overlap betweenX and Z, but this is often unreasonable: variables related to Y are also relevant to

4

selection (otherwise, there would be no relationship between observing Y and Y , andthe data would thus be MAR). Another is that the standard errors in the structuralmodel are incorrect and must be adjusted. This is difficult. Fortunately, STATA hastwo procedures that make this adjustment for you: Heckman (continuous structuraloutcome) and Heckprob (dichotomous structural outcome).

• Multiple Imputation. This method involves simulating possible values of the miss-ing data and constructing multiple datasets. We then estimate the model using eachnew data set and compute the means of parameters across the samples. Standard errorscan be obtained via a combination of the between-model variance and the within-modelstandard errors.

Problems: Although this approach adjusts for the downward bias that some of theprevious approaches produce, its key drawbacks include that it assumes the data areMAR, and it is fairly difficult to implement. To my knowledge, there are some stan-dard packages that do this, but they may be tedious to use. Another drawback is thatyou must assume some distribution for the stochasticity. This can be problematic aswell.

• The EM algorithm. The EM algorithm is a two-step process for estimating modelparameters. It integrates missing data into the estimation process, thus bypassing theneed to impute. The basic algorithm consists of two steps: expectation (E step) andmaximization (M step). First, separate the data into missing and nonmissing, andestablish starting values for the parameters. In the first step, using the parameters,compute the predicted scores for the missing data (the expectation). In the secondstep, using the predicted scores for the missing data, maximize the likelihood functionto obtain new parameter estimates. Repeat the process until convergence is obtained.

Problems: There are two key problems of this approach. One is that there is nostandard software (to my knowledge) that makes this accessible to the average user.Second, the algorithm doesn’t produce standard errors for the parameters. Other thanthese problems, this is the best maximum likelihood estimation has to offer. Note thatthis method assumes the data are MAR.

• Direct ML Estimation. Direct estimation involves factoring the likelihood functioninto components such that the missing data simply don’t contribute to estimation ofparameters for which the data are missing. This approach retains all observations inthe sample and makes full use of the data that are observed.

Problems: Once again, this approach assumes the data are MAR. Not all likelihoodsfactor easily, and not many standard packages allow such estimation.

• Bayesian modeling with MCMC methods. Bayesian estimation is concernedwith simulating distributions of parameters so that simple descriptive statistics can beused to summarize knowledge about a parameter. A simple example of a Bayesianmodel would be a linear regression model (assume for simplicity that σe is known). We

5

already know that the sampling distribution for the regression coefficients in a linearregression model has a mean vector equal to β and a covariance matrix σ2

e(X′X)−1.

In an OLS model, the OLS estimate for β, β is found using (X ′X)−1(X ′Y ). Thecentral limit theorem tells us that the sampling distribution is normal, so we can say:β ∼ N((X ′X)−1(X ′Y ), σ2

e(X′X)−1). That being the case, if we simply draw normal

variables with this distribution, we will have a sample of parameters, from which in-ference can be made. (note: If σe is not known, its distribution is inverse gamma. Theconditional distribution for the regression parameters is still normal, but the marginaldistribution for the parameters will be t).

If there are missing data, we can use the assumption that the model is based on,namely that Y ∼ N(Xβ, σ2

e), and integrate this into the estimation. Now, ratherthan simulating β only, we break estimation into two steps. After establishing start-ing values for the parameters, first, simulate the missing data using the assumptionabove regarding the distribution for Y . Second, given a complete set of data, simulatethe parameters using the formula discussed above regarding the distribution of β. Al-though this approach seems much like the EM algorithm, it has a couple of distinctadvantages. One is that standard errors (technically, the standard deviation of theposterior distribution for parameters) are obtained as a byproduct of the algorithm, sothere is no need to come up with some additional method to do so. Another is that, inthis process, a better estimate of uncertainty is obtained, given the missing data, thanwould be obtained using EM.

Problems: Bayesian approaches are not easy to implement, because there is very littlepackaged software in existence. Also, this method assumes the data are MAR. How-ever, I note that we can modify the model and estimation slightly to adjust for datathat aren’t MAR. This can become complicated, though.

• Selection and Pattern Mixture Models. These approaches are somewhat oppo-site of each other. We have already dealt with one type of selection model (Heck-man). In general, both approaches exploit the conditional probability rule, but theydo so in opposite fashions. Pattern mixture models model p(Y,Observed) = p(Y |Observed)p(Observed), while selection models model p(Y, Observed) = p(Observed |Y )p(Y ). We have already seen an example of selection. A pattern mixture approachwould require us to specify a probability model for Y conditional on being observedmultiplied by a model predicting whether an individual is observed. We would simul-taneously need to model Y conditional on being unobserved by this model for beingobserved. This model is underidentified, because, without information on the Y thatare missing, we do not know any characteristics of its distribution; thus, some identi-fying constraints are required.

Problems: The key problem with these approaches include that standard softwaredoes not estimate them (with the exception of Heckman’s method). However, theseare appropriate approaches when the data are not MAR.

6

4 A Simulation Demonstrating Common Approaches.

I generated 500 samples of size n = 100 each, consisting of 3 variables: X1, X2, and Y.ρX1,X2 = .4, and the error term, u, was drawn from N(0, 1). Y was computed using:Y = 5 + 3X1 + 3X2 + u. First, I estimated the regression model on all the samples withcomplete data. Second, I estimated the regression model on the samples after causing someto be missing. I used 4 different missing data patterns. First, I forced Y to be missing ifX1 > 1.645. This induces approximately 5% to be missing in each sample and generatessamples in which the data are MAR but not OAR. Second, I forced Y to be missing ifX1 > 1.037. This induces approximately 15% to be missing. For comparison, I also esti-mated the models after making X1 missing under the same conditions (rather than makingY missing). These results emphasize that the real problem occurs when the dependent vari-able is missing. Third, I forced Y to be missing if Y > 12.0735. Fourth, I forced Y to bemissing if Y > 9.4591. These latter two patterns make the data violate the MAR assumptionand generate approximately 5% and 15% missing data, respectively (the mean and variancefor Y differs from that for the X variables). I estimated regression models using variousapproaches to handling the missing data. Below is a table summarizing the results of thesimulation.

7

4.1 Simulation Results

Approach Int.(Emp/Est S.E.) X1(Emp/Est S.E.) X2(Emp/Est S.E.)No Missing 5.01(.100/.101) 3.01(.107/.111) 3.00(.112/.111)X1 missing if X1 > 1.645Listwise 5.01(.103/.104) 3.01(.122/.125) 3.00(.115/.114)Mean 5.31(.170/.176) 2.88(.157/.215) 3.31(.233/.187)Dummy 5.01(.103/.105) 3.00(.122/.126) 3.01(.117/.114)X1 missing if X1 > 1.037Listwise 5.01(.113/.116) 3.01(.140/.147) 3.00(.121/.121)Mean 5.76(.219/.234) 2.82(.212/.314) 3.57(.271/.230)Dummy 5.01(.113/.130) 2.99(.141/.163) 3.04(.126/.125)Y missing if X1 > 1.645Listwise same as aboveMean 4.57(.230/.210) 2.12(.368/.230) 2.86(.236/.231)Dummy 5.00(.104/.122) 3.02(.133/.145) 2.88(.153/.130)Y missing if X1 > 1.037Listwise same as aboveMean 3.87(.321/.250) 1.28(.342/.275) 2.57(.316/.275)Dummy 4.95(.130/.173) 2.97(.195/.213) 2.58(.227/.166)Y missing if Y > 12.0735Listwise 4.96(.102/.107) 2.97(.119/.122) 2.96(.118/.121)Mean 4.19(.257/.245) 2.07(.359/.269) 2.08(.341/.268)Dummy 4.95(.104/.121) 2.90(.143/.134) 2.90(.143/.133)Heckman 11.41(1.43/.314) 2.95(.128/.134) lambda=-2.23Y missing if Y > 9.4591Listwise 4.90(.118/.122) 2.93(.130/.135) 2.92(.130/.135)Mean 4.10(.281/.267) 1.97(.391/.293) 1.98(.372/.292)Dummy 4.75(.139/.164) 2.67(.197/.168) 2.67(.187/.168)Heckman 9.57(.85/.274) 2.90(.132/.136) lambda=-2.29

4.2 Summary of Results

The results indicate that, when the data are MAR, only violating the OAR assumption,listwise deletion only costs us efficiency. The dummy variable approach appears to workwell, at least with little missing data. As the percent missing increases, the biases in theparameters and (especially) the standard errors begin to become apparent. Mean imputationperforms very poorly in all cases. The biases appear to be most problematic when the datathat are missing are the outcome data; missingness on the covariate is less troublesome.

In the results for the data that violate the MAR assumption, listwise deletion appearsto work about as well as any approach. Once again, mean imputation performs very poorly,as does the dummy variable approach. Heckman selection works well, but the intercept andstandard errors are substantially biased.

It is important to remember that this simulation study used better data than are typically

8

found in real datasets. The variables were all normally distributed, and the regressionrelationship between the independent and dependent variables was quite strong. Theseresults may be more interesting if the signal/noise ratio were reduced (i.e., the error variancewere increased).

5 Recommendations for handling missing data.

The guidelines below are my personal recommendations for handling missing data. They arebased on a pragmatic view of the consequences of ignoring missing data, the consequencesof handling missing data inappropriately, and the likelihood of publication/rejection using aparticular method. These suggestions primarily apply when the outcome variable is missingand not the covariates. When covariates are missing, the consequences of missingness aresomewhat less severe.

The standard method for reporting about the extent of missingness in current sociologyarticles is to a) construct a dummy variable indicating whether an individual is missingon an item of interest, b) conduct logistic regression analyses predicting missingness usingcovariates/predictors of interest, c) report the results: do any variables predict missingnessat a significant level? There are at least two problems with this common approach: 1) whatdo you do if there is some pattern? Most people ignore it, and this seems to be acceptable.Some people make arguments for why any biases that will be created will be conservative. 2)this approach only demonstrates whether the data are OAR, which, as I said above, doesn’tmatter!!! I don’t recommend this standard approach, but unfortunately, it is fairly standard.

Beyond simply reporting patterns of missingness, I recommend the following for dealingwith missing data:

1. The first rule I recommend is to try several approaches in your modeling. If the findingsare robust to various approaches, this should be comforting, not just to you, but alsoto reviewers. So, report in a footnote at least all the various approaches you tried.This will save you the trouble of having to respond to a reviewer by saying you alreadytried his/her suggestions. If the findings are not robust, this may point you to thesource of the problem, or it may give you an idea for altering your research question.

2. If you have less than 5% missing, just listwise delete the missing observations. Anysort of simple imputation or correction may be more likely generate biases, as thesimulation showed.

3. If you have between 5% and 10% missing, think about using either a selection model orusing some sort of technique like multiple imputation. Here, pay attention to whetherthe data are missing on X or Y . If the data are missing on X, think about finding asubstitute for the X (assuming there is one in particular) that is causing the problem.For example, use education rather than income to measure SES. Also consider whetherthe data are OAR or MAR. If you have good reason to believe the data are MAR, thenjust listwise delete but explain why you’re doing so. If the data are not MAR, thenyou should do something beyond listwise deletion, if possible. If nothing else, try to

9

determine whether the missingness leads to conservative or liberal results. If theyshould be conservative, then make this argument in the text.

If the data are missing on Y , use either listwise deletion or Heckman selection. If youuse listwise deletion, you need to be able to justify it. This becomes more difficult asthe missing % increases. Use Heckman selection if you are pretty sure the data are notMAR. Otherwise, a reviewer is likely to question your results. At least use Heckmanselection and report the results in a footnote indicating there was no difference betweena listwise deletion approach and a Heckman approach (assuming there isn’t!) Heckmanis sensitive to the choice of variables in the selection equation and structural equation,so try different combinations. Don’t have too much overlap.

4. If you have more than 10% missing, you must use a selection model or use somesophisticated technique for handling the missing, unless the data are clearly MAR. Ifthey are, then either listwise delete or use some smart method of imputation (e.g.,multiple, hotdecking, EM, etc.).

5. If you have more than 20% missing, find other data, drop problematic variables, gethelp, or give up.

6. If you have a sample selection issue (either on the independent or dependent variables),use Heckman selection. For example, if you are looking at “happiness with marriage,”this item is only applicable to married persons. However, if you don’t compensate fordifferential propensities to marry (and remain so), your parameters will be biased. Asproof to the point, our divorce rates are higher now than they were in the past, yetmarital happiness is greater now than ever before.

6 Recommended Reading

• Allison. (2002). Missing Data . Sage series monograph.

• Little and Rubin. (1987). Statistical Analysis of Missing Data .

10

(copyright by Scott M. Lynch, April 2003)

1 Generalizations of the Regression Model (Soc 504)

As I said at the beginning of the semester, beyond the direct applicability of OLS regressionto many research topics, one of the reasons that a full semester course on the linear modelis warranted is that the linear model lays a foundation for understanding most other modelsused in sociology today. In these last sets of notes, I cover three basic generalizations of linearregression modeling that, taken as a whole, probably account for over 90% of the methodsused in published research over the last few years. Specifically, we will discuss 1) generalizedlinear models, 2) multivariate models, and 3) time series and longitudinal methods. I willinclude discussions of fixed/random effects models in this process.

2 Generalized Linear Models

In sociological data, having a continuous outcome variable is rare. More often, we tendto have dichotomous, ordinal, or nominal level outcomes, or we have count data. In thesecases, the standard linear model that we have been discussing all semester is inappropriatefor several reasons. First, heteroscedasticity (and nonnormal errors) are guaranteed whenthe outcome is not continuous. Second, the linear model will often predict values that areimpossible. For example, if the outcome is dichotomous, the linear model will predict scoresthat are less than 0 or greater than 1. Third, the functional form specified by the linearmodel will often be incorrect. For example, we should doubt that increases in a covariate willyield the same returns on the dependent variable at the extremes than would be obtainedtoward the middle.

2.1 Basic Setup of GLMs

Generalized linear models provide a way to handle these problems. The basic OLS modelcan be expressed as:

Y ∼ N(Xβ, σe)

Y = Xβ + e

Generalized linear models can be expressed as:

F (µ) = Xβ + e

E(Y ) = µ

That is, some function of the expected value of the expected value of Y is equal to the linearpredictor with which we are already familiar. The function that relates Xβ to µ is called thelink function. The choice of link function determines the name of the model we are using.The most common GLMs used in sociology have the following link functions:

1

Link Function (F) Modelµ Linear Regression

ln(

µ1−µ

)Logistic Regression

Φ−1(µ) Probit Regressionln(µ) Poisson Regressionln(−ln(1− µ)) Complementary Log-Log Regression

An alternate way of expressing this is in terms of probabilities. The logit and probitmodels are used to predict probabilities of observing a 1 on the outcome. Thus, we couldwrite the model as:

p(yi = 1) = F (Xiβ).

In this notation, F is the link function. I will illustrate this with the probit regression model.If our outcome variable is dichotomous, then the appropriate likelihood function for the

data is the binomial distribution:

L(p | y) = p(y | p) ∝n∏

i=1

pyi(1− p)1−yi

Our observed data constitute the yi—if a person is a 1 on the dependent variable, then thesecond term in the likelihood drops out (for that individual); if a person is a 0, then thefirst term drops. We would like to link p to Xβ, but as discussed at the beginning, this isproblematic because an identity link (i.e., p = Xβ) will predict illegitimate values for p. Aclass of functions that can map the predictor from the entire real line onto the interval [0, 1] iscumulative distribution functions. So, for example, in the probit case, we allow p = Φ(Xβ),

where Φ is the cumulative normal distribution function (i.e.,∫ Xβ

−∞N(0, 1)) Regardless of thevalue of Xβ, p will fall in the acceptable range. To obtain a logistic regression model, onewould simply need to set p = eXβ

1+eXβ (the cumulative logistic distribution function).The approach discussed immediately above may seem different from what was presented

in the table; however, the only difference is in how the link function is expressed—whetheras a function of the expected value of Y , or in terms of the linear predictor. These areequivalent (just inverses of one another). For example, the logistic regression model couldbe written as:

ln

(µ

1− µ

)= Xβ,

where µ = p. Another way to think about GLMs is in terms of latent distributions. Wecould express the probit model as:

Y ∗ = Xβ + e,

using the link: {Y = 1 iff Y ∗ > 0Y = 0 iff Y ∗ ≤ 0

2

Here, Y ∗ is a latent (unobserved) propensity. However, due to crude measurement, weonly observe a dichotomous response. If the individual’s latent propensity is strong enough,his/her propensity pushes him/her over the threshold (0), and we observe a 1. Otherwise,we observe a 0.

From this perspective, we need to rearrange the model somewhat to allow estimation.We can note that if Y ∗ = Xβ + e, then the expressions in the link equation above can berewritten such that: If Y = 1 then e > −Xβ; if Y = 0 then e < −Xβ. If we assume adistribution for e (say normal), then we can say that:

p(Y = 1) = P (e > −Xβ) = P (e < Xβ) =

∫ Xβ

−∞N(0, 1).

Observe that this is the same expression we placed into the likelihood function above. Ifwe assume a logistic distribution for the error, then we obtain the logistic regression modeldiscussed above.

I will use this approach to motivate the generalization of the dichotomous probit modelwe’ve been discussing to the ordinal probit model. If our outcome variable is ordinal, ratherthan dichotomous, OLS is still inappropriate. If we assume once again that a latent variableY ∗ underlies our observed ordinal measure, then we can expand the link equation above:

Y = 1 iff −∞ = τ0 ≤ Y ∗ < τ1Y = 2 iff τ1 ≤ Y ∗ < τ2

......

...Y = k iff τk−1 ≤ Y ∗ < τk = ∞

Just as before, this link, given a specification for the error term, implies an integral over theerror distribution, but now the integral is bounded by the thresholds:

p(Y = j) = P (τj−1 −Xβ < e < τj −Xβ) =

∫ τj−Xβ

τj−1−Xβ

N(0, 1).

2.2 Interpreting GLMs

GLMs are not as easy to interpret as the standard linear regression model. Because the linkfunction is nonlinear, the model is now nonlinear, even though the predictor is linear. Thiscomplicates interpretation, because the effects of variables are no longer independent of theeffects of other variables. That is, the effect of Xj depends on the effect of Xk. The probitmodel is linear in Z (standard normal) units. That is, given that Xβ implies an increase inthe upper limit of the integral of the standard normal distribution, each β can be viewed interms of its effect on the Z score for the individual.

The logit model is linear in log-odds units. Recall that odds are computed as the ratio ofp

1−p. The logistic link function, then, is a log-odds function. The coefficients from the model

can be interpreted in terms of their linear effect on the log-odds, but this is not of muchhelp. Instead, if we exponentiate the model, we obtain:

exp

(ln

(p

1− p

))= exp(Xβ) = eβ0eβ1X1 . . . eβjXj

3

This says that the odds are equal to the multiple of the exponentiated coefficients. Supposewe had an exponentiated coefficient for gender (male) of 2. This would imply that theodds are twice as great for men as for women, net of the other variables in the model.The interpretation is slightly more complicated for a continuous variable, but can be statedas: the odds are multiplied by exp(βj) for each unit increase in Xj. Be careful with thisinterpretation: saying the odds are multiplied by 2 does NOT mean that men are twice aslikely to die as women. The word “likely” implies a ratio of probabilities and not odds.

The logistic regression model has become quite popular because of the odds-ratios inter-pretation. However, the unfortunate aspect of it is that this interpretation tells us nothingabout the absolute risk (probability) of obtaining a ‘1’ response. In order to make this in-terpretation, we must compute the probabilities predicted by the model. It is in this processthat we can see how the effect of one variable depends on the values of the other variables inthe model. Below are a logistic regression and a probit regression of death on baseline age,gender (male), race (nonwhite), and education.

Variable Logistic Reg. Parameter Exp(β) Probit Reg. ParameterIntercept -6.2107 -3.4661Age .1091 1.115 .0614Male .7694 2.158 .4401Nonwhite .5705 1.769 .3341Education -.0809 .922 -.0487

The results (for either model) indicated that age, being male, and being nonwhite in-crease one’s probability of death, while education reduces it. Although the coefficients differbetween the two models, this is simply a function of the difference in the variances of thelogistic and probit distributions. The variance of the probit distribution is 1 (N(0, 1)); thevariance of the logistic distribution is π2

3. The ratio of these variances (L

P) is 1.81, and this

is also approximately the ratio of the coefficients—there is some slight deviation that is at-tributable to the slight differences in the shape of the distribution functions (the probit issteeper than the logit in the middle).

If we wanted to determine the difference in probability of mortality for a person with ahigh school diploma versus a college degree, we would need to fix the other covariates atsome value, compute Xβ, and perform the appropriate transformation to invert the linkfunction. Below are the predicted probabilities for 50 year-olds with different gender, race,and education profiles.

4

Profile Probit LogitSex Race Education (Predicted probabilities)Male White 12 yrs. .29 .28

16 yrs. .23 .22Nonwhite 12 yrs. .42 .40

16 yrs. .34 .33Female White 12 yrs. .16 .15

16 yrs. .12 .11Nonwhite 12 yrs. .26 .24

16 yrs. .20 .19

Notice that the estimated probabilities are very similar between the two models. For mostdata, the models can be used interchangeably. Notice also that the change in probabilityby altering one characteristic depends on the values of the other variables. For example,the difference in probability of death between 12 and 16 years of education is .06 for whitemales, .08 for nonwhite males, .04 for white females, and .06 for nonwhite females (all baseon the probit model results). The odds ratio, however, does not vary. For example, take theodds ratio for white males with 12 versus 16 years of education (OR = 1.38) and the oddsratio for nonwhite males with 12 versus 16 years of education (OR = 1.35). The differenceis only due to rounding of the probabilities.

3 Multivariate Models

For the entire semester, we have discussed univariate models—that is, models with a singleoutcome. Often, however, we may be interested in estimating models that have multipledependent variables. Let’s take a very simple model first.

Modeled

y1 = Xβ + e1

y2 = Zγ + e2

e1 ∼ N(0, σ2e1)

e2 ∼ N(0, σ2e2)

Not modeled, but true

[e1e2

]∼ N

([00

],

[σ2

e1σe1e2

σe1e2 σ2e2

])

5

This model says that outcomes y1 and y2 are functions of several covariates (X and Z) plussome error. The third and fourth components indicate that e1 and e2 are assumed to beuncorrelated. However, in fact, the last expression indicates that the errors across equationsare correlated (as long as σe1e2 is nonzero). This model is sometimes called the “seeminglyunrelated regression model.” The regressions seem as though they could be estimated inde-pendently, but if e1 and e2 are correlated, then it implies that there are variables (namelyy2 and possibly some Z) that are omitted from the model for y1 (and vice versa for themodel for y2). Omitting relevant variables leads to ‘omitted variable bias,’ which means thatour estimates for coefficients are incorrect. We could rewrite the model, either specificallyincorporating the error covariance portion of the model, or specifying a joint distribution fory: [

y1

y2

]∼ N

([XβZγ

],

[σ2

e1σe1e2

σe1e2 σ2e2

])This model is the same as the model above, just expressed explicitly as a multivariate model.

3.1 Multivariate Regression

So far, we have dealt with a number of univariate distributions, especially the univariatenormal distribution:

f(Y ) =1

σ√

2πexp

{−(Y − µ)2

2σ2

}The multivariate normal distribution is simply an extension of this:

f(Y ) = 2π−d2 | Σ |

−12 exp

{[Y − µ]T Σ−1 [Y − µ]

}Here, µ is a vector of means, and Σ is the covariance matrix of Y . If the matrix is diagonal,then the distribution could be rewritten simply a set of univariate normal distributions; whenΣ has off-diagonal elements, it indicates that there is some (linear) relationship betweenvariables.

Just as with linear regression, we can assume the errors are (multivariately) normallydistributed, allowing us to place (Y −Xβ) into the numerator of the kernel of the density formaximum likelihood estimation of β. In the multivariate case, each element of the [Y − µ]vector is replaced with [Y −Xβ(j)], where I’m using (j) to index the set of β parameters ineach dimension of the model (i.e., X does not have to have the same effect on all outcomes).We can assume the X matrix is the same across equations in the model, and if an X doesnot influence one of the outcomes, then its β parameter is constrained to be 0.

3.2 Path Analysis

Some multivariate models can be called ‘path analysis’ models. The requirements for pathanalysis include that the variables must all be continuous, and the model must be recursive-that is, following the paths through the variables, one cannot revisit a variable. Below is anexample of a path-analytic graph.

6

Education

Income

Depression F

D

A

C Health B

E

Figure: Path Diagram.

This path model says that depression is affected by physical health, education, andincome; health is influenced by education and income; and income is influenced by education.If we estimated the model: Depression = b0 + b1Education, we would find the total effect ofeducation on depression. However, it is unlikely that education’s effect is only direct. It ismore reasonable that income and physical health also affect depression, and that educationhas direct effects (and possibly indirect effects) on income and health. Thus, the simplemodel would produce a biased effect of education if income and health were ignored. If weestimate the path model above, the coefficient for the direct effect of education (c) wouldmost likely be reduced.

The path model above can be estimated using a series of univariate regression models:

Depression = β0 + β1education+ β2health+ β3income

Health = γ0 + γ1education+ γ2income

Income = α0 + α1education

The lettered paths in the diagram can be replaced as: A = γ1 , B = α1 , C = β1 , D = γ2 ,E = β2 , F = β3.

Now the direct effect of education on depression is no longer equal to the total effect.Rather, the direct effect is simply C, while the total effect is:

Total = Direct+ Indirect

= c+ (ae) + (bf) + (bde)

7

As these expressions indicate, the indirect effects are simply the multiples of the paths thatlead indirectly to depression from education.

3.3 Structural Equation Models

When nonrecursive (i.e., reciprocal effects are included) models are needed, variables are notmeasured on a continuous scale, and/or measurement error is to be considered, we can gen-eralize the path analysis model above. Structural equation models provide a generalization.These models are often also called LISREL models (after the first software package to esti-mate them) and covariance structure models (because estimation is based on the covariancematrix of the data). Using LISREL notation, these models consist of 3 basic equations and4 matrices:

η1

η2...ηj

=

0 β12 . . . β1j

β21 0 . . . β2j...

.... . .

...βj1 βj2 . . . 0

η1

η2...ηj

+

γ11 γ12 . . . γ1k

γ21 γ22 . . . γ2k...

.... . .

...γj1 γj2 . . . γjk

ξ1ξ2...ξk

+

ζ1ζ2...ζj

y1

y2...yp

=

λy11 λy12 . . . λy1j

λy21 λy22 . . . λy2j

......

. . ....

λyp1 λyp2 . . . λypj

η1

η2...ηj

+

ε1ε2...εp

x1

x2...xp

=

λx11 λx12 . . . λy1k

λx21 λx22 . . . λy2k

......

. . ....

λxq1 λxq2 . . . λyqk

ξ1ξ2...ξk

+

δ1δ2...δq

Φ = k-by-k covariance matrix of the ξ

Ψ = j-by-j covariance matrix of the ζ

θδ = q-by-q covariance matrix of the δ

θε = p-by-p covariance matrix of the ε

In this model, latent (unobserved) variables are represented by the Greek symbols η andξ. The distinction between η and ξ is whether the variable is exogenous (ξ) or endogenous(η) in the model, where ‘endogenous’ means the variable is influenced by other variablesin the model. The coefficients that relate the η to each other are β, while the coefficientsthat relate the ξ to the η are γ. The first equation is the ‘structural equation’ that relatesthe latent variables, with an error term ζ for each η. The second and third equationsare measurement equations that show how the observed y and x are related to the latentvariables η and ξ, respectively (via the λ coefficients). In these equations, ε and δ representmeasurement errors—that is, the part of the observed variable that is unaccounted-for bythe latent variable(s) which influence it.

8

The Φ matrix models the covariances of the exogenous latent variables, while the Ψmatrix models the covariances of the structural equation errors (allowing cross-equationerror correlation to exist, which is something univariate regressions do not allow.) The twoθ matrices allow correlation between errors in the measurement equations to exist (again,something that univariate regression cannot handle.)

This model is very general. If there is only one outcome variable (η), and all variablesare assumed to be measured without error, then the model reduces to OLS regression. If weare uninterested in structural relations, but are only interested in the measurement portionof the model (and possibly in estimating simple correlations between latent variables), thenwe have a (confirmatory) factor analysis.

The model is estimated by recognizing that (1) the parameters are functions of thecovariances (or correlations) of the variables and (2) a multivariate normal likelihood canbe written in terms of these covariances. When some of the data are not continuous, butrather are ordinal, we can estimate something called ‘polychoric’ and ‘polyserial’ correlationsbetween the observed variables, and the resulting correlation matrix can be used as the basisfor estimation. The resulting model could then be called a ‘multivariate generalized linearmodel.’

Below is a graphic depiction of a relatively simple structural equation model. The equa-tions for this SEM would be:

η1 = γ11 + ζ1 y1

y2

y3

=

λy1

λy2

λy3

η1 +

ε1ε2ε3

y1

y2

y3

=

λx1

λx2

λx3

ξ1 +

δ1δ2δ3

Φ = φ11, Ψ = ψ11 θδ11 0 0

0 θδ22 00 0 θδ33

θε11 0 0

0 θε22 θε23

0 θε32 θε33

Notice how most of the off-diagonal elements of the various covariance matrices are 0-thisis because we have only specified 1 error correlation (between ε2 and ε3). The top andbottom portions of the figure constitute confirmatory factor analyses-the idea is that theobserved x and y variables reflect underlying and imperfectly measured constructs (factors).We are really interested in examining the relationship between these constructs, but there ismeasurement error in our measures for them. Thus, with this model, γ11 is our estimate of

9

the relationship between the latent variables independent of any measurement error existingin our measures.

ξ1

λx3

x3 x2

λx1

x1

δ3

1 1

δ1 δ2

η1

λy3

ζ1

y1 y2

λy1

1

y3

1

1

λx2

γ11 1

λy2

1

ε1 ε2 ε3

θδ23

Figure: SEM Diagram.

3.4 Final note about multivariate models

Something you must remember with multivariate (and latent variable) models that isn’ttypically an issue in simple univariate models is the notion of identification or identifiability.We cannot estimate more parameters than we have pieces of information. In SEMs, thepieces of information are covariances and variances between variables—there are (p+q)(p+q+1)

2

of them. We need to be sure that we are not attempting to estimate too many parameters—if we are, then our model is ‘under-identified,’ which means loosely that there is no uniquesolution set for the parameters. The models we have dealt with to date have been ‘just-identified,’ meaning that there are exactly as many pieces of information as parameters.With SEMs, we often have ‘over-identified’ models, which means we have more than enough

10

information to estimate the model. This comes in handy when we would like to comparemodels.

4 Time Series and Longitudinal Methods

Analysis of time series and panel data is a very large area of methodology. Indeed, you cantake entire courses on basic time series, event history models, or other methods of analysisfor longitudinal data. In this brief introduction to these classes of models, I am exchangingdepth for breadth in an attempt to point you to particular types of models that you mayneed to investigate further in doing empirical work.

I’ll start with some basic terminology that’s relevant to longitudinal data. First, theterm ‘longitudinal data’ is somewhat vague. Generally, the term implies that one has paneldata, that is, data collected on multiple units across multiple points in time (like the PSID).However, it is often also used to refer to repeated cross-sectional data, that is, data collectedon multiple different units at multiple points in time (like the GSS). For the purposes of ourdiscussion here, we will limit our usage to the first definition.

• Time Series typically refers to repeated observations on a single observational unitacross time.

• Multiple Time Series means multiple observational units observed across time. Thiscan also be considered a panel study, although the usage often differs depending onthe unit of analysis. Generally, micro data is considered panel data, while macro datais considered time series data.

• Multivariate Time Series means that there are multiple outcomes measured over time.

• A Trend is a general pattern in time series data over time. For example, U.S. lifeexpectancy shows an increasing trend over the last 100 years.

• Seasonality is a repeated pattern in time series data. For example, the sale of Christ-mas trees evidences considerable seasonality—sales are low except in November andDecember. No trend is necessarily implied, but the pattern repeats itself annually.

• Stationarity. A stationary time series evidences no trending. A stationary series mayhave seasonality, however. Stationarity is important for time series models, for reasonsthat will be discussed shortly.

4.1 Problems that Time Series/Longitudinal Data Present

There are really very few differences between the approaches that are used to analyze timeseries or panel data and the basic linear regression model with which you are already familiar.However, there are four basic problems that such data present that necessitate the expansionof the OLS model or the development of alternative models.

11

1. Error correlation within units across time. This problem requires alternative estimationof the linear model (e.g., GLS estimation), or the development of a different model (e.g.,fixed/random effects models).

2. Spuriousness due to trending. When attempting to match two time series, it mayappear that two aggregate time series with similar trends are causally related, whenin fact, they aren’t. For example, population size (which has an increasing trend) mayappear to be related to increases in life expectancy (also with an increasing trend).However, this is probably not a causal relationship, because, in fact, countries withthe fastest growth in population don’t necessarily have the fastest increases (if any atall) in life expectancy. This may be an ecological fallacy or it may simply be that twosimilarly-trending time series aren’t related.

3. Few units. Sometimes, time series data are relatively sparse. When dealing with asingle time series, for example, we often have relatively few measurements. This makesit difficult to observe a trend and to include covariates/explanatory variables in models.

4. Censoring. Although our power to examine relationships is enhanced with time seriesand panel data, such data present their own problems in terms of “missing” data.What do we do with people who die or can’t be traced for subsequent waves of study?What do we do with people who are missing at some waves but not others? What dowe do when a person experiences an event that takes him/her out of the risk set? Etc.

4.2 Time Series Methods

Problems (1) and (2) above can often be resolved by placing some structure on the errorterm. There are many ways to do this, and such forms the basis for a smorgasboard ofapproaches to analyzing time series data.

The most basic models for time series are ones in which trends and seasonality aremodeled by including time as a variable (or a collection thereof, e.g., like time dummies) ina model. For example, if we wanted to examine a trend in birth rates across time, and weobserved birth rates on a monthly basis across, say, a 20 year span, we could first modelbirth rates as a function of time (years), examine the residual, and then include dummyvariables (or some other type of variables) to capture seasonality. The key problem with thisapproach is that we must make sure that autocorrelation of the errors does not remain afterdetrending and deseasoning.

There are two basic domains of more complicated time series analysis: the time domainand the frequency domain. In frequency domain models, measures are seen as being amixture of sine and cosine curves at different frequencies:

yi ∼ N((Xβi) +J∑

j=1

(ajsin(wjti) + bjcos(wjti)), σ2)

These models are often called ‘spectral models.’ I will not discuss these models, because a)they are not particularly common in sociology and demography and b) these methods are

12

ultimately related to time domain models (i.e., they are simply an alternate parameterizationof time domain models).

More common in sociology are time domain models. Time domain models model atime t variable as a function of the outcome at time t − 1. In economics, these are called‘dynamic models.’ A general class of models for time domain time series models are ARMA(AutoRegressive Moving Average) models, which can be represented as:

yt = Xtβ + φ1yt−1 + φ2yt−2 + . . .+ φmyt−p + et + γ1et−1 + γ2et−2 + . . .+ γnyt−q

In this equation, the autoregressive terms are the lagged y-values, while the moving averageterms are the lagged errors. et is assumed to be N(0, σ2I) under this model. As specified,this model would be called an ARMA(p,q) model (although, technically, the classic ARMAmodel does not contain regressors). Very often, we do not need more than 1 AR term or oneMA term to achieve stationarity of the error. (As a side note, a time series process in whichonly yt−1 is needed is called a Markov process). The model requires that φ be less than 1,or the model is considered ‘explosive.’ (this can be observed by repeated substitution).

To put the ARMA(p,q) model into perspective, we can view this model as placingstructure on the error term in an OLS model. Recall from previous discussions that anassumption of the OLS regression model is that e ∼ N(0, σeI). In other words, the er-rors for observations i and j are uncorrelated for i 6= j. This assumption is violated bytime series data, and so a natural extension of the OLS model is to use GLS estimation.GLS estimation involves estimating the parameters of the model with a modified estimator:βGLS = (XT Σ−1X)−1(XT Σ−1Y ). Σ, however, is an n × n matrix, and all elements of thismatrix cannot be estimated without some simplifying constraints. One such constraint thatcan be imposed is that the error correlations only exist between errors for adjacent timeperiods, and that all adjacent time periods have equal error correlation. In that case, weonly need to estimate one additional parameter, say σi,i+1, ∀i, so our Σ matrix appears as:

Σ =

σ11 σ12 0 . . . 0

σ21 σ22 σ23. . .

...0 σ32 σ33 σ34 0...

. . . σ43 σ44. . .

0 . . . 0. . . . . .

,

where σ11 = σ22 = . . . = σTT , and σ12 = σ21 = σ23 = σ32 = . . ..An AR(1) model simplifies the error covariance matrix for GLS estimation by decompos-

ing the original OLS error term into two components: et = ρet−1 + vt. Here the error at onetime point is simply a function of the error at the previous time point plus a random shockat time t(v). Higher order AR models are obtained by allowing the error to be a function oferrors at lags > 1. Typically, autoregressive models are estimated by incorporating a laggedy variable into the model. So long as the absolute value of the coefficient for the laggedterm(s) does not exceed 1, the series can be considered stationary.

An MA(1) model specifies structure on the random shocks: et = σvt−1 + vt. As with theAR models, higher order MA models can be obtained by adding additional lagged terms.

13

Moving average models are more difficult to estimate than autoregressive models, however,because the error term depends on the coefficients in the model, which, in turn, depend onthe error.

How do we determine what type of ARMA model we need? Typically, before we modelthe data, we first construct an autocorrelation plot (also sometimes called a corellogram),which is a plot of the autocorrelation of the data (or errors, if a model was previouslyspecified) at various lags. The function is computed as:

ACL =m

∑(θt − θ)(θt−L − θ)

(m− L)∑

(θt − θ)2

where m is the number of time series data points and L is the number of lags. The shape ofthis function across L tells us what type of model we may need, as we will discuss below.

When autocorrelation between errors cannot be removed with an ARMA(p,q) model,the next step may be to employ an ARiMA (Autoregressive integrated moving average)model. The most basic ARiMA model is one in which there are no autoregressive termsand no moving average terms, but the data are differenced once. That is, we simply takethe difference in all variables from time t − 1 to t, so that ydiff = yt − yt−1, ∀t. We dothe same for all the covariates. We then regress (using OLS) the differences in y on thedifferences in x. This model therefore ultimately relates change in x to change in y (this iswhy the term ‘integrated’ is used—from a calculus perspective, relating change to change ismatching the first derivatives of x and y, thus, the original variable is ‘integrated’ relativeto the differences).

Sometimes, a first differences approach is not sufficient to remove autocorrelation. Inthose cases, we may need to add autoregressive or moving average components, or we mayeven need to take second or higher order differences.

5 Methods for Panel Data

The time series methods just discussed are commonly used in economics, but are used some-what less often in sociology. Over the last twenty years, panel data have become quite com-mon, and sociologists have begun to use methods appropriate for multiple time series/paneldata. In many cases, the methods that I will discuss here are very tightly related to thetime series methods discussed above; however, they are generally presented differently thanthe econometric approach to time series. In this section, I will discuss two general types ofpanel methods: hazard/event history models and fixed/random effects hierarchical models(including growth models). The key feature that distinguishes these models is whether youroutcome variable is an event that can occur versus simply repeated measures on individualsover time.

5.1 Hazard and Event History Models

Hazard models is another class of models that go by several names. One is “event historymodels.” In demography, “life table methods” accomplish the same goals. Other names

14

include “discrete time hazard models” and “continuous time hazard models.” Related ap-proaches are called “survival models,” and “failure time models.” These related approachesare so-called because they model the survival probabilities (S(t)) and the distribution oftimes-to-event (f(t)), whereas hazard models model the hazard of an event. Hazard modelsare distinct from time series models, because time series models generally model an outcomevariable that takes different values across time, but hazard models model the occurrence ofa discrete event. If the time units in which the respondents are measured for the event arediscrete, we can use ’discrete time event history methods;’ if the time units are continuous(or very closely so), we can use ’continuous time event history methods.’

A hazard is a probability of experiencing an event within a specified period of time,conditional on having survived to be at risk during the time interval. Mathematically, thehazard can be represented as:

h(t) =lim

∆t→0p(E(t, t+ ∆t) | S(t))

∆t

Here, p(E(t, t + ∆t)) represents the probability of experiencing the event between timet and t + ∆t, S(t) indicates the individual survived to time t, and ∆t is an infinitely smallperiod of time. The hazard, unlike a true probability, is not bounded between 0 and 1, butrather has no upper bound.

If we examine the hazard a little further, we find that the hazard can be viewed as

h(t) =# who experience event during the interval

t×# exposed in the interval

The numerator represents the number of persons experiencing the event, while the denom-inator is a measure of person-time-units of exposure. Given that individuals can experiencethe event at any time during the interval, each individual who experiences the event canonly contribute as many time units of exposure as s/he existed in the interval. For exam-ple, if the time interval is one year, and an individual experiences the event in the middleof the interval, his/her contribution to the numerator is 1, while their contribution to thedenominator is .5.

In a hazard model, the outcome to be modeled is the hazard. If the time intervals aresufficiently small (e.g, minutes, seconds, etc.), then we may use a “continuous time” hazardmodel; if the intervals are sufficiently large, then we may use a “discrete time” hazard model.There is no clear break point at which one should prefer a discrete time model to a continuoustime model. As time intervals become smaller, discrete time methods converge on continuoustime methods.

In this discussion, I will focus on hazard models rather than survival or failure timemodels, although all three functions are related. For example, the hazard (h(t)) is equal to:

h(t) =f(t)

S(T ) = 1− F (t),

the density function (indicating probability of the event at time t) conditional on (dividingby) survival up to that point. Notice that the survival function is represented as 1 − F (t),where F (t) is the integral of the density function. The density function gives the probabilitiesof experiencing the event at each time t. So, if we want to know the probability that a person

15

will experience the event by time t, we simply need to know the area under the densityfunction from −∞ to t, which is

∫ t

−∞ uf(u)du = F (t). The survival probability beyond t,therefore, is 1− F (t).

We will focus on hazard models because they are more common in sociology and demog-raphy. Survival analysis, beyond simple plotting techniques, is more common in clinical andbiological research.

5.1.1 A Demographic Approach: Life Tables

The earliest type of hazard model developed was the life table. The life table generally takesas its input the hazard rate at each time (generally age), uses some assumptions to translatethe hazard rate into transition (death) probabilities, and uses a set of flow equations to applythese probabilities to a hypothetical population to ultimately end up with a measure of thetime remaining for a person at age a. A basic table might look like:

Table: Single Decrement Life Table

Age la h() = µ() qa da La ea

20 l20 = 100, 000 µ20 q20 q20 × l20l20+l21

2

∑ωa=20 La

l20

21 l21 = l20 − d20 µ21 q21 q21 × l21l21+l22

2

∑ωa=21 La

l21...

......

......

The columns in the life table are the following. The l column represents the numberof individuals remaining alive in a hypothetical cohort (radix=100,000) at the start of theinterval. The µ column represents the hazard for the time interval. The q column is theprobability that an individual alive at the beginning of the interval will die before the endof the interval. An assumption is used to transform µ into q. For example, if we assumethat individuals die, on average, in the middle of the interval, then the exposure (in personyears) is simply the average of the number of individuals alive at the beginning and the endof the interval, and so:

µ =q × l

12(l + (1− q)× l)

.

Some rearranging yields:

q =µ× l

l + 12µl,

So, we can obtain the transition probabilities from the hazards. We then apply these prob-abilities to the population surviving to the beginning of the interval to obtain the count ofdeaths in the interval (d). We then have enough information to calculate the next l, and wecan proceed through the table like this. When we are done, we then construct the L column,which is a count of the person years lived in the interval. Once again, if we assume individ-uals died in the middle of the interval, then the person years lived is simply the average of

16

the number of persons alive at the beginning and end of the interval (the denominator ofthe equation for µ). The next column, T , sums these person years lived from this intervalthrough the end of the table. Finally, the last column, e, divides the number of individualsalive at the start of an interval (l) by the cumulative person years remaining to be lived toobtain an average of the number of years remaining for each person (life expectancy). Thevalue at the earliest age in the table is an approximation of

∫ +∞−∞ t× f(t)dt, the expectation

of the distribution of failure times.The life table has been extended to handle multiple decrements (multiple types of exits),

as well as reverse and repeatable transitions (the multistate life table). A key limitation ofthe life table has been the difficulty with which covariates can be incorporated in it and thedifficulty in performing statistical tests for comparing groups. For this reason, researchersbegan using hazard regression models.

5.1.2 Continuous Time Hazard Models

The most basic continuous time hazard models include the exponential model (also calledthe constant hazard model), the Gompertz model, and the Weibull model. The differencebetween these models is their representation of the “baseline hazard,” which is the hazardfunction when all covariate values are 0. I emphasize this definition, because some may bemisled by the name into thinking that the baseline hazard is the hazard at time t = 0, andthat is not the case.

Suppose for a minute that there are no covariates, and that we assume the hazard doesnot change over time. In that case, the exponential model is:

ln(h(t)) = a

The hazard is logged, because of the bounding at 0. The name “exponential” modelstems from 1) the fact that if we exponentiate each side of the equation, the hazard isan exponential function of the constant, and 2) the density function for failure times thatcorresponds to this hazard is the exponential distribution.

If we assume that the hazard is constant across time, but that different subpopulationshave different values of this constant, the exponential model is:

ln(h(t)) = a+Xβ

In this specification, a is the baseline hazard, which is time-independent, and Xβ is thelinear combination of covariates thought to raise or lower the hazard.

Generally, the assumption of a constant hazard is unreasonable: instead, the hazard isoften assumed to increase across time. The most common model representing such time-dependent hazards is the Gompertz model, which says that the log of the hazard increaseslinearly with time:

ln(h(t)) = Xβ + bt.

In demography, we often see this model as:

h(t) = αexp(bt),

17

with α = exp(Xβ).Often, we do not think the log of the hazard increases linearly with time, but rather we

believe it increases more slowly. Thus, the Weibull model is:

ln(h(t)) = Xβ + b× ln(t).

Each of these models is quite common in practice. However, sometimes, we do not believethat any of these specifications for the baseline hazard is appropriate. In those cases, wecan construct “piecewise” models that break the baseline hazard into intervals in which thehazard may vary in its form.

A special and very common model in which the baseline hazard remains completelyunspecified, while covariate effects can still be estimated, is the Cox proportional hazardmodel. Cox’s great insight was that the likelihood function for a hazard model can befactored into a baseline hazard and the portion that contains the covariate effects. The Coxmodel looks like:

ln(h(t)) = g(t) +Xβ

where g(t) is an unspecified baseline hazard. Estimation of this model ultimately rests onthe ordering of the event times. Thus, a problem exists whenever there are lots of “ties” inthe data. This method, therefore, is generally most appropriate when the time intervals inthe data are very small, and hence the probability of ties is minimal.

Final note on continuous time methods: Realize that the hazard, being an instantaneousprobability, is ultimately always unobserved. Thus, estimation of these models thus requiresspecial software/procedures and cannot be estimated with OLS or other standard techniques.

5.1.3 Discrete Time Models

When time intervals are discrete, we may use discrete time models. The most commondiscrete time method is the discrete time logit model. The discrete time logit model is thesame logit model that we have already discussed. The only difference in application is thedata structure to which the model is applied. The logit model is represented as:

ln

(p

1− p

)= Xβ

where p is the probability that the event occurs to the individual in the discrete time interval.Construction of the data set for estimation involves treating each individual as multipleperson-time records in which an individual’s outcome is coded ‘0’ for the time intervals priorto the occurrence of the event, and is coded ‘1’ for the time interval in which the event doesoccur. Individuals who do not experience the event over the course of the study are saidto be censored, but they do not pose a problem for the analyses: they are simply coded ‘0’on the outcome for all time intervals. To visualize the structure of the data, the first tableshows 10 hypothetical individuals.

Standard format for data

18

ID Time until event Experienced event?1 3 1

2 5 0

3 1 1

4 1 1

5 2 1

6 4 1

7 4 1

8 3 1

9 1 0

10 2 1

The study ran for 5 time intervals. Persons 2 and 9 did not experience the event andthus are censored. Person 2 is censored simply because s/he did not experience the eventbefore the end of the study. Person 9 is censored due to attrition. The new data structureis shown in the second table.

Person-year format for data

19

Record ID Time Interval Experienced event?1 1 1 0

2 1 2 0

3 1 3 1

4 2 1 0

5 2 2 0

6 2 3 0

7 2 4 0

8 2 5 0

9 3 1 1

10 4 1 1

11 5 1 0

12 5 2 1

13 6 1 0

14 6 2 0

15 6 3 0

16 6 4 0

17 7 1 0

18 7 2 0

19 7 3 0

20 7 4 1

21 8 1 0

22 8 2 0

23 8 3 1

24 9 1 0

25 10 1 0

26 10 2 1

Now we have 26 records rather than the 10 in the original data set. We can compute thehazard, by observing the proportion of persons at risk who experience the event during eachtime period. The hazards are: h(1) = 2

10, h(2) = 2

7, h(3) = 2

5, h(4) = 1

3, h(5) = 0

1. Now,

when we run our logit model, the outcome variable is the logit of the hazard (ln h()1−h()

). Wecan include time-varying covariates very easily, by simply recording the value of the variablefor the respondent record for which it applies. As you can see, censoring is also handled veryeasily. We can also specify whatever form we would like for the baseline hazard. If we wantthe baseline hazard to be constant, we simply don’t include a variable or function for timein the model. If we want complete flexibility-a completely piecewise linear model-we wouldsimply include a dummy variable for every time interval (except one).

This model is very similar to the ones we have already discussed. As the time intervals get

20

smaller, the model converges on the continuous time hazard models. Which one it convergesto is simply a matter of how we specify time in the model. For example, if we construct avariable equal to ln(t) and enter it into the model, we essentially have the Weibull model. Ifwe just enter t as a variable, we have a Gompertz model.

The interpretation of the model is identical to that of the standard logit model, the onlydifference being that the outcome is the hazard rather than the probability.

5.2 Fixed/Random Effects Hierarchical Models

Hierarchical modeling is an approach to analyzing data that takes advantage of, and/orcompensates for, nesting structure in data. The approach is used to compensate for multi-stage sampling, which induces dependence among errors and leads to biased standard errors.The approach is used to take advantage of hierarchical structuring of data by distinguishingbetween effects that occur at multiple levels. For example, the approach can be used to dif-ferentiate between within-individual change over time and between-individual heterogeneity.The approach can be used to distinguish between family-level effects and neighborhood-leveleffects on individual outcomes, etc. Thus, Fixed/Random Effects Hierarchical models arenot exclusively used for panel data, but can be when the nesting structure for the data isindividuals (level 2) measured across time (level 1).

As alluded to above, hierarchical modeling is a quite flexible and general approach tomodeling data that are collected at multiple levels. Because of this flexibility and wideapplicability, hierarchical modeling has been called many things. Here is a list of some ofthe terms that have been used in referring to these types of models:

• Hierarchical modeling

• Multilevel modeling

• Contextual effects modeling

• Random coefficient models

• Random/Fixed effects models

• Random intercept models

• Random effects models with random intercepts and slopes

• Growth curve modeling

• Latent curve analysis

• 2-level models, 3-level models, etc.

• Variance component models

• Mixed (effects) models

• Random effects ANOVA

21

This list is not exhaustive, but covers the most common labels applied to this type ofmodeling. In this brief discussion, I am not going to give an in-depth mathematical treatmentof these models; instead, I will try to show how these names have arisen in different empiricalresearch contexts, but are all part of the general hierarchical model.

Yij = β0 + β1xij + β2zj + eij

Here, Yij is the individual-level outcome for individual i in group j, β0 is the intercept,β1 is the effect of individual-level variable xij, β2 is the effect of group-level variable zj, andeij is a random disturbance.

If there is clustering within groups-e.g, you have all family members in a family, or youhave repeated measures on an individual over time-this model is not appropriate, becauseeij are not independent (violating the OLS assumption that e ∼ N(0, σ2I)).

Two simple solutions to this dilemma are 1) to pull out the structure in the error term(similar to ARMA models) by decomposing it into a group effect and truly random errorand 2) to separate the intercept into two components: a grand mean and a group mean. Theformer approach leads to a random effects model; the latter a fixed effects model, but theylook the same:

Yij = β00 + β1xij + γj + eij

Here, the subscript on the intercept has changed to denote the difference between thisand the OLS intercept. γj denotes either a fixed effect (the decomposition of the interceptterm) or random effect (the decomposition of the error term). I have eliminated z, because ina fixed effects approach, all fixed characteristics are not identifiable apart from the intercept.

If we treat the model as a random effects model, this model can be called a randomintercept model. Note that there are two levels of variance in this specification-true within-individual variance (denoted σ2

e) and between-individual (level 2) variance (denoted τ 2), thevariance of the random effects. The total variance can be computed as: σ2

e + τ 2.We now have a model specification that breaks variance across two levels, and we can

begin to bring in variables to explain variance at both levels. Suppose we allow the randomintercept γj to be a function of group level covariates and residual group-specific randomeffects uj, so that γj = γ0 + γ1zj + uj. Then, substitution yields:

Yij = β1xij + (γ0 + γ1zj + uj) + eij

Notice that I have eliminated the original intercept, β00, as it is no longer identified asa parameter distinct from γ0, the new intercept after adjustment for group level differencescontained in zj. Now, every group has a unique intercept that is a function of a grand mean(γ0), a fixed effect of a group-level variable (γ1), and a random, unexplained component(uj). τ

2 should shrink as group-level variables are added to account for structure in u, andmeasures of second-level model fit can be constructed from this information. Similarly, theaddition of more individual-level measures (xij) should reduce σ2

e , and first-level model fitcan be constructed from this information.

The next extension of this model can be made by observing that, if the intercept can varybetween groups, so may the slope coefficients. We could, for example, assume that slopesvary according to a random group factor. So, in the equation:

22

Yij = (γ0 + γ1zj + uj) + β1xij + eij

We could allow β2 also to be a function of group level characteristics:

β1 = β0 + β1zj + vj

Substitution yields:

Yij = (γ0 + γ1zj + uj) + (β0 + β1zj + vj)xij + eij

Simplification yields:

Yij = (γ0 + γ1zj + uj) + (β0xij + β1zjxij + vjxij) + eij

This is the full hierarchical linear model, also called a random coefficients model, amultilevel model, etc., etc. Notice that this model almost appears as a regular OLS modelwith simply the addition of a cross-level interaction between zj and xij. Indeed, prior tothe development of software to estimate this model, many people did simply include thecross-level interaction and estimate the model via OLS, possibly adjusting for standard errorbias using some robust estimator for standard errors.

However, this model is NOT appropriately estimated by OLS, because we still have therandom effect uj and the term vjxij. In a nutshell, this model now contains 3 sources ofvariance: within-individual (residual) variance, σ2

e , and two between-individual variances(τ 2

intercept and τ 2slope).

We have now discussed the reason for many of the names for the hierarchical model,including multilevel modeling, hierarchical modeling, random/fixed effects modeling, randomcoefficient modeling, etc. We have not discussed growth curve modeling. I use growthcurve modeling extensively in my research, and approach the hierarchical model from thatperspective. I also approach the model from a probability standpoint, rather than a pureequation/residual variance standpoint. That being the case, this brief discussion of growthcurve modeling will use a very different notation.

Until now, we have treated the two levels of analysis as individual and group. For growthcurve modeling, the two levels are based on repeated measures on individuals across time.A basic growth model looks like:

Within-Individual equation{yit ∼ N(αi + βit, σ

2)

This equation says that time-specific individual measures are a function of an individual-specific intercept term and a slope term capturing change across time. The second levelequation,

Between-Individual equations

{αi ∼ N(γ0 +

∑Jj=1 γjXij, τ

2α)

βi ∼ N(δ0 +∑K

k=1 δkXik, τ2β)

}says that there may be between-individual differences in growth patterns, and that they

may be explained by individual-level characteristics. Realize that this model, aside fromnotation, is no different from the hierarchical model discussed above.

23

Simple Linear Regression Scott M Lynch

Documents

Transcript of Simple Linear Regression Scott M Lynch