VI. Multiple Linear Regression - Utah State University

55
Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2011 VI. Multiple Linear Regression The term “multiple” refers to the inclusion of more than one regression variable. The general regression model extends the simple model by including p 1 covariates X i1 X i2 X i 1 as follows: model by including p 1 covariates X i1 ,X i2 ,…,X i,p-1 as follows: , 1 , 2 2 1 1 0 i p i p i i i X X X Y 1 , 2 2 1 1 0 i p i p i i i where as before we assume ε i ~ N(0,σ 2 ).

Transcript of VI. Multiple Linear Regression - Utah State University

Page 1: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

VI. Multiple Linear RegressionThe term “multiple” refers to the inclusion of more than one regression variable. The general regression model extends the simple model by including p – 1 covariates Xi1 Xi2 Xi 1 as follows:model by including p 1 covariates Xi1,Xi2,…,Xi,p-1 as follows:

,1,22110 ipipiii XXXY 1,22110 ipipiii

where as before we assume εi ~ N(0,σ2).

Page 2: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Why “control” for additional factors?

In scientific settings, there may be several reasons to include additional variables in a regression model:

• Other investigative variables have potential scientifically interesting predictive effects relative to Y.

• We may be interested in testing the simultaneous effects of multiple factorsfactors.

• We need to model a predictor that (a) is a nominal (unordered) categorical variable (e.g., the feed type in Example , or (b) has nonlinear effects that require including polynomial terms (e.g., an X2

variable for a quadratic effect).• There are confounding factors that need to be controlled for as a part• There are confounding factors that need to be controlled for as a part

of the modeling analysis.

Page 3: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Confounding VariablesSuppose that we are investigating the relationship between two

i bl X d Y Th i k if i i i Xvariables X and Y. That is, we want to know if variation in X causesvariation in Y.

Controlled or randomized experiments are those in which study subjects are randomly assigned to levels of X. This ensures balanced variation between factors that might influence the relationshipvariation between factors that might influence the relationship between X and Y. We often refer to these additional factors as confounding variables.

Observational studies are those in which there is no randomization –investigative groups are defined based on how subjects selected g g p jthemselves to a particular treatment or exposure. In this case, we need to control for confounding factors as a part of the data analysis.

Page 4: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.AConsider recent findings from a study of “lighthearted” pursuits on average blood pressure:

http://www.cnn.com/2011/HEALTH/03/25/laughter.music.lower.blood.pressure/index.html.

Is this a controlled or observational study? What advantages (if any) does the study design have in this case?

Consider also the following results reported on a study of weight and bullying among children:

http://inhealth.cnn.com/living-well/overweight-kids-more-likely-to-be-bullied?did=cnnmodule1.

What kind of study is this? What factors might the researchers need to control for?

Page 5: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Interpretation of Regression Coefficients

For the multiple regression model a coefficient β represents theFor the multiple regression model, a coefficient βj represents the effect of Xij on the E{Yi} (the average of the outcome variable), holding all other variables constant.

In this sense, we are controlling for the effects of other factors, by assuming a common effect of Xij on the average of Y across all levels g ij gof the additional variables.

Note that in the regression setting it is also quite straightforward toNote that in the regression setting it is also quite straightforward to model interactions between two predictor variables, if the effect of one depends on the effect of the other. We will discuss interactions

lmore later.

Page 6: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.B

C id i bl d lConsider a two-variable model

.22110 iiii XXY

Suppose that β0 = 5, β1 = 2, and β2 = 1.5.

What is your interpretation of β0?

What is the effect of Xi1 on the average of Yi when Xi2 = 3? What isWhat is the effect of Xi1 on the average of Yi when Xi2 3? What is the effect of Xi1 when Xi2 = 10?

Page 7: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.B (continued)A geometric illustration of this relationship is shown in the plot below. This is a three-dimensional scatterplot, assuming a random sample of points. The superimposed plane represents the combined linear effect of X1 and X2 on the average of Y. What model parameter measures the variation of the points around the plane?

h l i hi d b d if h ldWhat relationship do we assume between X1 and Y if we hold X2 constant?

Page 8: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.B (continued)

Note that for three or more predictor variables the linear relationshipNote that for three or more predictor variables, the linear relationship is manifest through a hyperplane in p dimensions, where p – 1 represents the number of variables included in the model.

Beyond two predictors, it is therefore difficult to actually visualize the model, but we can apply the same interpretation of the relative , pp y peffects (in terms of the model coefficients) on the average of the outcome variable.

Page 9: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Additional Remarks on “Linear”“Linear” refers to the effects of the coefficients, not necessarily on the ff f h di h leffects of the predictors themselves.

We have already seen this before when discussing the role of y gtransformations for a single variable. For example, we may observe a relationship such that

log(Y) = β0 + Xi1β1 + ··· + Xi,p-1βp-1 + εi,

h fi i h li i d l i i l f fi iso that fitting the linear regression model is simply a matter of fitting

Y’ = β0 + Xi1β1 + ··· + Xi,p-1βp-1 + εi,β0 i1β1 i,p 1βp 1 i

where Y’ = log(Y).

Page 10: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Least-squares FitNote that the least squares fit involves the same minimization of qsquared residuals (in this case around a plane or hyperplane). The residual is defined as:

To find the least squares fit we minimize the objective function

).( 1,110 pipiii XXY

.)]([ 21110

2 n

pipii

n

i XXYQ

To find the least-squares fit, we minimize the objective function

)]([1

1,1101

i

pipiii

iQ

Note that there is a lot of matrix notation in chapter 6 to describe the l t h U l ’ i t t d i th lileast-squares approach. Unless you’re interested in the linear algebra, you can ignore all of that.

Page 11: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Fitted Residuals and MSEAs in the simple case, the least-square fit yields estimates b0, b1,…, b f h i ffi i d di i i bbp-1 for the regression coefficients, and our prediction equation can be expressed as:

ˆ XbXbXbbY .1,122110 pipiii XbXbXbbY

The observed residuals are then given by:

,,...,1 ),(ˆ1,1110 niXbXbbYYYe pipiiiii

ˆ11 SSE

and our estimated error variance (MSE) is computed as

.)ˆ(111

21

22

n

i iin

i i MSEpn

SSEYYpn

epn

s

Page 12: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Analysis of Variance

Note the error degrees of freedom for MSE This suggests anNote the error degrees of freedom for MSE. This suggests an ANOVA table for a multiple regression model that looks something like this:

SourceDegrees ofFreedom

Sum of Squares

MeanSquares F-statistic p-value

Regression p – 1 SSR MSR MSR/MSE

Error n – p SSE MSE

Total n – 1 SSTO

Page 13: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Components of Variation

A ith th i l d lAs with the simple model:

• SSTO represents the total variation in Y.

• SSR represents the variation in Y explained by the regression model that includes the set of X variables X1,…X 1.model that includes the set of X variables X1,…Xp-1.

• SSE represents the residual (unexplained) variation, or the total variation of the points around the hyperplanevariation of the points around the hyperplane.

Page 14: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

ANOVA F TestNote that there are p – 1 degrees of freedom associated with SSR, and n 1 degrees of freedom for SSE The ANOVA F test statistic isand n – 1 degrees of freedom for SSE. The ANOVA F test statistic is therefore given by MSR/MSE, and approximately follows an F(p – 1, n – 1) distribution.

The question is: what is the F statistic for? For a regression model with multiple predictors, the F statistic tests the relationship between Y d th t f X i bl X X I th d th ll FY and the set of X variables X1,…Xp-1. In other words, the overall Ftest evaluates the hypothesis

H : β β β 0 ers sH0: β1 = β2 = ··· = βp-1 = 0, versusHA: At least one βj ≠ 0, for j = 1,…,p – 1.

Under H0, MSR and MSE should be approximately equal, so a large value of F (i.e., a small p-value) provides evidence against the null.

Page 15: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Distribution of Individual Fitted CoefficientsAs with the estimated slope in the simple case, the individual fitted

ffi i b b l i l ll di ib dcoefficients b1,…bp-1 are at least approximately normally distributed.

All we need to know for a given bj is its estimated standard error g js{bj}, and the t distribution can be applied as before.

In other words the statisticIn other words, the statistic

}{kk

bsb

follows a t(n – p) distribution. Note that (especially in the case of lti l di t ) th t ti f {b } i t li t d t

}{ kbs

multiple predictors) the computation of s{bj} is too complicated to perform by hand, and in general is carried out using software.

Page 16: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Inference for Individual CoefficientsUsing the distribution on the previous slide, a (1 – α)100% confidence interval for βj is given by

}.{);2/1( jj bspntb We also would like to test the null hypothesis H0 : βj = 0 versus the alternative hypothesis HA: βj ≠ 0. A test statistic for assessing the evidence against H0 is given byevidence against H0 is given by

.}{0

j

j

bsb

t

Under H0, this test statistic approximately follows the t(n–p) distribution. The p-value is therefore given by 2P{t(n–p) ≥ |t|}.

j

Note that we can conceivably test against any specific value of βj, although 0 is generally the value of interest.

Page 17: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.CThe data for this example come from a study of cholesterol levels (Y) among 25 individuals, who were measured also for weight in kg (X1) and age in years (X2).individuals, who were measured also for weight in kg (X1) and age in years (X2). The data are tabulated below:

Chol Wt Age Chol Wt Age354 84 46 254 57 23354 84 46190 73 20405 65 52263 70 30

254 57 23395 59 60434 69 48220 60 34263 70 30

451 76 57302 69 25288 63 28

220 60 34374 79 51308 75 50220 82 34

385 72 36402 79 57365 75 44

311 59 46181 67 23274 85 37

209 27 24290 89 31346 65 52

303 55 40244 63 30

Page 18: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.C (continued)A good exploratory tool when looking at associations between several variables is a scatterplot matrix, as shown on the following slide. (A scatterplot matrix can be produced in SAS – for version 9.2 or higher – using commands shown later in this example.)g g p )

What bivariate associations do you observe?

Based on these associations, what do you predict we will see with respect to a fitted model that includes both predictors?

Is it possible that age might confound the relationship between weight and cholesterol level?

Page 19: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Page 20: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.C (continued)We want to fit a linear model for these data of the form

The SAS code below reads in the data file, fits the two-variable model (with confidence intervals for the individual variables) and also

,22110 iiii XXY

(with confidence intervals for the individual variables), and also produces a scatterplot matrix.options ls=79 nodate;

data; infile "c:\chris\classes\stat5100\data\cholesterol.txt" firstobs=2; input chol weight age;

proc reg; model chol=weight age / clb; run;

proc sgscatter; title "Scatterplot Matrix for Cholesterol Data"; matrix chol weight age;

run;

Page 21: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011SAS Output

Example VI.C (continued)

Interpret the 95% confidence intervals for each of the coefficients of

The REG ProcedureModel: MODEL1

Dependent Variable: chol

Interpret the 95% confidence intervals for each of the coefficients of weight and age.

I th id th t ith i ht i i di id ll i t d

Number of Observations Read 25Number of Observations Used 25

Analysis of VarianceIs there evidence that either weight or age is individually associated with average cholesterol level in the presence of the other?

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Pr > F

M d l 2 102571 51285 26 36 0001Interpret the ANOVA F test results.Model 2 102571 51285 26.36 <.0001Error 22 42806 1945.73752Corrected Total 24 145377

Root MSE 44.11051 R-Square 0.7056Dependent Mean 310.72000 Adj R-Sq 0.6788Coeff Var 14.19623

Page 22: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011SAS Output

Example VI.C (continued)

Interpret the 95% confidence intervals for each of the coefficients ofParameter EstimatesInterpret the 95% confidence intervals for each of the coefficients of weight and age.

I th id th t ith i ht i i di id ll i t d

Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 77.98254 52.42964 1.49 0.1511Is there evidence that either weight or age is individually associated with average cholesterol level in the presence of the other?

Intercept 1 77.98254 52.42964 1.49 0.1511weight 1 0.41736 0.72878 0.57 0.5727age 1 5.21659 0.75724 6.89 <.0001

Parameter Estimates

Interpret the ANOVA F test results.Variable DF 95% Confidence Limits

Intercept 1 -30.74988 186.71495weight 1 -1.09403 1.92875gage 1 3.64616 6.78702

Page 23: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.C (continued)

Using the SAS results, write out the fitted model.g ,

Interpret the fitted regression coefficients, including the intercept.

Report and interpret the estimated model variance.

I h 95% fid i l f h f h ffi i fInterpret the 95% confidence intervals for each of the coefficients of weight and age.

Is there evidence that either weight or age is individually associated with average cholesterol level in the presence of the other?

Interpret the ANOVA F test results.

Page 24: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Analyzing Partial Sums of Squares

Sums of squares in the ANOVA table can be decomposed into components that correspond to individual predictor variables or sets of predictor variables.

This is a very useful tool for comparing models in order to assess which variables should be included and in what form.

When variables (either one or more) are added to a given model, the SSE is always reduced. The idea behind analyzing so-called partial or extra sums of squares is that we would like to know whether the addition of these additional or extra factors leads to a significant reduction in SSE. If the reduction is significant, thissignificant reduction in SSE. If the reduction is significant, this indicates that the additional variables likewise have significant effects with respect to the outcome variable.

Page 25: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.DConsider again the cholesterol data in Example VI.C. Suppose we b i i h i l i d l i l h i h i blbegin with a simple regression model using only the weight variable Xi1. Fitting that model, we obtain the prediction equation

ˆ

along with the ANOVA table below:

,62.130.199ˆ1ii XY

Source df SS MS F p-valueRegression 1 10232 10232 1.74 0.200

Error 23 135145 5875.883

T t l 24 145377Total 24 145377

Page 26: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.D (continued)

If we add in the age variable Xi2, as in the two-variable model of g i2,Example VI.C, note what the effect is on the regression and error sums of squares. Denote the two-variable model as (I), and the simple model on the previous slide as (II)model on the previous slide as (II).

How is (I) nested in (II)?

What are the SSR and SSE for model (I)? What are the SSR and SSEfor model (II)? How do they compare? ( ) y p

Page 27: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Effects of Additional Covariates on ANOVA Sums of Squares

In summary, what we observe in Example VI.D holds generally: adding covariates to a regression model always increases the SSR, and decreases the SSE by the same amount.

The issue is whether this change in the sums of squares is significant e ssue s w e e s c a ge e su s o squa es s s g caenough to warrant the larger model.

In other words is the information or cost required to estimateIn other words, is the information or cost required to estimate additional effects worth it?

Page 28: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Nested ModelsConsider the predictive factors X1, X2,…, Xq, yielding the model

Suppose we augment this with additional factors represented by Xq+1, h i ldi h d l

(I) .22110 iiqqiii XXXY

Xq+2,…, Xp-1, where p–1 ≥ q, yielding the model

(II) ,1,11,1110 ipipqiqiqqii XXXXY

We say that model (I) is nested within model (II), in the sense that the regression parameters in model (I) represent a subset of the

t i d l (II)parameters in model (II).

We can evaluate whether (II) is an improvement on (I) – i.e., whether the reduction in SSE is significant relative to the number of additionalthe reduction in SSE is significant relative to the number of additional parameters – by using a partial F test that compares (II) to (I).

Page 29: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Notation

SSR(X1,…,Xp-1)Regression Sum of squares including all coefficients

SSR(X1,…,Xq)Regression sum of squares

i l di l X XSS ( 1,…, q) including only X1,…,Xq.

SSR(Xq+1,…,Xp-1 | X1,…,Xq) = SSE(X1,…,Xq) – SSE(X1,…,Xp-1)

Additional or extra regression sum of squares due to

inclusion of Xq+1,…,Xp 1.c us o o q+1,…, p-1.

Page 30: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Comparing Nested Models with Partial F Tests

Considering the effect of adding to the model, how do we assess whether the additional covariates are “worth it”? We wish to assess the hypotheses

0H

.1,...,2,1 oneleast at for ,0 :versus,0 : 1210

pqqkHH

kA

pqq

To make this comparison, we can use the F statistic

)|( XXXXMSR

)/()|(),...,(

),...,|,...,(

11

111

p

qpq

qpXXXXSSRXXMSE

XXXXMSRF

.),...,(

)/(),...,|,...,(

11

111

p

qpq

XXMSEqpXXXXSSR

Page 31: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Comparing Nested Models with Partial F Tests, continued

Note the degrees of freedom in the numerator correspond to theNote the degrees of freedom in the numerator correspond to the difference in the number of parameters fit for each model.

H d th ll h th i thi t ti ti f ll F( )Hence, under the null hypothesis, this statistic follows an F(p–q, n–p) distribution.

The p-value is the right tail probability corresponding to the observed statistic.

Page 32: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.E

i h l i C dUsing the results in Example VI.C and Example VI.D, carry out an F test of the hypothesis of no age effect.

What are the test statistic and corresponding p-value?

What are the degrees of freedom associated with this test statistic?What are the degrees of freedom associated with this test statistic?

How do these results compare to the confidence interval and t-test f h ffi i f ?for the coefficient of age?

Page 33: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Common Variable Types in Multiple Regression

A multivariable regression model can accommodate:A multivariable regression model can accommodate:

• non-linear (polynomial) effects,

• nominal categorical variables,

i i• interactions.

The accompanying three handouts describe such models, andThe accompanying three handouts describe such models, and provide examples analyzed using SAS to illustrate their application and interpretation.

Page 34: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

General Guidelines for Model SelectionWhat do you do with a large data set that has many covariates? How

i ht b t fi di hi h b t f th i t d t lmight you go about finding which subset of the covariates adequately describe their effects on the outcome of interest? How might you decide when to treat a given covariate as discrete versus continuous? Which covariates might interact?

These questions are often best answered using a combination of both h d i f d l b ildi Th f h d i hthe art and science of model building. The former has to do with our

substantive knowledge of the research question at hand (e.g., what’s been accomplished previously in your line of research), and the latter p p y y )has to do with empiricism and formal statistical analyses.

The slides that follow contain some suggestions for achieving the b l f i d hi i i h d li d d lbalance of parsimony and sophistication that underlies a good model.

Page 35: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

1. Understand your covariates.In any data analysis setting, you must spend some quality time exploring the distributions of the outcome and predictor variable.

This involves performing simple cross-validations and univariateThis involves performing simple cross validations and univariateprocedures to ensure consistency and correctness, and exploring two-or three-way (or higher order) associations between the variables in order to discern the types of marginal associations that may existorder to discern the types of marginal associations that may exist.

Exploratory analysis generally involves a combination of descriptive, inferential, and plotting tools.

Page 36: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

1. Understand your covariates (continued).

Many an analyst has found after much work on a problem that, forMany an analyst has found after much work on a problem that, for example, some of the data were either miscoded or coded in a way that the investigator neglected to understand. Models and their associated interpretations in that case need to be completelyassociated interpretations in that case need to be completely reassessed.

In addition, a thoughtful exploratory analysis helps you to avoid including covariates in the model that are very highly correlated. Including such a set of covariates results in the problem ofIncluding such a set of covariates results in the problem of collinearity, where your fitted model may be unreliable: small perturbations in the data may lead to large differences in the resulting parameter estimates as well as their standard errorsparameter estimates, as well as their standard errors.

Page 37: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

2. Start simple.

It’s a good idea to make a short list of some the most importantIt s a good idea to make a short list of some the most important covariates in your data set, and begin by considering simple models that involve only those. Your list should be dictated by your familiarity with the problem at hand, as well as the primary hypotheses of interest. For example, in an observational public health study of people, it’s a good bet that you would want to y p p , g yaccount for measures like gender, race, age, and socioeconomic status.

Especially with respect to observational data, there is a tendency for researchers to consider everything at once. This can make the

l i i i i ll h l ianalysis initially overwhelming.

Page 38: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

3. Use automated model selection procedures sparingly.

Procedures that automatically select a model for you while popularProcedures that automatically select a model for you, while popular in some settings, are completely ad hoc. Stepwise methods, for example, basically add covariates one at a time (or subtract one at a time), using an arbitrary significance level as the only criterion for inclusion versus exclusion. While such a tool might prove useful for exploratory purposes, it’s a terrible idea to wholly rely on thesefor exploratory purposes, it s a terrible idea to wholly rely on these kinds of procedures.

Page 39: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

4. Remember goodness-of-fit.

Goodness of fit is an aspect of data analysis that is all too oftenGoodness-of-fit is an aspect of data analysis that is all too often ignored. For example, investigators often simply treat covariates as continuous, assuming that their effects are linear without bothering to simply check such assumptions through simple exploratory analyses.

Avoid running roughshod over your data by checking to make sure the final model you’ve selected is not missing something importantimportant.

Page 40: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Variable Inclusion versus Exclusion

h l i i h i f i bl i l iThere are several motivations that can inform variable inclusion. For example, you likely want to include a variable if:

• It represents a primary hypothesis of interest.

I h b i l d i i l i h• It has been consistently used in prior analyses in the same line of research.

• It has a statistically significant predictive effect within your own analyses.

Page 41: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Formal Model Comparison Procedures

In terms of the science (i e empirical approches) for model selectionIn terms of the science (i.e., empirical approches) for model selection, we have already discussed important tools such as t tests for individual coefficients and partial F tests for one or more coefficients.

Aside from these, there are other strategies and metrics that make an objective attempt at balancing predictive power versus parsimony.j p g p p p y

The issue: given P–1 total predictor variables available in your dataset how do we select the “best” subset p–1? This will yield pdataset, how do we select the best subset p–1? This will yield ptotal regression coefficients (including the intercept) such that

1 P1 ≤ p ≤ P.

Page 42: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Using the Coefficient of Determination

The so-called coefficient of multiple determination is conventionally p ydenoted by R2, and represents the percent of the total variation in Ythat is explained by a given regression model with p–1 predictor variables Its definition is given byvariables. Its definition is given by

.12 SSESSRR SSTOSSTO

You can see that, by construction, 0 ≤ R2 ≤ 1.You can see that, by construction, 0 ≤ R ≤ 1.

Page 43: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

The Adjusted R2

Recall that whenever a predictor is added to a model, SSR increases, p , ,so that R2 will also increase. To balance this increase against the worth of an additional coefficient, the adjusted R2 is often also considered defined asconsidered, defined as

)/(12

pnSSER .)1/(

1

nSSTO

Ra

Note that the adjusted value may actually decrease with an additional predictor, if the decrease in SSE is offset by the loss of a degree of freedom in the numerator.

Page 44: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Selection Criteria Based on R2 Measures

.covariatesofset givena for of valueThe 22p pRR

.“large”a d that yielcovariatesofset a choose tois criterion One

.covariates ofset givena for of valueThe

g

2

22,

p

apa

p

Rp

pRR

p

gy pp

Recall that since the this value increases each time a variable is added to the model, under this criterion we need to make a subjective decision about whether the increase is sufficiently large to warrant the additional factor.

T k th d i i bj ti th it i i t l tTo make the decision more objective, another criterion is to select a model with p covariates that maximizes the adjusted R2.

Page 45: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Mallows’ Cp

This metric attempts to quantify the total mean square error (MSE) p q y q ( )across the sampled subjects, relative to the model fit. Note that MSE is measure that combines variability and bias. Here, the bias represents deviations from the true underlying model for E{Y} thatrepresents deviations from the true underlying model for E{Y} that arise because important variables have not been included.

While the technical details behind the computation of Cp for a given model are not that critical, in terms of interpretation we note that if key factors are omitted then Cp > p. That is, generally speaking models p p , g y p gwith little bias (which is more desirable) will yield values of Cp close to p.

Page 46: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

AICp and SBCp

As with the adjusted R and Mallows’ C both the AIC and SBCAs with the adjusted R2 and Mallows’ Cp, both the AICp and SBCppenalize models that have large numbers of predictors relative to their predictive power. Their definitions are respectively given by

AICp = nlog(SSEp) – nlog(n) + 2p, and

SBCp = nlog(SSEp) – nlog(n) + [log(n)]p.

The first term in these definitions will decrease as p increases TheThe first term in these definitions will decrease as p increases. The second term is fixed for a given sample size. The third term in either will increase with larger p.

Page 47: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Comparing AICp and SBCp

Models with small SSE will perform well using either of theseModels with small SSEp will perform well using either of these criteria, provided that the penalties imposed by the third term are not too large.

Note that if n ≥ 8, then the penalty for SBCp is larger than AICp. Hence, the SBCp criterion favors models with fewer covariates.e ce, t e S Cp c te o avo s ode s w t ewe cova ates.

Page 48: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.E

To illustrate the use of these various criteria, we will use the ,Concord, NH, summer water-use data that we examined for Example V.U, which are posted on the course website in the file concord.txt. The outcome variable Y is gallons of water used during the summerThe outcome variable Y is gallons of water used during the summer months in 1980. Predictors include annual income, years of education, retirement status (yes/no), and number of individuals in the household.

See the accompanying handout for SAS code, including commands p y g , gto generate all possible regressions along with the model selection metrics discussed thus far.

Page 49: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

CollinearityCollinearity or multicollinearity describes a setting in which two or

di t hi hl l t d Wh h t b t fmore predictors are highly correlated. When such a set or subset of explanatory variables is included in a model, we can experience problems with

• the conventional interpretation of regression coefficients (i.e., effects of predictive factors holding all other variables constant);

• higher sampling variability of fitted coefficients, making it more difficult to detect actual associations between the response variable and predictors;

• numerical computation required for the least-squares fit.

Page 50: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.FConsider a case where we have a model for an outcome variable Y = (5, 7, 4, 3, 8), with two predictors X = (4 5 5 4 11) and X = (8 10 10 8 22) Pair wisewith two predictors X1 = (4, 5, 5, 4, 11) and X2 = (8, 10, 10, 8, 22). Pair-wise scatterplots of these three variables are shown below. What is the relationship between Y and X1? Between Y and X2? What do you notice about the X1 and X2? What is the correlation coefficient between the two predictors?What is the correlation coefficient between the two predictors?

Page 51: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.F (continued)Suppose we consider first the model

Because X1 = 2X2, we have perfect collinearity between the predictors. In general, hi i if di i bl b d l f i f h

.110 XY

this is true if any predictor variable can be expressed as a linear function of the others. In this case, suppose we now consider the two-variable model:

)2(2 XXXY .)2(2 121012110 XXXY

Notice that for this second model the coefficients are no longer unique. That is, suppose we choose α2 = 0. Then α1 = β1 yields the original (simple) model. Nowsuppose we choose α2 0. Then α1 β1 yields the original (simple) model. Now consider α2 = 2. Then α1 = β1 – 4 yields the original model.

Computationally speaking, given our data, the least-squares solution for the two-p y p g g qvariable model will be indeterminate.

Page 52: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.F (continued)In fact, if we try to fit the two-variable model in SAS, this is the result:

The REG ProcedureModel: MODEL1

Dependent Variable: Y

Number of Observations Read 5Number of Observations Read 5Number of Observations Used 5

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Pr > F

Model 1 10.89811 10.89811 5.19 0.1072Error 3 6.30189 2.10063Corrected Total 4 17.20000

Root MSE 1 44935 R Square 0 6336Root MSE 1.44935 R-Square 0.6336Dependent Mean 5.40000 Adj R-Sq 0.5115Coeff Var 26.83990

Page 53: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Example VI.F (continued)

NOTE: Model is not full rank. Least-squares solutions for the parameters aret i S t ti ti ill b i l di A t d DF f 0 Bnot unique. Some statistics will be misleading. A reported DF of 0 or B

means that the estimate is biased.NOTE: The following parameters have been set to 0, since the variables are a

linear combination of other variables as shown.

X2 = 2 * X1

Parameter EstimatesParameter Estimates

Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 1.52830 1.81920 0.84 0.4625X1 B 0.71698 0.31478 2.28 0.1072X2 0 0 . . .

Page 54: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Near versus Perfect Collinearity

Of course, having perfectly correlated variables is an easily detectable problem, as your software package will yield an error when fitting models that include those variables.

The more practical worry in real-world research settings is the issue of variables that are highly – but not perfectly – correlated.

How can you detect such a potential problem?

Page 55: VI. Multiple Linear Regression - Utah State University

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2011

Collinearity Diagnostics• Compute the correlation matrix for your set of variables. A rule of thumb is to be

f l b i l di di i h 2 f d 0 8 hi hvery careful about including two predictors with an r2 of around 0.8 or higher.

• Regress each predictor on all others in turn, and examine the value of the coefficient of determination R2 for each modelcoefficient of determination R for each model.

• Examine the marginal effect on the fit of coefficients already in the model when an additional factor is added. For example, if a previously included variable has a highly significant coefficient whose magnitude, sign, and/or p-value dramatically changes with the addition of another factor, then those two predictors should be closely examined.

• (Not mentioned in the text.) Examine the eigenvalues of the correlation matrix of the predictors. Eigenvalues equal to zero or relatively close to zero indicate singularity or near-singularity in this matrix The ratio of the largest to smallestsingularity or near singularity in this matrix. The ratio of the largest to smallest eigenvalue is called the condition number of the matrix. A common rule of thumb is that a condition number > 30 represents a red flag.