Chapter14

37
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Multiple Regression Models

description

 

Transcript of Chapter14

Page 1: Chapter14

1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 14

Multiple Regression Models

Page 2: Chapter14

2 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A general additive multiple regression model, which relates a dependent variable y to k predictor variables x1, x2,…, xk is given by the model equation

y = + 1x1 + 2x2 + … + kxk + e

Multiple Regression Models

Page 3: Chapter14

3 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The random deviation e is assumed to be normally distributed with mean value 0 and variance 2 for any particular values of x1, x2,…, xk.

This implies that for fixed x1, x2,…, xk values, y has a normal distribution with variance 2 and

Multiple Regression Models

(mean y value for fixed x1, x2,…, xk values) = + 1x1 + 2x2 + … + kxk

Page 4: Chapter14

4 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The i’s are called population regression coefficients; each i can be interpreted as the true average change in y when the predictor xi increases by 1 unit and the values of all the other predictors remain fixed.

Multiple Regression Models

The deterministic portion

+ 1x1 + 2x2 + … + kxk

is called the population regression function.

Page 5: Chapter14

5 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The kth degree polynomial regression model y = + 1x + 2x2 + … + kxk + eis a special case of the general multiple regression model with x1 = x, x2 = x2, … , xk = xk.

The population regression function (mean value of y for fixed values of the predictors) is

+ 1x + 2x2 + … + kxk.

Polynomial Regression Models

Page 6: Chapter14

6 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The most important special case other than simple linear regression (k = 1) is the quadratic regression model

y = + 1x + 2x2.

This model replaces the line y = + x with a parabolic cure of mean values + 1x + 2x2.

If 2 > 0, the curve opens upward, whereas if 2 < 0, the curve opens downward.

Polynomial Regression Models

Page 7: Chapter14

7 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

If the change in the mean y value associated with a 1-unit increase in one independent variable depends on the value of a second independent variable, there is interaction between these two variables. When the variables are denoted by x1 and x2, such interaction can be modeled by including x1x2, the product of the variables that interact, as a predictor variable.

Interaction

Page 8: Chapter14

8 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Up to now, we have only considered the inclusion of quantitative (numerical) predictor variables in a multiple regression model.

Two other types are very common:

Dichotomous variable: One with just two possible categories coded 0 and 1

Examples

Gender {male, female}

Marriage status {married, not-married}

Qualitative Predictor Variables.

Page 9: Chapter14

9 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Ordinal variables: Categorical variables that have a natural ordering

Activity level {light, moderate, heavy} coded respectively as 1, 2 and 3

Education level {none, elementary, secondary, college, graduate} coded respectively 1, 2, 3, 4, 5 (or for that matter any 5 consecutive integers}

Qualitative Predictor Variables.

Page 10: Chapter14

10 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

According to the principle of least squares, the fit of a particular estimated regression function a + b1x1 + b2x2 + … + bkxk to the observed data is measured by the sum of squared deviations between the observed y values and the y values predicted by the estimated function:

[y –(a + b1x1 + b2x2 + … + bkxk )]2

Least Square Estimates

The least squares estimates of , 1, 2,…, k are those values of a, b1, b2, … , bk that make this sum of squared deviations as small as possible.

Page 11: Chapter14

11 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Predicted Values & Residuals

Doing this successively for the remaining observations yields the predicted values

(sometimes referred to as the fitted values or fits).

2 3 ky ,y , ,yˆ ˆ ˆ

The first predicted value is obtained by taking the values of the predictor variables x1, x2,…, xk for the first sample observation and substituting these values into the estimated regression function.

1y

Page 12: Chapter14

12 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Predicted Values & Residuals

The residuals are then the differences

between the observed and predicted y values.

1 1 2 2 k ky y ,y y , ,y yˆ ˆ ˆ

Page 13: Chapter14

13 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Sums of Squares

The number of degrees of freedom associated with SSResid is n - (k + 1), because k + 1 df are lost in estimating the k + 1 coefficients , 1, 2,…,k.

The residual (or error) sum of sqyares, SSResid, and total sum of squares, SSTo, are given by

where is the mean of the y observations in the sample.

2 2ˆSSResid= y-y SSTo= y-y

y

Page 14: Chapter14

14 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimate for 2

An estimate of the random deviation variance 2 is given by

and is the estimate of .

2e

SSResids

n - (k + 1)

2e es s

Page 15: Chapter14

15 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Coefficient of Multiple Determination, R2

The coefficient of multiple determination, R2, interpreted as the proportion of variation in observed y values that is explained by the fitted model, is

2 SSResidR 1

SSTo

Page 16: Chapter14

16 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Adjusted R2

Generally, a model with large R2 and small se are desirable. If a large number of variables (relative to the number of data points) is used those conditions may be satisfied but the model will be unrealistic and difficult to interpret.

Page 17: Chapter14

17 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Adjusted R2

To sort out this problem, sometimes computer packages compute a quantity called the adjusted R2,

2 n 1 SSResidadjusted R 1

n (k 1) SSTo

Notice that when a large number of variables are used to build the model, this value will be substantially lower than R2 and give a better indication of usability of the model.

Page 18: Chapter14

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

F DistributionsF distributions are similar to a Chi-Square distributions, but have two parameters, dfden and dfnum.

Page 19: Chapter14

19 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The F Test for Model Utility

The regression sum of squares denoted by SSReg is defined by

SSREG = SSTo - SSresid

Page 20: Chapter14

20 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The F Test for Model Utility

When all k i’s are zero in the model

y = + 1x1 + 2x2 + … + kxk + e

And when the distribution of e is normal with mean 0 and variance 2 for any particular values of x1, x2,…, xk, the statistic

has an F probability distribution based on k numerator df and n - (K+ 1) denominator df

SSRegrkF

SSResidn (k 1)

Page 21: Chapter14

21 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The F Test for Utility of the Model y = + 1x1 + 2x2 + … + kxk + e

Null hypothesis: H0: 1 = 2 = … = k =0

(There is no useful linear relationship between y and any of the predictors.)

Alternate hypothesis: Ha: At least one among 1, 2, … , k is

not zero(There is a useful linear relationship between y and at least one of the predictors.)

Page 22: Chapter14

22 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The F Test for Utility of the Model y = + 1x1 + 2x2 + … + kxk + e

Test statistic: SSRegrkF

SSResidn (k 1)

where SSreg = SST0 - SSresid.

An alternate formula: 2

2

RkF

(1 R )n (k 1)

where SSreg = SST0 - SSresid.

Page 23: Chapter14

23 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The F Test Utility of the Model y = + 1x1 + 2x2 + … + kxk + e

The test is upper-tailed, and the information in the Table of Values that capture specified upper-tail F curve areas is used to obtain a bound or bounds on the P-value using numerator df = k and denominator df = n - (k + 1).

Assumptions: For any particular combination of predictor variable values, the distribution of e, the random deviation, is normal with mean 0 and constant variance.

Page 24: Chapter14

24 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

A number of years ago, a group of college professors teaching statistics met at an NSF program and put together a sample student research project.

They attempted to create a model to explain lung capacity in terms of a number of variables.

Specifically,

Numerical variables: height, age, weight, waist

Categorical variables: gender, activity level and smoking status.

Page 25: Chapter14

25 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

They managed to sample 41 subjects and obtain/measure the variables.

There was some discussion and many felt that the calculated variable (height)(waist)2 would be useful since it would likely be proportional to the volume of the individual.

The initial regression analysis performed with Minitab appears on the next slide.

Page 26: Chapter14

26 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

ExampleLinear Model with All Numerical Variables

The regression equation isCapacity = - 13.0 - 0.0158 Age + 0.232 Height - 0.00064 Weight - 0.0029 Chest + 0.101 Waist -0.000018 hw2

40 cases used 1 cases contain missing values

Predictor Coef SE Coef T PConstant -13.016 2.865 -4.54 0.000Age -0.015801 0.007847 -2.01 0.052Height 0.23215 0.02895 8.02 0.000Weight -0.000639 0.006542 -0.10 0.923Chest -0.00294 0.06491 -0.05 0.964Waist 0.10068 0.09427 1.07 0.293hw2 -0.00001814 0.00001761 -1.03 0.310

S = 0.5260 R-Sq = 78.2% R-Sq(adj) = 74.2%

Page 27: Chapter14

27 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example

The only coefficient that appeared to be significant and the 5% level was the height. Since the P-value for the coefficient on the age was very close to 5% (5.2%) it was decided that a linear model with the two independent variables height and age would be calculated.

The resulting model is on the next slide.

Page 28: Chapter14

28 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

ExampleLinear Model with variables: Height & Age

The regression equation isCapacity = - 10.2 + 0.215 Height - 0.0133 Age

40 cases used 1 cases contain missing values

Predictor Coef SE Coef T PConstant -10.217 1.272 -8.03 0.000Height 0.21481 0.01921 11.18 0.000Age -0.013322 0.005861 -2.27 0.029

S = 0.5073 R-Sq = 77.2% R-Sq(adj) = 76.0%

Notice that even though the R2 value decreases slightly, the adjusted R2 value actually increases. Also note that the coefficient on Age is now significant at 5%.

Page 29: Chapter14

29 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

ExampleIn an attempt to determine if incorporating the categorical variables into the model would significantly enhance the it.

Gender was coded as an indicator variable (male = 0 and female = 1),

Smoking was coded as an indicator variable (No = 0 and Yes = 1), and

Activity level (light, moderate, heavy) was coded respectively as 1, 2 and 3.

The resulting Minitab output is given on the next slide.

Page 30: Chapter14

30 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

ExampleLinear Model with categorical variables added

The regression equation isCapacity = - 7.58 + 0.171 Height - 0.0113 Age - 0.383 C-Gender + 0.260 C-Activity - 0.289 C-Smoke

37 cases used 4 cases contain missing values

Predictor Coef SE Coef T PConstant -7.584 2.005 -3.78 0.001Height 0.17076 0.02919 5.85 0.000Age -0.011261 0.005908 -1.91 0.066C-Gender -0.3827 0.2505 -1.53 0.137C-Activi 0.2600 0.1210 2.15 0.040C-Smoke -0.2885 0.2126 -1.36 0.185

S = 0.4596 R-Sq = 84.2% R-Sq(adj) = 81.7%

Page 31: Chapter14

31 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

ExampleIt was noted that coefficient for the coded indicator variables gender and smoking were not significant, but after considerable discussion, the group felt that a number of the variables were related.

This, the group felt, was confounding the study. In an attempt to determine a reasonable optimal subgroup of the variables to keep in the study, it was noted that a number of the variables were highly related. Since the study was small, a stepwise regression was run and the variables, Height, Age, Coded Activity, Coded Gender were kept and the following model was obtained.

Page 32: Chapter14

32 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

ExampleLinear Model with Height, Age & Coded Activity

and Gender

The regression equation isCapacity = - 6.93 + 0.161 Height - 0.0137 Age + 0.302 C-Activity - 0.466 C-Gender

40 cases used 1 cases contain missing values

Predictor Coef SE Coef T PConstant -6.929 1.708 -4.06 0.000Height 0.16079 0.02454 6.55 0.000Age -0.013744 0.005404 -2.54 0.016C-Activi 0.3025 0.1133 2.67 0.011C-Gender -0.4658 0.2082 -2.24 0.032

S = 0.4477 R-Sq = 83.2% R-Sq(adj) = 81.3%

Page 33: Chapter14

33 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example Linear Model with Height, Age & Coded Activity

and Gender

Analysis of Variance

Source DF SS MS F PRegression 4 34.8249 8.7062 43.44 0.000Residual Error 35 7.0151 0.2004Total 39 41.8399

Source DF Seq SSHeight 1 30.9878Age 1 1.3296C-Activi 1 1.5041C-Gender 1 1.0034

Unusual ObservationsObs Height Capacity Fit SE Fit Residual St Resid 4 66.0 2.2000 3.2039 0.1352 -1.0039 -2.35R 23 74.0 5.7000 4.7635 0.2048 0.9365 2.35R 39 70.0 5.4000 4.4228 0.1064 0.9772 2.25R

R denotes an observation with a large standardized residual

The rest of the Minitab output is given below.

Page 34: Chapter14

34 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example Linear Model with Height, Age & Coded Activity

and GenderAll of the coefficients in this model were significant at the 5% level and the R2 and adjusted R2 were both fairly large.

This appeared to be a reasonable model for describing lung capacity even though the study was limited by sample size, and measurement limitations due to antique equipment.

Minitab identified 3 outliers (because the standardized residuals were unusually large.

Various plots of the standardized residuals are produced on the next few slides with comments

Page 35: Chapter14

35 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example Linear Model with Height, Age & Coded Activity

and Gender

The histogram of the residuals appears to be consistent with the assumption that the residuals are a sample from a normal distribution.

1.00.80.60.40.20.0-0.2-0.4-0.6-0.8-1.0

10

5

0

Residual

Fre

quen

cy

Histogram of the Residuals(response is Capacity)

Page 36: Chapter14

36 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example Linear Model with Height, Age & Coded Activity

and Gender

The normality plot also tends to indicate the residuals can reasonably be thought to be a sample from a normal distribution.

10-1

2

1

0

-1

-2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is Capacity)

Page 37: Chapter14

37 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example Linear Model with Height, Age & Coded Activity

and Gender

The residual plot also tends to indicate that the model assumptions are not unreasonable, although there would be some concern that the residuals are predominantly positive for smaller fitted lung capacities.

65432

1

0

-1

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is Capacity)