Regression Analysis and Multiple Regression Session 7.

Regression Analysis and Multiple Regression

Session 7

• Using Statistics

• The Simple Linear Regression Model

• Estimation: The Method of Least Squares

• Error Variance and the Standard Errors of Regression Estimators

• Correlation

• Hypothesis Tests about the Regression Relationship

• How Good is the Regression?

• Analysis of Variance Table and an F Test of the Regression Model

• Residual Analysis and Checking for Model Inadequacies

• Use of the Regression Model for Prediction

• Using the Computer

• Summary and Review of Terms

Simple Linear Regression Model

This scatterplot locates pairs of observations of advertising expenditures on the x-axis and sales on the y-axis. We notice that:

Larger (smaller) values of sales tend to be associated with larger (smaller) values of advertising.

Scatterplot of Advertising Expenditures (X) and Sales (Y)

50403020100

Advertising

The scatter of points tends to be distributed around a positively sloped straight line.

The pairs of values of advertising expenditures and sales are not located exactly on a straight line. The scatter plot reveals a more or less strong tendency rather than a precise linear relationship. The line represents the nature of the relationship on average.

7-1 Using Statistics

Examples of Other Scatterplots

The inexact nature of the relationship between advertising and sales suggests that a statistical model might be useful in analyzing the relationship.

A statistical model separates the systematic component of a relationship from the random component.

Statistical model

Systematic component

+Random

errors

In ANOVA, the systematic component is the variation of means between samples or treatments (SSTR) and the random component is the unexplained variation (SSE).

In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line.

Model Building

The population simple linear regression model:

Y= 0 + 1 X + Nonrandom or Random

Systematic Component Component

where Y is the dependent variable, the variable we wish to explain or predict; X is the independent variable, also called the predictor variable; and is the error term, the only random component in the model, and thus, the only source of randomness in Y.

0 is the intercept of the systematic component of the regression relationship.

1 is the slope of the systematic component. The conditional mean of Y:

E Y X X[ ] 0 1

7-2 The Simple Linear Regression Model

The simple linear regression model posits an exact linear relationship between the expected or average value of Y, the dependent variable, and X, the independent or predictor variable: E[Yi]=0 + 1 Xi

Actual observed values of Y differ from the expected value by an unexplained or random error:

Yi = E[Yi] + i

= 0 + 1 Xi + i X

E[Y]=0 + 1 X

}} 1 = Slope

0 = Intercept

{Error: i

Regression Plot

Picturing the Simple Linear Regression Model

• The relationship between X and Y is a straight-line relationship.

• The values of the independent variable X are assumed fixed (not random); the only randomness in the values of Y comes from the error term i.

• The errors i are normally

distributed with mean 0 and variance 2. The errors are uncorrelated (not related) in successive observations. That is: ~ N(0,2)

E[Y]=0 + 1 X

Assumptions of the Simple Linear Regression Model

Identical normal distributions of errors, all centered on the regression line.

Assumptions of the Simple Linear Regression Model

Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line.The estimated regression equation: Y=b0 + b1X + e

where b0 estimates the intercept of the population regression line, 0 ;b1 estimates the slope of the population regression line, 1;and e stands for the observed errors - the residuals from fitting the estimated regression line b0 + b1X to a set of n points.

The estimated regression line:

where Y (Y - hat) is the value of Y lying on the fitted regression line for a givenvalue of X.

Y b b X 0 1

7-3 Estimation: The Method of Least Squares

Fitting a Regression Line

Three errors from a fitted line

Three errors from the least squares regression line

Errors from the least squares regression line are minimized

.{Error ei Yi Yi

Yi the predicted value of Y for Xi

Y b b X 0 1 the fitted regression line

Errors in Regression

Least Squares RegressionThe sum of squared errors in regression is:

SSE = e (y

The is that which the SSEwith respect to the estimates b and b .

x y x x

i ii=1

least squares regression line

normal equations

minimizes

Least squares b0

Least squares b1

Sums of Squares and Cross Products:

Least squares regression estimators:

SS x x xx

SS y y yy

SS x x y y xyx y

b y b x

( )( )( )

Sums of Squares, Cross Products, and Least Squares Estimators

Miles Dollars Miles 2 Miles*Dollars 1211 1802 1466521 2182222 1345 2405 1809025 3234725 1422 2005 2022084 2851110 1687 2511 2845969 4236057 1849 2332 3418801 4311868 2026 2305 4104676 4669930 2133 3016 4549689 6433128 2253 3385 5076009 7626405 2400 3090 5760000 7416000 2468 3694 6091024 9116792 2699 3371 7284601 9098329 2806 3998 7873636 11218388 3082 3555 9498724 10956510 3209 4692 10297681 15056628 3466 4244 12013156 14709704 3643 5298 13271449 19300614 3852 4801 14837904 18493452 4033 5147 16265089 20757852 4267 5738 18207288 24484046 4498 6420 20232004 28877160 4533 6059 20548088 27465448 4804 6426 23078416 30870504 5090 6321 25908100 32173890 5233 7026 27384288 36767056 5439 6964 29582720 3787719679498 10605 293426944 390185024

SSx xx

SSxy xyx y

bSS XY

b y b x

29342694479448

2540947552

390185024106605

2551402848

151402848

409475521 255333776 1 26

0 1106605

251 255333776)

274 85

Example 7-1

5500500045004000350030002500200015001000

M iles

R-Squared = 0.965Y = 274.850 + 1.25533X

Regression of Dollars Charged against MilesMTB > Regress 'Dollars' 1 'Miles';SUBC> Constant.

Regression Analysis

The regression equation isDollars = 275 + 1.26 Miles

Predictor Coef Stdev t-ratio pConstant 274.8 170.3 1.61 0.120Miles 1.25533 0.04972 25.25 0.000

s = 318.2 R-sq = 96.5% R-sq(adj) = 96.4%

Analysis of Variance

SOURCE DF SS MS F pRegression 1 64527736 64527736 637.47 0.000Error 23 2328161 101224Total 24 66855896

Example 7-1: Using the Computer

The results on the right side are the output created by selecting REGRESSION option from the DATA ANALYSIS toolkit.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.98243393R Square 0.965176428Adjusted R Square 0.963662359Standard Error 318.1578225Observations 25

ANOVAdf SS MS F Significance F

Regression 1 64527736.8 64527736.8 637.4721586 2.85084E-18Residual 23 2328161.201 101224.4Total 24 66855898

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%Intercept 274.8496867 170.3368437 1.61356569 0.120259309 -77.51844165 627.217815 -77.51844165 627.217815MILES 1.255333776 0.049719712 25.248211 2.85084E-18 1.152480856 1.358186696 1.152480856 1.358186696

Example 7-1: Using Computer-Excel

Residual Analysis. The plot shows the absence of a relationship between the residuals and the X-values (miles).

Residuals vs. Miles

0 1000 2000 3000 4000 5000 6000

What you see when looking at the total variation of Y.

What you see when looking along the regression line at the error variance of Y.

Total Variance and Error Variance

Degrees of Freedom in Regression:

An unbiased estimator of s2

, denoted by S2

df = (n - 2) (n total observations less one degree of freedom

for each parameter estimated (b0 and b1) )

= ( - )

MSE =SSE

(n - 2)

SSE Y Y SSY

SS XSSY b SS XY

Square and sum all regression errors to find SSE.

Example 10 - 1:

SSE SSY b SS XY

MSESSE

166855898 1 255333776 51402852 4

2328161 2

23101224 4

101224 4 318 158

( . )( . )

7-4 Error Variance and the Standard Errors of Regression Estimators

The standard error of (intercept)

where s = MSE

The standard error of (slope)

s bs x

Example 10 - 1:

s bs x

( )( . ).

318 158 293426944

25 4097557 84170 338

318 158

40947557 840 04972

Standard Errors of Estimates in Regression

A (1 - ) 100% confidence interval for b0

A (1 - ) 100% confidence interval for b1

,( )( )

Example 10 - 195% Confidence Intervals:b t s b

b t s b

0 0 025 25 2 0

0 025 25 2

170 33827485 352 43

7758 627 28

01 25533 010287115246 1 35820

. ,( ) ( )

( . ). .

[ . , . ]

( ). .

[ . , . ]

= 274.85 2.069) (

= 1.25533 2.069) ( .04972

Length = 1

Height =

Least-squares point estimate:b1=1.25533

Lower 95% bound: 1

.15246

(not a possible value of the regression slope at 95%)

Confidence Intervals for the Regression Parameters

The correlation between two random variables, X and Y, is a measure of the degree of linear association between the two variables. The population correlation, denoted by, can take on any value from -1 to 1.

indicates a perfect negative linear relationship-1< <0 indicates a negative linear relationship indicates no linear relationship0< <1 indicates a positive linear relationshipindicates a perfect positive linear relationship

The absolute value of indicates the strength or exactness of the relationship.

7-5 Correlation

=-.8 Y

Illustrations of Correlation

The sample correlation coefficient*:

The population correlation coefficient:

Cov X Y

The covariance of two random variables X and Y: where X and are the population means of X and Y respectivelyY .

Cov X Y E X X Y Y( , ) [( )( )]

Example 10 - 1:

51402852.4

40947557.84 6685589851402852.4

52321943 299824

( )( )

*Note: If < 0, b1 < 0 If = 0, b1 = 0 If > 0, b1 >0

Covariance and Correlation

SUMMARY OUTPUT

Regression 1 40.0098686 40.0098686 511.2009204 1.55085E-08Residual 8 0.626131402 0.078266425Total 9 40.636

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%Intercept -8.762524695 0.594092798 -14.74942084 4.39075E-07 -10.13250603 -7.39254336 -10.13250603 -7.39254336US 1.423636087 0.062965575 22.60975277 1.55085E-08 1.278437117 1.568835058 1.278437117 1.568835058

RESIDUAL OUTPUT

Observation Predicted Y Residuals1 2.057109569 0.2428904312 2.484200395 0.1157996053 3.05365483 -0.153654834 3.480745656 -0.2807456565 3.765472874 -0.0654728746 4.050200091 0.0497999097 4.619654526 0.1803454748 5.758563396 -0.0585633969 7.466926701 -0.466926701

10 8.463471962 0.436528038

X Variable 1 Line Fit Plot

0 5 10 15

X Variable 1

Predicted Y

8 9 10 11 12

United States

Y = -8.76252 + 1.42364X R-Sq = 0.9846

Regression Plot

Example 7-2: Regression Plot

H0: =0 (No linear relationship)H1: 0 (Some linear relationship)

Test Statistic: tr

Example 10 -1:

=0.98241- 0.9651

=0.98240.0389

H rejected at 1% level0

2 807 2525

Hypothesis Tests for theCorrelation Coefficient

Constant Y Unsystematic Variation Nonlinear Relationship

A hypothesis test for the existence of a linear relationship between X and Y:

H 0 H1Test statistic for the existence of a linear relationship between X and Y:

where is the least - squares estimate of the regression slope and ( ) is the standard error of .

When the null hypothesis is true, the statistic has a distribution with - degrees of freedom.

1 1 12

b s b b

Hypothesis Tests about the Regression Relationship

Example 10 - 1:

H 0 H1

=1.25533

0.04972

H 0 is rejected at the 1% level and we may

conclude that there is a relationship between

charges and miles traveled.

. .( . , )

2 807 25 25

0 005 23

Example 10 - 3:

=1.24 - 1

H0 is not rejected at the 10% level.

We may not conclude that the beta

coefficient is different from 1.

. .( . , )

1 671 114

0 05 58

Hypothesis Tests for the Regression Slope

The coefficient of determination, r2, is a descriptive measure of the strength of the regression relationship, a measure of how well the regression line fits the data.

{}Total Deviation

Explained Deviation

Unexplained Deviation

Total = Unexplained ExplainedDeviation Deviation Deviation (Error) (Regression)

SST = SSE + SSR

( ) ( ) ( )

y y y y y y

1Percentage of total variation explained by the regression.

7-7 How Good is the Regression?

r2=0 SSE

r2=0.90SSE

r2=0.50 SSE

Example 10 -1:

r 2 SSRSST

64527736 866855898

0 96518.

5500500045004000350030002500200015001000

The Coefficient of Determination

7-8 Analysis of Variance and an F Test of the Regression Model

Example 10-1

Source ofVariation

Sum ofSquares

Degrees ofFreedom

Mean SquareF Ratio p Value

Regression 64527736.8 1 64527736.8 637.47 0.000

Error 2328161.2 23 101224.4

Total 66855898.0 24

Source ofVariation

Sum ofSquares

Degrees ofFreedom Mean Square F Ratio

Regression SSR (1) MSR MSRMSE

Error SSE (n-2) MSE

Total SST (n-1) MST

x or y

Residuals

Homoscedasticity: Residuals appear completely random. No indication of model inadequacy.

Residuals

Curved pattern in residuals resulting from underlying nonlinear relationship.

Residuals

Residuals exhibit a linear trend with time.

Residuals

Heteroscedasticity: Variance of residuals changes when x changes.

x or y

7-9 Residual Analysis and Checking for Model Inadequacies

• Point PredictionA single-valued estimate of Y for a given value of X

obtained by inserting the value of X in the estimated regression equation.

• Prediction Interval For a value of Y given a value of X

Variation in regression line estimate.Variation of points around regression line.

For an average value of Y given a value of XVariation in regression line estimate.

7-10 Use of the Regression Model for Prediction

Regression line

Upper limit on slope

Lower limit on slope

1) Uncertainty about the slope of the regression line

Regression lineUpper limit on intercept

Lower limit on intercept

2) Uncertainty about the intercept of the regression line

Errors in Predicting E[Y|X]

Prediction Interval for E[Y|X]

Regression line

• The prediction band for E[Y|X] is narrowest at the mean value of X.

• The prediction band widens as the distance from the mean of X increases.

• Predictions become very unreliable when we extrapolate beyond the range of the sample itself.

Prediction band for E[Y|X]

Additional Error in Predicting Individual Value of Y

3) Variation around the regression line.

YRegression line

Regression line

Prediction band for E[Y|X]

Prediction band for Y

A (1- ) 100% prediction interval for Y:

Example 10 -1 (X = 4000):

(1.2553)(4000)

. . [ , ]

x xSSX

274.85 2.069 1125

4000 3177.9240947557.84

5296 05 676 62 4619.43 5972.67

Prediction Interval for a Value of Y

A (1- ) 100% prediction interval for the E[Y X]:

Example 10 -1 (X = 4000):

(1.2553)(4000)

. . [ , ]

x xSSX

274.85 2.069125

4000 3177.9240947557.84

5296 05 156 48 5139.57 5452.53

Prediction Interval for the Average Value of Y

MTB > regress 'Dollars' 1 'Miles' tres in C3 fits in C4;SUBC> predict 4000;SUBC> residuals in C5.Regression Analysis

The regression equation isDollars = 275 + 1.26 Miles

Predictor Coef Stdev t-ratio pConstant 274.8 170.3 1.61 0.120Miles 1.25533 0.04972 25.25 0.000

s = 318.2 R-sq = 96.5% R-sq(adj) = 96.4%

SOURCE DF SS MS F pRegression 1 64527736 64527736 637.47 0.000Error 23 2328161 101224Total 24 66855896

Fit Stdev.Fit 95.0% C.I. 95.0% P.I. 5296.2 75.6 ( 5139.7, 5452.7) ( 4619.5, 5972.8)

Using the Computer

5500500045004000350030002500200015001000

700060005000400030002000

MTB > PLOT 'Resids' * 'Fits' MTB > PLOT 'Resids' *'Miles'

Plotting on the Computer (1)

Plotting on the Computer (2)

MTB > HISTOGRAM 'StRes'

210-1-2

5500500045004000350030002500200015001000

MilesD

MTB > PLOT 'Dollars' * 'Miles'

• Using Statistics.

• The k-Variable Multiple Regression Model.

• The F Test of a Multiple Regression Model.

• How Good is the Regression.

• Tests of the Significance of Individual Regression Parameters.

• Testing the Validity of the Regression Model.

• Using the Multiple Regression Model for Prediction.

Multiple Regression (1)11

• Qualitative Independent Variables.

• Polynomial Regression.

• Nonlinear Models and Transformations.

• Multicollinearity.

• Residual Autocorrelation and the Durbin-Watson Test.

• Partial F Tests and Variable Selection Methods.

• Using the Computer.

• The Matrix Approach to Multiple Regression Analysis.

• Summary and Review of Terms.

Multiple Regression (1)11

Slope: 1

Intercept: 0

Any two points (A and B), or an intercept and slope (0 and 1), define a line on a two-dimensional surface.

Any three points (A, B, and C), or an intercept and coefficients of x1 and x2 (0 , 1, and 2), define a plane in a three-dimensional surface.

Lines Planes

7-11 Using Statistics

y x x 0 1 1 2 2

The population regression model of a dependent variable, Y, on a set of k independent variables, X1, X2,. . . , Xk is given by:

Y= 0 + 1X1 + 2X2 + . . . + kXk +

where 0 is the Y-intercept of the regression surface and each i , i = 1,2,...,k is the slope of the regression surface - sometimes called the response surface - with respect to Xi.

Model assumptions:1. ~N(0,2), independent of other errors.2. The variables Xi are uncorrelated with the error term.

7-12 The k-Variable Multiple Regression Model

In a simple regression model, the least-squares estimators minimize the sum of squared errors from the estimated regression line.

In a multiple regression model, the least-squares estimators minimize the sum of squared errors from the estimated regression plane.

y b b x 0 1y b b x b x 0 1 1 2 2

Simple and Multiple Least-Squares Regression

The estimated regression relationship:

where is the predicted value of Y, the value lying on the estimated regression surface. The terms b0,...,k are the least-squares estimates of the population regression parameters i.

Y b b X b X b Xk k 0 1 1 2 2

The actual, observed value of Y is the predicted value plus an error:

y=b0+ b1 x1+ b2 x2+. . . + bk xk+e

The Estimated Regression Relationship

y nb b x b x

x y b x b x b x x

x y b x b x x b x

0 1 1 2 2

1 0 1 1 1

2 0 2 1 1 2 2 2

Minimizing the sum of squared errors with respect to the estimated coefficients b0, b1, and b2 yields the following normal equations:

Least-Squares Estimation: The 2-Variable Normal Equations

Y X1 X2 X1X2 X12 X2

2 X1Y X2Y 72 12 5 60 144 25 864 360 76 11 8 88 121 64 836 608 78 15 6 90 225 36 1170 468 70 10 5 50 100 25 700 350 68 11 3 33 121 9 748 204 80 16 9 144 256 81 1280 720 82 14 12 168 196 144 1148 984 65 8 4 32 64 16 520 260 62 8 3 24 64 9 496 186 90 18 10 180 324 100 1620 900--- --- --- --- ---- --- ---- ----743 123 65 869 1615 509 9382 5040

Normal Equations:

743 = 10b0+123b1+65b2

9382 = 123b0+1615b1+869b2

5040 = 65b0+869b1+509b2

b0 = 47.164942b1 = 1.5990404b2 = 1.1487479

Estimated regression equation:

. . .Y X X 47164942 15990404 114874791 2

Example 7-3

SUMMARY OUTPUT

Regression 2 630.5381466 315.2690733 86.33503537 1.16729E-05Residual 7 25.56185335 3.651693336Total 9 656.1

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 47.16494227 2.470414433 19.09191496 2.69229E-07 41.32334457 53.00653997X1 1.599040336 0.280963057 5.691283238 0.00074201 0.934668753 2.263411919X2 1.148747938 0.30524885 3.763316185 0.007044246 0.426949621 1.870546256

Excel Output

Total Deviation = Regression Deviation + Error Deviation SST = SSR + SSE

Y Y : Error Deviation

Y Y : Regression DeviationTotal deviation: Y Y

Decomposition of the Total Deviation in a Multiple Regression Model

A statistical test for the existence of a linear relationship between Y and any or all of the independent variables X1, x2, ..., Xk:

H0: 1 = 2 = ...= k=0H1: Not all the i (i=1,2,...,k) are 0

Source ofVariation

Sum ofSquares

Regression SSR (k)

Error SSE (n-(k+1))=(n-k-1)

Total SST (n-1)

MSRSSR

MSESSE

( ( ))1

MSTSST

7-13 The F Test of a Multiple Regression Model

SOURCE DF SS MS F pRegression 2 630.54 315.27 86.34 0.000Error 7 25.56 3.65Total 9 656.10

The test statistic, F = 86.34, is greater than the critical point of F(2, 7)

for any common level of significance (p-value 0), so the null hypothesis is rejected, and we might conclude that the dependent variable is related to one or more of the independent variables.0

F Distribution with 2 and 7 Degrees of Freedom

F0.01=9.55

Test statistic 86.34f(F)

Using the Computer: Analysis of Variance Table (Example 7-3)

The multiple coefficient of determination, R2 , measures the proportion ofthe variation in the dependent variable that is explained by the combinationof the independent variables in the multiple regression model:

=SSRSST

= 1-SSESST

The is an unbiasedestimator of the variance of the population

errors, denoted by 2

mean square error

Standard error of estimate

( ( ))

( )( ( ))

MSESSE

Errors: y - y

7-14 How Good is the Regression

The , R2

, is the coefficient ofdetermination with the SSE and SST divided by their respective degrees of freedom:

(n - (k + 1))

(n - 1)

adjusted multiple coefficient of determination

SSESSR

SST= 1 -

Example 11-1: s = 1.911 R-sq = 96.1% R-sq(adj) = 95.0%

Decomposition of the Sum of Squares and the Adjusted Coefficient of Determination

Source ofVariation

Sum ofSquares

Regression SSR (k)

Error SSE (n-(k+1))=(n-k-1)

Total SST (n-1)

MSRSSR

MSESSE

( ( ))1

MSTSST

SST= 1 -

2 = 1 -

(n - (k + 1))

(n - 1)

( ( ))

Measures of Performance in Multiple Regression and the ANOVA Table

Hypothesis tests about individual regression slope parameters:

(1) H0: 1=0H1: 10

(2) H0: 2=0H1: 20 .

(k) H0: k=0H1: k0

Test statistic for test i tb

s bn k

:( )( ( )

7-15 Tests of the Significance of Individual Regression Parameters

VariableCoefficientEstimate

StandardError t-Statistic

Constant 53.12 5.43 9.783 *X1 2.03 0.22 9.227 *X2 5.60 1.30 4.308 *X3 10.35 6.88 1.504

X4 3.45 2.70 1.259

X5 -4.25 0.38 11.184 *n=150 t0.025=1.96

Regression Results for Individual Parameters

MTB > regress 'Y' on 2 predictors 'X1' 'X2'

Regression Analysis

The regression equation isY = 47.2 + 1.60 X1 + 1.15 X2

Predictor Coef Stdev t-ratio pConstant 47.165 2.470 19.09 0.000X1 1.5990 0.2810 5.69 0.000X2 1.1487 0.3052 3.76 0.007

s = 1.911 R-sq = 96.1% R-sq(adj) = 95.0%

SOURCE DF SEQ SSX1 1 578.82X2 1 51.72

MTB > READ ‘a:\data\c11_t6.dat’ C1-C5MTB > NAME c1 'EXPORTS' c2 'M1' c3 'LEND' c4 'PRICE' C5 'EXCHANGE'MTB > REGRESS 'EXPORTS' on 4 predictors 'M1' 'LEND' 'PRICE' 'EXCHANGE'

Regression Analysis

The regression equation isEXPORTS = - 4.02 + 0.368 M1 + 0.0047 LEND + 0.0365 PRICE + 0.27 EXCHANGE

Predictor Coef Stdev t-ratio pConstant -4.015 2.766 -1.45 0.152M1 0.36846 0.06385 5.77 0.000LEND 0.00470 0.04922 0.10 0.924PRICE 0.036511 0.009326 3.91 0.000EXCHANGE 0.268 1.175 0.23 0.820

s = 0.3358 R-sq = 82.5% R-sq(adj) = 81.4%

Using the Computer: Example 7-4

MTB > REGRESS 'EXPORTS' on 3 predictors 'LEND' 'PRICE' 'EXCHANGE'

Regression Analysis

The regression equation isEXPORTS = - 0.29 - 0.211 LEND + 0.0781 PRICE - 2.10 EXCHANGE

Predictor Coef Stdev t-ratio pConstant -0.289 3.308 -0.09 0.931LEND -0.21140 0.03929 -5.38 0.000PRICE 0.078148 0.007268 10.75 0.000EXCHANGE -2.095 1.355 -1.55 0.127

s = 0.4130 R-sq = 73.1% R-sq(adj) = 71.8%

SOURCE DF SS MS F pRegression 3 29.1919 9.7306 57.06 0.000 Error 63 10.7442 0.1705Total 66 39.9361

Example 7-5: Three Predictors

MTB > REGRESS 'EXPORTS' on 2 predictors 'M1' 'PRICE'

Regression Analysis

The regression equation isEXPORTS = - 3.42 + 0.361 M1 + 0.0370 PRICE

Predictor Coef Stdev t-ratio pConstant -3.4230 0.5409 -6.33 0.000M1 0.36142 0.03925 9.21 0.000PRICE 0.037033 0.004094 9.05 0.000 s = 0.3306 R-sq = 82.5% R-sq(adj) = 81.9%

Example 7-5: Two Predictors

160150140130120110

P RIC E

Residuals Plotted Against M1 (Apparently Random)

Residuals Plotted Against Price (Apparent Heteroscedasticity)

7-16 Investigating the Validity of the Regression Model: Residual Plots

Investigating the Validity of the Regression: Residual Plots (2)

Residuals Plotted Against Time (Apparently Random)

Residuals Plotted Against Fitted Values (Apparent Heteroscedasticity)

706050403020100

Y-HATR

MTB > Histogram 'SRES1'.Histogram of SRES1 N = 67

Midpoint Count -3.0 1 * -2.5 1 * -2.0 3 *** -1.5 1 * -1.0 5 ***** -0.5 13 ************* 0.0 19 ******************* 0.5 12 ************ 1.0 6 ****** 1.5 3 *** 2.0 2 ** 2.5 0 3.0 1 *

Standardized Residuals:

~ ( , )N 0 1

Histogram of Standardized Residuals: Example 7-6

* Outlier

Regression line without outlier

Regression line with outlier

Outliers

... .... .

.. ... . .

Point with a large value of xiy

Regression line when all data are included

No relationship in this cluster

Influential Observations

Investigating the Validity of the Regression: Outliers and Influential Observations

Unusual ObservationsObs. M1 EXPORTS Fit Stdev.Fit Residual St.Resid 1 5.10 2.6000 2.6420 0.1288 -0.0420 -0.14 X 2 4.90 2.6000 2.6438 0.1234 -0.0438 -0.14 X 25 6.20 5.5000 4.5949 0.0676 0.9051 2.80R 26 6.30 3.7000 4.6311 0.0651 -0.9311 -2.87R 50 8.30 4.3000 5.1317 0.0648 -0.8317 -2.57R 67 8.20 5.6000 4.9474 0.0668 0.6526 2.02R

R denotes an obs. with a large st. resid.X denotes an obs. whose X value gives it large influence.

Outliers and Influential Observations: Example 7-6

Advertising

Promotions

Estimated Regression Plane for Example 11-1

7-17 Using the Multiple Regression Model for Prediction

MTB > regress 'EXPORTS' 2 'M1' 'PRICE';SUBC> predict 6 160;SUBC> predict 5 150;SUBC> predict 4 130. Fit Stdev.Fit 95.0% C.I. 95.0% P.I. 4.6708 0.0853 ( 4.5003, 4.8412) ( 3.9885, 5.3530) 3.9390 0.0901 ( 3.7590, 4.1190) ( 3.2543, 4.6237) 2.8370 0.1116 ( 2.6140, 3.0599) ( 2.1397, 3.5342)

A (1 - a) 100% prediction interval for a value of Y given values of Xi:

A (1 - a) 100% prediction interval for the conditional mean of Y givenvalues of Xi:

[ ( )]

( ,( ( )))

y t s y MSE

y t s E Y

Prediction in Multiple Regression

MOVIE EARN COST PROM BOOK 1 28 4.2 1.0 0 2 35 6.0 3.0 1 3 50 5.5 6.0 1 4 20 3.3 1.0 0 5 75 12.5 11.0 1 6 60 9.6 8.0 1 7 15 2.5 0.5 0 8 45 10.8 5.0 0 9 50 8.4 3.0 1 10 34 6.6 2.0 0 11 48 10.7 1.0 1 12 82 11.0 15.0 1 13 24 3.5 4.0 0 14 50 6.9 10.0 0 15 58 7.8 9.0 1 16 63 10.1 10.0 0 17 30 5.0 1.0 1 18 37 7.5 5.0 0 19 45 6.4 8.0 1 20 72 10.0 12.0 1

MTB > regress 'EARN’ 3 'COST' 'PROM’ 'BOOK'

Regression Analysis

The regression equation isEARN = 7.84 + 2.85 COST + 2.28 PROM + 7.17 BOOK

Predictor Coef Stdev t-ratio pConstant 7.836 2.333 3.36 0.004COST 2.8477 0.3923 7.26 0.000PROM 2.2782 0.2534 8.99 0.000BOOK 7.166 1.818 3.94 0.001

s = 3.690 R-sq = 96.7% R-sq(adj) = 96.0%

SOURCE DF SS MS F pRegression 3 6325.2 2108.4 154.89 0.000Error 16 217.8 13.6 Total 19 6543.0

An indicator (dummy, binary) variable of qualitative level A:

if level A is obtained

if level A is not obtainedX h

7-18 Qualitative (or Categorical) Independent Variables (in Regression)

A multiple regression with two quantitative variables (X1 and X2) and one qualitative variable (X3):

A regression with one quantitative variable (X1) and one qualitative variable (X2):

Line for X2=1

Line for X2=0

y b b x b x 0 1 1 2 2

y b b x b x b x 0 1 1 2 2 3 3

Picturing Qualitative Variables in Regression

YLine for X = 0 and X3 = 1

A regression with one quantitative variable (X1) and two qualitative variables (X2 and X2):

Line for X2 = 1 and X3 = 0

Line for X2 = 0 and X3 = 0

A qualitative variable with r levels or categories is represented with (r-1) 0/1 (dummy) variables.

Category X2 X3

Adventure 0 0Drama 0 1Romance 1 0y b b x b x b x

0 1 1 2 2 3 3

Picturing Qualitative Variables in Regression: Three Categories and Two Dummy Variables

Salary = 8547 + 949 Education + 1258 Experience - 3256 Gender (SE) (32.6) (45.1) (78.5) (212.4) (t) (262.2) (21.0) (16.0) (-15.3)

On average, female salaries are $3256 below male salaries

Genderif Female

if Male

Using Qualitative Variables in Regression: Example 7-6

A regression with interaction between a quantitative variable (X1) and a qualitative variable (X2 ):

YLine for X2=0

Line for X2=1Slope = b1

Slope = b1+b3

y b b x b x b x x 0 1 1 2 2 3 1 2

Interactions between Quantitative and Qualitative Variables: Shifting Slopes

One-variable polynomial regression model:Y=0+1 X + 2X2 + 3X3 +. . . + mXm +

where m is the degree of the polynomial - the highest power of X appearing in the equation. The degree of the polynomial is the order of the model.

y b b X 0 1

y b b X b X

y b b X 0 1

y b b X b X b X 0 1 2

7-19 Polynomial Regression

MTB > regress sales' 2 'advert’ 'advsqr'

Regression Analysis

The regression equation isSALES = 3.52 + 2.51 ADVERT - 0.0875 ADVSQR

Predictor Coef Stdev t-ratio pConstant 3.5150 0.7385 4.76 0.000ADVERT 2.5148 0.2580 9.75 0.000ADVSQR -0.08745 0.01658 -5.28 0.000

s = 1.228 R-sq = 95.9% R-sq(adj) = 95.4%

151050

ADVERT

Polynomial Regression: Example 7-7

Variable Estimate Standard Error T-statistic X1 2.34 0.92 2.54 X2 3.11 1.05 2.96 X1

2 4.22 1.00 4.22 X2

2 3.57 2.12 1.68 X1X2 2.77 2.30 1.20

Polynomial Regression: Other Variables and Cross-Product Terms

Y X X X

multiplicative model

logarithmic transformation

log log log log log log

0 1 2 3

0 1 1 2 2 3 3

MTB > loge c1 c3MTB > loge c2 c4MTB > name c3 'LOGSALE' c4 'LOGADV'MTB > regress 'logsale' 1 'logadv'Regression AnalysisThe regression equation isLOGSALE = 1.70 + 0.553 LOGADV

Predictor Coef Stdev t-ratio pConstant 1.70082 0.05123 33.20 0.000LOGADV 0.55314 0.03011 18.37 0.000

s = 0.1125 R-sq = 94.7% R-sq(adj) = 94.4%

Analysis of VarianceSOURCE DF SS MS F pRegression 1 4.2722 4.2722 337.56 0.000 Error 19 0.2405 0.0127Total 20 4.5126

7-20 Nonlinear Models and Transformations: Multiplicative Model

MTB > regress 'sales' 1 'logadv'

Regression AnalysisThe regression equation isSALES = 3.67 + 6.78 LOGADV

Predictor Coef Stdev t-ratio pConstant 3.6683 0.4016 9.13 0.000LOGADV 6.7840 0.2360 28.74 0.000

s = 0.8819 R-sq = 97.8% R-sq(adj) = 97.6%

SOURCE DF SS MS F pRegression 1 642.62 642.62 826.24 0.000 Error 19 14.78 0.78Total 20 657.40

exponential model

logarithmic transformation

log log log

Transformations: Exponential Model

151050

ADVERT

Simple Regression of Sales on Advertising

LOGADV

Regression of Log(Sales) on Log(Advertising)

R- Sq uared = 0 .8 9 5

Y = 6 .59 2 71 + 1.19 176 X

R-Sq uared = 0 .9 47

Y = 1.70 0 82 + 0 .553 13 6 X

LOGADV

R-Sq uared = 0 .978

Y = 3.6 682 5 + 6.784 X

Regression of Sales on Log(Advertising)

Residual Plots: Sales vs Log(Advertising)

Plots of Transformed Variables

• Square root transformation:Useful when the variance of the regression errors is

approximately proportional to the conditional mean of Y.

• Logarithmic transformation:Useful when the variance of regression errors is approximately

proportional to the square of the conditional mean of Y.

• Reciprocal transformation:Useful when the variance of the regression errors is

approximately proportional to the fourth power of the conditional mean of Y.

Y Ylog( )

Variance Stabilizing Transformations

E Y Xe

Logistic Function

The logistic function:

Transformation to linearize the logistic function:

Regression with Dependent Indicator Variables

Orthogonal X variables provide information from independent sources. No multicollinearity.

Perfectly collinear X variables provide identical information content. No regression.

Some degree of collinearity. Problems with regression depend on the degree of collinearity.

A high degree of negative collinearity also causes problems with regression.

7.21 Multicollinearity

• Variances of regression coefficients are inflated.

• Magnitudes of regression coefficients may be different from what are expected.

• Signs of regression coefficients may not be as expected.

• Adding or removing variables produces large changes in coefficients.

• Removing a data point may cause large changes in coefficient estimates or signs.

• In some cases, the F ratio may be significant while the t ratios are not.

Effects of Multicollinearity

MTB > CORRELATION 'm1' 'lend’ 'price’ 'exchange'

Correlations (Pearson)

M1 LEND PRICELEND -0.112PRICE 0.447 0.745EXCHANGE -0.410 -0.279 -0.420

MTB > regress 'exports' on 4 predictors 'm1’ 'lend’ 'price’ 'exchange';SUBC> vif.

Regression AnalysisThe regression equation isEXPORTS = - 4.02 + 0.368 M1 + 0.0047 LEND + 0.0365 PRICE + 0.27 EXCHANGE

Predictor Coef Stdev t-ratio p VIFConstant -4.015 2.766 -1.45 0.152M1 0.36846 0.06385 5.77 0.000 3.2LEND 0.00470 0.04922 0.10 0.924 5.4PRICE 0.036511 0.009326 3.91 0.000 6.3EXCHANGE 0.268 1.175 0.23 0.820 1.4

s = 0.3358 R-sq = 82.5% R-sq(adj) = 81.4%

Detecting the Existence of Multicollinearity: Correlation Matrix of Independent Variables and Variance Inflation Factors

1.00.50.0

Relationship between VIF and Rh2

The associated with

where R is the value obtained for the regression of X on the other independent variables.

variance inflation factor X

VIF XR

Variance Inflation Factor

• Drop a collinear variable from the regression.

• Change in sampling plan to include elements outside the multicollinearity range.

• Transformations of variables.

• Ridge regression.

Solutions to the Multicollinearity Problem

An autocorrelation is a correlation of the values of a variable with values of the same variable lagged one or more periods back. Consequences of autocorrelation include inaccurate estimates of variances and inaccurate predictions.

Lagged Residuals

i i i-1 i-2 i-3 i-4

1 1.0 * * * * 2 0.0 1.0 * * * 3 -1.0 0.0 1.0 * * 4 2.0 -1.0 0.0 1.0 * 5 3.0 2.0 -1.0 0.0 1.0 6 -2.0 3.0 2.0 -1.0 0.0 7 1.0 -2.0 3.0 2.0 -1.0 8 1.5 1.0 -2.0 3.0 2.0 9 1.0 1.5 1.0 -2.0 3.010 -2.5 1.0 1.5 1.0 -2.0

The Durbin-Watson test (first-order autocorrelation): H0: 1 = 0 H1:0The Durbin-Watson test statistic:

dei eii

7-22 Residual Autocorrelation and the Durbin-Watson Test

k = 1 k = 2 k = 3 k = 4 k = 5 n dL dU dL dU dL dU dL dU dL dU

15 1.08 1.36 0.95 1.54 0.82 1.75 0.69 1.97 0.56 2.21 16 1.10 1.37 0.98 1.54 0.86 1.73 0.74 1.93 0.62 2.15 17 1.13 1.38 1.02 1.54 0.90 1.71 0.78 1.90 0.67 2.10 18 1.16 1.39 1.05 1.53 0.93 1.69 0.82 1.87 0.71 2.06 . . . . . . . . . . . . . . . . . . 65 1.57 1.63 1.54 1.66 1.50 1.70 1.47 1.73 1.44 1.77 70 1.58 1.64 1.55 1.67 1.52 1.70 1.49 1.74 1.46 1.77 75 1.60 1.65 1.57 1.68 1.54 1.71 1.51 1.74 1.49 1.77 80 1.61 1.66 1.59 1.69 1.56 1.72 1.53 1.74 1.51 1.77 85 1.62 1.67 1.60 1.70 1.57 1.72 1.55 1.75 1.52 1.77 90 1.63 1.68 1.61 1.70 1.59 1.73 1.57 1.75 1.54 1.78 95 1.64 1.69 1.62 1.71 1.60 1.73 1.58 1.75 1.56 1.78100 1.65 1.69 1.63 1.72 1.61 1.74 1.59 1.76 1.57 1.78

Critical Points of the Durbin-Watson Statistic: =0.05, n= Sample Size, k = Number of Independent Variables

MTB > regress 'EXPORTS' 4 'M1' 'LEND' 'PRICE' 'EXCHANGE';SUBC> dw.

Durbin-Watson statistic = 2.58

PositiveAutocorrelation

NegativeAutocorrelation

Test isInconclusive

NoAutocorrelation

Test isInconclusive

0 dL dU 4-dL4-dU 4

For n = 67, k = 4: dU1.73 4-dU2.27 dL1.47 4- dL2.53 < 2.58

H0 is rejected, and we conclude there is negative first-order autocorrelation.

Using the Durbin-Watson Statistic

Full model:Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 +

Reduced model:Y = 0 + 1 X1 + 2 X2 +

Partial F test:H0: 3 = 4 = 0H1: 3 and 4 not both 0

Partial F statistic:

where SSER is the sum of squared errors of the reduced model, SSEF is the sum of squared errors of the full model; MSEF is the mean square error of the full model [MSEF = SSEF/(n-(k+1))]; r is the number of variables dropped from the full model.

F(r, (n (k 1))

7-23 Partial F Tests and Variable Selection Methods

• All possible regressionsRun regressions with all possible combinations of independent

variables and select best model.

• Stepwise proceduresForward selection

Add one variable at a time to the model, on the basis of its F statistic.

Backward elimination

Remove one variable at a time, on the basis of its F statistic.Stepwise regression

Adds variables to the model and subtracts variables from the model, on the basis of the F statistic.

Variable Selection Methods

Compute F statistic for each variable not in the model

Enter most significant (smallest p-value) variable into model

Calculate partial F for all variables in the model

Is there a variable with p-value > Pout?Removevariable

NoIs there at least one variable with p-value > Pin?

Stepwise Regression

MTB > STEPWISE 'EXPORTS' PREDICTORS 'M1’ 'LEND' 'PRICE’ 'EXCHANGE'

Stepwise Regression

F-to-Enter: 4.00 F-to-Remove: 4.00

Response is EXPORTS on 4 predictors, with N = 67

Step 1 2Constant 0.9348 -3.4230

M1 0.520 0.361T-Ratio 9.89 9.21

PRICE 0.0370T-Ratio 9.05

S 0.495 0.331R-Sq 60.08 82.48

Stepwise Regression: Using the Computer

MTB > REGRESS 'EXPORTS’ 4 'M1’ 'LEND’ 'PRICE' 'EXCHANGE';SUBC> vif;SUBC> dw.Regression AnalysisThe regression equation isEXPORTS = - 4.02 + 0.368 M1 + 0.0047 LEND + 0.0365 PRICE + 0.27 EXCHANGE

Predictor Coef Stdev t-ratio p VIFConstant -4.015 2.766 -1.45 0.152M1 0.36846 0.06385 5.77 0.000 3.2LEND 0.00470 0.04922 0.10 0.924 5.4PRICE 0.036511 0.009326 3.91 0.000 6.3EXCHANGE 0.268 1.175 0.23 0.820 1.4

s = 0.3358 R-sq = 82.5% R-sq(adj) = 81.4% Analysis of Variance

Durbin-Watson statistic = 2.58

Using the Computer: MINITAB

data exports;infile 'c:\aczel\data\c11_t6.dat';input exports m1 lend price exchange;proc reg data = exports;model exports=m1 lend price exchange/dw vif;run;

Model: MODEL1Dependent Variable: EXPORTS

Sum of Mean Source DF Squares Square F Value Prob>F

Model 4 32.94634 8.23658 73.059 0.0001 Error 62 6.98978 0.11274 C Total 66 39.93612

Root MSE 0.33577 R-square 0.8250 Dep Mean 4.52836 Adj R-sq 0.8137 C.V. 7.41473

Using the Computer: SAS

Parameter Estimates

Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 -4.015461 2.76640057 -1.452 0.1517 M1 1 0.368456 0.06384841 5.771 0.0001 LEND 1 0.004702 0.04922186 0.096 0.9242 PRICE 1 0.036511 0.00932601 3.915 0.0002 EXCHANGE 1 0.267896 1.17544016 0.228 0.8205

Variance Variable DF Inflation

INTERCEP 1 0.00000000 M1 1 3.20719533 LEND 1 5.35391367 PRICE 1 6.28873181 EXCHANGE 1 1.38570639

Durbin-Watson D 2.583(For Number of Obs.) 671st Order Autocorrelation -0.321

Using the Computer: SAS (continued)

The population regression

x x x x

x x x xk

n n n nk

model:

. . . . .

11 12 13 1

21 22 23 2

31 32 33 3

The estimated regression

model:

Y = Xb+ e

The Matrix Approach to Regression Analysis (1)

The normal equations

X Xb X Y

Estimators

b X X X Y

values

Y Xb X X X X Y HY

V b X X

s b MSE X X

( ) ( )

Predicted

The Matrix Approach to Regression Analysis (2)

Regression Analysis and Multiple Regression Session 7.

Documents

Transcript of Regression Analysis and Multiple Regression Session 7.

Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

Multiple Regression1 Multiple Linear Regression Multiple Regression Model A regression model that contains more than one regressor variable. Multiple Linear.

Chapter 3 Multiple Linear Regression Model - IITKhome.iitk.ac.in/~shalab/regression/Chapter3-Regression-Multiple...Chapter 3 Multiple Linear Regression Model ... So a simple linear

Regression Analysis and Multiple Regression

Doing Multiple Regression with SPSS Multiple Regression ...math.ou.edu/~mcknight/4753/spss/SPSS9.pdf · 1 Doing Multiple Regression with SPSS Multiple Regression for Data Already

Multiple Regression

Multiple linear regression: estimation and model fitting · simple linear regression and multiple regression Multiple Simple regression regression Solar 0.05 0.13 Wind -3.32 -5.73

Doing Multiple Regression with SPSS Multiple Regression ... · PDF fileDoing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify

Introduction to Multiple Regression - Statpowerstatpower.net/Content/312/Lecture Slides/MultipleRegressionIntro.pdf · Introduction to Multiple Regression 1 The Multiple Regression

Multiple Regression Chapter 1313 Multiple Regression Multiple Regression Assessing Overall Fit Assessing Overall Fit Predictor Significance Predictor.

Multiple Linear Regression Review. Outline Outline Outline Simple Linear Regression Multiple Regression Understanding the Regression Output Coefficient.

Slide 1 Hierarchical Multiple Regression. Slide 2 Differences between standard and hierarchical multiple regression Standard multiple regression is.

MULTIPLE REGRESSION ANALYSIS: ESTIMATION - …tastan/teaching/03 Multiple Regression... · Multiple Regression Analysis In the simple regression analysis with only one explanatory

2. Korrelation, Linear Regression und multiple · PDF file2. Korrelation, Linear Regression und multiple Regression 2. Korrelation, lineare Regression und multiple Regression 2.1 Korrelation

Multiple Regression Analysis Multiple Regression Model Sections 16.1 - 16.6.

Multiple Regression (continued) Polynomial Regression.

Chapter 11 Multiple Linear Regression Chapter 11 Multiple Linear Regression.

Multiple Linear Regression - Analysis Made Easy · Multiple Linear Regression The MULTIPLE LINEAR REGRESSION command performs simple multiple regression using least squares. Linear

Multiple Regression

Part II Multiple Linear Regression - Statistics · PDF filePart II Multiple Linear Regression 86. Chapter 7 Multiple Regression A multiple linear regression model is a linear model