Multiple Regression (Reduced Set with MiniTab Examples)

1 Slide

Multiple Regression(Reduced Set with MiniTab Examples)

Chapter 15BA 303

2 Slide

MULTIPLE REGRESSION

3 Slide

A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp that are used as the point estimators of the parameters b0, b1, b2, . . . , bp.

Estimated Multiple Regression Equation

^y = b0 + b1x1 + b2x2 + . . . + bpxp

Estimated Multiple Regression Equation

4 Slide

Least Squares Method

Least Squares Criterion

2ˆmin ( )i iy y

5 Slide

The years of experience, score on the aptitude test

test, and corresponding annual salary ($1000s) for a

sample of 20 programmers is shown on the next slide.

Programmer Salary Survey

Multiple Regression Model

A software firm collected data for a sample of 20

computer programmers. A suggestion was made that

regression analysis could be used to determine if

salary was related to the years of experience and the

score on the firm’s programmer aptitude test.

6 Slide

47158100166

92105684633

781008682868475808391

88737581748779947089

24.043.023.734.335.838.022.223.130.033.0

38.026.636.231.629.034.030.133.928.230.0

Exper.(Yrs.)

TestScore

TestScore

Exper.(Yrs.)

Salary($000s)

Salary($000s)


7 Slide

Suppose we believe that salary (y) is related tothe years of experience (x1) and the score on

theprogrammer aptitude test (x2) by the

following regression model:


where y = annual salary ($000) x1 = years of experience x2 = score on programmer aptitude test

y = b0 + b1x1 + b2x2 +

8 Slide

Solving for the Estimates of b0, b1, b2

Salary = 3.174 + 1.4039YearsExp + 0.25089ApScoreNote: Predicted salary will be in thousands of dollars.

9 Slide

MULTIPLE COEFFICIENT OF DETERMINATION

10 Slide

Multiple Coefficient of Determination

Relationship Among SST, SSR, SSE

where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error

SST = SSR + SSE

2( )iy y 2ˆ( )iy y= 2ˆ( )i iy y+

11 Slide

SSR, SSE, and SST

SSR

SSTSSE

12 Slide

Multiple Coefficient of Determination

R2 = SSR/SST

13 Slide

Adjusted Multiple Coefficientof Determination

R R nn pa

2 21 1 11

( )R R n

n pa2 21 1 1

1

( )

Where p is the number of independent variables in the regression equation.

14 Slide

R2 and R2a

834.079.59933.5002

SSTSSRR

81447.01220120)834.01(1

11)1(1 22

pnnRRa

15 Slide

TESTING FOR SIGNIFICANCE

16 Slide

Testing for Significance: F Test

The F test is referred to as the test for overall significance.

The F test is used to determine whether a significant relationship exists between the dependent variable and the set of all the independent variables.

17 Slide

Testing for Significance: F Test

Hypotheses

Rejection Rule

Test Statistics

H0: b1 = b2 = . . . = bp = 0 Ha: One or more of the parameters is not equal to zero.

F = MSR/MSE

Reject H0 if p-value < a or if F > Fa ,

where Fa is based on an F distributionwith p d.f. in the numerator andn - p - 1 d.f. in the denominator.

18 Slide

F Test for Overall Significance

Say a=0.05, is the regression significant overall?

19 Slide

A separate t test is conducted for each of the independent variables in the model.

The t test is used to determine whether each of the individual independent variables is significant.

Testing for Significance: t Test

We refer to each of these t tests as a test for individual significance.

20 Slide

Testing for Significance: t Test

Hypotheses

Rejection Rule

Test Statistics

Reject H0 if p-value < a orif t < -ta or t > ta where ta is based on a t distributionwith n - p - 1 degrees of freedom.

t bs

i

bi

t bs

i

bi

0 : 0iH b

: 0a iH b

21 Slide

t Test for Significanceof Individual Parameters

Say a=0.05, which parameters are significant?

22 Slide

MULTICOLLINEARITY

23 Slide

Multicollinearity

The term multicollinearity refers to the correlation among the independent variables.

When the independent variables are highly correlated, it is not possible to determine the separate effect of any particular independent variable on the dependent variable.

Every attempt should be made to avoid including independent variables that are highly correlated.

24 Slide

Multicollinearity

The Variance Inflation Factor (VIF) measures how much the variance of the coefficient for an independent variable is inflated by one or more of the other independent variables.

This inflation of the variance means that the independent variable is highly correlated with at least one other independent variable.• VIF around 1 = no multicollinearity (good)• VIF much greater than 1 = multicollinearity

(bad)• “much greater” is subjective!

25 Slide

Multicollinearity

VIF values not available in Excel MiniTab:

26 Slide

ESTIMATION AND PREDICTION

27 Slide

Using the Estimated Regression Equationfor Estimation and Prediction

The procedures for estimating the mean value of y and predicting an individual value of y in multiple regression are similar to those in simple regression.

We substitute the given values of x1, x2, . . . , xp into the estimated regression equation and use the corresponding value of y as the point estimate.

28 Slide

PI and CI Using MiniTab

29 Slide

CATEGORICAL VARIABLES

30 Slide

In many situations we must work with categorical independent variables such as gender (male, female), method of payment (cash, check, credit card), etc.

For example, x2 might represent gender where x2 = 0 indicates male and x2 = 1 indicates female.

Categorical Independent Variables

In this case, x2 is called a dummy or indicator variable.

31 Slide

The years of experience, the score on the programmer aptitude test, whether the individual has a relevant graduate degree, and the annual salary ($000) for each of the sampled 20 programmers are shown on the next slide.


Programmer Salary SurveyAs an extension of the problem involving the computer programmer salary survey, suppose that management also believes that the annual salary is related to whether the individual has a graduate degree in computer science or information systems.

32 Slide

47158100166

92105684633

781008682868475808391

88737581748779947089

24.043.023.734.335.838.022.223.130.033.0

38.026.636.231.629.034.030.133.928.230.0

Exper.(Yrs.)

TestScore

TestScore

Exper.(Yrs.)

Salary($000s)

Salary($000s)Degr.

NoYes NoYesYesYes No No NoYes

Degr. Yes NoYes No NoYes NoYes No No


If grad degree, Degr = 1. If no grad degree, Degr = 0.

33 Slide

47158100166

92105684633

781008682868475808391

88737581748779947089

24.043.023.734.335.838.022.223.130.033.0

38.026.636.231.629.034.030.133.928.230.0

Exper.(Yrs.)

TestScore

TestScore

Exper.(Yrs.)

Salary($000s)

Salary($000s)Degr.

01 0111 0 0 01

Degr. 1 01 0 01 01 0 0


34 Slide

Estimated Regression Equation

^where: y = annual salary ($1000) x1 = years of experience x2 = score on programmer aptitude test x3 = 0 if individual does not have a graduate degree 1 if individual does have a graduate degree

x3 is a dummy variable

y = b0 + b1x1 + b2x2 + b3x3^

35 Slide


36 Slide


37 Slide


38 Slide

More Complex Categorical Variables

If a categorical variable has k levels, k - 1 dummy variables are required, with each dummy variable being coded as 0 or 1.

For example, a variable with levels A, B, and C could be represented by x1 and x2 values of (0, 0) for A, (1, 0) for B, and (0,1) for C.

Care must be taken in defining and interpreting the dummy variables.

39 Slide

For example, a variable indicating level of education could be represented by x1 and x2 values as follows:

More Complex Categorical Variables

HighestDegree x1 x2

Bachelor’s 0 0Master’s 1 0Ph.D. 0 1

40 Slide

AND RESIDUALS

41 Slide

The variance of , denoted by 2, is the same for all values of the independent variables.

The error is a normally distributed random variable reflecting the deviation between the y value and the expected value of y given by b0 + b1x1 + b2x2 + . . + bpxp.

Assumptions About the Error Term

The error is a random variable with mean of zero.

The values of are independent.

42 Slide

Standardized Residual Plot Against ̂y Standardized residuals are frequently used in

residual plots for purposes of:• Identifying outliers (typically, standardized

residuals < -2 or > +2)• Providing insight about the assumption that

the error term has a normal distribution

43 Slide

Standardized Residual Plot Against ̂y

44 Slide

Residuals

45 Slide

Multiple Regression (Reduced Set with MiniTab Examples)

Documents

Transcript of Multiple Regression (Reduced Set with MiniTab Examples)