Multiple Regression (Reduced Set with MiniTab Examples)
description
Transcript of Multiple Regression (Reduced Set with MiniTab Examples)
1 Slide
Multiple Regression(Reduced Set with MiniTab Examples)
Chapter 15BA 303
2 Slide
MULTIPLE REGRESSION
3 Slide
A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bp that are used as the point estimators of the parameters b0, b1, b2, . . . , bp.
Estimated Multiple Regression Equation
^y = b0 + b1x1 + b2x2 + . . . + bpxp
Estimated Multiple Regression Equation
4 Slide
Least Squares Method
Least Squares Criterion
2ˆmin ( )i iy y
5 Slide
The years of experience, score on the aptitude test
test, and corresponding annual salary ($1000s) for a
sample of 20 programmers is shown on the next slide.
Programmer Salary Survey
Multiple Regression Model
A software firm collected data for a sample of 20
computer programmers. A suggestion was made that
regression analysis could be used to determine if
salary was related to the years of experience and the
score on the firm’s programmer aptitude test.
6 Slide
47158100166
92105684633
781008682868475808391
88737581748779947089
24.043.023.734.335.838.022.223.130.033.0
38.026.636.231.629.034.030.133.928.230.0
Exper.(Yrs.)
TestScore
TestScore
Exper.(Yrs.)
Salary($000s)
Salary($000s)
Multiple Regression Model
7 Slide
Suppose we believe that salary (y) is related tothe years of experience (x1) and the score on
theprogrammer aptitude test (x2) by the
following regression model:
Multiple Regression Model
where y = annual salary ($000) x1 = years of experience x2 = score on programmer aptitude test
y = b0 + b1x1 + b2x2 +
8 Slide
Solving for the Estimates of b0, b1, b2
Salary = 3.174 + 1.4039YearsExp + 0.25089ApScoreNote: Predicted salary will be in thousands of dollars.
9 Slide
MULTIPLE COEFFICIENT OF DETERMINATION
10 Slide
Multiple Coefficient of Determination
Relationship Among SST, SSR, SSE
where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error
SST = SSR + SSE
2( )iy y 2ˆ( )iy y= 2ˆ( )i iy y+
11 Slide
SSR, SSE, and SST
SSR
SSTSSE
12 Slide
Multiple Coefficient of Determination
R2 = SSR/SST
13 Slide
Adjusted Multiple Coefficientof Determination
R R nn pa
2 21 1 11
( )R R n
n pa2 21 1 1
1
( )
Where p is the number of independent variables in the regression equation.
14 Slide
R2 and R2a
834.079.59933.5002
SSTSSRR
81447.01220120)834.01(1
11)1(1 22
pnnRRa
15 Slide
TESTING FOR SIGNIFICANCE
16 Slide
Testing for Significance: F Test
The F test is referred to as the test for overall significance.
The F test is used to determine whether a significant relationship exists between the dependent variable and the set of all the independent variables.
17 Slide
Testing for Significance: F Test
Hypotheses
Rejection Rule
Test Statistics
H0: b1 = b2 = . . . = bp = 0 Ha: One or more of the parameters is not equal to zero.
F = MSR/MSE
Reject H0 if p-value < a or if F > Fa ,
where Fa is based on an F distributionwith p d.f. in the numerator andn - p - 1 d.f. in the denominator.
18 Slide
F Test for Overall Significance
Say a=0.05, is the regression significant overall?
19 Slide
A separate t test is conducted for each of the independent variables in the model.
The t test is used to determine whether each of the individual independent variables is significant.
Testing for Significance: t Test
We refer to each of these t tests as a test for individual significance.
20 Slide
Testing for Significance: t Test
Hypotheses
Rejection Rule
Test Statistics
Reject H0 if p-value < a orif t < -ta or t > ta where ta is based on a t distributionwith n - p - 1 degrees of freedom.
t bs
i
bi
t bs
i
bi
0 : 0iH b
: 0a iH b
21 Slide
t Test for Significanceof Individual Parameters
Say a=0.05, which parameters are significant?
22 Slide
MULTICOLLINEARITY
23 Slide
Multicollinearity
The term multicollinearity refers to the correlation among the independent variables.
When the independent variables are highly correlated, it is not possible to determine the separate effect of any particular independent variable on the dependent variable.
Every attempt should be made to avoid including independent variables that are highly correlated.
24 Slide
Multicollinearity
The Variance Inflation Factor (VIF) measures how much the variance of the coefficient for an independent variable is inflated by one or more of the other independent variables.
This inflation of the variance means that the independent variable is highly correlated with at least one other independent variable.• VIF around 1 = no multicollinearity (good)• VIF much greater than 1 = multicollinearity
(bad)• “much greater” is subjective!
25 Slide
Multicollinearity
VIF values not available in Excel MiniTab:
26 Slide
ESTIMATION AND PREDICTION
27 Slide
Using the Estimated Regression Equationfor Estimation and Prediction
The procedures for estimating the mean value of y and predicting an individual value of y in multiple regression are similar to those in simple regression.
We substitute the given values of x1, x2, . . . , xp into the estimated regression equation and use the corresponding value of y as the point estimate.
28 Slide
PI and CI Using MiniTab
29 Slide
CATEGORICAL VARIABLES
30 Slide
In many situations we must work with categorical independent variables such as gender (male, female), method of payment (cash, check, credit card), etc.
For example, x2 might represent gender where x2 = 0 indicates male and x2 = 1 indicates female.
Categorical Independent Variables
In this case, x2 is called a dummy or indicator variable.
31 Slide
The years of experience, the score on the programmer aptitude test, whether the individual has a relevant graduate degree, and the annual salary ($000) for each of the sampled 20 programmers are shown on the next slide.
Categorical Independent Variables
Programmer Salary SurveyAs an extension of the problem involving the computer programmer salary survey, suppose that management also believes that the annual salary is related to whether the individual has a graduate degree in computer science or information systems.
32 Slide
47158100166
92105684633
781008682868475808391
88737581748779947089
24.043.023.734.335.838.022.223.130.033.0
38.026.636.231.629.034.030.133.928.230.0
Exper.(Yrs.)
TestScore
TestScore
Exper.(Yrs.)
Salary($000s)
Salary($000s)Degr.
NoYes NoYesYesYes No No NoYes
Degr. Yes NoYes No NoYes NoYes No No
Categorical Independent Variables
If grad degree, Degr = 1. If no grad degree, Degr = 0.
33 Slide
47158100166
92105684633
781008682868475808391
88737581748779947089
24.043.023.734.335.838.022.223.130.033.0
38.026.636.231.629.034.030.133.928.230.0
Exper.(Yrs.)
TestScore
TestScore
Exper.(Yrs.)
Salary($000s)
Salary($000s)Degr.
01 0111 0 0 01
Degr. 1 01 0 01 01 0 0
Categorical Independent Variables
34 Slide
Estimated Regression Equation
^where: y = annual salary ($1000) x1 = years of experience x2 = score on programmer aptitude test x3 = 0 if individual does not have a graduate degree 1 if individual does have a graduate degree
x3 is a dummy variable
y = b0 + b1x1 + b2x2 + b3x3^
35 Slide
Categorical Independent Variables
36 Slide
Categorical Independent Variables
37 Slide
Categorical Independent Variables
38 Slide
More Complex Categorical Variables
If a categorical variable has k levels, k - 1 dummy variables are required, with each dummy variable being coded as 0 or 1.
For example, a variable with levels A, B, and C could be represented by x1 and x2 values of (0, 0) for A, (1, 0) for B, and (0,1) for C.
Care must be taken in defining and interpreting the dummy variables.
39 Slide
For example, a variable indicating level of education could be represented by x1 and x2 values as follows:
More Complex Categorical Variables
HighestDegree x1 x2
Bachelor’s 0 0Master’s 1 0Ph.D. 0 1
40 Slide
AND RESIDUALS
41 Slide
The variance of , denoted by 2, is the same for all values of the independent variables.
The error is a normally distributed random variable reflecting the deviation between the y value and the expected value of y given by b0 + b1x1 + b2x2 + . . + bpxp.
Assumptions About the Error Term
The error is a random variable with mean of zero.
The values of are independent.
42 Slide
Standardized Residual Plot Against ̂y Standardized residuals are frequently used in
residual plots for purposes of:• Identifying outliers (typically, standardized
residuals < -2 or > +2)• Providing insight about the assumption that
the error term has a normal distribution
43 Slide
Standardized Residual Plot Against ̂y
44 Slide
Residuals
45 Slide