Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient...

68
Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including qualitative variables Multicollinearity (if time permits)

Transcript of Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient...

Page 1: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Week 5: Is the model a good fit for the data?

Evaluating goodness of fit (pg 649 & ff.)

Coefficient of determination (pg 646)

Computing predictions

including qualitative variables

Multicollinearity (if time permits)

Page 2: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Announcements

Need help with SAS?

Midterm next week in class at 5:45-7:30pm. Closed book and closed notes. You can bring a

single-sided page of notes. Study guide posted on the COL website

Page 3: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

How is a Linear Regression Analysis done? A Protocol

Page 4: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Steps of Regression Analysis

1) Examine the scatterplot of the data. Does the relationship look linear? Are there points in locations they shouldn’t be? Do we need a transformation?

2) Assuming a linear function looks appropriate, estimate the regression parameters.

Least Squares Estimates3) Test whether there really is a statistically significant linear relationship.

Does model fit the data adequately? Goodness of fit test (F-test for Variances)

4) Examine the residuals for systematic inadequacies in the linear model as fit to the data.

Is there evidence that a more complicated relationship (say a polynomial) should be examined? (Residual analysis).

Are there data points which do not seem to follow the proposed relationship? (influential values or outliers).

5) Using the model for predictions

Page 5: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Outline Model diagnostics:

Evaluate the goodness of fit for the observed data Goodness of fit test Coefficient of determination

Analyze if model assumptions are satisfied Residual analysis

Collinearity problems (if time permits) Correlation matrix VIF metric

Using the model for predictions Computing predictions and evaluating predictions accuracy

Page 6: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Diagnostics on the model: Goodness-of-FitThe question we must answer before we use the model for

prediction isIs this model a good representation of the observed data?

Various techniques to answer this question: Perform a goodness of fit test

Compute the coefficient of determination R2

Perform a residual analysis

Page 7: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Overall test on Goodness-of-fit The test is on the hypothesis that the model is completely

wrong!

Null hypothesis: None of the x-variables included in the model have any association with Y:

Ho:

Alternative hypothesis: At least one X-variable has a significant effect on changes in Y:

Ha: At least one coefficient

0...321 p

0j

Page 8: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Ho: There is no relationship between Y and X’s.

HA: There is a relationship between Y and some X’s.

Which of two competing models are acceptable?

Mean model :

Linear model:

Testing for a Statistically Significant Regression

TEST: We look at the sums of squares of the prediction errors for the two models and decide if that for the linear model is significantly larger than that for the mean model.

y x x x e for i ni i i i i 0 1 1 2 2 3 3 1,...,

ii ey

Page 9: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Sum of the prediction errors for the null (mean model) hypothesis.

Sums of Squares About the Mean: SS(Total)

SS(Total) is a measure of the overall variance of the observed responses.

n

iiyy yySTotalSS

1

2)()(

Page 10: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Residual Sums of Squares: SS(Residual)

Sum of the prediction errors for the alternative (linear regression model) hypothesis.

SSE measures the variance of the residuals, the part of the response variation that is not explained by the model.

2

1

211

10

)ˆ(

)ˆ...ˆˆ()(

i

n

ii

pipi

n

ii

yy

xxySSEResidualSS

Page 11: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Regression Sums of Squares = SS(Regression)

Difference between SS(total) and SS(Residual).

n n2 2

i i ii 1 i 1

n2

ii 1

ˆSSR (y y) (y y )

ˆ(y y)

SSR measures how much variability in the response data is explained by the fitted regression model over that explained by simply using the response mean.

Page 12: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Total variability in y-values

=Variability explained by the regression

+ Unexplained variability

Then SS(Total) approaches SS(Regression) and SS(Residual) gets small.

Then SS(Model) approaches 0 and SS(Residual) approaches SS(Total).

SS(Total) = SS(Regression) + SS(Residual)

Regression model fits well

Regression model adds little

Page 13: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

ii xy 10ˆˆˆ

Mean Model

Linear Model

Total variability in y-values

=Variability explained by the regression

+Unexplained variability

SS(Total) = SS(Regression) + SS(Residual)

Graphical View

Page 14: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Goodness of fit testNull hypothesis: None of the x-variables included in the model

have any association with Y:

Ho:

Alternative hypothesis: At least one X-variable has a significant effect on changes in Y:

Ha: At least one coefficient

Test Statistic:

F has F-distribution with {k, n-(k+1)} degrees of freedom.

0...321 k

0j

l)MS(Residua

ion)MS(Regress

knresidualSS

kregressionSSF

)1(/)(

/)(

n=sample sizek = # of X-variables in model

Page 15: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

The computations of this test are summarized in the Analysis of Variance table.

Analysis of Variance Table

Source Sums of SquaresSS

Degrees of FreedomDF

Mean SquaresMS

F

Regression SSR k MSR=SSR/k F=MSR/MSE

Error(Residual)

SSE n-(k+1) MSE= SSE/(n-(k+1))

TOTAL SSM n-1

Page 16: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Under the assumption that the null (mean) model holds, both MSE and MSR measure the same underlying variance quantity

Under the alternative hypothesis, the MSR (Regression Mean Square) should be much greater than the MSE (Error Mean Square), because the regression line can explain the variation better

Placing this in the context of a test of variance.

2 2R

2 2R

2R2

MSRF

MSE

Goodness of Fit Test Statistic

F should be near 1 if the regression is not significant, i.e. H0: model with no predictors holds.

F Test for Significant Regression

Page 17: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

1. e1, e2, … en are independent of each other.2. The ei are normally distributed with mean

zero and have common variance s2.

How do we check these assumptions?

I. Appropriate graphs.II. CorrelationsIII. Formal goodness of fit tests.

Conditions for the F-test:

Page 18: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

SAS output for CPU usage data

The REG Procedure

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Pr > FModel 3 0.59705 0.19902 166.38 <.0001Error 34 0.04067 0.00120Corrected Total 37 0.63772

Root MSE 0.03459 R-Square 0.9362 Dependent Mean 0.15710 Adj R-Sq 0.9306 Coeff Var 22.01536

The goodness-of-fit test statistic is F=MSR/MSE= 0.19902/0.00120=166.38 with p-value less than 0.001

The null hypothesis of no interaction is strongly rejected and the F-test gives strong support to the fitted model!

ˆ MSE

MSE

Goodness of Fit test

Page 19: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Goodness of Fit – Recap

The test statistic is

The p-value is computed by SAS

Decision rules:

P-value > 0.05 the fitted regression model is not a good model for the data - none of the x-variables have a significant effect on Y.

P-value < 0.05 there is at least one x-variable that has a significant effect on Y.

)(

)(

ResidualMS

RegressionMSF

Page 20: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Outline Model diagnostics:

Evaluate the goodness of fit for the observed data Goodness of fit test Coefficient of determination

Analyze if model assumptions are satisfied Residual analysis

Collinearity problems Correlation matrix VIF metric

Using the model for predictions Computing predictions and evaluating predictions

accuracy

Page 21: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

The coefficient of determination R2

(section 12.4) R2 take values between 0 and 1

R2 indicates the amount of variation in Y which is explained by the regression model in X1,…Xk.

Caution

A high value of R2 does not necessarily mean that the regression model is a good fit for the data. Use it only if the number n of sample data points is larger than the number of variables k in the model.

R2 will always take a high value even if the variables have no effect on Y if sample size is close to k.

)(12

TotalSS

SSER

Page 22: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Adjusted R-square

adj-R2 is useful when comparing two models with a different set of x-variables.

Unlike the R2, the Adj-R2 value does not increase with the addition of a x-variable that does not improve the regression model.

A higher adj-R2 typically indicates a better model.

)1()1(

)1(1 22 R

kn

nadjR

Page 23: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

CPU usage example

The REG Procedure Analysis of Variance

Sum of MeanSource DF Squares Square F Value Pr > FModel 3 0.59705 0.19902 166.38 <.0001Error 34 0.04067 0.00120Corrected Total 37 0.63772

Root MSE 0.03459 R-Square 0.9362 Dependent Mean 0.15710 Adj R-Sq 0.9306 Coeff Var 22.01536

The adj-R2 value=0.9306

What does this value mean?

Page 24: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Outline Model diagnostics:

Evaluate the goodness of fit for the observed data Goodness of fit test Coefficient of determination

Analyze if model assumptions are satisfied Residual analysis

Collinearity problems Correlation matrix VIF metric

Using the model for predictions Computing predictions and evaluating predictions

accuracy

Page 25: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Compute predictions from the model

Section 11.4

Section 12.6

Page 26: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Suppose we have finally found a model that represents the associations in the data, say

We can use it to compute the mean response Y at a certain value of the X-variables.

The average value of Y for values x*1, x*

2, x*3 is computed as

A 95% Confidence Interval for the “true” mean response is

(This is typically computed by the computer…intractable by hand! )

( | , , ) * * * * * *E Y x x x x x x1 2 3 0 1 1 2 2 3 3

( | , , ) . .( ( ))* * *. ,E Y x x x t S E E Yn p1 2 3 0 95

A confidence interval for the mean response

Y x x xi i i 0 1 1 2 2 3 3

Page 27: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

A confidence interval for predictions Suppose we want to compute the predicted response Y at a new (not

observed) value of the X-variables.

The predicted value of Y for values x*1, x*

2, x*3 is computed as

This is the same as the expression for the average value

Thus the regression model is used to predict the mean response and a future single response.

A useful prediction should include a margin of error to indicate its accuracy: this is called Prediction Interval.

* * *Y x x x 0 1 1 2 2 3 3

( | , , )* * *E Y x x x1 2 3

)ˆ.(.ˆ ,95.0 yESty pn

The margin of error for predicting a point value is larger than the margin of error for predicting the average response.

Page 28: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Prediction band: what we would expect for one new observation at X=x.

Confidence band: what we would expect for the average of many observations taken at X=x.

The prediction interval is wider than the confidence bands for the mean response!

Page 29: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Margin of errors for average response and for predicted valueStraight line example - Fitted model:

A 95% Confidence Interval for the “true” mean response at x*

The 95% prediction interval for the new predicted value at x*

exxY 10ˆˆ)(

))|(ˆ.(.)|(ˆ *2,95.0

* xYEEStxYE n

2

2

)(

)*(*))|(ˆ.(.

xx

xxsxYEES

ie

)ˆ.(.ˆ2,95.0 YEStY n

2

2

)(

)*(1)ˆ.(.

xx

xx

nsYES

ie

Standard Errors

Additional term that makes standard error of predictions larger

Page 30: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Standard errors for predictions and mean responses for multiple regression Standard error for predicted average response

Standard error for prediction

xxx 1)(ˆ))|(ˆ.(. XXYEES TTe

xxx1)(1ˆ)ˆ.(. XXYES TT

e

Where , x = (1, x1,x2,…,xp) and X is the data matrix. MSEe 2̂

Page 31: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

The SAS System

The REG Procedure Output Statistics

Dep Var Predicted Std Error 95% CL Mean 95% CL Predict Obs time Value Mean Predict 1 . 0.2411 0.0120 0.2167 0.2655 0.1667 0.3155

The predicted value for the CPU usage for linet = 7, step=6 and device=3 is0.24 seconds.

The prediction interval is (0.1667, 0.3155)

The average CPU usage for linet = 7, step=6 and device=3 is 0.24 seconds

The 95% C.I. is (0.2167, 0.2655).

Page 32: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

SAS Code for the CPU usage data

Data cpu;infile "cpudat.txt";input time line step device;linet=line/1000;label time="CPU time in seconds" line="lines in program execution"

step="number of computer programs" device="mounted devices" linet="lines in program (thousand)";

/*Exploratory data analysis */

proc corr data=cpu;run;

proc gplot data=cpu;plot time*(linet step device);run;

Page 33: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

/* Regression analysis*/proc reg data=cpu;model time=linet step device;plot student.*predicted.; plot student.*(linet step device)/nostat hplots=2

vplots=3;plot npp.*student./nostat ;run; quit;

Page 34: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

/* Computes predicted values and prediction intervals & confidence interval for the predicted average*/

data predict;input time line linet step device;datalines;. . 7 6 3;data pred;set predict cpu;proc print data=pred;run;proc reg data=pred;model time=linet step device/clm cli alpha=0.05;run;quit;

Define a new data set containing the values for the x-variables, you want to predict about.

Merge the original data set and the new data set together – attach new data values to the top of the original dataset.

Refit the regression model using the new data set “pred”.The option clm computes the predicted values and the 95% C.I. for the average response.The option cli computes the predicted values and the 95% Prediction interval for the predicted response.

Page 35: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Predictions for transformed variables

Suppose you want to predict the average opening revenue for a new movie whose budget was equal to 65 million dollars.

The predicted average value of log(Y) and the 95% C.I. The REG Procedure

Dependent Variable: logopen Dep Var Predicted Std ErrorObs logopen Value Mean Predict 95% CL Mean

2.8314 0.1203 2.5856 3.0771

Data on OPEN = opening revenue for new movies, and BUDGET= cost of the movie. Fitted regression line is

budgetopen 0175.0692.1)log(

Movies with higher budget costs, typically gain more money at their first weekend opening.

Page 36: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Thus a movie that costs 65 million dollars, can expect to gain on average

Predicted Log(Y)= 2.8314 - with 95% C.I. Equal to (2.5856, 3.0771)

Transform the dependent variable back to the original value!

Predicted Y=exp(2.8314)=16.969 million dollars.

Apply the same inverse transformation to the 95% C.I.to obtain an approximate 95% C.I. for the predicted average response.

An approximate 95% C.I. for the average predicted gross receipts for movies with budget cost of 65 million dollars is

(exp(2.5856), exp(3.0771))=(13.27, 21.69) million dollars.

Page 37: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Predictions for the Y variable Suppose a transformation ytransf=f(y) on the response variable Y is used to

stabilize the non-constant error variance. Thus the parameter estimates have a smaller standard error!

The predicted values on Y computed from the regression model

Need to be converted back to the original units!

Apply the same inverse transformation for the Prediction Intervals or 95% Confidence Interval for the average response.

Compute the P.I. or the 95% C.I. In the transformed scale and then transform it back to the original units

11110ˆ...ˆˆ)ˆ(ˆ

kktransf xxyfY

)ˆ(ˆ 1transforig yfy

))(),((),( 11transftransforigorig UfLfUL

Page 38: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Outline Computing predictions Including qualitative variables in the model

Defining dummy variables Problems in regression analysis

Collinearity problems Correlation matrix VIF metrics

Page 39: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Example – Movie opening ticket saleA movie producer has two new movie scripts to choose from. He wants to

analyze which factors have a strong positive effect on the opening gross revenue of the movies. He collects data on 32 movies released between 1997-1998.

The data are on the variables: Movie = Title of the movieOpening = Gross receipts for the weekend after the movie was released (in

millions of dollars)Budget = The total budget for the movie (in millions of dollars)CHARACTER VARIABLES:Star = Whether or not the movie has a superstar;

VALUE = Star; NoStarSummer = Whether or not the movie was released in the summer;

VALUE= Summer or NoSummer

ANSWER: Fit a regression model for the gross opening revenue with independent variables chosen among budget, star and summer!

Page 40: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

DATA :Opening Budget Star? Release?

AirForceOne 37.132 85.00 Star SummerBatmanandRobin 42.870 110.00 Star SummerBean 2.255 22.00 NoStar NoSummerConAir 24.131 75.00 Star SummerContact 20.584 90.00 Star SummerKisstheGirl 13.215 27.00 NoStar NoSummerTheLostWorld 92.729 73.00 NoStar NoSummerMeninBlack 84.133 90.00 Star SummerMetro 18.734 55.00 NoStar NoSummerMimic 7.818 25.00 NoStar SummerThePeacemaker 12.311 50.00 Star NoSummerPrivateParts 14.616 20.00 NoStar NoSummerTheSaint 16.278 70.00 Star NoSummerSoulFood 11.197 7.00 NoStar NoSummer…….Speed2 16.158 110.00 Star NoSummerSpawn 21.210 40.00 NoStar SummerVolcano 14.581 90.00 NoStar NoSummer187 2.912 23.00 NoStar Summer

Page 41: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

How do we include the alphanumeric (qualitative) variables in the regression model?

Each alphanumeric variable is replaced by one or more dummy variables (that take only 0 or 1 values).

For instance:

The variable Star is replaced by the numeric variable numstar.

Numstar = 1 if Star = STARNumstar = 0 if Star = NOSTAR

Analogously for the variable Summer:

Numsum = 1 if Release= SUMMERNumsum = 0 if Release = NOSUMMER

Page 42: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Dummy variables

Suppose X = A or B (X has two levels/treatments).

The dummy variable Z1 is defined as: Z1=1 if X=A Z1=0 if X=B

The regression model of Y on X is

1. If the observations come from X=A, we substitute Z1=1 in the model to obtain

2. If the observations come from X=B, we substitute Z1=0 in the model to obtain

Thus

represents the difference in averages between responses for X=A and X=B.

ezy 110

eeyA 1010 )1(

eeyB 010 )0(

)|()|(1 ˆˆ BXYAXYBA yy

Page 43: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Creating dummy variables in SASdata movie;infile 'movies.TXT';input movie $ opening budget star $ release $ ;

*following code creates two dummy variables;numstar=1;if star = 'NoStar' then numstar = 0;numsum=1; if release = ‘NoSummer' then numsum= 0;

Alternative code is numstar= (star = ‘Star’);numsum= (release = ‘Summer’);

NOTE: numstar= (star = 'Star’);creates a variable numstar with value 1 if star = 'Star‘ and 0 otherwise.

Page 44: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Dummy variables – cont.

In general if X is not numeric and has k states, you can include it in the regression model for the average response Y by

replacing X with k-1 dummy variables Z1,…Z(k-1).

Suppose X can take three values A, B or C.

Define 2 (=3-1) dummy variables Z1 and Z2 as follows

Z1=1 if X= A; Z1=0 otherwise;

Z2=1 if X= B; Z2=0 otherwise;

The regression model of Y on X isThus

ezzy 22110

0210

20210

10210

)0()0(

)1()0(

)0()1(

ey

ey

ey

C

B

A

X Z1 Z2

A 1 0

B 0 1

C 0 0

CB

CA

yy

yy

ˆˆ

ˆˆ

2

1

Page 45: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Back to our example on the movie data Step 1 : Exploratory data analysis - examine the scatter plots of

the y-variable “opening” and each x-variable.

opening

budgetopening

Summer/numsum

Star/numstar

Page 46: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Correlation matrix Simple StatisticsVariable N Mean Std Dev Minimum Maximum opening 32 20.32619 19.93042 1.64200 92.72900 budget 32 56.19531 32.02662 0.25000 110.00000 numstar 32 0.40625 0.49899 0 1.00000 numsum 32 0.46875 0.50701 0 1.00000

Pearson Correlation Coefficients, N = 32 Prob > |r| under H0: Rho=0

opening budget numstar numsumopening 1.00000 0.46839 0.00141 0.1742

0.0069 0.9939 0.3401

budget 0.46839 1.00000 0.51767 0.09748 0.0069 0.0024 0.5956numstar 0.00141 0.51767 1.00000 0.11555 0.9939 0.0024 0.5289 numsum 0.17427 0.09748 0.11555 1.00000 0.3401 0.5956 0.5288

Stronger association between opening revenue and budget money, but the association with star and summer is weak!

Page 47: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Step 2: Fitting the regression model – Find the x-variables that have a

significant effect on YStart with the full model, that includes all the x-variables.

The REG Procedure Dependent Variable: opening

Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 3 3960.58314 1320.19438 4.43 0.0115Error 28 8353.29135 298.33183Corrected Total 31 12314

Root MSE 17.27229 R-Square 0.3216 Dependent Mean 20.32619 Adj R-Sq 0.2490 Coeff Var 84.97553 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 1.01440 6.68888 0.15 0.8805budget 1 0.39269 0.11332 3.47 0.0017numstar 1 -13.69455 7.28767 -1.88 0.0707numsum 1 5.98995 6.16596 0.97 0.3396

Page 48: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Step 3 – Residual analysis – Plots show some problems!

Normal probability plots for the model residuals

Residual versus predicted values

Page 49: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

(Multi)-Collinearity Problem

What is it?

This is a problem that arises when some of the x-variables are strongly correlated. If that happens the regression model provides poor predictions!

How do we detect it?

The overall F-test is highly significant, but none of the individual t-test on the beta’s turns out to be significant!

Why?

This is because the correlated X-variables vary together and it is hard to detect which one has a strong effect on the response variable Y

Page 50: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Outline Model diagnostics:

Evaluate the goodness of fit for the observed data Goodness of fit test Coefficient of determination

Analyze if model assumptions are satisfied Residual analysis

Collinearity problems Correlation matrix VIF metric

Using the model for predictions Computing predictions and evaluating predictions

accuracy

Page 51: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Multi collinearity -section 12.4

What to do?

When two X-variables are strongly correlated – there is no need to keep them both in the model! They don’t add predictive value to the model.

How do we assess multi-collinearity? Examine the Pearson correlation matrix and the scatter

plots for each pair of x-variables. Correlation values larger than 0.9 or so indicate a serious collinearity problem.

Compute the VIF statistics (see later).

Page 52: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

0 2 4 6x1

0

2

4

6

x2

-1 0 1 2 3 4 5 6x1

-1

1

3

5

x2

1 2 3 4 5x1

1

2

3

4

5

x2

r2=0 r2=.8 r2=.95

Detecting multicollinearity from correlations

Page 53: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Tolerance or VIF (variation inflation factor) can be used to assess multivariate multicollinearity.

The value of tolerance for an x-variable is computed by regressing the x-variable on all the others.

If the x-variable is highly correlated with one or more other x-variables, the R2 value for the regression above is by definition very large.

Variance inflation factor

Page 54: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Detecting Multi-colinearity Tolerance value = (1 – Rj

2 )

Where Rj2 is for the regression of Xj on all the other X-variables,

ignoring the y-variable. Tolerance coefficients are computed for each X-variable.The higher the collinearity of the X-variables, the more the tolerance will approach zero. As a rule of thumb, if tolerance is less than .20, a problem with multicollinearity is indicated.

Variance-inflation factor (VIF) is the variance inflation factor, and is simply the reciprocal of tolerance:

VIF=1/(1-Rj2).

A large value of VIF (larger than 4 or 5) is a sign of strong multicollinearity .

Page 55: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

The table below shows the inflationary impact on the standard error of the regression coefficient (beta) of the jth independent variable for various levels of multiple correlation (Rj), tolerance, and VIF (adapted from Fox, 1991: 12). Note that in the "Impact on SE" column, 1.0

corresponds to no impact, 2.0 to doubling the standard error, etc.:

Rj Tolerance VIFImpact on SEb

0 1 1 1.0

.4 .84 1.19 1.09

.6 .64 1.56 1.25

.75 .44 2.25 1.5

.8 .36 2.78 1.67

.87 .25 4.0 2.0

.9 .19 5.26 2.29

Therefore VIF >= 4 (or 5) is an arbitrary but common cut-off criterion for deciding when a given independent variable displays "too much" multicollinearity:

Values above 4 or 5 suggest a multicollinearity problem.

Detecting Multi-collinearity

Page 56: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Multicollinearity using SAS

The “tolerance” and “vif” multi-collinearity statistics are computed using the option “vif” or “tol” in the model statement.

PROC REG;MODEL yvar = xvar_1 xvar_2 ... xvar_k / vif tol;RUN;

Page 57: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Reducing or redefining the explanatory variables used in the model can help reduce multicollinearity.

First realize that the regression estimates are still acceptable but that our confidence in model fits and predictions will be impacted by the multicollinearity problem.

Use some more advanced model fitting techniques. Principal components regression Ridge regression

What to do when multicollinearity is detected.

Page 58: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Example of multiple regression

Case Study in Chapter 13 of your textbook: Performance of field sales representatives

Data are collected on 51 representatives. The data include

PROFIT = net profit margin for all orders placed through the representativeAREA = area of the district in thousands of square milesPOPN = millions of people in the districtOUTLETS = number of outlets in the districtCOMMIS = 1 for full-commission representatives and 0 for partially salaried

representative.

GOAL of the ANALYSIS = Find out whether the representative profit improves in districts with a higher number of outlets, or more populated. Also are larger districts harder to deal with? What about full commission? Profits obviously increase if representatives have many outlets, can you construct a model measuring the profit per outlet?

Page 59: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Correlation analysis shows some collinearity The CORR Procedure

5 Variables: PROFIT AREA POPN OUTLETS COMMIS Simple StatisticsVariable N Mean Std Dev Minimum MaximumPROFIT 51 1120 358.56843 188.00 1786AREA 51 13.05961 7.03102 6.12 40.34000POPN 51 3.77822 1.07928 0.297 5.74400OUTLETS 51 174.21569 30.90651 85.000 234.00000

Pearson Correlation Coefficients, N = 51 Prob > |r| under H0: Rho=0 PROFIT AREA POPN OUTLETS COMMISPROFIT 1.00000 -0.69571 0.60172 0.46029 0.27067 <.0001 <.0001 0.0007 0.0547AREA -0.69571 1.00000 -0.83563 -0.63878 0.14452 <.0001 <.0001 <.0001 0.3116POPN 0.60172 -0.83563 1.00000 0.74572 -0.31428 <.0001 <.0001 <.0001 0.0247OUTLETS 0.46029 -0.63878 0.74572 1.00000 -0.28831 0.0007 <.0001 <.0001 0.0402COMMIS 0.27067 0.14452 -0.31428 -0.28831 1.00000 0.0547 0.3116 0.0247 0.0402

Correlations with the response variable Y High correlations among the X variables

Page 60: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Scatterplot matrix for the 4 quantitative variables.

Which pairs of variables show strong correlation?

Page 61: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

The correlation matrix shows that POPN(the size of the population living in the district) and AREA (the area of the district) are strongly correlated (corr=0.83).

We don’t need to keep both variables in the model.

Start the regression analysis for predicting the profit margin of sales representatives by using the independent variables POPN, AREA, OUTLETS and COMMIS.

Start with the full regression model

Check if each variable has a significant effect on the margin profit and the model assumptions!

ePOPNCOMMISOUTLETSAREAY 43210

Page 62: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Regression Procedure Dependent Variable: PROFIT Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 4 4242550 1060638 22.32 <.0001Error 46 2186016 47522Corrected Total 50 6428566

Root MSE 217.99559 R-Square 0.6600 Dependent Mean 1120.03922 Adj R-Sq 0.6304 Coeff Var 19.46321 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 700.14983 363.15833 1.93 0.0600AREA 1 -23.36958 8.21232 -2.85 0.0066POPN 1 103.96061 62.45281 1.66 0.1028COMMIS 1 330.15153 67.94095 4.86 <.0001OUTLETS 1 0.75551 1.50573 0.50 0.6182

Page 63: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

VIF values from SAS outputComputed by the VIF and TOL option in the MODEL statement:

model profit=area popn commis outlets /vif tol;

VarianceVariable Tolerance InflationIntercept . 0AREA 0.28507 3.50787POPN 0.20920 4.78017COMMIS 0.84686 1.18084OUTLETS 0.43887 2.27859

Value VIF > 4 or 5 indicate a problem of collinearity.We will omit popn and refit the model

Page 64: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

The REG Procedure Dependent Variable: PROFIT

Number of Observations Read 51 Number of Observations Used 51

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 3 4110867 1370289 27.79 <.0001 Error 47 2317698 49313 Corrected Total 50 6428566

Root MSE 222.06470 R-Square 0.6395 Dependent Mean 1120.03922 Adj R-Sq 0.6165 Coeff Var 19.82651

Parameter Estimates arameter Standard VarianceVariable DF Estimate Error t Value Pr > |t| Tolerance Inflation

Intercept 1 1041.77773 305.20134 3.41 0.0013 . 0AREA 1 -33.19926 5.81379 -5.71 <.0001 0.59025 1.69420COMMIS 1 299.45143 66.61049 4.50 <.0001 0.91422 1.09382OUTLETS 1 1.89312 1.36675 1.39 0.1726 0.55273 1.80920

Page 65: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Fitted model

The fitted model is

OUTLETSCOMMISAREAY 893.1451.299199.33778.1041ˆ

Is it true that commission-only representative with many outlets to cover are highly productive?

What’s the estimate average difference in production between representatives working on commissions and the ones receiving a salary?

Page 66: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

SAS codedata sales;infile "sales.txt";input PROFIT AREA POPN OUTLETS COMMIS;/*profout is profit per outlet (it is a better response variable)*/;profout=profit/outlets;label profout="Monthly net profit margin per outlet($)"

* computes correlation values;proc corr;var profit profout area popn outlets commis;run;

* draws scatterplots;symbol value="dot" color=black;proc gplot;plot profout *(area popn outlets commis);run;

Page 67: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Sas code*computes regression analysis;proc reg;*fits model and computes VIF values for multicollinearity;

model profit=area popn commis outlets/vif;run; quit;

*fits model without the variable outlets;proc reg;

model profit=area popn commis;plot student.*predicted.;plot npp.*student.;plot student.*(area popn commis);

run; quit;

Page 68: Week 5: Is the model a good fit for the data? Evaluating goodness of fit (pg 649 & ff.) Coefficient of determination (pg 646) Computing predictions including.

Remarks on multiple regression How do you begin to develop an appropriate

multiple regression model for a given problem? There are no rules

First decide on the dependent Y-variable and the independent X-variables. Use experts judgment and problem description!

Most critical decision is the initial selection of the X-variables. Which ones should you include? Knowledge of the problem area is critical in identifying those variables that might have an important effect on the response variable we want to make predictions about. Use experts knowledge!

In the next few weeks we’ll learn how to select the best variables to include in the model