Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation....

Stat 112: Lecture 13 Notes

• Finish Chapter 5: – Review Predictions in Log-Log Transformation. – Polynomials and Transformations in Multiple

Regression

• Start Chapter 6: Checking Assumptions of Multiple Regression and Remedies for the Assumptions.

• Schedule: Homework 4 will be assigned next week and due Thursday, Nov. 2nd.

Another Example of Transformations: Y=Count of tree seeds, X= seed weight

Bivariate Fit of Seed Count By Seed weight (mg)

-5000

0

5000

10000

15000

20000

25000

30000

See

d C

ount

-1000 0 1000 2000 3000 4000 5000

Seed weight (mg)

Linear Fit

Transformed Fit Log to Log

Transformed Fit to Log

Linear Fit Seed Count = 6751.7179 - 2.1076776 Seed weight (mg) Summary of Fit RSquare 0.220603 RSquare Adj 0.174756 Root Mean Square Error 6199.931 Mean of Response 4398.474 Observations (or Sum Wgts)

19

Transformed Fit Log to Log Log(Seed Count) = 9.758665 - 0.5670124 Log(Seed weight (mg)) Fit Measured on Original Scale Sum of Squared Error 161960739 Root Mean Square Error

3086.6004

RSquare 0.8068273 Sum of Residuals 3142.2066

Transformed Fit to Log Seed Count = 12174.621 - 1672.3962 Log(Seed weight (mg)) Summary of Fit RSquare 0.566422 RSquare Adj 0.540918 Root Mean Square Error 4624.247 Mean of Response 4398.474 Observations (or Sum Wgts)

19

By looking at the root mean square error on the original y-scale, we see thatBoth of the transformations improve upon the untransformed model and that the transformation to log y and log x is by far the best.

Prediction using the log y/log x transformation

• What is the predicted seed count of a tree that weights 50 mg?

• Math trick: exp{log(y)}=y (Remember by log, we always mean the natural log, ln), i.e.,

96.1882}5406.7exp{}912.3*5670.07587.9exp{

)}912.3log|(logêxp{)}50loglog|(logêxp{

)}50|(logêxp{)50|(ˆ

XYEXYE

XYEXYE

1010log e

Polynomials and Transformations in Multiple

• Example: Fast Food Locations. An analyst working for a fast food chain is

asked to construct a multiple regression model to identify new locations that are likely to be profitable. The analyst has for a sample of 25 locations the annual gross revenue of the restaurant (y), the mean annual household income and the mean age of children in the area. Data in fastfoodchain.jmp

Scatterplot MatrixScatterplot Matrix

900

1000

1100

1200

1300

20

25

30

35

5.0

7.5

10.0

12.5

15.0

Revenue

900 1000 1100 12001300

Income

20 25 30 35

Age

5.0 7.5 10.0 12.5 15.0

There seems to be a nonlinearrelationshipbetween revenueand income andbetween revenueand age.

Polynomials and Transformations for Multiple Regression in JMP

• For multiple regression, transformations can be done by creating a new column, right clicking and clicking formula to create new formula.

• Polynomials can be added by using Fit Model and then highlighting the X variable in both the Select Columns box and the Construct Model Effects Box and then clicking cross.

• For choosing the order of the polynomials, we use the same procedure as in simple regression, making the polynomials higher order until the coefficient on the highest order term is not significant.

Polynomial Regression for Fast Food Chain DataResponse Revenue

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 1062.4317 72.9538 14.56 <.0001 Income 5.4563847 2.162126 2.52 0.0202 Age 1.6421762 5.413888 0.30 0.7648 (Income-24.2)*(Income-24.2) -3.979104 0.570833 -6.97 <.0001 (Age-8.392)*(Age-8.392) -4.112892 1.267459 -3.24 0.0041

Since quadratic coefficients for both income and age are significant, we will consider cubic terms for both income and age. Response Revenue Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 734.8803 146.2852 5.02 <.0001 Income 17.781924 5.237722 3.39 0.0032 Age 2.8688282 6.581383 0.44 0.6681 (Income-24.2)*(Income-24.2) -3.197419 0.607609 -5.26 <.0001 (Age-8.392)*(Age-8.392) -5.009426 1.501641 -3.34 0.0037 (Income-24.2)*(Income-24.2)*(Income-24.2) -0.248737 0.099073 -2.51 0.0218 (Age-8.392)*(Age-8.392)*(Age-8.392) 0.2871644 0.330796 0.87 0.3968

Since cubic coefficient for age is not significant, we will use a second-order (quadratic) polynomial for age. Since cubic coefficient for income is significant, let’s consider a fourth-order polynomial for income.

Response Revenue Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 746.81501 146.7115 5.09 <.0001 Income 16.32413 4.872106 3.35 0.0036 Age 6.3972838 5.319759 1.20 0.2447 (Income-24.2)*(Income-24.2) -4.023468 1.240477 -3.24 0.0045 (Age-8.392)*(Age-8.392) -4.240435 1.163926 -3.64 0.0019 (Income-24.2)*(Income-24.2)*(Income-24.2) -0.221251 0.090451 -2.45 0.0249 (Income-24.2)*(Income-24.2)*(Income-24.2)*(Income-24.2) 0.0092713 0.015497 0.60 0.5571

Since fourth order coefficient for income is not significant, we will use a third order (cubic) polynomial for income. Final Model: Response Revenue Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 752.87321 143.8674 5.23 <.0001 Income 15.903936 4.739058 3.36 0.0033 Age 6.3016037 5.226739 1.21 0.2428 (Income-24.2)*(Income-24.2) -3.36781 0.571292 -5.90 <.0001 (Age-8.392)*(Age-8.392) -4.166012 1.137538 -3.66 0.0017 (Income-24.2)*(Income-24.2)*(Income-24.2) -0.205342 0.084981 -2.42 0.0259

E(Revenue|Income =30, Age=10)=752.87+15.90*30+6.30*10-3.37*(30-24.2)2 -4.17*(10-8.392)2-0.21*(30-24.2)3= 1128.88

Chapter 6: Checking the Assumptions of the Regressions

Model and Remedies for When the Assumptions are Not Met

Assumptions of Multiple Linear Regression Model

1. Linearity:

2. Constant variance: The standard deviation of Y for the subpopulation of units with is the same for all subpopulations.

3. Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations.

4. The observations are independent.

1 0 1 1( | , , )K K KE Y X X X X

1 1, , K KX x X x

1 1, , K KX x X x

Assumptions for linear regression and their importance to inferences

Inference Assumptions that are important

Point prediction, point estimation

Linearity, independence

Confidence interval for slope, hypothesis test for slope, confidence interval for mean response

Linearity, constant variance, independence, normality (only if n<30)

Prediction interval Linearity, constant variance, independence, normality

Fast Food Chain Data Response Revenue Summary of Fit RSquare 0.325221 RSquare Adj 0.263877 Root Mean Square Error 111.6051 Mean of Response 1085.56 Observations (or Sum Wgts) 25 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 2 132070.87 66035.4 5.3016 Error 22 274025.29 12455.7 Prob > F C. Total 24 406096.16 0.0132 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 667.80548 132.3305 5.05 <.0001 Income 11.429981 4.677122 2.44 0.0230 Age 16.819467 7.999592 2.10 0.0472

-300

-200

-100

0

100

200

Rev

enue

Res

idua

l

800 900 1000 1100 1200 1300

Revenue Predicted

Checking Linearity• Plot residuals versus each of the

explanatory variables. Each of these plots should look like random scatter, with no pattern in the mean of the residuals.

If residual plots show a problem, then we could try to transform the x-variable and/orthe y-variable.

Residual Plot:Use Fit Yby X withY being Residuals.Fit Line will drawhorizontalLine.

-300

-200

-100

0

100

200

Res

idua

l Rev

enue

15 20 25 30 35

Income

-300

-200

-100

0

100

200

Res

idua

l Rev

enue

2.5 5.0 7.5 10.0 12.5 15.0

Age

Residual Plots in JMP

• After Fit Model, click red triangle next to Response, click Save Columns and click Residuals.

• Use Fit Y by X with Y=Residuals and X the explanatory variable of interest. Fit Line will draw a horizontal line with intercept zero. It is a property of the residuals from multiple linear regression that a least squares regression of the residuals on an explanatory variable has slope zero and intercept zero.

Residual by Predicted Plot

• Fit Model displays the Residual by Predicted Plot automatically in its output. • The plot is a plot of the residuals versus the predicted Y’s,

We can think of the predicted Y’s as summarizing all the information in the

X’s. As usual we would like this plot to show random scatter. • Pattern in the mean of the residuals as the predicted Y’s increase: Indicates

problem with linearity. Look at residual plots versus each explanatory variable to isolate problem and consider transformations.

• Pattern in the spread of the residuals: Indicates problem with constant variance.

1 1ˆ ˆ ( | , , )i i i K iKY E Y X x X x

-300

-200

-100

0

100

200

Rev

enue

Res

idua

l

800 900 1000 1100 1200 1300

Revenue Predicted

Corrections for Violations of the Linearity Assumption

• When the residual plot shows a pattern in the mean of the residuals for one of the explanatory variables Xj, we should consider:– Transforming the Xj.

– Adding polynomial variables in Xj— – Transforming Y

• After making the transformation/adding polynomials, we need to refit the model and look at the new residual plot vs. X to see if linearity has been achieved.

2 3( ) , ( ) ,etc.j j j jX X X X

Response Revenue Summary of Fit RSquare 0.885175 RSquare Adj 0.86221 Root Mean Square Error 48.28563 Mean of Response 1085.56 Observations (or Sum Wgts) 25 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 1062.4317 72.9538 14.56 <.0001 Income 5.4563847 2.162126 2.52 0.0202 Age 1.6421762 5.413888 0.30 0.7648 (Age-8.392)*(Age-8.392) -4.112892 1.267459 -3.24 0.0041 (Income-24.2)*(Income-24.2) -3.979104 0.570833 -6.97 <.0001 Residual by Predicted Plot

-100

-50

0

50

100

Rev

enue

Res

idua

l

700 800 900 1000 1100 1200 1300

Revenue Predicted

Quadratic Polynomials for Age and Income

Fit Y by X Group Bivariate Fit of Residual Revenue 2 By Income

-250

-200

-150

-100

-50

0

50

100

150

Res

idua

l Rev

enue

2

15 20 25 30 35

Income

Bivariate Fit of Residual Revenue 2 By Age

-250

-200

-150

-100

-50

0

50

100

150

Res

idua

l Rev

enue

22.5 5.0 7.5 10.0 12.5 15.0

Age

Linearity now appears to be satisfied.

Checking Constant Variance Assumption

• Residual plot versus explanatory variables should exhibit constant variance.

• Residual plot versus predicted values should exhibit constant variance (this plot is often most useful for detecting nonconstant variance)

Heteroscedasticity• When the requirement of a constant variance is violated

we have a condition of heteroscedasticity.• Diagnose heteroscedasticity by plotting the residual

against the predicted y.

+ + ++

+ ++

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

The spread increases with y

y

Residualy

+

+++

+

++

+

++

+

+++

+

+

+

+

+

++

+

+

How much traffic would a building generate?

• The goal is to predict how much traffic will be generated by a proposed new building of 150,000 occupied sq ft. (Data is from the MidAtlantic States City Planning Manual.)

• The data tells how many automobile trips per day were made in the AM to office buildings of different sizes.

• The variables are x = “Occupied Sq Ft of floor space in the building (in 1000 sq ft)” and

Y = “number of automobile trips arriving at the building per day in the morning”.

• A brief list of transformations» y’ = y1/2 (for y > 0)

• Use when the s2 increases with

» y’ = log y (for y > 0)• Use when the s increases with • Use when the error distribution is skewed to the

right.

» y’ = y2

• Use when the s2 is decreasing with , or

• Use when the error distribution is left skewed

Reducing Nonconstant Variance/Nonnormality by Transformations

Y

Y

Y

y

Bivariate Fit of AM Trips By Sq Ft (1000)

0

250

500

750

1000

1250

1500

AM

Trip

s

0 100 200 300 400 500 600 700 800

Sq Ft (1000)

Linear Fit

AM Trips = -4.55 + 1.515 Sq Ft (1000) Summary of Fit RSquare 0.857071 Observations (or Sum Wgts) 61 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 1 3810173.9 3810174 353.7926 Error 59 635401.2 10770 Prob > F C. Total 60 4445575.1 <.0001

-400

-300

-200

-100

0

100

200

300

Re

sid

ua

l

0 100 200 300 400 500 600 700 800

Sq Ft (1000)

The heteroscedasticity shows here.

Bivariate Fit of Log AM Trips By Occup. Sq. Ft. (1000)

1.5

2

2.5

3

Lo

g A

M T

rips

0 100 200 300 400 500 600

Occup. Sq. Ft. (1000)

-0.6-0.4-0.2-0.00.20.4

Re

sid

ua

l

0 100 200 300 400 500 600

Occup. Sq. Ft. (1000)

To try to fix heteroscedasticity we transform Y to Log(Y)

This fixes hetero…

BUT it creates a nonlinear pattern.

Bivariate Fit of Log AM Trips By Log(OccupSqFt)

1.5

2

2.5

3

Lo

g A

M T

rips

1 1.5 2 2.5 3

Log(OccupSqFt)

Linear Fit Log AM Trips = 0.6393 + 0.8033 Log(OccupSqFt)

Summary of Fit RSquare 0.827

-0.4

-0.2

-0.0

0.2

Re

sid

ua

l

1 1.5 2 2.5 3

Log(OccupSqFt)

To fix nonlinearity we now transform x to Log(x), without changing the Y axis anymore.

The resulting pattern is both satisfactorily homoscedastic AND linear.

-0.4

-0.3

-0.2

-0.1

-0.0

0.1

0.2

0.3

Lo

g A

M T

rips

Re

sid

ua

l

1.5 2.0 2.5 3.0

Log AM Trips Predicted

Often we will plot residuals versus predicted.

For simple regression the two residual plots are equivalent

Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation....

Documents

Transcript of Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation....