Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation....
-
date post
21-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation....
Stat 112: Lecture 13 Notes
• Finish Chapter 5: – Review Predictions in Log-Log Transformation. – Polynomials and Transformations in Multiple
Regression
• Start Chapter 6: Checking Assumptions of Multiple Regression and Remedies for the Assumptions.
• Schedule: Homework 4 will be assigned next week and due Thursday, Nov. 2nd.
Another Example of Transformations: Y=Count of tree seeds, X= seed weight
Bivariate Fit of Seed Count By Seed weight (mg)
-5000
0
5000
10000
15000
20000
25000
30000
See
d C
ount
-1000 0 1000 2000 3000 4000 5000
Seed weight (mg)
Linear Fit
Transformed Fit Log to Log
Transformed Fit to Log
Linear Fit Seed Count = 6751.7179 - 2.1076776 Seed weight (mg) Summary of Fit RSquare 0.220603 RSquare Adj 0.174756 Root Mean Square Error 6199.931 Mean of Response 4398.474 Observations (or Sum Wgts)
19
Transformed Fit Log to Log Log(Seed Count) = 9.758665 - 0.5670124 Log(Seed weight (mg)) Fit Measured on Original Scale Sum of Squared Error 161960739 Root Mean Square Error
3086.6004
RSquare 0.8068273 Sum of Residuals 3142.2066
Transformed Fit to Log Seed Count = 12174.621 - 1672.3962 Log(Seed weight (mg)) Summary of Fit RSquare 0.566422 RSquare Adj 0.540918 Root Mean Square Error 4624.247 Mean of Response 4398.474 Observations (or Sum Wgts)
19
By looking at the root mean square error on the original y-scale, we see thatBoth of the transformations improve upon the untransformed model and that the transformation to log y and log x is by far the best.
Prediction using the log y/log x transformation
• What is the predicted seed count of a tree that weights 50 mg?
• Math trick: exp{log(y)}=y (Remember by log, we always mean the natural log, ln), i.e.,
96.1882}5406.7exp{}912.3*5670.07587.9exp{
)}912.3log|(logˆexp{)}50loglog|(logˆexp{
)}50|(logˆexp{)50|(ˆ
XYEXYE
XYEXYE
1010log e
Polynomials and Transformations in Multiple
• Example: Fast Food Locations. An analyst working for a fast food chain is
asked to construct a multiple regression model to identify new locations that are likely to be profitable. The analyst has for a sample of 25 locations the annual gross revenue of the restaurant (y), the mean annual household income and the mean age of children in the area. Data in fastfoodchain.jmp
Scatterplot MatrixScatterplot Matrix
900
1000
1100
1200
1300
20
25
30
35
5.0
7.5
10.0
12.5
15.0
Revenue
900 1000 1100 12001300
Income
20 25 30 35
Age
5.0 7.5 10.0 12.5 15.0
There seems to be a nonlinearrelationshipbetween revenueand income andbetween revenueand age.
Polynomials and Transformations for Multiple Regression in JMP
• For multiple regression, transformations can be done by creating a new column, right clicking and clicking formula to create new formula.
• Polynomials can be added by using Fit Model and then highlighting the X variable in both the Select Columns box and the Construct Model Effects Box and then clicking cross.
• For choosing the order of the polynomials, we use the same procedure as in simple regression, making the polynomials higher order until the coefficient on the highest order term is not significant.
Polynomial Regression for Fast Food Chain DataResponse Revenue
Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 1062.4317 72.9538 14.56 <.0001 Income 5.4563847 2.162126 2.52 0.0202 Age 1.6421762 5.413888 0.30 0.7648 (Income-24.2)*(Income-24.2) -3.979104 0.570833 -6.97 <.0001 (Age-8.392)*(Age-8.392) -4.112892 1.267459 -3.24 0.0041
Since quadratic coefficients for both income and age are significant, we will consider cubic terms for both income and age. Response Revenue Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 734.8803 146.2852 5.02 <.0001 Income 17.781924 5.237722 3.39 0.0032 Age 2.8688282 6.581383 0.44 0.6681 (Income-24.2)*(Income-24.2) -3.197419 0.607609 -5.26 <.0001 (Age-8.392)*(Age-8.392) -5.009426 1.501641 -3.34 0.0037 (Income-24.2)*(Income-24.2)*(Income-24.2) -0.248737 0.099073 -2.51 0.0218 (Age-8.392)*(Age-8.392)*(Age-8.392) 0.2871644 0.330796 0.87 0.3968
Since cubic coefficient for age is not significant, we will use a second-order (quadratic) polynomial for age. Since cubic coefficient for income is significant, let’s consider a fourth-order polynomial for income.
Response Revenue Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 746.81501 146.7115 5.09 <.0001 Income 16.32413 4.872106 3.35 0.0036 Age 6.3972838 5.319759 1.20 0.2447 (Income-24.2)*(Income-24.2) -4.023468 1.240477 -3.24 0.0045 (Age-8.392)*(Age-8.392) -4.240435 1.163926 -3.64 0.0019 (Income-24.2)*(Income-24.2)*(Income-24.2) -0.221251 0.090451 -2.45 0.0249 (Income-24.2)*(Income-24.2)*(Income-24.2)*(Income-24.2) 0.0092713 0.015497 0.60 0.5571
Since fourth order coefficient for income is not significant, we will use a third order (cubic) polynomial for income. Final Model: Response Revenue Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 752.87321 143.8674 5.23 <.0001 Income 15.903936 4.739058 3.36 0.0033 Age 6.3016037 5.226739 1.21 0.2428 (Income-24.2)*(Income-24.2) -3.36781 0.571292 -5.90 <.0001 (Age-8.392)*(Age-8.392) -4.166012 1.137538 -3.66 0.0017 (Income-24.2)*(Income-24.2)*(Income-24.2) -0.205342 0.084981 -2.42 0.0259
E(Revenue|Income =30, Age=10)=752.87+15.90*30+6.30*10-3.37*(30-24.2)2 -4.17*(10-8.392)2-0.21*(30-24.2)3= 1128.88
Chapter 6: Checking the Assumptions of the Regressions
Model and Remedies for When the Assumptions are Not Met
Assumptions of Multiple Linear Regression Model
1. Linearity:
2. Constant variance: The standard deviation of Y for the subpopulation of units with is the same for all subpopulations.
3. Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations.
4. The observations are independent.
1 0 1 1( | , , )K K KE Y X X X X
1 1, , K KX x X x
1 1, , K KX x X x
Assumptions for linear regression and their importance to inferences
Inference Assumptions that are important
Point prediction, point estimation
Linearity, independence
Confidence interval for slope, hypothesis test for slope, confidence interval for mean response
Linearity, constant variance, independence, normality (only if n<30)
Prediction interval Linearity, constant variance, independence, normality
Fast Food Chain Data Response Revenue Summary of Fit RSquare 0.325221 RSquare Adj 0.263877 Root Mean Square Error 111.6051 Mean of Response 1085.56 Observations (or Sum Wgts) 25 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 2 132070.87 66035.4 5.3016 Error 22 274025.29 12455.7 Prob > F C. Total 24 406096.16 0.0132 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 667.80548 132.3305 5.05 <.0001 Income 11.429981 4.677122 2.44 0.0230 Age 16.819467 7.999592 2.10 0.0472
-300
-200
-100
0
100
200
Rev
enue
Res
idua
l
800 900 1000 1100 1200 1300
Revenue Predicted
Checking Linearity• Plot residuals versus each of the
explanatory variables. Each of these plots should look like random scatter, with no pattern in the mean of the residuals.
If residual plots show a problem, then we could try to transform the x-variable and/orthe y-variable.
Residual Plot:Use Fit Yby X withY being Residuals.Fit Line will drawhorizontalLine.
-300
-200
-100
0
100
200
Res
idua
l Rev
enue
15 20 25 30 35
Income
-300
-200
-100
0
100
200
Res
idua
l Rev
enue
2.5 5.0 7.5 10.0 12.5 15.0
Age
Residual Plots in JMP
• After Fit Model, click red triangle next to Response, click Save Columns and click Residuals.
• Use Fit Y by X with Y=Residuals and X the explanatory variable of interest. Fit Line will draw a horizontal line with intercept zero. It is a property of the residuals from multiple linear regression that a least squares regression of the residuals on an explanatory variable has slope zero and intercept zero.
Residual by Predicted Plot
• Fit Model displays the Residual by Predicted Plot automatically in its output. • The plot is a plot of the residuals versus the predicted Y’s,
We can think of the predicted Y’s as summarizing all the information in the
X’s. As usual we would like this plot to show random scatter. • Pattern in the mean of the residuals as the predicted Y’s increase: Indicates
problem with linearity. Look at residual plots versus each explanatory variable to isolate problem and consider transformations.
• Pattern in the spread of the residuals: Indicates problem with constant variance.
1 1ˆ ˆ ( | , , )i i i K iKY E Y X x X x
-300
-200
-100
0
100
200
Rev
enue
Res
idua
l
800 900 1000 1100 1200 1300
Revenue Predicted
Corrections for Violations of the Linearity Assumption
• When the residual plot shows a pattern in the mean of the residuals for one of the explanatory variables Xj, we should consider:– Transforming the Xj.
– Adding polynomial variables in Xj— – Transforming Y
• After making the transformation/adding polynomials, we need to refit the model and look at the new residual plot vs. X to see if linearity has been achieved.
2 3( ) , ( ) ,etc.j j j jX X X X
Response Revenue Summary of Fit RSquare 0.885175 RSquare Adj 0.86221 Root Mean Square Error 48.28563 Mean of Response 1085.56 Observations (or Sum Wgts) 25 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 1062.4317 72.9538 14.56 <.0001 Income 5.4563847 2.162126 2.52 0.0202 Age 1.6421762 5.413888 0.30 0.7648 (Age-8.392)*(Age-8.392) -4.112892 1.267459 -3.24 0.0041 (Income-24.2)*(Income-24.2) -3.979104 0.570833 -6.97 <.0001 Residual by Predicted Plot
-100
-50
0
50
100
Rev
enue
Res
idua
l
700 800 900 1000 1100 1200 1300
Revenue Predicted
Quadratic Polynomials for Age and Income
Fit Y by X Group Bivariate Fit of Residual Revenue 2 By Income
-250
-200
-150
-100
-50
0
50
100
150
Res
idua
l Rev
enue
2
15 20 25 30 35
Income
Bivariate Fit of Residual Revenue 2 By Age
-250
-200
-150
-100
-50
0
50
100
150
Res
idua
l Rev
enue
22.5 5.0 7.5 10.0 12.5 15.0
Age
Linearity now appears to be satisfied.
Checking Constant Variance Assumption
• Residual plot versus explanatory variables should exhibit constant variance.
• Residual plot versus predicted values should exhibit constant variance (this plot is often most useful for detecting nonconstant variance)
Heteroscedasticity• When the requirement of a constant variance is violated
we have a condition of heteroscedasticity.• Diagnose heteroscedasticity by plotting the residual
against the predicted y.
+ + ++
+ ++
++
+
+
+
+
+
+
+
+
+
+
++
+
+
+
The spread increases with y
y
Residualy
+
+++
+
++
+
++
+
+++
+
+
+
+
+
++
+
+
How much traffic would a building generate?
• The goal is to predict how much traffic will be generated by a proposed new building of 150,000 occupied sq ft. (Data is from the MidAtlantic States City Planning Manual.)
• The data tells how many automobile trips per day were made in the AM to office buildings of different sizes.
• The variables are x = “Occupied Sq Ft of floor space in the building (in 1000 sq ft)” and
Y = “number of automobile trips arriving at the building per day in the morning”.
• A brief list of transformations» y’ = y1/2 (for y > 0)
• Use when the s2 increases with
» y’ = log y (for y > 0)• Use when the s increases with • Use when the error distribution is skewed to the
right.
» y’ = y2
• Use when the s2 is decreasing with , or
• Use when the error distribution is left skewed
Reducing Nonconstant Variance/Nonnormality by Transformations
Y
Y
Y
y
Bivariate Fit of AM Trips By Sq Ft (1000)
0
250
500
750
1000
1250
1500
AM
Trip
s
0 100 200 300 400 500 600 700 800
Sq Ft (1000)
Linear Fit
AM Trips = -4.55 + 1.515 Sq Ft (1000) Summary of Fit RSquare 0.857071 Observations (or Sum Wgts) 61 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 1 3810173.9 3810174 353.7926 Error 59 635401.2 10770 Prob > F C. Total 60 4445575.1 <.0001
-400
-300
-200
-100
0
100
200
300
Re
sid
ua
l
0 100 200 300 400 500 600 700 800
Sq Ft (1000)
The heteroscedasticity shows here.
Bivariate Fit of Log AM Trips By Occup. Sq. Ft. (1000)
1.5
2
2.5
3
Lo
g A
M T
rips
0 100 200 300 400 500 600
Occup. Sq. Ft. (1000)
-0.6-0.4-0.2-0.00.20.4
Re
sid
ua
l
0 100 200 300 400 500 600
Occup. Sq. Ft. (1000)
To try to fix heteroscedasticity we transform Y to Log(Y)
This fixes hetero…
BUT it creates a nonlinear pattern.
Bivariate Fit of Log AM Trips By Log(OccupSqFt)
1.5
2
2.5
3
Lo
g A
M T
rips
1 1.5 2 2.5 3
Log(OccupSqFt)
Linear Fit Log AM Trips = 0.6393 + 0.8033 Log(OccupSqFt)
Summary of Fit RSquare 0.827
-0.4
-0.2
-0.0
0.2
Re
sid
ua
l
1 1.5 2 2.5 3
Log(OccupSqFt)
To fix nonlinearity we now transform x to Log(x), without changing the Y axis anymore.
The resulting pattern is both satisfactorily homoscedastic AND linear.
-0.4
-0.3
-0.2
-0.1
-0.0
0.1
0.2
0.3
Lo
g A
M T
rips
Re
sid
ua
l
1.5 2.0 2.5 3.0
Log AM Trips Predicted
Often we will plot residuals versus predicted.
For simple regression the two residual plots are equivalent