Lecture 23: Tues., Dec. 2 Today: –Review of using nominal variables in multiple regression...
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Lecture 23: Tues., Dec. 2 Today: –Review of using nominal variables in multiple regression...
Lecture 23: Tues., Dec. 2
• Today: – Review of using nominal variables in multiple
regression– Extra sum of squares F-tests (Ch. 10.3)– R-squared statistic (10.4.1)– Course evaluations
• Thursday:– Residual plots (11.2)– Dealing with influential observations (11.3-11.4)– Take home points from course
Info about Final Assigment (no Final Exam)
• Handed out after Thursday’s class (the final class) • Due Friday, Dec. 12th by 5 p.m.• Approximate length of a regular homework but will
involve somewhat more challenging questions (emphasis will be on multiple regression but questions will involve material from whole course)
• Do not talk to each other about it.• I will answer general conceptual questions about the
course but not specific questions about the assignment [office hours by appointment any time this week and next week]
Separate/Parallel Regression Lines Model
• Separate regression lines model:
• Parallel regression lines model:
nebirdnebatnebirdnebat IlmassIlmasslmassIITYPElmasslenergy
TYPElmassTYPElmassTYPElmasslenergy
**},|{
*},|{
543210
lmassIITYPElmasslenergy
TYPElmassTYPElmasslenergy
nebirdnebat 3210},|{
},|{
Parallel Regression Lines Model
• No strong evidence that echolocating bats use less energy than either nonecholocating bats (p-value = 0.70) or nonecholocating birds (p-value = 0.88) of same body size.
• 95% Confidence interval for difference in mean of log energy for nonecholocating bats and echolocating bats of same body size: (-0.51, 0.35).
• This means that 95% confidence interval for ratio of median energy for nonecholocating bats to echolocating bats of same body size is
• Summary of findings: Although there is no strong evidence that echolocating bats use less energy than nonecholocating bats of same body size, it is still plausible that they use quite a bit less energy (60% as much at the median). Study is inconclusive.
Parameter Estimates Term Estimate Std Error Prob>|t| Lower 95% Upper 95%
Intercept -1.497696 0.149869 <.0001 -1.815405 -1.179988 Log Mass 0.8149575 0.044541 <.0001 0.7205339 0.9093811 nebat -0.078664 0.202679 0.7030 -0.508325 0.3509971 nebird 0.0235982 0.1576 0.8828 -0.3105 0.3576964
)42.1,60.0(),( 35.051.0 ee
Two review points from example
• If the difference in the means of the two populations log(Y1) and log(Y2) is , then the ratio of the median of the population of Y1 and Y2 is (Ch. 3.5, 8.4)
• A nonsignificant (>0.05) p-value does not mean that the null hypothesis is true. It means that there is not strong evidence to reject the null hypothesis. A confidence interval provides information about the range of plausible values for the parameter based on the study.– Here study is inconclusive because CI contains both null
hypothesis and practically significant alternative (situation D of Display 23.1). If possible, choose sample size to avoid situation D.
e
Nominal Variables in JMP
• To automatically incorporate a nominal variable in JMP, make the modeling type nominal, fit the model and then click red triangle next to Response, Estimates and Expanded Estimates
• JMP creates variables for each level of the nominal variable that represent the difference between that level and the average of all the levels.
• To look directly at the difference between two levels of a categorical variable, it is easier to make your own indicator variables, leaving out one of the levels [Use this method for homework].
Prediction Intervals
• Predicted value of y for : • To find a 95% prediction interval for the mean log energy of a flying
vertebrate of a given type and mass, – Fit the multiple regression model (i.e., for parallel regression lines
model, fit )– Click red triangle next to response log energy, click save columns,
click predicted values and also click indiv confid interval. This saves the predicted values, lower 95% prediction interval endpoint and upper 95% prediction interval endpoint for each observation in data set.
– For an echolocating bat with body mass 8g, the prediction interval for log energy is (-0.393,0.499) [for energy it is
– Use mean confid interval for confidence intervals for mean response.
)647.1,675.0(),( 499.0393.0 ee
lmassIITYPElmasslenergy nebirdnebat 3210},|{
pxx ,...,1 ppp xxxxy ˆˆˆ},...,|{ˆ 1101
Nominal Variables Example
• An analysis has shown that the time required in minutes to complete a production run increases with the number of items produced. Data were collected for the time required to process 20 randomly selected orders as supervised by three managers. How do the managers compare?
One Way Layout Analysis
• One way layout analysis does not take into account run size. Manager C might be a better manager but have been supervising smaller production runs
Oneway Analysis of Time for Run By Manager T
ime
fo
r R
un
150
200
250
300
a b c
Manager
Separate/Parallel Regression Lines Model
• Separate regression lines model:
• Parallel regression lines model:ManagerBManagerA
ManagerBManagerA
IrunsizeIrunsize
runsizeIIMANAGERrunsizeruntime
MANAGERrunsizeMANAGERrunsizeMANAGERrunsizeruntime
***
},|{
*},|{
54
3210
runsizeIIMANAGERrunsizeruntime
MANAGERrunsizeMANAGERrunsizeruntime
ManagerBManagerA 3210},|{
},|{
Coded Scatterplot
• Can follow procedure using graph – overlay plot from last lecture.
• Shortcut: Click Rows, Color or Mark by Column and then highlight Manager. Now use Fit Y by X with Y=Time for Run and X=Run SizeBivariate Fit of Time for Run By Run Size
150
200
250
300
Tim
e f
or R
un
50100150200250300Run Size
Model Fits
• Parallel Regression Lines Model
• Separate Regression Lines Model
• How do we test whether parallel regression lines model is appropriate ( )?
Parameter Estimates Term Estimate Std Error Prob>|t| Lower 95% Upper 95%
Intercept 152.53408 6.282065 <.0001 139.94454 165.12362 I-Manager A 62.178074 5.194105 <.0001 51.768855 72.587293 I-Manager B 8.3438732 5.317727 0.1224 -2.31309 19.000836 Run Size 0.2454321 0.025265 <.0001 0.1947992 0.296065
Parameter Estimates Term Estimate Std Error Prob>|t| Lower 95% Upper 95%
Intercept 149.7477 8.084041 <.0001 133.53317 165.96224 I-Manager A 52.786451 12.352 <.0001 28.011482 77.561419 I-Manager B 35.398214 14.51216 0.0181 6.2905144 64.505913 Run Size 0.2592431 0.036053 <.0001 0.1869294 0.3315568 I-Manager A*Run Size
0.0480219 0.056812 0.4018 -0.065928 0.161972
I-Manager B*Run Size
-0.118178 0.061188 0.0588 -0.240906 0.0045504
0: 540 H
Extra Sum of Squares F-tests
• Suppose we want to test whether multiple coefficients are equal to zero, e.g.,
test• t-tests, either individually or in combination cannot be used
to test such a hypothesis involving more than one parameter.
• F-test for joint significance of several terms
lmassIISPECIESlmasslenergy nebirdnebat 3210}||{
0: 210 H
model full from of Estimate testedbeing betas ofNumber model full of errors squared of Sum
- model reduced of errors squared of Sum
2 statisticF
Extra Sum of Squares F-test
• Under , the F-statistic has an F distribution with number of betas being tested, n-(p+1) degrees of freedom.
• p-value can be found by using Table A.4 or creating a Formula in JMP with probability, F distribution and the putting the value of the F-statistic for F and the appropriate degrees of freedom. This gives the P(F random variable with degrees of freedom < observed F-statistic) which equals 1 – p-value
zero equal testedbeing sbeta' all:0H
Extra Sum of Squares F-test example
• Testing parallel regression lines model (H0, reduced model ) vs. separate regression lines model (full model) in manager example
• Full model:
• Reduced model:
• F-statistic
• p-value: P(F random variable with 2,53 df > 51.96)
Summary of Fit
Root Mean Square Error 15.7761 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio
Model 5 67040.713 13408.1 53.8728 Error 53 13190.915 248.9 Prob > F
Analysis of Variance Source DF Sum of Squares Mean Square F Ratio
Model 3 65401.417 21800.5 80.8502 Error 55 14830.210 269.6 Prob > F
29.37761.152
915.13190210.14830
2
F
045.0955.01
Second Example of F-test
• For echolocation study, in parallel regression model, test
• Full model:
• Reduced model:
• F-statistic:
• p-value: P(F random variable with 2,16 degrees of freedom > 0.44) = 1-0.3484 = 0.6516
lmassIISPECIESlmasslenergy nebirdnebat 3210}||{ 0: 210 HSummary of Fit
Root Mean Square Error 0.185963 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio
Model 3 29.421483 9.80716 283.5887 Error 16 0.553317 0.03458 Prob > F
Analysis of Variance Source DF Sum of Squares Mean Square F Ratio
Model 1 29.391909 29.3919 907.6384 Error 18 0.582891 0.0324 Prob > F
F
44.0185963.0
239109.29421483.29
2
F
Manager Example Findings
• The runs supervised by Manager a appear abnormally time consuming. Manager b has high initial fixed setup costs, but the time per unit is the best of the three. Manager c has the lowest fixed costs and per unit production time in between managers a and b.
• Adjustments to marginal analysis via regression only control for possible differences in size among production runs. Other differences might be relevant, e.g., difficulty of production runs. It could be that Manager a supervised most difficult production runs.
Parameter Estimates Term Estimate Std Error Prob>|t| Lower 95% Upper 95%
Intercept 149.7477 8.084041 <.0001 133.53317 165.96224 I-Manager A 52.786451 12.352 <.0001 28.011482 77.561419 I-Manager B 35.398214 14.51216 0.0181 6.2905144 64.505913 Run Size 0.2592431 0.036053 <.0001 0.1869294 0.3315568 I-Manager A*Run Size
0.0480219 0.056812 0.4018 -0.065928 0.161972
I-Manager B*Run Size
-0.118178 0.061188 0.0588 -0.240906 0.0045504
Special Cases of F-test• Multiple Regression Model:
• If we want to test if one equals zero, e.g., , F-test is equivalent to t-test.
• Suppose we want to test , i.e., null hypothesis is that the mean of Y does not depend on any of the explanatory variables.
• JMP automatically computes this test under Analysis of Variance, Prob>F. For separate regression lines model, strong evidence that mean run time does depend on at least one of run size, manager.
ppXXXY 110}|{
0: 10 pH
Analysis of Variance Source DF Sum of Squares Mean Square F Ratio
Model 5 67040.713 13408.1 53.8728 Error 53 13190.915 248.9 Prob > F
C. Total 58 80231.627 <.0001
0: 10 H
The R-Squared Statistic
• For separate regression lines model in production time example,
• Similar interpretation as in simple linear regression. The R-squared statistic is the proportion of the variation in y explained by the multiple regression model
• Total Sum of Squares: • Residual Sum of Squares:
Summary of Fit
RSquare 0.83559
squares of sum Total
squares of sum Residual - squares of sum Total2 R
2
1)(
n
i i yy
n
i ippii xxy1
2110 )ˆˆˆ(
Assumptions of Multiple Linear Regression Model
• Assumptions of multiple linear regression:– For each subpopulation ,
• (A-1A)• (A-1B) • (A-1C) The distribution of is normal[Distribution of residuals should not depend on ]
– (A-2) The observations are independent of one another
pxx ,...,1
ppp XXXXY 1101 },...,|{2
1 ),...,|( pXXYVar
pXXY ,...,| 1
pxx ,...,1
Checking/Refining Model
• Tools for checking (A-1A) and (A-1B)– Residual plots versus predicted (fitted) values– Residual plots versus explanatory variables – If model is correct, there should be no pattern in the
residual plots
• Tool for checking (A-1C)– Normal quantile plot
• Tool for checking (A-2)– Residual plot versus time or spatial order of
observations
pxx ,,1