Lecture 23: Tues., Dec. 2 Today: –Review of using nominal variables in multiple regression...

Lecture 23: Tues., Dec. 2

• Today: – Review of using nominal variables in multiple

regression– Extra sum of squares F-tests (Ch. 10.3)– R-squared statistic (10.4.1)– Course evaluations

• Thursday:– Residual plots (11.2)– Dealing with influential observations (11.3-11.4)– Take home points from course

Info about Final Assigment (no Final Exam)

• Handed out after Thursday’s class (the final class) • Due Friday, Dec. 12th by 5 p.m.• Approximate length of a regular homework but will

involve somewhat more challenging questions (emphasis will be on multiple regression but questions will involve material from whole course)

• Do not talk to each other about it.• I will answer general conceptual questions about the

course but not specific questions about the assignment [office hours by appointment any time this week and next week]

Separate/Parallel Regression Lines Model

• Separate regression lines model:

• Parallel regression lines model:

nebirdnebatnebirdnebat IlmassIlmasslmassIITYPElmasslenergy

TYPElmassTYPElmassTYPElmasslenergy

**},|{

*},|{

543210

lmassIITYPElmasslenergy

TYPElmassTYPElmasslenergy

nebirdnebat 3210},|{

},|{

Parallel Regression Lines Model

• No strong evidence that echolocating bats use less energy than either nonecholocating bats (p-value = 0.70) or nonecholocating birds (p-value = 0.88) of same body size.

• 95% Confidence interval for difference in mean of log energy for nonecholocating bats and echolocating bats of same body size: (-0.51, 0.35).

• This means that 95% confidence interval for ratio of median energy for nonecholocating bats to echolocating bats of same body size is

• Summary of findings: Although there is no strong evidence that echolocating bats use less energy than nonecholocating bats of same body size, it is still plausible that they use quite a bit less energy (60% as much at the median). Study is inconclusive.

Parameter Estimates Term Estimate Std Error Prob>|t| Lower 95% Upper 95%

Intercept -1.497696 0.149869 <.0001 -1.815405 -1.179988 Log Mass 0.8149575 0.044541 <.0001 0.7205339 0.9093811 nebat -0.078664 0.202679 0.7030 -0.508325 0.3509971 nebird 0.0235982 0.1576 0.8828 -0.3105 0.3576964

)42.1,60.0(),( 35.051.0 ee

Two review points from example

• If the difference in the means of the two populations log(Y1) and log(Y2) is , then the ratio of the median of the population of Y1 and Y2 is (Ch. 3.5, 8.4)

• A nonsignificant (>0.05) p-value does not mean that the null hypothesis is true. It means that there is not strong evidence to reject the null hypothesis. A confidence interval provides information about the range of plausible values for the parameter based on the study.– Here study is inconclusive because CI contains both null

hypothesis and practically significant alternative (situation D of Display 23.1). If possible, choose sample size to avoid situation D.

e

Nominal Variables in JMP

• To automatically incorporate a nominal variable in JMP, make the modeling type nominal, fit the model and then click red triangle next to Response, Estimates and Expanded Estimates

• JMP creates variables for each level of the nominal variable that represent the difference between that level and the average of all the levels.

• To look directly at the difference between two levels of a categorical variable, it is easier to make your own indicator variables, leaving out one of the levels [Use this method for homework].

Prediction Intervals

• Predicted value of y for : • To find a 95% prediction interval for the mean log energy of a flying

vertebrate of a given type and mass, – Fit the multiple regression model (i.e., for parallel regression lines

model, fit )– Click red triangle next to response log energy, click save columns,

click predicted values and also click indiv confid interval. This saves the predicted values, lower 95% prediction interval endpoint and upper 95% prediction interval endpoint for each observation in data set.

– For an echolocating bat with body mass 8g, the prediction interval for log energy is (-0.393,0.499) [for energy it is

– Use mean confid interval for confidence intervals for mean response.

)647.1,675.0(),( 499.0393.0 ee

lmassIITYPElmasslenergy nebirdnebat 3210},|{

pxx ,...,1 ppp xxxxy ˆˆˆ},...,|{ˆ 1101

Nominal Variables Example

• An analysis has shown that the time required in minutes to complete a production run increases with the number of items produced. Data were collected for the time required to process 20 randomly selected orders as supervised by three managers. How do the managers compare?

One Way Layout Analysis

• One way layout analysis does not take into account run size. Manager C might be a better manager but have been supervising smaller production runs

Oneway Analysis of Time for Run By Manager T

ime

fo

r R

un

150

200

250

300

a b c

Manager

Separate/Parallel Regression Lines Model

• Separate regression lines model:

• Parallel regression lines model:ManagerBManagerA

ManagerBManagerA

IrunsizeIrunsize

runsizeIIMANAGERrunsizeruntime

MANAGERrunsizeMANAGERrunsizeMANAGERrunsizeruntime

***

},|{

*},|{

54

3210

runsizeIIMANAGERrunsizeruntime

MANAGERrunsizeMANAGERrunsizeruntime

ManagerBManagerA 3210},|{

},|{

Coded Scatterplot

• Can follow procedure using graph – overlay plot from last lecture.

• Shortcut: Click Rows, Color or Mark by Column and then highlight Manager. Now use Fit Y by X with Y=Time for Run and X=Run SizeBivariate Fit of Time for Run By Run Size

150

200

250

300

Tim

e f

or R

un

50100150200250300Run Size

Model Fits

• Parallel Regression Lines Model

• Separate Regression Lines Model

• How do we test whether parallel regression lines model is appropriate ( )?


Intercept 152.53408 6.282065 <.0001 139.94454 165.12362 I-Manager A 62.178074 5.194105 <.0001 51.768855 72.587293 I-Manager B 8.3438732 5.317727 0.1224 -2.31309 19.000836 Run Size 0.2454321 0.025265 <.0001 0.1947992 0.296065


Intercept 149.7477 8.084041 <.0001 133.53317 165.96224 I-Manager A 52.786451 12.352 <.0001 28.011482 77.561419 I-Manager B 35.398214 14.51216 0.0181 6.2905144 64.505913 Run Size 0.2592431 0.036053 <.0001 0.1869294 0.3315568 I-Manager A*Run Size

0.0480219 0.056812 0.4018 -0.065928 0.161972

I-Manager B*Run Size

-0.118178 0.061188 0.0588 -0.240906 0.0045504

0: 540 H

Extra Sum of Squares F-tests

• Suppose we want to test whether multiple coefficients are equal to zero, e.g.,

test• t-tests, either individually or in combination cannot be used

to test such a hypothesis involving more than one parameter.

• F-test for joint significance of several terms

lmassIISPECIESlmasslenergy nebirdnebat 3210}||{

0: 210 H

model full from of Estimate testedbeing betas ofNumber model full of errors squared of Sum

- model reduced of errors squared of Sum

2 statisticF

Extra Sum of Squares F-test

• Under , the F-statistic has an F distribution with number of betas being tested, n-(p+1) degrees of freedom.

• p-value can be found by using Table A.4 or creating a Formula in JMP with probability, F distribution and the putting the value of the F-statistic for F and the appropriate degrees of freedom. This gives the P(F random variable with degrees of freedom < observed F-statistic) which equals 1 – p-value

zero equal testedbeing sbeta' all:0H

Extra Sum of Squares F-test example

• Testing parallel regression lines model (H0, reduced model ) vs. separate regression lines model (full model) in manager example

• Full model:

• Reduced model:

• F-statistic

• p-value: P(F random variable with 2,53 df > 51.96)

Summary of Fit

Root Mean Square Error 15.7761 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio

Model 5 67040.713 13408.1 53.8728 Error 53 13190.915 248.9 Prob > F

Analysis of Variance Source DF Sum of Squares Mean Square F Ratio


29.37761.152

915.13190210.14830

2

F

045.0955.01

Second Example of F-test

• For echolocation study, in parallel regression model, test

• Full model:

• Reduced model:

• F-statistic:

• p-value: P(F random variable with 2,16 degrees of freedom > 0.44) = 1-0.3484 = 0.6516

lmassIISPECIESlmasslenergy nebirdnebat 3210}||{ 0: 210 HSummary of Fit

Root Mean Square Error 0.185963 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio




F

44.0185963.0

239109.29421483.29

2

F

Manager Example Findings

• The runs supervised by Manager a appear abnormally time consuming. Manager b has high initial fixed setup costs, but the time per unit is the best of the three. Manager c has the lowest fixed costs and per unit production time in between managers a and b.

• Adjustments to marginal analysis via regression only control for possible differences in size among production runs. Other differences might be relevant, e.g., difficulty of production runs. It could be that Manager a supervised most difficult production runs.


Intercept 149.7477 8.084041 <.0001 133.53317 165.96224 I-Manager A 52.786451 12.352 <.0001 28.011482 77.561419 I-Manager B 35.398214 14.51216 0.0181 6.2905144 64.505913 Run Size 0.2592431 0.036053 <.0001 0.1869294 0.3315568 I-Manager A*Run Size

0.0480219 0.056812 0.4018 -0.065928 0.161972

I-Manager B*Run Size

-0.118178 0.061188 0.0588 -0.240906 0.0045504

Special Cases of F-test• Multiple Regression Model:

• If we want to test if one equals zero, e.g., , F-test is equivalent to t-test.

• Suppose we want to test , i.e., null hypothesis is that the mean of Y does not depend on any of the explanatory variables.

• JMP automatically computes this test under Analysis of Variance, Prob>F. For separate regression lines model, strong evidence that mean run time does depend on at least one of run size, manager.

ppXXXY 110}|{

0: 10 pH



C. Total 58 80231.627 <.0001

0: 10 H

The R-Squared Statistic

• For separate regression lines model in production time example,

• Similar interpretation as in simple linear regression. The R-squared statistic is the proportion of the variation in y explained by the multiple regression model

• Total Sum of Squares: • Residual Sum of Squares:

Summary of Fit

RSquare 0.83559

squares of sum Total

squares of sum Residual - squares of sum Total2 R

2

1)(

n

i i yy

n

i ippii xxy1

2110 )ˆˆˆ(

Assumptions of Multiple Linear Regression Model

• Assumptions of multiple linear regression:– For each subpopulation ,

• (A-1A)• (A-1B) • (A-1C) The distribution of is normal[Distribution of residuals should not depend on ]

– (A-2) The observations are independent of one another

pxx ,...,1

ppp XXXXY 1101 },...,|{2

1 ),...,|( pXXYVar

pXXY ,...,| 1

pxx ,...,1

Checking/Refining Model

• Tools for checking (A-1A) and (A-1B)– Residual plots versus predicted (fitted) values– Residual plots versus explanatory variables – If model is correct, there should be no pattern in the

residual plots

• Tool for checking (A-1C)– Normal quantile plot

• Tool for checking (A-2)– Residual plot versus time or spatial order of

observations

pxx ,,1

Lecture 23: Tues., Dec. 2 Today: –Review of using nominal variables in multiple regression...

Documents

Transcript of Lecture 23: Tues., Dec. 2 Today: –Review of using nominal variables in multiple regression...