k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

2DS00

Statistics 1 for Chemical

Engineering

lecture 4

Week schedule

Week 1: Measurement and statistics

Week 2: Error propagation

Week 3: Simple linear regression analysis

Week 4: Multiple linear regression analysis

Week 5: Nonlinear regression analysis

Detailed contents of week 4

• multiple linear regression

• polynomial regression

•interaction

• multicollinearity

• measures of model adequacy

• selection of regression models

Specific warmth

• specific warmth of vapour at constant pressure as function of temperature

• data set from Perry’s Chemical Engineers’ Handbook

• thermodynamic theories say that quadratic relation between temperature

and specific warmth usually suffices:

2210 TTC p

Scatter plot of specific warmth data

Plot of Cp vs T

T

Cp

250 300 350 4001800

1900

2000

2100

2200

Regression output specific warmth dataMultiple Regression Analysis-----------------------------------------------------------------------------Dependent variable: Cp----------------------------------------------------------------------------- Standard TParameter Estimate Error Statistic P-Value-----------------------------------------------------------------------------CONSTANT 3590,36 76,3041 47,0533 0,0000T -12,1386 0,454369 -26,7153 0,0000T^2 0,0213415 0,000670762 31,8169 0,0000-----------------------------------------------------------------------------

Analysis of Variance-----------------------------------------------------------------------------Source Sum of Squares Df Mean Square F-Ratio P-Value-----------------------------------------------------------------------------Model 169252,0 2 84626,2 6227,13 0,0000Residual 285,388 21 13,5899-----------------------------------------------------------------------------Total (Corr.) 169538,0 23

R-squared = 99,8317 percentR-squared (adjusted for d.f.) = 99,8156 percentStandard Error of Est. = 3,68645Mean absolute error = 2,94042Durbin-Watson statistic = 0,310971 (P=0,0000)Lag 1 residual autocorrelation = 0,640511

Issues in regression output

• significance of model

• significance of individual regression parameters

• residual plots:

– normality (density trace, normal probability plot)

– constant variance (against predicted values + each independent

variable)

– model adequacy (against predicted values)

– outliers

– independence

• influential points

Residual plot specific warmth data

This behaviour is visible in plot of fitted line only after rescaling!

Residual Plot

predicted Cp

Stu

dentized r

esid

ual

1800 1900 2000 2100 2200-3.8

-1.8

0.2

2.2

4.2

Plot of fitted quadratic model for specific warmth data

Plot of Fitted Model

T

Cp

250 300 350 400 4501800

1900

2000

2100

2200

Conclusion regression models for specific warmth data

• we need third order model (polynomial of degree 3)

• careful with extrapolation

• original data set contains influential points

• original data set contains potential outliers

Yield data

• yield of chemical reaction as function of both temperature and

pressure

• goal of regression analysis is to find optimal settings of temperature

and pressure

• start with simplest linear models:

– no interaction

– interaction

0 1 2Yield T P

0 1 2 3Yield *T P T P

No interaction

Temperature

Yie

ld

50 100

Pressure = 5.5

Pressure = 1

65

80

85

70

Interaction

Temperature

Yie

ld

50 100

Pressure = 5.5

Pressure = 1

65

8075

70

Interaction plot for yield data

Interaction Plot

T

Yie

ld

P15,510

27

37

47

57

67

77

87

50 125 200

First-order interaction model for yield dataMultiple Regression Analysis-----------------------------------------------------------------------------Dependent variable: Yield----------------------------------------------------------------------------- Standard TParameter Estimate Error Statistic P-Value-----------------------------------------------------------------------------CONSTANT 92,3593 10,1698 9,08172 0,0000T -0,312222 0,0732812 -4,26061 0,0006P -2,87037 1,54214 -1,86129 0,0812T*P 0,0444444 0,0110791 4,01157 0,0010-----------------------------------------------------------------------------


R-squared = 68,3352 percentR-squared (adjusted for d.f.) = 62,398 percentStandard Error of Est. = 10,576Mean absolute error = 8,62Durbin-Watson statistic = 1,19093 (P=0,0058)Lag 1 residual autocorrelation = 0,375807

Comments on first-order interaction model

• model significant, but R-squared relatively low

• residual plots suggest quadratic terms are missing:

Residual Plot

T

Stud

entiz

ed re

sidu

al

50 80 110 140 170 200-1,8

-0,8

0,2

1,2

2,2

Residual Plot

P

Stud

entiz

ed re

sidua

l

0 2 4 6 8 10-1,8

-0,8

0,2

1,2

2,2

Full quadratic model for yield dataMultiple Regression Analysis-----------------------------------------------------------------------------Dependent variable: Yield----------------------------------------------------------------------------- Standard TParameter Estimate Error Statistic P-Value-----------------------------------------------------------------------------CONSTANT 64,2072 3,26458 19,6678 0,0000T -0,0471429 0,0504541 -0,934371 0,3659P 6,3448 0,675556 9,39196 0,0000T*P 0,0444444 0,00243463 18,2551 0,0000T^2 -0,00106032 0,000191261 -5,54383 0,0001P^2 -0,837743 0,053128 -15,7684 0,0000-----------------------------------------------------------------------------


R-squared = 98,662 percentR-squared (adjusted for d.f.) = 98,1842 percentStandard Error of Est. = 2,32408Mean absolute error = 1,51905Durbin-Watson statistic = 2,71456 (P=0,0021)Lag 1 residual autocorrelation = -0,427245

Comments on full quadratic model for yield data

• strong improvement on R-squared

• independent variable T is no longer significant

• other independent variables involving T remain significant

• refit model omitting independent variable T while keeping the

other independent variables

Incomplete quadratic model for yield data

• all parameters significant

• residual plots OK

• normality OK

• 3 influential points but standard deviations of parameter estimates

are OK, so no action

• 2 possible outliers at predicted yield of 61%

• accept model and use it for finding optimal settings for yield

Optimal settings for yield

X

Y

Function60.065.070.075.080.085.0

0 40 80 120 160 2000

2

4

6

8

10

Problems with model selection

• variables may be significant in one model but not in another

• number of possible models increases rapidly with number of

independent variables

• independent variables may influence each other (multicollinearity)

Multicollinearity

• Phenomenon: variables xi (almost) satisfy a linear relation

• Cause: large variances of parameter estimates.

• Not harmful for predictions

• Unpleasant for finding causal relations

• Ways to check for multicollinearity:

– wrong signs of parameter estimates

– significant model, but (many) non significant parameters

– large variances of parameter estimates

Procedures for model selection• compute all possible regression models

– only possible with few independent variables

– choice of best model through adequacy measures:

• determination coefficient (adjusted for number of ind. variables)

• MSE (directly related to standard error)

• Mallow’s Cp (estimates total mean square error)

• sequentially add terms (forward regression)

• sequentially delete terms from full model (backward regression)

• These procedures do not necessarily yield the same result

• Final models should always be checked!

k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Documents

Transcript of k 2DS00 Statistics 1 for Chemical Engineering lecture 4.