k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

23
2DS00 Statistics 1 for Chemical Engineering lecture 4

Transcript of k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Page 1: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

2DS00

Statistics 1 for Chemical

Engineering

lecture 4

Page 2: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Week schedule

Week 1: Measurement and statistics

Week 2: Error propagation

Week 3: Simple linear regression analysis

Week 4: Multiple linear regression analysis

Week 5: Nonlinear regression analysis

Page 3: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Detailed contents of week 4

• multiple linear regression

• polynomial regression

•interaction

• multicollinearity

• measures of model adequacy

• selection of regression models

Page 4: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Specific warmth

• specific warmth of vapour at constant pressure as function of temperature

• data set from Perry’s Chemical Engineers’ Handbook

• thermodynamic theories say that quadratic relation between temperature

and specific warmth usually suffices:

2210 TTC p

Page 5: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Scatter plot of specific warmth data

Plot of Cp vs T

T

Cp

250 300 350 4001800

1900

2000

2100

2200

Page 6: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Regression output specific warmth dataMultiple Regression Analysis-----------------------------------------------------------------------------Dependent variable: Cp----------------------------------------------------------------------------- Standard TParameter Estimate Error Statistic P-Value-----------------------------------------------------------------------------CONSTANT 3590,36 76,3041 47,0533 0,0000T -12,1386 0,454369 -26,7153 0,0000T^2 0,0213415 0,000670762 31,8169 0,0000-----------------------------------------------------------------------------

Analysis of Variance-----------------------------------------------------------------------------Source Sum of Squares Df Mean Square F-Ratio P-Value-----------------------------------------------------------------------------Model 169252,0 2 84626,2 6227,13 0,0000Residual 285,388 21 13,5899-----------------------------------------------------------------------------Total (Corr.) 169538,0 23

R-squared = 99,8317 percentR-squared (adjusted for d.f.) = 99,8156 percentStandard Error of Est. = 3,68645Mean absolute error = 2,94042Durbin-Watson statistic = 0,310971 (P=0,0000)Lag 1 residual autocorrelation = 0,640511

Page 7: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Issues in regression output

• significance of model

• significance of individual regression parameters

• residual plots:

– normality (density trace, normal probability plot)

– constant variance (against predicted values + each independent

variable)

– model adequacy (against predicted values)

– outliers

– independence

• influential points

Page 8: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Residual plot specific warmth data

This behaviour is visible in plot of fitted line only after rescaling!

Residual Plot

predicted Cp

Stu

dentized r

esid

ual

1800 1900 2000 2100 2200-3.8

-1.8

0.2

2.2

4.2

Page 9: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Plot of fitted quadratic model for specific warmth data

Plot of Fitted Model

T

Cp

250 300 350 400 4501800

1900

2000

2100

2200

Page 10: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Conclusion regression models for specific warmth data

• we need third order model (polynomial of degree 3)

• careful with extrapolation

• original data set contains influential points

• original data set contains potential outliers

Page 11: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Yield data

• yield of chemical reaction as function of both temperature and

pressure

• goal of regression analysis is to find optimal settings of temperature

and pressure

• start with simplest linear models:

– no interaction

– interaction

0 1 2Yield T P

0 1 2 3Yield *T P T P

Page 12: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

No interaction

Temperature

Yie

ld

50 100

Pressure = 5.5

Pressure = 1

65

80

85

70

Page 13: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Interaction

Temperature

Yie

ld

50 100

Pressure = 5.5

Pressure = 1

65

8075

70

Page 14: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Interaction plot for yield data

Interaction Plot

T

Yie

ld

P15,510

27

37

47

57

67

77

87

50 125 200

Page 15: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

First-order interaction model for yield dataMultiple Regression Analysis-----------------------------------------------------------------------------Dependent variable: Yield----------------------------------------------------------------------------- Standard TParameter Estimate Error Statistic P-Value-----------------------------------------------------------------------------CONSTANT 92,3593 10,1698 9,08172 0,0000T -0,312222 0,0732812 -4,26061 0,0006P -2,87037 1,54214 -1,86129 0,0812T*P 0,0444444 0,0110791 4,01157 0,0010-----------------------------------------------------------------------------

Analysis of Variance-----------------------------------------------------------------------------Source Sum of Squares Df Mean Square F-Ratio P-Value-----------------------------------------------------------------------------Model 3862,17 3 1287,39 11,51 0,0003Residual 1789,63 16 111,852-----------------------------------------------------------------------------Total (Corr.) 5651,8 19

R-squared = 68,3352 percentR-squared (adjusted for d.f.) = 62,398 percentStandard Error of Est. = 10,576Mean absolute error = 8,62Durbin-Watson statistic = 1,19093 (P=0,0058)Lag 1 residual autocorrelation = 0,375807

Page 16: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Comments on first-order interaction model

• model significant, but R-squared relatively low

• residual plots suggest quadratic terms are missing:

Residual Plot

T

Stud

entiz

ed re

sidu

al

50 80 110 140 170 200-1,8

-0,8

0,2

1,2

2,2

Residual Plot

P

Stud

entiz

ed re

sidua

l

0 2 4 6 8 10-1,8

-0,8

0,2

1,2

2,2

Page 17: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Full quadratic model for yield dataMultiple Regression Analysis-----------------------------------------------------------------------------Dependent variable: Yield----------------------------------------------------------------------------- Standard TParameter Estimate Error Statistic P-Value-----------------------------------------------------------------------------CONSTANT 64,2072 3,26458 19,6678 0,0000T -0,0471429 0,0504541 -0,934371 0,3659P 6,3448 0,675556 9,39196 0,0000T*P 0,0444444 0,00243463 18,2551 0,0000T^2 -0,00106032 0,000191261 -5,54383 0,0001P^2 -0,837743 0,053128 -15,7684 0,0000-----------------------------------------------------------------------------

Analysis of Variance-----------------------------------------------------------------------------Source Sum of Squares Df Mean Square F-Ratio P-Value-----------------------------------------------------------------------------Model 5576,18 5 1115,24 206,47 0,0000Residual 75,619 14 5,40136-----------------------------------------------------------------------------Total (Corr.) 5651,8 19

R-squared = 98,662 percentR-squared (adjusted for d.f.) = 98,1842 percentStandard Error of Est. = 2,32408Mean absolute error = 1,51905Durbin-Watson statistic = 2,71456 (P=0,0021)Lag 1 residual autocorrelation = -0,427245

Page 18: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Comments on full quadratic model for yield data

• strong improvement on R-squared

• independent variable T is no longer significant

• other independent variables involving T remain significant

• refit model omitting independent variable T while keeping the

other independent variables

Page 19: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Incomplete quadratic model for yield data

• all parameters significant

• residual plots OK

• normality OK

• 3 influential points but standard deviations of parameter estimates

are OK, so no action

• 2 possible outliers at predicted yield of 61%

• accept model and use it for finding optimal settings for yield

Page 20: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Optimal settings for yield

X

Y

Function60.065.070.075.080.085.0

0 40 80 120 160 2000

2

4

6

8

10

Page 21: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Problems with model selection

• variables may be significant in one model but not in another

• number of possible models increases rapidly with number of

independent variables

• independent variables may influence each other (multicollinearity)

Page 22: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Multicollinearity

• Phenomenon: variables xi (almost) satisfy a linear relation

• Cause: large variances of parameter estimates.

• Not harmful for predictions

• Unpleasant for finding causal relations

• Ways to check for multicollinearity:

– wrong signs of parameter estimates

– significant model, but (many) non significant parameters

– large variances of parameter estimates

Page 23: k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

Procedures for model selection• compute all possible regression models

– only possible with few independent variables

– choice of best model through adequacy measures:

• determination coefficient (adjusted for number of ind. variables)

• MSE (directly related to standard error)

• Mallow’s Cp (estimates total mean square error)

• sequentially add terms (forward regression)

• sequentially delete terms from full model (backward regression)

• These procedures do not necessarily yield the same result

• Final models should always be checked!