k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
-
Upload
allan-sanders -
Category
Documents
-
view
213 -
download
0
Transcript of k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
2DS00
Statistics 1 for Chemical
Engineering
lecture 4
Week schedule
Week 1: Measurement and statistics
Week 2: Error propagation
Week 3: Simple linear regression analysis
Week 4: Multiple linear regression analysis
Week 5: Nonlinear regression analysis
Detailed contents of week 4
• multiple linear regression
• polynomial regression
•interaction
• multicollinearity
• measures of model adequacy
• selection of regression models
Specific warmth
• specific warmth of vapour at constant pressure as function of temperature
• data set from Perry’s Chemical Engineers’ Handbook
• thermodynamic theories say that quadratic relation between temperature
and specific warmth usually suffices:
2210 TTC p
Scatter plot of specific warmth data
Plot of Cp vs T
T
Cp
250 300 350 4001800
1900
2000
2100
2200
Regression output specific warmth dataMultiple Regression Analysis-----------------------------------------------------------------------------Dependent variable: Cp----------------------------------------------------------------------------- Standard TParameter Estimate Error Statistic P-Value-----------------------------------------------------------------------------CONSTANT 3590,36 76,3041 47,0533 0,0000T -12,1386 0,454369 -26,7153 0,0000T^2 0,0213415 0,000670762 31,8169 0,0000-----------------------------------------------------------------------------
Analysis of Variance-----------------------------------------------------------------------------Source Sum of Squares Df Mean Square F-Ratio P-Value-----------------------------------------------------------------------------Model 169252,0 2 84626,2 6227,13 0,0000Residual 285,388 21 13,5899-----------------------------------------------------------------------------Total (Corr.) 169538,0 23
R-squared = 99,8317 percentR-squared (adjusted for d.f.) = 99,8156 percentStandard Error of Est. = 3,68645Mean absolute error = 2,94042Durbin-Watson statistic = 0,310971 (P=0,0000)Lag 1 residual autocorrelation = 0,640511
Issues in regression output
• significance of model
• significance of individual regression parameters
• residual plots:
– normality (density trace, normal probability plot)
– constant variance (against predicted values + each independent
variable)
– model adequacy (against predicted values)
– outliers
– independence
• influential points
Residual plot specific warmth data
This behaviour is visible in plot of fitted line only after rescaling!
Residual Plot
predicted Cp
Stu
dentized r
esid
ual
1800 1900 2000 2100 2200-3.8
-1.8
0.2
2.2
4.2
Plot of fitted quadratic model for specific warmth data
Plot of Fitted Model
T
Cp
250 300 350 400 4501800
1900
2000
2100
2200
Conclusion regression models for specific warmth data
• we need third order model (polynomial of degree 3)
• careful with extrapolation
• original data set contains influential points
• original data set contains potential outliers
Yield data
• yield of chemical reaction as function of both temperature and
pressure
• goal of regression analysis is to find optimal settings of temperature
and pressure
• start with simplest linear models:
– no interaction
– interaction
0 1 2Yield T P
0 1 2 3Yield *T P T P
No interaction
Temperature
Yie
ld
50 100
Pressure = 5.5
Pressure = 1
65
80
85
70
Interaction
Temperature
Yie
ld
50 100
Pressure = 5.5
Pressure = 1
65
8075
70
Interaction plot for yield data
Interaction Plot
T
Yie
ld
P15,510
27
37
47
57
67
77
87
50 125 200
First-order interaction model for yield dataMultiple Regression Analysis-----------------------------------------------------------------------------Dependent variable: Yield----------------------------------------------------------------------------- Standard TParameter Estimate Error Statistic P-Value-----------------------------------------------------------------------------CONSTANT 92,3593 10,1698 9,08172 0,0000T -0,312222 0,0732812 -4,26061 0,0006P -2,87037 1,54214 -1,86129 0,0812T*P 0,0444444 0,0110791 4,01157 0,0010-----------------------------------------------------------------------------
Analysis of Variance-----------------------------------------------------------------------------Source Sum of Squares Df Mean Square F-Ratio P-Value-----------------------------------------------------------------------------Model 3862,17 3 1287,39 11,51 0,0003Residual 1789,63 16 111,852-----------------------------------------------------------------------------Total (Corr.) 5651,8 19
R-squared = 68,3352 percentR-squared (adjusted for d.f.) = 62,398 percentStandard Error of Est. = 10,576Mean absolute error = 8,62Durbin-Watson statistic = 1,19093 (P=0,0058)Lag 1 residual autocorrelation = 0,375807
Comments on first-order interaction model
• model significant, but R-squared relatively low
• residual plots suggest quadratic terms are missing:
Residual Plot
T
Stud
entiz
ed re
sidu
al
50 80 110 140 170 200-1,8
-0,8
0,2
1,2
2,2
Residual Plot
P
Stud
entiz
ed re
sidua
l
0 2 4 6 8 10-1,8
-0,8
0,2
1,2
2,2
Full quadratic model for yield dataMultiple Regression Analysis-----------------------------------------------------------------------------Dependent variable: Yield----------------------------------------------------------------------------- Standard TParameter Estimate Error Statistic P-Value-----------------------------------------------------------------------------CONSTANT 64,2072 3,26458 19,6678 0,0000T -0,0471429 0,0504541 -0,934371 0,3659P 6,3448 0,675556 9,39196 0,0000T*P 0,0444444 0,00243463 18,2551 0,0000T^2 -0,00106032 0,000191261 -5,54383 0,0001P^2 -0,837743 0,053128 -15,7684 0,0000-----------------------------------------------------------------------------
Analysis of Variance-----------------------------------------------------------------------------Source Sum of Squares Df Mean Square F-Ratio P-Value-----------------------------------------------------------------------------Model 5576,18 5 1115,24 206,47 0,0000Residual 75,619 14 5,40136-----------------------------------------------------------------------------Total (Corr.) 5651,8 19
R-squared = 98,662 percentR-squared (adjusted for d.f.) = 98,1842 percentStandard Error of Est. = 2,32408Mean absolute error = 1,51905Durbin-Watson statistic = 2,71456 (P=0,0021)Lag 1 residual autocorrelation = -0,427245
Comments on full quadratic model for yield data
• strong improvement on R-squared
• independent variable T is no longer significant
• other independent variables involving T remain significant
• refit model omitting independent variable T while keeping the
other independent variables
Incomplete quadratic model for yield data
• all parameters significant
• residual plots OK
• normality OK
• 3 influential points but standard deviations of parameter estimates
are OK, so no action
• 2 possible outliers at predicted yield of 61%
• accept model and use it for finding optimal settings for yield
Optimal settings for yield
X
Y
Function60.065.070.075.080.085.0
0 40 80 120 160 2000
2
4
6
8
10
Problems with model selection
• variables may be significant in one model but not in another
• number of possible models increases rapidly with number of
independent variables
• independent variables may influence each other (multicollinearity)
Multicollinearity
• Phenomenon: variables xi (almost) satisfy a linear relation
• Cause: large variances of parameter estimates.
• Not harmful for predictions
• Unpleasant for finding causal relations
• Ways to check for multicollinearity:
– wrong signs of parameter estimates
– significant model, but (many) non significant parameters
– large variances of parameter estimates
Procedures for model selection• compute all possible regression models
– only possible with few independent variables
– choice of best model through adequacy measures:
• determination coefficient (adjusted for number of ind. variables)
• MSE (directly related to standard error)
• Mallow’s Cp (estimates total mean square error)
• sequentially add terms (forward regression)
• sequentially delete terms from full model (backward regression)
• These procedures do not necessarily yield the same result
• Final models should always be checked!