Regression Analysis
description
Transcript of Regression Analysis
Regression Statistics
Regression Analysis
RLR
1
Purpose of Regression
Fit data to model
Known model based on physics
P* = exp[A - B/(T+C)] Antoine eq.
Assumed correlation
y = a + b*x1+c*x2
Use model
Interpolate
Extrapolate (use extreme caution)
Identify outliers
Identify trends in data
2
12345678910111213141516175-21016.2452656340924424161023
X
Y
Linear Regression
There are two classes of regressions
Linear
Non-linear
Linear refers to the parameters
Sensitivity coefficients of linear models contain no model parameters.
3
Which of these models are linear?
4
Example: Surface Tension Model
5
Issue 1: Nonlinear vs. Linear Regression
Nonlinear model
Linearized model
6
Nonlinear Regression: Mathcad - GENFIT
7
Nonlinear Regression Results
8
Linear Regression: Mathcad - Linfit
Does the linear regression
Redefine the dependent variable
Defines the independent variables
9
Linear Regression Results
10
Comparison
nonlinear
linear
11
Issue 2: How many parameters?
Linear regressions with 2, 3,4, and 5 parameters
12
Statistical Analysis of Regression
Straight Line Model as Example
13
Fit a Line Through This Data
14
Least Squares
15
How Good is the Fit?
What is the R2 value
Useful statistic, but not definitive
Does tell you how well model fits the data
Does not tell you that the model is correct
Tells you how much of the distribution about the mean is described by the model
16
Problems with R2
17
10813911146412758.04000000000000096.957.588.818.339.96000000000000267.244.2610.844.81999999999999855.68
X
Y
8888888198886.585.767.718.848.47000000000000067.045.2512.55.567.916.89
X
Y
10813911146412757.466.770000000000001412.747.10999999999999857.818.846.085.398.156.425.73
X
Y
10813911146412759.148.148.748.779.268.16.133.19.13000000000000087.264.74
X
Y
How Good is the Fit?
Are residuals random
18
Residuals Should Be Normally Distributed
19
8888888888196.585.767.718.848.47000000000000067.045.255.567.916.8912.5
X
Y
-0.42000000000000026-1.24000000000000020.710000000000000521.83999999999999871.47000000000000064.0000000000000063E-2-1.75-1.44000000000000040.91000000000000014-0.110000000000000320
e
45678910111213143.14.746.137.268.13999999999999888.779.13999999999999889.269.1299999999999998.73999999999999848.1
X
Y
-1.9000000000000001-0.760000000000000340.130.760000000000000341.14000000000000171.26999999999999781.14000000000000170.760000000000000340.13000000000000078-0.76000000000000034-1.9000000000000015
e
How Good is the Fit?
Find Confidence Interval
20
Parameter Confidence Level
21
Confidence Level of y
22
Multiple Linear Regression in Mathcad
23
Multiple Linear Regression: Mathcad - Regress
24
Mathcad Regress Function
25
Results on Ycalc vs Y Plot
26
Residuals
27
R2 Statistic
28
Confidence Level for Parameters
n is number of points, kk is number of independent variables
29
Confidence Level for Ycalc
30
b
1
y
x1
b
2
y
x2
y
b
0
b
1
x1
+
b
2
x2
+
b
0
y
1
c
bx
ax
y
+
+
=
2
bx
ae
y
=
5
4
3
T
e
T
d
T
c
T
b
a
y
+
+
+
+
=
+
-
=
C
T
B
A
y
exp
ymxb
=+
b
yaxc
=+
Tr
data
1
Tc
:=
0.5
0.6
0.7
0.8
0.9
1
0
0.01
0.02
0.03
g
Tr
ln
y
(
)
ln
C
1
(
)
C
2
C
3
x
+
C
4
x
2
+
C
5
x
3
+
ln
1
x
-
(
)
+
ln
y
(
)
ln
C
1
(
)
C
2
ln
1
x
-
(
)
+
C
3
x
ln
1
x
-
(
)
+
C
4
x
2
ln
1
x
-
(
)
+
C
5
x
3
ln
1
x
-
(
)
+
In this case, ln(y) becomes our new dependent variable and x
n
ln(1-x) for n=0...3 become our new independent
variables. Because all of the coefficients times their independent variables are added linearly, we may use
linear regression (LINFIT) to find the coefficients in this linearized equation.
y
x
(
)
C
1
1
x
-
(
)
C
2
C
3
x
+
C
4
x
2
+
C
5
x
3
+
F
x
C
,
(
)
C
1
1
x
-
(
)
C
2
C
3
x
+
C
4
x
2
+
C
5
x
3
+
1
x
-
(
)
C
2
C
3
x
+
C
4
x
2
+
C
5
x
3
+
C
1
1
x
-
(
)
C
2
C
3
x
+
C
4
x
2
+
C
5
x
3
+
ln
1
x
-
(
)
C
1
1
x
-
(
)
C
2
C
3
x
+
C
4
x
2
+
C
5
x
3
+
x
ln
1
x
-
(
)
C
1
1
x
-
(
)
C
2
C
3
x
+
C
4
x
2
+
C
5
x
3
+
x
2
ln
1
x
-
(
)
C
1
1
x
-
(
)
C
2
C
3
x
+
C
4
x
2
+
C
5
x
3
+
x
3
ln
1
x
-
(
)
:=
vector of initial
guesses for
coefficients
vg
0.02
1
0
0
0
:=
D
genfit
Tr
g
,
vg
,
F
,
(
)
:=
This performs the nonlinear
regression and returns the
coefficients
STn
i
F
Tr
i
D
,
(
)
1
:=
This calculates the
correlated values at
each Tr value
D
0.191
15.202
45.609
-
53.469
21.808
-
=
0.5
0.6
0.7
0.8
0.9
1
0
0.01
0.02
0.03
Results
g
STn
Tr
newy
i
ln
g
i
(
)
:=
F
x
(
)
1
ln
1
x
-
(
)
x
ln
1
x
-
(
)
x
2
ln
1
x
-
(
)
x
3
ln
1
x
-
(
)
:=
D
1
exp
D
1
(
)
:=
D
linfit
Tr
newy
,
F
,
(
)
:=
STl
i
exp
F
Tr
i
(
)
D
(
)
:=
These are the new
calculated values
for the linear
regression
0.5
0.6
0.7
0.8
0.9
1
0
0.01
0.02
0.03
Results
g
STn
STl
Tr
F
x
(
)
1
ln
1
x
-
(
)
x
ln
1
x
-
(
)
x
2
ln
1
x
-
(
)
x
3
ln
1
x
-
(
)
:=
D
1
exp
D
1
(
)
:=
The first coefficient in the linearized
version was ln(C1) so convert it to C1
D
7.374
10
4
93.46
252.234
-
271.601
108.799
-
=
These are the new
calculated values
for the linear
regression
0.5
0.6
0.7
0.8
0.9
1
0
0.01
0.02
0.03
Results
g
STn
STl
Tr
D
0.191
15.202
45.609
-
53.469
21.808
-
=
Note that there is a big difference in the coefficients, D, that we would report in the two different cases. Also
note from the plot that the two different regressions give different results. The sum of the squared errors
(SSE) obtained from the two methods is given below. Note that the nonlinear equation gives a smaller SSE,
so does that mean it is the "best" fit? The linear regression follows the up-and-down trend of the points, so
does that mean that it is the "best"? Is the up-and-down trend even statistically real?
SSE
n
STn
g
-
(
)
2
:=
SSE
n
4.346
10
6
-
=
SSE
l
STl
g
-
(
)
2
:=
SSE
l
6.416
10
6
-
=
0.5
0.6
0.7
0.8
0.9
1
0
0.01
0.02
0.03
Results
0.026
8
10
4
-
g
ST2
ST3
ST4
ST5
0.979
0.531
Tr
n
11
:=
n is number of points
i
1
n
..
:=
i is the index for all points
x
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
:=
y
8.1
7.8
8.5
9.8
9.5
8.9
8.6
10.2
9.3
9.2
10.5
:=
1
1.2
1.4
1.6
1.8
2
7
8
9
10
11
y
x
SSE
1
n
i
r
i
(
)
2
=
1
n
i
y
i
a
b
x
i
+
(
)
-
2
=
1
1.2
1.4
1.6
1.8
2
7
8
9
10
11
y
x
SSE
1
n
i
r
i
(
)
2
=
1
n
i
y
i
a
b
x
i
+
(
)
-
2
=
b
A
xy
A
x
A
y
-
A
xx
A
x
2
-
a
A
y
b
A
x
-
A
x
1
n
1
n
i
x
i
=
A
xx
1
n
1
n
i
x
i
(
)
2
=
A
xy
1
n
1
n
i
x
i
y
i
(
)
=
A
y
1
n
1
n
i
y
i
=
1
1.2
1.4
1.6
1.8
2
7
8
9
10
11
10.5
7.8
y
fit
2
1
x
You already knew how to do this, and we introduced this simply to introduce the notation and the averages
defined above that will be used in the statistical analysis. The real question here is how do we analyze the
regression that we just performed. Once an equation has been fitted to some data, we should always ask
ourselves the question: "how good is the fit?" The "goodness" of a fit can be measured in terms of a statistical
analysis of the
residuals
. There are several things you should always check.
The R
2
value is a useful statistic, but not definitive. It tells you how well the data fit the model. It does not tell
you if the model is correct. It tells you how much of the distribution of the data about the mean is described by
the model.
R2
y
i
calc
,
y
av
-
(
)
2
y
i
exp
,
y
av
-
(
)
2
R2
1
n
i
fit
i
A
y
-
(
)
2
=
1
n
i
y
i
A
y
-
(
)
2
=
:=
R2
0.5
=
r
i
y
i
exp
,
y
i
calc
,
-
r
i
y
i
a
b
x
i
+
(
)
-
:=
1
1.2
1.4
1.6
1.8
2
1
-
0.5
-
0
0.5
1
1.5
r
x
r
i
y
i
a
b
x
i
+
(
)
-
:=
1
1.2
1.4
1.6
1.8
2
1
-
0.5
-
0
0.5
1
1.5
r
x
bias
1
n
1
n
i
r
i
=
:=
bias
1.211
-
10
15
-
=
S
xx
1
n
i
x
i
A
x
-
(
)
2
=
1
n
i
x
i
(
)
2
=
1
n
1
n
i
x
i
=
2
-
where n - p is the degree of
freedom (number of points minus
the number of parameters)
Var
s
2
SSE
n
p
-
b
s
t
k
n
2
-
,
S
xx
-
b