Regression Analysis

30
Regression Analysis RLR

description

Regression Analysis. RLR. Purpose of Regression. Fit data to model Known model based on physics P* = exp[A - B/(T+C)] Antoine eq. Assumed correlation y = a + b*x1+c*x2 Use model Interpolate Extrapolate (use extreme caution) Identify outliers Identify trends in data. Linear Regression. - PowerPoint PPT Presentation

Transcript of Regression Analysis

Regression Statistics

Regression Analysis

RLR

1

Purpose of Regression

Fit data to model

Known model based on physics

P* = exp[A - B/(T+C)] Antoine eq.

Assumed correlation

y = a + b*x1+c*x2

Use model

Interpolate

Extrapolate (use extreme caution)

Identify outliers

Identify trends in data

2

12345678910111213141516175-21016.2452656340924424161023

X

Y

Linear Regression

There are two classes of regressions

Linear

Non-linear

Linear refers to the parameters

Sensitivity coefficients of linear models contain no model parameters.

3

Which of these models are linear?

4

Example: Surface Tension Model

5

Issue 1: Nonlinear vs. Linear Regression

Nonlinear model

Linearized model

6

Nonlinear Regression: Mathcad - GENFIT

7

Nonlinear Regression Results

8

Linear Regression: Mathcad - Linfit

Does the linear regression

Redefine the dependent variable

Defines the independent variables

9

Linear Regression Results

10

Comparison

nonlinear

linear

11

Issue 2: How many parameters?

Linear regressions with 2, 3,4, and 5 parameters

12

Statistical Analysis of Regression

Straight Line Model as Example

13

Fit a Line Through This Data

14

Least Squares

15

How Good is the Fit?

What is the R2 value

Useful statistic, but not definitive

Does tell you how well model fits the data

Does not tell you that the model is correct

Tells you how much of the distribution about the mean is described by the model

16

Problems with R2

17

10813911146412758.04000000000000096.957.588.818.339.96000000000000267.244.2610.844.81999999999999855.68

X

Y

8888888198886.585.767.718.848.47000000000000067.045.2512.55.567.916.89

X

Y

10813911146412757.466.770000000000001412.747.10999999999999857.818.846.085.398.156.425.73

X

Y

10813911146412759.148.148.748.779.268.16.133.19.13000000000000087.264.74

X

Y

How Good is the Fit?

Are residuals random

18

Residuals Should Be Normally Distributed

19

8888888888196.585.767.718.848.47000000000000067.045.255.567.916.8912.5

X

Y

-0.42000000000000026-1.24000000000000020.710000000000000521.83999999999999871.47000000000000064.0000000000000063E-2-1.75-1.44000000000000040.91000000000000014-0.110000000000000320

e

45678910111213143.14.746.137.268.13999999999999888.779.13999999999999889.269.1299999999999998.73999999999999848.1

X

Y

-1.9000000000000001-0.760000000000000340.130.760000000000000341.14000000000000171.26999999999999781.14000000000000170.760000000000000340.13000000000000078-0.76000000000000034-1.9000000000000015

e

How Good is the Fit?

Find Confidence Interval

20

Parameter Confidence Level

21

Confidence Level of y

22

Multiple Linear Regression in Mathcad

23

Multiple Linear Regression: Mathcad - Regress

24

Mathcad Regress Function

25

Results on Ycalc vs Y Plot

26

Residuals

27

R2 Statistic

28

Confidence Level for Parameters

n is number of points, kk is number of independent variables

29

Confidence Level for Ycalc

30

b

1

y

x1

b

2

y

x2

y

b

0

b

1

x1

+

b

2

x2

+

b

0

y

1

c

bx

ax

y

+

+

=

2

bx

ae

y

=

5

4

3

T

e

T

d

T

c

T

b

a

y

+

+

+

+

=

+

-

=

C

T

B

A

y

exp

ymxb

=+

b

yaxc

=+

Tr

data

1

Tc

:=

0.5

0.6

0.7

0.8

0.9

1

0

0.01

0.02

0.03

g

Tr

ln

y

(

)

ln

C

1

(

)

C

2

C

3

x

+

C

4

x

2

+

C

5

x

3

+

ln

1

x

-

(

)

+

ln

y

(

)

ln

C

1

(

)

C

2

ln

1

x

-

(

)

+

C

3

x

ln

1

x

-

(

)

+

C

4

x

2

ln

1

x

-

(

)

+

C

5

x

3

ln

1

x

-

(

)

+

In this case, ln(y) becomes our new dependent variable and x

n

ln(1-x) for n=0...3 become our new independent

variables. Because all of the coefficients times their independent variables are added linearly, we may use

linear regression (LINFIT) to find the coefficients in this linearized equation.

y

x

(

)

C

1

1

x

-

(

)

C

2

C

3

x

+

C

4

x

2

+

C

5

x

3

+

F

x

C

,

(

)

C

1

1

x

-

(

)

C

2

C

3

x

+

C

4

x

2

+

C

5

x

3

+

1

x

-

(

)

C

2

C

3

x

+

C

4

x

2

+

C

5

x

3

+

C

1

1

x

-

(

)

C

2

C

3

x

+

C

4

x

2

+

C

5

x

3

+

ln

1

x

-

(

)

C

1

1

x

-

(

)

C

2

C

3

x

+

C

4

x

2

+

C

5

x

3

+

x

ln

1

x

-

(

)

C

1

1

x

-

(

)

C

2

C

3

x

+

C

4

x

2

+

C

5

x

3

+

x

2

ln

1

x

-

(

)

C

1

1

x

-

(

)

C

2

C

3

x

+

C

4

x

2

+

C

5

x

3

+

x

3

ln

1

x

-

(

)

:=

vector of initial

guesses for

coefficients

vg

0.02

1

0

0

0

:=

D

genfit

Tr

g

,

vg

,

F

,

(

)

:=

This performs the nonlinear

regression and returns the

coefficients

STn

i

F

Tr

i

D

,

(

)

1

:=

This calculates the

correlated values at

each Tr value

D

0.191

15.202

45.609

-

53.469

21.808

-

=

0.5

0.6

0.7

0.8

0.9

1

0

0.01

0.02

0.03

Results

g

STn

Tr

newy

i

ln

g

i

(

)

:=

F

x

(

)

1

ln

1

x

-

(

)

x

ln

1

x

-

(

)

x

2

ln

1

x

-

(

)

x

3

ln

1

x

-

(

)

:=

D

1

exp

D

1

(

)

:=

D

linfit

Tr

newy

,

F

,

(

)

:=

STl

i

exp

F

Tr

i

(

)

D

(

)

:=

These are the new

calculated values

for the linear

regression

0.5

0.6

0.7

0.8

0.9

1

0

0.01

0.02

0.03

Results

g

STn

STl

Tr

F

x

(

)

1

ln

1

x

-

(

)

x

ln

1

x

-

(

)

x

2

ln

1

x

-

(

)

x

3

ln

1

x

-

(

)

:=

D

1

exp

D

1

(

)

:=

The first coefficient in the linearized

version was ln(C1) so convert it to C1

D

7.374

10

4

93.46

252.234

-

271.601

108.799

-

=

These are the new

calculated values

for the linear

regression

0.5

0.6

0.7

0.8

0.9

1

0

0.01

0.02

0.03

Results

g

STn

STl

Tr

D

0.191

15.202

45.609

-

53.469

21.808

-

=

Note that there is a big difference in the coefficients, D, that we would report in the two different cases. Also

note from the plot that the two different regressions give different results. The sum of the squared errors

(SSE) obtained from the two methods is given below. Note that the nonlinear equation gives a smaller SSE,

so does that mean it is the "best" fit? The linear regression follows the up-and-down trend of the points, so

does that mean that it is the "best"? Is the up-and-down trend even statistically real?

SSE

n

STn

g

-

(

)

2

:=

SSE

n

4.346

10

6

-

=

SSE

l

STl

g

-

(

)

2

:=

SSE

l

6.416

10

6

-

=

0.5

0.6

0.7

0.8

0.9

1

0

0.01

0.02

0.03

Results

0.026

8

10

4

-

g

ST2

ST3

ST4

ST5

0.979

0.531

Tr

n

11

:=

n is number of points

i

1

n

..

:=

i is the index for all points

x

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

:=

y

8.1

7.8

8.5

9.8

9.5

8.9

8.6

10.2

9.3

9.2

10.5

:=

1

1.2

1.4

1.6

1.8

2

7

8

9

10

11

y

x

SSE

1

n

i

r

i

(

)

2

=

1

n

i

y

i

a

b

x

i

+

(

)

-

2

=

1

1.2

1.4

1.6

1.8

2

7

8

9

10

11

y

x

SSE

1

n

i

r

i

(

)

2

=

1

n

i

y

i

a

b

x

i

+

(

)

-

2

=

b

A

xy

A

x

A

y

-

A

xx

A

x

2

-

a

A

y

b

A

x

-

A

x

1

n

1

n

i

x

i

=

A

xx

1

n

1

n

i

x

i

(

)

2

=

A

xy

1

n

1

n

i

x

i

y

i

(

)

=

A

y

1

n

1

n

i

y

i

=

1

1.2

1.4

1.6

1.8

2

7

8

9

10

11

10.5

7.8

y

fit

2

1

x

You already knew how to do this, and we introduced this simply to introduce the notation and the averages

defined above that will be used in the statistical analysis. The real question here is how do we analyze the

regression that we just performed. Once an equation has been fitted to some data, we should always ask

ourselves the question: "how good is the fit?" The "goodness" of a fit can be measured in terms of a statistical

analysis of the

residuals

. There are several things you should always check.

The R

2

value is a useful statistic, but not definitive. It tells you how well the data fit the model. It does not tell

you if the model is correct. It tells you how much of the distribution of the data about the mean is described by

the model.

R2

y

i

calc

,

y

av

-

(

)

2

y

i

exp

,

y

av

-

(

)

2

R2

1

n

i

fit

i

A

y

-

(

)

2

=

1

n

i

y

i

A

y

-

(

)

2

=

:=

R2

0.5

=

r

i

y

i

exp

,

y

i

calc

,

-

r

i

y

i

a

b

x

i

+

(

)

-

:=

1

1.2

1.4

1.6

1.8

2

1

-

0.5

-

0

0.5

1

1.5

r

x

r

i

y

i

a

b

x

i

+

(

)

-

:=

1

1.2

1.4

1.6

1.8

2

1

-

0.5

-

0

0.5

1

1.5

r

x

bias

1

n

1

n

i

r

i

=

:=

bias

1.211

-

10

15

-

=

S

xx

1

n

i

x

i

A

x

-

(

)

2

=

1

n

i

x

i

(

)

2

=

1

n

1

n

i

x

i

=

2

-

where n - p is the degree of

freedom (number of points minus

the number of parameters)

Var

s

2

SSE

n

p

-

b

s

t

k

n

2

-

,

S

xx

-

b