DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

25
8 Practical Guidelines for Algorithm Implementation Besides th e basic estimation algorithm, o n e should be aware of several d e- tails that should be taken care of for a successful an d meaningful estimation of unknown parameters. Items such as quality of the data at hand (detection of out- liers), generation o f initial guesses, overstepping an d potential remedies to over- come possible ill-conditioning of the problem are discussed in the following sec- tions. Particular emphasis is given to implementation guidelines fo r differential equation models. Finally, we touch upon the problem of autocorrelation. 8 . 1 INSPECTION OF THE DATA One of the very first things that th e analyst should do prior to attempting to f i t a model to the data, is to inspect th e data at hand. Visual inspection is a very powerful tool as the eye can often pick up an inconsistent behavior. T he primary goal of this data-inspection is to spot potential outliers. Barnett et al. (1994) define outliers a s t h e observations in a sample which appear to be inconsistent with th e remainder o th e sample. In engineering appli- cations, a n outlier is often th e result o f gross measurement error. This includes a mistake in the calculations or in data coding or even a copying error. A n outlier could also be the result of inherent variability of the process although chances a r e it is not! Besides visual inspection, there re several statistical procedures fo r detect- ing outliers (Barnett an d Lewis, 1978). These tests ar e based on the examination of th e residuals. Practically this means that if a residual is bigger than 3 or 4 standard 1 3 3 Copyright © 2001 by Taylor & Francis Group, LLC

Transcript of DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 1/25

8

Practical Guidelines for AlgorithmImplementation

Besides the basic estimation algorithm, one should be aware of several de-tails that should be taken care of for a successful and meaningful estimation of

unknown parameters. Items such as quality of the data at hand (detection of out-

liers), generation of initial guesses, overstepping an d potential remedies to over-

come possible ill-conditioning of the problem are discussed in the following sec-

tions. Particular emphasis is given to implementation guidelines for differential

equation models. Final ly, we touch upon the problem of au tocorrelation.

8.1 INSPECTION OF THE DATA

One of the very first things that the analyst should do prior to attempting to

f i t a model to the data, is to inspect the data at hand. Visual inspection is a very

powerful tool as the eye can often pick up an inconsistent behavior. The primary

goal of this data-inspection is to spot potential outl iers .Barnett et al . (1994) define outliers as the observat ions in a sample which

appear to be inconsistent with th e remainder of th e sample . In engineering appli-

cations, an outlier is often the result of gross measurement error. This includes a

mistake in the calculations or in data coding or even a copying error. A n outliercould also be the result of inherent variability of the process although chances a reit is not!

Besides visual inspection, there are several statistical procedures for detect-

ing outliers (Barnett and Lewis, 1978). These tests are based on the examination of

the residuals. Practically this means that if a residual is bigger than 3 or 4 standard

133

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 2/25

134 Chapt ers

deviations, it should be considered a potential outlier since the probability of oc-

currence of such a data point (due to the inherent variability of the process only) is

less than 0.1%.

A potential outlier should always be examined carefully. First we check if

there was a simple mistake during the recording or the early manipulation of the

data. If this is not the case, we proceed to a careful examination of the experimen-

tal circumstances during the particular experiment. W e should be very careful not

to discard ("reject") an outlier unless we have strong non-stat ist ical reasons for

doing so. In the worst case, we should report both results, one with the outlier and

the other without it.

Instead of a detailed presentation of the effect of extreme values and out-

liers on least squares estimation, the following common sense approach is recom-

mend ed in the analysis of engineering data:

( i) Start with careful visual observation of the data provided in tabular and

graphical form.

Identify "first" set of potential outliers.

Examine whether they should be discarded due to non-statistical reasons.

Eliminate those from the data set.

(iv) Estimate the parameters and obtain the values of response variables and

calculate the residuals.

(v ) Plot the residuals and examine whether there any data point lie beyond 3

standard deviations around the mean response estimated by the mathe-

matical model.(vi) H aving identified the second set of potential outliers, examine whether

they should be discarded due to non-statistical reasons and eliminate

them from the data set.

(vii) For all remaining "outliers" examine their effect on the model response

and the estimated parameter values. This is done by performing the pa-

rameter estimation calculations twice, once with the outlier included in

the data set and once without it.

(viii) Determine which of the outliers are highly informative compared to the

rest data points. The remaining outliers should have little on no effect onthe model parameters or the mean estimated response of the model and

hence, they can be ignored.

(ix) Prepare a final report where the effect of highly informative outliers is

clearly presented.

(x ) Perform replicate experiments at the experimental conditions where the

outliers were detected, if of course, time, money and circum stances allow

it.

I n summary, the possibility of detecting and correcting an outlier in the datais of paramount importance particularly if it is spotted by an early inspection.

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 3/25

Practical Guidelines fo r Algorithm Implementation 135

Therefore, careful inspection of the data is highly advocated prior to using any

parameter estimation software.

8 .2 G E N E R A T I O N O F I N I T I AL G U E S S E S

One of the most important tasks prior to employing an iterative parameter

estimation method for the solution of the nonlinear least squares problem is the

generation of initial guesses (starting v alues) for the unknown parameters. A good

initial guess facilitates quick convergence of any iterative method and particularly

of the Gauss-Newton method to the optimum parameter values. There are several

approaches that can be followed to arrive at reasonably good starting values for the

parameters and they are briefly discussed below.

8.2.1 Nature and Structure of the Model

The nature of the mathematical model that describes a physical system may

dictate a range of acceptable values for the unknown parameters. Furthermore,

repeated computations of the response variables fo r various values of the parame-

ters and subsequent plotting of the results provides valuable experience to theanalyst about the behav ior of the model and its dependency on the parameters. A s

a result of this exercise, we often come up with fairly good initial guesses for the

parameters. The only disadvantage of this approach is that it could be time con-

suming. This counterbalanced by the fact that one learns a lot about the structureand behavior of the model at hand.

8.2.2 Asymptotic Behavior of the Model Equations

Quite often the asymptotic behavior of the model can aid us in determining

sufficiently good initial guesses. F or example, let us consider the Michaelis-

Menten kinetics for enzyme catalyzed reactions,

( 8 . Dk2 + X j

When X ; tends to zero, the model equation reduces to

yi*-^-Xj (82

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 4/25

136 Chapters

and hence, from the very first data points at small values of X j , we can obtain ki/k 2

as the slope of the straight line V j = p X j . Furthermore, as x, tends to infinity, the

model equation reduces to

Y i-

k ,(83

and k j is obtained as the asymptotic value of yj at large values of x ;.Let us also consider the following exponential decay mode!, often encoun-

tered in analyzing environmental samples,

y j = k , + k2exp(-k3xl) (84

At large values of X j , (i.e., as x ; — > o o ) the model reduces to

y , -k! (85

whereas at values of x,, near zero (i.e., as x, —  > 0) the model reduces to

yi = k , + k 2 ( l - k 3 X j ) (8.6a)

or

y ; = k j + k 2 - k 2 k 3 X j (8.6b)

which is of the form y ; = p 0 + P i X j and hence, by performing a linear regression with

the values of x ; near zero we obtain estimates of k }+ k2 and k 2k 3 . Combined with

our estimate of k] we obtain starting v alues for all the unknown parameters.

In some cases we may not be able to obtain estimates of all the parametersby examining the asymptotic behavior of the model. However, it can still be used

to obtain estimates of some of the parameters or even establish an approximate

relationship between some of the parameters as functions of the rest.

8 . 2 . 3 Transformation of the Model Equations

A suitable transformation of the model equations can simplify the structure

of the model considerably and thus, initial guess generation becomes a trivial task.

The most interesting case which is also often encountered in engineering applica-

tions, is that of transformably l inear models. These are nonlinear models that re-

duce to simple linear models after a suitable transformation is performed. These

models have been extensively used in engineering particularly before the wide

availability of computers so that the parameters could easily be obtained with l i n -

ear least squares estimation. Even today such models are also used to reveal char-acteristics of the behavior of the model in a graphical form.

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 5/25

Practical Guidelines for Algorithm Implementation 137

If we have a transformably linear system, our chances are outstanding to get

quickly very good initial guesses for the parameters. Let us consider a few typ ical

examples.

T he integrated form of substrate utilization in an enzyme catalyzed batch

bioreactor is given by the implici t equation

(8.7)

where x , is time and y, is the substrate concentration. The initial conditions (x 0, yo)

are assumed to be known precisely. The above model is implicit in yj and we

should use the implicit formulation discussed in  Chapter 2 to solve it by the

Gauss-Newton method. Initial guesses for the two parameters can be obtained by

noticing that this model is transformably linear since Equation 8.7 can be writtenas

y o - Y J , ,, , ,8 5 n— — — — — = k, - k2 — — — — \ln\ — (88)

which is of the form Y , = p 0 +P i X j where

Yj =y°"y' (8.9a)

x, -x0

and

Xi= |——^- N -̂l (8-9b)

Initial estimates for the parameters can be readily obtained using linear least

squares estimation with the transformed model.

T he famous M ichaelis-M enten kinetics exp ression shown below

(8.10)k 2 + X i

can become linear by the following transformations also known as

Lineweaver-Burk plot: I — = — + — — — I (8.11)'

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 6/25

138 Chapters

Eadie-Hofsteeplot: y j = k , -k 2 | - -̂ I (8 .12)

and Hanes plot: I — I = — x . + - (8.13). y \

A ll the above transformations can readily produce initial parameter esti-

mates for the kinetic parameters k , and k 2 by performing a simple linear regres-sion. Another class of models that are often transformably linear arise in heteroge-neous catalysis. F or example, the rate of dehydrogeneration of ethanol into acetal-

dehyde over a Cu-Co catalyst is given by the following expression assuming that

the reaction on two adjacent sites is the rate controlling step (Franckaerts and

Froment, 1964; Fromentand Bischoff, 1990).

-1 + k j X j j + k 2 x 2 j + k 3 x 3 i + k 4 x 4 i j

where y i is the measured overall reaction rate and x ] 5 x 2, x 3 and x 4 are the partial

pressures of the chemical species. Good initial guesses for the unknown parame-

ters, k| , . . . ,k 5 , can be obtained by linear least squares estimation of the transformed

equation,

V i

which is of the form Y = p0-

8.2.4 Conditionally Linear Systems

I n engineering we often encounter conditionally linear systems. These weredefined in   Chapter 2 and it was indicated that special algorithms can be used

which exploit their conditional linearity (see Bates and Watts, 1988). In general,

we need to provide initial guesses only for the nonlinear parameters since the con-

ditionally linear parameters can be obtained through linear least squares estima-

tion.For example let us consider the three-parameter model,

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 7/25

Practical Guidelines for Algorithm Implementation 139

y, k 2 (8.16)

where both k, and k 2 are the conditionally linear parameters. These can be readily

estimated by linear least squares once the value of k 3 is known. Condit ionally lin-

ear parameters arise naturally in reaction rates with an Arrhenius temperature de-

pendence.

8.2.5 Direct Search Approach

If we have very little information about th e parameters, direct search meth-

ods, like the LJ optimization technique presented in  Chapter 5, present an excel-lent way to generate very good initial estimates for the Gauss-Newton method.

Actually, for algebraic equation models, direct search methods can be used to de-

termine the optimum parameter estimates quite efficiently. However, if estimates

of the uncertainty in the parameters are required, use of the Gauss-Newton methodis strongly recommended, even if it is only for a couple of iterations.

8.3 OVERSTEP P ING

Quite often the direction determined by the Gauss-Newton method, or any

other gradient method for that matter, is towards the optimum, however, the length

of the suggested increment of the parameters could be too large. As a result, the

value of the objective function at the new parameter estimates could actually be

higher than its va lue a t the previous iteration.

Figure 8.1 Contours of th e objective func t ion in the vicinity of th e opt imum. Potent ia l

problems with overstepping a re sh o w n for a two-parameter problelem.

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 8/25

140 Chapter 8

T he classical solution to this problem is by l imiting the step-length through

the introduction of a stepping parameter, ^ (0< j i< l ) , namely

The easiest way to arrive at an acceptable value of u ,, u . a , is by emp loyin g the

bisection rule as previously described. Namely, we start with u = l and we keep on

halving u - until the objective function at the new parameter values becomes less

than that obtained in the previous iteration, i . e . , we reduce u . until

) < S(k(i )

) . (8.18)

Normally , we stop the step-size determination here and we proceed to per-

form another iteration of Gauss-Newton method. This is what has been imple-mented in the comp uter programs accompanying this book.

8 . 3 . 1 An Optimal Step-Size Policy

Once an acceptable value for the step-size has been determined, we can

continue and with only one additional evaluation of the objective function, we can

obtain the optimal step-size that should be used along the direction suggested by

the Gauss-Newton method.

Essentially we need to perform a simple line search along the direction ofA k

1^

1'. The simplest way to do this is by ap prox imating the objective function by a

quadratic along this direction. N amely ,

S(u) = S C k + u A k ) = p 0 + p ,H + p 2u (8.19)

The first coefficient is readily obtained at u=0 as

P o -S 0(k(J )

) (8.20)

where S 0 = S (k®) which has already been calculated. The other two coefficients,

P i and p2, can be computed from information already available at u= u aan d with an

additional evaluation at u=ua/2. If we denote by S, and S 2 the values of the objec-

tive function at u=ua/2 and u= ua respectively, we have

S, = S0 + P, /2 ua + p 2/4 ua2

(8.2 la )

and

S 2 = S0 + p ,u a + p 2u a2

(8.2I b )

Solution of the above two equations yields,

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 9/25

Practical Guidelines for Algorithm Implem entation 14 1

= 4Sj~3S n-S

and

P 2 = 2 ^

,2- (8.22a)

"a

Having estimated P i and p 2, we can proceed and obtain the optimum step-

size by using the stationary criterion

= 0 (8.23)0\i

which yields

H o t =~ ~ (8-24)

Substitution of the expressions fo r p ; and p2 yields

Ha (4S!-3S 0 -S 2)lopt

2(2S,-S 0 -S 2 )3.25)

The above expression for the optimal step-size is used in the calculation of

the next estimate of the parameters to be used in the next iteration of the Gauss-

N ewton method,

k G - i > = k < J ) + ^op tA k( l+1 )

(8.26)

If we wish to avoid the additional objective function evaluation at u=u.a/2 ,

we can use the extra information that is available at u=0. This approach is prefer-

able for differential equation models where evaluation of the objective functionrequires the integration of the state equations. It is presented later in Section 8.7

where we discuss the implementation of Gauss-Newton method for O D E models.

8 .4 I L L -COND I TI ONI NG OF M A T R I X A AND P A R T I A L R E M E D I E S

If two or more of the unknown parameters are highly correlated, or one of

the parameters does not have a measurable effect on the response variables, matrix

A may become singular or near-singular. In such a case we have a so called ill-

posed problem and matrix A is i l l-conditioned.A measure of the degree of ill-conditioning of a nonsingular square matrix is

through the condition number which is defined as

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 10/25

142 Chapters

cond(\) = | |A j | I f A - ' l l (8.27)

The condition number is always greater than one and it represents the

maximum amplification of the errors in the right hand side in the solution vector.

T he condition num ber is also equal to the square root of the ratio of the largest to

the smallest singular value of A. In parameter estimation applications, A is a posi-

tive definite symmetr ic matrix and hence, the cond(\) is also equal to the ratio of

the largest to the smallest eigenvalue of A , i.e.,

cond(\) = (8.28)

Generally speaking, for condition numbers less than 10J

the parameter esti-

mation problem is well-posed. For condition numbers greater than 1010

the prob-lem is relatively ill-conditioned whereas for condition numbers 10''° or greater the

problem is very ill-conditioned and we may encounter computer overflow prob-

lems.

T he condition number of a matrix A is intimately connected with the sensi-

tivity of the solution of the linear system of equations A x = b. Wh e n solving this

equation, the error in the solution can be magnified by an amount as large as

cond(A) times the norm of the error in A and b due to the presence of the error in

the data.

A x< cond(\) - i r————nr (8.29)

Thus, the error in the solution vector is expected to be large for an ill-

conditioned problem and small for a well-conditioned one.In parameter estima-

tion, vector b is comprised of a linear combination of the response variables

(measurements) which contain the error terms. M atrix A does not depend explic-

itly on the response variables, it depends only on the parameter sensitivity coeffi-cients which depend only on the independent variables (assumed to be known

precisely) and on the estimated parameter vector k which incorporates the uncer-

tainty in the data. A s a result, we expect most of the uncertainty in Equation 8.29

to be present in Ab.

If matrix A is ill-conditioned at the optimum (i.e., at k=k*) , there is not

much we can do. W e are faced with a "truly" ill-conditioned problem and the es-

timated parameters will have highly questionable values with unacceptably large

estimated variances. Probably, the most productive thing to do is to reexamine the

structure and dependencies of the mathematical model and try to reformulate abetter posed problem. Sequential experimental design techniques can also aid us in

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 11/25

Practical Guidelines for Algorithm Implementation 143

this direction to see whether additional experiments will improve th e parameter

estimates significantly (seeChapter 12).

If however, matrix A is reasonably well-conditioned at the optimum, A

could easily be ill-conditioned when the parameters are away from their optimal

values. This is quite often the case in parameter estimation and it is particularly

true for highly nonlinear systems. In such cases, we would like to have the means

to move the parameters estimates from the initial guess to the optimum even if thecondition number of matrix A is excessively high for these initial iterations.

In general there are two remedies to this problem: (i) use a pseudoinverse

and/or ( i i ) use L evenberg-M arquardt 's modification.

8.4.1 Pseudoinverse

The eigenvalue decomposition of the positive definite symmetric matrix Ayields

A = VTA V (8.30)

where V is orthogonal (i.e., V"1

= Vr) and A = diag(X t, A , 2 , ...,A1)). Furthermore, it is

assumed that the eigenvalues are in descending order ( X i > A , 2> . . . > A , p ) .

Having decomposed matrix A, we can readily compute the inverse from

A-i = yT

A-'V (8.31)

If matrix A is well-conditioned, the above equation should be used. If how-

ever, A is ill-conditioned, we have the option without any additional computation

effort, to use instead th e pseudoinverse of A. Essentially, instead of A"1

in Equa-

tion 8.31, we use the pseudoinverse of A, A".

T he pseudoinverse , A * is obtained by inverting only the eigenvalues which

are greater than a user-specified small number, 5, and setting the rest equal to zero.

Namely ,

A * = diag(\IK^ 1A ,2 ,..., 1/^,0,. . . ,0) (8.32)

where A .J > 8, j = l , 2 , . . . , j i and X j < 8, j=7t+l,7t+2, . . . ,p . In other words, by using the

pseudoinverse of A, we are neglecting the contribution of the eigenvalues that are

very close to zero. Typically, we have used for 8 a value of 10"20

when X , is of

order 1 .

Finally, it should be noted that this approach yields an approximate solution

to AAk^*1' = b that has the additional advantage of keeping I J A k ^ " ' ! ) as small as

possible.

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 12/25

144 Chapter 8

8.4.2 Marquardt's Modification

In order to improve the convergence characteristics and robustness of the

Gauss-Newton method, Levenberg in 1944 and later Marquardt (1963) proposed

to modify the normal equations by adding a small positive number, y2, to the di-

agonal elements of A. Namely , at each iteration the increment in the parametervector is obtained by solving the following equation

(A + y - I ) A ku + 1 )

= b (8 .33)

L evenberg-M arquardt's modification is often given a geometric interpreta-

tion as being a compromise between the direction of the Gauss-Newton method

and that of steepest descent. The increment vector, A k ^ '0

, is a weighted averagebetween that by the Gauss-New ton and the one by steepest descent.

A more interesting interpretation of L evenberg-M arquardt 's modificationcan be obtained by examining the eigenvalues of the modified matrix (A+y

2I) . If

we consider the eigenvalue decomposition of A , V ' A V we have,

A +y2I = V

TA V + y

2V

TV = V

T( A + y

2I )V = V

TA

MV (8.34)

where

AM

= diag(^+j2, A2+y

2v.,V7

2) (

8-

35)

The net result is that all the eigenvalues of matrix A are all increased by y2,

i.e., the eigenvalues are now A , ] + y2, A ^ + y ,. . . ,A p+y . Obviously, the large eigenval-

ues will be hardly changed whereas the small ones that are much smaller than y2

become essentially equal to y2

and hence, the condition number of matrix A is re-

duced from A maiA min tO

(8.36)

There are some concerns about the implementation of the method (Bates

and W atts, 1988) since one must decide how to manipulate both M arquardt 's y2

and the step-size A. In our experience, a good implementation strategy is to de-

compose matrix A and examine the magnitude of the small eigenvalues. If some

of them are excessively small, we use a y2

which is larger compared to the small

eigenvalues and we compute j |A k ^+ 1 )

| | . Subsequently, using the bisection rule we

obtain an acceptable or even an optimal step-size along the direction of A k°+1 )

.

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 13/25

Practical Guidel ines for Algorithm Implementation 145

8.4.3 Scaling of Matrix A

When the parameters differ by more than one order of magnitude, matrix A

may appear to be ill-conditioned even if the pa rameter estimation problem is well-

posed. The best way to overcome this problem is by introducing the reduced sen-

sitivity coefficients, defined as

(8.37)

and hence, the reduced parameter sensitivity matrix, GR, is related to our usual

sensitivity matrix, G, as follows

GR = GK

where

K = diag(k },k2 , . . . , k p )

A s a result of this scaling, the normal equations AAk1-'*

1' = b become

ARAkR+ 1)

= bR

where

A » = K K = KA K

b R = K K b

and

A k < J+ 1 )

=

(8.38)

(8.39)

(8.40)

(8.41)

(8.42)

(8 .43)

T he param eter estimates for the next iteration are now obtained (using a step-

size u, 0 < u < l ) a s follows

u K A k R+ 1 )

(8.44)

or equivalently

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 14/25

146 Chapter 8

(8.45)

With this modification the conditioning of matrix A is significantly improved

and cond(AR) gives a more reliable measure of the ill-conditioning of the param e-

ter estimation problem. This modification has been implemented in all computerprograms provided with this book.

8 .5 US E O F "PRI OR" I NF ORM A TI ON

Under certain conditions we m ay have some prior information about the pa-

rameter values. This information is often summarized by assuming that each pa-

rameter is distributed normally with a given mean and a small or large variance

depending on how trustworthy our prior estimate is. The Bay esian objective func-tion, S B(k ) , that should be minimized fo r algebraic equation models is

(8.46)

and for differential equation m odels it takes the form,

(k-k B )T

V B ' (k-k B ) (8.47)

W e have assumed that the prior information can be described by the multi-

variate normal distribution, i.e.,k is normally distributed with mean kB and co-

variance matrix V B .

T he required modifications to the Gauss-Newton algorithm presented in

Chapter 4 are rather minimal. A t each iteration, we jus t need to add the following

terms to matrix A and vector b,

A = AGN + VB' (8.48)

and

b = bGN - VeV'̂ kB) (8.49)

where A G N and b G N are matrix A and vector b for the Gauss-Newton method as

given in Chapter 4 for algebraic or differential equation models. T he prior covari-

ance matrix of the parameters ( V B ) is often a diagonal matrix and since in the so-lution of the problem only the inverse of V B is used, it is preferable to use as input

to the program the inverse itself.

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 15/25

Practical Guidelines for Algorithm Implementation 147

From a computer implementat ion point of view, this provides some extra

flexibility to handle simultaneously parameters for which we have some prior

knowledge and others for which no information is available. For the latter we sim-

ply need to input zero as the inverse of their prior variance.

Practical experience has shown that (i) if we have a relatively large number

of data points, the prior has an insignificant effect on the parameter estimates (ii) if

the parameter estimation problem is ill-posed, use of "prior" information has a

stabilizing effect. A s seen from Equation 8.48, all the eigenvalues of matrix A are

increased by the addition of positive terms in its diagonal. It acts almost like Mar-

quadt's modification as far as convergence characteristics are concerned.

8 .6 S E L E C T I O N O F W E I G H T I N G M A T R I X Q IN LEAST SQ UAR ES

ES TI M A TI ON

As we mentioned in Chapter 2, the user specified matrix Q , should be equal

to the inverse of C 6 > K ( e j ) . However, in many occasions we have very little infor-

mation about the nature of the error in the measurements. In such cases, we have

found it very useful to use Qj as a normalization matrix to make the measured re-

sponses of the same order of magnitude. If the measurements do not change sub-

stantially from data point to data point, we can use a constant Q. The simplest

form of Q that we have found adequate is to use a diagonal matrix whose jth

ele-

ment in the diagonal is the inverse of the squared mean response of the j* variable,

Q y = — — — — — — — r (8.50)

This is equiva lent to assuming a constant standard error in the measurem ent

of the j* response variable, and at the same time the standard errors of different

response variables are proportional to the average value of the variables. This is a

"safe" assumption when no other information is available, and least squares esti-

mation pays equal attention to the errors from different response variables (e.g.,

concentration, versus pressure or temperature measurements).

If however the measurements of a response variable change over several or-

ders of magnitude, it is better to use the non-constant diagonal weighting matrix Q;

given below

Q,, =

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 16/25

148 Chapters

This is equivalent to assuming that the standard error in the i"1

measurement

of the jth

response variable is proportional to its value, again a rather "safe" as-

sumption as it forces least squares to pay equal attention to all data points.

8 .7 I M P L E M E N T A T I O N G U I D E L I N E S F O R O D E M O D E L S

F or models described by a set of ordinary differential equations there are a

few modifications we may consider implementing that enhance the performance

(robustness) of the Gauss-Newton method. T he issues that one needs to address

more carefully are (i) numerical instability during the integration of the state and

sensitivity equations, (ii) ways to enlarge the region of convergence.

8.7.1 Stiff ODE Models

Stiff differential equations appear quite often in engineering. T he wide range

of the prevailing time constants creates additional numerical difficulties which

tend to shrink even further the region of convergence. It is noted that if the state

equations are s t i f f , so are the sensitivity differential equations since both of them

have the same Jacobean. It has been pointed out by many numerical analysts that

deal with the stability and efficiency of integration algorithms (e.g., Edsberg,1976) that the scaling of the problem is very crucial for the numerical solver to

behave at its best. Therefore, it is preferable to scale the state equations according

to the order of magnitude of the given measurements, whenever possible. In addi-tion, to normalize the sensitivity coefficients we introduce the reduced sensitivity

coefficients,

f f t c - " !G R ( i j ) =K-L k (8.52)

and hence, the reduced parameter sensitivity matrix, GR , is related to our usual

sensitivity matrix, G, as follows

G R ( t) = G(t) K (8.53)

where K = diag(k\ , k 2 , ...,k p) . By this transformation th e sensitivity differential

equations shown below

±̂ .=p_G(t) + ——dt 5x 5k

become

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 17/25

Practical Guidelines fo r Algorithm Implementation 149

d G i

dtGR(t) K ; G R ( t0) = (8.55)

W ith this transformation, th e normal equations now become,

where

A R =

A R Ak R, ( J+D

£GR (t i )T

CT

Q i CGR (t i )

(8.56)

(8.57)

b R =

and

XGR ( t , )T

CT

Q i [y(t i )-Cx(t i , k( j )

)

A k {{+1) =

(8 .58)

(8.59)

T he parameter estimates for the next iteration are now obtained as

+ u K A k | J+ 1 )

or

(8.60)

(8.61)

As we have already pointed out in this chapter for systems described by al-

gebraic equations, the introduction of the reduced sensitivity coefficients results in

a reduction of cond(\). Therefore, the use of the reduced sensitivity coefficients

should also be beneficial to non-stiff systems.

W e strongly suggest the use of the reduced sensitivity whenever we are

dealing with differential equation models. Even if the system of differential equa-

tions is non-stiff at the opt imum (when k=k* ) , when the parameters are far from

their optimal values, the equations may become stiff temporarily for a few itera-

tions of the Gauss-Newton method. Furthermore, since this transformation alsoresults in better conditioning of the normal equations, we propose its use at all

times. This transformation has been implemented in the program for OD E systems

provided with this book.

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 18/25

15 0 Chapters

8.7.2 Increasing the Region of Convergence

A well known problem of the Gauss-Newton method is its relatively small

region of convergence. Unless the initial guess of the unknown parameters is in

the vicinity of the optimum, divergence may occur. This problem has received

much attention in the literature. F or example, Ramaker et al . (1970) suggested theincorporation of M arquadt's modification to expand the region of convergence.

Donnely and Quon (1970) proposed a procedure whereby the measurements are

perturbed if divergence occurs and a series of problems are solved until the model

trajectories "match" the original data. Nieman and Fisher (1972) incorporated lin-ear programming into the parameter estimation procedure and suggested the solu-

tion of a series of constrained parameter estimation problems where the search is

restricted in a small parameter space around the chosen initial guess. Wang and

Luus (1980) showed that the use of a shorter data-length can enlarge substantially

the region of convergence of Newton's method. Similar results were obtained byKalogerakis and Luus (1980) using the quasi linearization method.

I n this section we first present an efficient step-size policy for differential

equation systems and we present two approaches to increase the region of conver-

gence of the Gauss-Newton method. O ne through the use of the Information Index

and the other by using a two-step procedure that involves direct search optimiza-

tion.

8.7.2.1 An Optimal Step-Size PolicyThe proposed step-size policy for differential equation systems is fairly

similar to our approach for algebraic equation models. First we start with the bi-

section rule. W e start with u = l and we keep on halving it until an acceptable

value, jia, has been found, i.e., we reduce u until

S (k(-

i)+u ilA k

a"

1)) < S(k

G )) . (8 .18)

I t should be emphasized here that it is unnecessary to integrate the state

equations for the entire data length for each value of u. Once the objective func-tion becomes greater than S(k

(i )) , a smaller value for u can be chosen. By this pro-

cedure, besides the savings in computation time, numerical instability is also

avoided since the objective function often becomes large very quickly and inte-

gration is stopped well before computer overflow is threatened.

Having an acceptable value for u ., we can either stop here and proceed to

perform another iteration of Gauss-Newton method or we can attempt to locate

the optimal step-size along the direction suggested by the Gauss-Newton method.

The main difference with differential equation systems is that every evalua-

tion of the objective function requires the integration of the state equations. In thissection we present an optimal step size policy proposed by Kalogerakis and Luus

(1983b) which uses information only at u=0 (i.e., at k&) and at u= u a (i.e., at

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 19/25

Practical Guideline s for Algo rithm Impleme ntation 15 1

k^'+u.aAk'J"1') . Let us consider a Taylor series expansion of the state vector with

respect to u, retaining up to second order terms,

x(t, k® + u A k ( r l ) ) = x(t,k0)

) + uG(t)Ak(H)

+ u2r(t) (8.62)

where the n-dimensional residual vector r(t) is not a function of u. Vector r(t) is

easily computed by the already performed integration at ua and using the state and

sensitivity coefficients already computed at kw; namely,

H a

Thus, having r(t;), i = l , 2 , . . . ,N , the objective function S(u.) becomes

N

[y(t,)-Cx(t i ,k(j)

)-uCG(t i)Ak(-

i+1)-u

2Cr(ti)| (8.64)

Subsequent use of the stationary criterion 9S/3u=0, yields the following 3rd

order equation for u,

P 3u3

+ P 2u2

+ P , u + p0 = 0 (8.65)

where

N

P3 =2^r( t i )IC

TQ , C r ( t i ) (8.66a)

i= l

N

P2 = 3^r(t i)TC

TQ iCG(t1)Ak

c i+I)(8.66b)

N

Aka+1)

£G(t i )T

CT

Q i CG(t i)Ak( i + 1 )

-

N r ,

2^r(t i)TC

TQ i[y(t i)-Cx(t i ,k

(-

l ))j (8.66c)

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 20/25

152 Chapters

TN r ,

= - A k < J+ 1 )

^G(t i )T

CT

Q i [y(t i )-Cx(t i ,k(J

)) j (8.66d)

Solution of Equation 8.65 yields the optimum value for the step-size. T he

solution can be readily obtained by Newton's method within 3 or 4 iterations using

ua as a starting value. This optimal step-size policy was found to yield very good

results. The only problem that it has is that one needs to store the values of state

and sensitivity equations at each iteration. F or high dimensional systems this is not

advisable.

Finally it is noted that in the above equations we can substitute G(t)with

G R ( t) and Ak1-'*

1' with A k R ^

+ l )in case we wish to use the reduced sensitivity coeffi-

cient formulation.

8.7.2.2 Use of the Information Index

The remedies to increase the region of convergence include the use of a

pseudoinverse or M arquardt 's modification that overcome the problem of i l l -conditioning of matrix A . However, if the basic sensitivity information is not

there, the estimated direction Ak0""

1' cannot be obtained reliably.

A careful examination of matrix A or A R , shows that A depends no t only on

the sensitivity coefficients but also on their values at the times when th e given

measurements have been taken. Wh e n the parameters are far from their optimum

values, some of the sensitivity coefficients may be excited and become large in

magnitude at times which f a l l outside the range of the given data points. This

means that the output vector will be insensitive to these parameters at the given

measurement times, resulting in loss of sensitivity information captured in matrix

A. The normal equation, A A k ^+1)

=b, becomes ill-conditioned and any attempt to

change the equation is rather self-defeating since the parameter estimates will be

unreliable.

Therefore, instead of modifying the normal equations, we propose a direct

approach whereby the conditioning of matrix A can be significantly improved byusing an appropriate section of the data so that most of the available sensitivity

information is captured for the current parameter values. To be able to determine

the proper section of the data where sensitivity information is available, Kalo-

gerakis and Luus (1983b) introduced the Information Index for each parameter,defined as

; ] = \ , . . . j > ( 8 . 6 7 )

O r equivalently, using the sensitivity coefficient matrix,

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 21/25

Practical Guidelines for Algorithm Implementation 15 3

I j ( t ) = k j 5 [ G T ( t ) C T Q C G ( t ) 5 J k j ; j = l , . . . , p (8.68)

where 8j is a p- d imens iona l vector with 1 in the j* element and zeros elsewhere.

The scalar Ij(t) should be viewed as an index measuring the overall sensitivity ofthe output vector to parameter k j at t ime t.

Thus given an initial guess for the parameters, we can integrate the state and

sensitivity equations and compute the Information Indices, Ij(t), j = l , . . . , p as func-

tions of t ime. Subsequently by plotting I j ( t ) versus t ime, preferably on a semi-log

scale, we can spot immediately where they become excited and large in magni-

tude. If observations are not available within this time interval, artificial data can

be generated by data smoothing and interpolation to provide the missing sensitiv-

ity information. In general, the best section of the data could be determined at each

iteration, as during the course of the iterations different sections of the data maybecome the most appropriate. However, it is also expected that during the itera-

tions the most appropriate sections of the data will lie somewhere in between the

range of the given measurements and the section determined by the initial pa-

rameter estimates. It is therefore suggested, that the section selected based on the

initial parameter estimates be combined with the section of the given measure-

ments and the entire data length is used for all iterations. At the last iteration all

artificial data are dropped and only the given data are used.

The use of the Information Index and the optimal step size policy are illus-

trated in Chapter 16 where we present parameter estimation examples for ODEmodels. F or illustration purposes we present here the information indices for the

two-parameter model describing the pyrolytic dehydrogenation of benzene to di-

phenyl and triphenyl (introduced first in Section 6.5.2). In Figure 8.2 the Informa-

tion Indices are presented in graphical form with parameter values (k]=355,400

and k 2=403,300) three orders of magnitude away from the optimum. With this

initial guess the Gauss-Newton method fails to converge as all the available sensi-tivity information falls outside the range of the given measurements. By generat-

ing artificial data by interpolation, subsequent use of the Gauss-Newton method

brings the parameters to the optimum (1^=355.4 and k 2=403.3) in nine iterations.For the sake of comparison the Information Indices are also presented when the

parameters have their optimal values. A s seen in Figure 8.3 , in terms of experi-

mental design the given measurements have been taken at proper times, although

some extra information could have been gained by having a few extra data points

i n the interval [ 1 0 ~4, 5.83xlO"

4] .

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 22/25

154 Chapter 8

i - o

XUJa\0-8

O

5 0 -6

Ou.

Io,UlN

< 0-2

D;o

KT"

K | = 355400.

k , = 403300.

RANGE OF |

G I V EN |

DATA POINTS ]

ICT4

TIME

10"

Figure 8.2 Normalized Information Index o f k / and k ? versus t ime to de-

termine th e best sect ion of data to be used by the G auss -

Newton method (I,mc K=0.0884, I2max=0.0123) [reprintedfrom

Industrial Engineer ing Chemistry Fundamenta ls with p e rmis -s ion from the Ame r ic an Chemical Society].

1 0xUJ

Q

\08

O

V.o

0-6

0-4

02

H ZO

k | = 355-4

k2= 403-3

RANGE OF GIVEN

DATA POINTS

io -T id-' i o - »

TIME

10- 10-' I D - * 10"

Figure 8.3 Normalized Information Index ofk/ a nd k2 versus t ime to de-

termine whether th e measur ements have been col lected atproper t imes (I lmax=0.0885, I2mcK=0.0123) [reprinted from In -dustr ial Engineer ing Chemistry Fundamenta ls with p e rmis -s ion from the Ame r ic an Chemica l Society].

When the dynamic system is described by a set of stiff O D E s and observa-tions during the fast transients are not available, generation of artificial data by

interpolation at times close to the origin may be very risky. If however, we ob-

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 23/25

Practical Guidelines for Algorithm Implementation 155

serve all state variables (i.e., C = I ) , we can overcome this difficulty by redefining

the initial state (Kalogerakis and Luus, 1983b). Instead of restricting the use of

data close to the initial state, the data can be shifted to allow any of the early datapoints to constitute the initial state where the fast transients have died out.A t the

shifted origin, generation of artificial data is considerably easier. Illustrative ex -

amples of this technique are provided by Kalogerakis and Luus (1983b).

8.7.2.3 Use of Direct Search Methods

A simple procedure to overcome the problem of the small region of conver-gence is to use a two-step procedure whereby direct search optimization is used to

initially to bring the parameters in the vicinity of the optimum, followed by the

Gauss-Newton method to obtain the best parameter values and estimates of the

uncertainty in the parameters (Kalogerakis and Luus, 1982).For example let us consider the estimation of the two kinetic parameters in

the Bodenstein-L inder model for the homogeneous gas phase reaction of NO with

O 2 (first presented in Section 6.5.1). In Figure 8.4 we see that the use of direct

search (L J optimization) can increase the overall size of the region of convergenceby at least two orders of magnitude.

icr'

- t - - h

+ •+•

•¥ -t-

yOPTIMUM '

* . . , I /

+ -t-

••+ 4-.1___.

lor4

k, 10-3

Figure 8.4 Use of the LJ optimization procedure to bring th e f irst p arame t e r

estimates inside the region of convergence of the Gauss-Newtonmethod (denoted by the so l id line). All test points are denoted by +.

Ac t ua l path of s o m e typical run s is sh o w n by the dotted line.

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 24/25

156 Chapter 8

8 . 8 AUTOC ORR ELA TION IN D Y N A M I C S YS T E M S

When experimental data are collected over t ime or distance there is always a

chance of having autocorrelated residuals. Box et al. (1994) provide an extensive

treatment of correlated disturbances in discrete t ime models. The structure of the

disturbance term is often moving average or autoregressive models. Detection ofautocorrelation in the residuals can be established either from a time series plot of

the residuals versus time (or experiment number) or from a la g plot . If we can see

a pattern in the residuals over time, it probably means that there is correlation be-

tween the disturbances.

A n autoregressive model ( A R ) of order 1 for the residuals has the form

e, = pe,, + % (8.69)

where e, is the residual of the ith

measurement, p is an unknown constant and ^ areindependent random disturbances with zero mean and constant variance (also

known as white noise). In engineering, when we are dealing with parameter esti-

mation of mechanistic models, we do not encounter highly correlated residuals.

This is not the case however when we use the black box approach (Box et al.,

1994).If we have data collected at a constant sampling interval, any autocorrelation

i n the residuals can be readily detected by plotting the residual autocorrelation

function versus lag number. The latter is defined as ,

la g k = l , 2 , . . . (8.70)

The 95% confidence intervals of the autocorrelation function beyond lag 2 is

simply given by + 2 7 V rT . If the estimated autocorrelation function has values at

greater than 2 / V N , we should correct the data for autocorrelation.

If we assume that the residuals satisfy Equation 8.69, we must transform thedata so that the new residuals are not correlated. Substituting the definition of the

residual (for the single-response case) into Equation 8.69 we obtain upon rear-

rangement,

Y i = p y j - i + f ( x i , k ) - p f ( x i _ 1 , k ) + $i (8.71)

The above model equation now satisfies all the criteria for least squares es-

timation. A s an initial guess for k we can use our estimated parameter values when

it was assumed that no correlation was present. Of course, in the second step wehave to include p in the list of the parameters to be determined.

Copyright © 2001 by Taylor & Francis Group, LLC

8/22/2019 DK5985_ch08Engineers (Chemical Industries) by Peter Englezos [Cap 8]

http://slidepdf.com/reader/full/dk5985ch08engineers-chemical-industries-by-peter-englezos-cap-8 25/25

Practical Guidelines for Algorithm Implementation 157

Final ly it is noted that the above equations can be readily extended to the

multi-response case especially if we assume that there is no cross-correlation be-

tween different response variables.