Adaptive local kernel-based learning for soft sensor modeling of nonlinear processes

Am

Ka

Hb

1

IalciMccbrmicv

0d

chemical engineering research and design 8 9 ( 2 0 1 1 ) 2117–2124

Contents lists available at ScienceDirect

Chemical Engineering Research and Design

journa l homepage: www.e lsev ier .com/ locate /cherd

daptive local kernel-based learning for soft sensorodeling of nonlinear processes

un Chena, Jun Jia, Haiqing Wanga,∗, Yi Liub, Zhihuan Songa

State Key Laboratory of Industrial Control Technology, Institute of Industrial Process Control, Zhejiang University, 38 Zheda Road,angzhou 310027, PR ChinaInstitute of Process Equipment and Control Engineering, Zhejiang University of Technology, Hangzhou 310014, PR China

a b s t r a c t

Soft sensor techniques have been widely used to estimate product quality or other key indices which cannot be

measured online by hardware sensors. Unfortunately, their estimation performance would deteriorate under certain

circumstances, e.g., the change of the process characteristics, especially for global learning approaches. Meanwhile,

local learning methods always only utilize input information to select relevant instances, which may lead to a waste

of output information and inaccurate sample selection. To overcome these disadvantages, a new local modeling

algorithm, adaptive local kernel-based learning scheme (ALKL) is proposed. First, a new similarity measurement using

both input and output information is proposed and utilized in a supervised locality preserving projection technique

to select relevant samples. Second, an adaptive weighted least squares support vector regression (AW-LSSVR) is

employed to establish a local model and predict output indices for each query data. In AW-LSSVR, instead of using

traditional cross-validation methods, the trade-off parameters are adjusted iteratively and the local model is updated

recursively, which reduces the computational complexity a lot. The proposed ALKL is applied to an online crude

oil endpoint prediction in an industrial fluidized catalytic cracking unit (FCCU) process. The experimental results

demonstrate the high precision of our ALKL approach.

© 2011 The Institution of Chemical Engineers. Published by Elsevier B.V. All rights reserved.

Keywords: Adaptive local kernel-based learning; Supervised locality preserving projection; Adaptive weighted least

squares support vector regression; FCCU

In the past decades, variational data-driven techniques

. Introduction

ndustrial processing plants are generally instrumented withlarge number of sensors with the primary purpose to col-

ect and deliver data for process modeling, monitoring andontrol. Although hardware analyzers have been widely usedn industry, they are usually expensive or hard to maintain.

eanwhile, some critical variables, such as biomass con-entration in bioprocess and crude oil endpoint in fluidizedatalytic cracking unit (FCCU), cannot be measured onliney hardware sensors. What’s more, the delayed informationesulting from offline assay has difficulty to meet the require-

ent of real-time monitoring and control, and ultimatelynfluences the product quantity and quality. As a result, it isrucial to develop some soft sensors using some secondary
ariables which can be measured online to infer these key vari-
∗ Corresponding author. Tel.: +86 571 87951442/810.E-mail address: [email protected] (H. Wang).Received 11 October 2010; Received in revised form 20 January 2011; A

263-8762/$ – see front matter © 2011 The Institution of Chemical Engioi:10.1016/j.cherd.2011.01.032

ables (Dochain, 2003; Kadlec et al., 2009; Kaneko et al., 2009;Lin et al., 2007).

At a very general level, soft sensors can be distin-guished into two different classes, namely model-drivenand data-driven methods (Kadlec et al., 2009). The model-driven family is commonly based on First Principle Models(FPMs) which describes the physical and chemical back-ground of the process. But it is extremely difficult forchemical processes due to the complex structure or the lackof expert knowledge. As a solution, data-driven methods,which aim to estimate the relationship between the primeand secondary variables which are directly acquired fromthe processes under real condition, gain considerable inter-est.

ccepted 31 January 2011

have been proposed. Extensive reviews can be found in work

neers. Published by Elsevier B.V. All rights reserved.

http://www.sciencedirect.com/science/journal/02638762

www.elsevier.com/locate/cherd

mailto:[email protected]

dx.doi.org/10.1016/j.cherd.2011.01.032

2118 chemical engineering research and design 8 9 ( 2 0 1 1 ) 2117–2124

of Kadlec et al.’s (2009). Among these methods, multivari-ate static techniques (Dong et al., 1995; Facco et al., 2009;Hartnett et al., 1998; Kim et al., 2005; Kresta et al., 1994; Liet al., 2000; X.Q. Liu et al., 2009; Lu et al., 2004; Wang et al.,2003; Yao and Gao, 2009), have been widely used. However,these methods need a relatively large amount of samples,and are commonly sensitive to measurement noise as well.Various artificial neural network (ANN) algorithms (Gonzagaet al., 2009; Rallo et al., 2002) have been proposed and suc-cessfully applied to polymerization processes, but how toconstruct the network topology is still an open question.Also, they always fail due to the small-sample circumstance.To address these problems, kernel-based methods, such assupport vector regression (SVR) (Vapnik, 1995), least squaressupport vector regression (LSSVR) (Suykens et al., 2002) andtheir online-updated form, e.g., adaptive kernel learning net-works (AKL) (Wang et al., 2006) and selected recursive kernellearning (SRKL) (Y. Liu et al., 2009; Liu et al., 2010) are pre-sented. With the structural risk minimization criterion, these‘kernelized’ techniques can attain a better performance undersmall-sample condition.

Note that all the algorithms mentioned above are globalapproaches. They all aim to achieve a universal generalizationperformance, which may lead an inaccurate model in somelocal regions where process characteristics change. To over-come such shortcomings, local learning or named just-in-time(JIT) learning (Atkeson et al., 1997; Cheng and Chiu, 2004, 2007;Fujiwara et al., 2009; Lee et al., 2005; Liu et al., 2008; Quan et al.,2010) was proposed. In these researches, no model is estab-lished until it is required to. There are three main steps topredict the prime variables for each query instance: (1) selectrelevant data samples from historical data based on somenearest neighborhood criteria, e.g., Euclidean distance based(Lee et al., 2005; Quan et al., 2010), distance and angle based(Cheng and Chiu, 2004; Liu et al., 2008), correlation similaritysuch as Q and T2 statistics (Fujiwara et al., 2009); (2) build alocal model on the selected relevant instances; (3) predict theoutput and discard the model.

Two key issues in local learning are how to construct thesimilarity criterion and which local model to use. For the firstproblem, the criteria in the mentioned works may fail forthe reason that: the correlation similarity based criterion inFujiwara et al.’s method (2009) is more likely to choose onegroup of data from several predefined groups, but how to clus-ter these groups would lead to another problem; the Euclideandistance or angle based criterion (Cheng and Chiu, 2004; Leeet al., 2005; Liu et al., 2008; Quan et al., 2010) cannot representthe true similarity as different secondary variables may havedifferent importance. Meanwhile, only the secondary variableinformation in the training dataset is used to select the rel-evant samples, which leads to insufficient use of the outputinformation.

For the second question, as discussed earlier, kernel-basedmethod would be a good choice due to its structural risk min-imization criterion and its performance under small-samplecondition. For instances, Lee et al. (2005) adopted a weightedSVR to estimate the local model, which need solve a quadraticprogramming problem; Quan et al. (2010) employed a weightedLSSVR, which required to solve a set of linear equationsinstead. Notes that such weighted LSSVR would perform well ifthere is sufficient historical data, but under small-sample cir-cumstance, it is doubtful as some samples would be discarded
due to the weighted scheme.
In our work, we explore the potential to improve the locallearning, by utilizing a new similarity measurement with bothinput and output information, and employing a new version ofadaptive weighted LSSVR (AW-LSSVR) with a self-adjust pro-cedure of the trade-off parameters to train the local model. Asthe query sample only contains secondary variables, the newcriterion cannot be used directly to guide to choose the near-est samples. Thus, a supervised locality preserving projection(SLPP) procedure extended from the traditional LPP method(He and Niyogi, 2003) which aims to keep the local struc-ture, is adopted to search a rough mapping direction. Then,the query data is projected into the low-dimensional space tofind relevant samples, which does not require target value ofthe query instance any more. To the best of our knowledge,although there are already some supervised LPP (SLPP) (Chenget al., 2005; Li et al., 2007) for classification problem; no SLPPversion for regression is presented in any article. Meanwhile,in AW-LSSVR procedure, the trade-off weights are optimizediteratively and recursively along the gradient descend direc-tion by minimize the sum of squared predictive error underthe fast leave-one-out (FLOO) (Cawley and Talbot, 2004) crossvalidation scheme for each query data.

The remainder of this paper is organized as follows. In Sec-tion 2, the detail of our proposed ALKL method, including SLPPand AW-LSSVR is presented. In Section 3, experiments of anumerical function and online crude oil endpoint prediction ofan industrial FCCU process using our proposed ALKL methodand other algorithms are presented. Finally in Section 4, themain contribution of this paper is summarized.

2. Methodology

The general idea of our proposed adaptive local kernel-basedlearning is to estimate a local model for each query instance.As it is straightforward to extend the algorithm to multiple-input multiple-output (MIMO) systems as a MIMO systems canbe described as a set of multiple-input single-output (MISO)systems, we will restrict ourselves to multiple-input single-output (MISO) system in this paper. The detail steps are asfollows.

2.1. SLPP based relevant sample selection

As the first step of local learning, relevant samples should beselected for each query instance. Traditional way is to employa distance or angle based similarity index using input infor-mation, e.g.,

SIij = � exp(−dij) + (1 − �) cos �ij (1)

In Cheng and Chiu (2004) and Liu et al. (2008), where dij, �ij

are the Euclidean distance and angle between the ith and jthsample in the training dataset, respectively, and symbol � istrade-off parameter. Obviously, only the input information istaken account to construct the likelihood index, while the out-put information of the training dataset is discarded directly. Tobetter utilize the information and follow the basic principle toenhance the similarity measurements of those data with bothsimilar input and output, we proposed a new criterion whichcontains both the input and target indices:

SIij = Sxij�1 Syij

1−�1 (2)

chemical engineering research and design 8 9 ( 2 0 1 1 ) 2117–2124 2119

wtm�

aiAaciitt

acsrrLspflo

m

waxomtets2

X

H1

bLitieldmstaaw

sfp

nearest instances selected in SLPP projected space

relevant samples in naive space

w

w

query

Fig. 1 – Illustration of the relevant samples selected in naiveand SLPP projected space. w indicates the rough mappingdirection by SLPP method, w⊥ is the orthogonal direction ofw. The instances inside the dotted black and dash-dottedmagenta ellipses indicate the relevant samples selected in

here Sxij and Syij are the similarity measurements betweenhe ith and jth sample based on the input and output infor-

ation, respectively, which can also be defined as Eq. (1) and

1 is a flexible parameter to balance the importance of Sxij

nd Syij, e.g., �1 = 0 for monotone functions to only use outputnformation and �1 = 1 as (1) to utilize the input factors alone.lthough the proposed criterion (2) should be better than (1)s it takes account of full information of the training dataset, itannot be used directly to select the relevant samples for querynstances, for the reason that query samples only contain thenput features. Thus, it is crucial to find a rough mapping direc-ion which can keep the local structure of data to help selecthe nearest samples.

As a solution, the locality preserving projection (LPP) (Hend Niyogi, 2003) which aims to preserve local informationan be adopted to search the mapping direction. Locality pre-erving projection (LPP) is a recently proposed dimensionalityeduction method and successfully applied in the patternecognition and information retrieval areas. The main idea ofPP can be described as follows. Given a set of N-dimensionalecondary data {x1, x2, . . ., xn} in RN with correspondingrime variables {y1, y2, . . ., yn} in R1, LPP aims to find a trans-ormation matrix W to project these input instances to aow-dimensional samples set {z1, z2, . . ., zn} inRd (d � N) basedn the following objective criterion:

in

n∑i,j=1....n,i /= j

||Zi − Zj||2Sij (3)

here Sij is the similarity to characterize the likelihood of ithnd jth sample, and zi is the 1-dimensional representation of

i with a projection vector w, i.e., zi = wTxi. By minimizing thebjective function, LPP incurs a heavy penalty if neighboringapped samples zi and zj are far away. Thus, LPP will keep

he mapped points close if their original points are close toach other to preserve the local information. It is observedhat the minimization problem can be calculated througholving the following eigenvalue problem (He and Niyogi,003):

LXTw = �XDXTw (4)

ere, X = [x1, x2, . . ., xn]T, D = diag(1TS) and L = D − S where= [1, 1, . . ., 1]T and S is the matrix with S(i, j) = Sij.

Although some supervised likelihood measurements haveeen proposed for classification problems (Cheng et al., 2005;i et al., 2007) with a supervised LPP (SLPP) technique, theres still no SLPP version for regression problems. By utilizinghe similarity index defined as (2), the SLPP will map thosenstances with both similar input and similar output close toach other. Then, by solving the generalized eigenvalue prob-em (4), the first d eigenvectors corresponding to the largest

absolute eigenvalues are adopted as the transformationatrix W and zi = WTxi. Finally, for each query data, relevant

amples will be selected according to the Euclidean distance inhe low-dimensional (e.g., one dimensional) space. Fig. 1 givesn example of how to selected in the relevant samples in naivend SLPP projected space (d = 1 is chosen for simplicity), whereis obtained by solving Eq. (4).It is observed that SLPP can not only help select relevant

amples for local learning, but also determine the weights of
eature variables, which will be illustrated in the experimentalart. Also, by reducing the feature dimension from N to d, SLPP
the naive space and SLPP projected space, respectively.

releases the computation load successfully while keeps thelocal information of the training dataset.

2.2. AW-LSSVR based local modeling

Supposed that {xk1, xk2, xk3, . . ., xkl} are the selected l relevantsecondary variables with the corresponding prime variables{yk1, yk2, yk3, . . ., ykl} where {xk, yk} indicates the selected rel-evant sample sequence and l denotes the number of thesesamples, a weight LSSVR (WLSSVR) approach is employed tobuild the local model, which can be written as the followingoptimization problem:

min�,b,e

J(�, b) = 12

�T� + 12

l∑i=1

vkie1ki

s.t. yki = �ϕ(xki) + b + eki, i = 1, 2, . . . , l

(5)

where �, b, eki, and vki are the weight vector, bias, residual,and regularized weights for ith sample in {xk, yk}, respectively.To solve such optimization problem, the following Lagrangianshould be considered then:

L(�, b, e, �) = J(�, b) −l∑

i=1

˛ki{ωTϕ(xki) + b + eki − yki} (6)

From the condition for optimality as

ıL

ı�= ıL

ıb= ıL

ıeki= ıL

ı˛ki= 0 (7)

and elimination of � and e, one can obtain theKarush–Kuhn–Tucker (KKT) system:[

0 1T

1 � + V

][b

�

]=[

0

y

](8)

where �(i, j) = ϕ(xki)Tϕ(xkj) = K(xki, xkj) with K(*,*) the positivedefinite kernel functions, e.g., Gaussian kernel, linear kernel,etc., � = [˛k1, ˛k2, ...., ˛k3]T, y = [yk1, yk2, ..., ykl]T, and V is a diag-onal matrix given by:

V(i, i) = 1vki

(9)


In our work, a modified Gaussian kernel is employed in theWLSSVR as:

K(xi, xj) = exp

(−||Wxi − Wxj||2

2�2

)(10)

with W as the projected weight matrix obtained by SLPP inthe first step. By solving (8), b, � can be obtained easily and theresulting WLSSVR model can be evaluated for the query pointx* as:

y∗ =l∑

i=1

˛kiK(xki, x∗) + b (11)

Note that, the parameters e.g., kernel width � and trade-off parameters vki, i = 1, 2, . . ., l would play an important roleto access a high prediction accuracy. Currently, the most com-mon way is to employ cross validation (CV) or leave one outCV (LOO-CV) method to select the optimal values. In the naiveimplementation of LOO-CV, one trains the WLSSVR model ltimes, each time leaving out one sample to perform the pre-diction. The implementation involves the solutions of l linearsystems of dimension (l − 1) for each potential combinationof hyper-parameters. Supposed there are M potential valuesfor each parameter (the combination number CM would beCM = Ml for the trade-off parameters), the complexity of thenaive LOO-CV is O(CM*l/3)(l − 1)3. To lease the computationload, two improvements are adopted in our work: (1) FLOO-CV(Cawley and Talbot, 2004) is used to calculate the predictiveerror, which leads to only one linear system to be solvedfor each set of hyper-parameters. Thus, the complexity willreduced to (CM/3)(l − 1)3. (2) Instead of validating each combi-nation, a gradient descend method is employed to optimizethe parameters iteratively by minimizing the sum of squaredprediction errors.

Let H denotes the matrix

[0 1T

1 � + V

], P = H−1, Pi, Pi,j

denotes the ith column and (i,j)th element of the matrix P.Then the sum of squared error (SSE) can be expressed as:

SSE =∑l

i=1

(�i

Pi+1,i+1

)2

(12)

Supposed vj is the parameter to be optimized at current, itcan be rewritten as:

1v′

j= 1

vj+ 1

vR(13)

where vj is the current value and v′j

is a more optimal value.Then the procedure to optimize the value vj equals to searchthe a suitable vR. Thus, the value of H′ and P′ can be calculated:

H′ = H + u1vR

uT (14)

P′ = P − z−1Pj+1 ∗ PTj+1 (15)

where u =

⎡⎣ j︷︸︸︷0, . . . , 0, 1,

l−j︷︸︸︷0, . . . , 0

⎤⎦ , z = (�R + Pj+1,j+1).

Also, the new parameters b′, �′ can be update by:

[b′˛′]T = [b˛]T − z−1Pj+1�j (16)

and the diagonal element of matrix P′ and index SSE are:

P′i,i = Pi,i − z−1Pi,j+1Pi,j+1

SSE =∑l

i=1

(�′

ki

P′i+1,i+1

)2

=∑l

i=1

(�ki − z−1Pj+1,i+1�kj

Pi+1,i+1 − z−1P2i,j+1

)2

(17)

Thus, the derivative of SSE can be written as follows:

ıSSEıvR

= 2∑l

i=1

Pj+1,i+1(z�j − Pj+1,i+1�i)(Pj+1,j+1�i − Pj+1,i+1�j+1)

(zPj+1,j+1 − P2j+1,i+1)

3

(18)

By using the gradient descend methods, vR can be opti-mized as

vR = vR − �ıeFLOO

ıvR(19)

where � is the step parameter.The detailed implementation of AW-LSSVR is described as

follows:

Step 1: Predefine parameters, such as l and � the maximumiteration number Itmax.Step 2: Select the parameter � and a same initial value for vi,i = 1, 2, . . ., l based on FLOO techniques on training dataset.Step 3: For each query data, calculate P, � and SSE valuesusing Eqs. (8) and (12).Step 4: For each trade-off parameter vi:

Step 4.1: compute the gradient descend direction by Eq. (18),and update vi by Eqs. (19) and (13).Step 4.2: recalculate the matrix P′, � and SSE using Eqs. (16)and (17).Step 4.3: if the iteration number It satisfies the criterionIt > Itmax, or the deviation of SSE fulfills �SSE < T�SSE, usingthe current P, �, b values to predict the output for the queryinstance; otherwise execute Step 4.1 and Step 4.2 iteratively.

Step 5: Back to Step 3 for next query data.

In summary, the flowchart of our proposed ALKL algorithmis illustrated in Fig. 2. Firstly, SLPP technique is employed tosearch the proper mapping direction to help select the relevantsamples and determine the weight matrix in Eq. (10). Secondly,the proposed AW-LSSVR with a self-adjust procedure of thetrade-off parameters is adopted to train the local model andpredict the output. For each query data, the trade-off weightsare optimized iteratively and recursively along the gradientdescend direction by minimize SSE of l relevant samples underthe FLOO-CV scheme.

3. Experimental results

In this section, from different aspects and comparison pur-poses, 2 examples are utilized to evaluate the characteristicsof the proposed scheme. Note that in all the experiments inthis paper, Sxij is defined as:

Sxij = 1 − dij

max(d)(20)

in the new proposed similarity measurement, where max(d)
represents the maximum value of Euclidean distances of thetraining samples and Syij is simultaneously defined.


Fig. 2 – The flowchart of the proposed ALKL method.

iwmi

e

wut

3d

Futs

y

wwdadw

n

ssw

Table 1 – MER values of different similaritymeasurement for the sigmoid function.

Proposed measurement (2) Naive way (1)

�1 = 0 � = 0 � = 0.5 � = 1

MER 5.56% 86.04% 86.7% 85.8%

To assess the performance more insightful, the follow-ng indices are considered: (1) SD: the square distance of theeight vector w, i.e., (SD = (w − w)

T(w − w)) and (2) MER: the

ean error rate of the selected l relevant samples are adoptedn the numerical examples; while (1) RMSE: root mean square

rror RMSE −√∑l

i=1(yi − yi)/l, and (2) Pert: the percentage

ith the predictive error within a predefined threshold t aresed for the industrial FCCU process where w and yi denoteshe prediction term of w and yi, respectively.

.1. Monotone function estimation: comparison ofifferent similarity criteria

irst, to illustrate the effect of the similarity measurementsing output information, a simple monotone nonlinear func-ion, sigmoid function is choose to compare different kinds ofimilarity indices. The sigmoid function is described as:

= 1

1 + e−wTx+ � (21)

here w is the weight vector, � is a white Gaussian noiseith zeros mean and as standard deviation 5% of the stan-ard deviation of the true output sequence y. 250 data pointsre generated where the first 200 samples are used as trainingata, and the last 50 are used for validation. The weight vectorfollows unit distribution [0 1] for each component, and it is

ormalized as wi = wi/

√∑N

1 wi2. For each query instance, 10

amples are chosen from the training dataset as the relevant
amples and compared with the ground truth which utilizesas the feature weights.Two kinds of similarity measurement
are considered: the naive way, which only consider the inputinformation as defined in (1) with � = 0, 0.5, 1, and the proposedmeasurement in (2) where �1 = 0 for monotone function. Thisexperiment is executed for 100 times with different w andTable 1 shows the MER values of these four measurements.Also, the estimated SD value is with a mean value 9.45e−5and a standard deviation 7.71e−5.

As can be drawn from Table 1 and the calculated SD value,the utilizing of output information plays an important roleto estimate the weights of secondary variables (as the weightvector w can be estimated accurately), and to select the rele-vant samples for local learning (as it achieves a better accuracythan that of utilizing input information only). Thus, it is cru-cial to employ both input and output variables in the designedsimilarity measurement.

3.2. FCCU endpoint prediction: comparison of kernellearning strategies

FCCU is the core unit of the oil secondary operation and anessential component of the refinery equipment. The typicalFCCU process has the ability to convert high-boiling, high-molecular weight hydrocarbon fraction of petroleum crude oilto a range of lighter hydrocarbon products (e.g., gasoline). Itsoperating conditions strongly affect the yield of light oil inpetroleum refining. Therefore, FCCU has an important role inthe overall economic performance of the refinery (Alaradi andRohani, 2002; Feng, 2003; Yan et al., 2004; Yang et al., 1998).Generally, a FCCU process consists of a reactor–regeneratorsubsystem, a fractionator subsystem, an absorber–stabilizersubsystem and a gas sweetening subsystem. Our work willfocus on the fractionator subsystem whose main objectiveis to split crude oil into several more valuable products, e.g.,gasoline, according to a fractional distillation process. To bet-ter control the quality and quantity, it is necessary to predictcrude oil endpoint online and accurately.

The process data are collected from the distributed controlsystem and the corresponding daily laboratory analysis in anindustrial FCCU refinery in China (Feng, 2003). Based on theprior information, 7 features, i.e., the temperature at the topof the main fractionating tower, the rate and temperature ofthe cool reflux flow of the crude oil at the top of the tower,the rate, discharge temperature and the return temperatureof the cycle flow at the top of the tower and the catalytic reac-tion temperature, are selected as secondary variables, and theprime variable to be estimated is the endpoint of the crude oil(Feng, 2003). 120 pairs of samples are obtained to do the val-idation where the first 40 samples are selected as the initialtraining dataset, and the rest are treated as the testing set.

Two cases will be investigated in this part. The first oneis to study modeling performance of different kernel learningstrategy in the naive space (W = I in (10)). LSSVR, SRKL withdifferent predictive error bound (PEB) values, LLSSVR and theproposed local AW-LSSVR, are compared in detail. The sec-
ond case will be performed on the SLPP projected spaces, toshow the effectiveness of ALKL via different �1 values. Note


0 10 20 30 40 50 60 70 80178

180

182

184

186

188

190

192

194

Testing data index

Cru

de o

il en

dpoi

nt

Fig. 3 – Prediction results of the crude oil endpoint usingLSSVR, LLSSVR (l = 15) and AW-LSSVR (l = 15) in the naivespace.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

RMSE

ρ1

LSSVR LLSSVR ALKL

Fig. 4 – RMSE values via different kernel learning strategies(LSSVR, LLSSVR and ALKL) and different �1 values in theSLPP projected space.

Table 4 – P1.5 values of the optimal �1 values in SLPPprojected space.

LSSVR (�1 = 0.9) LLSSVR (�1 = 0.9) ALKL (�1 = 0.3)

Together with the experimental results in the naive space,the AW-LSSVR generally is able to estimate the local model

that the hyper-parameters, i.e., the Gaussian kernel width �,the trade-off regularization parameters and similarity balanceweight, are selected using FLOO (Cawley and Talbot, 2004) onthe training dataset, and l is fixed at 15 for the second case.

3.2.1. Kernel learning strategy comparison in naive spaceThe comparison of the kernel learning strategies in this caseis studied in the naive space, where W = I in (10) and the rele-vant data are selected based on the Euclidean distance of theinput variables directly. Two global kernel-based methods, i.e.,LSSVR (Suykens et al., 2002) and SRKL (Liu et al., 2010) withPEB value 1 and 1.5, and two local modeling techniques, i.e.,LLSSVR (Liu et al., 2008) and the proposed Local AW-LSSVRwith different relevant instances (l = 10, 15 and 20) are appliedto the FCCU endpoint online prediction. Prediction compari-son of the testing dataset is shown in Fig. 3, and indices ofRMSE and P1.5 are listed in Table 2.

From Fig. 3 and Table 2, one can confirm that, in the naivespace, local modeling algorithms (LLSSVR and AW-LSSVR)can obtain a better prediction results than global techniques(LSSVR and SRKL). Also, no matter what the k value is, ourproposed local AW-LSSVR achieves lower RMSE and higherP1.5 indices compared to LLSSVR. In general, the proposedlocal AW-LSSVR outperforms the other 3 modeling methods,which also indicates that the chosen of the trade-off parame-ters plays an important role to obtain an accurate predictionmodel.

Table 2 – RMSE and P1.5 indices using different kernel learning

LSSVR SRKL

PEB = 1 PEB = 1.5 l = 10

RMSE 1.71 1.76 1.78 1.68P1.5 (%) 72.5 72.5 72.5 81.3

Table 3 – RMSE values of different �1 values of three different m

Modeling methods

0 0.1 0.2 0.3 0

LSSVR 2.49 1.81 1.83 1.92 1.LLSSVR 2.50 1.78 1.74 1.78 1.ALKL 2.49 1.62 1.51 1.44a 1.

a The RMSE values using the optimal �1 value chosen form the training da

P1.5 81.3% 81.3% 85%

3.2.2. Kernel learning strategy comparison in SLPPprojected spaceAlso, our proposed ALKL is also applied to the FCCU data onthe SLPP projection space. Firstly, different �1 values are cho-sen to construct the similarity index in (2), with the optimalvalue selected from the training dataset marked by the sym-bol (a). Fig. 4 shows the RMSE values on the testing datasetvia different �1 values. Details about the corresponding softanalyzer models (LSSVR, LLSSVR and ALKL) are tabulated inTable 3. Also, the P1.5 index and the predictive error of theoptimal value are shown in Table 4 and Fig. 5, respectively.

From Fig. 4 and Table 3, two conclusions can be drawn:

(1) No matter what kernel-based learning strategies are usedto estimate the model, the criterion utilizing both the inputfeature and target information (0 < �1 < 1) help achieve bet-ter prediction than only using input feature (�1 = 1) oremploying the target information alone (�1 = 0).

(2) Regardless the �1 value, ALKL outperforms the other twomethods. It also proved that the proposed AW-LSSVR canachieve the best performance in the SLPP projected space.

strategies in the naive space.

LLSSVR AW-LSSVR

l = 15 l = 20 l = 10 l = 15 l = 20

1.60 1.62 1.59 1.50 1.5778.8 78.8 83.8 81.3 78.8

odeling methods.

�1

.4 0.5 0.6 0.7 0.8 0.9 1

94 1.97 1.98 1.99 2.00 1.54a 2.4974 1.74 1.74 1.75 1.75 1.56a 2.4940 1.46 1.54 1.48 1.58 1.56 2.49

taset (0.9, 0.9 and 0.3 for LSSVR, LLSSVR and ALKL, respectively).


0 10 20 30 40 50 60 70 80-8

-6

-4

-2

0

2

4

6

8

Testing data index

Pred

ictiv

e er

ror o

f end

poin

t

LSSVR LLSSVR ALKL

Fig. 5 – Predictive error using LSSVR, LLSSVR (l = 15) andALKL (l = 15) in the SLPP projected space. The parameter �1

is chosen from the training dataset using FLOO criterion,which is 0.9, 0.9 and 0.3 for LSSVR, LLSSVR and ALKL,r

wtbn

4

Ilsinisvptl

fAnpf

A

TRt2Gerq

R

A

A

C

espectively.

ith a higher accuracy than other validated kernel-basedechniques. Meanwhile, form Table 4, the ALKL performs theest of all the testing algorithms, which shows the effective-ess of our proposed method.

. Conclusion

n this paper, we propose a new local adaptive kernel-basedearning method to address the limitations, e.g., small-sizeample condition and the change of process characteristicsn industrial processes. Our main contributions include aew similarity measurement utilizing both input and output

nformation, a new SLPP version for regression problems toelect relevant samples and determine the weights of rele-ant features, and a self-adjust procedure to tune the trade-offarameters in WLSSVR (named as AW-LSSVR) using an itera-ive and recursive way which leads to a high precision underow computation complexity.

Also, the experimental results of numerical monotoneunction and crude oil endpoint prediction demonstrate thatLKL can estimate the output with a high precision, and theew similarity measurement containing both input and out-ut information would indeed help determine the relevanteature and select relevant nearest samples.

cknowledgements

his work was supported by the National High Technology&D Program (“863” Program) of China (No. 2009AA04Z126),he National Natural Science Foundation of China (No.0776128) and Alexander von Humboldt Foundation ofermany (Dr. Haiqing Wang) which are gratefully acknowl-dged. Meanwhile, the authors would like to thank theeviewers for their constructive comments in improving theuality of this paper.

eferences

laradi, A., Rohani, S., 2002. Identification and control of ariser-type FCC unit using neural networks. Comput. Chem.Eng. 26 (3), 401–421.

tkeson, C.G., Moore, A.W., Schaal, S., 1997. Locally weightedlearning. Artif. Intell. Rev. 11 (1), 11–73.

awley, G., Talbot, N., 2004. Fast exact leave-one-out
cross-validation of sparse least-squares support vectormachines. Neural Networks 17 (10), 1467–1475.
Cheng, C., Chiu, M.S., 2004. A new data-based methodology fornonlinear process modeling. Chem. Eng. Sci. 59 (13),2801–2810.

Cheng, C., Chiu, M.S., 2007. Adaptive IMC controller design fornonlinear process control. Chem. Eng. Res. Des. 85 (2),234–244.

Cheng, J., Liu, Q., Lu, H., Chen, Y.W., 2005. Supervised kernellocality preserving projections for face recognition.Neurocomputing 67, 443–449.

Dochain, D., 2003. State and parameter estimation in chemicaland biochemical processes: a tutorial. J. Process Control 13 (8),801–818.

Dong, D., McAvoy, T.J., Chang, L.J., 1995. Emission monitoringusing multivariate soft sensors. Proc. Am. Control Conf. 1,761–765.

Facco, P., Doplicher, F., Bezzo, F., Barolo, M., 2009. Moving averagePLS soft sensor for online product quality estimation in anindustrial batch polymerization process. J. Process Control 19(3), 520–529.

Feng, R., 2003. Study on support vector machines and itsapplication in soft sensor of industrial processes, Ph.D.Thesis, Shanghai Jiaotong University, P.R. China.

Fujiwara, K., Kano, M., Hasebe, M., Takinami, A., 2009. Soft-sensordevelopment using correlation-based just-in-time modeling.AIChE J. 55 (7), 1754–1765.

Gonzaga, J.C.B., Meleiro, L.A.C., Kiang, C., Maciel Filho, R., 2009.ANN-based soft-sensor for real-time process monitoring andcontrol of an industrial polymerization process. Comput.Chem. Eng. 33 (1), 43–49.

Hartnett, M.K., Lightbody, G., Irwin, G.W., 1998. Dynamicinferential estimation using principal components regression(PCR). Chemom. Intell. Lab. Syst. 40 (2), 215–224.

He, X.F., Niyogi, P., 2003. Locality preserving projections. NIPS 16,153–160.

Kadlec, P., Gabrys, B., Strandt, S., 2009. Data-driven soft sensors inthe process industry. Comput. Chem. Eng. 33 (4), 795–814.

Kaneko, H., Arakawa, M., Funatsu, K., 2009. Development of anew soft sensor method using independent componentanalysis and partial least squares. AICHE J. 55 (1), 87–98.

Kim, K., Lee, J.M., Lee, I.B., 2005. A novel multivariate regressionapproach based on kernel partial least squares withorthogonal signal correction. Chemom. Intell. Lab. Syst. 79(1/2), 22–30.

Kresta, J.V., Marlin, T.E., MacGregor, J.F., 1994. Development ofinferential process models using PLS. Comput. Chem. Eng. 18(7), 597–611.

Lee, D.E., Song, J.H., Song, S.O., Yoon, E.S., 2005. Weighted supportvector machine for quality estimation in the polymerizationprocess. Ind. Eng. Chem. Res. 44 (7), 2101–2105.

Li, J.B., Pan, J.S., Chu, S.C., 2007. Kernel class-wise localitypreserving projection. Inform. Sci. 178 (7), 1823–1835.

Li, W., Yue, H.H., Vally-Cervantes, S., Qin, S.J., 2000. Recursive PCAfor adaptive process monitoring. J. Process Control 10 (5),471–486.

Lin, B., Recke, B., Knudsen, J.K.H., Jørgensen, S.B., 2007. Asystematic approach for soft sensor development. Comput.Chem. Eng. 31 (5/6), 419–425.

Liu, X.Q., Kruger, U., Littler, T., Xie, L., Wang, S.Q., 2009a. Movingwindow kernel PCA for adaptive monitoring of nonlinearprocesses. Chemom. Intell. Lab. Syst. 96 (2), 132–143.

Liu, Y., Hu, N., Wang, H.Q., Li, P., 2009b. Soft chemical analyzerdevelopment using adaptive least-squares support vectorregression with selective pruning and variable movingwindow size. Ind. Eng. Chem. Res. 48 (12), 5731–5741.

Liu, Y., Wang, H.Q., Li, P., 2008. Adaptive local learning based leastsquares support vector regression with application to onlinemodeling. Chin. J. Chem. Eng. 59 (8), 2052–2057.

Liu, Y., Wang, H.Q., Yu, J., Li, P., 2010. Selective recursive kernellearning for online identification of nonlinear systems withNARX form. J. Process Control 20 (2), 181–194.

Lu, N.Y., Yang, Y., Gao, F.R., Wang, F.L., 2004. Multirate dynamic
inferential modeling for multivariable processes. Chem. Eng.Sci. 59 (4), 855–864.


statistical modeling methods for batch processes. Annu. Rev.

Quan, T.W., Liu., X.M., Liu., Q., 2010. Weighted least squaressupport vector machine local region method for nonlineartime series prediction. Appl. Soft Comput. 10 (2), 562–566.

Rallo, R., Ferre-Giné, J., Arenas, A., Giralt, F., 2002. Neural virtualsensor for the inferential prediction of product quality fromprocess variables. Comput. Chem. Eng. 26 (12), 1735–1754.

Suykens, J.A.K., Van Gestel, T., De Brabanter, J., 2002. LeastSquares Support Vector Machines. World Scientific, Singapore.

Vapnik, V., 1995. The Nature of Statistical Learning Theory.Springer, New York.

Wang, H.Q., Li, P., Song, Z.H., Ding, S.X., 2006. Adaptive kernel
learning networks with application to nonlinear systemidentification. LNCS 4232, 737–746.
Wang, X., Kruger, U., Lennox, B., 2003. Recursive partial leastsquares algorithms for monitoring complex industrialprocess. Control Eng. Pract. 11 (6), 613–632.

Yan, W.W., Shao, H.H., Wang, X.F., 2004. Soft sensing modelingbased on support vector machine and Bayesian modelselection. Comput. Chem. Eng. 28 (8), 1489–1498.

Yang, S.H., Wang, X.Z., McGreavy, C., Chen, Q.H., 1998. Soft sensorbased predictive control of industrial fluid catalytic crackingprocesses. Chem. Eng. Res. Des. 76 (4), 499–508.

Yao, Y., Gao, F.R., 2009. A survey on multistage/multiphase

Control 33 (2), 172–183.

Adaptive local kernel-based learning for soft sensor modeling of nonlinear processes

Documents

Transcript of Adaptive local kernel-based learning for soft sensor modeling of nonlinear processes