Diploma Thesis - June 2009 - draft

8/14/2019 Diploma Thesis - June 2009 - draft

1/34

"Gheorghe Asachi" Tehnical UniversityFaculty of Automatic Control and Computer

Engineering

Departament of Automatic Control and Applied

Informatics

Iai

University Duisburg Essen

Faculty Automatic Control and Complex Sciences

Duisburg

2009

Fault diagnosis in linear andnonlinear systemsDiploma thesis

Lucian-Adrian Hulu


2/34

Table of Contents

1. Introduction ......................................................................................................................................... 3

2. The Learning Methodology .................................................................................................................. 4

2.1. Supervised Learning ...................................................................................................................... 4

2.2. Learning and Generalization ......................................................................................................... 6

2.3. Improving Generalization ............................................................................................................. 7

2.4. Fault diagnosis as a classification problem ................................................................................... 8

3. Support Vector Machines for fault diagnosis ....................................................................................... 8

3.1. Two-Class Support Vector Machines ........................................................................................ 8

3.1.1. Hard-Margin Support Vector Machines ................................................................................ 9

3.1.2. L1 Soft-Margin Support Vector Machines ........................................................................... 14

3.1.3. Mapping to a High-Dimensional Space ............................................................................... 18

3.1.3.1. Kernel tricks .................................................................................................................... 183.1.3.2. Kernels............................................................................................................................. 18

3.1.4. L2 Soft-Margin Support Vector Machines ........................................................................... 19

3.2. Multiclass Support Vector Machines ...................................................................................... 22

3.2.1. One-against-all .................................................................................................................... 22

3.2.2. One-against-one .................................................................................................................. 22

3.3. Support Vector Regression ..................................................................................................... 22

3.4. Advantages and Disadvantages .............................................................................................. 23

4. Data-driven methods for fault diagnosis ........................................................................................... 24

4.1. Principal component analysis ..................................................................................................... 24

4.2. Projection algorithms .................................................................................................................. 24

5. Application to Tennessee Eastman process as a benchmark ............................................................ 24

5.1. Problem formulation ................................................................................................................... 26

5.2. Support vector machines & Kernel methods on TE .................................................................... 27

5.3. PCA on TE .................................................................................................................................... 28

6. List of Figures ..................................................................................................................................... 30

7. Abbreviations ..................................................................................................................................... 30

8. Nomenclature .................................................................................................................................... 30

9. References .......................................................................................................................................... 31

10. Annex 1 Code listing ..................................................................................................................... 32

10.1. PCA, T2

statistic and Q statistic algorithms .......................................................................... 32

10.2. Support vector machines & Kernel methods .......................................................................... 34


3/34

1.IntroductionIn attention to the serious consequences of accidents eventually occurring in chemical

plants caused by faulty operating conditions, the importance of incipient fault detection and

isolation is well recognized. Although catastrophes and disasters due to chemical plant failures

may be infrequent, minor incidences (accidents) may be very common, resulting in personal

injury, illness, and material loses costing the society large amounts of resources every year.

It is difficult for process operators to detect the existence of a fault by examining time-

sequenced data, among the hundreds of variables recorded in modern chemical processes. The

isolation of the fault for discerning its root causes is even a more difficult problem as it requires

managing a great amount of information. Thus, these supervision difficulties point out the

necessity of developing decision-making supporting tools in order to help process operators

managing plant incidences.

Different techniques have been developed to detect and diagnose faults in complex

chemical plants altogether with computer aided decision-making supporting tools. Parallel to

the growing computational power during last decades, diverse statistical techniques have been

developed to carry out effective process fault detection. The difficulty to develop accurate first

principle models, the highly coupled nature of chemical processes and the overwhelming

amount of stored data, have made these statistical techniques one of the most successfully

applied tools for process monitoring in chemical industry.

On the other hand, the fault diagnosis problem, the isolation and characterization of the

detected anomalies, has not been properly solved yet as it still presents important practicallimitations. Three main methodologies have been followed in fault diagnosis of chemical

processes:

1. Quantitative model-based methods,

2. Qualitative model-based methods

3. Process history-based methods.

Several approaches based on the third alternative, such as neural networks or fuzzy

systems, have been implemented to overcome the lack of useful quantitative and qualitativeprocess knowledge required on the first two FDS approaches.

Besides not requiring first principles or qualitative knowledge, data-based models allow

dealing with the intrinsic non-linearity of variables involved in most chemical processes at no

additional cost. Even though they are also limited by difficulties on class generalization, multiple


4/34

fault diagnosis or data uncertainties handling, they are the most popular techniques applied in

industry because of their simplicity and their non-linearity processing capabilities.

In any case, none of the previously reported works in literature do not guarantee

enough confidence and reliability in a general diagnosis problem. Thus all the existing practical

limitations make this area an open field requiring further work.

This work proposes a new classification approach to face the fault diagnosis problem

using a novel technique recently developed in the machine learning area. This novel technique

is a process history-based methodology that has proved higher performance than other

reported pattern classifiers in different technical areas. The proposed approach use support

vector machines (SVM) which are a kernel-based method from the Statistical Learning Theory

that has succeeded in many classification problems (i.e. Computational linguistics). Recently, it

has been applied to fault diagnosis in chemical plants using the Tennessee Eastman (TE)

benchmark.

2.The Learning MethodologyThe construction of machines capable of learning from experience has for a long time been

the object of both philosophical and technical debate. The technical aspect of the debate has

received an enormous impetus from the advent of electronic computers. They have

demonstrated that machines can display a significant level of learning ability, though the

boundaries of this ability are far from being clearly defined.

The availability of reliable learning systems is of strategic importance, as there are many

tasks that cannot be solved by classical programming techniques, since no mathematical model

of the problem is available. So for example it is not known how to write a computer program to

perform hand-written character recognition, though there are plenty of examples available. It is

therefore natural to ask if a computer could be trained to recognize the letter A' from

examples - after all this is the way humans learn to read. We will refer to this approach to

problem solving as the learning methodology

The same reasoning applies to the problem of finding genes in a DNA sequence, filtering

email, detecting or recognizing objects in machine vision, and so on. Solving each of these

problems has the potential to revolutionize some aspect of our life, and for each of them

machine learning algorithms could provide the key to its solution.2.1. Supervised LearningWhen computers are applied to solve a practical problem it is usually the case that the

method of deriving the required output from a set of inputs can be described explicitly. The


5/34

task of the system designer and eventually the programmer implementing the specifications

will be to translate that method into a sequence of instructions which the computer will follow

to achieve the desired effect.

As computers are applied to solve more complex problems, however, situations can arise in

which there is no known method for computing the desired output from a set of inputs, orwhere that computation may be very expensive. Examples of this type of situation might be

modeling a complex chemical reaction, where the precise interactions of the different reactants

are not known, or classification of protein types based on the DNA sequence from which they

are generated, or the classification of credit applications into those who will default and those

who will repay the loan.

These tasks cannot be solved by a traditional programming approach since the system

designer cannot precisely specify the method by which the correct output can be computed

from the input data. An alternative strategy for solving this type of problem is for the computer

to attempt to learn the input/output functionality from examples, in the same way that

children learn which sports cars are simply by being told which of a large number of cars are

sporty rather than by being given a precise specification of sportiness. The approach of using

examples to synthesize programs is known as the learning methodologyand in the particular

case when the examples are input/output pairs it is called supervised learning. The examples of

input/output functionality are referred to as the training data.

The input/output pairings typically reflect a functional relationship mapping inputs to

outputs, though this is not always the case as for example when the outputs are corrupted by

noise. When an underlying function from inputs to outputs exists it is referred to as the target

function. The estimate of the target function which is learnt or output by the learning algorithm

is known as the solution of the learning problem. In the case of classification this function issometimes referred to as the decision function. The solution is chosen from a set of candidate

functions which map from the input space to the output domain. Usually we will choose a

particular set or class of candidate functions known as hypotheses before we begin trying to

learn the correct function. For example, so-called decision trees are hypotheses created by

constructing a binary tree with simple decision functions at the internal nodes and output

values at the leaves. Hence, we can view the choice of the set of hypotheses (or hypothesis

space) as one of the key ingredients of the learning strategy. The algorithm which takes the

training data as input and selects a hypothesis from the hypothesis space is the second

important ingredient. It is referred to as the learning algorithm.

In the case of learning to distinguish sports cars the output is a simple yes/no tag which we

can think of as a binary output value. For the problem of recognizing protein types, the output

value will be one of a finite number of categories, while the output values when modeling a

chemical reaction might be the concentrations of the reactants given as real values. A learning

problem with binary outputs is referred to as a binary classification problem, one with a finite

number of categories as multi-class classification, while for real-valued outputs the problem

becomes known as regression.


6/34

There are other types of learning, for example unsupervised learning considers the case

where there are no output values and the learning task is to gain some understanding of the

process that generated the data. This type of learning includes density estimation, learning the

support of a distribution, clustering, and so on. There are also models of learning which

consider more complex interactions between a learner and their environment. Perhaps thesimplest case is when the learner is allowed to query the environment about the output

associated with a particular input. The study of how this affects the learner's ability to learn

different tasks is known as query learning. Further complexities of interaction are considered in

reinforcement learning, where the learner has a range of actions at their disposal which they

can take to attempt to move towards states where they can expect high rewards. The learning

methodology can play a part in reinforcement learning if we treat the optimal action as the

output of a function of the current state of the learner. There are, however, significant

complications since the quality of the output can only be assessed indirectly as the

consequences of an action become clear.

2.2. Learning and GeneralizationWe discussed how the quality of an on-line learning algorithm can be assessed in terms of

the number of mistakes it makes during the training phase. It is not immediately clear,

however, how we can assess the quality of a hypothesis generated during batch learning. Early

machine learning algorithms aimed to learn representations of simple symbolic functions that

could be understood and verified by experts. Hence, the goal of learning in this paradigm was

to output a hypothesis that performed the correct classification of the training data and early

learning algorithms were designed to find such an accurate fit to the data. Such a hypothesis is

said to be consistent. There are two problems with the goal of generating a verifiable consistenthypothesis.

The first is that the function we are trying to learn may not have a simple representation

and hence may not be easily verified in this way. An example of this situation is the

identification of genes within a DNA sequence. Certain subsequences are genes and others are

not, but there is no simple way to categorize which are which.

The second problem is that frequently training data are noisy and so there is no guarantee

that there is an underlying function which correctly maps the training data. The example of

credit checking is clearly in this category since the decision to default may be a result of factorssimply not available to the system. A second example would be the classification of web pages

into categories, which again can never be an exact science.

The type of data that is of interest to machine learning practitioners is increasingly of these

two types, hence rendering the proposed measure of quality difficult to implement. There is,

however, a more fundamental problem with this approach in that even when we can find a

hypothesis that is consistent with the training data; it may not make correct classifications of


7/34

unseen data. The ability of a hypothesis to correctly classify data not in the training set is known

as its generalization, and it is this property that we shall aim to optimize.

Shifting our goal to generalization removes the need to view our hypothesis as a correct

representation of the true function. If the hypothesis gives the right output it satisfies the

generalization criterion, which in this sense has now become a functional measure rather thana descriptional one. In this sense the criterion places no constraints on the size or on the

meaning' of the hypothesis - for the time being these can be considered to be arbitrary.

2.3. Improving GeneralizationThe generalization criterion places an altogether different constraint on the learning

algorithm. This is most amply illustrated by the extreme case of rote learning. Many classical

algorithms of machine learning are capable of representing any function and for difficult

training sets will give a hypothesis that behaves like a rote learner. By a rote learner we mean

one that correctly classifies the data in the training set, but makes essentially uncorrelatedpredictions on unseen data. For example, decision trees can grow so large that there is a leaf

for each training example. Hypotheses that become too complex in order to become consistent

are said to overfit. One way of trying to control this difficulty is to restrict the size of the

hypothesis, for example pruning the size of the decision tree.

The approach that we will adopt is to motivate the trade-off by reference to statistical

bounds on the generalization error. These bounds will typically depend on certain quantities

such as the margin of the classifier, and hence motivate algorithms which optimize the

particular measure. The drawback of such an approach is that the algorithm is only as good as

the result that motivates it. On the other hand the strength is that the statistical result provides

a well-founded basis for the approach, hence avoiding the danger of a heuristic that may be

based on a misleading intuition.

The fact that the algorithm design is based on a statistical result does not mean that we

ignore the computational complexity of solving the particular optimization problem. We are

interested in techniques that will scale from toy problems to large realistic datasets of hundreds

of thousands of examples. It is only by performing a principled analysis of the computational

complexity that we can avoid settling for heuristics that work well on small examples, but break

down once larger training sets are used. The theory of computational complexity identifies two

classes of problems. For the first class there exist algorithms that run in time polynomial in thesize of the input, while for the second the existence of such an algorithm would imply that any

problem for which we can check a solution in polynomial time can also be solved in polynomial

time. This second class of problems is known as the NP-complete problems and it is generally

believed that these problems cannot be solved efficiently.

If no restriction is placed over the set of all possible hypotheses (that is all possible

functions from the input space to the output domain), then learning is impossible since no


8/34

amount of training data will tell us how to classify unseen examples. Problems also arise if we

allow ourselves the freedom of choosing the set of hypotheses after seeing the data, since we

can simply assume all of the prior probability on the correct hypotheses. In this sense it is true

that all learning systems have to make some prior assumption of a Bayesian type often called

the learning bias.

2.4. Fault diagnosis as a classification problemFollowing the previous analysis, the fault diagnosis problem could be formulated as a

classical classification problem in order to extend the concepts successfully applied in machine

learning theory. In this sense, better generalization capabilities or more accurate and flexible

pattern training algorithms could overcome limitations on faults isolation or multiple fault

management.

By adopting this classification view, the faults to be managed in a process industry would be

different classes in the general classification problem. As in any classification problem, it will be

crucial to properly choose meaningful variables or features to properly represent the diagnosis

problem, adopt optimal learning methodologies and follow serious validation procedures to

avoid the FDS over-fitting. Under this view the classical fault detection problem is a binary

classification (BC) problem, consisting of determining whether the process is in or out of

control. The global fault diagnosis problem is considered as a multi-class classification (MC)

problem, so that many classes are involved. With the purpose to solve both kinds of problems,

different learning algorithms have been developed in the machine learning area obtaining very

promising results (i.e. Naive Bayes, KNearest Neighbours, SVM, AdaBoost.MH, etc.).

3.Support Vector Machines for fault diagnosis3.1. Two-Class Support Vector MachinesIn training a classifier, usually we try to maximize classification performance for the training

data. But if the classifier is too fit for the training data, the classification ability for unknown

data, i.e., the generalization ability is degraded. This phenomenon is called overfitting. Namely,

there is a trade-off between the generalization ability and fitting to the training data.

Determination of decision functions using input-output pairs is called training.

Conventional training methods determine the indirect decision functions so that each training

input is correctly classified into the class designated by the associated training output. Figure 1

shows an example of the decision functions obtained when the training data of two classes do

not overlap.


9/34

Figure 1 Simple example of classification

Thus there are infinite possibilities of the positions of the decision functions that

correctly classify the training data. Although the generalization ability is directly affected by the

positions, conventional training methods do not consider this.

In a SVM, the direct decision function that maximizes the generalization ability is

determined for a two-class problem. Assuming that the training data of different classes do not

overlap, the decision function is determined so that the distance from the training data is

maximized. We call this the optimal decision function (Figure 2). For a two-class problem, a

support vector machine is trained so that the direct decision function maximizes the

generalization ability.

Figure 2 Optimal decision function

3.1.1. Hard-Margin Support Vector MachinesLet dimensional training inputs ( = 1. . ) belong to Class 1 or 2 and the

associated labels be = 1 for Class 1 and -1 for Class 2. If these are linearly separable, we candetermine the decision function:


10/34

1

= + , (1)where is an -dimensional vector, is a bias term, and for = 1. .

+

> 0 = 1< 0

=

1

(2)Because the training data are linearly separable, no training data satisfy + = 0.

Thus to control separability, instead of (2), we consider the following inequalities:

+ 1 = 1 1 = 1 (3)Here, 1 and 1 on the right-hand sides of the inequalities can be a constant (> 0) and , respectively. But dividing both sides of the inequalities by , (3) is obtained. Equation (3) is

equivalent to

( + ) 1, = 1. . (4)The hyperplane

= + = , 1 < < 1 (5)forms a separating hyperplane that separates ( = 1. . ). When = 0, the separatinghyperplane is in the middle of the two hyperplanes with = 1 and -1. The distance betweenthe separating hyperplane and the training datum nearest to the hyperplane is called the

margin(Figure 3).

Figure 3 Margin

Assuming that the hyperplanes = 1 and -1 include at least one training datum, thehyperplane = 0 has the maximum margin for 1 < < 1. The region 1 1 is the generalization region for the decision function. The generalization ability depends on


11/34

1

the location of the separating hyperplane, and the hyperplane with the maximum margin is

called the optimal separating hyperplane(Figure 4). Because we can obtain the same optimal

separating hyperplane even if we delete all the data that satisfy the strict inequalities in (2.10),

the data that satisfy the equalities are called support vectors(Figure 4).

Figure 4 Optimal hyperplane

Assume that no outliers are included in the training data and that unknown test data

will obey the same probability law as that of the training data. Then it is intuitively clear that

the generalization ability is maximized if the optimal separating hyperplane is selected as the

separating hyperplane. Now consider determining the optimal separating hyperplane. The

Euclidean distance from a training datum x to the separating hyperplane is given by ()/

. This can be shown as follows. Because the vector

is orthogonal to the separating

hyperplane, the line that goes through and that is orthogonal to the hyperplane is given by/ + , where is the Euclidean distance from to the hyperplane. It crosses thehyperplane at the point where

+ = 0 (6)is satisfied. Solving (6) for , we obtain = ()/ .

Then all the training data must satisfy

() , = 1. . (7)Where is the margin.

Now if (, ) is a solution, (, ) is also a solution, where is a scalar. Thus weimpose the following constraint:


12/34

1

= 1 (8)From (7) and (8), to find the optimal hyperplane, we need to find , with the minimum

Euclidean norm that satisfies (4).

Therefore, the optimal separating hyperplane can be obtained by minimizing:

= 122 (9)

With respect to and subject to the constraints:( + ) 1, = 1. . (10)

Here, the square of the Euclidean norm in (9) is to make the optimization problemquadratic programming. The assumption of linear separability means that there exist and that satisfy (9). We call the solutions that satisfy (10) feasible solutions. Because the

optimization problem has the quadratic objective function with the inequality constraints, evenif the solutions are nonunique, the value of the objective function is unique. Thus

nonuniqueness is not a problem for support vector machines. Because we can obtain the same

optimal separating hyperplane even if we delete all the data that satisfy the strict inequalities in

(2.10), the data that satisfy the equalities are called support vectors1(Figure 2 and Figure 4).

The variables of the convex optimization problem given by (9) and (10) are and .Thus the number of variables is the number of input variables plus 1: + 1. When the numberof input variables is small, we can solve (9) and (10) by the quadratic programming technique.

But, as will be discussed later, because we map the input space into a high-dimensional feature

space, in some cases, with infinite dimensions, we convert (9) and (10) into the equivalent dual

problem whose number of variables is the number of training data.

To do this, we first convert the constrained problem given by (9) and (10) into the

unconstrained problem

, , = 12 + 1=1 (11)

Where

= (

1,

,

) and

are the nonnegative Lagrange multipliers. The optimal solution

of (11) is given by the saddle point, where (11) is minimized with respect to and andmaximized with respect to ( 0), and it satisfies the following Karush-Kuhn-Tucker (KKT)conditions:

1This definition is imprecise. Data can satisfy ( + ) = 1 but that can be deleted without changing the

optimal separating hyperplane. Support vectors are defined using the solution of the dual problem, as discussed

later.


13/34


14/34

1

1

2 ,=1 =

1

2=1

T =1 0 (20)maximizing (18) under the constraints (19) is a concave quadratic programming problem. If a

solution exists, namely, if the classification problem is linearly separable, the global optimalsolution = 1. . exists. For quadratic programming, the values of the primal and dualobjective functions coincide at the optimal solutions if they exist. This is called the zero duality

gap.

Data that are associated with positive are support vectors for Classes 1 and 2. Then from (16)the decision function is given by

= +

, (21)where is the set of support vector indices, and form the KKT conditions given by (14), isgiven by

= (22)Where is a support vector. From the standpoint of precision of calculations, it is better totake the average among the support vectors as follows:

= 1

( )

(23)Then unknown datum is classified into:

1 > 0 1 < 0 (24)If = 0, is on the boundary and thus is unclassifiable. When training data are

separable, the region 1 1 is a generalization region.3.1.2. L1 Soft-Margin Support Vector MachinesIn hard-margin support vector machines, we assumed that the training data are linearly

separable. When the data are linearly inseparable, there is no feasible solution, and the hard-

margin support vector machine is unsolvable. Here we extend the support vector machine so

that it is applicable to an inseparable case.

To allow inseparability, we introduce the nonnegative slack variables ( 0) into (4):


15/34


16/34

1

Similar to the linearly separable case, introducing the nonnegative Lagrange multipliers and , we obtain

,

,

,

,

=1

2

2 +

=1

+

1 +

=1

=1 (28)

where = (1, ,)and = (1, ,).For the optimal solution, the following KKT conditions are satisfied:

(, , ,,) = 0, (29)(, , ,,) = 0, (30)

(, , ,,) = 0, (31) + 1 + = 0, = 1. . , (32) = 0, = 1. . , (33) 0, 0, 0, = 1. . . (34)

Using (28), we reduce (29) to (31), respectively, to

= =1 (35)

= 0,=1 (36) + = , = 1. . . (37)Thus substituting (35) to (37) into (28), we obtain the following dual problem. Maximize

= =1

1

2

,=1 (38)

subject to constraints


17/34

1

= 0=1 , 0 , = 1. . . (39)The only difference between L1 soft-margin support vector machines and hard margin

support vector machines is that

cannot exceed

. Especially, (32) and (33) are called KKT

(complementarity) conditions.

From these and (37), there are three cases for :1. = 0 then = 0. Thus is correctly classified.2. 0 . Then + 1 + = 0 and then = 0. Therefore + = 1 and is a support vector. Especially we call the support vector with > > 0

an unbounded support vector.

3. = . Then + 1 + = 0 and 0. Thus is a support vector. Wecall the support vector with

=

a bounded support vector. If 0

< 1,

is

correctly classified, and if 1, is misclassified .The decision function is the same as that of the hard-margin support vector machine and is

given by

= + (40)where is the set of support vector indices. Because are nonzero for the support vectors,the summation (40) is added only for the support vectors. For the unbounded

,

= (41)is satisfied. To ensure the precision of calculations, we take the average of that is calculatedfor unbounded support vectors,

= 1( ) (42)where U is the set of unbounded support vector indices.

Then unknown datum x is classified into

1 > 0 1 < 0 (43)If = 0, is on the boundary and thus is unclassifiable. When there are no bounded

support vectors, the region 1 1 is a generalization region, which is the sameas the hard-margin support vector machine.


18/34

1

3.1.3. Mapping to a High-Dimensional Space

Figure 6 Linear inseparability

Figure 7 Mapping input data

3.1.3.1.Kernel tricks3.1.3.2.Kernels

Examples of kernels:

1. Linear, =

2. Polynomial


19/34

1

, = ( + 1)3. Radial Basis Function (RBF)

, = 2

3.1.4. L2 Soft-Margin Support Vector MachinesInstead of the linear sum of the slack variables in the objective function, the L2 soft-

margin support vector machine uses the square sum of the slack variables (Figure 8). Namely,

training is done by minimizing

,

,

=1

2

+2

2

=1 (44)

with respect to , , and subject to the inequality constraints:() + 1 , = 1. . (45)

Here, is -dimmensional vector, is the bias, () is the mapping function that mapsthe -dimmensional vector x into the -dimmensional feature space (will be discussed in 3.1.3Mapping to a High-Dimensional Space section), is the slack variable for variable , and isthe margin parameter.

Figure 8 L2 Soft-Margin slack variables

Introducing the Lagrange multipliers ( 0), we obtain


20/34

2

, , , = 12 +

22=1 () + 1 +

=1

(46)Here, we do not need to introduce the Lagrange multipliers associated with . As is shown

imemediately,

= is satisfied for the optimal solution. Hence

is nonnegative, so long

is nonnegative.

For the optimal solution the following KKT conditions are satisfied:

(, , ,) = ()=1 = 0,

(47)(, , ,)

= = 0, (48)

(,, ,) = =1 = 0, (49)() + 1 + = 0, = 1. . , (50)Ecuation (50) gives the KKT complementarity conditions; from (47), (48), and (50), the

optimal solution must satisfy either = 0 or

=1 ,

+

+

1 = 0 (51)

Where , = () and is Kroneckers delta function, in which = 1 for =and 0 otherwise. Thus the bias term is calculated for > 0:

= =1 , + (52)

which is different from that of the L1 support vector machine. But the decision function is the

same:

= =1 , + (53)Substituting (47) to (49) into (46), we obtain the dual objective function:


21/34

2

= =1 1

2 , +

,=1 (54)Thus the following dual problem is obtained. Maximize (54) subject to

= 0=1 , 0, = 1. . . (55)This is similar to a hard-margin support vector machine. The difference is the addition of

in (54). Therefore, for L1 support vector machine, if we replace , with , + and remove the upper bound given by for , we obtain L2 support vector machine. But wemust notice that when we calculate the decision function in (53) we must not add

.Because 1/ is added to the diagonal elements of the matrix = , called kernelmatrix, resulting matrix becomes positive definite. Thus the associated optimization problem is

more computationally stable than that of the L1 support vector machine.

L2 soft-margin support vector machines look similar to hard-margin support vector

machines. Actually, letting

=

, = , =

, (56)

where is the -dimensional vector with the th element being 1 and the remaining elements0, training L2 support vector machine given by (44) and (45) is converted to the following

problem. Minimize

1

2 (57)

subject to

+

1,

= 1. .

. (58)

Therefore, the L2 support vector machine is equivalent to the hard-margin support vector

machine with the augmented feature space. Because the L2 support vector machine always has

a solution because of the slack variables, the equivalent hard-margin support vector machine

also has a solution. But this only means that the solution is non-overlapping in the augmented

feature space. Therefore, there may be cases where the solution is overlapped in the original


22/34

2

feature space, and thus the recognition rate of the training data for the L2 support vector

machine is not 100 percent.

3.2. Multiclass Support Vector Machines3.2.1. One-against-all3.2.2. One-against-one

3.3. Support Vector RegressionThe Support Vector method can also be applied to the case of regression, maintaining all

the main features that characterize the maximal margin algorithm: a non-linear function is

learned by a linear learning machine in a kernel-induced feature space while the capacity of the

system is controlled by a parameter that does not depend on the dimensionality of the space.

As in the classification case the learning algorithm minimizes a convex functional and its

solution is sparse.

Figure 9 The insensitive band for a one dimensional linear regression problem


23/34

2

Figure 10 The linear (red) and quadratic (green) -insensitive loss for zero and non-zero 3.4. Advantages and DisadvantagesThe advantages of support vector machines over multilayer neural network classifiers are as

follows:

1. Maximization of generalization ability. In training a multilayer neural network classifier,

the sum-of-squares error between outputs and desired training outputs is minimized. Thus, the

class boundaries change as the initial weights change. So does the generalization ability. Thus,

especially when training data are scarce and linearly separable, the generalization ability

deteriorates considerably. But because a support vector machine is trained to maximize the

margin, the generalization ability does not deteriorate very much.

2. No local minima. A multilayer neural network classifier is known to have numerous local

minima, and there have been extensive discussions on how to avoid a local minimum in

training. But because a support vector machine is formulated as a quadratic programming

problem, there is a global optimum solution.

3. Robustness to outliers. Multilayer neural network classifiers are vulnerable to outliers

because they use the sum-of-squares errors. Thus to prevent the effect of outliers, outliers

need to be eliminated before training, or some mechanism for suppressing outliers needs to be

incorporated in training. In support vector machines the margin parameter controls themisclassification error. If a large value is set to

, misclassification is suppressed, and if a small

value is set, training data that are away from the gathered data are allowed to be misclassified.Thus by properly setting a value to , we can suppress outliers.

The disadvantages of support vector machines explained so far are as follows.


24/34

2

1. Extension to multiclass problems. Unlike multilayer neural network classifiers, support

vector machines use direct decision functions. Thus an extension to multiclass problems is not

straightforward, and there are several formulations.

2. Long training time. Because training of a support vector machine is done by solving the

associated dual problem, the number of variables is equal to the number of training data. Thus

for a large number of training data, solving the dual problem becomes difficult from both the

memory size and the training time.

3. Selection of parameters. In training a support vector machine, we need to select an

appropriate kernel and its parameters, and then we need to set the value to the margin

parameter . To select the optimal parameters to a given problem is called model selection.This is the same situation as that of neural network classifiers. Namely, we need to set the

number of hidden units, initial values of weights, and so on. In support vector machines, model

selection is done by estimating the generalization ability through repeatedly training supportvector machines. But because this is time-consuming, several indices for estimating the

generalization ability have been proposed.

4.Data-driven methods for fault diagnosis4.1. Principal component analysis4.2. Projection algorithms

5.Application to Tennessee Eastman process as abenchmark

The case involves the production of two products from four reactants. There are 41

measured process variables and 12 manipulated variables. The process (Figure 11) is composed

of five main unit operations: an exothermic two-phase reactor, a product condenser, a flash

separator, a reboiled stripper, and a recycle compressor.

All the 20 proposed faults in the original paper (Downs & Vogel, 1993) were consideredto face a global and complex diagnosis problem. Main diagnosis difficulties in this problem

come from the number of different faults to be managed and the wide faults diversity. Random

and step biases as well as soft drifts on process variables must be managed in this complex

diagnosis case study.


25/34

2

Figure 11 Case study: Tennessee Eastman process flowsheet

Besides, some of the considered disturbances cause insignificant biases in variables

because of the adjusted control strategy, which makes very difficult to find out the biases

source causes. Therefore, it represents a challenging real diagnosis problem for testing and

comparing the approach proposed in this work.

Three main limitations were found when revising former approaches to compare the

performance of a novel fault diagnosis system:

1. Reported results are very limited from the quantitative point of view. No common or

standard quantitative indices have been used for that purpose;

2. Unclear distinctions between the detection and diagnosis steps reduce the chances

for systematic comparisons;

3. Since many simulation runs may be obtained from the same case study (different

ways of sampling, several initial and final conditions, simulation span, etc.), complete data sets

are required to be agreed for establishing a unique comparative basis;


26/34

2

Classification tools are one of the most extended fault diagnosis systems in literature.

They are based on mapping input data from process plant to different process states. The

classification methodologies have succeeded because of their implementation simplicity, so

they do not require process operators experience or process first principles knowledge. Among

these classification methodologies, data-based rules systems and machine learning are themost commonly applied methodologies. However, techniques coming from the machine

learning area have not been properly exploited to solve significant limitations on chemical

process fault diagnosis. Only neural networks, a technique overcome in other technical fields,

has been widely applied as a pattern learning technique.

Thus, considering fault diagnosis of chemical process as a classification problem would

allow taking advantage of the benefits gained by the learning community throughout last

decades.

5.1. Problem formulationThe fault diagnosis problem could be formulated as a classical classification problem in

order to extend the concepts successfully applied in machine learning theory. In this sense,

better generalization capabilities or more accurate and flexible pattern training algorithms

could overcome limitations on faults isolation or multiple fault management. By adopting this

classification view, the faults to be managed in a process industry would be different classes in

the general classification problem. As in any classification problem, it will be crucial to properly

choose meaningful variables or features to properly represent the diagnosis problem, adopt

optimal learning methodologies and follow serious validation procedures to avoid the FDS over-

fitting. Under this view the classical fault detection problem is a binary classification (BC)

problem, consisting of determining whether the process is in or out of control. The global fault

diagnosis problem is considered as a multi-class classification (MC) problem, so that many

classes are involved.

With the purpose to solve both kinds of problems, different learning algorithms have been

developed in the machine learning area obtaining very promising results (i.e. Naive Bayes,

KNearest Neighbours, SVM, AdaBoost.MH, etc.).

Some of these algorithms employ BC techniques, but can also be applied to solve multi-class

problems when they are adapted by techniques like one versus all or the constraint

classification. In order to deal with the multi-class or fault diagnosis problem the one versus

all methodology was applied throughout the paper. It consists of decomposing the problem in

as many binary problems as classes has the original problem; then, one classifier is trained for

each class trying to separate the examples of that class (positives) from the examples of all

other classes (negatives).This method generates classifiers between each class and the set of all


27/34

2

other classes. When classifying a new example, all binary classifiers predict a class and the one

with highest confidence is selected (winner-take-all strategy).

5.2. Support vector machines & Kernel methods on TE

Figure 12 Evolution of Error when changing arguments on


28/34

2

5.3. PCA on TE

Figure 13 PCA Method on for datasets D00, D10 and D12


29/34

2

Figure 14 Q statistic for fault 12

Figure 15 T2

statistic for fault 12


30/34

3

6.List of FiguresFigure 1 Simple example of classification .................................................................................... 9

Figure 2 Optimal decision function .............................................................................................. 9

Figure 3 Margin .......................................................................................................................... 10

Figure 4 Optimal hyperplane ..................................................................................................... 11

Figure 5 L1 Soft-Margin slack variables ..................................................................................... 15

Figure 6 Linear inseparability ..................................................................................................... 18

Figure 7 Mapping input data ...................................................................................................... 18

Figure 8 L2 Soft-Margin slack variables ..................................................................................... 19

Figure 9 The insensitive band for a one dimensional linear regression problem ..................... 22

Figure 10 The linear (red) and quadratic (green) -insensitive loss for zero and non-zero ... 23Figure 11 Case study: Tennessee Eastman process flowsheet .................................................. 25

Figure 12 Evolution of Error when changing arguments on ...................................................... 27

Figure 13 PCA Method on for datasets D00, D10 and D12 ....................................................... 28

Figure 14 Q statistic for fault 12 ................................................................................................ 29

Figure 15 T2

statistic for fault 12 ................................................................................................ 29

7.AbbreviationsSVM Support Vector Machines

FDS Fault Diagnosis System

TE Tennessee EastmanPCA Principial Component Analysis

PSVM Parallel support vector machines

FDA Fisher Discriminant Analysis

BC Binary Classification

MC Multi Class Classification

KKT Karush-Kuhn-Tucker

8.NomenclatureAre used lowercase letters to denote vectors and uppercase italic letters to denote

matrices. The following list shows the symbols used in this diploma project.

Training data Lagrange multiplier for Lagrange multiplier for slack variable associated with


31/34

3

margin parameter degree of a polynomial kernel() mapping function from to the feature space parameter for a radial basis function kernel(

,

) kernel

dimension of the feature space number of training data number of input variables number of classes set of support vector indices set of unbounded support vector indices Euclidean norm of vector bias decision fuction cost function

Weights

9.References1. Nello Cristianini, John Shawe-Taylor, An Introduction to Support Vector Machines and

Other Kernel-based Learning Methods, Cambridge University Press, 2000;

2. John Shawe-Taylor, Nello Cristianini, Kernel Methods for Pattern Analysis, CambridgeUniversity Press, 2004;

3. Support vector machines (SVMs),http://en.wikipedia.org;4. Ovidiu Ivanciuc, Applications of Support Vector Machines in Chemistry. In: Reviews in

Computational Chemistry, Volume 23, Eds.: K. B. Lipkovitz and T. R. Cundary. Wiley-VCH,

Weinheim, 2007, pp. 291-400;

5. Shigeo Abe, Support Vector Machines for Pattern Classification, Springer-VerlagLondon Limited, 2005;

6. Chih-Wei Hsu and Chih-Jen Lin, A Comparison of Methods for Multiclass SupportVector Machines, IEEE Transactions on Neural Networks, VOL. 13, NO. 2, March 2002;

7. Abhijit Kulkarni, V.K. Jayaraman, B.D. Kulkarni, Knowledge incorporated support vectormachines to detect faults in Tennessee Eastman Process, Computers and Chemical

Engineering 29, pp. 21282133, Elsevier Ltd., 2005

8. A. Mathur, G. M. Foody, Multiclass and Binary SVM Classification: Implications forTraining and Classification Users, IEEE Geoscience and Remote Sensing Letters, vol. 5,

no. 2, april 2008;
http://en.wikipedia.org/http://en.wikipedia.org/http://en.wikipedia.org/http://en.wikipedia.org/


32/34

3

9. Shawn Martin, The Numerical Stability of Kernel Methods, Sandia National LaboratoriesAlbuquerque, NM, USA, december 2005;

10.A. Mauricio Sales Cruz, Tennessee Eastman Plant-wide Industrial Process. ChallengeProblem. Complete Model, Department of Chemical Engineering Technical University of

Denmark, January 2004;11.Ignacio Ylamos, Gerard Escudero, Moiss Graells, Luis Puigjaner, Performance

assessment of a novel fault diagnosis system based on support vector machines,

Computers and Chemical Engineering 33, 244255, 2009;

10. Annex 1 Code listing10.1. PCA, T2 statistic and Q statistic algorithms

clear;clc;close all;load data/d00.datload data/d00_te.datload data/d10_te.datload data/d12_te.dat

X_1=d10_te;X_2=d12_te;%with faultsX_3=d00_te;%without faults

num_row_X_3=size(X_3,1);num_col_X_3=size(X_3,2);num_row_X_2=size(X_2,1);num_col_X_2=size(X_2,2);

num_row_X_1=size(X_1,1);num_col_X_1=size(X_1,2);

Mn_x_3= Mean(X_3);Std_x_3= Std(X_3);

%NormalizationX_norm_3=(X_3-kron(Mn_x_3,ones(num_row_X_3,1)))./kron(Std_x_3,ones(num_row_X_3,1));X_norm_2=(X_2-kron(Mn_x_3,ones(num_row_X_2,1)))./kron(Std_x_3,ones(num_row_X_2,1));X_norm_1=(X_1-kron(Mn_x_3,ones(num_row_X_1,1)))./kron(Std_x_3,ones(num_row_X_1,1));

TestMn_x_3=Mean(X_norm_3);TestStd_x_3=Std(X_norm_3);TestMn_x_2=Mean(X_norm_2);TestStd_x_2=Std(X_norm_2);TestMn_x_1=Mean(X_norm_1);TestStd_x_1=Std(X_norm_1);

S_3=(X_norm_3'*X_norm_3)/(num_row_X_3-1);


33/34

3

%eigen decomposition obtaining eigenvalues and eigenvectors[V_3,D_3]=eigs(S_3)a=2; % set number of principle compoments

Represented_variance_2 = (D_3(1,1)+D_3(2,2))/sum(diag(D_3));

P=V_3(:,1:2);

T_3=X_norm_3*P;T_2=X_norm_2*P;T_1=X_norm_1*P;

%calculate thresholdT_threshold=a*(num_row_X_3-1)*(num_row_X_3+1)/(num_row_X_3*(num_row_X_3-a))*finv(0.99,a,(num_row_X_3-a));

%ploting PCAb=sqrt(D_3(1,1)*T_threshold);c=sqrt(D_3(2,2)*T_threshold);

theta=[0:pi/50:2*pi];t_1=b*cos(theta);t_2=c*sin(theta);

figure;plot(T_3(:,1),T_3(:,2),'x',T_2(:,1),T_2(:,2),'*',T_1(:,1),T_1(:,2),'o',t_1,t_2);xlabel('t_1');ylabel('t_2');legend('Class 3','Class 2','Class 1','95% Threshold','location','northwest');title('PCA method');

%Calulate T Square

T_UCL_square=(((num_row_X_2)^2-1)*a*finv(0.99,a,num_row_X_2-a))/(num_row_X_2*(num_row_X_2-a));T_Square = [];Y=T_2*Represented_variance_2^-.5;for i=1:num_row_X_2

yy=Y(i,1:a)';T_Square(i)=yy'*yy;

end

%ploting T square statisticfigureplot(1:num_row_X_2,T_Square,1:num_row_X_2,T_threshold);title('T square statistic');xlabel('time');

ylabel('T2');

%calculate Q statisticq1Array = [];[n m] = size(X_2);I = eye(m);X_norm_residual = X_norm_3 - X_norm_3*P*P';Co_x=(X_norm_residual'*X_norm_residual);[M,N]=eigs(Co_x);


34/34

N=N/(num_row_X_2-1);therta_1=sum(diag(N));therta_2=sum(diag(N.^2));therta_3=sum(diag(N.^3));h_0=1-(2*therta_1*therta_3/(3*(therta_2^2)));part_1=norminv(0.99)*sqrt(2*therta_2*(h_0^2))/therta_1;part_2=therta_2*h_0*(h_0-1)/(therta_1^2);%calculate Squared Prediction ErrorSPE_UCL_x=therta_1*((part_1+part_2+1)^inv(h_0));

%ploting Q statisticfor i=1:nxi = X_norm_2(i,:);q1Array(i) = xi*(I-P*P')*xi';

end;figureplot(1:n,q1Array,1:n,SPE_UCL_x);xlabel('time');ylabel('Q');title('Q statistic');

10.2. Support vector machines & Kernel methods

Diploma Thesis - June 2009 - draft

Documents

Transcript of Diploma Thesis - June 2009 - draft