Diploma Thesis - June 2009 - draft
-
Upload
lucianhuluta -
Category
Documents
-
view
214 -
download
0
Transcript of Diploma Thesis - June 2009 - draft
-
8/14/2019 Diploma Thesis - June 2009 - draft
1/34
"Gheorghe Asachi" Tehnical UniversityFaculty of Automatic Control and Computer
Engineering
Departament of Automatic Control and Applied
Informatics
Iai
University Duisburg Essen
Faculty Automatic Control and Complex Sciences
Duisburg
2009
Fault diagnosis in linear andnonlinear systemsDiploma thesis
Lucian-Adrian Hulu
-
8/14/2019 Diploma Thesis - June 2009 - draft
2/34
Table of Contents
1. Introduction ......................................................................................................................................... 3
2. The Learning Methodology .................................................................................................................. 4
2.1. Supervised Learning ...................................................................................................................... 4
2.2. Learning and Generalization ......................................................................................................... 6
2.3. Improving Generalization ............................................................................................................. 7
2.4. Fault diagnosis as a classification problem ................................................................................... 8
3. Support Vector Machines for fault diagnosis ....................................................................................... 8
3.1. Two-Class Support Vector Machines ........................................................................................ 8
3.1.1. Hard-Margin Support Vector Machines ................................................................................ 9
3.1.2. L1 Soft-Margin Support Vector Machines ........................................................................... 14
3.1.3. Mapping to a High-Dimensional Space ............................................................................... 18
3.1.3.1. Kernel tricks .................................................................................................................... 183.1.3.2. Kernels............................................................................................................................. 18
3.1.4. L2 Soft-Margin Support Vector Machines ........................................................................... 19
3.2. Multiclass Support Vector Machines ...................................................................................... 22
3.2.1. One-against-all .................................................................................................................... 22
3.2.2. One-against-one .................................................................................................................. 22
3.3. Support Vector Regression ..................................................................................................... 22
3.4. Advantages and Disadvantages .............................................................................................. 23
4. Data-driven methods for fault diagnosis ........................................................................................... 24
4.1. Principal component analysis ..................................................................................................... 24
4.2. Projection algorithms .................................................................................................................. 24
5. Application to Tennessee Eastman process as a benchmark ............................................................ 24
5.1. Problem formulation ................................................................................................................... 26
5.2. Support vector machines & Kernel methods on TE .................................................................... 27
5.3. PCA on TE .................................................................................................................................... 28
6. List of Figures ..................................................................................................................................... 30
7. Abbreviations ..................................................................................................................................... 30
8. Nomenclature .................................................................................................................................... 30
9. References .......................................................................................................................................... 31
10. Annex 1 Code listing ..................................................................................................................... 32
10.1. PCA, T2
statistic and Q statistic algorithms .......................................................................... 32
10.2. Support vector machines & Kernel methods .......................................................................... 34
-
8/14/2019 Diploma Thesis - June 2009 - draft
3/34
1.IntroductionIn attention to the serious consequences of accidents eventually occurring in chemical
plants caused by faulty operating conditions, the importance of incipient fault detection and
isolation is well recognized. Although catastrophes and disasters due to chemical plant failures
may be infrequent, minor incidences (accidents) may be very common, resulting in personal
injury, illness, and material loses costing the society large amounts of resources every year.
It is difficult for process operators to detect the existence of a fault by examining time-
sequenced data, among the hundreds of variables recorded in modern chemical processes. The
isolation of the fault for discerning its root causes is even a more difficult problem as it requires
managing a great amount of information. Thus, these supervision difficulties point out the
necessity of developing decision-making supporting tools in order to help process operators
managing plant incidences.
Different techniques have been developed to detect and diagnose faults in complex
chemical plants altogether with computer aided decision-making supporting tools. Parallel to
the growing computational power during last decades, diverse statistical techniques have been
developed to carry out effective process fault detection. The difficulty to develop accurate first
principle models, the highly coupled nature of chemical processes and the overwhelming
amount of stored data, have made these statistical techniques one of the most successfully
applied tools for process monitoring in chemical industry.
On the other hand, the fault diagnosis problem, the isolation and characterization of the
detected anomalies, has not been properly solved yet as it still presents important practicallimitations. Three main methodologies have been followed in fault diagnosis of chemical
processes:
1. Quantitative model-based methods,
2. Qualitative model-based methods
3. Process history-based methods.
Several approaches based on the third alternative, such as neural networks or fuzzy
systems, have been implemented to overcome the lack of useful quantitative and qualitativeprocess knowledge required on the first two FDS approaches.
Besides not requiring first principles or qualitative knowledge, data-based models allow
dealing with the intrinsic non-linearity of variables involved in most chemical processes at no
additional cost. Even though they are also limited by difficulties on class generalization, multiple
-
8/14/2019 Diploma Thesis - June 2009 - draft
4/34
fault diagnosis or data uncertainties handling, they are the most popular techniques applied in
industry because of their simplicity and their non-linearity processing capabilities.
In any case, none of the previously reported works in literature do not guarantee
enough confidence and reliability in a general diagnosis problem. Thus all the existing practical
limitations make this area an open field requiring further work.
This work proposes a new classification approach to face the fault diagnosis problem
using a novel technique recently developed in the machine learning area. This novel technique
is a process history-based methodology that has proved higher performance than other
reported pattern classifiers in different technical areas. The proposed approach use support
vector machines (SVM) which are a kernel-based method from the Statistical Learning Theory
that has succeeded in many classification problems (i.e. Computational linguistics). Recently, it
has been applied to fault diagnosis in chemical plants using the Tennessee Eastman (TE)
benchmark.
2.The Learning MethodologyThe construction of machines capable of learning from experience has for a long time been
the object of both philosophical and technical debate. The technical aspect of the debate has
received an enormous impetus from the advent of electronic computers. They have
demonstrated that machines can display a significant level of learning ability, though the
boundaries of this ability are far from being clearly defined.
The availability of reliable learning systems is of strategic importance, as there are many
tasks that cannot be solved by classical programming techniques, since no mathematical model
of the problem is available. So for example it is not known how to write a computer program to
perform hand-written character recognition, though there are plenty of examples available. It is
therefore natural to ask if a computer could be trained to recognize the letter A' from
examples - after all this is the way humans learn to read. We will refer to this approach to
problem solving as the learning methodology
The same reasoning applies to the problem of finding genes in a DNA sequence, filtering
email, detecting or recognizing objects in machine vision, and so on. Solving each of these
problems has the potential to revolutionize some aspect of our life, and for each of them
machine learning algorithms could provide the key to its solution.2.1. Supervised LearningWhen computers are applied to solve a practical problem it is usually the case that the
method of deriving the required output from a set of inputs can be described explicitly. The
-
8/14/2019 Diploma Thesis - June 2009 - draft
5/34
task of the system designer and eventually the programmer implementing the specifications
will be to translate that method into a sequence of instructions which the computer will follow
to achieve the desired effect.
As computers are applied to solve more complex problems, however, situations can arise in
which there is no known method for computing the desired output from a set of inputs, orwhere that computation may be very expensive. Examples of this type of situation might be
modeling a complex chemical reaction, where the precise interactions of the different reactants
are not known, or classification of protein types based on the DNA sequence from which they
are generated, or the classification of credit applications into those who will default and those
who will repay the loan.
These tasks cannot be solved by a traditional programming approach since the system
designer cannot precisely specify the method by which the correct output can be computed
from the input data. An alternative strategy for solving this type of problem is for the computer
to attempt to learn the input/output functionality from examples, in the same way that
children learn which sports cars are simply by being told which of a large number of cars are
sporty rather than by being given a precise specification of sportiness. The approach of using
examples to synthesize programs is known as the learning methodologyand in the particular
case when the examples are input/output pairs it is called supervised learning. The examples of
input/output functionality are referred to as the training data.
The input/output pairings typically reflect a functional relationship mapping inputs to
outputs, though this is not always the case as for example when the outputs are corrupted by
noise. When an underlying function from inputs to outputs exists it is referred to as the target
function. The estimate of the target function which is learnt or output by the learning algorithm
is known as the solution of the learning problem. In the case of classification this function issometimes referred to as the decision function. The solution is chosen from a set of candidate
functions which map from the input space to the output domain. Usually we will choose a
particular set or class of candidate functions known as hypotheses before we begin trying to
learn the correct function. For example, so-called decision trees are hypotheses created by
constructing a binary tree with simple decision functions at the internal nodes and output
values at the leaves. Hence, we can view the choice of the set of hypotheses (or hypothesis
space) as one of the key ingredients of the learning strategy. The algorithm which takes the
training data as input and selects a hypothesis from the hypothesis space is the second
important ingredient. It is referred to as the learning algorithm.
In the case of learning to distinguish sports cars the output is a simple yes/no tag which we
can think of as a binary output value. For the problem of recognizing protein types, the output
value will be one of a finite number of categories, while the output values when modeling a
chemical reaction might be the concentrations of the reactants given as real values. A learning
problem with binary outputs is referred to as a binary classification problem, one with a finite
number of categories as multi-class classification, while for real-valued outputs the problem
becomes known as regression.
-
8/14/2019 Diploma Thesis - June 2009 - draft
6/34
There are other types of learning, for example unsupervised learning considers the case
where there are no output values and the learning task is to gain some understanding of the
process that generated the data. This type of learning includes density estimation, learning the
support of a distribution, clustering, and so on. There are also models of learning which
consider more complex interactions between a learner and their environment. Perhaps thesimplest case is when the learner is allowed to query the environment about the output
associated with a particular input. The study of how this affects the learner's ability to learn
different tasks is known as query learning. Further complexities of interaction are considered in
reinforcement learning, where the learner has a range of actions at their disposal which they
can take to attempt to move towards states where they can expect high rewards. The learning
methodology can play a part in reinforcement learning if we treat the optimal action as the
output of a function of the current state of the learner. There are, however, significant
complications since the quality of the output can only be assessed indirectly as the
consequences of an action become clear.
2.2. Learning and GeneralizationWe discussed how the quality of an on-line learning algorithm can be assessed in terms of
the number of mistakes it makes during the training phase. It is not immediately clear,
however, how we can assess the quality of a hypothesis generated during batch learning. Early
machine learning algorithms aimed to learn representations of simple symbolic functions that
could be understood and verified by experts. Hence, the goal of learning in this paradigm was
to output a hypothesis that performed the correct classification of the training data and early
learning algorithms were designed to find such an accurate fit to the data. Such a hypothesis is
said to be consistent. There are two problems with the goal of generating a verifiable consistenthypothesis.
The first is that the function we are trying to learn may not have a simple representation
and hence may not be easily verified in this way. An example of this situation is the
identification of genes within a DNA sequence. Certain subsequences are genes and others are
not, but there is no simple way to categorize which are which.
The second problem is that frequently training data are noisy and so there is no guarantee
that there is an underlying function which correctly maps the training data. The example of
credit checking is clearly in this category since the decision to default may be a result of factorssimply not available to the system. A second example would be the classification of web pages
into categories, which again can never be an exact science.
The type of data that is of interest to machine learning practitioners is increasingly of these
two types, hence rendering the proposed measure of quality difficult to implement. There is,
however, a more fundamental problem with this approach in that even when we can find a
hypothesis that is consistent with the training data; it may not make correct classifications of
-
8/14/2019 Diploma Thesis - June 2009 - draft
7/34
unseen data. The ability of a hypothesis to correctly classify data not in the training set is known
as its generalization, and it is this property that we shall aim to optimize.
Shifting our goal to generalization removes the need to view our hypothesis as a correct
representation of the true function. If the hypothesis gives the right output it satisfies the
generalization criterion, which in this sense has now become a functional measure rather thana descriptional one. In this sense the criterion places no constraints on the size or on the
meaning' of the hypothesis - for the time being these can be considered to be arbitrary.
2.3. Improving GeneralizationThe generalization criterion places an altogether different constraint on the learning
algorithm. This is most amply illustrated by the extreme case of rote learning. Many classical
algorithms of machine learning are capable of representing any function and for difficult
training sets will give a hypothesis that behaves like a rote learner. By a rote learner we mean
one that correctly classifies the data in the training set, but makes essentially uncorrelatedpredictions on unseen data. For example, decision trees can grow so large that there is a leaf
for each training example. Hypotheses that become too complex in order to become consistent
are said to overfit. One way of trying to control this difficulty is to restrict the size of the
hypothesis, for example pruning the size of the decision tree.
The approach that we will adopt is to motivate the trade-off by reference to statistical
bounds on the generalization error. These bounds will typically depend on certain quantities
such as the margin of the classifier, and hence motivate algorithms which optimize the
particular measure. The drawback of such an approach is that the algorithm is only as good as
the result that motivates it. On the other hand the strength is that the statistical result provides
a well-founded basis for the approach, hence avoiding the danger of a heuristic that may be
based on a misleading intuition.
The fact that the algorithm design is based on a statistical result does not mean that we
ignore the computational complexity of solving the particular optimization problem. We are
interested in techniques that will scale from toy problems to large realistic datasets of hundreds
of thousands of examples. It is only by performing a principled analysis of the computational
complexity that we can avoid settling for heuristics that work well on small examples, but break
down once larger training sets are used. The theory of computational complexity identifies two
classes of problems. For the first class there exist algorithms that run in time polynomial in thesize of the input, while for the second the existence of such an algorithm would imply that any
problem for which we can check a solution in polynomial time can also be solved in polynomial
time. This second class of problems is known as the NP-complete problems and it is generally
believed that these problems cannot be solved efficiently.
If no restriction is placed over the set of all possible hypotheses (that is all possible
functions from the input space to the output domain), then learning is impossible since no
-
8/14/2019 Diploma Thesis - June 2009 - draft
8/34
amount of training data will tell us how to classify unseen examples. Problems also arise if we
allow ourselves the freedom of choosing the set of hypotheses after seeing the data, since we
can simply assume all of the prior probability on the correct hypotheses. In this sense it is true
that all learning systems have to make some prior assumption of a Bayesian type often called
the learning bias.
2.4. Fault diagnosis as a classification problemFollowing the previous analysis, the fault diagnosis problem could be formulated as a
classical classification problem in order to extend the concepts successfully applied in machine
learning theory. In this sense, better generalization capabilities or more accurate and flexible
pattern training algorithms could overcome limitations on faults isolation or multiple fault
management.
By adopting this classification view, the faults to be managed in a process industry would be
different classes in the general classification problem. As in any classification problem, it will be
crucial to properly choose meaningful variables or features to properly represent the diagnosis
problem, adopt optimal learning methodologies and follow serious validation procedures to
avoid the FDS over-fitting. Under this view the classical fault detection problem is a binary
classification (BC) problem, consisting of determining whether the process is in or out of
control. The global fault diagnosis problem is considered as a multi-class classification (MC)
problem, so that many classes are involved. With the purpose to solve both kinds of problems,
different learning algorithms have been developed in the machine learning area obtaining very
promising results (i.e. Naive Bayes, KNearest Neighbours, SVM, AdaBoost.MH, etc.).
3.Support Vector Machines for fault diagnosis3.1. Two-Class Support Vector MachinesIn training a classifier, usually we try to maximize classification performance for the training
data. But if the classifier is too fit for the training data, the classification ability for unknown
data, i.e., the generalization ability is degraded. This phenomenon is called overfitting. Namely,
there is a trade-off between the generalization ability and fitting to the training data.
Determination of decision functions using input-output pairs is called training.
Conventional training methods determine the indirect decision functions so that each training
input is correctly classified into the class designated by the associated training output. Figure 1
shows an example of the decision functions obtained when the training data of two classes do
not overlap.
-
8/14/2019 Diploma Thesis - June 2009 - draft
9/34
Figure 1 Simple example of classification
Thus there are infinite possibilities of the positions of the decision functions that
correctly classify the training data. Although the generalization ability is directly affected by the
positions, conventional training methods do not consider this.
In a SVM, the direct decision function that maximizes the generalization ability is
determined for a two-class problem. Assuming that the training data of different classes do not
overlap, the decision function is determined so that the distance from the training data is
maximized. We call this the optimal decision function (Figure 2). For a two-class problem, a
support vector machine is trained so that the direct decision function maximizes the
generalization ability.
Figure 2 Optimal decision function
3.1.1. Hard-Margin Support Vector MachinesLet dimensional training inputs ( = 1. . ) belong to Class 1 or 2 and the
associated labels be = 1 for Class 1 and -1 for Class 2. If these are linearly separable, we candetermine the decision function:
-
8/14/2019 Diploma Thesis - June 2009 - draft
10/34
1
= + , (1)where is an -dimensional vector, is a bias term, and for = 1. .
+
> 0 = 1< 0
=
1
(2)Because the training data are linearly separable, no training data satisfy + = 0.
Thus to control separability, instead of (2), we consider the following inequalities:
+ 1 = 1 1 = 1 (3)Here, 1 and 1 on the right-hand sides of the inequalities can be a constant (> 0) and , respectively. But dividing both sides of the inequalities by , (3) is obtained. Equation (3) is
equivalent to
( + ) 1, = 1. . (4)The hyperplane
= + = , 1 < < 1 (5)forms a separating hyperplane that separates ( = 1. . ). When = 0, the separatinghyperplane is in the middle of the two hyperplanes with = 1 and -1. The distance betweenthe separating hyperplane and the training datum nearest to the hyperplane is called the
margin(Figure 3).
Figure 3 Margin
Assuming that the hyperplanes = 1 and -1 include at least one training datum, thehyperplane = 0 has the maximum margin for 1 < < 1. The region 1 1 is the generalization region for the decision function. The generalization ability depends on
-
8/14/2019 Diploma Thesis - June 2009 - draft
11/34
1
the location of the separating hyperplane, and the hyperplane with the maximum margin is
called the optimal separating hyperplane(Figure 4). Because we can obtain the same optimal
separating hyperplane even if we delete all the data that satisfy the strict inequalities in (2.10),
the data that satisfy the equalities are called support vectors(Figure 4).
Figure 4 Optimal hyperplane
Assume that no outliers are included in the training data and that unknown test data
will obey the same probability law as that of the training data. Then it is intuitively clear that
the generalization ability is maximized if the optimal separating hyperplane is selected as the
separating hyperplane. Now consider determining the optimal separating hyperplane. The
Euclidean distance from a training datum x to the separating hyperplane is given by ()/
. This can be shown as follows. Because the vector
is orthogonal to the separating
hyperplane, the line that goes through and that is orthogonal to the hyperplane is given by/ + , where is the Euclidean distance from to the hyperplane. It crosses thehyperplane at the point where
+ = 0 (6)is satisfied. Solving (6) for , we obtain = ()/ .
Then all the training data must satisfy
() , = 1. . (7)Where is the margin.
Now if (, ) is a solution, (, ) is also a solution, where is a scalar. Thus weimpose the following constraint:
-
8/14/2019 Diploma Thesis - June 2009 - draft
12/34
1
= 1 (8)From (7) and (8), to find the optimal hyperplane, we need to find , with the minimum
Euclidean norm that satisfies (4).
Therefore, the optimal separating hyperplane can be obtained by minimizing:
= 122 (9)
With respect to and subject to the constraints:( + ) 1, = 1. . (10)
Here, the square of the Euclidean norm in (9) is to make the optimization problemquadratic programming. The assumption of linear separability means that there exist and that satisfy (9). We call the solutions that satisfy (10) feasible solutions. Because the
optimization problem has the quadratic objective function with the inequality constraints, evenif the solutions are nonunique, the value of the objective function is unique. Thus
nonuniqueness is not a problem for support vector machines. Because we can obtain the same
optimal separating hyperplane even if we delete all the data that satisfy the strict inequalities in
(2.10), the data that satisfy the equalities are called support vectors1(Figure 2 and Figure 4).
The variables of the convex optimization problem given by (9) and (10) are and .Thus the number of variables is the number of input variables plus 1: + 1. When the numberof input variables is small, we can solve (9) and (10) by the quadratic programming technique.
But, as will be discussed later, because we map the input space into a high-dimensional feature
space, in some cases, with infinite dimensions, we convert (9) and (10) into the equivalent dual
problem whose number of variables is the number of training data.
To do this, we first convert the constrained problem given by (9) and (10) into the
unconstrained problem
, , = 12 + 1=1 (11)
Where
= (
1,
,
) and
are the nonnegative Lagrange multipliers. The optimal solution
of (11) is given by the saddle point, where (11) is minimized with respect to and andmaximized with respect to ( 0), and it satisfies the following Karush-Kuhn-Tucker (KKT)conditions:
1This definition is imprecise. Data can satisfy ( + ) = 1 but that can be deleted without changing the
optimal separating hyperplane. Support vectors are defined using the solution of the dual problem, as discussed
later.
-
8/14/2019 Diploma Thesis - June 2009 - draft
13/34
-
8/14/2019 Diploma Thesis - June 2009 - draft
14/34
1
1
2 ,=1 =
1
2=1
T =1 0 (20)maximizing (18) under the constraints (19) is a concave quadratic programming problem. If a
solution exists, namely, if the classification problem is linearly separable, the global optimalsolution = 1. . exists. For quadratic programming, the values of the primal and dualobjective functions coincide at the optimal solutions if they exist. This is called the zero duality
gap.
Data that are associated with positive are support vectors for Classes 1 and 2. Then from (16)the decision function is given by
= +
, (21)where is the set of support vector indices, and form the KKT conditions given by (14), isgiven by
= (22)Where is a support vector. From the standpoint of precision of calculations, it is better totake the average among the support vectors as follows:
= 1
( )
(23)Then unknown datum is classified into:
1 > 0 1 < 0 (24)If = 0, is on the boundary and thus is unclassifiable. When training data are
separable, the region 1 1 is a generalization region.3.1.2. L1 Soft-Margin Support Vector MachinesIn hard-margin support vector machines, we assumed that the training data are linearly
separable. When the data are linearly inseparable, there is no feasible solution, and the hard-
margin support vector machine is unsolvable. Here we extend the support vector machine so
that it is applicable to an inseparable case.
To allow inseparability, we introduce the nonnegative slack variables ( 0) into (4):
-
8/14/2019 Diploma Thesis - June 2009 - draft
15/34
-
8/14/2019 Diploma Thesis - June 2009 - draft
16/34
1
Similar to the linearly separable case, introducing the nonnegative Lagrange multipliers and , we obtain
,
,
,
,
=1
2
2 +
=1
+
1 +
=1
=1 (28)
where = (1, ,)and = (1, ,).For the optimal solution, the following KKT conditions are satisfied:
(, , ,,) = 0, (29)(, , ,,) = 0, (30)
(, , ,,) = 0, (31) + 1 + = 0, = 1. . , (32) = 0, = 1. . , (33) 0, 0, 0, = 1. . . (34)
Using (28), we reduce (29) to (31), respectively, to
= =1 (35)
= 0,=1 (36) + = , = 1. . . (37)Thus substituting (35) to (37) into (28), we obtain the following dual problem. Maximize
= =1
1
2
,=1 (38)
subject to constraints
-
8/14/2019 Diploma Thesis - June 2009 - draft
17/34
1
= 0=1 , 0 , = 1. . . (39)The only difference between L1 soft-margin support vector machines and hard margin
support vector machines is that
cannot exceed
. Especially, (32) and (33) are called KKT
(complementarity) conditions.
From these and (37), there are three cases for :1. = 0 then = 0. Thus is correctly classified.2. 0 . Then + 1 + = 0 and then = 0. Therefore + = 1 and is a support vector. Especially we call the support vector with > > 0
an unbounded support vector.
3. = . Then + 1 + = 0 and 0. Thus is a support vector. Wecall the support vector with
=
a bounded support vector. If 0
< 1,
is
correctly classified, and if 1, is misclassified .The decision function is the same as that of the hard-margin support vector machine and is
given by
= + (40)where is the set of support vector indices. Because are nonzero for the support vectors,the summation (40) is added only for the support vectors. For the unbounded
,
= (41)is satisfied. To ensure the precision of calculations, we take the average of that is calculatedfor unbounded support vectors,
= 1( ) (42)where U is the set of unbounded support vector indices.
Then unknown datum x is classified into
1 > 0 1 < 0 (43)If = 0, is on the boundary and thus is unclassifiable. When there are no bounded
support vectors, the region 1 1 is a generalization region, which is the sameas the hard-margin support vector machine.
-
8/14/2019 Diploma Thesis - June 2009 - draft
18/34
1
3.1.3. Mapping to a High-Dimensional Space
Figure 6 Linear inseparability
Figure 7 Mapping input data
3.1.3.1.Kernel tricks3.1.3.2.Kernels
Examples of kernels:
1. Linear, =
2. Polynomial
-
8/14/2019 Diploma Thesis - June 2009 - draft
19/34
1
, = ( + 1)3. Radial Basis Function (RBF)
, = 2
3.1.4. L2 Soft-Margin Support Vector MachinesInstead of the linear sum of the slack variables in the objective function, the L2 soft-
margin support vector machine uses the square sum of the slack variables (Figure 8). Namely,
training is done by minimizing
,
,
=1
2
+2
2
=1 (44)
with respect to , , and subject to the inequality constraints:() + 1 , = 1. . (45)
Here, is -dimmensional vector, is the bias, () is the mapping function that mapsthe -dimmensional vector x into the -dimmensional feature space (will be discussed in 3.1.3Mapping to a High-Dimensional Space section), is the slack variable for variable , and isthe margin parameter.
Figure 8 L2 Soft-Margin slack variables
Introducing the Lagrange multipliers ( 0), we obtain
-
8/14/2019 Diploma Thesis - June 2009 - draft
20/34
2
, , , = 12 +
22=1 () + 1 +
=1
(46)Here, we do not need to introduce the Lagrange multipliers associated with . As is shown
imemediately,
= is satisfied for the optimal solution. Hence
is nonnegative, so long
is nonnegative.
For the optimal solution the following KKT conditions are satisfied:
(, , ,) = ()=1 = 0,
(47)(, , ,)
= = 0, (48)
(,, ,) = =1 = 0, (49)() + 1 + = 0, = 1. . , (50)Ecuation (50) gives the KKT complementarity conditions; from (47), (48), and (50), the
optimal solution must satisfy either = 0 or
=1 ,
+
+
1 = 0 (51)
Where , = () and is Kroneckers delta function, in which = 1 for =and 0 otherwise. Thus the bias term is calculated for > 0:
= =1 , + (52)
which is different from that of the L1 support vector machine. But the decision function is the
same:
= =1 , + (53)Substituting (47) to (49) into (46), we obtain the dual objective function:
-
8/14/2019 Diploma Thesis - June 2009 - draft
21/34
2
= =1 1
2 , +
,=1 (54)Thus the following dual problem is obtained. Maximize (54) subject to
= 0=1 , 0, = 1. . . (55)This is similar to a hard-margin support vector machine. The difference is the addition of
in (54). Therefore, for L1 support vector machine, if we replace , with , + and remove the upper bound given by for , we obtain L2 support vector machine. But wemust notice that when we calculate the decision function in (53) we must not add
.Because 1/ is added to the diagonal elements of the matrix = , called kernelmatrix, resulting matrix becomes positive definite. Thus the associated optimization problem is
more computationally stable than that of the L1 support vector machine.
L2 soft-margin support vector machines look similar to hard-margin support vector
machines. Actually, letting
=
, = , =
, (56)
where is the -dimensional vector with the th element being 1 and the remaining elements0, training L2 support vector machine given by (44) and (45) is converted to the following
problem. Minimize
1
2 (57)
subject to
+
1,
= 1. .
. (58)
Therefore, the L2 support vector machine is equivalent to the hard-margin support vector
machine with the augmented feature space. Because the L2 support vector machine always has
a solution because of the slack variables, the equivalent hard-margin support vector machine
also has a solution. But this only means that the solution is non-overlapping in the augmented
feature space. Therefore, there may be cases where the solution is overlapped in the original
-
8/14/2019 Diploma Thesis - June 2009 - draft
22/34
2
feature space, and thus the recognition rate of the training data for the L2 support vector
machine is not 100 percent.
3.2. Multiclass Support Vector Machines3.2.1. One-against-all3.2.2. One-against-one
3.3. Support Vector RegressionThe Support Vector method can also be applied to the case of regression, maintaining all
the main features that characterize the maximal margin algorithm: a non-linear function is
learned by a linear learning machine in a kernel-induced feature space while the capacity of the
system is controlled by a parameter that does not depend on the dimensionality of the space.
As in the classification case the learning algorithm minimizes a convex functional and its
solution is sparse.
Figure 9 The insensitive band for a one dimensional linear regression problem
-
8/14/2019 Diploma Thesis - June 2009 - draft
23/34
2
Figure 10 The linear (red) and quadratic (green) -insensitive loss for zero and non-zero 3.4. Advantages and DisadvantagesThe advantages of support vector machines over multilayer neural network classifiers are as
follows:
1. Maximization of generalization ability. In training a multilayer neural network classifier,
the sum-of-squares error between outputs and desired training outputs is minimized. Thus, the
class boundaries change as the initial weights change. So does the generalization ability. Thus,
especially when training data are scarce and linearly separable, the generalization ability
deteriorates considerably. But because a support vector machine is trained to maximize the
margin, the generalization ability does not deteriorate very much.
2. No local minima. A multilayer neural network classifier is known to have numerous local
minima, and there have been extensive discussions on how to avoid a local minimum in
training. But because a support vector machine is formulated as a quadratic programming
problem, there is a global optimum solution.
3. Robustness to outliers. Multilayer neural network classifiers are vulnerable to outliers
because they use the sum-of-squares errors. Thus to prevent the effect of outliers, outliers
need to be eliminated before training, or some mechanism for suppressing outliers needs to be
incorporated in training. In support vector machines the margin parameter controls themisclassification error. If a large value is set to
, misclassification is suppressed, and if a small
value is set, training data that are away from the gathered data are allowed to be misclassified.Thus by properly setting a value to , we can suppress outliers.
The disadvantages of support vector machines explained so far are as follows.
-
8/14/2019 Diploma Thesis - June 2009 - draft
24/34
2
1. Extension to multiclass problems. Unlike multilayer neural network classifiers, support
vector machines use direct decision functions. Thus an extension to multiclass problems is not
straightforward, and there are several formulations.
2. Long training time. Because training of a support vector machine is done by solving the
associated dual problem, the number of variables is equal to the number of training data. Thus
for a large number of training data, solving the dual problem becomes difficult from both the
memory size and the training time.
3. Selection of parameters. In training a support vector machine, we need to select an
appropriate kernel and its parameters, and then we need to set the value to the margin
parameter . To select the optimal parameters to a given problem is called model selection.This is the same situation as that of neural network classifiers. Namely, we need to set the
number of hidden units, initial values of weights, and so on. In support vector machines, model
selection is done by estimating the generalization ability through repeatedly training supportvector machines. But because this is time-consuming, several indices for estimating the
generalization ability have been proposed.
4.Data-driven methods for fault diagnosis4.1. Principal component analysis4.2. Projection algorithms
5.Application to Tennessee Eastman process as abenchmark
The case involves the production of two products from four reactants. There are 41
measured process variables and 12 manipulated variables. The process (Figure 11) is composed
of five main unit operations: an exothermic two-phase reactor, a product condenser, a flash
separator, a reboiled stripper, and a recycle compressor.
All the 20 proposed faults in the original paper (Downs & Vogel, 1993) were consideredto face a global and complex diagnosis problem. Main diagnosis difficulties in this problem
come from the number of different faults to be managed and the wide faults diversity. Random
and step biases as well as soft drifts on process variables must be managed in this complex
diagnosis case study.
-
8/14/2019 Diploma Thesis - June 2009 - draft
25/34
2
Figure 11 Case study: Tennessee Eastman process flowsheet
Besides, some of the considered disturbances cause insignificant biases in variables
because of the adjusted control strategy, which makes very difficult to find out the biases
source causes. Therefore, it represents a challenging real diagnosis problem for testing and
comparing the approach proposed in this work.
Three main limitations were found when revising former approaches to compare the
performance of a novel fault diagnosis system:
1. Reported results are very limited from the quantitative point of view. No common or
standard quantitative indices have been used for that purpose;
2. Unclear distinctions between the detection and diagnosis steps reduce the chances
for systematic comparisons;
3. Since many simulation runs may be obtained from the same case study (different
ways of sampling, several initial and final conditions, simulation span, etc.), complete data sets
are required to be agreed for establishing a unique comparative basis;
-
8/14/2019 Diploma Thesis - June 2009 - draft
26/34
2
Classification tools are one of the most extended fault diagnosis systems in literature.
They are based on mapping input data from process plant to different process states. The
classification methodologies have succeeded because of their implementation simplicity, so
they do not require process operators experience or process first principles knowledge. Among
these classification methodologies, data-based rules systems and machine learning are themost commonly applied methodologies. However, techniques coming from the machine
learning area have not been properly exploited to solve significant limitations on chemical
process fault diagnosis. Only neural networks, a technique overcome in other technical fields,
has been widely applied as a pattern learning technique.
Thus, considering fault diagnosis of chemical process as a classification problem would
allow taking advantage of the benefits gained by the learning community throughout last
decades.
5.1. Problem formulationThe fault diagnosis problem could be formulated as a classical classification problem in
order to extend the concepts successfully applied in machine learning theory. In this sense,
better generalization capabilities or more accurate and flexible pattern training algorithms
could overcome limitations on faults isolation or multiple fault management. By adopting this
classification view, the faults to be managed in a process industry would be different classes in
the general classification problem. As in any classification problem, it will be crucial to properly
choose meaningful variables or features to properly represent the diagnosis problem, adopt
optimal learning methodologies and follow serious validation procedures to avoid the FDS over-
fitting. Under this view the classical fault detection problem is a binary classification (BC)
problem, consisting of determining whether the process is in or out of control. The global fault
diagnosis problem is considered as a multi-class classification (MC) problem, so that many
classes are involved.
With the purpose to solve both kinds of problems, different learning algorithms have been
developed in the machine learning area obtaining very promising results (i.e. Naive Bayes,
KNearest Neighbours, SVM, AdaBoost.MH, etc.).
Some of these algorithms employ BC techniques, but can also be applied to solve multi-class
problems when they are adapted by techniques like one versus all or the constraint
classification. In order to deal with the multi-class or fault diagnosis problem the one versus
all methodology was applied throughout the paper. It consists of decomposing the problem in
as many binary problems as classes has the original problem; then, one classifier is trained for
each class trying to separate the examples of that class (positives) from the examples of all
other classes (negatives).This method generates classifiers between each class and the set of all
-
8/14/2019 Diploma Thesis - June 2009 - draft
27/34
2
other classes. When classifying a new example, all binary classifiers predict a class and the one
with highest confidence is selected (winner-take-all strategy).
5.2. Support vector machines & Kernel methods on TE
Figure 12 Evolution of Error when changing arguments on
-
8/14/2019 Diploma Thesis - June 2009 - draft
28/34
2
5.3. PCA on TE
Figure 13 PCA Method on for datasets D00, D10 and D12
-
8/14/2019 Diploma Thesis - June 2009 - draft
29/34
2
Figure 14 Q statistic for fault 12
Figure 15 T2
statistic for fault 12
-
8/14/2019 Diploma Thesis - June 2009 - draft
30/34
3
6.List of FiguresFigure 1 Simple example of classification .................................................................................... 9
Figure 2 Optimal decision function .............................................................................................. 9
Figure 3 Margin .......................................................................................................................... 10
Figure 4 Optimal hyperplane ..................................................................................................... 11
Figure 5 L1 Soft-Margin slack variables ..................................................................................... 15
Figure 6 Linear inseparability ..................................................................................................... 18
Figure 7 Mapping input data ...................................................................................................... 18
Figure 8 L2 Soft-Margin slack variables ..................................................................................... 19
Figure 9 The insensitive band for a one dimensional linear regression problem ..................... 22
Figure 10 The linear (red) and quadratic (green) -insensitive loss for zero and non-zero ... 23Figure 11 Case study: Tennessee Eastman process flowsheet .................................................. 25
Figure 12 Evolution of Error when changing arguments on ...................................................... 27
Figure 13 PCA Method on for datasets D00, D10 and D12 ....................................................... 28
Figure 14 Q statistic for fault 12 ................................................................................................ 29
Figure 15 T2
statistic for fault 12 ................................................................................................ 29
7.AbbreviationsSVM Support Vector Machines
FDS Fault Diagnosis System
TE Tennessee EastmanPCA Principial Component Analysis
PSVM Parallel support vector machines
FDA Fisher Discriminant Analysis
BC Binary Classification
MC Multi Class Classification
KKT Karush-Kuhn-Tucker
8.NomenclatureAre used lowercase letters to denote vectors and uppercase italic letters to denote
matrices. The following list shows the symbols used in this diploma project.
Training data Lagrange multiplier for Lagrange multiplier for slack variable associated with
-
8/14/2019 Diploma Thesis - June 2009 - draft
31/34
3
margin parameter degree of a polynomial kernel() mapping function from to the feature space parameter for a radial basis function kernel(
,
) kernel
dimension of the feature space number of training data number of input variables number of classes set of support vector indices set of unbounded support vector indices Euclidean norm of vector bias decision fuction cost function
Weights
9.References1. Nello Cristianini, John Shawe-Taylor, An Introduction to Support Vector Machines and
Other Kernel-based Learning Methods, Cambridge University Press, 2000;
2. John Shawe-Taylor, Nello Cristianini, Kernel Methods for Pattern Analysis, CambridgeUniversity Press, 2004;
3. Support vector machines (SVMs),http://en.wikipedia.org;4. Ovidiu Ivanciuc, Applications of Support Vector Machines in Chemistry. In: Reviews in
Computational Chemistry, Volume 23, Eds.: K. B. Lipkovitz and T. R. Cundary. Wiley-VCH,
Weinheim, 2007, pp. 291-400;
5. Shigeo Abe, Support Vector Machines for Pattern Classification, Springer-VerlagLondon Limited, 2005;
6. Chih-Wei Hsu and Chih-Jen Lin, A Comparison of Methods for Multiclass SupportVector Machines, IEEE Transactions on Neural Networks, VOL. 13, NO. 2, March 2002;
7. Abhijit Kulkarni, V.K. Jayaraman, B.D. Kulkarni, Knowledge incorporated support vectormachines to detect faults in Tennessee Eastman Process, Computers and Chemical
Engineering 29, pp. 21282133, Elsevier Ltd., 2005
8. A. Mathur, G. M. Foody, Multiclass and Binary SVM Classification: Implications forTraining and Classification Users, IEEE Geoscience and Remote Sensing Letters, vol. 5,
no. 2, april 2008;
http://en.wikipedia.org/http://en.wikipedia.org/http://en.wikipedia.org/http://en.wikipedia.org/ -
8/14/2019 Diploma Thesis - June 2009 - draft
32/34
3
9. Shawn Martin, The Numerical Stability of Kernel Methods, Sandia National LaboratoriesAlbuquerque, NM, USA, december 2005;
10.A. Mauricio Sales Cruz, Tennessee Eastman Plant-wide Industrial Process. ChallengeProblem. Complete Model, Department of Chemical Engineering Technical University of
Denmark, January 2004;11.Ignacio Ylamos, Gerard Escudero, Moiss Graells, Luis Puigjaner, Performance
assessment of a novel fault diagnosis system based on support vector machines,
Computers and Chemical Engineering 33, 244255, 2009;
10. Annex 1 Code listing10.1. PCA, T2 statistic and Q statistic algorithms
clear;clc;close all;load data/d00.datload data/d00_te.datload data/d10_te.datload data/d12_te.dat
X_1=d10_te;X_2=d12_te;%with faultsX_3=d00_te;%without faults
num_row_X_3=size(X_3,1);num_col_X_3=size(X_3,2);num_row_X_2=size(X_2,1);num_col_X_2=size(X_2,2);
num_row_X_1=size(X_1,1);num_col_X_1=size(X_1,2);
Mn_x_3= Mean(X_3);Std_x_3= Std(X_3);
%NormalizationX_norm_3=(X_3-kron(Mn_x_3,ones(num_row_X_3,1)))./kron(Std_x_3,ones(num_row_X_3,1));X_norm_2=(X_2-kron(Mn_x_3,ones(num_row_X_2,1)))./kron(Std_x_3,ones(num_row_X_2,1));X_norm_1=(X_1-kron(Mn_x_3,ones(num_row_X_1,1)))./kron(Std_x_3,ones(num_row_X_1,1));
TestMn_x_3=Mean(X_norm_3);TestStd_x_3=Std(X_norm_3);TestMn_x_2=Mean(X_norm_2);TestStd_x_2=Std(X_norm_2);TestMn_x_1=Mean(X_norm_1);TestStd_x_1=Std(X_norm_1);
S_3=(X_norm_3'*X_norm_3)/(num_row_X_3-1);
-
8/14/2019 Diploma Thesis - June 2009 - draft
33/34
3
%eigen decomposition obtaining eigenvalues and eigenvectors[V_3,D_3]=eigs(S_3)a=2; % set number of principle compoments
Represented_variance_2 = (D_3(1,1)+D_3(2,2))/sum(diag(D_3));
P=V_3(:,1:2);
T_3=X_norm_3*P;T_2=X_norm_2*P;T_1=X_norm_1*P;
%calculate thresholdT_threshold=a*(num_row_X_3-1)*(num_row_X_3+1)/(num_row_X_3*(num_row_X_3-a))*finv(0.99,a,(num_row_X_3-a));
%ploting PCAb=sqrt(D_3(1,1)*T_threshold);c=sqrt(D_3(2,2)*T_threshold);
theta=[0:pi/50:2*pi];t_1=b*cos(theta);t_2=c*sin(theta);
figure;plot(T_3(:,1),T_3(:,2),'x',T_2(:,1),T_2(:,2),'*',T_1(:,1),T_1(:,2),'o',t_1,t_2);xlabel('t_1');ylabel('t_2');legend('Class 3','Class 2','Class 1','95% Threshold','location','northwest');title('PCA method');
%Calulate T Square
T_UCL_square=(((num_row_X_2)^2-1)*a*finv(0.99,a,num_row_X_2-a))/(num_row_X_2*(num_row_X_2-a));T_Square = [];Y=T_2*Represented_variance_2^-.5;for i=1:num_row_X_2
yy=Y(i,1:a)';T_Square(i)=yy'*yy;
end
%ploting T square statisticfigureplot(1:num_row_X_2,T_Square,1:num_row_X_2,T_threshold);title('T square statistic');xlabel('time');
ylabel('T2');
%calculate Q statisticq1Array = [];[n m] = size(X_2);I = eye(m);X_norm_residual = X_norm_3 - X_norm_3*P*P';Co_x=(X_norm_residual'*X_norm_residual);[M,N]=eigs(Co_x);
-
8/14/2019 Diploma Thesis - June 2009 - draft
34/34
N=N/(num_row_X_2-1);therta_1=sum(diag(N));therta_2=sum(diag(N.^2));therta_3=sum(diag(N.^3));h_0=1-(2*therta_1*therta_3/(3*(therta_2^2)));part_1=norminv(0.99)*sqrt(2*therta_2*(h_0^2))/therta_1;part_2=therta_2*h_0*(h_0-1)/(therta_1^2);%calculate Squared Prediction ErrorSPE_UCL_x=therta_1*((part_1+part_2+1)^inv(h_0));
%ploting Q statisticfor i=1:nxi = X_norm_2(i,:);q1Array(i) = xi*(I-P*P')*xi';
end;figureplot(1:n,q1Array,1:n,SPE_UCL_x);xlabel('time');ylabel('Q');title('Q statistic');
10.2. Support vector machines & Kernel methods