Diploma Thesis - June 2009 - draft

download Diploma Thesis - June 2009 - draft

of 34

Transcript of Diploma Thesis - June 2009 - draft

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    1/34

    "Gheorghe Asachi" Tehnical UniversityFaculty of Automatic Control and Computer

    Engineering

    Departament of Automatic Control and Applied

    Informatics

    Iai

    University Duisburg Essen

    Faculty Automatic Control and Complex Sciences

    Duisburg

    2009

    Fault diagnosis in linear andnonlinear systemsDiploma thesis

    Lucian-Adrian Hulu

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    2/34

    Table of Contents

    1. Introduction ......................................................................................................................................... 3

    2. The Learning Methodology .................................................................................................................. 4

    2.1. Supervised Learning ...................................................................................................................... 4

    2.2. Learning and Generalization ......................................................................................................... 6

    2.3. Improving Generalization ............................................................................................................. 7

    2.4. Fault diagnosis as a classification problem ................................................................................... 8

    3. Support Vector Machines for fault diagnosis ....................................................................................... 8

    3.1. Two-Class Support Vector Machines ........................................................................................ 8

    3.1.1. Hard-Margin Support Vector Machines ................................................................................ 9

    3.1.2. L1 Soft-Margin Support Vector Machines ........................................................................... 14

    3.1.3. Mapping to a High-Dimensional Space ............................................................................... 18

    3.1.3.1. Kernel tricks .................................................................................................................... 183.1.3.2. Kernels............................................................................................................................. 18

    3.1.4. L2 Soft-Margin Support Vector Machines ........................................................................... 19

    3.2. Multiclass Support Vector Machines ...................................................................................... 22

    3.2.1. One-against-all .................................................................................................................... 22

    3.2.2. One-against-one .................................................................................................................. 22

    3.3. Support Vector Regression ..................................................................................................... 22

    3.4. Advantages and Disadvantages .............................................................................................. 23

    4. Data-driven methods for fault diagnosis ........................................................................................... 24

    4.1. Principal component analysis ..................................................................................................... 24

    4.2. Projection algorithms .................................................................................................................. 24

    5. Application to Tennessee Eastman process as a benchmark ............................................................ 24

    5.1. Problem formulation ................................................................................................................... 26

    5.2. Support vector machines & Kernel methods on TE .................................................................... 27

    5.3. PCA on TE .................................................................................................................................... 28

    6. List of Figures ..................................................................................................................................... 30

    7. Abbreviations ..................................................................................................................................... 30

    8. Nomenclature .................................................................................................................................... 30

    9. References .......................................................................................................................................... 31

    10. Annex 1 Code listing ..................................................................................................................... 32

    10.1. PCA, T2

    statistic and Q statistic algorithms .......................................................................... 32

    10.2. Support vector machines & Kernel methods .......................................................................... 34

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    3/34

    1.IntroductionIn attention to the serious consequences of accidents eventually occurring in chemical

    plants caused by faulty operating conditions, the importance of incipient fault detection and

    isolation is well recognized. Although catastrophes and disasters due to chemical plant failures

    may be infrequent, minor incidences (accidents) may be very common, resulting in personal

    injury, illness, and material loses costing the society large amounts of resources every year.

    It is difficult for process operators to detect the existence of a fault by examining time-

    sequenced data, among the hundreds of variables recorded in modern chemical processes. The

    isolation of the fault for discerning its root causes is even a more difficult problem as it requires

    managing a great amount of information. Thus, these supervision difficulties point out the

    necessity of developing decision-making supporting tools in order to help process operators

    managing plant incidences.

    Different techniques have been developed to detect and diagnose faults in complex

    chemical plants altogether with computer aided decision-making supporting tools. Parallel to

    the growing computational power during last decades, diverse statistical techniques have been

    developed to carry out effective process fault detection. The difficulty to develop accurate first

    principle models, the highly coupled nature of chemical processes and the overwhelming

    amount of stored data, have made these statistical techniques one of the most successfully

    applied tools for process monitoring in chemical industry.

    On the other hand, the fault diagnosis problem, the isolation and characterization of the

    detected anomalies, has not been properly solved yet as it still presents important practicallimitations. Three main methodologies have been followed in fault diagnosis of chemical

    processes:

    1. Quantitative model-based methods,

    2. Qualitative model-based methods

    3. Process history-based methods.

    Several approaches based on the third alternative, such as neural networks or fuzzy

    systems, have been implemented to overcome the lack of useful quantitative and qualitativeprocess knowledge required on the first two FDS approaches.

    Besides not requiring first principles or qualitative knowledge, data-based models allow

    dealing with the intrinsic non-linearity of variables involved in most chemical processes at no

    additional cost. Even though they are also limited by difficulties on class generalization, multiple

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    4/34

    fault diagnosis or data uncertainties handling, they are the most popular techniques applied in

    industry because of their simplicity and their non-linearity processing capabilities.

    In any case, none of the previously reported works in literature do not guarantee

    enough confidence and reliability in a general diagnosis problem. Thus all the existing practical

    limitations make this area an open field requiring further work.

    This work proposes a new classification approach to face the fault diagnosis problem

    using a novel technique recently developed in the machine learning area. This novel technique

    is a process history-based methodology that has proved higher performance than other

    reported pattern classifiers in different technical areas. The proposed approach use support

    vector machines (SVM) which are a kernel-based method from the Statistical Learning Theory

    that has succeeded in many classification problems (i.e. Computational linguistics). Recently, it

    has been applied to fault diagnosis in chemical plants using the Tennessee Eastman (TE)

    benchmark.

    2.The Learning MethodologyThe construction of machines capable of learning from experience has for a long time been

    the object of both philosophical and technical debate. The technical aspect of the debate has

    received an enormous impetus from the advent of electronic computers. They have

    demonstrated that machines can display a significant level of learning ability, though the

    boundaries of this ability are far from being clearly defined.

    The availability of reliable learning systems is of strategic importance, as there are many

    tasks that cannot be solved by classical programming techniques, since no mathematical model

    of the problem is available. So for example it is not known how to write a computer program to

    perform hand-written character recognition, though there are plenty of examples available. It is

    therefore natural to ask if a computer could be trained to recognize the letter A' from

    examples - after all this is the way humans learn to read. We will refer to this approach to

    problem solving as the learning methodology

    The same reasoning applies to the problem of finding genes in a DNA sequence, filtering

    email, detecting or recognizing objects in machine vision, and so on. Solving each of these

    problems has the potential to revolutionize some aspect of our life, and for each of them

    machine learning algorithms could provide the key to its solution.2.1. Supervised LearningWhen computers are applied to solve a practical problem it is usually the case that the

    method of deriving the required output from a set of inputs can be described explicitly. The

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    5/34

    task of the system designer and eventually the programmer implementing the specifications

    will be to translate that method into a sequence of instructions which the computer will follow

    to achieve the desired effect.

    As computers are applied to solve more complex problems, however, situations can arise in

    which there is no known method for computing the desired output from a set of inputs, orwhere that computation may be very expensive. Examples of this type of situation might be

    modeling a complex chemical reaction, where the precise interactions of the different reactants

    are not known, or classification of protein types based on the DNA sequence from which they

    are generated, or the classification of credit applications into those who will default and those

    who will repay the loan.

    These tasks cannot be solved by a traditional programming approach since the system

    designer cannot precisely specify the method by which the correct output can be computed

    from the input data. An alternative strategy for solving this type of problem is for the computer

    to attempt to learn the input/output functionality from examples, in the same way that

    children learn which sports cars are simply by being told which of a large number of cars are

    sporty rather than by being given a precise specification of sportiness. The approach of using

    examples to synthesize programs is known as the learning methodologyand in the particular

    case when the examples are input/output pairs it is called supervised learning. The examples of

    input/output functionality are referred to as the training data.

    The input/output pairings typically reflect a functional relationship mapping inputs to

    outputs, though this is not always the case as for example when the outputs are corrupted by

    noise. When an underlying function from inputs to outputs exists it is referred to as the target

    function. The estimate of the target function which is learnt or output by the learning algorithm

    is known as the solution of the learning problem. In the case of classification this function issometimes referred to as the decision function. The solution is chosen from a set of candidate

    functions which map from the input space to the output domain. Usually we will choose a

    particular set or class of candidate functions known as hypotheses before we begin trying to

    learn the correct function. For example, so-called decision trees are hypotheses created by

    constructing a binary tree with simple decision functions at the internal nodes and output

    values at the leaves. Hence, we can view the choice of the set of hypotheses (or hypothesis

    space) as one of the key ingredients of the learning strategy. The algorithm which takes the

    training data as input and selects a hypothesis from the hypothesis space is the second

    important ingredient. It is referred to as the learning algorithm.

    In the case of learning to distinguish sports cars the output is a simple yes/no tag which we

    can think of as a binary output value. For the problem of recognizing protein types, the output

    value will be one of a finite number of categories, while the output values when modeling a

    chemical reaction might be the concentrations of the reactants given as real values. A learning

    problem with binary outputs is referred to as a binary classification problem, one with a finite

    number of categories as multi-class classification, while for real-valued outputs the problem

    becomes known as regression.

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    6/34

    There are other types of learning, for example unsupervised learning considers the case

    where there are no output values and the learning task is to gain some understanding of the

    process that generated the data. This type of learning includes density estimation, learning the

    support of a distribution, clustering, and so on. There are also models of learning which

    consider more complex interactions between a learner and their environment. Perhaps thesimplest case is when the learner is allowed to query the environment about the output

    associated with a particular input. The study of how this affects the learner's ability to learn

    different tasks is known as query learning. Further complexities of interaction are considered in

    reinforcement learning, where the learner has a range of actions at their disposal which they

    can take to attempt to move towards states where they can expect high rewards. The learning

    methodology can play a part in reinforcement learning if we treat the optimal action as the

    output of a function of the current state of the learner. There are, however, significant

    complications since the quality of the output can only be assessed indirectly as the

    consequences of an action become clear.

    2.2. Learning and GeneralizationWe discussed how the quality of an on-line learning algorithm can be assessed in terms of

    the number of mistakes it makes during the training phase. It is not immediately clear,

    however, how we can assess the quality of a hypothesis generated during batch learning. Early

    machine learning algorithms aimed to learn representations of simple symbolic functions that

    could be understood and verified by experts. Hence, the goal of learning in this paradigm was

    to output a hypothesis that performed the correct classification of the training data and early

    learning algorithms were designed to find such an accurate fit to the data. Such a hypothesis is

    said to be consistent. There are two problems with the goal of generating a verifiable consistenthypothesis.

    The first is that the function we are trying to learn may not have a simple representation

    and hence may not be easily verified in this way. An example of this situation is the

    identification of genes within a DNA sequence. Certain subsequences are genes and others are

    not, but there is no simple way to categorize which are which.

    The second problem is that frequently training data are noisy and so there is no guarantee

    that there is an underlying function which correctly maps the training data. The example of

    credit checking is clearly in this category since the decision to default may be a result of factorssimply not available to the system. A second example would be the classification of web pages

    into categories, which again can never be an exact science.

    The type of data that is of interest to machine learning practitioners is increasingly of these

    two types, hence rendering the proposed measure of quality difficult to implement. There is,

    however, a more fundamental problem with this approach in that even when we can find a

    hypothesis that is consistent with the training data; it may not make correct classifications of

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    7/34

    unseen data. The ability of a hypothesis to correctly classify data not in the training set is known

    as its generalization, and it is this property that we shall aim to optimize.

    Shifting our goal to generalization removes the need to view our hypothesis as a correct

    representation of the true function. If the hypothesis gives the right output it satisfies the

    generalization criterion, which in this sense has now become a functional measure rather thana descriptional one. In this sense the criterion places no constraints on the size or on the

    meaning' of the hypothesis - for the time being these can be considered to be arbitrary.

    2.3. Improving GeneralizationThe generalization criterion places an altogether different constraint on the learning

    algorithm. This is most amply illustrated by the extreme case of rote learning. Many classical

    algorithms of machine learning are capable of representing any function and for difficult

    training sets will give a hypothesis that behaves like a rote learner. By a rote learner we mean

    one that correctly classifies the data in the training set, but makes essentially uncorrelatedpredictions on unseen data. For example, decision trees can grow so large that there is a leaf

    for each training example. Hypotheses that become too complex in order to become consistent

    are said to overfit. One way of trying to control this difficulty is to restrict the size of the

    hypothesis, for example pruning the size of the decision tree.

    The approach that we will adopt is to motivate the trade-off by reference to statistical

    bounds on the generalization error. These bounds will typically depend on certain quantities

    such as the margin of the classifier, and hence motivate algorithms which optimize the

    particular measure. The drawback of such an approach is that the algorithm is only as good as

    the result that motivates it. On the other hand the strength is that the statistical result provides

    a well-founded basis for the approach, hence avoiding the danger of a heuristic that may be

    based on a misleading intuition.

    The fact that the algorithm design is based on a statistical result does not mean that we

    ignore the computational complexity of solving the particular optimization problem. We are

    interested in techniques that will scale from toy problems to large realistic datasets of hundreds

    of thousands of examples. It is only by performing a principled analysis of the computational

    complexity that we can avoid settling for heuristics that work well on small examples, but break

    down once larger training sets are used. The theory of computational complexity identifies two

    classes of problems. For the first class there exist algorithms that run in time polynomial in thesize of the input, while for the second the existence of such an algorithm would imply that any

    problem for which we can check a solution in polynomial time can also be solved in polynomial

    time. This second class of problems is known as the NP-complete problems and it is generally

    believed that these problems cannot be solved efficiently.

    If no restriction is placed over the set of all possible hypotheses (that is all possible

    functions from the input space to the output domain), then learning is impossible since no

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    8/34

    amount of training data will tell us how to classify unseen examples. Problems also arise if we

    allow ourselves the freedom of choosing the set of hypotheses after seeing the data, since we

    can simply assume all of the prior probability on the correct hypotheses. In this sense it is true

    that all learning systems have to make some prior assumption of a Bayesian type often called

    the learning bias.

    2.4. Fault diagnosis as a classification problemFollowing the previous analysis, the fault diagnosis problem could be formulated as a

    classical classification problem in order to extend the concepts successfully applied in machine

    learning theory. In this sense, better generalization capabilities or more accurate and flexible

    pattern training algorithms could overcome limitations on faults isolation or multiple fault

    management.

    By adopting this classification view, the faults to be managed in a process industry would be

    different classes in the general classification problem. As in any classification problem, it will be

    crucial to properly choose meaningful variables or features to properly represent the diagnosis

    problem, adopt optimal learning methodologies and follow serious validation procedures to

    avoid the FDS over-fitting. Under this view the classical fault detection problem is a binary

    classification (BC) problem, consisting of determining whether the process is in or out of

    control. The global fault diagnosis problem is considered as a multi-class classification (MC)

    problem, so that many classes are involved. With the purpose to solve both kinds of problems,

    different learning algorithms have been developed in the machine learning area obtaining very

    promising results (i.e. Naive Bayes, KNearest Neighbours, SVM, AdaBoost.MH, etc.).

    3.Support Vector Machines for fault diagnosis3.1. Two-Class Support Vector MachinesIn training a classifier, usually we try to maximize classification performance for the training

    data. But if the classifier is too fit for the training data, the classification ability for unknown

    data, i.e., the generalization ability is degraded. This phenomenon is called overfitting. Namely,

    there is a trade-off between the generalization ability and fitting to the training data.

    Determination of decision functions using input-output pairs is called training.

    Conventional training methods determine the indirect decision functions so that each training

    input is correctly classified into the class designated by the associated training output. Figure 1

    shows an example of the decision functions obtained when the training data of two classes do

    not overlap.

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    9/34

    Figure 1 Simple example of classification

    Thus there are infinite possibilities of the positions of the decision functions that

    correctly classify the training data. Although the generalization ability is directly affected by the

    positions, conventional training methods do not consider this.

    In a SVM, the direct decision function that maximizes the generalization ability is

    determined for a two-class problem. Assuming that the training data of different classes do not

    overlap, the decision function is determined so that the distance from the training data is

    maximized. We call this the optimal decision function (Figure 2). For a two-class problem, a

    support vector machine is trained so that the direct decision function maximizes the

    generalization ability.

    Figure 2 Optimal decision function

    3.1.1. Hard-Margin Support Vector MachinesLet dimensional training inputs ( = 1. . ) belong to Class 1 or 2 and the

    associated labels be = 1 for Class 1 and -1 for Class 2. If these are linearly separable, we candetermine the decision function:

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    10/34

    1

    = + , (1)where is an -dimensional vector, is a bias term, and for = 1. .

    +

    > 0 = 1< 0

    =

    1

    (2)Because the training data are linearly separable, no training data satisfy + = 0.

    Thus to control separability, instead of (2), we consider the following inequalities:

    + 1 = 1 1 = 1 (3)Here, 1 and 1 on the right-hand sides of the inequalities can be a constant (> 0) and , respectively. But dividing both sides of the inequalities by , (3) is obtained. Equation (3) is

    equivalent to

    ( + ) 1, = 1. . (4)The hyperplane

    = + = , 1 < < 1 (5)forms a separating hyperplane that separates ( = 1. . ). When = 0, the separatinghyperplane is in the middle of the two hyperplanes with = 1 and -1. The distance betweenthe separating hyperplane and the training datum nearest to the hyperplane is called the

    margin(Figure 3).

    Figure 3 Margin

    Assuming that the hyperplanes = 1 and -1 include at least one training datum, thehyperplane = 0 has the maximum margin for 1 < < 1. The region 1 1 is the generalization region for the decision function. The generalization ability depends on

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    11/34

    1

    the location of the separating hyperplane, and the hyperplane with the maximum margin is

    called the optimal separating hyperplane(Figure 4). Because we can obtain the same optimal

    separating hyperplane even if we delete all the data that satisfy the strict inequalities in (2.10),

    the data that satisfy the equalities are called support vectors(Figure 4).

    Figure 4 Optimal hyperplane

    Assume that no outliers are included in the training data and that unknown test data

    will obey the same probability law as that of the training data. Then it is intuitively clear that

    the generalization ability is maximized if the optimal separating hyperplane is selected as the

    separating hyperplane. Now consider determining the optimal separating hyperplane. The

    Euclidean distance from a training datum x to the separating hyperplane is given by ()/

    . This can be shown as follows. Because the vector

    is orthogonal to the separating

    hyperplane, the line that goes through and that is orthogonal to the hyperplane is given by/ + , where is the Euclidean distance from to the hyperplane. It crosses thehyperplane at the point where

    + = 0 (6)is satisfied. Solving (6) for , we obtain = ()/ .

    Then all the training data must satisfy

    () , = 1. . (7)Where is the margin.

    Now if (, ) is a solution, (, ) is also a solution, where is a scalar. Thus weimpose the following constraint:

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    12/34

    1

    = 1 (8)From (7) and (8), to find the optimal hyperplane, we need to find , with the minimum

    Euclidean norm that satisfies (4).

    Therefore, the optimal separating hyperplane can be obtained by minimizing:

    = 122 (9)

    With respect to and subject to the constraints:( + ) 1, = 1. . (10)

    Here, the square of the Euclidean norm in (9) is to make the optimization problemquadratic programming. The assumption of linear separability means that there exist and that satisfy (9). We call the solutions that satisfy (10) feasible solutions. Because the

    optimization problem has the quadratic objective function with the inequality constraints, evenif the solutions are nonunique, the value of the objective function is unique. Thus

    nonuniqueness is not a problem for support vector machines. Because we can obtain the same

    optimal separating hyperplane even if we delete all the data that satisfy the strict inequalities in

    (2.10), the data that satisfy the equalities are called support vectors1(Figure 2 and Figure 4).

    The variables of the convex optimization problem given by (9) and (10) are and .Thus the number of variables is the number of input variables plus 1: + 1. When the numberof input variables is small, we can solve (9) and (10) by the quadratic programming technique.

    But, as will be discussed later, because we map the input space into a high-dimensional feature

    space, in some cases, with infinite dimensions, we convert (9) and (10) into the equivalent dual

    problem whose number of variables is the number of training data.

    To do this, we first convert the constrained problem given by (9) and (10) into the

    unconstrained problem

    , , = 12 + 1=1 (11)

    Where

    = (

    1,

    ,

    ) and

    are the nonnegative Lagrange multipliers. The optimal solution

    of (11) is given by the saddle point, where (11) is minimized with respect to and andmaximized with respect to ( 0), and it satisfies the following Karush-Kuhn-Tucker (KKT)conditions:

    1This definition is imprecise. Data can satisfy ( + ) = 1 but that can be deleted without changing the

    optimal separating hyperplane. Support vectors are defined using the solution of the dual problem, as discussed

    later.

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    13/34

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    14/34

    1

    1

    2 ,=1 =

    1

    2=1

    T =1 0 (20)maximizing (18) under the constraints (19) is a concave quadratic programming problem. If a

    solution exists, namely, if the classification problem is linearly separable, the global optimalsolution = 1. . exists. For quadratic programming, the values of the primal and dualobjective functions coincide at the optimal solutions if they exist. This is called the zero duality

    gap.

    Data that are associated with positive are support vectors for Classes 1 and 2. Then from (16)the decision function is given by

    = +

    , (21)where is the set of support vector indices, and form the KKT conditions given by (14), isgiven by

    = (22)Where is a support vector. From the standpoint of precision of calculations, it is better totake the average among the support vectors as follows:

    = 1

    ( )

    (23)Then unknown datum is classified into:

    1 > 0 1 < 0 (24)If = 0, is on the boundary and thus is unclassifiable. When training data are

    separable, the region 1 1 is a generalization region.3.1.2. L1 Soft-Margin Support Vector MachinesIn hard-margin support vector machines, we assumed that the training data are linearly

    separable. When the data are linearly inseparable, there is no feasible solution, and the hard-

    margin support vector machine is unsolvable. Here we extend the support vector machine so

    that it is applicable to an inseparable case.

    To allow inseparability, we introduce the nonnegative slack variables ( 0) into (4):

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    15/34

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    16/34

    1

    Similar to the linearly separable case, introducing the nonnegative Lagrange multipliers and , we obtain

    ,

    ,

    ,

    ,

    =1

    2

    2 +

    =1

    +

    1 +

    =1

    =1 (28)

    where = (1, ,)and = (1, ,).For the optimal solution, the following KKT conditions are satisfied:

    (, , ,,) = 0, (29)(, , ,,) = 0, (30)

    (, , ,,) = 0, (31) + 1 + = 0, = 1. . , (32) = 0, = 1. . , (33) 0, 0, 0, = 1. . . (34)

    Using (28), we reduce (29) to (31), respectively, to

    = =1 (35)

    = 0,=1 (36) + = , = 1. . . (37)Thus substituting (35) to (37) into (28), we obtain the following dual problem. Maximize

    = =1

    1

    2

    ,=1 (38)

    subject to constraints

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    17/34

    1

    = 0=1 , 0 , = 1. . . (39)The only difference between L1 soft-margin support vector machines and hard margin

    support vector machines is that

    cannot exceed

    . Especially, (32) and (33) are called KKT

    (complementarity) conditions.

    From these and (37), there are three cases for :1. = 0 then = 0. Thus is correctly classified.2. 0 . Then + 1 + = 0 and then = 0. Therefore + = 1 and is a support vector. Especially we call the support vector with > > 0

    an unbounded support vector.

    3. = . Then + 1 + = 0 and 0. Thus is a support vector. Wecall the support vector with

    =

    a bounded support vector. If 0

    < 1,

    is

    correctly classified, and if 1, is misclassified .The decision function is the same as that of the hard-margin support vector machine and is

    given by

    = + (40)where is the set of support vector indices. Because are nonzero for the support vectors,the summation (40) is added only for the support vectors. For the unbounded

    ,

    = (41)is satisfied. To ensure the precision of calculations, we take the average of that is calculatedfor unbounded support vectors,

    = 1( ) (42)where U is the set of unbounded support vector indices.

    Then unknown datum x is classified into

    1 > 0 1 < 0 (43)If = 0, is on the boundary and thus is unclassifiable. When there are no bounded

    support vectors, the region 1 1 is a generalization region, which is the sameas the hard-margin support vector machine.

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    18/34

    1

    3.1.3. Mapping to a High-Dimensional Space

    Figure 6 Linear inseparability

    Figure 7 Mapping input data

    3.1.3.1.Kernel tricks3.1.3.2.Kernels

    Examples of kernels:

    1. Linear, =

    2. Polynomial

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    19/34

    1

    , = ( + 1)3. Radial Basis Function (RBF)

    , = 2

    3.1.4. L2 Soft-Margin Support Vector MachinesInstead of the linear sum of the slack variables in the objective function, the L2 soft-

    margin support vector machine uses the square sum of the slack variables (Figure 8). Namely,

    training is done by minimizing

    ,

    ,

    =1

    2

    +2

    2

    =1 (44)

    with respect to , , and subject to the inequality constraints:() + 1 , = 1. . (45)

    Here, is -dimmensional vector, is the bias, () is the mapping function that mapsthe -dimmensional vector x into the -dimmensional feature space (will be discussed in 3.1.3Mapping to a High-Dimensional Space section), is the slack variable for variable , and isthe margin parameter.

    Figure 8 L2 Soft-Margin slack variables

    Introducing the Lagrange multipliers ( 0), we obtain

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    20/34

    2

    , , , = 12 +

    22=1 () + 1 +

    =1

    (46)Here, we do not need to introduce the Lagrange multipliers associated with . As is shown

    imemediately,

    = is satisfied for the optimal solution. Hence

    is nonnegative, so long

    is nonnegative.

    For the optimal solution the following KKT conditions are satisfied:

    (, , ,) = ()=1 = 0,

    (47)(, , ,)

    = = 0, (48)

    (,, ,) = =1 = 0, (49)() + 1 + = 0, = 1. . , (50)Ecuation (50) gives the KKT complementarity conditions; from (47), (48), and (50), the

    optimal solution must satisfy either = 0 or

    =1 ,

    +

    +

    1 = 0 (51)

    Where , = () and is Kroneckers delta function, in which = 1 for =and 0 otherwise. Thus the bias term is calculated for > 0:

    = =1 , + (52)

    which is different from that of the L1 support vector machine. But the decision function is the

    same:

    = =1 , + (53)Substituting (47) to (49) into (46), we obtain the dual objective function:

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    21/34

    2

    = =1 1

    2 , +

    ,=1 (54)Thus the following dual problem is obtained. Maximize (54) subject to

    = 0=1 , 0, = 1. . . (55)This is similar to a hard-margin support vector machine. The difference is the addition of

    in (54). Therefore, for L1 support vector machine, if we replace , with , + and remove the upper bound given by for , we obtain L2 support vector machine. But wemust notice that when we calculate the decision function in (53) we must not add

    .Because 1/ is added to the diagonal elements of the matrix = , called kernelmatrix, resulting matrix becomes positive definite. Thus the associated optimization problem is

    more computationally stable than that of the L1 support vector machine.

    L2 soft-margin support vector machines look similar to hard-margin support vector

    machines. Actually, letting

    =

    , = , =

    , (56)

    where is the -dimensional vector with the th element being 1 and the remaining elements0, training L2 support vector machine given by (44) and (45) is converted to the following

    problem. Minimize

    1

    2 (57)

    subject to

    +

    1,

    = 1. .

    . (58)

    Therefore, the L2 support vector machine is equivalent to the hard-margin support vector

    machine with the augmented feature space. Because the L2 support vector machine always has

    a solution because of the slack variables, the equivalent hard-margin support vector machine

    also has a solution. But this only means that the solution is non-overlapping in the augmented

    feature space. Therefore, there may be cases where the solution is overlapped in the original

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    22/34

    2

    feature space, and thus the recognition rate of the training data for the L2 support vector

    machine is not 100 percent.

    3.2. Multiclass Support Vector Machines3.2.1. One-against-all3.2.2. One-against-one

    3.3. Support Vector RegressionThe Support Vector method can also be applied to the case of regression, maintaining all

    the main features that characterize the maximal margin algorithm: a non-linear function is

    learned by a linear learning machine in a kernel-induced feature space while the capacity of the

    system is controlled by a parameter that does not depend on the dimensionality of the space.

    As in the classification case the learning algorithm minimizes a convex functional and its

    solution is sparse.

    Figure 9 The insensitive band for a one dimensional linear regression problem

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    23/34

    2

    Figure 10 The linear (red) and quadratic (green) -insensitive loss for zero and non-zero 3.4. Advantages and DisadvantagesThe advantages of support vector machines over multilayer neural network classifiers are as

    follows:

    1. Maximization of generalization ability. In training a multilayer neural network classifier,

    the sum-of-squares error between outputs and desired training outputs is minimized. Thus, the

    class boundaries change as the initial weights change. So does the generalization ability. Thus,

    especially when training data are scarce and linearly separable, the generalization ability

    deteriorates considerably. But because a support vector machine is trained to maximize the

    margin, the generalization ability does not deteriorate very much.

    2. No local minima. A multilayer neural network classifier is known to have numerous local

    minima, and there have been extensive discussions on how to avoid a local minimum in

    training. But because a support vector machine is formulated as a quadratic programming

    problem, there is a global optimum solution.

    3. Robustness to outliers. Multilayer neural network classifiers are vulnerable to outliers

    because they use the sum-of-squares errors. Thus to prevent the effect of outliers, outliers

    need to be eliminated before training, or some mechanism for suppressing outliers needs to be

    incorporated in training. In support vector machines the margin parameter controls themisclassification error. If a large value is set to

    , misclassification is suppressed, and if a small

    value is set, training data that are away from the gathered data are allowed to be misclassified.Thus by properly setting a value to , we can suppress outliers.

    The disadvantages of support vector machines explained so far are as follows.

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    24/34

    2

    1. Extension to multiclass problems. Unlike multilayer neural network classifiers, support

    vector machines use direct decision functions. Thus an extension to multiclass problems is not

    straightforward, and there are several formulations.

    2. Long training time. Because training of a support vector machine is done by solving the

    associated dual problem, the number of variables is equal to the number of training data. Thus

    for a large number of training data, solving the dual problem becomes difficult from both the

    memory size and the training time.

    3. Selection of parameters. In training a support vector machine, we need to select an

    appropriate kernel and its parameters, and then we need to set the value to the margin

    parameter . To select the optimal parameters to a given problem is called model selection.This is the same situation as that of neural network classifiers. Namely, we need to set the

    number of hidden units, initial values of weights, and so on. In support vector machines, model

    selection is done by estimating the generalization ability through repeatedly training supportvector machines. But because this is time-consuming, several indices for estimating the

    generalization ability have been proposed.

    4.Data-driven methods for fault diagnosis4.1. Principal component analysis4.2. Projection algorithms

    5.Application to Tennessee Eastman process as abenchmark

    The case involves the production of two products from four reactants. There are 41

    measured process variables and 12 manipulated variables. The process (Figure 11) is composed

    of five main unit operations: an exothermic two-phase reactor, a product condenser, a flash

    separator, a reboiled stripper, and a recycle compressor.

    All the 20 proposed faults in the original paper (Downs & Vogel, 1993) were consideredto face a global and complex diagnosis problem. Main diagnosis difficulties in this problem

    come from the number of different faults to be managed and the wide faults diversity. Random

    and step biases as well as soft drifts on process variables must be managed in this complex

    diagnosis case study.

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    25/34

    2

    Figure 11 Case study: Tennessee Eastman process flowsheet

    Besides, some of the considered disturbances cause insignificant biases in variables

    because of the adjusted control strategy, which makes very difficult to find out the biases

    source causes. Therefore, it represents a challenging real diagnosis problem for testing and

    comparing the approach proposed in this work.

    Three main limitations were found when revising former approaches to compare the

    performance of a novel fault diagnosis system:

    1. Reported results are very limited from the quantitative point of view. No common or

    standard quantitative indices have been used for that purpose;

    2. Unclear distinctions between the detection and diagnosis steps reduce the chances

    for systematic comparisons;

    3. Since many simulation runs may be obtained from the same case study (different

    ways of sampling, several initial and final conditions, simulation span, etc.), complete data sets

    are required to be agreed for establishing a unique comparative basis;

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    26/34

    2

    Classification tools are one of the most extended fault diagnosis systems in literature.

    They are based on mapping input data from process plant to different process states. The

    classification methodologies have succeeded because of their implementation simplicity, so

    they do not require process operators experience or process first principles knowledge. Among

    these classification methodologies, data-based rules systems and machine learning are themost commonly applied methodologies. However, techniques coming from the machine

    learning area have not been properly exploited to solve significant limitations on chemical

    process fault diagnosis. Only neural networks, a technique overcome in other technical fields,

    has been widely applied as a pattern learning technique.

    Thus, considering fault diagnosis of chemical process as a classification problem would

    allow taking advantage of the benefits gained by the learning community throughout last

    decades.

    5.1. Problem formulationThe fault diagnosis problem could be formulated as a classical classification problem in

    order to extend the concepts successfully applied in machine learning theory. In this sense,

    better generalization capabilities or more accurate and flexible pattern training algorithms

    could overcome limitations on faults isolation or multiple fault management. By adopting this

    classification view, the faults to be managed in a process industry would be different classes in

    the general classification problem. As in any classification problem, it will be crucial to properly

    choose meaningful variables or features to properly represent the diagnosis problem, adopt

    optimal learning methodologies and follow serious validation procedures to avoid the FDS over-

    fitting. Under this view the classical fault detection problem is a binary classification (BC)

    problem, consisting of determining whether the process is in or out of control. The global fault

    diagnosis problem is considered as a multi-class classification (MC) problem, so that many

    classes are involved.

    With the purpose to solve both kinds of problems, different learning algorithms have been

    developed in the machine learning area obtaining very promising results (i.e. Naive Bayes,

    KNearest Neighbours, SVM, AdaBoost.MH, etc.).

    Some of these algorithms employ BC techniques, but can also be applied to solve multi-class

    problems when they are adapted by techniques like one versus all or the constraint

    classification. In order to deal with the multi-class or fault diagnosis problem the one versus

    all methodology was applied throughout the paper. It consists of decomposing the problem in

    as many binary problems as classes has the original problem; then, one classifier is trained for

    each class trying to separate the examples of that class (positives) from the examples of all

    other classes (negatives).This method generates classifiers between each class and the set of all

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    27/34

    2

    other classes. When classifying a new example, all binary classifiers predict a class and the one

    with highest confidence is selected (winner-take-all strategy).

    5.2. Support vector machines & Kernel methods on TE

    Figure 12 Evolution of Error when changing arguments on

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    28/34

    2

    5.3. PCA on TE

    Figure 13 PCA Method on for datasets D00, D10 and D12

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    29/34

    2

    Figure 14 Q statistic for fault 12

    Figure 15 T2

    statistic for fault 12

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    30/34

    3

    6.List of FiguresFigure 1 Simple example of classification .................................................................................... 9

    Figure 2 Optimal decision function .............................................................................................. 9

    Figure 3 Margin .......................................................................................................................... 10

    Figure 4 Optimal hyperplane ..................................................................................................... 11

    Figure 5 L1 Soft-Margin slack variables ..................................................................................... 15

    Figure 6 Linear inseparability ..................................................................................................... 18

    Figure 7 Mapping input data ...................................................................................................... 18

    Figure 8 L2 Soft-Margin slack variables ..................................................................................... 19

    Figure 9 The insensitive band for a one dimensional linear regression problem ..................... 22

    Figure 10 The linear (red) and quadratic (green) -insensitive loss for zero and non-zero ... 23Figure 11 Case study: Tennessee Eastman process flowsheet .................................................. 25

    Figure 12 Evolution of Error when changing arguments on ...................................................... 27

    Figure 13 PCA Method on for datasets D00, D10 and D12 ....................................................... 28

    Figure 14 Q statistic for fault 12 ................................................................................................ 29

    Figure 15 T2

    statistic for fault 12 ................................................................................................ 29

    7.AbbreviationsSVM Support Vector Machines

    FDS Fault Diagnosis System

    TE Tennessee EastmanPCA Principial Component Analysis

    PSVM Parallel support vector machines

    FDA Fisher Discriminant Analysis

    BC Binary Classification

    MC Multi Class Classification

    KKT Karush-Kuhn-Tucker

    8.NomenclatureAre used lowercase letters to denote vectors and uppercase italic letters to denote

    matrices. The following list shows the symbols used in this diploma project.

    Training data Lagrange multiplier for Lagrange multiplier for slack variable associated with

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    31/34

    3

    margin parameter degree of a polynomial kernel() mapping function from to the feature space parameter for a radial basis function kernel(

    ,

    ) kernel

    dimension of the feature space number of training data number of input variables number of classes set of support vector indices set of unbounded support vector indices Euclidean norm of vector bias decision fuction cost function

    Weights

    9.References1. Nello Cristianini, John Shawe-Taylor, An Introduction to Support Vector Machines and

    Other Kernel-based Learning Methods, Cambridge University Press, 2000;

    2. John Shawe-Taylor, Nello Cristianini, Kernel Methods for Pattern Analysis, CambridgeUniversity Press, 2004;

    3. Support vector machines (SVMs),http://en.wikipedia.org;4. Ovidiu Ivanciuc, Applications of Support Vector Machines in Chemistry. In: Reviews in

    Computational Chemistry, Volume 23, Eds.: K. B. Lipkovitz and T. R. Cundary. Wiley-VCH,

    Weinheim, 2007, pp. 291-400;

    5. Shigeo Abe, Support Vector Machines for Pattern Classification, Springer-VerlagLondon Limited, 2005;

    6. Chih-Wei Hsu and Chih-Jen Lin, A Comparison of Methods for Multiclass SupportVector Machines, IEEE Transactions on Neural Networks, VOL. 13, NO. 2, March 2002;

    7. Abhijit Kulkarni, V.K. Jayaraman, B.D. Kulkarni, Knowledge incorporated support vectormachines to detect faults in Tennessee Eastman Process, Computers and Chemical

    Engineering 29, pp. 21282133, Elsevier Ltd., 2005

    8. A. Mathur, G. M. Foody, Multiclass and Binary SVM Classification: Implications forTraining and Classification Users, IEEE Geoscience and Remote Sensing Letters, vol. 5,

    no. 2, april 2008;

    http://en.wikipedia.org/http://en.wikipedia.org/http://en.wikipedia.org/http://en.wikipedia.org/
  • 8/14/2019 Diploma Thesis - June 2009 - draft

    32/34

    3

    9. Shawn Martin, The Numerical Stability of Kernel Methods, Sandia National LaboratoriesAlbuquerque, NM, USA, december 2005;

    10.A. Mauricio Sales Cruz, Tennessee Eastman Plant-wide Industrial Process. ChallengeProblem. Complete Model, Department of Chemical Engineering Technical University of

    Denmark, January 2004;11.Ignacio Ylamos, Gerard Escudero, Moiss Graells, Luis Puigjaner, Performance

    assessment of a novel fault diagnosis system based on support vector machines,

    Computers and Chemical Engineering 33, 244255, 2009;

    10. Annex 1 Code listing10.1. PCA, T2 statistic and Q statistic algorithms

    clear;clc;close all;load data/d00.datload data/d00_te.datload data/d10_te.datload data/d12_te.dat

    X_1=d10_te;X_2=d12_te;%with faultsX_3=d00_te;%without faults

    num_row_X_3=size(X_3,1);num_col_X_3=size(X_3,2);num_row_X_2=size(X_2,1);num_col_X_2=size(X_2,2);

    num_row_X_1=size(X_1,1);num_col_X_1=size(X_1,2);

    Mn_x_3= Mean(X_3);Std_x_3= Std(X_3);

    %NormalizationX_norm_3=(X_3-kron(Mn_x_3,ones(num_row_X_3,1)))./kron(Std_x_3,ones(num_row_X_3,1));X_norm_2=(X_2-kron(Mn_x_3,ones(num_row_X_2,1)))./kron(Std_x_3,ones(num_row_X_2,1));X_norm_1=(X_1-kron(Mn_x_3,ones(num_row_X_1,1)))./kron(Std_x_3,ones(num_row_X_1,1));

    TestMn_x_3=Mean(X_norm_3);TestStd_x_3=Std(X_norm_3);TestMn_x_2=Mean(X_norm_2);TestStd_x_2=Std(X_norm_2);TestMn_x_1=Mean(X_norm_1);TestStd_x_1=Std(X_norm_1);

    S_3=(X_norm_3'*X_norm_3)/(num_row_X_3-1);

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    33/34

    3

    %eigen decomposition obtaining eigenvalues and eigenvectors[V_3,D_3]=eigs(S_3)a=2; % set number of principle compoments

    Represented_variance_2 = (D_3(1,1)+D_3(2,2))/sum(diag(D_3));

    P=V_3(:,1:2);

    T_3=X_norm_3*P;T_2=X_norm_2*P;T_1=X_norm_1*P;

    %calculate thresholdT_threshold=a*(num_row_X_3-1)*(num_row_X_3+1)/(num_row_X_3*(num_row_X_3-a))*finv(0.99,a,(num_row_X_3-a));

    %ploting PCAb=sqrt(D_3(1,1)*T_threshold);c=sqrt(D_3(2,2)*T_threshold);

    theta=[0:pi/50:2*pi];t_1=b*cos(theta);t_2=c*sin(theta);

    figure;plot(T_3(:,1),T_3(:,2),'x',T_2(:,1),T_2(:,2),'*',T_1(:,1),T_1(:,2),'o',t_1,t_2);xlabel('t_1');ylabel('t_2');legend('Class 3','Class 2','Class 1','95% Threshold','location','northwest');title('PCA method');

    %Calulate T Square

    T_UCL_square=(((num_row_X_2)^2-1)*a*finv(0.99,a,num_row_X_2-a))/(num_row_X_2*(num_row_X_2-a));T_Square = [];Y=T_2*Represented_variance_2^-.5;for i=1:num_row_X_2

    yy=Y(i,1:a)';T_Square(i)=yy'*yy;

    end

    %ploting T square statisticfigureplot(1:num_row_X_2,T_Square,1:num_row_X_2,T_threshold);title('T square statistic');xlabel('time');

    ylabel('T2');

    %calculate Q statisticq1Array = [];[n m] = size(X_2);I = eye(m);X_norm_residual = X_norm_3 - X_norm_3*P*P';Co_x=(X_norm_residual'*X_norm_residual);[M,N]=eigs(Co_x);

  • 8/14/2019 Diploma Thesis - June 2009 - draft

    34/34

    N=N/(num_row_X_2-1);therta_1=sum(diag(N));therta_2=sum(diag(N.^2));therta_3=sum(diag(N.^3));h_0=1-(2*therta_1*therta_3/(3*(therta_2^2)));part_1=norminv(0.99)*sqrt(2*therta_2*(h_0^2))/therta_1;part_2=therta_2*h_0*(h_0-1)/(therta_1^2);%calculate Squared Prediction ErrorSPE_UCL_x=therta_1*((part_1+part_2+1)^inv(h_0));

    %ploting Q statisticfor i=1:nxi = X_norm_2(i,:);q1Array(i) = xi*(I-P*P')*xi';

    end;figureplot(1:n,q1Array,1:n,SPE_UCL_x);xlabel('time');ylabel('Q');title('Q statistic');

    10.2. Support vector machines & Kernel methods