116927830 Principios Del Filtro de Novedad

32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 1 An Evaluation of Neural Network Methods and Data Preparation Strategies for Novelty Detection Rewbenio A. Frota and Guilherme A. Barreto, Member IEEE and Jo ˜ ao C. M. Mota, Member IEEE The authors are with the Department of Teleinformatics Engineering, Federal University of Cear´ a (UFC), Fortaleza-CE, Brazil. E-mails: {rewbenio, guilherme, mota}@deti.ufc.br. November 24, 2004 DRAFT

Transcript of 116927830 Principios Del Filtro de Novedad

Page 1: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 1/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 1

An Evaluation of Neural Network Methods and

Data Preparation Strategies for Novelty

Detection

Rewbenio A. Frota and Guilherme A. Barreto, Member IEEE 

and Joao C. M. Mota, Member IEEE 

The authors are with the Department of Teleinformatics Engineering, Federal University of Ceara (UFC), Fortaleza-CE, Brazil.

E-mails: {rewbenio, guilherme, mota}@deti.ufc.br.

November 24, 2004 DRAFT

Page 2: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 2/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 2

Abstract

An important issue in the design of a model for a particular data set is the quality of the data

concerning the presence of anomalous observations (outliers) and their influence in the performance of 

pattern classifiers. Common approaches to deal with outliers remove them from data or improve the

robustness of the machine learning method by handling outliers directly. We explore these two views

by introducing a systematic methodology to compare the performance of neural methods applied to

novelty detection. Firstly, we describe in a tutorial-like fashion the most common neural-based novelty

detection techniques. Then, in order to compute reliable decision thresholds, we generalize the recent

application of the bootstrap resampling technique to unsupervised novelty detection to the supervised

case, and propose a outlier removal procedure based on it. Finally, we evaluate the performance of the

neural network methods through simulations on a breast cancer data set, assessing their robustness to

outliers and their sensitivity to training parameters, such as data scaling, number of neurons, training

epochs and size of the training set. We conclude the paper by discussing the obtained results.

Index Terms

Novelty detection, self-organizing maps, multilayer neural networks, bootstrap, prediction intervals.

I. INTRODUCTION

Novelty detection is the problem of reporting the occurrence of novel events or data. As such,

it has been the focus of an increasing attention in many pattern recognition applications whose

success depends on building a reliable model for the data, such as machine monitoring [1], [2],

[3], image processing [4], radar target detection [5], detection of masses in mammogram [6],

mobile robotics [7], [8], handwritten digit recognition [9], computer network security [10], [11],

[12], statistical process control [13], fault management [14], among others.

This interest is due in part to the crucial importance for some problems that the model may

be able to detect patterns that do not match well with the stored data representation. Due to the

wide range of applicability across disciplines in engineering and science, novelty detection can

also be called anomaly detection, intruder detection, fault detection or outlier detection.

Several neural, system-theoretic, statistical and hybrid approaches to novelty detection have

been proposed over the years, but it is becoming usual the formulation of novelty detection tasks

as one of the following pattern classification problems:

November 24, 2004 DRAFT

Page 3: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 3/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 3

• Single-Class Classification: The data available for learning a representation of the expected

behavior of the system of interest is composed only of one type of data vectors, usually

representing normal activity of the system. This type of data is also referred to as positive

examples. The goal is to indicate if a given input vector correspond to normal or abnormal

behavior.

• Multi-Class Classification: The training set contains data vectors of different classes. The

data should be representative of positive (normal) and negative (abnormal) behavior, in order

to build an overall representation of the known system behavior, even (and specially) in

faulty operation [15]. The goal is to indicate is to classify the input vector into one or none

of the existing classes.

Thus, the design of novelty detectors can be generally stated as the task in which a description

of what is already known about the system is learned by fitting a set of normal and/or abnormal

data vectors, so that subsequent unseen patterns are evaluated by comparing a measure of novelty

against some threshold. The main challenges are then the collection of reliable data, the definition

of an appropriate learning machine (i.e. the classifier) and the computation of decision thresholds

with which novel patterns can be detected for a given application.

As can be verified in good survey papers recently published [16], [17], [18], [19], considerable

efforts have been devoted to the design of powerful classifiers and threshold computation tech-

niques, while much less attention has been paid to the data-related issues, such as the occurrence

of outliers and data scaling methods, and their influence on the performance of the classifiers.

In what concern the quality of the collected data, most of the works in novelty detection

assume, implicitly or explicitly, that the training data is outlier-free or the outliers are known

in advance. Since outliers may arise due to a number of reasons, such as measurement error,

mechanical/electrical faults, unexpected behavior of the system (e.g. fraudulent behavior), or

simply by natural statistical deviations within the data set, those assumptions are unrealistic.It is worth mentioning that the data labelling process, even if performed by an expert, is also

error-prone.

Even if we assume that the data is outlier-free, it is very difficult, if not impossible, to know in

advance if the sampled data, concerning the number of positive and/or negative examples, suffices

to give a reliable description of the underlying statistical nature of the system. For example, for

some applications, the number of negative (abnormal) examples can be very small, since they are

November 24, 2004 DRAFT

Page 4: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 4/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 4

rare or difficult (expensive) to collect. It is well-known that for a good classification performance,

the number of examples per class should be ideally balanced [20]. This is particularly true for

powerful classifiers, such as feedforward neural networks [21], were occurrence of data overfitting

is usual.

In this case, it is recommended to consider the few negative examples available as outliers,

treating the novelty detection task as a single-class classification problem, in which training

the classifier is carried out with positive (normal) examples only. The outliers are then used

to test the performance of the novelty detection system. Some authors, however, argue that the

inclusion of outliers during training can be beneficial for the novelty detection system, improving

its robustness as a whole [4], [22], [23]. If known outliers are unavailable, these authors suggest

to generate artificial outliers for that purpose.

Bearing in mind the aforementioned issues concerning the design of a robust novelty detection

system, the contributions of this paper are manifold:

1) Proposal of a generic methodology to compute thresholds that is applicable independently

to supervised and unsupervised networks.

2) Proposal of a data cleaning strategy for outlier removal based on the proposed methodology.

3) Comparison of the proposed methodology with existing threshold determination techniques

using different neural network paradigms.

4) Evaluation of the proposed methodology in the presence of known and unknown outliers

and for different data scaling strategies.

The remainder of the paper is organized as follows. In Section II, we briefly present the

novelty detection task as a hypothesis testing procedure. Then, in Section III, we describe in

a tutorial-like fashion the most common neural-based novelty detection techniques. In Sections

IV and V, in order to compute reliable decision thresholds, we generalize the recent application

of the bootstrap resampling technique to unsupervised novelty detection to the supervised case,

and propose a outlier removal procedure based on it. Finally, in Section VI we evaluate the

performance of the neural network methods through simulations on a breast cancer data set and

discuss the obtained results. We conclude the paper in Section VII.

November 24, 2004 DRAFT

Page 5: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 5/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 5

II. NOVELTY DETECTION AS HYPOTHESIS TESTING

Before starting the presentation of techniques for novelty detection, it is worth understanding

the novelty detection task under the formalism of statistical hypothesis testing, in order to

establish criteria to measure the performance of the neural models. First of all, it is necessary

to define a null hypothesis, i.e., the hypothesis to be tested. For our purposes, H 0 is stated as

follows:

• H 0: The input vector xnew reflects KNOWN activity.

where by the adjective known we mean the vector xnew represents normal behavior, if we are

dealing with single-class classification problems. If we have a multi-class classification problem,

the adjective known means that the input vector belongs to one of the already learned classes.

The so-called alternative hypothesis, denoted as H 1, is obviously given by:

• H 1: The input vector reflects the UNKNOWN activity.

so that, in this case, the input vector carries novel information, which in general is indicative of 

abnormal behavior of the system being analyzed.

Thus, when formulating a conclusion regarding the condition of the system based on the

definitions of  H 0 and H 1, two types of errors are possible:

• Type I error: This error occurs when H 0 is rejected when it is, in fact, true. The probability

of making a type I error is denoted by the so-called significance level, α, whose value is

set by the investigator in relation to the consequences of such an error. That is, we want to

make the significance level as small as possible in order to protect the null hypothesis and

to prevent, as far as possible, the investigator from inadvertently making false claims. Type

I error is also referred to as False Alarm, False Detection or yet False Positive.

• Type II error: This error occurs when H 0 is accepted when it should be rejected. The

probability of making a type II error is denoted by β  (which is generally unknown). Type II

error is also referred to as Absence of Alarm or False Negative. A type II error is frequently

due to sample sizes N  being too small.

Novelty detection systems are usually evaluated by the number of false alarms and absence

of alarms they produce. The ideal novelty detector would have α = 0 and β  = 0, but this is

not really possible in practice. So, one tries to manage α and β  error probabilities based on the

November 24, 2004 DRAFT

Page 6: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 6/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 6

overall consequences (e.g. high costs, death, machine breakdown, virus infection, etc.) for the

system being analyzed.

For example, reporting false alarms too frequently that the system operators would gradually

put no faith on its decisions, to the point that they would refuse to believe that an actual problem

is occurring. In medical testing, absence of alarms (false negatives) provide false, incorrect

reassurance to both patients and physicians that patients are free of disease which is actually

present. This in turn leads to people receiving inappropriate understanding and a lack of better

advice and treatment to better protect their interests.

The difficulty is that, for any fixed sample size N , a decrease in α will cause an increase in β .

Conversely, an increase in α will cause a decrease in β . To decrease both α and β  to acceptable

levels, we may increase the sample size N . Also, for any fixed α, an increase in the sample size

N  will cause a reduction in β , i.e., a greater number of samples reduce the probability of reject

the null hypotheses when it is true.

Usually, in neural-based novelty detection the number of samples is fixed and strongly related

to the number of neurons. If one increases the number of neurons in order to decrease both α

and β , the computational costs also increase rapidly. This can be problematic if the novelty

detection systems is supposed to work in real-time, such as in computer network or spam

detection softwares. Even for offline applications, higher computational efforts demand highercomputational power, increasing the costs of the hardware. In this paper, an alternative to increase

the number of samples which demands much lesser additional computer efforts is proposed. For

this purpose, the number of samples of the variable of interest is increased through statistical

resampling techniques, such as bootstrap [24].

III. NEURAL METHODS FOR NOVELTY DETECTION

Supervised and unsupervised artificial neural network (ANN) algorithms have been used in a

wide range of novelty detection tasks, mainly due to its nonparametric1 nature and its powerful

generalization performance [25]. In this section we briefly review the most common ANN

approaches to novelty detection. It is not our intention to provide a comprehensive survey of 

possible approaches, but rather to give an introduction to the issue.

1By nonparametric we mean methods that make none or very few assumptions about the statistical distribution of the data.

November 24, 2004 DRAFT

Page 7: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 7/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 7

 A. Optimal Linear Associative Memory

One of the first approaches to novelty detection, called the Novelty Filter  was proposed by

Kohonen and Oja [26]. The following theoretical development is a special case of the Optimal

 Linear Associative Memory (OLAM) [27], in which one tries to learn a linear mapping y = Mx

from a finite set of input-output pairs (xi,yi), i = 1, . . . , m.

For novelty detection purpose, we are interested only in the autoassociative OLAM. In this

case, the pair (xi,yi) reduces to the redundant pair (xi,xi). So, given a set of sample vectors

x1,x2, . . . ,xm, it is possible to compute the matrix M as follows:

M = X∗X, (1)

where the columns of X are the training vectors xi and X∗ = XT (XXT )−1 denotes the pseudo-

inverse matrix of X.

Let the known vectors x1,x2, . . . ,xm span some unique linear subspace L(x1,x2, . . . ,xm) of 

n, or alternatively,

L = L(x1,x2, . . . ,xm) =x|x =

m

i=1cixi

(2)

where the c1, . . . , cm are arbitrary real scalars from the domain (−∞, ∞). It can be shown

that the matrix M behaves as a projection operator . The operator M projetsn onto

L. There

is another operator, called dual operator  that projects n onto L⊥, which is the orthogonal

complement space {x ∈ n : xyT  = 0, ∀y ∈ L}.

It can be shown that the dual operator is given by I−X∗X, where I denotes the n×n identity

matrix. Every vector in n can be uniquely decomposed as follows:

x = x(X∗X) + x(I−X∗X) = x+ x, (3)

in which the projection x measures what is known about the input x relative to the vectors

x1,x2, . . . ,xm stored in matrix M as shown in (1).

By its turn, the projection x is called the novelty vector , since it measures what is maximally

unknown or novel in the measured input vector x. Thus, the magnitude of  x can be used for

novelty detection purposes. In such applications, the larger the norm ||x||, the less certain we

are of judging the vector x as belonging to the linear subspace L. In [26], Kohonen and Oja

implemented the novelty filter as a fully-connected adaptive neural feedback system.

November 24, 2004 DRAFT

Page 8: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 8/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 8

 B. The Self-Organizing Map

The Self-Organizing Map (SOM) [28], [29] is one of the most popular neural network 

architectures. It belongs to the category of competitive learning algorithms and it is usually

designed to build an ordered representation of spatial proximity among vectors of an unlabelled

data set.

The neurons in the SOM are put together in an output layer, A, in one-, two- or even three-

dimensional arrays. Each neuron i ∈ A has a weight vector wi ∈ n with the same dimension of 

the input vector x ∈ n. The network weights are trained according to a competitive-cooperative

scheme in which the weight vectors of a winning neuron and its neighbors in the output array

are updated after the presentation of an input vector. Roughly speaking, the functioning of this

type of learning algorithm is based on the concept of  winning neuron, defined as the neuron

whose weight vector is the closest to the current input vector.

During the learning phase, the weight vectors of the winning neurons are modified incremen-

tally in time in order to extract average features from the set of input patterns. The SOM has

been widely applied to pattern recognition and classification tasks, such as clustering and vector

quantization. In these applications, the weight vectors are called prototypes or centroids of a

given class or category, since through learning they become the most representative element of 

a given group of input vectors.

Using Euclidean distance, the simplest strategy to find the winning neuron, i∗(t), is given by:

i∗(t) = arg min∀i

x(t) −wi(t) (4)

where x(t) ∈ n denotes the current input vector, wi(t) ∈ n is the weight vector of neuron i,

and t symbolizes the time steps associated with the iterations of the algorithm. Accordingly, the

weight vector of the winning neuron is modified as follows:

wi(t + 1) = wi(t) + η(t)h(i∗

, i; t)[x(t) −wi(t)] (5)

where h(i∗, i; t) is a Gaussian function which control the degree of change imposed to the weight

vectors of those neurons in the neighborhood of the winning neuron:

h(i∗, i; t) = exp

−ri(t) − ri∗(t)2

σ2(t)

(6)

where σ(t) defines the radius of the neighborhood function, ri(t) and ri∗(t) are respectively,

the positions of neurons i and i∗ in the array. The learning rate, 0 < η(t) < 1, should decay

November 24, 2004 DRAFT

Page 9: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 9/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 9

with time to guarantee convergence of the weight vectors to stable states. In this paper, we use

η(t) = η0 (ηT /η0)(t/T ), where η0 and ηT  are the initial and final values of  η(t), respectively. The

variable σ(t) should decay in time similarly to the learning rate η(t).

The SOM has several beneficial features which make it a valuable tool in data mining

applications [30]. In particular, the use of a neighborhood function imposes an order to the

weight vectors, so that, at the end of the training phase, input vectors that are close in the input

space are mapped onto the same winning neuron or onto winning neurons that are close in the

output array. This is the so-called topology-preserving property of the SOM, which has been

particularly useful for data visualization purposes [31].

Once SOM algorithm has converged, the set of ordered weight vectors summarizes important

statistical characteristics of the input. The SOM should reflect variations in the statistics of 

the input distribution: regions in the input space X  from which a sample x are drawn with

a high probability of occurrence are mapped onto larger domains of the output space A, and

therefore with better resolution than regions in X  from which sample vectors are drawn with a

low probability of occurrence.

This density matching property is one of the most important for novelty detection purposes. For

example, once the SOM has been trained with unlabelled vectors that one believes to consist only

of data representing the normal state of the system being analyzed, we can use the quantizationerror, e(x,wi∗, t), between the current input vector x(t) and the winning weight vector wi∗(t)

as a measure of the degree of proximity of  x(t) to the distribution of “normal” data vectors

encoded by the weight vectors of the SOM.

The quantization error is computed as follows:

e(x,wi∗, t) = x(t) −wi∗(t) =

  n j=1

(x j(t) − wi∗ j(t))2 (7)

where t is an index denoting the current discrete time step. Roughly speaking, If  e(x,wi∗

, t)is larger than a certain threshold ρ, one assume that the current input is far from the region of 

the input space representing normal behavior as modelled by the SOM weights, thus revealing a

novelty or an anomaly in the system being analyzed. Several procedures to compute the threshold

ρ have been developed in recent years, most of them based on well-established statistical

techniques (see e.g. [16], [18]). In the following sections we describe some of these techniques

in the context of SOM-based novelty detection.

November 24, 2004 DRAFT

Page 10: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 10/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 10

C. Computing Single-Thresholds

In [12], the SOM is trained with data representing the activity of normal users within a

computer network. The threshold ρ is determined by computing the statistical p-value associated

with the distribution of training quantization errors. The p-value defines the probability of 

observing the test statistic e(x,wi∗, t) as extreme as or more extreme than the observed value,

assuming that the null hypothesis is true. This novelty detection procedure is implemented as

follows:

• Step 1: After training is finished, the quantization errors for the training set vectors are

computed (e1, e2, . . . , em).

• Step 2: The quantization error for a new input vector is computed, enew.

• Step 3: The p-value for any new input vector, denote by P new, is computed as follows. Let

B be the number of distances in (e1, e2, . . . , em) that are greater than enew. Thus,

ρ = P new =B

m(8)

• Step 4: If  ρ > α, then H 0 is accepted; otherwise it is rejected. A significance level α = 0.05

is commonly used.

• Step 5: Steps 2-4 are repeated for every new input vector.

According to the authors the system is very reliable and have presented acceptable rates of 

false negatives and false positives, concluding that theses errors were caused by normal changes

in user profiles. Similar approaches have applied to novelty detection in cellular networks [32],

time series modelling [33] and machine monitoring [2].

A single-threshold SOM-based method for fault detection in rotating machinery is presented

in [3]. The procedure follows the same steps described previously, except that in this case, the

novelty threshold is computed as follows. For each neuron in the immediate neighborhood of 

the winning neuron (also called 1-neighborhood neurons), one computes their distances, Di∗ j =

wi∗ − w j, to the winning neuron. The novelty threshold is taken as the maximum value of 

these distances:

ρ = max∀ j∈V 1

{Di∗ j} (9)

where V 1 is the set of 1-neighborhood neurons of the current winning neuron. Thus, if  enew > ρ

then the input vector carries novel or anomalous information, i.e. the null hypotheses should be

rejected.

November 24, 2004 DRAFT

Page 11: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 11/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 11

 D. Computing Double-Thresholds

In this section, we describe techniques that compute two thresholds for evaluating the degree

of novelty in the input vector. The rationale behind this approach is based on the fact that, for

certain applications, not only a very high quantization error is indicative of novelty, but also

very small ones.

One can argue that a small quantization error means that the input vector is almost surely

normal. This is true if no outliers are present in the data. However, in more realistic scenarios,

there is no guarantee that the training data is outlier-free, and a given neuron could be representing

exactly the region the outliers belong to.

Thus, since outliers can be handled directly by double-threshold methodologies, they are more

robust than single-threshold approaches as will be shown in the simulations. In addition, double-

threshold methodologies is well-suited for outlier removal purposes as we propose in Section

V.

In [14], the authors proposed a novel technique to detect faults in cellular systems by com-

puting the Bootstrap Prediction Interval (BOOPI) for the distribution of quantization errors. The

lower and upper limits of the BOOPI define the novelty thresholds. Several competitive models

are analyzed and the SOM has provided the best results, generating low false alarms rates.

To implement this procedure, a sample of  M  bootstrap instances (eb1, eb2, . . . , ebM ) is drawn

with replacement  from the original sample of  m(m M ) quantization errors (e1, e2, . . . , em),

where each original instance has equal probability to be sampled. Then, the lower and upper

limits of the BOOPI method are computed via percentiles.2

It is shown that prediction (or confidence) intervals can be computed from the bootstrap

samples without making any assumption about the distribution of the original data, provided the

number M  of bootstrap samples is large, e.g. M > 1000 [34], [35], [24]. For a given significance

level α, we are interested in an interval within which we can certainly find a percentage 100(1−α)(e.g. α = 0.05) of normal values of the quantization error. Hence, we compute the lower and

upper limits of this interval as follows:

2The percentile of a distribution of values is a number N α such that a percentage 100(1 − α) of the population values are

less than or equal to N α. For example, the 75th percentile (also referred to as the 0.75 quantile) is a value (N α such that 75%

of the values of the variable fall below that value.

November 24, 2004 DRAFT

Page 12: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 12/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 12

• Lower Limit (ρ−): This is the 100α2

th percentile.

• Upper Limit (ρ+): This is the 100(1 − α2

)th percentile.

This interval [ρ−, ρ+] can then be used to classifying a new state vector into normal/abnormal

by means of a simple hypothesis test:

IF enew ∈ [ρ−, ρ+]

THEN xnew is NORMAL (10)

ELSE xnew is ABNORMAL

Instead of using computing the 100α2

th and 100(1 − α2

)th percentiles, one can also use the

well-known statistical box-plot technique3 to determine the interval [ρ−, ρ+]. As will be shown

in the simulations, this box-plot approach revealed to be one of the more robust approach to

novelty detection.

It is worth noting that this use of box-plot for novelty detection is very similar to the one

introduced by [36]. However, there are two important differences: (i) In our case, the interval

[ρ−, ρ+] is computed from the set of  M  bootstrap instances (eb1, eb2, . . . , ebM ), while in [36] the

interval is computed from the quantization errors generated by “cleaned training data set from

which the outliers were removed. (ii) In order to detect and remove outliers, the method by [36]

demands the additional computation of the MID matrix4 and the Sammon’s mapping [37], which

makes it unsuitable for online applications due the excessive computational burden required5.

 E. Habituation-based Methods

In psychology, habituation is defined as a response diminishment as a function of stimulus

repetition, when no reward or punishment follows, and it is a constant finding in almost any

3

In Box Plots ranges or distribution characteristics of values of a selected variable (or variables) are plotted separately forgroups of cases defined by values of a categorical (grouping) variable. The central tendency (e.g., median or mean), and range

or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each group of cases.

4 Median Interneuron Distance matrix is defined as that whose mij entry is the median of the Euclidean distance between the

weight vector wi and all neurons within its L-neighborhood.

5The Sammon’s mapping is a non-linear mapping that maps a set of input vectors onto a plane trying to preserve the relative

distance between the input vectors approximately. It is widely used to visualize the SOM ordering by mapping the values of 

weight vectors onto a plane. Sammon’s mapping can be applied directly to data sets, but it is computationally very intensive.

November 24, 2004 DRAFT

Page 13: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 13/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 13

behavioral response [38]. For instance, if you make an unusual sound in the presence of the

family dog, it will respond – usually by turning its head toward the sound. If the stimulus is

given repeatedly and nothing either pleasant or unpleasant happens to the dog, it will soon cease

to respond. This lack of response is not a result of fatigue nor of sensory adaptation and is

long-lasting; when fully habituated, the dog will not respond to the stimulus even though weeks

or months have elapsed since it was last presented. Recently, some authors have proposed the

use of mathematical models of habituation together with the SOM applied to novelty detection

task.

Marsland et al. [39], [7] presented an unsupervised algorithm, called Habituating Self-Organizing

 Map (HSOM), for detecting novel stimuli encountered by a mobile robot during navigation. The

HSOM is a two-layered network in which each neuron of the first layer (an usual SOM) is

connected to the output neuron via a habituable synapse, so that the more frequently a first-layer

neuron is chosen the winner, the lower the efficacy of its output synapse and hence the lower the

activation of the output neuron. The output value associated with the winning neuron is taken

as the novelty value and the more familiar the input vector is, the faster the output value decays

to zero.

In [8] the authors proposed an alternative to HSOM, called Grow When Required  (GWR)

network, that allows the insertion of new neurons when necessary. In the GWR, both the synapsesand the neurons have counters that indicate how many times they have fired (i.e. have been

selected as winner). Using these counters it is possible to determine whether a given neuron is

still learning the inputs or if it is ‘confused’ (i.e., it tries to encode input vectors from different

classes). If this is the case, then a new neuron is added to the network between the input and

the winning neuron that caused the confusion.

The insertion of neurons is dependent upon two user-defined thresholds. The first is a minimum

activity threshold below which the current node is not considered to be a sufficiently good matchand second is a maximum habituation threshold above that the current node is not considered

to have learnt sufficiently well. The GWR network can be used as a novelty filter without any

modification, if the neuron that fires has not fired before, or fired very infrequently, then the

input is novel.

November 24, 2004 DRAFT

Page 14: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 14/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 14

F. Multilayer Feedforward Supervised Networks

The most popular supervised ANN algorithm, the multilayer Perceptron (MLP), learns an

input-output mapping through minimization of some objective function, usually the mean squared

error [25]. Due to its popularity, it is natural that MLPs have been also widely used for novelty

detection. Two main approaches are common for this purpose: (i) If examples of normal and

abnormal behaviors are available, the MLP is used as a nonlinear classifier [4], [22], [23]; (ii)

If only data representing normal behavior is available, then the MLP is commonly used as an

auto-associator [40]. These possibilities are better described next.

The single-hidden layer MLP classifier, using the logistic or the hyperbolic tangent function

for the activation function of the hidden and output layer neurons, implements very general

nonlinear discriminant functions [20], [25]. Usually, if there are q  classes of data, denoted here

(C 1, C 2, . . . , C  q), we will need q  output neurons. These neurons are then trained to produce

output values yi, i = 1, . . . , q  that encode the different classes. For example, the neuron i should

produce an output value yi close to 1 if the input vector belongs to class C i; otherwise, yi = 0

(or yi = −1). For testing the classification performance, we select the neuron with the highest

output value:

yk = max∀i

{yi} (11)

Then, we assign the new input vector x to class C k, in a “winner-take-all” (WTA) classification

scheme.

For novelty detection purposes, given a new input vector, once we find the neuron with the

highest output value as in (11), we verify if yk is below a preset threshold (ρ). If so, then a novelty

is declared. This approach was used by Singh and Markou [4], Augusteijn and Folkert [22] and

Vasconcelos et al. [23]. Additionally, Augusteijn and Folkert argued that the WTA classification

scheme is unsuitable for novelty detection, since it takes into account only the information

carried out by a single output neuron. Hence, they suggested taking the entire output pattern

into account, so that the distance between this output pattern and each one of the target patterns

(used during training) is computed, and if the smallest of these distances are above a preset

threshold then the input pattern is considered to belong to a novel class.

To improve the MLP performance in novelty/outlier detection tasks, Vasconcelos et al. [23]

suggested the use of the Gaussian Multilayer Perceptron (GMLP) [41]. In this network it is used

November 24, 2004 DRAFT

Page 15: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 15/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 15

a Gaussian activation function for the hidden layer neurons, instead of the usual sigmoids. This

simple modification provided better results, due to the fact that Gaussian activation functions

forces the receptive field of a neuron to be more selective, being active only for a narrow partition

of the input space, since it tends to produce closed regions surrounding the training data.

Finally, the MLP is also commonly used for novelty detection tasks as an autoassociative

architecture [42], [40]. The autoassociative MLP is designed to learn an input-output mapping

in which the target vectors are the input vectors themselves. This is usually implemented through

a hidden layer whose number of neurons is lower than the dimension of the input vectors.

The network is trained to reconstruct as well as possible a training set consisting of vec-

tors representing normal behavior. In this sense, the autoassociative MLP can be viewed as a

nonlinear version of the Novelty Filter presented in Section III-A. Hence, it should be able to

adequately reconstruct subsequent normal input vectors, but should perform poorly on the task 

of reconstructing abnormal (novel) ones.

Thus, the detection of novel or anomalous input patterns reduces to the task of assessing how

well such vectors are reconstructed by the autoassociative MLP. Quantitatively, this procedure

consists in computing an upper-bound for the reconstruction error of all the training set vectors

at the end of training. For testing purposes, this upper bound is usually relaxed a little by

a certain percentage. New input patterns are subsequently classified by checking whether thereconstruction error of the new input pattern is above the relaxed upper bound, thus revealing

novel data, or below (if data is normal).

Another popular supervised multilayer ANN, the Radial Basis Function (RBF) network, have

been used for novelty detection [43]. In such applications, RBF networks frequently use the

WTA classification rule. This rule is very simple to use, and is often an appropriate solution

[44]. However, the same issues discussed for the MLP concerning the WTA classification rule also

apply to RBF-based novelty detectors, and the method proposed by Augusteijn and Folkert [22]can be used instead.

A mathematically well-founded alternative has been proposed by Li et al. [1]. If the output

value of neuron i is given by:

yi(x) = wT i φ(x) + bi (12)

where φ(x) = [φ1(x) · · · φq(x)]T  is the vector of Gaussian basis functions and bi is the bias

November 24, 2004 DRAFT

Page 16: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 16/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 16

of the output neuron i. Li et al. developed a method to set the threshold values for each output

neuron of a RBF network as follows:

ρi = bi + εi (13)

where bi is the bias of neuron i computed during training and 0 < εi 1 is a very small positive

constant introduced to make the classifier robust to noise and disturbances while having little

increase on the misclassification rate. By using this method, outputs may be readily interpreted

as an ‘unknown fault’ when none of the ‘normal’ or ‘fault output neuron exceeds the threshold

ρi.

IV. A GENERAL METHODOLOGY FOR COMPARISON

As pointed out by Markou and Singh [17], there are a number of studies and proposals in the

field of novelty detection, but comparative work has been much less. To our knowledge, only

a few papers have compared different neural model on the same data set [4], [10], [11], [45].

None of them provided general guidelines on which technique will work best on what types of 

data, which one is more robust to outliers, and which data preprocessing method provide better

results. In this paper we give a first step in trying to answer some of this questions by providing

a general methodology to compare the performance of neural-based novelty detection systems

under more statistically-oriented framework, thus avoiding ad hoc approaches.

The rationale behind the proposal of a general methodology was the observation that the

decision thresholds of many neural-based novelty detection systems, specially those based on

MLP and RBF networks, were computed heuristically, without clearly stated principles. For

example, a commonly used heuristic for MLP- or RBF-based novelty detectors is to set the

decision threshold to ρ = 0.5. That is, all the outputs of the network fall below this value then

an unknown (novel) activity is detected.

This problem is also observed in many unsupervised methods, but in a lower scale. In general,

the decision thresholds in these cases are more statistically-inspired, such as the p-value or the

BOOPI approach. In this paper, we argue that most of the techniques described for SOM-based

novelty detectors, can be equally used by MLP- and RBF-based novelty detectors. For that,

once a neural method to be evaluated is defined, the approach we propose to compute decision

threshold is a combination of four main steps listed below:

November 24, 2004 DRAFT

Page 17: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 17/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 17

Step 1: Define the output variable, z t, to be evaluated at a given time step t. It is worth

emphasizing that z t should reflect the statistical variability of the training data. For that

purpose, we give next some possibilities next.

• OLAM : It can be chosen the Euclidean norm of the novelty vector, z t = x(t) (see

Section III-A).

• SOM : The quantization error, z t = x(t) − wi∗(t) is the usual choice (see Sec-

tion III-B).

• MLP/RBF : In this case, we have two situations.

– For single-output networks, it can be the output value of the network itself, i.e.,

z t = y(t).

– For multi-output networks, it can be the Euclidean norm of the difference be-

tween the vector of desired outputs, d(t), and the vector of actual outputs, y(t).

Then, we have z t = d(t) −y(t). If the Autoassociative MLP is being used z t

can be chosen as the norm of the reconstruction error.

Step 2: After the learning machine has been trained. Compute the values of  z t for each

vector of the training set, Z = (z 1, z 2, . . . , z  m).Step 3: Generate a sample of  M  bootstrap instances Z b = (z b1, z b2, . . . , z  bM ) drawn with

replacement  from the original sample (z 1, z 2, . . . , z  m), where each original value of  z i

has equal probability to be sampled.

Step 4: Compute the threshold for novelty detection tests using the bootstrap samples

Z b. In this case, we have again two possibilities.

• For single-threshold methods, one can choose e.g. the p-value approach described

in (8) or Tanaka’s method described in (9).

• For double-threshold methods, one can choose to compute prediction intervals

through percentile [14] or by the Box-Plot method, both described in Section III-D.

Several advantages of the proposed methodology are listed below:

• Reliability: It is a statistically well-founded approach, since its functioning is based on the

bootstrap resampling method. In addition, if one adopts the BOOPI, the computed thresholds

November 24, 2004 DRAFT

Page 18: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 18/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 18

will correspond exactly to prediction (confidence) intervals for the output variable z t.

• Nonparametric: No assumptions about the statistical properties of the output variable are

made in any stage of the procedure.

• Generality: It allows the comparison of supervised and unsupervised learning methods

under common basis.

• Robustness: The bootstrap resampling technique allows the generation of a large number

of samples, improving the estimates of parameters.

• Simplicity: The method is logical, very easy to understand and apply.

As it will shown in the simulations (VI), one of the main conclusions that we have drawn

from the comparison of the neural methods under the proposed methodology is that training

with outliers is not so good as suggested by some authors. In addition, an interesting by-product

of the proposed methodology is the development of simple data cleaning strategy as described

in the next section.

V. DATA PREPARATION STRATEGIES

In [46], Ypma and Duin comment on the usual unavailability of samples that describe ac-

curately the faults in the system and claim that the best solution is to accurately build a

representation of the normal operation of the system and measure faults as a deviation of this

normality. In [47], this problem is addressed using the Vapnik’s principle of never solving a

problem that is more general than the one that we are actually interested in.

If we are only interested in detecting novelty, it is not always necessary to estimate a full

density model of the data. As pointed out earlier in this paper, a common approach to novelty

detection (for some authors the genuine one!) is to treat the problem as a single-class mod-

elling/classification problem, in which we are interested in build a good representation for only

a restricted class and then create a method to test if novel events are members of this class. Inthis kind of method, the training set must be ideally outlier-free, even the unknown ones. The

supporters of this viewpoint argue that a novelty detection system may have its performance

improved if associated with some mechanism of outlier cleaning.

The general methodology proposed in the previous section lends itself to automatic data

cleaning, removing anomalous (undesirable) vectors from the training set, and then retraining

November 24, 2004 DRAFT

Page 19: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 19/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 19

the neural model with the cleaned data set. The proposed data cleaning procedure is detailed

next:

• Step 1: Choose a neural model and compute decision thresholds according to the general

methodology described in Section IV.

• Step 2: Apply novelty tests using the training data vectors. Obviously, for a well-trained

network, only a few of these vectors will be considered as novel ones.

• Step 3: Exclude those “abnormal” vectors from the original training set and retrain the

network with the new (cleaned) set.

In Section VI, we report simulations showing the benefits of this data cleaning procedure.

 A. Data Scaling Strategies

Data scaling is one important issue that is usually underemphasized in applications of neural

methods to novelty detection. In this paper, we also evaluate the neural algorithms in this

respect, assessing the influence of different methods on their performances. For this purpose,

three techniques are utilized, so that two of them apply to the components of the input vectors

individually and one applies to the vectors as a whole.

• Soft Normalization - The distributions of the individual components, x j, j = 1, . . . , m are

standardized to zero mean and unitary variance:

xnew j =

x j − x jσ j

(14)

where

x j =1

m

m j=1

x j and σ j =

  1

m − 1

m j=1

(x j − x j)2 (15)

• Hard Normalization- The components x j are rescaled to the [0;1] range:

xnew j =

x j − min(x j)

max(x j) − min(x j)(16)

in which max(x j) and min(x j) are the maximum and minimum values of  x j , respectively.

• Whitening and Sphering- The data vectors x are transformed to a new vector v whose

components are uncorrelated and their variances are unitary equal unity. In other words,

the covariance matrix of equals the identity matrix E {vvT } = I. This is usually performed

November 24, 2004 DRAFT

Page 20: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 20/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 20

through the eigenvalue decomposition (EVD) of the covariance matrix Rx = E {xxT } of 

the original data vectors [20], [25].

VI. SIMULATIONS AND RESULTS

In this section we evaluate the performance of the neural network methods discussed in Section

III through simulations on a breast cancer data set [48], available through the UCI Machine

Learning Repository [49]. This dataset was chosen because biomedical applications require high

accuracy due to human factors involved. False positive and false negative errors in diagnosis have

different implications to the person being analyzed, but both should be reduced. Unsupervised

and supervised architectures were assessed under the proposed methodology by their robustness

to outliers and their sensitivity to training parameters, such as data scaling, number of neurons,

training epochs and size of the training set.

Let a cancer detection test be performed under the null hypothesis H 0 that the person is healthy

(i.e. normal behavior). If an actual cancer is not detected (false negative), the most probable is

that the person go home and forget health for a while, until the next visit to the doctor. This is

a serious problem, since the detection of a malignant tumor in the earlier stages of development

is crucial for the success of the treatment. If a false cancer is detected (false positive), the

person will probably make further investigations about the disease and will finally discover thatthe previous diagnostic was wrong. In this case, besides additional costs for new exams, the

person is exposed to undesirable psychological stress while waiting for the final results. Form

the exposed, we will put more emphasis on false negative error rates by virtue of its higher

importance to the health.

For SOM-based novelty detection, the output variable z t was the quantization error. For MLP-

and RBF-based novelty detectors, the output variable z t was the output of the network itself,

except for the Autoassociative MLP for which we selected the norm of the reconstruction error.

For all the neural algorithms, the decision thresholds were determined from the bootstrap samples

of  z t using the following methods: p-value, box-plots, BOOPI. Additionally, for SOM-based

novelty detectors the Tanaka’s method is also used to compute decision thresholds.

All tests were performed using a 1-dimensional SOM. The MLP consisted of a single hidden

layer of neurons trained with the standard backpropagation algorithm with momentum term. The

logistic function was adopted for all neurons. The RBF consisted of a first layer of gaussian

November 24, 2004 DRAFT

Page 21: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 21/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 21

basis functions whose centers ci were computed by the SOM algorithm. A single radius is

defined for all the gaussians, computed as a fraction of the maximum distance among all the

centers, i.e. σ = dmax(ci, c j)/√ 

2q, ∀i = j, where q  is the number of basis functions and

dmax(ci, c j) = max∀i= j{ci − c j}. In the simulations, we are interested in the evaluation of 

following issues:

• Novelty detection using the aforementioned neural network techniques.

• Performance improvement through the proposed outlier removal procedure.

• Performance sensitivity to different data preprocessing methodologies.

Unsupervised ANNs: The first set of simulations compares the novelty detection ability of 

the neural methods. The entire data set consisted of 699 nine-dimensional feature vectors, whose

attributes xi, i = 1, . . . , 9 are the following: clump thickness, uniformity of cell size, uniformity

of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal

nucleoli and mitoses. All the attributes assume values within the range [1 − 10]. The hard

normalization method was used to rescale the data to the [0 − 1].

A number of 16 instances containing a single missing attribute value were excluded from the

original data set. From the remaining set of 683 vectors, 444 vectors corresponded to benign

tumors and 239 to malignant ones. From the total of 444 “normal” vectors, 355 of them (about

80%) were selected for training purposes. From the remaining 89 “normal” vectors, 30 of them

were replaced by “abnormal” vectors, randomly chosen from the set of 239 “abnormal” vectors.

The inclusion of abnormal vectors in the testing set was necessary in order to evaluate the

false negative (Error II) rates. If only examples of normal vectors were present in the testing

set, we could estimate only the false positive (Error I) rates. For each combination of neural

network model and decision threshold computation strategy, this procedure was repeated for 100

simulation runs, and the final error rates were averaged accordingly.

The false negative rates obtained for the SOM-based novelty detectors as a function of thenumber of neurons are shown in Figure 1. Each neural model in this figure was trained for 100

epochs. It can be noted that the pair (SOM, Box-plot) produced the lower rates, but followed very

closely by the pair (SOM, p-value). The pairs (SOM, BOOPI) and (SOM, TANAKA) provided

the worst rates.

The second set of simulations evaluates the sensitivity of the SOM-based novelty detectors to

changes in the number of training epochs, as shown in Figure 2. The training parameters used

November 24, 2004 DRAFT

Page 22: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 22/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 22

2 5 10 15 20 25 30 35 40 45 50

3

10

20

30

40

50

60

70

80

90

Number of Neurons

   F  a   l  s  e   N  e  g  a   t   i  v  e   R  a   t  e  s   (   %   )

BOOPI

P−VALUE

BOX−PLOT

TANAKA

Fig. 1. False negative rates (in percentage) as a function of the number of neurons in the SOM.

were the same as those used for the first set of simulations, except that the number of neurons

was set to 40. The overall performances remain the same as in Figure 1, with the pair (SOM,

Box-plot) achieving the lowest false negative rates.

As a double threshold decision test, it can detect outliers in regions of high quantization

error (QE) as well as in regions of low QE. As discussed in Section III-D, this type of outlier

(referred to as unknown outliers) can be the result of an erroneous labelling. If unknown outliers

are present in the training set, some neurons may be attracted to these spurious patterns, so that

in the future some outliers will probably fire these neurons giving low quantization errors. Only

novelty decisions based on double thresholds methods, such as the Box-plot or the BOOPI, can

detect outliers in this case.

The pair (SOM, BOOPI), which in theory could also detect outliers in the low-QE region,

has obtained a performance only better than the pair (SOM, TANAKA). This may be due to the

fact that the great majority of unknown outliers are in the region of high-QE, probably due to

the low occurrence of unknown outliers (such as mislabelled normal data) in the training set,

thus implicitly revealing the good quality of the data set.

It is interesting to note that the performance of the pair (SOM, TANAKA) gets worse as the

number of number of training epochs increases. This may occur because of the very nature of 

November 24, 2004 DRAFT

Page 23: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 23/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 23

1 50 100 150 200 250 300

3

10

20

30

40

50

60

70

80

90

Number of Training Epochs

   F  a   l  s  e   N  e  g  a   t   i  v  e   R  a   t  e  s   (   %   )

BOOPIP−VALUE

BOX−PLOTTANAKA

Fig. 2. False negative rates (in percentage) as a function of the number of training epochs.

Tanaka’s test. Once the SOM network has more time to converge, it better fits the data manifold.

Then, we can observe that the quantization error enew tends to decrease even more, while the

novelty threshold ρ computed in (9) tends to stabilize, remaining constant. So, as the network 

achieves a better representation of the data, it becomes more and more rare to observe enew > ρ,

and hence the test is almost never positive for novelty, even when the presented data vector is

truly novel. This contradicts the common sense that says that the better the representation of the

data, the better the network’s result.

Finally, another interesting conclusion drawn from Figures 1 and 2 is that, for a large number of 

neurons or a very long training period, the pairs (SOM, Box-plot) and (SOM, p-value) produced

very similar false negative rates. This may be due to the fact that increasing the number of 

neurons of the SOM or the number of training epochs the mean value of the quantization error

decreases, which makes few real outliers to fall below the decision threshold computed according

to the p-value method.

The third set of simulations compares the accuracy of SOM-based novelty detectors with

respect to the size of the training and testing sets. The purpose of this test is to give a rough

idea of which method requires less data to give high accuracy. In Figure 3 we observed that

no relevant changes in performance were verified as the sample size varied significantly. For

November 24, 2004 DRAFT

Page 24: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 24/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 24

10 20 30 40 50 60 70 80 902

4

6

8

10

12

14

16

18

20

Size of the Training Data (% of total Data Set)

   F  a   l  s  e   N  e  g  a   t   i  v  e   R  a   t  e  s   (   %   )

BOX−PLOT

P−VALUE

BOOPI

Fig. 3. False negative rates as a function of the training set size.

these tests, the number of neurons and the number of training epochs were set to 40 and 100,

respectively.

Supervised ANNs: The same tests described above for SOM-based novelty detectors were

repeated here for the supervised methods (MLP, Autoassociative MLP, GMLP and RBF). The

first set of simulations evaluates the false negative rate as a function of the number of hidden

neurons. For these tests, each MLP network was trained for 1000 epochs with normal data

vectors only. The learning rate and the momentum factor were set to 0.35 and 0.5, respectively.

For clarity sake, the results are shown only for the p-value (Figure 4) and the Box-plot (Figure

5) decision threshold methods, since they provided the best overall results. The best individual

performances were produced by the pairs (MLP, p-value) and (RBF, Box-plot). These figures also

illustrate that some methods of computing decision thresholds (e.g. the p-value) are unsuitable

for certain supervised neural networks (e.g. RBF).

To illustrate how the presence of outliers in the training set influence the performance of 

unsupervised and supervised novelty detectors, we simulated the pairs (SOM, Box-plot) and

(MLP, p-value) on a training set that contains a given number of  fake outliers, i.e. originally

abnormal data vectors that we intentionally labelled as being normal ones. The result is shown in

Figure 6. For comparison purposes, we also simulated a standard MLP classifier for a two-class

November 24, 2004 DRAFT

Page 25: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 25/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 25

0 5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

80

90

100

Number of Hidden Neurons

   F  a   l  s  e   N  e  g  a   t   i  v  e   R  a   t  e  s   (   %   )

GMLP

MLP

RBF

AA

Fig. 4. False negative rates for supervised novelty detectors as a function of the number of hidden neurons, using the p-value

decision threshold method.

0 5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

80

90

100

Number of Hidden Neurons

   F  a   l  s  e   N  e  g  a   t   i  v  e   R  a   t  e  s   (   %   )

MLP

GMLP

AA

RBF

Fig. 5. False negative rates for supervised novelty detectors as a function of the number of hidden neurons, using the Box-plot

decision threshold method.

November 24, 2004 DRAFT

Page 26: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 26/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 26

(normal/abnormal) problem, using the WTA decision scheme. The pair (Two-class MLP, WTA)

was trained according the guidelines presented in Section III-F, using the actual labels of the

data vectors.

Figure 6 shows how the false negative rates vary with the number of outliers. As expected,

the performance of the single-class methods, (SOM, Box-plot) and (MLP, p-value), deteriorates

with the presence of outliers, while the performance of the two-class methods improves. This

occurs because single-class methods learn erroneously to consider outliers as normal data vectors,

diminishing their sensitivity to novelty. For the two-class method, the sensitivity to novelty is

increased, since the classifier learns to separate better and better what is normal from what is

abnormal. However, this occurs only when more than 30% of the training patterns are abnormal

ones. Since it is generally unrealistic to have such a high number of abnormal vectors, the overall

conclusion is that if only few abnormal patterns are available the best thing to do is to exclude

them from the training data and to choose a single-class approach. Note that in Figure 6 the

performance of the single-class methods is much better than the two-class approach when the

percentage of outliers is lower than 10%.

TABLE I

BEST RESULTS OBTAINED FOR THE NOVELTY DETECTION TASK .

False Negative False Positive

MODEL Mean Variance Mean Variance

(RBF, Box-plot) 0.1 0.1 9.9 11.8

(MLP, p-value) 0.6 0.5 3.4 3.5

(SOM, Box-plot) 2.0 1.0 3.7 2.9

(Novelty Filter, Box-plot) 3.3 41.0 5.2 11.9

(Two-Class MLP, WTA) 0.9 0.9 3.5 3.2

Finally, Table I presents the best results obtained for novelty detection for the dataset used in

this paper. The best overall performance was obtained by the pair (RBF, Box-plot). Note that

the result shown for the pair (Two-class MLP, WTA) is for a training set containing 50% of 

abnormal vectors.

November 24, 2004 DRAFT

Page 27: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 27/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 27

0 5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

80

90

100

Percentage of Outliers in Data

   F  a   l  s  e   N  e  g  a   t   i  v  e   R  a   t  e  s   (   %   )

(SOM , BOX−PLOT)

(MLP , P−VALUE)

MLP Classifier

Fig. 6. False negative rates (in percentage) as a function of the number of outliers in the training set.

Data Cleaning: It was observed that all novelty detection methods have presented a consid-

erable reduction on their false negative rates after the application of the data cleaning procedure

proposed in Section V, as shown in Figure 7 for the pair (SOM, Box-plot). The number of 

neurons was set to 40, and the number of training epochs was varied from 1 to 200. It is worth

noting that the training on a cleaned data set yielded the best performance of a SOM-basednovelty detector, achieving a false negative rate below 3%.

Data Scaling: For the simulations carried out so far we rescaled the data vectors through the

hard normalization method. Soft normalization and data decorrelation have also been tested, but

for this particular data set they performed poorer than the hard normalization, as can be seen

in Figure 8, when the number of neurons for the pair (SOM, box-plot) is varied. The general

conclusion we draw from these results is that different data scaling produce different error rates,

and then a number of them should be tested ever it is possible.

VII. CONCLUSION AND FURTHER WOR K

In this paper we have introduced a systematic methodology to compare the performance of 

neural methods applied to novelty detection tasks. This methodology allowed us not only to

evaluate the computational properties of both supervised and unsupervised algorithms under

common basis, but also paved the way for the proposal of a data cleaning strategy for outlier

November 24, 2004 DRAFT

Page 28: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 28/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 28

1 50 100 150 200

3

Number of Training Epochs

   F  a   l  s  e   N  e  g  a   t   i  v  e   R  a   t  e  s   (   %   )

After Data Cleaning

No Data Cleaning

Fig. 7. False negative rates for the pair (SOM, Box-plot) trained on the original and the cleaned data set as the number of 

training epochs varies.

2 5 10 15 20 25 30 35 40 45 50

3

10

20

30

40

50

60

70

80

90

100

Number of Neurons

   F  a   l  s  e   N  e  g  a   t   i  v  e   R  a   t  e  s   (   %   )

Soft NormalizationHard Normalization

Whitening Transf.

Fig. 8. False negative rates obtained by the pair (SOM, Box-plot) for three different data scaling methods, as the number of 

neurons varies.

November 24, 2004 DRAFT

Page 29: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 29/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 29

removal. This outlier removal strategy has shown to be very efficient in diminishing the false

negative error rates of all simulated neural methods.

The proposed methodology was also used to assess the effectiveness of existing decision

threshold computation techniques when used in conjunction with different neural network algo-

rithms, such as the SOM, MLP and RBF. The influence of different data scaling methods and

the robustness of several neural-based novelty detectors to outliers in the training data were also

evaluated.

Further work is being developed in order to extend the applicability of the proposed methodol-

ogy to novelty detection in time series data. For this purpose, we are currently evaluating several

recurrent neural networks architectures, such as the Elman network or the Recursive SOM, using

different decision threshold computation methods. The chosen application is a computer network 

intruder detection task, where anomalous behavior is to be detected based on the analysis of the

network traffic time series.

ACKNOWLEDGMENT

This work was developed under the financial support of CNPq (grant DCR:305275/2002-0).

The first author also thanks FUNCAP for supporting his graduate studies.

REFERENCES

[1] Y. Li, M. J. Pont, and N. B. Jones, “Improving the performance of radial basis function classifiers in condition monitoring

and fault diagnosis applications where ‘unknown’ faults may occur,” Pattern Recognition Letters, vol. 23, no. 5, pp. 569–

577, 2002.

[2] T. Harris, “A Kohonen SOM based machine health monitoring system which enables diagnosis of faults not seen in the

training set,” in Proceedings of the International Joint Conference on Neural Networks, (IJCNN’93) , vol. 1, pp. 947–950,

1993.

[3] M. Tanaka, M. Sakawa, I. Shiromaru, and T. Matsumoto, “Application of kohonen’s self-organizing network to the diagnosis

system for rotating machinery,” in Proceedings of the IEEE International Conference on Systems, Man and Cybernetics

(SMC’95), vol. 5, pp. 4039–4044, 1995.

[4] S. Singh and M. Markou, “An approach to novelty detection applied to the classification of image regions,” IEEE 

Transactions on Knowledge and Data Engineering, vol. 16, no. 4, pp. 396–407, 2004.

[5] G. A. Carpenter, M. Rubin, and W. W. Streilein, “ARTMAP-FD: familiarity discrimination applied to radar target

recognition,” in Proceedings of the IEEE International Conference on Neural Networks, vol. 3, pp. 1459–1464, 1997.

[6] C. J. Rose and C. J. Taylor, “A generative statistical model of mammographic appearance,” in Proceedings of the 2004

 Medical Image Understanding and Analysis (MUIA’04) (D. Rueckert and J. H. andG. Z. Yang, eds.), pp. 89–92, 2004.

November 24, 2004 DRAFT

Page 30: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 30/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 30

[7] S. Marsland, U. Nehmzow, and J. Shapiro, “Novelty detection for robot neotaxis,” in Proceedings of the 2nd International

 ICSC Symposium on Neural Computation, pp. 554–559, 2000.

[8] S. Marsland, J. Shapiro, and U. Nehmzow, “A self-organising network that grows when required,” Neural Networks, vol. 15,

no. 8–9, pp. 1041–1058, 2002.

[9] Y. Le Cun, B. Boser, J. S. Denker, R. E. Howard, W. Habbard, L. D. Jackel, and D. Henderson, “Handwritten digit

recognition with a back-propagation network,” in Advances in Neural Information Processing Systems (D. S. Touretzky,

ed.), vol. 2, pp. 396–404, Morgan Kaufmann, 1990.

[10] D. Vu and V. R. Vemuri, “Computer network intrusion detection: A comparison of neural networks methods,” Journal of 

 Differential Equations and Dynamical Systems, 2002.

[11] Z. Zhang, J. Li, C. N. Manikopoulos, J. Jorgenson, and J. Ucles, “HIDE: A hierarchical network intrusion detection system

using statistical preprocessing and neural network classification,” in Proceedings of the IEEE Workshop on Information

 Assurance and Security, pp. 85–90, 2001.

[12] A. J. Hoglund, K. Hatonen, and A. S. Sorvari, “A computer host-based user anomaly detection system using the self-

organizing map,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN’00) ,vol. 5, (Como, Italy), pp. 411–416, 2000.

[13] R. S. Guh, F. Zorriassatine, J. D. T. Tannock, and C. O’Brien, “On-line control chart pattern detection and discrimination:

A neural network approach,” Artificial Intelligence in Engineering, vol. 13, no. 4, pp. 413–425, 1999.

[14] G. A. Barreto, J. C. M. Mota, L. G. M. Souza, R. A. Frota, L. Aguayo, J. S. Yamamoto, and P. E. O. Macedo, “Competitive

neural networks for fault detection and diagnosis in 3G cellular systems,” Lecture Notes in Computer Science, vol. 3124,

pp. 207–313, 2004.

[15] E. Alhoniemi, J. Hollmen, O. Simula, and J. Vesanto, “Process monitoring and modeling using the self-organizing map,”

 Integrated Computer Aided Engineering, vol. 6, no. 1, pp. 3–14, 1999.

[16] M. Markou and S. Singh, “Novelty detection: A review – Part 1: Statistical approaches,” Signal Processing, vol. 83, no. 12,

pp. 2481–2497, 2003.

[17] M. Markou and S. Singh, “Novelty detection: A review – Part 2: Neural network based approaches,” Signal Processing,

vol. 83, no. 12, pp. 2499–2521, 2003.

[18] V. J. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, pp. 85–126,

2004.

[19] S. Marsland, “Novelty detection in learning systems,” Neural Computing Surveys, vol. 3, pp. 157–195, 2003.

[20] A. Webb, Statistical Pattern Recognition. John Wiley & Sons, 2nd ed., 2002.

[21] S. Lawrence, I. Burns, A. D. Back, A. C. Tsoi, and C. L. Giles, “Neural network classification and unequal prior class

probabilities,” in Neural Networks: Tricks of the Trade (G. Orr, K.-R. Muller, and R. Caruana, eds.), vol. 1524 of  Lecture

 Notes in Computer Science, pp. 299–314, Springer Verlag, 1998.

[22] M. F. Augusteijn and B. A. Folkert, “Neural network classification and novelty detection,” International Journal of Remote

Sensing, vol. 23, no. 14, pp. 2891–2902, 2002.

[23] G. C. Vasconcelos, M. C. Fairhurst, and D. L. Bisset, “Investigating feedforward neural networks with respect to the

rejection of spurious patterns,” Pattern Recognition Letters, vol. 16, pp. 207–212, 1995.

[24] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, 1993.

[25] J. C. Principe, N. R. Euliano, and W. C. Lefebvre, Neural and Adaptive Systems: Fundamentals through Simulations . John

Wiley & Sons, 2000.

November 24, 2004 DRAFT

Page 31: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 31/32

SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 31

[26] T. Kohonen and E. Oja, “Fast adaptive formation of orthogonalizing filters and associative memory in recurrent networks

of neuron-like elements,” Biological Cybernetics, vol. 25, pp. 85–95, 1976.

[27] T. Kohonen, Self-Organization and Associative Memory. Springer-Verlag, 3rd ed., 1989.

[28] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE , vol. 78, no. 9, pp. 1464–1480, 1990.

[29] T. Kohonen, Self-Organizing Maps. Springer-Verlag, 3rd ed., 2001.

[30] J. Vesanto and J. Ahola, “Hunting for correlations in data using the self-organizing map,” in Proceedings of the International

 ICSC Congress on Computational Intelligence Methods and Applications (CIMA99), pp. 279–285, 1999.

[31] A. Flexer, “On the use of self-organizing maps for clustering and visualization,” Intelligent Data Analysis, vol. 5, no. 5,

pp. 373–384, 2001.

[32] J. Laiho, M. Kylvaja, and A. Hoglund, “Utilisation of advanced analysis methods in UMTS networks,” in Proceedings of 

the IEEE Vehicular Technology Conference (VTS/spring), (Birmingham, Alabama), pp. 726–730, 2002.

[33] F. Gonzalez and D. Dasgupta, “Neuro-immune and self-organizing map approaches to anomaly detection: A comparison,”

in Proceedings of the First International Conference on Artificial Immune Systems , (Canterbury, UK), pp. 203–211, 2002.

[34] Y. Reich and S. V. Barai, “Evaluating machine learning models for engineering problems,” Artificial Intelligence in Engineering, vol. 13, pp. 257–272, 1999.

[35] T. J. DiCiccio and B. Efron, “Bootstrap confidence intervals,” Statistical Science, vol. 11, no. 3, pp. 189–228, 1996.

[36] A. Munoz and J. Muruzabal, “Self-organising maps for outlier detection,” Neurocomputing, vol. 18, pp. 33–60, 1998.

[37] J. W. S. Jr., “A nonlinear mapping for data structure analysis,” IEEE Transactions on Computers, vol. C-18, pp. 401–409,

1969.

[38] E. N. Sokolov, “Neuronal models and the orienting reflex,” in The central nervous system and behaviour  (M. A. B. Brazier,

ed.), pp. 187–276, Josiah Macy Jr. Foundation, 1960.

[39] S. Marsland, U. Nehmzow, and J. Shapiro, “Detecting novel features of an environment using habituation,” in Proceedings

of the 6th International Conference on Simulation of Adaptative Behaviour (SAB’00) , (Cambridge, MA), MIT Press, 2000.

[40] N. Japkowicz, C. Myers, and M. Gluck, “A novelty detection approach to classification,” in Proceedings of the 14th

 International Joint Conference on Artificial Intelligence (IJCAI’95), pp. 518–523, 1995.

[41] M. R. W. Dawson and D. P. Schopflocher, “Modifying the generalized delta rule to train networks of nonmonotonic

processors for pattern classification,” Connection Science, vol. 4, no. 1, pp. 19–31, 1992.

[42] T. Petsche, A. Marcantonio, C. Darken, S. J. Hanson, G. M. Kuhn, and I. Santoso, “A neural network autoassociator for

induction motor failure prediction,” in Advances in Neural Information Processing Systems (D. Touretzky, M. Mozer, and

M. Hasselmo, eds.), vol. 8, pp. 924–930, MIT Press, 1996.

[43] S. Albrecht, J. Busch, M. Kloppenburg, F. Metze, and P. Tavan, “Generalized radial basis functions networks for

classification and novelty detection: Self-organization of optimal bayesian decision,” Neural Networks, vol. 13, pp. 1075–

1093, 2000.

[44] L. P. Cordella, C. De Stefano, F. Tortorella, and M. Vento, “A method for improving classification reliability of multilayer

perceptrons,” IEEE Transactions on Neural Networks, vol. 6, no. 5, pp. 1140–1147, 1995.

[45] J. F. D. Addison, S. Wermter, and J. MacIntyre, “Effectiveness of feature extraction in neural network architectures for

novelty detection,” in Proceedings of the 9th International Conference on Artificial Neural Networks (ICANN’99) , pp. 976–

981, IEE Press, 1999.

[46] A. Ypma and R. P. W. Duin, “Novelty detection using self-organising maps,” in Progress in Connectionist-Based 

November 24, 2004 DRAFT

Page 32: 116927830 Principios Del Filtro de Novedad

7/28/2019 116927830 Principios Del Filtro de Novedad

http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 32/32