Robust outlier detection

Projectnummer: RSM-80820BPA-nummer: 324-00-RSM/INTERN

Datum: 19-jul-00

Statistics NetherlandsDivision Research and DevelopmentDepartment of Statistical Methods

ROBUST MULTIVARIATE OUTLIER DETECTION

Peter de Boer and Vincent Feltkamp

Summary: Two robust multivariate outlier detection methods, based on the

Mahalanobis distance, are reported: the projection method and the Kosinski

method. The ability of those methods to detect outliers is exhaustively tested.

A comparison is made between the two methods as well as a comparison with

other robust outlier detection methods that are reported in the literature.

The opinions in this paper are those of the authors and do not necessaril y reflect

those of Statistics Netherlands.

Robust multivariate outlier detection

1

1. Introduction

The statistical process can be separated in three steps. The input phase involves the

collection of data by means of surveys and registers. The throughput phase involves

preparing the raw data for tabulation purposes, weighting and variance estimating.

The output phase involves the publication of population totals, means, correlations,

etc., which have come out of the throughput phase.

Data editing is one of the first steps in the throughput process. It is the procedure for

detecting and adjusting individual errors in data. Editing also comprises the

detection and treatment of correct but influential records, i.e. records that have a

substantial contribution to the aggregates to be published.

The search for suspicious records, i.e. records that are possibly wrong or influential,

can be done in basically two ways. The first way is by examining each record and

looking for strange or wrong fields or combinations of fields. In this view a record

includes all fields referring to a particular unit, be it a person, household or business

unit, even if those fields are stored in separate files, like files containing survey data

and files containing auxiliary data.

The second way is by comparing each record with the other records. Even if the

fields of a particular record obey all the edit rules one has laid down, the record

could be an outlier. An outlier is a record, which does not follow the bulk of the

records.

The data can be seen as a rectangular file, each row denoting a particular record and

each column a particular variable. The first way of searching for suspicious data can

be seen as searching in rows, the second way as searching in columns. It is remarked

that some and possibly many errors can be detected by both ways.

Records could be outliers while their outlyingness is not apparent by examining the

variables, or columns, one by one. For instance, a company that has a relatively

large turnover but that has paid relatively little taxes might be no outlier in either one

of the variables, but could be an outlier considering the combination. Outliers

involving more than one variable are multivariate outliers.

In order to quantify how far a record lies from the bulk of the data, one needs a

measure of distance. In the case of categorical data no useful distance measure

exists, but in the case of continuous data the so-called Mahalanobis distance is often

employed.

A distance measure should be robust against the presence of outliers. It is known

that the classical Mahalanobis distance is not. This means that the outliers, which are

to be detected, seriously hamper the detection of those outliers. Hence, a robust

version of the Mahalanobis distance is needed.

In this report two robust multivariate outlier detection algorithms for continuous

data, based on the Mahalanobis distance, are reported. In the next section the

classical Mahalanobis distance is introduced and ways to robustify this distance

measure are discussed. In sections 3 and 4 the two algorithms, successively the


2

Kosinski method and the projection method, are presented. In section 5 a

comparison between the two algorithms is made as well as a comparison with other

algorithms reported in the outlier literature. A practical example, and problems

involved with it, is the subject of section 6. In section 7 some concluding remarks

are made.

2. The Mahalanobis distance

The Mahalanobis distance is a measure of the distance between a point and the

center of all points, with respect to the scale of the data — and in the multivariate

case with respect to the shape of the data as well . It is remarked that in regression

analysis another distance measure is more convenient: instead of the distance

between a point and the center of the data, the distance between the point and the

regression plane (see also section 5).

Suppose we have a continuous data set nyyy ,..,, 21 . The vectors iy are p-

dimensional, i.e. tipiii yyyy )..( 21= , where iqy denotes a real number. The

classical squared Mahalanobis distance is defined by

)()( 12 yyCyyMD it

ii −−= −

where y and C denote the mean and the covariance matrix respectively:

∑

∑

=

=

−−−

=

=

n

i

tii

n

ii

yyyyn

C

yn

y

1

1

))((1

1

1

In the case of one-dimensional data the covariance matrix reduces to the variance

and the Mahalanobis distance to σyyMD ii −= , where σ denotes the standard

deviation.

Another point of view results by noting that the Mahalanobis distance is the solution

of a maximization problem. The maximization problem is defined as follows. The

data points iy can be projected on a projection vector a. The outlyingness of the

point iy is the squared projected distance 2))(( yya it − , with respect to the

projected variance Caat . Assuming that the covariance matrix C is positive

definite, there exists a non-singular matrix A such that ICAAt = . Using the

Cauchy-Schwarz equality we have


3

2

1

1

11

212

)()(

)()()(

)()()()(

))(())((

i

it

i

ti

tti

tt

ti

tti

t

ti

ttt

ti

t

MD

yyCyy

Caa

yyAAyyaAAa

Caa

yyAAyyaAaA

Caa

yyAAa

Caa

yya

=

−−=

−−=

−−≤

−=

−

−

−

−−

−

with equality if and only if )(1 yycAaA it −=− for some constant c. Hence

Caa

yyaMD

ti

t

aai

t

2

1

2 ))((sup

−=

=

i.e., the Mahalanobis distance is equal to the supremum of the outlyingness of iy

over all possible projection vectors.

If the data set iy is multivariate normal the squared Mahalanobis distances 2iMD

follow the 2χ distribution with p degrees of freedom.

The classical Mahalanobis distance suffers however from the masking and

swamping effect. Outliers seriously affect the mean and the covariance matrix in

such a way that the Mahalanobis distance of outliers could be small (masking),

while the Mahalanobis distance of points which are not outliers could be large

(swamping).

Therefore, robust estimates of the center and the covariance matrix should be found

in order to calculate a useful Mahalanobis distance. In the univariate case the most

robust choice is the median (med) and the median of absolute deviations (mad)

replacing the mean and the standard deviation respectively. The med and mad have a

robustness of 50%. The robustness of a quantity is defined as the maximum

percentage of data points that can be moved arbitraril y far away while the change in

that quantity remains bounded.

It is not trivial to generalize the robust one-dimensional Mahalanobis distance to the

multivariate case. Several robust estimators for the location and scale of multivariate

data have been developed. We have tested two methods, the projection method and

the Kosinski method. Other methods for robust outlier detection will be discussed in

section 5, where we will compare the different methods on their ability to detect

outliers.

In the next two sections the Kosinski method and the projection method will be

discussed in detail.


4

3. The Kosinski method

3.1 The principle of KosinskiThe method discussed in this section was quite recently published by Kosinski. The

idea of Kosinski is basically the following:

1) start with a few, say g, points, denoted the “good” part of the data set;

2) calculate the mean and the covariance matrix of those points;

3) calculate the Mahalanobis distances of the complete data set;

4) increase the good part of the data set with one point by selecting the g+1 points

with the smallest Mahalanobis distance and define g=g+1;

5) return to step 2 or stop as soon as the good part contains more than half the data

set and the smallest Mahalanobis distance of the remaining points is higher than

a predefined cutoff value. At the end the remaining part, or the “bad” part,

should contain the outliers.

In order to assure that the good part wil l contain no outliers at the end, it is important

to start the algorithm with points which all are good. In the paper by Kosinski this

problem is solved by repetitively choosing a small set of random points, and

performing the algorithm for each set. The number of sets of points to start with is

taken high enough to be sure that at least one set contains no outliers.

We made two major adjustments to the Kosinski algorithm. The first one is the

choice of the starting data set. The demanded property of the starting data set is that

it contains no outliers. It does not matter how these points are found. We choose the

starting data set by robustly estimating the center of the data set and selecting the

p+1 closest points. In the case of a p-dimensional data set, p+1 points are needed to

get a useful starting data set, since the covariance matrix of a set of at most p points

is always non-invertible. A separation of the data set in p+1 good points and n-p-1

bad points is called an elemental partition.

The center is estimated by calculating the mean of the data set, neglecting all

univariate robustly detected outliers. This is of course just a rude estimation, but is

satisfactory for the purpose of selecting a good starting data set. Another rude

estimation of the center that was tried out was the coordinate-wise median. The

coordinate-wise median appeared to result in less satisfactory starting data sets.

The p+1 points closest to the mean are chosen, where closest is defined by an

ordinary distance measure. In order to take the different scales and units of the

different dimensions into account, the data set is coordinate-wisely scaled before the

mean is calculated, i.e. each component of each point is divided by the median of

absolute deviations of the dimension concerned. It is remarked that, after the first

p+1 points are selected the algorithm continues with the original unscaled data.

It is, of course, possible to construct a data set for which this algorithm fails to select

p+1 points that are all good points. However, in all the data sets exploited in this

report, artificial and real, this choice of a starting data set worked very well .


5

This adjustment results in a spectacular gain in computer time, since the algorithm

has to be run only once instead of more than once. Kosinski estimates the required

number of random starting data sets in his own original algorithm to be

approximately 35 in the case of 2-dimensional data sets, and up to 10000 in 10

dimensions.

The other adjustment is in the expansion of the good part. In the Kosinski paper the

increment is always one point. We implemented an increment proportional to the

good part already found, for instance 10%. This means that the good part is

increased with a factor of 10% each step. This speeds up the algorithm as well,

especially in large data sets. The original algorithm with one-point increment scales

with 2n , where n is the number of data points, while the algorithm with proportional

increment scales with nnln . Also this adjustment was tested and appeared to be

very good.

In the remainder of this report, “ the Kosinski method” denotes the adjusted Kosinski

method, unless otherwise noted.

3.2 The Kosinski algorithmThe purpose of the algorithm is, given a set of n multivariate data points

nyyy ,..,, 21 , to calculate the outlyingness iu for each point i. The algorithm can be

summarized as follows.

Step 0. In: data set

The algorithm is started with a set of continuous p-dimensional data nyyy ,..,, 21 ,

where ( )tipii yyy ..1= .

Step 1. Choose an elemental partition

A good part of p+1 points is found as follows.

• Calculate the med and mad for each dimension q:

qlql

q

kqk

q

MymedS

ymedM

−=

=

• Divide each component q of each data point i by the mad of the dimension

concerned. The scaled data points are denoted by the superscript s:

q

iqsiq S

yy =

• Declare a point to be a univariate outlier if at least one component of the data

point is farther than 2.5 standard deviations away from the scaled median. The

standard deviation is approximated by 1.484 times the mad (see section 4.1 for

the background of the factor 1.484). So calculate for each component q of each

point i:


6

q

qsiqiq S

Myu −=

484.1

1

If 5.2>iqu for any q, then point i is an univariate outlier.

• Calculate the mean of the data set, neglecting the univariate outliers:

∑=

=n

outlier no is 1 0

1

iyi

si

s yn

y

where n0 denotes the number of points that are no univariate outliers.

• Select the p+1 points that are closest to the mean. Define those points to be the

good part of the data set. So calculate:ss

ii yyd −=

The g=p+1 points with the smallest di form the good part, denoted by G.

Step 2. Iteratively increase the good part

The good part is increased until a certain stop criterion is fulfilled.

• Continue with the original data set iy , not with the scaled data set siy .

• Calculate the mean and the covariance matrix of the good part:

∑

∑

∈

∈

−−−

=

=

Gi

tii

Gii

yyyyg

C

yg

y

))((1

1

1

• Calculate the Mahalanobis distance of all the data points:

)()( 12 yyCyyMD it

ii −−= −

• Calculate the number of points with a Mahalanobis distance smaller than a

predefined cutoff value. A useful cutoff value is 21, αχ −p , with • =1%.

• Increase the good part with a predefined percentage (a useful percentage is 20%)

by selecting the points with the smallest Mahalanobis distances, but not more

than up to

a) half the data set if the good part is smaller than half the data set

(g<h=[½(n+p+1]).

b) the number of points with a Mahalanobis distance smaller than the cutoff if

the good part is larger than half the data set.

• Stop the algorithm if the good part was already larger than half the data set and

no more points were added in the last iteration.

Step 3. Out: outlyingnesses

The outlyingness of each point is now simply the Mahalanobis distance of the point,

calculated with the mean and the covariance matrix of the good part of the data set.


7

3.3 Test resultsA prototype/test program was implemented in a Borland Pascal 7.0 environment.

Documentation of the program is published elsewhere. We successively tested the

choice of the elemental partition by means of the mean, the amount of swamped

observations of data sets containing no outliers, the amount of masked and swamped

observations of data sets containing outliers, the algorithm with proportional

increment and the time-performance of the proportional increment of the good part

compared to the one-point increment. Finally, we tested the sensitivity of the

number of detected outliers to the cutoff value and the increment percentage in some

known data sets.

3.3.1 Elemental partition

First of all , the choice of the elemental partition was tested with the generated data

set published by Kosinski. The Kosinski data set is a kind of worst-case data set. It

contains a large amount of outliers (40% of the data) and the outliers are distributed

with a variance much smaller than the variance of the good points.

Before using the mean, we calculated the coordinate-wise median as a robust

estimator of the center of the data, and selected the three closest points. This strategy

failed. Although the median has a 50%-robustness, the 40% outliers strongly shift

the median. Hence, one of the three selected points appeared to be an outlier. As a

consequence, the forward search algorithm indicated all point to be good points, i.e.

all the outliers were masked.

This was the reason we searched for another robust measure of the location of the

data. One of the simplest ideas is to search for univariate outliers first, and to

calculate the mean of the points that are outlier in none of the dimensions.

The selected points, the three points closest to the mean, appeared all to be good

points. Moreover, the forward search algorithm, applied with this elemental

partition, successfully distinguished the outliers from the good points.

All following tests were performed using this “mean” to select the first p+1 points.

For all tested data sets the selected p+1 points appeared to be good points, resulting

in a successful forward search. It is possible, in principle, to construct a data set for

which this selection algorithm still fails, for instance a data set with a large fraction

of outliers which are univariately invisible and with no unambiguous dividing line

between the group of outliers and the group of good points. This is, however, a very

hypothetical situation.

3.3.2 Swamping

A simulation study was performed in order to determine the average fraction of

swamped observations in normal distributed data sets. In large data sets almost

always a few points are indicated to be an outlier, even if the whole data set nicely

follows a normal distribution. This is due to the cutoff value. If a cutoff value of2

1, αχ −p is used as discriminator between good points and outliers in a

p-dimensional standard normal data set, a fraction of • data points will have a

Mahalanobis distance larger than the cutoff value.


8

For each dimension p between 1 and 8 we generated 100 standard normal data sets

of 100 points. The Kosinski algorithm was run twice on each data set, once with a

cutoff value 299.0,pχ , and once with 2

95.0,pχ . Each point that is indicated to be an

outlier is a swamped observation since there are no true outliers by construction. We

calculated the average fraction of swamped observations (i.e. the number of

swamped observations of each data set divided by 100, the number of points in the

data set, averaged over all 100 data sets). Results are shown in Table 3.1.

• p=1 2 3 4 5 6 7 80.01 0.015 0.011 0.010 0.008 0.008 0.008 0.007 0.0070.05 0.239 0.112 0.081 0.070 0.059 0.052 0.045 0.042

Table 3.1. The average fraction of swamped observations of the simulations of 100

generated p-dimensional data sets of 100 points for each p between 1 and 8, with

cutoff value 21, αχ −p .

For • =0.01 the fraction of swamped observations is very close to the value of •

itself. These results are very similar to the results of the original Kosinski algorithm.

For • =0.05, however, the average fraction of swamped observations is much larger

than 0.05 for the lower dimensions, especially for p=1 and p=2. The reason for this

is the following. Consider a one-dimensional standard normal data set. If the

variance of all points is used, the outlyingness of a fraction of • points will be larger

than 21,1 αχ − . However, in the Kosinski algorithm the variance of all points but at

least that fraction of • points with the largest outlyingnesses is calculated. This

variance is smaller than the variance of all points. Hence, the Mahalanobis distances

are overestimated and too many points are indicated to be an outlier. This is a self-

magnifying effect. More outliers lead to a smaller variance which leads to more

points indicated to be an outlier, etc.

The effect is the strongest in one dimension. In higher dimensions the points with a

large Mahalanobis distance are “all around” . Therefore they less influence the

variance in the separate directions.

Apparently, the effect is quite strong for • =0.05, but almost negligible for • =0.01. In

the remaining tests • =0.01 is used, unless otherwise stated.

3.3.3 Masking and swamping

The ability of the algorithm to detect outliers was tested in another simulation. We

generated data sets in the same way as is done in the Kosinski paper in order to get a

fair comparison between the original and our adjusted Kosinski algorithm. Thus we

generated data sets of 100 points containing good points as well as outliers. Both the

good points and the outliers were generated from a multivariate distribution, with

402 =σ for the good points and 12 =σ for the bad points. The distance between


9

the center of the good points and the bad points is denoted by d. The vector between

the centers is along the vector of 1’s.

We varied the dimension (p=2, 5), the fraction of outliers (0.10• 0.45), and the

distance (d=20• 60). We calculated the fraction of masked outliers (the number of

masked outliers of each data set divided by the number of outliers) and the fraction

of swamped points (the number of swamped points of each data set divided by the

number of good points), both averaged over 100 simulation runs for each set of

parameters p, d, and fraction of outliers. Results are shown in Table 3.2.

p=2 p=5fraction of

outliersfraction of

masked obs.fraction ofswamped

obs.

fraction ofoutliers

fraction ofmasked obs.

fraction ofswamped

obs.d=20 d=25

0.10 0.81 0.009 0.10 0.90 0.0080.20 0.89 0.014 0.20 0.91 0.0210.30 0.88 0.022 0.30 0.93 0.1460.40 0.86 0.146 0.40 0.97 0.5510.45 0.88 0.350 0.45 1.00 0.855

d=30 d=400.10 0.03 0.011 0.10 0.00 0.0080.20 0.00 0.011 0.20 0.04 0.0080.30 0.01 0.010 0.30 0.03 0.0220.40 0.05 0.043 0.40 0.02 0.0200.45 0.01 0.019 0.45 0.01 0.014

d=40 d=600.10 0.00 0.011 0.10 0.00 0.0080.20 0.00 0.011 0.20 0.00 0.0070.30 0.00 0.011 0.30 0.00 0.0090.40 0.00 0.009 0.40 0.00 0.0100.45 0.00 0.010 0.45 0.00 0.008

Table 3.2. Average fraction of masked and swamped observations of 2- and

5-dimensional data sets over 100 simulation runs. Each data set consisted of 100

points with a certain fraction of outliers. The good (bad) points were generated from

a multivariate normal distribution with 402 =σ ( 12 =σ ) in each direction. The

distance between the center of the good points and the bad points is denoted by d.

The following conclusions can be drawn from these results. The algorithm is said to

be performing well i f the fraction of masked outliers is close to zero and the fraction

of swamped observation is close to • =0.01. The first conclusion is: the larger the

distance between the good points and the bad points the better the algorithm

performs. This conclusion is not surprising and is in agreement with Kosinski’s

results. Secondly, the higher the dimension, the worse the performance of the

algorithm. In five dimensions the algorithm starts to perform well at d=40, and close

to perfect at d=60, while in two dimensions the performance is good at d=30,

respectively perfect at d=40. The original algorithm did not show such a dependence

on the dimension. It is remarked, however, that the paper by Kosinski does not give


10

enough details for a good comparison on this point. Third, for both two and five

dimensions the adjusted algorithm performs worse than the original algorithm. The

original algorithm is almost perfect at d=25 for both p=2 and p=5, while the adjusted

algorithm is not perfect until d=40 or d=60. This is the price that is paid for the large

gain in computer time. The fourth conclusion is: the performance of the algorithm is

almost not dependent on the fraction of outliers, in agreement with Kosinski’s

results. In some cases, the algorithm even seems to perform better for higher

fractions. This is however due to the relatively small number of points (100) per data

set. For very large data sets and very large number of simulation runs this artifact

will disappear.

p d fr inc masked swamped2 20 0.10 1p 0.79 0.0102 20 0.10 10% 0.80 0.0092 20 0.10 100% 0.80 0.0092 20 0.40 1p 0.86 0.2252 20 0.40 10% 0.86 0.1462 20 0.40 100% 0.89 0.093

2 30 0.10 1p 0.00 0.0112 30 0.10 10% 0.03 0.0112 30 0.10 100% 0.02 0.0112 30 0.40 1p 0.05 0.0422 30 0.40 10% 0.05 0.0432 30 0.40 100% 0.08 0.038

2 40 0.10 1p 0.00 0.0112 40 0.10 10% 0.00 0.0112 40 0.10 100% 0.00 0.0112 40 0.40 1p 0.00 0.0102 40 0.40 10% 0.00 0.0092 40 0.40 100% 0.02 0.009

5 40 0.10 1p 0.00 0.0085 40 0.10 10% 0.00 0.0085 40 0.10 100% 0.01 0.0085 40 0.40 1p 0.01 0.0165 40 0.40 10% 0.01 0.0165 40 0.40 100% 0.06 0.035

Table 3.3. Average fraction of masked and swamped observations for p-dimensional

data sets with a fraction of fr outliers on a distance d from the good points (for more

details about the data sets see Table 3.2), calculated with runs with either one-point

increment (1p) or proportional increment (10% or 100% of the good part).

3.3.4 Proportional increment

Until now all tests have been performed using the one-point increment, i.e. at each

step of the algorithm the size of the good part is increased with just one point. In

section 3.1 it was already mentioned that a gain in computer time is possible by

increasing the size of the good part with more than one point per step. The

simulations on the masked and swamped observations were repeated with the

proportional increment algorithm. The increment with a certain percentage was


11

tested for percentages up to 100% (which means that the size of the good part is

doubled at each step).

The results of Table 3.1, showing the average fraction of swamped observations in

outlier-free data sets, did not change. Small changes showed up for large

percentages in the presence of outliers. A summary of the results is shown in Table

3.3. In order to avoid an unnecessary profusion of data we only show the results for

p=2 in some relevant cases and, as an illustration, in a few cases for p=5.

A general conclusion from the table is that for a wide range of percentages the

proportional increment algorithm works satisfactoril y. For a percentage of 100%

outliers are masked slightly more frequently than for lower percentages. The

differences between 10% increment and one-point increment are negligible.

3.3.5 Time dependence

To illustrate the possible gain with the proportional increment we measured the time

per run for p-dimensional data sets of n points, with p ranging from 1 to 8 and n

from 50 to 400. The simulations were performed with outlier-free generated data

sets so that the complete data sets had to be included in the good part. This was done

in order to obtain useful information about the dependence of the simulation times

on the number of points. Table 3.4 shows the results for the simulation runs with

one-point increment. The results for the runs with a proportional increment of 10%

are shown in Table 3.5.

n p=1 2 3 4 5 6 7 850 0.09 0.18 0.29 0.45 0.64 0.84 1.08 1.35100 0.36 0.68 1.05 1.75 2.5 3.3 4.3 5.5200 1.46 2.8 4.6 7.0 10400 6.2 12

Table 3.4. Time (in seconds) per run on p-dimensional data sets of n points, using

the one-point increment.

n p=1 2 3 4 5 6 7 850 0.05 0.10 0.16 0.23 0.31 0.39 0.52 0.62100 0.14 0.24 0.39 0.56 0.76 1.00 1.25 1.55200 0.33 0.60 0.92 1.35 1.90400 0.80 1.40

Table 3.5. Time (in seconds) per run on p-dimensional data sets of n points, using

the proportional increment (perc=10%).

Let us denote the time per run as a function of n for fixed p by tp, and the time per

run as a function of p for fixed n by tn. For the one-point increment simulations tp is

approximately proportional to n2. This is as expected since there are O(n) steps with

a increment of one point and at each step the Mahalanobis distance has to be

calculated for each point (O(n)) and sorted (O(n ln n)). For the simulations with

proportional increment tp is approximately O(n ln n), due to the fact that only

O(ln n) steps are needed instead of O(n). As a consequence there is a substantial


12

gain in the time per run, ranging from a factor of 2 for 50 points up to a factor of 8

for 400 points.

The time per run for fixed n, tn, is approximately proportional to p1.5, for both one-

point and proportional increment runs. The exponent 1.5 is just an empirical average

over the range p=1..8 and is result of several O(p) and O(p2) steps. Since the

exponent is much smaller than 2 it is more efficient to search for outliers in one

p-dimensional run than in ½p(p-1) 2-dimensional runs, one for each pair of

dimensions, even if one is not interested in outliers in more than 2 dimensions.

Consider for instance p=8, n=2. One run takes 0.62 seconds. However, a total of 1.4

seconds would be needed for the 28 runs in each pair of dimensions, each run taking

0.05 seconds.

3.3.6 Sensitivity to parameters

The Kosinski algorithm was tested on the twelve data sets described in section 5. A

full description of the outliers and a comparison of the results with the results of the

projection algorithm as well as with other methods described in the literature is

given in that section. In the present section we restrict the discussion to the

sensitivity of the number of outliers to the cutoff and the increment percentage.

The algorithm was run with a cutoff 21, αχ −p for • =1% as well as • =5%.

Furthermore, both one-point increment and proportional increment (in the range 0-

40%) were used. The number of detected outliers of the twelve data sets is shown in

Table 3.6.

It is clear that the number of outliers for a specific data set is not the same for each

set of parameters. It is remarked that, in all cases, if different sets of parameters lead

to the same number of outliers, the outliers are exactly the same points. Moreover, if

one set of parameters leads to more outliers than another set, all outliers detected by

the latter are also detected by the former (these are empirical results).

Let us first discuss the differences between the detection with • =1% and with • =5%.

It is obvious that in many cases • =5% results in slightly more outliers than • =1%.

However, in two cases the differences are substantial, i.e. in the Stackloss data and

in the Factory data.

In the Stackloss data five outliers for • =5% are found using moderate increments,

while • =1% shows no outliers at all . The reason for this difference is the relatively

small number of points related to the dimension of the data set. It has been argued

by Rousseeuw that the ratio n/p should be larger than 5 in order to be able to detect

outliers reliably. If n/p is smaller than 5 one comes to a point where it is not useful

to speak about outliers since there is no real bulk of data.

With n=21 and p=4 the Stackloss data lie on the edge of meaningful outlier

detection. Moreover, if the five points which are indicated as outliers with • =5% are

left out, only 16 good points remain, resulting in a ratio n/p=4. In such a case any

outlier detection algorithm will presumably fail to find outliers consistently.


13

Data set p n inc • =5% • =1%1. Kosinski 2 100 1p 42 40

• 40% 42 40

2. Brain mass 2 28 1p 5 3• 10% 5 3

15-20% 4 330-40% 3 3

3. Hertzsprung-Russel 2 47 1p 7 6• 30% 7 6

40% 6 6

4. Hadi 3 25 1p 3 3• 5% 3 310% 3 0

15-25% 3 330% 3 040% 3 3

5. Stackloss 4 21 1p 5 0• 17% 5 0

18-24% 4 025-30% 1 0

40% 0 0

6. Salinity 4 28 1p 4 2• 30% 4 2

40% 2 2

7. HBK 4 75 1p 15 14• 30% 15 14

40% 14 14

8. Factory 5 50 1p 20 0• 40% 20 0

9. Bush fire 5 38 1p 16 13• 40% 16 13

10. Wood gravity 6 20 1p 6 5• 20% 6 5

30% 6 640% 6 5

11. Coleman 6 20 1p 7 7• 40% 7 7

12. Milk 8 85 1p 20 17• 30% 20 17

40% 18 15Table 3.6. Number of outliers detected by the Kosinski algorithm with a cutoff of

21, αχ −p , for • =1% respectively • =5%, with either one-point (1p) or proportional

increment in the range 0-40%.


14

The Factory data is an interesting case. For • =5% twenty outliers are detected,

which is 40% of all points, while detection with • =1% shows no outliers.

Explorative data analysis shows that about half the data set is quite narrowly

concentrated in a certain region, while the other half is distributed over a much

larger space. There is however no clear distinction between these two parts. The

more widely distributed part is rather a very thick tail of the other part. In such a

case the effect that the algorithm with • =5% tends to detect too much outliers, which

is explained discussing Table 3.1, is very strong. It is questionable whether the

indicated points should be considered as outliers.

Let us now discuss the sensitivity of the number of detected outliers to the

increment. At low percentages the number of outliers is always the same as for the

one-point increment • in fact, at very low percentages the proportional increment

procedure leads to an increment of just one point per step, making the two

algorithms equal. For most data sets the number of outliers is constant for a wide

range of percentages and starts to differ slightly only at 30-40% or higher. Three of

the twelve data sets behave differently: the Brain mass data, the Hadi data, and the

Stackloss data.

The Brain mass data shows 5 outliers at low percentages for • =5%. At percentages

around 15% the number of outliers is only 4 and at 30% only 3. So the number of

outliers changes earlier (at 15%) than in most other data sets (• 30%). For • =1% the

number of outliers is constant over the whole range. In fact, the three outliers which

are found at 30-40% for • =5% are exactly the same as the three outliers found for

• =1%. The two outliers which are missed at higher percentages for • =5% both lie

just above the cutoff value. Therefore it is disputable whether they are real outliers at

all .

The Hadi data shows strange behavior. At all percentages for • =5% and at most

percentages for • =1% three outliers are found. However, near 10% and near 30% no

outliers are detected. Again, the three outliers are disputable. All have a

Mahalanobis distance just above the cutoff (see Table 5.2). Hence it is not strange

that sometimes these three points are included in the good part (the three points lie

close together; hence, the inclusion of one of them in the good part leads to low

Mahalanobis distances for the other two as well). On the other side, it is also not a

big problem, since it is rather a matter of taste than a matter of science to call the

three points outliers or good points.

The Stackloss data shows a decreasing number of outliers for • =5% at relatively low

percentages, like in the Brain mass data. Here, the sensitivity to the percentage is

related to the low ratio n/p, as is discussed previously.

In conclusion, for increments up to 30% the same outliers are found as with the one-

point increment. In cases where this is not true, the supposed outliers always have an

outlyingness slightly above or below the cutoff, so that missing such outliers has no

big consequences. Furthermore, relatively low cutoff values could lead to

disproportionate swamping.


15

4. The projection method

4.1 The principle of projectionThe projection method is based on the idea that outliers in univariate data are easily

recognized, visually as well as by computational means. In one dimension the

Mahalanobis distance is simply σyyi − . A robust version of the univariate

outlyingness is found by replacing the mean by the med and replacing the standard

deviation by the mad. Denoting the robust outlyingness by iu , this leads to

S

Myu i

i

−=

where M and S denote the med respectively the mad:

MymedS

ymedM

ll

kk

−=

=

In the case of multivariate data the idea is to “ look” at the data set from all possible

directions and to “see” whether a particular data point lies far away from the bulk of

the data points. Looking in this context means projecting the data set on a projection

vector a; seeing means calculating the outlyingness as is done in univariate data. The

ultimate outlyingness of a point is just the maximum of the outlyingnesses over all

projection directions.

The outlyingness defined in this way corresponds to the multivariate Mahalanobis

distance as is shown in section 2. Recalling the expression for the Mahalanobis

distance:

Caa

yyaMD

ti

t

aai

t

2

1

2 ))((sup

−=

=

Robustifying the Mahalanobis distance leads to

S

Myau

it

aai

t

−=

=1

sup

Now M and S are defined as follows:

MyamedS

yamedM

lt

l

kt

k

−=

=

It is remarked that 2iMD corresponds to 2

iu .

How is the maximum calculated? The outlyingness

S

Mya it −


16

as a function of a could posses several local maxima, making gradient search

methods unfeasible. Therefore the outlyingness is calculated on a grid of a finite

number of projection vectors. The grid should be fine enough in order to calculate

the maximum outlyingness with enough accuracy.

This robust measure of outlyingness was firstly developed by Stahel en Donoho.

More recent work on this subject has been reported by Maronna and Yohai. These

authors used the outlyingness in order to calculate a weighted mean and covariance

matrix. Outliers were given small weights so that the Stahel-Donoho estimator of the

mean was robust against the presence of outliers. It is of course possible to use the

weighted mean and covariance matrix to calculate a weighted Mahalanobis distance.

This is not done in the projection method discussed here.

The robust outlyingness iu was slightly adjusted for the following reason. The mad

of univariate standard normal data, which has a standard deviation of 1 by definition,

is 0.674=1/1.484. In order to assure that, in the limiting case of an infinitely large

multivariate normal data set, the outlyingness 2iu is equal to the squared

Mahalanobis distance, the mad in the denominator is multiplied with 1.484:

S

Myau

it

aai

t 484.1sup

1

−=

=

4.2 The projection algorithmThe purpose of the algorithm is, given a set of n multivariate data points

nyyy ,..,, 21 , to calculate the outlyingness iu for each point i. The algorithm can be

summarized as follows.

Step 0. In: data set

The algorithm is started with a set of continuous p-dimensional data nyyy ,..,, 21 ,

with ( )tipii yyy ..1= .

Step 1. Define a grid

There are

q

p subsets of q dimensions in the total set of p dimensions. The

“maximum search dimension” q is predefined. Projection vectors a in a certain

subset are parameterized by the angles 121 ,..,, −qθθθ :

=

−−

−−

121

121

123

12

1

sinsinsin

sinsincos

sinsincos

sincos

cos

θθθθθθ

θθθθθ

θ

�

�

�

qq

qq

a


17

A certain predefined step size step (in degrees) is used to define the grid.

The first angle 1θ can take the values 1stepi , with 1step the largest angle smaller

than or equal to step for which 1

180

stepis an integer value, and with

1

180,..,2,1

stepi = .

The second angle can take the values 2stepj , with 2step the largest angle smaller

than or equal to 1

1

cosθstep

for which 2

180

step is an integer value, and with

2

180,..,2,1

stepj = .

The r-th angle can take the values rstepk , with rstep the largest angle smaller than

or equal to 1

1

cos −

−

r

rstep

θ for which

rstep

180 is an integer value, and with

rstepk

180,..,2,1= .

Such a grid is defined in each subset of q dimensions.

Step 2. Outlyingness for each grid point

For each grid point a, calculate the outlyingness for each data point iy :

• Calculate the projections iya .

• Calculate the median kk

a yamedM = .

• Calculate the mad MyamedL ll

a −= .

• Calculate the outlyingness a

aii L

Myaau

484.1)(

−= .

Step 3. Out: outlyingness

The outlyingness iu is the maximum over the grid:

)(sup auu ia

i = .

4.3 Test resultsA prototype/test program was implemented in an Excel/Visual Basic environment.

Documentation of the program is published elsewhere. We successively tested the

amount of swamped observations of data sets containing no outliers, the amount of

masked observations of data sets containing outliers, the time-dependence of the

algorithm on the parameters step and q, and the sensitivity of the number of detected

outliers to these parameters in some known data sets.


18

4.3.1 Swamping

A simulation study was performed in order to determine the average fraction of

swamped observations in normal distributed data sets. See section 3.3.2 for more

detailed remarks about the swamping effect and about generating the data sets. The

results of the simulations are shown in Table 4.1.

• step p=1 2 3 4 51% 10 0.010 0.011 0.016 0.018 0.0235% 10 0.049 0.052 0.067 0.071 0.088

1% 30 0.010 0.010 0.012 0.011 0.0125% 30 0.049 0.049 0.051 0.049 0.058

Table 4.1. The average fraction of swamped observations of the simulations on

several generated p-dimensional data sets of 100 points, with cutoff value 21, αχ −p

and step size step. The parameter q is equal to p.

p=2 q=2 p=5 q=2 p=5 q=5fraction of

outliersfraction of

masked obs.fraction of

outliersfraction of

masked obs.fraction of

outliersfraction of

masked obs.d=20 d=30 d=30

0.12 0.83 0.12 1.00 0.12 0.220.23 1.00 0.23 1.00 0.23 0.540.34 1.00 0.34 1.00 0.34 1.000.45 1.00 0.45 1.00 0.45 1.00

d=40 d=50 d=500.12 0.00 0.12 0.00 0.12 0.000.23 0.00 0.23 0.67 0.23 0.000.34 0.62 0.34 1.00 0.34 0.650.45 1.00 0.45 1.00 0.45 1.00

d=50 d=80 d=600.12 0.00 0.12 0.00 0.12 0.000.23 0.00 0.23 0.00 0.23 0.000.34 0.00 0.34 0.00 0.34 0.000.45 1.00 0.45 1.00 0.45 1.00

d=90 d=140 d=1200.12 0.00 0.12 0.00 0.12 0.000.23 0.00 0.23 0.00 0.23 0.000.34 0.00 0.34 0.00 0.34 0.000.45 0.00 0.45 0.00 0.45 0.00

Table 4.2. Average fraction of masked outliers of 2- and 5-dimensional generated

data sets (see also section 3.3.3).

For low dimensions the average fraction of swamped observations tend to be almost

equal to • . The fraction increases, however, with increasing dimension. This due to

the decreasing ratio n/p. It is remarkable that if the step size is 30 the fraction of

swamped observations seems to be much better than for step size 10. This is just a

coincidence. The fact that more observations are declared to be an outlier is


19

compensated by the fact that outlyingnesses are usually smaller if high step sizes are

used. In fact, the differences between step size 10 and 30 are so large for higher

dimensions that this is an indication that a step size of 30 could be too low to result

in reliable outlyingnesses.

4.3.2 Masking and swamping

The ability of the projection algorithm to detect outliers was tested by generating

data sets that contain good points as well as outliers. See section 3.3.3 for details on

how the data sets were generated.

Results are shown in Table 4.3. In all cases, the ability to detect the outliers is

strongly dependent on the contamination of outliers. If there are many outliers, they

can only be detected if they lie very far away from the cloud of good points. This is

due to the fact that, although the med and the mad have a robustness of 50%, a large

concentrated fraction of outliers strongly shifts the med towards the cloud of outliers

and enlarges the mad.

In higher dimensions it is more difficult to detect the outliers, like in the Kosinski

method. The ability to detect the outliers depends also on the maximum search

dimension q. If q is taken equal to p less outliers are masked.

4.3.3 Time dependence

The time dependence of the projection algorithm on the step size step and the

maximum search dimension q is shown in Table 4.3.

n p q step t n p q step t400 2 2 36 13.0 100 2 2 9 8.0

18 21.0 3 19.39 32.7 4 33.5

4.5 56.8 5 50.16 71.4

400 3 3 36 28.1 7 98.918 68.6 8 128.09 209.1

4.5 719.3 100 5 1 9 5.92 50.1

50 5 2 9 26.3 3 479.8100 50.1 4 2489.1200 107.7 5 4692.1400 202.9

Table 4.3. Time t (in seconds) per run on p-dimensional data sets of n points using

maximum search dimension q and step size step (in degrees).

Asymptotically the time per run should be proportional to 1)180

()ln( −

q

stepq

pnn ,

since for each of the

q

psubsets a grid is defined with a number of grid points of


20

the order of 1)180

( −q

step, and at each grid point the median of the projected points has

to be calculated (n ln n). The results in the table roughly confirm this theoretical

estimation. The most important conclusion from the table is that the time per run

strongly increases with the search dimension q. This makes the algorithm only

useful for relatively low dimensions.

4.3.4 Sensitivity to parameters

The projection method was tested with the twelve data sets that are fully described

in section 5, like is done with the Kosinski method (see section 3.3.6). The results

are shown in Table 4.4.

Let us first discuss the differences between • =5% and 1%. In almost all cases the

number of outliers, detected with • =5% are larger than with • =1%. This is

completely due to stronger swamping. It is remarked that there is no algorithmic

dependence on the cutoff value, like in the Kosinski method. In the projection

method a set of outlyingnesses is calculated and after the calculation a certain cutoff

value is used in order to discriminate between good and bad points. Hence, a smaller

cutoff value leads to more outliers, but all points still have the same outlyingness. In

the Kosinski method the cutoff value is already used during the algorithm: the cutoff

is used in order to decide whether more points should be added to the good part. A

smaller cutoff leads not only to more outliers but also to a different set of

outlyingnesses since the mean and the covariance matrix are calculated with a

different set of points. As a consequence, in cases where the Kosinski possibly

shows a rather strong sensitivity to the cutoff value, this sensitivity is missing in the

projection method.

Now let us discuss the dependence of the number of outliers on the maximum search

dimension q. In the Hertzsprung-Russel data set and in the HBK data set the number

of outliers found with q=1 is already as large as found with higher values of q. In the

Brain mass data set and in the Milk data set, the number of outliers for q=1 are

however much smaller than for large values of q. In those cases, many outliers are

truly multivariate.

In the Hadi data set, the Factory data set and the Bush fire data set there is also a

rather large discrepancy between q=2 and q=3. It is remarked that the Hadi data set

was constructed so that all outliers were invisible looking at two dimensions only

(see section 5.2.4). Also in the other two data sets it is clear that many outliers can

only be found by inspecting three or more dimensions at the same time.

If q is higher than three, only slightly more outliers are found than for q=3.

Differences can be explained by the fact that searching in higher dimensions with

the projection method leads to more outliers (see section 4.3.1).


21

Data set p n q step • =5% • =1%1. Kosinski 2 100 2 10 78 34

2 20 77 342 30 42 31

2. Brain mass 2 28 2 5 9 62 10 9 42 30 8 41 n/a 3 1

3. Hertzsprung-Russel 2 47 2 1 7 62 30 6 52 90 6 51 n/a 6 5

4. Hadi 3 25 3 5 11 53 10 8 02 10 0 0

5. Stackloss 4 21 4 5 14 94 10 10 94 15 8 64 20 9 74 30 6 6

6. Salinity 4 28 4 10 12 84 20 9 73 30 6 4

7. HBK 4 75 4 10 15 144 20 14 141 n/a 14 14

8. Factory 5 50 5 10 24 185 20 14 94 10 24 173 10 22 142 10 9 9

9. Bush fire 5 38 5 10 24 195 20 19 174 10 22 193 10 21 172 10 13 12

10. Wood gravity 6 20 5 20 14 145 30 12 113 10 15 14

11. Coleman 6 20 5 20 10 85 30 4 4

12. Milk 8 85 5 20 18 145 30 15 134 20 16 144 30 15 133 20 15 133 30 15 122 20 13 112 30 12 71 n/a 6 5

Table 4.4. Number of outliers detected by the projection algorithm with a cutoff of

21, αχ −p , for • =1% respectively • =5%, with maximum search dimension q and angular

step size step (in degrees).


22

The sensitivity to the step size is not large in most cases. In cases li ke the Hadi data,

the Stackloss data, the Salinity data and the Coleman data, the sensitivity can be

explained by the sparsity of the data sets. A step size near 10-20 seems to work well

in most cases.

In conclusion, the number of outliers is not very sensitive to the parameters q and

step. However, the sensitivity is not completely negligible. In most practical cases

q=3 and step=10 work well enough.

5. Comparison of methods

In this section the projection method and the Kosinski method are compared with

each other as well with other robust outlier detection methods. In section 5.1 we will

shortly describe some other methods reported in the literature. The comparison is

made by applying the projection method and the Kosinski method on data sets that

are analyzed by at least one of the other methods. Those data sets and the results of

the said methods are described in section 5.2. In section 5.3 the results are discussed.

Unfortunately, in most papers on outlier detection methods very little is said about

the efficiency of the methods, i.e. how fast the algorithms are and how it depends on

the number of points and the dimension of the data set. Therefore we restrict the

discussion to the abili ty to detect outliers.

5.1 Other methodsIt is important to note that two different type outliers are distinguished in the outlier

literature. The first type outlier, which is used in this report, is a point that lies far

away from the bulk of the data. The second type is a point that lies far away from the

regression plane formed by the bulk of the data. The two types will be denoted by

bulk outliers respectively regression outliers.

Of course, outliers are often so according to both points of view. That is why we

compare the results of the projection method and the Kosinski method, which are

both bulk outlier methods, also with regression outlier methods. An outlier that is

declared to be so by both methods is called a bad leverage point. In the case that a

point lies far away from the bulk of the points but close to the regression plane it is

called a good leverage point.

Rousseeuw (1987, 1990) developed the minimum volume elli psoid (MVE) estimator

in order to robustly detect bulk outliers. The principle is to search for the ellipsoid,

covering at least half the data points, for which the volume is minimal. The mean

and the covariance matrix of the points inside the ell ipsoid are inserted in the

expression for the Mahalanobis distance. This method is costly due to the

complexity of the algorithm that searches the minimum volume elli psoid.

A related technique is based on the minimum covariance determinant (MCD)

estimator. This technique is employed by Rocke. The aim of this technique is to

search for the set of points, containing at least half the data, for which the

determinant of the covariance matrix is minimal. Again, the mean and the


23

covariance matrix, determined by that set of points, are inserted in the Mahalanobis

distance expression. Also this method is rather complex, although substantially

optimized by Rocke.

Hadi (1992) developed a bulk outlier method that is very similar to the Kosinski

method. He also starts with a set of p+1 “good” points and increases the good set

one point by one. The difference lies in the choice of the first p+1 points. Hadi

orders the n points using another robust measure of outlyingness. The question

arises why that other outlyingness would not be appropriate for outlier detection. A

reason could be that an arbitrary robust measure of outlyingness deviates relatively

strongly from the “real” Mahalanobis distance.

Atkinson combines the MVE method of Rousseeuw and the forward search

technique also employed by Kosinski. A few sets of p+1 randomly chosen points are

used for a forward search. The set that results in the ell ipsoid with minimal volume

is used for the calculation of the Mahalanobis distances.

Maronna employed a projection–like method, but slightly more complicated. The

outlyingnesses are calculated li ke in the projection method. Then, weights are

assigned to each point, with low weights for the outlying points, i.e. the influence of

outliers is restricted. The mean and the covariance matrix are calculated using these

weights. They form the Stahel-Donoho estimator for location and scatter. Finally,

Maronna inserts this mean and this covariance matrix in the expression for the

Mahalanobis distance.

Egan proposes resampling by the half-mean method (RHM) and the smallest half-

volume method (SHV). In the RHM method several randomly selected portions of

the data are generated. In each case the outlyingnesses are calculated. For each point

is counted how many times it has a large outlyingness. It is declared to be a true

outlier if this happens often. In the SHV method the distance between each pair of

points is calculated and put in a matrix. The column with the smallest sum of the

smallest n/2 distances is selected. The corresponding n/2 points form the smallest

half-volume. The mean and the covariance of those points are inserted in the

Mahalanobis distance expression.

The detection of regression outliers is mainly done with the least median of squares

(LMS) method. The LMS method is developed by Rousseeuw (1984, 1987, 1990).

Instead of minimizing the sum of the squares of the residuals in the least squares

method (which should rather be called the least sum of squares method in this

context) the median of the squares is minimized. Outliers are simply the points with

large residuals as calculated with the regression coefficients determined with the

LMS method.

Hadi (1993) uses a forward search to detect the regression outliers. The regression

coefficients of a small good set are determined. The set is increased by subsequently

adding the points with the smallest residuals and recalculating the regression

coefficients until a certain stop criterion is fulfilled. A small good set has to be found

beforehand.


24

Atkinson combines forward search and LMS. A few sets of p+1 randomly chosen

points are used in a forward search. The set that results in the smallest LMS is used

for the final determination of the regression residuals.

A completely different approach is the genetic algorithm for detection of regression

outliers by Walczak. We will not describe this approach here since it lies beyond the

scope of deterministic calculation of outlyingnesses.

Fung developed an adding-back algorithm for confirmation of regression outliers.

Once points are declared to be outliers by any other robust method, the points are

added back to the data set in a stepwise way. The extent to which estimation of

regression coefficients are affected by the adding-back of a point is used as a

diagnostic measure to decide whether that point is a real outlier. This method was

developed since robust outlier methods tend to declare too many points to be

outliers.

5.2 Known data setsIn this section the projection method and the Kosinski method are compared by

running both algorithms on the twelve data sets given in Table 5.1. The main part of

these data sets is well described in the robust outlier detection literature. Hence, we

are able to compare the results of the two algorithms with known results.

The outlyingnesses as calculated by the projection method and the Kosinski method

are shown in Table 5.2, Table 5.4 and Table 5.5. In both methods the cutoff value

for • =1% is used. In the Kosinski method a proportional increment of 20% was

used. The outlyingnesses of the projection method were calculated with q=p (if p<6;

if p>5 then q=5) and the lowest step size that is shown in Table 4.4.

We will now discuss the data sets one by one.

Data set p n Source1. Kosinski 2 100 Ref. [1]2. Brain mass 2 28 Ref. [3]3. Hertzsprung-Russel 2 47 Ref. [3]4. Hadi 3 25 Ref. [4]5. Stackloss 4 21 Ref. [3]6. Salinity 4 28 Ref. [3]7. HBK 4 75 Ref. [3]8. Factory 5 50 This work9. Bush fire 5 38 Ref. [5]10. Wood gravity 6 20 Ref. [6]11. Coleman 6 20 Ref. [3]12. Milk 8 85 Ref. [7]Table 5.1. The name, the dimension p, the number of points n, and the source of the

tested data sets.

5.2.1 Kosinski data

The Kosinski data form a data set that is difficult to handle from a point of view of

robust outlier detection. The two-dimensional data set contains 100 points. Points 1-


25

40 are generated from a bivariate normal distribution with

0,1,18,18 22

2121 ===−== ρσσµµ , and are considered to be outliers. Points

41-100 are good points and are a sample from the bivariate normal distribution with

7.0,40,0,0 22

2121 ===== ρσσµµ .

The Kosinski method correctly identifies all outliers (see Table 5.2). The projection

method identifies none of the outliers and declares many good points to be outliers.

The reason for this failure is the large contamination and the small scatter of the

outliers. Since there are so many outliers they strongly shift the median towards the

outliers. Hence, the outliers are not detected. Furthermore, since they are narrowly

distributed, they almost completely determine the median of absolute deviations in

the projection direction perpendicular to the vector pointing from the center of the

good points to the center of the outliers. Hence, many points, lying at the end points

of the ellipsoid of good points, have a large outlyingness.

It is remarked that this data set is not an arbitrarily chosen data set. It was generated

by Kosinski in order to demonstrate the superiority of his own method over other

methods.

5.2.2 Brain mass data

The Brain mass data contain three outliers according to the Kosinski method: points

6, 16 and 25. Those points are also indicated to be outliers by Rousseeuw (1990) and

Hadi (1992). Those authors also declare point 14 to be an outlier, but with an

outlyingness slightly above the cutoff . The projection method declares points 6, 14,

16, 17, 20 and 25 to be outliers.

5.2.3 Hertzsprung-Russel data

The two methods produce almost the same outlyingnesses for all points. Both

declare points 11, 20, 30 and 34 to be large outliers, in agreement with results by

Rousseeuw (1987) and Hadi (1993). However, the projection method and the

Kosinski method also declare points 7 and 14 as outliers and point 9 is an outlier

according to the Kosinski method . The outlyingness of these three points is

relatively small . Visual inspection of the data (see page 28 in Rousseeuw (1987))

shows that these points are indeed moderately outlying.

5.2.4 Hadi data

The Hadi data is an artificial one. The data set contains three variables 1x , 2x and y .

The two predictors were originally created as uniform (0,15) and were then

transformed to have a correlation of 0.5. The target variable was then created by

ε++= 21 xxy with )1,0(~ Nε . Finally, cases 1-3 were perturbed to have

predictor values around (15,15) and to satisfy 421 ++= xxy .

The Kosinski method finds the outliers, with a relatively small outlyingness. The

projection method finds these outliers too but declares also two good points to be

outliers.


26

A: Kosinski Brain mass Hertzsprung-Russel HadiB: 3,035 3,035 3,035 3,368C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos

1 2,59 7,45 51 4,37 1,01 1 1,79 0,75 1 0,80 1,20 1 4,75 3,472 2,80 7,96 52 1,53 0,98 2 1,05 1,13 2 1,39 1,46 2 4,75 3,473 2,46 7,14 53 2,22 1,05 3 0,37 0,16 3 1,41 1,83 3 4,76 3,464 2,87 8,21 54 4,69 1,32 4 0,65 0,13 4 1,39 1,46 4 2,86 1,845 2,78 7,97 55 3,97 1,50 5 1,99 0,92 5 1,42 1,90 5 0,96 0,706 2,59 7,48 56 3,47 1,44 6 8,40 6,19 6 0,80 1,04 6 3,43 1,577 2,84 8,09 57 4,59 2,55 7 2,08 1,27 7 5,55 6,35 7 2,21 0,918 2,75 7,89 58 2,27 0,37 8 0,66 0,55 8 1,44 1,38 8 0,46 0,369 2,51 7,22 59 2,96 0,51 9 0,94 0,91 9 2,59 3,26 9 0,99 0,35

10 2,45 7,12 60 2,22 0,54 10 1,93 0,99 10 0,61 0,93 10 1,74 1,3411 2,69 7,71 61 4,94 1,83 11 1,23 0,51 11 11,01 12,67 11 2,50 1,6512 2,84 8,12 62 5,07 1,29 12 0,96 0,90 12 0,91 1,21 12 1,54 1,1313 2,77 7,95 63 4,66 1,13 13 0,64 0,60 13 0,79 0,88 13 2,81 1,2514 2,68 7,72 64 1,68 1,17 14 3,87 2,21 14 3,04 3,51 14 0,98 0,6815 2,37 6,95 65 3,32 1,03 15 2,22 1,44 15 1,55 1,22 15 2,65 1,3716 2,46 7,17 66 2,25 1,03 16 7,54 5,63 16 1,23 0,99 16 0,97 0,8417 2,64 7,59 67 2,59 1,13 17 3,18 1,83 17 2,17 1,80 17 3,31 1,6418 2,40 6,96 68 3,89 1,04 18 0,90 0,92 18 2,17 2,04 18 3,17 1,3919 2,46 7,11 69 1,82 0,88 19 3,00 1,43 19 1,77 1,54 19 2,78 1,4920 2,45 7,15 70 5,96 1,59 20 3,59 1,71 20 11,26 13,01 20 2,94 1,3721 2,70 7,71 71 2,29 0,70 21 1,54 0,66 21 1,35 1,07 21 0,90 0,6622 2,62 7,54 72 3,91 0,86 22 0,50 0,25 22 1,62 1,28 22 1,61 1,2723 2,82 8,11 73 2,15 1,30 23 0,66 0,74 23 1,60 1,41 23 3,89 1,3924 2,68 7,67 74 6,76 2,00 24 2,18 1,11 24 1,21 1,10 24 2,80 1,2225 2,37 6,88 75 6,20 2,01 25 8,97 6,75 25 0,34 0,58 25 2,04 1,1226 2,75 7,86 76 3,37 0,77 26 2,61 1,24 26 1,04 0,7827 2,67 7,70 77 2,67 0,49 27 2,59 1,41 27 0,88 1,0728 2,85 8,14 78 1,83 0,50 28 1,13 1,17 28 0,36 0,3329 2,78 7,98 79 4,19 2,45 29 1,43 1,6030 2,78 8,00 80 2,71 0,46 30 11,61 13,4831 2,45 7,14 81 4,49 1,12 31 1,36 1,0932 2,91 8,29 82 2,74 0,79 32 1,59 1,4833 2,51 7,27 83 1,62 0,31 33 0,49 0,5234 2,33 6,80 84 2,81 0,47 34 11,87 13,8835 2,68 7,72 85 5,94 1,57 35 1,50 1,5036 2,82 8,08 86 3,50 1,01 36 1,57 1,7037 2,52 7,31 87 1,38 1,93 37 1,27 1,1338 2,65 7,66 88 2,21 1,57 38 0,49 0,5239 2,49 7,18 89 5,47 1,73 39 1,14 1,0340 2,61 7,52 90 3,07 1,44 40 1,17 1,5241 1,89 0,50 91 2,94 1,54 41 0,88 0,6042 1,84 0,41 92 6,02 1,59 42 0,46 0,3043 7,94 2,03 93 3,65 0,80 43 0,81 0,7744 3,04 0,61 94 3,89 0,98 44 0,61 0,8045 2,35 0,67 95 6,68 1,64 45 1,17 1,1946 6,42 1,76 96 2,50 0,84 46 0,58 0,3747 5,36 1,68 97 4,59 1,32 47 1,41 1,2048 3,74 0,77 98 5,65 1,4649 3,92 0,92 99 2,12 1,6450 6,53 1,78 100 2,31 0,30

Table 5.2. The outlyingness of each point of the Kosinski, the Brain mass, the Hertzsprung-

Russel and the Hadi data. A: Name of data set. B: Cutoff value for • =1%; outlyingnesses

higher than the cutoff are shown in bold. C: Method (Proj: projection method; Kos: Kosinski

method).


27

The projection method finds consistently larger outlyingnesses than the Kosinski

method, roughly a factor 2 for most points. This is related to the sparsity of the data

set. Consider for instance the extreme case of three points in two dimensions. Every

point will have an infinitely large outlyingness according to the projection method.

This can be understood by noting that the mad of the projected points is zero if the

projection vector intersects two points. The remaining point has an infinite

outlyingness. For data sets with more points the situation is less extreme. But as long

as there are relatively little points the projection outlyingnesses will be relatively

large. In such a case the cutoff values based on the 2χ -distribution are in fact too

low, leading to the swamping effect.

5.2.5 Stackloss data

The Stackloss data outlyingnesses show large differences between the two methods.

One of the reasons is the sensitivity of the Kosinski results to the cutoff value in this

case, as is discussed in section 3. If a cutoff value 080.3295.0,4 =χ is used instead of

644.3299.0,4 =χ , the Kosinski method shows outlyingnesses as in Table 5.3.

outl. outl. outl.1 4.73 8 0.98 15 1.072 3.30 9 0.76 16 0.873 4.42 10 0.98 17 1.144 4.19 11 0.83 18 0.715 0.63 12 0.93 19 0.806 0.76 13 1.24 20 1.047 0.87 14 1.04 21 3.80

Table 5.3. The outlyingnesses of the Stackloss data, calculated with the Kosinski

method with cutoff value 080.3295.0,4 =χ . Outlyingnesses above this value are

shown in bold, outlyingnesses that are even higher than 644.3299.0,4 =χ are shown

in bold italic.

Here 5 points have an outlyingness exceeding the cutoff value for • =5%, four of

them (points 1, 3, 4 and 21) even above the value for • =1%. Even in this case the

differences with the projection method are large. The projection outlyingnesses are

up to 5 times larger than the Kosinski ones.

For comparison, Walczak and Atkinson declared points 1, 3, 4 and 21 to be outliers,

Rocke indicated also point 2 as an outlier, while points 1, 2, 3 and 21 are outliers

according to Hadi (1992). These results are comparable with the results of the

Kosinski method with • =5%. Hence, considering the results in Table 5.4, the

Kosinski method results in too little outliers, the projection method too much. In

both cases the origin lies in the low n/p ratio.


28

A: Stackloss Salinity HBK FactoryB: 3,644 3,644 3,644 3,884C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos

1 8,42 1,62 1 2,67 1,29 1 30,38 32,34 51 1,99 1,64 1 5,23 2,122 6,92 1,53 2 2,58 1,46 2 31,36 33,36 52 2,20 2,06 2 5,66 1,673 8,14 1,45 3 4,65 1,84 3 32,81 34,90 53 3,18 2,80 3 5,55 1,914 9,00 1,51 4 3,54 1,63 4 32,60 34,97 54 2,13 1,96 4 4,57 2,055 1,74 0,41 5 6,06 4,06 5 32,71 34,92 55 1,57 1,22 5 3,28 2,346 2,33 0,82 6 3,12 1,41 6 31,42 33,49 56 1,78 1,46 6 2,19 1,487 3,45 1,31 7 2,62 1,25 7 32,34 34,33 57 1,81 1,61 7 2,27 1,498 3,45 1,24 8 2,87 1,59 8 31,35 33,24 58 1,67 1,55 8 1,85 1,239 2,15 1,11 9 3,31 1,90 9 32,13 34,35 59 0,89 1,13 9 2,15 1,17

10 4,26 1,16 10 2,08 0,91 10 31,84 33,86 60 2,08 2,05 10 3,56 1,7011 3,01 1,11 11 2,76 1,24 11 28,95 32,68 61 1,78 1,99 11 3,64 1,8712 3,30 1,34 12 0,77 0,43 12 29,42 33,82 62 2,29 2,00 12 3,67 1,9913 3,25 1,01 13 2,36 1,28 13 29,42 33,82 63 1,70 1,70 13 2,24 1,4314 3,75 1,15 14 2,52 1,24 14 33,97 36,63 64 1,62 1,75 14 2,13 1,7915 3,90 1,20 15 3,71 2,16 15 1,99 1,89 65 1,90 1,85 15 1,84 1,2916 2,88 0,85 16 14,83 8,08 16 2,33 2,03 66 1,78 1,87 16 3,52 2,3417 7,09 1,78 17 3,68 1,60 17 1,65 1,74 67 1,34 1,20 17 2,42 1,7918 3,56 0,98 18 1,84 0,82 18 0,86 0,70 68 2,93 2,20 18 5,55 2,4919 3,07 1,04 19 2,93 1,79 19 1,54 1,18 69 1,97 1,56 19 5,65 1,7620 2,48 0,61 20 2,00 1,22 20 1,67 1,95 70 1,59 1,93 20 5,91 2,8321 8,85 2,11 21 2,50 0,95 21 1,57 1,76 71 0,75 1,01 21 4,35 1,90

22 3,34 1,23 22 1,90 1,70 72 1,00 0,83 22 2,20 1,6323 5,20 2,07 23 1,72 1,72 73 1,70 1,53 23 2,77 1,6224 4,62 1,90 24 1,70 1,56 74 1,77 1,80 24 2,14 0,9025 0,77 0,42 25 2,06 1,83 75 2,44 1,98 25 3,11 2,1326 1,80 0,87 26 1,73 1,80 26 2,27 1,3127 2,85 1,11 27 2,17 2,01 27 4,88 2,0228 3,72 1,48 28 1,41 1,13 28 5,08 2,67

29 1,33 1,13 29 4,49 2,5930 2,04 1,86 30 1,91 1,2731 1,61 1,53 31 1,13 0,8332 1,78 1,70 32 2,00 1,3433 1,55 1,45 33 3,13 2,0534 2,10 2,07 34 2,43 1,7035 1,41 1,80 35 5,96 2,8236 1,63 1,61 36 5,78 2,2537 1,75 1,87 37 5,75 1,8338 2,01 1,86 38 4,14 1,6239 2,16 1,93 39 3,16 2,1940 1,25 1,17 40 2,77 1,6241 1,65 1,81 41 2,75 1,8642 1,91 1,72 42 2,56 1,6743 2,50 2,17 43 4,54 2,1544 2,04 1,91 44 4,25 1,8945 2,07 1,86 45 3,91 2,1446 2,04 1,91 46 2,10 1,5247 2,92 2,56 47 1,06 0,8448 1,40 1,70 48 1,47 1,1049 1,73 2,01 49 3,34 2,1650 1,05 1,36 50 2,51 1,39

Table 5.4. The outlyingness of each point of the Stackloss, the Salinity, the HBK

and the Factory data. A, B, C: see Table 5.2.


29

A: Bush fire Wood gravity Coleman MilkB: 3,884 4,100 4,100 4,482C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos

1 3,48 1,38 1 4,72 2,65 1 3,56 2,84 1 9.06 9,46 51 2.62 1,982 3,27 1,04 2 2,71 1,20 2 4,92 6,37 2 10.57 10,81 52 3.64 2,983 2,76 1,11 3 3,68 2,19 3 6,76 2,94 3 4.04 5,09 53 2.38 2,224 2,84 1,02 4 14,45 33,75 4 2,99 1,53 4 3.86 2,83 54 1.22 1,165 3,85 1,40 5 3,02 2,80 5 2,70 1,43 5 2.23 2,52 55 1.68 1,696 4,92 1,90 6 16,19 38,83 6 5,74 10,43 6 2.97 2,84 56 1.10 1,017 11,79 4,37 7 7,90 5,00 7 3,11 2,23 7 2.36 2,35 57 1.96 2,198 17,96 11,87 8 15,85 37,88 8 1,48 1,83 8 2.32 2,08 58 2.05 1,959 18,36 12,18 9 6,12 2,72 9 2,49 5,95 9 2.58 2,49 59 1.47 2,21

10 14,75 7,64 10 8,59 2,37 10 5,71 12,04 10 2.20 1,98 60 2.04 1,7611 12,31 6,76 11 5,38 3,04 11 5,07 7,70 11 5.28 4,60 61 1.48 1,4212 6,17 2,38 12 6,79 2,65 12 4,31 2,77 12 6.65 6,05 62 2.64 2,0713 5,83 1,77 13 7,14 1,98 13 3,49 2,92 13 5.63 5,38 63 2.33 2,6014 2,30 1,59 14 2,38 2,09 14 1,95 2,16 14 6.17 5,48 64 2.58 1,9015 4,70 1,55 15 2,40 1,47 15 6,11 6,56 15 5.47 5,73 65 1.85 1,5616 3,43 1,38 16 4,74 2,86 16 2,18 2,30 16 3.84 4,56 66 2.01 1,6417 3,06 0,92 17 6,07 2,12 17 3,78 5,95 17 3.59 4,76 67 3.28 2,5918 2,75 1,41 18 3,28 2,49 18 7,86 3,09 18 3.74 3,30 68 2.41 2,3319 2,82 1,38 19 18,33 44,49 19 3,48 2,11 19 2.43 2,85 69 46.45 44,6120 2,89 1,20 20 7,16 2,07 20 2,80 1,56 20 4.14 3,44 70 1.99 1,8721 2,47 1,13 21 2.26 2,08 71 2.19 2,2722 2,44 1,73 22 1.69 1,59 72 3.24 3,0223 2,46 1,04 23 1.81 2,04 73 6.89 6,9924 3,44 1,04 24 2.28 2,05 74 5.01 4,9025 1,90 0,91 25 2.81 2,83 75 2.02 2,0326 1,69 0,97 26 1.83 2,09 76 4.77 4,5127 2,27 0,99 27 4.24 3,71 77 1.35 1,4328 3,31 1,35 28 3.29 3,04 78 1.49 1,8729 4,82 1,83 29 3.19 2,57 79 2.93 2,6630 5,06 2,18 30 1.47 1,39 80 1.40 1,3831 6,00 5,66 31 2.87 2,29 81 2.59 2,3432 13,48 14,08 32 2.37 2,66 82 2.14 2,4233 15,34 16,35 33 1.78 1,33 83 3.00 2,5634 15,10 16,11 34 2.09 1,96 84 3.88 3,0635 15,33 16,43 35 2.73 2,10 85 2.19 2,3636 15,02 16,04 36 2.66 2,3237 15,17 16,30 37 2.61 2,2338 15,25 16,41 38 2.23 2,07

39 2.27 2,0740 3.31 2,8941 10.63 10,1142 3.69 3,0443 3.20 2,8544 7.67 6,0845 1.99 2,2846 1.78 2,4147 5.19 5,3548 2.92 2,5849 3.43 2,7050 3.96 2,69

Table 5.5. The outlyingness of each point of the Bush fire, the Wood gravity, the

Coleman, and the Milk data. A, B, C: see Table 5.2.


30

5.2.6 Salinity data

The outlyingnesses of the Salinity data are roughly two times larger for the

projection method as compared to the Kosinski method. As a consequence, the latter

shows just 2 outliers (points 5 and 16), the former 8 points. Rousseeuw (1987) and

Walczak agree that the points 5, 16, 23 and 24 are outliers, with points 23 and 24

lying just above the cutoff. Fung finds the same points in first instance, but after

applying his adding-back algorithm he concludes that point 16 is the only outlier.

The projection method shows too much outliers, while the Kosinski method misses

points 23 and 24.

5.2.7 HBK data

In the case of the HBK data the projection method and the Kosinski method agree

completely. Both indicate points 1-14 to be outliers. This is also in agreement with

the results of the original Kosinski method and of Egan, Hadi (1992,1993), Rocke,

Rousseeuw (1987,1990), Fung and Walczak. It is remarked that some of these

authors only find points 1-10 as outliers, but they use the “ regression” definition of

an outlier. The HBK is a artificial data set, where the good points lie along a

regression plane. Points 1-10 are bad leverage points, i.e. they lie far away from the

center of the good points and from the regression plane as well . Points 11-14 are

good leverage points, i.e. although they lie far away from the bulk of the data they

still l ie close to the regression plane. If one considers the distance from the

regression plane, the points 11-14 are not outliers.

5.2.8 Factory data

The Factory data set is a new one1. It is given in Table 5.6.

The outlyingnesses show a big discrepancy between the two methods. The

projection outlyingnesses are much larger than the Kosinski ones, resulting in 18

versus 0 outliers. The outlyingnesses are so large due to the shape of the data. About

half the data set is quite narrowly concentrated around the center of the data, the

other half forms a rather thick tail. Hence, in many projection directions the mad is

very small , leading to large outlyingnesses for the points in the tail. It is remarked

that the projection outliers are well comparable to the Kosinski outliers found with a

cutoff f or • =5% (see also section 3.3.6).

1 The Factory data is a generated data set, originall y used in an exercise on regressionanalysis in the CBS course “multivariate technics with SPSS”. It is interesting to note thatthe regression coefficients change radically if the points, that are indicated to be outliers bythe projection method and the Kosinski method with low cutoff, are removed from the dataset. In other words, the regression coefficients are mainly determined by the “outlying”points.


31

x1 x2 x3 x4 x5 x1 x2 x3 x4 x5

1 14.9 7.107 21 129 11.609 26 12.3 12.616 20 192 11.4782 8.4 6.373 22 141 10.704 27 4.1 14.019 20 177 14.2613 21.6 6.796 22 153 10.942 28 6.8 16.631 23 185 15.3004 25.2 9.208 20 166 11.332 29 6.2 14.521 19 216 10.1815 26.3 14.792 25 193 11.665 30 13.7 13.689 22 188 13.4756 27.2 14.564 23 189 14.754 31 18 14.525 21 192 14.1557 22.2 11.964 20 175 13.255 32 22.8 14.523 21 183 15.4018 17.7 13.526 23 186 11.582 33 26.5 18.473 22 205 14.8919 12.5 12.656 20 190 12.154 34 26.1 15.718 22 200 15.459

10 4.2 14.119 20 187 12.438 35 14.8 7.008 21 124 10.76811 6.9 16.691 22 195 13.407 36 18.7 6.274 21 145 12.43512 6.4 14.571 19 206 11.828 37 21.2 6.711 22 153 9.65513 13.3 13.619 22 198 11.438 38 25.1 9.257 22 169 10.44514 18.2 14.575 22 192 11.060 39 26.3 14.832 25 191 13.15015 22.8 14.556 21 191 14.951 40 27.5 14.521 24 177 14.06716 26.1 18.573 21 200 16.987 41 17.6 13.533 24 186 12.18417 26.3 15.618 22 200 12.472 42 12.4 12.618 21 194 12.42718 14.8 7.003 22 130 9.920 43 4.3 14.178 20 181 14.86319 18.2 6.368 22 144 10.773 44 6 16.612 21 192 14.27420 21.3 6.722 21 123 15.088 45 6.6 14.513 20 213 10.70621 25 9.258 20 157 13.510 46 13.1 13.656 22 192 13.19122 26.1 14.762 24 183 13.047 47 18.2 14.525 21 191 12.95623 27.4 14.464 23 177 15.745 48 22.8 14.486 21 189 13.69024 22.4 11.864 21 175 12.725 49 26.2 18.527 22 200 17.55125 17.9 13.576 23 167 12.119 50 26.1 15.578 22 204 13.530

Table 5.6. The Factory data (n=50, p=5). The average temperature (x1, in degrees

Celsius), the production (x2, in 1000 pieces), the number of working days (x3), the

number of employees (x4) and the water consumption (x5, in 1000 liters) at a factory

in 50 successive months.

5.2.9 Bushfire data

The outliers found by the adjusted Kosinski method (points 7-11, 31-38) agree

perfectly with those found by the original algorithm of Kosinski and with the results

by Rocke and Maronna. The projection method shows as additional outliers points 6,

12, 13, 15, 29 and 30. Due to the large contamination the projected median is shifted

strongly, leading to relatively large outlyingnesses for the good points and,

consequently, many swamped points.

5.2.10 Wood gravity data

Rousseeuw (1984), Hadi (1993), Atkinson, Rocke and Egan declare points 4, 6, 8

and 19 to be outliers. The Kosinski method finds these outliers too, but outlier 7 is

additional. The projection method shows strange results. Fourteen points have an

outlyingness above the cutoff, which is 70% of the data set. This is of course not

realistic. The reason is again the sparsity of the data set. Hence, it is rather surprising

that the Kosinski method and the methods by other authors perform relatively well

in this case.

5.2.11 Coleman data

The Coleman data contain 8 outliers according to the projection method, 7 according

to the Kosinski method. However, they agree only upon 5 points (2, 6, 10, 11, 15).


32

The Kosinski method finds as additional outliers points 9 and 17, the projection

method points 3, 12 and 18. Only one author has searched for outliers in this data

set, to our knowledge. Rousseeuw (1987) declares points 3, 17 and 18 to be outliers.

A straightforward conclusion is difficult. None of the outliers is found by all three

methods. There is more agreement between the Kosinski method and the projection

method than between the Rousseeuw method and any of the other two.

However, it is possible that the original Kosinski method will give other results,

since the data set is very sparse. If the number of outliers is truly 7 or 8, the

contamination is also extremely large, since [½(n+p+1)]=13 should be the minimum

number of good points.

5.2.12 Milk data

The adjusted Kosinski method is in good agreement with the results of the original

Kosinski method and with the results of Rocke, which both give points 1-3, 12-17,

41, 44, 47, 69, 73 and 74 as outliers. The adjusted Kosinski method finds points 11

and 76 as additional outliers. Point 76 is also found by Atkinson, who misses points

69 and who finds point 27 as an another additional outlier. The projection method

misses points 3, 16 and 17 compared to the Kosinski method. Hence, there is good

agreement between the several methods, while the disagreement concerns only the

points with an outlyingness just below or above the cutoff .

5.3 DiscussionIn general, both the projection method and the Kosinski method show roughly the

same outliers as other methods. If there are any differences between methods, the

disagreement usually concerns points that have an outlyingness just below or just

above the arbitrarily chosen cutoff, i.e. points of which the true outlyingness is

disputable.

In the case of sparse data sets the projection method tends to give too many outliers

and the Kosinski method too li ttle. The Kosinski method is more reliable in the case

of very many outliers. If the distribution deviates from the normal distribution,

especially when there are thick tails, the projection method and the Kosinski method

with low cutoff declare many points in the tails to be outliers.

In almost all cases both the projection method and the Kosinski method show larger

outlyingnesses for points declared to be outliers by other methods than for points

declared to be good by those other methods. This holds even if the projection

method and/or the Kosinski method declare too little outliers to be so or declare too

many good points to be an outlier. This means that the ordering of the Mahalanobis

distances is, roughly, similar across methods. Exceptions only occur in the case of

very sparse or very contaminated data sets.


33

6. A practical example

As an illustration we will show some results in a practical case, a file with VAT data

of the retail trade in 1996. This file contains 87376 companies with data on several

VAT entries. A complete search on statistical outliers would meet the problem that

many cells are filled with zeroes. In the VAT file there are many different VAT

entries. Many companies show zeroes for most entries. In many SBI classes this is

even true for one of the two most important variables, turnover of goods with high

VAT rate respectively low VAT rate, i.e. in many classes almost all companies have

a zero on either turnover high or turnover low. Hence, the file is mainly filled with

zeroes with only in a few places non-zero values.

If this fact is neglected, application of the projection method or the Kosinski method

will l ead to strange results. The projection method will show a zero median of

absolute deviations in cases where more than half the records shows a zero for a

particular variable. The Kosinski method will show non-invertible covariance

matrices in those cases. From the point of view of distributions, this is due to the fact

that data with many zeroes and few non-zeroes deviates extremely strongly from the

normal distribution. From the point of view of the definition of an outlier, if more

than half the records shows a zero for a particular variable, all records that do not

show a zero should be considered as an outlier.

The presence of a zero is, in fact, often (but not necessarily always) due to an

implicit categorical variable. If a particular variable is zero (non-zero) for a

particular company, one could say that a hypothetical variable indicates that this

company has (has no) contribution to this variable. As is previously mentioned, the

projection method and the Kosinski method can only be used for numerical

continuous variables, not for categorical variables. Hence, the algorithms should be

used with care. Searching for outliers is only useful if appropriate categories and

combinations of variables are selected.

If one still would li ke to apply an outlier detection method on a file li ke the VAT file

at once (with the advantages of searching in one run) the following adjustment to the

Kosinski method is a possible solution: simply neglect the zero-value cells in all

summations in the expressions of the mean, the covariance matrix and the

Mahalanobis distance. The expressions will then be:

∑

∑

∑

≠≠=

−

≠≠

=

≠=

−−=

−−−

=

=

p

yy

kjkikjkjiji

n

yyi

kikjijjk

jk

n

yi

ijj

j

ik

ij

ik

ij

ij

yyCyyMD

yyyyn

C

yn

y

001,

12

00

1

01

)())((

))((1

1

1


34

with nj denoting the number of points for which yij is non-zero and njk denoting the

number of points for which yij and yik are non-zero. This possibility is promising if

one assumes that the presence of a zero is not strongly correlated to the magnitude of

other variables. The Kosinski method including the alternative expressions is worth

further examination. Unfortunately, such a simple adjustment to the projection

method is not possible since zeroes disappear in projection directions that are not

parallel to one of the axes.

So for the moment, for a successful application of the methods, variables have to be

selected carefully. We merely search for outliers among the companies in SBI 5211.

We only take the variables “annual turnover high” and “annual turnover low” into

account. We choose SBI 5211 since almost all companies in this class show

substantial contributions for both variables, so that a two-dimensional search for

outliers makes sense.

Since some companies make a declaration once a month or once a quarter, the sum

of the declarations was calculated for each company. So a new file was created

containing 3755 companies and three variables: turnover high, turnover low, and

size class. Outliers were searched per size class. The number of companies in each

size class is shown in Table 6.1. Size classes 7, 8 and 9 contain few companies,

making an outlier search useless. These classes were combined with class 6, merely

as an example.

cutoff=50.0 cutoff=9.210size class n #good #outl. #good #outl.

0 748 733 15 609 1391 952 942 10 835 1172 696 685 11 623 733 417 412 5 357 604 484 482 2 439 455 420 418 2 382 386 30 30 0 30 07 58 29 1

6-9 38 35 3 33 5Table 6.1. Number of companies, number of good points and number of outliers for

cutoff value 50.0 respectively 9.210 for each size class and for the combined size

classes 6-9 in SBI 5211, as found with the Kosinski method.

Outliers were searched with the Kosinski method as well as with the projection

method. The methods showed roughly the same results. Therefore only the results of

the Kosinski method are discussed here.

Outliers were searched with two different cutoff values. Results are shown in figures

6.1, 6.2 and 6.3. First a cutoff value 210.9299.0,2 =χ ( 035.32

99.0,2 =χ ) was used.

It appeared that many companies were indicated to be an outlier, roughly 10-20% in


35

each size class. The reason for this phenomenon is the distribution of the data, which

deviates from the normal distribution rather strongly. The data show very thick tails,

compared to the variance, which is small due to the large amount of data in the

neighborhood of the origin. For this reason a second search with a much larger

cutoff value (50.0) was performed. This led to more realistic numbers of outliers.

7. Conclusions

Both the projection method and the Kosinski method are well able to detect

multivariate outliers. In the case of strong contamination it is slightly more difficult

to find the outliers with the projection method than with the Kosinski method. The

Kosinski method tends to relatively strongly overestimate the number of outliers in

the case of a low cutoff value. At a given cutoff value, the number of outliers is

slightly more sensitive to the tunable parameters in the case of the projection method

than in the case of the Kosinski method.

The time dependence of the two algorithms on the number of points in a data set is

roughly the same, i.e. they are both roughly proportional to n ln n. It is remarked that

the absolute times per run, as shown in this report, cannot be compared due to the

different implementations.

The time dependence on the dimension of the data set is much worse for the

projection method than for the Kosinski method. In the case of the Kosinski method,

the time per run is between linear and quadratic in the dimension. In the case of the

projection method, it is either exponential if the maximum search dimension is taken

equal to the dimension itself or cubic if a moderate maximum search dimension of

three is chosen.

As far as the ability to detect outliers and the expected time-performance is

concerned, there is not a strong preference for either of the two methods if outlier

detection is restricted to 2 or at most 3 dimensions. Since it is expected that for large

data sets in higher dimensions the projection method could lead to undesirable large

run times, the Kosinski method is the recommended method if a multivariate outlier

detection method in high dimensions is requested.


36

References

[1] A.S. Kosinski, Computational Statistics & Data Analysis 29, 145 (1999).

[2] D.M. Rocke and D.L. Woodruff, Journal of the American Statistical Association

91, 1047 (1996).

[3] P.J. Rousseeuw and A.M. Leroy, Robust regression & outlier detection (Wiley,

NY, 1987).

[4] A.S. Hadi and J.S. Simonoff , Journal of the American Statistical Association 88,

1264 (1993).

[5] R.A. Maronna and V.J. Yohai, Journal of the American Statistical Association

90, 330 (1995).

[6] P.J. Rousseeuw, Journal of the American Statistical Association 79, 871 (1984).

[7] J.J. Daudin, C. Duby, and P. Trecourt, Statistics 19, 241 (1988).

[8] P.J. Rousseeuw and B.C. van Zomeren, Journal of the American Statistical

Association 85, 633 (1990).

[9] A.S. Hadi, J.R. Statist. Soc. B 54, 761 (1992).

[10] A.C. Atkinson, Journal of the American Statistical Association 89, 1329 (1994).

[11] B. Walczak, Chemometrics and Intelligent Laboratory Systems 28, 259 (1995).

[12] W.-K. Fung, Journal of the American Statistical Association 88, 515 (1993).

[13] W.J. Egan and S.L. Morgan, Anal. Chem. 70, 2372 (1998).

[14] W.A. Stahel, Research Report 31, Fachgruppe für Statistik, E.T.H. Zürich

(1981).

[15] D.L. Donoho, Ph.D. qualifying paper, Harvard University (1982).


37

Robust outlier detection

Education

Transcript of Robust outlier detection