A Deter Minis Tic Algorithm for the MCD

8/2/2019 A Deter Minis Tic Algorithm for the MCD

1/27

S E C T I O N O F S T A T I S T I C S

DEPARTMENT OF MATHEMATICS

KATHOLIEKE UNIVERSITEIT LEUVEN

T E C H N I C A L R E P O R T

TR-10-01

A DETERMINISTIC ALGORITHM

FOR THE MCD

Hubert, M., Rousseeuw, P.J., Verdonck, T.

http://wis.kuleuven.be/stat/


2/27

A deterministic algorithm for the MCD

Mia HubertDepartment of Mathematics, Katholieke Universiteit Leuven

andPeter J. Rousseeuw

Department of Mathematics, Katholieke Universiteit Leuvenand

Tim VerdonckDepartment of Mathematics and Computer Science, Universiteit Antwerpen

January 4, 2010

Abstract

The minimum covariance determinant (MCD) method is a robust estimator of multivari-ate location and scatter (Rousseeuw, 1984). The MCD is highly resistant to outliers, andit is often applied by itself and as a building block for other robust multivariate methods.

Computing the exact MCD is very hard, so in practice one resorts to approximate algorithms.Most often the FASTMCD algorithm of Rousseeuw and Van Driessen (1999) is used. Thisalgorithm starts by drawing many random subsets, followed by so-called concentration steps.The FASTMCD algorithm is affine equivariant but not permutation invariant. In this arti-cle we present a deterministic algorithm, denoted as DetMCD, which does not use randomsubsets and is even faster. It is permutation invariant and very close to affine equivariant.We illustrate DetMCD on real and simulated data sets, with applications involving princi-pal component analysis, multivariate regression, and classication. Supplemental material(Matlab code of the DetMCD algorithm and the data sets) are available online.

Keywords: affine equivariance, covariance, outliers, multivariate, robustness.

1 Introduction

The Minimum Covariance Determinant (MCD) method (Rousseeuw, 1984) is a highly robust

estimator of multivariate location and scatter. Given an n p data matrix X = ( x 1 , . . . , x n )T with

1


3/27

x i = ( xi1 , . . . , x ip )T , its objective is to nd h observations (with n/ 2 h n) whose covariance

matrix has the lowest determinant. The MCD estimate of location is then the average of these

h points, and the scatter estimate is a multiple of their covariance matrix. Consistency and

asymptotic normality of the MCD estimator has been shown by Butler et al. (1993) and Cator

and Lopuhaa (2009). The MCD has a bounded inuence function (Croux and Haesbroeck, 1999).

The breakdown value is the smallest amount of contamination that can have an arbitrarily large

effect on the estimator. The MCD estimator has the highest possible breakdown value (i.e. 50%)

when h = (n + p + 1) / 2 (Lopuha a and Rousseeuw, 1991). In practice we often do not need the

maximal breakdown value and therefore typically h = 0.75n is chosen, yielding a breakdown

value of 25% which is sufficiently robust for most applications.

In addition to being highly resistant to outliers, the MCD is affine equivariant, i.e. the estimates

behave properly under affine transformations of the data. To be precise, the estimators and

are affine equivariant if for any n p data set X it holds that

(XA + 1n v T ) = (X )A + v (1)

(XA + 1n v T ) = A T (X )A (2)

for all nonsingular p p matrices A and all p 1 vectors v . The vector 1n denotes (1, 1, . . . , 1)T

with n entries. Affine equivariance makes the analysis independent of the measurement scales of

the variables, as well as to translations and rotations of the data.

Although the MCD was already introduced in 1984, its practical use only became feasible

since the introduction of the computationally efficient FASTMCD algorithm of Rousseeuw and

Van Driessen (1999). Since then the MCD has been applied in various elds such as quality con-

trol, medicine, nance, image analysis and chemistry, see e.g. Hubert et al. (2008) and Hubert

and Debruyne (2009) for references. The MCD is also being used as a basis to develop robust and

computationally efficient multivariate techniques, such as e.g. principal component analysis (Croux

and Haesbroeck, 2000; Hubert et al., 2005), factor analysis (Pison et al., 2003), classication (Hu-

bert and Van Driessen, 2004; Vanden Branden and Hubert, 2005), clustering (Hardin and Rocke,

2004), and multivariate regression (Rousseeuw et al., 2004). For a review see (Hubert et al., 2008).

The FASTMCD algorithm starts by drawing random subsets of size p+1. It needs to draw many

in order to obtain at least one that is outlier-free. Starting from each subset several iteration steps

are taken, as will be described in the next section. The overall computation time of FASTMCD

2


4/27

is thus roughly proportional to the number of initial subsets.

If one is willing to give up the affine equivariance requirement, certain robust covariance ma-

trices can be computed much faster. This is the idea behind the BACON algorithm (Billor et al.,

2000; Hadi et al., 2009), the spatial sign and rank covariance matrices (Visuri et al., 2000), and

the OGK estimator (Maronna and Zamar, 2002).

In this chapter we will present a deterministic algorithm for the MCD, denoted as DetMCD,

which does not use random subsets and runs even faster than FASTMCD. Unlike the latter it is

permutation invariant, i.e. the result does not depend on the order of the observations in the data

set. It starts from only a few well-chosen initial estimates. In Section 2 we give brief descriptions of

FASTMCD and the OGK estimator, since parts of both are used in Section 3 to construct the new

DetMCD algorithm. Section 4 reports on an extensive simulation study, showing that DetMCD is

as robust as FASTMCD. In Section 5 we show that DetMCD is permutation invariant and close

to affine equivariant. Section 6 illustrates the algorithm on several real data sets with applications

involving principal component analysis, multivariate regression, and discriminant analysis.

2 FASTMCD and OGK

In this section we briey describe the FASTMCD algorithm and the OGK estimator, as our new

algorithm DetMCD will use aspects of both. The observations will be denoted as x i (i = 1 , . . . , n ),

whereas the columns of our data matrix are denoted by X j ( j = 1 , . . . , p). For a data set X with

estimated center and scatter matrix , the statistical distance of the i-th observation x i will

be written as

D(x i , , ) = (x i )T 1 (x i ).2.1 The FASTMCD algorithm

A major component of the FASTMCD algorithm is the concentration step (C-step), which works

as follows. Given initial estimates old for the center and old for the scatter matrix,

1. Compute the distances dold (i) = D(x i , old , old ) for i = 1 , . . . , n .

2. Sort these distances, yielding a permutation for which

dold ((1)) dold ((2)) . . . dold ((n)), and set H = {(1), (2), . . . , (h)}.

3


5/27

3. Compute new = 1 /h iH x i and new = 1 /h iH (x i new )(x i new )T .

In Theorem 1 of Rousseeuw and Van Driessen (1999) it was proved that det( new ) det( old ),

with equality only if new = old . Therefore, if we apply C-steps iteratively, the sequence of

determinants obtained in this way must converge in a nite number of steps (because there areonly nitely many h-subsets). Since there is no guarantee that the nal value of the iteration

process is the global minimum of the MCD objective function, an approximate MCD solution

is obtained by taking many initial h-subsets H 1 {1, 2, . . . , n }, applying C-steps to each, and

keeping the solution with the overall lowest determinant.

To construct an initial subset H 1 a random ( p + 1)-subset J is drawn and 0 = 1 / ( p +

1) i J x i and 0 = 1 / ( p+ 1) i J (x i 0 )(x i 0 )T are computed. (If 0 is singular, random

points are added to J until it becomes nonsingular.) Next, we apply the C-step to ( 0 , 0 )

yielding ( 1 , 1 ), etc. Since each C-step involves the calculation of a covariance matrix, its inverse,

and the corresponding distances, we dont want to use too many. Therefore, the FASTMCD

algorithm only applies two C-steps to each initial subset, and only on the ten subsets with lowest

determinant further C-steps are taken until convergence. The raw FASTMCD estimates, RAWMCD

and RAWMCD , then correspond to the empirical mean and covariance matrix of the h-subset with

the lowest determinant.

In order to increase the statistical efficiency while retaining high robustness, reweighted esti-

mators are computed:

FASTMCD =n

i=1wix i /

n

i=1wi

FASTMCD = c1n

i=1wi(x i FASTMCD )(x i FASTMCD )T

n

i=1wi

1

where c1 is a correction factor to obtain consistency when the data come from a multivariate

normal distribution (Pison et al., 2002) and wi is an appropriate weight function, e.g.

wi =1 D(x i , RAWMCD , RAWMCD ) 2 p,0 .9750 otherwise

with 2 p, the -quantile of the 2 p distribution.

Implementations of the FASTMCD algorithm are available in the package S-PLUS (as the

built-in function cov.mcd ), in R (as part of the packages rrcov , robust and robustbase ), in

4


6/27

SAS/IML Version 7, and in SAS Version 9 (in PROC ROBUSTREG). The FASTMCD is also

part of LIBRA, a Matlab LIBrary for Robust Analysis (Verboven and Hubert, 2005) as the function

mcdcov . Moreover, it is available in the PLS Toolbox of Eigenvector Research (Wise et al., 2006)

used in chemometrics.

2.2 The OGK estimator

Maronna and Zamar (2002) presented a general method to obtain positive denite and approx-

imately affine equivariant robust scatter matrices starting from any pairwise robust scatter ma-

trix. This method was applied to the robust covariance estimate of Gnanadesikan and Ketten-

ring (1972). The resulting multivariate location and scatter estimates are called orthogonalized

Gnanadesikan-Kettenring (OGK) estimates and are calculated as follows:1. Let m(.) and s(.) be robust univariate estimators of location and scale.

2. Construct y i = D 1 x i for i = 1 , . . . , n with D = diag( s(X 1 ), . . . , s (X p)).

3. Compute the correlation matrix U of the variables of Y = ( Y 1 , . . . , Y p), given by

u jk = 1 / 4(s(Y j + Y k)2 s(Y j Y k)

2 ).

4. Compute the matrix E of eigenvectors of U and

(a) project the data on these eigenvectors, i.e. V = Y E ;

(b) compute robust variances of V = ( V 1 , . . . , V p), i.e. = diag( s2 (V 1 ), . . . , s 2 (V p));

(c) Set (Y ) = Em where m = ( m(V 1 ), . . . , m (V p))T , and compute the positive denite

matrix (Y ) = E E T .

5. Transform back to X , i.e. RAWOGK = D (Y ) and RAWOGK = D (Y )D T .

In the OGK algorithm m(.) is a weighted mean and s(.) is the -scale of Yohai and Zamar

(1988). Step 2 makes the estimate scale equivariant, whereas the following steps are a kind

of principal components that replace the eigenvalues of U (which may be negative) by robust

variances. As in the FASTMCD algorithm the estimate is improved by a reweighting step, where

the cutoff value in the weight function is now taken as c = 2 p,0 .9 med(d1 , . . . , d n )/2 p,0 .5 with

di = D(x i , RAWOGK , RAWOGK ). The reweighted estimates are denoted as OGK and OGK .

5


7/27

3 Deterministic MCD algorithm

3.1 General procedure

In this section we present an alternative algorithm to calculate the MCD. First we standardizeeach variable X j by subtracting its median and dividing by the Qn scale estimator of Rousseeuw

and Croux (1993). This standardization makes the algorithm location and scale equivariant, i.e.

(1) and (2) hold for any non-singular diagonal matrix A . (We also looked into centering by

the spatial median, but based on speed considerations and simulation results we stayed with the

coordinatewise median.) The standardized data set is denoted by Z with rows z T i (i = 1 , . . . , n )

and columns Z j ( j = 1 , . . . , p).

Next, we construct seven initial estimates k(Z ) and k(Z ) (k = 1 , . . . , 7) for the center and

scatter of Z . Apart from the last one, each computes a preliminary estimate S k of the covariance

or correlation matrix of Z . They will be described in Section 3.2. As these S k may have very

inaccurate eigenvalues, we apply the following steps to each. Note that the rst two steps are

similar to steps 4(a) and 4(b) of the OGK algorithm:

1. Compute the matrix E of eigenvectors of S k and put B = ZE .

2. Estimate the covariance of Z by k(Z ) = ELE T where L = diag ( Q2n (B 1 ), . . . , Q2n (B p)).

3. To estimate the center of Z we sphere the data, apply the coordinatewise median, and

transform it back, i.e. k(Z ) = 1 / 2k (med( Z

1 / 2k )).

For all seven estimates ( k(Z ), k(Z )) we then compute the statistical distances

dik = D(z i , k(Z ), k(Z )) . (3)

For each initial estimate k we take the h observations with smallest dik and apply C-steps until

convergence. The solution with smallest determinant we call the raw DetMCD. Then we apply areweighting step as in the FASTMCD algorithm, yielding the nal DetMCD.

3.2 Initial scatter estimates

(1) The rst initial scatter estimate is obtained by computing the hyperbolic tangent (sigmoid)

of each column of Z , i.e. Y j = tanh( Z j ) for j = 1 , . . . , p . This bounded function reduces

6


8/27

the effect of large coordinatewise outliers. Computing the classical correlation matrix of Y

yields S 1 = corr( Y ).

(2) Now let R j be the ranks of the column Z j , and put S 2 = corr( R ). This is the Spearman

correlation matrix of Z . (Note that since the population Spearman correlation satisesS = 6 / sin

1 (/ 2) for bivariate normal distributions with correlation coefficient (Kendall,

1975), we also applied the inverse transformation to each element of S 2 . As it did not improve

the results, we did not retain this option.)

(3) For S 3 we compute normal scores from the ranks R j , namely T j = 1 ((R j 1/ 3)/ (n +1 / 3))

where (.) is the normal cumulative distribution function, and set S 3 = corr( T ).

(4) The fourth scatter estimate is based on the spatial sign covariance matrix (Visuri et al.,2000). Dene k i = z i / z i for all i and let S 4 = cov( K ). (Note that this is not the usual

spatial sign covariance matrix because the z i were centered by the coordinatewise median

instead of the spatial median to save computation time.) We also tried the spatial rank

covariance matrix

cov 1/nn

i,j =1(z i z j )/ z i z j ,

but this matrix requires O(n 2 ) operations (for xed p) whereas the other estimates only

require O(n log n) time, and it did not improve the performance of the algorithm.

(6) For S 5 we take the rst step of the BACON algorithm (Billor et al., 2000). Consider

the n/ 2 standardized observations z i with smallest norm, and compute their mean and

covariance matrix. (Note that the BACON algorithm starts with a smaller set.)

(6) The sixth scatter estimate is the raw OGK estimator. For m(.) and s(.) we used the median

and Qn for reasons of simplicity (no choice of tuning parameters) and to be consistent with

the other components of DetMCD.

(7) Finally we consider the classical mean 7 (Z ) and covariance matrix 7 (Z ) of the full data

set. This initial estimate is not robust, but it is fast and accurate at uncontaminated data.

7


9/27

4 Simulation study

We will compare the new DetMCD algorithm with FASTMCD.

4.1 Simulation design

The simulation is similar to the setup of Maronna and Zamar (2002). Because the DetMCD

estimates are not fully affine equivariant, their behavior may depend on the covariance structure,

hence we need to generate correlated data. These are obtained by rst generating uncorrelated

normal data y i N p(0, I) and applying an affine transformation x i = Gy i to them, where G is

the matrix with G jj = 1 and G jk = for j = k. If there is no contamination ( = 0) X has

covariance matrix G 2 , and the squared multiple correlation 2mult (which is the R 2 obtained by

regressing any coordinate of X on all of the others) can be calculated as a function of . In the

simulations we have taken such that mult = 0 .75, which is a rather collinear situation.

Outliers were generated in y -space, and the same affine transformation G was applied to them.

We considered three types of contamination: point contamination, cluster contamination, and

radial contamination. In all cases y i N p(0, I) for i = 1 , . . . , n m where m = n and is the

percentage of contamination. Point contamination was obtained as in Maronna and Zamar (2002)

by generating y i N p(y 0 , 2 I) for i = n m + 1 , . . . , n with = 0 .1 and y 0 = r a 0 , where a0 is a

unit vector generated orthogonal to (1 , 1, . . . , 1)T . The value of r , which determines the distance

between the outliers and the main center, was varied with the data dimension as specied below.

Cluster contamination was generated by shifting the center while using the same covariance matrix,

i.e. y i N p([ 10, 10, 0 p 2 ], I). For radial contamination many observations were generated from

the distribution N p(0, 5I) and as radial outliers we took the rst m observations whose statistical

distance exceeded the cutoff value 2 p,0 .8 .Different data sizes were considered, namelyA : n = 100 and p = 2 ( r = 50)

B : n = 100 and p = 5 ( r = 100)

C : n = 200 and p = 10 ( r = 150)

8


10/27

and different contamination levels were investigated, namely = 0%, 10%, and 20%. We always

put h, the number of observations whose covariance determinant will be minimized, equal to the

default value 0.75n in both FASTMCD and DetMCD, so that the algorithms can resist about

25% of outliers.

For both the FASTMCD and DetMCD algorithms we compute the raw and the reweighted

location vectors raw (X ) and (X ), and the raw and the reweighted scatter matrices raw (X )

and (X ). The corresponding estimators for the data set Y are obtained by transforming back

to (Y ) = G 1 (X ) and (Y ) = G 1 (X )G 1 . The following performance measures were

considered:

The objective function of the raw scatter estimator, OBJ = det raw (Y ).

An error measure of the location estimator, given by e = || (Y )|| 2 .

An error measure of the scatter estimate, dened as the logarithm of its condition number:

e = log 10 (cond( (Y ))).

The computation time t (in seconds).

Each of these performance measures should be as close to zero as possible. All simulations were run

in MATLAB R2007a (The MathWorks, Natick, MA). We wrote new code for DetMCD, whereasthe FASTMCD was obtained from the mcdcov function in the Matlab library LIBRA (Verboven

and Hubert, 2005).

4.2 Simulation results

Table 1 shows the simulation results for clean data (without contamination) for the different data

sizes. Each entry is the average (over 100 runs) of the performance measure in question. We

see that the algorithms perform similarly for the rst three performance criteria. The objectivefunction attained by FASTMCD was on average slightly smaller than with DetMCD, but this

difference is not signicant (according to the Mann-Whitney test). Also the differences in e and

e are not signicant. Moreover we see that DetMCD is much faster. These results also hold when

outliers are present in the data, irrespective of the type or fraction of contamination, as can be

9


11/27

Table 1: Simulation results for clean data.

FASTMCD DetMCD

A

OBJ 0.2658 0.2666

e 0.0236 0.0231

e 0.1507 0.1445

t 1.2832 0.0434

B

OBJ 0.1409 0.1421

e 0.0597 0.0592

e 0.3606 0.3478

t 1.3943 0.0616

C

OBJ 0.0743 0.0746e 0.0560 0.0560

e 0.3773 0.3734

t 2.3281 0.1352

Table 2: Simulation results for data with 10% of contamination.

Point Cluster Radial

FASTMCD DetMCD FASTMCD DetMCD FASTMCD DetMCD

A

OBJ 0.3873 0.3890 0.3873 0.3889 0.3873 0.3880e

0.0231 0.0231 0.0231 0.0234 0.0231 0.0238e 0.1376 0.1375 0.1376 0.1361 0.1376 0.1370t 1.2812 0.0446 1.2917 0.0452 1.2962 0.0423

B

OBJ 0.2412 0.2425 0.2411 0.2426 0.2411 0.2426e

0.0571 0.0570 0.0570 0.0566 0.0577 0.0565e 0.3403 0.3364 0.3404 0.3384 0.3403 0.3364

t 1.3786 0.0583 1.3910 0.0585 1.3805 0.0567

C

OBJ 0.1490 0.1499 0.1490 0.1499 0.1490 0.1499e

0.0579 0.0566 0.0578 0.0571 0.0576 0.0567e 0.3720 0.3693 0.3719 0.3694 0.3725 0.3695t 2.2429 0.1303 2.2359 0.1314 2.1922 0.1259

10


12/27

Table 3: Simulation results for data with 20% of contamination.


FASTMCD DetMCD FASTMCD DetMCD FASTMCD DetMCD

A

OBJ 0.6258 0.6267 0.6258 0.6268 0.6258 0.6261e

0.0256 0.0256 0.0256 0.0256 0.0306 0.0313e 0.1283 0.1283 0.1283 0.1283 0.1583 0.1597t 1.3077 0.0422 1.2945 0.0418 1.3042 0.0400

B

OBJ 0.4955 0.4968 0.4955 0.4965 0.4956 0.4963e

0.0621 0.0622 0.0621 0.0622 0.0621 0.0621e 0.3302 0.3308 0.3302 0.3292 0.3302 0.3298t 1.3762 0.0536 1.3948 0.0565 1.3791 0.0532

C

OBJ 0.4001 0.4015 0.4001 0.4017 0.4002 0.4015e

0.0633 0.0630 0.0633 0.0630 0.0633 0.0631e 0.3786 0.3772 0.3786 0.3780 0.3778 0.3781t 2.2410 0.1215 2.2495 0.1312 2.1975 0.1199

seen in Tables 2 and 3. We conclude that DetMCD is a fast and robust alternative to FASTMCD.

For DetMCD, Figure 1 shows how many times on average each initial subset led (after con-

vergence) to the smallest value of the objective function. Figure 1(a) shows this for the uncon-taminated case, whereas Figures 1(b) and (c) correspond to 10% and 20% of point contamination.

For clustered and radial contamination we obtained similar gures, hence they are not included

here. We immediately see that the rst subset (using the hyperbolic tangent transformation) is

often best in low dimensions. In higher dimensions the frequencies are more evenly distributed

(except that at contaminated data, the initial estimate based on the classical mean and covariance

matrix is selected rarely). Therefore, we kept all seven initial h-subsets in the algorithm. Figure 2

shows for each initial h-subset how many C-steps were needed on average to reach convergence.Typically 3 or 4 C-steps were sufficient, so the DetMCD algorithm used around 25 C-steps in all,

compared to over 1000 in FASTMCD.

11


13/27

1 2 3 4 5 6 70

10

20

30

40

50

60

Hsubset

# T i m e s

S e

l e c

t e d

n=100,p=2n=100,p=5n=200,p=10

1 2 3 4 5 6 70

10

20

30

40

50

60

70

80

Hsubset

# T i m e s

S e

l e c

t e d

n=100,p=2n=100,p=5n=200,p=10

(a) (b)

1 2 3 4 5 6 70

10

20

30

40

50

60

70

80

90

Hsubset

# T i m e s

S e

l e c

t e d

n=100,p=2n=100,p=5n=200,p=10

(c)

Figure 1: Number of times each of the seven initial subsets of DetMCD led to the best objective

function, for (a) 0%, (b) 10%, and (c) 20% of point contamination.

5 Properties of DetMCD

5.1 Affine equivariance

DetMCD is not fully affine equivariant any more due to the construction of the initial estimates.

We will measure its deviation from affine equivariance as done in Maronna and Zamar (2002)for the OGK. Since DetMCD is clearly location equivariant we can drop v from (1) and (2) and

only consider non-singular matrices A . We generate such a p p matrix A as the product of a

random orthogonal matrix and a diagonal matrix diag( u1 , . . . , u p) where the u i are independent

and uniformly distributed on (0 , 1). Let X A = {Ax 1 , . . . , Ax n }. We then compare the original

estimates X = (X ) and X = (X ) with A = A 1 (X A ) and A = A 1 (X A )A T .

12


14/27

1 2 3 4 5 6 70

1

2

3

4

5

6

Hsubset

# c s

t e p s

n=100,p=2n=100,p=5n=200,p=10

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Hsubset

# c s

t e p s

n=100,p=2n=100,p=5n=200,p=10

(a) (b)

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Hsubset

# c s

t e p s

n=100,p=2n=100,p=5n=200,p=10

(c)

Figure 2: For each initial subset of DetMCD, the average number of C-steps until convergence,

for (a) 0%, (b) 10%, and (c) 20% of point contamination.

Maronna and Zamar (2002) measured the deviation from equivariance by d = || A X || and

d = cond( 1 / 2X A

1 / 2X ). Note that affine equivariant estimators satisfy d = 0 and d = 1.

First, we study the deviation from affine equivariance on the ionospheric data taken from Bay

(1999) and preprocessed as in Maronna and Zamar (2002), which has n = 225 observations and

p = 31 variables. Table 4 shows the average d and d over 100 matrices A . Note that DetMCDwas even closer to affine equivariance than OGK. (The values for FASTMCD conrm its affine

equivariance.) We also considered data sets from the simulation in Section 4.1. For 20 such data

sets we generated 50 matrices A . Tables 5, 6 and 7 report d and d for 0%, 10%, and 20% of

contamination. In all cases DetMCD was closer to affine equivariance than OGK. Since Maronna

and Zamar (2002) concluded that the OGKs deviation from affine equivariance was small enough

13


15/27

not to concerned about, this holds even more so for DetMCD.

Table 4: Deviation from Affine Equivariance for the Ionospheric Data.

FASTMCD OGK DetMCD

d 0 0.3451 0.0133

d 1 177.5803 2.9954

Table 5: Deviation from Affine Equivariance for the reweighted estimators on simulated data

without contamination.

OGK DetMCD

Ad 0.0183 0.0002

d 1.0397 1.0012

Bd 0.0819 0.0030

d 1.5229 1.0072

Cd 0.2997 0.0423

d 1.8086 1.1796

Table 6: Deviation from Affine Equivariance for the reweighted estimators on simulated data with

10% of contamination.


OGK DetMCD OGK DetMCD OGK DetMCD

Ad 0.0382 0.0009 0.0436 0.0000 0.0394 0.0012

d 1.1840 1.0023 1.2240 1.0000 1.0995 1.0025

Bd 0.0924 0.0337 0.0866 0.0416 0.1097 0.0274

d 1.8610 1.0515 1.5662 1.0626 1.7491 1.0433

Cd 0.2551 0.0369 0.1672 0.0261 0.1942 0.0309

d 2.1819 1.1403 2.6909 1.1300 1.8949 1.1261

14


16/27

Table 7: Deviation from Affine Equivariance for the reweighted estimators on simulated data with

20% of contamination.


OGK DetMCD OGK DetMCD OGK DetMCD

Ad 0.0658 0.0000 0.0837 0.0000 0.0346 0.0000

d 1.2592 1.0000 3.1184 1.0000 1.1487 1.0000

Bd 0.1869 0.0000 0.1464 0.0000 0.1092 0.0000

d 2.1709 1.0000 1.9857 1.0000 1.8884 1.0000

Cd 0.2928 0.0005 0.3770 0.0830 0.1860 0.0000

d 2.0009 1.0022 12.1472 4.5434 1.8217 1.0000

5.2 Permutation invariance

Another property we are interested in is permutation invariance. An estimator T (.) is said to be

permutation invariant if T (P X ) = T (X ) for any data set X and any permutation matrix P . A

permutation matrix is a square matrix that has a single entry 1 in each row and each column, and

zeroes elsewhere. Therefore P X simply permutes the rows of X . Note that FASTMCD is not

permutation invariant because the initial subsets (generated by a pseudorandom number generator

with a xed seed) will have the same case numbers but correspond to different observations. By

contrast, all ingredients of DetMCD are permutation invariant. Analogous to the previous section,

the deviation from permutation invariance can be measured by d = || (P X ) (X )|| and

d = cond( (X ) 1 / 2 (P X ) (X ) 1 / 2 ). For the ionospheric data, Table 8 shows the average d

and d over 100 matrices P . They conrm that FASTMCD is not permutation invariant, whereas

OGK and DetMCD are.

Table 8: Deviation from Permutation Invariance for the Ionospheric Data.

FASTMCD OGK DetMCD

d 0.0410 0 0

d 12.4131 1 1

15


17/27

5.3 Different values of h

We already noted that DetMCD is faster than FASTMCD when applying the algorithm once for

a xed value of h. As the number of outliers should be below n h, it is commonly advised to set

h 0.5n when a large proportion of outliers could occur, and h 0.75n otherwise. Alternatively,one could also compute the MCD for a whole range of h-values, and see whether at some h there

is an important change in the objective function or the estimates. This is related to the forward

search of Atkinson et al. (2004). With DetMCD it becomes very easy to compute the MCD for

several h-values: since the seven initial estimates do not depend on h, we only need to store the

resulting ordered distances (3), yielding the initial h-subset for any h. We will illustrate this

feature on several examples in Section 6.

6 Real Examples

6.1 Philips data

Rousseeuw and Van Driessen (1999) illustrated FASTMCD on data provided by Philips, which

produced diaphragm parts for television sets. When a new production line was started, the

engineers measured 9 characteristics for each of the 677 parts. Applying FASTMCD with h =

0.75n yielded the robust distances in Figure 3(a). Many distances exceed the cutoff value

29 ,0 .975 . In particular, the observations 491-565 are clearly different from the others, indicatingthat something happened in the production process. Figure 3(b) shows the robust distances usingthe DetMCD algorithm. They are almost identical to the FASTMCD results, with the same points

being agged as outlying. Moreover, the estimates for location and scatter were almost identical,

i.e. d = || MCD DetMCD || = 0 .0004 and d = cond

12

MCD DetMCD (

12

MCD )T = 1 .0488. Also

the objective functions reached by the raw DetMCD and the raw FASTMCD were almost the same,

since OBJ MCDOBJ DetMCD = 0 .9930. The optimal h-subsets that determine the raw estimates only differed

in ve observations. The main difference lies in the computation time: whereas FASTMCD took

4.8 seconds, DetMCD only needed 0.5 seconds.

16


18/27

0 100 200 300 400 500 600 7000

2

4

6

8

10

12

14

16

Index

R

o b u s

t d i s t a n c e

0 100 200 300 400 500 600 7000

2

4

6

8

10

12

14

16

Index

R

o b u s

t d i s t a n c e

(a) (b)

Figure 3: Robust distances of the Philips data with (a) FASTMCD and (b) DetMCD.

6.2 Swiss bank notes dataAs a second example we consider p = 6 measurements of n = 100 forged Swiss 1000 franc bills, from

Flury and Riedwyl (1988). As shown in Salibian-Barrera et al. (2006); Pison and Van Aelst (2004),

and Willems et al. (2009), this data set contains several outlying observations and highly correlated

variables. Therefore it is appropriate to analyze the data with a robust PCA method. In Croux

and Haesbroeck (2000) it is argued that the MCD estimator can be used for this purpose. The

rst three principal components are retained, because together they explain 92% of the variance.

Figure 4 shows the resulting outlier maps based on FASTMCD and DetMCD using h = 75. On

the horizontal axis they have the robust distance of the observation in the three-dimensional PCA

subspace. The vertical axis shows the orthogonal distance of the observation to the PCA subspace.

Such an outlier map allows to classify observations into regular cases, good PCA leverage points,

orthogonal outliers, and bad PCA leverage points (Hubert et al., 2005). We see that the outlier

maps are very similar, and that the same observations are agged as outlying. The nal h-subsets

obtained with both algorithms had h 1 points in common. Again the computation times were

quite different: FASTMCD took 1.5 seconds whereas DetMCD was 20 times faster.

6.3 Pulp bre data

The MCD can also be used to perform a robust multivariate regression (Rousseeuw et al., 2004).

Denoting the q-dimensional response variable of the ith observation by y i , the goal of multivariate

17


19/27

0 0.5 1 1.5 2 2.5 3 3.5 4

0

0.5

1

1.5

2

Score distance (3 LV)

O r t

h o g o n a

l d i s t a n c e

233

118716

8060

13

25

61

1187

94 1668

38

618248 9262 60

80

6771

0 0.5 1 1.5 2 2.5 3 3.5 4

0

0.5

1

1.5

2


O r t

h o g o n a

l d i s t a n c e

23 3 13

118716

8060

25

61

1187

94 1668

38

618248 9262 60

80

6771

(a) (b)

Figure 4: Outlier map of the Swiss bank notes data using robust PCA with (a) FASTMCD and

(b) DetMCD.

linear regression is to estimate the intercept vector and the slope matrix B in the model

y i = + Bx i + i .

The MCD regression estimates and B are obtained by matrix operations on the MCD location

and covariance estimates of the joint ( x i , y i) data.

To illustrate MCD regression we consider a dataset of Lee (1992) that contains properties of

n = 62 pulp bres and the paper made from them. MCD regression is applied to predict the q = 4

paper properties from the p = 4 bre characteristics. Figure 5 shows the norm of each slope B j for

all 27 values of h from 35 to 61, obtained with the FASTMCD and the DetMCD algorithms. Note

that FASTMCD has to start from scratch for each h, whereas DetMCD only needs to compute

the seven initial estimates once. The algorithms yield identical results for all values of h. Figure 5

has a sizeable jump at h = 52. In fact, from h = 52 on the nal h-subset contains bad leverage

points. It turned out that the most severe bad leverage points were produced from r wood, and

that most of the outlying samples were obtained using different pulping processes. The 27 robust

regressions together took 53.2 seconds with FASTMCD, whereas the version with DetMCD only

needed 5.9 seconds.

18


20/27

35 40 45 50 55 60

20

40

60

80

100

120

h

n o r m

( s l o p e

)

FASTMCD (1st slope)DetMCD(1st slope)FASTMCD (2nd slope)DetMCD(2nd slope)FASTMCD (3rd slope)DetMCD(3rd slope)FASTMCD (4th slope)DetMCD(4th slope)

Figure 5: Norm of each slope of the pulp bre data, for different values of h.

6.4 Fruit data

Our last example is high-dimensional. The fruit data set contains spectra of three different cultivars

(with sizes 490, 106, and 500) of a type of cantaloupe, and was previously analyzed in Hubert and

Van Driessen (2004). All spectra were measured at 256 wavelengths, hence the data set contains

1096 observations and 256 variables. First, we performed a robust PCA using the ROBPCA

method of Hubert et al. (2005). ROBPCA mainly consists of two steps. In the rst step arobust subspace is constructed based on the Stahel-Donoho outlyingness (Stahel, 1981; Donoho

and Gasko, 1992; Debruyne and Hubert, 2009). Next, robust eigenvectors and eigenvalues are

found within this subspace by applying the MCD to the projected observations. We consider

the original ROBPCA method that uses FASTMCD in the second stage of the algorithm, and a

modied ROBPCA that applies DetMCD. From the scree plot we decided to retain two principal

components. Again FASTMCD and DetMCD gave identical results. We note a big group of

outliers in Figure 6, which corresponds to a change in the instruments illumination system.

Next, we applied the robust quadratic discriminant rule RQDR (Hubert and Van Driessen,

2004) to the robust two-dimensional PCA scores for values of h from 550 to 1095. The RQDR

method rst runs the MCD estimator on each of the groups. A new datum is then assigned to the

group for which it attains the largest discriminant score. Also membership probabilities for each

group are estimated, as the proportion of regular observations in each group. Figure 7 shows these

19


21/27

0 2 4 6 8 10

0

1

2

3

4

5

6

7

8

9

10


O r t

h o g o n a l

d i s t a n c e

621682

681

625615600

Figure 6: Outlier map of the fruit data using ROBPCA.

membership probabilities obtained with both methods as a function of h. Also here FASTMCD

600 700 800 900 1000 11000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

h

M e m

b e r s

h i p p r o

b a

b i l i t y f o r e a c

h g r o u p

FASTMCD: group 1DetMCD: group 1FASTMCD: group 2DetMCD: group 2FASTMCD: group 3DetMCD: group 3

Figure 7: Membership probabilities of each group of the fruit data, for different values of h.

and DetMCD give almost the same results. We see that the membership probabilities change

signicantly at h = 700, hence there are a substantial number of outliers present. Therefore h

should be taken sufficiently small to obtain robust results. The entire analysis took 562 seconds

when using FASTMCD, whereas it only needed 111 seconds with DetMCD. The computation time

only went down by a factor of 5 here because the analyses have several parts in common, such as

the computation of the discriminant scores.

20


22/27

7 Conclusions and outlook

DetMCD is a new algorithm for the MCD, which needs even less time than FASTMCD. It starts

from a few easily computed h-subsets, and then takes concentration steps until convergence. The

DetMCD algorithm is deterministic in that it does not use any random subsets. It is permutationinvariant and close to affine equivariant, and allows to run the analysis for many values of h without

much additional computation. We illustrated DetMCD in the contexts of PCA, regression, and

classication.

Also many other methods that directly or indirectly rely on the MCD (e.g. through its ro-

bust distances) may benet from the DetMCD approach, such as robust canonical correlation

(Croux and Dehon, 2002), robust regression with continuous and categorical regressors (Hubert

and Rousseeuw, 1996), robust errors-in-variables regression (Fekri and Ruiz-Gazen, 2004), robustprincipal component regression (Hubert and Verboven, 2003), and robust partial least squares

(Hubert and Vanden Branden, 2003). In particular, on-line applications or procedures that re-

quire the MCD to be computed many times, such as genetic algorithms (Wiegand et al., 2009), will

become more efficient. The cross-validation techniques of Hubert and Engelen (2007) and Engelen

and Hubert (2005) may benet from the fact that DetMCD is easily updated when an observation

is added or removed. Following Copt and Victoria-Feser (2004) and Serneels and Verdonck (2008)

we will also investigate whether DetMCD can be extended to the missing data framework.The DetMCD algorithm will be made available in Matlab as part of LIBRA (Verboven and

Hubert, 2005). Also an implementation in R will be provided.

The random sampling mechanism is currently used for many other high-breakdown robust

estimators. Our deterministic approach could improve on those algorithms as well. In particular we

intend to study a deterministic algorithm for S-estimators and -estimators, for which algorithms

in the spirit of FASTMCD were developed recently (Salibian-Barrera and Yohai, 2006; Salibian-

Barrera et al., 2008). We will also work on a deterministic algorithm for LTS regression, which istypically computed with the FASTLTS algorithm (Rousseeuw and Van Driessen, 2006).

SUPPLEMENTAL MATERIALS

Matlab code for DetMCD algorithm: Matlab code to perform the DetMCD algorithm that

is proposed in this article. Note that this algorithm requires the Matlab library for Robust

21


23/27

Analysis LIBRA, which can be freely downloaded from

http://wis.kuleuven.be/stat/robust/LIBRA.html . (.m le)

data sets: Matlab le that contains all the data sets used in this article. (.mat le)

References

Atkinson, A. , Riani, M. and Cerioli, A. (2004). Exploring multivariate data with the forward

search . Springer-Verlag, New York.

Bay, S. (1999). The UCI KDD Archive, http://kdd.ics.uci.edu . Irvine, CA: University of Cali-

fornia, Department of Information and Computer Science.

Billor, N. , Hadi, A. and Velleman, P. (2000). Bacon: blocked adaptive computationally

efficient outlier nominators. Computational Statistics and Data Analysis 34(3) 279298.

Butler, R. , Davies, P. and Jhun, M. (1993). Asymptotics for the Minimum Covariance

Determinant estimator. The Annals of Statistics 21 13851400.

Cator, E. and Lopuhaa, H. (2009). Central limit theorem and inuence function for the MCD

estimators at general multivariate distributions. Submitted.Copt, S. and Victoria-Feser, M.-P. (2004). Fast algorithms for computing high breakdown

covariance matrices with missing data. In Theory and Applications of Recent Robust Methods

(M. Hubert, G. Pison, A. Struyf and S. V. Aelst, eds.). Statistics for Industry and Technology,

Birkh auser, Basel.

Croux, C. and Dehon, C. (2002). Analyse canonique basee sur des estimateurs robustes de la

matrice de covariance. La Revue de Statistique Appliquee 2 526.

Croux, C. and Haesbroeck, G. (1999). Inuence function and efficiency of the Minimum

Covariance Determinant scatter matrix estimator. Journal of Multivariate Analysis 71 161

190.

22


24/27

Croux, C. and Haesbroeck, G. (2000). Principal components analysis based on robust esti-

mators of the covariance or correlation matrix: inuence functions and efficiencies. Biometrika

87 603618.

Debruyne, M. and Hubert, M. (2009). The inuence function of the Stahel-Donoho covarianceestimator of smallest outlyingness. Statistics and Probability Letters 79 275282.

Donoho, D. and Gasko, M. (1992). Breakdown properties of location estimates based on

halfspace depth and projected outlyingness. The Annals of Statistics 20 18031827.

Engelen, S. and Hubert, M. (2005). Fast model selection for robust calibration. Analytica

Chemica Acta 544 219228.

Fekri, M. and Ruiz-Gazen, A. (2004). Robust weighted orthogonal regression in the errors-

in-variables model. Journal of Multivariate Analysis 88 89108.

Flury, B. and Riedwyl, H. (1988). Multivariate statistics: a practical approach . Cambridge

university press.

Gnanadesikan, R. and Kettenring, J. (1972). Robust estimates, residuals, and outlier de-

tection with multiresponse data. Biometrics 28 81124.

Hadi, A. , Rahmatullah Imon, H. and Werner, M. (2009). Detection of outliers. Wiley

Interdisciplinary Reviews: Computational Statistics 1 5770.

Hardin, J. and Rocke, D. (2004). Outlier detection in the multiple cluster setting using the

minimum covariance determinant estimator. Computational Statistics and Data Analysis 44

625638.

Hubert, M. and Debruyne, M. (2009). Minimum Covariance Determinant. Wiley Interdisci-

plinary Reviews: Computational Statistics in press.

Hubert, M. and Engelen, S. (2007). Fast cross-validation for high-breakdown resampling

algorithms for PCA. Computational Statistics and Data Analysis 51 50135024.

Hubert, M. and Rousseeuw, P. (1996). Robust regression with both continuous and binary

regressors. Journal of Statistical Planning and Inference 57 153163.

23


25/27

Hubert, M. , Rousseeuw, P. and Van Aelst, S. (2008). High breakdown robust multivariate

methods. Statistical Science 23 92119.

Hubert, M. , Rousseeuw, P. and Vanden Branden, K. (2005). ROBPCA: a new approach

to robust principal components analysis. Technometrics 47 6479.

Hubert, M. and Van Driessen, K. (2004). Fast and robust discriminant analysis. Computa-

tional Statistics and Data Analysis 45 301320.

Hubert, M. and Vanden Branden, K. (2003). Robust methods for Partial Least Squares

Regression. Journal of Chemometrics 17 537549.

Hubert, M. and Verboven, S. (2003). A robust PCR method for high-dimensional regressors.

Journal of Chemometrics 17 438452.

Kendall, M. (1975). Multivariate Analysis . Griffin, London.

Lee, J. (1992). Relationships between Properties of Pulp-Fibre and Paper . Ph.D. thesis, Ph.D.,

University of Toronto.

Lopuha a, H. and Rousseeuw, P. (1991). Breakdown points of affine equivariant estimators of

multivariate location and covariance matrices. The Annals of Statistics 19 229248.Maronna, R. and Zamar, R. (2002). Robust estimates of location and dispersion for high-

dimensional data sets. Technometrics 44 307317.

Pison, G. , Rousseeuw, P. , Filzmoser, P. and Croux, C. (2003). Robust factor analysis.

Journal of Multivariate Analysis 84 145172.

Pison, G. and Van Aelst, S. (2004). Diagnostic plots for robust multivariate methods. Journal

of Computational and Graphical Statistics 13 310329.

Pison, G. , Van Aelst, S. and Willems, G. (2002). Small sample corrections for LTS and

MCD. Metrika 55 111123.

Rousseeuw, P. (1984). Least median of squares regression. Journal of the American Statistical

Association 79 871880.

24


26/27

Rousseeuw, P. and Croux, C. (1993). Alternatives to the median absolute deviation. Journal

of the American Statistical Association 88 12731283.

Rousseeuw, P. , Van Aelst, S. , Van Driessen, K. and Agull o, J. (2004). Robust multi-

variate regression. Technometrics 46 293305.

Rousseeuw, P. and Van Driessen, K. (1999). A fast algorithm for the Minimum Covariance

Determinant estimator. Technometrics 41 212223.

Rousseeuw, P. and Van Driessen, K. (2006). Computing LTS regression for large data sets.

Data Mining and Knowledge Discovery 12 2945.

Salibian-Barrera, M. , Van Aelst, S. and Willems, G. (2006). PCA based on multivariate

MM-estimators with fast and robust bootstrap. Journal of the American Statistical Association

101 11981211.

Salibian-Barrera, M. , Willems, G. and Zamar, R. (2008). The fast- estimator for regres-

sion. Journal of Computational and Graphical Statistics 17 659682.

Salibian-Barrera, M. and Yohai, V. J. (2006). A fast algorithm for S-regression estimates.

Journal of Computational and Graphical Statistics 15 414427.

Serneels, S. and Verdonck, T. (2008). Principal component analysis for data containing

outliers and missing elements. Computational Statistics and Data Analysis 52 17121727.

Stahel, W. (1981). Robuste Sch atzungen: innitesimale Optimalit at und Sch atzungen von Ko-

varianzmatrizen . Ph.D. thesis, ETH Zurich.

Vanden Branden, K. and Hubert, M. (2005). Robust classication in high dimensions based

on the SIMCA method. Chemometrics and Intelligent Laboratory Systems 79 1021.

Verboven, S. and Hubert, M. (2005). LIBRA: a Matlab library for robust analysis. Chemo-

metrics and Intelligent Laboratory Systems 75 127136.

Visuri, S. , Koivunen, V. and Oja, H. (2000). Sign and rank covariance matrices. Journal of

Statistical Planning and Inference 91 557575.

25


27/27

Wiegand, P. , Pell, R. and Comas, E. (2009). Simultaneous variable selection and outlier

detection using a robust genetic algorithm. Chemometrics and Intelligent Laboratory Systems

98 108114.

Willems, G. , Joe, H. and Zamar, R. (2009). Diagnosing multivariate outliers detected byrobust estimators. Journal of Computational and Graphical Statistics 18(1) 7391.

Wise, B. , Gallagher, N. , Bro, R. , Shaver, J. , Windig, W. and Koch, R. (2006).

PLS Toolbox 4.0 for use with MATLAB . Software, Eigenvector Research, Inc., 2006.

URL http://software.eigenvector.com/

Yohai, V. and Zamar, R. (1988). High breakdown point estimates of regression by means of the

minimization of an efficient scale. Journal of the American Statistical Association 83 406413.

26

A Deter Minis Tic Algorithm for the MCD

Documents

Transcript of A Deter Minis Tic Algorithm for the MCD