EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS...

10
International Journal of Advances in Engineering & Technology, May 2013. ©IJAET ISSN: 2231-1963 573 Vol. 6, Issue 2, pp. 573-582 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS Nada Badr, Noureldien A. Noureldien Department of Computer Science University of Science and Technology, Omdurman, Sudan ABSTRACT Intrusion detection has gasped the attention of both commercial institutions and academic research area. In this paper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariate outliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robust estimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as two different robustification techniques for the PCA. The results obtained from experiments show that PCA generates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is much accurate and both reveals the effects of masking and swamping undergo the PCA method. KEYWORDS: Multivariate Techniques, Robust Estimators, Principal Components, Minimum Covariance Determinant, Projection Pursuit. I. INTRODUCTION Principal Components Analysis (PCA) is a multivariate statistical method that concerned with analyzing and understanding data in high dimensions, that is to say, PCA method analyzes data sets that represent observations which are described by several dependent variables that are inter correlated. PCA is one of the best known and most used multivariate exploratory analysis technique [5]. Several robust competitors to classical PCA estimators have been proposed in the literature. A natural way to robustify PCA is to use robust location and scatter estimators instead of the PCA's sample mean and sample covariance matrix when estimating the eigenvalues and eigenvectors of the population covariance matrix. The minimum covariance determinant (MCD) method is a highly robust estimator of multivariate location and scatter. Its objective is to find h observations out of n whose covariance matrix has the lowest determinant. The MCD location estimate then is the mean of these h points, and the estimate of scatter is their covariance matrix. Another robust method for principal component analysis uses the Projection-Pursuit (PP) principle. Here, one projects the data on a lower-dimensional space such that a robust measure of variance of the projected data will be maximized. In this paper we investigate the effectiveness of the robust estimators provided by MCD and PP, by applying PCA on Abilene dataset and compare its detection performance of dataset outliers to MCD and PP. The rest of this paper is organized as follows. Section 2 is an overview to related work. Section 3 was dedicated for classical PCA. PCA robustification methods, MCD and PP are discussed in section 4. In section 5 the experiment results are shown, conclusions and future work are drawn in section 6. II. RELATED WORK A number of researches have utilized principal components analysis to reduce the dimensionality and to detect anomalous network traffic. The use of PCA to structure network traffic flows was introduced

description

Intrusion detection has gasped the attention of both commercial institutions and academic research area. In this paper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariate outliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robust estimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as two different robustification techniques for the PCA. The results obtained from experiments show that PCA generates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is much accurate and both reveals the effects of masking and swamping undergo the PCA method.

Transcript of EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS...

Page 1: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

573 Vol. 6, Issue 2, pp. 573-582

EXAMINING OUTLIER DETECTION PERFORMANCE FOR

PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS

ROBUSTIFICATION METHODS

Nada Badr, Noureldien A. Noureldien

Department of Computer Science

University of Science and Technology, Omdurman, Sudan

ABSTRACT

Intrusion detection has gasped the attention of both commercial institutions and academic research area. In this

paper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariate

outliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robust

estimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as two

different robustification techniques for the PCA. The results obtained from experiments show that PCA

generates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is much

accurate and both reveals the effects of masking and swamping undergo the PCA method.

KEYWORDS: Multivariate Techniques, Robust Estimators, Principal Components, Minimum Covariance

Determinant, Projection Pursuit.

I. INTRODUCTION

Principal Components Analysis (PCA) is a multivariate statistical method that concerned with

analyzing and understanding data in high dimensions, that is to say, PCA method analyzes data sets

that represent observations which are described by several dependent variables that are inter

correlated. PCA is one of the best known and most used multivariate exploratory analysis technique

[5].

Several robust competitors to classical PCA estimators have been proposed in the literature. A natural

way to robustify PCA is to use robust location and scatter estimators instead of the PCA's sample

mean and sample covariance matrix when estimating the eigenvalues and eigenvectors of the

population covariance matrix. The minimum covariance determinant (MCD) method is a highly

robust estimator of multivariate location and scatter. Its objective is to find h observations out of n

whose covariance matrix has the lowest determinant. The MCD location estimate then is the mean of

these h points, and the estimate of scatter is their covariance matrix. Another robust method for

principal component analysis uses the Projection-Pursuit (PP) principle. Here, one projects the data on

a lower-dimensional space such that a robust measure of variance of the projected data will be

maximized.

In this paper we investigate the effectiveness of the robust estimators provided by MCD and PP, by

applying PCA on Abilene dataset and compare its detection performance of dataset outliers to MCD

and PP.

The rest of this paper is organized as follows. Section 2 is an overview to related work. Section 3 was

dedicated for classical PCA. PCA robustification methods, MCD and PP are discussed in section 4.

In section 5 the experiment results are shown, conclusions and future work are drawn in section 6.

II. RELATED WORK

A number of researches have utilized principal components analysis to reduce the dimensionality and

to detect anomalous network traffic. The use of PCA to structure network traffic flows was introduced

Page 2: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

574 Vol. 6, Issue 2, pp. 573-582

by Lakhina [13] whereby principal components analysis is used to decompose the structure of Origin-

Destination flows from two backbone networks into three main constituents, namely periodic

trends, bursts and noise.

Labib [2] utilized PCA in reducing the dimension of the traffic data and for visualizing and

identifying attacks. Bouzida et, al. [7] presented a performance study of two machine learning

algorithms, namely, nearest neighbors and decision trees algorithms, when used with traffic data with

or without PCA. They discover that when PCA is applied to the KDD99 dataset to reduce dimension

of the data, the algorithms learning speed was improved while accuracy remained the same.

Terrel [9] used principal components analysis on features of aggregated network traffic of a link

connecting a university campus to the Internet in order to detect anomalous traffic. Sastry [10]

proposed the use of singular value decomposition and wavelet transform for detecting anomalies in

self similar network traffic data. Wong [12] proposed an anomaly intrusion detection model based on

PCA for monitoring network behaviors. The model utilizes PCA in reducing the dimensions of a

historical data and in building the normal profile, as represented by the first few components

principals. An anomaly is flagged when distance between the new observation and normal profile

exceeds a predefined threshold.

Mei-ling [4] proposed an anomaly detection scheme on robust principal components analysis. Two

classifiers were implemented to detect anomalies, one was based on major components that capture

most of the variations in the data, and the second was based on minor components or residuals. A new

observation is considered to be an outlier or anomalous when the sum of squares of the weighted

principal components exceeds the threshold in any of the two classifiers.

Lakhina [6] applied principal components analysis to Origin-Destination (OD) flows traffic , the

traffic isolated into normal and anomalous spaces by projecting the data onto the resulting principal

components one at a time, ordered from high to low, Principal components (PC) are added to the

normal space as long as a predefined threshold is not exceeded. When the threshold is exceeded, then

the PC and the subsequent PCs are added to anomalous space. New OD flow traffic is projected into

the anomalous space and anomaly is flagged if the value of the square prediction error or Q-statistic

exceeds a predefined limit.

Therefore PCA is widely used to identify lower dimensional structure in data, and is commonly

applied to high-dimensional data. PCA represents data by a small number of components that account

for the variability in the data. This dimension reduction step can be followed by other multivariate

methods, such as regression, discriminant analysis, cluster analysis, etc.

In classical PCA the sample mean and the sample covariance matrix are used to derive the principal

components. These two estimators are highly sensitive to outlying observations, and render PCA

unreliable, when outliers are encountered.

III. CLASSICAL PCA MODEL

The PCA detection model detects outliers by projecting observations of the dataset on the new

computed axes known as PCs. The outliers detected by PCA method are two types, outliers detected

by major PCs, and outliers detected by minor PCs.

The basic goals of PCA [5] are to extract important information from data set, to compress the size of

the data set by keeping only this important information and to simplify the description of data and

analyze the structure of the observation and variables (finding patterns with similarities and

difference).

To achieve these goals PCA calculate new variables from the original variables, called Principal

Components (PCs). The computed variables are linear combination of the original variables (to

maximize variance of the projected observation) and uncorrelated. The first computed PCs, called

major PCs has the largest inertia ( total variance in data set ), while the second calculated PCs, called

minor PCs has the greater residual inertia ,and orthogonal to the first principal components.

The Principal Components define orthogonal directions in the space of observations. In other words,

PCA just makes a change of orthogonal reference frame, the original variables being replaced by the

Principal Components.

Page 3: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

575 Vol. 6, Issue 2, pp. 573-582

3.1 PCA Advantages

PCA common advantages are:

3.1.1 Exploratory Data Analysis

PCA is mostly used for making 2-dimensional plots of the data for visual examination and

interpretation. For this purpose, data is projected on factorial planes that are spanned by pairs of

Principal Components chosen among the first ones (that is, the most significant ones). From these

plots, one will try to extract information about the data structure, such as the detection of outliers

(observations that are very different from the bulk of the data).

Due to most researches [8][11], PCA detect two types of outliers, type(1): the outlier that inflate

variance and this is detected by the major PCs and type (2): outlier that violate structure, which are

detected by minor PCs.

3.1.2 Data Reduction Technique

All multivariate techniques are prone to the bias variance tradeoff, which states that the

number of variables entering a model should be severely restricted. Data is often described

by many more variables than necessary for building the best model. PCA is better than

other statistical reduction techniques in that, it select and feed the model with reduced

number of variables.

3.1.3 Low Computational Requirement

PCA needs low computational efforts since its algorithm constitutes simple calculations.

3.2 PCA Disadvantages

It may be noted that the PCA is based on the assumptions that, the dimensionality of data can be

efficiently reduced by linear transformation and most information is contained in those directions

where input data variance is maximum.

As it is evident, these conditions are by no means always met. For example, if points of an input set

are positioned on the surface of a hyper sphere, no linear transformation can reduce dimension

(nonlinear transformation, however, can easily cope with this task). From the above the following

disadvantage of PCA are concluded.

3.2.1 Depending On Linear Algebra

It relies on simple linear algebra as its main mathematical engine, and is quite easy to interpret

geometrically. But this strength is also a weakness, for it might very well be that other synthetic

variables, more complex than just linear combinations of the original variables, would lead to a more

complex data description.

3.2.2 Smallest Principal Components Have No Attention in Statistical Techniques

The lack of interest is due to the fact that, compared with the largest principal components that

contain most of the total variance in the data, the smallest principal components only contain the

noise of the data and, therefore, appear to contribute minimal information. However, because outliers

are a common source of noise, the smallest principal components should be useful for outlier

detection.

3.2.3 High False Alarms

Principal components are sensitive to outliers, since the principal components are determined by

their directions and calculated from classical estimator such classical mean and classical covariance

or correlation matrices.

IV. PCA ROBUSTIFICATION

In real datasets, it often happens that some observation are different from the majority, such

observation are called outliers, intrusion, discordant, etc. However classical PCA method can be

Page 4: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

576 Vol. 6, Issue 2, pp. 573-582

affected by outliers so that PCA model cannot detect all the actual real deviating observation, this is

known as masking effect. In addition some good data points might even appear to be outliers which

are known as swamping effect .

Masking and swamping cause PCA to generate a high false alarm. To reduce this high false alarms

using robust estimators was proposed, since outlying points are less likely to enter into the

calculation of the robust estimators.

The well-known PCA Robustification methods are the minimum covariance determinant (MCD) and

Projection-Pursuit (PP) principle. The objective of the raw MCD is to find h > n/2 observations out

of n whose covariance matrix has the smallest determinant. Its breakdown value is (bn= [n- h+1]/n),

hence the number h determines the robustness of the estimator. In Projection-Pursuit principle [3],

one projects the data on a lower-dimensional space such that a robust measure of variance of the

projected data will be maximized. PP is applied where the number of variables or dimensions is very

large, so PP has an advantage over MCD, since the MCD proposes the dimensions of the dataset not

to exceed 50 dimensions.

Principal Component Analysis (PCA) is an example of the PP approach, because they both search for

directions with maximal dispersion of the data projected on it, but PP instead of using variance as

measure of dispersion, they use robust scale estimator [4].

V. EXPERIMENTS AND RESULTS

In this section we show how we test PCA and its robustification methods MCD and PP on a dataset.

The data that was used consist of OD (Origin-Destination) flows which, are collected and made

available by Zhang [1]. The dataset is an extraction of sixty minutes traffic flows from first week of

the traffic matrix on 2004-03-01, which is the traffic matrix Yin Zhang was built from Abilene

network. Availability of the dataset is on offline mode, where it is extracted from offline traffic

matrix.

5.1 PCA on Dataset

At first, the dataset or the traffic matrix is arranged into the data matrix X, where rows represent

observations and columns represent variables or dimensions.

X (144×12) =[

𝑥1,1 ⋯ 𝑥1,12

⋮ ⋱ ⋮𝑥144,1 ⋯ 𝑥144,12

],

The following steps are considered in apply PCA method on the dataset.

Centering the dataset to have zero mean, so the mean vector is calculated from the following

equation:

𝜇 =1

𝑛∑ 𝑥𝑖

𝑛𝑖=1 (1)

and subtracted off the mean for each dimension.

The product of this step is another centered data matrix Y, which has the same size as original dataset

𝑌(𝑛,𝑝) = (𝑥𝑖,𝑗 – 𝜇(𝑋)) (2)

Covariance matrix is calculated from the following equation:

𝐶(𝑋)𝑜𝑟Σ(𝑋) =1

𝑛−1(𝑋 − 𝑇(𝑋))𝑇 . (𝑋 − 𝑇(𝑋)) (3)

Finding eigenvectors and eigenvalues from the covariance matrix where eigenvalues are diagonal

elements of the matrix by using eigen-decomposition technique in equation (4).

𝐸−1 × Σ Y×E =ʎ (4)

Where E is the eigenvectors, ʎ is the eigenvalues .

Ordering eigenvalues in decreasing order and sorting eigenvectors according to the ordered

eigenvalues in loadings matrix. The Eigenvectors matrix is then sorted to be loading matrix.

Calculating scores matrix (dataset projected on principal components), which declares the

relations between principal components and observations. The scores matrix is calculated from

the following equations:

𝑠𝑐𝑜𝑟𝑒𝑠(𝑛,𝑝) = 𝑌(𝑛,𝑝) × 𝑙𝑜𝑎𝑑𝑖𝑛𝑔𝑠(𝑝,𝑝) (5)

Page 5: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

577 Vol. 6, Issue 2, pp. 573-582

Applying the 97.5 tolerance ellipse of the bivariate dataset (data projected on first PCS, data

projected on minor PCS) to reveal outliers automatically. The ellipse is defined by these data

points whose distance is equal to the chisquare root of the 97.5 quantile with 2 degrees of

freedom. The form of the distance is 𝑑𝑖𝑠𝑡 ≤ √𝑥2𝑝,0.975 (6)

The screeplot is used and studied and the first and the second principal components accounted for

98% of total variance of the dataset, so retaining the first two principal components to represent the

dataset as whole, figure (1) shows the screeplot, the plotting of the data projected onto the first two

principal components in order to reveal the outliers on the dataset visually is shown in figure (2).

Figure 1: PCA Screeplot Figure 2: PCA Visual outliers

Figure (3) shows tolerance ellipse on major PCS, and figures (4) and (5) shows the visual recording of

outliers from scatter plots of data projected on robust minor principal components and the outliers

detected by robust minor principal components tuned by tolerance ellipse respectively.

Figure 3: PCA Tolerance Ellipse Figure 4: PCA type2 Outliers

.

0 2 4 6 8 10 120

10

20

30

40

50

60

70

80

90

100

principal components

tota

lvariance v

ariances

-2 -1 0 1 2 3 4 5 6 7

x 107

-1

-0.5

0

0.5

1

1.5

2x 10

7 data projected on major pcs

PC1

PC

2

66

120

119

135

67 68 71 75 77 78 82 83

86

87 88 89 90

96 98

101103105

111112113115

126

127128

132

134136139141

125

129

130 131144

124

116

117 118

58 60 64 65 76 79 80 81 82

84

85 91 92 93 94 95 107108109110 114115121

137138140

142143

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 65 106122123

133

69 72 102

73747370

104

-4 -2 0 2 4 6

x 107

-5

0

5

10

15

x 106

PC1

PC

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 76 77 78 79 80 81 82

84

85 91 92 93 94 95

106107108109110 111112113114115121

122123

124

125 130

137138

139140141

142143

66

67 68 69 70 71 72 73 74

75 83

86

87 88 89 90

96 97 98

99100101

102103

104105

116

117 118

119

120

126

127128

129

131

132

133

134

135

136

144

Tolerance ellipse (97.5%)

-8 -6 -4 -2 0 2 4 6

x 105

-6

-4

-2

0

2

4

6

8x 10

5 data projected on minor pcs

last PC-1

last

PC

1 2 3 4 5 6 7 8 9 10

11

12 13 14 15

16

17 18 19 20 21 22 23 24 25

26

27 28 29 30 31 32 33 34 35 36

37 38 39 40

41

42 43 44 45 46 47 48 49 50 51 52 53 54 55

56 57 58 59 60

61 62 63 64 65

66

67 68

70

71

72

73

74

75

77 78 79 80 81

82 83

84 85

87 88

89 90

91

92 93 94 95

99 100

101

102103

104105

106

107108

109110

111

112113114115

116

117118

119

120

121

122123124

125

126

127128

129130

131

132133134135

136

137138

139140

141

142143

144

86

76

98

96

Page 6: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

578 Vol. 6, Issue 2, pp. 573-582

Figure 5: Tuned Minor PCS

5.2 MCD on Dataset

Testing robust statistics MCD (Minimum Covariance Determinant) estimator yields robust location

measure Tmcd and robust dispersion Σmcd.

The following steps are applied to test MCD on the dataset in order to reach the robust principal

components.

MCD measure is calculated from the formula: R=(xi-Tmcd(X))T.inv(Σmcd(X)).(xi-Tmcd(X) ) for i=1 to n (7)

Tmcd or µmcd =1.0e+006 *

From robust covariance matrix Σmcd calculating the followings:

C(X)mcd or Σ(x)mcd = 1.0e+012 *

* find robust eigenvalues as diagonal matrix as in equation (4) by replacing n with h

* find robust eigenvectors as loading matrix as in equation (5).

Calculating robust scores matrix as in the following form

𝑟𝑜𝑏𝑢𝑠𝑡𝑠𝑐𝑜𝑟𝑒𝑠(𝑛,𝑝) = 𝑌(𝑛,𝑝) × 𝑙𝑜𝑎𝑑𝑖𝑛𝑔𝑠(𝑝,𝑝) (8)

The robust screeplot retaining the first two robust principal components which accounted above of

98% of total variance is shown in figure (6). Figures (7) and (8) shows respectively the visual

recording of outliers from scatter plots of data projected on robust major principal components, and

the outliers detected by robust major principal components tuned by tolerance ellipse, and Figures (9)

and (10) shows the visual recording of outliers from scatter plots of data projected on robust minor

principal components and the outliers detected by robust minor principal components tuned by

tolerance ellipse respectively.

Figure 6: MCD screeplot Figure 7: MCD Visual Outliers

-6 -4 -2 0 2 4

x 105

-4

-2

0

2

4

6

x 105

PC11

PC

12 1 2 3 4 5 6 7 8 9 10

11

12 13 14 15

16

17 18 19 20 21 22 23 24 25

26

27 28 29 30 31 32 33 34 35 36

37 38 39 40

41

42 43 44 45 46 47 48 49 50 51

52 53 54 55

56

57 58 59 60

61 62 63 64 65

76

77 78 79 80 81

82

84

85

91

92 93 94 95106

107108

109110

111

112113

114115

121

122123124

125

130

137138

139140

141

142143 66

67 68

69 70

71

72

73

74

75

83

86

87 88

89 90

96

97 98

99100

101

102103

104

105

116

117118

119

120

126

127128

129

131

132133134135

136

144

Tolerance ellipse (97.5%)

0 2 4 6 8 10 120

10

20

30

40

50

60

70

80

90

100robust mcd screeplot to retain robust PCS

principal components

tota

l variance

-8 -7 -6 -5 -4 -3 -2 -1 0 1

x 107

-1

-0.5

0

0.5

1

1.5

2

2.5x 10

7

robustmcd PC1

rob

ustm

cd P

C2

major pcs from robust estimator

135

119

120

66

116

118117

129129130 125

124

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 67 68

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83

84

85 86

87 88 89 90 91 92 93 94 95 96

97 98 99 100 101

102 103104 105 106107108109110111112113 114115

121122123

127128

132136

137138

139140141

142143

134

133

131

104104

Page 7: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

579 Vol. 6, Issue 2, pp. 573-582

Figure 8: MCD Tolerance Ellipse Figure 9: MCD type2 Outliers

Figure 10: MCD Tuned Minor PCs

5.3 Projection Pursuit on Dataset

Testing the projection pursuit method on the dataset is included in the following steps:

Center the data matrix X(n,p) , around L1-median to reach centralized data matrix Y(n,p) as :

𝑌(𝑛,𝑝) = (𝑋 (𝑛,𝑝) − 𝐿1(𝑋)) (9)

Where L1(X) is high robust estimator of multivariate data location with 50% resist of outliers [11].

Construct the directions pi as normalized rows of matrix , `this process include the following:

𝑃𝑌 = (𝑌[𝑖, : ])′ 𝑓𝑜𝑟 𝑖 , 1: 𝑛 (10)

𝑙𝑒𝑡 𝑁𝑃𝑌 = max(𝑆𝑉𝐷(𝑃𝑌)) (11)

Where SVD stand for singular value decomposition.

𝑃𝑖 =𝑃𝑌

𝑁𝑃𝑌 (12)

Project all dataset on all possible directions.

𝑇𝑖 = 𝑌 × (𝑃𝑖)𝑡 (13)

Calculate robust scale estimator for all the projections and find the directions that maximize qn

estimator,𝑞 = max(𝑞𝑛(𝑇𝑖)) (14) qn is a scale estimator, essentially it is the first quartile of all pairwise distance between two data

points [5]. The results of these steps yields the robust eigenvectors (PCs), and the squared of

value of the robust scale estimator is the eigenvalues.

project all data on the selected direction q to obtain robust principal components as in the

following : 𝑇𝑖 = 𝑌𝑛,𝑝 × 𝑃𝑞

𝑡 (15)

Update data matrix by its orthogonal complement as in the followings:

𝑌 = 𝑌 − (𝑃𝑞 × 𝑃𝑞𝑡). 𝑌 (16)

-6 -4 -2 0 2 4

x 107

-5

0

5

10

15

20

x 106

robustmcdPC1

robustm

cdP

C2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 79 81

84

85 91 94106107108109110114121

122123

124

125

66

67 68

69 70 71 72 73 74 75 76 77 78 80 82 83

86

87 88 89 90 92 93 95 96

97 98

99100 101102 103

104105 111112113115

116

117118

119

120

126

127128

129130

131

132133

134

135

136

137138

139140141

142143

144

Tolerance ellipse (97.5%)

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

x 106

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2x 10

6 data project on robustmcd minor PCS

robustmcd last-1 pc

robustm

cd last

pc

116

96131

717069

1019798

99100

66

120119

848576118

117

86

73

67

7414191

81126

136144

102134

102104

136139

61248026

444444

1131128888

56

-2.5 -2 -1.5 -1 -0.5 0 0.5 1

x 106

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

x 106

robustmcd pclast-1

robu

stm

cd p

clas

t

1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22 23 24 25

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

62 63 64 65

79

81

84 85 91

94106107108 109110114

121122123124125

66

67 68

69 70 71

72

73

74

75

76

77 78 80 82 83

86

87 88

89 90 92 93 95

96

97 98

99100

101102

103104105111

112113

115

116

117

118

119120

126

127128129130

131

132133

134

135136

137138

139140

141

142143144

Tolerance ellipse (97.5%)

Page 8: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

580 Vol. 6, Issue 2, pp. 573-582

Project all data on the orthogonal complement,

𝑠𝑐𝑜𝑟𝑒𝑠 = 𝑌 × 𝑃𝑖 (17)

The Plotting of the data projected on the first two robust principal components to detect outliers

visually, is shown in figure (11), and the tuning the first two robust principal components by

tolerance ellipse is shown in figure (12). Figures (13) and (14) show respectively the plotting of

the data projected on minor robust principal components to detect outliers visually, and the tuning

of the last robust principal components by tolerance ellipse.

Figure 11: PP Visual Outliers Figure 12: PP Tolerance Ellipse

Figure 13: MCD type2 Outliers Figure 14: MCD Tuned Minor PCs

5.4 Results

Table (1) summarizes the outliers detected by each method. The table shows that PCA suffers from

both masking and swamping. The MCD and PP methods results reveal the effects of masking and

swamping of the PCA method. The PP method results are similar to MCD with slight difference

since we use 12 dimensions on the dataset.

Table 1: Outliers Detection

PCA Outlier

detected by major

and Minor PCS

MCD Outliers

detected by major and

minor PCS

PP Outliers

detected by major

and minor PCS

False alarms effects

Masking Swamping

66 66 66 No No

99 99 99 No No

100 100 100 No No

116 116 116 No No

117 117 117 No No

118 118 118 No No

119 119 119 No No

120 120 120 No No

-1 0 1 2 3 4 5 6 7 8

x 107

-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1x 10

7 data projected on robust major PCS by PP method

PProbust PC1

PP

robust

PC

2

66

67 68

69 70

71

72 73 74

75

76 77 78

79 80 81

82 83 84 85

86 87 88

89 90

91 92 93 94 95

96

97 98 99 100

101

102

103104

105

107111112113

114115

116

117

118

119

120

121 126

127128

129130

131132

133134

135

136

137138139

140141

142143

144

-4 -2 0 2 4 6

x 107

-4

-3

-2

-1

0

1

x 107

PProbust PC1

PP

robust

PC

2

15 137 34 19 3 63 79 80

134 87

144 22 62 2 35 20 23 14 49 47 50 29 48 59 30 33 32 18 17 43109 25 54 42 24 55 27 45 28110 52 53 60 6 44106

90 88

142 57122 64 13 12 65123 46 58 51 26 8 40 7

89

39

78

38 37 31

77

10 92

21 5 4 1138 93 94

9 95

83 82

96132

143107

84

56108

128

11

73

131 86

140121 36 16

127

61126

124

85

103

114139

72

81130

118

133

141

41115

102

75

129

125

117

91

71

74

112113136

105101

67 68

111

104

76

135116

97 98100 99 70

69

66

119

120

Tolerance ellipse (97.5%)

-3 -2 -1 0 1 2 3 4

x 106

-2

-1.5

-1

-0.5

0

0.5

1

1.5x 10

6 data projected on robust minor PCS by PP

PProbust PC11

PP

robust

PC

12

99 100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

41

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

61 62 63 64 65

67 68

70

72 73

77 78 79 80

81 82 83

84 85

86 87 88

91

92 93 94 95

102103

106107108109110

114121122123

127128131132

133

134137138139140

141

1421431441 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

41

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

61 62 63 64 65

67 68

71

72 73

74 75

76

77 78 79 80

81 82 83 86

87 88 89 90

91

92 93 94 95 96

101

102103

104105

106107108109110

111112113

114115

117118

121122123

124125126 127

129130

131132

133

134

136

137138139140

142143144

135

119

120

9797

116

-2 -1 0 1 2 3

x 106

-1.5

-1

-0.5

0

0.5

1

x 106

PProbust PC11

PP

robust

PC

12

15137 34 19 3 63 79 80134 87144 22 62 2 35 20 23 14 49 47 50 29 48 59 30 33 32 18 17 43109 25 54 42 24 55 27 45 28110 52 53 60 6 44106 90 88142 57122 64 13 12 65123 46 58 51 26 8 40 7 89 39 78

38 37 31 77 10 92 21 5 4 1138

93 94 9 95 83 82

96132143107

84 56

108

128 11 73131

86140121

36 16 127 61126124 85

103114 139

72

81

130 118

133

141

41115 102 75

129 125117

91 71 74 112113 136105

101

67 68

111104

76

135

116

97 98100 99 70

69 66

119

120

Tolerance ellipse (97.5%)

Page 9: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

581 Vol. 6, Issue 2, pp. 573-582

129 129 129 No No

131 131 131 No No

135 135 135 No No

Normal Normal 69 Yes No

Normal Normal 70 Yes No

71 Normal normal No Yes

76 Normal normal No Yes

81 Normal normal No Yes

101 Normal normal No Yes

104 Normal normal No Yes

111 Normal normal No Yes

144 Normal normal No Yes

Normal 84 normal Yes No

Normal 96 normal Yes No

Normal 97 97 Yes No

Normal 98 98 Yes No

VI. CONCLUSION AND FUTURE WORK

The study has examined the PCA and its robustification methods (MCD, PP) performance for

intrusion detection by presenting the bi-plots and extracted outlying observation that are very

different from the bulk of data. The study showed that tuned results are identical to visualized one.

The study returns the PCA false alarms shortness due to masking and swamping effect. The

comparison proved that PP results are similar to MCD with slight difference in outliers type 2 since

are considered as source of noise. Our future work will go into applying the hybrid method

(ROBPCA), which takes PP as reduction technique and MCD as robust measure for further

performance, and applying dynamic robust PCA model with regards to online intrusion detection.

REFERENCES

[1]. Abilene TMs, collected by Zhang . www.cs.utexas.edu/yzhang/ research, visited on 13/07/2012

[2]. Khalid Labib and V.Rao Vemuri. "An application of principal Components analysis to the detection

and visualization of computer network ". Annals of telecommunications, pages 218-234, 2005 .

[3]. C. Croux, A. Ruiz-Gazen, "A fast algorithm for robust principal components based on projection

pursuit", COMPSTAT: Proceedings in Computational Statistics, Physica-Verlag, Heidelberg,1996, 211–

217.

[4]. Mei-ling Shyu, Schu-Ching Chen,Kanoksri Sarinnapakorn,and Li Wuchang. "Anovel anomaly detection

scheme based on principal components classifier". In proceedings of the IEEE foundations and New

directions of Data Mining workshop, in conjuction with third IEEE international conference on data mining

(ICOM03) .

[5]. J.Edward Jackson . "A user guide to principal components". Wiely interscience Ist edition 2003.

[6]. Anukool Lakhina,. Mark Crovella, and Christoph Diot. "Diagnosing network wide traffic anomalies"

.Proceedings of the 2004 conference on Applications, technologies, architectures, protocols for computer

communication. ACM 2004.

[7]. Yacine Bouzida, Frederic Cuppens, NoraCuppens-Boulahio, and Sylvain Gombaul. "Efficient Intrusion

Detection Using Principal Component Analysis ". La londe, France, June 2004.

[8]. R.Gnandesikan, "Methods for statistical data analysis of multivariate observations". Wiely-interscience

publication New York, 2nd edition 1997.

[9]. J.Terrel, K.Jeffay L.Zhang, H.Shen, Zhu, and A.Nobel, "Multivariate SVD analysis for a network

anomaly detection ". In proceedings of the ACM SIGOMM Conference 2005.

[10]. Challa S.Sastry, Sanjay Rawat, Aurn K.Pujari and V.P Gulati, "Netwok traffic analysis using singular

value decomposition and multiscale transforms ". information sciences : an international journal 2007.

Page 10: EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUSTIFICATION METHODS

International Journal of Advances in Engineering & Technology, May 2013.

©IJAET ISSN: 2231-1963

582 Vol. 6, Issue 2, pp. 573-582

[11]. I.T.Jollif, "Principal components analysis", springer series in statistics, Springer Network ,2nd edition

2007.

[12]. Wei Wong, Xiachong Guan, and Xiangliong Zhang, "Processing of massive audit data streams for real

time anomaly intrusion detection". Computer communications , Elsevier 2008.

[13]. A Lkhaina, K Papagiannak, M Crovella, C-Diot, E Kolaczy, and N. Taft, "Structural Analysis of

network traffic flows". In proceedings of SIGMETRICS, New York, NY, USA, 2004.

AUTHORS BIOGRAPHIES

Nada Badr earned her BSC in Mathematical and Computer Science at University of

Gezira, Sudan. She received the MSC in Computer Science at University of Science and

Technology. She is pursuing her PHD in Computer Science at University of Science and

Technology, Omdurman, Sudan. She currently serving lecturer at the University of

Science and Technology, Faculty of Computer Science and Information Technology.

Noureldien A. Noureldien is working as an associate professor in Computer Science,

department of Computer Science and Information Technology, University of Science and

Technology, Omdurman, Sudan. He received his B.Sc. and M.Sc. from School of

Mathematical Sciences, University of Khartoum, and received his PhD in Computer

Science in 2001 from University of Science and Technology, Khartoum, Sudan. He has

many papers published in journals of repute. He currently working as the dean of the

Faculty of Computer Science and Information Technology at the University of Science

and Technology, Omdurman, Sudan.