Genetic-based minimum classification error mapping for accurate identifying Peer-to-Peer...

7
Genetic-based minimum classification error mapping for accurate identifying Peer-to-Peer applications in the internet traffic Mehdi Mohammadi , Bijan Raahemi, Ahmad Akbari, Hossein Moeinzadeh, Babak Nasersharif Iran University of Science and Technology, University Road, Hengam Street, Resalat Square, Tehran 16846-13114, Iran article info Keywords: Genetic algorithm Minimum classification error Packet classification abstract In this paper, we propose a hybrid approach using genetic algorithm and neural networks to classify Peer- to-Peer (P2P) traffic in IP networks. We first compute the minimum classification error (MCE) matrix using genetic algorithm. The MCE matrix is then used during the pre-processing step to map the original dataset into a new space. The mapped data set is then fed to three different classifiers: distance-based, K- Nearest Neighbors, and neural networks classifiers. We measure three different indexes, namely mutual information, Dunn, and SD to evaluate the extent of separation of the data points before and after map- ping is performed. The experimental results demonstrate that with the proposed mapping scheme we achieve, on average, 8% higher accuracy in classification of the P2P traffic compare to the previous solu- tions. Moreover, the genetic-based MCE matrix increases the classification accuracy more than what the basic MCE does. Ó 2011 Published by Elsevier Ltd. 1. Introduction Recent studies have shown a dramatic shift of the Internet traf- fic away from HTML text pages and images towards multimedia file sharing and Peer-to-Peer (P2P) applications. P2P is an Internet application that allows a group of users to share their files and computing resources. It has been observed that as much as 70% of broadband traffic is P2P (Azzouna & Guillemin, 2004). While allocating resources for such a significant network usage, telecom carriers and service providers do not see proportional profits out of the services they offer through their infrastructure. As such, tele- communication equipment vendors and Internet service providers are interested in efficient solutions to identify and filter P2P traffic for further control and regulation. With the growth of the Internet traffic, in terms of number and type of applications, traditional identification techniques such as port matching, protocol decoding or packet payload analysis are no longer effective. In particular, P2P applications may use ran- domly selected non-standard ports to communicate which makes it difficult to distinguish them from other types of traffic if we only inspect the port numbers (Cloud Shield, 2007). Current P2P net- works tend to intentionally disguise their generated traffic to cir- cumvent both filtering firewalls, as well as legal issues most emphatically articulated by the Recording Industry Association of America (RIAA). Not only do most P2P networks now operate on top of non-standard and proprietary protocols, but P2P clients can also easily operate on any port number, even HTTP’s port 80. Because of the problems associated with the traditional port- based methods, several data mining techniques have been pro- posed in recent years to identify the Internet traffic based on its statistical characteristics (Auld, Moore, & Gull, 2007; Huang & Zhu, 2006; Moore & Zuev, 2005; Zander, Nguyen, & Armitage, 2005; Zuev & Moore, 2005). For example in Raahemi, Hayajneh, and Rabinovitch (2007), the authors used neural networks, incre- mental neural networks (ARTMAP), very fast decision tree (VFDT), and concept drift very fast decision trees (CDVFDT) to identify P2P traffic. These approaches classify P2P applications based on their statistical characteristics using various classification techniques. These methods can generally be defined as follows: a set of N train- ing examples of the form (X, y) is given, where y is a discrete class label and X is a vector of d attributes, each of which may be sym- bolic or numeric. The goal is to produce a model y = f(X) from these examples which will predict the classes y of future examples X with a high accuracy. With this method, the P2P applications can be identified without knowing the port numbers in advance. To increase the accuracy of the classifier, various mapping schemes are proposed (Duda, Hart, & Stork, 2001; Loog & Duin, 2004) These techniques, applied at the pre-processing stage, map the original training dataset into a new space minimizing the clas- sification error. Among these approaches, the minimum classifica- tion error (MCE) technique (De La Torre, Peinado, & Rubio, 1996; Hung & Lee, 2002) is a discriminative training technique that explicitly incorporates classification performance into the training 0957-4174/$ - see front matter Ó 2011 Published by Elsevier Ltd. doi:10.1016/j.eswa.2010.09.114 Corresponding author. Tel.: +98 21 77491192; fax: +98 21 77491128. E-mail addresses: [email protected], [email protected] (M. Mohammadi), [email protected] (B. Raahemi), [email protected] (A. Akbari), [email protected] (H. Moeinzadeh), [email protected] (B. Nasersharif). Expert Systems with Applications 38 (2011) 6417–6423 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Transcript of Genetic-based minimum classification error mapping for accurate identifying Peer-to-Peer...

Expert Systems with Applications 38 (2011) 6417–6423

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Genetic-based minimum classification error mapping for accurate identifyingPeer-to-Peer applications in the internet traffic

Mehdi Mohammadi ⇑, Bijan Raahemi, Ahmad Akbari, Hossein Moeinzadeh, Babak NasersharifIran University of Science and Technology, University Road, Hengam Street, Resalat Square, Tehran 16846-13114, Iran

a r t i c l e i n f o

Keywords:Genetic algorithmMinimum classification errorPacket classification

0957-4174/$ - see front matter � 2011 Published bydoi:10.1016/j.eswa.2010.09.114

⇑ Corresponding author. Tel.: +98 21 77491192; faxE-mail addresses: [email protected], mh_

(M. Mohammadi), [email protected] (B. Raahemi),[email protected] (H. Moeinzadeh), nasser_s@ius

a b s t r a c t

In this paper, we propose a hybrid approach using genetic algorithm and neural networks to classify Peer-to-Peer (P2P) traffic in IP networks. We first compute the minimum classification error (MCE) matrixusing genetic algorithm. The MCE matrix is then used during the pre-processing step to map the originaldataset into a new space. The mapped data set is then fed to three different classifiers: distance-based, K-Nearest Neighbors, and neural networks classifiers. We measure three different indexes, namely mutualinformation, Dunn, and SD to evaluate the extent of separation of the data points before and after map-ping is performed. The experimental results demonstrate that with the proposed mapping scheme weachieve, on average, 8% higher accuracy in classification of the P2P traffic compare to the previous solu-tions. Moreover, the genetic-based MCE matrix increases the classification accuracy more than what thebasic MCE does.

� 2011 Published by Elsevier Ltd.

1. Introduction

Recent studies have shown a dramatic shift of the Internet traf-fic away from HTML text pages and images towards multimediafile sharing and Peer-to-Peer (P2P) applications. P2P is an Internetapplication that allows a group of users to share their files andcomputing resources. It has been observed that as much as 70%of broadband traffic is P2P (Azzouna & Guillemin, 2004). Whileallocating resources for such a significant network usage, telecomcarriers and service providers do not see proportional profits outof the services they offer through their infrastructure. As such, tele-communication equipment vendors and Internet service providersare interested in efficient solutions to identify and filter P2P trafficfor further control and regulation.

With the growth of the Internet traffic, in terms of number andtype of applications, traditional identification techniques such asport matching, protocol decoding or packet payload analysis areno longer effective. In particular, P2P applications may use ran-domly selected non-standard ports to communicate which makesit difficult to distinguish them from other types of traffic if we onlyinspect the port numbers (Cloud Shield, 2007). Current P2P net-works tend to intentionally disguise their generated traffic to cir-cumvent both filtering firewalls, as well as legal issues mostemphatically articulated by the Recording Industry Association of

Elsevier Ltd.

: +98 21 [email protected]@iust.ac.ir (A. Akbari),t.ac.ir (B. Nasersharif).

America (RIAA). Not only do most P2P networks now operate ontop of non-standard and proprietary protocols, but P2P clientscan also easily operate on any port number, even HTTP’s port 80.

Because of the problems associated with the traditional port-based methods, several data mining techniques have been pro-posed in recent years to identify the Internet traffic based on itsstatistical characteristics (Auld, Moore, & Gull, 2007; Huang &Zhu, 2006; Moore & Zuev, 2005; Zander, Nguyen, & Armitage,2005; Zuev & Moore, 2005). For example in Raahemi, Hayajneh,and Rabinovitch (2007), the authors used neural networks, incre-mental neural networks (ARTMAP), very fast decision tree (VFDT),and concept drift very fast decision trees (CDVFDT) to identify P2Ptraffic. These approaches classify P2P applications based on theirstatistical characteristics using various classification techniques.These methods can generally be defined as follows: a set of N train-ing examples of the form (X,y) is given, where y is a discrete classlabel and X is a vector of d attributes, each of which may be sym-bolic or numeric. The goal is to produce a model y = f(X) from theseexamples which will predict the classes y of future examples Xwith a high accuracy. With this method, the P2P applications canbe identified without knowing the port numbers in advance.

To increase the accuracy of the classifier, various mappingschemes are proposed (Duda, Hart, & Stork, 2001; Loog & Duin,2004) These techniques, applied at the pre-processing stage, mapthe original training dataset into a new space minimizing the clas-sification error. Among these approaches, the minimum classifica-tion error (MCE) technique (De La Torre, Peinado, & Rubio, 1996;Hung & Lee, 2002) is a discriminative training technique thatexplicitly incorporates classification performance into the training

6418 M. Mohammadi et al. / Expert Systems with Applications 38 (2011) 6417–6423

criterion. Through minimization of the criterion function, MCE isaimed directly at minimizing classification error rather than learn-ing the true data probability distributions, the target of MaximumLikelihood Estimation (MLE) via Baum-Welch or Viterbi training.MCE has been used successfully to train Hidden Markov Modelsfor speech recognition tasks (Hung & Lee, 2002).

The most important component of the MCE technique is its MCEmatrix. The test data is multiplied by the MCE matrix, changing itsfeature space. This may increase the classification accuracy as thenew features in each class become more distinguishable from thefeatures in other classes.

The proposed deterministic approaches to calculate the MCEmatrix may trap in local optimum; i.e. may not generate a globallyoptimum MCE matrix. This drawback may cause a poor perfor-mance of the mapping process. In this paper, we use genetic algo-rithm to calculate the MCE matrix. GA is a global search algorithmguided by natural operation, and one of its important features is todecrease the likelihood of the solution being trapped in localoptimum.

We first compute an MCE matrix using the genetic algorithm.The MCE matrix is then applied to the input dataset to map its fea-tures into a new space. We then use various classifiers, the mainone being a neural networks classifier, to classify the input datasetwhose features are already modified by the MCE mapping. Wedemonstrate that the mapping process provides more accurateclassification of Peer-to-Per applications in the Internet traffic.Our results show that the proposed method achieves classificationaccuracy of higher than 96%.

The rest of the paper is organized as follows. In Section 2 wehave Survey pertaining to MCE and P2P classification and some re-lated works. Section 3 is about our proposed hybrid method withdetails. In Section 4 we describe our evaluating methods and wepresent the experimental results which we achieved. Finally in Sec-tion 5 we have a conclusion section about the proposed methodand its results.

2. Related works

In this section, we review related works on classification of P2Papplications, as well as related works on minimum classificationerror method.

2.1. Related works on classification of P2P applications using machinelearning techniques

P2P traffic identification has recently gained much attention inboth academic and industrial research communities. Various solu-tions have been developed for P2P traffic classification. A popularapproach is the TCP port based analysis where tools such as Net-flow (Kamei & Kimura, 2003, 2006) and cflowd (Crovella & Krish-namurthy, 2006) are configured to read the service port numbersin the TCP/UDP packet headers, and compare them with the known(default) port numbers of the P2P applications. The packets arethen classified as P2P if a match occurs. Although P2P applicationshave default port numbers, newer versions allow the user tochange the port numbers, or choose a random port number withina specified range. Hence, port based analysis becomes inefficientand misleading.

The method using application signatures was developed by Sen,Spatscheck, and Wang (2004), noticing the fact that internet appli-cations have a unique string (signature) located in the data portionof the packet (payload). They used the available information in theproprietary P2P protocol specifications in conjunction with infor-mation extracted from packet-level trace analysis to identify thesignatures, and classify the packets accordingly. This signature

detection approach is process intensive, and performs deep packetinspection which may complicate the case where the privacy is amajor concern. Also, most P2P applications can encrypt the datamaking it impossible to detect the signature.

Karagiannis, Broido, Faloutsos, and Klaffy (2004) proposed a P2Ptraffic identification method based on the transport layer analysis.This approach relies on the connection level patterns of the P2Ptraffic by observing the behavior of P2P communications. Althoughthis method was able to detect 95% of P2P flows from OC48(2.4 Gpbs) backbone link, it also has some limitations. First, the ap-proach will be misled if a P2P application uses port numbers of theapplications with the same behavior. Second, the approach will bemisled if a user running different P2P applications at the same timeor running the same P2P application to download different filesfrom different peers.

Researchers have also considered the behavioral and statisticalcharacteristics of Internet traffic to identify P2P applications. Zan-der et al. (2005) proposed a framework for IP traffic classificationbased on a flow’s statistical properties using an unsupervised ma-chine learning technique. While the authors planned to evaluatetheir approach using a larger number of flows and more applica-tions, they indicated that the accuracy and performance of theresulting classifier had not yet been evaluated.

Zuev and Moore (2005) proposed a supervised machine learn-ing approach to classify network traffic. They started by allocatingflows of traffic to one of several predefined categories: Bulk, Data-Base, Interactive, Mail, WWW, P2P, Service, Attack, Games andMultimedia. They then utilized 248 per-flow discriminators (char-acteristics) to build their model using Naive Bayes analysis. Theyevaluated the performance of the solution in terms of accuracy(the raw count of flows that were classified correctly divided bythe total number of flows) and trust (the probability that a flowthat has been classified into a class, is in fact from that class).Although this approach is promising, there is a question aboutthe scalability of the approach as it involves too many discrimina-tors, and it takes much time to prepare the data (with many attri-butes) and assign the traffic flows to predefined categories. Toovercome it, Moore and Zuev used Fast Correlation-Based Filterand a variation of a wrapper method to reduce the number of dis-criminators (Moore & Zuev, 2005). Furthermore, Auld classifiedInternet traffic using Bayesian neural network to improve the clas-sification accuracy (Auld et al., 2007).

Raahemi et al. (2007) applied supervised machine learningtechniques, namely neural networks and decision trees, to classifyP2P traffic. They pre-processed and labeled the data, and built sev-eral models using a combination of different attributes for variousratios of P2P/NonP2P in the training data set. They also consideredincremental learning approaches for mining stream of data (Raa-hemi, Kouznetsov, Hayajneh, & Rabinovitch, 2008), and exploitedthe issues of concept-drift (Raahemi, Zhong, & Liu, 2008b, confer-ence), imbalanced data (Zhong et al., 2009) and learning from unla-beled data (Raahemi, Zhong, & Liu, 2008a, journal). While theauthors’ main focus is on classification techniques, they do not ex-ploit mapping the feature space (in the pre-processing step) to in-crease the accuracy of the classifiers. The mapping techniques, andin particular, the minimum classification error mapping techniqueare the main focus of the current paper.

2.2. Related works on minimum classification error techniques

Minimum classification error is a well-known discriminativemethod used for both feature transformations (De La Torre et al.,1996; Hung & Lee, 2002) and training classifiers. When the MCEmethod is used in training of the Hidden Markove Model (HMM),the parameters are adjusted so to reduce the total classification er-ror. In the MCE training method, the objective function to find the

M. Mohammadi et al. / Expert Systems with Applications 38 (2011) 6417–6423 6419

HMM parameters is modeled first using a continuous function.Then, minimum of the function is found using a gradient-searchmethod such as gradient probabilistic descent (GPD) technique(Juang & Katagiri, 1992; McDermott, Hazen, Roux, Nakamura, &Katagiri, 2007). However, we should note that the gradient searchapproaches often get trapped in local optima.

The main idea behind the MCE algorithm is to optimize anempirical error rate on the training set to improve the overall rec-ognition rate. After the empirical training error rate is optimized bya classifier or recognizer, a biased estimate of the true error rate isobtained. One effective way to reduce this bias rate is to increase‘‘margins” on the training data. It is desirable to use large marginsto achieve low errors in the test dataset, even if this may result inhigh empirical errors in the training dataset. This leads to methodssuch as Large-Margin MCE (LM-MCE) which adjusts the marginincrementally in the MCE training process such that a desirablebalance can be achieved between the empirical error rates on thetraining set and the margin (Yu, Deng, He, & Acero, 2006, 2007).

3. GAMCE-based classification of P2P applications

Fig. 1 shows the block diagram of the proposed GAMCE-basedclassification approach, where we employ the searching capabilityof genetic algorithm to calculate the optimum MCE matrix that re-duces the classification error. As shown in Fig. 1, the genetic algo-rithm finds the optimum MCE matrix throughout severaliterations. The MCE matrix is then applied to the feature set inthe training dataset. The mapped features are more discriminativethan the original ones. This improves the classifier accuracy. Themapped dataset is then forwarded to the classifier where the inputdata is classified into P2P or NonP2P applications.

The process of calculating the MCE matrix using genetic algo-rithm is explained in the following section.

3.1. Calculating the MCE matrix using genetic algorithm

The genetic algorithm (GA) is a randomized search methodbased on natural evolution (Eiben & Smith, 2007). GA copes withsearch in complex and large spaces, and usually provides near-optimal solutions for a defined fitness function of an optimizationproblem.

In GA, each instance of search space is encoded in the form ofstring called chromosome (genotype or individual). A collectionof these chromosomes is called population. The initial populationis created at random and it contains some points in the search

Fig. 1. Block diagram of the proposed GAMCE-based classification approach.

space. Defined fitness function presents the goodness of each chro-mosome, and based on that, the fitter chromosomes are selectedfor next population. The variation operators such as mutationand crossover are applied on the chromosomes to yield a new gen-eration of chromosomes. The process of selection, crossover andmutation continue for a fixed number of generations or till the ter-mination criteria is satisfied. GAs are capable of finding near-opti-mal solutions in many applications such as VLSI design, imageprocessing and machine learning.

Fig. 2 shows the details of the genetic algorithm which arealso followed in the proposed GMCE-based classificationapproach.

3.1.1. Representation (definition of chromosomes)The first step in defining a GA is to link the ‘‘real world” to the

‘‘GA world”, that is to setup a bridge between the original problemscontext and the problem solving space. The chromosomes in thisgenetic algorithm are squared matrix n*n, where n is number offeatures.

3.1.2. Population initializationThe first population is selected from randomly generated chro-

mosomes. The value of each position in each chromosome is a realnumber between [�MaxValue, MaxValue]. MaxValue is a parame-ter of the problem and we initialize it with 1.

3.1.3. Crossover operationCrossover is a probabilistic process that exchanges information

between parent chromosomes for generating child chromosomes.We use uniform crossover as the crossover operation with a fixedcrossover probability of lC (Eiben & Smith, 2007; Goldberg, 1989).

3.1.4. Mutation operationFour mutation methods are implemented to increase the perfor-

mance of the genetic algorithm. Mutation operation is applied onthe chromosomes with lm probability.

� Random mutation: In this mutation, a chromosome is selected atrandom, and then one bit of the chromosome is revalued ran-domly. The selection rate of the mutation is 25% of lm (Eiben& Smith, 2007; Goldberg, 1989).� Swap mutation: In this mutation, a chromosome is selected at

random then two bits of this chromosome are selected ran-domly and then the values of these bits are swapped. The selec-tion rate of this mutation is 25% of lm (Eiben & Smith, 2007;Goldberg, 1989).

Fig. 2. The genetic cycle.

6420 M. Mohammadi et al. / Expert Systems with Applications 38 (2011) 6417–6423

� Creep mutation: This mutation works by adding a small randomvalue to a bit of the selected chromosome and changing thevalue of this bit. The small random value is a real numberbetween [�CreepValue, CreepValue]. CreepValue is a parameterof the problem and we initialize it with 0.2. The selection rate ofthis mutation is 30% of lm (Eiben & Smith, 2007; Goldberg,1989).� Scramble mutation: This mutation is the most disruptive muta-

tion. A chromosome is selected at random and the values of thischromosome are reconfigured. The selection rate of this muta-tion is 20% of lm (Eiben & Smith, 2007; Goldberg, 1989).

3.2. Evaluation function (fitness function)

A fitness function is a particular type of objective function thatquantifies the optimality of a chromosome in a genetic algorithmand creates a measure to rank a chromosome against all the otherchromosomes in a genetic population. In this paper, we use outputof distance classifier as the fitness function. Fig. 3 shows the com-putation method of the fitness function.

The details of the fitness function are described as follows: Foreach chromosome in the population,

� Multiply the feature set by the chromosome (the MCE matrix)to generate the transformed feature vectors.� Partition the transformed dataset into two sets of training and

test datasets (60% training, 40% test).� Calculate the center of each class of transformed training data-

set based on Eq. (1) (training phase). In the following equationVector_Sample is the feature vector of ith sample of the dataset.jP2Pj is the number P2P samples and jNonP2Pj is the number ofNonP2P samples in the training dataset. Vector_centerP2P is thecenter of class P2P and Vector_CenterNonP2P is the center ofclass NonP2P in training set.

Vector CenterP2P ¼PjP2Pj

i¼1;Vector Sample2P2PVector SampleðiÞjP2Pj

Vector CenterNonP2P ¼PjNonP2Pj

i¼1;Vector Sample2NonP2PVector SampleðiÞjNonP2Pj

ð1Þ

� Present all of the test dataset to the distance classifier and cal-culate number of false negatives (i.e. a P2P packet is classified asNonP2P), false positives (i.e. a NonP2P packet is classified asP2P).

Fig. 3. Fitness calculation for each chromosome.

� Calculate the fitness function for each chromosome based on Eq.(2)

FitnessðiÞ ¼jFPi jP2P þ

jFNi jNonP2P

2� 100 ð2Þ

We use normalized error in computation of the fitness function.This technique is useful when the dataset is not balanced (numberof samples in each class differs significantly). With imbalanceddata, the classifier tends to bias towards the larger class (i.e. mostlyminimizes the classification errors of the large classes). Since thedataset in our study is imbalanced, we normalize the error as inEq. (2). In this equation, for ith chromosome, jFPij is the numberof P2P samples which are classified in NonP2P class and jFNij isthe number of NonP2P samples which are classified in P2P classin training dataset.

This fitness value is the error of the classifier in the mappedspace. The chromosomes (the MCE matrices) generating the lowestfitness values are the winners. The winner chromosomes map theoriginal dataset into a new dataset more discriminated, thusincreasing the accuracy of the classifier.

3.3. Selection operation

The selection process selects individuals from population basedon fitness values. In this study, the binary tournament selection(Eiben & Smith, 2007; Goldberg, 1989) is used.

Having computed the MCE matrix, it is applied to the trainingdataset to generate mapped (i.e. more discriminative) trainingsamples. The mapped training dataset is then used to train theclassifier. We show that the accuracy of the classifier is increasedwhen trained using the mapped samples.

4. Experiments

4.1. Measuring criteria

We first introduce the criteria we use to evaluate the proposedmethod, including mutual information, Dunn index, and SD index.

4.1.1. Mutual information criteria (MI)The mutual information of I(Fi,C) is the amount of information

that feature i contains about the output classes C. This value can beused to rank the features, and also, measure how the feature mayimprove the classifier accuracy.

Mutual information and entropy are introduced by Shannon’sinformation theory to measure the information of random vari-ables (Shannon, 1948). Basically, the entropy is a measure of uncer-tainty of random variables. If a discrete random variable X has Walphabets with its probability density function denoted asp(x) = Pr{X = x,x 2W}, then the entropy of X can be defined as

HðXÞ ¼ �Xx2W

pðxÞ logðpðxÞÞ ð3Þ

For two discrete random variables, X and Y, the joint entropy ofX and Y can be defined as Eq. (4).

HðX;YÞ ¼ �Xx2W

Xi2C

pðx; yÞ logðpðx; yÞÞ ð4Þ

where p(x,y) denotes the joint probability density function of X andY.

The typical information of two random variables X and Y is de-fined as the mutual information between them calculated by Eq.(5). When the two random variables are closely related, the mutualinformation between them is large. If the mutual information iszero, then the two variables are independent of each other.

M. Mohammadi et al. / Expert Systems with Applications 38 (2011) 6417–6423 6421

IðX;YÞ ¼ HðXÞ þ HðYÞ � HðX;YÞ ð5Þ

Other measurements we use for evaluating the proposed meth-od are Dunn and SD measures. They represent data dispersion be-fore and after transformation. In the following subsections, wepresent a brief introduction of these two measures.

4.1.2. Dunn indexDunn is a validity index which identifies compact and well-sep-

arated classes defined by Eq. (6) for a specific number of classes(Abbasian, Nasersharif, & Akbari, 2008; Dunn & Dunn, 1974).

Dnc ¼ mini¼1;...;nc

minj¼iþ1;...;nc

distðci; cjÞmaxk¼1;...;ncdiamðckÞ

� �� �ð6Þ

where nc is number of classes, and dist(ci,cj) is the dissimilarityfunction between two classes Ci and Cj defined by Eq. (7).

distðci; cjÞ ¼ minx2ci ;y2cj

distðx; yÞ ð7Þ

and diam(c) is the diameter of the class c, a measure of dispersion ofthe class. The diameter of a class c can be defined as Eq. (8).

diamðcÞ ¼maxx;y2c

distðx; yÞ ð8Þ

It is clear that if the dataset contains compact and well-sepa-rated classes, the distance between the classes is expected to belarge, and the diameter of the classes is expected to be small. Thus,based on the Dunn’s index definition, we may conclude that largevalues of the index indicate the presence of compact and well-sep-arated classes.

4.1.3. SD indexThe SD validity index is defined based on the average scattering

of classes and total separation between classes. Average scatteringof classes is defined as Eq. (9). Where nc is the number of classes, vi

is the center of class i and X is the overall data ri is the variance ofith class and r is the variance of overall data (Abbasian et al., 2008;Halkidi, Vazirgiannis, & Batistakis, 2000).

ScatðncÞ ¼ 1nc

Xnc

i¼1

krðv iÞk=krðXÞk ð9Þ

Total separation between classes is defined as Eq. (10).

DisðncÞ ¼ Dmax

Dmin

Xnc

k¼1

Xnc

z¼1

kvk � vzk !�1

ð10Þ

where Dmax and Dmin are the maximum and minimum distance be-tween class centers, respectively.

Dmax ¼max kv i � v jk� �

8i; j 2 f1;2; . . . ; ncg ð11ÞDmin ¼ min kv i � v jk

� �8i; j 2 f1;2; . . . ;ncg ð12Þ

The SD validity index is then defined as follows:

SDðncÞ ¼ k � ScatðncÞ þ DisðncÞ ð13Þ

In this equation, the first term Scat(nc), defined as in Eq. (9), rep-resents the average scattering (or average compactness) of classes.The smaller the value of Scat(nc), the more compact the class. The

Table 1Sample records of full IP header extracted by Windump.

Protocol IP header

TCP 15:39:54.369946 IP (tos 0 � 0, ttl 127, id 35950, offset 0, flags [DF], proto: Twin 17439

UDP 15:39:54.369535 IP (tos 0 � 0, ttl 127, id 19203, offset 0, flags [none], prolength 101

second term, Dis(nc), is a function of the location of the center ofthe classes. It measures the separation between the nc classes,and increases with the number of classes. A small value of the SDindex indicates the presence of compact and well-separated clas-ses. Since the two terms of SD have different ranges, the weightingfactor k is used to balance their overall contribution. The number ofclasses that minimizes the above SD index can be considered as anoptimal value for the number of classes presented in the dataset.

4.2. Collecting and labeling data

To identify P2P traffic using data mining techniques, we firstneed a training dataset. Using Tcpdump, we captured the IP packetheader of two-way Internet traffic. Using Windump, sample entrieswere extracted from the captured files and transformed from bin-ary format into a readable-text format. A sample full IP header withprotocol being TCP or UDP is shown in Table 1.

To make the IP headers suitable for data mining techniques, weconsider each IP header as one example, and label all examples intothree classes, namely ‘‘P2P”, ‘‘NonP2P”, and ‘‘Unknown” based ontheir ‘‘source port” and ‘‘destination port” numbers. The rule to la-bel is:

If (source port OR destination port) < 1024Then Class ‘‘NonP2P”Else If (source port OR destination port) = {Well-known stan-dard port numbers including 1214, 6881, 6889, 6699, 6700,6701, 4661, 4665, 4672, 4662, 6346, 6347, 6348, 6349, 6257,1044, 1045, 1337, 2340, 2705, 4500, 4329, 5190, 5500, 5501,5502, 5503, 6666, 6667, 7668, 7788, 8038, 8080, 28864, 8311,8888, 8889, 41170, 3074, 3531}Then Class ‘‘P2P”Else Class ‘‘Unknown”

The port numbers used in the above rule represent the defaultport numbers of the most popular P2P applications.

After all examples are labeled into three classes, we then selectthe most useful information in IP headers as the attributes sinceirrelevant and redundant information may increase the computa-tional cost. First, we remove the ‘‘tos”, ‘‘offset”, ‘‘flags” and ‘‘cksum”fields which are almost-unary attributes (more than 95% of the val-ues are the same). The ‘‘ack” is nearly a random number and con-tain no information to differentiate records. Since ‘‘length” can becalculated by using two ‘‘sequence number” in TCP header and,in addition, ‘‘packet length” and ‘‘length” can be deduced from eachother, two ‘‘sequence number” and ‘‘length” are redundant andthus can be removed. In addition, ‘‘arrival time” is implicitly con-sidered in our analysis since IP header records are as consequentialinputs to the algorithm. Accordingly, we select ‘‘id”, ‘‘protocol”,‘‘packet length”, ‘‘source IP”, and ‘‘destination IP” as the significantattributes. The attributes ‘‘source IP” and ‘‘destination IP” wereoriginally captured in dotted-decimal notation, and then were bin-ned to 256 bins according to their values of the first octet. It isworth noting that we do not use ‘‘source port” and ‘‘destinationport” in the mining process because they are already used to labelthe examples.

CP (6), length: 603) 137.122.72.6.3684 > 137.122.14.100.80: P 0:563(563) ack 82

to: UDP (17), length: 129) 137.122.69.220.59155 > 83.50.166.156.25307: UDP,

6422 M. Mohammadi et al. / Expert Systems with Applications 38 (2011) 6417–6423

4.3. Experimental results

In this subsection, we present the experimental results for eval-uating the proposed GAMCE approach. The dataset we used in ourexperiment has 32767 sample records with five features per sam-ple record. We designed three different experiments to evaluatethe proposed method on the dataset. We run the tests on a Pen-tium IV 2 GHz machine with 1 GB memory. The tests are writtenin Matlab in a windows environment.

4.3.1. Experiment-1In this experiment, we use three different datasets, the original

dataset, the one mapped by the MCE matrix, and the datasetmapped by the GAMCE matrix. We also use three different classi-fiers, namely Distance, K-Nearest Neighbors, and neural networks,to classify P2P and NonP2P classes. All together we run nine simu-lations, generate the confusion matrix, and measure the accuracy(or the error rate) of the classifiers.

The accuracy (or the error rate) of the three classifiers on threedifferent groups of datasets are shown in Fig. 4. The three groups ofdatasets are:

Original dataset: In this experiment, we use the original datasetwithout applying any mapping process (Normal case).MCE dataset: We calculate the MCE matrix (Section 2.2), andapply it on the original dataset in the pre-processing step(MCE case).GAMCE dataset: We calculate the genetic based MCE (GAMCE)as per our proposal, and apply it on the original dataset(GAMCE).

As shown in Fig. 4, in each three cases, the accuracy of the threeclassifiers is improved when using the MCE and GAMCE mappingin the pre-processing step. This demonstrates that the MCE andGAMCE mapping discriminate between the classes moreaccurately.

Using distance classifier on original dataset, the error rate isabout 27%. When applying the MCE mapping, the classifier erroris reduced to 24%, and finally, by using GAMCE matrix in the pre-processing step, the error rate is reduced to 22%. For K-NearestNeighbor (KNN) classifier, the error reduction is the same as thatof the distance classifier. Finally, for neural network classifier, theclassification error is about 12% when we use the original datasetas the training dataset. The error is reduced to 105% and 7% whenwe apply neural network classifier on the dataset mapped by the

0

5

10

15

20

25

30

35

Normal MCE GAMCE

Erro

r Rat

e

DistanceKnnNN

Fig. 4. Comparison of MCE and GAMCE approaches with Normal case (no mappingof features) for three different classifiers.

MCE and GAMCE matrices, respectively. To calculate the error ratefor each case in Fig. 4, we ran 100 experiments and calculated theaverage error rate of 100 independent runs.

4.3.2. Experiment-2In this experiment, we calculate mutual information between

classes and each feature. As described in Section 4.1.1, this mea-sure indicates the amount of information which each feature car-ries about the classes. For instance, if a feature has a high valueof mutual information, it indicates that the feature carries moreinformation about the classes, and as such, it can be selected as asignificant feature to classify the data.

Fig. 5 shows the mutual information between features andclasses for each five features we used in classification tasks, bothbefore and after the MCE matrix is applied to the original data-set. As shown in this figure, the MCE mapping increases theamount of mutual information between features and classes.Therefore, we expect that the MCE mapping also improves theaccuracy of the classifiers in P2P and NonP2P classification taskaccordingly.

4.3.3. Experiment-3In this experiment, we calculate the Dunn and SD measures

to evaluate the impact of the MCE and GAMCE mapping onthe original dataset. Fig. 6 shows the Dunn measures for the ori-ginal dataset, and the datasets mapped by the MCE and GAMCEmatrices. As shown in this figure, the Dunn index is increasedafter applying MCE matrix on the original dataset. This measureis increased even more when we apply GAMCE matrix on theoriginal dataset.

Next, we measure the SD index on original dataset, and the onesmapped by the MCE and GAMCE matrices. The results are shown inFig. 7. We observe similar improvement in these measures whenthe original dataset is mapped into new dataset using the MCEand GAMCE matrices. This measure is decreased when we applyGAMCE matrix on the original dataset. This decrement shows thebetter separation between classes in new feature space after apply-ing GAMCE matrix.

Both SD and Dunn measures indicate that mapping the originaldataset using the MCE and the proposed GAMCE matrices can im-prove the accuracy of classifiers on P2p and NonP2p dataset. Fur-thermore, the proposed GAMCE method increases the accuracy ofthe classifier more than the MCE mapping; a clear indication ofthe superiority of the proposed genetic based MCE mappingapproach.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ID ProtoCol Length Scr-IP Des-IPFeature Number

MI(B

it)

Before MCEAfter MCE

Fig. 5. Mutual information for each feature and classes before and after MCEprocessing.

Dunn

0

0.05

0.1

0.15

0.2

0.25

Normal After MCE After GAMCE

Dunn

Fig. 6. Comparison of index Dunn for Normal (no mapping of features), MCE, andGAMCE cases.

SD

0.445

0.45

0.455

0.46

0.4650.47

0.475

0.48

0.485

0.49

Normal After MCE After GAMCE

SD

Fig. 7. Comparison of index SD for Normal (no mapping of features), MCE, andGAMCE cases.

M. Mohammadi et al. / Expert Systems with Applications 38 (2011) 6417–6423 6423

5. Conclusion

In this paper, we introduced a new hybrid approach to classifyP2P and NonP2P packets with high accuracy. The proposed ap-proach is based on calculating minimum classification error matrixusing genetic algorithm. Since the genetic algorithm is not usuallytrapped in local optimum points, the computed MCE matrix per-forms better and increases the overall accuracy of the classifier.The genetic-based MCE matrix is used to map the features of thedataset into a new space where they can easier be separated intodifferent classes. According to the experimental results, the pro-posed method leads to higher accuracy of the classifiers. It outper-forms standard MCE method for three different classifiers, namelydistance based, K-Nearest Neighbor, and neural network classifiers.

Moreover, we used mutual information, SD index and Dunn in-dex to compare the proposed GAMCE-based method against thestandard MCE-based, and Normal (no feature mapping) ap-proaches. The experimental results show that the proposed map-ping technique reduces the overlap among the classes, andaccordingly, improves the accuracy of the classification.

References

Abbasian, H., Nasersharif, B., & Akbari, A. (2008). Class dependent LDA optimizationusing genetic algorithm for robust MFCC extraction (Vol. 1, pp. 807–810). Berlin,Heidelberg: Springer.

Auld, T., Moore, W., & Gull, F. (2007). Bayesian neural network for Internet trafficclassification. IEEE Transactions on Neural Network., 18(1), 223–239.

Azzouna, B., & Guillemin, F. (2004). Impact of peer-to-peer applications on widearea network traffic: an experimental approach. IEEE Global TelecommunicationsConference, 3, 1544–1548.

Cloud shield (2007). Peer-to-peer traffic control. http://www.cloudshield.com/solutions/p2pcontrol.asp.

Crovella, M., & Krishnamurthy, B. (2006). Internet measurement: Infrastructure, trafficand applications, Publisher. West Sussex: John Wiley and Sons Ltd.

De La Torre, A., Peinado, A. M., & Rubio, A. J. (1996). An application of minimumclassification error to feature space transformation for speech recognition.Speech Communication., 20, 273–290.

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). JohnWiley & Sons.

Dunn, K., & Dunn, J. (1974). Well separated clusters and optimal fuzzy partitions.Journal of Cybernetics(4), 95–104.

Eiben, A., & Smith, J. E. (2007). Introduction to evolutionary computin (2nd ed.).Springer.

Goldberg, D. E. (1989). Genetic algorithms in search. Optimization and machinelearning. New York: Addison-Wesley.

Halkidi, M., Vazirgiannis, M., & Batistakis, Y. (2000). Quality scheme assessment inthe clustering process. In Proceeding the 4th european conference on principles ofdata mining and knowledge discovery (pp. 265–276).

Huang, H., & Zhu, J. (2006). Kernel based non-linear feature extraction methods forspeech recognition. In Proceedings of the international conference on intelligentsystem design and applications (Vol. 6, pp. 749–754).

Hung, J., & Lee, L. S. (2002). Data-driven temporal filters for robust features inspeech recognition obtained via minimum classification error (MCE). InProceedings of ICASSP conference (pp. 373–376).

Juang, B. H., & Katagiri, S. (1992). Discriminative learning for minimum errorclassification. IEEE Transaction Signal Processing, 40, 3043–3054.

Kamei, S., & Kimura, T. (2006). Cisco IOS NetFlow overview. Whitepaper, available atwww.Cisco.com.

Kamei, S., & Kimura, T. (2003). Practicable network design for handling growth inthe volume of peer-to-peer traffic. IEEE Pacific Rim Conference onCommunications, Computers and signal Processing, 2, 597–600.

Karagiannis, T., Broido, A., Faloutsos, M., & Klaffy, K. (2004). Transport layeridentification of P2P traffic. In Proceedings of the 4th ACM SIGCOMM conferenceon internet measurement, Italy (pp. 121–134).

Loog, M., & Duin, R. P. W. (2004). Linear dimensionality reduction via aheteroscedastic extension of LDA: The Chernoff criterion. IEEE Transactions onPattern Analysis and Machine Intelligence, 26(6), 732–739.

McDermott, E., Hazen, T. J., Roux, J. L., Nakamura, A., & Katagiri, S. (2007).Discriminative training for large vocabulary speech recognition using minimumclassification error. IEEE Transactions on Audio, Speech, and Language Processing,15, 203–223.

Moore, W., & Zuev, D. (2005). Internet traffic classification using Bayesian analysistechniques. In Proceedings ACM Sigmetrics (pp. 50–59). Alberta, Canada.

Raahemi, B., Hayajneh, A., & Rabinovitch, P. (2007a). Classification of peer-to-peertraffic using neural networks. In Proceedings of artificial intelligence and patternrecognition, Orlando, USA (pp. 411–417).

Raahemi, B., Kouznetsov, A., Hayajneh, A., & Rabinovitch, P. (2008). Classification ofpeer-to-peer traffic using incremental neural networks. In Proceedings of theIEEE Canadian conference on electrical and computer engineering CCECE08, NiagaraFalls, Canada.

Raahemi, B., Zhong, W., & Liu, J. (2008b). Peer-to-peer traffic identification bymining IP layer data streams using concept-adapting very fast decision tree. InProceedings of the international conference on tools for artificial intelligenceICTAI008.

Raahemi, B., Hayajneh, A., & Rabinovitch, P. (2007b). Peer-to-peer IP trafficclassification using decision tree and IP layer attributes. International Journalof Business Data Communications and Networks., 3, 60–74.

Raahemi, W., Zhong, B., & Liu, J. (2008a). Exploiting unlabeled data to improve peer-to-peer traffic identification using incremental tri-training method. Journal ofPeer-to-Peer Networking and Applications. Springer.

Sen, S., Spatscheck, O., & Wang, D. (2004). Accurate, scalable in-networkidentification of P2P traffic using application signatures. In 13th internationalworld wide web conference, NY, USA (pp. 512–521).

Shannon, C. E. (1948). A mathematical theory of communication. Bell SystemTechnical Journal, 27. 379–423 & 623–656.

Yu, D., Deng, L., He, X., & Acero, A. (2006). Use of incrementally regulateddiscriminative margins in MCE training for speech recognition. In Proceedings ofinterspeech conference (pp. 2418–2421).

Yu, D., Deng, L., He, X., & Acero, A. (2007). Large-margin minimum classificationerror training for large-scale speech recognition tasks. In Proceedings of ICASSP(pp. 1137–1140).

Zander, S., Nguyen, T., & Armitage, G. (2005). Self-learning IP traffic classificationbased on statistical flow characteristics. Lecture notes in computer science (Vol.3441). Springer-Verlag (pp. 325–328).

Zhong, W., Raahemi, & B. Liu, J. (2009). Learning on class imbalanced datato classify peer-to-peer applications in IP traffic using resamplingtechniques. In International joint conference on neural networks, Atlanta,Georgia, USA.

Zuev, D., & Moore, W. (2005). Traffic classification using a statistical approach. Lecturenotes in computer science (Vol. 3441). Springer-Verlag (pp. 321–324).