[IEEE 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery...

5
Flow-Based P2P Network Traffic Classification using Machine Learning Algorithm Prof. Shashikala Tapaswi ABV-IIITM Gwalior Gwalior, India Arpit S. Gupta ABV-IIITM Gwalior Gwalior, India Abstract – With the introduction of new and new services in the market every day, the internet is growing rapidly. The network traffic generated by these network protocols and applications needs to be categorised which is an important task of network management. Among these, p2p has the largest share of the bandwidth. This great demand in the bandwidth has increased the importance of network traffic engineering. So, in order to meet the current demand and develop new architectures which help in improving the network performance, a broad understanding of the network traffic properties is required. The flow based methods classify p2p and non-p2p traffic using the characteristics of flows on the internet. In this paper, Naïve Bayes estimator is used to categorize the traffic into p2p and non-p2p. Our results show that with the right set of features and good training data, high level of accuracy is achievable with the simplest of Naïve Bayes estimator. Keywords – Naive Bayesian Estimator, peer-to-peer, traffic classification, P2P I. INTRODUCTION Over the past few years, peer-to-peer (p2p) file sharing has gained popularity. The bandwidth share of p2p traffic is the largest which is continuously increasing as we speak [1][2]. Peer-to-peer (p2p) allows resources to be shared on the network in which any computer can act as a client or a server for other computers. This allows sharing of files, peripherals or any other resource without the need of a central server. It’s simplicity in connecting and searching makes it the most popular file sharing protocol. As the bandwidth consumption by p2p traffic is the maximum, this has posed a serious problem for traffic engineers in limited capacity links. The task of meeting the bandwidth requirements, billing issues, Quality of Service (QoS) issues etc. has become a major cause of concern to these traffic engineers. As most of the data shared by the p2p applications is copyright protected, it has become essential to detect this p2p traffic. It is difficult to identify the application which is generating this p2p traffic. This is because of the fact that the logs of the applications which are running in the network are never up to date. So the network administrator cannot determine accurately as to which application is generating this traffic. So, there is a need to classify the traffic of the network so that it can become possible to distinguish between heavy bandwidth consuming p2p traffic and non-p2p traffic. This process can help in engineering the bandwidth for the critical applications in the network. There are three approaches which are mentioned in academic literature and used practically – port-based methods, payload inspection-based methods and flow- based methods. Port-based methods rely on the fact that the applications work on a specific port. So, if we can know on which port that p2p application is working we can classify the traffic into p2p and non-p2p. This was the initial criteria for classification. However, most of the applications now-a-days assign dynamic ports [1]. As a result of this, the success rate of this method has become limited. The use of dynamic ports has made this method inappropriate for traffic classification. The very popular method which is used mostly is payload inspection based method. In this method, the packet data is inspected which violated the user’s privacy and incurs a huge amount of data processing [3][4]. Also, a major disadvantage of this method is in the case when the data is encrypted. In this case, this method is impossible to implement [5]. The flow-trace method of traffic classification has become an alternate to the above methods. In this method, flow traces are collected from the IP headers which can summarize the features of the flows. This method does not suffer from the disadvantages of the port-based and payload inspection-based detection techniques. Also, neither the amount of data to be analysed nor the encrypted data pose any problem. The user’s privacy is also preserved. Machine learning algorithm is used to classify the traffic into p2p and non-p2p using flow-based method. The accuracy of the algorithm can be determined by using either the signature-based or port-based tools. But as these tools are limited in functionality because of the drawbacks of these methods as explained above, the accuracy can vary accordingly. The motivation is our work is to cope up with the problems faced in the classification using port-based and signature based methods. The low accuracy of these methods motivated us to propose a methodology for traffic classification which can give higher accuracy with low complexity. Also because of the importance of traffic classification in the job of network engineers, a method is required. In our approach, in order to cope up with these problems, custom made data is used [6] to find the accuracy. Quarantine p2p server is established to generate the traffic needed for the experiment. Hence, the flow traffic type is already determined which can be used to calculate the accuracy of the algorithm. Also, different network scenarios are generated by varying the number of 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery 978-0-7695-5106-7/13 $26.00 © 2013 IEEE DOI 10.1109/CyberC.2013.75 402

Transcript of [IEEE 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery...

Page 1: [IEEE 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC) - Beijing, China (2013.10.10-2013.10.12)] 2013 International Conference

Flow-Based P2P Network Traffic Classification using Machine Learning Algorithm

Prof. Shashikala Tapaswi ABV-IIITM Gwalior

Gwalior, India

Arpit S. Gupta

ABV-IIITM Gwalior Gwalior, India

Abstract – With the introduction of new and new services in the market every day, the internet is growing rapidly. The network traffic generated by these network protocols and applications needs to be categorised which is an important task of network management. Among these, p2p has the largest share of the bandwidth. This great demand in the bandwidth has increased the importance of network traffic engineering. So, in order to meet the current demand and develop new architectures which help in improving the network performance, a broad understanding of the network traffic properties is required. The flow based methods classify p2p and non-p2p traffic using the characteristics of flows on the internet. In this paper, Naïve Bayes estimator is used to categorize the traffic into p2p and non-p2p. Our results show that with the right set of features and good training data, high level of accuracy is achievable with the simplest of Naïve Bayes estimator.

Keywords – Naive Bayesian Estimator, peer-to-peer, traffic classification, P2P

I. INTRODUCTION Over the past few years, peer-to-peer (p2p) file

sharing has gained popularity. The bandwidth share of p2p traffic is the largest which is continuously increasing as we speak [1][2]. Peer-to-peer (p2p) allows resources to be shared on the network in which any computer can act as a client or a server for other computers. This allows sharing of files, peripherals or any other resource without the need of a central server. It’s simplicity in connecting and searching makes it the most popular file sharing protocol. As the bandwidth consumption by p2p traffic is the maximum, this has posed a serious problem for traffic engineers in limited capacity links. The task of meeting the bandwidth requirements, billing issues, Quality of Service (QoS) issues etc. has become a major cause of concern to these traffic engineers. As most of the data shared by the p2p applications is copyright protected, it has become essential to detect this p2p traffic.

It is difficult to identify the application which is generating this p2p traffic. This is because of the fact that the logs of the applications which are running in the network are never up to date. So the network administrator cannot determine accurately as to which application is generating this traffic. So, there is a need to classify the traffic of the network so that it can become possible to distinguish between heavy bandwidth consuming p2p traffic and non-p2p traffic. This process can help in engineering the bandwidth for the critical applications in the network.

There are three approaches which are mentioned in academic literature and used practically – port-based methods, payload inspection-based methods and flow-based methods.

Port-based methods rely on the fact that the applications work on a specific port. So, if we can know on which port that p2p application is working we can classify the traffic into p2p and non-p2p. This was the initial criteria for classification. However, most of the applications now-a-days assign dynamic ports [1]. As a result of this, the success rate of this method has become limited. The use of dynamic ports has made this method inappropriate for traffic classification.

The very popular method which is used mostly is payload inspection based method. In this method, the packet data is inspected which violated the user’s privacy and incurs a huge amount of data processing [3][4]. Also, a major disadvantage of this method is in the case when the data is encrypted. In this case, this method is impossible to implement [5].

The flow-trace method of traffic classification has become an alternate to the above methods. In this method, flow traces are collected from the IP headers which can summarize the features of the flows. This method does not suffer from the disadvantages of the port-based and payload inspection-based detection techniques. Also, neither the amount of data to be analysed nor the encrypted data pose any problem. The user’s privacy is also preserved.

Machine learning algorithm is used to classify the traffic into p2p and non-p2p using flow-based method. The accuracy of the algorithm can be determined by using either the signature-based or port-based tools. But as these tools are limited in functionality because of the drawbacks of these methods as explained above, the accuracy can vary accordingly.

The motivation is our work is to cope up with the problems faced in the classification using port-based and signature based methods. The low accuracy of these methods motivated us to propose a methodology for traffic classification which can give higher accuracy with low complexity. Also because of the importance of traffic classification in the job of network engineers, a method is required. In our approach, in order to cope up with these problems, custom made data is used [6] to find the accuracy. Quarantine p2p server is established to generate the traffic needed for the experiment. Hence, the flow traffic type is already determined which can be used to calculate the accuracy of the algorithm. Also, different network scenarios are generated by varying the number of

2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

978-0-7695-5106-7/13 $26.00 © 2013 IEEE

DOI 10.1109/CyberC.2013.75

402

Page 2: [IEEE 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC) - Beijing, China (2013.10.10-2013.10.12)] 2013 International Conference

packets and amount of pre-classified p2p traffic to determine the accuracy of the algorithm.

The working of the supervised machine learning algorithm can be seen from Figure 1:

Figure 1. Supervised Machine Learning

From Figure 1 it can be seen that known data and known responses are used to form the model (Bayesian model) of the algorithm which is used to predict the values of the new unclassified data. The better the known responses the better is the model for prediction.

Through our work, we tried to propose a methodology for traffic classification with high accuracy and low complexity. The importance of custom-made data is shown by the results with different network scenarios. From the results, we also tried to show the importance of the number of packets in the classification and the importance of the features on the basis of which the classification is being carried out

The rest of the paper is organized as follows. In section 2, the related work in traffic classification and the machine learning algorithm used has been discussed it also discussed about the accuracy determination methods. Detailed experimental analysis is provided in section 3. Section 4 includes the evaluation results and conclusion in section 5.

II. LITERATURE REVIEW A. Previous work

There have been studies on the classification of network traffic using the various methods as mentioned in section 1. These methods have different level of accuracies of the classified traffic.

Traffic classification using port based methods have been used in [1][12]. Layer 4 ports of different applications are used to classify the traffic. The problem with this approach was that the ports of the services now are coming to be dynamic instead of static.

Sen et. al. [3] and Collins et. al [5] used signature based methods in which they intercepted the traffic of high traffic links 1Gbps and then used the method to classify the data. The problem was that the data which was collected was too huge and the privacy of the data was being compromised in this method.

Various machine learning algorithms are used in [6][7][8][9][10][11][12][13][14]. Erman et. al in [7] used

unsupervised machine learning algorithm to classify traffic. When applications communicate on the network their distinctive characteristics are used. These features are utilized by them. In [8] the authors used a IDS approach to classify the traffic.

While [9][10][11][12][14] discusses about the accuracy of the naïve Bayesian algorithm with accuracies varying from 65% to 91%. Hand classified data is used by Moore et. al. [9]. This was used for both training and testing purposes. In [10] a supervised machine learning algorithm using Bayesian neural network is proposed. In [11], the authors used a pre-categorized data from another research and performed their experiments. In [12], Williams et. al. showed the importance of computational performance along with accuracy in various machine learning algorithms while the authors in [14] compared the result of unsupervised algorithm with supervised algorithm.

B. Naïve Bayes Estimator It can be seen as a way to understand the probability

of occurrence of an event provided a new piece of evidence. It can be written as:

P(C|E) = ∑ (P(Ci) x P(E|Ci))

P(E)

In this, C stands for the class that we are deducing and E is the new piece of evidence. It is the posterior probability which calculates the probability of hypothesis being true given that the evidence has already been taken into consideration.

For calculating the value of eq(1), the denominator of eq(1) is constant. So the problem reduces into calculating the numerator, i.e.

P(C|E) = ∑ (P(Ci) x P(E|Ci))

P(Ci) = probability of occurrence of class Ci

P(Ci) = count of occurrence of class Ci

Total classes occurrence

P(Ci) = #Ci

#C

P(Ej|Ci) = count of Ej belonging to Ci

count in Ci

P(Ej|Ci) = #(Ej|Ci)

#Ci

..(3)

..(4)

..(2)

..(1)

403

Page 3: [IEEE 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC) - Beijing, China (2013.10.10-2013.10.12)] 2013 International Conference

P(E|Ci) = probability of occurrence of evidence E in class Ci

P(E|Ci) = ∏ P(Ej|Ci)

Now, using eq (3),(4) and (5), we can calculate eq(2) and the class can be predicted. This is the mathematics involved behind the Naïve Bayes Estimator.

Algorithm for the estimator: Train(C, D) 1 N = countTuples(D) 2 E = extractEvidences(D) 3 for each c€C 4 do Nc = countOccurenceOfClass(C,D) 5 prior[c] = Nc/N 6 X = countTuplesWithClass(c,D) 7 for each e€E 8 do evi_count = countEvidenceInClass(e,c) 9 condprob[e][c] = evi_count/X 10 return (E, prior, condprob) Test(C, E, prior, condprob, d) 1 for each e€E 2 do for each c€C 3 do temp = prior[c] * condprob[e][c] 4 maxi[c] = max(maxi[c], temp) 5 return arg maxc€C maxi[c]

The above algorithm implements the mathematics involved with the Naïve Bayes Estimator mentiones in section II-B. The function Train() takes two arguments C which is the number classes in which we want to classify and D the tuples of data in our collected data file. The values returned by this function are the evidence vector E, the probability of occurrence of class Ci prior and the conditional probability array condprob. These values will be used in the testing phase of our experiment.

The function Test() uses the values returned by the Train() function in order to classify the data into p2p and non-p2p. It takes four arguments. C is the class vector, E is the evidence vector, prior and condprob are the values returned by Train() function and d is the new document to be tested i.e. classified. The value returned by this function is arg max (maxi[c]) which is the class to which this data belongs to.

These two functions help in making the confusion matrix required to calculate the accuracy of the classification.

We used WEKA (Waikato Environment for Knowledge Analysis) [15] to conduct our experiments. It is a collection of machine learning algorithms for data mining written in JAVA.

C. Accuracy Evaluation The traffic classification can be seen as a confusion

matrix in which there is a possibility of occurrence of false positives and false negatives. The percentage of the

overall traffic that is correctly classified is the accuracy of the algorithm.

TABLE 1 CONFUSION MATRIX

Classified as p2p non-p2p p2p A b

non-p2p C d where, a = #p2p packets classified as p2p. b = #non-p2p packets classified as p2p. c = #non-p2p packets classified as p2p. d = #non-p2p packets classified as non-p2p.

The accuracy of the algorithm can be written as:

(a+d)

(a+b+c+d)

III. EXPERIMENTAL ANALYSIS A. Data acquisition

The network topology of the setup used in our experimentation is shown in Figure 2:

Figure 2. Network topology

The router of the network captures all the packets and sends them on the logging machine which saves all these packets. The router logs all the traffic using the NetFlow [17] feature of the CISCO router. The traffic is then logged using flow tools [18] from which features are extracted. In this way, flows as generated.

The data is generated from the quarantined p2p server. So, we know beforehand what the type of the flow is. Various machines are also setup with and without any p2p application. This generates the traffic which is required for the experiment.

Network flows are taken at different points of time in a period of 2 weeks from 27 January 2013 to 9 February 2013. Dedicated p2p server was established in order to generate the pre-classified flow of data. These flows are collected at different points of time during the day so that the characteristics of the network remain neutral because the amount of traffic during day time differs from the traffic during night time. Each flow ranges from a duration of 3-5 minutes. Each of the flow contains

..(5)

404

Page 4: [IEEE 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC) - Beijing, China (2013.10.10-2013.10.12)] 2013 International Conference

different amount of packets. So, custom data is created with different amount of packets and different amount of p2p traffic content in each of the flow. These constitute the different network scenarios required for the experiments.

TABLE 2 NETWORK SCENARIOS

Scenario No. p2p percentage Scenario 1 25% Scenario 2 50% Scenario 3 75%

B. Evaluation methodology We performed the experiment using the custom made

data as discussed in Section 3(A). For each of the scenarios, 10,000 flows were used in the experiment.

Feature Identification: The various features are extracted from the IP header information of the packets. These are address pairs, port pairs, protocol, size of the packet and packets per flow. [8] helps us in determining the distribution of these features in the flows. These features are chosen because of the independency of these features in the traffic.

Using WEKA, these flows are tested using the Naïve Bayes Estimator to classify the traffic into p2p and non-p2p. For each of the network scenario, the test is performed using 10000 flows each and with different number of packets in each of the flow. The fixed number of flows for each of the scenario helps us in better comparison of the algorithm.

The user supplied test data method can be used in this case as the amount of data supplied is large enough. A set of data is used to train the machine while the remaining set is used as test set to find the accuracy of the algorithm. The confusion matrix as shown in Table 1 tells us the distribution of the classified result into p2p and non-p2p which helps in finding the accuracy of the algorithm. The accuracy of the algorithm in case of each scenario is calculated using the equation 6. This accuracy helps us in determining the performance of the algorithm under the given conditions.

IV. IMPLEMENTATION RESULTS With different number of packets in each flow, the

results were obtained. The accuracy of the algorithm in each of the scenario with different number of packets is shown in Table 3.

TABLE 3. ACURRACY COMPARISON UNDER DIFFERENT AMOUNT OF PACKETS

#packets Scenario 1 Scenario 2 Scenario 3 5000 87.41% 88.65% 87.27%

10000 89.06% 89.03% 89.52% 15000 90.02% 89.17% 90.53% 20000 90.42% 89.61% 90.74% 25000 90.87% 90.43% 91.17% 30000 91.45% 90.47% 92.24%

35000 92.32% 90.96% 92.77% 40000 93.02% 91.43% 93.21% 45000 93.02% 91.57% 93.43% 50000 93.89% 91.60% 93.09% Table 3 shows us that accuracy of the algorithm

increases as the amount of the number of packets increases. Better results are seen in case of scenario 1 and 3. These scenarios have a larger difference in the amount of p2p and non-p2p packet content. Also there is not much change in the accuracy in scenario 2 when p2p and non-p2p data is distributed equally and the accuracy is also low too.

Figure 3. Accuracy plot with different packets

From Figure 3, it can be seen that the accuracy is low in case of scenario 2 as compared to scenario 1 and 3 as stated in the previous paragraph. And for the other two cases the accuracy increases as the number of packets increases.

For the identification of the importance of the features, we started decreasing the number of features one by one to three features and then evaluated the accuracy of the algorithm. It was found that the accuracy started decreasing as the number of features started decreasing. This experiment was conducted for the scenarios mentioned in Table 2. The result with 50000 packets in each flow can be seen from table 4.

TABLE 4. ACCURACY COMPARISON UNDER DIFFERENT NUMER OF FEATURES

#features Scenario 1 Scenario 2 Scenario 3 7 92.74% 88.62% 92.89% 6 90.21% 86.59% 90.62% 5 83.25% 81.02% 83.54% 4 76.57% 77.35% 74.85% 3 76.02% 75.92% 73.32%

From Table 4, it can be seen that the number of features which are used to classify the traffic is crucial in determining the accuracy. As the features started

82.00%84.00%86.00%88.00%90.00%92.00%94.00%96.00%

5000

1000

015

000

2000

025

000

3000

035

000

4000

045

000

5000

0

Scenario 1Scenario 2Scenario 3

Number of packets

Accuracy

405

Page 5: [IEEE 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC) - Beijing, China (2013.10.10-2013.10.12)] 2013 International Conference

decreasing, so did the accuracy. So, with the given features it is possible to classify with a high level of accuracy the network traffic using the simplest of the machine learning algorithm, Naïve Bayesian Algorithm.

Figure 4 shows a line graph of the accuracy when the features are changed. From Figure 4 it can be seen that the accuracy decreases as the number of features decreases. But the change in accuracy is more prominent when the features are changed from 6 to 5 to 4 then the other two cases. The change in accuracy in case of 6 to 7 or from 3 to 4 is not much significant. So it can be concluded that the accuracy increases with increase in the features and remains almost same after that.

Figure 4. Accuracy plot with varying features

There are two popular machine learning verification methods mentioned in literature. The 66% split test and the 10-fold test. These two tests were also applied to test the accuracy of the algorithm. The results of the tests can be seen from Table 5. It can be seen that the results of 66% split test and 10-fold test are almost similar. The results of the tests decreases for the user supplied test data.

TABLE 5 ACURRACY COMPARISON OF THE TESTS

#Test Scenario 1 Scenario 2 Scenario 3 66% split 84.42% 88.77% 85.47%

10 fold 84.56% 89.03% 85.49%

V. CONCLUSION Here, an evaluation of the accuracy of the machine

learning based algorithm is given using the flow-based method for network traffic classification. The custom made data has helped in better classification of the traffic into p2p and non-p2p. Results with high level of accuracy can be achieved with the simple Bayesian algorithm as opposed to the previous work with this algorithm in which the accuracy was between 65% to 85%. It is also shown that the amount of data which is used also plays a crucial role in determining the accuracy of the algorithm.

REFERENCES

[1] T. Karagiannis, A. Broido, N. Brownlee, kc claffy, and M. Faloutsos, “Is P2P dying or just hiding?” in IEEE Global Telecommunications Conf. (GLOBECOM 04), 2004

[2] A. Gerber, J. Houle, H. Nguyen, M. Roughan, S. Sen, “P2P the gorilla in the cable,” in National Cable and Telecommunications Association (NCTA) 2003 National Show, 2003.

[3] S. Sen, O. Spatscheck, and D. Wang, “Accurate, scalable network identification of P2P traffic using application signature,” in 13th international conference on World Wide Web, 2004.

[4] T. Karagiannis, A. Broido, N. Brownlee, kc claffy, and M. Faloutsos, “File-sharing in the internet: A characterization of P2P traffic in the backbone,” Tech. Rep., 2004.

[5] P. Collins and M. K. Reiter, “Finding peer-to-peer file-sharing using coarse network behaviors,” Lecture Notes in Computer Science, vol. 4189, pp. 1–17, 2006.

[6] M. Soysal, E.G. Schmidt, “An accurate evaluation of machine learning algorithms for flow-based P2P traffic detection”, in 22nd International Symposium on Computer and Information Sciences, ISCIS07, 2007.

[7] J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification using clustering algorithms,” in Proceedings of the 2006 SIGCOMM Workshop on Mining Network Data (MineNet06), 2006.

[8] E.G. Schmidt, M. Soysal, “An intrusion detection based approach for the scalable detection of P2P traffic in the national academic network backbone”, in International Symposium on Computer Networks, ISCN06, 2006, pp. 128_133.

[9] A.W. Moore, D. Zuev, “Internet traffic classification using Bayesian analysis techniques”, in ACM SIGMETRICS 05, 2005.

[10] T. Auld, A.W. Moore, S.F. Gull, “Bayesian neural networks for Internet traffic classification”, IEEE Transactions on Neural Networks 18 (2007) 223_239.

[11] A.W. Moore, D. Zuev, “Traffic classification using a statistical approach”, in Passive and Active Measurement Workshop, PAM05, 2005.

[12] N. Williams, S. Zander, G. Armitage, “A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification”, Computer Communication Review 36 (2006) 7_15.

[13] N. Williams, S. Zander, G. Armitage, “Evaluating machine learning algorithms for automated network application identification, in: Center for Advanced Internet Architectures”, CAIA, Technical Report 060410B, 2006.

[14] J. Erman, A. Mahanti, M. Arlitt, “Internet traffic identification using machine learning”, in IEEE Global Telecommunications Conference, GLOBECOM'06, 2006.

[15] Online resource: http://www.cs.waikato.ac.nz/ml/weka/

[16] Online resource: http://www.trinity.edu/cbrown/bayesweb/

[17] Online resource: http://www.cisco.com/en/US/products/ps6601/products_ios_protocol_group_home.html/

[18] Online Resource: http://www.splintered.net/sw/flow-tools/

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

7 6 5 4 3

Scenario 1

Scenario 2

Scenario 3

Accuracy

Features

406