A novel approach for practical real-time, machine learning ... · A Novel Approach for Practical,...

A Novel Approach for Practical,Real-Time, Machine Learning Based

IP Traffic Classification

Dissertation submitted in accordance with the requirements forthe degree of Doctor of Philosophy

Thuy T.T. Nguyen

Centre for Advanced Internet ArchitecturesFaculty of Information and Communication Technologies

Swinburne University of TechnologyMelbourne, Australia

February 2009

Declaration

To the best of my knowledge and belief, this thesis contains no material previously published or

written by any other person, except where due reference is made in the text of the thesis. This

thesis has not been submitted previously, in whole or in part, to qualify for any other academic

award. The content of the thesis is the result of work which has been carried out since the

beginning of my candidature on March 2003.

Melbourne, 23rd February 2009

Thuy Nguyen

2

A Novel Approach for Practical, Real-Time,Machine Learning BasedIP Traffic Classification

4

c© Thuy Nguyen 2009. All rights reserved.

To my beloved family, especially

my husband Uy Dzung and my little daughter Khiet Linh!

Acknowledgements

Working towards this PhD was a long and challenging journey, and I would like to thank the

following people for making it possible.

First of all, I would like to express my sincere gratitude and appreciation to my first super-

visor, Professor Grenville Armitage. Throughout the years he has been a great mentor to me. I

have experienced both successful and frustrating experimental outcomes, sometimes losing my

track, and it was his guidance, support, encouragement, enthusiasm and passion that helped me

stay inspired and motivated. I feel so grateful to have had a supervisor who is willing to stand

up for his students; who tries hard to provide a great working environment with all the neces-

sary facilities and equipment for our experiments and research; and who creates opportunities

for us to present our work and establish networking connections at both local and international

conferences and workshops. Finally, I would like to thank him for all his patience during many

long hours of discussions and experiments, for teaching me how to do good research, how to

write a good paper, etc. All of these have really built a solid grounding for my future research

career.

I would like to thank Dr Philip Branch, my colleague and recently my second supervisor,

who has always been willing to provide me with help, support and advice when needed. I

am thankful for the encouragement he has given me since the early days of starting my thesis.

I really appreciate all the time he spent helping me review my work and offering valuable

suggestions and feedback. I would also like to thank Dr Jim Lambert (who was my second

supervisor for the first two years of my candidature and is now retired) for his support of my

work.

I owe special thanks to my colleagues, Sebastian Zander and Nigel Williams, for the inspi-

ration of their work, that ultimately led me to my thesis. My thanks to them for always being so

helpful and kind to me over the years. I deeply thank Warren Harrop and Lawrence Stewart, for

7

their kindness and generosity in giving me the VoIP data trace collected at their home network

to support my research. I would also like to thank Dragi Klimovski for giving me the opportu-

nity to attend and study his Cisco CCNP class – to widen my knowledge and gain experience

which benefited my research. To my other colleagues at the Centre for Advanced Internet Ar-

chitectures, I must say that I have been so lucky to have a chance to work in a great research

environment, with such very smart, helpful and nice people – thank you all!

I would like to thank the Swinburne IT Services Department and the Centre for Astrophysics

and Supercomputing for providing the laboratory equipment that facilitated my research.

I would also like to thank the Centre for Advanced Internet Architectures, Cisco Systems

Australia, and Swinburne University of Technology for awarding me the Swinburne University

Postgraduate Research Award (SUPRA) and for providing funding support for the duration of

my candidature.

I would like to thank my husband – Uy Dzung – for walking with me on this journey with

his infinite support, love, and encouragement. Thanks to my parents, my sisters and my parents-

in-law for their patience and understanding. Completing this PhD would not have been possible

without the friendship of many special people. This is a sincere thank you to all of them. Last

but not least, this thesis is specially dedicated to Khiet Linh, my dear little daughter, who has

been separated from mommy for many months so that I could complete my thesis!

Contents

Acknowledgements 6

Abstract 14

Publications 17

Table of Acronyms 19

1 Introduction 21

2 Application Context for ML Based IPTC 28

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2 The importance of IP traffic classification . . . . . . . . . . . . . . . . . . . . 30

2.2.1 QoS issues over Last Mile networks . . . . . . . . . . . . . . . . . . . 30

2.2.2 QoS provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Internet QoS standards . . . . . . . . . . . . . . . . . . . . . . . . . . 32

QoS-enabled solutions from industry . . . . . . . . . . . . . . . . . . 32

Automated QoS solution . . . . . . . . . . . . . . . . . . . . . . . . . 33

The role of IP traffic classification . . . . . . . . . . . . . . . . . . . . 33

2.2.3 Internet pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.4 Lawful interception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3 Traffic classification metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3.1 Positives, negatives, accuracy, precision and recall . . . . . . . . . . . 37

2.3.2 Byte and flow accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4 Limitations of packet inspection for traffic classification . . . . . . . . . . . . . 40

2.4.1 Port-based IP traffic classification . . . . . . . . . . . . . . . . . . . . 40

2.4.2 Payload-based IP traffic classification . . . . . . . . . . . . . . . . . . 41

8

CONTENTS 9

2.5 Classification based on statistical traffic properties . . . . . . . . . . . . . . . . 42

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 A Brief Background on Machine Learning 44

3.1 A review of classification with Machine Learning . . . . . . . . . . . . . . . . 44

3.1.1 Input and output of an ML process . . . . . . . . . . . . . . . . . . . . 45

3.1.2 Different types of learning . . . . . . . . . . . . . . . . . . . . . . . . 46

3.1.3 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

The Naive Bayes algorithm . . . . . . . . . . . . . . . . . . . . . . . . 47

The C4.5 Decision Tree algorithm . . . . . . . . . . . . . . . . . . . . 49

3.1.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1.5 Evaluating supervised learning algorithms . . . . . . . . . . . . . . . . 52

3.1.6 Evaluating unsupervised learning algorithms . . . . . . . . . . . . . . 54

3.1.7 Feature selection algorithms . . . . . . . . . . . . . . . . . . . . . . . 55

3.1.8 Imbalanced datasets problem . . . . . . . . . . . . . . . . . . . . . . . 55

3.2 The application of ML in IP traffic classification . . . . . . . . . . . . . . . . . 57

3.2.1 Training and testing a supervised ML traffic classifier . . . . . . . . . . 59

3.2.2 Supervised versus unsupervised learning . . . . . . . . . . . . . . . . 62

3.3 Challenges for operational deployment . . . . . . . . . . . . . . . . . . . . . . 63

3.3.1 A deployment scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3.2 The operational challenges . . . . . . . . . . . . . . . . . . . . . . . . 66

Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Timely and continuous classification . . . . . . . . . . . . . . . . . . . 66

Directional neutrality . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Efficient use of memory and processors . . . . . . . . . . . . . . . . . 67

Portability and Robustness . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 IP Traffic Classification Using Machine Learning 70

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2 Clustering approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

10 CONTENTS

4.2.1 Flow clustering using Expectation Maximisation . . . . . . . . . . . . 71

4.2.2 Automated application identification using AutoClass . . . . . . . . . 72

4.2.3 TCP-based application identification using Simple K-Means . . . . . . 73

4.2.4 Identifying HTTP and P2P traffic in the network core . . . . . . . . . . 75

4.3 Supervised learning approaches . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.1 Statistical signature-based approach using NN, LDA and QDA algorithms 76

4.3.2 Classification using Bayesian analysis techniques . . . . . . . . . . . . 77

4.3.3 GA-based classification techniques . . . . . . . . . . . . . . . . . . . 78

4.3.4 Simple statistical protocol fingerprint method . . . . . . . . . . . . . . 79

4.4 Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.5 Comparisons and related work . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.5.1 Comparison of different clustering algorithms . . . . . . . . . . . . . . 81

4.5.2 Comparison of clustering versus supervised techniques . . . . . . . . . 82

4.5.3 Comparison of different supervised ML algorithms . . . . . . . . . . . 83

4.5.4 ACAS: Classification using machine learning techniques on application

signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.5.5 BLINC: Multilevel traffic classification in the dark . . . . . . . . . . . 85

4.5.6 Pearson’s Chi-Square test and Naive Bayes classifier . . . . . . . . . . 86

4.6 Limitations of the reviewed works . . . . . . . . . . . . . . . . . . . . . . . . 87

4.6.1 Timely and continuous classification . . . . . . . . . . . . . . . . . . . 87

4.6.2 Directional neutrality . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.6.3 Efficient use of memory and processors . . . . . . . . . . . . . . . . . 88

4.6.4 Portability and Robustness . . . . . . . . . . . . . . . . . . . . . . . . 88

4.7 My research goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Training Using Multiple Sub-Flows for Real-Time IPTC 90

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2 My proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.3 My experimental approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.3.1 Flows and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

CONTENTS 11

5.3.2 Machine Learning algorithms . . . . . . . . . . . . . . . . . . . . . . 94

5.3.3 Some statistical properties of ET traffic . . . . . . . . . . . . . . . . . 95

5.3.4 Constructing training and testing datasets . . . . . . . . . . . . . . . . 100

ET traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Other traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Training with full-flow, testing with four different sliding windows . . . 102

Training with full-flow instances of more than 25 packets (called filtered

full-flow), testing with a sliding window of N = 25 packets . . 103

Training with individual sub-flow, testing with a sliding window of N =

25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Training with multiple sub-flows, testing with a sliding window of N =

25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.3.5 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.4 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.4.1 Training with full-flows, testing with four different sliding windows . . 109

5.4.2 Training with filtered full-flows, testing with a sliding window of N =

25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.4.3 Training with individual sub-flows, testing with a sliding window of N

= 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.4.4 Training with multiple sub-flows, testing with a sliding window of N =

25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6 Clustering For Automated Sub-Flow Selection 128

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.2 My proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2.1 Step 1 - Sub-flow identification . . . . . . . . . . . . . . . . . . . . . 130

6.2.2 Step 2 - Sub-flows selection . . . . . . . . . . . . . . . . . . . . . . . 131

6.3 An experimental illustration of my proposal . . . . . . . . . . . . . . . . . . . 133

6.3.1 Step 1 - Sub-flow identification . . . . . . . . . . . . . . . . . . . . . 133

12 CONTENTS

6.3.2 Step 2 - Sub-flow selection . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3.3 Evaluation of classifiers trained with sub-flows selected by EM . . . . . 137


6.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.4.2 Computational performance . . . . . . . . . . . . . . . . . . . . . . . 144

6.4.3 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.5 Sampling for faster clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.5.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.5.2 Down-sampling for the clustering proposal . . . . . . . . . . . . . . . 152

6.5.3 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.6 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7 Training Using Synthetic Sub-Flow Pairs 162

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.2 Proposal using a synthetic sub-flow pairs approach . . . . . . . . . . . . . . . 163

7.3 Illustrating the Synthetic Sub-Flow Pairs Training Approach . . . . . . . . . . 168

7.3.1 Experimental data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.3.2 Test methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170


7.4.1 Classifying without training on SSP . . . . . . . . . . . . . . . . . . . 170

7.4.2 Training on SSP Option 1, classifying with a sliding window . . . . . . 172

7.4.3 Training on SSP Option 2, classifying with a sliding window . . . . . . 176

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

8 Training Using SSP-ACT 183

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

8.2 Evaluation of SSP-ACT in identifying VoIP traffic . . . . . . . . . . . . . . . . 184

8.2.1 A brief background on ITU-T G.711 PCMU and GSM 06.10 encoded

voice traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

8.2.2 Data collection and research methodology . . . . . . . . . . . . . . . . 185

Statistical properties of VoIP flows . . . . . . . . . . . . . . . . . . . . 186

CONTENTS 13

8.2.3 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

8.3 Evaluation of SSP-ACT in the presence of additional packet loss . . . . . . . . 194

8.3.1 Impact of packet loss on the classification of ET traffic . . . . . . . . . 198

8.3.2 Impact of packet loss on the classification of VoIP traffic . . . . . . . . 200

8.4 Concurrent classification of multiple applications with SSP-ACT . . . . . . . . 204

8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

9 Conclusion 214

Bibliography 218

List of Figures 240

List of Tables 240

A Traffic Characteristics of Selected Internet Applications 251

A.1 Asymmetric properties in bi-directional communication . . . . . . . . . . . . . 251

A.1.1 Client server ports asymmetry . . . . . . . . . . . . . . . . . . . . . . 251

A.1.2 Statistical Properties Asymmetry . . . . . . . . . . . . . . . . . . . . 252

A.2 Variation of traffic statistics during flow lifetime . . . . . . . . . . . . . . . . . 254

B Summary of ML-Based IP TC works in the Literature 258

B.1 A summary of key points for each reviewed work . . . . . . . . . . . . . . . . 258

B.2 A qualitative evaluation of the reviewed works . . . . . . . . . . . . . . . . . . 258

C Some Properties of Data Used for Training and Testing 265

C.1 Geographical distribution of ET traffic . . . . . . . . . . . . . . . . . . . . . . 265

C.2 Traffic mix for training and testing . . . . . . . . . . . . . . . . . . . . . . . . 266

D Characteristics of VoIP Traffic 271

D.1 VoIP data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

D.2 Statistical properties of G.711 and GSM flows . . . . . . . . . . . . . . . . . . 272

E Trade-offs in Cluster Quality and Classifier Performance 274

E.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

E.2 Computational performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

Abstract

Today’s Internet does not guarantee any bounds on packet delay, loss or jitter for traffic travers-

ing its networks. Uncontrolled networks can easily lead to bad user experiences for those emerg-

ing applications that have more stringent Quality of Service (QoS) requirements. This suggests

there is a vital need for an effective QoS-enabled network architecture, in which the network

equipment is capable of classifying Internet traffic into different classes for different QoS treat-

ments. Beyond technology, there are other issues related to a practical QoS solution for the

Internet, including the challenges of minimising the deployment cost of QoS technologies and

simplifying users’ experiences. Like other services, the Internet is expected to be user-friendly,

simple and easy to understand, stable and available on request, predictable and transparent, and

not requiring users to understand its underlying architecture in order to use the service.

With an awareness of these issues, my thesis focuses on the automation of the QoS control

process, particularly by means of an automated, real-time IP traffic classification (IPTC) mech-

anism. Traditional techniques for the identification of Internet applications are based either on

the use of well-known registered port numbers or on payload-based protocol reconstruction.

However, applications can use unregistered ports or encryption to obfuscate packet contents;

and governments may impose privacy regulations that constrain the ability of third parties to

lawfully inspect packet payloads. Newer approaches, on the other hand, classify traffic by learn-

ing and recognising statistical patterns in externally observable attributes of the traffic (such as

packet lengths and inter-packet arrival times). State-of-the-art approaches look closely at the

application of Machine Learning (ML) – a powerful technique for data mining and knowledge

discovery – to the classification of IP traffic.

However, before I began publishing my work no ML-based approach to IPTC properly con-

sidered the constraints of being deployed in real-time operational networks. Most publications

on the use of ML algorithms for classifying IP traffic have relied on bi-directional, full-flow

15

statistics (from start until finish or time-out), while assuming that flows have an explicit direc-

tion implied by the first packet captured, or a known client-server relationship. Some other

studies have tried classification using the first few packets of a flow. In contrast, most if not

all real-world scenarios require a classification decision well before a flow has finished, using

statistics derived from a small number of recent packets rather than from the entire flow. Clas-

sifiers may also have missed an arbitrary number of packets from the start of a flow, and be

unsure of the direction in which the flow started.

To overcome these problems, I propose and evaluate novel modifications to the current ML-

based approaches. My goal is to achieve classification by using statistics derived from only the

most recent N packets of a flow (for some small value of N). Because a target application’s

short-term traffic statistics vary within the lifetime of a single flow, I propose training the ML

classifier on a set of multiple short sub-flows, each ‘sub-flow’ being a collection of N consec-

utive packets extracted from full-flow samples of the target application’s traffic. The sub-flows

are picked from regions of the application’s flow that have noticeably different statistical char-

acteristics. I further augment the training set by synthesising a complementary version of every

sub-flow in the reverse direction, since most Internet applications exhibit asymmetric traffic

characteristics in the client-to-server and server-to-client directions. Finally, I propose a novel

use of unsupervised ML algorithms for the automated selection of appropriate sub-flow pairs

when examples of traffic are given from applications that we wish to classify.

I combine my proposals into a training approach that I call Synthetic Sub-flow Pairs with the

Assistance of Clustering Techniques (SSP-ACT). I demonstrate my optimisation when applied

to the Naive Bayes and C4.5 Decision Tree ML algorithms, for the identification of an online

game – Wolfenstein Enemy Territory (ET) and VoIP traffic. My experiments showed that for

ET, being trained using SSP-ACT and classifying using a small sliding classification window

of 25 packets (roughly corresponds to 0.5 of a second in real-time), the Naive Bayes classifier

achieved 98.9% median Recall and 87% median Precision, and the C4.5 Decision Tree classifier

achieved 99.3% median Recall and 97% median Precision. My results also confirmed that

classification performance is maintained even when the classification is initiated at an arbitrary

point within a flow and is independent of the direction of the first packet captured.

For VoIP, being trained using SSP-ACT and classifying on a sliding window of 25 packets

16

(approximately 0.25 seconds in real-time when there is voice traffic in both directions), the

Naive Bayes classifier achieved 100% median Recall and 95.4% median Precision, and the

C4.5 Decision Tree classifier achieved 95.7% median Recall and 99.2% Precision.

I also study the impact of packet loss on SSP-ACT’s performance, with 5% synthetic, ran-

dom and independent packet loss. For Wolfenstein Enemy Territory traffic, 5% packet loss only

degraded the Recall and Precision of both the Naive Bayes and C4.5 Decision Tree classifiers

by less than 0.5%. For VoIP traffic, 5% packet loss did not manifest noticeable degradation

on the Naive Bayes classifier’s Recall and Precision. However, it degraded the C4.5 Decision

Tree classifier’s Recall and Precision by 8.5% and 0.1% respectively. Despite this degradation,

median Recall and Precision of the C4.5 Decision Tree classifier still remained above 87% and

99% for all the tested positions of the sliding window. Deeper investigation of the sensitivity of

the Naive Bayes and C4.5 Decision Tree classifiers with regards to packet loss is left for future

research. This work also can be expanded in future with other loss rates and loss models.

I also demonstrate that SSP-ACT is effective in identifying both ET and VoIP traffic con-

currently, by using a single common classifier or two separate classifiers in parallel, one for

each application. My results reveal that using a common classifier provides better Precision and

Recall, with a trade-off in the classification speed. It also has several pros and cons compared

to the latter option of using two separate classifiers. How SSP-ACT could scale to classify a

larger number of applications simultaneously is a question that requires further study.

My results show that SSP-ACT is a significant improvement over the previous, published

state-of-the art for IP traffic classification. My present work has focused on IPTC of an online

game and VoIP, and revealed a potential solution to the accurate and timely classification of

traffic belonging to other Internet applications.

Publications

A number of peer-reviewed papers have been published based on material and discussion in this

thesis, as listed below.

Peer-reviewed Journal Papers:

• T.T.T. Nguyen and G. Armitage, “A Survey of Techniques for Internet Traffic Classifica-

tion using Machine Learning,” IEEE Communications Surveys & Tutorials, Vol. 10, No.

4, 2008

• J. But, T.T.T. Nguyen, G. Armitage, “The Brave New World of Online Digital Home

Entertainment,” IEEE Communications, May 2005

• T.T.T. Nguyen, G. Armitage, “Evaluating Internet Pricing Schemes - A Three Dimen-

sional Visual Model,” ETRI Journal, Vol. 27, No. 1, February 2005.

Peer-reviewed Conference Papers:

• T.T.T Nguyen and G. Armitage, “ Clustering to Assist Supervised Machine Learning for

Real-Time IP Traffic Classification,” in Proc. 2008 IEEE International Conference on

Communications, pp. 5857-5862. Beijing, China, 19-23 May 2008.

• T.T.T. Nguyen, G. Armitage, “Synthetic Sub-flow Pairs for Timely and Stable IP Traffic

Identification,” in Proc. Australian Telecommunication Networks and Application Con-

ference, Melbourne, Australia, December 2006.

• T.T.T. Nguyen, G. Armitage, “Training on Multiple Sub-flows to Optimise the Use of

Machine Learning Classifiers in Real-world IP Networks,” in IEEE 31st Conference on

Local Computer Networks, Tampa, Florida, USA, November 2006.

18

• S. Zander, T.T.T. Nguyen, G. Armitage, “Automated Traffic Classification and Appli-

cation Identification using Machine Learning,” Proc. IEEE 30th Conference on Local

Computer Networks (LCN 2005), Sydney, Australia, November 2005

• S. Zander, T.T.T. Nguyen, G. Armitage, “Self-learning IP Traffic Classification based

on Statistical Flow Characteristics,” Passive Active Measurement Workshop (PAM) 2005,

Boston, USA, March/April 2005.

• T.T.T.Nguyen, G. Armitage, “Experimentally Derived Interactions between TCP Traffic

and Service Quality over DOCSIS Cable Links,” Proc. Of Global Internet and Next

Generation Networks Symposium, IEEE Globecom 2004, Texas, USA, November 2004.

• T.T.T. Nguyen, G. Armitage, “Quantitative Assessment of IP Service Quality in 802.11b

Networks,” The 3rd Workshop on the Internet, Telecommunications and Signal Process-

ing (WITSP’04), Adelaide, Australia, December 2004.

• T.T.T. Nguyen, G. Armitage, “Quantitative Assessment of IP Service Quality in 802.11b

and DOCSIS networks,” The Australian Telecommunication Networks and Applications

Conference (ATNAC 2004), Sydney, Australia, December 2004.

• T.T.T. Nguyen, G. Armitage, “Pricing the Internet - A Visual 3-Dimensional Evaluation

Model,” Proc. of Australian Telecommunications Networks and Applications Conference

(ATNAC), Melbourne, Australia, December 2003

Table of Acronyms

CM Cable Modem

DBSCAN Density Based Spatial Clustering of Applications with Noise

DiffServ Differentiated Services

DNS Domain Name System

DOCSIS Data Over Cable Service Interface Specifications

DS Downstream

ET Wolfenstein Enemy Territory

FPS First Person Shooter

FTP File Transfer Protocol

HTTP HyperText Transfer Protocol

IMAP Internet Message Access Protocol

IntServ Integrated Services

IP Internet Protocol

IPTC IP Traffic Classification

ISP Internet Services Provider

Kbyte Kilo byte, which is equal to 1024 bytes

LI Lawful Interception

Mbps Mega bits per second

Mbyte Mega byte, which is equal to 1024 Kbytes

ML Machine Learning

MPLS Multi Protocol Label Switching

NTP Network Time Protocol

QoS Quality of Service

20

RTT Round Trip Time

SMTP Simple Mail Transfer Protocol

SSP-ACT Synthetic Sub-Flow Pairs with the Assistance of Clustering Techniques

US Upstream

HFC Hybrid Fibre Coaxial Network

CMTS Cable Modem Termination System

ACK Acknowledgement

P2P Peer to peer

ICMP Internet Control Message Protocol

MTU Maximum Transmission Unit

FN False Negatives

FP False Positives

TP True Positives

TN True Negatives

MSS Maximum Segment Size

Chapter 1

Introduction

The Internet has now been a part of our lives for more than 30 years, since the first public demon-

stration of the ARPANET network technology in 1972 [1]. Its use has been growing rapidly over

the years with increases not only in the number of users [2][3], hosts and servers [4], networks

and autonomous systems [5], but also in volume and types of traffic [6]. Traditional Internet

applications, such as electronic mail, file transfer, and static-content web sites, are being joined

by newer services such as IP telephony, real-time interactive audio and video conferencing,

streaming of multimedia content, online games, and electronic commerce. This creates a wide

range of household [7][8][9] and business Internet uses [10][11][12][13][14][15][16]. This

expanding trend is driven further by the rapid development of computing and communications

in portable forms (e.g. laptop computers, PDAs and cellular phones), along with new modes of

Internet access (e.g. from dial-up to broadband to possible optical access networks in the future

[17][18]), which will potentially spawn more new applications and services.

With these developing trends, parameters such as the timeliness of data delivery, packet loss

and variability in end-to-end packet delay (jitter) become more important for Internet quality of

service (QoS). Traditional non-interactive applications, such as bulk data transfer (FTP), backup

operations or database synchronising, can span their operations over a long period of non-peak

time as background activities [19]. On the other hand, emerging interactive applications such as

business transactions and Web surfing are delay-sensitive; waiting times for these are tolerable

only to the order of seconds [20]. Even less tolerant to delay are those applications that need

to satisfy human requirements for interactivity, such as real-time voice communication and

networked online games. The delay limits for these application types are a fraction of a second

21

22 CHAPTER 1. INTRODUCTION

[21][22][23]. Similarly, video performance can suffer from jerky appearance due to jitter and

frame distortion resulting from packet loss [24]. For voice applications, the loss of two or

more consecutive voice samples may result in noticeable degradation of voice quality [19]. In

various studies, online game applications have also been shown to be sensitive to network delay,

loss and jitter ( [25][26][27][28][29]).

Finding viable solutions for QoS-enabled Internet has attracted considerable research effort

since the early 1990s, with the introduction of the Integrated Services (IntServ) [30], Differen-

tiated Services (DiffServ) [31], and Multi Protocol Label Switching (MPLS) [32] architectures.

However, the introduction of these architectures has yet to make a significant impact on the QoS

perceived by Internet end users. Most networks and applications are still dominated by ‘Best-

Effort’ services, in which the network provides no guarantee on the bounds of packet delay, loss

or jitter.

One reason for the poor uptake of these QoS approaches is the lack of an effective mecha-

nism that allows applications to signal their explicit QoS requirements to the underlying network

[33]. One option is to leave this task to the applications or to the users. However, it might be

unreasonable to expect software developers to be aware of the network issues or to understand

the underlying technologies and explicit network requirements for providing QoS for their ap-

plications. Furthermore, tying an application to a particular standard for QoS provisioning, or

requiring complicated user intervention or knowledge may restrict its options for deployment

[33][34]. An alternative solution is to shift QoS signalling from the application to the network

[33][35]. In this approach, the network is equipped with intelligent devices that can automat-

ically classify traffic in terms of QoS demands, and prompt the ISP’s QoS control system to

provide appropriate QoS treatment.

There are also other issues beyond those related to technology to be faced in order to achieve

a successful Internet QoS solution [36]. These are the challenges of minimising the deployment

cost of QoS technologies and simplifying users’ experiences. For ISPs, implementation and

operational costs must not exceed the revenues likely to be gained by deploying any new QoS

scheme. ISPs may also resist deploying a complex technology if there are questions as to its

reliability and operational effort [37][38]. For Internet users, the Internet is expected to be user-

friendly, simple to understand, stable and available on request, predictable and transparent, and

23

should not require that users understand the underlying architecture in order to use the service

[20][37][39][40].

The work of this thesis is motivated by the desire to find a good solution for Internet QoS.

My literature review on the QoS problem space suggests that a network based, robust and auto-

mated real-time IP traffic classification technique is an important component for implementing

QoS across the Internet. IP traffic classification (which I will refer to as IPTC) is the process

of identifying and classifying an individual Internet application or a group of applications of

interest. It can serve as a core part of an automated QoS-enabled architecture, assist the QoS

signalling process by quickly identifying the traffic of interest, and trigger an automated QoS

control system for allocation of network resources for priority applications. Real-time IPTC

allows network operators to know in good time what is flowing over their networks, so they

can react quickly in support of their various business goals. It also has the potential to support

class-based QoS accounting and billing. More importantly, it can be done automatically by the

network providers, and does not require users’ intervention or specialist knowledge about the

underlying technologies. It can help to bring QoS to consumers in a user friendly way. Further-

more, IPTC can assist in automated intrusion detection [41][42]. Recently, governments have

also been clarifying ISP obligations with respect to ‘lawful interception’ (LI) of IP traffic [43].

IPTC is an integral part of ISP-based LI solutions [44][45][46].

Traditional techniques for identifying Internet applications are typically based on the use

of well-known registered port numbers or on payload-based protocol reconstruction. How-

ever, applications may use unregistered ports or encryption to obfuscate packet contents and

governments may impose privacy regulations constraining the ability of third parties to law-

fully inspect packet payloads. Newer approaches classify traffic by learning and recognising

statistical patterns in externally observable attributes of the traffic (such as packet lengths and

inter-packet arrival times). In particular, state-of-the-art techniques include the application of

Machine Learning (ML) – a powerful technique for data mining and knowledge discovery – to

IPTC.

However, the literature of ML-based IPTC has not properly considered the constraints of be-

ing deployed in real-time operational networks. Most published work has primarily focused on

the efficacy of different ML algorithms when applied to entire datasets of IP traffic. Classifica-


tion models typically rely on flow statistical properties measured over full-flows (from their start

until they finish or are timed out); some more recent work has attempted classification using the

first few packets of a flow. Yet in real networks, traffic classifiers must reach decisions well

before a flow has finished, so that network operators can react quickly to support their various

business goals, for example, for flow QoS mapping and priority treatment. The classifier may

start (or restart) at an arbitrary time and may not see the beginning of a flow. An application’s

statistical behaviour may change over its flow lifetime; in addition there may be thousands of

concurrent flows, and the classifier needs to operate with finite CPU and memory resources.

Further, although this has not always been clearly stated in the literature, directionality

has been an implicit attribute of the features on which ML classifiers were built and used.

Application flows in many cases are defined as bi-directional, and the application’s statistical

features are calculated separately in the forward and backward (reverse) directions. Most work

assumes that the forward direction is indicated by the first packet of the flow (on the basis that it

is commonly the initial packet from a client to a server). Subsequent evaluations assume that the

classifier sees the first packet of every flow, in order to calculate features with the correct sense

of direction. However, a real-world classifier cannot be sure whether the first packet it sees

(of any bi-directional flow of packets) is heading in the forward (client-to-server) or backward

(server-to-client) direction. Because, for many Internet applications, the traffic is asymmetric

in the client-to-server and server-to-client directions, this can lead to degraded classification

performance.

In contrast to previously published work, I consider not only the timeliness of an ML traffic

classifier, but also its sustainability in performance when monitoring traffic flows at any point in

their lifetime, given the constraints of limited physical resources. This makes the contribution

of my work novel and unique.

I propose that practical real-time traffic classifiers must accurately classify traffic in the face

of a number of constraints:

• The classifier should use statistical methods (such as ML algorithms), since TCP/UDP

port numbers may be misleading, and packet payloads may be opaque to direct interpre-

tation.

25

• ML classification should be done over a small sliding window of the last N packets (to

minimise memory requirements and perform classification in a timely manner).

• The classifier must only use features that require low processing/computation cost.

• Applications may change their network traffic patterns during the life of a flow.

• The classifier must recognise flows already in progress since the beginning of a flow may

be missed.

• The classifier does not need to know the direction the original flow takes. It can assume

the forward direction is the direction of the first packet of the most recent N packets it has

captured, regardless of whether this is from client to server or server to client.

My research question, therefore, is to investigate the possibility of building practical ML-

based real-time traffic classifiers that address all of the above requirements.

In this thesis I propose a novel approach to ML-based IPTC that I call the ‘Synthetic Sub-

Flow Pairs with the Assistance of Clustering Techniques’ (SSP-ACT) training method. Instead

of using the statistical properties of a flow calculated over its whole lifetime, or from its first few

packets, I train the ML classifier on a set of short sub-flows (each sub-flow contains a number of

consecutive packets extracted from full-flow examples of the target application’s traffic). This

allows the classifier to properly identify an application, regardless of where within a flow the

classifier begins capturing packets.

Dealing with the directionality issues, SSP-ACT further augments the training set by synthe-

sising a complementary version of every sub-flow in the reverse direction (hence the ‘synthetic

sub-flow pairs’ term). The first packet of a sliding window can alternatively represent traffic

between a client to a server or a server to a client. SSP-ACT trains the classifier to recognise

the application either way.

A limited number of representative sub-flows that best capture distinctive statistical varia-

tions of the full-flows are selected to train the classifier. SSP-ACT makes use of unsupervised

clustering ML techniques to automate the selection process.

I demonstrate the effectiveness of SSP-ACT by constructing an ML classifier designed to

identify highly interactive online game traffic mixed with thousands of unrelated interfering


traffic flows. I chose a popular First Person Shooter (FPS) game application (Wolfenstein En-

emy Territory (ET) [47]), the traffic characteristics of which can change significantly over the

lifetime of each flow, and are asymmetric in the client-to-server and server-to-client directions.

I evaluate the generality of SSP-ACT with the classification of another Internet application,

Voice over IP (VoIP) traffic. The characteristics of VoIP traffic differ remarkably from ET traf-

fic, to be more stable over a flow’s lifetime, and more symmetric in the forward and backward

directions. I also perform a preliminary investigation on the impacts of 5% random, indepen-

dent packet loss on the classification of VoIP and ET traffic. The scalability of SSP-ACT for

concurrent classification of multiple applications is also discussed.

I demonstrate that SSP-ACT can significantly improve a classifier’s performance using a

small sliding window, regardless of how many packets are missed from the beginning of each

flow and of the direction of the first packet of the most recent N packets used for the classifica-

tion. The classifiers trained using SSP-ACT maintain their accuracy well with the presence of

5% random, independent synthetic packet loss. I also demonstrate that SSP-ACT is effective in

identifying both ET and VoIP traffic concurrently, by using a single common classifier or two

separate classifiers in parallel, one for each application.

At the time of submitting this thesis, SSP-ACT has been implemented and used in an auto-

mated QoS-control system at Swinburne University of Technology [35], and has been demon-

strated to provide sub-second real-time classification of online game traffic.

My results show that SSP-ACT is a significant improvement over the previous, published

state-of-the art for IP traffic classification. Although the experiments are confined to online

game and VoIP applications, my results reveal a potential solution to the accurate and timely

classification of traffic belonging to other Internet applications.

The thesis is organised as follows.

In Chapter 2 I provide the context for IPTC in IP networks, and highlight its importance

in the areas of QoS provisioning, Internet accounting and charging, and lawful interception. I

then review the traditional methods of traffic classification, and highlight the motivations for

emerging ML-based IPTC techniques.

ML-based IPTC is interdisciplinary involving the areas of networking and data mining tech-

niques. It leverages data mining techniques to explore the large traffic statistical properties space

27

and to devise novel classification rules emerging from the data mining process. In Chapter 3, I

summarise the basic concepts of ML and how they can be applied to IPTC. I discuss a number

of key requirements for the employment of ML-based classifiers in operational IP networks,

which act as guidelines for my research presented in subsequent chapters.

In Chapter 4 I review significant works related to ML-based IPTC over the past. I discuss

their limitations with regards to the operational challenges addressed in Chapter 3. This helps

me define my research question with a justification of its originality and novelty and the reasons

why it is worth pursuing. The chapter is concluded with the problem statement for my thesis.

In Chapter 5 I present my novel modification to traditional ML training and classification

techniques, using a multiple sub-flows training method. I demonstrate that the method optimises

the classification of flows within finite periods of time, regardless of where within the flows’

lifetime the traffic is captured. My experiments are conducted on the Naive Bayes and C4.5

Decision Tree classifiers, with the goal to classify ET traffic against a number of other common

Internet applications.

In Chapter 6 I propose and demonstrate an automated approach based on the use of clus-

tering ML techniques to choose appropriate, representative sub-flows, from which effective

ML-based IP traffic classifiers may be trained.

In Chapter 7, I demonstrate the directional issues when a classifier is trained based on an as-

sumption of flow direction, which maybe wrong when classifying in real operational networks.

I propose and demonstrate that training on synthetic sub-flow pairs allows that the classifier

maintains its performance without relying on prior knowledge of inferred or actual directional-

ity of a flow.

Chapter 8 provides an evaluation of the overall SSP-ACT approach proposed. The effective-

ness of SSP-ACT is demonstrated with VoIP application. My preliminary investigation on the

impacts of 5% random, independent packet loss on the classification of VoIP and ET traffic is

presented. I also propose two different implementation options for the concurrent classification

of multiple applications, the pros and cons of which are discussed.

Chapter 9 concludes the thesis with final remarks and suggestions for future work.

Chapter 2

Application Context for ML Based IPTraffic Classification

2.1 Introduction

Real-time IP traffic classification (IPTC) has the potential to solve difficult network manage-

ment problems for Internet service providers (ISPs) and their equipment vendors. Network

operators need to know what is flowing over their networks promptly so they can react quickly

in support of their various business goals. Traffic classification may be a core part of automated

intrusion detection systems [48][42][49], used to detect patterns indicative of denial of service

attacks, to trigger automated re-allocation of network resources for priority customers [33], or to

identify customer use of network resources for accounting and billing purposes. More recently,

governments have also been clarifying ISP obligations with respect to ‘lawful interception’ (LI)

of IP data traffic [50]. Just as telephone companies must support interception of telephone

usage, ISPs are increasingly subject to government requests for information on network use

by particular individuals at particular points in time. IPTC is an integral part of ISP-based LI

solutions.

Commonly deployed IPTC techniques have been based around direct inspection of each

packet’s contents at some point on the network. Successive IP packets that have the same

five-tuple of protocol type, source address:port and destination address:port are considered to

belong to a flow whose controlling application we wish to determine. Simple classification

infers the controlling application’s identity by assuming that most applications consistently

use ‘well known’ TCP or UDP port numbers (visible in the TCP or UDP headers). However,

28

2.1. INTRODUCTION 29

many applications are increasingly using unpredictable (or at least obscure) port numbers [51].

Consequently, more sophisticated classification techniques infer application types by looking

for application-specific data (or well-known protocol behaviour) within the TCP or UDP pay-

loads [52].

Unfortunately, the effectiveness of such ‘deep packet inspection’ techniques is diminishing.

Such packet inspection relies on two related assumptions:

• Third parties unaffiliated with either source or recipient are able to inspect each IP packet’s

payload (i.e. the payload is visible).

• The classifier knows the syntax of each application’s packet payloads (i.e. the payload

can be interpreted).

Two emerging challenges undermine the first assumption – customers may use encryption to

obfuscate packet contents (including TCP or UDP port numbers), and governments may impose

privacy regulations constraining the ability of third parties to lawfully inspect payloads at all.

The second assumption imposes a heavy operational load – commercial devices would need

repeated updates to stay ahead of regular (or simply gratuitous) changes in every application’s

packet payload formats.

The research community has responded by investigating classification schemes capable of

inferring application-level usage patterns without deep inspection of packet payloads. Newer

approaches (e.g. [53], [54], [55], [56], [57] and [58]) classify traffic by recognising statistical

patterns in externally observable attributes of the traffic (such as typical packet lengths, inter-

packet arrival times, and flow duration and volume). The goal is to either cluster IP traffic flows

into groups that have similar traffic patterns, or classify one or more applications of interest.

A number of researchers are looking at the application of Machine Learning (ML) tech-

niques (a subset of the wider Artificial Intelligence discipline) to IPTC (e.g. [59], [60], [61]).

The application of ML techniques involves a number of steps. First, features are defined by

which future unknown IP traffic may be identified and differentiated. Features are attributes of

flows calculated over multiple packets (such as maximum or minimum packet lengths in each

direction, flow durations or inter-packet arrival times). The ML classifier is trained to associate

sets of features with known traffic classes (creating rules), and to apply the ML algorithm to

30 CHAPTER 2. APPLICATION CONTEXT FOR ML BASED IPTC

classify unknown traffic using the previously learned rules. Every ML algorithm has a different

approach to sorting and prioritising sets of features, which leads to different dynamic behaviours

during training and classification.

This chapter provides the rationale for IPTC in IP networks, reviews the traditional ap-

proaches to traffic classification, and highlights the motivations for emerging ML-based tech-

niques for IPTC.

The rest of this chapter is organised as follows. Section 2.2 justifies the importance of IPTC

when reviewing the important networking areas of QoS issues and provisioning, Internet pricing

and lawful interception. Section 2.3 follows with the introduction of a number of metrics for

assessing classification accuracy. Section 2.4 discusses the limitations of traditional port- and

payload-based classification techniques. This provides the basis for the motivation for statistical

and ML based traffic classification approaches discussed in section 2.5. Section 2.6 concludes

the chapter with some final remarks.

2.2 The importance of IP traffic classification

The importance of IPTC may be illustrated by reviewing the important areas of IP QoS issues

and provisioning, Internet pricing and Lawful Interception (LI).

2.2.1 QoS issues over Last Mile networks

Network capacity tends to be high in core (backbones) networks, low in access networks and

high in home or enterprise LANs. Consequently the edge (the boundary between ISP and

customer networks) tends to make a significant contribution to observed network queuing delay

and jitter [62]. I conducted a number of experimental studies to observe the degree to which

modern, ‘high bandwidth’ access technologies still introduce uncontrolled latency fluctuations

[63], [64] and [65]. I focused in particular on two common Internet access technologies: Data

Over Cable Service Interface Specifications (DOCSIS) [66] networks and 802.11b [67] wireless

local area networks.

A typical DOCSIS access network is illustrated in Figure 2.1. In this scenario, the home

user’s equipment (used for various activities, such as Web browsing, data and movie down-

loading, or playing interactive online games and chat) is connected to the remote content or

2.2. THE IMPORTANCE OF IP TRAFFIC CLASSIFICATION 31

game servers through the DOCSIS cable network of an ISP. Conceptually, the user’s traffic

travels through the user’s Cable Modem (CM), the Hybrid Fibre Coaxial Network (HFC) and

the Cable Modem Termination System (CMTS) at the ISP site, and the remote links.

VoIP

Online Game

Web, P2P, SSH, SMTP

Cable Modem

HFC Network

CMTS

DS

US

Web, P2P, SSH, SMTP Server

Game Server

ISP Home Network

Figure 2.1: A typical DOCSIS cable network from ISP to home users

I observed that when a client downloads content from an ISP-hosted server the DOCSIS

link exhibits a significant spike in latency that impacts on all traffic concurrently sharing the

DOCSIS link. (In my particular experiments [64], [63] the RTT jumped from 13ms when idle

to over 100ms during long-lived TCP-based data transfers from a remote server to a home-based

client 1.).

Wireless LAN networks have become popular for interactive applications such as online

gaming and videoconferencing. As with DOCSIS, I observed that consumer-grade 802.11b

networks exhibited latency fluctuations in excess of 100ms during long-lived TCP-based data

transfers [64] and [65].

These experiments confirmed my belief that modern access link technologies must deploy

traffic prioritisation mechanisms to effectively isolate different classes of end-user traffic from

each other. (With respect to my specific examples, better QoS control requires a CMTS, CM,

802.11 AP and/or 802.11 client that can discriminate between Internet applications, classes of

traffic and customers with different needs.)

2.2.2 QoS provisioning

In responding to the problem of network congestion, a common strategy for network providers

is under-utilising (over-provisioning) the link capacity. However, this is not necessarily an1The downstream (DS) and upstream (US) directions were capped to 2Mbps and 1Mbps respectively. This

approximated a consumer-grade cable-modem downlink while also ensuring the upstream ACK rate was not alimiting factor. Further characterisation of the increase in RTT as a function of offered load is presented in [64],[63]


economic solution for most ISPs. The Internet QoS solutions proposed over the last decade

can be classified into three broad categories: Internet QoS standards, industry QoS-enabled

products, and others.

A common requirement for these frameworks is an effective IPTC mechanism. An overview

of these frameworks provides the context for the use of IPTC in IP networks.

Internet QoS standards

The Integrated Services (IntServ) architecture [30] was the first major attempt to enhance the

Internet with QoS capabilities. It developed a new architecture for resource allocation, to meet

the requirements of real-time applications while preserving the datagram model of IP-based

networks. The basis of this approach is per-flow resource reservation. The resource reservation

protocol (RSVP) [68] has been developed as an end-to-end resource reservation set-up protocol

that maintains the reservation state inside the network [69].

The differentiated services (DiffServ) [31] architecture, unlike IntServ, does not provide a

complete solution for end-to-end QoS set-up or management. DiffServ defines only a set of

per-hop building blocks and a language in which to express per-hop forwarding behaviours.

Both IntServ and DiffServ rely on packet header inspection to map traffic to reserved re-

sources or forwarding behaviours (respectively) in each router along a path.

QoS-enabled solutions from industry

Since the early 2000s the telecommunications industry has introduced a number of QoS-enabled

products which can provide some QoS guarantees. For example, Ubicom Inc.’s ‘StreamEngine’

technology [70], and D-Link’s ‘GameFuel’ products [71] built upon that StreamEngine technol-

ogy, offer routers targeted specifically at providing QoS for multiplayer games and real-time,

interactive traffic applications. StreamEngine technology relies on local packet inspection to

classify packets into QoS classes for QoS provisioning and management.

Another example is Cisco Systems’ integration of AutoQoS features (e.g. for voice traffic

[72] ) into their high-end switches and routers. The technology combines traffic classification

with configuration of Differentiated Services across the network. Packets are classified based

on policies specified by the network operator, which are mostly based on the physical port,


source or destination IP or MAC address, IP protocol type, or payload content. Marked packets

or flows are then tagged with a specific priority for treatment when they arrive at the Cisco

QoS-enabled device [72].

Allot Communications Ltd has developed a range of products called NetEnforcer to provide

QoS control and service level management in IP networks. The Allot NetEnforcer technique

relies on deep content packet inspection for traffic classification and control [73]. Priority queu-

ing is used to provide QoS. With NetEnforcer, each new connection flow gets its own queue

(per-flow queuing). The new queue is treated equally with other flows having the same priority

policy class.

In 2008, Exinda Networks [74] has introduced a range of products called Application Accel-

eration, and NetIntact [75] introduced its PacketLogic Generation 2 products. These products

also rely on packet content inspection for traffic identification, while rate shaping and priority

queuing are used to provide QoS.

Automated QoS solutions using traffic classification and priority systems

Distributed mechanisms for classifying and controlling traffic over access links are also being

explored. One example is ANGEL (Automated Network Games Enhancement Layer [35] [76],

itself an evolution of [33]). ANGEL provides for remote control of traffic differentiation in

customer modems or routers based on traffic classification occurring inside the ISP network.

The architecture of ANGEL is comprised of both ISP-side and CPE-side components. The

ISP-side components of ANGEL receive a copy of network traffic that is later classified, so as

to detect network game traffic. Once a game flow is detected 2, ANGEL informs individual

CPE devices of this identification (using the ANGEL ISP/CPE protocol) to allow flow prioriti-

sation at the CPE. ANGEL has been implemented using machine learning techniques for traffic

classification, building on my work presented in this thesis.

The role of IP traffic classification

All QoS schemes have some degree of IPTC implicit in their design. DiffServ assumes that

edge routers can recognise and differentiate between aggregate classes of traffic in order to

2Although the goal of ANGEL is to provide QoS for game traffic, its architecture can be used for other real-timetraffic as well.


set the DiffServ code point (DSCP) on packets entering the network core. IntServ presumes

that routers along a path are able to differentiate between finely grained traffic classes (and

historically has presumed the use of packet header inspection to achieve this goal).

Furthermore, real-time traffic classification is the core component of recent QoS-enabled

products [70] and automated QoS architectures [33] [35]. For example, with the StreamEngine

technology, one of the most important steps is to automatically classify the traffic passing

through the system to assign appropriate levels of priority [77]. The deployability of the ar-

chitecture in [33], [35] depends on the choice of the core components, including the traffic

classifier.

2.2.3 Internet pricing

The development of QoS solutions such as IntServ or DiffServ has been stymied in part due to

the lack of an effective service pricing mechanism (as suggested in [69] and [78]). A pricing

mechanism is needed to differentiate customers with different needs and charge for the QoS that

they receive. It would also act as a cost recovery mechanism and provide revenue generation

for the ISPs to compensate for their efforts in providing QoS and managing resource allocation.

Traffic classification has great potential to support a practical class-based Internet QoS charging

mechanism.

Finding a fairer and more efficient charging scheme for the Internet has attracted a signifi-

cant amount of research over the past decade. Works have included proposals for smart market

[79], shadow pricing [80], rate-based pricing [81], edge pricing [82], congestion discount [83],

zone-based cost sharing [84], Paris metro pricing [85], Tirupati pricing [86], priority pricing

[87] [88] [89], pricing for Integrated Services [38], Differentiated Services [90], pricing for re-

source negotiation [91] and pricing over congestion control [92] [93]. (While not central to this

thesis, I present a detailed review of these ideas in [37] and [36].)

Many pricing models have been proposed, aimed at an ideal pricing scheme which is able

to:

• Provide levels of services suited to different users with different needs.

• Charge users only for their perceived quality of service (QoS) and the resources they

consume.


• Cope with the non-uniformity of Internet traffic with different QoS requirements.

• Enable ISPs to develop sustainable and profitable business models.

Most proposed Internet pricing models generally achieve a subset of the goals addressed

above. However, no particular solution has been widely implemented - the Internet is still dom-

inated by flat-rate pricing and simple usage charges (such as charging per volume of traffic or

connection duration). Exploring the issues associated with Internet pricing reveals that a prac-

tical solution needs to consider three important metrics, namely technical efficiency, economic

efficiency, and social impact [36] [37].

Technical efficiency refers to the costs associated with applying the new technology of a

particular pricing model or QoS provisioning scheme. Economic efficiency captures the impact

of a pricing scheme on network utilisation and the optimisation of a service provider’s revenue.

This dimension reflects the capability to accommodate new Internet services and valued cus-

tomers, and the maximisation of profit gained by charging customers’ traffic and QoS delivered.

Social impact concerns fairness for network users.

For most pricing schemes there is a distinct coupling and interrelationship between eco-

nomic efficiency, social impact and technical efficiency. Clearly, it is desirable to discover an

optimal pricing model in which economic efficiency, social impact and technical efficiency are

all concurrently maximised. However, in reality pricing models always tend to reveal a trade-off

between these three dimensions.

Most QoS provisioning schemes reviewed in section 2.2.2 try to provide QoS to end users

by differentiating among users on the basis of different needs/preferences or their different

types of applications. Different QoS treatment on the network is then provided accordingly. A

compatible Internet pricing model, therefore, needs to rely on an accurate classification of users

traffic, and charge the users for the QoS delivered.

Furthermore, as indicated in [36],[38],[39],[40], and [94], from the user’s perspective, the

most important requirements and expectations are the transparency, stability and predictability

of a pricing scheme, as well as the QoS provisioning mechanism. Probably most Internet users

have little or no interest in the underlying technologies or complicated ways by which their

applications or network are managed. Techniques which require user intervention and special


knowledge about the underlying technology are likely to be a hindrance to deployment. Users

should not have to signal their application’s identity to the underlying network through explicit

QoS preferences. Such tasks should be performed automatically by the network providers.

From an ISP perspective, implementation costs are critical and must not exceed the revenues

likely to be gained by introducing any new scheme. Network stability and reliability must

also be considered. ISPs resist deploying a complex technology if there are questions as to its

reliability or the operational effort required.

Accurate, automated traffic classification is an important component of any practical and

deployable QoS-based pricing scheme.

2.2.4 Lawful interception

There is an emerging requirement for ISP networks to provide Lawful Interception (LI) capabil-

ities, and traffic classification is an important solution in this regard [50] [95] [43]. Governments

typically implement LI at various levels of abstraction. In the telephony world a law enforce-

ment agency may nominate a ‘person of interest’ and issue a warrant for the collection of in-

tercept information. The intercept may be high-level call records (who called whom and when)

or low-level ‘tapping’ of the audio from actual phone calls in progress. In the ISP space, traffic

classification techniques offer the possibility of identifying traffic patterns (which end points

are exchanging packets and when), and identifying what classes of applications are being used

by a ‘person of interest’ at any given point in time (e.g. [96], [46], [44] and [97]). Depending on

the particular traffic classification scheme, this information may potentially be obtained without

violating any privacy laws covering the TCP or UDP payloads of the ISP customer’s traffic [45].

2.3 Traffic classification metrics

A key criterion on which to differentiate between classification techniques is predictive accuracy

(i.e., how accurately the technique or model makes decisions when presented with previously

unseen data). A number of metrics exist with which to express predictive accuracy.

2.3. TRAFFIC CLASSIFICATION METRICS 37

2.3.1 Positives, negatives, accuracy, precision and recall

Let us assume there is a traffic class X that we wish to identify, mixed with a broader set of IP

traffic. A traffic classifier is used to identify (classify) packets (or flows of packets) belonging to

class X when presented with a mixture of previously unseen traffic. The classifier is presumed

to give one of two outputs: a flow (or packet) is either believed to be a member of class X, or it

is not.

A common way to characterise a classifier’s accuracy is through metrics known as the per-

centage of False Positives, False Negatives, True Positives and True Negatives. These metrics

are defined as follows:

• False Negatives (FN): The number of members of class X incorrectly classified as not

belonging to class X.

False Negatives Percentage (FN%): The percentage of FN, among all members of class

X.

• False Positives (FP): The number of members of other classes incorrectly classified as


False Positives Percentage (FP%): The percentage of FP, among all members of other

classes.

• True Positives (TP): The number of members of class X correctly classified as belonging

to class X.

True Positives Percentage (TP%): The percentage of TP among all members of class X

(equivalent to 100% - FN%).

• True Negatives (TN): The number of members of other classes correctly classified as not


True Negatives Percentage (TN%): The percentage of TN, among all members of other

classes (equivalent to 100% - FP).

Figure 2.2 illustrates the relationships between FN, FP, TP and TN. A good traffic classifier

aims to minimise the False Negatives and False Positives.


X

X X

X

TP

FP TN

FN

Classified as

Figure 2.2: Evaluation Metrics

Some work in the literature makes use of Accuracy as an evaluation metric. It is generally

defined as the percentage of correctly classified instances among the total number of instances.

This definition is used throughout the thesis unless otherwise stated.

The ML literature often utilises two additional metrics known as Recall and Precision.

These metrics are defined as follows:

• Recall: Percentage of members of class X correctly classified as belonging to class X,

among all members of class X.

• Precision: Percentage of those instances that truly belong to class X, among all those

classified as class X.

Except for FN, FP, TP and TN, all other metrics are considered to range from 0 (very poor)

to 100% (optimal). It can be seen that Recall is equivalent to TP%.

With regards to Figure 2.2, Recall and Precision are defined as follows:

Recall = T PT P+FN Precision = T P

T P+FP

Though all these metrics can be appropriate to evaluate a classifier, it is important to realise

that the overall accuracy metric might not be the best metric (sometimes even misleading) to re-

flect the classifier’s performance. This can be the case when there is a great imbalance between

traffic classes’ population sizes, for example, if class X contains only a single member, while

non-class X (X̄) contains 99 members. If the classifier falsely classifies the class X’s member

as X̄ , and truly classifies 99 X̄ members as X̄ , then overall accuracy is very high (99%) while

actually the FN% is 100% (or Recall is 0%). If we want to identify members of class X, the

accuracy per class X (e.g. FN% and FP% or Recall and Precision) is more important than the

overall accuracy of the classifier.

In this thesis I focus on the Recall and Precision metrics commonly used in the ML literature,

as they summarise well the performance of the classifier per class. It is also important to note

2.3. TRAFFIC CLASSIFICATION METRICS 39

that high Precision is only meaningful when the classifier has achieved good Recall and vice

versa.

2.3.2 Byte and flow accuracy

When comparing the literature on different classification techniques it is also important to note

the unit of the author’s chosen metric. Recall, Precision, FN and FP may all be reported as

percentages of bytes or flows relative to the traffic being classified. An author’s choice here can

significantly alter the meaning of the reported accuracy results.

Most recently published traffic classification studies have focused on flow accuracy – mea-

suring the accuracy with which flows are correctly classified, relative to the number of other

flows in the author’s test and/or training dataset(s). However, some recent work has also chosen

to express accuracy calculations in terms of byte accuracy – focusing more on how many bytes

are carried by the packets of correctly classified flows, relative to the total number of bytes in

the author’s test and/or training dataset(s) (e.g. [53] and [98]).

Erman et al. in [99] argue that byte accuracy is crucial when evaluating the accuracy of

traffic classification algorithms. They note that the majority of flows on the Internet are small

and account for only a small portion of total bytes and packets in the network (mice flows).

On the other hand, the majority of traffic bytes are generated by a small number of large flows

(elephant flows). They provide an example from a six-month data trace which found the top

(largest) 1% of flows accounted for over 73% of the traffic in terms of bytes. With a threshold to

differentiate elephant and mice flows of 3.7MB, the top 0.1% of flows would account for 46%

of the traffic (in bytes). Presented with such a dataset, a classifier optimised to identify all but

the top 0.1% of the flows could attain a 99.9% flow accuracy but still result in 46% of the bytes

in the dataset being misclassified.

Whether flow accuracy or byte accuracy is more important will generally depend on the

classifier’s intended use. For example, when classifying traffic for IP QoS purposes it is plau-

sible that identifying every instance of a short-lived flow needing QoS (such as five-minute,

32Kbit/sec phone calls) is as important as identifying long-lived flows needing QoS (such as

a 30 minute, 256Kbit/sec video conference), with both being far more important to correctly

identify than the few flows that represent multi-hour (and/or hundreds of megabytes) peer-to-


peer file sharing sessions. Conversely, an ISP undertaking analysis of load patterns on their

network may well be significantly interested in correctly classifying the applications driving the

elephant flows that contribute a disproportionate number of packets across their network.

In this thesis, I focus on IPTC to support QoS solutions. Hence, I use flow accuracy to

evaluate the performance of a classifier under test.

2.4 Limitations of packet inspection for traffic classification

Traditional IPTC relies on the inspection of a packet’s TCP or UDP port numbers (port-based

classification), or the reconstruction of protocol signatures in its payload (payload-based clas-

sification). Each approach suffers from a number of limitations.

2.4.1 Port-based IP traffic classification

TCP and UDP provide for the multiplexing of multiple flows between common IP end points

through the use of port numbers. Historically many applications utilise a ‘well-known’ port

on their local host as a rendezvous point to which other hosts may initiate communication. A

classifier sitting in the middle of a network need only look for TCP SYN packets (the first step

in a TCP’s three-way handshake during session establishment) to know the server side of a new

client-server TCP connection. The application is then inferred by looking up the TCP SYN

packet’s target port number in the Internet Assigned Numbers Authority’s (IANA) list of reg-

istered ports[100]. UDP uses ports in a similar way, though without connection establishment

nor the maintenance of connection state.

However, this approach has limitations. Firstly, some applications may not have their ports

registered with IANA (for example, peer-to-peer applications such as Napster and Kazaa) [61].

An application may use ports other than its well-known ports to avoid operating system access

control restrictions (for example, non-privileged users on Unix-like systems may be forced to

run HTTP servers on ports other than port 80). Also, in some cases server ports are dynamically

allocated as needed. For example, the RealVideo streamer allows the dynamic negotiation of

the server port to be used for the data transfer. This server port is negotiated on an initial TCP

connection, which is established using the well-known RealVideo control port [101].

Moore and Papagiannaki [102] observed no better than a 70% byte accuracy for port-based

2.4. LIMITATIONS OF PACKET INSPECTION FOR TRAFFIC CLASSIFICATION 41

classification using the official IANA list. Madhukar and Williamson [103] showed that port-

based analysis was unable to identify 30-70% of the Internet traffic flows they investigated. Sen

et al. [52] reported that the default port accounted for only 30% of the total traffic (in bytes) for

the Kazaa P2P protocol.

In some circumstances IP layer encryption may also obfuscate the TCP or UDP header,

making it impossible to know the actual port numbers.

2.4.2 Payload-based IP traffic classification

To avoid total reliance on the semantics of port numbers, many current industry products utilise

stateful reconstruction of session and application information from each packet’s content.

Sen et al. [52] demonstrated that payload-based classification of P2P traffic (by examin-

ing the signatures of the traffic at the application level) could reduce false positives and false

negatives to 5% of total bytes for most P2P protocols studied.

Moore and Papagiannaki [102] use a combination of port- and payload-based techniques

to identify network applications. The classification procedure starts with the examination of

a flow’s port number. If no well-known port is used, the flow is passed through to the next

stage. In the second stage, the first packet is examined to see whether it contains a known

signature. If one is not found, then the packet is examined to see whether it contains a well-

known protocol. If these tests fail, the protocol signatures in the first KByte of the flow are

studied. Flows that remain unclassified after that stage require inspection of the entire flow

payload. Their results show that port information by itself is capable of correctly classifying

69% of the total bytes. Furthermore, including the information observed in the first KByte of

each flow increases the accuracy to almost 79%. Higher accuracy (up to nearly 100%) can only

be achieved by investigating the remaining unclassified flows’ entire payload.

Although payload-based inspection avoids reliance on fixed port numbers, it imposes sig-

nificant complexity and a substantial processing load on the traffic identification device. It must

be kept up-to-date with extensive knowledge of application protocol semantics, and must be

powerful enough to perform concurrent analysis of a potentially large number of flows. This

approach can be difficult or impossible when dealing with proprietary protocols or encrypted

traffic. Furthermore, direct analysis of session and application layer content may represent a


breach of organisational privacy policies or a violation of relevant privacy legislation.

2.5 Classification based on statistical traffic properties

The preceding techniques are limited by their dependence on the inferred semantics of the infor-

mation gathered through deep inspection of packet content (payload and port numbers). Newer

approaches rely on the traffic’s statistical characteristics to identify the application. An assump-

tion underlying such methods is that traffic at the network layer has statistical properties (such

as the distribution of flow duration, flow idle time, packet inter-arrival time and packet lengths)

that are unique for certain classes of applications which enables different source applications to

be distinguished from each other.

The relationship between the class of traffic and its observed statistical properties has been

noted in [104] where the authors analysed and constructed empirical models of connection

characteristics - such as bytes, duration and arrival periodicity - for a number of specific TCP

applications, and in [105] where the authors analysed Internet chat systems by focusing on the

characteristics of the traffic in terms of flow duration, packet inter-arrival time and packet size

and byte profile. Later work (for example [106], [107] and [108]) also observed distinctive

traffic characteristics, such as the distributions of packet lengths and packet inter-arrival times,

for a number of Internet applications. The results of these studies have stimulated new classifi-

cation techniques based on the statistical properties of traffic flow. The need to deal with traffic

patterns, large datasets and multi-dimensional spaces of flow and packet attributes is one of the

reasons for the introduction of ML techniques into this field.

2.6 Conclusion

In this chapter, I have discussed the negative impacts of the Last Mile bottleneck on QoS de-

livery to sensitive applications. The results showed that QoS problems in access links cannot

be solved simply by providing larger amounts of uncontrolled bandwidth. This suggests the

need for an effective QoS control and traffic prioritising system to overcome the problem. For

implementing a number of proposed QoS architectures, a real-time automated IPTC is a crucial

component.

IPTC also plays an important role in the areas of Internet pricing and lawful interception.

2.6. CONCLUSION 43

An automated, real-time and low-cost IP traffic classifier operating at the network layer may

be a useful tool for a simple class-based Internet pricing and billing system. This may offer a

solution that satisfies the deployment requirements of both ISPs and their customers. A robust

IPTC scheme also has the great potential to provide a non-intrusive solution for ISPs to satisfy

government LI requirements.

I have also demonstrated that the commonly deployed approaches, such as port-based or

deep packet inspection techniques, have been diminishing in effectiveness. The new direction

of classifying traffic by learning and recognising statistical patterns in externally observable

attributes of the traffic (such as packet lengths and inter-packet arrival times), therefore, is be-

coming a potential future solution. This provides the application context for Machine Learning-

based IPTC techniques.

In the next chapter, a brief background on Machine Learning and its application in the IPTC

field will be presented. I will also address a number of requirements for a deployable machine

learning based IP traffic classifier in an operational network. This acts as a guideline to my

novel proposal for a practical, real-time, automated IP traffic classifier, presented in Chapter 5

and Chapter 7.

Chapter 3

A Brief Background on Machine Learningand its Application to IP TrafficClassification

Machine Learning (ML) has long been known as a powerful technique for data mining and

knowledge discovery, which searches for and describes useful structural patterns in data. ML

has a great range of applications, including in relation to search engines, medical diagnosis, text

and handwriting recognition, image screening, load forecasting, marketing and sales diagnosis

[109].

This chapter summarises the basic concepts of ML and outlines how ML can be applied to

IPTC. It also discusses a number of key requirements for the employment of ML-based classi-

fiers in operational IP networks, which act as guidelines for my research on a novel, practical,

real-time, automated and deployable ML-based IP traffic classifier.

3.1 A review of classification with Machine Learning

In 1992 Shi [110] noted, ‘One of the defining features of intelligence is the ability to learn...

Machine learning is the study of making machines acquire new knowledge, new skills, and

reorganise existing knowledge’. A learning machine has the ability to learn automatically from

experience and refine and improve its knowledge base. In 1983 Simon noted, ‘Learning denotes

changes in the system that are adaptive in the sense that they enable the system to do the same

task or tasks drawn from the same population more efficiently and more effectively the next

time’ [111]; and in 2000 Witten and Frank observed, ‘Things learn when they change their

44

3.1. A REVIEW OF CLASSIFICATION WITH MACHINE LEARNING 45

behavior in a way that makes them perform better in the future’ [109].

As mentioned previously, ML has a wide range of applications. The use of ML techniques

by network traffic controllers was proposed in 1990, aiming to maximise call completion in

a circuit-switched telecommunications network [112]; this was one of the studies that marked

the point at which ML techniques expanded their application space into the telecommunications

networking field. In 1994 ML was first utilised for Internet flow classification in the context of

intrusion detection [41]. This was the starting point for much of the work on ML techniques in

Internet traffic classification that followed.

3.1.1 Input and output of an ML process

ML takes input in the form of a dataset of instances (also known as examples). An instance

refers to an individual, independent example of the dataset. Each instance is characterised by the

values of its features (also known as attributes or discriminators) that measure different aspects

of the instance. (In the networking field consecutive packets from the same flow might form an

instance, while the set of features might include median inter-packet arrival times or standard

deviation of packet lengths over a number of consecutive packets in a flow.) The dataset is

ultimately presented as a matrix of instances versus features [109]. An example dataset is

illustrated in Figure 3.1. Within a dataset, the same group of features must be used to describe

any instances; nonetheless the values of the features can vary among the individual instances.

Feature 1 (e.g. Mean packet

length)

Feature 2 (e.g. Mean packet inter -

arrival time)

Feature K … Class

F low instance 1 A 1 B 1 … Game

F low instance 2 A 2 B 2 … Other

… … … … F low instance

N - 1 A n - 1 B n - 1 … Other

F low instance N

A n B n … Game

Figure 3.1: An example dataset as a matrix of instances versus features

The output is the description of the knowledge that has been learnt. How the specific out-

come of the learning process is represented (the syntax and semantics) depends largely on the

particular ML approach being used.

46 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING

3.1.2 Different types of learning

Witten and Frank [109] define four basic types of learning:

• Classification (or supervised learning)

• Clustering (or unsupervised learning)

• Association

• Numeric prediction

Classification learning involves a machine learning from a set of pre-classified (also called pre-

labelled) examples, from which it builds a set of classification rules (a model) to classify unseen

examples. Clustering is the grouping of instances that have similar characteristics into clusters,

without any prior guidance. In association learning, any rules that strongly relate different

features’ values are sought (not only those that relate features’ values and class). In numeric

prediction, the outcome to be predicted is not a discrete class but a numeric quantity.

Most ML techniques used for IPTC focus on the use of supervised and unsupervised learn-

ing.

3.1.3 Supervised learning

Supervised learning creates knowledge structures that support the task of classifying new in-

stances into pre-defined classes [113]. The learning machine is provided with a collection of

sample instances, pre-classified into classes. Output of the learning process is a classification

model that is constructed by examining and generalising from the provided instances.

In effect, supervised learning focuses on modelling the input/output relationships. Its goal

is to identify a mapping from input features to an output class. The knowledge learnt (such as

commonalities among members of the same class and differences between competing ones) can

be presented, for example, as a flowchart, a decision tree or classification rules, which can be

used later to classify a new unseen instance.

There are two major phases (steps) in supervised learning:

• Training: The learning phase that examines the data provided (called the training dataset)

and constructs (builds) a classification model.


• Testing (also known as classifying): The model built in the training phase is used to

classify new, previously unseen instances.

For example, let C be a discrete set {y1,y2, ...,yM} that consists of all the pre-defined classes.

A number of instances are selected for each class y j(1 ≤ j ≤M) to train the classifier. Let TS

be a training dataset, that is a set of input/output pairs,

TS = {< x1,y1 >,< x2,y1 >,...,< xN−1,YM >,< xN,yM >}

where xi is the vector of values of the input features corresponding to the ith instance, and y j

is its output class value. The goal of classification can be formulated as follows: from a training

dataset TS, find a function f (x) of the input features that best predicts the outcome of the output

class y for any new unseen values of x (such as a minimum probability of error). The function

f (x) is the core of the classification model.

As an ML principle, training data should have the same characteristics as the data to be

classified. Also, the model created during training is improved if we simultaneously provide

examples of instances that belong to class(es) of interest and instances known to not be mem-

bers of the class(es) of interest. This allows the ML algorithm to compare and contrast, and

to generalise the classification rules in distinguishing between the instances belonging to the

class(es) of interest and the others. This will enhance the model’s performance later in the

classification of new, previously unknown instances [110].

There exists a number of supervised learning classification algorithms, each differing mainly

in terms of the way the classification model is constructed and the optimisation algorithm used

to search for a good model. In this thesis I make use of two different learning algorithms: the

C4.5 Decision Tree [114] and supervised Naive Bayes algorithms [115].) Brief descriptions of

these two algorithms are presented in the sub-sections below.

The Naive Bayes algorithm

The Naive Bayes algorithm provides a simple approach to classification based on probabilistic

knowledge [115]. The method is designed for use in supervised classification, in which the goal

is to predict accurately the class of unseen data using the classification model built on training

instances.

Let C be the random variable denoting the class of an instance and let X = {X1,X2, ...Xn} be


a vector of random variables denoting the observed attribute values. Let c be a particular class

and x be an instance to be classified. The algorithm makes a statistical conclusion about the

probability of instance x belonging to a class c, based on:

• the conditional probability of observing the occurrence of each class’s instance in the

training set (the so-called posterior probability, denoted by P(C=c))

• and the probability for the instance x given c.

The calculation follows Bayes’s rule:

p(C = c|X = x) =P(C = c)P(X = x|C = c)

p(X = x)(3.1)

Based on the outcome the classifier then predicts the most probable class. The Naive Bayes

algorithm relies on two assumptions: Firstly, it assumes that the instance’s attributes are inde-

pendent given the class, and that no hidden or latent attributes influence the prediction process

[115]. The second assumption is that within each class, the values of numeric attributes are

Normally (or Gaussian) distributed, so that the attribute’s value distribution can be represented

in terms of its mean and standard deviation, and the probability of an observed value can be

easily computed from such estimates.

In equation 3.1 X = x represents the event that X1 = x1 ∧X2 = x2 ∧ ...Xn = xn. With the

assumption that these attribute values are independent, one obtains:

P(X = x|C = c) = ∏i

P(Xi = xi|C = c) (3.2)

Generally the denominator of equation 3.1 is not directly estimated as it can be simply

considered to be a normalising factor [115].

Both independence and normality assumptions are violated in many cases 1. However, this

approach has been shown to work better than more complex methods and it also can cope with

complex situations [109].

1The normality assumption is violated in my ET datasets used in Chapter 5 as well. However, my experimentsshow that the Naive Bayes classifier trained using SSP-ACT still performs well in identifying ET traffic.


The C4.5 Decision Tree algorithm

C4.5 is one of the most commonly used algorithms that deploy decision trees for classification.

It has a history dating back to the 1960s with the work of Hunt et al [116]. The attractiveness

of this algorithm is that, in contrast to the Naive Bayes algorithm, it represents rules that can be

easily understood by humans. In addition, there are no a priori assumptions about the nature of

the data needed.

This classification method is built based on the form of a tree structure, where each node is

either a leaf node that represents the class or a test node that specifies some test to be carried out

on a single attribute value that has two or more outcomes (branches), each linked to a sub-tree.

An instance can be classified by starting at the root of the tree and following the path until it

reaches a leaf node, which provides the classification of the example.

To construct the tree, C4.5 (similarly to other decision tree classifiers) uses a method known

as ‘divide and conquer’ that employs a top-down, greedy search through the space of possible

decision trees from a set of training instances. For optimal tree construction C4.5 selects the

attribute to test at each test node in the tree that maximises a heuristic splitting criterion. One

criterion used in the algorithm is information gain measurement. A detailed description for

calculating this factor is described in [114].

The divide and conquer algorithm partitions the data until every leaf contains cases of a

single class, or until further partitioning is impossible because two cases have the same values

for each attribute but belong to different classes [114]. While this is sometimes a reasonable

strategy, it can lead to a loss of predictive accuracy in most applications if there is noise 2 in the

training data, or when the number of training examples is too small to produce a representative

sample of the true target application class. In other words, this simple algorithm can produce

trees that over-fit the training dataset. There are two approaches to overcoming the problem.

The first approach tries to stop the growing of the tree before it reaches the point where it

perfectly classifies the training data. The second one allows the tree to over-fit the data, and

then post prunes the tree. The latter seems to be more successful in practice - and is employed

2Noise can be a random error or variance in a measured variable [117], (in our case, it can be an error in exam-ples’ feature values in the training dataset, such as variability in packet inter-arrival times caused by congestion,or changes in MTU sizes caused by alternate paths between sender and receiver). Noise can also be an error in theclass that is assigned to an example in the training dataset [109].


by the C4.5 algorithm [118] [114].

Despite the advantages mentioned above, classification based on decision trees has a num-

ber of limitations. It is unstable - small variations in the training data can result in different

attribute selections at each test point within the tree, and large changes in the classification rules

[119] [118] [120].

Furthermore, trees created from numeric datasets can be quite complex since attribute splits

for numeric data are binary. The process of growing a decision tree is also computationally

expensive. At each node, each candidate splitting field must be sorted before its best split can

be found. Pruning algorithms can also be expensive since many candidate sub-trees must be

formed and compared [118] [120].

3.1.4 Clustering

Classification techniques use pre-defined classes of training instances. In contrast, clustering

methods are not provided with this guidance; instead, they discover natural clusters (groups) in

the data using internalised heuristics [121].

Clustering focuses on finding patterns in the input data. It clusters instances with similar

properties (defined by a specific distance measuring approach, such as Euclidean space) into

groups. The groups that are so identified may be exclusive, so that any instance belongs in only

one group; or they may be overlapping, where one instance may fall into several groups; they

may also be probabilistic, such that an instance belongs to a group with a certain probability.

They may be hierarchical, where there is a division of instances into groups at the top level, and

then each of these groups is refined further - even down to the level of individual instances [109].

There are three basic clustering methods: the classic k-means, incremental clustering, and

the probability-based clustering. The classic k-means forms clusters in numeric domains, par-

titioning instances into disjoint clusters, while incremental clustering generates a hierarchical

grouping of instances. The probability-based methods assign instances to classes probabilisti-

cally, not deterministically [109].

To be used in classification intermediate steps are required to label the resulting clusters and

generate rules from the clusters for future classification. Generally, ‘labelling’ is the process

of classifying the members of a dataset using manual (human) inspection or an irrefutable au-


tomated process. A common method is to label a cluster according to the member class that

contributes the most to the cluster’s population. Rules created from a cluster can be a paramet-

ric model to assign a new flow to a cluster. For example, as in [122] and [123] the Euclidean

distance between the new flow and the centre of each pre-defined cluster is computed, and the

new flow belongs to the cluster for which the distance is the least.

The Expectation Maximisation (EM) algorithm is one of the probabilistic clustering meth-

ods. It assigns a data point to each cluster with a certain probability. The underlying statistical

model of EM is a finite mixture. Each mixture is a set of probability distributions - one for each

cluster - that models the attribute values for members of that cluster. The algorithm starts with

initial guesses for the parameters for each cluster, uses them to calculate the cluster probabil-

ities for each instance, then uses these probabilities to re-estimate the parameters, and repeats

until convergence is attained [109]. EM has been used to cluster IP traffic flows in previous

work, such as [59] and [60]. Since the algorithm is used in this thesis, a brief description of the

algorithm is presented in a subsection below.

EM algorithm

The EM algorithm is introduced when both the distribution and the parameters that characterise

a mixture model are unknown. It adopts the procedure used for the k-means clustering algor-

ithm. The method starts with initial guesses for the unknown parameters, uses them to calculate

the cluster probabilities for each instance (‘expectation’), uses these probabilities to re-estimate

the parameters, and repeats until convergence (‘maximisation’) [109].

Let y be a random vector whose joint density f (y;θ) is indexed by a p-dimensional param-

eter θ in Θ. If the complete-data vector y is observed, it is of interest to compute the maximum

likelihood estimate of q based on the distribution of y. The log-likelihood function of y:

logL(θ ;y) = l(θ ;y) = log( f (y;θ)) (3.3)

is then required to be maximised. If θ (0) is an initial value for θ , then on the first iteration it

is necessary to compute:

Q(θ ,θ (0)) = E(0)θ

[l(θ (0);y)] (3.4)


Q(θ ,θ (0)) is now maximised with respect to θ , that is θ (1) is found such that

Q(θ (1),θ (0)) > Q(θ ,θ (0)) (3.5)

for all θ in Θ.

Thus the EM algorithm consists of an E-step (Estimation step) followed by an M-step (Max-

imisation step) defined as follows: E-step: Compute Q(θ ,θ (t)) where

Q(θ ,θ (t)) = E(t)θ

[l(θ ;y|θ (t)] (3.6)

M-step: Find θ (t+1) such that

Q(θ (t+1),θ (t)) > Q(θ ,θ (t)) (3.7)

for all θ in Θ.

The E-step and the M-step are repeated alternately until the increase in the log-likelihood

is less than ε , where ε is a prescribed small quantity (that can be considered negligible). The

EM algorithm guarantees convergence to a local maximum. To obtain the global maximum,

the whole procedure should be repeated several times, with different initial guesses for the

parameter values. The model that provides the highest local maxima should be chosen as the

best [109].

Another issue in this process involves choosing the number of clusters to model. If it is not

known in advance then the process is modelling 1, 2, 3, ... clusters. The number of clusters are

increased until the log-likelihood is less than ε . Some implementations of the EM algorithm

(e.g. WEKA [124]) include an option to allow the number of clusters to be found automatically.

Beginning with one cluster, it continues to add clusters until the estimated log-likelihood is no

longer increased [125].

3.1.5 Evaluating supervised learning algorithms

A good ML classifier optimises Recall and Precision. However, there may be trade-offs between

these metrics. To decide which one is more important or should be given higher priority one

needs to take into account the cost of making wrong decisions or wrong classifications. The de-

cision must depend on the specific application context and one’s commercial and/or operational

priorities.


Various tools exist to study this trade-off, and thus support this decision-making process.

The receiver operating characteristic (ROC) curve provides a way to visualise the trade-offs

between TP and FP by plotting the number of TP as a function of the number of FP (both

expressed as percentage of the total TP and FP respectively). This has been found useful in

analysing how classifiers perform over a range of threshold settings [109]. Another tool is the

Neyman-Pearson criterion [126], which attempts to maximise TP given a fixed threshold on FP

[127].

A challenge when using supervised learning algorithms is that both the training and test-

ing phases must be performed using datasets that have been pre-labelled 3. Ideally one would

have a large training set (for optimal learning and creation of models) and a large, yet indepen-

dent, testing dataset to assess the algorithm’s performance. (Testing on the training dataset is

usually misleading. Such testing will usually only show that the constructed model is good at

recognising the instances from which it was constructed.)

In the real world we are often faced with a limited quantity of pre-labelled datasets. A

simple procedure (sometimes known as holdout [109]) involves setting aside some part (e.g.

two thirds) of the pre-labelled dataset for training, and the rest (e.g. one third) for testing.

In practice, when only small or limited datasets are available a variant of holdout, called

N-fold cross-validation, is most commonly used. The dataset is first split into N approximately

equal partitions (or folds). Each partition (1/N) in turn is then used for testing, while the

remainder ((N− 1)/N) are used for training. The procedure is repeated N times so that in the

end, every instance has been used exactly once for testing. The overall Recall and Precision

are calculated from the average (mean value) of the Recalls and Precisions measured from all N

tests. The results therefore do not apply to a particular classifier among those tested, but can be

considered as an estimation for a classifier being trained on the whole dataset [128][129]. It has

been claimed that N = 10 (10-fold cross-validation) provides a good estimate of classification

performance [109].

Simply partitioning the full dataset N ways does not guarantee that the same proportion is

used for any given class within the dataset. A further step, known as stratification, is usually

applied - randomly sampling the dataset in such a way that each class is properly represented3In contrast to a controlled training and testing environment, operational classifiers do not have access to pre-

viously labelled example flows.


in both training and testing datasets. When stratification is used in combination with cross-

validation, it is called stratified cross-validation. It is common to use stratified 10-fold cross-

validation when only limited pre-labelled datasets are available.

3.1.6 Evaluating unsupervised learning algorithms

While Recall and Precision are common metrics to evaluate classification algorithms, evaluating

clustering algorithms is more complicated. There are intermediate steps required in evaluating

the resulting clusters before labelling them or generating rules for future classification. Given a

dataset, a clustering algorithm can always generate a division, with its own finding of structure

within the data. Different approaches can lead to different clusters, and even for the same

algorithm, different parameters or different orders of input patterns might alter the final results

[130] [131].

Therefore, it is important to have effective evaluation standards and criteria to provide the

users with a certain level of confidence in results generated by a particular algorithm, or com-

parisons of different algorithms [132]. Criteria should help to answer useful questions such

as: how many clusters are hidden in the data; what are the optimal number of clusters [131];

whether the resulting clusters are meaningful or just an artifact of the algorithms [132]; how

one algorithm performs compared to another – how easy they are to use, how fast it is to be em-

ployed [130]; what is the intra-cluster quality; how good is inter-cluster separation; what is the

cost of labelling the clusters and what are the requirements in terms of computer computation

and storage.

Halkidi et al.[131] identify three approaches to investigating cluster validity: external cri-

teria, internal criteria and relative criteria. The first two approaches are based on statistical

hypothesis testing. The external criteria approach is based on some pre-specified structure,

which is known as prior information on the data and is used as a standard to compare and val-

idate the clustering results [132]. The internal criteria approach evaluates the clustering result

of an algorithm based on examining the internal structure inherited from the dataset. The rela-

tive criteria approach emphasises finding the best clustering scheme that a clustering algorithm

can define under certain assumptions and parameters. The basic idea is to evaluate a clustering

structure by comparing it to others that use the same algorithm but with different parameter


values [133]. (More details on these approaches can be found in [131], [132], [130] and [109].)

3.1.7 Feature selection algorithms

A key to building an ML classifier is identification of the smallest necessary set of features

required to achieve one’s goals in relation to accuracy - a process known as feature selection.

The quality of the feature set is crucial to the performance of an ML algorithm. Using

irrelevant or redundant features often leads to negative impacts on the accuracy of most ML

algorithms. It can also make the system more computationally expensive, as the amount of

information stored and processed rises with the dimensionality of a feature set used to describe

the data instances. Consequently it is desirable to select a subset of features that is small in size

yet retains essential and useful information about the classes of interest.

Feature selection algorithms can be broadly classified into filter methods or wrapper meth-

ods. Filter method algorithms make independent assessments based on the general characteris-

tics of the data. They rely on a certain metric to rate and select the best subset before learning

commences. The results provided therefore should not be biased toward a particular ML al-

gorithm. Wrapper method algorithms, on the other hand, evaluate the performance of different

subsets using the ML algorithm that will ultimately be employed for learning. Its results are

therefore biased toward the ML algorithm used. A number of subset search techniques can be

used, such as Correlation-based Feature Selection (CFS) filter techniques with a Greedy, Best-

First or Genetic search. (Additional details on these techniques can be found in [109], [134],

[135], [136] and [137].)

3.1.8 Imbalanced datasets problem

A common assumption for machine learning classification is that the participating classes share

similar prior probabilities, possess a similar percentage of examples in the dataset. However,

this assumption is normally violated in real-world problems, for example, in network intrusion

detection, and fraud and anomaly detection. It is often the case that the ratios of prior probabili-

ties between classes are significantly skewed. For example, there may be a ‘majority class’ that

greatly outweighs a ‘minority class’ in terms of number of examples. This problem is referred

to as inter-class imbalance.


Another problem of imbalance is within a single class, normally referred to as intra-class

imbalance. This occurs when the members of a class are under-represented compared to other

members of the same class drawn from different distributions.

These imbalances may have a negative impact on the performance of standard classification

algorithms, such as the C4.5 Decision Tree algorithm, which normally aim to maximise the

overall classification accuracy. When dealing with unbalanced datasets, these algorithms may

result in classifiers that ignore the minority class or classifiers that over-fit the training data

[138][139].

Several methods have been proposed to deal with the problem of inter-class imbalance,

including re-sampling the training datasets (e.g. [140] and [141]), adjusting misclassification

costs (e.g. [142]), and learning from the minority class (e.g. [143]). Among them, re-sampling

appears to be a reasonably effective approach [144].

Re-sampling is the process of changing the prior probabilities of the majority and minority

classes in the training set by changing the number of examples in the majority and minority

classes [144]. In particular, over-sampling duplicates the minority examples in the training set.

While this helps with balancing the dataset, it does not increase the amount of information, and

may lead to over-fitting [139] [138]. In under-sampling, examples are reduced in the majority

class. While this does increase the balance of prior probabilities between classes, it results in a

loss of information that may be useful in building an accurate classification model.

Nikerson and Milios [145] propose a solution that addresses both inter-class and intra-class

imbalance problems. Their approach firstly clusters the minority class and the majority class

separately. Cluster memberships are then examined and re-sampled based on the number of

examples per cluster instead of the number of examples per class.

The authors of [144] point out that past studies have not reached any conclusive results

with regards to whether under-sampling or over-sampling is best to optimise classification per-

formance. Most likely, conflicting results are due to the combination of specific datasets and

classification algorithms. Yet under-sampling has the advantage of requiring less training time

and physical resources compared to the over-sampling method.

Another notable point is that there has not been a solution when the training dataset and

testing dataset have different balancing characteristics. The training data may be balanced but

3.2. THE APPLICATION OF ML IN IP TRAFFIC CLASSIFICATION 57

the testing may not and vice versa. Studies such as [146] and [147] have shown that a balanced

class distribution is not always the best for learning, and in some cases naturally occurring class

distribution is shown to perform well.

Besides the issue of an imbalanced ratio, a minority class can also create the problem of a

lack of information. The extent to which an algorithm suffers from an imbalanced ratio may

be different from one algorithm to another; however, all such algorithms will suffer from a lack

of examples presented for training [139]. As in the example provided in [139], for a dataset

consisting of 5:95 minority:majority examples the imbalanced ratio is the same as in a dataset

of 50:950. However, in the first case the minority class is poorly represented and thus suffers

more from the lack of information problem than in the second case. Therefore, when the impact

of the imbalance ratio on a learning algorithm is unclear, it is more important to gather as many

examples from the minority class as possible (under-sampling should not be performed on a

minority class).

More information about this issue can be found in [148], [149] and [139].

3.2 The application of ML in IP traffic classification

A number of general ML concepts take a specific meaning when applied to IPTC. For the

purpose of the subsequent discussion I define the following three terms relating to flows:

• Flow or Uni-directional flow: A series of packets that share the same five-tuple: source

and destination IP addresses, source and destination IP ports and protocol number.

• Bi-directional flow: A bi-directional flow is a pair of uni-directional flows, one in each

direction between the same source and destination IP addresses and ports 4.

• Full-flow: A bi-directional flow captured over its entire lifetime, from the establishment

to the end of the communication connection.

A class usually indicates the IP traffic caused by (or belonging to) an application or group

of applications. Instances are usually multiple packets belonging to the same flow. Features are

typically numerical attributes calculated over multiple packets belonging to individual flows.

4In asymmetric routing, server-to-client packets may take a different path to client-to-server packets - the trafficcapture point needs to be located where it can see packets in both directions for the use of bi-directional flow.


Examples include mean packet lengths, standard deviation of inter-packet arrival times, total

flow lengths (in bytes and/or packets), Fourier transform of packet inter-arrival time, and so on

[150]. As previously noted not all features are equally useful, so practical ML classifiers choose

the smallest set of features that lead to efficient differentiation between members of a class and

other traffic outside the class.

Internet applications’ traffic is often bi-directional. For example its flows consist of data

and acknowledgements, requests and replies, and commands and feedback, separately in one

direction and the other. Hence bi-directional flows are often chosen for study in the literature

(e.g. [98], [59], [122], [60] and [151]). Each bi-directional flow instance is normally charac-

terised by the values of its features calculated separately in the client-to-server (forward) and

the server-to-client (backward) directions.

The definition of full-flow flow is illustrated in Figure 3.2.

Forward

Backward

Full-flow Server Client

Figure 3.2: An illustration of full-flow flow. The forward direction is normally defined as theclient-to-server direction

Figure 3.3 presents a visual illustration of how the features are calculated for full-flow in-

stances.

L B 1

L F 1 L F 2 L F 3 L F K

L B 2

IAT F 1 IAT F 2 IAT FK

IAT B 1 IAT B 2 IAT BJ

L B 3 L BJ L BJ-1

Forward

Backward

Figure 3.3: An illustration of the definition of flow direction and features calculation

Let LF1, LF2, ...,LFJ be the IP packet lengths of packet 1, 2, ... ,J in the forward direction.

Let LB1, LB2, ... , LBK be the IP packet lengths of packet 1, 2, ... , K in the backward direction.


Then packet length features for the forward direction are calculated based on the statistics of

{LF1, LF2, ..., LFJ}. And features for the backward direction are calculated based on the statis-

tics of {LB1, LB2, ..., LBK}. Similarly, packet inter-arrival time (IAT) features in the forward

and backward directions are calculated based on the statistics of {IATF1, IATF2, ...,IATFJ} and

{IATB1, IATB2, ... IATBK} respectively.

3.2.1 Training and testing a supervised ML traffic classifier

Figure 3.4 presents an example scenario, in which the traffic classifier is intended to recognise

real-time online game traffic (the application class of interest) among the usual mix of traffic

seen on an IP network.

VoIP

Game

Web, P2P, SSH, SMTP

Traffic classifier

Game

Other

Figure 3.4: A simple scenario of online game traffic classification

Figure 3.5 illustrates the steps involved in building a traffic classifier using a supervised ML

algorithm. As noted earlier, the optimal approach to training a supervised ML algorithm is to

provide previously classified examples of two types of IP traffic: traffic matching the class of

traffic that one wishes later to identify in the network (in this case online game traffic), and

representative traffic of entirely different applications one would expect to see in future (often

referred to as Interfering or Other traffic).

The lower part of Figure 3.5 (Training) expands on the sequence of events involved in train-

ing a supervised ML traffic classifier. First, sample traffic is collected for both the application of

interest (e.g. game traffic) and other interfering applications (such as VoIP, Web, P2P, SSH, and


Labelled 'Game' class

VoIP

Game

Web, P2P, SSH, SMTP

Traffic classifier

Game

Other

ML

Classification model Game or Other

Classification

Training

Labelled 'Other' class

Optional data sampling and features filtering/selection

Features calculation Features calculation

Game traffic Web, P2P, SSH,

SMTP traffic VoIP traffic

Figure 3.5: Training and classification for a two-classes supervised ML traffic classifier


SMTP) that the classifier may see in the network. The ‘features calculation’ step involves cal-

culating the statistical properties of these flows (such as mean packet inter-arrival time, median

packet length and/or flow duration) as a prelude to generating features.

An optional next step is ‘data sampling’ or ‘features filtering/selection’, designed to narrow

down the search space for the ML algorithm when faced with extremely large training datasets

(traffic traces). The data sampling step extracts statistics from a subset of instances of various

application classes, and passes these along to the classifier to be used in the training process.

As noted in section 3.1.7, a feature filtering/selection step is desirable to limit the number of

features actually used to train the supervised ML classifier, and thus create an effective classi-

fication model. The input into the ML step is a dataset with training instances for both classes,

presented as a matrix of instances versus features as illustrated in Figure 3.1.

The output of the ML training process is a classification model. It is used in the classification

(sometimes referred to as testing/evaluating) step (illustrated in the upper part of Figure 3.5) to

identify a new unknown flow as either Game or Other traffic. In this classification step, traffic

captured in real-time is used to construct flow statistics from which features are determined and

then submitted to the classification model. (Here we presume that the set of features calculated

from captured traffic is the same as the optimal feature set determined during training.) The

classifier’s output indicates which flows are deemed to be members of the class of interest.

Certain implementations may optionally allow the classification model to be updated in

real-time (performing a similar data sampling and training process). For controlled testing and

evaluation purposes offline traffic traces can be used instead of live traffic capture.

Cross-validation (or stratified cross-validation) may be used to generate accuracy evalua-

tion results during the training/classification steps. However, if the source dataset consists of

IP packets collected at the same time and the same network measurement point, the cross-

validation results are likely to over-estimate the classifier’s accuracy. (Ideally the source data

trace would contain traffic collected at different times and measurement points, using entirely

independent training and testing datasets.)


3.2.2 Supervised versus unsupervised learning

As previously noted, IPTC is usually used to identify traffic belonging to known applications

(classes of interest) within previously unseen streams of IP packets. The key challenge is to

determine the relationship(s) between classes of IP traffic (as differentiated by ML features)

and the applications generating the IP traffic.

Supervised ML schemes require a training phase to cement the link between classes and ap-

plications. Training requires a priori classification (or labelling) of the flows within the training

datasets. For this reason, supervised ML may be attractive for the identification of a particular

(or groups of) application(s) of interest. However, as noted in section 3.1.3, the supervised ML

classifier works best when trained on examples of all the classes it expects to see in practice.

Consequently, its performance may be degraded or skewed if not trained on a representative

mix of traffic or if the network link(s) being monitored start seeing traffic of previously un-

known applications. (For example, Park et al. [152] demonstrated that accuracy is sensitive to

site-dependent training datasets, while Erman et al. [153] revealed different accuracy results

between the two data traces studied for the same ML algorithms.)

When evaluating supervised ML schemes in an operational context it is worthwhile consid-

ering how the classifier will be supplied with adequate supervised training examples, when it

will be necessary to re-train, and how the user will detect a new type of applications.

It might appear that one advantage of unsupervised ML schemes is the automatic discov-

ery of classes through the recognition of ‘natural’ patterns (clusters) in the dataset. However,

resulting clusters still need to be labelled (for example, through direct inspection by a human

expert) in order that new instances may be properly mapped to applications. (A related benefit

is that traffic from previously unknown applications may be detected by noting when new clus-

ters emerge - sometimes the emergence of new application flows is noteworthy even before the

identity of the application has been determined.)

Another issue for unsupervised ML schemes is that clusters do not necessarily map 1:1 to

applications. It would be ideal if the number of clusters formed were equal to the number of

application classes to be identified, and each application dominated one cluster group. However,

in practice, the number of clusters is often greater than the number of application classes [60]

[123]. One application might spread over and dominate a number of clusters, or conversely an

3.3. CHALLENGES FOR OPERATIONAL DEPLOYMENT 63

application might also spread over but not dominate any of the clusters. Mapping back from a

cluster to a source application can become a great challenge.

When evaluating unsupervised ML schemes in an operational context it is worthwhile con-

sidering how clusters will be labelled (mapped to specific applications), how labels will be up-

dated as new applications are detected, and the optimal number of clusters (balancing accuracy,

cost of labelling and label look-up, and computational complexity).

3.3 Challenges for operational deployment

3.3.1 A deployment scenario

Section 2.2.1 discussed the negative impacts of Last Mile bottlenecks on real-time interactive

traffic. Studies (such as [154], [155], [35], [156] and [157]) have shown that prioritisation

of real-time traffic over non-real-time traffic (such as ‘bursty’ TCP traffic) could improve the

perceived performance of the real-time traffic applications.

With the DOCSIS network considered in section 2.2.1, if the cable modem has the ability

to do class-based queuing and QoS scheduling, it can separate traffic into different queues,

and apply QoS scheduling mechanisms to them. The queuing and scheduling system requires

classification (identification) of traffic. While queuing and scheduling need to be done locally at

the CPE (e.g. embedded at the cable modem), traffic classification may be done at the ISP [35].

Suppose we have a classifier machine that listens to a limited number of packets of the traf-

fic flow, derives their statistical properties, then recognises the type of application that generates

the traffic. Once the flow has been classified, its classification rule can be logged into a data-

base, and communicated with the CPE. Its subsequent packets, as a result, can then quickly be

mapped to either a QoS class, put into a priority queue, or receive special network monitoring

and treatment when traversing the network. This deployment scenario is illustrated in Figure

3.6.

Figure 3.7 illustrates a sample of the operation of a classifier in a QoS-enabled architecture.

Data traffic passing a sniffer point is divided into separate flows (based on the five-tuple packet

header information: source and destination IP addresses, source and destination ports, and pro-

tocol) 5. These flows are passed through the classifier for identification. If a flow is classified as

5This five-tuple information serves purely as a flow’s differentiation. The numerical values or semantics therein


VoIP

Online Game

Web, P2P, SSH, SMTP

Cable Modem

Traffic classification QoS mapping for priority control

Trigguring QoS signalling process

Copy of traffic

ISP

Copy of traffic

Downstream traffic Upstream traffic

Figure 3.6: Example of an automated QoS and priority control

one requiring prioritisation, the classifier signals the CPE with this information; the CPE will

use this to apply priority queuing and scheduling for the flow.

The actual number of QoS classes and associated prioritisation levels used in such a sce-

nario will vary depending on customer requirements and ISP capabilities. Diffserv-style QoS (a

small handful of classes, or even only two classes [158]) is often considered sufficient, so long

as no individual QoS class is overloaded. In principle one might imagine hundreds or thou-

sands of different applications being mapped into a small number of QoS classes. In practical

consumer situations there are likely to be only a small number of applications (such as online

games or VoIP) that require prioritisation (with the default being that unrecognised flows are

not prioritised).

A further consideration involves applications whose QoS requirements and statistical traffic

properties vary over time. For example, game traffic features may vary during different phases

of the game. It is ultimately a business decision whether each phase is mapped to a different QoS

class, or always mapped to the high priority class. As noted in section 2.2.3, an ISP will aim for

the simplest technical solution that satisfies their customer’s goals. My focus is to identify an

IPTC technique capable of flexibly supporting the mapping of application traffic (across all or

parts of one or more applications flows’ lifetimes) to QoS classes. Operational challenges for

such classification are addressed in the following sub-section.

are not important to our ML-based classifier; as soon as the combination of the actual binary bits for IP addresses,port numbers and protocol make an unique identification of a flow, this can be used later to identify the flow’ssubsequent packets for prioritisation. In rare cases where the IP/TCP headers are encrypted, as soon as this com-bination stays constant for a period of time, it can be used as a unique key to distinguish a flow.


Data passing the sniffer – Sequenced by arrival time

Separate data into flow (five - tuple – source and destination IPs and ports and protocol) – passing through the classifier

Feature Computations &

ML Classifier Model

Flow 1 Game

Flow N P2P

Flow 2 VoIP

Flow Classification Rule Table

Flow Classification Rule Table

Game P2P VoIP VoIP Game P2P

1 1 1

2 2

Flow 1 Flow 2 Flow N

Flow data passing the classifier – Sequenced by arrival time

Classifier window (e.g. packets

buffer) 2

Figure 3.7: Example operation of an IP flows classifier


3.3.2 The operational challenges

Section 3.3.1’s scenario and the discussion in section 2.2.3 raise a number of key requirements

for a practical, deployable IP traffic classifier in an operational network. These requirements

may be summarised into five broad categories: accuracy; timely and continuous classification;

directional neutrality; efficient use of memory and processors; and portability and robustness.

They are described in turn below.

Accuracy

Accuracy is a critical requirement that can be measured in terms of Recall and Precision (2.3),

both of which are important. A classifier cannot be accepted, either by ISPs or consumers, if

it has low Recall (a high False Negatives Percentage) or low Precision (a high False Positives

Percentage). For example, if the class X application of interest is real-time and interactive, and

is desired to receive priority treatment when traversing the network, a low value of Precision

might not only seriously interfere with the QoS mapping, queuing and scheduling system, but

would also be unacceptable to the Internet customers who are charged for the priority traffic

they inject into the network. On the other hand, a low Recall rate would make the ISP fail to

meet the QoS level guaranteed to the customer.

Timely and continuous classification

A timely classifier should reach its decision using as few packets as possible from each flow

rather than waiting until each flow completes before finalising a decision. Reducing the number

of packets required for classification also reduces the memory required to buffer packets during

feature calculations. This is an important consideration for situations where the classifier is

calculating features for (10 of) thousands of concurrent flows. Depending on the business reason

for performing classification, it may be unacceptable to sample the available flows in order to

reduce the memory consumption. Instead, one must aim to use less packets from each flow.

However, it is not sufficient to classify based only on the first few packets of a flow. For

example, malicious attacks might disguise themselves with the statistical properties of a trusted

application early in their flow’s lifetime. Or the classifier itself might have been started (or

restarted) while hundreds or thousands of flows were already active through a network monitor-


ing point (thereby missing the starts of these active flows). Consequently, the classifier should

ideally perform continuous classification – recomputing its classification decision throughout

the lifetime of every flow.

Timely and continuous ML classification must also address the fact that many applications

change their statistical properties over time, yet a flow should ideally be correctly classified as

being the same application throughout the flow’s lifetime.

Directional neutrality

Application flows are often assumed to be bi-directional, and the application’s statistical fea-

tures are calculated separately in the forward and reverse directions. Many applications (such

as multiplayer online games or streaming media) exhibit different (asymmetric) statistical prop-

erties in the client-to-server and server-to-client directions. Consequently, the classifier must

either ‘know’ the direction of a previously unseen flow (for example, at which ends the server

and the client are located) or be trained to recognise an application of interest without relying

on external indications of directionality.

Inferring the server and client ends of a flow is fraught with practical difficulties. As a real

world classifier should not presume that it has seen the first packet of every flow currently being

evaluated, it cannot be sure whether the first packet it sees (of any new bi-directional flow of

packets) is heading in the ‘forward’ or ‘reverse’ direction. Furthermore, as noted in section 2.4,

the semantics of the TCP or UDP port fields should be considered unreliable, so it becomes

difficult to justify using ‘well known’ server-side port numbers to infer a flow’s direction.

Efficient use of memory and processors

Another important criteria for operational deployment is the classification system’s use of com-

putational resources (such as CPU time and memory consumption). The classifier’s efficiency

impacts on the financial cost of building, purchasing and operating large-scale traffic classifi-

cation systems. An inefficient classifier may be inappropriate for operational use regardless of

how quickly it can be trained or how accurately it identifies flows.

Minimising CPU cycles and memory consumption is advantageous whether the classifier is

expected to sit in the middle of an ISP network (where a small number of large, powerful devices


may see hundreds of thousands of concurrent flows at multi-gigabit rates) or out toward the

edges (where the traffic load is substantially smaller, but the CPU power and memory resources

of individual devices are also diminished).

Portability and Robustness

A model may be considered portable if it can be used in a variety of network locations, and

robust if it provides consistent accuracy in the face of network layer perturbations such as

packet loss, traffic shaping, packet fragmentation, and jitter. A classifier also is robust if it can

efficiently identify the emergence of new traffic applications.

3.4 Conclusion

In this chapter I have provided background information about ML and how it could be applied

to IPTC. More information on ML algorithms can be found in [109], [121].

I have also addressed the crucial requirements for a practical and deployable real-time IP

traffic classifier, namely Accuracy, Timely and continuous classification, Directional neutrality,

Efficient use of memory and processors, Portability and Robustness. These critical factors not

only lay emphasis on the technical viability of a solution (by meeting the Accuracy, Timely

and continuous classification and Directional neutrality requirements), but also address other

requirements for an economically feasible and deployable solution (by meeting the Efficient use

of memory and processors, Portability and Robustness requirements). The importance of these

requirements is justified in section 2.2 when considering the context of IPTC and its important

role as the core of most QoS solutions.

With a primary focus on the accuracy of ML-based traffic classifiers, most published re-

search to date 6 has not considered the constraints on classifiers deployed in real-time and

operational networks. My approaches proposed in Chapters 5 and 7 address these vital re-

quirements. I consider not only the real-time requirements of an ML traffic classifier, but also

its sustainable performance when monitoring traffic flows over their lifetime with limited phys-

ical resources. This is what makes my contribution novel and significant.

6Prior to the publications of my proposals in late 2006 [159] [160].

3.4. CONCLUSION 69

In the next chapter, I review state-of-the-art IPTC approaches using ML techniques. A qual-

itative critique of the reviewed works is then presented, which leads to the problem statement

of my research.

Chapter 4

IP Traffic Classification Using MachineLearning

4.1 Introduction

In Chapter 2 I have shown that ML has potential for solving difficult IP network problems. I

also have provided some background in ML and discussed the application of ML algorithms to

IP traffic classification in Chapter 3. In this chapter I review the previous literature on applying

ML to IPTC, which can be divided into four broad categories:

• Clustering Approaches: Works whose main approach centres around unsupervised learn-

ing techniques.

• Supervised Learning Approaches: Works whose main approach centres around super-

vised learning techniques.

• Hybrid Approaches: Works whose approaches combine supervised and unsupervised

learning techniques.

• Comparisons and Related Work: Works that compare and contrast different ML algo-

rithms, or consider non-ML approaches that could be utilised in conjunction with ML

approaches.

The key points of each reviewed work are discussed in the following subsections and sum-

marised in Tables B.1, B.2, B.3 and B.4 (Appendix B.1).

This chapter demonstrates that most published research has focused primarily on the ac-

curacy of ML-based traffic classifiers, and has not considered the constraints on classifiers

70

4.2. CLUSTERING APPROACHES 71

deployed in real-time, operational networks. These studies have typically relied on features

calculated for full-flows consisting of thousands of packets - both for training and for sub-

sequent classification. The efficacy and timeliness of ML classifiers have not been evaluated

under conditions where the beginning of the flow is missed and the classifier sees only a subset

of its packets.

Yet, as mentioned in 3.3.2, in real IP networks traffic classifiers must reach decisions well

before a flow has finished. The classifier may start (or re-start) at any time, and may not see the

actual beginning of a flow. The application’s statistical behaviour may change over the lifetime

of each flow. In addition there may be thousands of concurrent flows, and the classifier has to

operate with finite CPU and memory resources.

Section 4.6 of this chapter discusses the limitations of the reviewed works with regards to

the operational challenges. This helps to define the problem statement for my thesis, justify its

originality and explain why it is worthwhile pursuing.

4.2 Clustering approaches

4.2.1 Flow clustering using Expectation Maximisation

In 2004 McGregor et al. [59] published one of the earliest work to apply ML to IP traffic clas-

sification using the Expectation Maximisation algorithm [161]. The approach clusters traffic

with similar observable properties into different application types.

This study examined HTTP, FTP, SMTP, IMAP, NTP and DNS traffic. Packets in a 6-hour

Auckland-VI trace were divided into bi-directional flows. Flow features (listed in Table B.1)

were calculated on a full-flow basis. Flows are not timed out, except when they exceed the

length of the traffic trace.

Based on these features, the EM algorithm was used to group the traffic flows into a small

number of clusters and then created classification rules based on these clusters. From these

rules, features that did not have a large impact on the classification were identified and removed

from the input to the learning machine and the process was repeated. The implementation of

EM in this study included an option to allow the number of clusters to be found automatically

via cross-validation. The resulting estimation of performance was then used to select the best

competing model (hence the number of clusters).

72 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING

In this study, the algorithm was found to separate traffic into a number of classes based

on traffic type (such as bulk transfer, small transactions, or multiple transactions). However,

current results were limited in identifying individual applications of interest. Nonetheless, it

may be suitable to apply this approach as the first step of classification in cases where the traffic

is completely unknown, as it could possibly provide an indication of the group of applications

that have similar traffic characteristics. Importantly, the use of features that are calculated on

the basis of the completion of traffic flows hinders the application of the approach for real-time

IP traffic classification in an operational network.

4.2.2 Automated application identification using AutoClass

The work of Zander et al. [60], proposed in 2005, uses AutoClass [162], which is an unsuper-

vised Bayesian classifier, using the EM algorithm to determine the best clusters set from the

training data. EM is guaranteed to converge to a local maximum. To find the global maximum,

autoclass repeats EM searches starting from pseudo-random points in the parameter space. The

model with the parameter set that has the highest probability is considered the best.

Autoclass can be preconfigured with the number of classes (if known) or it can try to es-

timate the number of classes itself. Firstly, packets are classified into bi-directional flows and

flow characteristics are computed using NetMate [163]. A number of features are calculated for

each flow, in each direction (listed in Table B.1). Feature values are calculated on a full-flow

basis. A flow timeout of 60 seconds is used.

Sampling is used to select a subset of the flow data for the learning process. Once the

classes (clusters) have been learnt, new flows are classified. The results of the learning and

classification are exported for evaluation. The approach is evaluated based on random samples

of flows obtained from three 24-hour traffic traces (Auckland-VI, NZIX-II and Leipzig-II traces

from NLANR [164]).

Taking a further step from [59], this study proposed a method for cluster evaluation. A met-

ric called intra-class homogeneity, H, is introduced to assess the quality of the resulting classes

and classification. H of a class is defined as the largest fraction of flows on one application in

the class. The overall homogeneity H of a set of classes is the mean of the class homogeneities.

The goal is to maximise H to achieve a good separation between different applications.


The results of this study revealed that some separation between the different applications

could be achieved, especially for certain particular applications (such as Half-Life online game

traffic) in comparison with others. With different sets of features used, the authors demonstrated

that H increased with an increase in the number of features used. H reached a maximum value

of between 85% and 89%, depending on the trace. However, their work has not addressed

the trade-offs between the number of features used and the consequences of computational

overheads.

To compute the accuracy for each application the authors mapped each class to the applica-

tion that was dominating that class (by having the largest fraction of flows in that class). The

authors used accuracy (Recall) as an evaluation metric. Median accuracy was≥ 80% for all ap-

plications across all traces. However, there were some exceptional cases. For example, for the

Napster application there was one trace where it was not dominating any of the classes (hence

the accuracy is 0%). The results also indicated that FTP, HTTP and Telnet seemed to have the

most diverse traffic characteristics and were spread across many classes.

In general, although the mapping of class to application shows promising results in separat-

ing the different applications, the number of classes resulting from the clustering algorithm is

high (approximately 50 classes for 8 selected applications). For class and application mapping,

it is a challenge to identify applications that do not dominate any of the classes. The use of fea-

tures that require (or are calculated on the basis of) the completion of traffic flows also hinders

the application of the approach in real-time IPTC in an operational network.

4.2.3 TCP-based application identification using Simple K-Means

In 2006 Bernaille et al. [122] proposed a technique using an unsupervised ML (Simple K-

Means) algorithm that classified different types of TCP-based applications using the first few

packets of the traffic flow.

In contrast to the previously published work, the method proposed in this paper allows early

detection of traffic flow by looking at only the first few packets of a TCP flow. The intuition

behind this method is that the first few packets capture the application’s negotiation phase,

which is usually a pre-defined sequence of messages and is distinct among applications.

The training phase is performed offline. The input is a one-hour packet trace of TCP flows


from a mix of applications. Flows are grouped into clusters based on the values of their first

P packets. Flows are represented by points in a P-dimensional space, where each packet is

associated with a dimension; the coordinate on dimension p is the size of packet p in the flow.

Bi-directional flows are used. Packets sent by the TCP server are distinguished from packets

sent by the TCP client by having a negative coordinate.

Similarity between flows is measured by the Euclidean distance between their associated

spatial representations. After natural clusters are formed, the modelling step defines a rule to

assign a new flow to a cluster. (The number of clusters is chosen by trial with different numbers

of clusters for the K-means algorithm). The classification rule is simple: the Euclidean distance

between the new flow and the centre of each pre-defined cluster is computed, and the new

flow belongs to the cluster for which the distance is the least. The training set also consists of

payload, so that flows in each cluster can be labelled with their source application. The learning

output consists of two sets: one with the description of each cluster (the centre of the cluster),

and the other with the composition of its applications. Both sets are used to classify flows

online.

In the classification phase, packets are formed into a bi-directional flow. The sizes of the

first P packets of the connection are captured and used to map the new flow to a spatial repre-

sentation. After the cluster is defined, the flow is associated with the application that is the most

prevalent in that cluster.

The results reveal that more than 80% of total flows are correctly identified for a number of

applications by using the first five packets of each TCP flow. One exceptional case is the POP3

application. The classifier labels 86% of POP3 flows as NNTP and 12.6% as SMTP, because

POP3 flows always belong to clusters where POP3 is not the dominant application.

The results of this approach are inspiring for early detection of the traffic flow. However,

this approach assumes that the classifier can always capture the start of each flow. The effec-

tiveness of the approach when the classifier misses the first few packets of the traffic flow has

not been discussed or addressed. Furthermore, with the use of a unsupervised algorithm and its

classification technique, the proposal faces the challenge of classifying an application when it

does not dominate any of the clusters found.


4.2.4 Identifying HTTP and P2P traffic in the network core

The work of Erman et al. [123] in early 2007 addressed the challenge of traffic classification at

the core of the network, where the available information about the flows and their contributors

might be limited. This work proposed to classify a flow using only uni-directional flow informa-

tion. While indicating that for a TCP connection, server-to-client direction might provide more

useful statistics and better accuracy than the reverse direction, it may not always be feasible to

capture traffic in this direction. These researchers also developed and evaluated an algorithm

that could estimate missing statistics from a uni-directional packet trace.

The approach proposed makes use of clustering machine learning techniques with a demon-

stration of using the K-Means algorithm. Similar to other clustering approaches, Euclidean

distance is used to measure the similarity between two flow vectors.

Uni-directional traffic flows are described by a full-flow-based features set (listed in Table

B.3). Possible traffic classes include HTTP, P2P and FTP. For the training phase, it is assumed

that labels for all training flows are available (manually classified based on payload content and

protocol signatures), and a cluster is mapped back to a traffic class that makes up the majority of

flows in that cluster. An unseen flow will be mapped to the nearest cluster based on its distance

from the clusters’ centroids.

The approach is evaluated using flow accuracy and byte accuracy as performance metrics.

Three datasets are considered: datasets containing only client-to-server packets, datasets con-

taining only server-to-client packets, and datasets containing a random mixture of each direc-

tion. The K-Means algorithm requires the number of clusters as an input and it has been shown

that both flow and byte accuracies improved as k increased from 25 to 400. Overall, the server-

to-client datasets consistently gave the best accuracy (95% and 79% in terms of flows and bytes

respectively). With the random datasets, the average flow and byte accuracy was 91% and 67%

respectively. For the client-to-server datasets, 94% of the flows and 57% of the bytes were

correctly classified.

The algorithm to estimate the missing flow statistics is based on the syntax and seman-

tics of the TCP protocol. So it only works with TCP, not other transport protocol traffic. The

flow statistics are divided into three general categories: duration, number of bytes, and number

of packets. The flow duration in the missing direction is estimated as the duration calculated


with the first and the last packet seen in the observed direction. The number of bytes trans-

mitted is estimated according to information contained in acknowledgement (ACKs) packets.

The number of packets sent is estimated with the tracking of the last sequence number and ac-

knowledgement number seen in the flow, with regards to the maximum segment size (MSS). A

number of assumptions are made in this process. For example, MSS is used as a common value

of 1460 bytes, and simple acknowledgement strategy of an ACK (40-byte data header with no

payload) for every data packet, assuming that no packet loss or retransmission has occurred. In

this study, an evaluation of the estimation algorithm is reported, the results were promising for

flow duration and bytes estimation, with a relatively larger error range revealed for the number

of packets estimation.

This work addressed the interesting issue of the possibility of using uni-directional flow

statistics for traffic classification and proposed a method to estimate the missing statistics. A

related issue of directionality in the use of bi-directional traffic flows, based on the material and

discusion in this thesis, was addressed earlier in [160]. The use of features that are calculated on

the basis of the completion of traffic flows hinders the application of the approach in real-time

IP traffic classification in an operational network.

4.3 Supervised learning approaches

4.3.1 Statistical signature-based approach using NN, LDA and QDA algorithms

In 2004 Roughan et al. [61] proposed to use the nearest neighbours (NN), linear discriminate

analysis (LDA) and Quadratic Discriminant Analysis (QDA) ML algorithms to map different

network applications to predetermined QoS traffic classes.

The authors list a number of possible features, and classify them into five categories:

• Packet Level: e.g. packet length (mean and variance, root mean square).

• Flow Level: flow duration, data volume per flow, number of packets per flow (all with

mean and variance values) etc. Uni-directional flow is used.

• Connection Level: e.g. advertised TCP window sizes, throughput distribution and the

symmetry of the connection.

4.3. SUPERVISED LEARNING APPROACHES 77

• Intra-flow/connection features: e.g. packet inter-arrival times between packets in flows.

• Multi-flow: e.g. multiple concurrent connections between the same set of end-systems.

Of the features considered, the pair of most value was the average packet length and flow

duration. These features are computed per full-flow, then per aggregate of flows within 24-hour

periods (an aggregate is a collection of statistics indexed by server port and server IP address).

Three cases of classification are considered. The three-class classification looks at three

types of application: Bulk data (FTP-data), Interactive (Telnet), and Streaming (RealMedia).

The four-class classification looks at four types of applications: Interactive (Telnet), Bulk data

(FTP), Streaming (RealMedia) and Transactional (DNS). The seven-class classification looks

at seven applications: DNS, FTP, HTTPS, Kazaa, RealMedia, Telnet and WWW.

The classification process is evaluated using 10-times cross-validation. The classification

error rates are shown to vary depending on the number of classes the process has sought to

identify. The three-class classification has an lowest error rate, varying from 2.5% to 3.4% for

different algorithms, while the four-class classification had the error rate in the range of 5.1% to

7.9%, and the seven-class one had the highest error rate of 9.4% to 12.6%. The use of features

that are calculated on the basis of the completion of traffic flows hinders the application of the

approach in real-time IP traffic classification in an operational network.

4.3.2 Classification using Bayesian analysis techniques

In 2005 Moore and Zuev [98] proposed to apply the supervised ML Naive Bayes technique to

categorise Internet traffic by application. Traffic flows in the dataset used are manually classified

(based upon flow content) allowing accurate evaluation.

In this study, the classifier was trained using 248 full-flow based features (a summary is

listed in Table B.2). Selected traffic for Internet applications was grouped into different cate-

gories for classification, such as bulk data transfer, database, interactive, mail, services, HTTP,

P2P, attack, games and multimedia.

To evaluate the classifier’s performance, the authors used Accuracy and Trust (equivalent

to Recall) as evaluation metrics. The results showed that with the simple Naive Bayes tech-

nique, using the whole population of flow features, approximately 65% flow accuracy could be

achieved in classification. Two refinements for the classifier were performed, with the use of the


Naive Bayes Kernel Estimation (NBKE) and Fast Correlation-Based Filter (FCBF) methods 1.

These refinements helped to reduce the feature space and improved the classifier performance

to a flow accuracy of greater than 95% overall. With the best combination technique, the Trust

value for an individual class of application ranged, for instance, from 98% for HTTP, to 90%

for bulk data transfer, to approximately 44% for services traffic and 55% for P2P.

Thjs research was extended with the application of the Bayesian neural network approach in

[165]. It has been demonstrated that accuracy is further improved when compared to the Naive

Bayes technique. The Bayesian trained neural network approach is able to classify flows with

up to 99% accuracy for data trained and tested on the same day, and 95% accuracy for data

trained and tested eight months apart. This paper also presented a list of features including their

descriptions and ranking in terms of importance.

While achieving very good classification results, similar to that of other studies reviewed

in previous sections, this work made use of full-flow features. The use of features that are

calculated on the basis of the completion of traffic flows hinders the application of the approach

in real-time IP traffic classification in an operational network.

4.3.3 GA-based classification techniques

In 2006 Park et al. [166] made use of a feature selection technique based on the Genetic Algor-

ithm (GA). Using the same feature set specified in [152] (listed in B.2), three classifiers were

tested and compared: the Naive Bayesian classifier with Kernel Estimation (NBKE), Decision

Tree J48 and the Reduced Error Pruning Tree (REPTree) classifier. Their results suggest that

the two decision tree classifiers provide more accurate classification results than the NBKE

classifier. Their work also suggests the impact of using training and testing data from different

measurement points.

Early flow classification is also briefly mentioned. Accuracy as a function of the number of

packets used for classification is presented for J48 and REPTree classifiers. The first 10 packets

used for classification seem to provide the most accurate result. However, the accuracy result

1The NBKE method is a generalisation of Naive Bayes. It addresses the problem of approximating everyfeature by a normal distribution. Instead of using a normal distribution with parameters estimated from the data,it uses kernel estimation methods. FCBF is a feature selection and redundancy reduction technique. In FCBF,goodness of a feature is measured by its correlation with the class and other good features. That feature becomesgood if it is highly correlated with the class, yet is not correlated with any other good features [98]

4.3. SUPERVISED LEARNING APPROACHES 79

is provided as an overall result. It is not clear how it would be different for different types of

Internet applications. The effectiveness of the approach when the classifier misses the first few

packets of the traffic flow also has not been discussed or addressed.

4.3.4 Simple statistical protocol fingerprint method

Crotti et al. [167] in early 2007 proposed a flow classification mechanism based on three prop-

erties of the captured IP packets: packet length, inter-arrival time and packet arrival order. They

defined a structure called protocol fingerprints which expresses the three traffic properties in a

compact way and uses an algorithm based on normalised thresholds for flow classification.

There are two phases in the classification process: training and classifying. In the training

phase, pre-labelled flows from the application to be classified (the training dataset) are analysed

to build the protocol fingerprints. Uni-directional flow is used. A classifier on the path between

the client and the server will see a pair of flows in both directions.

At the IP layer, a flow with N packets can be characterised as an ordered sequence of N

pairs Pi = {si, ∆ti}, with 1 ≤ i ≤ N, where si represents the size of Packeti and ∆ti represents

the inter-arrival time between Packeti−1 and Packeti from a set of flows generated by the same,

known protocol, captured by a monitoring device; and L + 1 is the number of packets of the

longer-lived flows; the protocol’s fingerprint is generated as a Probability Density Function

vector PDF, that consists of L Probability Density Functions PDFi. The ithPDFi is built on all

the ith pairs Pi belonging to those flows that are at least i+1 packets long [167].

In order to classify an unknown traffic flow given a set of different PDFs, the authors check

whether the behaviour of the flow is statistically compatible with the description given by at

least one of the PDFs, and choose which PDF best describes it. An anomaly score that gives

a value between 0 and 1 is used to indicate how ‘statistically distant’ an unknown flow is from

a given protocol PDF. It shows the correlation between the unknown flow’s ith packet and the

application layer protocol PDFi described by the specific PDF used; the higher the value, the

higher the probability that the flow was generated by that protocol. To avoid the effects of noise

within the training data, a Gaussian filter is applied to each component of the PDF vector.

Their results reveal a flow accuracy of more than 91% for classifying three applications –

HTTP, SMTP and POP3 – using the first few packets of each application’s traffic flow.


In a similar way to the work of Bernaille et al. [122] reviewed above, this approach demon-

strates advanced results for timeliness of the classification. However, it has the same limitation

in assuming that the classifier can always capture the start of each flow, and is aware of the loca-

tions of client and server (for constructing the PDF of client-server and server-client directions).

The effectiveness of the approach when the classifier misses the first few packets of the traffic

flow (assumed to carry the protocol fingerprint) has not been addressed.

4.4 Hybrid approaches

Erman et al. [168] in early 2007 proposed a semi-supervised traffic classification approach

which combines unsupervised and supervised methods. Motivations for their proposal are

grounded in two main reasons. Firstly, labelled examples are scarce and difficult to obtain,

while supervised learning methods do not generalise well when being trained with few exam-

ples in the dataset. Secondly, new applications may appear over time, and not all of them are

known as a priori, traditional supervised methods map unseen flow instances into one of the

known classes, without the ability to detect new types of flows [168].

To overcome the challenges, the proposed classification method consists of two steps. First,

a training dataset consisting of labelled flows combined with unlabelled flows is fed into a

clustering algorithm. Second, the available labelled flows are used to obtain a mapping from

the clusters to the different known classes. This step allows some clusters to remain. To map a

cluster with labelled flows back to an application type, a probabilistic assignment is used. The

probability is estimated by the maximum likelihood estimate, n jknk

where n jk is the number of

flows that were assigned to cluster k with label j, and nk is the total number of labelled flows that

were assigned to cluster k. Clusters without any labelled flows assigned to them are labelled

‘Unknown’ as application type. Finally, a new unseen flow will be assigned to the nearest

cluster with the distance metric chosen in the clustering step.

This new proposed approach has promising results. Preliminary results have been shown

in [168] with the employment of the K-Means clustering algorithm. In this case, the classifier

was provided with 64,000 unlabelled flows. Once the flows were clustered, a fixed number

of random flows in each cluster were labelled. Results reveal that with two labelled flows per

cluster and K = 400, this approach results in a 94% flow accuracy. The increase in classification

4.5. COMPARISONS AND RELATED WORK 81

accuracy is marginal when five or more flows are labelled per cluster. Further discussion of

these results can be found in [169].

The proposal is claim to offer the following advantages: faster training time with a small

number of labelled flows mixed with a large number of unlabelled flows; being able to handle

previously unseen applications and the variation of existing applications’ characteristics; and

the possibility of enhancing the classifier’s performance by adding unlabelled flows for iterative

classifier training [169]. An evaluation of these advantages has not been performed in [169].

Nevertheless, these findings motivate my investigation into using only a small number of la-

belled samples (down-sampling) for clustering in assisting SSP-ACT, as presented in Chapter

6.

4.5 Comparisons and related work

4.5.1 Comparison of different clustering algorithms

In 2006 Erman et al. [153] compared three unsupervised clustering algorithms: K-Means,

Density Based Spatial Clustering of Applications with Noise (DBSCAN) and AutoClass. The

comparison was performed on two empirical data traces: one public trace from the University

of Auckland, and one self-collected trace from the University of Calgary.

The effectiveness of each algorithm is evaluated using overall accuracy and the number of

clusters it produces. Overall accuracy measurement determines how well the clustering algor-

ithm is able to create clusters that contain only a single traffic category. A cluster is labelled

by the traffic class that makes up the majority of its total connections (bi-directional traffic

flows). Any connection that has not been assigned to a cluster is labelled as noise. Then overall

accuracy is determined by the portion of the total TP for all clusters out of the total number

of connections to be classified. In all clustering algorithms, the number of clusters produced

by a clustering algorithm is an important evaluation factor as it affects the performance of the

algorithm in the classification stage.

The results of this study revealed that the AutoClass algorithm produced the best overall

accuracy. On average, AutoClass was 92.4% and 88.7% accurate in the Auckland and Calgary

datasets respectively. It produced on average of 167 clusters for the Auckland dataset (for less

than 10 groups of applications) and 247 clusters for the Calgary dataset (for four groups of


applications). For K-Means, the number of clusters can be set, and the overall accuracy steadily

improves as the number of clusters (K) increases. When K is around 100, overall accuracy

was 79% and 84% on average for the Auckland and Calgary datasets respectively. Accuracy

is improved only slightly with a greater value of K. The DBSCAN algorithm produces lower

overall accuracy (up to 75.6% for the Auckland and 72% for the Calgary data sets); however, it

places the majority of the connections in a small subset of the clusters. Looking at the accuracy

for particular traffic class categories, the DBSCAN algorithm has the highest precision value

for P2P, POP3 and SMTP (lower than Autoclass for HTTP traffic).

This study only briefly considers the model build time, and does not explore other perfor-

mance evaluation measurements, such as processing speed, CPU and memory usage, or the

timeliness of classification.

4.5.2 Comparison of clustering versus supervised techniques

In 2006 Erman et al. [170] evaluated the effectiveness of the supervised Naive Bayes and

clustering AutoClass algorithms. Three accuracy metrics were used for evaluation: Recall,

Precision and overall accuracy (overall accuracy is defined here as it is in [153], reviewed in the

previous section).

Classification using the supervised Naive Bayes algorithm is straight forward. For classifi-

cation using AutoClass, once AutoClass comes up with the most probable set of clusters from

the training data, the clustering is transformed into a classifier. A cluster is labelled with the

most common traffic category of the flows within it. If two or more categories are tied, then a

label is chosen randomly amongst the tied category labels. A new flow is then classified with

the traffic class label of the cluster to which it is most similar [170].

The evaluation was performed on two 72-hour data traces provided by the University of

Auckland (NLANR). A connection in this instance is defined as a bi-directional flow. The

feature set is shown in Table B.4.

This research indicated that with the dataset used and nine application classes (HTTP, SMTP,

DNS, SOCKS, IRC, FTP control, FTP data, POP3 and LIMEWIRE), AutoClass has an average

overall accuracy of 91.2% whereas the Naive Bayes classifier has an overall accuracy of 82.5%.

According to the authors, AutoClass also performs better in terms of Precision and Recall for


individual traffic classes. On average, for Naive Bayes, both Precision and Recall for six out of

nine classes were of above 80%; whereas for AutoClass, all classes have Precision and Recall

values above 80%, six out of the nine classes have average Precision values of above 90%, and

seven have average Recall values of above 90%. However, in terms of the time taken to build

a classification model, AutoClass takes far longer than Naive Bayes algorithm (2,070 seconds

versus 0.06 seconds for the algorithm implementation, data and equipment used).

The conclusion that the unsupervised AutoClass outperforms the supervised Naive Bayes

in terms of overall accuracy might be counter-intuitive. Furthermore, another issue related to

clustering approaches is the real-time classification speed, as the number of clusters resulting

from the training phase is typically larger than the number of application classes. Neither of

these two issues have been explorered further in [170].

4.5.3 Comparison of different supervised ML algorithms

Williams et al. [171] in 2006 provided insights into the performance aspect of ML traffic clas-

sification. Their work compared a number of supervised ML algorithms: Naive Bayes with

Discretisation (NBD), Naive Bayes with Kernel Density Estimation (NBK) , C4.5 Decision

Tree, Bayesian Network, and Naive Bayes Tree. The computational performance of these algo-

rithms is evaluated in terms of classification speed (number of classifications per second) and

the time taken to build the associated classification model.

Results have been collected from experiments on three public NLANR traces. The features

used for analysis include the full set of 22 features, and two best reduced feature sets selected

by correlation-based feature selection (CFS) and consistency-based feature selection (CON)

algorithms. The features set is shown in Table B.4.

The results indicate that most algorithms achieve high flow accuracy with the full set of 22

features (the NBK algorithm achieves > 80% accuracy and the rest of the algorithms achieve

greater than 95% accuracy). With the reduced sets of eight (CFS) and nine (CON) features, the

results achieved by cross-validation reveal only slight changes in the overall accuracy compared

to the use of the full feature set. The largest reduction in accuracy was 2-2.5% for NBD and

NBK with the use of the CON reduced feature set.

Despite the similarity in classification accuracy, this study found significant differences in


classification computational performance. The C4.5 Decision Tree algorithm was seen as the

fastest algorithm when using any of the features set (with a maximum of 54,700 classifications

per second on a 3.4GHz Pentium 4 workstation running SUSE Linux 9.3 with Waikato Envi-

ronment for Knowledge Analysis (WEKA) implementation of ML algorithms). The ranking of

algorithms in descending order in terms of classification speeds is: C4.5 Decision Tree, NBD,

Bayesian Network, Naive Bayes Tree, NBK.

In terms of the required model build time, the Naive Bayes Tree algorithm takes significant

longer time than the other algorithms. The ranking of algorithms in descending order in terms of

required model build time is: Naive Bayes Tree, C4.5 Decision Tree, Bayesian Network, NBD

and NBK. Feature reduction is also shown to greatly improve performance of the algorithms in

terms of model build time and classification speeds for most algorithms.

These findings are inline with my results as presented in Chapters 6, 7 and 8, that C4.5

Decision Tree classifier takes longer time to build, but is faster than NBD classifier in terms of

classification speed.

4.5.4 ACAS: Classification using machine learning techniques on application signatures

Haffner et al. [172] in 2005 proposed an approach for the automated construction of application

signatures using machine learning techniques. In contrast to the other works, this work makes

use of the first n Bytes of a data stream as features. Although it shares the same limitation with

those works that require access to packet payloads, I include it in my literature review as it is

also ML-based, and its interesting results may be useful in a composite ML-based approach that

combines different information such as statistical characteristics, contents, and communication

patterns.

Three learning algorithms – Naive Bayes, AdaBoost and Maximum Entropy – have been

investigated in constructing application signatures for a various range of network applications:

FTP control, SMTP, POP3, IMAP, HTTPS, HTTP and SSH. A flow instance is characterised

with n Bytes represented in binary value, and ordered by the position of the Byte in the flow

stream. The collection of flow instances with binary features is used as input by the machine

learning algorithms.

Using the first 64 bytes of each TCP unidirectional flow the overall error rate is below 0.51%


for all applications considered. Adaboost and Maximum Entropy provide the best results with

more than 99% of all flows classified correctly. Precision is above 99% for all applications

and Recall is above 94% for all applications except SSH (86.6%). (The poor performance on

SSH application was suspected due to the small amount of sample instances in the training

dataset). As with previously reviewed work on early traffic classification, the effectiveness of

this approach when the classifier misses the first few packets of the traffic flow (assumed to

carry the protocol fingerprint) has not been addressed.

4.5.5 BLINC: Multilevel traffic classification in the dark

Karagiannis et al. [53] in 2005 developed an application classification method based on the

behaviours of the source host at the transport layer, divided into three different levels. The social

level captures and analyses the interactions of the examined host with other hosts, in terms of

the numbers of them with which it communicates. The host’s popularity and that of other hosts

in its community’s circle are considered. The role of the host, in acting as a provider or the

consumer of a service, is classified at the functional level. Finally, transport layer information

is used, such as the four-tuple of the traffic (source and destination IP addresses, and source

and destination ports), and flow characteristics such as the transport protocol, and the average

packet size.

A range of application types was studied in this work, including HTTP, P2P, data transfer,

network management traffic, mail, chat, media streaming, and gaming. By analysing the social

activities of the host, the authors concluded that among the host’s communities, neighbouring

IPs may offer the same service (a server farm) if they use the same service port, exact communi-

ties might indicate attacks, while partial communities may signify P2P or gaming applications.

In addition, most IPs acting as clients have a minimum number of destination IPs. Thus, focus-

ing on the identification of that small number of servers can help client identification, leading

to the classification of a large amount of traffic. Classification at the functional level shows that

a host is likely to be providing a service if during a duration of time it uses a small number of

source ports, normally less than or equal to two for all of their flows. Typical client behaviour is

normally represented when the number of source ports is equal to the number of distinct flows.

The consistency of average packet size per flow across all flows at the application level is sug-


gested to be a good property for identifying certain applications, such as gaming and malware.

Completeness and accuracy are the two metrics used for the classification approach in this

case. Completeness is defined as the ratio of the number of flows (bytes) classified by BLINC

over the total number of flows (bytes), indicated through payload analysis. The results show

that BLINC can classify 80% to 90% of traffic flows with more than 95% flow accuracy (70%

to 90% for byte accuracy).

BLINC must gather information from several flows for each host before it can decide on the

role of one host. Such a requirement will present challenges to the employment of this method

in real-time operational networks.

4.5.6 Pearson’s Chi-Square test and Naive Bayes classifier

Focusing on the identification of Skype [173] traffic, in late 2007 Bonfiglio et al. [54] pre-

sented two tests: the first test, based on Pearson’s Chi-Square test, detects Skype’s fingerprint

through analysis of the message content randomness introduced by the encryption process; and

the second test, based on the Naive Bayes theorem, detects Skype’s traffic from its statistical

characteristics.

The aim of Pearson’s Chi-Square test in this context is to check if a message under analysis

complies with one of the Skype message formats, and can thus reveal fingerprints. The test is

based on the first few bits, bytes or the content of the whole message, dependent on the different

types of Skype traffic (e.g. Skype flows transported by UDP or TCP). The second test identifies

Skype flows based on message size (the segment size at the transport layer) and the average

packet inter-arrival time (called average-inter packet gap (average-IPG) in this work) features.

For a window of w packets, it characterises the message size distribution for each possible

Codec, using a number of joined Naive Bayes classifiers. The average-IPG is evaluated as 1w

times the time elapsed between the reception of the first and the wth packet in the window. A

single Naive Bayes classifier is used for the average-IPG.

The combination of the two tests is shown to be effective in detecting Skype voice traf-

fic over UDP or TCP, with almost zero percent of false positives, and a few percent of false

negatives.

The idea of using feature values averaged on a small window (a window size of 30 packets

4.6. LIMITATIONS OF THE REVIEWED WORKS 87

was chosen) for the Naive Bayes classifiers is similar to the idea of training on sub-flows in

SSP-ACT, however, its mechanism is different: it makes use of completely different feature

sets, with multiple Naive Bayes classifiers employed. (It is also worth noting that [54] was

published well after the basics of SSP-ACT had been published in 2006 [159] [160].

4.6 Limitations of the reviewed works

This section provides a qualitative look at the extent to which the reviewed works overlap with

the additional constraints and requirements for using ML techniques inside real-time IP traffic

classifiers outlined in section 3.3.2.

Table B.5 (Appendix B.2) provides a qualitative summary of the reviewed works in relation

to the following criteria.

4.6.1 Timely and continuous classification

Most of the reviewed work has evaluated the efficacy of different ML algorithms when applied

to entire datasets of IP traffic, trained and tested over full-flows consisting of thousands of

packets (such as [98], [61], [60], [59], [170], and [171]).

Some studies ([122] and [167]) have explored the performance of ML classifiers that utilise

only the first few packets of a flow, but they cannot cope with missing the flow’s initial packets.

4.6.2 Directional neutrality

The assumption that application flows are bi-directional, and that the application’s direction

may be inferred prior to classification, permeates many of the works published to date ([98]

[59] [122] [60] [151]). Most work has assumed that they will see the first packet of each

bi-directional flow, that this initial packet is from a client to a server. Classification models

are often trained using this assumption, and subsequent evaluations have presumed the ML

classifier can calculate features with a correct sense of forward and reverse direction. However,

in a real-world network a classifier can assume nothing about the direction (client to server or

vice versa) of the first packet captured, particularly if it misses a number of packets from the

actual start of a given flow.


4.6.3 Efficient use of memory and processors

There are definite trade-offs between the classification performance of a classifier and the re-

source consumption of the implementation. For example, [98] and [165] reveal excellent poten-

tial for classification accuracy. However, they use a large number of features, many of which are

computationally challenging. The overheads involved with computing complex features (such

as effective bandwidth based upon entropy, or Fourier Transform of the packet inter-arrival

time) must be considered against the potential loss of accuracy if one simply did without those

features.

Williams et al. [171] provide some pertinent warnings about the trade-off between training

time and classification speed. (For example, among five ML algorithms studied, Naive Bayes

with Kernel Estimation took the shortest time to build a classification model, yet performed

slowest in terms of classification speed.)

Techniques for timely and continuous classification have tended to suggest a sliding win-

dow over which features are calculated. Increasing the length of this window ([159], [160] and

[172]) might increase classification accuracy. However, depending on the method of implemen-

tation (whether it includes opportunities for pipelining, step size with which the window slides

across the incoming packet streams, etc.) this may decrease the timeliness with which classifi-

cation decisions are made (and increase the memory required to buffer packets during feature

calculations). Most of the reviewed work has not, to date, closely investigated this issue.

4.6.4 Portability and Robustness

None of the reviewed works has seriously considered or addressed the issue of classification

model portability mentioned in section 3.3.2.

None of the reviewed works has addressed and evaluated their model’s robustness in terms

of classification performance with the introduction of packet loss, packet fragmentation, delay

and jitter. Unsupervised approaches have the potential to detect the emergence of new types of

traffic. However, this issue has not been evaluated in most of the works. This issue was only

briefly mentioned in [168].

4.7. MY RESEARCH GOAL 89

4.7 My research goal

My goal is to identify and demonstrate a real-time, ML-based traffic classification system that

addresses the limitations identified in section 4.6. My primary focus is on the requirements

of timely and continuous classification, directional neutrality and efficient use of physical re-

sources.

4.8 Conclusion

In this chapter I have reviewed the state of the art in ML-based IP traffic classification. As

can be seen from the literature review, ML-based traffic classification provides very promising

results. However, all of the reviewed studies (prior to my published proposals in 2006 [159] and

[160]) have been concerned only with the accuracy of identifying traffic, but have overlooked

real-time operational deployment issues and requirements.

I have analysed the limitations of the reviewed proposals in terms of timely and continuous

classification, directional neutrality, efficient use of memory and processors, and portability and

robustness.

My research goal is to find an effective real-time, ML-based traffic classification system that

meets the requirements of timely and continuous classification, directional neutrality and effi-

cient use of physical resources. This is the motivation underlying my novel approach presented

in Chapters 5 and 7, while the portability and robustness of my proposed solution is evaluated

in Chapter 8.

Chapter 5

Training Using Multiple Sub-Flows toOptimise the Use of Machine LearningClassifiers in Real-World IP Networks

5.1 Introduction

In this chapter I present a novel modification to traditional ML training and classification tech-

niques. My technique optimises the classification of flows within finite periods of time and with

limited physical resources. I propose that realistic ML-based traffic classification tools should:

• Operate the ML classifier using a sliding window over each flow - the classifier can see

(or must use) no more than the N most recent packets of a flow at any given time.

• Train the ML classifier using sets of features calculated from multiple sub-flows - each

sub-flow is a fragment of N consecutive packets taken from different points within the

lifetime of the flow used for training.

N is chosen to reflect memory limitations in the classifier implementation or the upper limit

on the time allowed to classify a flow. Training on multiple sub-flows allows the sliding window

classifier to properly identify an application regardless of where within a flow the classifier

begins capturing packets.

I illustrate my proposal’s benefits by considering an ISP that wishes to automatically and

quickly detect online interactive game traffic mingled in amongst regular consumer IP traffic. I

apply my modifications to the well-known Naive Bayes and C4.5 Decision Tree algorithms and

90

5.2. MY PROPOSAL 91

demonstrate distinct improvements in classification accuracy and timeliness, compared to the

performance of the training approaches used in the literature.

This chapter is organised as follows. Section 5.2 illustrates and justifies my proposed app-

roach. The details of my experimental method are described in section 5.3. I analyse the results

in section 5.4, followed by some discussions and conclusions in sections 5.5 and 5.6.

5.2 My proposal

My goal is to classify traffic based on only the most recent N packets of a flow (for some

small value of N), which I have called the sliding window. This is driven by two primary

considerations. First, an ML classifier is likely to be part of a larger system (for example,

automated QoS control as discussed in section 3.3.1), that must react swiftly once it identifies a

new flow as belonging to a class of interest. Reducing the time taken to detect traffic of interest

implies reducing the number of packets that must pass the monitoring point before classification

can be achieved. Second, re-calculating features over a sliding window of N packets requires us

to buffer the most recent N packets. So we can remove the effect of the Nth most recent packet

when we receive a new packet in the same flow (hence the term sliding window). Particularly

on high-speed networks, a classifier may be observing (tens of) thousands of concurrent flows;

minimising the number of buffered packets per flow provides a beneficial reduction in physical

memory requirements.

A practical real-time classifier cannot assume it will see the beginning of all flows. For

example, classification may be initiated at a point in time when many thousands of flows are

already in progress. Thus, a classifier should be capable of recognising flows using N packets

starting from anywhere in a flow.

Using a sliding window of N packets does, however, poses potential problems. Application

flow statistics (such as the maximum packet inter-arrival time) over a small sliding window will

differ from those of the statistics over an entire long flow [174]. Application flow statistics also

often change during the lifetime of a flow. For example, the initial handshake of a new SMTP

connection may look quite different to the traffic while transferring the body of each email.

A classifier trained on feature values calculated for entire flows (as done in the majority of

previous research) may not recognise members of the class when presented with feature values

92 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC

calculated from subsets of an unknown flow.

The preceding considerations give rise to my novel proposal for training ML classifiers.

First, extract two or more sub-flows (of N packets) from every flow that represents the class

of traffic we wish to identify in the future. Each sub-flow should be taken from places in

the original flow that have noticeably different statistical properties (for example, the start and

middle of the flow). Each sub-flow would result in a set of instances with feature values derived

from its N packets. Then train the ML classifier with the combination of these sub-flows rather

than the original full-flows.

To illustrate my proposal the following scenario is constructed: a real-time classifier must

accurately identify Wolfenstein Enemy Territory (ET) [47] traffic mixed amongst other un-

related, interfering traffic flows. ET is a highly interactive online game representative of ap-

plications whose traffic characteristics can change noticeably over the lifetime of each flow.

I compare classification accuracy using full-flows and sub-flows for various values of N, and

show that training on full-flow performs poorly when classifying using a small sliding window.

Poor Precision and Recall are seen even with a large window of 1000 packets. On the other

hand, training with multiple sub-flows allows a small window of N = 25 packets to achieve high

Recall and Precision. Importantly, classification performance can be maintained even when

packets are missed at the beginning of a flow. An evaluation of my proposal with VoIP traffic is

be presented in Chapter 8.

5.3 My experimental approach

This section describes in details my experimental approach, including how to prepare the data

for training, and how to build and test a classification model.

5.3.1 Flows and features

In my experiments, for UDP traffic a flow is considered to have stopped when no more packets

are seen for 60 seconds in both directions. For TCP traffic, a flow is stopped when the connec-

tion is explicitly torn down or no packets are seen for 60 seconds in both directions (which ever

comes first) 1.1A TCP flow is known to be explicitly town down based on TCP header information. In strict situation where a

classifier is not allowed to be doing packet header inspection at all (even including TCP header information), only

5.3. MY EXPERIMENTAL APPROACH 93

In this chapter, I introduce a new term, sub-flow, which is defined as follows.

Sub-flow: each sub-flow is a fragment of N consecutive packets (bi-directional) taken from

different points within the original application flow’s lifetime. The forward direction of the

sub-flow is defined as it is in the full-flow: in the client-to-server direction.

Referring to the definition of full-flow presented in Figure 3.2, sub-flows are illustrated in

Figure 5.1. Let M (M ≥ 0) be the number of packets offset from the beginning of each full-flow,

and sub-flow SF-M denotes N consecutive packets starting from the Mth packet with regards to

the beginning of the full-flow.

Forward

Backward

Full-flow

Sub-flow SF-M

Packet 0 (The first packet) Packet M

Figure 5.1: An illustration of sub-flow definition

I trained and classified using the following features:

• Minimum, maximum, mean and standard deviation of inter-packet arrival time 2 in the

forward and backward directions.

• Minimum, maximum, mean and standard deviation of inter-packet arrival length 3 in the

forward and backward directions.

• Minimum, maximum, mean and standard deviation of IP packet length in the forward and

backward directions.

These features are chosen as they are independent of flow length and packet contents. They

also require low computation overhead. For each packet captured, only its length and arrival

timestamp are needed for feature calculation. The calculation of minimum, maximum, mean

a timeout can be used to determine when the flow is finished.2The difference in arrival times of two consecutive packets traversing in the same direction.3The difference in lengths of two consecutive packets traversing in the same direction.


and standard deviation is simple and can be done incrementally as a packet arrives 4. This can

help to improve the classifier’s performance in terms of timeliness and processing speed, as well

as efficient use of memory and physical resources.

Features for full-flow and sub-flows are calculated by modifying the NETMATE tool 5

[163].

5.3.2 Machine Learning algorithms

The Naive Bayes and C4.5 Decision Tree algorithms were chosen because:

• The algorithms allow supervised training, so that we can train the ML for identification

of individual or group (class) of applications of interest.

• These are well-known supervised learning algorithms. They also have been used in other

IP traffic classification work, including [98] and [151].

• They have quite different internal training and classification mechanisms. Testing my pro-

posal with both algorithms reveals similar benefits in each case, suggesting the approach

is applicable to more than just one type of ML algorithm.

• The underlying statistical computation is simple, tractable and understandable. Classifi-

cation models can be expressed as decision trees or sets of classification rules.

I used the WEKA (Waikato Environment for Knowledge Analysis) implementation of the

Naive Bayes and C4.5 Decision Tree (J48) algorithms [109] 6.

4Since feature calculation is done incrementally as a new packet arrives, increasing the sliding window size(bigger N) does not increase the memory requirement to buffer information of the most recent N packets (discussedin section 5.2). However, it does increase the time required for a classification decision to be made, which includesthe waiting time for a packet to arrive and computational overhead.

5 NETMATE (Network Measurement and Accounting System) is a free, open-source network measurementsoftware. It was developed and maintained by Zander (Centre for Advanced Internet Architectures, SwinburneUniversity of Technology) and Schmoll (Fraunhofer Institute for Open Communication Systems (FOKUS), Berlin,Germany). The tool is written in C++. It has a modular (class-based) structure, which means it can easily beextended, and a dynamic loadable packet processing and information export modules.

6WEKA is a free, open-source ML and data mining tool written in Java. A number of standard ML techniqueshave been incorporated into the software. WEKA has a wide range of users, including ML researchers, industrialscientists and teachers [175]. A number of works in IP traffic classification make use of the tool, for example[98], [171], [153], [151], [60] and [59]


5.3.3 Some statistical properties of ET traffic

Wolfenstein Enemy Territory is an online, team-based first person shooter (FPS) game built on

the Quake III Arena game engine 7. It has representative characteristics for the online FPS game

genre, which are also ideal to demonstrate my proposal in that:

• Its traffic statistical properties vary over different phases during the flow lifetime.

• Its traffic is asymmetric in client-to-server and server-to-client directions.

A demonstration of the results I gathered with ET traffic can serve as the first-proof-of-

concept, and suggests the applicability of my approach for other similar types of traffic. (Elab-

oration on this point is covered in Appendix A where I show the similarities in statistical prop-

erties between ET and other applications.) This section introduces and analyses some critical

statistical properties of the ET traffic, justifying the reasons for my novel training approach.

Consistent with many other online FPS games, ET traffic seen at a server can exhibit three

different phases: clients probing the server (Probing phase), clients connecting to the server

(Connecting phase), and clients playing a game on the server (In-game phase) [176].

Figure 5.2 shows the variation of an ET flow’s characteristics as a scatter plot of two features

- standard deviation versus mean of packet length - calculated with N = 25 across 1000 ET flow

samples. In this illustrative example, the Probing phase’s features are calculated on sub-flows

that cover the first N packets of the full-flows; the Connecting phase’s features are calculated

on sub-flows of size N starting from the 20th packet; and the In-game playing phase’s features

are calculated on sub-flows of size N starting from the 2000th packet. Full-flow features are

calculated over the entire Probing, Connecting and In-game periods.

Full-flow and In-game feature values are shown on the right, Probing and Connecting feature

values are shown on the left. With only two features the regions are partially overlapping and

partially disjoint. While there is considerable overlap between the different phases, there is

also some separation, which needs to be learnt by a classifier to identify the different sub-flow

phases.

Similar behaviours are seen with the server-to-client direction, as illustrated in Figure 5.3.

A similar mix of overlapping and disjoint regions also occurs with other features (such as inter-7Detailed game settings and properties can be found in [176].


0 50 150 250

050

100

150

200

250

300

Mean Packet Length (C−S) (bytes)

Pac

ket L

engt

h S

tdde

v.(b

ytes

)

C−S ProbingC−S Connecting

0 50 150 2500

5010

015

020

025

030

0Mean Packet Length (C−S) (bytes)

Pac

ket L

engt

h S

tdde

v.(b

ytes

)

Full−flowIn−game

Figure 5.2: Packet length from client to server for ET traffic - N = 25 packets

0 400 800 1200

010

030

050

0

Mean Packet Length (S−C) (bytes)

Pac

ket L

engt

h S

tdde

v.(b

ytes

)

C−S ProbingC−S Connecting

0 400 800 1200

010

030

050

0


Pac

ket L

engt

h S

tdde

v.(b

ytes

)

Full−flowIn−game

Figure 5.3: Packet length from server to client for ET traffic - N = 25 packets


packet arrival time and inter-packet length variation). This suggests that a classifier trained on

full-flow feature values may have trouble recognising the clusters of feature values calculated

on small windows of packets.

The changes in ET’s statistical properties over different phases are clearer by looking at the

distribution of feature values calculated at different points during a flow’s lifetime. Figures 5.4

and 5.5 show the values of two features: mean and standard deviation of packet length from

client to server for ET traffic with a window size of 25 packets. The classifier’s window slides

across 1,000 ET flows, while M is the number of packets passed from the beginning of each

flow.

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

010

030

050

070

0

M (packets)

Mea

n pa

cket

leng

th fe

atur

e va

lues

Figure 5.4: Mean packet length from client to server for ET traffic - N = 25 packets

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

010

020

030

040

050

060

0

M (packets)

Std

. Dev

. pac

ket l

engt

h fe

atur

e va

lues

Figure 5.5: Standard deviation of packet length from client to server for ET traffic - N = 25packets


The results are presented using boxplots 8. As shown in figures 5.4 and 5.5, the Probing

phase (M = 0) and In-game phase (M > 1000) have quite different ranges of values, while the

Connecting phase or early In-game phase (10 ≤ M ≤90) has a large range of values, which

seems to cover the value range of both the Probing and In-game phases.

Figures 5.6 and 5.7 show the mean packet length and standard deviation of packet length

features calculated over the first N packets of ET flows in the client-to-server direction. The

results are collected from 1,000 flow samples. The sample set contains flows of longer than

1,000 packets, with a median flow size of 60,000 packets.

10 25 50 500 1000 Full−flow

5010

015

020

025

030

0

Window size N (packets)

Mea

n pa

cket

leng

th (p

acke

ts)

Figure 5.6: Mean packet length in the client-to-server direction, calculated for the window ofthe first N packets taken from 1,000 flow samples for ET traffic (1,000 values of the means foreach N value)

From the plots it appears that these statistics calculated for different N values are different

from each other and different from those calculated for full-flow. The differences in these fea-

ture values are significant when calculated for a small window compared to full-flow. Features

calculated for a larger windows yield closer distributions to those calculated on full-flow.

8The black line in the box indicates the median; the bottom and top of the box indicates the 25th and 75th

percentile, respectively. The vertical lines drawn from the box are whiskers. The upper cap is drawn at the largestobservation that is less than or equal to the 75th percentile + 1.5*IQR (interquartile range - which is essentiallythe length of the box). The lower cap is drawn at the smallest observation that is greater than or equal to the 25th

percentile - 1.5*IQR. Any observations beyond the caps are drawn as individual points. These points indicateoutliers.


10 25 50 500 1000 Full−flow

050

100

150

200

Window size N (packets)

Std

.dev

. pac

ket l

engt

h (p

acke

ts)

Figure 5.7: The standard deviation of packet length in the client-to-server direction, calculatedfor window of the first N packets taken from 1,000 flow samples for ET traffic (1,000 values ofthe standard deviations for each N value)

However, I use the two-sample Kolmogorov Smirnov (two-sample KS) test 9 [178] to show

that there is strong evidence that they are different.

Table 5.1 shows the p-values for pairs of two feature sets calculated for two different N

values. Each feature set contains 1,000 means packet length of the first N packets of 1,000

flow samples). With each pair of feature sets the p-value is the probability that both of the

observed results could arise by chance from the same parent source. A p-value of less than

0.05 is regarded as probably significant and is usually the threshold at which the null hypothesis

(that both results come from the same source) is rejected [178]. In all cases, other than when

comparing a sample set with itself, we observe very small p-values indicating strong evidence

that the distributions are different. Similar characteristics have been seen with other features.

Consequently, this confirms that ET’s statistics calculated for different N values are different

from each other and different from those calculated for full-flow.

Figures 5.2 and 5.3 also demonstrate another important aspect of ET traffic. They suggest

the asymmetry of ET flows’ characteristics for traffic in the client-to-server and server-to-client

directions, which motivates my work as presented in Chapter 7.

9My sample datasets are not normally distributed for all N values, according to the results when applying theAnderson-Darling test [177] to the datasets. Hence the two-sample K-S test is chosen as a general nonparametricmethod for comparing two samples. It quantifies a distance between the empirical distribution functions of the twosamples. The null hypothesis is that the samples are drawn from the same distribution.


Table 5.1: Two-sample KS test p-values (probability of occurrence of the null hypothesis) forthe mean packet length feature sets calculated for different N values, based on a set of 1000flow samples

N 10 25 5010 1 <2.2e-16 1.6e-0425 <2.2e-16 1 <2.2e-1650 <2.2e-16 <2.2e-16 1500 <2.2e-16 <2.2e-16 <2.2e-161000 <2.2e-16 <2.2e-16 <2.2e-16Full-flow <2.2e-16 <2.2e-16 <2.2e-16

N 500 1000 Full-flow10 <2.2e-16 <2.2e-16 <2.2e-1625 <2.2e-16 <2.2e-16 <2.2e-1650 <2.2e-16 <2.2e-16 <2.2e-16500 1 <2.2e-16 <2.2e-161000 <2.2e-16 1 <2.2e-16Full-flow <2.2e-16 <2.2e-16 1

5.3.4 Constructing training and testing datasets

I demonstrate the effectiveness of my approach using different datasets for training and testing

the classifiers (as described in section 3.2) [109].

Flows in each dataset are divided into two classes - ET and Other (non-ET) - because super-

vised learning algorithms work best when trained with examples of traffic in the class of interest

and traffic known to be outside the class of interest (‘interfering’ or Other traffic).

The high-level description of data traces used in the training and testing phases are shown

in Figure 5.8. Details of training and testing steps of an ML classifier were presented in Figure

3.5.

ET traffic

The ET datasets consist of two separate month-long traces collected during May and Septem-

ber 2005 at a public ET server in Australia [179]. The server was running ETPro (v3.2.0) [47].

Full-payload traffic was captured to disk with timestamps of microsecond resolution and ac-

curacy of better than +/-100usec. The distribution of domestic and international traffic on this

server was consistent with previously published work [180]. More information on geographical


ET Traffic Training Dataset

Other Traffic Training Dataset

ML Training

ET Traffic Testing Dataset

Other Traffic Testing Dataset

ML Classification

Classification Results

Figure 5.8: High-level description of datasets used for training and testing

distribution of game clients in terms of countries and hop counts is presented in Appendix C.

Raw ET traffic traces taken at an ET server typically contain far more short flows (clients

probing the server, usually less than 10 packets from client to server) than actual game-play

flows [180]. Balanced ET datasets for each month were created by taking all non-probe flows

(assumed to have more than 10 packets from client to server) and then sampling an equal number

of probe flows from the raw monthly traces. Table 5.2 summarises the resulting balanced ET

datasets.

Table 5.2: ET traffic full-flow datasetMonth Non-

ProbeFlows

ProbeFlows

TotalFlows

Total Packets Total Bytes

May 4344 4344 8688 107.9M 14.9GSep 3444 3444 6888 187.9M 26.6G

Other traffic

The interfering (non-ET) traffic is constructed from two 24-hour data traces collected by the

University of Twente, Netherlands, on February 6th and 7th 2004 [181]. I will refer to these

traffic sources as T1 and T2 respectively.

The interfering traffic datasets were built by extracting flows from T1 and T2 belonging to

a range of common applications. As payloads were missing I inferred application type from the

port numbers (judged an acceptable approach because my primary criteria for interfering traffic

is that it was not ET). For each application’s default port(s) I sampled a maximum of 10,000


flows per raw trace file 10. Table 5.3 summarises the overall mix of traffic in my resulting

interfering datasets.

Table 5.3: Sampled interfering application flows - full-flow datasetApplications Total Flows

(x1000)(T1)Total Bytes(MB) (T1)

Total Flows(x1000) (T2)

Total Bytes(MB) (T2)

HTTP, HTTPS (Web) 13.8 329.2 13.3 267.2DNS, NTP (DNS etc.) 2.4 1.4 2.7 1.4SMTP, IMAP, POP3,Telnet, SSH (Mail etc.)

0.6 15.8 0.5 10.1

HalfLife 8.7 25.4 10.0 38.6Kazaa, Bittorrent,Gnutella, eDonkey(P2P)

48.0 1,354.6 56.4 1,524.5

For each experiment described below, I trained my classifiers using a mix of ET traffic from

the May dataset and interfering traffic from T2. Subsequent testing of each classifier scenario

was performed using a mix of ET traffic from September and interfering traffic from T1.

Training with full-flow, testing with four different sliding windows

First I look at the effectiveness of classifying data using a sliding window across the test dataset

and an ML classifier trained on full-flow. I use windows of sizes N = 10, 25, 100, and 1000

packets. During ET game-play we see 20 packets per second (PPS) from server to client and

roughly 28 PPS from client to server, so these windows correspond to 0.2, 0.5, 2.1, and 20.8

seconds of actual time. Recall and Precision results are averaged across ET flows and interfering

flows in the test dataset. I show that a classifier trained on full-flow is ineffective in identifying

ET traffic even with N as large as 1,000 packets.

Then I look at the effectiveness of different modified approaches in training the classifier in

the following experiment. In the scenario considered in section 5.2, my goal is to classify ET

traffic in less than one second for a timely classification. N = 25 (roughly corresponds to a time

window of 0.5 second) is chosen as a good size of the sliding window for testing.

10Please note that P2P applications have a range of port numbers to which a server can default. All server portsin the range are used; hence the large number of flows collected.


Training with full-flow instances of more than 25 packets (called filtered full-flow), testingwith a sliding window of N = 25 packets

The motivation for this modification is that the classifier only performs its classification on a

full sliding window. So only flows with more than 25 packets will be classified. Training the

classifier with flow instances shorter than 25 packets may only add noise to the classification

model, and also may incur a longer training time.

Training with individual sub-flow, testing with a sliding window of N = 25 packets

The motivation for this modification is from my analysis of ET’s statistical properties, which

suggests a classifier trained on full-flow may have difficulties when classifying on small win-

dows of packets, and that it should be trained on sub-flows with the same length as the sliding

window.

Training with multiple sub-flows, testing with a sliding window of N = 25 packets

The motivation for this modification is that the flow’s statistical characteristics vary over dif-

ferent phases during a flow’s duration. Training on a combination of multiple sub-flows rep-

resenting different phases within the original full-flow, the classifier will then recognise new

flows if they have statistical properties similar to any of the sub-flows on which the classifier

was trained.

The detailed training and testing implementation is summarised in Table 5.4 and 5.5.

Please note that from the original data trace, only flows that have at least N + M packets are

considered in each test.

Figure 5.9 presents the detailed combination of traffic (in flows and percentage) for training

the classification models. Close to 90% of flows in the full-flow model are shorter than 25

packets, hence the large reduction in flow instances to train the filtered full-flow model 11.

The ET class takes 9.5% and 25.0% out of the total training instances for the full-flow and

filtered full-flow models respectively. For the single sub-flow model, ET traffic takes from

10.5% to 14.1% of the total training instances. There are approximately 21.2% of ET instances

11The filtered sub-flow model has a slightly smaller number of training instances compared to the SF-0 model,since its flows are sampled from the training dataset for the full-flow model, a subset of the whole training datasetfrom where training instances for the SF-0 model are sampled.


Full-

Flow

Filte

red

Full-

Flow

SF-

0

SF-

20

SF-

40

SF-

2000

Mul

ti-S

Fs

Mail etc. HalfLife

DNS etc. ET

Web P2P

0

10000

20000

30000

40000

50000

60000

Num

ber o

f flo

ws

(a) Flow counts

Full-

Flow

Filte

red

Full-

Flow

SF-

0

SF-

20

SF-

40

SF-

2000

Mul

ti-S

Fs

Mail etc. HalfLife

DNS etc. ET

Web P2P

0

10

20

30

40

50

60

70

Perc

enta

ge (%

)

(b) Flow percentage

Figure 5.9: Distribution of different applications’ traffic (in flows and percentage) in the trainingdatasets


Table 5.4: Detailed training and testing implementation for each experimentExperiment Training TestingTraining withfull-flow, test-ing with fourdifferent slid-ing windows

The classifier is trained using fea-tures calculated from full-flows fromthe training datasets, for both ET andOther traffic

The classifier is tested using featurescalculated for sliding windows of N =10, 25, 100 and 1000. For ET traffic,M is chosen to cover two periods - earlyclient contact with the game server (0≤M ≤ 90) and during active game-play(1000 ≤M ≤ 9000). For the other traf-fic, M is chosen to be 19. (This is chosenarbitrary, the important thing is its valueof greater than 0, so that we can test theclassifier in the extreme case of missingthe beginning of the test flows.)

Training withfiltered full-flow, testingwith a slidingwindow of N= 25 packets

For both ET and Other traffic the clas-sifier is trained using features calcu-lated from instances with more than25 packets in both directions from thefull-flow training datasets.

The classifier is tested using featurescalculated from a sliding window of N= 25 packets. Similar to the above test,M is chosen to cover two periods: 0 ≤M ≤ 90 and 1000 ≤ M ≤ 9000 for ETtraffic, and M = 19 for Other traffic.

in the training datasets for the multiple sub-flows model.

The ET data trace used for training contains flows of unequal lengths, hence the reduction

of flow examples used for training when the sub-flow is chosen towards the end of the flows.

As a result, for the multiple sub-flows model, there are slight differences in the total num-

bers of training instances per sub-flow. There are more training instances for sub-flows at the

start, compared to sub-flows towards the end of a flow. SF-0 has the greatest number of ex-

amples (2,500 instances), compared to other sub-flows (consisting of between 1,000 and 1,500

instances). This may give more weight toward sub-flows selected at the early phase of flows

used for training. As a result, there is a possibility of inconsistent performance of the classifier

with regards to the position of the sliding window during the flow’s duration (e.g. better Re-

call when the sliding window is at the early stage of a flow as more instances are available for

training). In this experimental study, I chose to use maximum sub-flow instances available in

the dataset to maximise the information in training an accurate classifier. Evaluating the pros

and cons of applying re-sampling techniques (discussed in section 3.1.8) to create a balance

between different sub-flows is left for future research.


Table 5.5: Detailed training and testing implementation for each experiment (continued)Experiment Training TestingTraining with individ-ual sub-flow, testingwith a sliding windowof N = 25 packets

ET traffic: The classifier is trained us-ing features calculated from 25-packetsub-flows rather than full-flows. Fourseparate variants of the classifier aretrained, using sub-flows that coverspackets 0-24, 20-44, 40-64, and 2000-2024 respectively of the original ETflows in the training dataset. Thesesub-flows are selected to represent thestatistical properties of ET traffic overdifferent phases of the original ETflows. These classifier models are de-noted by SF-0, SF-20, SF-40, and SF-2000 respectively.

Similar to the above test,the testing instances arebuilt using features from25 packet sub-flows thatcovers two periods - 0≤ M ≤ 90 and 1000 ≤M ≤ 9000 for ET traf-fic. M is chosen to be19 for Other traffic, sothat the sliding windowis at a different posi-tion from where the sub-flows were chosen fortraining. This is to avoidbiasing the test results.

Other traffic: As most of the Otherflows are short (95% of interferenceflows are shorter than 50 packets),only two phases are considered fortraining: the Beginning phase thatcovers packets 1-25 and the Middlephase that covers packets 10-34 of theoriginal interference flows in the train-ing dataset. For the SF-0 model, ETtraffic is trained in combination withfeatures calculated from the Begin-ning phase of the Other traffic. Forother sub-flow models, ET traffic istrained in combination with featurescalculated from the Middle phase ofthe Other traffic.

Training with multiplesub-flows, testing witha sliding window of N= 25 packets

ET traffic: The classifier is trained us-ing features calculated from differentsub-flows.

The same with testingfor individual sub-flow.

Other traffic: The classifier is trainedusing features from 25-packet sub-flows that cover the Beginning andMiddle phases of the original fullinterference flows in the trainingdatasets.


0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K

Mail etc. HalfLife DNS etc. ET Web P2P 0

1000

2000

3000

4000

5000

6000

7000

Num

ber o

f flo

ws

M (Number of packets offset from the beginning of each flow)

(a) Flow counts

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


10

20

30

40

50

60

Perc

enta

ge (%

)


(b) Flow percentage

Figure 5.10: Distribution of different applications’ traffic (in flows and percentage) in testingdatasets for N = 25


Figures 5.10 presents a detailed combination of traffic (in flows and percentages) for testing

different classification models with N = 25 packets. Similar to the training datasets, the total

number of ET flows reduces as N and M increase. With different N values (shown in Appendix

C.2), the ratios of classes’ instances are different. ET is the minority class with N = 10 packets,

and is the majority class with N = 1,000 packets.

In addition, the proportions of applications’ traffic in the training and testing datasets are

different. For example, with N = 25 packets, the ET class takes approximately 11.9-17.1% out

of the total testing instances for all M values 12. This is different from the traffic profile of all

training models. This suggests that the classifier may not have to be trained on the same traffic

profile with the one it expects to see in deployment.

5.3.5 Data processing

To build and test the classification model, a large range of sample traffic was collected. ET

traffic used for training and testing was extracted from two month-long traces with a total size

of 41.5 GBytes of ET data. Other traffic was extracted from two day-long traces with a total of

3.6 GBytes.

Each data trace needs to be processed for full-flow and sub-flow feature values. A total of

six classification models need to be trained and tested with features calculated for 19 different

positions of the sliding window for ET traffic and two different positions for Other traffic. The

data processing therefore is time-consuming if done sequentially in a single processing unit.

In my experiments, the data processing was undertaken in parallel using the supercomputer

cluster provided by the Centre for Astrophysics and Supercomputing, Swinburne University of

Technology [182] 13.

12My Precision result by definition can be impacted by percentage of traffic mix (due to the number of falsepositives coming from Other class). It can be lower than when a balance 50% Other vs. 50% ET is used instead.However, to have more data for testing, and my assumption is that the traffic of a single application can be muchsmaller than the total traffic aggregate. I accept that my Precision can be negatively affected by the traffic mixchose.

13Each node of the supercomputer cluster is a Single 2 quad-core Clovertown 64-bit low-volt Intel Xeon 5138processor running Linux CentOS 5 at 2.33 GHz. Virtual memory is set at 1GByte. Jobs are submitted to the clustervia a batch queue system. I used WEKA implementation version 3.4 with Java version 1.4.2.

5.4. RESULTS AND ANALYSIS 109

5.4 Results and analysis

In this section I present the results with respect to M, the number of packets offset from the

beginning of each ET flow in the test dataset.

5.4.1 Training with full-flows, testing with four different sliding windows

Figure 5.11 shows Recall for the Naive Bayes classifier as each sliding window moves across

the test dataset. For all N values, Recall degrades rapidly as we move further from the start of

each flow. Recall for N = 1000 is good (85%) when the flow is captured from the beginning,

but rapidly drops below 10% if the classifier misses more than the first 30 packets. Recall for

small sliding windows is poor (≤ 66%) even when the beginning of a flow is captured. Missing

more than the first 20 packets further degrades Recall to lower than 20% for all N values.

Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100N=10N=25

N=100N=1000

Figure 5.11: ET Recall: Classifier trained with full-flows, tested with four different slidingwindows - Naive Bayes models

Figure 5.12 depicts Precision for the Naive Bayes classifier as each sliding window moves

across the test dataset. Precision is 100% for all window sizes and M values when Recall > 0.

Classifying in the middle of game-play (M > 1000) provides a Recall close to 0%, making the

high achieved Precision somewhat meaningless. At a few points where M = 6K, 8K and 9K

Precision is 0% (caused by 0% Recall).


Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100

N=10N=25

N=100N=1000

Figure 5.12: ET Precision: Classifier trained with full-flows, tested with four different slidingwindows - Naive Bayes models

Figure 5.13 summarises Recall as each sliding window moves across the test dataset for

C4.5 Decision Tree classifiers. Similar to the results seen with Naive Bayes classifiers, Recall

is quite good at the flow’s beginning for small sliding windows (80%, 92%, 92% for N = 10,

25, and 100 respectively). It even achieves 98% Recall for N = 1000 when the flows’ beginning

is captured. However, there is significant degradation in Recall for all windows if we miss the

start of the flows (dropping to approximately 50% and 25% when missing the first 20 packets).

Figure 5.14 presents Precision for the C4.5 Decision Tree classifiers as each sliding window

moves across the test dataset. For a window size of 10, 25 and 100 Precision dropped signifi-

cantly with M ≥ 10. With N = 1000, it reduced gradually from 99% to 80%; however, with a

Recall of less than 20% for M ≥ 20 packets, the high Precision becomes less meaningful.

To sum up, these results demonstrate the poor performance of the full-flow models when

classifying on small sliding windows and when missing the start of a flow.

5.4.2 Training with filtered full-flows, testing with a sliding window of N = 25 packets

Figure 5.15 shows Recall when the Naive Bayes classifier is trained using features from filtered

full-flow instances.

This model has a better Recall compared to the full-flow model when classifying with N


Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100N=10N=25

N=100N=1000

Figure 5.13: ET Recall: Classifier trained with full-flows, tested with four different slidingwindows - C4.5 Decision Tree models

Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100

N=10N=25

N=100N=1000

Figure 5.14: ET Precision: Classifier trained with full-flows, tested with four different slidingwindows - C4.5 Decision Tree models


Rec

all /

Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100RecallPrecision

Figure 5.15: ET Recall and Precision: Classifier trained on filtered full-flows, N = 25 for clas-sification - Naive Bayes models

= 25 packets. However, Recall still drops off quickly to less than 50% if we miss more than

the first 10 packets of a flow. Precision is greater than 50% if the beginning of each flow is

captured, otherwise it becomes rather poor. Precision is lower compared to the full-flow model

when Recall is greater than 0.

The increase in Recall and reduction in Precision when comparing the filtered full-flow to

the full-flow model can be explained as follows.

As shown in section 5.3.4, the training dataset for the full-flow model consists of many noise

instances, due to short flows of ET and Other traffic (≥50% of ET traffic and 90% of Other flows

are less than 25 packets). The ML algorithm needs to rule out this noise in order to identify ET

traffic. This may lead to a classification model that is over-fitted and can only identify a small

number of ET instances. In this case, the low Recall and perfect Precision (100% where Recall

> 0) of the full-flow model is a strong indicator of over-fitting.

On the other hand, this noise has already been removed from the filtered full-flow model.

The ML algorithm can now create a model that covers a larger range of ET instances. This helps

improve Recall yet creates the opportunity for false positives, which leads to lower Precision.

To demonstrate this explanation, consider the simplified example illustrated in Figure 5.16(a)

below. Due to noise in the Other traffic, the classification model created only covers a small


range of ET instances. In Figure 5.16(b), with noises being removed (their points are faded in

the figure), the filtered full-flow model now covers a greater range of ET instances.

Due to the internal construction of a particular ML algorithm, the impact on Recall and

Precision of noise traffic may be different. With this particular use of a Naive Bayes classifier,

the removal of noise helps improve Recall yet decreases Precision as a trade-off.

Figure 5.17 summarises Recall when the C4.5 Decision Tree classifier is trained using

features from filtered full-flows.

This model has a better Recall compared to the full-flow model when classifying with N =

25 packets. Precision is also better compared to the full-flow model when Recall is greater than

0, which suggests the benefit of filtering out noise in training the C4.5 Decision Tree classifier.

However, poor Precision and Recall have been seen when the classifier misses more than 20

packets at the beginning of a flow. It seems that both Recall and Precision are better with M ≥

1000, yet they are still quite low (less than 70%)14.

In summary, both the Naive Bayes and C4.5 Decision Tree classifiers trained on filtered full-

flows perform poorly on a small sliding window and when missing the beginning of a traffic

flow.

I will go on to compare the results of the filtered full-flow model with the multiple sub-flows

and the best single sub-flow models because:

• They are all trained on flow instances with more than 25 packets.

• Between the filtered full-flow and full-flow models, the former produces better results

while the latter is considered to be over-fitted to the training set.

5.4.3 Training with individual sub-flows, testing with a sliding window of N = 25 packets

Figure 5.18 presents Recall when the Naive Bayes classifier is trained using features calculated

from 25-packet sub-flows rather than full-flows. Each model (as defined in Table 5.5) shows

interesting variation in Recall with respect to the position of the sliding window across the test

dataset. Training on a sub-flow at a particular phase of a flow tends to demonstrate higher Recall

at the same phase in the test dataset, and lower Recall otherwise.

14It is noted that the classifier trained on filtered full-flow performs worst in the transition stage when 20≤M ≤90. However, I will not investigate this particular issue further as I later show that training the classifier on multiple


Long ET flow Short ET flow Long Other flow

Short Other flow

Region of ET traffic covered by classifiers trained on full - flow

(a) Full-flow model

Short Other flow

Long ET flow Short ET flow Long Other flow

Region of ET traffic covered by classifiers trained on filtered full - flow (where all shorts flo ws are removed)

(b) Filtered full-flow model

Figure 5.16: An illustration of creating classification rules for the full-flow and filtered full-flowmodels


Rec

all /

Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100RecallPrecision

Figure 5.17: ET Recall and Precision: Classifier trained on filtered full-flows, N = 25 for clas-sification - C4.5 Decision Tree models

Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100

SF−0SF−20SF−40SF−2000

Figure 5.18: ET Recall: Classifier trained on 25-packet sub-flows, N = 25 for classification -Naive Bayes models


Recall starts very high at the beginning of a flow and then drops off quickly if we miss more

than the first 10 packets for SF-0 model. On the other hand, it stays low until the sliding window

has moved beyond the early period of each flow (M ≥ 90) for SF-2000 model. Recall for the

SF-20 model is quite good even if we miss 30 or 40 packets, but eventually becomes quite

poor. The SF-40 model exhibits good overall Recall of greater than 80% with all M values15.

These results are expected as each sub-flow model presents a particular phase with distinctive

statistical properties of an ET flow during its lifetime.

Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9KM (Number of packets offset from the beginning of each flow)

0102030405060708090

100

SF−0SF−20SF−40SF−2000

Figure 5.19: ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classification -Naive Bayes models

Figure 5.19 shows Precision for all sub-flow models. Precision drops off quickly if we miss

more than the first 10 packets of a flow for SF-0 model. On the other hand, it stays low until the

sliding window has moved beyond the early period of each flow (M ≥ 60) for SF-2000 model.

When trained on SF-20 the classifier’s Precision is quite good even if we miss 90 packets, but

eventually becomes quite poor. SF-40 model exhibits good overall Precision from 97.8% to

98.7% for all M values. These results follow the same trend with Recall. This is to be expected

because if the classifier recognises instances of both classes better (better Recall for both ET

and Other traffic), it will have better overall Precision.

sub-flows overcomes this issue and provides more stable Recall and Precision throughout the phases of a full-flow.15This suggests that perhaps the game’s transition between ‘Connecting’ and ‘In-game’ phases occurs during or

near SF-40, so this cluster contains instances of both ‘Connecting’ and ‘In-game’ statistics. For excellent Recallwe cannot use just SF-40 as will be shown later in section 5.4.4.


Compared to the results of the full-flow and filtered full-flow models, training on a sub-flow

picked from within each original training flow (e.g. the SF-40 model) significantly improves

the classification performance, especially when M is greater than 0 (i.e. real-world scenarios

where the classifier cannot be sure it sees the start of every flow).

Figure 5.20 presents Recall when the C4.5 Decision Tree classifier is trained using features

from 25-packet sub-flows rather than full-flows.

Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9KM (Number of packets offset from the beginning of each flow)

0102030405060708090

100

SF−0SF−20SF−40SF−2000

Figure 5.20: ET Recall: Classifier trained on 25-packet sub-flows, N = 25 for classification -C4.5 Decision Tree models

Similar to the results seen with the Naive Bayes classifier, Recall for the C4.5 Decision Tree

classifier is very high at 99.2% at the start of a flow, then drops off quickly if we miss more than

the first 20 packets for the SF-0 model. On the other hand, Recall remains low until the sliding

window has moved beyond the early period of each flow (M ≥ 90) for the SF-2000 model.

When trained on SF-40 the classifier’s Recall is quite good even if we miss more than 1,000

packets. When trained on SF-20, Recall degrades when the sliding window has moved beyond

the first 30 packets.

Figure 5.21 shows Precision for all sub-flow models (Figure 5.22 is a zoomed-in for clearer

results presentation). Similar to the Naive Bayes classifier, the SF-0, SF-20 and SF-2000 models

either have good Precision when the sliding window stays at the beginning or has moved beyond

the early period of each flow. SF-40 model maintains good Precision from 97.6% to 98.6% for


all M values.

Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100

SF−0SF−20SF−40SF−2000

Figure 5.21: ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classification -C4.5 Decision Tree models

Again, compared to the filtered full-flow models, training on a sub-flow picked from within

each original training flow (e.g. the SF-40 model) significantly improves the classification per-

formance for all M values.

5.4.4 Training with multiple sub-flows, testing with a sliding window of N = 25 packets

In this section I demonstrate how a far more effective classifier can be constructed using multiple

sub-flows, which represent different time periods within the original full-flows.

Trying different combinations of the four sub-flows SF-0, SF-20, SF-40 and SF-2000, the

combination of all four sub-flows produced the best overall Precision and Recall than other

combinations of a sub-set of those four sub-flows.

Figure 5.23 shows this new classifier’s Recall, along with Recall for a classifier trained on

SF-40 (the best performed single sub-flow model) and a classifier trained on filtered full-flows

using a sliding window of 25 packets.

The multiple sub-flows curve shows very good Recall early in a flow’s life (M≤ 30) (95.7%-

98.8% compared to 83.2%-93.4% for the best single sub-flow model). For 40≤M ≤ 70 Recall

is comparable to training on the single sub-flow (91.3%-93.4% versus 89.4%-93.7% respec-

tively). For M ≥ 80 training the classifier on multiple sub-flows outweighs training on the


Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


70

80

90

100

SF−0SF−20SF−40SF−2000

Figure 5.22: ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classification -C4.5 Decision Tree models - a zoomed-in version of Figure 5.21

Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100

Filtered Full−FlowBest Single Sub−FlowMultiple Sub−Flows

Figure 5.23: ET Recall: Comparing full-flow and sub-flow training of the Naive Bayes classifier


single sub-flow by 5%-14%. Training the classifier on filtered full-flows results in substantially

degradation in Recall compared to training on a single or multiple sub-flows.

Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100


Figure 5.24: ET Precision: Comparing full-flow and sub-flow training of the Naive Bayesclassifier

Figure 5.24 shows Precision of the models. Both the best single sub-flow and multiple sub-

flows models produce greatly improved Precision compared to the filtered full-flows model.

With the analysis of traffic statistical properties in the above section for ET traffic (and in Ap-

pendix A for Other traffic), it is expected as training the classifier on features calculated from

sub-flows of the same length as the sliding window would be best to identify both ET and Other

traffic, compared to training on full-flows.

Comparing the Precision of training on a single sub-flow with multiple sub-flows (the same

method of features calculation) suggests a trade-off in Recall and Precision. Precision when

trained on the multiple sub-flows (from 89.8% to 94% for all M values) is 4.7% to 8% lower

than the Precision for the best single sub-flow model.

The improvement in Recall and reduction in Precision when comparing the multiple sub-

flows model with the best single sub-flow model can be explained as follows.

Each sub-flow’s features values form a cluster. Multiple sub-flows’ clusters may either be

overlapping or disjoint. Using a single sub-flow to train the classifier can leave out members of

other sub-flows, which are outliers to its cluster. Including those outliers in training the classifier


improves Recall, yet on the other hand creates opportunities for false positives, which leads to

lower Precision.

To demonstrate this concept, consider the example displayed in Figure 5.25.

Other sub - flows

Single sub - flow

SF - 40 SF - 0 SF - 2000

Multiple sub - flows

Figure 5.25: An illustration of creating multiple sub-flows classifier from a number of individualsub-flows (data points are artificially created for illustrative purposes only).

As indicated in Figure 5.25, each single sub-flow forms a cluster: SF-0 forms a cluster of

pink squares, SF-40 forms a cluster of blue circles, SF-2000 forms a cluster of orange triangles,

and other points indicate members of other sub-flows. These clusters are partially overlapping

and partially disjoined. Training on a single sub-flow (e.g. SF-40 for the best single sub-flow

model) leaves out many members of other sub-flows (outliers to the SF-40 cluster). Training

on multiple sub-flows makes sure these members are included in constructing the classification

model. The classifier’s Recall, therefore is improved when the sliding window moves across

the test dataset. (This helps explain the results in Figure 5.23, where Recall for the multiple

sub-flows classifier is better during the early and later phases of the ET flows, compared to the

best single sub-flow classifier. This is due to the additional inclusion of SF-0 and SF-2000’s

members in training the multiple sub-flows model.)

On the other hand, the inclusion of these members creates a greater unwanted area, which

is the gap between the contributing clusters (indicated by the grey area in the figure), when

compared to training on a single sub-flow. The greater the unwanted area, the better the oppor-


tunities for false positives – hence the lower Precision.

This illustration is a simple explanation of the trade-offs between Precision and Recall with

regards to the selection of sub-flows to train the classifier. This also suggests a novel approach

for an automated sub-flows selection, which will be presented in the next chapter.

It is also notable that Precision for the multiple sub-flows model slightly decreases as M

increases, especially for 1K ≤ M ≤ 9K, while Recall remains almost the same for these M

values and the same instances of Other traffic are used for testing all M values. This can be

explained as follows.

Precision for ET traffic is calculated as T PT P+FP (defined in section 2.3). FP is a constant (as

the same instances of Other traffic are used for testing all M values). Precision, therefore, only

depends on the TP of ET traffic for each test dataset.

As d(Precision)d(T P) is FP

(T P+FP)2 which is always positive, Precision is proportional to TP.

When M increases there are fewer flows longer than M+N packet. Consequently, there are

fewer ET flows for testing (as shown in Figure 5.10) and the TP for ET traffic is reduced. This

explains why Precision reduces when M increases.

Similar results have been seen with the C4.5 Decision Tree model. Figure 5.26 summarises

Recall for different classification models. For all values of M, the best single sub-flow model

and the multiple sub-flows model outperform the filtered full-flow model.

Total false positives for the multiple sub-flows models are 0.41% and 1.59% for the C4.5

Decision Tree and Naive Bayes models, respectively. Out of these percentages, most false

positives come from the P2P traffic (25% and 71.0% respectively) and Half-Life traffic (45.4%

and 10.7% respectively).

Figure 5.27 shows Precision for the different C4.5 Decision Tree classifiers. Precision held

steady at 97%-98% when trained on the multiple sub-flows. It is much better when compared

to the filtered full-flow model, and comparable to the best single sub-flow. Similar to the Naive

Bayes classifier, Precision for the multiple sub-flows model slightly decreases as M moves

further from the beginning of the flows.

To sum up, my results demonstrate that for applications with time-varying traffic character-

istics there are significant benefits to training ML classifiers using features calculated from one

(or more) sub-flows rather than full-flows.


Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100


Figure 5.26: ET Recall: Comparing full-flow and sub-flow training of the Classifier- C4.5Decision Tree models

Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


40

50

60

70

80

90

100


Figure 5.27: ET Precision: Comparing full-flow and sub-flow training of the C4.5 DecisionTree classifier


5.5 Discussion

There are several notable points about my approach:

• Sub-flows that include the start of the flow usually result in more training instances than

towards the end of the training flows, due to the variation in flow length. This would give

more weight towards the start of the flows, and may create intra-class imbalance effects

(discussed in section 3.1.8).

In my experimental approach, I selected sub-flows at different phases during the flow’s

lifetime, and obtained the maximum number of instances available in each sub-flow to

train a classifier. It was demonstrated that this performed well with my datasets. An

evaluation of the classifier’s performance with intra-class balancing (such as [145]) is left

for future research.

• For different applications, the number of sub-flows required may be different. This de-

cision is dependent on the application’s statistical characteristics, especially the typical

flow lengths and variations in traffic characteristics over the flow’s lifetime.

The selection of a much greater number of sub-flows used for the application of inter-

est than for the other applications (or vice versa) may lead to the inter-class imbalance

problem (discussed in section 3.1.8).

For the illustrative scenario studied in this chapter, I consider it is more important to

correctly classify the ET class. The sensitivity of Recall and Precision when using more

sub-flows in training for ET traffic is investigated in Chapter 6.

• My analysis in this chapter depended on manual inspection of ET’s particular traffic char-

acteristics. Training a classifier for optimal recognition of another application may re-

quire an entirely different choice of sub-flows. Ideally we would like to avoid having to

manually inspect and identify the optimal set of sub-flows for each application of interest.

In the next chapter I propose utilising unsupervised classification ML algorithms to au-

tomatically identify key sub-flows within examples of an application’s full-flows. Intu-

itively this seems reasonable, since unsupervised learning algorithms identify ‘natural’

5.5. DISCUSSION 125

clustering of sub-flows, from which we may identify a set of sub-flows that represent key

statistical characteristics of the full-flow. (The existence of natural clustering of feature

values in ET was hinted at in Figures 5.2 and 5.3)

• While making use of bi-directional flows, for both training and classifying phases, I define

the forward direction as as the client-to-server direction. In practice, the classifier cannot

assume anything about the direction (whether client to server or vice-versa) of the first

packet in the N-packet sliding window at any instant (particularly if the classifier misses

some packets from the start of any given flow). The challenge of building a direction-

neutral classifier is addressed in Chapter 7.

• Depending on the particular application we are trying to classify and the particular ML

algorithm, there will be a trade-off between keeping N low (for timely classification and

reduced memory consumption) and keeping N high (for acceptable Recall and Precision).

A very short window may not be good enough to differentiate between different appli-

cations. A large window may improve the classifier’s Precision and Recall, yet increase

the time required to collect enough packets’ statistics before a classification decision is

made. For example, I performed similar comparisons using N=10. For the Naive Bayes

classifier, the median Recall and Precision were 24.3% and 5% lower than for N =25

respectively. For the C4.5 Decision Tree classifier, the median Recall and Precision were

0.5% and 5% lower than for N = 25 respectively. Detailed analysis of this trade-off is a

subject for future research.

• Since Recall and Precision rarely reach 100%, in continuous classification there is possi-

bility of flapping (oscillation) in classification results when monitoring traffic flows over

their lifetime. This can be overcome by applying a scheme to verify the classification

result before taking further action. For example, the classifier would only send an update

of the flow classification result if it sees two new, consecutive and identical results 16.

This technique was applied and demonstrated to work well in [76].

16The classifier then has hysteresis included, so that the result is sustained for a longer duration, with noisesuppression during the steady state (e.g. [183])


There are also a number of limitations in my current experimental approach. Further im-

provement can be made in the following areas:

• My training dataset used a mixture of traffic at different locations. This approach is

practical when examples of traffic collected at a single monitor point are not sufficient

for learning. This does not affect my results as it reserves the inter-flow characteristics of

samples traffic used for training. However, the portability of the trained classifier should

be evaluated 17.

• A limited number of common interference applications are used for training the classifier

model. Extending the training dataset for the Other class is subject to future research.

• There is another FPS game (Half-Life) in my dataset for the Other traffic, which ac-

counted for less than 10% of traffic mix for training and less than 0.3% of traffic mix for

testing (as shown in section 5.3.4). With such a small amount of traffic, the negative im-

pact of this FPS game on my Precision results is insignificant (presented in section 5.4.3

and 5.4.4). However, the proportion of Half-Life traffic among the false positives in my

results suggests that an inclusion of other FSP games traffic (which has similar character-

istics as ET [176]) in the Other class can degrade the Precision of ET traffic classification.

The classifiers should be trained with more examples of these other FPS games traffic and

re-evaluated in that case. The separation of ET from other FPS games traffic is a subject

for future work.

• Sub-flows used to train the Other class are not being optimised. One reason for this is

that most of the Other class’s examples of flows are short. Further investigation on which

sub-flows are best for the Other class to train the classifier may lead to better results.

• The test dataset is constructed with a static selection of sliding window positions. Ex-

haustive testing of the classifier models throughout the flow’s lifetime would be ideal.

17Results from [76] demonstrate that the constructed classification model in this thesis’ work performs well onlive capture of ET traffic in a lab environment.

5.6. CONCLUSION 127

5.6 Conclusion

In this chapter I have proposed a novel solution: the ML classifier should be trained using

statistical features calculated from multiple short sub-flows extracted from full-flows generated

by the target application. The sub-flows are selected from regions of the application’s full-flows

that have noticeably different statistical characteristics.

I show that this can significantly improve a classifier’s performance when using a small

sliding window, regardless of how many packets are missed from each flow’s beginning. My

proposal is illustrated by constructing, training and testing Naive Bayes and C4.5 Decision

Tree classifiers for the detection of Wolfenstein Enemy Territory online game traffic. With this

particular scenario good results were found when trained on four sub-flows and using a sliding

window of only 25 packets.

Focusing on the identification of ET traffic, while having thousands of full-flow samples

for most interfering applications, for some applications I only have a few hundred (as detailed

in section 5.3.4). For a better classification model (in terms of Precision) we need to train the

classifier with the presence of a larger and more diverse collection of interfering traffic. This

endeavour is left for future work.

In this chapter, representative sub-flows used for training were manually selected. Training

a classifier for a new application may require an entirely different set of sub-flows. This step

should ideally be done automatically, without requiring expert knowledge of the application of

interest. Furthermore, for both training and classifying phases, I ensure that the forward direc-

tion is defined as the client-to-server direction. In the next chapter, I propose novel approaches

that supplement my training method to overcome the issue of automated sub-flows selection

and the problem of directionality.

Chapter 6

Automated Sub-Flow Selection usingUnsupervised Clustering Techniques

6.1 Introduction

In Chapter 5 I presented a novel approach to ML-based IPTC with two unique characteristics:

the classifier is trained on multiple short sub-flows (each sub-flow being a fragment of N consec-

utive packets taken from different points within the original application flow’s lifetime); and the

classification decision process is repeated continuously on a sliding window of the most recent

N packets seen by the classifier. This allows my classifier to accurately identify applications

whose traffic statistics change over time.

A crucial step is the a priori identification of appropriate sub-flows to train the classifier.

These sub-flows must cover all possible phases of the full-flow during its lifetime for consistent

and stable classification. For applications with well understood traffic characteristics, this can

be done based on the domain knowledge of an expert. For example, the initial handshake of

an SMTP connection may look quite different to the traffic while transferring the body of each

email; hence sub-flows should be taken at the beginning and middle of the flow. However,

training a classifier for a new application may require an entirely different set of sub-flows.

Ideally the identification of sub-flows would be done automatically, without the need for

expert knowledge about the application of interest. It is also ideal to eliminate the need to

manually handle the complexity of data analysis in studying the application’s traffic. In this

chapter I propose and demonstrate an automated approach that uses clustering ML techniques

to select sub-flows for training.

128

6.2. MY PROPOSAL 129

My approach firstly identifies sub-flows that are subsets of each full-flow’s packets passing

the classifier, starting from the beginning to the end of the full-flow. However, training the

classifier using all sub-flows found in this step may require a greater processing overhead. The

next step is to select only a limited number of representative sub-flows to train the classifier.

These are sub-flows that best capture distinctive statistical variation of a full-flow during its

lifetime. This step is important to minimise the load on the classifier, both during training and

classification, yet still meets the requirements of maintaining accurate classification. It is this

step that I propose to automate through the use of clustering ML techniques.

To demonstrate my proposal, I use the same hypothetical scenario as in Chapter 5 where ET

application traffic needs to be identified. I use the Expectation Maximisation (EM) algorithm

[161] for automated selection of sub-flows, and the Naive Bayes and C4.5 Decision Tree su-

pervised learning ML algorithms for subsequent traffic classification. The classifiers built using

the proposed approach are evaluated using accuracy and computational performance metrics.

This chapter is organised as follows. Section 6.2 introduces my proposal. Section 6.3

describes my experimental approach. The results and analysis are presented in section 6.4.

Finally, the chapter is concluded in section 6.7 with some final remarks and suggestions for

future work.

6.2 My proposal

The sub-flows identification and selection to train a classifier can be described in two steps as

follows:

1. Sub-flow identification: Extract two or more sub-flows from every flow that represents

the class of traffic one wishes to identify in the future.

2. Sub-flow selection: Examine the extracted sub-flows to select a number of representa-

tive sub-flows that best capture distinctive statistical characteristics of the application of

interest (e.g. at the start and middle of the flow).

The purpose of Step 1 is to find all possible sub-flows to train the classifier. The only crucial

requirement is that the step must cover all possible phases of the application’s flows during their

lifetime.

130 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION

Step 2, which involves the selection of representative sub-flows among all sub-flows found

in Step 1, can be a challenging task in practice. Each sub-flow’s instance is represented by the

values of multiple features, which results in multi-dimensional datasets to be examined.

The following sub-sections elaborate on how we can automate these two steps.

6.2.1 Step 1 - Sub-flow identification

I propose the following approach to automate Step 1 of sub-flow identification:

• Choose a window size N and step fraction S.

• Starting at the first packet, slide across the training dataset in steps of S packets (for

example, 12N), creating sub-flows of N consecutive packets at each step.

The same positions of sub-flows 1 are used for all full-flows in the data trace. This is pro-

posed because: sub-flow instances selected at the same position with regards to the full-flow

should share similar statistical properties; each position will give us a collection of instances to

study, representing the specific phase of the full-flow’s lifetime.

The positions of sub-flows are selected based on the chosen values of N and S. With a

suitable value of S, we can cover all full-flows’ phases. This coverage is better than randomly

selecting the position of sub-flows. One potential drawback is that it may lead to a greater

number of sub-flows identified and thus higher computational processing cost (where S and N

are small). This approach can also be applied for flows that exhibit periodic characteristics.

This approach can be illustrated as follows. A data trace with multiple full-flow instances

will result in a set of sub-flows with the same offset position from the beginning of each full-

flow. SF-M denotes the set of sub-flows positioned M taken from Q full-flows instances in the

data trace: SF-M = {SF-M0, SF-M1, SF-M2,..., SF-MQ−1}. SF-M is called a sub-flow class (or

sub-flow for short). SF-M0, SF-M1, SF-M2,..., SF-MQ−1 are members of sub-flow SF-M, or

sub-flow SF-M’s instances. Figure 6.1 illustrates how sub-flow instances are identified within

a single full-flow. S = 12N is chosen in this example. K sub-flows are identified for a full-flow

instance Fi (0≤ i≤ Q), namely SF-0i, SF-[12N]i, ..., SF-[K−1

2 N]i.

1Referring to Figure 5.1, the position of a sub-flow is indicated by the number of packets offset from thebeginning of the original full-flow.

6.2. MY PROPOSAL 131

Figure 6.1: An illustration of the sub-flow identification step

Depending on the choice of S, sub-flows may overlap. N is chosen to reflect the lower bound

on classification timeliness (as discussed in section 5.5). The choice of S influences the number

of sub-flows instances identified, which subsequently affects the processing overhead for sub-

flow selection in Step 2. The flow length may vary, which results in a variation in the number

of sub-flows identified per full-flow, and the number of instances for each sub-flow class. The

typical flow length of the application should therefore be taken into account when choosing

suitable N and S values. The optimisation of N and S is implementation specific.

6.2.2 Step 2 - Sub-flows selection

Selection of sub-flows is automated through the use of clustering ML techniques. This novel

approach is motivated by the analysis of ET traffic’s statistical properties in Section 5.3.3. The

scatter plots of Figures 5.2 and 5.3 hinted that feature values calculated for sub-flow instances

naturally form into a number of clusters. A cluster may contain members of different sub-

flow classes yet having similar statistical characteristics. This suggests that the classifier may

be trained using only a subset of each cluster’s members instead of using the whole cluster’s

population. Representative members for a cluster, for example, can be members of the sub-flow

class that dominates the cluster’s population.

The key now is to search for the clusters among the sub-flows’ instances. To search in multi-


dimensional datasets, ML clustering techniques appear to be the best tools. An unsupervised

clustering algorithm identifies ‘natural’ clusters among the initial set of sub-flows from Step 1,

from which we can then select a set of sub-flows that represent key statistical characteristics of

the application’s traffic.

Once the clusters are found, a subset of their members must be chosen to represent the

clusters in training the classifier 2. How to choose representative members for a cluster is a

choice to be made during implementation. For example, one can choose to use members of one

sub-flow class that dominates the cluster, or members of more than one sub-flow per cluster to

train the classifier. The choice again involves trade-offs between required classification model

build time, and classification speed and accuracy of the built classification model.

Figure 6.2 illustrates this concept 3.

SF-80

SF-0 SF-10 SF-20 SF-30 SF-40

SF-60 SF-70

Cluster 0

SF-90 SF-50

Cluster 2

Cluster 1

Figure 6.2: An illustration of selecting representative sub-flows to train a classifier

In this example, Step 1 identified 10 sub-flow classes, namely SF-0, SF-10, ..., SF-90. Mem-

bers of these sub-flows form into three clusters: Cluster 0 contains members of SF-0, SF-10,

SF-20 and SF-30; cluster 1 contains members of SF-30, SF-40, SF-50, SF-60 and SF-70; clus-2Sub-flows entered into a clustering algorithm are labelled for post analysis of sub-flow selection. The cluster-

ing process itself is unsupervised.3Data points given in this example are created for illustrative purposes only. They are not the actual data drawn

from my ET datasets.

6.3. AN EXPERIMENTAL ILLUSTRATION OF MY PROPOSAL 133

ter 2 contains members of SF-40, SF-80 and SF-90. These three clusters demonstrate three

phases in which the application’s traffic flows have noticeably distinctive statistical properties

over their lifetime. Each cluster is then examined to determine which sub-flow’s members con-

tribute the majority value of the class attribute (dominating) within the cluster. It might happen

that, for this particular example, members of SF-0 dominate Cluster 0’s population, members of

SF-50 dominate Cluster 1’s population, and SF-80’s members dominate Cluster 2’s population.

In this case, SF-0, SF-50 and SF-80 are selected as representative sub-flows for Cluster 0, 1 and

2 respectively. In this simplified example, all members of SF-0 belong to a single cluster, as do

the members of SF-50 and SF-80. Members of SF-0, SF-50 and SF-80 are then used to train

the classifier.

Using only the most dominant sub-flow class of each cluster reduces the computational load

when training the classifier. It also ensures that the sub-flows chosen cover all critical phases of

the application’s flows during their lifetime.

In practice, members of a sub-flow may belong to more than one cluster, and one sub-

flow can dominate more than one cluster. This will lead to different implementation options in

choosing the representative sub-flow members to train the classifier.

6.3 An experimental illustration of my proposal

To illustrate my proposal I use the same scenario as described in Chapter 5 : a real-time Naive

Bayes/ C4.5 Decision Tree classifier must accurately identify Wolfenstein Enemy Territory traf-

fic mixed in amongst unrelated, interfering traffic. Flow definition and feature set are the same

as that defined in Section 5.3.1.

The EM [161] clustering algorithm is chosen for automated sub-flow selection, with WEKA

implementation [175]. The EM algorithm is described in section 3.1.4. It has been used to

cluster IP traffic flows in previous studies, such as [59] and [60].

6.3.1 Step 1 - Sub-flow identification

As in Chapter 5, I use N = 25 packets for the sliding classification window. The identification

of sub-flows in Step 1 is carried out as follows. I divided the full-flow into two phases, the

‘earlier phase’ and the ‘later phase’, and selected a number of sub-flows for each phase. Let M


be the number of packets offset from the beginning of each flow in the dataset. Sub-flows for the

‘earlier phase’ started at M = 0, increasing by steps (S) of 10 packets, until M = 90. Sub-flows

for the ‘later phase’ phase started at M = 2000 and increased by steps (S) of 1000 packets until

M = 9000 4. This step resulted in 18 different sub-flows starting at different points within the

full-flow lifetime. Instances of these sub-flows are labelled (for post-clustering analysis only)

then submitted to the EM Clustering ML algorithm for the Step 2 process.

Full Flow

SF - 0 N

N SF - 10

N N

N

N

(Later phase)

SF - 90

(Earlier phase)

SF - 2000

SF - 3000

SF - 9000

Figure 6.3: Step 1 - Experimental approach

The ET trace described in section 5.3.4 is used for the analysis of ET’s statistical properties

and sub-flows selection. With this dataset, the 18 sub-flows results in a total of 23,875 instances,

which will be used as input to the clustering process in Step 2. Figure 6.4 presents in detail the

number of instances per sub-flow identified in this step.

6.3.2 Step 2 - Sub-flow selection

In the WEKA implementation of EM, one can specify the desired number of clusters oneself or

leave it to the tool to determine the optimal number of clusters (see section 3.1.4). The optimal

number of clusters is the one that produces the highest estimated log-likelihood–a measure of

the goodness of the clustering–which denotes the likelihood that the data originates from the

clustering models, given the values of the estimated parameters.

4With my current implementation of features calculations and choice of clustering algorithm, using a small Svalue consistently in identification of sub-flows would result in an enormous processing overhead. I chose to use asimplified approach with different S values for the earlier and later phases of a flow’s lifetime.


Num

ber o

f ins

tanc

es

0 10 20 30 40 50 60 70 80 90 2K 3K 4K 5K 6K 7K 8K 9K

050

010

0015

0020

0025

00


Figure 6.4: Number of instances for each sub-flow identified in Step 1

This optimal number of clusters can produce an optimal cluster model, which can subse-

quently lead to an accurate classifier. However, it may not necessarily produce an optimal

classifier, which involves the trade-offs between classification accuracy, classification speed

and computational complexity. The choice of the number of clusters, therefore, may need to be

evaluated against the performance of the corresponding classifier built on the clustering results.

Consequently, there are two options available for choosing the optimal number of clusters. I

refer to the first option as pre-classification. This optimisation process begins with one cluster,

and continues to add clusters until the estimated log-likelihood can no longer be increased. I

refer to the second option as post-classification. This technique begins with one cluster, and

continues to add clusters until there can be no more increase in the estimated performance of

the classifier trained and tested using each different number of clusters.

Although the post-classification option may produce the optimal classifier, the optimisation

process is dependent on the ML classification algorithm used in the classifier. This process

can also be very complex and computationally expensive. A range of factors (such as Recall,

Precision, classification speed, classification model build time, and physical resource require-

ments) need to be taken into consideration. It is still more challenging to automate the whole


optimisation process.

In this chapter, with the aim of making the clustering process simple, fully automated and

independent of the ML classification algorithm used by the classifier, I choose to use the optimal

number of clusters found by the pre-classification option. My research as outlined in Appendix

E, which evaluates the performance of the classifiers using both options, indicates that the pre-

classification option can produce a classifier that possesses high accuracy with only small trade-

offs in required classification model build time and memory usage.

To select representative sub-flows for each cluster, I use WEKAs ‘classes to clusters evalua-

tion mode. First, this mode ignores the sub-flow label class attribute and generates the clusters.

Then during the test phase it assigns a sub-flow label to each cluster5.

Figure 6.5: Sub-flow to cluster mapping and evaluation.

5In this mode, all possibilities of the class-to-cluster assignment are tried. The total number of incorrectlyclustered instances–compared to the label for the instances in the training set, called the classification error–iscomputed for each assignment. The class-to-cluster assignment with the smallest classification error will be chosen[124]. The actual classification error, however, is not important as it is expected that a sub-flows instance can beclassified into clusters with different sub-flow class labels.


Figure 6.5 summarises the results of the clustering process. The EM algorithm found eight

clusters of sub-flows from the 18 sub-flows found in Step 1. From this it assigned eight sub-

flows classes to map to these eight clusters. I use these eight sub-flows as representative sub-

flows to train my Naive Bayes and C4.5 Decision Tree classifiers. The sub-flows chosen include

sub-flows SF-0, SF-10, SF-20, SF-30, SF-40, SF-50, SF-60, and SF-3000, with a total of 12,804

instances (approximately half of total instances for all 18 sub-flows found in Step 1) 6.

6.3.3 Evaluation of classifiers trained with sub-flows selected by EM

With the selected sub-flow classes in Step 2, Naive Bayes and C4.5 Decision Tree classifiers

are built and tested using the same method as described in section 5.3.4 for multiple sub-flows

classifiers. I call this an Automatically selected multiple sub-flows (MultiSFs-AutoSel) classifier.

The performance of MultiSFs-AutoSel is evaluated, and compared with other classifiers trained

by different approaches, including:

• Full-flow classifier (Full-flow): as defined in Chapter 5.

• Filtered full-flow classifier (Filtered full-flow): as defined in Chapter 5.

• Manually selected multiple sub-flows classifier (MultiSFs-ManualSel): The four sub-

flows selected to build the multiple sub-flows classification model as outlined in Chapter

5.

• All found multiple sub-flows classifier (MultiSFs-AllFound): All sub-flows identified in

Step 1 are used to train the classifier.

Different training approaches lead to the differences in the number of training instances for

each classifier. This is summarised in Table 6.1.

These classifiers are compared for Accuracy (based on Precision and Recall as defined in

Section 2.3) and Computational performance. Computational performance is evaluated using

the three sub-metrics:

• Model build time: The CPU time required to train a classifier.

6This is justified as acceptable for my study because I will have a collection of subsets of all clusters to trainmy classifiers. This meets the requirements of Steps 1 and 2 of my proposed approach.


Table 6.1: The differences in training instances for each classifierClassifier Training Instances

Full-flow One full-flow in the data trace results in one instance to train theclassifier.

Filtered full-flow One full-flow (greater than 25 packets long) in the data trace resultsin one instance to train the classifier.

MultiSFs-ManualSel One full-flow in the data trace results in up to 7 four instances totrain the classifier.

MultiSFs-AutoSel One full-flow in the data trace results in up to eight instances totrain the classifier.

MultiSFs-AllFound One full-flow in the data trace results in up to eighteen instances totrain the classifier.

• Classification speed: The number of classifications that can be performed in each CPU

second.

• Memory usage: Memory usage for building the classification model and classifying using

the built model.

In addition, I study the clustering time, which refers to the CPU time required for the clus-

tering process.

This experiment was run on the Swinburne supercomputer cluster (described in Section

5.3.5). The physical resource consumption (CPU time and memory usage) was tracked using

Qstat [184].


Figure 6.6 shows the normalised number of instances used to train each classifier. A value

of 1 represents the highest number of instances (91,641 instances for the Full-flow classifier).

The Filtered full-flow classifier has the smallest number of instances, as one full-flow resulted

in one instance for training and all flows of shorter than 25 packets have been filtered out as

described in Chapter 5. With the three classifiers trained on multiple sub-flows, the more sub-

flows selected, the greater number of instances to train the classifier. The MultiSFs-AllFound

model has the highest number of training instances among the three, followed in order by the

MultiSFs-AutoSel and the MultiSFs-ManualSel classifiers. The number of instances used to


train a classifier directly affects the time taken to build the classification model, as revealed in

the results section below.

Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Nor

mal

ised

Num

ber o

f Tra

inin

g In

stan

ces

00.

20.

40.

60.

81

Figure 6.6: Normalised number of instances in training each classifier

6.4.1 Accuracy

Figure 6.7 depicts the Recall for each of the Naive Bayes classifiers for 19 positions of the

sliding window with the test dataset (detailed in Section 5.3.4). The results are presented using

boxplots 8.

Consistent with the results seen in Chapter 5, Full-flow and Filtered full-flow classifiers

result in very low Recall when classifying traffic using the sliding window. All classifiers trained

with multiple sub-flows produce greater than 98% Recall.

Figure 6.8 is a zoomed-in version of Figure 6.7, to enable a more precise comparison of

the MultiSFs-ManualSel, MultiSFs-AutoSel and MultiSFs-AllFound classifiers. Among these

three classifiers, the MultiSFs-AutoSel classifier has the highest median Recall of 99%, fol-

8The black line in the box indicates the median; the bottom and top of the box indicates the 25th and 75th per-centile, respectively. The vertical lines drawn from the box are whiskers. The upper cap is the largest observationthat is ≤ to the 75th percentile + 1.5*IQR (interquartile range - essentially the length of the box). The lower capis the smallest observation that is ≥ the 25th percentile - 1.5*IQR. Any observations beyond the caps are drawn asindividual points, and indicate outliers.


020

4060

8010

0

Rec

all(%

)

Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Figure 6.7: Recall for Naive Bayes classifiers trained on various selections of full-flows andsub-flows

9092

9496

9810

0

Rec

all(%

)

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Figure 6.8: Recall for Naive Bayes classifiers using multiple sub-flows, expanded from Figure6.7


lowed by the MultiSFs-AllFound classifier with a median Recall of 98.5% and the MultiSFs-

ManualSel with a median Recall of 98.3%. The differences in Recall among these three clas-

sifiers are small, at less than 1%. However, the slightly higher Recall of the MultiSFs-AutoSel

classifier compared to the MultiSFs-ManualSel classifier suggests that even without the expert

knowledge, clustering ML techniques can effectively assist the selection of sub-flows that cover

all distinct phases of the application’s flows during their lifetime. The MultiSFs-AutoSel clas-

sifier, in addition, has the smallest gap between the 25th and 75th percentile, which suggests a

better consistency in the classification’s Recall for all positions of the sliding window consid-

ered.

Figure 6.9 presents the Precision for each of the Naive Bayes classifiers. While the Full-flow

classifier shows the maximum median Precision of 100%, it is an indication of an over-fitting

problem (discussed in section 5.4). The high Precision of this classifier does not have much

meaning due to the low Recall achieved. Consistent with the results presented in Chapter 5,

the Filtered full-flow classifier has a low median Precision of 12.7%. All classifiers using the

multiple sub-flows training approach achieve higher than 91% Precision.

020

4060

8010

0

Pre

cisi

on(%

)

Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Figure 6.9: Precision for Naive Bayes classifiers trained on various selections of full-flows andsub-flows

Figure 6.10 is an expanded version of Figure 6.9 and focuses only on comparison of the


three classifiers trained on multiple sub-flows. Among the three classifiers, the MultiSFs-

AllFound classifier achieves the highest median Precision (94.9%), followed by the MultiSFs-

AutoSel (93.3%) and the MultiSFs-ManualSel (91.9%) classifiers. The MultiSFs-AutoSel clas-

sifier achieves slightly higher results in both Precision and Recall than the MultiSFs-ManualSel

classifier, which suggests that the automated sub-flows selection approach positively assists

in building a more accurate classifier. Using all found sub-flows results in lower Recall and

higher Precision of the MultiSFs-AllFound classifier compared to the MultiSFs-AutoSel classi-

fier, suggesting the possibility of over-fitting (similar to the case illustrated in Figure 5.16(a)).

The trade-off between Precision and Recall with regard to the number of clusters (hence the

number of sub-flows to train a classifier) is presented in Appendix E.

9092

9496

9810

0

Pre

cisi

on(%

)

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Figure 6.10: Precision for Naive Bayes classifiers using multiple sub-flows, expanded fromFigure 6.9

Figure 6.11 shows the Recall for each C4.5 Decision Tree classifier. Consistent with the

results seen in Chapter 5, Full-flow and Filtered full-flow classifiers result in very low Recall

when classifying traffic using the sliding window. All classifiers trained on multiple sub-flows

produce greater than 98% median Recall.

The differences among the three classifiers trained on multiple sub-flows are small, at less

than 0.5%. The MultiSFs-AllFound classifier has the highest median Recall of 98.9%, followed


020

4060

8010

0

Rec

all(%

)

Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Figure 6.11: Recall for C4.5 Decision Tree classifiers trained on various selections of full-flowsand sub-flows

by the MultiSFs-AutoSel (98.7%) and the MultiSFs-ManualSel (98.4%) classifiers.

Figure 6.12 summarises the Precision for each of the C4.5 Decision Tree classifiers. Con-

sistent with the results shown in Chapter 5, the Full-flow and Filtered full-flow classifiers have

low median Precision at less than 62%. All classifiers trained with multiple sub-flows achieve

a median Precision of higher than 97%.

Among the three classifiers trained on multiple sub-flows, the MultiSFs-ManualSel and

MultiSFs-AutoSel classifiers achieve an almost identical median Precision of 97.8%. The

MultiSFs-AllFound classifier achieves a slightly lower Precision, with a median of 97.5%.

However, the differences in Precision achieved by all three classifiers trained on multiple sub-

flows are less than 0.5%. The three classifiers achieved similar levels of consistency in Recall

and Precision with the 19 positions of the sliding window tested.

To sum up, my results indicate that manual selection of sub-flows for training is not nec-

essary in the general cases. My datasets even demonstrate that slightly better Precision and

Recall can be achieved using the clustering technique for automated sub-flow selection. Using

all sub-flows identified results in similar Precision and Recall to using only sub-flows automat-

ically selected by the EM algorithm. Also the C4.5 Decision Tree classifiers achieved higher


020

4060

8010

0

Pre

cisi

on(%

)

Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Figure 6.12: Precision for C4.5 Decision Tree classifiers trained on various selections of full-flows and sub-flows

Precision and Recall than the Naive Bayes classifiers across all tests.

6.4.2 Computational performance

This section compares the classifiers in terms of computational performance. This evaluation

is important considering real-time classification of potentially thousands of simultaneous traffic

flows. Each experiment was repeated three times with the median of all three runs taken to

represent each experiment.

Figure 6.13 compares the normalised build time for each of the Naive Bayes classifiers. A

value of 1 represents the slowest build time (214.47 seconds in the supercomputer environment

described earlier).

As expected, the larger the number of training instances, the longer the time required to

construct a classification model. As shown in Figure 6.13, the Full-flow classifier has the longest

required model build time as it contains the largest number of training instances. The Filtered

full-flow classifier has the shortest required model build time. The MultiSFs-AutoSel classifier

has shorter required model build time than the MultiSFs-AllFound classifier (∼15% less), and

slightly longer model build time than the MultiSFs-ManualSel classifier (∼9% more).

Figure 6.14 depicts the normalised classification speed for the classifiers. A value of 1


Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Nor

mal

ised

Cla

ssifi

catio

n M

odel

Bui

ld T

ime

00.

20.

40.

60.

81

Figure 6.13: Normalised build time for Naive Bayes classifiers

represents the fastest classification speed (4,051 classifications per second). The Full-flow clas-

sifier has the fastest classification speed, followed by the Filtered full-flow classifier which is

∼5% slower. Among the three classifiers trained on multiple sub-flows, the MultiSFs-AutoSel

classifier achieves the highest classification speed, ∼2% and 4% higher than the speed of the

MultiSFs-AllFound and MultiSFs-ManualSel classifiers, respectively.

Figure 6.15 outlines the normalised memory usage for the Naive Bayes classifiers while

performing 10-times cross-validation [109] of their training dataset. A value of 1 represents

the most memory consumption (552MB). Although all classifiers consume quite low mem-

ory resources, the Full-flow classifier consumes the most resources, followed by the MultiSFs-

AllFound classifier. The MultiSFs-AutoSel classifier is in the middle range of memory usage

compared to the other classifiers (∼15% less than the MultiSFs-AllFound and comparable to

the MultiSFs-ManualSel classifiers). It seems that memory usage is proportional to the required

model build time. The longer the time taken to build the model, the greater the memory usage.

Figure 6.16 compares the normalised build time for each of the C4.5 Decision Tree clas-

sifiers. A value of 1 represents the slowest build time (450.87 seconds on our test platform).

Similar to the results of the Naive Bayes classifiers, the Full-flow classifier has the longest model

build time. The Filtered full-flow classifier has the shortest model build time. The MultiSFs-


Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Nor

mal

ised

Cla

ssifi

catio

n S

peed

00.

20.

40.

60.

81

Figure 6.14: Normalised classification speed for Naive Bayes classifiers

Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Nor

mal

ised

Mem

ory

Usa

ge

00.

20.

40.

60.

81

Figure 6.15: Normalised memory usage for Naive Bayes classifiers while performing 10-timescross validation (during both training and testing)


AutoSel required classifier’s build time is 29% less than that of the MultiSFs-AllFound classi-

fier, and ∼ 15% longer than the required build time for the MultiSFs-ManualSel classifier.

Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Nor

mal

ised

Cla

ssifi

catio

n M

odel

Bui

ld T

ime

00.

20.

40.

60.

81

Figure 6.16: Normalised build time for C4.5 Decision Tree classifiers

Figure 6.17 presents the normalised classification speed for the C4.5 Decision Tree classi-

fiers. A value of 1 represents the fastest classification speed (15,402 classifications per second).

The Full-flow classifier has the fastest classification speed, followed by the Filtered full-flow

classifier which is ∼7% slower. Among the three classifiers trained on multiple sub-flows, the

MultiSFs-AutoSel classifier is∼3% slower compared to the MultiSFs-ManualSel classifier, and

∼10% faster than the MultiSFs-AllFound classifier.

Figure 6.18 shows the normalised memory usage for the classifiers while performing 10

times cross validation [109] of their training dataset. A value of 1 represents the most memory

consumption (128MB). Although all classifiers consume relatively low memory resources, the

Full-flow classifier consumes the most resources, followed by the MultiSFs-AllFound classifier.

The MultiSFs-AutoSel classifier consumes ∼15% less than the MultiSFs-AllFound classifier,

and ∼10% more than the MultiSFs-ManualSel classifier.

6.4.3 Summary of results

These results suggest that in general, training on multiple sub-flows is significantly more effec-

tive than the traditional full-flow training approach in terms of Precision and Recall, required


Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Nor

mal

ised

Cla

ssifi

catio

n S

peed

00.

20.

40.

60.

81

Figure 6.17: Normalised classification speed for C4.5 Decision Tree classifiers

Full−

flow

Filte

red

full−

flow

Mul

tiSFs

−Man

ualS

el

Mul

tiSFs

−Aut

oSel

Mul

tiSFs

−AllF

ound

Nor

mal

ised

Mem

ory

Usa

ge

00.

20.

40.

60.

81

Figure 6.18: Normalised memory usage for C4.5 Decision Tree classifiers while performing10-times cross-validation


model build time and physical resources usage, with a slight trade-off in terms of classification

speed.

For the comparison among the three classifiers trained on multiple sub-flows, Figure 6.19

summarises their median Precision and Recall results.

Recall.NB Precision.NB Recall.DT Precision.DT

Pre

cisi

on/R

ecal

l

MultiSFs−AutoSel MultiSFs−AllFound MultiSFs−ManualSel

020

4060

8010

0

Figure 6.19: Summary of Precision / Recall results for Naive Bayes (NB) and C4.5 DecisionTree (DT) classifiers trained on multiple sub-flows

The results of training the Naive Bayes classifiers on the sub-flows selected automatically

by the EM algorithm include:

• Highest Recall (0.5% and 0.7% higher than the MultiSFs-AllFound and MultiSFs-ManualSel

classifiers). The Precision is 1.4% higher than when the classifier is trained on manually

selected sub-flows and 1.6% lower than the Precision of the classifier trained on all sub-

flows identified in Step 1.

• A reduction in the required classification model build time by ∼15%, improvement in

the classification speed by ∼2%, and consumption of less than 15% memory resource

compared to training on all sub-flows identified in Step 1.

• Faster classification speed (by ∼4%), a longer required model build time (by∼9%) and

consumption of similar memory usage to the classifier trained on manually selected sub-

flows.

The results of training the Naive Bayes classifiers on the sub-flows selected automatically

by the EM algorithm include:


Naive Bayes C4.5 Decision Tree

Nor

mal

ised

Mod

el B

uild

Tim

e


00.

20.

40.

60.

81

(a) Normalised model build time


Nor

mal

ised

Mem

ory

Usa

ge


00.

20.

40.

60.

81

(b) Normalised memory usage while performing 10-times cross validation


Nor

mal

ised

Cla

ssifi

catio

n S

peed


00.

20.

40.

60.

81

(c) Normalised classification speed

Figure 6.20: Summary of computational performance results for Naive Bayes (NB) and C4.5Decision Tree (DT) classifiers trained on multiple sub-flows

6.5. SAMPLING FOR FASTER CLUSTERING 151

• A slightly higher Recall (0.3%) than the MultiSFs-ManualSel classifier, and a slightly

lower Recall (0.2%) than the MultiSFs-AllFound classifier. The median Precision is sim-

ilar to when the classifier is trained on manually selected sub-flows and 0.3% higher than

the Precision of the classifier trained on all sub-flows identified in Step 1.

• A reduction in the required classification model build time (by ∼29%), improvement in

the classification speed (by 10%), and consumption of less than 15% memory resource to

training on all sub-flows identified in Step 1.

• A slower classification speed (by ∼3%), a longer required model build time (by∼15%)

and consumption of ∼ 10% more memory resource compared to when the classifier is

trained on manually selected sub-flows.

The results suggest that we can automatically train an effective classifier without requiring

expert knowledge of the application of interest. Clustering ML techniques offer distict advan-

tage in building a faster classifier in real-time classification with high Precision and Recall.

Using all sub-flows found in Step 1 can also create accurate classifiers with high Precision and

Recall. However, this requires longer model training time and the classifiers are slower in clas-

sification speed. This could become an issue when scaling to multiple concurrent application

classification. From my results, while the C4.5 Decision Tree classifiers take a longer time to

build , they are much faster (nearly three times) than Naive Bayes classifiers and have higher

Precision and Recall overall. This is consistent with the previous findings of [171].

6.5 Sampling for faster clustering

6.5.1 The problem

One limitation of my current experimental approach is the slow clustering time using the EM

algorithm. With the sub-flows identified in Step 1, the clustering process took up to 172 CPU

hours to complete in the supercomputer environment described earlier. Alhough this step can be

carried out offline, this should be improved so that it will not outweigh the gain in the required

classification model build time discussed above.

This can be improved in a number of ways:


• Using a smaller number of iterations when running the EM algorithm.

• Using another ML clustering algorithm.

• Using a more powerful processing unit.

• Down-sampling the number of instances for the clustering process.

Each of these solutions involves trade-offs between the gain in processing overhead (espe-

cially the CPU time), cost of the processing unit, and the quality of the clusters produced.

6.5.2 Down-sampling for the clustering proposal

In this section I investigate the method of down-sampling the dataset for clustering. My pro-

posed solution is to sample only a small number of instances from each sub-flow class identified

from Step 1 to use as input to the clustering process in Step 2. With the aim of understanding the

statistical properties of an application’s traffic, small samples of flow instances may be sufficient

to give us valuable hints for representative sub-flows.

For each sub-flow identified in Step 1, I sampled randomly 25, 50, 100 instances and used

them as input for Step 2’s clustering process (compared to more than 1,000 instances per sub-

flow for a full dataset as presented in Figure 6.4). I measured the time taken for the clustering

process to complete for each case, and compared them with the time taken when all sub-flows’

members were used. The clusters produced in each case were evaluated using the same method

as described earlier in this chapter.

6.5.3 Results and analysis

Figure 6.21 depicts the number of instances used for clustering, and the normalised clustering

time for different sample sizes. A value of 1 represents the longest time (172 hours). The

clustering time is proportional to the number of instances used for the clustering process. As

shown in this figure, down-sampling the number of instances for each sub-flow significantly

reduces the CPU time required for clustering. Using 100 samples per sub-flow only took 0.64%

of the clustering time when using all instances per sub-flow. The differences in clustering time

for using 25, 50 and 100 samples per sub-flow is insignificant in this experiment (at less than

0.1%).


25 50 100 All

Nor

mal

ised

Tra

inin

g In

stan

ces

for C

lust

erin

g P

roce

ss

00.

20.

40.

60.

81

Number of instances sampled per sub−flow

(a) Normalised number of instances for the clustering process

25 50 100 All

Nor

mal

ised

Clu

ster

ing

Tim

e

00.

20.

40.

60.

81


(b) Normalised clustering time

Figure 6.21: Sampled clustering


Table 6.2 presents the results of the clustering process in terms of number of sub-flows

selected.

Table 6.2: Number of sub-flows selected automatically by the clustering processNumber of instances sampled persub-flow

25 50 100 All

Number of clusters produced 7 9 8 8

Based on the resultant clusters, four Naive Bayes and C4.5 Decision Tree classifiers were

built and compared. Figure 6.21 shows Precision and Recall results for the Naive Bayes classi-

fiers. There are trade-offs in terms of Precision and Recall achieved. Sub-flows selected when

sampling 25 instances per sub-flow produce a classifier with the lowest median Recall of 92%.

Sub-flows selected when sampling ≥ 50 instances per sub-flow produce classification models

with better Recall, with a median of greater than 95%. Interestingly, the exerimental results

reveal that using 50 instances per sub-flow seems to produce a classifier with the best combina-

tion of Precision and Recall, slightly better than using 100 instances per sub-flow. However, the

difference is less than 0.3%. Although finding an optimal number of samples is left to future

research, my results can be taken as an indication that we only need a small number of sam-

ples for the clustering process to produce a good sub-flows selection. This assists in markedly

reducing the CPU time required for the sub-flow selection step.

Figure 6.22 shows the Precision and Recall results for the C4.5 Decision Tree classifiers.

Similar to the results seen with the Naive Bayes classifiers, sub-flows selected when sampling

25 instances per sub-flow produce a classifier with lower Recall compared to no sampling.

However, sampling 50 or 100 instances per sub-flow produce classification models with as

good Recall and Precision as in the case of no sampling.

Since the number of clusters (and representative sub-flows) identified by the EM for differ-

ent sampling rates are similar, the differences in model build time and classification speed for

the classifiers are small. The results are presented in Figure 6.24 and 6.25

In summary, the results of this section demonstrate that we can reduce the time taken for

the clustering process significantly by using a small number of instances without noticeably

compromising the classifier’s performance.


9092

9496

9810

0

Rec

all(%

)

25 50 100 All


(a) Recall

9092

9496

9810

0

Pre

cisi

on(%

)

25 50 100 All


(b) Precision

Figure 6.22: Precision and Recall for Naive Bayes classifiers using sub-flows selected by EMwith small numbers of samples for the clustering process.


9092

9496

9810

0

Rec

all(%

)

25 50 100 All


(a) Recall

9092

9496

9810

0

Pre

cisi

on(%

)

25 50 100 All


(b) Precision

Figure 6.23: Results for C4.5 Decision Tree classifiers using sub-flows selected by EM withsmall numbers of samples for the clustering process.



Nor

mal

ised

Cla

ssifi

catio

n M

odel

Bui

ld T

ime 25 samples per sub−flow

50 samples per sub−flow100 samples per sub−flowAll sub−flow’s instances

00.

20.

40.

60.

81

Figure 6.24: Normalised Model Build Time for classifiers trained on sub-flows selected by EMwith small numbers of samples used in the clustering process


Nor

mal

ised

Cla

ssifi

catio

n S

peed

25 samples per sub−flow50 samples per sub−flow

100 samples per sub−flowAll sub−flow’s instances

00.

20.

40.

60.

81

Figure 6.25: Normalised classification speed for classifiers trained on sub-flows selected by EMwith small numbers of samples used in the clustering process


6.6 Discussion and future work

There are a number of limitations in my current experimental approach. Further improvement

can be gained in the following areas:

• This chapter studies the sub-flows selection for ET traffic. Sub-flows used to train the

Other class are not being optimised. One reason for this is that most of the Other class’s

example flows are short. Thus I chose only two sub-flows taken at the beginning and

middle of the original full-flows as described in section 5.3. Further investigation on

which sub-flows are best for the Other class to train the classifier may lead to a better

result.

• In the evaluation metric, I have not considered the prior processing overhead, which is the

processing for the preparation of datasets to train a classifier. For each of the classifiers

listed above the prior processing overhead includes:

– Full-flow classifier: Features calculation for all full-flow instances in the data trace.

– Filtered full-flow classifier: Processing for the removal of flows shorter than the size

of the sliding window and features calculation for all longer full-flow instances in

the data trace.

– MultiSFs-ManualSel classifier: Study and analysis to understand the statistical char-

acteristics of the application’s traffic; features calculation for a selected number of

sub-flows; searching for the best combination of sub-flows to produce the classifier.

– MultiSFs-AutoSel: Features calculation for all sub-flows identified in Step 1 and in

the clustering process in Step 2.

– MultiSFs-AllFound: Features calculation for all sub-flows identified in Step 1.

Some components may not be able to be precisely measured (such as the study and analy-

sis to understand the statistical characteristics of the application’s traffic, which normally

is performed by a domain expert) or may be too dependent on the choice of implementa-

tion (such as a recursive search for the best combination of sub-flows to train a classifier).

6.6. DISCUSSION AND FUTURE WORK 159

• Feature calculation takes a finite period of time, depending on the complexity of features

used by a particular ML-based IPTC system. In my experiments, the feature sets are

simple, and statistics can be computed incrementally when a packet arrives in the sliding

window. I consider the feature calculation time to be relatively small compared to the

total time taken to collect N packets for the classification 9. Consequently my focus

in this chapter has been on overall classification speed. More detailed analysis of the

computational load of alternative features is a topic for future work.

• More sub-flow positions could be chosen in Step 1 in the experiments to make the com-

parison between MultiSFs-AutoSel and MultiSFs-AllFound classifiers clearer.

• The test dataset is constructed with a static selection of the sliding window. Hence the

stability and consistency of the classification result is limited to the selected positions of

the sliding window tested. Testing the classifier models in a live network would be ideal.

In future work, ML clustering techniques can be used to recognise new and unknown appli-

cations as described below.

One well-known problem related to using supervised ML techniques for classification is the

inability to detect new and unknown applications. For example, a classifier uses supervised ML

techniques to identify ET against Other applications, where the Other class is trained with the

traffic of known applications, such as FTP, Kazza, Email, and Web. The classifier performs very

well until a new and unknown application is introduced into the network, such as VoIP traffic.

Since the classifier has not yet been updated with the newly emerged application, it will classify

some of the VoIP traffic as ET, and some of the VoIP as Other traffic. The classifier’s Precision

will degrade as a result.

An ML clustering technique may offer a solution. A small sample of the classifier’s output

can be used to keep track of the historical profile of the traffic’s statistics and its variation trends,

using ML clustering techniques. When a significant change in the application’s statistical pro-

file is detected, we know the classifier should be updated. For example, when the clustering

9With a sliding window of 25 packets, it usually takes less than 0.5 second to collect enough packets for aclassification. Most processors handle millions of instructions per second, so calculating simple mathematicalfeatures in microseconds represents a trivial fraction of the typical arrival time between packets making up thesliding window.


technique detects a new cluster, the new cluster will be examined and traced back to its source

application. Its traffic then will be collected to re-train and update the classifier. Figure 6.26

illustrates the idea.

My results presented in section 6.5 suggest that only a small number of the classifier’s output

traffic is needed for this purpose.

6.7 Conclusion

In this chapter I extend my work on training with multiple sub-flows presented in Chapter 5

to include the idea of using clustering ML techniques for automated sub-flow selection. This

extension is significant for the deployment of the proposed approach to classify new applications

of interest. It eliminates the need for expert knowledge of the application and relieves the

complexity of manually choosing the best combination of sub-flows to train the classifier. I

have presented a performance comparison between the use of my approach, the traditional full-

flow training, the use of all identified sub-flows without a selection method, and the use of

sub-flows selected manually.

The results suggest that my proposed approach has the potential to select the optimal num-

ber of representative sub-flows for training, which takes into account the trade-offs between

accuracy and computational performance. One limitation of my approach is the long time taken

in the clustering process using the EM algorithm. I have proposed and evaluated an approach

to overcome this limitation by down-sampling the training instances for the clustering process.

The application of my proposed method for other Internet applications and the trade-offs

in selecting parameters such as the classification window size (N) and forwarding step (S) are

subjects for future work.

6.7. CONCLUSION 161

Labelled 'Game' class

VoIP

Game

Web, P2P, SSH, SMTP

Traffic classifier

Game

Other

ML

Classification model Game or Other

Classification

Training

Labelled 'Other' class

Optional data sampling and features filtering/selection

Features calculation Features calculation

Game traffic Web, P2P, SSH,

SMTP traffic VoIP traffic

Traffic sampling

Traffic sampling

Clustering

Detection of significant changes in clustering

results (e.g new cluster for new and

unknown traffic)

Investigate and update the

classification model (e.g. adding the new and unknown traffic

to the training dataset for Other

class to re-train the classifier )

Model Update

Figure 6.26: An illustration of updating a classifier when new, previously unknown traffic isdetected

Chapter 7

Training Using Synthetic Sub-Flow Pairs

7.1 Introduction

In Chapter 5 I presented a novel technique to train a classifier on a combination of short sub-

flows, such that IP flows can be classified in a finite period of time, starting at an arbitrary point

in a flow’s lifetime. In Chapter 6 I proposed and demonstrated an automated approach based

on the use of clustering ML techniques to choose appropriate, representative sub-flows, from

which a classifier may be trained. In this chapter, I present an improvement to the training phase

so that subsequent flow classifications need not rely on prior knowledge of inferred or actual

directionality of a flow.

The directional neutrality issue was identified in section 3.3.2 and discussed in sections

4.6.2 and 5.5. It is a requirement for classifiers that rely on bi-directional statistics to make

a distinct assumption about the direction of each packet captured to calculate feature values.

This becomes a challenge when classifying in an operational network, where such assumptions

about the traffic direction can be wrong.

In this chapter, I propose and evaluate a novel approach for direction-neutral classification.

I train the ML classifier using:

• multiple short sub-flows’ instances extracted from the full-flows generated by the target

application. The feature values of these instances are calculated with the forward direction

defined as the client-to-server direction;

• and their mirror-image replicas, as if the flows were in the reverse direction, that is,

features of the multiple short sub-flow instances are transposed and replicated to construct

162

7.2. PROPOSAL USING A SYNTHETIC SUB-FLOW PAIRS APPROACH 163

a synthetic ‘pair’ of features.

The combination of a sub-flow instance and its mirror-image replica is called a synthetic

sub-flows pair (SSP). In classification, the forward direction is defined as the direction of the

first packet captured in the sliding window, regardless of whether it is from client to server or

server to client. This helps the classifier identify traffic flows either way.

For example, consider a classifier trained with the following simple scenario: a flow whose

first packet is destined for port 25 is an SMTP flow. However, imagine if the classifier misses

the first packet of the SMTP flow, instead capturing a later packet (the reply from server to

client). This packet has the source port of 25 instead of the destination port. The classifier then

classifies the flow as non-SMTP, when in fact it is an SMTP flow. To overcome the problem, the

classifier should be trained with a rule that: a flow whose first packet is destined or originated

from port 25 is an SMTP flow. Hence the classifier would not miss the SMTP flow.

I demonstrate my optimisation when applied to the Naive Bayes and C4.5 Decision Tree

classifiers, and show that the SSP approach results in good performance even when classifica-

tion is initiated mid-way through a flow, without prior knowledge of the flow’s direction.

In the next section, I present the SSP approach. Section 7.3 presents the details of my ex-

perimental approach, section 7.4 analyses the experimental results, followed by the conclusion

in section 7.5.

7.2 Proposal using a synthetic sub-flow pairs approach

Training on mirror-image replicas of each sub-flow is an important augmentation of the tech-

nique demonstrated in Chapter 5. Figure 3.5’s key steps of feature calculation (F), training (T)

and classification (C) are illustrated in Figure 7.1. F, T and C denotes the features calculation,

training and classification steps respectively. In step F, features are calculated as described in

Section 5.3.1. Each sub-flow’s instance then is labelled as either ET or Other class to train the

ML classifier. Output of the training are the classification rules to identify ET and Other traffic

in future.

With the SSP approach, the dataset for mirror-image replicas of the sub-flows is created

artifically in a separate step called F’. From the features calculation in Step F, the mirror-image

164 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS

F

ET Traffic

T

F

Other Traffic

ET Rules

Other Rules

C

ET

Other

Labelled ET class

Labelled Other class

Figure 7.1: Steps in training an ML classifier for identification of ET traffic versus Other traffic- without using the SSP approach

replica of a sub-flow instance is created by swapping its feature values in the client-to-server

and server-to-client (forward and backward) directions.

Figure 7.2 presents an example to illustrate how a mirror-image replica is created for a sub-

flow instance X. Consider LF and LB to be the mean forward and backward packet lengths

(respectively) of sub-flow instance X. The mirror-image replica of sub-flow instance X is as-

signed mean forward and backward packet lengths of LB and LF respectively. The same trans-

position (mirroring) step is repeated for other features of sub-flow instance X.

Mean forward

packet length

Mean backward

packet length …

Mean forward packet inter - arrival time

Mean backward

packet inter - arrival time

Sub - flow instance X

L F L B … I F I B

Mirror - image replica of sub - flow

instance X

L B L F … I B I F

Figure 7.2: An illustration of how to create a mirror-image replica for a sub-flow instance

The sub-flows’ instances and their mirror-image replicas are then labelled to train a classi-

fier. There are two options for building a classifier using SSP.

In Option 1, both sub-flows’ instances and their mirror-image replicas are labelled as one


F

ET Traffic

T

F'

F

F'

Other Traffic ET Rules

Other Rules

C

ET

Other

Labelled ET class


The OR function

Figure 7.3: Option 1: Both sub-flows’ instances and the mirror-image replicas of every shortsub-flow are labelled as one class. The classifier is trained with two classes: ET and Other.

class. For example, ET instances and their mirror-image replicas are both labelled as ET class.

This option trains the classifier such that a new flow which has traffic characteristics similar

to either ET or its mirror-image replica will be classified as ET traffic. The OR function, as

indicated in Figure 7.3, is placed before the training T step.

In Option 2, sub-flows’ instances and their mirror-image replicas are labelled independently

as two separate classes. For example ET instances are labelled as ET class, and their mirror-

image replicas are labelled as ET’ class. This option trains the classifier to identify ET, ET’,

Other and Other’ classes separately. Then a new flow which is classified as ET OR ET’ will

be classified as ET traffic. The OR function, as indicated in Figure 7.4, is placed after the

classifying C step.

Figure 7.5 provides an example of datasets used to train a classifier for ET traffic using

Option 1 and Option 2.

Figure 7.6 presents an example to illustrate the proposal. Feature values for sub-flows in-

stances form a cluster of pink squares. Feature values for these sub-flows instances’ mirror-

image replicas form a cluster of blue circles. For applications with asymmetric statistics in the

forward and backward directions, these clusters are mostly disjoint. Training on multiple sub-


F

ET Traffic

T

F'

F

F'

Other Traffic

ET Rules

ET' Rules

Other Rules

Other' Rules

C

ET

ET'

Other

Other'

ET

Other

Labelled ET class

Labelled ET' class


Labelled Other' class

The OR function

Figure 7.4: Option 2: Sub-flows’ instances and their mirror-image replicas are labelled inde-pendently as two separate classes. The lassifier is trained with four classes: ET, ET’, Other andOther’.

flows left out many members of the sub-flows’ mirror-image replicas (outliers to the multiple

sub-flows cluster). Training on SSP ensures that these members are included in constructing the

classification model. The classifier’s Recall, therefore can be improved when the classifier does

not need to make an assumption about the direction of the first packet captured in the sliding

window.

On the other hand, the inclusion of these members creates an unwanted area, which is the

gap between the contributing clusters (indicated by the grey area in the figure), compared to

training without SSP. Depending on the internal construction of an ML classification algorithm,

and the method of implementation of SSP (i.e. whether Option 1 or Option 2) this area may

have different impact on the classifier’s Precision.

In Option 1, the synthetic sub-flow pairs share the same class in training the classifier. As a

result, the grey area is included in training a classifier, which could create opportunities for false

positives, which leads to lower Precision. Using Option 2, a classifier is trained with multiple

sub-flows instances and their mirror-image replicas separately. This means that the classifier is

trained to recognise members of the pink squares and blue circles clusters without the need to

include the grey area in the training phase. This may have positive impacts on Precision of the


Mean forward

packet length

Mean backward

packet length …

Mean forward


Mean backward


Class

Sub - flow instance X 1

L F 1 L B 1 … I F 1 I B 1 ET


instance X 1

L B 1 L F 1 … I B 1 I F 1 ET

… … … … … … … Sub - flow

instance X n

L F n L B n … I F n I B n ET


instance X n

L B n L F n … I B n I F n ET

(a) Option 1: Sub-flow instance X and its mirror-image replica X’ are both labelled asET class

Mean forward

packet length

Mean backward

packet length …

Mean forward


Mean backward


Class

Sub - flow instance X 1

L F 1 L B 1 … I F 1 I B 1 ET


instance X 1

L B 1 L F 1 … I B 1 I F 1 ET ’

… … … … … … … Sub - flow

instance X n

L F n L B n … I F n I B n ET


instance X n

L B n L F n … I B n I F n ET ’

(b) Option 2: Sub-flow instance X and its mirror-image replica X’ are labelled as ETand ET’ classes respectively

Figure 7.5: Example datasets used to train a classifier using Option 1 and Option 2


Their mirror-image replicas

Sub-flow instances Unwanted area

Figure 7.6: An illustration of creating SSP classifier from sub-flow instances and their mirror-image replicas (data points are artifically created for illustration purposes only.)

classifier.

However, the classifier built with Option 1 is simpler, with entailing two-classes classifi-

cation. Option 2, with a 4-classes classification, requires more processing complexity to train

the classifier. In the latter, the classification rules could be more complicated (for example,

involving a much larger tree size for the C4.5 Decision Tree classifier), leading to a slower

classification in real-time.

The following sections present the results of my study on Naive Bayes and C4.5 Decision

Tree classifiers trained without SSP, and with SSP using Option 1 and Option 2.

7.3 Illustrating the Synthetic Sub-Flow Pairs Training Approach

To illustrate my proposal I use the same scenario as described in Chapter 5: a real-time Naive

Bayes and C4.5 Decision Tree classifier must accurately identify asymmetric Wolfenstein En-

emy Territory traffic mixed in among unrelated, interfering traffic. The same training, testing

datasets and feature set as in Chapter 5 are used.

7.3. ILLUSTRATING THE SYNTHETIC SUB-FLOW PAIRS TRAINING APPROACH 169

7.3.1 Experimental data

As seen in Chapter 5 (section 5.3.3), ET traffic characteristics are noticeably asymmetric. Mea-

sured across all the ET flows in the test dataset, Figure 7.7 shows the percentage of sub-flows

whose first packet is in the client-to-server direction as a function of M – the number of packets

offset from the start of the full-flow 1. Not surprisingly, this is 100% when M = 0, and fluctuates

significantly for 1 ≤M ≤ 9 (the value does not reach 0% for M = 1 because for ∼35% of ET

flows, both the first and second packets seen on the wire are in the client-to-server direction).

This fluctuation is expected as this region is the Probing phase where the client is discoverying

the server. In the region 2000≤M ≤ 2009 (assumed to be the In-game phase) it is more stable.

There appears to be a ∼60:40 chance that the 2001st , 2002nd and ... 2009th packets traverse in

the client-to-server or server-to-client directions. This is consistent with my analysis of the data

trace in Chapter 5, where during ET game-play we see roughly 28 PPS from client to server and

20 PPS from server to client.

4050

6070

8090

100

Per

cent

age

(%)

0 1 2 3 4 5 6 7 8 920

0020

0120

0220

0320

0420

0520

0620

0720

0820

09


Figure 7.7: Percentage of flows that have the first packet captured in the client-to-server direc-tion if the first M packets are missed

In general, when a classifier model is trained with an explicit definition of the direction

(client-to-server direction or the forward direction is defined as the direction of the first packet of

1Here I chose to slide the classification window with a step of 1 packet. This is to make the alternating directionfrom client to server and server to client of the 1st packet in the sliding window clearer.


a full-flow), its Recall is dependent on the proportion of sub-flows’ instances that actually start

with the first packet traversing in the same direction (i.e. from client to server). If the first packet

of the sub-flow traverses in the opposite direction (i.e. from server to client), the classifier’s

performance will be negatively affected due to the asymmetric flow statistical properties in the

two directions. This is confirmed by my experimental results, as shown in section 7.4.

7.3.2 Test methodology

In my experiments, I study the performance of a classifier trained with an explicit definition

of flow direction that classifies in real-time. I show the classification accuracy of classifiers

trained on full-flow and multiple sub-flows that have the forward direction defined as the client-

to-server direction, when in testing the client-to-server (or forward) direction is defined as the

direction of the 1st packet captured in the sliding window, which can be from client to server or

server to client.

My experimental results reveal that training with SSP using both Options 1 and 2 allows us

to achieve high Recall and Precision for both the Naive Bayes and C4.5 Decision Tree classi-

fiers. My results also confirm that classification performance is maintained, even when packets

are missed at the beginning of a flow and regardless of the direction of the first packet captured.


First I look at the effectiveness of classifying data using a sliding window across the test dataset

and an ML classifier trained on full-flow, filtered full-flow and multiple sub-flows2. Then I

show how Recall and Precision improve when each ML classifier is trained using SSP Option

1 and SSP Option 2 instead. Similar to the work in Chapters 5 and 6 I use a sliding window of

25 packets.

7.4.1 Classifying without training on SSP

Figure 7.8 outlines Recall for the Naive Bayes full-flow model, filtered full-flow model and

multiple sub-flows model as each sliding window moves across the test dataset. M is the number

of packets offset from the beginning of each flow in the test dataset. The graphs cover two

2The multiple sub-flow model is trained on eight sub-flows found automatically by the EM algorithm describedin Chapter 6.


periods: early client contact with the game server (0 ≤ M ≤ 9) and during active game-play

(2000≤M ≤ 2009).

Rec

all (

%)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


0102030405060708090

100

Full−flowFiltered Full−flow

Multiple Sub−flows

Figure 7.8: Recall for Naive Bayes classifiers trained on full-flow (full-flow model), filteredfull-flow (filtered full-flow model) and multiple sub-flows (multiple sub-flows model)

Recall for all three classifiers suffers as M increases above zero. Similar to the results seen

in Chapters 5 and 6, full-flow and filtered full-flow models result in very poor Recall. Training

on multiple sub-flows achieves better Recall with a median of greater than 65%. However,

even with the multiple sub-flows model, Recall fluctuates significantly around 66% when 0 ≤

M ≤ 9 and stays relatively stable at ∼70% when 2000≤M ≤ 2009. More importantly, we see

noticeable shifts in Recall each time the sliding classification window moves by one packet.

This is a direct consequence of the classifier assuming (sometimes incorrectly) that the first

packet in the sliding window represents the client-to-server direction, when in reality it does

not (as shown in Figure 7.7).

Figure 7.9 shows Precision for the three Naive Bayes classifiers. While the full-flow model

displays the maximum Precision of 100%, this is an indication of an over-fitting problem, as

discussed in section 5.4. The high Precision of this classifier does not have much significance

due to the low Recall achieved. Consistent with the results presented in Chapter 5, the filtered

full-flow model has a low Precision that fluctuates around ∼ 40% when 0 ≤ M ≤ 9 and stays

lower than 10% when 2000≤M ≤ 2009. The classifier trained on multiple sub-flows achieves


Pre

cisi

on (%

)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


0102030405060708090

100



Figure 7.9: Precision for Naive Bayes classifiers trained on full-flow, filtered full-flow andmultiple sub-flows

greater than 88% Precision for all M values. A common point for all three classifiers is that their

Precision fluctuates noticeably when 0 ≤M ≤ 9, and less noticeably when 2000 ≤M ≤ 2009.

This is consistent with the fluctuation in the probability of the first packet in the sliding window

in the client-to-server direction as shown in Figure 7.7.

Figures 7.10 and 7.11 depict Recall and Precision for full-flow, filtered full-flow and multi-

ple sub-flows C4.5 Decision Tree classifiers. Similar to the results seen with the Naive Bayes

classifiers, all three C4.5 Decision Tree classifiers perform better when the classifier correctly

assumes the first packet in the sliding window is in the client-to-server direction. Their Recall

and Precision degrade otherwise.

The C4.5 Decision Tree classifier trained on multiple sub-flows, while achieving the best

Recall and Precision among the three, still suffers when the classifier incorrectly assumes the

direction of the first packet. Its median Recall is low, fluctuating around 66% when 0≤M ≤ 9

and remaining at ∼ 76% when 2000≤M ≤ 2009. Its Precision fluctuates above 90% for all M

values.

7.4.2 Training on SSP Option 1, classifying with a sliding window

Figure 7.12 compares Recall as a function of M for the Naive Bayes classifier trained with SSP

Option 1 and the multiple sub-flows model.


Rec

all (

%)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


0102030405060708090

100



Figure 7.10: Recall for C4.5 Decision Tree classifiers trained on full-flow, filtered full-flow andmultiple sub-flows

Pre

cisi

on (%

)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


0102030405060708090

100



Figure 7.11: Precision for C4.5 Decision Tree classifiers trained on full-flow, filtered full-flowand multiple sub-flows


Rec

all (

%)

0 1 2 3 4 5 6 7 8 920

0020

0120

0220

0320

0420

0520

0620

0720

0820

09


0102030405060708090

100

SSP Option 1Multiple Sub−flows

Figure 7.12: Recall for Naive Bayes classifiers trained using SSP Option 1 and multiple sub-flows

A Naive Bayes classifier trained using SSP Option 1 shows significant improvement in

Recall (a median of 98.9%) compared with the multiple sub-flows model (a median of 72.1%).

More importantly, Recall is more stable, less affected by the implications of the directions of

the traffic flows when the classifier misses M packets.

However, there are trade-offs between gain in Recall and loss in Precision. Figure 7.13

summarises Precision for the two Naive Bayes classifiers. Compared to the classifier trained on

multiple sub-flows, the SSP Option 1 classifier achieved 3% lower Precision on average for all

M values, and its Precision remains at 85.2%-89.8%. It is also notable that median Precision

when M ≥ 2000 is lower than when 0 ≤M ≤ 9. This is due to a smaller number of ET flows

when M ≥ 2000, which leads to a smaller number of true positives for ET traffic as explained

in section 5.4.

A C4.5 Decision Tree classifier trained using SSP Option 1 shows similar significant im-

provement in Recall. As presented in Figure 7.14, it displays good Recall (a median of 99.3%)

compare to the one trained on multiple sub-flows (a median of 75.2%). Its Recall is not only

higher but also more stable, and less affected by the implications of the directions of the traffic

flows when the classifier misses M packets.

Figure 7.15 presents Precision for the SSP Option 1 and multiple sub-flows C4.5 Decision


Pre

cisi

on (%

)

0 1 2 3 4 5 6 7 8 920

0020

0120

0220

0320

0420

0520

0620

0720

0820

09


0102030405060708090

100


Figure 7.13: Precision for Naive Bayes classifiers trained using SSP Option 1 and multiplesub-flows

Rec

all (

%)

0 1 2 3 4 5 6 7 8 920

0020

0120

0220

0320

0420

0520

0620

0720

0820

09


0102030405060708090

100


Figure 7.14: Recall for C4.5 Decision Tree classifiers trained using SSP Option 1 and multiplesub-flows


Tree classifiers. In contrast to the Naive Bayes classifiers, there is also a gain in Precision when

training on SSP Option 1. The Precision increases by 2.7% on average, staying at a 97.3%-

98.2% for the SSP Option 1 classifier. This is due to the different responses of the Naive Bayes

and C4.5 Decision Tree classifiers when trained with the unwanted range of feature values (as

described earlier in Figure 7.6). In the next section I evaluate SSP Option 2, which does not

include the unwanted grey area when training the classifiers.

Pre

cisi

on (%

)

0 1 2 3 4 5 6 7 8 920

0020

0120

0220

0320

0420

0520

0620

0720

0820

09


0102030405060708090

100


Figure 7.15: Precision for C4.5 Decision Tree classifiers trained using SSP Option 1 and multi-ple sub-flows

7.4.3 Training on SSP Option 2, classifying with a sliding window

In this section, I compare Precision, Recall, model build time and classification speed3 for

Naive Bayes and C4.5 Decision Tree classifiers trained on multiple sub-flows (multiple sub-

flows model), SSP Option 1 (SSP Option 1 model) and SSP Option 2 ( SSP Option 2 model).

Figure 7.16 summarises Recall for the three Naive Bayes classifiers .

As shown in Figure 7.16, a Naive Bayes classifier trained using SSP Option 2 has almost

identical Recall to those trained using SSP Option 1. Both models show great improvement in

Recall – higher and more stable – compared to the classifier trained on multiple sub-flows only.

Figure 7.17 summarises Precision for the three Naive Bayes classifiers.

3Model build time and classification speed are two evaluation metrics defined in Chapter 6.


Rec

all (

%)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


405060708090

100

Multiple Sub−flowsSSP Option 1

SSP Option 2

Figure 7.16: Recall for Naive Bayes classifiers trained using SSP Option 1, SSP Option 2 andMultiple Sub-Flows

Pre

cisi

on (%

)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


0102030405060708090

100

Multiple Sub−flowsSSP Option 1SSP Option 2

Figure 7.17: Precision for Naive Bayes classifiers trained using SSP Option 1, SSP Option 2and multiple sub-flows


A Naive Bayes classifier trained using SSP Option 2 has increased Precision by approx-

imately 5% for all M values compared to being trained using SSP Option 1. This suggests

the positive impact on Precision for the Naive Bayes classifier when using this augmented op-

tion. In this case, Precision for the Naive Bayes classifier is even higher and more stable than a

classifier trained on multiple sub-flows only.

Figure 7.18 outlines Recall for the C4.5 Decision Tree classifiers. A C4.5 Decision Tree

classifier trained using SSP Option 2 has a slightly better Recall compared to being trained using

SSP Option 1. This suggests the positive impact on Recall for the C4.5 Decision Tree model

when the unwanted area for each feature is eliminated. Both models show great improvement

in Recall, which is both higher and more stable, compared to the classifier trained on multiple

sub-flows only.

Rec

all (

%)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008

M (Packets)

405060708090

100

Multiple Sub−flows ModelSSP Option 1

SSP Option 2

Figure 7.18: Recall for C4.5 Decision Tree classifiers trained using SSP Option 1, SSP Option2 and Multiple Sub-Flows

Figure 7.19 depicts Precision for the C4.5 Decision Tree classifiers. A C4.5 Decision Tree

classifier trained using SSP Option 2 has almost identical Precision to those trained using SSP

Option 1. This suggests that the C4.5 Decision Tree algorithm is less affected by the inclusion

of the unwanted area when training the classifier compared to the Naive Bayes algorithm. This

is why the further augmentation of Option 2 has little impact on its Precision.

Figure 7.20 shows the normalised model build time and classification speed for the Naive


Pre

cisi

on (%

)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


0102030405060708090

100


Figure 7.19: Precision for C4.5 Decision Tree classifiers trained using SSP Option 1, SSPOption 2 and Multiple Sub-Flows

Bayes and C4.5 Decision Tree classifiers. The value of 1 represents the longest time taken to

build a classification model of 1,636 seconds, and the highest classification speed of 12,303

instances per second.

Figure 7.20 indicates that the Naive Bayes and C4.5 Decision Tree classifiers trained using

SSP Option 1 take almost double the model build time compared to the same classifiers trained

on multiple sub-flows only. This result is to be expected as the training approach using SSP

Option 1 doubles the number of training instances for the classifiers. Using SSP Option 2

slightly increases model build time for the Naive Bayes classifier, while almost doubling the

model build time for the C4.5 Decision Tree classifier compared to the SSP Option 1 approach.

This is a trade-off when training the classifier for four-classes classification rather than two-

classes classification. The size of the tree for the C4.5 Decision Tree classifier increased by

20 times (a tree size of 3,507 versus 175 and number of leaves of 1,754 versus 88) in my

experiment.

Both the Naive Bayes and C4.5 Decision Tree classifiers trained using SSP Option 1 are

slightly slower compared to the same classifiers trained on multiple sub-flows ( by 1% and 7%


Naive Bayes Decision Tree

Nor

mal

ised

Mod

el B

uild

Tim

e


00.

20.

40.

60.

81

(a) Normalised Model Build Time

Naive Bayes Decision Tree

Nor

mal

ised

Cla

ssifi

catio

n S

peed


00.

20.

40.

60.

81

(b) Normalised Classification Speed

Figure 7.20: Computational performance for Naive Bayes and C4.5 Decision Tree classifierstrained on multiple sub-flows, SSP Option 1 and SSP Option 2

7.5. CONCLUSION 181

respectively). Training with SSP Option 2 slows down the Naive Bayes classifier trained on

multiple sub-flows by ∼9% and the C4.5 Decision Tree classifier trained on multiple sub-flows

by ∼60% (due to the significant increase in tree size mentioned previously).

To sum up, training on SSP Option 2 has produced classifiers which are not only accurate

but also stable. Recall for both the Naive Bayes and C4.5 Decision Tree classifiers is close

to 99% with Precision is close to 98% and 95% for the C4.5 Decision Tree and Naive Bayes

classifiers respectively. The classifier is able to maintain its performance regardless of where it

begins to capture packets of a given traffic flow in real-time classification.

As a trade-off, this results in a longer time required to build a classification model, and

slower classification speed, especially for the C4.5 Decision Tree classifier. In practice classifi-

cation models can be built offline, hence the longer model build time should not be considered

a significant drawback. The slower classification speed, however, may need to be considered in

deployment when scaling to classification of multiple applications concurrently.

Compared to SSP Option 1, SSP Option 2 improves Precision of the Naive Bayes classifier

significantly, with only a slight trade-off in terms of model build time and classification speed.

On the other hand, SSP Option 2 produces slight improvement in Recall for the C4.5 Decision

Tree classifier, with significant degradation in model build time and classification speed. Con-

sidering the gains in Precision and Recall, and the trade-offs on computational performance,

SSP Option 1 is considered to be the simpler and more effective choice at this stage.

7.5 Conclusion

Most research in the literature has assumed that a classifier will see the first packet of each bi-

directional flow, and that this initial packet will be from a client to a server. The classification

model is trained on the basis of this assumption, and subsequent evaluations have presumed the

ML classifier can calculate features with the correct sense of forward and reverse direction. In

real-time IPTC, this assumption can be wrong, especially when the classifier is initiated when

the traffic flows are already in progress.

To solve this problem, I have introduced a novel approach that further complements the

multiple sub-flows training approach presented in the previous chapter. I propose that the ML

classifier should be trained using synthetic sub-flow pairs (SSP). With SSP the statistical fea-


tures of multiple short sub-flows associated with a target application (as discussed in Chapter 5)

are transposed and replicated to construct a synthetic ‘pair’ of features. These pairs of sub-flow

features now reflect the statistical characteristics of a target application’s traffic, whether seen

in the forward or reverse direction.

My proposal is illustrated by constructing, training and testing with Naive Bayes and C4.5

Decision Tree classifiers for the detection of Wolfenstein Enemy Territory online game traffic.

With this particular scenario and two options for implementing SSP, I demonstrate that SSP

using either option can significantly improve a classifier’s performance when using a small slid-

ing window, regardless of the direction of the first packet of the most recent N packets used for

the classification. Significantly higher and more stable Recall is achieved compared to train-

ing solely on multiple sub-flows with an explicit definition of flow direction. Depending on

the internal construction of each ML algorithm, SSP Option 1 and SSP Option 2 have differ-

ent impacts on the Precision achieved. SSP Option 1 shows an improvement in Precision for

C4.5 Decision Tree classifiers, but degradation in Precision for Naive Bayes classifiers. Op-

tion 2 seems to improve the trade-off in Precision and Recall, especially for the Naive Bayes

classifiers. However, this option results in a longer required model build time and slower clas-

sification speed. With both accuracy (Precision and Recall) and computational performance

(model build time and classification speed) taken into consideration, SSP Option 1 is chosen to

be evaluated in the next chapter, as my overall proposed training method.

Chapter 8

Training Using Synthetic Sub-Flow Pairswith the Assistance of ClusteringTechniques (SSP-ACT)

8.1 Introduction

In Chapters 5, 6 and 7, I proposed building a practical and real-time ML-based IP traffic classi-

fier. This approach uses:

• Training on multiple sub-flows instead of full-flow for timely and continuous classifica-

tion.

• Clustering techniques to automate the sub-flows selection for training.

• Training on synthetic sub-flows pairs for direction-neutral classification.

I shall refer to this proposed approach as Synthetic Sub-flow Pairs with the Assistance of

Clustering Techniques (SSP-ACT). I have shown that SSP-ACT can significantly improve the

timely and direction-independent real-time classification of Wolfenstein Enemy Territory online

game traffic.

In this chapter, I study the generality and robustness of SSP-ACT. I demonstrate that SSP-

ACT also benefits the identification of VoIP traffic. Naive Bayes and C4.5 Decision Tree classi-

fiers trained using SSP-ACT can maintain their performance well with 5% random, independent

synthetic packet loss. SSP-ACT also can scale well to three-classes (ET versus VoIP versus

Other) classification simultaneously.

183

184 CHAPTER 8. TRAINING USING SSP-ACT

This chapter is organised as follows. In section 8.2 I demonstrate the robustness of SSP-

ACT when applied to the classification of VoIP traffic. In contrast to ET, VoIP traffic tends to

be more stable over a flow’s lifetime, and more symmetric in the caller-to-callee and callee-to-

caller directions.

Section 8.3 evaluates the performance of the Naive Bayes and C4.5 Decision Tree classi-

fiers trained in Chapter 7 when classifying ET and VoIP traffic in the presence of 5% random,

independent synthetic packet loss.

The possibility of using a single classifier to classify multiple applications simultaneously

is explored in section 8.4. Naive Bayes and C4.5 Decision Tree classifiers are trained using

SSP-ACT to identify ET, VoIP and Other traffic at the same time. A discussion of the pros and

cons of using a common classifier versus multiple classifiers in parallel, each for an individual

application, is presented.

This chapter concludes with a discussion of a number of remaining implementation issues

of SSP-ACT and some options for possible future work.

8.2 Evaluation of SSP-ACT in identifying VoIP traffic

To test the generality of the approach, I identify voice flows encoded with ITU-T G.711 PCMU1 [185] (G.711) and GSM 06.10 [186] (GSM) codecs and transported over RTP [101], using

Naive Bayes and C4.5 Decision Tree classifiers trained using SSP-ACT. This sections starts

with a brief background on VoIP, G.711 and GSM codec. This is followed by a description of

my data processing method. I then analyse the statistical properties of my VoIP data trace, and

justify why and how SSP-ACT can benefit the identification of VoIP traffic. Finally, Recall and

Precision results are presented and analysed.

8.2.1 A brief background on ITU-T G.711 PCMU and GSM 06.10 encoded voice traffic

VoIP has become a popular Internet application for both home users and enterprises. A voice

session is normally set up and torn down using a signalling protocol, such as Session Initiation

Protocol (SIP) [187]. The signalling packets carry information required to locate users and

allow them to negotiate compatible media types. Setup information can be conveyed between

1PCM µ-law

8.2. EVALUATION OF SSP-ACT IN IDENTIFYING VOIP TRAFFIC 185

participants using the Session Description Protocol (SDP) 2 [188].

Voice traffic is carried in Real-Time Transport Protocol (RTP) [101] flows. RTP provides

end-to-end delivery services suitable for real-time applications including interactive audio and

video. Different codecs may be utilised to encode and transmit VoIP traffic. G.711 PCMU codec

samples eight bits/sample at 8,000 samples/sec. The default packetisation interval is 20ms. With

this sampling interval, 50 frames/sec are generated, each frame containing 160 bytes of payload.

GSM 06.10 (Group Special Mobile) is a European standard for full-rate speech transcoding. It

is a frame-based codec, coding at 8,000 samples/sec, with a fixed packetisation interval of

20ms/frame. In the GSM packing used by RTP, every block of 160 audio samples is compressed

into a 33-octets frame [189].

During a voice call, there can be times when both parties are talking, yet a typical voice

conversation has a talk spurt period, followed by a silent period (e.g. one party listening to the

other). When one of the parties remains silent, background noise is picked up and sent over

the network. However, RTP allows discontinuous transmission (silence suppression) when one

party does not speak, to save bandwidth. When silence suppression is enabled, the line may

appear to have dropped at the receiving end. For this reason, comfort noise packets [190] are

generated to compensate for the lack of background noise. At the beginning of an inactive voice

segment (silence period), a comfort noise packet is transmitted in the same RTP voice stream

and indicated by the comfort noise payload type. The comfort noise generator algorithm at the

receiver uses the information in the comfort noise payload to update its noise generation model

and then to produce an appropriate amount of comfort noise. The comfort noise packet sending

rate is implementation specific. It may be sent periodically or only when there is a significant

change in the background noise characteristics [190].

8.2.2 Data collection and research methodology

My experiment data is 3.4GBytes of VoIP traffic extracted from 50GBytes of full-payload traffic

collected on a home network between 27 November 2006 and 3 August 20073. The raw data

2SDP is purely a format for session description. It is used in conjunction with different transport protocols asappropriate including SIP

3The network has been setup and maintained by two of my colleagues, L. Stewart and W. Harrop. The dataused in my experiments are personal VoIP calls going to and from two VoIP phones, connected to a VoIP providerthrough an Asterisk server.


trace is a mixture of voice and other Internet applications’ traffic. Voice traffic is extracted (as

described in Appendix D) to provide 644 RTP flows (made up of 594 G.711 flows and 50 GSM

flows) to use as benchmark VoIP flows for subsequent analysis.

These RTP flows contain voice calls with a duration ranging between 19 and 8,207 seconds.

The cumulative distribution of the flows’ duration is presented in Figure 8.1. Median call dura-

tion is 80 seconds (with a total of 7,061 packets in both directions), with the 75th percentile at

355 seconds (with a total of 30,530 packets in both directions).

0 2000 4000 6000 8000

0.0

0.2

0.4

0.6

0.8

1.0

Call duration (seconds)

Cum

mul

ativ

e D

istri

butio

n Fu

nctio

n (0

−1)

Figure 8.1: Cummulative distribution of call duration

To provide independent training and testing datasets I used 341 flows (52% of the available

data, which consists of 314 G.711 flows and 27 GSM flows) for clustering and training and the

remaining 303 flows (which consists of 280 G.711 flows and 23 GSM flows) for testing. This

choice was made in order that I have a good number of instances for both training and testing.

Statistical properties of VoIP flows

Figures 8.2 and 8.3 show the mean packet length and mean packet inter-arrival time feature

values from the caller to callee for G.711 traffic for sub-flows with a window size of 25 packets.

M is the number of packets offset from the beginning of each flow. (On the far right of the

x-axis, FF represents features calculated on full-flow.) Statistics of G.711 traffic in the reverse

direction are presented in Appendix D.


0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

5010

015

020

0


Mea

n pa

cket

leng

th (B

ytes

)

Figure 8.2: G.711 traffic - forward direction, mean packet length calculated over a window of25 packets

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

020

4060

8010

0


Mea

n pa

cket

inte

r−ar

rival

tim

e (m

sec)

Figure 8.3: G.711 traffic - forward direction, mean packet inter-arrival time calculated over awindow of 25 packets


As shown in Figure 8.2, most packets are 200 bytes long. For M ≥ 80, there are outliers,

which are less than 200 bytes. These are due to the presence of comfort noise packets (each 41

bytes long) within the sliding window. It makes sense that these outliers are only evident when

the conversations are in progress.

Figure 8.3 indicates that most packets arrive at 20ms intervals. However, there are outliers

that indicate a packet inter-arrival time of more than 20ms. These longer packet inter-arrival

times are due to jitter, packet loss 4 and silent periods during voice conversations.

The analysis of packet length and inter-arrival time statistics for G.711 traffic reveals that,

in most cases, voice packets have the same length with little variation on packet arrival interval.

This makes the voice traffic stable for the duration of a conversation, and symmetric in both

forward and reverse directions.

GSM flows have similar traffic characteristics. Figures 8.4 and 8.5 show the mean packet

length and mean packet inter-arrival time from the caller to callee for GSM traffic with a window

size of 25 packets. Statistics of G.711 traffic in the reverse direction are presented in Appendix

D.

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

5060

7080

9010

0


Mea

n pa

cket

leng

th (B

ytes

)

Figure 8.4: GSM traffic - forward direction, mean packet length calculated over a window of25 packets

As shown in Figure 8.4, almost all packets are 73 bytes long. There are ony a few outliers

due to telephone-event packets. Figure 8.5 shows that most packets arrive at 20ms interval.

However, there are outliers indicating packet inter-arrival time of greater than 20ms. These

4By observing discontinuities in RTP sequence numbers, 93% of the recorded flows are missing less than 2%of their packets. The largest observed loss of packets (4.9%) involved a single voice conversation where 729 voiceRTP packets were missed.


0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

020

4060

8010

0


Mea

n pa

cket

inte

r−ar

rival

tim

e (m

sec)

Figure 8.5: GSM traffic - forward direction, mean packet inter-arrival time calculated over awindow of 25 packets

longer packet inter-arrival time are due to jitter or packet loss during the voice converations. 5.

However, there are exceptions (indicated by outliers in Figures 8.2 and 8.3, for example).

When silence suppression is enabled, if the sliding window captures packets that cover a silent

period, the occassional presence of comfort noise packets will affect the packet length statistics

such as minimum and mean packet length features. Similarly, the presence of silent periods

affects the packet inter-arrival time statistics within the window.

Figure 8.6 illustrates the impact of silence suppression by focusing on 15 seconds of traffic

in each direction, starting at the 30th second of a 121-second G.711 call. Inspection of the RTP

sequence numbers in each direction reveals that no packets were missed. The vertical y-axis

presents the IP packet length. The voice packet length is 200 bytes in both directions. However,

to make the distinction between the traffic in both directions clearer, I plotted packets from

callee-to-caller lower. In this example, the traffic is asymmetric at position 1 and symmetric at

position 2 in the forward and reverse directions.

My analysis suggests that although voice traffic is usually stable and symmetric in both

directions, there are cases where:

• features calculated on full-flow can differ from those calculated on a small sliding win-

dow;

• features calculated on small sliding window at different points during a flow lifetime can5By observing discontinuities in RTP sequence numbers, 88% of the recorded flows are missing less than 2%

of their packets. The largest observed loss of packets (3.1%) involved a single voice conversation where 189 voiceRTP packets were missed.


Time elapsed (seconds)

Pac

ket l

engt

h (B

ytes

)

30

35

40

45

41

200

200

Comfort noise packets in caller- to-callee direction

Voice packets in callee-to- caller direction

Sliding window position 1: Asymmetric in two directions

Sliding window position 2: Symmetric in two directions

Pac

ket l

engt

h (B

ytes

)

41

200

200

Voice packets in caller-to- callee direction

Figure 8.6: Voice traffic generated during a voice conversation: Comfort noise packets andsilence suppression periods during a conversation can create asymmetry and multiple packetsizes within the traffic captured by the sliding window

be different from each other; and

• features calculated in one direction can be different from those calculated in the other

direction.

In the following section I present an experimental analysis that demonstrates that SSP-ACT

can produce a more accurate classifier compared to training on full-flow. In addition, when

classifying VoIP against other traffic whose statistical properties vary greatly during a flow’s

lifetime and are asymmetric in the forward and reverse directions, SSP-ACT identifies Other

traffic better, and hence can outperform training on full-flow in terms of Precision.

8.2.3 Results and analysis

To maintain consistency with the work in previous chapters, I choose the same sliding window

size N of 25 packets. With a 20ms packet interval in one direction, this is equivalent to a

maximum of 0.5 second to collect 25 packets in a talk spurt 6. Training and testing datasets are6This is the worst case when there is only voice traffic in one direction. It should be only 0.25 second when

there is traffic in both directions.


constructed using the same approach described in Chapter 7. Let M be the number of packets

offset from the beginning of a flow. Nineteen sub-flows were selected for the clustering process

with 0 ≤M ≤ 90 and 1,000 ≤M ≤ 9,000, assuming that this would capture the early and in-

progress phases of the voice calls. EM found five clusters among these 19 sub-flows’ instances.

From these clusters, five representative sub-flows, including SF-0, SF-10, SF-30, SF-1000 and

SF-3000, were chosen to train the Naive Bayes and C4.5 Decision Tree classifiers. The Other

traffic used for training was the same as that outlined in Chapters 5, 6 and 7.

Figure 8.7 shows Recall for VoIP traffic for the Naive Bayes classifiers trained on full-flow

and SSP-ACT. Recall for the former is 100% for all M values. For the latter, 18 out of 19

positions of the sliding window have Recall of 100%, with the remaining position with Recall

of 99.6% (median Recall of 100%).

Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


50

60

70

80

90

100

Full−flowSSP−ACT

Figure 8.7: VoIP Recall: Naive Bayes classifiers trained on full-flow and SSP-ACT

However, training on full-flow shows a very low Precision of less than 6% for all positions

of the sliding window, as shown in Figure 8.8. The classifier trained using SSP-ACT, on the

other hand, results in a good median Precision of 95.4%, averaged on all positions of the sliding

window. Precision is higher with smaller M values, and slightly decreases as M increases. This

is because the number of VoIP instances for testing reduces when M increases (as explained in

Section 6.4). The number of false positives is a constant (as the same instances of Other traffic

are used for testing all M values). Precision, therefore, is only dependent on the number of true

positives for each M value. When M increases there are fewer flows longer than M+N packet.


Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100


Figure 8.8: VoIP Precision: Naive Bayes classifiers trained on full-flow and SSP-ACT

Consequently, there are fewer VoIP flows for testing and the TP for VoIP traffic is reduced. This

explains why Precision reduces when M increases.

Similar results are found with C4.5 Decision Tree classifiers. As shown in Figure 8.9, train-

ing on full-flow exhibits a median Recall of 99.6%, while training using SSP-ACT shows a

median Recall of 95.7% for all M values. While training on full-flow produces higher Recall

than training using SSP-ACT, the higher Recall is less meaningful when we look at the Preci-

sion.

Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


50

60

70

80

90

100


Figure 8.9: VoIP Recall: C4.5 Decision Tree classifiers trained on full-flow and SSP-ACT

Figure 8.10 summarises Precision for the C4.5 Decision Tree classifiers trained on full-


flow and SSP-ACT. The classifier trained on full-flow demonstrates very low Precision, of less

than 5.6% for all positions of the sliding window. In contrast, the classifier trained using SSP-

ACT results in a significantly higher median Precision of 99.2% for all positions of the sliding

window.P

reci

sion

(%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


0102030405060708090

100


Figure 8.10: VoIP Precision: C4.5 Decision Tree classifiers trained on full-flow and SSP-ACT

The high Recall and low Precision when using classifiers trained on full-flow can be ex-

plained as follows. As most VoIP flows are stable and symmetric in the forward and reverse

directions, training on the full-flow is good enough to identify VoIP traffic even when classify-

ing on a small window. Hence the high Recall for VoIP traffic.

However, the statistical characteristics of Other traffic vary during their flow’s lifetime and

are asymmetric in the forward and reverse directions (see Appendix A). When classifying using

a small sliding window, training on the full-flow fails to distinguish the Other traffic. Both

the Naive Bayes and C4.5 Decision Tree classifiers trained on full-flow misclassify a good

number of Other traffic as VoIP traffic, hence their very poor Precision. In contrast, classifiers

trained using SSP-ACT can identify both VoIP and Other traffic well when classifying on a

small window – hence their good results in both Recall and Precision.

Figure 8.9 reveals that training using SSP-ACT can sometimes produce slightly lower Recall

than training on full-flow. This is because features calculated on sub-flow can differ from those

calculated on full-flow, for both VoIP and Other traffic. The distinction between VoIP and Other

traffic features calculated on sub-flow may not be as significant as when calculated on full-flow.


Training on sub-flow hence generates more conservative classification rules than training on

full-flow. While these rules produce high Precision for VoIP traffic, they come with a trade-off

of more false negatives, resulting in a lower Recall compared to training on full-flow.

This explanation is illustrated in Figure 8.11. In Figure 8.11(a), the classification model

trained on full-flow covers a greater range of VoIP instances. When training using SSP-ACT,

due to the differences in sub-flows’ feature values compared to those of full-flows, the classi-

fication model created covers a smaller range of VoIP sub-flow instances (as shown in Figure

8.11(b)). With VoIP traffic, the difference between feature values calculated on sub-flows and

full-flows are less significant than in the case of Other traffic. Hence, the full-flow model can

produce high Recall but very low Precision when classified using a small sliding window. Train-

ing using SSP-ACT provides much greater Precision, with a little trade-off in Recall due to a

greater number of false negatives.

There are also trade-offs in Recall and Precision, due to the particular internal construction

of an ML algorithm. When both being trained using SSP-ACT, Naive Bayes classifier has

a slightly higher Recall and slightly lower Precision than the C4.5 Decision Tree classifier.

Deeper exploration into this area is a subject for future research.

To sum up, SSP-ACT has been demonstrated to benefit VoIP traffic, whose statistical char-

acteristics differ significantly from those of ET. Training using SSP-ACT produces a more accu-

rate classifier in terms of both Recall and Precision when identifying VoIP traffic among other

interference traffic, whose flows’ statistical characteristics vary over their flows’ lifetime and

are asymmetric in the forward and reverse directions.

8.3 Evaluation of SSP-ACT in the presence of additional packet loss

In this section I present my preliminary results from an investigation of the robustness of SSP-

ACT. Like any other IP traffic classification approach which relies on the statistical properties

of traffic at the network layer, the performance of a classifier trained using SSP-ACT could be

affected when deployed in a variety of network environments. Network layer perturbations such

as packet loss, packet re-ordering, traffic shaping, packet fragmentation, and jitter are likely to

result in variations of the feature values.

My preliminary investigation is on the impact of packet loss, as packet loss will have an

8.3. EVALUATION OF SSP-ACT IN THE PRESENCE OF ADDITIONAL PACKET LOSS195

VoIP f ull - flow Other full - flow

Region of VoIP covered by the full - flow model

(a) Classification model for VoIP traffic, trained on full-flow

Vo IP sub - flows Other sub - flows

Region of VoIP covered by SSP - ACT model

(b) Classification model for VoIP traffic, trained usingSSP-ACT

Region of VoIP covered by SSP - ACT model

Region of VoIP covered by the full - flow model

VoIP traffic in the sliding window

Other traffic in the sliding window

(c) Classifying VoIP using classifiers trained on full-flowand SSP-ACT

Figure 8.11: VoIP classification using classifiers trained on full-flow and SSP-ACT: Trainingon full-flow may cover a larger area of VoIP instances when classifying using a small slidingwindow, hence resulting in higher Recall but lower Precision compared to training using SSP-ACT. (The data points are artifically created for illustration purposes only. They are not actualdata points from my dataset.)


impact on values of features that are based on packet length and packet arrival time statistics.

For example, a loss of a few packets in a sliding window would result in longer gaps in packet

arrivals, which consequently affects the packet inter-arrival time statistics. Similarly, the loss

of large or small packets would affect the packet length statistics, and so on. An example of

packet loss on VoIP traffic is illustrated in Figure 8.12. A loss of packet 2 increases the packet

inter-arrival time from T to 2T.

1 2 3 1 3

T T 2T

Packet inter-arrival time is T

Packet inter-arrival time becomes 2T

Packet 2 is lost

Figure 8.12: A simple illustration of the impact of packet loss on packet inter-arrival timestatistics

These changes are likely to have an impact on the classifier’s performance, in terms of both

Precision and Recall. Packet loss on the target application can result in a greater number of

false negatives, and hence can lower the classifier’s Recall. Packet loss on Other applications

can result in a greater number of false positives, which can thus lower the Precision for the

classification of the application of interest.

Different packet loss patterns (such as random or bursty losses [191]) can also have different

impacts on the changes in features values compared to when there is no packet loss. In this

initial evaluation, I focus on the impact of random, independent packet loss. Future work can

expand on this through an evaluation of other loss patterns and loss processes (for example, the

Gilbert model as described in [192]).

In this section, I study the changes in Precision and Recall of a classifier trained using

SSP-ACT and classifing using a testing dataset which is tampered by the inclusion of synthetic

packet loss. From the original test trace file, I create synthetic loss by randomly skipping packets

when calculating features statistics (I assume here that any given packet may be lost with a pre-

specified probability p, and that these random losses are independent).

I consider the classification of both ET and VoIP traffic. These applications are not reactive

to packet loss (no significant flow-control or packet retransmissions occur at the application or

transport layers). On the other hand, many Other applications are TCP-based, consequently


they exhibit network-layer retransmissions and adjustment of their sending rate in the presence

of packet loss. The loss-reactive nature of TCP flows makes it hard to predict the impact of

additional (synthetic) loss on previously calculated feature values. Therefore, in this preliminary

study, the synthetic loss is only applied for ET and VoIP traffic in the test dataset, not for the

Other traffic. In so doing, my study is focused on the evaluation of the impact of packet loss

on Recall for both ET and VoIP traffic. The experimental Precision results may be higher than

the actual Precision achieved when packet loss also occurs with Other traffic. This work can

be expanded in future with the application of more comprehensive packet loss models to both

application of interest and Other interference traffic.

In terms of selecting a realistic loss rate to test, the studies of [193] show that due to the

enormous diversity of the Internet, only a few studies are agreed on the average packet loss rate

and the average loss burst length (i.e. the number of packets lost in a row). Among the works

reviewed in [193], in the period 1998-1999, the average Internet packet loss was reported to

vary between 0.36% and 11%, depending on the particular studies of each work ([194], [195],

[196] and [197]). A 2007 study characterising residential broadband networks [198] reveals

that both cable and DSL have remarkably low packet loss rates, of less than 1% for more than

95% of all broadband paths. The data was drawn from 1,894 broadband hosts from 11 major

cable and DSL providers in North America and Europe. The studies of [199] and [200] indicate

that a small packet loss of 1% or 2% would lead to dramatic degradation in TCP throughput.

These findings suggest that an ISP should not tolerate a greater loss on their access links, as it

would likely trigger complaints from consumers. The analysis of my VoIP dataset also shows a

very low packet loss rate.

My assumption is that the greater the packet loss, the greater the impact on feature values,

and hence might possibly result in a more significant impact on the classifier’s performance.

Therefore, I chose to study the impact of a 5% packet loss (total loss in both directions for

bi-directional traffic), to approximate a reasonable upper bound on tolerable packet loss in an

Internet access link. I apply a 5% packet loss to the ET dataset used in Chapters 5 and 7 and

to the VoIP dataset used earlier in section 8.2. The impact of other loss rates remains for future

research.

For a small sliding window of 25 packets, a 5% packet loss typically results in the loss of


only 1 - 2 packets. This small loss therefore has little impact on ET and VoIP feature values.

With ET traffic, the median of the features calculated with packet loss is only slightly greater

than for those calculated on the original test dataset. For example, the median of the max-

imum packet inter-arrival time feature over a 25-packets window for all sub-flow instances is

increased by only∼ 2ms (0.2%) compared to the original test dataset. Analysis of other features

shows similar results.

With VoIP traffic, the median of the maximum packet inter-arrival time feature for all sub-

flows instances increased by ∼ 18.7ms (88.3%) 7 compared to the original test dataset. How-

ever, a 5% packet loss only changes the mean packet inter-arrival time and minimum packet

inter-arrival time features very slightly. On average, with packet loss, the median of minimum

packet inter-arrival time feature for all sub-flows instances increased by only ∼ 0.8ms (4%)

compared to the original test dataset. A 5% packet loss also has little impact on the maximum,

minimum and mean packet length features. This is because VoIP traffic is quite stable in terms

of packet length and arrival statistics, so a small loss of 1-2 packets out of a 25-packet window

would have a lesser impact on packet length statistics.

The following sections report on the effects of these changes on the classification of ET and

VoIP traffic.

8.3.1 Impact of packet loss on the classification of ET traffic

Figure 8.13 shows Recall for the Naive Bayes classifier both with and without a 5% synthetic

loss applied to the test dataset. The classification model is trained with SSP-ACT as described

in Chapter 7. It is recorded for the position of the sliding window with regards to the numbers

of packets (M) offset from the beginning of each flow.

As shown in Figures 8.13 and 8.14, a 5% synthetic loss applied to ET traffic caused the

median Recall for all M values to reduce by 0.45%, and the median Precision for all M values

to reduce by 0.4%.

With these slight degradations, the Naive Bayes classifier exhibits good median Recall and

Precision of 98% and 86% respectively. This suggests that a Naive Bayes classifier trained

using SSP-ACT could maintain its performance well with a loss rate of 5% applied to the test

7This is not increased by 100% as in the example given in Figure 8.12, as a bigger gap caused by packet lossmay still be less than the packet gap caused by silent periods withing the sliding window.


Rec

all (

%)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


50

60

70

80

90

100

No synthetic loss 5% synthetic loss

Figure 8.13: ET Recall: Training with SSP-ACT and classifying with ET traffic experiencing5% random packet loss - Naive Bayes classifier

Pre

cisi

on (%

)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


50

60

70

80

90

100


Figure 8.14: ET Precision: Training with SSP-ACT and classifying with ET traffic experiencing5% random packet loss - Naive Bayes classifier


dataset.

Figures 8.15 and 8.16 show Recall and Precision for the C4.5 Decision Tree classifier both

with and without a 5% synthetic loss applied to the test dataset. On average, a 5% synthetic loss

applied to ET traffic caused the median Recall for all M values to reduce by only 0.5%, and the

median Precision for all M values to reduce by only 0.25%.

Rec

all (

%)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


50

60

70

80

90

100


Figure 8.15: ET Recall: Training with SSP-ACT and classifying with ET traffic experiencing5% random packet loss - C4.5 Decision Tree classifier)

To summarise, despite the slight degradation due to a 5% packet loss, the C4.5 Decision Tree

classifier still demonstrates a good median Recall and Precision of 97% and 96.7% respectively.

This suggests the C4.5 Decision Tree classifier trained using SSP-ACT can also maintain its

performance well with a loss rate of 5%.

8.3.2 Impact of packet loss on the classification of VoIP traffic

Figure 8.17 depicts Recall for a Naive Bayes classifier trained using SSP-ACT and tested both

with and without a 5% synthetic loss for VoIP traffic. The results are recorded for the position

of the sliding window with regards to the numbers of packets (M) offset from the beginning

of each flow. A 5% synthetic loss applied to VoIP traffic does not have a noticeable impact on

Recall for the Naive Bayes classifier. Recall remains the same at more than 99.6% for all M

values.


Pre

cisi

on (%

)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


50

60

70

80

90

100


Figure 8.16: ET Precision: Training with SSP-ACT and classifying with ET traffic experiencing5% random packet loss - C4.5 Decision Tree classifier

Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


50

60

70

80

90

100

No synthetic loss5% synthetic loss

Figure 8.17: VoIP Recall: Training with SSP-ACT and classifying with VoIP traffic experienc-ing 5% random packet loss - Naive Bayes classifier


Not surprisingly, there was no noticeable impact on Precision, as shown in Figure 8.18. This

is because Recall is maintained, and the same test dataset is used for Other traffic in both tests.

Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


50

60

70

80

90

100


Figure 8.18: VoIP Precision: Training with SSP-ACT and classifying with VoIP traffic experi-encing 5% random packet loss - Naive Bayes classifier

With a 5% packet loss, the Naive Bayes classifier trained using SSP-ACT maintains its

performance in terms of both Precision and Recall for VoIP traffic. This is because while

packet loss does impact on packet inter-arrival time statistics, the longer packet inter-arrival

time caused by packet loss may simply look similar to a long packet gap due to silent periods

during a voice conversation or jitter on the network. Flows used in training contain silence

suppression periods, jitter, and even packet loss 8. For this reason, the classifier has a good

chance of maintaining its performance even with a 5% packet loss.

However, this also depends on the internal construction of a particular ML algorithm. In

contrast to the Naive Bayes classifier, the C4.5 Decision Tree classifier shows a significant

negative impact of packet loss on Recall.

Figures 8.19 and Figure 8.20 show Recall and Precision for the C4.5 Decision Tree classifier

respectively.

As shown in Figure 8.19, a 5% packet loss applied to VoIP traffic reduced median Recall

8 Filtering out these flows in training would make the test results clearer; however, it would also reduce thenumber of instances for training significantly. Repeating the experiments with absolutely no packet loss in thetraining dataset is left for future work.


Rec

all (

%)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


50

60

70

80

90

100


Figure 8.19: VoIP Recall: Training with SSP-ACT and classifying with VoIP traffic experienc-ing 5% random packet loss - Naive Bayes classifier - C4.5 Decision Tree classifier

Pre

cisi

on (%

)

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


50

60

70

80

90

100


Figure 8.20: VoIP Precision: Training with SSP-ACT and classifying with VoIP traffic experi-encing 5% random packet loss - Naive Bayes classifier


by up to 8.5% for all M values. Median Precision was only slightly reduced by 0.1% due to

the variation in Recall (a reduction in the number of true positives) for VoIP traffic. The only

slight reduction in Precision versus the significant reduction in Recall is due to the relatively

very small number of false positives versus the larger number of false negatives. Despite these

degradations, median Recall and Precision for the C4.5 Decision Tree classifier still remained

above 87% and 99% respectively.

The C4.5 Decision Tree classifier is more sensitive to packet loss than the Naive Bayes

classifier because of differences between the internal mechanisms of each ML algorithm. The

former builds a tree based on precise differences in feature values, while the latter builds a

model based on approximations. For the C4.5 Decision Tree, a small change in feature values

can lead to different sub-tree paths at test nodes within the tree, which can subsequently lead to

a different classification result [119] (as noted in the discussion in section 3.1.3 on the instability

of decision tree algorithms). On the other hand, the Naive Bayes classifier is more tolerant to

small variations in feature values. Deeper investigation into the sensitivity of the Naive Bayes

and C4.5 Decision Tree classifiers to packet loss is left for future research.

My experimental results suggest the robustness of the SSP-ACT training approach when

classifying with a 5% packet loss. This work could be extended with the use of different loss

models and packet loss rates. However, since I propose classifying on a small sliding window,

a classifier seeing a burst of packet loss may perform as if classifying traffic when the first part

of the flow is missed. Training using SSP-ACT can help the classifier to maintain performance

in this case. Further exploration of this issue and evaluation of SSP-ACT with changes in other

network environment parameters is left for future work.

8.4 Concurrent classification of multiple applications with SSP-ACT

So far, SSP-ACT has only been applied to the identification of an application of interest. For

further development, it should be evaluated in terms of the identification of multiple applications

simultaneously.

This evaluation idea can be illustrated using a simple scenario presented in Figure 3.4. A

home network generates Game (e.g. ET), VoIP and other common Internet protocols (such as

P2P, Web, SSH and SMTP) traffic. Previously I considered only VoIP or only Game traffic as

8.4. CONCURRENT CLASSIFICATION OF MULTIPLE APPLICATIONS WITH SSP-ACT205

priority traffic.

However, there may be cases where both VoIP and ET are specified as priority traffic. To

improve the flexibility of users’ options, VoIP and ET also may require different priority levels,

in which case it might be desirable that they be identified as two separate classes. This then

leads to the requirement for three-classes classification: VoIP versus ET versus Other.

For the simultaneous identification of VoIP and ET traffic, there are a number of possible

solutions. In this section, I consider two possible options. The first option is to use a common

classifier to classify both types of traffic (I shall refer to this option as ‘Option A - Common

classifier’). The second option is to use two separate classifiers in parallel; each classifies one

type of traffic (I shall refer to this option as ‘Option B - Separate classifiers in parallel’). Figure

8.21 and Figure 8.22 illustrate Option A and Option B respectively.

In Option A, VoIP, ET and other traffic (such as Web, P2P, SSH and SMTP) are labelled

separately as VoIP, ET and Other classes to train the classifier. Only one classification model is

used to classify a new traffic flow as either VoIP or ET or Other traffic.

In Option B, two classifiers are required to classify a new flow as {ET or Other} and {VoIP

or Other} traffic. The difference in this option compared to Option A is that the task for each

classifier is simpler. Each classifier only identifies two classes, rather than three classes simul-

taneously. Option A thus may require a more powerful, centralised processing unit than Option

B. The logical AND operation is needed to avoid a situation of overriding the classification re-

sults of classifier 1 and classifier 2 for the identification of ET and VoIP traffic. For example, if

there were an ET flow, classifier 2 would indicate ‘ET’ and classifier 1 would indicate ‘Other’.

The correct output for the overall system would be ‘ET’.

In order to make a decision over which option to use, we need to compare and evaluate

the performance of classifiers trained using two options, based on the operational challenges

addressed in section 3.3.2. My work as outlined in section 7.4.3 offers insights into the pros

and cons of four-class classification versus two-class classification. In the case of section 7.4.3

four-class classification results in better trade-offs between Precision and Recall 9. However, its

comes with longer model training time and slower classification speed. This finding is consis-

tent with my results reported below.

9The level of improvement is dependent on the particularities of the ML algorithm.


VoIP

ET

Web, P2P, SSH, SMTP

Traffic classifier

VoIP

Other

ET

Training

Classification model VoIP or ET or Other

VoIP sample traffic

ET sample traffic

Web, P2P, SSH, SMTP sample traffic

Labelled VoIP class

Labelled ET class


Figure 8.21: Training for VoIP and ET traffic identification: Option A - Common classifier


VoIP

ET

Web, P2P, SSH, SMTP

Classifier 1

VoIP

Other

Training

Classification model VoIP or Other

VoIP sample traffic ET sample traffic Web, P2P, SSH,

SMTP sample traffic

Labelled VoIP class


Classifier 2

ET

Training

Classification model ET or Other

ET sample traffic VoIP sample traffic

Web, P2P, SSH, SMTP sample traffic


Labelled ET class

AND operation

Figure 8.22: Training for VoIP and ET traffic identification: Option B - Separate classifiers inparallel


Rec

all (

%)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


50

60

70

80

90

100

Option A − Common classifierOption B − Separate classifiers in parallel

(a) Recall - Naive Bayes classifierP

reci

sion

(%)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


50

60

70

80

90

100


(b) Precision - Naive Bayes classifier

Figure 8.23: VoIP Recall and Precision: Naive Bayes classifier

Rec

all (

%)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


50

60

70

80

90

100


(a) Recall - C4.5 Decision Tree classifier

Pre

cisi

on (%

)

0 1 2 3 4 5 6 7 8 9

2000

2001

2002

2003

2004

2005

2006

2007

2008


50

60

70

80

90

100


(b) Precision - C4.5 Decision Tree classifier

Figure 8.24: VoIP Recall and Precision: C4.5 Decision Tree classifier


Figure 8.23 and Figure 8.24 present Recall and Precision respectively for VoIP traffic iden-

tified by the Naive Bayes and C4.5 Decision Tree classifiers trained using SSP-ACT. As seen

in the figures, Option A results in better trade-offs between Precision and Recall for both the

Naive Bayes and C4.5 Decision Tree classifiers.

In the case of the Naive Bayes classifier, Option A has similar Recall as Option B but

slightly better Precision due to the reduction in the number of false positives. For the C4.5

Decision Tree classifier, Option A’s Precision is slightly lower compared to Option B, yet its

Recall is significantly higher. This result is consistent with the findings in section 7.4.3 and

occurs for the same reasons. In Option B, ET traffic samples and other traffic samples are

joined to form one single Other class to train a classifier. This may create an unwanted area of

feature values in training the classifier. In Option A, this unwanted area is removed, which helps

improve Precision for the Naive Bayes classifier, and improves Recall for the C4.5 Decision

Tree classifier with a slight trade-off in Precision.


Nor

mal

ised

Mod

el B

uild

Tim

e

00.

20.

40.

60.

81


(a) Normalised Model Build Time


Nor

mal

ised

Cla

ssifi

catio

n S

peed

00.

20.

40.

60.

81


(b) Normalised Classification Speed

Figure 8.25: Computational performance for Naive Bayes and C4.5 Decision Tree classifierstrained with Option A and Option B

Figure 8.25 presents the normalised model build time and classification speed for the Naive

Bayes and C4.5 Decision Tree classifiers with Option A and Option B. The value of 1 rep-

resents the longest time taken to build a classification model of 789 seconds, and the highest

classification speed of 19,073 instances per second.

In terms of required model build time, training a Naive Bayes classifier using Option A takes

approximately 10% longer than when using Option B. For the C4.5 Decision Tree classifier,

training using Option A takes approximately 60% longer than training a single classifier using


Option B to identify VoIP traffic. Yet Option B requires training two classifiers for both ET and

VoIP traffic. This may double the model build time of a single classifier.

In terms of classification speed, using Option A with the Naive Bayes classifier slows down

the classifier slightly (∼ 5%). With the C4.5 Decision Tree classifier, using Option A slows

down the classification speed significantly, by ∼ 60%. The slower classification speed when

using Option A could become an issue when scaling to classify a large number of applications

simultaneously.

Besides the accuracy and computational performance, the differences in statistical traffic

characteristics and QoS requirements of the applications of interest also needs to be consid-

ered. Each application may have a particular optimal classification window size, which would

influence trade-offs between classification timeliness, Recall and Precision, and computational

performance of the classification. For example, my studies indicate that ET and VoIP require

only 25 packets for high Precision and Recall classification. However, another application, such

as video conferencing, might require a different number of packets to obtain an acceptable ac-

curate and timely classification. A common classifier in Option A must balance the trade-offs

between different performance parameters for all applications, while the individual classifiers

in Option B have better opportunities to optimise the parameters for each individual application.

Another drawback of Option B is that there may be conflicts between classification results.

For example, a flow is classified as VoIP by classifier 1 and also is classified as ET by classifier

2. The solution to such a situation is implementation specific. For example, a flow that results

in a conflict in classification results would be classified as belonging to the class with the lower

priority of the two.

Table 8.1 summarises the pros and cons of both options.

My preliminary results and analysis suggest that a common classifier can be used to clas-

sify multiple applications. (In other words, it is not necessary to have a separate classifier for

each application.) However, this approach has pros and cons compared with using a separate

classifier for each individual application. Taking into account these pros and cons, Option A

may still be a simpler and more effective choice if the classification speed is acceptable for a

particular purpose. Furthermore, as suggested by [158], one or two classes classification could

be sufficient for QoS-enabled Internet access networks.


Table 8.1: Comparison of the pros and cons of Option A: Common classifier versus Option B:Separate classifiers in parallel

Option A: Common classifier Option B: Separate classifiers in parallel

Provides better balance in trade-offs be-tween Precision and Recall

Provides worse balance in trade-offs be-tween Precision and Recall

Slower in classification speed. This can bean issue when scaling to a great number ofapplications

Faster classification speed

One single classifier requires updates whenthe classification model needs to be up-dated

All individual classifiers require updateswhen the classification models need to beupdated

Model building and classification tasks areconcentrated in a single processing compo-nent. This may lead to the requirement fora powerful and expensive processing unitwhen scaling to a great number of applica-tions

Training and classification work-load canbe divided into multiple processing com-ponents. One can make use of cheapercomponents in parallel processing

Must use the same sliding window for allapplications, which makes it harder to op-timise performance for individual applica-tion

Flexible in optimising for individual appli-cations

No conflict in classification result Possible conflict in classification results


8.5 Discussion

Since this chapter has outlined my preliminary study of the generality and robustness of SSP-

ACT, there are a number of limitations in my current experimental approach. Further improve-

ment can be made in the following areas:

• In evaluating SSP-ACT’s robustness, only a single 5% random, independent loss has

been studied. Future work could expand on this study through an evaluation that includes

a wide range of loss rates and other loss patterns (such as the Gilbert model as described

in [192]). Furthermore, with my experiments, packet loss is only applied to the traffic of

interest. This work could be extended to study the performance of SSP-ACT when packet

loss occurs to other traffic as well.

• SSP-ACT has only been tested with a limited number of sliding window positions. Its

potential could be further explored through more exhaustive testing throughout the flow’s

lifetime, in order to evaluate the stability of SSP-ACT.

• In evaluating SSP-ACT’s scalability, I have studied its performance in classifying up

to three classes (VoIP versus ET versus Other) simultaneously. How SSP-ACT could

scale to classify a large number (e.g. 100s or 1,000s) of applications simultaneously is a

question that requires further study. Although it is not clear for a business scenario that

requires the identification of 100s applications or QoS classes simultaneously, it would be

of interest for future work to evaluate the trade-offs between accuracy and computational

performance of SSP-ACT when scaling to that extent. Even with the scenario of two QoS

classes, how many applications might be grouped into a single class before the occurrence

of degradation in classification accuracy and computational performance is a valuable

subject for further research.

• In my experiments a sliding window of 25 packets worked well in terms of classification

timeliness, Precision and Recall for both ET and VoIP traffic. However, another applica-

tion may have different requirements which may require a different size for an optimal

classification window. An optimal size for a classification window should balance the

trade-off between the classifier’s Precision and Recall, classification timeliness, classifi-

8.6. CONCLUSION 213

cation speed and processing overhead. Nevertheless, using a common classifier for the

classification of multiple applications simultaneously would require the same sliding win-

dow for all applications. Detailed characterising of this trade-off is a subject for future

research.

8.6 Conclusion

In this chapter, I have demonstrated the effectiveness of SSP-ACT in identifying VoIP traffic.

Training using SSP-ACT produces an accurate classifier, in terms of both Precision and Recall,

when the classifier needs to identify VoIP traffic among other traffic with statistical properties

which vary over their flows’ lifetime, and are asymmetric in the forward and reverse directions.

I have evaluated the robustness of SSP-ACT when classifying with synthetic packet loss.

Both the Naive Bayes and C4.5 Decision Tree classifiers maintain their performance well with

a 5% synthetic packet loss applied to the test dataset. Evaluating SSP-ACT with a larger range

of applications and other loss, delay and jitter models remains for future work.

I also consider the use of SSP-ACT for classifying multiple applications simultaneously. My

preliminary results suggest that it is possible to use a common classifier for multiple applications

classification. However, this approach entails pros and cons versus the option of using a separate

classifier for each individual application. In my experiments, using a common classifier has

been seen to provide better balance in trade-offs between Precision and Recall. It is also easier

to update the classifier. However, this method is slower in classification speed, which can be an

issue when scaling to a greater number of applications. It uses the same sliding window size for

all applications, which makes it harder to optimise performance for individual applications. The

requirement of a powerful central processing unit within this option may also be an issue when

scaling to a greater number of applications. Deeper study on this subject would be valuable for

future research.

Chapter 9

Conclusion

Real-time traffic classification has the potential to solve difficult network management prob-

lems, in the areas of QoS provisioning, Internet accounting and charging, and lawful intercep-

tion. The traditional IP traffic classification (IPTC) techniques that rely mostly on destination

port numbers or ‘deep packet inspection’ are becoming less effective. This has been the mo-

tivation to develop new approaches that classify traffic by learning and recognising statistical

patterns in the externally observable attributes of the traffic1.

My literature review on the use of Machine Learning (ML) in IPTC suggests that ML-based

IPTC has great potential as a new and robust approach. Previously proposed work on ML-based

IPTC has shown very promising results. However, it has not considered the practical constraints

of deployment in real-life operational networks. My thesis, therefore, has filled this research

gap, and showed that statistically based IPTC using ML is a feasible and practical approach.

In this thesis I have proposed and demonstrated a novel solution that an ML classifier should

be trained using a set of short sub-flows extracted from full-flows generated by the target ap-

plication(s), coupled with their mirror-image replicas. My proposal (referred to as SSP-ACT)

was illustrated by considering an ISP that wishes to automatically and quickly detect online

interactive game traffic (ET) or voice (VoIP) traffic mingled in among regular consumer IP

traffic. The results presented in Chapters 5 to 8 revealed that, using a sliding window of 25

packets, the Naive Bayes classifier achieved 98.9% median Recall and 87% median Precision

when classifying ET traffic, and 100% median Recall and 95.4% median Precision when clas-

sifying VoIP traffic. The C4.5 Decision Tree classifier achieved 99.3% median Recall and 97%

1Elaboration on these issues was presented in Chapter 2.

214

215

median Precision when classifying ET traffic, and 95.7% median Recall and 99.2% Precision

when classifying VoIP traffic2. Both these classifiers maintained their performance well regard-

less of how many packets are missed from the beginning of each flow or of the direction of

the first packet of the most recent N packets used for the classification. Compared to the poor

performance of the classifiers trained on full-flows – a common method used in the literature –

these results indicate that SSP-ACT is a significant improvement over the previous, published

state-of-the-art for IP traffic classification. Although the experiments are confined to online

game and VoIP applications, my results reveal a potential solution to the accurate and timely

classification of traffic belonging to other Internet applications.

Furthermore, I have also proposed and demonstrated a novel approach using unsupervised

ML clustering techniques to choose appropriate, representative sub-flows, from which a clas-

sifier may be trained. This extension is significant for the deployment of SSP-ACT to classify

new applications of interest. It eliminates the need for expert knowledge of the application

and relieves the complexity of manually choosing the best combination of sub-flows to train a

classifier. This approach has been demonstrated with the use of the EM algorithm.

In Chapter 6 I showed that manual selection of sub-flows for training was not necessary in

general cases. Instead, sub-flows automatically selected by the EM algorithm could produce

classifiers with slightly better Precision and Recall, and minor trade-offs in classification speed

and computational complexity. I further showed that the clustering process for optimal sub-flow

selection can be up to 99% faster, yet still result in acceptable SSP-ACT classifier performance,

when sub-flow selection is performed on small samples of sub-flow instances (for example, 50

instances per sub-flow).

I also have briefly explored the impact of packet loss on the Naive Bayes and C4.5 Decision

Tree classifiers trained using SSP-ACT. As presented in Chapter 8, with a 5% random, indepen-

dent synthetic packet loss, the Naive Bayes and C4.5 Decision Tree classifiers maintained their

performance well. For ET traffic, a 5% packet loss only degraded Recall and Precision of both

classifiers by less than 0.5%. For VoIP traffic, a 5% packet loss did not produce any notice-

able degradation on the Naive Bayes classifier’s Recall and Precision. However, it degraded the

C4.5 Decision Tree classifier’s Recall and Precision by 8.5% and 0.1% respectively. Despite2ET flows made up from 11.9% to 17.1% of the traffic mix in the test datasets. VoIP flows made up from 1.1%

to 2.4% of the traffic mix in the test datasets.

216 CHAPTER 9. CONCLUSION

this degradation, median Recall and Precision of the C4.5 Decision Tree classifier remained

above 87% and 99% respectively for all the tested positions of the sliding window. Deeper

investigation into the sensitivity of the Naive Bayes and C4.5 Decision Tree classifiers with

regards to packet loss is left for future research, with other loss rates and loss models.

Finally, I have discussed the pros and cons between using a common classifier (Option A)

and multiple classifiers in parallel (Option B) to identify multiple applications simultaneously.

My initial study presented in Chapter 8 investigated the concurrent classification of ET, VoIP

and other traffic. I showed that Option A produced better Precision and Recall for both the

Naive Bayes and C4.5 Decision Tree classifiers. Furthermore, this option does not involve the

potential of conflicting classification results as does Option B, and allows the classifier to be

updated more easily. Yet Option A does come with a significant cost in classification speed

and computational complexity. For example, Option A required greater than 60% more time to

build a C4.5 Decision Tree classifier, and the resulting classifier was 60% slower in classification

speed compared to when using Option B. Using a common classifier for all applications also

makes it more difficult to optimise performance for individual applications. This initial work

can serve as the starting point for future investigation of these issues.

My work can be extended in a number of future research directions. These include:

• Characterising the optimal sliding classification window size (N) for a wider range of

applications (as discussed in sections 5.5 and 8.5);

• Identifying how varying N affects classification accuracy, classification timeliness, clas-

sification speed, and the stability of results for continuous classification (as discussed in

section 5.5);

• Evaluating the stability of classification accuracy in the presence of network perturba-

tions, such as packet loss, delay and packet re-ordering (as briefly outlined in section

8.3);

• Evaluating the impact on classification accuracy of different traffic mixes and the em-

ployment of different ML classification and clustering algorithms (as briefly mentioned

in sections 5.3.4, 5.5, and E.1);

217

• Expanding SSP-ACT for the recognition of new and unknown applications (as discussed

in section 6.6); and

• Evaluating the scalability of SSP-ACT to classify a large number (for example, hundreds)

of applications simultaneously (as discussed in section 8.5)

In summary, my thesis has opened up a new path for research on the optimisation of the use

of ML classifiers in real-time IP traffic classification. I believe my proposal will assist in the

use of ML algorithms inside practical and deployable IP traffic classifiers.

Bibliography

[1] B. Leiner, V. G. Cerf, D. Clark, R. E. Kahn, L. Kleinrock, D. Lynch, J. Postel, L. G.

Roberts, and S. Wolff, “The past and future history of the Internet,” Communications of

the ACM, vol. 40, no. 2, pp. 102–108, 1997.

[2] (2006, January) Worldwide Internet users top 1 billion in 2005. eTForcasts. [Online].

Available: http://www.etforecasts.com/pr/pr106.htm [Last accessed: 2009, 22 February].

[3] J. Zhu and E. Wang, “Diffusion, use, and effect of the Internet in China,” Communica-

tions of the ACM, vol. 48, no. 4, pp. 49–53, 2005.

[4] G. Huston. (2005, March) IPv4 Address utilization. [Online]. Available: http:

//www.potaroo.net/papers/2005-03-ipv4.pdf [Last accessed: 2009, 22 February].

[5] G. Huston. (2005, June) The BGP report for 2005. [Online]. Available: http:

//www.potaroo.net/ispcol/2006-06/bgpupds.html [Last accessed: 2009, 22 February].

[6] M. Fomenkov, K. Keys, D. Moore, and K. Claffy, “Longitudinal study of Internet traffic

in 1998-2003,” in WISICT ’04: Proceedings of the winter international synposium on

Information and communication technologies. Cancun, Mexico: Trinity College Dublin,

2004, pp. 1–6.

[7] R. Kraut, T. Mukhopadhyay, J. Szczypula, S. Kiesler, and W. Scherlis, “Information and

communication: Alternative uses of the Internet in households,” Information Systems

Research, vol. 10, no. 4, pp. 287–303, 1999.

[8] S. E. Stern, S. Gregor, M. A. Martin, S. Goode, and J. Rolfe, “A classification tree anal-

ysis of broadband adoption in Australian households,” in ICEC ’04: Proceedings of the

218

http://www.etforecasts.com/pr/pr106.htm

http://www.potaroo.net/papers/2005-03-ipv4.pdf

http://www.potaroo.net/papers/2005-03-ipv4.pdf

http://www.potaroo.net/ispcol/2006-06/bgpupds.html

http://www.potaroo.net/ispcol/2006-06/bgpupds.html

BIBLIOGRAPHY 219

6th international conference on Electronic commerce. Delft, The Netherlands: ACM

Press, October 2004, pp. 451–456.

[9] E. Castronova, “Network technology, markets, and the growth of synthetic worlds,” in

NetGames ’03: Proceedings of the 2nd workshop on Network and system support for

games. Redwood City, California: ACM Press, 2003, pp. 121–134.

[10] L. Chen, M. L. Gillenson, and D. L. Sherrell, “Consumer acceptance of virtual stores:

a theoretical model and critical success factors for virtual stores,” SIGMIS Database,

vol. 35, no. 2, pp. 8–31, 2004.

[11] Y. Chen and J. H. Rankin, “A framework for benchmarking e-procurement in the AEC

industry,” in ICEC 06: Proceedings of the 8th international conference on Electronic

commerce. Fredericton, New Brunswick, Canada: ACM, August 2006, pp. 411–419.

[12] K. Kim and B. Prabhakar, “Initial trust and the adoption of B2C e-commerce: The case

of Internet banking,” SIGMIS Database, vol. 35, no. 2, pp. 50–64, 2004.

[13] S. Bolin, “E-commerce: a market analysis and prognostication,” StandardView, vol. 6,

no. 3, pp. 97–105, 1998.

[14] J. Yang and G. Miao, “The estimates and forecasts of worldwide e-commerce,” in ICEC

’05: Proceedings of the 7th international conference on Electronic commerce. Xi’an,

China: ACM, 2005, pp. 52–56.

[15] A. Ginsberg, P. Hodge, T. Lindstrom, B. Sampieri, and D. Shiau, “The little Web school-

house: using virtual rooms to create a multimedia distance learning environment,” in

MULTIMEDIA 98: Proceedings of the sixth ACM international conference on Multime-

dia. Bristol, United Kingdom: ACM, September 1998, pp. 89–98.

[16] E. McLoughlin, D. O’Sullivan, M. Bertolotto, and D. C. Wilson, “MEDIC: Mobile di-

agnosis for improved care,” in SAC 06: Proceedings of the 2006 ACM symposium on

Applied computing. Dijon, France: ACM Press, April 2006, pp. 204–208.

220 BIBLIOGRAPHY

[17] I. Tomkos and A. Tzanakaki, “Towards digital optical networks,” in Proceedings of 7th

International Conference on Transparent Optical Networks, 2005, vol. 1, Barcelona,

Spain, July 2005, pp. 1–4.

[18] R. Alferness, “The all-optical networks,” in International Conference on Communication

Technology Proceedings (WCC - ICCT 2000), 2000., vol. 1, Beijing, China, August 2000,

pp. 14–15.

[19] (2006, July) Service provider Quality-of-Service overview. Cisco. [Online]. Available:

http://www.cisco.com/warp/public/cc/so/neso/sqso/spqos wp.pdf [Last accessed: 2009,

22 February].

[20] A. Bouch, A. Kuchinsky, and N. Bhatti, “Quality is in the eye of the beholder: meet-

ing users’ requirements for Internet quality of service,” in CHI ’00: Proceedings of the

SIGCHI conference on Human factors in computing systems. The Hague, The Nether-

lands: ACM Press, April 2000, pp. 297–304.

[21] ITU-T, “ITU-T Recommendation G.114: One-way transmission time,” ITU-T G.114

Standard, International Telecommunication Union, 1996.

[22] G. Armitage, “An experimental estimation of latency sensitivity in multiplayer Quake3,”

in The 11th IEEE International Conference on Networks (ICON2003), 2003., Sydney,

Australia, September 2003, pp. 137–141.

[23] J. Nichols and M. Claypool, “The effects of latency on online madden NFL football,”

in NOSSDAV 04: Proceedings of the 14th international workshop on Network and oper-

ating systems support for digital audio and video. New York, NY, USA: ACM Press,

2004, pp. 146–151.

[24] M. Claypool and J. Tanner, “The effects of jitter on the peceptual quality of video,”

in MULTIMEDIA ’99: Proceedings of the seventh ACM international conference on

Multimedia (Part 2). Orlando, Florida, United States: ACM Press, September 1999, pp.

115–118.

http://www.cisco.com/warp/public/cc/so/neso/sqso/spqos_wp.pdf

BIBLIOGRAPHY 221

[25] G. Armitage and L. Stewart, “Limitations of using real-world, public servers to estimate

jitter tolerance of first person shooter games,” in ACE ’04: Proceedings of the 2004 ACM

SIGCHI International Conference on Advances in computer entertainment technology.

Singapore: ACM Press, June 2004, pp. 257–262.

[26] T. Henderson and S. Bhatti, “Networked games: a QoS-sensitive application for QoS-

insensitive users?” in RIPQoS ’03: Proceedings of the ACM SIGCOMM workshop on

Revisiting IP QoS. Karlsruhe, Germany: ACM Press, 2003, pp. 141–147.

[27] G. Armitage and L. Stewart, “Some thoughts on emulating jitter for user experience

trials,” in NetGames 04: Proceedings of 3rd ACM SIGCOMM workshop on Network

and system support for games. Portland, Oregon, USA: ACM Press, August 2004, pp.

157–160.

[28] S. Zander and G. Armitage, “Empirically measuring the QoS sensitivity of interactive

online game players,” in Proceedings of Australian Telecommunications and Network

Application Conference (ATNAC), December 2004.

[29] M. Dick, O. Wellnitz, and L. Wolf, “Analysis of factors affecting players’ performance

and perception in multiplayer games,” in NetGames ’05: Proceedings of 4th ACM SIG-

COMM workshop on Network and system support for games. Hawthorne, NY: ACM,

October 2005, pp. 1–7.

[30] R. Braden, D. Clark, and S. Shenker, “Integrated Services in the Internet architecture: an

overview,” RFC 1633, IETF, 1994.

[31] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, “An architecture for

Differentiated Services,” RFC 2475, IETF, 1998.

[32] E. Rosen, A. Viswanathan, and R. Callon, “Multiprotocol Label Switching Architecture,”

RFC 3031, IETF, 2001.

[33] L. Stewart, G. Armitage, P. Branch, and S. Zander, “An architecture for automated net-

work control of QoS over consumer broadband links,” in IEEE TENCON 05, Melbourne,

Australia, November 2005.

222 BIBLIOGRAPHY

[34] M. Oliveira and T. Henderson, “What online gamers really think of the Internet?” in

NetGames ’03: Proceedings of the 2nd workshop on Network and system support for

games. New York, NY, USA: ACM Press, 2003.

[35] J. But, N. Williams, S. Zander, L. Stewart, and G. Armitage, “ANGEL - Automated net-

work games enhancement layer,” in NetGames ’06: Proceedings of 5th ACM SIGCOMM

workshop on Network and system support for games. Singapore: ACM, October 2006,

p. 9.

[36] T. Nguyen and G. Armitage, “Evaluating Internet pricing schemes - A three dimensional

visual model,” ETRI Journal, vol. 27, no. 1, pp. 64–74, February 2005.

[37] T. Nguyen and G. Armitage, “Pricing the Internet - A visual 3-dimensional evaluation

model,” in Australian Telecommunications Networks & Applications Conference 2003

(ATNAC 2003), Melbourne, Australia, December 2003.

[38] M. Karsten, J. Schmitt, C. L. Wolf, and R. Steinmetz, “Cost and price calculation for

Internet Integrated Services,” in Kommunikation in Verteilten Systemen. Springer, 1999,

pp. 46–57.

[39] J. R. Edell and P. P. Varaiya, “Providing Internet access: What we learn from INDEX,”

IEEE Network, vol. 13, no. 5, pp. 18–25, September/October 1999.

[40] B. Stiller, T. Braun, M. Gunter, and B. Plattner, “The CATI Project: Charging and ac-

counting technology for the Internet,” in ECMAST ’99: Proceedings of the 4th European

Conference on Multimedia Applications, Services and Techniques. Springer-Verlag,

1999, pp. 281–296.

[41] J. Frank, “Machine learning and intrusion detection: Current and future directions,” in

Proceedings of the 17th National Computer Security Conference, Baltimore, MD, Octo-

ber 1994.

[42] Bro Intrusion Detection System – Bro Overview, Lawrence Berkeley National

Laboratory, April 2006. [Online]. Available: http://bro-ids.org. [Last accessed: 2009, 22

February].

http://bro-ids.org.

BIBLIOGRAPHY 223

[43] P. Branch, “Lawful Interception of the Internet,” Australian Journal of Emerging Tech-

nologies and Society, 2003.

[44] A. Milanovic, S. Srbljic, I. Raznjevic, D. Sladden, I. Matosevic, and D. Skrobo, “Meth-

ods for Lawful Interception in IP telephony networks based on H.323,” in Computer as

a Tool. The IEEE Region 8. EUROCON 2003, vol. 1, September 2003, pp. 198–202.

[45] A. Rojas and P. Branch, “Lawful Interception based on sniffers in Next Generation Net-

works,” in Australian Telecommunications Networks & Applications Conference 2004

(ATNAC2004), Sydney, Australia, December 8-10 2004.

[46] P. Branch, A. Pavlicic, and G. Armitage, “Using MAC addresses in the Lawful Intercep-

tion of IP traffic,” in Australian Telecommunications Networks & Applications Confer-

ence 2004 (ATNAC2004), Sydney, Australia, December 2004.

[47] Wolfenstein Enemy Territory, February 2009. [Online]. Available: http://enemy-territory.

4players.de:1041/news.php [Last accessed: 2005, December].

[48] Snort – the de facto standard for intrusion detection/prevention, Sourcefire, Inc., April

2006. [Online]. Available: http://www.snort.org [Last accessed: 2009, 22 February].

[49] V. Paxson, “Bro: A system for detecting network intruders in real-time,” Computer Net-

works, no. 31 (23-24), pp. 2435–2463, 1999.

[50] F. Baker, B. Foster, and C. Sharp, “Cisco architecture for Lawful Intercept in IP net-

works,” RFC 3924, Internet Engineering Task Force IETF, October 2004.

[51] T. Karagiannis, A. Broido, N. Brownlee, and K. Claffy, “Is P2P dying or just hiding?” in

IEEE Global Telecommunications Conference (GLOBECOM ’04), 2004., vol. 3, Dallas,

Texas, USA, November/December 2004, pp. 1532–1538.

[52] S. Sen, O. Spatscheck, and D. Wang, “Accurate, scalable in-network identification of P2P

traffic using application signatures,” in WWW ’04: Proceedings of the 13th international

conference on World Wide Web. New York, NY, USA: ACM, May 2004, pp. 512–521.

http://enemy-territory.4players.de:1041/news.php

http://enemy-territory.4players.de:1041/news.php

http://www.snort.org

224 BIBLIOGRAPHY

[53] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINC: multilevel traffic classifica-

tion in the dark,” in SIGCOMM ’05: Proceedings of the 2005 conference on Applications,

technologies, architectures, and protocols for computer communications. Philadelphia,

Pennsylvania, USA: ACM, August 2005, pp. 229–240.

[54] D. Bonfiglio, M. Mellia, M. Meo, D. Rossi, and P. Tofanelli, “Revealing Skype traffic:

when randomness plays with you,” in SIGCOMM ’07: Proceedings of the 2007 confer-

ence on Applications, technologies, architectures, and protocols for computer communi-

cations. Kyoto, Japan: ACM, August 2007, pp. 37–48.

[55] K. Papagiannaki, N. Taft, S. Bhattacharyya, P. Thiran, K. Salamatian, and C. Diot, “A

pragmatic definition of elephants in Internet backbone traffic,” in IMW ’02: Proceedings

of the 2nd ACM SIGCOMM Workshop on Internet measurment. Marseille, France:

ACM, 2002, pp. 175–176.

[56] N. Brownlee and K. Claffy, “Understanding Internet traffic streams: Dragonflies and

tortoises,” IEEE Communications Magazine, vol. 40, no. 10, pp. 110–117, 2002.

[57] S. Sarvotham, R. Riedi, and R. Baraniuk, “Connection-level analysis and modeling of

network traffic,” in IMW ’01: Proceedings of the 1st ACM SIGCOMM Workshop on

Internet Measurement. San Francisco, California, USA: ACM, 2001, pp. 99–103.

[58] A. Soule, K. Salamatian, N. Taft, R. Emilion, and K. Papagiannaki, “Flow classification

by histograms or how to go on safari in the Internet,” in ACM SIGMETRICS Performance

Evaluation Review, vol. 32, no. 1. New York, NY, USA: ACM, 2004, pp. 49–60.

[59] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow clustering using machine learn-

ing techniques,” in Passive and Active Measurement (PAM) Conference, 2004, Antibes

Juan-les-Pins, France, April 2004.

[60] S. Zander, T. Nguyen, and G. Armitage, “Automated traffic classification and application

identification using machine learning,” in IEEE 30th Conference on Local Computer

Networks (LCN 2005), Sydney, Australia, November 2005, pp. 250–257.

BIBLIOGRAPHY 225

[61] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-service mapping for

QoS: a statistical signature-based approach to IP traffic classification,” in IMC ’04: Pro-

ceedings of the 4th ACM SIGCOMM conference on Internet measurement. Taormina,

Sicily, Italy: ACM, October 2004, pp. 135–148.

[62] B. Choi, S. Moon, Z. Zhang, K. Papagiannaki, and C. Diot, “Analysis of point-to-point

packet delay in an operational network,” in INFOCOM 2004. Twenty-third Annual Joint

Conference of the IEEE Computer and Communications Societies, Hong Kong, March

2004, pp. 1797–1807.

[63] T. Nguyen and G. Armitage, “Experimentally derived interactions between TCP traffic

and service quality over DOCSIS cable links,” in IEEE Global Telecommunications Con-

ference (GLOBECOM ’04), 2004., vol. 3, Texas, USA, November/December 2004, pp.

1314–1318.

[64] T. Nguyen and G. Armitage, “Quantitative assessment of IP service quality in 802.11b

and DOCSIS networks,” in Proceedings of Australian Telecommunications Networks &

Applications Conference 2004 (ATNAC 2004), Sydney, Australia, December 2004.

[65] T. Nguyen and G. Armitage, “Quantitative assessment of IP service quality in 802.11b

networks,” in Telecommunications and Signal Processing (WITSP’04), Adelaide, Aus-

tralia, December 2004.

[66] CableLabs, “Data-Over-Cable Service Interface Specifications Radio Frequency Inter-

face Specification,” SP-RFIv1.1-I01-990311, 1999.

[67] IEEE, “ANSI/IEEE Std 802.11, 1999 edition,” ISO/IEC 8802-11: 1999, 1999.

[68] R. Braden, D. Clark, and S. Shenker, “Resource Reservation Protocol (RSVP) version 1

functional aspects,” RFC 2205, IETF, 1997.

[69] G. Armitage, Quality of Service In IP Networks: Foundations for a Multi-Service Inter-

net. Macmillan Technical Publishing, 2000.

[70] StreamEngine, Ubicom, June 2008. [Online]. Available: http://streamengine.ubicom.

com/ [Last accessed: 2009, 22 February].

http://streamengine.ubicom.com/

http://streamengine.ubicom.com/

226 BIBLIOGRAPHY

[71] (2008, June) D-link GamerLounge - Product Categories. D-Link. [Online]. Available:

http://games.dlink.com/products/?pid=370\&#DGL-4300 [Last accessed: 2009, 22

February].

[72] (2008, June) AutoQoS for Voice over IP (voip). Cisco White Paper. [Online]. Available:

http://www.cisco.com/warp/public/732/Tech/qos/docs/autoqos wp.pdf. [Last accessed:

2008, 28 June].

[73] (2008, June) Solutions: Service control. Allot Communications. [On-

line]. Available: http://www.allot.com/index.php?option=com content&task=view&id=

51&Itemid=51 [Last accessed: 2008, June].

[74] (2008) Exinda - WAN Optimization, WAN Acceleration, Application Acceleration,

Unified Performance Management. Exinda Networks. [Online]. Available: http:

//www.exinda.com/public/products/products.html. [Last accessed: 2009, 22 February].

[75] NetIntact, NetIntact, 2008. [Online]. Available: http://www.netintact.com/ [Last

accessed: 2008, June].

[76] J. But, T. Nguyen, L. Stewart, N. Williams, and G. Armitage, “Peformance analysis of

the ANGEL system for automated control of game traffic prioritisation,” in NetGames

’07: Proceedings of the 6th ACM SIGCOMM workshop on Network and system support

for games. Melbourne, Australia: ACM, September 2007, pp. 123–128.

[77] (2008, June) How StreamEngine works. Ubicom. [Online]. Available: http:

//streamengine.ubicom.com/html/activity.cfm?page=how streamengine works [Last ac-

cessed: 2009, 22 February].

[78] L. Burgstahler, K. Dolzer, C. Hauser, J. Jahnert, S. Junghans, C. Macian, and W. Payer,

“Beyond technology: the missing pieces for QoS success,” in RIPQoS ’03: Proceedings

of the ACM SIGCOMM workshop on Revisiting IP QoS. Karlsruhe, Germany: ACM

Press, August 2003, pp. 121–130.

[79] J. K. M.-M. Varian and H. R., “Pricing the Internet,” EconWPA, Computational

http://games.dlink.com/products/?pid=370\&#DGL-4300

http://www.cisco.com/warp/public/732/Tech/qos/docs/autoqos_wp.pdf.

http://www.allot.com/index.php?option=com_content&task=view&id=51&Itemid=51

http://www.allot.com/index.php?option=com_content&task=view&id=51&Itemid=51

http://www.exinda.com/public/products/products.html.

http://www.exinda.com/public/products/products.html.

http://www.netintact.com/

http://streamengine.ubicom.com/html/activity.cfm?page=how_streamengine_works

http://streamengine.ubicom.com/html/activity.cfm?page=how_streamengine_works

BIBLIOGRAPHY 227

Economics 9401002, January 1994. [Online]. Available: http://ideas.repec.org/p/wpa/

wuwpco/9401002.html [Last accessed: 2009, 22 February].

[80] F. P. Kelly, “Charging and accounting for bursty connections,” Internet economics, pp.

253–278, 1997.

[81] F. Kelly, “Charging and rate control for elastic traffic,” European Transactions on

Telecommunications, vol. 8, pp. 33–37, 1997.

[82] S. Shenker, D. Clark, D. Estrin, and S. Herzog, “Pricing in computer networks: reshaping

the research agenda,” ACM SIGCOMM Computer Communication Review, vol. 26, no. 2,

pp. 19–43, April 1996.

[83] N. Keon and G. Anandalingam. (2003, July) A new pricing

model for competitive telecommunications services using congestion dis-

counts. [Online]. Available: http://mail3.rhsmith.umd.edu/Faculty/KM/papers.nsf\

/0/d5ea3f525a84fc5485256d0c006f210d?OpenDocument [Last accessed: 2009, 22

February].

[84] D. Clark, “Combining sender and receiver payments in the Internet,” in Telecommunica-

tions Research Policy Conference, 1996.

[85] M. Odlyzko, “Paris metro pricing for the Internet,” in EC ’99: Proceedings of the 1st

ACM conference on Electronic commerce, Denver, Colorado, United States, 1999, pp.

140–147.

[86] P. Dube, V. Borkar, and D. Manjunath, “Differential join prices for parallel queues: so-

cial optimality, dynamic pricing algorithms and application to internet pricing,” in IN-

FOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Com-

munications Societies. Proceedings. IEEE, vol. 1, 2002, pp. 276–283.

[87] P. Marbach, “Priority service and max-min fairness,” in IEEE INFOCOM 2002, The 21st

Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 1,

New York, USA, 2002, pp. 266–275.

http://ideas.repec.org/p/wpa/wuwpco/9401002.html

http://ideas.repec.org/p/wpa/wuwpco/9401002.html

http://mail3.rhsmith.umd.edu/Faculty/KM/papers.nsf\/0/d5ea3f525a84fc5485256d0c006f210d?OpenDocument

http://mail3.rhsmith.umd.edu/Faculty/KM/papers.nsf\/0/d5ea3f525a84fc5485256d0c006f210d?OpenDocument

228 BIBLIOGRAPHY

[88] P. Marbach, “Priority service and max-min fairness,” IEEE/ACM Transactions on Net-

working, vol. 11, no. 5, pp. 733–746, October 2003.

[89] R. Cocchi, D. Estrin, S. Shenker, and L. Zhang, “A study of priority pricing in multiple

service class networks,” in SIGCOMM ’91: Proceedings of the conference on Communi-

cations architecture protocols. USA: ACM Press, 1991, pp. 123–130.

[90] G. Fankhauser and B. Plattner, “Diffserv bandwidth brokers as mini-markets,” in Pro-

ceedings of International Workshop on Internet Service Quality Economics (ICQE), MIT,

US, December 2-3 1999.

[91] X. Wang and H. Schulzrinne, “RNAP: A resource negotiation and pricing protocol,”

in Proceedings of the nineth International Workshop Network and Operating Systems

Support for Digital Audio and Video NOSSDAV ’99, Basking Ridge, NJ, June 1999, pp.

77–93.

[92] M. Yuksel and A. Kalyanaraman, S. Goel, “Congestion pricing overlaid on edge-to-

edge congestion control,” IEEE International Conference on Communications (ICC ’03),

2003., vol. 2, pp. 880–884, May 2003.

[93] A. J. O. Sethu and Harish, “Congestion control, Differentiated Services, and Efficient

Capacity Management Through a Novel Pricing Strategy,” Computer Communications,

vol. 26, no. 13, pp. 1457–1469, 2003.

[94] B. Stiller, P. Reichl, and S. Leinen, “Pricing and cost recovery for Internet Services:

Practical review, classification and application of relevant models.” in NETNOMICS -

Economic Research and Electronic Networking, vol. 3. Kluwer Academic Publishers,

2001, pp. 149–171.

[95] “Bills Digest no.67 1997-98, Telecommunications Legislation Amendment Bill 1997,”

Parliament of Australia, 1997.

[96] B. Karpagavinayagam, R. State, and O. Festor, “Monitoring architecture for Lawful In-

terception in VoIP networks,” in Second International Conference on Internet Monitoring

and Protection (ICIMP 2007), San Jose, CA, July 2007.

BIBLIOGRAPHY 229

[97] A. Rojas, P. Branch, and G. Armitage, “Predictive Lawful Interception in mobile IPv6

networks,” in ICON 2007: 15th IEEE International Conference on Networks, 2007.,

Adelaide, Australia, November 2007, pp. 501–506.

[98] A. Moore and D. Zuev, “Internet traffic classification using Bayesian analysis tech-

niques,” in SIGMETRICS ’05: Proceedings of the 2005 ACM SIGMETRICS interna-

tional conference on Measurement and modeling of computer systems. Banff, Alberta,

Canada: ACM, June 2005, pp. 50–60.

[99] J. Erman, A. Mahanti, and M. Arlitt, “Byte me: a case for byte accuracy in traffic clas-

sification,” in MineNet ’07: Proceedings of the 3rd annual ACM workshop on Mining

network data. San Diego, California, USA: ACM Press, June 2007, pp. 35–38.

[100] (2007, August) Internet Assigned Numbers Authority (IANA). [Online]. Available:

http://www.iana.org/assignments/port-numbers [Last accessed: 2009, 22 February].

[101] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A transport protocol for

real-time applications,” RFC 1889, IETF, 1996.

[102] A. Moore and K. Papagiannaki, “Toward the accurate identification of network applica-

tions,” in Sixth Passive and Active Measurement Workshop (PAM), Boston, MA, USA,

March/April 2005.

[103] A. Madhukar and C. Williamson, “A longitudinal study of P2P traffic classification,” in

14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer

and Telecommunication Systems, September 2006, pp. 179 –188.

[104] V. Paxson, “Empirically derived analytic models of wide-area TCP connections,”

IEEE/ACM Transactions on Networking, vol. 2, no. 4, pp. 316–336, 1994.

[105] C. Dewes, A. Wichmann, and A. Feldmann, “An analysis of Internet chat systems,” in

IMC ’03: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement.

Miami, Florida, USA: ACM, October 2003, pp. 51–64.

[106] K. C. Claffy, “Internet traffic characterisation,” PhD Thesis, University of California, San

Diego, 1994.

http://www.iana.org/assignments/port-numbers

230 BIBLIOGRAPHY

[107] T. Lang, G. Armitage, P. Branch, and H.-Y. Choo, “A synthetic traffic model for Half-

Life,” in Proceedings of Australian Telecommunications Networks & Applications Con-

ference 2003 ATNAC2003, Melbourne, Australia, December 2003.

[108] T. Lang, P. Branch, and G. Armitage, “A synthetic traffic model for Quake 3,” in Pro-

ceedings of ACM SIGCHI ACE2004, Singapore, June 2004.

[109] I. Witten and E. Frank, Data mining: Practical machine learning tools and techniques

with Java implementations, 2nd ed. Morgan Kaufmann Publishers, 2005.

[110] Z. Shi, Principles of machine learning. International Academic Publishers, 1992.

[111] H. A. Simon, “Why should machines learning,” in R. S. Michalski, J. G. Carbonell, and

T. M. Mitchell (ed) Machine Learning: An Artificial Intelligence Approach. Tioga, 1983.

[112] B. Silver, “Netman: A learning network traffic controller,” in IEA/AIE ’90: Proceedings

of the 3rd international conference on Industrial and engineering applications of artifi-

cial intelligence and expert systems. Charleston, South Carolina, United States: ACM,

1990, pp. 923–931.

[113] Y. Reich and S. J. Fenves, “The formation and use of abstract concepts in design,” Con-

cept formation knowledge and experience in unsupervised learning, pp. 323–353, 1991.

[114] R. Kohavi, J. R. Quinlan, W. Klosgen, and J. Zytkow, “Decision tree discovery,” Hand-

book of Data Mining and Knowledge Discovery, pp. 267–276, 2002.

[115] G. John and P. Langley, “Estimating continuous distributions in Bayesian classifiers,” in

Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Mon-

treal, Quebec, Canada: Morgan Kaufmann, August 1995, pp. 338–345.

[116] E. B. Hunt, J. Marin, and P. T. Stone, Experiments in Induction. New York, NY, USA:

Academic Press, 1966.

[117] J. Han and M. Kamber, Data Mining - Concepts and Techniques. Morgan Kaufmann

Publishers, 2001.

BIBLIOGRAPHY 231

[118] D. Gamberger, T. Smuc, and I. Maric. (2006, April) Tutorial on decision tree. [Online].

Available: http://dms.irb.hr/tutorial/tut dtrees.php [Last accessed: 2009, 22 February].

[119] R.-H. Li and G. G. Belford, “Instability of decision tree classification algorithms,” in

KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowl-

edge discovery and data mining. New York, NY, USA: ACM Press, 2002, pp. 570–575.

[120] D. J. Haglin. (2006, April) Decision trees for supervised learning. [Online]. Available:

http://grb.mnsu.edu/grbts/doc/manual/J48 Decision Trees.html [Last accessed: 2009, 22

February].

[121] H. D. Fisher, J. M. Pazzani, and P. Langley, Concept Formation: Knowledge and Ex-

perience in Unsupervised Learning. San Francisco, CA, USA: Morgan Kaufmann

Publishers, 1991.

[122] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, “Traffic classifica-

tion on the fly,” ACM SIGCOMM Computer Communication Review, vol. 36, no. 2, pp.

23–26, 2006.

[123] J. Erman, A. Mahanti, M. Arlitt, and C. Williamson, “Identifying and discriminating

between web and peer-to-peer traffic in the network core,” in WWW ’07: Proceedings of

the 16th international conference on World Wide Web. Banff, Alberta, Canada: ACM

Press, May 2007, pp. 883–892.

[124] (2006, April) Weka en:primer (3.4.6). The University of Waikato. [Online]. Available:

http://weka.sourceforge.net/wekadoc/index.php/en:Primer [Last accessed: 2009, 22

February].

[125] (2009, February) WEKA API documentation (weka.clusterers class EM). [Online].

Available: http://weka.sourceforge.net/doc/ [Last accessed: 2009, 22 February].

[126] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. JWiley-Interscience,

2000.

http://dms.irb.hr/tutorial/tut_dtrees.php

http://grb.mnsu.edu/grbts/doc/manual/J48_Decision_Trees.html

http://weka.sourceforge.net/wekadoc/index.php/en:Primer

http://weka.sourceforge.net/doc/

232 BIBLIOGRAPHY

[127] O. Carmichael and M. Hebert, “Shape-based recognition of wiry objects,” IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, vol. 26, no. 12, pp. 1537–1552,

December 2004.

[128] (2007, November) [Wekalist] ten-fold cross validation (1). [Online]. Available: https://

list.scms.waikato.ac.nz/mailman/htdig/wekalist/2005-April/003836.html [Last accessed:

2007, 30 November].

[129] (2007, November) [wekalist] ten-fold cross validation (2). [Online]. Available: https://

list.scms.waikato.ac.nz/mailman/htdig/wekalist/2005-April/003847.html [Last accessed:

2007, 30 November].

[130] W. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the

American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.

[131] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster validity methods: part I,” ACM

SIGMOD Record, vol. 31, no. 2, pp. 40–45, 2002.

[132] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on Neural

Networks, vol. 16, no. 3, pp. 645–678, May 2005.

[133] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Clustering validity checking methods:

part II,” ACM SIGMOD Record, vol. 31, no. 3, pp. 19–27, 2002.

[134] M. Hall and G. Holmes, “Benchmarking attribute selection techniques for discrete class

data mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 6,

pp. 1437–1447, November/December 2003.

[135] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning.

Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.

[136] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligent,

vol. 97, no. 1-2, pp. 273–324, 1997.

[137] P. H. Winston, Artificial Intelligence, 2nd ed. Boston, MA, USA: Addison-Wesley

Longman Publishing Co., Inc., 1984.

https://list.scms.waikato.ac.nz/mailman/htdig/wekalist/2005-April/003836.html




BIBLIOGRAPHY 233

[138] J. Zhang and I. Mani, “kNN approach to unbalanced data distributions: A case study

involving information extraction,” in Proceedings of the ICML’03 Workshop on Learning

from Imbalanced Data Sets, Washington, DC, 2003.

[139] S. Visa and A. Ralescu, “Issues in mining imbalanced data sets - a review paper,” in Pro-

ceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference,

MAICS-2005, 2005, pp. 67–73.

[140] C. X. Ling and C. Li, “Data mining for direct marketing: Problems and solutions,” in

Knowledge Discovery and Data Mining. AAAI Press, 1998, pp. 73–79.

[141] M. Kubat and S. Matwin, “Addressing the curse of imbalanced training sets: one-sided

selection,” in Proceedings of the Fourteenth International Conference on Machine Learn-

ing. Morgan Kaufmann, 1997, pp. 179–186.

[142] P. Domingos, “Metacost: a general method for making classifiers cost-sensitive,” in KDD

’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge dis-

covery and data mining. San Diego, California, United States: ACM, 1999, pp. 155–

164.

[143] N. Japkowicz, C. Myers, and M. A. Gluck, “A novelty detection approach to classifica-

tion,” in Proceedings of the Fourteenth Joint Conference on Artificial Intelligence, 1995,

pp. 518–523.

[144] A. Y. Liu, “The effect of oversampling and undersampling on classifying imbalanced text

datasets,” Master Thesis, The University of Texas at Austin, 2004.

[145] A. Nickerson, N. Japkowicz, and E. Milios, “Using unsupervised learning to guide re-

sampling in imbalanced data sets,” in Proceedings of the Eighth International Workshop

on Artificial Intelligence and Statistics, 2001, pp. 261–265.

[146] M. G. Weiss and F. Provost, “Learning when training data are costly: The effect of class

distribution on tree induction,” Journal of Artificial Intelligence Research, vol. 19, pp.

315–354, 2003.

234 BIBLIOGRAPHY

[147] S. Visa and A. Ralescu, “The effect of imbalanced data class distribution on fuzzy classi-

fiers - experimental study,” in The 14th IEEE International Conference on Fuzzy Systems

(FUZZ ’05), 2005, May 2005, pp. 749–754.

[148] N. Japkowicz, “Learning from imbalanced data sets: a comparison of various

strategies,” Learning from imbalanced data sets: The AAAI Workshop 10-15.

Menlo Park, CA: AAAI Press, Tech. Rep. WS-00-05, 2000. [Online]. Available:

http://www.aaai.org/Library/Workshops/2000/ws00-05-003.php [Last accessed: 2009,

22 February].

[149] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intel-

ligent Data Analysis Journal, vol. 6, no. 5, pp. 429–450, 2002.

[150] D. Z. . M. C. Andrew Moore, “Discriminators for use in flow-based classification,”

Department of Computer Science, Queen Mary, University of London, Tech. Rep.

RR-05-13, August 2005. [Online]. Available: http://www.dcs.qmul.ac.uk/tech reports/

RR-05-13.pdf [Last accessed: 2009, 22 February].

[151] N. Williams, S. Zander, and G. Armitage, “Evaluating machine learning methods for

online game traffic identification,” Centre for Advanced Internet Architectures (CAIA),

Tech. Rep. 060410C, April 2006. [Online]. Available: http://caia.swin.edu.au/reports/

060410C/CAIA-TR-060410C.pdf [Last accessed: 2009, 22 February].

[152] J. Park, H.-R. Tyan, and K. C.-C.J., “Internet traffic classification for scalable QoS pro-

vision,” in IEEE International Conference on Multimedia and Expo, 2006, Toronto, On-

tario, Canada, July 2006, pp. 1221–1224.

[153] J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification using clustering algorithms,”

in MineNet ’06: Proceedings of the 2006 SIGCOMM workshop on Mining network data.

Pisa, Italy: ACM, 2006, pp. 281–286.

[154] Y. Yang and R. Kravets, “Throughput guarantees for multi-priority traffic in ad hoc net-

works,” Ad Hoc Networks, vol. 5, pp. 228–253, 2007.

http://www.aaai.org/Library/Workshops/2000/ws00-05-003.php

http://www.dcs.qmul.ac.uk/tech_reports/RR-05-13.pdf

http://www.dcs.qmul.ac.uk/tech_reports/RR-05-13.pdf

http://caia.swin.edu.au/reports/060410C/CAIA-TR-060410C.pdf

http://caia.swin.edu.au/reports/060410C/CAIA-TR-060410C.pdf

BIBLIOGRAPHY 235

[155] H.-H. W. Lin and X., “Multiple priorities QoS scheduling for simultaneous videos trans-

missions,” in Proceedings. International Symposium on Multimedia Software Engineer-

ing, 2000. IEEE Computer Society, 2000, pp. 135–141.

[156] V. Ambetkar, P. Bender, J. Ma, Y. Pei, and W. J. Modestino, “Distributed flow admission

control for real-time multimedia services over wireless ad hoc networks,” in MobiMedia

’06: Proceedings of the 2nd international conference on Mobile multimedia communi-

cations. Alghero, Italy: ACM, 2006, pp. 1–6.

[157] D. X. Shengquan Wang and W. Zhao, “Differentiated Services with statistical QoS guar-

antees in static-priority scheduling networks,” Texas A & M University, Tech. Rep.

TR01-015, 2001.

[158] K. Cieliebak and B. Liver, “How many QoS classes are optimal?” in EC ’99: Proceed-

ings of the 1st ACM conference on electronic commerce. Denver, Colorado, United

States: ACM, 1999, pp. 48–57.

[159] T. Nguyen and G. Armitage, “Training on multiple sub-flows to optimise the use of

machine learning classifiers in real-world IP networks,” in Proceedings 2006 31st IEEE

Conference on Local Computer Networks, Tampa, Florida, USA, November 2006, pp.

369–376.

[160] T. Nguyen and G. Armitage, “Synthetic sub-flow pairs for timely and stable IP traffic

identification,” in Proceedings of Australian Telecommunication Networks and Applica-

tion Conference, Melbourne, Australia, December 2006.

[161] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from incomplete data via

the EM algorithm,” Journal of Royal Statistical Society, vol. 39, pp. 1–22, 1977.

[162] P. Cheeseman and J. Stutz, “Bayesian classification (AutoClass): Theory and results,” in

Advances in Knowledge Discovery and Data Mining. Menlo Park, CA, USA: American

Association for Artificial Intelligence, 1996, pp. 153–180.

[163] C. Schmoll and S. Zander, Netmate, February 2009. [Online]. Available: http:

//sourceforge.net/projects/netmate-meter/ [Last accessed: 2009, 22 February].

http://sourceforge.net/projects/netmate-meter/

http://sourceforge.net/projects/netmate-meter/

236 BIBLIOGRAPHY

[164] (2006, September) Traffic measurement data repository. The National Laboratory for

Applied Network Research (NLANR). [Online]. Available: http://pma.nlanr.net/Special/

[Last accessed: 2009, 22 February].

[165] T. Auld, A. W. Moore, and S. F. Gull, “Bayesian neural networks for Internet traffic clas-

sification,” IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 223–239, January

2007.

[166] J. Park, H.-R. Tyan, and C.-C. J. Kuo, “GA-based Internet traffic classification technique

for QoS provisioning,” in IIH-MSP ’06: Proceedings of the 2006 International Confer-

ence on Intelligent Information Hiding and Multimedia. Pasadena, California: IEEE

Computer Society, December 2006, pp. 251–254.

[167] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, “Traffic classification through simple

statistical fingerprinting,” ACM SIGCOMM Computer Communication Review, vol. 37,

no. 1, pp. 5–16, 2007.

[168] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, “Semi-supervised net-

work traffic classification,” ACM SIGMETRICS Performance Evaluation Review, vol. 35,

no. 1, pp. 369–370, 2007.

[169] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, “Offline/realtime

network traffic classificatioin using semi-supervised learning,” Department of Computer

Science, University of Calgary, Tech. Rep., February 2007. [Online]. Available: http:

//pages.cpsc.ucalgary.ca/∼mahanti/papers/semi.supervised.pdf [Last accessed: 2009, 22

February].

[170] J. Erman, A. Mahanti, and M. Arlitt, “QRP05-4: Internet traffic identification using

machine learning,” in GLOBECOM ’06. IEEE Global Telecommunications Conference,

2006., San Francisco, USA, December 2006, pp. 1–6.

[171] N. Williams, S. Zander, and G. Armitage, “A preliminary performance comparison of

five machine learning algorithms for practical IP traffic flow classification,” SIGCOMM

Computer Communication Review, vol. 36, no. 5, pp. 5–16, 2006.

http://pma.nlanr.net/Special/

http://pages.cpsc.ucalgary.ca/~mahanti/papers/semi.supervised.pdf

http://pages.cpsc.ucalgary.ca/~mahanti/papers/semi.supervised.pdf

BIBLIOGRAPHY 237

[172] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: Automated construction of

application signatures,” in MineNet ’05: Proceeding of the 2005 ACM SIGCOMM work-

shop on Mining network data. New York, NY, USA: ACM Press, August 2005, pp.

197–202.

[173] N. Z. Friis and Janus, Skype, Skype Technologies S.A., February 2009. [Online].

Available: http://www.skype.com/intl/en/. [Last accessed: 2009, 22 February].

[174] E. Gumbeil, Statistics of Extremes. New York: Columbia University Press, 1958.

[175] Weka 3.4.4, The University of Waikato, February 2009. [Online]. Available:

http://www.cs.waikato.ac.nz/ml/weka [Last accessed: 2009, 22 February].

[176] G. Armitage, M. Claypool, and P. Branch, Networking and online games - undertanding

and engineering multiplayer Internet games. UK: John Wiley & Sons, 2006.

[177] T. W. Anderson and D. A. Darling, “Asymptotic theory of certain g̈oodness of fitc̈riteria

based on stochastic processes,” Annals of Mathematical Statistics, vol. 23, no. 2, pp.

193–212, 1952.

[178] W. J. Conover, Practical nonparametric statistics. New York: John Wiley & Sons,

1971.

[179] (2009, February) CAIA Grangenet game server (GENIUS project). Centre for Advanced

Internet Architectures (CAIA). [Online]. Available: http://caia.swin.edu.au/genius/

games.html [Last accessed: 2009, 22 February].

[180] S. Zander, D. Kennedy, and G. Armitage, “Dissecting server-discovery traffic patterns

generated by multiplayer first person shooter games,” in NetGames ’05: Proceedings of

4th ACM SIGCOMM workshop on Network and system support for games. Hawthorne,

NY: ACM, October 2005, pp. 1–12.

[181] (2006, March) Traffic measurement data repository. The University of Twente. [Online].

Available: http://m2c-a.cs.utwente.nl/repository [Last accessed: 2006, 26 March].

http://www.skype.com/intl/en/.

http://www.cs.waikato.ac.nz/ml/weka

http://caia.swin.edu.au/genius/games.html

http://caia.swin.edu.au/genius/games.html

http://m2c-a.cs.utwente.nl/repository

238 BIBLIOGRAPHY

[182] (2007, January) Supercomputing overview. The Centre for Astrophysics and

Supercomputing, Swinburne University of Technology. [Online]. Available: http:

//astronomy.swinburne.edu.au/supercomputing/ [Last accessed: 2009, 22 February].

[183] T. G. Renna, I. Bar-Kana, and P. Kalata, “A two-level gain stochastic disturbance ob-

server with hysteresis,” in IEEE International Conference on Systems Engineering, Au-

gust 1990, pp. 77–80.

[184] Qstat, The open group base specifications issue 6, IEEE Std 1003.1, 2004 edition,

January 2009. [Online]. Available: http://www.opengroup.org/onlinepubs/000095399/

utilities/qstat.html. [Last accessed: 2009, 22 February].

[185] ITU-T, “G.71 : Pulse code modulation (PCM) of voice frequencies,” G.711 ITU-T Stan-

dard, International Telecommunication Union, 1988.

[186] ETSI, “European Digital Cellular Telecommunications System (Phase 2): Full rate

speech transcoding. ETSI spec. GSM 06.10, GSM 06.32 ed.” European Standard, The

International Telegraph and Telephone Consultative Committee., 1994.

[187] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. Peterson, R. Sparks, M. Han-

dley, and E. Schooler, “SIP: Session Initiation Protocol,” RFC 3216, IETF, 2002.

[188] M. Handley and V. Jacobson, “SDP: Session Description Protocol,” RFC 2327, IETF,

1998.

[189] H. Schulzrinne and S. Casner, “RTP profile for audio and video conferences with minimal

control,” RFC 3551, IETF, 2003.

[190] R. Zopf, “Real-time transport protocol (RTP) payload for comfort noise (CN),” RFC

3389, IETF, 2002.

[191] J.-C. Bolot, “End-to-end packet delay and loss behavior in the Internet,” ACM SIGCOMM

Computer Communication Review, vol. 23, no. 4, pp. 289–298, 1993.

[192] W. Jiang and H. Schulzrinne, “Comparison and optimization of packet loss repair meth-

ods on VoIP perceived quality under bursty loss,” in NOSSDAV ’02: Proceedings of the

http://astronomy.swinburne.edu.au/supercomputing/

http://astronomy.swinburne.edu.au/supercomputing/

http://www.opengroup.org/onlinepubs/000095399/utilities/qstat.html.

http://www.opengroup.org/onlinepubs/000095399/utilities/qstat.html.

BIBLIOGRAPHY 239

12th international workshop on Network and operating systems support for digital audio

and video. Miami, Florida, USA: ACM, 2002, pp. 73–81.

[193] D. Loguinov and H. Radha, “Measurement study of low-bitrate Internet video stream-

ing,” in IMW ’01: Proceedings of the 1st ACM SIGCOMM Workshop on Internet Mea-

surement. San Francisco, California, USA: ACM, 2001, pp. 281–293.

[194] M. Borella, D. Swider, S. Uludag, and G. Brewster, “Internet packet loss: Measurement

and implications for end-to-end QoS,” in Proceedings of the 1998 ICPP Workshops on

Architectural and OS Support for Multimedia Applications/Flexible Communication Sys-

tems/Wireless Networks and Mobile Computing. IEEE Computer Society, August 1998,

pp. 3–12.

[195] H. Balakrishnan, V. Padmanabhan, S. Seshan, M. Stemm, and R. Katz, “TCP behavior

of a busy Internet server: Analysis and improvements,” in INFOCOM ’98, Seventeenth

Annual Joint Conference of the IEEE Computer and Communications Societies. San

Francisco, CA, USA: University of California at Berkeley, 1998, pp. 252–262.

[196] V. E. Paxson, “Measurements and analysis of end-to-end Internet dynamics,” PhD Thesis,

University of California at Berkeley, Berkeley, CA, USA, 1998.

[197] M. Yajnik, S. Moon, J. Kurose, and D. Towsley, “Measurement and modelling of the

temporal dependence in packet loss,” in INFOCOM ’99. Eighteenth Annual Joint Con-

ference of the IEEE Computer and Communications Societies, vol. 1, March 1999, pp.

345–352.

[198] M. Dischinger, A. Haeberlen, K. P. Gummadi, and S. Saroiu, “Characterizing residential

broadband networks,” in IMC ’07: Proceedings of the 7th ACM SIGCOMM conference

on Internet measurement. San Diego, California, USA: ACM, 2007, pp. 43–56.

[199] M. Mathis, J. Semke, and J. Mahdavi, “The macroscopic behavior of the TCP conges-

tion avoidance algorithm,” ACM SIGCOMM Computer Communication Review, vol. 27,

no. 3, pp. 67–82, 1997.

240 BIBLIOGRAPHY

[200] L. Cottrell. (2000, February) Throughput versus loss. Stanford Linear Accelerator Center.

[Online]. Available: http://www.slac.stanford.edu/comp/net/wan-mon/thru-vs-loss.html

[Last accessed: 2009, 22 February].

[201] J.-A. Bussiere and S. Zander, “Enemy Territory traffic analysis,” Centre for Advanced

Internet Architectures (CAIA), Tech. Rep. 060203A, February 2006. [Online]. Available:

http://caia.swin.edu.au/reports/060203A/CAIA-TR-060203A.pdf [Last accessed: 2009,

22 February].

[202] J. Ma, K. Levchenko, C. Kreibich, S. Savage, and G. M. Voelker, “Unexpected means

of protocol inference,” in IMC ’06: Proceedings of the 6th ACM SIGCOMM on Internet

measurement. Rio de Janeriro, Brazil: ACM Press, October 2006, pp. 313–326.

[203] C. Jin, H. Wang, and K. G. Shin, “Hop-count filtering: an effective defense against

spoofed DDoS traffic,” in CCS ’03: Proceedings of the 10th ACM conference on Com-

puter and communications security. Washington D.C., USA: ACM, 2003, pp. 30–41.

[204] G. Armitage, C. Javier, and S. Zander, “Post-game estimation of game client RTT

and hop count distributions,” Centre for Advanced Internet Architectures (CAIA),

Tech. Rep. 060801A, August 2006. [Online]. Available: http://caia.swin.edu.au/reports/

060410C/CAIA-TR-060801A.pdf [Last accessed: 2009, 22 February].

[205] Wireshark, Wireshark foundation, February 2009. [Online]. Available: http:

//www.wireshark.org/ [Last accessed: 2009, 22 February].

[206] Wireshark. (2008, December) Wireshark frequently asked questions. [Online]. Available:

http://www.wireshark.org/faq.html [Last accessed: 2009, 22 February].

[207] H. Schulzrinne and S. Petrack, “RTP payload for DTMF digits, telephony tones and

telephony signals,” RFC 2833, IETF, 2000.

[208] T. Nguyen and G. Armitage, “Clustering to assist supervised machine learning for real-

time IP traffic classification,” in IEEE International Conference on Communications

(ICC ’08), 2008, Beijing, China, 2008, pp. 5857–5862.

http://www.slac.stanford.edu/comp/net/wan-mon/thru-vs-loss.html

http://caia.swin.edu.au/reports/060203A/CAIA-TR-060203A.pdf

http://caia.swin.edu.au/reports/060410C/CAIA-TR-060801A.pdf

http://caia.swin.edu.au/reports/060410C/CAIA-TR-060801A.pdf

http://www.wireshark.org/

http://www.wireshark.org/

http://www.wireshark.org/faq.html

List of Figures

2.1 A typical DOCSIS cable network from ISP to home users . . . . . . . . . . . 31

2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1 An example dataset as a matrix of instances versus features . . . . . . . . . . . 45

3.2 An illustration of full-flow flow. The forward direction is normally defined as

the client-to-server direction . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 An illustration of the definition of flow direction and features calculation . . . . 58

3.4 A simple scenario of online game traffic classification . . . . . . . . . . . . . . 59

3.5 Training and classification for a two-classes supervised ML traffic classifier . . 60

3.6 Example of an automated QoS and priority control . . . . . . . . . . . . . . . 64

3.7 Example operation of an IP flows classifier . . . . . . . . . . . . . . . . . . . . 65

5.1 An illustration of sub-flow definition . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Packet length from client to server for ET traffic - N = 25 packets . . . . . . . . 96

5.3 Packet length from server to client for ET traffic - N = 25 packets . . . . . . . . 96

5.4 Mean packet length from client to server for ET traffic - N = 25 packets . . . . 97

5.5 Standard deviation of packet length from client to server for ET traffic - N = 25

packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.6 Mean packet length in the client-to-server direction, calculated for the window

of the first N packets taken from 1,000 flow samples for ET traffic (1,000 values

of the means for each N value) . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.7 The standard deviation of packet length in the client-to-server direction, calcu-

lated for window of the first N packets taken from 1,000 flow samples for ET

traffic (1,000 values of the standard deviations for each N value) . . . . . . . . 99

241

242 LIST OF FIGURES

5.8 High-level description of datasets used for training and testing . . . . . . . . . 101

5.9 Distribution of different applications’ traffic (in flows and percentage) in the

training datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.10 Distribution of different applications’ traffic (in flows and percentage) in testing

datasets for N = 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.11 ET Recall: Classifier trained with full-flows, tested with four different sliding

windows - Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.12 ET Precision: Classifier trained with full-flows, tested with four different slid-

ing windows - Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . . 110

5.13 ET Recall: Classifier trained with full-flows, tested with four different sliding

windows - C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . . . 111

5.14 ET Precision: Classifier trained with full-flows, tested with four different slid-

ing windows - C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . 111

5.15 ET Recall and Precision: Classifier trained on filtered full-flows, N = 25 for

classification - Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . 112

5.16 An illustration of creating classification rules for the full-flow and filtered full-

flow models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.17 ET Recall and Precision: Classifier trained on filtered full-flows, N = 25 for

classification - C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . 115

5.18 ET Recall: Classifier trained on 25-packet sub-flows, N = 25 for classification -

Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.19 ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classifica-

tion - Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.20 ET Recall: Classifier trained on 25-packet sub-flows, N = 25 for classification -

C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117


tion - C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . . . . . . 118


tion - C4.5 Decision Tree models - a zoomed-in version of Figure 5.21 . . . . . 119

LIST OF FIGURES 243

5.23 ET Recall: Comparing full-flow and sub-flow training of the Naive Bayes clas-

sifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.24 ET Precision: Comparing full-flow and sub-flow training of the Naive Bayes

classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.25 An illustration of creating multiple sub-flows classifier from a number of in-

dividual sub-flows (data points are artificially created for illustrative purposes

only). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.26 ET Recall: Comparing full-flow and sub-flow training of the Classifier- C4.5

Decision Tree models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.27 ET Precision: Comparing full-flow and sub-flow training of the C4.5 Decision

Tree classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.1 An illustration of the sub-flow identification step . . . . . . . . . . . . . . . . 131

6.2 An illustration of selecting representative sub-flows to train a classifier . . . . . 132

6.3 Step 1 - Experimental approach . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.4 Number of instances for each sub-flow identified in Step 1 . . . . . . . . . . . 135

6.5 Sub-flow to cluster mapping and evaluation. . . . . . . . . . . . . . . . . . . . 136

6.6 Normalised number of instances in training each classifier . . . . . . . . . . . 139

6.7 Recall for Naive Bayes classifiers trained on various selections of full-flows and

sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.8 Recall for Naive Bayes classifiers using multiple sub-flows, expanded from Fig-

ure 6.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.9 Precision for Naive Bayes classifiers trained on various selections of full-flows

and sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.10 Precision for Naive Bayes classifiers using multiple sub-flows, expanded from

Figure 6.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.11 Recall for C4.5 Decision Tree classifiers trained on various selections of full-

flows and sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.12 Precision for C4.5 Decision Tree classifiers trained on various selections of full-

flows and sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.13 Normalised build time for Naive Bayes classifiers . . . . . . . . . . . . . . . . 145

244 LIST OF FIGURES

6.14 Normalised classification speed for Naive Bayes classifiers . . . . . . . . . . . 146

6.15 Normalised memory usage for Naive Bayes classifiers while performing 10-

times cross validation (during both training and testing) . . . . . . . . . . . . . 146

6.16 Normalised build time for C4.5 Decision Tree classifiers . . . . . . . . . . . . 147

6.17 Normalised classification speed for C4.5 Decision Tree classifiers . . . . . . . 148

6.18 Normalised memory usage for C4.5 Decision Tree classifiers while performing

10-times cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.19 Summary of Precision / Recall results for Naive Bayes (NB) and C4.5 Decision

Tree (DT) classifiers trained on multiple sub-flows . . . . . . . . . . . . . . . 149

6.20 Summary of computational performance results for Naive Bayes (NB) and C4.5

Decision Tree (DT) classifiers trained on multiple sub-flows . . . . . . . . . . 150

6.21 Sampled clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6.22 Precision and Recall for Naive Bayes classifiers using sub-flows selected by EM

with small numbers of samples for the clustering process. . . . . . . . . . . . . 155

6.23 Results for C4.5 Decision Tree classifiers using sub-flows selected by EM with

small numbers of samples for the clustering process. . . . . . . . . . . . . . . 156

6.24 Normalised Model Build Time for classifiers trained on sub-flows selected by

EM with small numbers of samples used in the clustering process . . . . . . . 157

6.25 Normalised classification speed for classifiers trained on sub-flows selected by

EM with small numbers of samples used in the clustering process . . . . . . . 157

6.26 An illustration of updating a classifier when new, previously unknown traffic is

detected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.1 Steps in training an ML classifier for identification of ET traffic versus Other

traffic - without using the SSP approach . . . . . . . . . . . . . . . . . . . . . 164

7.2 An illustration of how to create a mirror-image replica for a sub-flow instance . 164

7.3 Option 1: Both sub-flows’ instances and the mirror-image replicas of every

short sub-flow are labelled as one class. The classifier is trained with two

classes: ET and Other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

LIST OF FIGURES 245

7.4 Option 2: Sub-flows’ instances and their mirror-image replicas are labelled in-

dependently as two separate classes. The lassifier is trained with four classes:

ET, ET’, Other and Other’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.5 Example datasets used to train a classifier using Option 1 and Option 2 . . . . . 167

7.6 An illustration of creating SSP classifier from sub-flow instances and their

mirror-image replicas (data points are artifically created for illustration purposes

only.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.7 Percentage of flows that have the first packet captured in the client-to-server

direction if the first M packets are missed . . . . . . . . . . . . . . . . . . . . 169

7.8 Recall for Naive Bayes classifiers trained on full-flow (full-flow model), filtered

full-flow (filtered full-flow model) and multiple sub-flows (multiple sub-flows

model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.9 Precision for Naive Bayes classifiers trained on full-flow, filtered full-flow and

multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.10 Recall for C4.5 Decision Tree classifiers trained on full-flow, filtered full-flow

and multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.11 Precision for C4.5 Decision Tree classifiers trained on full-flow, filtered full-

flow and multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.12 Recall for Naive Bayes classifiers trained using SSP Option 1 and multiple sub-

flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.13 Precision for Naive Bayes classifiers trained using SSP Option 1 and multiple

sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.14 Recall for C4.5 Decision Tree classifiers trained using SSP Option 1 and multi-

ple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.15 Precision for C4.5 Decision Tree classifiers trained using SSP Option 1 and

multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.16 Recall for Naive Bayes classifiers trained using SSP Option 1, SSP Option 2

and Multiple Sub-Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

7.17 Precision for Naive Bayes classifiers trained using SSP Option 1, SSP Option 2

and multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

246 LIST OF FIGURES

7.18 Recall for C4.5 Decision Tree classifiers trained using SSP Option 1, SSP Op-

tion 2 and Multiple Sub-Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.19 Precision for C4.5 Decision Tree classifiers trained using SSP Option 1, SSP

Option 2 and Multiple Sub-Flows . . . . . . . . . . . . . . . . . . . . . . . . 179

7.20 Computational performance for Naive Bayes and C4.5 Decision Tree classifiers

trained on multiple sub-flows, SSP Option 1 and SSP Option 2 . . . . . . . . . 180

8.1 Cummulative distribution of call duration . . . . . . . . . . . . . . . . . . . . 186

8.2 G.711 traffic - forward direction, mean packet length calculated over a window

of 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

8.3 G.711 traffic - forward direction, mean packet inter-arrival time calculated over

a window of 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

8.4 GSM traffic - forward direction, mean packet length calculated over a window

of 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

8.5 GSM traffic - forward direction, mean packet inter-arrival time calculated over

a window of 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

8.6 Voice traffic generated during a voice conversation: Comfort noise packets and

silence suppression periods during a conversation can create asymmetry and

multiple packet sizes within the traffic captured by the sliding window . . . . . 190

8.7 VoIP Recall: Naive Bayes classifiers trained on full-flow and SSP-ACT . . . . 191

8.8 VoIP Precision: Naive Bayes classifiers trained on full-flow and SSP-ACT . . . 192

8.9 VoIP Recall: C4.5 Decision Tree classifiers trained on full-flow and SSP-ACT . 192

8.10 VoIP Precision: C4.5 Decision Tree classifiers trained on full-flow and SSP-ACT 193

8.11 VoIP classification using classifiers trained on full-flow and SSP-ACT: Training

on full-flow may cover a larger area of VoIP instances when classifying using

a small sliding window, hence resulting in higher Recall but lower Precision

compared to training using SSP-ACT. (The data points are artifically created

for illustration purposes only. They are not actual data points from my dataset.) 195

8.12 A simple illustration of the impact of packet loss on packet inter-arrival time

statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

LIST OF FIGURES 247

8.13 ET Recall: Training with SSP-ACT and classifying with ET traffic experiencing

5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . . . . . . 199

8.14 ET Precision: Training with SSP-ACT and classifying with ET traffic experi-

encing 5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . . 199

8.15 ET Recall: Training with SSP-ACT and classifying with ET traffic experiencing

5% random packet loss - C4.5 Decision Tree classifier) . . . . . . . . . . . . . 200

8.16 ET Precision: Training with SSP-ACT and classifying with ET traffic experi-

encing 5% random packet loss - C4.5 Decision Tree classifier . . . . . . . . . . 201

8.17 VoIP Recall: Training with SSP-ACT and classifying with VoIP traffic experi-

encing 5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . . 201

8.18 VoIP Precision: Training with SSP-ACT and classifying with VoIP traffic expe-

riencing 5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . 202

8.19 VoIP Recall: Training with SSP-ACT and classifying with VoIP traffic expe-

riencing 5% random packet loss - Naive Bayes classifier - C4.5 Decision Tree

classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8.20 VoIP Precision: Training with SSP-ACT and classifying with VoIP traffic expe-

riencing 5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . 203

8.21 Training for VoIP and ET traffic identification: Option A - Common classifier . 206

8.22 Training for VoIP and ET traffic identification: Option B - Separate classifiers

in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

8.23 VoIP Recall and Precision: Naive Bayes classifier . . . . . . . . . . . . . . . . 208

8.24 VoIP Recall and Precision: C4.5 Decision Tree classifier . . . . . . . . . . . . 208

8.25 Computational performance for Naive Bayes and C4.5 Decision Tree classifiers

trained with Option A and Option B . . . . . . . . . . . . . . . . . . . . . . . 209

A.1 Client port range for the other selected applications . . . . . . . . . . . . . . . 252

A.2 ET packet length in C-S and S-C directions . . . . . . . . . . . . . . . . . . . 253

A.3 ET packet inter-arrival time in C-S and S-C directions . . . . . . . . . . . . . . 254

A.4 HTTP packet length in C-S and S-C directions . . . . . . . . . . . . . . . . . . 255

A.5 SMTP packet length in C-S and S-C directions . . . . . . . . . . . . . . . . . 255

A.6 P2P packet length in C-S and S-C directions . . . . . . . . . . . . . . . . . . . 256

248 LIST OF FIGURES

A.7 Packet length statistics calculated over five consecutive packets at different

phases during a flow’s lifetime for SMTP traffic - C-S direction . . . . . . . . . 256


phases during a flow’s lifetime for Kazaa traffic - C-S direction . . . . . . . . . 257


phases during a flow’s lifetime for HTTP traffic - S-C direction . . . . . . . . . 257

C.1 Top 10 countries that contributed the greatest amount of ET traffic in the training

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

C.2 Top 10 Countries that contributed the most amount of ET traffic in the testing

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

C.3 Cumulative distribution of client hop counts per country for the training dataset 267

C.4 Distribution of different applications’ traffic (in flows and percentage) in testing

datasets for N = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268


datasets for N = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269


datasets for N = 1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

D.1 G.711 traffic - mean packet length - reverse direction . . . . . . . . . . . . . . 272

D.2 G.711 traffic - mean packet inter-arrival time - reverse direction . . . . . . . . . 273

D.3 GSM traffic - mean packet length - reverse direction . . . . . . . . . . . . . . . 273

D.4 GSM traffic - mean packet inter-arrival time - reverse direction . . . . . . . . . 273

E.1 Recall for different classifiers trained using different number of clusters . . . . 276

E.2 Recall for different classifiers trained using different number of clusters . . . . 277

E.3 Normalised build time for different classifiers trained using different number of

clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

E.4 Normalised classification speed for different classifiers trained using different

number of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

E.5 Normalised memory usage for different classifiers trained using different num-

ber of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

LIST OF FIGURES 249

E.6 Normalised clustering time for different classifiers trained using different num-

ber of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

List of Tables

5.1 Two-sample KS test p-values (probability of occurrence of the null hypothesis)

for the mean packet length feature sets calculated for different N values, based

on a set of 1000 flow samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.2 ET traffic full-flow dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.3 Sampled interfering application flows - full-flow dataset . . . . . . . . . . . . . 102

5.4 Detailed training and testing implementation for each experiment . . . . . . . . 105

5.5 Detailed training and testing implementation for each experiment (continued) . 106

6.1 The differences in training instances for each classifier . . . . . . . . . . . . . 138

6.2 Number of sub-flows selected automatically by the clustering process . . . . . 154

8.1 Comparison of the pros and cons of Option A: Common classifier versus Option

B: Separate classifiers in parallel . . . . . . . . . . . . . . . . . . . . . . . . . 211

B.1 A Summary of Research Reviewed in Chapter 4 . . . . . . . . . . . . . . . . . 260

B.2 A Summary of Research Reviewed in Chapter 4 (continued) . . . . . . . . . . 261

B.3 A Summary of Research Reviewed in Chapter 4 (continued) . . . . . . . . . . 262

B.4 A Summary of Research Reviewed in Chapter 4(continued) . . . . . . . . . . . 263

B.5 Reviewed work in light of considerations for operational traffic classification . 264

250

Appendix A

Traffic Characteristics of Selected InternetApplications

In this appendix I look at some characteristics of an FPS game, Wolfenstein Enemy Territory

(ET), and three other common Internet applications: SMTP, Kazaa, and HTTP. These applica-

tions are among the sample applications used to train my classifiers in Chapters 5 to 8). I place

emphasis on their asymmetry in the client-to-server (C-S) and server-to-client (S-C) directions,

and in the variation of their traffic statistics over a flow’s lifetime.

A.1 Asymmetric properties in bi-directional communication

The asymmetry in bi-directional traffic is seen both in UDP/TCP ports used at the client and

server and in the statistical properties in client-to-server (C-S) and server-to-client (S-C) direc-

tions for the applications considered.

A.1.1 Client server ports asymmetry

I sample up to 100 flows of each application for this analysis. ET traffic is sampled from a full-

month data trace collected at a public server [179] in Australia during September 2005. Only

flows with longer than 1,000 packets in the C-S direction are selected, to make sure that they

are actual game flows. Traffic for SMTP, Kazaa and HTTP is sampled from one 24-hour trace

collected by the University of Twente, the Netherlands, on the 6th of February 2004 [181].

In a typical TCP connection the traffic flow starts with a SYN/ACK handshake from a client

to a server. The server port to which the client addresses its initial SYN packet is usually well-

251

252APPENDIX A. TRAFFIC CHARACTERISTICS OF SELECTED INTERNET APPLICATIONS

known 1. The client port, on the other hand, is typically chosen dynamically. It is obvious that if

the classifier misses the first SYN packet (from C-S) there will be a chance that the first packet

that it captures will be from the reverse direction (S-C). For applications with asymmetric client

and server ports, a port-based classification approach would degrade its performance (i.e. would

produce a higher rate of false negatives).

Figure A.1 shows the client port numbers for ET, HTTP, Kazaa and SMTP traffic. These

applications show a wide distribution of client port numbers, which are spreading over the whole

range of possible port numbers (1 through 65,535). For my sample ET traffic, the server port is

configured as 27961 and client ports are distributed across a wide range. Approximately 50%

of the flow samples have client ports equal to the default ET port of 27960. Less than 1% of the

flow samples have client ports of 27961 (the port actually used by the ET server in this case).

For HTTP, Kazaa and SMTP traffic, the client ports are widely distributed, and different from

the server ports of 80, 1214 and 25 for each application respectively.

0 10000 20000 30000 40000 50000 60000

0.0

0.2

0.4

0.6

0.8

1.0

Client Port Number

Cum

ulat

ive

Dis

tribu

tion

Func

tion

(0−1

)

ETHTTPSMTPKazaa

Figure A.1: Client port range for the other selected applications

A.1.2 Statistical Properties Asymmetry

All applications considered exhibit asymmetry in statistical properties in the C-S and S-C di-

rections. Figures A.2 and Figure A.3 illustrate this aspect of ET traffic in terms of packet

1If the application uses a registered port [100] or a port number selected from a range of default values.

A.1. ASYMMETRIC PROPERTIES IN BI-DIRECTIONAL COMMUNICATION 253

length and packet inter-arrival time2. Packet length in the C-S (normally containing the client’s

queries and updates) is typically small, mostly ranged between 62 and 75 bytes in my dataset.

Packet length in the S-C direction (normally containing the server’s responses information) is

more varied, mostly ranged between 77 and 276 bytes in my dataset. The S-C packet length

is impacted by the combination of the map and number of players participating in a particular

game, while the C-S packet length is due mostly to the behaviour of a particular player (for ex-

ample, C-S packets are shorter when a player is connected but idle, and longer when the player

starts playing) [201].

0 200 400 600 800 1000 1200 1400

0.0

0.2

0.4

0.6

0.8

1.0

Packet Length (Bytes)

Cum

ulat

ive

Dis

tribu

tion

Func

tion

(0−1

)

C−S directionS−C direction

Figure A.2: ET packet length in C-S and S-C directions

The ET packet rate in the S-C direction depends on the server’s update algorithm. Figure

A.3 shows a fairly consistent packet inter-arrival time of approximately 0.05 second. The Cu-

mulative Distribution Function (CDF) of S-C packet inter-arrival times jumps slightly at 0.1 or

0.15 second. This can be due to a lost packet or skipped packet if the ET server briefly rate-

limits its transmissions to particular clients. The packet inter-arrival times increase at multiples

of 50ms in these cases. In contrast, from C-S there is a wider range of packet inter-arrival time

values, which can be due to the diversity in graphic cards and maps used by particular clients

[201] and/or a choice of the client’s software to lower its sending rate due to slower speeds at

the client’s access links [176]3.2Data is collected from all packets of the full-flow samples.3The packet rates in C-S seen in my analysis are slightly lower than the results reported in [201]. This is because


0.00 0.05 0.10 0.15

0.0

0.2

0.4

0.6

0.8

1.0

Packet IAT (seconds)

Cum

ulat

ive

Dis

tribu

tion

Func

tion

(0−1

)


Figure A.3: ET packet inter-arrival time in C-S and S-C directions

For Web, Kazaa and SMTP traffic, due to the TCP transport protocol the asymmetry is min-

imal with packet inter-arrival times, as the sender rate is controlled by the receiver flow control

mechanism. However, packet length asymmetry in the C-S and S-C directions is significant, as

shown in Figures A.4, A.5 and A.6. Packet lengths in one direction are typically smaller than

those in the reverse direction due to the typical asymmetry in the application traffic (e.g. small

request packets in one direction versus long response packets in the reverse direction).

A.2 Variation of traffic statistics during flow lifetime

The variation of ET traffic statistics during a flow lifetime was presented in section 5.3.3. While

not as significant as those of ET traffic, Kazaa and SMTP flow statistics also change during the

lifetime of a flow. For example, the initial handshake of a new SMTP connection looks quite

different to the traffic while transferring the body of each email.

Figure A.7 presents the mean packet length of five consecutive packets in the C-S direction4,

taken at different points in time of the SMTP flows. I consider two different phases of the traffic

[201] did analysis of LAN players with high-speed links toward a local server. My data trace consists of widelydiversified geographically distributed players, connecting to the server via the Internet. The players can configuredtheir clients to work within lower Internet access rate limits, which leads to longer/lower average and peak packetinter-arrival times

4The three-way handshake at the beginning of SMTP traffic typically occurs within the first six packets ex-changed between a client and a server.

A.2. VARIATION OF TRAFFIC STATISTICS DURING FLOW LIFETIME 255

0 500 1000 1500

0.0

0.2

0.4

0.6

0.8

1.0


Cum

ulat

ive

Dis

tribu

tion

Func

tion

(0−1

)


Figure A.4: HTTP packet length in C-S and S-C directions

0 500 1000 1500

0.0

0.2

0.4

0.6

0.8

1.0


Cum

ulat

ive

Dis

tribu

tion

Func

tion

(0−1

)


Figure A.5: SMTP packet length in C-S and S-C directions


0 500 1000 1500

0.0

0.2

0.4

0.6

0.8

1.0


Cum

ulat

ive

Dis

tribu

tion

Func

tion

(0−1

)


Figure A.6: P2P packet length in C-S and S-C directions

flow during its lifetime: Starting (the beginning of the traffic flow) and In progress (the five

consecutive packets starting from the 10th packet). As shown in Figure A.7, the statistical

properties computed over five packets taken at different phases are different from each other,

and different from those calculated over a full-flow. Similar characteristics are seen with Kazaa

and HTTP traffic, as shown in Figure A.8 and Figure A.9.

0 500 1000 1500

0.0

0.2

0.4

0.6

0.8

1.0


Cum

ulat

ive

Dis

tribu

tion

Func

tion

(0−1

)

StartingIn progressFull flow

Figure A.7: Packet length statistics calculated over five consecutive packets at different phasesduring a flow’s lifetime for SMTP traffic - C-S direction

A.2. VARIATION OF TRAFFIC STATISTICS DURING FLOW LIFETIME 257

0 500 1000 1500

0.0

0.2

0.4

0.6

0.8

1.0


Cum

ulat

ive

Dis

tribu

tion

Func

tion

(0−1

)


Figure A.8: Packet length statistics calculated over five consecutive packets at different phasesduring a flow’s lifetime for Kazaa traffic - C-S direction

0 500 1000 1500

0.0

0.2

0.4

0.6

0.8

1.0


Cum

ulat

ive

Dis

tribu

tion

Func

tion

(0−1

)


Figure A.9: Packet length statistics calculated over five consecutive packets at different phasesduring a flow’s lifetime for HTTP traffic - S-C direction

Appendix B

A Summary of ML-Based IP TrafficClassification Works in the Literature

B.1 A summary of key points for each reviewed work

Some key points for each work reviewed in Chapter 4 are summarised in Tables B.1, B.2, B.3,

and B.4.

B.2 A qualitative evaluation of the reviewed works

Table B.5 provides a qualitative summary of the reviewed works in Chapter 4 against the fol-

lowing criteria:

• Real-Time Classification

– No: The work makes use of features that require flow completion to compute (e.g.

flow duration, total flow bytes count)

– Yes: The work requires the capture of a small number of packets/bytes of a flow to

complete the classification

• Feature Computation Overhead

Low: The work makes use of a small number of features (e.g sizes of the first few

packets, binary encoding of the first few bytes of a uni-directional flow)

– Average: The work makes use of an average set of features (such as packet length

and inter-arrival times statistics, flow duration, bytes count)

258

B.2. A QUALITATIVE EVALUATION OF THE REVIEWED WORKS 259

– High: The work makes use of a large (comparatively with other work in the area),

computationally complex features (such as Fourier transform of packet inter-arrival

time)

• Continuous Classification

– Not addressed: The issue is not considered in the work

– Yes: The issue is considered and solved in the work

• Directional Neutrality

– No: The work makes use of bi-directinal flow and features calculations, but does

not consider the issue

– Yes: The work makes use of bi-directional flow and feature calculations, addresses

the issues and proposes solution

– N/A: The work makes use of uni-directional flow and the issue is not applicable

– Not clear: Not clearly stated in the paper

260 APPENDIX B. SUMMARY OF ML-BASED IP TC WORKS IN THE LITERATURE

Table B.1: A Summary of Research Reviewed in Chapter 4Work ML Algo-

rithmsFeatures Data

TracesTraffic Consid-ered

ClassificationLevel

McGregoret al. [59]

ExpectationMaximisation

Packet length statistics(min, max, quartiles, ...);Inter-arrival statistics;Byte counts; Connectionduration; Number oftransitions between trans-action mode and bulktransfer mode; Idle time;Calculated on full flows

NLANRandWaikatotrace

A mixture ofHTTP, SMTP,FTP (control),NTP, IMAP,DNS ...

Coarsegrained(bulk trans-fer, smalltrans-actions,multipletransactions...)

Zander etal. [60]

AutoClass Packet length statistics(mean and variance inforward and backwarddirections); Inter-arrivaltime statistics (meanand variance in for-ward and backwarddirections); Flow size(bytes); Flow duration;Calculated on full-flows

Auckland-VI,NZIX-II andLeipzig-IIfromNLANR

Half-Life,Napster, AOL,HTTP, DNS,SMTP, Telnet,FTP (data)

Fine grained(8 appli-cationsstudied)

Roughanet al. [61]

Nearest Neigh-bour, LinearDiscriminateAnalysis andQuadraticDiscriminantAnalysis

Packet Level; Flow Level;Connection Level; Intra-flow/Connection features;Multi-flow featuresCalculated on full flows

Waikatotrace andsectionlogs froma com-mercialstreamingservices

Telnet, FTP(data), Kazaa,Real MediaStreaming,DNS, HTTPS

Fine grained(three, fourand sevenclasses ofindividualapplica-tions)

Mooreand Zuev[98]

BayesianTechniques(Naive Bayesand NaiveBayes withKernel Estima-tion and FastCorrelation-Based Filtermethod)

Total of 248 fea-tures, among themare (detailed in [150]):Flow duration; TCPport; Packet inter-arrivaltime statistics; Pay-load size statistics;Effective bandwidthbased upon entropy;Fourier transform ofpacket inter-arrival time;Calculated on full flows

ProprietaryHandClassifiedTraces

A large rangeof Database,P2P, Buck,Mail, Services,... traffic

Coarsegrained


Table B.2: A Summary of Research Reviewed in Chapter 4 (continued)Work ML Algo-

rithmsFeatures Data


ClassificationLevel

Bernailleet al.[122]

Simple K-Means

Packet lengths of thefirst few packets ofbi-directional traffic flows

Proprietarytraces

eDonkey,FTP, HTTP,Kazaa, NTP,POP3, SMTP,SSH, HTTPS,POP3S

Fine grained(10 appli-cationsstudied)

Park et al.[152]

Naive Bayeswith KernelEstimation,Decision TreeJ48 and Re-duced ErrorPrunning Tree

Flow duration; Initial Ad-vertised Window bytes;Number of actual datapackets; Number of pack-ets with the option ofPUSH; Packet lengths;Advertised window bytes;Packet inter-arrival time;Size of total burst packets

NLANR,USC/ISI,CAIDA

WWW, Telnet,Chat (Mes-senger), FTP,P2P (Kazaa,Gnutella),Multimedia,SMTP, POP,IMAP, NDS,Oracle, X11

N/A (com-parisonwork)

Erman etal. [123]

K-Means Total number of packets;Mean packet length;Mean payload lengthexcluding headers; Num-ber of bytes transferred;Flow duration; Meaninter-arrival time

Self-collected8 1-hourcampustracesbetweenApril 6-9,2006

Web, P2P, FTP,Others

Coarsegrained (29differentprotocolsgroupedinto anumber ofapplicationcategoriesfor studies)

Crotti etal. [167]

Protocol fin-gerprints(ProbabilityDensity Func-tion vectors)and Anomalyscore (fromprotocol PDFsto protocolfingerprints)

Packet lengths; Inter-arrival time; Packetarrival order

6-monthself-collectedtraces atthe edgegatewayof theUniver-sity ofBres-cia datacentrenetwork

TCP applica-tions (HTTP,SMTP, POP3,SSH)

Fine grained(four TCPprotocols)


Table B.3: A Summary of Research Reviewed in Chapter 4 (continued)Work ML Algo-

rithmsFeatures Data


ClassificationLevel

Ma et al.[202]

Unsupervisedlearning(product distri-bution, Markovprocesses, andcommon sub-string graphs)

Discrete byte encoding ofthe first n-bytes payloadof a TCP unidirectionalflow

Proprietary FTP (control),SMTP, POP3,IMAP, HTTPS,HTTP, SSH

Fine grained

Auld et al.[165]

Bayesian Neu-ral Network

246 features intotal, including:Flow metrics (dura-tion, packet-count, totalbytes); Packet inter-arrival time statistics;Size of TCP/IP controlfields; Total packets ineach direction and totalfor bi-directional flow;Payload size; Effectivebandwidth based uponentropy; Top-ten Fouriertransform components ofpacket inter-arrival timesfor each direction;

Proprietaryhand clas-sifiedtraces

A large rangeof Database,P2P, Buck,Mail, Services,Multimedia,Web ... traffic

Coarsegrained

Williamset al.[171]

Naive Bayeswith Dis-cretisation,Naive Bayeswith KernelEstimation,C4.5 DecisionTree, BayesianNetwork andNaive BayesTree

Protocol; Flow duration;Flow volume in bytes andpackets; Packet length(minimum, mean, maxi-mum and standard devi-ation); Inter-arrival timebetween packets (mini-mum, mean, maximumand standard deviation)

NLANR FTP(data),Telnet, SMTP,DNS, HTTP

N/A (Com-parisonwork)

Haffner etal. [172]

Naive Bayes,AdaBoost,RegularizedMaximumEntropy

Discrete byte encoding ofthe first n-bytes payloadof a TCP unidirectionalflow

Proprietary FTP (control),SMTP, POP3,IMAP, HTTPS,HTTP, SSH

Fine grained


Table B.4: A Summary of Research Reviewed in Chapter 4(continued)Work ML Algo-

rithmsFeatures Data


ClassificationLevel

Erman etal. [153]

K-Means, DB-SCAN and Au-toClass

Total number of pack-ets; Mean packet length;Mean payload length ex-cluding headers; Num-ber of bytes transfered(in each direction andcombined); Mean packetinter-arrival time

NLANRand a self-collected1-hourtrace fromthe Uni-versity ofCalgary

HTTP, P2P,SMTP,IMAP, POP3,MSSQL, Other


Erman etal. [170]

Naive Bayesand AutoClass

Total number of packets;Mean packet length (ineach direction and com-bined); Flow duration;Mean data packet length;Mean packet inter-arrivaltime

NLANR HTTP, SMTP,DNS, SOCKS,FTP(control),FTP (data),POP3,Limewire


Bonfiglioet al. [54]

Naive Bayesand Pearson’sChi-Square test

Message size (the lengthof the message encapsu-lated into the transportlayer protocol segment);Average inter-packet gap

Two selfcollecteddatasets

Skype traffic Applicationspecific


Table B.5: Reviewed work in light of considerations for operational traffic classification

Work Real-timeClassification

Feature Com-putationOverhead

Classify FlowsIn Progress

Directionalneutrality

McGregor et al.[59]

No Average Not addressed No

Zander et al.[60]


Roughan et al.[61]

No Average Not addressed N/A

Moore andZuev [98]

No High Not addressed No

Bernaille et al.[122]

Yes Low Not addressed No

Park et al.[152]

No Average Not addressed Not clear

Erman et al.[123]


Crotti et al.[167]

Yes Average Not addressed No

Haffner et al.[172]

Yes Average Not addressed N/A

Ma et al. [202] No Average Not addressed NoAuld et al.[165]

No High Not addressed No

Williams et al.[171]

N/A Average N/A N/A

Erman et al.[153]

N/A Average N/A N/A

Erman et al.[170]

N/A Average N/A N/A

Bonfiglio et al.[54]

Yes Average Not addressed Not clear

Appendix C

Some Properties of Data Used for Trainingand Testing

This appendix presents some properties of the training and testing dataset used in Chapters 5,

6 and 7.

C.1 Geographical distribution of ET traffic

Figures C.1 and C.2 show the top 10 countries that contributed the greatest amount of ET

traffic (in terms of total bytes and number of flows) in the May 2005 (training) and September

2005 (testing) datasets.

Uni

ted

Sta

tes

Aus

tralia

Pol

and

Ger

man

y

Fran

ce

Finl

and

Net

herla

nds

Uni

ted

Kin

gdom

Can

ada

Bel

gium

Per

cent

age(

%) o

f tot

al fl

ows

0

5

10

15

20

(a) Percentage of flows

Aus

tralia

Pol

and

Fran

ce

Ger

man

y

Uni

ted

Sta

tes

Uni

ted

Kin

gdom

New

Zea

land

Finl

and

Sw

eden

Net

herla

nds

Per

cent

age(

%) o

f tot

al b

ytes

0

20

40

60

80

(b) Percentage of total bytes

Figure C.1: Top 10 countries that contributed the greatest amount of ET traffic in the trainingdataset

Figures C.1(b) and C.2(b) are more peaked than Figure C.1(a) and C.2(a) as most actual

long game flows are from Australia (where the server located).

265

266APPENDIX C. SOME PROPERTIES OF DATA USED FOR TRAINING AND TESTING

Aus

tralia

Uni

ted

Sta

tes

Pol

and

Ger

man

y

Fran

ce

Net

herla

nds

Finl

and

Uni

ted

Kin

gdom

Bel

gium

Can

ada

Per

cent

age(

%) o

f tot

al fl

ows

0

5

10

15

20

25

30

(a) Percentage of flows

Aus

tralia

Pol

and

New

Zea

land

Uni

ted

Sta

tes

Ger

man

y

Fran

ce

Net

herla

nds

Finl

and

Uni

ted

Kin

gdom Ita

ly

Per

cent

age(

%) o

f tot

al b

ytes

0

20

40

60

80

100

(b) Percentage of total bytes

Figure C.2: Top 10 Countries that contributed the most amount of ET traffic in the testingdataset

I estimate the number of hop counts for the distribution of client IP addresses based on the

TTL field’s values in each flow’s packet. My estimation method is based on similar assumptions

as those outlined in [203]. Initial TTL is usually a multiple of 32 and is decremented once at

each hop back towards the game server (the measurement point). Since the default TTL values

configured at the client machines are unknown, I assume that the maximum hop count is no

more than 32 hops. Therefore, the hop counts can be inferred from the TTL values as:

NumberO f HopCounts = ceiling(T T L/32)∗32−T T L

The topological spreading of game clients from international countries is illustrated by the

distributions of hops counts for game flows from most popular countries to the server. As can

be seen in Figure C.3, Australian clients are between 5 and 17 hops from the server while

international clients are at least 10 hops away from the server. This finding is consistent with

the study of [204]. Similar results are seen with the testing dataset.

C.2 Traffic mix for training and testing

This section provides details on the traffic mix for the training and testing of different classifiers

in Chapter 5. N is the size of the sliding window. I consider N = 10, 100 and 1,000 packets. M

is the number of packet offsets from the beginning of each flow.

C.2. TRAFFIC MIX FOR TRAINING AND TESTING 267

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Hop Counts

Cum

mul

ativ

e D

istri

butio

n Fu

nctio

n

AustraliaUnited StatesFinlandPolandGermanyFranceNetherlandsUnited KingdomCanadaBrazil

Figure C.3: Cumulative distribution of client hop counts per country for the training dataset


0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K

Mail etc.HalfLifeDNS etc.ETWebP2P0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Num

ber o

f flo

ws

M (packets)

(a) Flow counts

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


5101520

2530

35

40

45

50

Perc

enta

ge (%

)

M (Packets)

(b) Flow percentage

Figure C.4: Distribution of different applications’ traffic (in flows and percentage) in testingdatasets for N = 10

C.2. TRAFFIC MIX FOR TRAINING AND TESTING 269

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


500

1000

1500

2000

2500

Num

ber o

f flo

ws

M (Packets)

(a) Flow counts

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


5

10

15

20

25

30

35

40

45

Perc

enta

ge (%

)

M (Packets)

(b) Flow percentage



0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


200 400 600

800

1000

1200

1400

1600

1800

Num

ber o

f flo

ws

M (Packets)

(a) Flow counts

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K


10

20

30

40

50

60

70

Perc

enta

ge (%

)

M (Packets)

(b) Flow percentage


Appendix D

Characteristics of VoIP Traffic

D.1 VoIP data extraction

While my data trace is a mixture of voice traffic and other applications, voice traffic is filtered

out as follows. Firstly, RTP traffic is filtered out using TShark 1.0.4, a common-line-based

version of Wireshark [205]. This tool dissects traffic protocol using deep packet inspection.

It can identify a UDP datagram as containing a packet of a particular protocol running on top

of UDP only if: the protocol has a particular standard port number, and the UDP source or

destination port number is that port; packets of that protocol can be identified by looking for a

‘signature’ of some type in the packet; or some other traffic earlier in the capture indicated that

traffic between two particular addresses and ports belong to the protocol [206].

When Tshark sees SIP/SDP traffic setting up an RTP session, the details of the RTP session

(source IP address and source UDP port number) will be identified from the SIP/SDP packets

and used to extract the subsequent RTP stream.

While this approach is sound in most cases, registering only the source IP address and

source port number may result in false positives in subsequent classification of RTP traffic. As

the conversation registration is set up indefinitely, other traffic initiated from the same IP address

and port number will be falsely classified as RTP traffic.

To eliminate false positives I then manually inspected all the flows believed by Tshark to be

RTP traffic. Anomaly flows are filtered out for quarantine, which are flows with an IP packet

length different from 200 bytes (G.711 PCMU voice packet), 73 bytes (GSM voice packets), 41

bytes (Comfort noise packets) or 44 bytes (Telephone-Event [207]).

Out of 666 RTP flows identified by Tshark, I found one RTP flow that contains 280-Bytes

271

272 APPENDIX D. CHARACTERISTICS OF VOIP TRAFFIC

packets, sending at 30ms in one direction. Closer inspection reveals that it contains G.711

PCMU packets, but instead of the 20ms packet interval by default, its packets are sent at 30ms

intervals (three 10ms frames), with a payload size of 240 bytes (80 Bytes/frame x 3) in one

direction. Most G.711 and GSM flows have a constant packet length in each direction (apart

from the presence of Comfort noise and Telephone-Event packets). There are eight exceptions

where the flows switched between G.711 and GSM in one direction, hence the flows contain a

mixture of 200 bytes and 73 bytes in one direction.

There were five DNS flows falsely classified as RTP traffic. The reason for this was that

these DNS flows had the same source IP address and source port as one previously registered

RTP session. There were also 17 video (H.263) flows encapsulated in RTP sessions. These

DNS and video flows were removed from the dataset. The remaining 644 RTP flows were then

used as benchmark VoIP flows for my analysis in Chapter 8.

D.2 Statistical properties of G.711 and GSM flows

This section summarises some statistical properties of my VoIP dataset. As shown in Figure

8.2, most G.711 voice packets are 200 bytes long. The sliding window’s mean packet length

sometimes falls below 200 bytes due to the presence of 41-byte comfort noise packets from

time to time. Figure 8.3 reveals that most packets arrive at 20ms intervals. However, there

are outliers that indicate a packet inter-arrival time of greater than 20ms. These longer packet

inter-arrival times are due to jitter, packet loss or silent periods during voice conversations.

Similar traffic characteristics are seen in the reverse direction.

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

5010

015

020

0


Mea

n pa

cket

leng

th (B

ytes

)

Figure D.1: G.711 traffic - mean packet length - reverse direction

As shown in Figure 8.4, almost all GSM voice packets are 73 bytes long. There are only

D.2. STATISTICAL PROPERTIES OF G.711 AND GSM FLOWS 273

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

020

4060

8010

0


Mea

n pa

cket

inte

r−ar

rival

tim

e (m

sec)

Figure D.2: G.711 traffic - mean packet inter-arrival time - reverse direction

a few outliers due to telephone-event packets. Figure 8.5 shows that most packets arrive at

20ms intervals. However, there are outliers that indicate a packet inter-arrival time of greater

than 20ms. These longer packet inter-arrival times are due to jitter, packet loss or silent periods

during voice converations. Similar traffic characteristics are seen in the reverse direction.

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

5060

7080

9010

0


Mea

n pa

cket

leng

th (B

ytes

)

Figure D.3: GSM traffic - mean packet length - reverse direction

0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF

010

2030

4050

60


Mea

n pa

cket

inte

r−ar

rival

tim

e (m

sec)

Figure D.4: GSM traffic - mean packet inter-arrival time - reverse direction

Appendix E

Trade-offs in Cluster Quality andClassifier Performance

In Chapter 6 I demonstrated that there are two options available when choosing an ‘optimal’

number of clusters: the pre-classification, and the post-classification option. The former has

been chosen to study in Chapter 6. In this appendix, I investigate the latter option.

In Chapter 6, the pre-classification option found eight ‘natural’ clusters from the 18 sub-

flows input from Step 1. From this I obtained eight representative sub-flows which were used to

train and test my Naive Bayes classifier. Using the post-classification option, I pre-specify EM

to start from two clusters. From the cluster results I obtain the representative sub-flows from

which to train and test my Naive Bayes classifier. I continue to add clusters until the accuracy

of the Naive Bayes classifier can no longer be increased.

My results reveal that the pre-classification option produces a classifier with higher Recall

and Precision and which is a little faster in classification, with a small trade-off in memory

usage and a longer time required to build the classifier (more than 16% longer, yet equivalent

to approximately only 15 seconds in my test platform). As it is simple, fully automated, and

independent of ML classification algorithm, this option can be used as a general approach to

assist in automated sub-flow selection. The only drawback of this option is the long clustering

time required, which can be overcome by using a smaller number of sub-flows instances as

demonstrated in Chapter 6. My experimental results are presented in the following sections.

274

E.1. ACCURACY 275

E.1 Accuracy

Figure E.1 presents Recall for classifiers trained using different numbers of clusters. The results

are illustrated using boxplot1. Using two clusters produces a very low Recall (a median of

19.3%), and using ≥ three clusters produces a median Recall of 94.8%. Median Recall reaches

97.8% at four clusters and remains almost the same at five clusters. The eight clusters selected

by the pre-classification process results in the maximum median Recall of 99%. It also produces

the most stable Recall result, with the smallest gap between the 25th and 75th percentiles of the

boxplot.

As outlined in Figure E.1, Recall generally increases as the number of clusters increases.

This is because using more clusters provides a better chance of covering more distinct phases

within a flow’s life time, and hence better and more stable Recall in classification. As soon as

there are sufficient clusters to cover all the phases of the full-flow, improvement with the addtion

of more clusters can stop. This explains the jump in Recall when selecting three clusters versus

two, as presented in Figure E.1, and the improvement is small for classifiers built using more

than three clusters.

Figure E.2 depicts the Precision of the classifiers. Two clusters produces a classifier with

the lowest median Precision of 81.9%. Both classifiers built based on four and five clusters

result in an almost identical median Precision of 93%. The eight clusters selected by the pre-

classification option produce a classifier with the highest median Precision of 93.3%. The

Precision results follow a similar curve to the Recall results, and Precision increases noticeably

when the number of clusters increases from two to three. When the number of cluters is greater

than three, the increase is very small.

The number of clusters can affect Precision as the inclusion of one or more clusters can

reduce or increase the unwanted range of feature values (the gap between the disjoint clusters)

to train a classifier, as illustrated in section 5.4.4. This subsequently can reduce or increase the

possibilities for false positives. Precision can also be affected by the ratio of traffic mix between

1The black line in the box indicates the median; the bottom and top of the box indicates the 25th and 75th per-centile, respectively. The vertical lines drawn from the box are whiskers. The upper cap is the largest observationthat is ≤ to the 75th percentile + 1.5*IQR (interquartile range - essentially the length of the box). The lower capis the smallest observation that is ≥ the 25th percentile - 1.5*IQR. Any observations beyond the caps are drawn asindividual points, which indicate outliers.

276APPENDIX E. TRADE-OFFS IN CLUSTER QUALITY AND CLASSIFIER PERFORMANCE

020

4060

8010

0

Rec

all(%

)

2 3 4 5 8

Number of clusters

Figure E.1: Recall for different classifiers trained using different number of clusters

the number of examples for Other traffic versus the number of examples for ET traffic used for

testing.

These are two possible factors that lead to the Precision results presented in Figure E.2.

Detailed analysis of the impact of each factor on Precision is left for future work. Nonethe-

less, there is little difference between the Precision for eight clusters and that produced for five

clusters (the former is only 0.3% higher)2.

In summary, my results demonstrate that using the pre-classification option produces the

best classifier in terms of both Precision and Recall. Using the post-classification option pro-

duces a classifier with a slightly lower Precision and Recall. However, the former needs to use

eight clusters, while for the latter Recall and Precision appear to reach an optimal point when

the number of clusters reaches four (after which the increase in Recall is insignificant) 3. Using

a smaller number of clusters (hence a smaller number of sub-flows instances to train the clas-

sifier) could reduce the time and memory required to build a classifier. This is evaluated in the

2These results are slightly different from the results I previously reported in a similar study [208]. This isbecause [208] used different examples of Other traffic to train and test the classifier. Furthermore, the ratio of ETtraffic versus Other traffic in the traffic mix for testing in [208] was different. The ratio was approximately 1:2 incontrast to approximately in the range between 1:10 and 1:5 used in my experiment (see section 5.3.4).

3There can be trade-offs in Precision and Recall, both of which may not converge to an optimal point using thesame number of clusters. A priority rule can be useful in the decision-making process. Analysis of these trade-offsis a subject for future research

E.2. COMPUTATIONAL PERFORMANCE 277

2040

6080

100

Pre

cisi

on(%

)

2 3 4 5 8

Number of clusters

Figure E.2: Recall for different classifiers trained using different number of clusters

next section.

E.2 Computational performance

Figure E.3 compares the normalised required build time for each classifier. A value of 1 rep-

resents the slowest build time (96 seconds on my test platform). The smaller the number of

clusters used, the faster will be the time needed to construct a classification model. This is

because the number of instances used to train the classifier increases as the number of clusters

increases. Building a classifier using eight clusters takes approximately 16% longer than using

only four clusters. However, this time difference is only equivalent to approximately 15 seconds

of CPU time.

Figure E.4 presents the normalised classification speed for the models with the same test

dataset. A value of 1 represents the fastest classification speed (3,984 classifications per second

on my test platform). In contrast to the required build time, the classification speed varies only

slightly with different numbers of clusters chosen. This is because the classification speed is

directly dependent on the classification rules, rather than on the number of instances used to

train the classifier. Classifier using the pre-classification option in the highest classification

speed. However the difference is small – 1.5% faster than in the case of the post-classfication


2 3 4 5 8

Number of clusters

Nor

mal

ised

Bui

ld T

ime

0.0

0.2

0.4

0.6

0.8

1.0

Figure E.3: Normalised build time for different classifiers trained using different number ofclusters

option.

Figure E.5 presents the normalised memory usage for the classification models while per-

forming 10-times cross-validation on their training datasets. A value of 1 represents the highest

memory consumption (304MB on my test platform). Although all models consume quite low

memory resources, the classifier built using a smaller number of clusters consumes fewer mem-

ory resources (due to smaller dataset). However, the differences are less than 3%.

Figure E.6 presents the normalised clustering time for different numbers of clusters. A value

of 1 represents the longest time ( approximately 5.4 hours on my test platform). The greater

the number of clusters specified, the longer the clustering algorithm will take in terms of CPU

time. In the pre-classification option, generating eight clusters requires up to 172 hours to com-

plete. This is because of the many repeated trial runs required in the WEKA cross-validation

implementation method. This clustering time can be vastly reduced by using a smaller number

of sub-flow instances in the clustering process, as demonstrated in Chapter 6.

E.2. COMPUTATIONAL PERFORMANCE 279

2 3 4 5 8

Number of clusters

Nor

mal

ised

Cla

ssifi

catio

n S

peed

0.0

0.2

0.4

0.6

0.8

1.0

Figure E.4: Normalised classification speed for different classifiers trained using different num-ber of clusters

2 3 4 5 8

Number of clusters

Nor

mal

ised

mem

ory

usag

e

0.0

0.2

0.4

0.6

0.8

1.0

Figure E.5: Normalised memory usage for different classifiers trained using different numberof clusters


2 3 4 5

Number of clusters

Nor

mal

ised

Clu

ster

ing

Tim

e

0.0

0.2

0.4

0.6

0.8

1.0

Figure E.6: Normalised clustering time for different classifiers trained using different numberof clusters

A novel approach for practical real-time, machine learning ... · A Novel Approach for Practical,...

Documents

Transcript of A novel approach for practical real-time, machine learning ... · A Novel Approach for Practical,...